1. Introduction
Aerial refueling, a key technology for delivering fuel to reconnaissance, fighter, or other aircraft during flight via a receiver aircraft, was first proposed in 1917 by Russian–American Alexander de Seversky [
1] to enhance the combat capability, range, and payload of manned aircraft [
2,
3]. Currently, manual performance of aerial refueling missions is relatively inefficient and requires a high level of skill and various psychological factors, as well as the physiological state of the pilot [
4]. In recent years, the widespread use of drones in both military and civilian applications has greatly contributed to research in the field of autonomous aerial refueling (AAR) [
5]. Autonomous aerial refueling technology can extend the flight time of UAVs in the air, improve the flight distance of UAVs [
6], solve the contradiction between take-off weight and flight performance, and significantly reduce the risk of refueling manned aircraft [
7]. In addition, the application of autonomous aerial refueling technology could eliminate the need for UAVs to perform intermittent landings, thus greatly enhancing air operations. An algorithm with high accuracy and detection speed ensures more accurate docking between the receiver aircraft and the tanker aircraft, reducing the risk of docking failure. It enables the autonomous aerial refueling system to operate in more diverse environments with more complex weather conditions, enhances the system’s adaptability and range of applications, and reduces the dependence on manual operations and the risk of accidents due to human error. The future of UAV autonomous aerial refueling technology will have great potential and application value.
Currently, there are three types of aerial refueling methods: the probe-and-drogue refueling (PDR) method, the flying-boom refueling (FBR) method, and the boom–drogue–adapter refueling (BDAR) method. Among them, FBR refueling is the most efficient, but it is costly and only suitable for specific aircraft equipped with acceptance cantilevers. The BDAR method can switch the refueling method through an adapter, but the adapter will reduce the refueling efficiency and speed of the original system. The PDR refueling tube, however, is lighter and more flexible, making it more suitable for UAV autonomous aerial refueling missions due to its flexibility and economy.
Probe-and-drogue refueling (PDR) (see
Figure 1) involves installing a probe on the receiver aircraft and extending a hose behind the tanker aircraft, with a refueling drogue attached at the end of the hose, which is then docked with the probe and the drogue to transfer the fuel [
8]. However, there are many unsolved problems and challenges in the refueling process of PDR. For example, atmospheric turbulence can cause the drogue hose to swing significantly [
9]. The effects of various complex weather conditions such as rain, snow, clouds, fog, and bright light lead to uneven or partially saturated drogue images [
10]. During the final stage of the docking process, when the receiver is close to the refueling machine, the drogue experiences violent oscillations due to the bow wave effect (BWE) and is often obscured by the fuel rods [
11]. These problems significantly increase the difficulty of UAVs refueling docking. Most of the current detection systems are prone to misdetection and omission, while low positioning accuracy and long response time may lead to flight accidents and even trigger aircraft crashes and casualties [
7]. Therefore, fast and accurate positioning of the drogue in complex environments has become a critical problem to be solved in autonomous aerial refueling missions.
Conventional perception methods have various drawbacks in locating the drogue: inertial navigation and GPS combination navigation methods may lead to inaccurate localization due to signal occlusion and distortion [
12]; radar systems struggle to accurately differentiate the position of the drogue in combinations [
13,
14]; and infrared and laser perception methods are susceptible to the interference of external environments, such as cloudiness and glare [
15], which affects the recognition accuracy. In contrast, vision-based detection algorithms can recognize the drogue more accurately, avoid error accumulation, and are free from signal interference.
In recent years, many contributions have been made to vision-based drogue inspection. These are mainly classified into manual- and learning-based methods [
16]. Pollini et al. mounted light-emitting diodes [
17] and paint coated with special markers [
18] on the drogue to assist in locating the drogue and estimating its position. However, this method leads to localized loss of image information due to the possibility of markers obscuring the drogue and increasing the power lines of the drogue, and the hand-labeled features are not well adapted for locating the drogue in real scenes. Yin et al. [
19] used template matching and threshold segmentation to identify the motion trajectory of the drogue during autonomous aerial refueling without the need for manual features; however, template matching cannot cover all motion positions of the drogue. Martínez et al. [
20] designed a drogue target detection method based on edge features; however, this method faces difficulty in correctly detecting the drogue that is subjected to occlusion. Wang et al. [
21] designed an HSV-based feature extraction method for the drogue; however, this method is challenging to use in complex environments, particularly when backgrounds are rich in colors or similar to the target color, which may lead to significant errors in feature extraction. Wang et al. [
22] established a drogue dataset and used a simple CNN structure to identify and detect the drogue, and Zhang et al. [
23] proposed a new Faster R-CNN algorithm for the detection of docking-stage refueling drogue proximity, but the algorithms are not able to solve the problem of occlusion and complex background well, and require more computational resources and a good real-time performance (which they currently lack) and have a greater difficulty in deploying on UAVs.
Existing drogue detection algorithms sacrifice speed to improve accuracy, which increases the difficulty of deploying these algorithms on UAVs. The YOLO series has become a classic one-stage target detection algorithm, which strikes a better balance between detection accuracy and speed than other detection algorithms and is more suitable for the application of UAVs’ autonomous docking scenarios in the air [
24]; but for the autonomous aerial refueling mission, deploying the algorithms on mobile still faces limitations in terms of computing power, memory capacity, and energy consumption. However, for the autonomous aerial refueling task, deploying the algorithm on the mobile terminal still faces the limitations of computational power, memory capacity, and energy consumption, and how to obtain better detection results with the least computational resources is the problem we want to solve. In this paper, we design a new efficient drogue detection network DREP-Net based on the YOLOv8 model to address the above problems and train and test it on a real refueling drogue dataset. The main contributions of this study are as follows:
- (1)
To address the problem of changing scale of drogue targets, we introduce the DGST module into the backbone network of yolov8, which integrates the channel shuffling technology and vision transformer architecture and adopts a 3:1 channel partitioning strategy. This approach effectively reduces the number of parameters and computation when extracting features from the backbone network, thereby enhancing the network’s recognition ability for drogue targets with changing scales.
- (2)
We designed a module, called RGConv, that merges the re-parameterized convolution module with the GhostNet concept. This module was incorporated into the neck network. By discarding the C2f module’s residual block, adding a 1 × 1 convolution at the end of the gradient flow branch, and employing cost-effective operations to generate redundant feature maps, we mitigated the loss of image information due to up-sampling. This approach not only accelerates the neck network during inference but also enhances the network’s ability to recognize targets amid occlusions and complex backgrounds.
- (3)
We introduce an efficient local attention mechanism in the neck network that obtains feature maps in both horizontal and vertical directions through the strip pooling technique. By using 1D convolution and group normalization, we enhance the capture of the global perceptual field without increasing the model size. This significantly improves the model’s focus on the overall region of the drogue and enhances localization accuracy.
- (4)
Building on the concept of the coupling and decoupling head, we designed a lightweight and efficient detection head based on partial convolution (PConv). This design significantly reduces the model’s parameter count and computational demand, greatly improving its speed for detecting the refueling drogue. Additionally, it effectively balances detection accuracy for the refueling drogue.
2. Methodology
2.1. Proposed Network Framework
YOLOv8 is an object detection network that was open-sourced on 10 January 2023 by Ultralytics. Although the YOLOv8 algorithm is already very efficient, its performance in AAR tasks is still unsatisfactory. To this end, we propose an improved DREP-Net based on YOLOV8 for the detection and localization of the drogue in the docking phase of the autonomous aerial refueling of a UAV. DREP-Net is characterized by efficient feature extraction using the DGST module while reducing the amount of computation; the use of the RGConv module and the ELA attention mechanism in the neck network for finer detail capture and improved attention to the target region, respectively; and the use of more efficient detection headers to improve the speed of the network’s localization of the target region. The proposed model has better capabilities in terms of detection speed, accuracy, and adaptability to multivariate environments. Therefore, DREP-Net is also potentially useful for application areas with complex backgrounds and dynamic environments such as autonomous UAV landings and takeoffs and aerial surveillance. The general structure of the network is shown in
Figure 2.
2.2. Dynamic Group Convolution Shuffle Transformer (DGST)
The DGST module was proposed by Wenkai Gong [
25], which integrates the advantages of grouped convolution, channel shuffle, and the vision transformer to improve the feature extraction effect as much as possible while utilizing the least amount of resources. We introduce the DGST module into the backbone network of the original YOLOv8 model, replacing the original C2f module, which allows the model to handle features of various scales more effectively while maintaining computational efficiency.
The dynamic group convolution shuffle module (DGSM) is the base component of the DGST module, as shown in
Figure 3, and is designed with the idea of significantly improving computational efficiency while maintaining model performance. Group convolution is introduced in this structure, which significantly reduces the number of parameters and computations of the model and prevents the overfitting of the model, thus maintaining the robustness and generalization ability of the network. In addition, the module introduces the channel shuffling technique of ShuffleNetV2 [
26], which allows features of different groups to interact and integrate with each other by rearranging the channels between groups and enriches the representation of multi-scale features by changing the order of the channels of the input feature mapping in order to realize more effective information exchange between groups and enhance the comprehensive representation capability of the network.
DGST combines the architecture of vision transformer (ViT) [
27] with that of DGSM, using a similar modular processing approach. The specific structure is shown in
Figure 4. The DGST adopts a 3:1 channel division strategy. First, the channel information of the input features is extracted by a 1 × 1 convolution kernel, and the input features are mapped to low-dimensional features. The number of channels is then divided into two groups. Then, the features with the original 1/4 channel count are input to a 3 × 3 group convolution, and feature extraction is performed within each group by independent convolution operations, followed by a channel shuffle mechanism to improve the flow of information across groups. Finally, feature fusion is performed through the ConvFFN structure, which is functionally similar to the feed forward network (FFN) after the multi-head attention mechanism used in ViT, utilizing 1 × 1 convolutional layers. This structure, with enhanced feature map associations, contributes to improved feature representation and the integration of outputs from different information sources, and feature maps undergo enhanced residual connectivity to facilitate efficient gradient flow in deep networks. The introduction of the DGST module reduces the computational and parametric quantities of the backbone network during feature extraction and effectively enhances the network’s ability to recognize drogue targets with changing scales.
2.3. Efficient Local Attention
The attention mechanism has greatly improved the performance of deep neural networks in the field of computer vision by dynamically adjusting the focus of the network to attend to important information and ignoring irrelevant details; however, it still faces some problems when dealing with spatial information. Some existing methods have to sacrifice channel dimensions or increase network complexity in order to enhance spatial information processing, which can increase the computational burden of the model.
Wei et al. [
28] proposed an efficient local attention mechanism (ELA) that effectively addresses these limitations. ELA obtains feature vectors in the spatial dimension using a similar approach to the coordinate attention mechanism (CA) [
29]. ELA extracts the feature vectors in both horizontal and vertical directions through strip pooling techniques and maintains the channel dimension of the input feature mapping during processing. This approach captures long-range dependencies by maintaining a narrow kernel shape, effectively reducing the problem of label predictions being affected by uncorrelated regions. In addition, it generates rich target position features in their respective directions, thus capturing the global information of the feature map more efficiently and pinpointing the location of the target. The ELA module significantly optimizes the processing efficiency and computational speed of sequential signals by employing the 1 × 1 1D convolution instead of the more complex 2D convolution. Specifically, ELA utilizes a narrower kernel to perform local interactions on two feature vectors separately. This strategy not only simplifies the model structure but also improves the processing speed.
Wu et al. [
30] showed that batch normalization (BN) can be inaccurate when dealing with small batches of data, and if the batch size is too small, the mean and variance do not effectively reflect the characteristics of the entire dataset. In contrast, group normalization (GN) improves the adaptability of the model to different data distributions by normalizing the data within independent groups. With Sigmoid activation function processing, the model is able to make accurate location predictions in two spatial dimensions.
Figure 5a,b show the specific structures of the coordinate attention mechanism and the efficient local attention mechanism, respectively.
ELA uses strip pooling to obtain the spatial dependence over long distances by performing average pooling in the horizontal and vertical directions for each channel separately, generating a pooled feature map for the corresponding direction. The output of the convolution block is expressed as
, where
denotes the height,
denotes the width, and
denotes the number of channels. The pooling operations in the horizontal and vertical directions are given by (1) and (2), respectively:
where
denotes the horizontally pooled features of the
th channel at height
,
denotes the features of the
th channel at width
,
and
are the width index and height index, respectively, and
denotes the original input feature of position (
) in channel.
The ELA model employs a lighter 1D convolution to obtain the location information of (1) and (2), which effectively enhances the interaction capability of the location information. Group normalization is used to process the enhanced location information. By normalizing each group of channels independently, it makes it possible to provide more stable performance in training on small batches of data and is more suitable for lightweight networks. The specific process is shown in Equations (3) and (4):
In the above equations, the 1D convolution is denoted as
and
, respectively,
denotes group normalization,
denotes nonlinear activation function, and the weights in horizontal and vertical directions are denoted as
and
, respectively. The final output of ELA is denoted by
and is described by Equation (5):
2.4. GhostNet-Based Re-Parameterized Convolution Module
In the autonomous aerial refueling mission, in response to the need to take immediate action after detecting a drogue target and the problem of low recognition accuracy caused by the complex background or occlusion encountered during flight, we designed a module that combines the GhostNet idea with re-parameterized convolution, called RGConv, and introduced this module into the neck network of YOLOv8. The specific structure is shown in
Figure 6.
As can be seen in
Figure 6a, we remove the residual block in C2f, which reduces the number of parameters and computational burden of the model. To compensate for the performance loss caused by the removal of the residual block and to improve the ability to capture information in the feature map, as well as to alleviate the problem of multi-scale information loss when up-sampling on the neck network, we introduced the RepConv module [
31] on the gradient circulation branch of the designed RGConv structure. RepConv is a model re-parametrization technique, and its structure is shown in
Figure 6b.
The RepConv, with a parallel structure including batch normalization (BN), 1 × 1 and 3 × 3 convolutions in the training phase, is able to capture efficiently and enhance the multilevel information of the image, whereas in the inference phase, multiple computational modules are merged into a single-branch structure to improve the efficiency and performance of the model. The core idea of its re-parameterization is to merge the convolutional layer and batch normalization into a single convolution, thus achieving the effect of reducing the amount of operations.
In the inference phase, the formulae for the convolutional layer and the batch normalization are represented by Equations (6) and (7), respectively:
where
is the convolution module weight function,
is the input feature map of the convolution layer,
is the bias,
and
are the input mean and input variance of the batch normalization, and
and
are the parameters of the trainable affine transformation. The fusion process can be shown by Equation (8):
The corresponding weights and the bias of the new convolutional layers after merging are shown in Equations (9) and (10):
In this process, the weights of each branch are converted to a uniform 3 × 3 size. By merging the weights and bias terms of each branch, the final weights and bias of the single branch convolution are formed. The introduction of RepConv not only compensates for the possible performance loss caused by the removal of residual blocks but also ensures that critical information is not lost during the feature fusion process. In addition, it optimizes the gradient circulation and reduces the problem of gradient vanishing, thus ensuring effective training of the deep network.
In addition, redundancy in feature mapping is a very important feature in CNNs, and there is extensive redundancy in the intermediate feature mapping of mainstream CNN computation [
32]. Rich redundancy not only ensures that the network has a better understanding of the input data but also captures information about the input data from different perspectives. Inspired by GhostNet [
33], we introduce a 1 × 1 convolution at the end of the gradient flow branch without generating redundant computation. Through cheap operations, redundant feature maps are generated to alleviate the loss of scale information in the neck network due to up-sampling, which not only reduces the overall computational effort but also improves the computational efficiency, more efficiently utilizing the high-level semantic information and the low-level detail information, which, in turn, captures more details and contextual information and enhances the identification of occluded drogues, thus better improving the drogue detection in complex environments.
In addition, we introduce a deep network scaling factor that allows for flexible adjustment of the number of channels in a given layer. This adjustment not only controls the detail and complexity of the information flow but also enables the differential processing and fusion of features in different layers according to their importance and accuracy. In this way, the features delivered by the backbone network can be more finely tuned when multiple 3 × 3 convolutional layers are used, thus improving the model’s ability to detect and localize the different sizes and shapes of the refueling drogue in one step.
2.5. Lightweight Detection Heade Based on PConv
Previous YOLO series of detection models, such as YOLOv3 to v5, use a coupled head design [
34], which combines the bounding box regression and classification tasks of target detection in a single detection head. Although this approach is simple and intuitive, since classification prediction is a multi-classification problem and location prediction is a regression problem, the errors of the two may interact with each other, limiting the improvement of detection and localization accuracy. In contrast, YOLOv8 use a decoupled head design to process target location and category information separately. This separation allows the models to learn both tasks in a targeted manner.
Compared to YOLOv5, the detection head of YOLOv8 has a significant improvement in accuracy but also brings a higher number of parameters and computations. Specifically, the number of parameters and computation of YOLOv8’s detection head account for 25% and 36% of the total model volume, respectively, which is mainly due to its use of two independent branches, each using two 3 × 3 convolutional layers and one 1 × 1 convolutional layer for processing three different scales of feature maps from the feature pyramid networks (FPN) for parallel classification and regression.
For this specific task, namely, autonomous aerial refueling, which involves only a single category of detection, we need to position the refueling drogue with high accuracy and speed. This requires us to design an inspection head that has the advantages of a decoupled head for handling independent tasks, as well as the advantages of a coupled head in terms of a small number of parameters and high inspection speed. To solve this problem, we combined the design concepts of coupled and decoupled heads to design a lightweight detection head based on
Pconv [
35], called PHead. The detection heads of YOLOv5, YOLOv8, and DREP-Net are shown in
Figure 7a–c, respectively.
First, following the coupling concept, we merge the 3 × 3 convolutions in the original two branches into a single branch, allowing the subsequent regression and classification tasks to share the same set of weight parameters. To further achieve the lightweight of the detection head, we use the partial convolution (
PConv) and 1 × 1 convolution to replace the original two 3 × 3 convolutions. By introducing
PConv, we reduce unnecessary computational redundancy, thus utilizing the computational resources of the device more efficiently and improving the detection speed of the model.
Figure 8a,b demonstrates the difference between regular convolution and partial convolution and how each works.
It can be seen that regular convolution utilizes all channels to extract features, with the number of floating point operations calculated as in Equation (11):
where
and
are the height and width of the input feature map, respectively,
denotes the size of the convolution kernel, and
and
are the number of channels of the input and output feature maps, respectively. The number of memory accesses is given by Equation (12):
PConv uses regular convolution on some of the input channels for spatial feature extraction, leaving the rest of the channels unchanged. The number of floating point operations for partial convolution can be described as in (13) if the number of channels of feature mapping for input and output are assumed to be equal:
where
is the first or last consecutive partial channel in the input feature mapping selected to ensure efficient memory access. Assuming that the proportion of partial convolution performed is 1/4, the number of floating-point operations is only 1/16 of the regular convolution, and its memory access is only 1/4 of the regular convolution. The number of memory accesses is calculated as shown in (14):
Compared to regular convolution, PConv dynamically decides the scope of the convolution kernel based on the validity of the data, and for those invalid data points, it does not perform convolution operations on them. By selectively applying regular convolution, lower computational effort is achieved.
We integrate partial convolution and 1 × 1 convolution into the same path of the detection head, which minimizes computation and memory access while making reasonable use of the raw information of the channel to provide rich feature representations, which, in turn, effectively improves the detection speed of the model for drogue detection.
3. Case Studies
3.1. Parameters of Hardware and Software
To ensure the fairness and reasonableness of the experiments in this paper, all experiments were conducted in the same experimental environment. The experiments were conducted using the Ubuntu 20.04 operating system, GPU model NVIDIA GeForce RTX 3090, PyTorch 2.0.0 as the network training framework, and Python 3.8 and CUDA 11.8 as the compilation environment. The model was optimized using the SGD optimizer, with the total training epochs set to 350, batch size set to 32, and other parameters using default values.
3.2. Dataset
In order to validate the performance of DREP-Net, we performed drogue detection and localization in real AAR scenarios, and we used a total of 41 videos of aircraft refueling in a real airspace collected by Wang et al. [
22]. Considering the diversity of situations faced in aerial refueling, we enriched the refueling drogue dataset on this basis by re-collecting some additional real aerial refueling videos and regularly intercepting video frames for refueling drogues in the videos to ensure that the difference of each image is large enough. After labeling all the drogue images using the labeling tool, we obtained a total of 2600 datasets for aerial refueling drogue detection. These data images were not labeled with artificial features such as LEDs or painted markers, of which 1909 images were used for training and 691 images were used for testing, where a large number of refueling first-view images were included in the test set to satisfy the real aerial refueling environment. The produced dataset is rich enough to contain a variety of conditions encountered in real scenarios such as cloudy, foggy, sunny, occluded, deformed, etc., which also contains the refueling viewpoint first view and other viewpoints. Some of the video frames intercepted in the above video are shown in
Figure 9.
3.3. Evaluation Indicators
In this paper, we use precision (
P), recall (
R),
F1 score, mean average precision (
mAP), FPS, and giga floating point operations per second (GFLOPs) to evaluate the model. In order to reduce the effect of fluctuations in the model detection speed, we take eight measurements of FPS and calculate their average value.
P,
R,
F1, and
mAP are calculated as follows:
where
TP (true positive) denotes the number of drogues correctly detected,
FP (false positive) denotes the number of drogues incorrectly detected,
FN (false negative) denotes the number of drogues missed, and
N is the number of categories.
AP is the average precision of a single category, which is derived from the area under the
P–
R curve, and
mAP is the summed average precision of all the categories. Since the task is a single-category detection, at this point,
AP is equal to
mAP.
3.4. Comparative Experiment on Different Positions of the Lightweight Module
Table 1 shows the experimental results of adding the DGST module and the RGConv module to different parts of the backbone and neck networks, respectively. The results are divided into four groups (a, b, c, and d) for comparative analysis. Combined with
Table 1, it can be seen that in group a, the mAP50 is higher, but the mean average precision is the lowest under the more stringent IOU criteria, and the detection speed is also the lowest compared to the other three groups. Groups b and c have faster detection speeds, but they are not as good as group d in terms of accuracy and
F1 scores. Group d, on the other hand, achieves a higher balance between the detection accuracy, the
F1, and the model detection speed. This shows that it is reasonable and effective to replace the original C2f structure by adding the DGST module and the RGConv module to the backbone and neck networks, respectively.
3.5. Lightweight Module Speed Performance Analysis
In the field of real-time target detection, computational efficiency is a key indicator of technical merit. In order to further validate the computational efficiency and performance of the DGST and RGConv modules, we selected several core modules from different YOLO versions—the C3 module in YOLOv5, the C2f module in YOLOv8, and the RepNCSPELAN4 module in YOLOv9—with the aim of comparing and analyzing their performances under the same test conditions.
Table 2 shows the specific experimental results.
First, to ensure the stability of the hardware during the testing process, we performed 1000 warm-up operations for each module before formally recording the performance data to allow the hardware to reach a stable working condition. Each module is then forward propagated 2000 times independently using a random tensor with dimensions (batch size, channels, height, width) and specific settings (1, 256, 128, 128) to provide an accurate performance evaluation. The performance metrics include total execution time (obtained by summing up the 2000 runs), average execution time (total execution time divided by the number of runs), frames per second (FPS) (1/average execution time), number of parameters (params), and floating point operations per second (FLOPs).
Table 2 shows the specific experimental results, where it can be seen that the designed RGConv module (with a scaling factor of 0.5) outperforms the other tested modules in terms of FPS and average execution time and has a more pronounced role in reducing redundant computations and optimizing data flow. This performance advantage is mainly attributed to the optimization of its internal architecture, i.e., demonstrating the advantage of using inexpensive operations to generate redundant feature maps. When the scaling factor is 1, RGConv still has a lower number of parameters than the C2f module and a slightly higher processing speed than the C2f module.
The DGCST module operates at a moderate speed, though still faster than the C2f module, and achieves the lowest number of parameters and computations, which, in combination with
Table 1, demonstrates the significance of the introduction of the module into the YOLOv8 backbone network, which significantly reduces the number of parameters of the model and improves the computational efficiency whilst maintaining the model performance.
3.6. Validity Analysis of Different Deep Network Scaling Factors
In the autonomous aerial refueling mission, the refueling aircraft and the drogue in the process of close contact, the imaging of the target of the drogue will gradually become larger, and in close proximity to the rapid movement, deformation, and so on, we have designed the scaling factor of the RGConv in the neck network and carried out targeted experiments.
The RGConv module uses different scaling factors to enable the model to flexibly adapt to different sizes of the drogue in complex scenarios, thus maintaining higher robustness in variable fueling scenarios. In order to verify the reasonableness of the scaling factor settings, we set the combinations of the corresponding scaling factors in front of the small target detection head as small (L), the medium target detection head as medium (M), and the large target detection head as large (L), respectively, and conducted experimental comparisons on the drogue dataset. The experimental results are shown in
Table 3.
From
Table 3, we can see that setting different scaling factors allows us to adjust the details and complexity of the information flow with flexibility. When we set the scaling factor to 0.5 in front of the small and medium target detection heads, and 1 in front of the large target detection head, the detection accuracy and
F1 score are optimal. It can be seen that when dealing with smaller drogue sizes, the computational burden can be reduced by reducing the number of channels in the middle layer while retaining important feature information. When dealing with a close drogue, by keeping the number of channels in the middle layer, more detailed information of the target area of the drogue can be captured, which, in turn, can effectively mitigate the effects of the drogue’s drastic shaking, deformation, and occlusion.
By setting different scale parameters for different detector heads, we can make each detector head specialize in the target size it is best at. This differentiation not only helps to improve the detection accuracy of specific types of targets but also improves the overall detection efficiency by reducing unnecessary calculations.
3.7. Ablation Analysis
In order to verify the role played by the different modules on the overall model, we conducted ablation experiments on the refueling drogue dataset. The results of the experiments are shown in
Table 4.
3.7.1. Effect of Individual Improvements
Taking the YOLOv8-n model as a baseline, we apply the improvement strategies in the model one by one. The comparison shows that after introducing the DGST module in the backbone extraction network, the mAP50 is improved by 1.0%, and the parameters and computations are reduced by 0.88 M and 1.5 Gflops, respectively, which indicates that the DGST module is able to extract the effective information of the feature map efficiently. After adding the RGConv module to the neck network, the detection accuracy is improved by 0.8%, the parameters and computations are reduced by 0.4 M and 0.8 Gflops, respectively, and the inference time is reduced to 3.96 ms, which indicates that RGConv can improve the neck network’s ability to capture the detail information of the drogue region at a low computational cost. When replacing the original detection head with the designed PHD, the inference time of the model reduces to 3.56 ms, the speed reaches 178.6 FPS, and the mAP only decreases by a small margin. When the ELA attention mechanism was added, the mAP50 increased by 0.7%, but the parameters and computations did not rise significantly.
3.7.2. Effectiveness of Joint Improvements
When combining DGST and RGConv, the mAP50 of the model was improved by 1.3%, and the detection speed was also partially improved compared to the introduction of DGST and RGConv alone. When DGST, RGConv, and ELA were added, mAP50 was improved by 2.4% and mAP50-95 by 0.5%. When DGST, RGConv, and PHD were added, the detection speed reached 185.6 FPS, the inference time reached a minimum of 3.3 ms, and the model size and computation amount also reached a minimum, respectively, and mAP50 still outperforms the baseline by 1.6%, even though the detection accuracy decreases at high thresholds. When the four improved strategies are introduced at the same time, the mAP50 and mAP50-95 reach 88.7% and 77.3%, respectively, and the detection speed reaches 184.4 FPS, indicating that there is a good fusion between the four modules.
Figure 10a,b show the relationship between the precision and recall and the relationship between the mean average precision and the number of training epochs, respectively, which intuitively shows that this work’s proposed method can achieve a better balance between precision and recall and has higher detection accuracy. The ablation experiments show that all four proposed modules are able to improve the performance of the algorithm in different aspects, and the combination between them all has a good effect. Overall, compared to the baseline, DREP-Net improves the mAP50 by 2.7% and mAP50-95 by 0.7%, the computational and parametric counts are reduced by 54.3% and 38.5% respectively, and the detection speed is improved by 31.4 frames per second.
3.8. Comparison Experiment
In order to further analyze the detection performance of the DREP-Net model, we conducted comparison experiments with several mainstream detection algorithms.
Table 5 gives the experimental results, and
Figure 11 gives the accuracy vs. training epochs curves and mean average precision vs. inference time scatter plots for the six models.
When combined with
Table 5, it can be seen that the proposed model has 1.9% higher detection precision than the YOLOv8s model with deeper network layers, but the parameters and computations effort are much smaller than YOLOv8s. The YOLOv5n model has the smallest number of parameters, computation, and size of the model weights, and also achieves a speed of 181 FPS. The DREP-Net algorithm we propose is 3.9% and 2.8% higher in mAP50 compared to YOLO5n and YOLOv5s, respectively, and it is almost equal in detection speed and the number of parameters and computation of the model. Neither YOLOv6n nor YOLOv6s [
36] can achieve a good level of detection accuracy. YOLOv7-tiny [
37] achieves a high level of average precision, but the parameters and computations are 3.3 and 3.5 times higher than the DREP-Net. Compared to YOLOv9s, the mAP50 of DREP-Net is 2.7% lower, but the parameters and computations are only 1/5 and 1/100 that of YOLOv9s [
38], and the detection speed is 139.6 FPS higher than that of YOLOv9s. Compared with the lighter YOLOv9t model, the mAP50 difference is only 0.3%, but the parameters and computations is 71% and 34.6% of YOLOv9t. While YOLOv10 [
39] removes the non-maximal value suppression at the expense of accuracy, although the detection speed is improved, it is still smaller than DREP-Net. When combined with
Figure 11a,b, it can be seen that the proposed DREP-Net has good compatibility in terms of precision and detection speed and has a better overall performance than the current mainstream detection algorithms.
3.9. Algorithmic Visualisation and Analysis
In order to validate the detection effect of DREP-Net in real refueling scenarios, the DREP-Net was used to visualize and analyze the video data provided in this work.
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16 and
Figure 17 show the detection results of the two models in different target sizes and different refueling scenarios. The numbers in the detection results represent the confidence level, where the results in red boxes represent the proposed method and the orange boxes represent the YOLOv8n model. From
Figure 12a–d, it can be seen that when the drogue is far from the oil receiver’s probe, the YOLOv8n model detects the drogue with low confidence and misses the detection. From
Figure 13a–d, it can be seen that when the drogue target is larger, both are able to detect the target correctly, but the confidence level of the detected target is still higher than that of the original model. From
Figure 14a–d, it can be seen that DREP-Net is able to locate the large target drogue better and fit the correct target area of the drogue. From
Figure 15a–d, it can be seen that the YOLOv8 model loses its prediction of the shape of the drogue when under cloudy and foggy weather conditions, while DREP-Net is still able to identify the complete region location of the drogue target properly. From
Figure 16a–d, it can be seen that DREP-Net is able to locate and identify the drogue well and efficiently under severe occlusion and under harsh conditions such as bright light and water fog, while the YOLOv8 model suffers from omission of detection and low identification accuracy. As can be seen in
Figure 17a–d, the original model loses the correct fit to the target when the target of the drogue is similar to the background; in
Figure 17e–h, the original model fails to distinguish the erroneous object correctly when there are similarities and produces a serious misdetection, whereas the proposed model is still able to maintain the localization of the drogue with a high degree of accuracy.
Figure 18 shows a comparison of the thermograms of the two models, and
Figure 19 shows the heat map scale, where the closer the color is to red, the higher the model’s level of focus and confidence in the target.
It can be observed that for large-, medium-, and small-sized drogues, the proposed algorithm effectively focuses on the overall location of the drogue, excludes surrounding interference, and captures the target’s structural information comprehensively, demonstrating good size adaptability. In the presence of drogue analogues in the background, the YOLOv8 model exhibits correlated noise in its heatmap and excessively focuses on incorrect targets, whereas the proposed algorithm reliably distinguishes the drogue from the background.
In summary, the DREP-Net model proposed in this study can accurately locate the drogue in different complex environments with high accuracy and robustness and is suitable for UAVs to carry out aerial refueling missions in scenarios dealing with complex backgrounds or with multiple features.
4. Conclusions
To address the high-speed requirements of drogue detection in air refueling scenarios, as well as the low accuracy of drogue target detection caused by constant changes in drogue dimensions, deformation, rapid jitter, and multiple interference factors in complex environments, we introduce targeted improvements based on the YOLOv8 model. We propose an efficient drogue detection algorithm, DREP-Net, which is specifically designed for the docking phase of aerial refueling in UAV autonomy. In the detection process, to tackle the challenge of the drogue target’s ever-changing scale, we introduce the DGST module into the backbone network of YOLOv8. This module enhances the network’s ability to express multi-scale feature information while reducing the number of parameters, enabling efficient extraction of features from the drogue target. The RGConv module we designed is introduced into the neck to alleviate the problem of multi-scale information loss caused by up-sampling in the neck network. It generates redundant feature maps using efficient operations, enhancing the ability to capture detailed information while reducing the number of parameters. This effectively addresses low drogue recognition accuracy caused by complex backgrounds or occlusion. Additionally, the ELA attention mechanism is introduced to further improve the network’s ability to localize the target area. For the autonomous aerial refueling single-category task, we introduced the designed PHead to replace the original detection head of YOLOv8. This change significantly improved the model’s recognition speed for the drogue while maintaining detection accuracy. Finally, we conducted extensive experiments on a drogue dataset in a real aerial refueling scenario. The results show that DREP-Net is 2.7% higher than the YOLOv8n model in mAP50, with a detection speed of 31.4 frames per second, which is a substantial improvement in both detection accuracy and speed, with high accuracy and robustness, and its algorithmic performance is better than that of the current mainstream algorithms. At present, the number of datasets and samples we have collected are not rich enough, as well as we have only improved the algorithm theoretically. In the future, we will continue to enrich the datasets and transfer the algorithm to UAVs to verify the effectiveness of the algorithm.