1. Introduction
In recent years, the evolution of military technology and the emergence of diverse threats [
1] have posed significant challenges to conventional security measures. In response to these challenges, deep learning, a potent artificial intelligence technology, has gained prominence in the national defense and security field [
2]. It plays a pivotal role in tasks including critical target tracking [
3], scene matching for autonomous navigation in restricted conditions [
4], and threat identification [
5]. Given maneuverability and rapid deployment capabilities [
6], small-scale and low-altitude unmanned aerial vehicles (UAVs) are frequently employed in site protection, targeted patrols, and strategic reconnaissance missions [
1] to ensure swift responses to potential threats.
When identifying threat targets, conventional manual methods are inefficient, demanding a great deal of human resources and time investments. Furthermore, the outcomes frequently face issues of subjectivity and arbitrariness. Some handcrafted image processing methods, such as histogram of oriented gradient [
7] (HOG) and Deformable Part Model [
8] (DPM), have demonstrated drawbacks, including poor robustness and sensitivity to scale variations. Sumari [
9] employed Support Vector Machine (SVM) to recognize aerial military targets. They utilized SVM to learn the knowledge of 11 features, including the wing, engine, fuselage, and tail of military airplanes, achieving good accuracy on their datasets.
In contrast, object detection techniques offer a promising solution for accurately locating and classifying specific targets, thus alleviating the aforementioned limitations. Du [
10] proposed a lightweight military target detector. They employed the coordinate attention module to design the backbone, reducing the parameters and computation. Additionally, the authors proposed power parameter loss by combining EIOU and focal loss, further enhancing detection accuracy and convergence speed. Jafarzadeh [
11] implemented YOLO based on automated detection of tanks. They constructed a tank dataset and utilized data augmentation to address the scarcity of targets. They achieved satisfactory results in real-time military tank detection. Jacob [
12] introduced CNN for classifying military vehicles in Synthetic Aperture Radar (SAR) images. Through transfer training on a VGG16-pretrained model, they achieved classification accuracy of 98. Yu [
13] presented a military target detector based on YOLOv3. By introducing deformable convolution and dual-attention mechanisms, the authors designed ResNet50-D network, which effectively improved the accuracy and speed for military target detection, providing better technical support for battlefield situation analysis.
Deploying detection algorithms on UAVs will significantly enhance the accuracy and efficiency of threat identification. However, detecting targets for low-altitude UAVs faces several challenges. Firstly, in scouting operations, UAVs may fly at diverse attitudes and angles, leading to drastic changes in the scale and background of the targets. Additionally, unlike remote sensing images captured from a top-down perspective at high altitudes, UAVs often monitor targets with a large oblique perspective for camouflage and concealment. Furthermore, potential threat targets are stealthy, disguised, and versatile, which further complicates recognition. Based on these challenges, the ideal dataset should meet two key requirements:
The dataset is supposed to encompass a wide range of terrain and weather conditions, including both ground and water domains;
The UAV’s viewpoint has a large oblique perspective and altitude variations to match real scouting mission scenarios.
The performance of detectors heavily relies on thorough extraction of image information. RGB images exhibit a strong two-dimensional local structure [
14] because neighboring pixels in adjacent regions are highly correlated. This property allows humans to comprehend high-level concepts even from a tiny portion of an image. CNN excels in capturing local feature information by effectively modeling pixel relationships within a defined kernel size. Moreover, the local receptive field and weight-sharing properties of the convolution kernel enable it to enjoy shift, rotation, and scale invariance. These inductive biases have enabled CNN-based detectors to achieve good performance over recent decades. Two-stage detector Faster R-CNN [
15] first generated a vast number of predefined anchors and then regressed the positions of positive ones to obtain detection results. One-stage detectors such as the YOLO series [
16,
17,
18,
19,
20,
21] and SSD [
22] leverage a single CNN to facilitate end-to-end detection and directly generate detection results. However, CNN-based detectors pay little attention to global information modeling.
By contrast, Transformer, characterized by a global self-attention mechanism, can establish long-range dependencies between remote pixels in an image. It provides better generalization and larger model capacity compared to CNN, but it necessitates pretraining on large datasets before transfer training on custom datasets. ViT [
23] introduced the Transformer architecture to computer vision, achieving results comparable to CNN in classification tasks. DETR [
24] was a pioneer in applying an encoder–decoder Transformer architecture in detection tasks. However, a global attention mechanism requires more computing resources and longer training time. In order to improve the efficiency of the Transformer block, subsequent works attempted to reduce the computational scope of self-attention, such as Deformable DETR [
25], Swin Transformer [
26], CSWin Transformer [
27], and PVT [
28]. Nonetheless, the weight matrix derived by self-attention between tokens, which describes the relevance of different areas in an image, is unable to model relationships between pixels within one token scope, implying a deficiency in extracting local information.
CNN and Transformer both have strengths and weaknesses. CNN is proficient in local perception but lacks global modeling capability, while Transformer excels in establishing long-range dependencies and gathering global information but lacks ability regarding extraction of localized features. Many studies have explored hybrid structures. CvT [
14] proposed convolutional projection to replace the linear aspect in each Transformer block. It abandoned position embedding due to a built-in local context structure of convolutions, enabling the simplification of the process. BoTNet [
29] achieved remarkable results in object detection tasks by replacing the convolution layer with a multi-head self-attention block in the final three bottlenecks of ResNet [
30]. However, these works were only preliminary attempts for hybrid models as they merely sequentially coupled convolution and Transformer modules in turn without fully leveraging their respective strengths.
For the above design ideas and application scenarios, this article explores how to combine the feature modeling process of CNN and Transformer in a decoupled manner, aiming to leverage the merits of both and improve the performance of UAV object detection from a large oblique perspective. Our contributions are as follows:
A hybrid detection model that combines CNN and Transformer architectures is proposed for detecting military targets on ground and at sea. The detector incorporates T-branch Region–Pixel two-stage Transformer (RPFormer) and C-branch Multi-gradient Path Network (MgpNet) in a decoupled manner.
RPFormer, an efficient Transformer architecture consisting of Region-stage and Pixel-stage attention mechanisms, is proposed to gather the global information of an entire image. Additionally, MgpNet, based on Elan [
21] block and inspired by the multi-gradient flow strategy [
31], is introduced using local feature extraction.
A feature fusion strategy for hybrid models in channel dimensions is proposed. It fuses feature maps through three steps: Cross-Concatenation Interaction, Global Context Embedding, and Cross-Channel Interaction.
An object detection dataset regarding military background is constructed. The dataset consists of air-to-ground and air-to-sea scenarios and includes common combat units in a real scenario. All the images are captured from a large oblique perspective from 10 to 45 degrees at low altitude.
4. Datasets
Aerial remote sensing images are commonly captured from a top-down perspective at high altitudes [
54,
55,
56]. However, for UAV reconnaissance patrol missions, collecting a number of images from such a viewpoint proves impractical. The top-down perspective makes it easier for the UAV to expose its position and be attacked. In contrast, a large oblique perspective allows for better concealment and reduces the risk of being detected. Additionally, the top-down perspective provides flattened planar information, while oblique perspectives offer more details, rendering targets more visibly distinct and facilitating localization and identification of potential threats.
Based on the above background, the existing datasets cannot meet our demands. We created a military targets dataset for air-to-ground and air-to-sea scenarios to simulate more realistic environments. The dataset includes common units on ground and at sea. Partial images in the dataset are shown in
Figure 8.
4.1. Simulation Environments
Digital Combat Simulator (DCS) World is a sandbox game in which maps and combat units are designed to be very realistic and to accurately reproduce real-world situations. The proposed military dataset is based on the game and includes air-to-ground and air-to-sea scenarios. In the air-to-sea scenario, we collected combat units of six categories, namely aircraft carriers, cruisers, destroyers, patrol ships, speedboats, and landing ships. We classified the ships into three categories: large warships, medium warships, and small warships. Large warships, typically referring to aircraft carriers, have a displacement of over 20,000 tons; medium warships, including cruisers and destroyers, have a displacement between 2000 and 20,000 tons; and small warships, including patrol ships, speedboats, and landing ships, have a displacement below 2000 tons. In the air-to-ground scenario, we collected combat units in five categories: tanks, radar vehicles, transport vehicles, rocket launchers, and armored vehicles.
4.2. Details of Dataset
We set an oblique perspective from 10 to 45 degrees to capture images. In the air-to-ground scenario, the UAVs’ flight altitude ranges from 25 to 200 m, while, in the air-to-sea scenario, the flight altitude ranges between 200 and 1000 m. To make the data more reliable and generalized, there are four weather conditions: sunny, cloudy, overcast, and cloudy with rain. We carefully filtered and cleaned the data and removed any irrelevant or redundant information. We also classified and labeled each image with bounding boxes using both manual and semi-automatic methods. In the air-to-ground scenario, there are 5238 images, including 31,606 boxes. In the air-to-sea scenario, there are 4878 images, including 38,702 boxes. The image resolution is 1080 × 1920. The detailed contents of the dataset are available in
Table 2. Considering the constraints of an actual reconnaissance mission, the UAV flies at an almost fixed altitude for 15 rounds in each scenario. The distribution details of the dataset are visualized in
Figure 9.
6. Conclusions
This paper presented a hybrid detection model that combined CNN and Transformer architecture to detect military targets from UAVs’ oblique perspective. The proposed detector combined the C-branch Multi-gradient Path Network and the T-branch RPFormer in a parallel decoupled manner. A feature fusion strategy was proposed to integrate the feature maps of the two branches in three steps: Cross-Concatenation Interaction, Global Context Embedding, and Cross-Channel Interaction. Because the existing remote sensing datasets were mostly collected in a top-down view, which did not correspond to the actual UAV reconnaissance mission scenes, we constructed a dataset consisting of air-to-ground and air-to-sea scenarios from a large oblique perspective to evaluate the effectiveness of the proposed approaches. We validated different fusion methods, including add and concatenation operations, and demonstrated the effectiveness of the proposed fusion strategy in ablation experiments. In comparison experiments, the proposed method achieved mAP values of 57.9 and 68.9, improvements of 1.9 and 2.4 in air-to-ground and air-to-sea scenarios, respectively, which surpassed most current detection methods.
This research contributes to UAV detection for military targets in challenging scenarios. However, our research still has shortcomings. Although we have carefully designed the dataset for military targets, there is still a gap regarding actual perception reconnaissance scenarios, such as low light, smoke, and flame environments. The detection for small targets has achieved improvements, but it is still easy to miss and wrongly detect the targets compared with medium and large targets. In future work, we will continue to focus on these problems.