Next Article in Journal
Application of Getis-Ord Correlation Index (Gi) for Burned Area Detection Improvement in Mediterranean Ecosystems (Southern Italy and Sardinia) Using Sentinel-2 Data
Previous Article in Journal
Pyramid Cascaded Convolutional Neural Network with Graph Convolution for Hyperspectral Image Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(16), 2944; https://doi.org/10.3390/rs16162944
Submission received: 21 June 2024 / Revised: 7 August 2024 / Accepted: 8 August 2024 / Published: 11 August 2024
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Abstract

:
Currently, target detection on unmanned aerial vehicle (UAV) images is a research hotspot. Due to the significant scale variability of targets and the interference of complex backgrounds, current target detection models face challenges when applied to UAV images. To address these issues, we designed an effective and lightweight full-scale target detection network, FSTD-Net. The design of FSTD-Net is based on three principal aspects. Firstly, to optimize the extracted target features at different scales while minimizing background noise and sparse feature representations, a multi-scale contextual information extraction module (MSCIEM) is developed. The multi-scale information extraction module (MSIEM) in MSCIEM can better capture multi-scale features, and the contextual information extraction module (CIEM) in MSCIEM is designed to capture long-range contextual information. Secondly, to better adapt to various target shapes at different scales in UAV images, we propose the feature extraction module fitting different shapes (FEMFDS), based on deformable convolutions. Finally, considering low-level features contain rich details, a low-level feature enhancement branch (LLFEB) is designed. The experiments demonstrate that, compared to the second-best model, the proposed FSTD-Net achieves improvements of 3.8%, 2.4%, and 2.0% in AP50, AP, and AP75 on the VisDrone2019, respectively. Additionally, FSTD-Net achieves enhancements of 3.4%, 1.7%, and 1% on the UAVDT dataset. Our proposed FSTD-Net has better detection performance compared to state-of-the-art detection models. The experimental results indicate the effectiveness of the FSTD-Net for target detection in UAV images.

1. Introduction

Target detection represents one of the most important fundamental tasks in the domain of computer vision, attracting substantial scholarly and practical interest in recent years. The core objective of target detection lies in accurately identifying and classifying various targets within digital images, such as humans, vehicles, or animals, and precisely determining their spatial locations. The advent and rapid evolution of deep learning technologies have significantly propelled advancements in target detection, positioning them as a focal point of contemporary research with extensive applications across diverse sectors [1,2]. Notably, drones, leveraging their exceptional flexibility, mobility, and cost-effectiveness, have been integrated into a wide array of industries. These applications span emergency rescue [3], traffic surveillance [4], military reconnaissance [5], agricultural monitoring [6], and so on.
For natural images, target detection technology has made considerable progress. This technology can be divided into single-stage and two-stage detection methods. Two-stage detection methods primarily include region-based target detection approaches such as R-CNN [7], Fast R-CNN [8], Faster R-CNN [9], and so on. Single-stage detection methods encompass techniques like YOLO, SSD, and DETR [10,11]. Natural images are typically captured from a ground-level perspective, with a relatively fixed viewpoint and minimal changes in the shapes and sizes of targets. In contrast, drone images are usually taken from high altitudes, offering a bird’s-eye view. Under this perspective, especially when the drone’s flight altitude varies, the shapes and sizes of targets can change significantly. As shown in Figure 1, we analyzed the number of small, medium, and large targets in the VisDrone 2019 and UAVDT datasets (the latter being divided into training, testing, and validation sets for clarity). The results indicate that the training, testing, and validation sets of both the VisDrone 2019 and UAVDT datasets contain targets of various sizes, showing significant differences in target scale distribution.
Moreover, targets in the drone images often have more complex backgrounds due to variations in lighting, weather conditions, and surrounding environments. As shown in Figure 2a, the buildings, trees, and other non-target categories presented in the images, along with the dim lighting conditions, make target detection in UAV images more challenging. For Figure 2b, the occlusion by trees, shadows, and adverse weather conditions introduces more challenges to the target detection task. Consequently, the significant scale differences and complex backgrounds make it challenging to directly apply existing detection models designed for natural images to UAV images.
In this paper, to address the challenges of significant scale variability of targets and complex backgrounds in UAV images, we propose the UAV target detection network, FSTD-Net. The design of FSTD-Net is based on three principal aspects. Firstly, to capture more effective multi-level features and avoid the interference of complex backgrounds, the multi-scale and contextual information extraction module (MSCIEM) is designed. Secondly, to extract full-scale object information that can better represent targets of various scales, we designed the feature extraction module fitting different shapes (FEMFDS), which is guided by the deformable convolutions. The positions of the sampling points in deformable convolutions can be dynamically adjusted during the learning process; the FEMFDS is positioned before the detection heads for different scales, enabling it to effectively capture full-scale target features of different target shapes. Finally, to fully utilize the detailed information contained in the low-level features, we propose the low-level feature enhancement branch (LLFEB), which uses a top-down and bottom-up feature fusion strategy to incorporate low-level semantic features. In summary, the main contributions are as follows:
(1)
The MSCIEM is proposed to optimize the semantic features of targets on different scales while avoiding the interference of complex backgrounds. MSIEM employs parallel convolutions with different kernel sizes to capture the features of targets on different scales. The CIEM embedded in MSCIEM is to capture long-range contextual information. Thus, the MSCIEM can effectively capture multi-scale and contextual information.
(2)
We design the FEMFDS, which is guided by the deformable convolutions. Due to the FEMFDS breaking through the conventional feature extraction paradigm of standard convolution, the FEMFDS can dynamically match full-scale target shapes.
(3)
Due to the rich, detailed features in low-level features, such as edges and textures, we designed the LLFEB. The LLFEB uses a top-down and bottom-up feature fusion strategy to incorporate low-level and high-level features, which can effectively utilize the detailed information in low-level features.
(4)
Experiments on the VisDrone2019 and UAVDT dataset show that the proposed FSTD-Net surpasses the state-of-the-art detection models. Our proposed FSTD-Net can achieve more accurate detection of targets at various scales and in a complex background. In addition, FSTD-Net is lightweight. These experiments demonstrate the superior performance of the proposed FSTD-Net.
This paper is structured into six distinct sections. Section 2 reviews the related work pertinent to this work. Section 3 elaborates on the design details of the FSTD-Net. Section 4 presents the parameter settings of the model training process, training environment, and experimental results. Section 5 provides an in-depth discussion of the advantages of FSTD-Net and the impacts of the cropping and resizing methods designed for the experiments. Finally, Section 6 summarizes the current work and proposes directions for future research.

2. Related Work

2.1. YOLO Object Detection Models and New Structure Innovations

The YOLO series of target detection models has played a significant role in advancing the task of target detection in natural images. YOLOv1 initially proposed an end-to-end target detection framework, treating target detection as a regression problem [12]. Subsequent versions, such as YOLOv2 [13], YOLOv3 [14], YOLOv4 [15], and so on, continually optimized network architectures and training methods, leading to improvements in detection accuracy and speed. YOLOv2 improves YOLOv1 by using anchor boxes for better target size and shape handling, higher resolution feature maps for improved localization, and batch normalization for faster convergence and enhanced detection performance. YOLOv3 enhances YOLOv2 by using multi-scale predictions, a ResNet-based architecture, and logistic regression classifiers, improving small target detection, feature extraction, and suitability for multi-label classification, thus boosting accuracy and speed. YOLOv4 introduced the Bag of Specials and Bag of Freebies to enhance model expressiveness. YOLOv5 absorbed state-of-the-art structures and added modules that are easy to study. YOLOv6 customized networks of different scales for various applications and adopted a self-correcting strategy [16]. YOLOv7 addressed some issues through trainable free packets and expanded the approach [17]. YOLOv8 employed the latest C2f structure and PAN structure, with a decoupled form for the detection head [18]. The YOLOv9 further improved the model structure, resolving the issue of information loss [19]. The latest YOLOv10 further advances the performance-efficiency boundary of YOLOs through improvements in both post-processing techniques and model architecture [20]. These enhancements have significantly advanced the YOLO series models in target detection tasks.
Feature Pyramid Networks (FPN) are widely adopted in target detection frameworks, including the YOLO series, and have been proven to significantly enhance detection performance [21]. Recently, there have been several studies focused on improving FPN structure. DeNoising FPN (DN-FPN), proposed by Liu et al, employs contrastive learning to suppress noise at each level of the feature pyramid in the top-down pathway of the FPN [22]. GraphFPN is capable of adjusting the topological structure based on different intrinsic image structures, achieving feature interaction at the same and different scales in contextual layers and hierarchical layers [23]. AFPN enables feature fusion across non-adjacent levels to prevent feature loss or degradation during transmission and interaction. Additionally, through adaptive spatial fusion operations, AFPN suppresses information conflicts between features at different levels [24].
Furthermore, several enhancements to standard convolutional feature extraction have been proposed recently, with deformable convolution garnering significant attention. Standard convolution has a fixed geometric structure, which is detrimental to feature extraction. However, the Deformable Convolution series can overcome this limitation of the CNN building blocks. When first introduced, Deformable Convolution (DCNv) introduced deformable convolution kernels and deformable ROI pooling to enhance the model’s modeling capability, and it can automatically learn offsets in target tasks without additional supervision [25]. DCNv2 addresses the issues of DCNv by further introducing a modulation mechanism to adjust the spatial support regions, avoiding irrelevant information affecting the features, and achieving better results compared to the original DCNv [26]. DCNv3 further extends DCNv2 by weight sharing among convolutional neurons and introducing multiple sets of mechanisms and per-sample normalization modulation scalars [27]. DCNv4 addresses the limitations of DCNv3 by eliminating soft-max normalization in spatial aggregation to enhance its dynamic properties and expressive power. Then, it optimizes memory access to minimize redundant operations and improve speed [28].

2.2. Methods of Mitigating the Impact of Complex Backgrounds on Target Detection

To mitigate the impact of complex backgrounds on target detection, Liu et al. designed a center-boundary dual attention (CBDA) module within their network, employing dual attention mechanisms to extract attention features from target centers and boundaries, thus learning more fundamental characteristics of rotating targets and reducing background interference [29]. Shao et al. developed a spatial attention mechanism to enhance target feature information and reduce interference from irrelevant background information [30]. Wang et al. introduced the target occlusion contrast module (TOCM) to improve the model’s ability to distinguish between targets and non-targets [31]. To alleviate interference caused by complex backgrounds, Zhang et al. designed a global-local feature guidance (GLFG) module that uses self-attention to capture global information and integrate it with local information [32]. Zhou proposed a novel encoder-decoder architecture, combining a transformer-based encoder with a CNN-based decoder, to address the limitations of CNNs in global modeling [33]. Zhang et al. constructed a spatial location attention (SLA) module to highlight key targets and suppress background information [34]. Gao et al. introduced a multitask enhanced structure (MES) that injects prominent information into different feature layers to enhance multi-scale target representation [35]. Gao et al. also designed the hidden recursive feature pyramid network (HRFPN) to improve multi-scale target representation by focusing on information changes and injecting optimized information into fused features [36]. Furthermore, Gao et al. designed an innovative scale-aware network with a plug-and-play global semantic interaction module (GSIIM) embedded within the network [37]. Dong et al. developed a novel multiscale deformable attention module (MSDAM), embedded in the FPN backbone network, to suppress background features while highlighting target information [38].

2.3. Methods of Addressing the Impact of Significant Scale Variations on Target Detection

To address the impact of significant scale variations on target detection, Shen et al. combined inertial measurement units (IMU) to estimate target scales and employed a divide-and-conquer strategy to detect targets of different scales [39]. Jiang et al. enhanced the model’s ability to perceive and detect small targets by designing a small target detection head to replace the large target detection head [40]. Liu et al. designed a widened residual block (WRB) to extract more features at the residual level, improving detection performance for small-scale targets [41]. Lan et al. developed a multiscale localized feature aggregation (MLFA) module to capture multi-scale local feature information, enhancing small target detection [42]. Mao et al. proposed an efficient receptive field module (ERFM) to extract multi-scale information from a variety of feature maps, mitigating the impact of scale variations on small target detection performance [43]. Hong et al. developed the scale selection pyramid network (SSPNet) to enhance the detection ability of small-scale targets in images [44]. Cui et al. created the context-aware block network (CAB Net), which establishes high-resolution and strong semantic feature maps, thereby better detecting small-scale targets [45]. Yang et al. introduced the QueryDet scheme, which utilizes a novel query mechanism to speed up the inference of feature pyramid-based target detectors by first predicting rough locations of small targets on low-resolution features and then using these rough locations to guide high-resolution features for precise detection results [46]. Zhang et al. proposed the semantic feature fusion attention (FFA) module, which aggregates information from all feature maps of different scales to better facilitate information exchange and alleviate the challenge of significant scale variations [47]. Nie et al. designed the enhanced context module (ECM) and triple attention module (TAM) to enhance the model’s ability to utilize multi-scale contextual information, thereby improving multi-scale target detection accuracy [48]. Wang et al. introduced a lightweight backbone network, MSE-Net, which better extracts small-scale target information while also providing good feature descriptions for targets of other scales [49]. Cai et al. designed a poly kernel inception network (PKINet) to capture features of targets more effectively at different scales, thereby addressing the significant variations in target scales more efficiently [50].

3. Methods

In this section, the baseline model YOLOv5s and the proposed FSTD-Net will be introduced. Firstly, the overall framework of the baseline model YOLOv5s will be introduced. Then, we will give an overall introduction to the FSTD-Net. Finally, we will discuss the components of FSTD-Net in detail: the LLFEB, which can provide sufficient detailed information from low-level features; the MSCIEM, which can effectively capture target features at different scales; and long-range context information. The FEMFDS, which is based on deformable convolution, is sensitive to full-scale targets of different scales.

3.1. The Original Structure of Baseline YOLOv5s

YOLOv5 is a classic single-stage target detection algorithm known for its fast inference speed and good detection accuracy. It comprises five different models: n, s, m, l, and x, which can meet the needs of various scenarios. As shown in Figure 3, the YOLOv5s architecture consists of a backbone for feature extraction, which includes multiple CBS modules composed of convolutional layers, batch normalization layers, and SiLU activation functions, as well as C3 modules for complex feature extraction (comprising multiple convolutional layers and cross-stage partial connections). At the end of the backbone, global feature extraction is performed using the SPPF module. The SPPF (Spatial Pyramid Pooling-Fast) module in YOLOv5 serves the purpose of enhancing the receptive field and capturing multi-scale spatial information without increasing the computational burden significantly. By applying multiple pooling operations with different kernel sizes on the same feature map and then concatenating the pooled features, SPPF effectively aggregates contextual information from various scales. This helps in improving the model’s ability to detect targets of different sizes and enhances the robustness of feature representations, contributing to better overall performance in target detection tasks.
The Neck incorporates both FPN and PANet [51] structures, employing concatenation and nearest-neighbor interpolation for upsampling to achieve feature fusion. FPN is utilized to build a feature pyramid that enhances the model’s ability to detect targets at different scales by combining low-level and high-level feature maps. PANet further improves this by strengthening the information flow between different levels of the feature pyramid, thus enhancing the model’s performance in detecting small and large targets.
The Head consists of several convolutional layers that transform the feature maps into prediction outputs, providing the class, bounding box coordinates, and confidence for each detected target. In the YOLOv5s, the detection head adopts a similar structure to YOLOv3 due to its proven efficiency and effectiveness in real-time target detection. This structure supports multi-scale predictions by detecting targets at three different scales, handles varying target sizes accurately, and utilizes anchor-based predictions to generalize well across different target shapes and sizes. Its simplicity and robustness make it easy to implement and optimize, leading to faster training and inference times while ensuring high performance and reliability across diverse applications.

3.2. The Overall Framework of FSTD-Net

In this paper, our proposed FSTD-Net is based on the YOLOv5s architecture, as illustrated in Figure 4. The structure of FSTD-Net includes three main components: The Backbone, the Neck, and the Head, all integrated into an end-to-end framework for efficient and precise target detection. The Backbone is responsible for feature extraction, leveraging YOLOv5s for its lightweight design, which balances performance with computational efficiency. The Neck further processes and fuses these features from different levels of the backbone, enhancing the network’s ability to recognize targets of varying scales and complex backgrounds through multi-scale feature fusion. The Head, employing a detection mechanism similar to YOLOv3, accurately identifies and localizes target categories and positions using regression methods, ensuring high accuracy and reliability. Overall, FSTD-Net’s design combines lightweight architecture with multi-scale feature integration, achieving high detection accuracy while maintaining low computational complexity and high processing efficiency, making it suitable for UAV target detection tasks.
For FSTD-Net, the Backbone adopts the CSP-Darknet network structure, which effectively extracts image features. In the Neck part, to better suppress complex background information in UAV images and extract information from targets of different scales, the MSCIEM replaces the original C3 module. The MSCIEM is mainly composed of MSIEM and CIEM. MSIEM utilizes multiple convolutional kernels with different receptive fields to better extract information from targets on different scales. Meanwhile, the context information extraction module (CIEM) is embedded in the MSCIEM to extract long-distance contextual semantic information. To improve the detection of targets of different scales in UAV images, the FEMFDS module breaks through the inherent feature extraction method of standard convolution. It can better fit the shapes of targets of different scales and extract more representative features. The designs of MSCIEM and FEMFDS aim to enhance the detection performance of FSTD-Net for targets of different sizes. Furthermore, considering that low-level features contain rich, detailed features such as edges, corners, and texture features, the LLFEB is separately designed to better utilize low-level semantic features and enhance the target detection performance of FSTD-Net.

3.3. Details of MSCIEM

Since UAV images are captured from a high-altitude perspective, they introduce complex background information and contain various types of targets. The complexity of the background and the scale differences can significantly impact the performance of target detection. To mitigate the introduction of complex background information and address the issue of scale differences in targets, MSCIEM is designed. As shown in Figure 5, the structural composition of MSCIEM is illustrated. MSCIEM consists of an independent shortcut connection branch (SCB), MSIEM, and CIEM, where features from these three paths are fused as the final output feature. The MSIEM, designed with convolutional kernels of various sizes, effectively captures feature information from targets of different scales in feature maps. Meanwhile, the CIEM is employed to capture contextual information, enhancing the overall feature representation capability.
The inputs and outputs of MSCIEM are respectively f i n C i n × H i n × W i n and f o u t C o u t × H o u t × W o u t . For the independent branch SCB, it consists of a convolutional layer with a kernel size of 3 × 3 and a stride of 1. The input is f i n C i n × H i n × W i n . The output feature after processing by SCB is S C B o u t , as shown in Equation (1).
S C B o u t = S C B ( f i n ) C i n × H i n × W i n
For MSIEM and CIEM, before being fed into MSIEM and CIEM, f i n undergoes a convolutional layer with a 1 × 1 kernel size to reduce the number of channels to 1/4 of the original, resulting in the inputs f i n C i n / 4 × H i n × W i n for MSIEM and CIEM, respectively. For MSIEM, it follows an inception-style structure [52,53]. It first undergoes 3 × 3 depth-wise convolutions (DWConvs) to extract local information and is then followed by a group of parallel DWConv with kernel sizes of 5 × 5, 7 × 7, and 9 × 9 to extract cross-scale contextual information. Let L represent the local features extracted by the 3 × 3 DWConv, and Z i C i n / 4 × H i n × W i n , i = 1 , 2 , 3 represent the cross-scale contextual features extracted by the DWConvs with kernel sizes of 5 × 5, 7 × 7, and 9 × 9, respectively. The contextual features Z i and the local features L are fused using 1 × 1 convolutions, resulting in the features P o u t , as shown in Equation (2).
P o u t = C o n v 1 × 1 ( L + i = 1 3 Z i )
For CIEM, its main role is to capture long-range contextual information. CIEM is integrated into MSIEM to strengthen the central features while further capturing the contextual interdependencies between distant pixels. As shown in Figure 5, by using an average pooling layer followed by a 1 × 1 convolutional layer on f i n C i n / 4 × H i n × W i n , the local features are gained. A v g represents the average pooling operation, followed by two depth-wise strip convolutions applied to F l o c a l , and then the long-range contextual information F l o n g is extracted. Finally, the sigmoid activation function is used to extract the final attention feature map F a t t .
F a t t = S i g m o i d ( C o n v 1 × 1 ( D W C 11 × 1 ( D W C 1 × 11 ( F l o c a l ) ) ) )
In Equation (3), D W C 11 × 1 and D W C 1 × 11 represent depth-wise strip convolutions with kernel sizes of 11 × 1 and 1 × 11, respectively. S i g m o i d denotes the activation function. F a t t is used to enhance the contextual expression of features extracted by MSIEM. After the features extracted by MSIEM and CIEM are processed by a 1 × 1 convolution, they are fused with the features extracted by SCB to obtain the final features.
f o u t = C o n v 1 × 1 ( P o u t + P o u t F a t t ) + S C B o u t
In Equation (4), represents element-wise multiplication.

3.4. The Design of FEMFDS

Standard convolutions have fixed geometric structures and receptive fields; thus, they have certain limitations in feature extraction. Standard convolutions cannot effectively model full-scale targets of different shapes present in UAV images. However, deformable convolutions, by adding additional offsets based on sampling positions within the module and learning these offsets during the target task, can better model full-scale targets of various shapes present in UAV images. Thus, we designed the FEMFDS to better fit different full-scale target shapes. As shown in Figure 6, The FEMFDS consists of two branches: one branch is used to extract local features from the image, and the other branch is dedicated to extracting features that can better fit targets of different sizes. The features extracted by these two branches are then fused to obtain the final feature representation.
The deformable convolution used in FEFMDS is deformable convolution v4 (DCNv4). The reason for choosing DCNv4 is that, based on DCNv3, it removes the soft normalization to achieve stronger dynamic properties and expressiveness. Furthermore, as shown in Figure 7, memory access is further optimized. DCNv4 uses one thread to process multiple channels in the same group that share sampling offset and aggregation weights. Thus, DCNv4 has a faster inference speed. When the input is x C × H × W , with height H, width W, and C channels. The process of deformable convolution is as follows:
y g = k = 1 K m g k x g ( p 0 + p k + Δ p g k ) y = c o n c a t ( [ y 1 , y 2 , , y g ] , a x i s = 1 )
In Equation (5), g represents the number of groups for spatial grouping along the channel dimension. For the g-th group, the x g C × H × W and y g C × H × W respectively represents the feature maps for the g-th group. The dimension of the feature channels is represented by c * = c / g . m g k R represents the spatial aggregation weight of the k-th sampling point in the g-th group. p k represents the k-th position of the predefined grid sampling, similar to standard convolution { ( 1 , 1 ) , ( 1 , 0 ) , , ( 0 , + 1 ) , , ( + 1 , + 1 ) } . Δ p g k represents the offset corresponding to the position p k of the grid sampling in the g-th group.

3.5. The Structure of LLFEB

In convolutional neural networks, low-level semantic features include rich details such as edges, corners, textures, and so on. These features provide the foundation for subsequent high-level features, thereby enhancing the model’s detection performance on images. In the YOLOv5s baseline model, the feature maps obtained after two down-sampling operations are not directly used. Instead of directly fusing these low-level semantic features with the down-sample features, we adopt a top-down and bottom-up feature fusion approach. The fusion of low-level and high-level features combines local, detailed information with global semantic information, forming a more comprehensive feature representation. The multi-level information integration helps FSTD-Net better understand and parse image content, thereby further enhancing the detection performance of FSTD-Net on UAV images.
As shown in Figure 8, the specific implementation process of LLFEB is as follows. The low-level feature map is F 2 , a convolutional layer that first operates on the corresponding higher-level semantic feature F 2 h i g h , followed by upsampling on the processed F 2 h i g h , then the F 2 L E is gained. Then, F 2 L E and F 2 are merged with the Concat operation, the fused feature map is further processed with FEMFDS and a convolutional layer, and finally, the enhanced low-level semantic feature F 2 L E is obtained.
The overall process of LLFEB is as follows:
F 2 L E = U p ( C B S ( F 2 h i g h ) ) F 2 L E = C o n c a t ( F E ( C o n c a t ( F 2 L E , F 2 ) ) , C B S ( F 2 h i g h ) )
In Equation (6), U p represents the upsampling operation, F E represents the FEMFDS module.

3.6. Loss Function

The loss function of FSTD-Net consists of three components, bounding box regression loss, classification loss, and objectness loss. For bounding box regression loss, the Complete Intersection over Union (CIoU) is used. The CIoU loss considers not only the IoU of the bounding box but also the differences in centroid distances and aspect ratios, which makes the prediction of the bounding box more accurate.
C I o U = 1 I o U + ρ 2 ( b , b g ) c 2 + α v v = 4 π 2 ( a r c t a n w g t h g t a r c t a n w h ) 2
In Equation (7), I o U means the Intersection and juxtaposition ratio of predicted and true frames. As shown in Figure 9, we use two different colors of boxes to represent the ground truth boxes and the predicted boxes, w g t , h g t , w , h represent the width and height of the ground truth boxes and predicted boxes, respectively. ρ 2 ( b , b g ) represents the Euclidean distance between the center point of the predicted box and the center point of the ground truth box. c represents the length of the diagonal of the smallest enclosing box that contains both the predicted box and the ground truth box. α represents the weight parameter used to balance the impact of the aspect ratio. v represents the consistency measure of the aspect ratio.
For classification loss, the classification loss (CLoss) measures the difference between the predicted and true categories of the target within each predicted box. The Binary Cross-Entropy Loss (BCEloss) is used to calculate the classification loss.
C L o s s = c = 1 C [ y c log ( p c ) + ( 1 y c ) log ( 1 p c ) ]
In Equation (8), C represents the total number of categories. y c represents the true label of class c (0 or 1). p c represents the predicted probability of class c.
For Objectness Loss (OLoss), the objectness loss measures the difference between the predicted confidence and the true confidence regarding whether each predicted box contains a target.
O L o s s = [ y log ( p ) + ( 1 y ) log ( 1 p ) ]
In Equation (9), y represents true confidence (1 if the target is present, otherwise 0). p is the predicted confidence.
The final total loss function (TLoss) is obtained by weighting and combining the three aforementioned loss components:
T L o s s = λ b o x × C I o U + λ c l s × C L o s s + λ o b j × O L o s s
In Equation (10), λ b o x , λ c l s and λ o b j represent the weight coefficients for the bounding box regression loss, classification loss, and objectness loss, respectively. In this experiment, the weights of the box regression loss, classification loss, and objectness loss are 0.05, 0.5, and 1.0, respectively.

4. Experiments and Analysis

In this section, the VisDrone2019 and UAVDT [54] dataset used in this paper will be introduced first, including its characteristics and relevance to the experiment. The experimental setup is then elaborated, encompassing the training configuration, the computational platform used for training, and the evaluation metrics employed to assess the performance of the proposed FSTD-Net and other state-of-the-art models.

4.1. Dataset and Experimental Settings

4.1.1. Dataset Introduction

The dataset used in this experiment is the large-scale VisDrone2019 dataset and UAVDT dataset. For the UAVDT dataset, it contains 6471 images for training, 548 images for validation, and 3190 images for testing. However, since 1580 images are reserved for challenges and do not provide the ground truth, we use 1610 images for testing in this experiment. The VisDrone2019 dataset includes a large number of targets of different scales and complex background information, making it suitable for evaluating the performance of the proposed algorithm. As shown in Figure 10, it shows the statistical information on the VisDrone2019 dataset, including the number of instances for different categories, the distribution of bounding box centers, the normalized coordinates of bounding box center points in the image, and the distribution of bounding boxes with different width-to-height ratios. During the training process, all images are cropped into 800 × 800 pixels with a stride of 700. Additionally, to further validate the performance of the proposed network, we conducted experiments on the UAVDT dataset. The UAVDT dataset consists of 50 videos, comprising a total of 40,375 images, each with a resolution of 1024 × 540 pixels. The UAVDT has three categories: car, truck, and bus. Due to the high similarity among image sequences within each video, we selected image sequences from each video to construct the train and test sets. The train set comprises 6566 images, and the test set includes 1830 images.

4.1.2. The Details of Experimental Environment

In this experiment, as shown in Table 1, we utilized an operational environment running on Windows, equipped with a GeForce RTX 2080ti GPU. The programming was conducted using Python 3.8 within the PyCharm integrated development environment, leveraging CUDA 12.2 for GPU support. For deep learning tasks, the Pytorch framework was employed, ensuring efficient model development and training processes.

4.1.3. The Setting of Model Training

In this training session, as shown in Table 2, the batch size for each epoch was set to 2, and the input image size was 800 × 800 for VisDrone 2019 dataset and 640 × 640 for the UAVDT dataset. The initial learning rate was set at 0.01 and gradually decayed to 0.001. A momentum of 0.937 was applied to accelerate the convergence process, and a weight decay of 0.0005 was employed to prevent overfitting. The more training epochs, the higher the risk of overfitting; thus, our FSTD-Net are trained for 80 epochs on both VisDrone 2019 and UAVDT dataset. The first three epochs involved a warm-up stage, during which a lower learning rate and momentum were applied to stabilize the training process. Data augmentation techniques included changes in hue, saturation, and brightness, as well as geometric transformations such as rotation, translation, scaling, and shearing. Additionally, to improve model generalization and avoid overfitting, horizontal and vertical flips, as well as advanced augmentation methods like Mosaic, Copy-Paste, and so on, are applied.

4.1.4. Evaluation Metrics

To better evaluate the detection performance of the model, we selected six metrics: average precision (AP), AP50, AP75, APs, APm, and APl. AP represents the average precision across all ten IoU thresholds, ranging from 0.5 to 0.95. AP50 is the average precision calculated at an IoU threshold of 0.5, while AP75 is calculated at an IoU threshold of 0.75. APs, APm, and APl refer to the precision of target detection with areas in the ranges of 32 × 32, 32 × 32 to 96 × 96, and above 96 × 96, respectively. The process for calculating the metrics is as follows:
P = T P T P + F P R = T P T P + F N A P = i = 1 n 1 ( R ( i + 1 ) R ( i ) ) × P ( i + 1 ) A P 50 = i = 1 n 1 ( R 0.5 ( i + 1 ) R 0.5 ( i ) ) × P 0.5 ( i + 1 ) A P 75 = i = 1 n 1 ( R 0.75 ( i + 1 ) R 0.75 ( i ) ) × P 0.75 ( i + 1 ) A P s = i = 1 n 1 ( R s ( i + 1 ) R s ( i ) ) × P ( i + 1 ) A P m = i = 1 n 1 ( R m ( i + 1 ) R m ( i ) ) × P m ( i + 1 ) A P l = i = 1 n 1 ( R l ( i + 1 ) R l ( i ) ) × P l ( i + 1 )
True Positive (TP), False Positive (FP), and False Negative (FN) are important metrics in evaluating the performance of classification and detection models. TP refers to the number of correctly identified positive instances, meaning the targets that are correctly detected and match the ground truth. FP refers to the number of instances incorrectly identified as positive, meaning the targets that are detected but do not match any ground truth. FN refers to the number of positive instances that were missed by the model, meaning the targets that are present in the ground truth but were not detected by the model. To calculate the specific value of each AP in the above formula, we first need to calculate the Precision (P) and Recall (R) values. Then, AP is obtained by computing the weighted sum of the precision at different recall levels, where the weight is the difference in recall multiplied by the corresponding precision. Specifically, AP50, AP75, APs, APm, and APl correspond to the average precision at different IOU thresholds (such as 0.5 and 0.75) and different target scales (small, medium, and large targets). The calculation method is the same, but the precision and recall used are under the corresponding conditions.

4.2. Experimental Results on VisDrone2019 Dataset

4.2.1. Detecting Results on the VisDrone2019 Dataset

To validate the detection performance of our proposed FSTD-Net, we compared it with several state-of-the-art target detection models, including the YOLOv3, SSD, YOLOv6, YOLOv7, YOLOv8, YOLOv10, Faster R-CNN, and so on, on the VisDrone2019 dataset. As shown in Table 3. From the evaluation metrics we selected, it can be observed that FSTD-Net outperforms other models and achieved the best accuracy values across all evaluation metrics. Specifically, FSTD-Net achieves a 2.4% higher AP, 3.8% higher AP50, and 2.0% higher AP75 compared to the second-best model. Moreover, in the APs, APm, APl metrics, FSTD-Net achieves 1.4%, 3.1%, and 3.7 % higher scores compared to the second-best model. FSTD-Net not only achieves the best detection performance overall among all models but also attains the best detection results in APs, APm, and APl. This indicates that FSTD-Net is more sensitive to targets of different scales in UAV images and can better detect targets with significant scale variability in UAV images. At the same time, the overall number of params in the FSTD-Net remains relatively low. The FSTD-Net maintains the same params as YOLOv8, but it achieves better detection performance. FSTD-Net achieves 5% higher AP50, 2.4% higher AP, 2% higher AP75, 1.4% higher APs, 3.1% higher APm, and 6.4% higher APl compared to YOLOv8s. Our proposed FSTD-Net can achieve better detection performance with a lightweight architecture.

4.2.2. Visual Comparisons of Detection Results

To better demonstrate the superior detection performance of FSTD-Net compared to other state-of-the-art models, we selected several images from the VisDrone 2019 dataset to visualize the detection results, which allows us to directly compare and evaluate the performance differences between FSTD-Net and other models in practical applications. As shown in Figure 11, we use blue boxes to highlight the detection areas for comparison, allowing us to evaluate the detection performance of our proposed FSTD-Net and other advanced models in these specific areas. For better visibility, we enlarge the area outlined by the blue box and move it to the upper right corner of the image. In the first column of images, the blue box areas contain two “motor” targets in the drone imagery. Comparative models, such as YOLOv6 and YOLOv8, failed to detect these two targets, with YOLOv5 detecting only one of them. Due to the proposed LLEFB, FEFMDS, and MCIEM, they have enhanced FSTD-Net’s ability to learn features of targets at different scales; thus, FSTD-Net can precisely detect these two targets. For the second column of images, YOLOv5, YOLOv6, and YOLOv8 all failed to detect the car target highlighted by the blue box. However, our model successfully detected this car target. This is attributed to LLEFB, which enhances FSTD-Net’s capability to learn detailed features of certain target types, such as texture and edge features. In the third column of images, YOLOv5, YOLOv6, and YOLOv8 all failed to detect the human target highlighted by the blue box. However, our FSTD-Net successfully detected this human target, demonstrating FSTD-Net’s superior capability in detecting some difficult targets. In the fourth column of images, YOLOv6 and YOLOv8 all failed to fully detect the bicycle and the two motor targets highlighted in the blue box area. YOLOv5 mistakenly identified the bicycle as a motor. However, our FSTD-Net successfully detected all these targets, demonstrating FSTD-Net’s superior detection capabilities. In the fifth column of images, YOLOv6 and YOLOv8 both failed to fully detect the “van” target highlighted by the blue box, which is situated in a relatively complex background area. However, both our FSTD-Net and YOLOv6 successfully detected the entire target. Thanks to MSCIEM, FSTD-Net can effectively mitigate the interference from the complex background, successfully detecting the target.
Overall, our proposed FSTD-Net has better detection results for the presence of the different scale targets in the UAV image or the presence of some targets that are interfered with by complex backgrounds. LLEFB can provide FSTD-Net with sufficient detailed features, enabling FSTD-Net to detect targets that are difficult to detect. MCIEM can provide FSTD-Net with sufficient multi-scale information and, to some extent, avoid interference from complex backgrounds. FEFMDS can better enhance the model’s ability to learn targets on different scales. Both MSCIEM and FEFMDS can better enhance FSTD-Net’s detection of targets on different scales. Our FSTD-Net is better than the compared detection effect of the compared state-of-the-art models; the proposed FSTD-Net can correctly identify the majority of targets in the VisDrone2019 image.

4.2.3. Experimental Results of Different Categories on the VisDrone2019 Test Set

To further validate the effectiveness of the proposed FSTD-Net, we compare the detection performance of the proposed FSTD-Net and the Yolov8s on all categories on the VisDrone2019 dataset. As shown in Table 4, compared to the proposed FSTD-Net, Yolov8s surpasses it by only 0.1% in AP75 and APm for the awning-tricycle target. For the remaining categories, the proposed FSTD-Net all outperforms Yolov8s in detection performance. As shown in Table 3, the params of the FSTD-Net are comparable to those of Yolov8s. However, compared to Yolov8s, FSTD-Net exhibits overall better detection performance, providing more accurate detection for targets of varying scales present in the UAV images. This indicates the superior detection capability of the proposed FSTD-Net, demonstrating its ability to effectively detect various target categories present in the UAV dataset.

4.2.4. Ablation Studies of FSTD-Net on the VisDrone 2019 Dataset

To further explore the effectiveness of the proposed method, we conducted ablation experiments to investigate the impact of the LLFEB, MSCIEM, and FEFMDS modules on FSTD-Net detection performance on the VisDrone2019 dataset. Table 5 presents the improvements based on the baseline YOLOv5s model when performing ablation experiments. Table 6 shows the detection performance of the baseline with the addition of the different modules. As shown in Table 5, √ indicated that the current module is embedded in the baseline.
In the baseline model, the feature maps obtained after two down-sampling operations are not used; thus, the detailed features of the low-level feature have not been fully utilized. The LLFEB enhances the baseline’s learning and understanding of target features in UAV images by providing low-level semantic features such as edges, corners, and textures as a basis for subsequent high-level feature extraction. After introducing LLFEB to baseline, AP50, AP, AP75, APs, and APm on the VisDrone2019 increase by 0.9%, 0.7%, 0.7%, 0.7%, and 0.8%, the APl remains comparable after the introduction of LLFEB. With the integration of LLFEB, the overall target detection accuracy of the model is improved, particularly for medium and small targets.
The FEFMDS overcomes the limitations of standard convolutions in extracting features according to fixed geometric patterns, providing more representative features for the detection of targets of various scales. After introducing FEFMDS, it can be observed that there are improvements of 1% in APs, 2% in APm, and 3.1% in APl. AP50, AP, and AP75 on VisDrone2019 increased by 2.1%, 1.5%, and 1.9%, respectively. The designed model can better detect targets of different scales in drones under the guidance of the FEFMDS.
The MSCIEM suppresses complex backgrounds in UAV images while enhancing the feature extraction of targets of different scales by utilizing multiple convolutions with different kernel sizes. The CIEM further enhances the baseline’s ability to capture long-range semantic information. With the introduction of MSCIEM, we observed improvements of 0.3% in APs, 0.8% in APm, and 0.7% in APl. MSCIEM can guide the model to detect targets of different scales in UAV images more accurately.
In summary, the proposed MSCIEM and FEFMDS modules enhance the baseline’s detection of targets of different scales, while LLFEB improves the baseline’s learning and understanding of target features in UAV images.

4.3. Experimental Results on UAVDT Dataset

To further validate the effectiveness of the proposed FSTD-Net, we further conducted experiments on the UAVDT dataset and compared the detection performance of the FSTD-Net with that of YOLOv6, YOLOv7, YOLOv8, YOLOv10, and so on. Table 7 presents the AP, AP50, AP75, and the detection precision for small, medium, and large targets (APs, APm, APl) on the UAVDT dataset. Compared to other algorithms, FSTD-Net achieved the highest detection precision in AP50, AP, AP75, and APm. Specifically, our method improved by 3.4%, 1.7%, 1%, and 2.1%, respectively, over the second-best model in these metrics. As for the APs and APl, FSTD-Net achieved relatively good results. These results demonstrate the effectiveness of our FSTD-Net in handling large-scale variations in object detection and achieving a good detection accuracy.
To further explore the effectiveness of the proposed method on the UAVDT dataset, we conducted ablation experiments to investigate the impact of the LLFEB, MSCIEM, and FEFMDS modules on FSTD-Net detection performance on the UAVDT dataset. As shown in Table 8, LLFEB can better improve the detection performance of medium and large targets. FEFMDS can help the model better learn small targets. MSCIEM enhances the model’s ability to comprehensively learn targets of various scales, effectively balancing the learning of small and large targets while significantly improving the learning of medium-sized targets. Overall, due to the design of the three modules, FSTD-Net achieves relatively optimal performance in object detection across various scales.
To further validate the effectiveness of the proposed FSTD-Net, we conducted a comparison of its detection performance with that of baseline YOLOv5s across all categories on the UAVDT dataset. Table 9 presents the detection performance of FSTD-Net across various categories in the UAVDT dataset. It can be seen that YOLOv5s outperforms FSTD-Net only in the APm of the bus category and the APl of the truck category. In the other metric of all target categories, FSTD-Net all achieves better results. FSTD-Net can achieve a better detection result for all categories of targets in general. Overall, FSTD-Net achieves better detection performance on the UAVDT dataset.

4.4. Extended Experimental Results on VOC Dataset

To validate the generalization capability of the FSTD-Net, we conducted further experiments on the Pascal VOC datasets [56], which include the VOC 2007 and VOC 2012 datasets. The training and validation sets from VOC 2007 and 2012 are combined to form new training and validation sets, respectively, while the test set from VOC 2007 is used as the new test set. The training set consists of 8218 images; the test set contains 4952 images; and the validation set comprises 8333 images. FSTD-Net was trained for a total of 300 epochs, and the results are presented in Table 10. As shown in Table 10, FSTD-Net achieves an AP50 of 68.2%, an AP of 44.1%, and an AP75 of 47.0%. Compared to the baseline, our proposed FSTD-Net achieves improvements of 1.5%, 2.9%, 3.6%, 0.5%, and 4.3% in AP50, AP, AP75, APm, and APl, respectively. It indicates that our proposed model is well-suited for generic image applications.
As shown in Table 11, the results demonstrate the detection accuracy of FSTD-Net compared to the baseline across various categories on the VOC dataset. It can be seen that our proposed model outperforms the baseline in most categories, indicating that our method also improves detection performance in general image applications.

5. Discussion

5.1. The Effectiveness of FSTD-Net

Targets in UAV images are more complex compared to those in natural images due to significant scale variability and complex background conditions. Target detection models suitable for natural images, such as the YOLO series of neural networks, cannot be directly applied to UAV images. To address this issue, we designed FSTD-Net, providing some insights on how to adapt models such as the YOLO series for use in UAV image detection.
Overall, the structure of FSTD-Net is essentially consistent with the YOLO series, including the backbone, neck, and head parts, maintaining a relatively lightweight structure. Its parameter count is 11.3M, which is roughly equivalent to that of YOLOv8s. However, compared to the YOLO series of neural networks, FSTD-Net is better suited for UAV image detection tasks.
Experimental results on the VisDrone 2019 and UAVDT datasets demonstrate that the methods proposed in this paper can enhance the applicability of the YOLO model for detection tasks involving UAV images. Additionally, we conducted further experiments on the general datasets VOC2007 and VOC2012. Experiments on the general datasets indicate that the proposed methods can also improve the baseline’s detection performance on natural images. Overall, FSTD-Net not only performs well on UAV image datasets but also achieves commendable results on general datasets.

5.2. The Impact of Image Resizing and Cropping on Target Detection Performance

We further conduct the experiments on VisDrone 2019, we cropped the train, test, and val sets into images of 640 × 640 pixels. The original training set contained 6471 images, and after cropping, the number of images in the training set increased to 44,205. We then compared the results of training on two datasets: one with the original images resized to 640 × 640, containing 6471 images, and the other with the cropped images, containing 44,205 images. FSTD-Net was trained for 80 epochs on both datasets. The detection results are shown in Table 12.
As shown in Table 12, it can be observed that the detection performance on the datasets with cropping methods is significantly better across all metrics compared to the performance on the datasets with resizing methods. On one hand, the cropping operation increases the number of data samples in the training set, allowing the network to undergo more iterations during training, which leads to improved detection performance on the cropped dataset. On the other hand, resizing methods change the original shape of the targets in the UAV images, which may significantly degrade detection performance compared to the cropped dataset.

6. Conclusions

In this paper, to address the target scale variability and complex background interference presented in UAV images, we propose the FSTD-Net, which effectively detects multi-scale targets in UAV images and avoids interference caused by complex backgrounds. Firstly, to better capture the features of targets at different scales in UAV images, MSCIEM utilizes a multi-kernel combination approach, and the CIEM in MSCIEM can capture long-range contextual information. Due to MSCIEM, the model can effectively capture features of targets at different scales, perceive long-range contextual information, and be sensitive to significant variations in scale. Secondly, to further account for the targets of different shapes on different scales, FEFMDS breaks through the conventional method of feature extraction by standard convolution, providing a more flexible approach to learning full-scale targets of different shapes. The FEFMDS can better fit full-scale targets of different shapes in UAV images, providing more representative features for the final detection of FSTD-Net. Finally, LLFEB is used to efficiently utilize low-level semantic features, including edges, corners, and textures, providing a foundation for subsequent high-level feature extraction and guiding the model to better understand various target features. Experimental results demonstrate that FSTD-Net outperforms the selected state-of-the-art models, achieving advanced detection performance. FSTD-Net achieves better detection results in terms of overall detection accuracy while maintaining a relatively lightweight structure. In the future, we will further explore more efficient and lightweight target detection models to meet the requirements for direct deployment on UAVs.

Author Contributions

W.Y. performed the experiments and wrote the paper; J.Z. and D.L. analyzed the data and provided experimental guidance. Y.X. and Y.W. helped to process the data. J.Z. and D.L. helped edit the manuscript. Y.X. and Y.W. checked and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

VisDrone2019 dataset: https://github.com/VisDrone (accessed on 7 August 2024). UAVDT dataset: https://sites.google.com/view/grli-uavdt (accessed on 7 August 2024). Pascal VOC dataset: The PASCAL Visual Object Classes Homepage (ox.ac.uk).

Conflicts of Interest

All authors declare no conflicts of interest.

References

  1. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  2. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
  3. Waheed, M.; Ahmad, R.; Ahmed, W.; Alam, M.M.; Magarini, M. On coverage of critical nodes in UAV-assisted emergency networks. Sensors 2023, 23, 1586. [Google Scholar] [CrossRef] [PubMed]
  4. Gupta, H.; Verma, O.P. Monitoring and surveillance of urban road traffic using low altitude drone images: A deep learning approach. Multimed. Tools Appl. 2022, 81, 19683–19703. [Google Scholar] [CrossRef]
  5. Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight aware enhancement transformer and multiple matching network for real-time UAV tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
  6. Feng, L.; Zhang, Z.; Ma, Y.; Sun, Y.; Du, Q.; Williams, P.; Drewry, J.; Luck, B. Multitask learning of alfalfa nutritive value from UAVbased hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5506305. [Google Scholar] [CrossRef]
  7. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  8. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  10. Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
  11. Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. arXiv 2023, arXiv:2304.00501. [Google Scholar] [CrossRef]
  12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
  14. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  15. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  16. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Wei, X. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  17. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  18. Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
  19. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  20. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  21. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  22. Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN With Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
  23. Zhao, G.; Ge, W.; Yu, Y. GraphFPN: Graph feature pyramid network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2743–2752. [Google Scholar]
  24. Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar] [CrossRef]
  25. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  26. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316. [Google Scholar]
  27. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H. InternImage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
  28. Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Dai, J. Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications. arXiv 2024, arXiv:2401.06197. [Google Scholar]
  29. Liu, S.; Zhang, L.; Lu, H.; He, Y. Center-boundary dual attention for oriented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603914. [Google Scholar] [CrossRef]
  30. Shao, Z.; Cheng, G.; Ma, J.; Wang, Z.; Wang, J.; Li, D. Realtime and accurate UAV pedestrian detection for social distancing monitoring in COVID-19 pandemic. IEEE Trans. Multimed. 2022, 24, 2069–2083. [Google Scholar] [CrossRef]
  31. Wang, M.; Zhang, B. Contrastive Learning and Similarity Feature Fusion for UAV Image Target Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6001105. [Google Scholar] [CrossRef]
  32. Zhang, Y.; Wu, C.; Zhang, T.; Liu, Y.; Zheng, Y. Self-attention guidance and multiscale feature fusion-based UAV image object detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6004305. [Google Scholar] [CrossRef]
  33. Zhou, X.; Zhou, L.; Gong, S.; Zhang, H.; Zhong, S.; Xia, Y.; Huang, Y. Hybrid CNN and Transformer Network for Semantic Segmentation of UAV Remote Sensing Images. IEEE J. Miniat. Air Space Syst. 2024, 5, 33–41. [Google Scholar] [CrossRef]
  34. Zhang, Y.; Liu, T.; Yu, P.; Wang, S.; Tao, R. SFSANet: Multi-scale object detection in remote sensing image based on semantic fusion and scale adaptability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4406410. [Google Scholar] [CrossRef]
  35. Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-free global multiscale fusion network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5603214. [Google Scholar] [CrossRef]
  36. Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A task-balanced multiscale adaptive fusion network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613515. [Google Scholar] [CrossRef]
  37. Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to local: A scale-aware network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
  38. Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y.; Li, B. Multiscale deformable attention and multilevel features aggregation for remote sensing object detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6510405. [Google Scholar] [CrossRef]
  39. Shen, H.; Lin, D.; Song, T. Object detection deployed on UAVs for oblique images by fusing IMU information. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505305. [Google Scholar] [CrossRef]
  40. Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multi-Scale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
  41. Liu, X.; Leng, C.; Niu, X.; Pei, Z.; Cheng, I.; Basu, A. Find small objects in UAV images by feature mining and attention. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6517905. [Google Scholar] [CrossRef]
  42. Lan, Z.; Zhuang, F.; Lin, Z.; Chen, R.; Wei, L.; Lai, T.; Yang, C. MFO-Net: A Multiscale Feature Optimization Network for UAV Image Object Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6006605. [Google Scholar] [CrossRef]
  43. Mao, G.; Liang, H.; Yao, Y.; Wang, L.; Zhang, H. Split-and-Shuffle Detector for Real-Time Traffic Object Detection in Aerial Image. IEEE Internet Things J. 2024, 11, 13312–13326. [Google Scholar] [CrossRef]
  44. Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8018505. [Google Scholar] [CrossRef]
  45. Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-aware block net for small object detection. IEEE Trans. Cybern. 2022, 52, 2300–2313. [Google Scholar] [CrossRef] [PubMed]
  46. Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667. [Google Scholar]
  47. Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-Scale Feature Aggregation and Grouping Feature Reconstruction-Based UAV Image Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621411. [Google Scholar] [CrossRef]
  48. Nie, J.; Pang, Y.; Zhao, S.; Han, J.; Li, X. Efficient selective context network for accurate object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3456–3468. [Google Scholar] [CrossRef]
  49. Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Sang, Q. FSoD-Net: Full-scale object detection from optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602918. [Google Scholar] [CrossRef]
  50. Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. arXiv 2024, arXiv:2403.06258. [Google Scholar]
  51. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  52. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  53. Yu, W.; Zhou, P.; Yan, S.; Wang, X. Inceptionnext: When inception meets convnext. arXiv 2023, arXiv:2303.16900. [Google Scholar]
  54. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  55. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  56. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Figure 1. Target size distribution in VisDrone 2019 and UAVDT dataset. (a) Target numbers of different size in VisDrone 2019, (b) target numbers of different sizes in UAVDT.
Figure 1. Target size distribution in VisDrone 2019 and UAVDT dataset. (a) Target numbers of different size in VisDrone 2019, (b) target numbers of different sizes in UAVDT.
Remotesensing 16 02944 g001
Figure 2. Examples from the VisDrone 2019 and UAVDT dataset. (a) Examples from VisDrone 2019 dataset, (b) examples from UAVDT dataset.
Figure 2. Examples from the VisDrone 2019 and UAVDT dataset. (a) Examples from VisDrone 2019 dataset, (b) examples from UAVDT dataset.
Remotesensing 16 02944 g002
Figure 3. The original structure of baseline YOLOv5s.
Figure 3. The original structure of baseline YOLOv5s.
Remotesensing 16 02944 g003
Figure 4. The overall framework of FSTD-Net.
Figure 4. The overall framework of FSTD-Net.
Remotesensing 16 02944 g004
Figure 5. The overall structure of MSCIEM.
Figure 5. The overall structure of MSCIEM.
Remotesensing 16 02944 g005
Figure 6. The overall structure of FEMFDS.
Figure 6. The overall structure of FEMFDS.
Remotesensing 16 02944 g006
Figure 7. The improvements of DCNv4 over the foundation of DCNv3.
Figure 7. The improvements of DCNv4 over the foundation of DCNv3.
Remotesensing 16 02944 g007
Figure 8. The overall structure of LLFEB.
Figure 8. The overall structure of LLFEB.
Remotesensing 16 02944 g008
Figure 9. The design principles of the CIoU.
Figure 9. The design principles of the CIoU.
Remotesensing 16 02944 g009
Figure 10. Statistical information of VisDrone 2019. (a) The distribution of instance quantities for different categories, (b) the distribution of annotated bounding box center points, (c) the density of bounding box center points, further refining the positional distribution of targets within the image, (d) the width-to-height ratio distribution of annotated bounding boxes.
Figure 10. Statistical information of VisDrone 2019. (a) The distribution of instance quantities for different categories, (b) the distribution of annotated bounding box center points, (c) the density of bounding box center points, further refining the positional distribution of targets within the image, (d) the width-to-height ratio distribution of annotated bounding boxes.
Remotesensing 16 02944 g010
Figure 11. The visual detection results of FSTD-Net and other state-of-the-art models. The first column represents the original images. The second column represents the detection results of YOLOv5. The third column represents the detection results of YOLOv6. The fourth column represents the detection results of YOLOv8. The fifth column represents the detection results of our proposed FSTD-Net.
Figure 11. The visual detection results of FSTD-Net and other state-of-the-art models. The first column represents the original images. The second column represents the detection results of YOLOv5. The third column represents the detection results of YOLOv6. The fourth column represents the detection results of YOLOv8. The fifth column represents the detection results of our proposed FSTD-Net.
Remotesensing 16 02944 g011
Table 1. Experimental environment configuration.
Table 1. Experimental environment configuration.
ParameterConfiguration
Operation EnvironmentWindows
GPUGeForce RTX 2080ti
Programming LanguagePython 3.8
Integrated Development EnvironmentPycharm
Cuda12.2
Deep Learning FrameworkPytorch
Table 2. The basic parameters of training settings on a different dataset.
Table 2. The basic parameters of training settings on a different dataset.
ParameterVisDrone 2019/UAVDT
Input Image Size800 × 800/640 × 640
Batch-size2
Epochs80
Learning rate0.01
Warmup epochs3
Momentum0.937
Weight-decay0.0005
Table 3. Detection results of FSTD-Net with state-of-the-art models on the VisDrone dataset.
Table 3. Detection results of FSTD-Net with state-of-the-art models on the VisDrone dataset.
MethodAP50APAP75APsAPmAPlParams
Faster-RCNN [9]13.44.92.50.38.325.741.2M
SSD [55]24.112.111.36.819.719.826.3M
YOLOv3-Tiny [36]26.213.011.46.120.530.68.85M
YOLOv6s [38]33.218.819.110.729.832.118.5M
YOLOv7-Tiny [39]32.417.617.09.626.735.86.2M
MFO-Net [25]34.318.417.911.328.032.635.2M
YOLOv5s37.720.520.111.231.538.47.2M
YOLOv10s [20]35.019.920.311.630.533.87.2M
YOLOv8s [40]36.520.821.011.832.035.711.2M
FSTD-Net (Ours)41.523.223.013.235.142.111.3M
Table 4. Experimental results of different categories on the VisDrone2019 test set.
Table 4. Experimental results of different categories on the VisDrone2019 test set.
MethodsCategoryAP50APAP75APsAPmAPl
FSDT-Net (ours)pedestrian36.814.58.711.242.851.4
people21.87.12.66.421.223.0
bicycle21.48.75.26.918.326.7
car79.848.952.934.364.877.5
van50.633.338.119.547.254.1
truck48.230.333.78.036.652.7
tricycle26.714.614.56.721.536.5
awning-tricycle25.214.915.69.620.420.0
bus61.942.248.314.248.663.0
motor43.217.410.215.029.416.1
YOLOv8spedestrian31.412.67.69.340.247.0
people17.46.12.55.419.121.8
bicycle16.26.84.16.211.821.1
car76.747.551.733.063.875.2
van46.530.835.518.044.246.1
truck38.924.827.55.931.241.5
tricycle21.311.712.06.017.431.2
awning-tricycle23.413.815.77.820.512.1
bus55.838.143.912.844.856.4
motor37.815.59.713.527.04.7
Table 5. Detailed settings of ablation experiment.
Table 5. Detailed settings of ablation experiment.
YOLOv5sLLFEBFEFMDSMSCIEMParams
baseline 7.2M
a 7.3M
b 8.6M
c11.3M
Table 6. Detailed comparison of the proposed FSTD-Net in the ablation studies.
Table 6. Detailed comparison of the proposed FSTD-Net in the ablation studies.
MethodAP50APAP75APsAPmAPl
baseline37.720.520.111.231.538.4
a38.621.220.811.932.338.3
b40.722.722.712.934.341.4
c41.523.223.013.235.142.1
Table 7. Detection results of FSTD-Net with state-of-the-art models on the UAVDT dataset.
Table 7. Detection results of FSTD-Net with state-of-the-art models on the UAVDT dataset.
MethodsAP50APAP75APsAPmAPlParams
YOLOv3-Tiny [36]35.418.617.317.319.526.38.85M
YOLOv6s [38]37.422.525.423.222.317.018.5M
YOLOv7-Tiny [39]35.917.314.015.818.617.76.2M
YOLOv10s [20]36.821.422.123.421.020.67.2M
YOLOv8s [40]37.022.024.423.820.830.211.2M
YOLOv5s37.321.722.620.722.535.77.2M
FSTD-Net (Ours)40.824.226.423.224.632.011.3M
Table 8. Ablation studies of FSTD-Net on the UAVDT dataset.
Table 8. Ablation studies of FSTD-Net on the UAVDT dataset.
MethodAP50APAP75APsAPmAPl
baseline37.321.722.620.722.535.7
a37.622.224.119.623.236.5
b38.823.526.423.922.628.6
c40.824.226.423.224.632.0
Table 9. Experimental results of different categories on the UAVDT dataset.
Table 9. Experimental results of different categories on the UAVDT dataset.
MethodsCategoryAP50APAP75APsAPmAPl
FSTD-Netcar76.246.652.940.553.2-
truck22.813.715.04.517.025.1
bus23.212.311.424.73.538.9
baselinecar76.245.450.039.951.4-
truck14.98.07.13.110.232.9
bus21.011.610.719.25.838.5
Table 10. Detection results of FSTD-Net with baseline on the VOC dataset.
Table 10. Detection results of FSTD-Net with baseline on the VOC dataset.
MethodAP50APAP75APsAPmAPl
baseline66.741.243.411.529.249.8
FSTD-Net68.244.147.011.229.754.1
Table 11. The detection results of FSTD-Net with other state-of-the-art models on the VOC dataset.
Table 11. The detection results of FSTD-Net with other state-of-the-art models on the VOC dataset.
MethodsCategoryAP50APAP75APsAPmAPl
FSDT-Netairplane76.349.052.723.141.958.1
bicycle77.550.756.25.136.161.2
bird60.836.436.114.031.346.5
boat54.828.725.814.623.944.4
bottle44.125.726.38.427.243.0
bus78.161.267.915.124.574.8
car80.355.158.821.243.676.7
cat78.555.058.00.018.658.8
chair50.828.728.23.825.837.3
cow70.444.149.619.644.652.9
diningtable61.641.343.70.03.549.3
dog71.047.350.50.730.551.6
horse79.552.956.15.221.661.2
motorbike79.350.154.99.030.960.1
person79.346.947.614.137.560.6
pottedplant42.919.614.76.215.630.0
sheep65.543.647.618.445.154.3
sofa63.943.346.8-21.245.3
train80.054.260.6-31.158.0
tvmonitor69.747.457.423.238.857.6
baselineairplane74.745.348.323.539.453.2
bicycle76.249.354.65.036.858.9
bird62.135.136.417.228.545.3
boat49.925.120.310.723.236.2
bottle45.224.322.66.827.138.4
bus75.257.262.613.423.469.8
car78.652.856.219.343.473.5
cat76.950.053.50.020.352.7
chair49.627.126.95.525.733.3
cow74.345.648.124.448.752.2
diningtable58.535.838.20.03.442.7
dog70.243.045.70.430.346.8
horse75.548.251.94.020.955.9
motorbike75.845.648.213.127.855.2
person77.844.043.713.136.456.9
pottedplant41.119.314.07.416.227.3
sheep64.642.346.817.047.849.7
sofa61.939.341.1-20.540.9
train77.450.154.9-26.653.9
tvmonitor69.045.453.326.838.553.6
Table 12. Detection results of FSTD-Net with resizing and cropping methods.
Table 12. Detection results of FSTD-Net with resizing and cropping methods.
MethodAP50APAP75APsAPmAPl
FSTD-Net (resized)25.612.811.56.319.625.9
FSTD-Net (with crop)40.923.122.914.434.739.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, W.; Zhang, J.; Liu, D.; Xi, Y.; Wu, Y. An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization. Remote Sens. 2024, 16, 2944. https://doi.org/10.3390/rs16162944

AMA Style

Yu W, Zhang J, Liu D, Xi Y, Wu Y. An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization. Remote Sensing. 2024; 16(16):2944. https://doi.org/10.3390/rs16162944

Chicago/Turabian Style

Yu, Wanwan, Junping Zhang, Dongyang Liu, Yunqiao Xi, and Yinhu Wu. 2024. "An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization" Remote Sensing 16, no. 16: 2944. https://doi.org/10.3390/rs16162944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop