Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection

Liu, Jing; Wang, Ying; Cao, Yanyan; Guo, Chaoping; Shi, Peijun; Li, Pan

doi:10.3390/sym17020242

Open AccessArticle

Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection

by

Jing Liu

^1,*

,

Ying Wang

^2,*

,

Yanyan Cao

¹

,

Chaoping Guo

¹,

Peijun Shi

¹ and

Pan Li

¹

Xi’an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation, School of Computer Science, Xijing University, Xi’an 710123, China

²

School of Information Science and Engineering, Wuchang Shouyi University, Wuhan 430072, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(2), 242; https://doi.org/10.3390/sym17020242

Submission received: 18 December 2024 / Revised: 31 January 2025 / Accepted: 2 February 2025 / Published: 6 February 2025

(This article belongs to the Special Issue Symmetry and Asymmetry Study in Object Detection)

Download

Browse Figures

Versions Notes

Abstract

:

Small object detection in aerial imagery remains challenging due to sparse feature representation, limited spatial resolution, and complex background interference. Current deep learning approaches enhance detection performance through multi-scale feature fusion, leveraging convolutional operations to expand the receptive field or self-attention mechanisms for global context modeling. However, these methods primarily rely on spatial-domain features, while self-attention introduces high computational costs, and conventional fusion strategies (e.g., concatenation or addition) often result in weak feature correlation or boundary misalignment. To address these challenges, we propose a unified spatial-frequency modeling and multi-scale alignment fusion framework, termed USF-DETR, for small object detection. The framework comprises three key modules: the Spatial-Frequency Interaction Backbone (SFIB), the Dual Alignment and Balance Fusion FPN (DABF-FPN), and the Efficient Attention-AIFI (EA-AIFI). The SFIB integrates the Scharr operator for spatial edge and detail extraction and FFT/IFFT for capturing frequency-domain patterns, achieving a balanced fusion of global semantics and local details. The DABF-FPN employs bidirectional geometric alignment and adaptive attention to enhance the significance expression of the target area, suppress background noise, and improve feature asymmetry across scales. The EA-AIFI streamlines the Transformer attention mechanism by removing key-value interactions and encoding query relationships via linear projections, significantly boosting inference speed and contextual modeling. Experiments on the VisDrone and TinyPerson datasets demonstrate the effectiveness of USF-DETR, achieving improvements of 2.3% and 1.4% mAP over baselines, respectively, while balancing accuracy and computational efficiency. The framework outperforms state-of-the-art methods in small object detection.

Keywords:

small object detection; spatial-frequency modeling; multi-scale feature fusion; RT-DETR; aerial images

1. Introduction

Detecting small objects in aerial imagery is a critical yet challenging task in computer vision, with wide-ranging applications in surveillance, disaster monitoring, urban planning, and environmental protection [1]. Compared with large-scale objects, small objects typically exhibit limited spatial resolution, ambiguous feature representation, and higher sensitivity to noise [2]. In aerial imagery, these challenges are further exacerbated by complex backgrounds and varying viewpoints. Therefore, achieving robust and precise small object detection requires innovative strategies that can effectively capture fine-grained local details and global contextual information [3].

In recent years, with the rapid development of deep learning and convolutional neural networks (CNN), object detection models based on Transformer [4] architecture, such as Detection Transformer (DETR [5]), have made significant progress in object detection tasks. DETR reformulates object detection as an end-to-end sequence modeling problem, eliminating the need for candidate box generation required in traditional detectors. Compared to methods like R-CNN [6] and YOLO [7,8,9,10,11,12], DETR simplifies the detection process by removing post-processing complexities and parameter tuning. It achieves this by using query vectors as soft anchor boxes to locate objects, rather than relying on predefined anchor boxes. However, this approach leads to slow convergence of the soft anchor boxes, resulting in longer training times. To address this limitation, researchers have introduced several improvements, such as Deformable DETR [13]. These advancements have led to the development of algorithms like RT-DETR [14], demonstrating that DETR-based approaches have evolved into novel and mature detection methods. By incorporating multi-scale feature extraction modules, context information fusion strategies, and advanced bounding box regression mechanisms, these methods significantly improve small object detection precision.

However, small objects in aerial images often exhibit characteristics such as low resolution, limited pixel information, and sparse feature representation, making detection in complex scenarios a significant challenge [15]. On the one hand, small objects typically occupy only a few pixels in the image, resulting in low feature resolution, the loss of detailed information, and sparse feature representation, which makes distinguishing them from the background difficult [16]. In addition, small objects often exhibit low variability, meaning their appearance remains relatively consistent with little variation in shape, size, or orientation. This low variability, combined with the limited number of pixels they occupy, makes their detection more challenging, as subtle differences between these objects and the background may be hard to capture. Common deep learning approaches [7,17,18] build multi-scale feature pyramids to capture multi-scale representations of objects, but these primarily rely on spatial-domain convolutional operations. As a result, they are unable to adequately extract global structural features and struggle to capture the boundaries and details of small objects. On the other hand, in the detection of small objects, contextual information is crucial to distinguish objects from the background. Existing methods [19,20] predominantly rely on local convolution operations or computationally expensive attention mechanisms. For example, Transformer-based [4] methods (e.g., DETR [5] and Deformable DETR [13]) use self-attention to capture contextual information. However, their key-value interaction operations involve extensive matrix multiplications, leading to high computational costs and slow inference speeds [21]. Furthermore, during the multi-scale feature fusion stage, low-resolution semantic information may not align effectively with high-resolution detail features, resulting in small object features being suppressed by larger objects and amplified background noise. Conventional multi-scale fusion methods, such as PANet [22] and NAS-FPN [23], typically concatenate or weight features directly, failing to address geometric alignment issues between detailed high-resolution features and coarse low-resolution features. Redundant information in the fusion process further exacerbates false detections in the background [24].

To address the challenges of sparse feature representation, insufficient multi-scale fusion, limited contextual modeling capacity, and balancing the trade-off between processing speed and detection accuracy in small objects, this paper proposes a unified spatial-frequency modeling and alignment framework, USF-DETR, which consists of three primary modules: Spatial-Frequency Interaction Backbone (SFIB), Dual Alignment and Balance Fusion FPN (DABF-FPN), and Efficient Attention-AIFI (EA-AIFI). The SFIB module unifies spatial and frequency-domain feature modeling to enhance the boundary and detail representations of objects. The DABF-FPN module optimizes multi-scale feature fusion through bidirectional alignment and saliency balancing. Finally, the EA-AIFI module employs an efficient additive attention mechanism to achieve a balance between global contextual modeling and inference efficiency.

As illustrated in Figure 1, the detection process of RT-DETR is compared with the proposed method. RT-DETR mainly relies on spatial domain operations for feature extraction, which has limited hierarchical representation capability for features of different scales. Consequently, the generated feature maps have low resolution and lack sufficient detail, resulting in blurred boundaries and contours for small objects. In contrast, the feature maps produced by USF-DETR are richer in detail with clearer boundaries. After passing through the AIFI and the Encoder, the heatmaps generated by our method (second row) highlight small objects more accurately and significantly reduce background noise compared to RT-DETR (first row). Additionally, the incorporation of the S2 layer during feature fusion enhances the preservation of small object details, and the use of a bidirectional alignment strategy improves the geometric alignment of high- and low-resolution features. The saliency balance mechanism further optimizes multi-scale feature fusion, preventing small object features from being suppressed by larger ones. As evident from the final detection results, the proposed approach substantially enhances the accuracy of small object detection, with notable reductions in both missed and false detections. These findings demonstrate that the USF-DETR achieves better recognition of small object details and boundaries in complex scenarios. The specific contributions of the framework are detailed as follows:

A novel backbone, SFIB, is proposed that combines spatial and frequency domain features. In the spatial domain, the Scharr operator is utilized to extract details and edge information, enhancing local spatial structure features. In the frequency domain, FFT and IFFT are employed to capture both low-frequency and high-frequency patterns, balancing global and local information. The dynamic weighting integration of spatial and frequency domain features enables the model to comprehensively consider both spatial details and frequency distributions, significantly improving feature extraction capability and multi-scale adaptability in complex scenarios.
The proposed DABF-FPN achieves geometric alignment and saliency balance between high- and low-resolution features through a bidirectional fusion mechanism. The DABF module utilizes an adaptive attention mechanism to selectively aggregate boundary details and semantic information based on the resolution and content of feature maps. This enhances the saliency representation of target regions while suppressing background noise, leading to more effective feature fusion and further improving the performance of multi-scale feature integration.
EA-AIFI introduces Efficient Additive Attention, eliminating key-value interactions and relying solely on linear projections to efficiently encode the query-key relationship. This approach alleviates the computational bottleneck of matrix operations. In combination with positional encoding and a feed-forward network (FFN), EA-AIFI enhances inference speed and robust contextual representation, optimizing both global context modeling and small object detection in terms of efficiency and performance.

The structure of this paper is organized as follows: Section 2 provides an overview of related studies on small object detection and Transformer-based approaches. Section 3 details the proposed methodology, including its architecture and key components. Section 4 outlines the experimental setup and results, offering a comprehensive analysis of the model’s performance. Finally, Section 5 concludes the paper with a summary and a discussion of potential directions for future research.

2. Related Work

Aerial images present distinct challenges in object detection due to characteristics such as significant variations in target size, complex backgrounds, and a high prevalence of small objects [25]. This section reviews related research on small object detection in aerial images from two perspectives.

2.1. Transformer-Based Detection Methods

In recent years, Transformer-based models have significantly advanced object detection, offering superior global context modeling compared to traditional convolutional neural networks (CNNs). The DETR [5] eliminates the reliance on region proposals, anchors, and non-maximum suppression (NMS), enabling an end-to-end object detection framework. By reformulating object detection as a set prediction problem, DETR uses self-attention to capture long-range dependencies and global relationships within images. However, DETR suffers from slow convergence and computational inefficiency, particularly when handling high-resolution inputs.

To address DETR’s limitations, Deformable DETR [13] incorporates a deformable attention mechanism that focuses only on sparse, highly informative key points rather than the entire image. This improvement significantly reduces computational costs while accelerating convergence. The Swin Transformer [26] introduces a shifted window mechanism and hierarchical feature maps to balance computational efficiency and detection accuracy across scales. Similarly, PVT [27] employs pyramid-like features to enhance multi-scale detection, which is particularly beneficial for detecting small objects and objects with varying scales. RT-DETR [14], an optimized version of DETR, improves real-time object detection by reducing the complexity of the Transformer structure and enhancing the feature extraction module. O2DETR [28] replaces the attention module with local convolutions and integrates multi-scale feature maps, which have been shown to improve detection performance in scenarios involving rotated objects.

Furthermore, methods like Sparse R-CNN [29] incorporate Transformers to enhance region proposal generation, improving detection accuracy for objects in cluttered or complex environments. Query-based Transformers, such as TSP-RCNN [30], optimize attention mechanisms to focus on relevant object regions, improving both detection precision and computational efficiency. More recently, DINO [31] introduced a denoising training strategy to further refine detection results and reduce convergence time, making Transformer-based approaches more practical for real-world applications.

Despite these advancements, challenges remain, such as the high computational cost of self-attention and the inefficiency of Transformers when dealing with high-resolution inputs, particularly in aerial imagery where global and local features must be effectively balanced. Therefore, combining Transformers with lightweight, efficient attention mechanisms continues to be an active research direction.

2.2. Small Object Detection Methods in Aerial Images

Unlike standard object detection tasks, small objects occupy only a few pixels, making them difficult to distinguish from their surroundings. Traditional methods often rely on multi-scale feature fusion strategies to address this issue. For example, Feature Pyramid Networks (FPN) [19] and their variants (e.g., PANet [22], NAS-FPN [23]) enhance small object detection by combining features from different levels of a backbone network.

Xiao et al. [32] utilize an adaptive fusion mechanism that combines multi-scale features and boundary detail information to improve the detection accuracy of small targets while effectively suppressing background noise. DN-FPN [33] addresses the problem of noisy features arising from the lack of regularization between features of different scales during fusion by using contrastive learning to suppress noise in each layer of FPN’s top-down pathway, thereby improving small object detection accuracy. CFPT [34] introduces a feature pyramid network designed without upsampling, tailored for small object detection in aerial imagery. By enhancing feature interaction and global information utilization through cross-layer channel and spatial attention mechanisms, this approach effectively prevented information loss and improved detection performance.

ClusDet [35] proposes a progressive refinement detection strategy, which first identifies clustering regions using a clustering proposal network (CPNet), then adjusts the region size with a scale estimation network (ScaleNet), and finally completes high-precision detection with a detection network (DetecNet). DMNet [36] simplifies the training process of ClusDet by using a density map generation network to predict clustering regions. UFPMP-Det [37] employs a multi-path aggregation strategy for feature pyramids and anchor optimization techniques, assembling predicted subregions into a complete image, thereby achieving efficient detection with improved precision and efficiency through a single inference process. CEASC [38] introduces global contextual features to replace sparse sampling statistics and designs an adaptive multi-layer masking strategy to optimize foreground coverage across different scales by generating optimal masks, significantly improving detection performance. DTSSNet [39] adds manually designed modules between the backbone network and the feature pyramid to enhance sensitivity to multi-scale features while optimizing detection performance through a sample selection method specifically designed for small objects.

Drone-YOLO [40] uses a three-layer PAFPN structure combined with a large feature map detection head optimized for small objects, significantly enhancing small object detection capabilities. ESOD [41] integrates feature-level target search and image-slicing techniques, using the backbone network for feature extraction and a sparse detection head to reduce computational overhead in background regions, achieving efficient detection in high-resolution images. FFCA-YOLO [42] introduces a Feature Enhancement Module (FEM), Feature Fusion Module (FFM), and Spatial Context-Aware Module (SCAM) to improve local region perception and multi-scale feature fusion, enhancing global correlation across channels and spatial dimensions. SOD-YOLOv8 [43] includes explicit supervision for tiny object regions during training, generating attention maps to modulate the semantics of regions containing small objects while suppressing background features, thereby improving detection accuracy for small objects. YOLC [44] improves the quality of bounding boxes. Its detection head combines deformable convolutions with refinement techniques, enhancing small object detection performance. The sparse feature representation of small objects and their susceptibility to background noise make accurate feature extraction critical.

DQ-DETR [45] improves DETR for tiny object detection by introducing a Dynamic Query mechanism that adaptively adjusts query positions and refines features based on the input image. This dynamic adjustment enhances the model’s ability to focus on small, densely packed objects.

The aforementioned methods effectively address various challenges in small object detection. However, due to the inherent characteristics of small objects, several issues remain, including the low proportion of small objects, insufficient feature extraction, misalignment between low-resolution semantic information and high-resolution detailed features during fusion, and the computationally expensive nature of self-attention mechanisms. To tackle these issues, this paper proposes USF-DETR, which effectively mitigates the aforementioned problems and improves the accuracy of small object detection. The following sections will elaborate on the proposed method in detail.

3. Methodologies

This section provides an comprehensive explanation of the proposed USF-DETR framework for small object detection in aerial images. We begin with an overview of the method, introducing the overall network architecture. Subsequently, we provide a detailed description of each module and its components.

3.1. Overview

This study adopts RT-DETR [14] as the baseline framework. RT-DETR is an end-to-end object detector based on the DETR [5] architecture, eliminating the need for non-maximum suppression (NMS). By doing so, RT-DETR significantly reduces the latency compared to previous CNN-based object detectors, such as the YOLO series. It combines a robust backbone, a hybrid encoder, and a unique query selector to efficiently and accurately process features. Building on RT-DETR, this paper proposes USF-DETR, an end-to-end framework for small object detection that unifies the spatial and frequency domains. As shown in Figure 2, the proposed method comprises three key components: SFIB (Spatial-Frequency Interaction Backbone), EA-AIFI (Efficient Attention-AIFI), and DABF-FPN (Dual Alignment and Balance Fusion FPN), along with a minimum uncertainty query selection and decoding head. The backbone network, SFIB, employs a Spatial-Frequency Interaction (SFI) module to integrate spatial and frequency domain features. This enables the effective capture of fine-grained local structures and global contextual information, leveraging the complementary characteristics of spatial details and frequency patterns to extract rich multi-scale representations. The backbone generates multi-scale feature maps (S2–S4), which are directly fed into the DABF-FPN, providing multi-level inputs for subsequent stages. The S5 feature map is processed through the EA-AIFI module, producing F5. This module efficiently models dependencies between different scales of features, enabling effective interaction across scales. The DABF-FPN refines multi-scale features through a bidirectional dynamic recalibration mechanism, enhancing key features while suppressing redundant information to ensure high-quality representation across scales. The integrated features are subsequently processed through a minimum uncertainty query selection strategy, which prioritizes high-confidence feature queries for input into the decoder, significantly reducing ambiguity during detection. The decoder and detection head further process these optimized query features, generating accurate target localization and classification results. This unified approach ensures efficient and accurate small object detection, leveraging the complementary strengths of spatial and frequency domains to address the unique challenges of aerial images.

3.2. Spatial-Frequency Interaction Backbone (SFIB)

3.2.1. Module Structure

Small object detection in aerial imagery faces challenges such as small target sizes and sparse feature representations, which are often suppressed by large object features in deeper layers of the network. Current deep learning approaches commonly utilize multi-scale feature pyramids to enhance small object features and improve detection performance [46]. However, these methods predominantly rely on spatial domain features, which struggle to comprehensively capture the spatial structure and frequency information of images. This limitation results in insufficient feature extraction for targets and an increase in background false positives. In the frequency domain, high-frequency components typically correspond to image details and textures, while low-frequency components represent variations in overall brightness, including smooth regions and gradient transitions in the image. The Fourier Transform translates an image from the spatial domain into the frequency domain, revealing the distribution of different frequency components, which helps to better understand the structure and content of the image. Inspired by this, we propose the SFIB, a dual-domain fusion backbone that extracts features from both the spatial and frequency domains and then integrates them. This effectively preserves the spatial structural features of small objects, significantly enriches feature representation, and enhances the robustness and accuracy of small object detection.

As illustrated in Figure 3, the backbone consists of four stages (Stage 1–4), each comprising a convolutional layer and an SFI block, forming a five-layer feature pyramid. For an input image of size H × W × 3, preliminary feature extraction is performed using a convolutional layer (Conv). In subsequent stages, each stage extracts features through a Conv module followed by an SFI block. The resolution progressively decreases, while the number of channels expands to 2c, 4c, and 6c, allowing the network to capture more abstract and higher-order features.

The structure of the SFI block, as shown in the lower-left part of Figure 3, integrates the CSP [47] design. First, the input feature map is split into two parts, each undergoing independent feature extraction pathways before being fused. This design reduces computational overhead, avoids feature loss, and optimizes gradient propagation, thereby facilitating more efficient model training and feature learning. This dual-domain fusion approach leverages the complementary strengths of spatial and frequency domains, enriching feature representation and enabling robust small object detection.

3.2.2. SFI Block

The SFI block is a CNN block designed to fuse spatial and frequency domain features. By extracting features from both domains, it aims to capture different levels of spatial and frequency information, enhancing the model’s robustness and representational capability for image data. The Scharr operator is employed to extract spatial-domain features based on the spatial structure of the image, focusing on details such as edges and gradients. The block employs FFT and IFFT to extract frequency-domain patterns, capturing both low-frequency components (smooth regions and gradual transitions) and high-frequency components (details and textures) of the image. Spatial and frequency features are fused through a weighted addition to produce the final feature map. This weighted fusion enables the model to simultaneously leverage spatial structure information and frequency patterns, improving its performance across various scenarios.

Given the input feature

I \in R^{H \times W \times C}

, the horizontal and vertical gradients

G_{x} (x, y)

and

G_{y} (x, y)

are computed using the Scharr operator:

G_{x} (x, y) = I (x, y) * K_{x}, G_{y} (x, y) = I (x, y) * K_{y}

(1)

where ∗ represents the 2D convolution operation,

K_{x}

, and

K_{y}

are the pre-defined convolution kernels of the Scharr operator, and

G_{x} (x, y)

and

G_{y} (x, y)

are the gradient responses in the horizontal and vertical directions, respectively.

The spatial gradient features are subsequently processed through several convolutional layers to capture both local and global information, thereby generating the spatial-domain feature output

F_{spatial}

:

F_{spatial} = Conv (I (x, y) + Conv (G_{x} + G_{y}))

(2)

After applying FFT and IFFT, the frequency domain features extracted from the other features are

F_{frequency} = Conv (IFFT (Conv (FFT (I (x, y)))))

(3)

The spatial-domain and frequency-domain enhanced features are combined through addition and further processed with an additional convolutional layer to generate the final output:

F_{out} = Conv (F_{spatial} + F_{frequency})

(4)

The SFI block employs a CSP branching structure to reduce redundant computations while retaining important feature information. By integrating local spatial details with global structural information from the frequency domain, it significantly enhances feature richness and discriminative capability. This design captures edge and frequency characteristics, improving resistance to complex background noise and making it particularly effective in detecting small objects and intricate details.

3.3. Dual Alignment and Balance Fusion FPN (DABF-FPN)

3.3.1. Overall Structure

In aerial image small object detection, features with strong class information and high-resolution spatial details are crucial. To achieve this, existing deep learning models often rely on feature fusion, where upsampled coarse features from deeper layers are directly added to high-resolution features from shallower layers. It is well understood that shallow layers contain fewer semantic details but are rich in spatial details, offering clearer boundaries and less distortion, while deeper layers provide abundant semantic information. However, directly fusing high-level and low-level feature maps often leads to rapid changes in feature values, which can interfere with high-frequency features [48]. Repeated downsampling further exacerbates the issue by blurring boundaries, resulting in imprecise high-frequency components and boundary misalignments [49]. To strengthen the interaction between shallow and deep features, this paper proposes the DABF-FPN, as illustrated in Figure 4. In this framework, features from the previous layer and the current layer are fused using the DABF module. The DABF module effectively integrates boundary and semantic information to define finer-grained object contours, asymmetrically enhancing the target regions while suppressing background noise. This optimization highlights target boundaries and positions, improving the saliency of these features. When the fused features are passed to the next level, they incorporate additional details from lower-resolution layers, further enriching the information.

To better leverage information from small objects, features from the S2 layer are also included in the fusion process. This ensures the retention of features rich in information related to small objects. Subsequently, a bottom-up feature fusion process is applied, enabling more thorough information exchange across features and improving the multi-scale feature fusion process. Finally, the fused features, denoted P2, N3, N4, and N5, are passed to the decoder and detection head to facilitate the detection of small objects. Compared to traditional FPN structures, the DABF module incorporates a bidirectional fusion mechanism to bridge high- and low-resolution features, enabling a more comprehensive exchange of information between them, significantly improving multiscale feature fusion performance, and enhancing the accuracy of small object detection.

3.3.2. Dual Alignment and Balance Fusion (DABF)

Directly fusing low- and high-resolution features can lead to redundancy and inconsistencies. The DABF module overcomes this by selectively aggregating boundary and semantic information, refining object contours and recalibrating positions with greater precision. Through an adaptive attention mechanism, the module dynamically adjusts feature weights based on the resolution and content of the feature maps, effectively capturing multi-scale target features.

The designed DABF module adaptively extracts mutual representations from two inputs before fusion. As shown in Figure 5, the high-level semantic features (low-resolution) are denoted as

F_{l}

, and the low-level detailed features (high-resolution) are denoted as

F_{h}

. Both shallow and deep information is input into two DABF blocks in different ways to address the deficiency of spatial boundary details in high-level semantic features and the lack of semantic richness in low-level features. This ensures a more comprehensive information exchange between high- and low-resolution features, expressed as

F_{h}^{'} = DPAC (F_{h}, F_{l}), F_{l}^{'} = DPAC (F_{l}, F_{h})

(5)

Finally, the outputs of the two DABF blocks are concatenated after a 3 × 3 convolution, and then passed through

R e p C_{3}

to obtain the fused features

F_{fusion}

. This aggregation strategy achieves the robust combination of different features and refines rough features. The fused features can be written as follows:

F_{fusion} = {RepC}_{3} ({Conv}_{3 \times 3} (F_{h}^{'} + F_{l}^{'}))

(6)

In the DABF module, the low-resolution input

F_{l}

is first processed using a 1 × 1 convolution to align the channels, producing the initial features

I_{1}

. These features are further refined through attention mechanisms, enhancing their spatial representation and generating dynamically adjusted features

I_{1}^{'}

. Similarly, the high-resolution input

F_{h}

undergoes a 1 × 1 convolution to align its channel dimensions, resulting in the initial features

I_{2}

. The attention operation strengthens its spatial detail representation, producing dynamically enhanced features

I_{2}^{'}

. The two sets of refined features,

I_{1}^{'}

and

I_{2}^{'}

, are combined through a complementary weighted fusion mechanism. This involves element-wise multiplication of one feature set with the other and summing the results, ensuring dynamic feature integration. The fused output is expressed as

F_{l}^{'}

.

I_{1}^{'} = {Conv}_{1 \times 1} (Sigmod (I_{1})), I_{2}^{'} = {Conv}_{1 \times 1} (Sigmod (I_{2})) F_{l}^{'} (I_{1}, I_{2}) = I_{1}^{'} \cdot I_{1} + I_{2}^{'} \cdot I_{2} \cdot (- I_{1}^{'}) + I_{1}

(7)

The DABF-FPN provides a multi-scale feature fusion method that combines alignment and balance mechanisms. By modeling symmetry in high-resolution features and asymmetry in low-resolution features, it effectively enhances the boundary clarity and saliency of target regions while suppressing interference from complex backgrounds. This significantly enhances the precision and robustness of small object detection.

3.4. Efficient Attention-AIFI (EA-AIFI)

In RT-DETR, the AIFI module interacts within the scale of

S_{5}

using a single-scale Transformer encoder, further reducing the computational cost. Specifically, features

S_{5}

are flattened into vectors and fed into the AIFI module, which functions as a standard Transformer encoder that comprises multi-head self-attention and a feed-forward network (FFN). The output is then reshaped back into two-dimensional features, expressed as

F_{5}

:

\begin{matrix} Q = K = V = Flatten (C_{5}) \\ F_{5} = Reshape (AIFI (Q, K, V)) \end{matrix}

(8)

The Transformer encoder adopts multi-head self-attention, which has proven to be an effective choice for capturing global context in various applications. Additive attention mechanisms capture global context through pairwise interactions between tokens using element-wise multiplication instead of dot-product operations. However, the expensive matrix multiplications involved continue to remain a bottleneck. To improve detection efficiency, this paper incorporates the Efficient Additive Attention from SwiftFormer [50]. This approach eliminates the need for key-value interactions and learns token relationships effectively using only linear projection layers for query-key interactions. Without sacrificing performance, this method significantly reduces computational overhead. The proposed EA-AIFI module, as illustrated in Figure 6a, processes data flow from the lower-level input (e.g., S5 features) through position encoding and EA-AIFI, and then through the FFN. This process achieves efficient feature extraction and the integration of object interaction information. The EA-AIFI module offers faster inference speed while producing stronger contextual representations, enabling efficient small object detection and robust target feature modeling.

Efficient Additive Attention is depicted in Figure 6b. The input x is used to generate queries (Q) and keys (K) through linear transformations. The query matrix Q is multiplied by a learnable parameter vector to determine the attention weights of the queries, resulting in a global attention query vector

α

. Subsequently, the query matrices are aggregated based on the learned attention weights to form a unified global query vector, denoted as q:

α = Q \cdot w_{a} / \sqrt{d}, q = \sum_{i = 1}^{n} α_{i} * Q_{i}

(9)

Subsequently, the global attention is normalized and combined with the global context and the original input x to generate the updated feature representation (

\hat{x}

), formulated as

\hat{x} = \hat{Q} + Γ (K * q)

(10)

where

\hat{Q}

denotes the normalized global query matrix, and

Γ

represents the linear transformation applied to the original input.

Compared to traditional multi-head attention mechanisms, the use of additive models can effectively reduce computational complexity while maintaining strong global modeling capabilities. Replacing the original AIFI with EA-AIFI not only lowers computational complexity but also substantially boosts small object detection performance by strengthening global context modeling capabilities. Especially in complex backgrounds or target-dense aerial image scenes, it can better preserve target information and improve detection accuracy.

4. Experiment

4.1. DataSet

This study evaluates the USF-DETR on two small object detection datasets: VisDrone 2019 [51] and TinyPerson [52]. VisDrone is a large-scale drone-captured dataset collected from 14 different cities across urban and suburban areas in China. It contains annotations for ten object categories: bicycles, tricycles with canopies, tricycles, vans, buses, trucks, cars, pedestrians, people, and vehicles. The resolution of these images is approximately 2000 × 1500 pixels. Due to significant viewpoint variations and severe occlusions, the dataset contains a high proportion of small objects. It is divided into three subsets: a training set of 6471 images, a validation set of 548 images, and a test set of 1610 images.

TinyPerson is a dataset designed for small object crowd detection tasks, particularly suitable for studying how to accurately detect and localize small-scale human targets in complex environments. The dataset consists of 1610 annotated images and 759 unannotated images, encompassing a total of 72,651 annotated instances. In TinyPerson, the average target height is only 12 pixels, with many targets near the lower limit of annotation precision. These extremely small targets pose significant challenges for feature extraction and fine-grained information modeling. Furthermore, targets often exhibit dense distributions with minimal spacing between multiple objects. The 1610 annotated images are split into training, validation, and testing sets in an 8:1:1 ratio. These datasets provide a rigorous testing ground for evaluating the robustness and accuracy of small object detection methods, particularly in challenging scenarios with dense object distributions and severe occlusion.

Figure 7 shows the distribution of the square root of bounding box areas for the two datasets. From the top plot for the TinyPerson dataset, it is evident that the bounding box sizes are mostly concentrated below 10 pixels, indicating that the targets are extremely small. In the VisDrone dataset, bounding box sizes vary. For small objects (e.g., “motor”), the bounding box sizes primarily fall within the range of 10 to 30 pixels. In contrast, large objects (e.g., “bus”) have bounding box sizes concentrated between 20 and 40 pixels. Categories like “tricycle” and “awning-tricycle” exhibit similar bounding box size distributions, which could lead to detection ambiguities. These variations affect the model’s ability to effectively learn small object categories.

4.2. Evaluation Metrics and Environment

4.2.1. Evaluation Metrics

To rigorously assess the performance of small object detection methods in aerial imagery, several well-known quantitative metrics are employed. These metrics include precision, recall, and mean average precision (mAP), which are standard in machine learning for evaluating classification performance. Precision measures the proportion of correctly predicted positive samples among all positive predictions, while recall assesses the proportion of actual positive samples that are correctly identified. Average precision (AP) evaluates the precision–recall relationship for a single class, and mean average precision (mAP) extends this by averaging the AP values across multiple classes to provide an overall evaluation of model performance.

Additionally, we use Params, GFLOPs, and FPS to evaluate the model’s efficiency. Params refer to the number of trainable parameters, GFLOPs measure the computational cost, and FPS indicates the model’s processing speed. These metrics offer a comprehensive assessment of the model’s trade-offs between accuracy, efficiency, and speed.

4.2.2. Experimental Environment

To ensure fairness in model training and comparison experiments, the ablation experiments and training processes were conducted at the Super Intelligent Computing Center of Xijing University. The GPU used is NVIDIA A800 with 80 GB of memory, and the CPU is Intel 6338N Xeon, both sourced from Intel and NVIDIA Corporation respectively, Santa Clara, CA, USA. The operating system is Red Hat 4.8.5-28. All models were trained using an environment configured with CUDA 12.1, Python 3.11, and Pytorch 2.1 [53]. The hardware specifications used for training and testing are listed in Table 1, and the key parameters used during the training process are shown in Table 2.

4.3. Ablation Study

We conducted ablation experiments on the validation sets of the VisDrone and TinyPerson datasets to assess the impact of each proposed enhancement in our methods. RT-DETR was used as the baseline model, with ResNet-r18 as the backbone. The experiments systematically evaluated the impact of each proposed component on the model. The ablation study used precision, recall, mAP_50, and mAP_50:95 as metrics to evaluate detection accuracy. Additionally, GFLOPs and Params were employed to assess the complexity of the model. The experimental results are presented in Table 3 and Table 4.

Table 3 compares the impact of different module combinations (SFIB, DABF-FPN, and EA-AIFI) on model performance. As shown in the table, the baseline model achieves a mAP_50 of 48.2%. The baseline model lacks optimization strategies specifically tailored for small objects and relies solely on global features, which leading to the neglect of small object details and consequently results in limited performance. When the SFIB module is added, the Params and GFLOPs slightly increase, while the mAP_50 improves by 1%. This improvement is attributed to the SFIB module’s integration of spatial and frequency domain features, enabling the model to comprehensively consider spatial details and frequency distributions. As a result, this significantly enhances its feature extraction capability and multi-scale adaptability in complex scenarios. Upon including both SFIB and DABF-FPN, the number of parameters increases significantly to 42.39M, and GFLOPs rise to 106.9. Meanwhile, mAP_50 improves by 3.7% and mAP_50:95 by 3.3%. This improvement is largely due to the DABF-FPN module, which strengthens the interaction between shallow and deep features through bidirectional fusion. Additionally, it selectively aggregates boundary and semantic information to depict finer object contours, asymmetrically enhancing target regions, suppresses background noise, and optimizes the saliency representation of target boundaries and positions. These features collectively enhance the model’s feature fusion capability, resulting in a significant improvement in detection accuracy. After integrating the EA-AIFI module, Params and GFLOPs slightly decrease, while precision reaches its highest value of 65%, with mAP_50 improving to 52.3% and mAP_50:95 reaching 33.2%. The EA-AIFI module reduces matrix operations and combines positional encoding with FFN, achieving faster inference speeds and robust contextual representation. The synergistic effect of these three modules compensates for the baseline model’s shortcomings in small object detection. By complementing each other, these modules optimize different aspects of the network, making it better suited for small object detection in aerial imagery and collectively improving overall detection performance.

Figure 8 compares the feature maps generated by USF-DETR and RT-DETR after the backbone network. As shown in Figure 8b, the feature map extracted by RT-DETR, which demonstrates limited capability in capturing global semantics and local details. The target regions exhibit sparse feature representation, with noticeable background noise. Figure 8c illustrates the feature map generated by USF-DETR. Compared to the baseline, the proposed method’s feature map demonstrates stronger saliency in target regions (e.g., pedestrians) and clearer details. This improvement is mainly due to the integration of the SFIB module in the backbone, which integrates spatial-domain local details and frequency-domain global structural information. This approach greatly improves feature richness and discriminative power by capturing edge and frequency characteristics, enhancing resistance to noise in complex backgrounds. As a result, the produced feature map surpasses RT-DETR in target saliency representation and background suppression, delivering outstanding performance in challenging scenarios and multi-scale detection tasks.

In the TinyPerson dataset, we achieve similar experimental results, as presented in Table 4. SFIB provides improved feature representation, DABF-FPN enhances multi-scale feature fusion, while EA-AIFI focuses on key feature regions through attention mechanisms, resulting in optimal performance. The optimized model has significantly improved the precision, recall, and mAP for small targets, representing an important enhancement over the original model. Additionally, the proposed modules are compatible, and when all proposed methods are implemented, the model achieves its highest performance, reaching 31.2% on the TinyPerson.

To illustrate the advantages of our method more intuitively, Figure 9 presents a visual comparison of thermal maps and detection results for two images from the TinyPerson dataset. Compared with the Baseline method, the heatmap in (c) generated by our method more accurately focuses on small target areas, reducing background noise interference, indicating that our method can effectively extract and combine local and global features. In the detection results (d and e), the baseline method exhibits false positives (first row) and false negatives (second row), while USF-DETR (e) can more comprehensively and accurately detect targets, significantly improving small target detection accuracy. By comparing it with the ground truth, it can be seen that the USF-DETR effectively improves the receptive field for small objects, especially in areas with dense targets or those located far from the camera. This demonstrates the optimization effect of USF-DETR on small target feature extraction and representation. This superior performance is attributed to our feature extraction and fusion process, which minimizes the loss of small object information and increases feature correlation.

4.4. Comparative Experiment

4.4.1. VisDrone

We evaluated our proposed method against other state-of-the-art approaches on the VisDrone dataset, using the COCO metrics to assess performance, including average precision (AP) at different IoU thresholds. As shown in Table 5, the evaluation metrics include

AP

,

{AP}_{50}

,

{AP}_{75}

, performance for small objects (

{AP}_{s}

), medium objects (

{AP}_{m}

), and large objects (

{AP}_{l}

), as well as the number of parameters (Params), computational complexity (GFLOPs), and inference speed (FPS).

Two-stage detectors such as Faster R-CNN and Cascade R-CNN exhibit moderate AP performance but require significantly higher computational resources and parameters, resulting in low FPS. One-stage models (e.g., TOOD and RTMDet) achieve higher AP compared to two-stage methods but have lower FPS, rendering them less practical for real-time applications. RTMDet has a considerably lower parameter count and computational complexity than other models while maintaining a relatively high inference speed (52.1 FPS). The YOLO series models (e.g., YOLOv8m and YOLOv11m) excel in inference speed but have slightly lower AP compared to other methods. RT-DETR achieves a balanced AP performance and demonstrates some advantages in small object detection (

{AP}_{s}

). Our proposed method (USF-DETR) outperforms others in overall performance, particularly in small object detection (

{AP}_{s}

= 12.3%) and overall AP (AP = 34.1%), while maintaining a high inference speed (FPS = 80.4). These results emphasize the notable benefits of our method in both performance and efficiency. The optimal results in the table are displayed in bold.

Figure 10 displays the detection results achieved by USF-DETR on the VisDrone dataset. It can be observed that the targets are distributed across a wide range of scenarios, from dense traffic environments to open areas, and the model demonstrates robust performance throughout. The detection results for various objects, including pedestrians, cars, trucks, and bicycles, are accurate, with bounding boxes effectively covering the targets. Many targets in the images are small pedestrians or vehicles, often located in dense traffic intersections or crowded public areas. Despite the complexity of backgrounds, such as nighttime streets or sports venues, the model effectively detects numerous targets with minimal overlap between bounding boxes. This indicates the model’s robustness to background interference and its strong ability to distinguish small objects in densely populated scenes. The results highlight the model’s capability for precise detection in challenging scenarios.

Figure 11 compares the detection results of the USF-DETR and the baseline model. Green boxes indicate correct detections, blue boxes represent false positives, and red boxes denote missed detections. USF-DETR achieves significantly higher accuracy in various challenging scenarios, including dense crowds and complex backgrounds such as square scenes. The number of false positives and missed detections is notably reduced, particularly in environments with high noise, intricate backgrounds such as streets, small or distant objects, and low-contrast settings such as nighttime scenes. These proposed components work synergistically to enable USF-DETR to perform effectively in complex environments, detect dense and multi-scale targets, and significantly reduce both false positives and false negatives, resulting in improved overall detection accuracy.

4.4.2. TinyPerson

Table 6 presents the detection performance comparison of the USF-DETR against various mainstream detection algorithms on the TinyPerson dataset. Traditional two-stage methods (e.g., Faster R-CNN and Cascade R-CNN) perform poorly in small object detection (

{AP}_{s}

) and have low FPS, indicating their limited applicability to real-time scenarios. The YOLO series one-stage methods (e.g., YOLOv8m and YOLOv11m) demonstrate better performance in terms of both AP and FPS, with YOLOv8m achieving an impressive FPS of 154.9 while maintaining a relatively good AP of 6%. RT-DETR also performs well, striking a balance between AP and FPS, achieving an AP of 6.9% with a high inference speed of 133.8 FPS. Among all methods, USF-DETR achieves the best performance, with an overall AP of 8.3%. It also outperforms other methods significantly in

{AP}_{50}

and

{AP}_{75}

, reaching 25.4% and 3.2%, respectively. Notably, in small object detection,

{AP}_{s}

reaches 8.1%, demonstrating the superior capability of the USF-DETR for detecting small objects. However, the FPS of USF-DETR is slightly lower than that of the YOLO series, but a balance has still been struck between speed and accuracy.

From Figure 12, it is evident that USF-DETR (first row) outperforms the baseline method (second row) in overall object detection. USF-DETR shows significantly more green boxes indicating correct detections, reflecting its higher precision. There are noticeably fewer blue boxes indicating false positives, demonstrating a lower false positive rate. Similarly, fewer red boxes indicating missed detections highlight the improved detection coverage of USF-DETR. Notably, in densely populated areas (e.g., the nearshore region in the left image), USF-DETR successfully detects more targets, whereas the baseline method shows higher rates of missed detections and false positives. Overall, USF-DETR surpasses the baseline in both accuracy and robustness, particularly excelling in complex scenes where its superior performance is more evident.

Figure 13 compares the detection performance of USF-DETR against other approaches on the TinyPerson dataset. Overall, USF-DETR excels in detecting both the “earth person” and “sea person” categories. The number of detected targets closely matches the ground truth, with bounding boxes that are accurate in both position and size. In the first row’s scene, RT-DETR exhibits false positive detections for some “earth person” targets, especially in densely overlapping regions, where the detection boxes lack precision. In the second row’s dock scene, USF-DETR successfully detects a large number of “sea person” targets on the water surface, with bounding box distributions closely aligned with the ground truth. It also effectively detects distant small targets. While RT-DETR shows improved performance, it still misses some targets in densely packed boats and complex backgrounds (e.g., overlapping targets). YOLOv11m exhibits significant missed detections, leaving many targets unmarked. In the third and fourth rows’ beach scenes, the USF-DETR accurately detects small “earth person” targets on the ground, with bounding boxes closely fitting the actual targets. In contrast, RT-DETR and YOLOv11m show frequently missed detections for distant small targets, and some detection boxes are misaligned with the target positions. Through improved feature fusion and multi-scale processing, USF-DETR significantly enhances the detection performance for the two target categories in the TinyPerson dataset, particularly excelling in complex scenes and densely populated areas.

5. Conclusions

This article proposes a unified spatial frequency domain modeling and alignment framework to address the problems of sparse feature expression, insufficient multi-scale fusion, and limited contextual modeling capabilities in small object detection in aerial images, significantly improving the performance of small object detection. By designing SFIB, this model achieves a dynamic weighted combination of spatial and frequency domain features, effectively capturing global semantic information and local detail features. By using DABF-FPN, the geometric alignment and saliency balance of high and low-resolution features are achieved in multi-scale fusion, further enhancing the feature expression of small targets. Meanwhile, EA-AIFI optimizes the balance between context modeling efficiency and inference speed, significantly reducing the missed and false detection rates of small targets. The experimental results show that USF-DETR can better extract the boundary and detail information of small targets in complex scenes, and has significant advantages compared to existing methods, fully demonstrating the robustness and accuracy of the proposed framework in small target detection in aerial images. The experimental results show that the method proposed in this paper outperforms existing methods in terms of accuracy in small object detection. Meanwhile, the feature maps generated by the method in this article have richer details, clearer boundaries, significantly improved box localization accuracy for small targets, reduced background noise, and effectively reduced missed and false detection rates.

Although this method has made significant progress in small object detection, there are still certain limitations. Firstly, while the introduction of frequency domain modeling and the additive attention mechanism has improved the feature expression ability of the model, the consumption of computational resources is still higher than that of lightweight models, which is not suitable for resource-limited terminal devices. Secondly, this method optimizes the geometric alignment and saliency balance of multi-scale features, but there is still room for improvement in performance in extremely small object detection tasks such as extremely small object sizes or highly overlapping targets. Future work includes further optimizing the model structure to reduce computational complexity, making it more suitable for resource-constrained scenarios. Meanwhile, we plan to add more robustness verification of the algorithm, such as handling artifacts like blurring, low contrast, affine transformations, etc. Additionally, we will explore how to further improve the model’s robustness and detection accuracy in extremely complex scenes. In addition, combining unsupervised or weakly supervised learning methods to expand the application potential of models in scenarios with insufficient annotated data is also a direction worth exploring.

Author Contributions

Conceptualization, J.L. and Y.C.; Methodology, J.L., Y.W. and P.L.; Software, P.L., C.G. and P.S.; Validation, P.L.; Visualization, Y.C.; Writing—original draft, J.L. and P.S.; Writing—review and editing, Y.W. and C.G. All authors have read and agreed to the published version of this manuscript.

Funding

This research was funded by Science and Technology Research Project of Education Department of Hubei Province (B2023362) and Excellent Young and Middle aged Science and Technology Innovation Team Project for Higher Education Institutions of Hubei Province (T2023045).

Data Availability Statement

The VisDrone is available at https://github.com/VisDrone (accessed on 10 May 2024); the TinyPerson is available at https://github.com/ucas-vg/PointTinyBenchmark (accessed on 12 May 2024).

Acknowledgments

The authors acknowledge the referees and the editor for carefully reading this paper and giving many helpful comments. The authors also express their gratitude to the reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Jiang, X.; Tang, H.; Li, Z. Global Meets Local: Dual Activation Hashing Network for Large-Scale Fine-Grained Image Retrieval. IEEE Trans. Knowl. Data Eng. 2024, 36, 6266–6279. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Farhadi, A.; Redmon, J. YOLOv3: An Incremental Improvement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Wang, S.; Wan, C.; Yan, J.; Li, S.; Sun, T.; Chi, J.; Yang, G.; Chen, C.; Yu, T. Hierarchical Scale Awareness for object detection in Unmanned Aerial Vehicle Scenes. Appl. Soft Comput. 2024, 168, 112487. [Google Scholar] [CrossRef]
Tong, K.; Wu, Y. Deep learning-based detection from the perspective of small or tiny objects: A survey. Image Vis. Comput. 2022, 123, 104471. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings; Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Wang, Z.; Guo, J.; Zhang, C.; Wang, B. Multiscale feature enhancement network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5634819. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Ma, T.; Mao, M.; Zheng, H.; Gao, P.; Wang, X.; Han, S.; Ding, E.; Zhang, B.; Doermann, D. Oriented object detection with transformer. arXiv 2021, arXiv:2106.03146. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 14454–14463. [Google Scholar]
Sun, Z.; Cao, S.; Yang, Y.; Kitani, K.M. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3611–3620. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Xiao, Y.; Xu, T.; Yu, X.; Fang, Y.; Li, J. A Lightweight Fusion Strategy with Enhanced Inter-layer Feature Correlation for Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708011. [Google Scholar] [CrossRef]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN with Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
Du, Z.; Hu, Z.; Zhao, G.; Jin, Y.; Ma, H. Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images. arXiv 2024, arXiv:2407.19696. [Google Scholar]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 190–191. [Google Scholar]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 1026–1033. [Google Scholar]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic Training Sample Selection Network for UAV Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5902516. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient Small Object Detection on High-Resolution Images. arXiv 2024, arXiv:2407.16424. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Khalili, B.; Smyth, A.W. SOD-YOLOv8—Enhancing YOLOv8 for Small Object Detection in Aerial Imagery and Traffic Scenes. Sensors 2024, 24, 6209. [Google Scholar] [CrossRef]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13863–13875. [Google Scholar] [CrossRef]
Liu, S.; Huang, S.; Li, F.; Zhang, H.; Liang, Y.; Su, H.; Zhu, J.; Zhang, L. DQ-DETR: Dual query detection transformer for phrase extraction and grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1728–1736. [Google Scholar]
Liu, J.; Jing, D.; Zhang, H.; Dong, C. SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images. Electronics 2024, 13, 2358. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Jin, W.; Qiu, S.; He, Y. Multiscale Spatial-Frequency Domain Dynamic Pansharpening of Remote Sensing Images Integrated with Wavelet Transform. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5408915. [Google Scholar] [CrossRef]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17425–17436. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1257–1265. [Google Scholar]
Liu, J.; Jing, D.; Cao, Y.; Wang, Y.; Guo, C.; Shi, P.; Zhang, H. Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images. Electronics 2024, 13, 3172. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Biffi, L.J.; Mitishita, E.; Liesenberg, V.; Santos, A.A.d.; Gonçalves, D.N.; Estrabis, N.V.; Silva, J.d.A.; Osco, L.P.; Ramos, A.P.M.; Centeno, J.A.S.; et al. ATSS deep learning-based approach to detect apple fruits. Remote Sens. 2020, 13, 54. [Google Scholar] [CrossRef]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. Comparison between RT-DETR and the proposed USF-DETR method. The feature maps generated by USF-DETR (bottom row) exhibit sharper edges and richer details due to the SFIB and EA-AIFI modules. After multi-scale alignment fusion through the DABF-FPN Encoder, USF-DETR produces more accurate heatmaps, effectively highlighting small objects and improving detection results with fewer missed detections and false positives, as demonstrated by the red bounding boxes.

Figure 2. Architecture of the proposed USF-DETR, which includes three modules: SFIB, EA-AIFI, and DABF-FPN. The top part illustrates the pipeline of USF-DETR, while the bottom part presents the module flowchart.

Figure 3. The pipeline of SFIB consists of four stages, with each stage including a Conv layer and a SFI block. The SFI block, shown in the lower left figure, is connected across layers using the CSP concept; As depicted in the lower right image, the SFI extracts spatial and frequency domain features of the image and then fuses them.

Figure 4. The overall structure of the DABF-FPN integrates bidirectional feature fusion to enhance small object detection and outputs multi-scale features (P2, N3, N4, and N5) for further processing.

Figure 5. The structure of the DABF module involves high-level semantic features and low-level detailed features being adaptively processed to extract mutual representations. Two DABF blocks facilitate comprehensive information exchange and enhance feature fusion quality.

Figure 6. EA-AIFI module. (a) Through input embedding and positional encoding, combined with enhanced representation of contextual information, further internal feature interaction and optimization are carried out through a FFN. (b) Efficient Additive Attention eliminates key value interactions and relies solely on linear projections.

Figure 7. Bounding box distribution. (a) VisDrone2019-DET Dataset. (b) TinyPerson Dataset. The vertical axis represents the categories of annotated bounding boxes, while the horizontal axis depicts the square root of the bounding box area, measured in pixels.

Figure 8. Visualization of feature maps. (a) Input image. (b) Feature map generated without using the SFI module in the baseline model. (c) Feature map generated with the SFI module in USF-DETR.

Figure 9. Visualizing the detection results and heatmap on TinyPerson. The highlighted area represents the region of network attention, demonstrating the outstanding performance of USF-DETR in detecting small objects.

Figure 10. Detection results of the USF-DETR on the VisDrone dataset. Boxes of different colors represent different target categories.

Figure 11. A comparison of detection results between USF-DETR and the baseline model is presented. Green boxes indicate correct detections, blue boxes represent false positives, and red boxes denote missed detections.

Figure 12. A comparison of detection performance between the two methods. The first row represents USF-DETR, while the second row shows the baseline method. USF-DETR significantly reduces false positives (blue) and false negatives (red).

Figure 13. Comparison of detection performance between USF-DETR and popular methods. The yellow circle shows the outstanding detection effect of USF-DETR.

Table 1. Configuration of the experiment environments.

Environment	Parameter
Operating System	Red Hat 4.8.5-28
Programming Language	Python 3.11
Framework	Pytorch 2.1
CUDA	CUDA 12.1
GPU	NVIDIA A800
CPU	Intel_6338N_Xeon
VRAM	80 G

Table 2. Model training parameters.

Parameter	Value	Description
optimizer	AdamW	Combines Adam with weight decay to prevent overfitting.
base_learning_rate	0.0001	Initial step size for the optimizer.
weight_decay	0.0001	Regularization parameter to prevent overfitting by penalizing large weights.
global_gradient_clip_norm	0.1	Limits gradient norm to ensure stable training and prevent exploding gradients.
linear_warmup_steps	2000	Defines the maximum norm for gradient clipping during training.
minimum learning rate	0.00001	Defines the lower bound for the learning rate during training.
input image size	640 × 640	Specifies the height and width of input images.
epochs	300	The total passes through the entire training dataset.
batch size	4	The count of samples is handled in a single forward and backward pass.

Table 3. Ablation experiments on the VisDrone dataset.

SFIB	DABF-FPN	EA-AIFI	Params	GFLOPs	Precision	Recall	mAP_50	mAP_50:95
×	×	×	38.61	58.3	62.1	47.1	48.2	29.4
✓	×	×	39.81	62	62.8	47.7	49.2	30.5
✓	✓	×	42.39	106.9	62.9	50.0	51.9	32.7
✓	✓	✓	42.12	106.8	65	50.8	52.3	33.2

Note: “×” indicates that the corresponding module is not used, while “✓” denotes that the module is included in the model.

Table 4. Ablation experiments on the TinyPerson dataset.

SFIB	DABF-FPN	EA-AIFI	Params	GFLOPs	Precision	Recall	mAP_50	mAP_50:95
×	×	×	38.6	56.9	46.1	33.9	28.6	8.96
✓	×	×	39.8	60.6	48.5	34.6	29.7	9.04
✓	✓	×	42.1	103.5	46.4	35.9	30.4	9.5
✓	✓	✓	42.1	107.4	48.8	36.8	31.2	9.79

Note: “×” indicates that the corresponding module is not used, while “✓” denotes that the module is included in the model.

Table 5. Performance evaluation on the VisDrone dataset.

Method	BackBone	AP	AP₅₀	AP₇₅	AP_s	AP_m	AP_l	Params (M)	GFLOPs	FPS
Two-stage models:
Faster-RCNN [54]	ResNet-50	20.5	34.2	21.9	10.0	29.5	43.3	41.39	208	46.8
Cascade-RCNN [55]	ResNet-50	20.8	33.7	22.4	10.1	29.9	45.2	69.29	236	24.6
One-stage models:
TOOD [56]	ResNet-50	21.4	34.6	23.0	10.4	30.3	41.6	32.04	199	25.7
ATSS [57]	ResNet-50	21.6	34.6	23.1	10.2	30.8	45.8	38.91	110	35.4
RetinaNet [58]	ResNet-50	17.8	29.4	18.9	6.7	26.5	43.0	36.517	210	54.4
RTMDet [59]	CSPNeXt-tiny	18.4	31.2	21.3	7.7	28.8	44.5	4.876	8.033	52.1
YOLOX [60]	Darknet53	15.6	28.3	15.5	7.8	21.3	28.8	5.035	7.578	88.0
YOLOv8m	Darknet53	16.2	30.7	15.2	10.7	31.9	39.1	99.0	78.7	165.2
YOLOv11m	Darknet53	16.5	30.8	15.1	10.6	32.1	39.3	87.0	68.5	156.6
RT-DETR [14]	ResNet-18	19.83	31.3	7.1	10.4	30.2	39.0	38.61	57.0	147.9
USF-DETR	SFIB	22.10	34.7	8.9	12.3	34.1	38.1	42.12	103.6	80.4

Note: Bold values indicate the best result for each metric.

Table 6. Comparison of experimental results on the TinyPerson dataset.

Method	BackBone	AP	AP₅₀	AP₇₅	AP_s	AP_m	Params (M)	FPS
Two-stage models:
Faster-RCNN [54]	ResNet-50	4.1	12.2	1.8	3.7	36.9	41.39	59.7
Cascade-RCNN [55]	ResNet-50	4.5	12.7	2.1	3.9	37.5	69.29	39.6
One-stage models:
ATSS [57]	ResNet-50	4.0	13.9	1.2	3.8	21.8	38.91	46.6
RetinaNet [58]	ResNet-50	1.6	5.7	0.5	1.4	16.6	36.517	46.3
YOLOX [60]	ResNet-50	5.6	21.4	1.6	6.0	12.6	5.035	99.1
YOLOv8m	Darknet53	6.0	18.6	2.2	5.7	35.3	49.6	154.9
YOLOv11m	Darknet53	6.1	18.8	2.1	5.7	38.4	48.3	153.7
RT-DETR [14]	ResNet-18	6.9	21.8	3.0	6.6	37.4	38.6	133.8
USF-DETR	SFIB	8.3	25.4	3.2	8.1	37.9	42.1	72.6

Note: Bold values indicate the best result for each metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Wang, Y.; Cao, Y.; Guo, C.; Shi, P.; Li, P. Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection. Symmetry 2025, 17, 242. https://doi.org/10.3390/sym17020242

AMA Style

Liu J, Wang Y, Cao Y, Guo C, Shi P, Li P. Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection. Symmetry. 2025; 17(2):242. https://doi.org/10.3390/sym17020242

Chicago/Turabian Style

Liu, Jing, Ying Wang, Yanyan Cao, Chaoping Guo, Peijun Shi, and Pan Li. 2025. "Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection" Symmetry 17, no. 2: 242. https://doi.org/10.3390/sym17020242

APA Style

Liu, J., Wang, Y., Cao, Y., Guo, C., Shi, P., & Li, P. (2025). Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection. Symmetry, 17(2), 242. https://doi.org/10.3390/sym17020242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Transformer-Based Detection Methods

2.2. Small Object Detection Methods in Aerial Images

3. Methodologies

3.1. Overview

3.2. Spatial-Frequency Interaction Backbone (SFIB)

3.2.1. Module Structure

3.2.2. SFI Block

3.3. Dual Alignment and Balance Fusion FPN (DABF-FPN)

3.3.1. Overall Structure

3.3.2. Dual Alignment and Balance Fusion (DABF)

3.4. Efficient Attention-AIFI (EA-AIFI)

4. Experiment

4.1. DataSet

4.2. Evaluation Metrics and Environment

4.2.1. Evaluation Metrics

4.2.2. Experimental Environment

4.3. Ablation Study

4.4. Comparative Experiment

4.4.1. VisDrone

4.4.2. TinyPerson

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI