USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception

Deng, Hongqiang; Zhang, Shuzhe; Wang, Xiaodong; Han, Tianxin; Ye, Yun

doi:10.3390/app15073795

Open AccessArticle

USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception

by

Hongqiang Deng

¹

,

Shuzhe Zhang

²,

Xiaodong Wang

¹,

Tianxin Han

¹ and

Yun Ye

^3,*

¹

College of Mechanical Engineering and Automation, Northeastern University, Shenyang 110819, China

²

Houston International Institute, Dalian Maritime University, Dalian 116026, China

³

Faculty of Maritime and Transportation, Ningbo University, Ningbo 315211, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3795; https://doi.org/10.3390/app15073795

Submission received: 8 March 2025 / Revised: 27 March 2025 / Accepted: 28 March 2025 / Published: 30 March 2025

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

In the perception of unmanned systems, small object detection faces numerous challenges, including small size, low resolution, dense distribution, and occlusion, leading to suboptimal perception performance. To address these issues, we propose a specialized algorithm named Unmanned-system Small-object Detection-You Only Look Once (USD-YOLO). First, we designed an innovative module called the Anchor-Free Precision Enhancer to achieve more accurate bounding box overlap measurements and provide a smarter processing mechanism, thereby improving the localization accuracy of candidate boxes for small and densely distributed objects. Second, we introduced the Spatial and Channel Reconstruction Convolution module to reduce redundancy in spatial and channel features while extracting key features of small objects. Additionally, we designed a novel C2f-Global Attention Mechanism module to expand the receptive field and capture more contextual information, optimizing the detection head’s ability to handle small and low-resolution objects. We conducted extensive experimental comparisons with state-of-the-art models on three mainstream unmanned system datasets and a real unmanned ground vehicle. The experimental results demonstrate that USD-YOLO achieves higher detection precision and faster speed. On the Citypersons dataset, compared with the baseline, USD-YOLO improves mAP50-95, mAP50, and Recall by 8.5%, 5.9%, and 2.3%, respectively. Additionally, on the Flow-Img and DOTA-v1.0 datasets, USD-YOLO improves mAP50-95 by 2.5% and 2.5%, respectively.

Keywords:

computer vision; object detection; you only look once; deep learning; unmanned system

1. Introduction

Object detection technology is crucial in the perception field of unmanned systems. It has been widely applied across various platforms, including unmanned ground vehicles (UGVs), unmanned aerial vehicles (UAVs), and unmanned surface vehicles (USVs) [1,2,3]. The detections in camera or radar images can provide necessary assistance for unmanned system perception [4]. Currently, small object detection in unmanned systems faces significant challenges due to the limitations of visual sensor performance and the complexity of real-world scenarios. For example, UGVs identify distant pedestrians and traffic signs [5], USVs detect small boats and floating objects on water surfaces [6], and UAVs recognize objects in aerial images [7].

These challenges arise from the hardware limitations of visual sensors, which cause objects to appear small or low-resolution in images, as well as from complex scenarios that lead to densely distributed or occluded objects [8]. For general object detection algorithms in existing research, on the one hand, the small size of objects results in their features being distorted or lost during downsampling operations, while low resolution causes a mismatch between the receptive field and the size of small objects [8], leading to lower detection precision for small objects compared with larger ones. On the other hand, dense distributions and occlusions interfere with bounding boxes, causing significant fluctuations in the Intersection over Union (IoU) metric, further reducing the detection precision of small objects [9]. These issues weaken the perception capabilities of unmanned systems, directly limiting their overall performance and application scope [10]. Therefore, improving detection precision and candidate box precision for small objects in resource-constrained unmanned systems has become an urgent problem.

To address this, a novel USD-YOLO algorithm is proposed to enhance detection precision. The framework of this study is illustrated in Figure 1, and the main contributions are summarized as follows:

In the post-processing, an innovative Anchor-Free Precision Enhancer (APE) module was designed. It dynamically adjusts box scores based on Enhanced IoU (EIoU). Unlike traditional Non-Maximum Suppression (NMS) or Soft-NMS, APE significantly improves precision for dense and small objects.
In the backbone network, the Spatial and Channel Reconstruction Convolution (SCConv) module was introduced. Compared with conventional convolution methods, SCConv reduces both spatial and channel redundancy in Convolutional Neural Network (CNN) features, enhancing feature extraction for small objects.
In the neck section, a novel C2f-Global Attention Mechanism (CGAM) module was developed. By expanding the receptive field and capturing more contextual information, CGAM outperforms existing neck designs such as PANet and BiFPN in small object detection.

To demonstrate the significance of these contributions, experiments were conducted on three small object datasets commonly used in unmanned systems studies. Compared with the baseline YOLOv8 on the Citypersons dataset, USD-YOLO achieved improvements of 8.5%, 5.9%, and 2.3% in mAP50-95, mAP50, and Recall, respectively. On Flow-Img and DOTA-v1.0 datasets, USD-YOLO improved mAP50-95 by 2.5% on both. In addition, a visualization-based comparison of results and a real UGV deployment experiment further validate the superiority and generalization capability of USD-YOLO. While the proposed enhancements significantly improve small object detection precision, they come at the cost of a slight reduction in computational speed. Compared with the original YOLOv8 baseline, the frame rate decreases by 3.5, 1.4, and 6.2 Frames Per Second (FPS) on the Citypersons, Flow-Img, and DOTA-v1.0 datasets, respectively. However, this trade-off remains acceptable for detection tasks in unmanned systems.

The remainder of this paper is organized as follows. Section 2 reviews related studies on object enhancement methods and detection methods. Section 3 provides methodology details of the proposed USD-YOLO algorithm. Section 4 presents the preparation of the experiments, and Section 5 interprets the experimental results. Finally, Section 6 draws conclusions.

2. Related Works

2.1. Object Enhancement Methods

In visual perception tasks for unmanned systems, small objects often occupy only a minimal portion of the image [10]. Their feature responses are prone to information attenuation in deep networks.

Feature enhancement technology can significantly improve the detection precision and localization of small objects. Lin et al. [11] proposed the Feature Pyramid Network (FPN), which introduced an innovative top-down multi-scale feature fusion paradigm. By integrating high-resolution shallow features (rich in detail) with low-resolution deep features (rich in semantic information), FPN significantly enhanced the detection sensitivity for small objects. Subsequent researchers have proposed various improvements to FPN, such as PANet [12], NAS-FPN [13], and BiFPN [14]. These methods enhance multi-scale feature representation but often increase computational complexity. To address the limited feature representation capability for small objects, Cheng et al. [15] introduced the Dual Attention-guided Fusion mechanism, which refines feature maps through Channel and Spatial Attention Modules. While effective, these attention mechanisms can be computationally expensive. Wang et al. [16] proposed the Non-local Neural Network, which constructs a global context modeling framework by computing long-range dependencies between spatial pixels. This approach improves localization precision but may struggle with real-time performance.

In addition to featuring enhancement techniques, data augmentation and balancing techniques have also been explored to improve the performance of small object detection. For instance, ALADA, a lightweight automatic data augmentation framework, optimizes augmentation policies and detection networks simultaneously. It formulates a compact search space and a three-step bi-level optimization algorithm to address imbalanced sampling and redundant hyperparameter issues in search-based AutoDA methods [17]. Similarly, Balanced Weighted Extreme Learning Machine techniques have been proposed to handle imbalanced learning scenarios [18]. These techniques adjust the weights of minority and majority classes to ensure balanced learning, thereby enhancing the generalization performance of the detection model. While effective, they may require additional training time.

2.2. Object Detection Methods

Based on the feature extraction methods used, object detection algorithms can be classified into traditional detectors and deep learning-based detectors. Traditional detectors primarily rely on handcrafted features, exhibiting weak generalization and poor performance in complex scenes. In contrast, deep learning-based detectors leverage convolutional neural networks to automatically learn features, offering strong feature representation and generalization capabilities, thus achieving superior performance in complex scenarios. Currently, deep learning-based detectors can be further categorized into two-stage detectors and single-stage detectors based on their image-processing approaches [19].

Two-stage detectors follow a process from coarse detection to refined detection. In 2015, Ren et al. proposed Faster R-CNN [20], which achieves high detection precision but is computationally expensive and slow. Single-stage detectors extract features only once and can search for all objects in a single inference step. Redmon et al. [21] proposed the first single-stage detector, YOLO, while Liu et al. [22] proposed the Single Shot MultiBox Detector (SSD). These models significantly improve detection speed but struggle with low precision for small objects. At present, the YOLO series has developed rapidly, with continuous improvements in precision, making it a widely used algorithm for small object detection. Considering the limited computing resources of unmanned systems, we choose YOLOv8s as the baseline model to balance detection precision and speed.

Recent studies have introduced improved object detection algorithms to address the challenges of small object detection. For instance, Wang et al. [23] proposed FE-YOLO, which integrates deformable convolutions in the neck of YOLO to fuse high- and low-level feature maps, aiming to reduce the semantic gap caused by top-down connections. However, this approach significantly increases the number of parameters. Shen et al. [24] proposed CA-YOLO, embedding a coordinate attention module to suppress redundant background information, though at the cost of additional parameters. Zhu et al. [25] introduced TPH-YOLO, incorporating a Transformer prediction head to improve small object detection precision, but this substantially increases parameters and reduces inference speed. Sunkara et al. [26] applied a Spatial-to-Depth Convolution (SPD-Conv) module in YOLO to enhance depth information in feature maps. However, the rearrangement of feature maps leads to information loss, reducing detection precision for partially occluded small objects. Thus, small object detection algorithms still face challenges in maintaining real-time performance while improving detection precision on low-computation-cost devices. To address this, we designed USD-YOLO, integrating APE, SCConv, and CGAM modules to improve detection precision while maintaining speed without increasing the number of parameters.

3. Methodology

3.1. Anchor-Free Precision Enhancer

IoU is a metric used to calculate the ratio of the intersection to the union between a candidate box and a target box in YOLO [27]. Its mathematical expression is given in Equation (1), with A and B representing the areas of two detection boxes. A higher IoU value indicates greater overlap between the two boxes. However, IoU only measures area overlap between the predicted and target boxes without considering factors such as the distance between their centers or size differences.

I o U = \frac{A \cap B}{A \cup B}

(1)

s_{i} = \{\begin{matrix} s_{i}, IoU (M, b_{i}) < N_{t} \\ 0, IoU (M, b_{i}) ⩾ N_{t} \end{matrix}

(2)

NMS is a post-processing technique in YOLO that filters and retains candidate boxes with higher confidence scores based on IoU values while eliminating those that overlap significantly with the retained boxes but have lower confidence scores [28]. In Equation (2),

M

and

b_{i}

represent the candidate box with the highest confidence score and the current candidate box, respectively. Candidate boxes with IoU values exceeding the threshold

N_{t}

have their scores set to 0 and are subsequently removed, while those with IoU values below

N_{t}

retain their scores

s_{i}

. However, NMS applies a fixed threshold

N_{t}

, which is not suitable for all scenarios. In real-world object detection scenarios, densely distributed small objects can lead to high overlap and high IoU values, causing important candidate boxes to be mistakenly removed due to occlusion. Conversely, increasing the threshold

N_{t}

may result in a large number of false detections in some cases.

Soft-NMS extends NMS by introducing a penalty decay mechanism, incorporating a Gaussian decay function to mitigate threshold-related issues and reduce false removals or false detections [29]. However, since IoU only measures area overlap without accounting for distance or scale relationships between target boxes, Soft-NMS, which relies on IoU calculations, may still lead to false removals or false detections in certain cases.

To address the limitations of NMS and IoU, we designed the APE module. APE integrates EIoU for more precise IoU calculations and enhances Soft-NMS to improve detection performance for high-density small objects. EIoU is an advanced IoU calculation method that incorporates additional factors [30]. It optimizes IoU calculation by considering three additional aspects: center distance loss, the size difference between boxes A and B, and the size of the minimum enclosing box. This leads to more accurate IoU calculations for target detection boxes. The mathematical expression of EIoU is given in Equation (3), where

ρ (a, b) = ∥ a - b ∥

represents the Euclidean distance, and the remaining parameters are illustrated in Figure 2.

E I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{ω_{c}^{2} + h_{c}^{2}} - \frac{ρ^{2} (w, w^{g t})}{ω_{c}^{2}} - \frac{ρ^{2} (h, h^{g t})}{h_{c}^{2}}

(3)

Algorithm 1 presents the complete flowchart of the APE method. The algorithm first initializes an empty set

D

to store the final detection boxes. While the candidate box set

B

is not empty, the box

M

with the highest score in

S

is selected and added to

D

. The mathematical expression of

s_{i}

in

S

is given in Equation (4). Next, for each remaining box in

B

, its score is dynamically adjusted based on its EIoU value with

M

rather than being directly removed. Finally, the algorithm returns the filtered detection box set

D

and the updated scores

S

. By incorporating a Gaussian decay function, APE minimizes false removals while improving the screening and localization accuracy of candidate boxes. Through the integration of EIoU and the Gaussian decay function, APE enables more precise calculation of candidate box scores

s_{i}

, providing a more robust criterion for determining whether to retain or remove boxes. This effectively reduces both false detections and removals. Consequently, APE significantly improves the detection precision for small objects in densely distributed and occluded scenarios, mitigating the false removal or false detection issues commonly associated with traditional NMS.

s_{i} = s_{i} e^{- \frac{{E I o U (M, b_{i})}^{2}}{σ}}, \forall b_{i} \notin D

(4)

Algorithm 1: The flowchart of the APE method

3.2. Spatial and Channel Reconstruction Convolution

CNNs are widely used in object detection but often require high computational resources. Researchers frequently explore model compression techniques or develop lightweight network models to reduce computational costs. However, these methods may lead to significant precision degradation or feature redundancy. SCConv is a novel method that exploits spatial and channel feature redundancy to compress CNNs while preserving key features of small objects, thereby improving precision while reducing model complexity [31]. SCConv consists of a Spatial Reconstruction Unit (SRU) and a Channel Reconstruction Unit (CRU), as illustrated in Figure 3.

The Spatial Reconstruction Unit (SRU) performs separation and reconstruction operations. The separation operation isolates feature maps containing meaningful information from those with lower spatial relevance. The reconstruction operation then integrates information-rich and information-poor features to form a comprehensive feature set while optimizing spatial resource utilization.

The Channel Reconstruction Unit (CRU) applies to split, transform, and fuse operations. The split operation divides spatially refined features into two channels and compresses them using convolution. The transform operation employs efficient convolution operations to extract key representative information. Finally, the fuse operation merges the output features from the upper and lower transformations.

In general, the parameters of a standard convolution can be calculated as follows:

P_{C o n v} = k^{2} C_{1} C_{2}

(5)

where k is the kernel size of the convolution, and

C_{1}

and

C_{2}

are the number of input and output feature channels.

The parameters of the SCConv consist of

\begin{matrix} P_{S C C o n v} & = 1 \times 1 \times α C_{1} + k \times k \times \frac{α C_{1}}{g r} \times \frac{C_{2}}{g} \times g \\ + 1 \times 1 \times \frac{α C_{1}}{r} \times C_{2} + (1 - α) C_{1} \times \frac{(1 - α) C_{1}}{r} \\ + 1 \times 1 \times \frac{(1 - α) C_{1}}{r} \times (C_{2} - \frac{1 - α}{r} C_{1}) \end{matrix}

(6)

where

α

denotes the split ratio, r refers to the squeeze ratio, and g is the group size of the group-wise convolution operation. In the experiments,

α

, r, and g are set to

\frac{1}{2}

, 2, and 2, respectively [31]. Then, the parameters of the SCConv can be simplified as follows:

P_{S C C o n v} = \frac{1}{8} [{C_{1}}^{2} + (k^{2} + 4) C_{1} C_{2}]

(7)

In a typical scenario, when

k = 3

,

C_{1} = C_{2} = C

, the parameter counts are

P_{C o n v} = 9 C^{2}

and

P_{S C C o n v} = 1.75 C^{2}

[31]. Consequently, SCConv significantly reduces the number of parameters to approximately

\frac{1}{5}

of that in standard convolution, highlighting its lightweight architecture.

3.3. C2f-Global Attention Mechanism

In recent years, attention mechanisms have been widely applied in object detection to improve detection precision. However, previous methods have overlooked the importance of preserving information across both channel and spatial dimensions to enhance cross-dimensional interactions. The Global Attention Mechanism (GAM) expands the receptive field, mitigates information diffusion, enhances global interaction representations, and improves the performance of deep neural networks [32].

As shown in Figure 4, GAM combines channel and spatial attention to capture important features across three dimensions, increasing the receptive field to extract more contextual information. The channel attention sub-module employs 3D permutation to retain 3D information and a two-layer MLP to strengthen cross-dimensional channel-spatial dependencies. The spatial attention sub-module applies two convolution layers to integrate and refine spatial information.

The original YOLOv8 uses the C2f module in its neck. C2f is a feature fusion module based on the Cross-Stage-Partial structure, designed to enhance efficient information propagation and fusion between feature maps of different scales through partial connections and feature reuse across stages. Figure 5 illustrates the schematic diagram of the CGAM module. By integrating the traditional C2f module with GAM, the CGAM module is constructed to enhance the detection head’s ability to process small-sized and low-resolution object features.

3.4. Architecture of USD-YOLO

The architecture of the USD-YOLO algorithm is shown in Figure 6. Compared with the baseline model YOLOv8, USD-YOLO integrates the three designed modules SCConv, CGAM, and APE into the backbone, neck, and post-processing parts of the network, respectively.

The backbone is responsible for extracting multi-scale features from the input image; USD-YOLO introduces the SCConv module into the backbone. The image first passes through a series of Conv and C2f modules for initial feature extraction. The SCConv module is inserted between the C2f module and the Spatial Pyramid Pooling-Fast (SPPF) module. SCConv reduces redundancy in channel and spatial features, filtering out key features of small objects in the image and passing them to the SPPF module. After the SPPF module, the backbone outputs multi-scale feature maps (P3, P4, P5) for further processing in the neck part.

The backbone is responsible for extracting multi-scale features from the input image. USD-YOLO replaces the original C2f module with the CGAM module in the neck. The neck upsamples and concatenates the feature maps output by the backbone to fuse features at different scales. After upsampling and concatenation, the CGAM module performs convolution operations on the features and employs a global attention mechanism to expand the receptive field, providing more contextual information. After processing by the CGAM module, the neck outputs enhanced feature maps for detection in the head part.

The post-processing step filters the original predictions from the head and removes redundant detections. USD-YOLO introduces the APE module in the post-processing part behind the head. The head receives the feature maps (P3, P4, P5) output by the neck and generates candidate detection boxes along with their confidence scores. After generating candidate boxes, the APE module improves traditional IoU and NMS. APE provides more accurate candidate box overlap measurements and a smarter processing mechanism, enhancing the precision of candidate boxes. After processing by the APE module, the final detection results significantly improve the precision of densely distributed and occluded small objects.

By integrating the SCConv, CGAM, and APE modules into the backbone, neck, and post-processing parts, USD-YOLO significantly improves the detection precision of small objects.

4. Experiments Preparation

4.1. Dataset

We selected three popular datasets in the field of unmanned systems, each containing a large number of small objects. These datasets are Citypersons, FloW-Img, and DOTA-v1.0, which are publicly available and widely used in unmanned systems research [33,34,35,36]. The ablation, comparison, and generalization experiments in Section 5 refer to deep learning model evaluations performed on benchmark datasets rather than physical-world experiments.

Citypersons Dataset
This dataset is a publicly available dataset re-annotated from the Cityscapes dataset by a team from Nanjing University, China [37]. It focuses on detecting small pedestrian objects captured by ground vehicles. The dataset contains 2975 training images, 500 validation images, and 1575 test images, all with a resolution of 1280 × 640. This dataset emphasizes small objects with dense distributions and occlusions, which are common in UGVs.
FloW-Img Dataset
This dataset is provided by researchers from Tsinghua University and Northwestern Polytechnical University, China [38]. It aims to detect small floating bottle objects captured by USVs. The dataset contains 1200 training images and 800 test images, all with a resolution of 1280 × 720. In this study, the original dataset is randomly divided into training, validation, and test sets in a 6:2:2 ratio. This dataset focuses on small objects with small sizes and low resolutions, which are typical challenges for USVs.
DOTA-v1.0 Dataset
This dataset is provided by researchers from Wuhan University, China [39]. It contains a large number of extremely small objects captured by various sensors and platforms in aerial images. Before experimentation, the original dataset requires preprocessing. In this study, the original dataset is processed into images with a resolution of 1024 × 1024 and randomly divided into training, validation, and test sets in a 6:2:2 ratio. This dataset emphasizes diverse and extremely small objects, which are common challenges for UAVs.

4.2. Evaluation Metrics

In the experiments, the USD-YOLO algorithm is compared with state-of-the-art (SOTA) algorithms based on classic evaluation metrics. These metrics are widely used in object detection research [40,41].

Precision (P) measures the ratio of correctly detected objects to all positive (true positive and false positive) detection results. It is calculated as follows, where

T P

and

F P

represent true positives and false positives:

P = \frac{T P}{T P + F P}

(8)

Recall (R) measures the probability of detecting all positive samples. It is calculated as follows, where

F N

represents false negatives:

R = \frac{T P}{T P + F N}

(9)

Average Precision (AP) is the weighted average of precision (P) at different recall levels. It is calculated as follows:

A P = \int_{0}^{1} P (R) d R

(10)

mean Average Precision (mAP) is the average of the average precision (AP) across all categories. It is commonly used to evaluate the overall performance of multi-class object detection models. It is calculated as follows:

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i)

(11)

The mAP calculated at IoU = 0.5 and IoU = [0.5, 0.95] is referred to as mAP50 and mAP50-95, respectively. In comparison, mAP50-95 represents the average mAP across IoU thresholds ranging from 0.5 to 0.95 (with a step size of 0.05). This metric provides a stricter reflection of detection precision, as even minor deviations between candidate boxes and target boxes are penalized.

4.3. Experimental Setting

The hardware and software platforms used in the experiments, as well as the training parameter settings, are detailed in Table 1 and Table 2, respectively.

5. Experiments

5.1. Ablation Experiment

To validate the effectiveness of each improved module, ablation experiments were conducted on the Citypersons dataset using the experimental settings above. The results of the ablation experiments on the Citypersons dataset are presented in Table 3.

Based on the results, the introduction of the SCConv module successfully focused on key features, improving Recall, mAP50, and mAP50-95. Applying the CGAM module to the neck of YOLOv8s increased the receptive field size, further enhancing the mAP50-95 value. The use of the self-designed APE module in the post-processing stage significantly improved the accuracy of candidate boxes for dense and small objects, leading to notable increases in Recall, mAP50, and mAP50-95.

Additionally, compared with using a single improved module, combining YOLOv8s with any two modules significantly improved detection precision, especially when combined with the self-designed APE module, where the improvement was particularly pronounced. Furthermore, the algorithm combining SCConv, CGAM, and APE modules—namely, the proposed USD-YOLO algorithm—outperformed the use of these modules individually or combined any two of them. Compared with the baseline YOLOv8s algorithm, the USD-YOLO algorithm achieved significant improvements of 2.3%, 5.9%, and 8.5% in Recall, mAP50, and mAP50-95, respectively.

According to the calculation principle of the mAP metric, mAP50-95 provides a stricter measure of precision compared with mAP50. From the ablation experiment results, the USD-YOLO algorithm, which combines SCConv, CGAM, and APE modules, showed the most significant improvement in mAP50-95, reaching 8.5%. This indicates that the USD-YOLO algorithm has excellent capabilities for small object detection and candidate box localization.

5.2. Comparison Experiment

On the Citypersons dataset, we compared the USD-YOLO algorithm with SOTA algorithms, including mainstream YOLO series algorithms (YOLOv5, YOLOv7, YOLOv8, YOLOv9, and YOLOv10), YOLO-based improved small object detection algorithms (TPH-YOLO, SPD-YOLO), two-stage algorithms (Faster R-CNN), and single-stage algorithms outside the YOLO series (SSD). The YOLOv5, TPH-YOLO, YOLOv7, YOLOv8, YOLOv9, YOLOv10, Faster R-CNN, and SSD algorithms were obtained from the open-source platform GitHub. The SPD-YOLO algorithm was sourced from reference [26].

As depicted in Table 4, compared with other SOTA algorithms, the USD-YOLO algorithm significantly outperforms them in terms of Recall, mAP50, and mAP50-95 metrics with a fast speed.

In terms of precision, USD-YOLO achieves a significant improvement of 6.8–7.9% and 8.9–10.8% on mAP50 and mAP50-95, compared with small object detection algorithms improved by other researchers (TPH-YOLO, SPD-YOLO). Compared with the baseline, USD-YOLO improves mAP50-95, mAP50, and Recall by 8.5%, 5.9%, and 2.3%, respectively. USD-YOLO is also clearly superior to other SOTA algorithms, including YOLO series detectors, the single-stage detector SSD and the faster two-stage detector R-CNN. In terms of network parameters and GFLOPs, USD-YOLO has fewer parameters than YOLOv8m, SPD-YOLO, and SSD, and significantly fewer than Faster R-CNN. This indicates that USD-YOLO has a relatively lightweight structure, requiring fewer computational resources. In terms of FPS, USD-YOLO outperforms most algorithms, although it is 3.5 FPS lower than the baseline YOLOv8s. This suggests that USD-YOLO sacrifices a small amount of speed after the improvements, but it remains fast in the field of object detection.

These advantages demonstrate that USD-YOLO significantly improves the precision of small object detection while maintaining a fast detection speed. Unmanned systems demand high precision and efficient utilization of limited computational resources. Consequently, USD-YOLO is well-suited for deployment on unmanned devices, meeting these critical requirements. However, based on the comparison experiments, the two-stage detector Faster R-CNN exhibits significant drawbacks, including a large number of parameters, high computational complexity, and extremely low detection speed. These limitations make it entirely unsuitable for the detection requirements of unmanned systems. Therefore, Faster R-CNN will not be included in subsequent generalization experiments.

5.3. Generalization Experiment

5.3.1. FloW-Img Dataset

First, generalization experiments were conducted on the FloW-Img dataset by comparing the proposed algorithm with SOTA algorithms to validate the generalization capability of USD-YOLO across different datasets. The results of the generalization experiments on the FloW-Img dataset are presented in Table 5.

Compared with other SOTA algorithms, USD-YOLO achieved the highest values in precision metrics such as Recall, mAP50, and mAP50-95 on the FloW-Img dataset, demonstrating its superior detection precision. Additionally, USD-YOLO outperformed most algorithms in terms of FPS, indicating its fast detection speed. Compared with the baseline model YOLOv8s, USD-YOLO improved small object detection precision with only a slight sacrifice in speed, meeting the requirements of unmanned systems for both precision and speed.

Therefore, the results of the generalization experiments demonstrate that USD-YOLO exhibits excellent generalization performance, maintaining robust detection capabilities and reliability across other datasets.

5.3.2. DOTA-v1.0 Dataset

To further enhance credibility, generalization experiments were conducted on the popular DOTA-v1.0 dataset by comparing USD-YOLO with SOTA algorithms. The results of the generalization experiments on the DOTA-v1.0 dataset are presented in Table 6.

Compared with SOTA algorithms, USD-YOLO achieved the highest values in metrics such as Recall, mAP50, and mAP50-95 on the DOTA-v1.0 dataset, demonstrating its superior detection precision. Additionally, USD-YOLO surpassed most algorithms in terms of FPS, further indicating its fast detection speed. These results show that USD-YOLO can meet the computational resource constraints of unmanned systems while significantly improving the detection precision of small objects and ensuring real-time detection speed.

Therefore, the experiments on the DOTA-v1.0 dataset further validate that USD-YOLO possesses excellent generalization capabilities.

5.4. Visualization of Results

To demonstrate the detection performance of the USD-YOLO algorithm, a visual analysis was conducted by comparing USD-YOLO with YOLOv8s and YOLOv10 on the Citypersons and FloW-Img datasets.

5.4.1. Citypersons Dataset

Figure 7 shows the detection results of YOLOv8, YOLOv10, and USD-YOLO on three sets of images with small objects from the Citypersons dataset.

In the first set of images, USD-YOLO accurately identified and bounded extremely small and occluded pedestrian objects, while YOLOv8 failed to detect them, and YOLOv10 detected only one target. This indicates that the addition of the CGAM attention mechanism in USD-YOLO enhanced the perception of important features in the image, improving the detection accuracy of small objects. In the second set of images, YOLOv8 and YOLOv10 incorrectly detected two densely packed and occluded tiny pedestrian objects as one. In contrast, USD-YOLO accurately detected and precisely bounded both targets. In the third set of images, YOLOv8 and YOLOv10 falsely detected non-target regions as pedestrian objects, leading to false positives. Conversely, USD-YOLO accurately detected all targets without any false positives.

These visualization results demonstrate that the use of the APE, CGAM, and SCConv modules in the USD-YOLO algorithm significantly improved candidate box capabilities and enhanced the detection accuracy of small objects.

5.4.2. FloW-Img Dataset

Figure 8 shows the detection results of YOLOv8, YOLOv10, and USD-YOLO on three sets of images with small objects from the FloW-Img dataset.

In the first set of images, USD-YOLO accurately detected a small bottle object that was missed by both YOLOv8 and YOLOv10. In the second set of images, for extremely small and densely packed objects in the distance, USD-YOLO accurately detected all targets, while YOLOv8 failed to detect any, and YOLOv10 detected only two. In the third set of images, USD-YOLO accurately detected small objects partially visible at the edges of the image, whereas YOLOv8 and YOLOv10 missed them.

These visualization results indicate that the USD-YOLO algorithm significantly improved detection and candidate box precision, even in the presence of interference factors such as water reflections, demonstrating strong robustness compared with SOTA algorithms.

5.5. Real UGV Deployment

We deployed the USD-YOLO algorithm on an unmanned system for real-time testing in real-world scenarios to validate its practical performance. We selected a UGV as the unmanned system device. The physical structure of the UGV is shown in Figure 9. The main body of the UGV is the Scout 2.0 model produced by the Agilex Robotics Team. We equipped it with an Intel NUC 12 Enthusiast Kit (NUC12SNKi72) as the computation unit and installed a camera for transmitting image data. The USD-YOLO algorithm was deployed on the computation unit under the settings listed in Table 1 and Table 2. Our experimental scenarios were complex, including both indoor and outdoor environments with varying lighting conditions. The detection targets included small objects with dense distributions and occlusions.

The comparison results between the USD-YOLO algorithm and the YOLOv8 and YOLOv10 algorithms are shown in Figure 10. In the first set of images, for densely packed small objects, USD-YOLO accurately detected four targets, while YOLOv8 failed to detect any, and YOLOv10 detected only two, including one false detection. In the second set of images, for an occluded target and a nearby extremely small target, USD-YOLO accurately detected both, while YOLOv8 detected a false target and YOLOv10 detected only the occluded target. In the third set of images, for two extremely small targets, USD-YOLO accurately detected both, while YOLOv8 detected only one, and YOLOv10 produced a false detection. In the fourth set of images, for a target that was almost completely occluded, USD-YOLO accurately detected it, while YOLOv8 and YOLOv10 failed to detect it.

Therefore, the real UGV deployment experiment in real-world scenarios once again demonstrated the superiority and generalization capability of the USD-YOLO algorithm.

6. Conclusions

After thoroughly considering the challenges and hardware limitations of unmanned systems, the USD-YOLO algorithm is proposed to improve detection precision in the perception process of unmanned systems. Due to the hardware and software constraints of unmanned systems, it is necessary to balance detection precision and speed. The single-stage detection algorithm YOLOv8s was selected as the baseline, and three improved modules—APE, SCConv, and CGAM—were integrated into the post-processing, backbone, and neck sections, respectively, to enhance detection precision.

Experiments were conducted on three widely used datasets in unmanned systems research, each containing a large number of small objects. Ablation and comparison experiments on the Citypersons dataset demonstrated that, compared with the baseline algorithm, USD-YOLO significantly improved the Recall, mAP50, and mAP50-95 metrics by 2.3%, 5.9%, and 8.5%, respectively, while maintaining a high FPS. Generalization experiments on the FloW-Img and DOTA-v1.0 datasets showed that USD-YOLO also achieved the highest detection precision and excellent real-time performance. Additionally, visual comparisons of detection results on these two datasets further highlighted the superiority of the USD-YOLO algorithm. These experimental results prove that, compared with SOTA models, USD-YOLO offers high detection precision, fast speed, and strong generalization capabilities. This holds significant importance for small object perception research in unmanned systems such as UGVs, USVs, and UAVs.

In future research, efforts will continue to explore ways to improve the detection precision and speed of small objects in unmanned systems under more complex scenarios, such as environments with varying lighting conditions or foggy weather. Additionally, the study will focus on balancing the detail improvements with the high resource consumption of high-resolution cameras.

Author Contributions

Methodology, H.D.; software, S.Z.; validation, X.W.; data curation, T.H.; writing-original draft, H.D.; writing—review and editing, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LQN25E080011, in part by the Ningbo Natural Science Foundation under Grant 2024J440, in part by the National Natural Science Foundation of China under Grants 62203434, and in part by the Liaoning Provincial Natural Science Foundation Program of China under Grant 2023-MS-032.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. Citypersons dataset can be found here: https://github.com/cvgroup-njust/CityPersons (accessed on 7 March 2025). FloW-Img dataset can be found here:https://github.com/ORCA-Uboat/FloW-Dataset (accessed on 7 March 2025). DOTA-v1.0 dataset can be found here: https://captain-whu.github.io/DOTA/index.html (accessed on 7 March 2025).

Acknowledgments

The authors would like to thank Mingyang Zhang from the Shenyang Institute of Automation, Chinese Academy of Sciences, and Lina Sun from the College of Mechanical Engineering and Automation, Northeastern University, for their valuable support and assistance with resources and funding for this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar]
Wang, Y.; Wang, Z.; Cheng, P.; Tian, P.; Yuan, Z.; Tian, J.; Wang, W.; Zhao, L. UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5615916. [Google Scholar]
He, X.; Wu, D.; Wu, D.; You, Z.; Zhong, S.; Liu, Q. Millimeter-Wave Radar and Camera Fusion for Multiscenario Object Detection on USVs. IEEE Sensors J. 2024, 24, 31562–31572. [Google Scholar]
Viadero-Monasterio, F.; Alonso-Rentería, L.; Pérez-Oria, J.; Viadero-Rueda, F. Radar-Based Pedestrian and Vehicle Detection and Identification for Driving Assistance. Vehicles 2024, 6, 1185–1199. [Google Scholar] [CrossRef]
Chen, L.; Lin, S.; Lu, X.; Cao, D.; Wu, H.; Guo, C.; Liu, C.; Wang, F.Y. Deep neural network based vehicle and pedestrian detection for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3234–3246. [Google Scholar]
Bovcon, B.; Muhovič, J.; Vranac, D.; Mozetič, D.; Perš, J.; Kristan, M. MODS—A USV-oriented object detection and obstacle segmentation benchmark. IEEE Trans. Intell. Transp. Syst. 2021, 23, 13403–13418. [Google Scholar] [CrossRef]
Liu, B.Y.; Chen, H.X.; Huang, Z.; Liu, X.; Yang, Y.Z. Zoominnet: A novel small object detector in drone images with cross-scale knowledge distillation. Remote Sens. 2021, 13, 1198. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Zhang, S.; Li, C.; Jia, Z.; Liu, L.; Zhang, Z.; Wang, L. Diag-IoU loss for object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7671–7683. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Cheng, G.; Lang, C.; Wu, M.; Xie, X.; Yao, X.; Han, J. Feature enhancement network for object detection in optical remote sensing images. J. Remote Sens. 2021, 2021, 9805389. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Wang, Y.; Chung, S.H.; Khan, W.A.; Wang, T.; Xu, D.J. ALADA: A lite automatic data augmentation framework for industrial defect detection. Adv. Eng. Informatics 2023, 58, 102205. [Google Scholar] [CrossRef]
Khan, W.A. Balanced weighted extreme learning machine for imbalance learning of credit default risk and manufacturing productivity. Ann. Oper. Res. 2023, 1–29. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Wang, C.; Li, Z.; Gao, Q.; Cui, T.; Sun, D.; Jiang, W. Lightweight and Efficient Air-to-Air Unmanned Aerial Vehicle Detection Neural Networks. In Proceedings of the 2023 IEEE International Conference on Unmanned Systems (ICUS), Hefei, China, 13–15 October 2023; pp. 1575–1580. [Google Scholar]
Shen, L.; Lang, B.; Song, Z. CA-YOLO: Model optimization for remote sensing image object detection. IEEE Access 2023, 11, 64769–64781. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Tan, Y.; Yao, H.; Li, H.; Lu, X.; Xie, H. Prf-ped: Multi-scale pedestrian detector with prior-based receptive field. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6059–6064. [Google Scholar]
Pu, Z.; Geng, X.; Sun, D.; Feng, H.; Chen, J.; Jiang, J. Comparison and Simulation of Deep Learning Detection Algorithms for Floating Objects on the Water Surface. In Proceedings of the 2023 4th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 7–9 April 2023; pp. 814–820. [Google Scholar]
Ouyang, C.; Hou, Q.; Dai, Y. Surface Object Detection Based on Improved YOLOv5. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 16–18 December 2022; Volume 5, pp. 923–928. [Google Scholar]
Xie, X.; Cheng, G.; Rao, C.; Lang, C.; Han, J. Oriented object detection via contextual dependence mining and penalty-incentive allocation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5618010. [Google Scholar]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
Cheng, Y.; Zhu, J.; Jiang, M.; Fu, J.; Pang, C.; Wang, P.; Sankaran, K.; Onabola, O.; Liu, Y.; Liu, D.; et al. Flow: A dataset and benchmark for floating waste detection in inland waters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10953–10962. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Dai, Y.; Liu, W.; Wang, H.; Xie, W.; Long, K. Yolo-former: Marrying yolo and transformer for foreign object detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar]
Zhao, H.; Chu, K.; Zhang, J.; Feng, C. YOLO-FSD: An improved target detection algorithm on remote-sensing images. IEEE Sensors J. 2023, 23, 30751–30764. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Researach framework.

Figure 2. Formula parameters of the EIoU.

Figure 3. The schematic diagram of the SCConv module.

Figure 4. The schematic diagram of the GAM module.

Figure 5. The schematic diagram of the CGAM module.

Figure 6. The structure of the USD-YOLO algorithm.

Figure 7. Visualization results on the Citypersons dataset with three sets of images. (a) Detected by YOLOv8. (b) Detected by YOLOv10. (c) Detected by USD-YOLO.

Figure 8. Visualization results on the Flow-Img dataset with three sets of images. (a) Detected by YOLOv8. (b) Detected by YOLOv10. (c) Detected by USD-YOLO.

Figure 9. The entity structure of the UGV.

Figure 10. Comparisons of real-time detection results on the real UGV. (a) Detected by YOLOv8. (b) Detected by YOLOv10. (c) Detected by USD-YOLO.

Table 1. Software and hardware platforms for experiments.

Component	Value	Component	Value
System	Ubuntu 18.04	CUDA	CUDA 11.1
CPU	Xeon 4214R	Language	Python 3.8
GPU	RTX 3080Ti	Framework	Pytorch 1.9.0
Memory	12 GB	GPU Number	1

Table 2. Training parameters setting for experiments.

Component	Citypersons	FloW-Img	DOTA-v1.0
Optimizer	SGD	SGD	SGD
Momentum	0.937	0.937	0.937
Lr0	0.01	0.01	0.01
Lr1	0.01	0.01	0.01
Epoch	100	200	200
Workers	8	8	8
Batch Size	32	32	12

Table 3. Ablation experiments on the Citypersons dataset.

YOLOv8s	SCConv	CGAM	APE	Recall (%)	mAP50 (%)	mAP50-95 (%)
✓				54.4	64.9	37.7
✓	✓			54.5	65.1	37.9
✓		✓		54.1	63.9	38.3
✓			✓	55.1	70.1	45.8
✓	✓	✓		54.6	64.6	38.3
✓	✓		✓	55.7	70.4	45.8
✓		✓	✓	54.9	70.3	46.4
✓	✓	✓	✓	56.7	70.8	46.2

Table 4. Results of comparison experiments on the Citypersons dataset.

Algorithm	Recall (%)	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs	FPS
YOLOv8s	54.4	64.9	37.7	11.2	28.8	129.8
TPH-YOLO [25]	54.0	64.0	37.3	11.36	29.8	74.8
SPD-YOLO [26]	52.9	62.9	35.4	12.74	46.2	94.5
YOLOv5	56.2	66.3	41.8	7.02	15.9	100.0
YOLOv7	43.8	49.1	23.0	9.32	26.1	56.2
YOLOv8n	49.5	56.9	29.2	3.15	8.9	133.7
YOLOv8m	54.4	68.3	43.4	25.9	79.3	77.0
YOLOv9	53.0	62.5	38.4	6.17	23.9	42.5
YOLOv10	56.2	66.2	41.1	8.06	24.8	89.3
SSD [22]	41.8	44.8	18.8	24.48	62.2	47.5
Faster R-CNN [42]	46.2	64.0	24.6	136.98	199.6	6.2
USD-YOLO	56.7	70.8	46.2	12.35	30.5	126.3

Table 5. Results of generalization experiments on the Flow-Img dataset.

Algorithm	Recall (%)	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs	FPS
YOLOv8s	79.7	84.9	42.6	11.2	28.8	126
TPH-YOLO [25]	78.7	85.6	42.5	11.36	29.8	81.5
SPD-YOLO [26]	73.1	80.1	37.1	12.74	46.2	88.4
YOLOv5	78.9	84.9	43.6	7.02	15.9	106.3
YOLOv7	64.2	69.1	28.8	9.32	26.1	55.0
YOLOv9	74.2	79.9	38.9	6.17	23.9	41.8
YOLOv10	74.5	82.4	39.9	8.06	24.8	99.1
SSD [22]	45.2	59.2	29.8	24.48	62.2	48.8
USD-YOLO	81.3	86.2	45.1	12.35	30.5	124.6

Table 6. Results of generalization experiments on the DOTA-v1.0 dataset.

Algorithm	Recall (%)	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs	FPS
YOLOv8s	68.2	71.7	46.4	11.2	28.8	86.4
TPH-YOLO [25]	67.6	71.1	43.7	11.36	29.8	57.1
SPD-YOLO [26]	62.0	66.8	39.7	12.74	46.2	63.2
YOLOv5	61.3	64.6	40.9	7.02	15.9	72.4
YOLOv7	61.0	64.3	37.8	9.32	26.1	32.1
YOLOv9	62.1	64.0	41.7	6.17	23.9	43.6
YOLOv10	69.1	71.7	47.1	8.06	24.8	68.0
SSD [22]	22.6	29.6	14.9	24.48	62.2	40.3
USD-YOLO	68.6	74.0	48.8	12.35	30.5	80.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, H.; Zhang, S.; Wang, X.; Han, T.; Ye, Y. USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception. Appl. Sci. 2025, 15, 3795. https://doi.org/10.3390/app15073795

AMA Style

Deng H, Zhang S, Wang X, Han T, Ye Y. USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception. Applied Sciences. 2025; 15(7):3795. https://doi.org/10.3390/app15073795

Chicago/Turabian Style

Deng, Hongqiang, Shuzhe Zhang, Xiaodong Wang, Tianxin Han, and Yun Ye. 2025. "USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception" Applied Sciences 15, no. 7: 3795. https://doi.org/10.3390/app15073795

APA Style

Deng, H., Zhang, S., Wang, X., Han, T., & Ye, Y. (2025). USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception. Applied Sciences, 15(7), 3795. https://doi.org/10.3390/app15073795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

USD-YOLO: An Enhanced YOLO Algorithm for Small Object Detection in Unmanned Systems Perception

Abstract

1. Introduction

2. Related Works

2.1. Object Enhancement Methods

2.2. Object Detection Methods

3. Methodology

3.1. Anchor-Free Precision Enhancer

3.2. Spatial and Channel Reconstruction Convolution

3.3. C2f-Global Attention Mechanism

3.4. Architecture of USD-YOLO

4. Experiments Preparation

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setting

5. Experiments

5.1. Ablation Experiment

5.2. Comparison Experiment

5.3. Generalization Experiment

5.3.1. FloW-Img Dataset

5.3.2. DOTA-v1.0 Dataset

5.4. Visualization of Results

5.4.1. Citypersons Dataset

5.4.2. FloW-Img Dataset

5.5. Real UGV Deployment

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI