FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention

Sui, Hang; Han, Huiyan; Cui, Yuzhu; Yang, Menglong; Pei, Binwei

doi:10.3390/electronics14091713

Open AccessArticle

FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention

by

Hang Sui

¹

,

Huiyan Han

^1,2,*,

Yuzhu Cui

¹,

Menglong Yang

¹ and

Binwei Pei

¹

College of Computer Science and Technology, North University of China, Taiyuan 030051, China

²

Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1713; https://doi.org/10.3390/electronics14091713

Submission received: 13 March 2025 / Revised: 14 April 2025 / Accepted: 21 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Applications of Computer Vision, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Pedestrian detection technology refers to identifying pedestrians within the field of view and is widely used in smart cities, public safety surveillance, and other scenarios. However, in real-world complex scenes, challenges such as high pedestrian density, occlusion, and low lighting conditions lead to blurred image boundaries, which significantly impact accuracy of pedestrian detection. To address these challenges, we propose a novel pedestrian detection algorithm, FA-YOLO. First, to address issues of limited effective information extraction in backbone network and insufficient feature map representation, we propose a feature enhancement module (FEM) that integrates both global and local features of the feature map, thereby enhancing the network’s feature representation capability. Then, to reduce redundant information and improve adaptability to complex scenes, an adaptive sparse self-attention (ASSA) module is designed to suppress noise interactions in irrelevant regions and eliminate feature redundancy across both spatial and channel dimensions. Finally, to further enhance the model’s focus on target features, we propose cross stage partial with adaptive sparse self-attention (C3ASSA), which improves overall detection performance by reinforcing the importance of target features during the final detection stage. Additionally, a scalable intersection over union (SIoU) loss function is introduced to address the vector angle differences between predicted and ground-truth bounding boxes. Extensive experiments on the WiderPerson and RTTS datasets demonstrate that FA-YOLO achieves State-of-the-Art performance, with a precision improvement of 3.5% on the WiderPerson and 3.0% on RTTS compared to YOLOv11.

Keywords:

pedestrian detection; feature enhancement module; loss function; adaptive sparse self-attention

1. Introduction

As a core topic in computer vision, pedestrian detection aims to automatically locate pedestrians in images or videos through algorithms [1,2,3]. Technological advancements in this field have significant implications for intelligent transportation systems, autonomous driving, video surveillance, and human–computer interaction [4,5,6,7]. In autonomous driving [8], real-time and accurate pedestrian detection can effectively reduce traffic accident risks and ensure safety of vulnerable road users. In security surveillance [9], this technology can be used for crowd density estimation and abnormal behavior detection, contributing to public safety management. In smart city development [10], traffic flow analysis incorporating pedestrian detection provides valuable data support for urban planning.

With the continuous advancement of computer vision [11], an increasing number of algorithms have been developed for automatically recognizing pedestrians in various scenes. These methods can be categorized into traditional object detection methods and deep learning-based object detection methods.

Traditional object detection methods [12] primarily consist of three stages: candidate region selection, feature extraction, and classification. However, using a fixed-size search window may result in missing candidate regions of other sizes, thereby affecting detection accuracy and efficiency. Ojala et al. (2002) [13] introduced Local Binary Patterns (LBP) to extract texture features of the target. The method binarized surrounding pixels based on grayscale value of the central pixel in a window and then computes a weighted sum over different pixel positions within the window to obtain the LBP value. However, this algorithm only processed fixed-size small regions and could not adapt to multi-scale variations. Dalal et al. (2005) [14] proposed a detection method based on Histogram of Oriented Gradients (HOG) features. This method described the local shape and texture of the target using histograms of image gradient orientations and exhibited strong robustness to optical distortions in target images. However, the computation of HOG features relied on local gradient information, making them sensitive to scale and rotation variations, and their detection performance degrades under occlusion conditions.

Deep learning-based object detection methods can be categorized into two-stage and one-stage methods based on whether they employ a region proposal mechanism. Two-stage object detection algorithms transform detection problem into a classification task for local images within the generated proposal regions through explicit region proposals. Initially, a region proposal network generates candidate regions. Subsequently, feature extraction, classification, and bounding box regression are performed for each candidate region. Representative two-stage object detection algorithms include R-CNN [15], Fast R-CNN [16], and others. R-CNN [15] used selective search to generate 2000 candidate regions, each of which was individually input into a CNN for feature extraction and then classified using an SVM. Fast R-CNN [16] introduced an RoI Pooling layer to achieve feature sharing, thereby improving efficiency. Faster R-CNN [17] further integrated the RPN with the detection network for end-to-end joint training, significantly reducing redundant computations. Although these algorithms achieved high accuracy, their computational complexity was high and speed was slow due to the need to process a large number of candidate regions. To address this issue, one-stage algorithms directly regressed target positions and categories. By dividing the image into grids and predicting bounding boxes and class probabilities for each grid, combined with non-maximum suppression to optimize the results, they achieved end-to-end real-time detection. You Only Look Once (YOLO) [18,19,20,21,22] recently became mainstream due to its high efficiency. The YOLO model enabled real-time pedestrian detection while maintaining high accuracy; however, it performed poorly in detecting small pedestrian targets. To address this issue, YOLO series had undergone continuous iteration and optimization. YOLOv3 [23] introduced feature pyramids and multi-scale prediction techniques, improving detection accuracy for targets of varying scales and aspect ratios. YOLOv4 [24] reduced computational complexity and improved the network’s receptive field and feature representation ability by modifying the processing of input feature maps. YOLOv5 [25] further introduced techniques such as adaptive training data augmentation, Mosaic data augmentation, and the Swish activation function, significantly improving detection accuracy and inference speed. YOLOX [26] employed Mosaic and Mixup data augmentation techniques, along with innovative methods such as Anchor-free and Decoupled Head, further enhancing detection accuracy, particularly for small target detection. In subsequent versions, YOLOv8 [27] integrated detection, segmentation, and tracking capabilities, enhancing the model’s versatility. As the latest version, YOLOv11 [28] had undergone significant improvements in both architecture and training methods. It adopted an improved backbone and neck architecture, enhancing feature extraction capabilities, and introduces components such as Cross Stage Partial with kernel size 2 (C3k2) [20], Spatial Pyramid Pooling - Fast (SPPF) [29], and Convolutional block with Parallel Spatial Attention (C2PSA) [20], further optimizing feature extraction and processing efficiency. The YOLO series models [30,31] have continuously optimized from YOLOv3 to YOLOv11, gradually improving performance and practicality in object detection tasks such as pedestrian detection through the introduction of various innovative techniques and architectural improvements.

Although these methods have made some progress, surveillance video scenes are large-field environments, where pedestrians are at varying distances from the camera and appear with different granular sizes in the image. Furthermore, pedestrian detection in real-world scenarios remains challenging due to factors such as variable illumination, occlusions, and adverse weather like rain or fog, all of which which can compromise the robustness of detection systems. To address these issues, we propose a novel pedestrian detection algorithm, FA-YOLO. First, we develop a feature enhancement module (FEM) that seamlessly integrates global and local information, thereby significantly enhancing the network’s ability to extract comprehensive and robust pedestrian features. This integration ensures that the model captures both contextual and detailed information, leading to improved feature representation. Second, we put forward an adaptive sparse self-attention (ASSA) module. This module is specifically designed to reduce noise interactions in irrelevant regions and mitigate feature redundancy across both spatial and channel domains. By doing so, it enhances the model’s discriminative power and ensures that the extracted features are more relevant and informative for pedestrian detection tasks. Furthermore, we enhance the model’s architecture by incorporating an improved C3K2 structure. This structure enables model to focus more effectively on critical target features, thereby improving its detection accuracy and overall performance. The improved C3K2 structure is optimized to better capture the essential characteristics of pedestrians, even in challenging scenarios. Finally, we adopt a scalable intersection over union (SIoU) loss function, which accounts for vector angle differences between predicted and ground-truth bounding boxes, leading to more precise localization. These improvements collectively enable FA-YOLO to handle multi-scale pedestrian detection while maintaining robustness against environmental challenges such as lighting variations, occlusions, and adverse weather conditions. The key contributions of this study are summarized as follows:

We propose FA-YOLO, a novel pedestrian detection algorithm that enhances YOLOv11 by integrating a feature enhancement module (FEM) and an adaptive sparse self-attention (ASSA) module to improve feature representation and reduce redundancy.
We introduce the SIoU loss function, which considers vector angle differences to enhance localization accuracy and improve the model’s ability to detect pedestrians in complex environments.
We integrate an improved C3K2 structure, C3ASSA, which enhances the network’s ability to focus on key pedestrian features, improving robustness against occlusions and scale variations.
Extensive experiments on public pedestrian detection datasets demonstrate that FA-YOLO outperforms existing methods, achieving higher detection accuracy and robustness in challenging real-world scenarios.

2. Related Works

2.1. YOLOv11

In realm of real-time object detection, YOLO series has been instrumental in advancing both accuracy and efficiency. The latest iteration, YOLOv11 [28], builds upon its predecessors with several notable enhancements. Figure 1 shows the overall architecture of YOLOv11. YOLOv11 introduces an improved backbone and neck, which bolsters capability to extract pertinent features from images, thereby facilitating more precise object detection. Through a refined architectural design and an optimized training pipeline, YOLOv11 achieves a harmonious balance between processing speed and detection accuracy, ensuring rapid inference without compromising performance. Notably, the YOLOv11m variant attains a higher mean Average Precision (mAP) on COCO dataset [32] while utilizing 22% fewer parameters compared to YOLOv8m, underscoring its computational efficiency without sacrificing accuracy. Beyond traditional object detection, YOLOv11 extends its applicability to a range of computer vision tasks, demonstrating its adaptability across diverse applications. These advancements position YOLOv11 as a robust baseline model for pedestrian detection, particularly in complex environments where rapid and accurate identification is paramount.

2.2. Attention Mechanism

Integrating the attention mechanism into the network model performing the target detection task can effectively reduce the interference of invalid targets, obtain rich detailed feature information, enhance the representation ability of the network model, and thus improve the detection performance of the rectified network model [33,34]. YOLOv11 uses a self-attention mechanism [35]. The input data is multiplied by each of the three learnable matrices to obtain the query matrix Q, index matrix K and value matrix V. K and V are mapped to the output as a set of key-value pairs. The formula for calculating the attention is shown below.

S_{l} = \frac{Q_{l} K_{l}^{T}}{\sqrt{d}}

(1)

A_{l} = S o f t max (S_{l})

(2)

where l denotes the l-th attention layer and d is the dimension of the vector. When handling tasks involving a large number of image patches, self-attention can become computationally expensive. To address this issue, multi-head self-attention mechanism enables parallel computation and allows model to attend to information from different subspace representations. Each attention head computes distinct attention scores, enabling it to focus on different parts of the input sequence and capture various aspects of the input at multiple levels of granularity. The outputs from all heads are concatenated and passed through a linear transformation to generate final output. This design, in which each head has its own learnable weight matrices, enhances model’s capacity to represent complex input relationships by integrating diverse contextual information from multiple perspectives. The mathematical expression of the multi-head self-attention mechanism is shown in Equation (3).

The outputs from all heads are concatenated and passed through a linear transformation to generate the final output. This design, in which each head has its own learnable weight matrices, enhances the model’s capacity to represent complex input relationships by integrating diverse contextual information from multiple perspectives.

M u l t i H e a d (Q, K, V) = C o n c a t ({[\begin{matrix} A_{l, h} V_{l, h} \end{matrix}]}_{h = 1}^{H})

(3)

where h represents the head index,

A_{l, h}

and

V_{l, h}

are the attention weights and value matrix of head h and layer l respectively.

2.3. Deep Network Features

Deep neural networks (DNNs) enhance feature learning by increasing the number of hidden layers, allowing them to extract more abstract and complex representations. This hierarchical feature extraction improves model’s expressiveness and adaptability to intricate transformations in real-world scenarios. However, deeper networks often suffer from information loss due to spatial transformations and feature extraction processes, commonly referred to as the information bottleneck problem [36,37]. During early stages of training, randomly initialized network weights can lead to attenuation or loss of essential feature information, negatively affecting classification and detection performance.

To address these issues, researchers have introduced various architectural enhancements. One effective approach is the incorporation of depthwise separable convolutions and dual residual feature enhancement modules, which expand the receptive field and improve multi-scale feature representation [38]. Another strategy involves spatial-channel attention mechanisms, such as coordinate attention [39] and inverted bottleneck structures [40], which improve feature fusion by decoupling spatial and channel transformations. These methods help preserve crucial target information while maintaining computational efficiency.

Despite these advancements, deep networks still encounter challenges related to feature redundancy and high computational cost. While attention-based mechanisms enhance feature learning, they often introduce additional parameters and increase inference latency. Consequently, balancing feature richness and efficiency remains a key research direction in deep network design, particularly for real-time applications such as pedestrian detection.

3. Materials and Methods

3.1. Overview

Our work builds on YOLOv11 by introducing multiple key enhancements. To enhance feature representation, we design a FEM that integrates global and local features, improving the network’s ability to capture fine-grained pedestrian characteristics. Additionally, we propose an ASSA module to suppress irrelevant background noise and reduce feature redundancy in both spatial and channel dimensions. Furthermore, we integrate an improved C3K2 structure, C3ASSA, which enhances ability to focus on key pedestrian features, improving robustness against occlusions and scale variations. Moreover, we introduce SIoU loss function, which refines the bounding box regression by addressing vector angle differences between predicted and ground-truth boxes, leading to more accurate localization. The overall architecture of FA-YOLO is illustrated in Figure 2. The backbone network serves as primary feature extractor, capturing fundamental structures from input images. The neck module aggregates multi-scale information to ensure robustness against variations in pedestrian sizes, while the detection head generates the final classification and bounding box predictions. These improvements collectively enable FA-YOLO to achieve superior pedestrian detection performance compared to existing models. Next, we will introduce FEM, ASSA, C3ASSA and SIou loss function in detail, respectively.

3.2. FEM

In object detection tasks, convolutional layers primarily focus on extracting local detailed features, while the acquisition of global features is relatively limited. Furthermore, during the forward propagation of deep neural networks, feature information may encounter bottlenecks as network depth increases, leading to gradual weakening or loss of key features. To address these issues, we propose a Feature Enhancement Module (FEM), as shown in Figure 3. This module consists of a local feature extraction branch and a Global Context Enhancement Branch (GCEB) to optimize feature representation in deep networks.

3.2.1. Local Feature Extraction Branch

This branch first utilizes a 1 × 1 convolution to adjust the number of channels in input feature map, reducing computational redundancy while preserving the original information representation, as shown in Equation (4).

F_{l o c a l} = C o n v_{1 \times 1} (X)

(4)

where

C o n v_{1 \times 1}

represents the 1 × 1 convolution. Then, a 3 × 3 convolution is applied to further extract local contextual features, as shown in Equation (5).

X_{l o c a l} = C o n v_{3 \times 3} (F_{l o c a l})

(5)

where

C o n v_{3 \times 3}

represents the 3 × 3 convolution. To prevent gradient vanishing or explosion, we employ residual connections and concatenate the input features with the local features to enhance feature representation, as shown in Equation (6).

F_{l o c a l}^{f i n a l} = C o n c a t (F_{l o c a l}, X_{l o c a l}) + F_{l o c a l}

(6)

3.2.2. Global Context Enhancement Branch

To compensate for the limited receptive field of local features, we introduce the Global Context Enhancement Branch (GCEB) in parallel. This branch aggregates global information from the feature map using global average pooling (GAP), as shown in Equation (7).

G (F) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (i, j)

(7)

where H and W represent height and width of feature map, respectively.

F (i, j)

denotes pixel value of feature map F at position (i, j). Then, a 1 × 1 convolution is applied for channel compression, followed by ReLU activation function to enhance the network’s nonlinearity, as shown in Equation (8).

Y_{1} = R e L U (C o n v_{1 \times 1} (G (F)))

(8)

where

R e L U

is an activation function to enhance network’s nonlinearity. Subsequently, channel attention weights are normalized using Sigmoid function, as shown in Equation (9).

Y_{2} = S i g m o i d (C o n v_{1 \times 1} (Y_{1}))

(9)

where

S i g m o i d

is an activation function to normalize feature values to the range of 0 to 1, generating attention weights. Finally, the input features are multiplied element-wise with the channel attention weights to achieve global feature enhancement, as shown in Equation (10).

Y_{g l o b a l} = F ⊙ Y_{2}

(10)

where ⊙ represents element-wise multiplication, used to combine the input features with the channel attention weights.

3.2.3. Feature Fusion

The features from the local and global branches are fused to enhance feature representation, as shown in Equation (11).

X_{f u s e d} = F_{l o c a l}^{f i n a l} + Y_{g l o b a l}

(11)

Finally, a 1 × 1 convolution is applied to map features to final output, achieving channel expansion and enhancing nonlinear feature representation. FEM optimizes the feature extraction process in object detection tasks by combining local features with global context information, improving network’s ability to detect complex targets and effectively mitigating issue of feature information loss.

3.3. ASSA

In complex environments, pedestrian detection is often affected by irrelevant background noise and feature redundancy, which can significantly degrade performance of detection models. To address these challenges, we propose the ASSA module, which is designed to suppress irrelevant background noise and reduce feature redundancy in both spatial and channel dimensions.

The ASSA module is designed to adaptively capture the most informative interactions among tokens while preserving essential information. It consists of two branches: a sparse self-attention branch (SSA) and a dense self-attention branch (DSA). The SSA branch filters out irrelevant interactions among tokens, while the DSA branch ensures necessary information flows through the network.

Given a normalized feature map

X \in R^{H \times W \times C}

, where C represents the number of channels, we partition it into non-overlapping windows of size

M \times M

, resulting in a representation

X_{i} \in R^{M^{2} \times C}

from the i-th window. We then generate matrices of queries Q, keys K, and values V from X, as shown in Equation (12).

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(12)

where

W_{Q}, W_{K}, W_{V} \in R^{C \times d}

are linear projection matrices shared among all windows, d represents the dimension of vector. The attention computation can be defined in Equation (13):

A = f (\frac{Q K^{T}}{\sqrt{d}} + B) V

(13)

where A denotes estimated attention, B refers to learnable relative positional bias, and f is a scoring function. The standard dense self-attention (DSA) mechanism employs a SoftMax layer to obtain attention scores, as shown in Equation (14).

DSA = SoftMax (\frac{Q K^{T}}{\sqrt{d}} + B)

(14)

However, not all query tokens are closely relevant to corresponding ones in keys, making utilization of all similarities ineffective for clear image reconstruction. To enhance feature aggregation, we develop a sparse self-attention (SSA) mechanism to select the most useful interactions among tokens, as shown in Equation (15).

SSA = {ReLU}^{2} (\frac{Q K^{T}}{\sqrt{d}} + B)

(15)

To balance the sparse and dense branches, we propose an adaptive two-branch self-attention mechanism. The attention matrix in Equation (13) can be updated to Equation (16).

A = (w_{1} \cdot SSA + w_{2} \cdot DSA) V

(16)

where

w_{1}, w_{2} \in R^{1}

are two normalized weights for adaptively modulating the two branches, computed as Equation (17).

w_{n} = \frac{e^{a_{n}}}{\sum_{i = 1}^{N} e^{a_{i}}}, n = {1, 2}

(17)

where

{a_{1}, a_{2}}

are learnable parameters initialized to 1 for two branches. This design ensures a trade-off between filtering out noisy interactions from irrelevant areas and retaining enough informative features.

3.4. C3ASSA

The C3ASSA module is an enhanced feature extraction component that replaces Bottleneck structure in original module with the IR-ASSA, using an inverted residual structure, as shown in Figure 4. This module leverages the ASSA mechanism to improve the C3K2 structure, enhancing the ability to capture fine-grained pedestrian features while maintaining low computational overhead.

The IR-ASSA module first expands the input channels to twice their original size using pointwise convolution, mapping low-dimensional features to a higher-dimensional feature space. This enables subsequent operations to extract complex feature information more efficiently. Next, a 3 × 3 depthwise separable convolution is used instead of a standard convolution to extract local spatial features while significantly reducing computational complexity. Subsequently, the ASSA attention mechanism module is integrated to effectively filter out irrelevant features and highlight key features, thereby enhancing the ability to capture target features. Finally, pointwise convolution is applied to compress and reduce the dimensionality of the channels, while a residual connection is introduced to prevent the loss of important global features during the dimensionality reduction process, further improving information retention and representation capability. Compared to the traditional C3K2 module, which extracts features using standard convolution, the C3ASSA module allows model to focus on key pedestrian features, improving robustness against occlusion and scale variations.

3.5. SIoU Loss Function

The YOLOv11 network model uses the CIOU loss for bounding box detection, which only considers aspect ratio of the bounding box and the distance between predicted and ground truth centers, without addressing the issue of bounding box angle. However, angle of bounding box also has an impact on model’s regression. To address this, SIOU is introduced to resolve vector angle issue between ground truth and predicted bounding boxes, reducing detection errors caused by inaccurate position or shape estimation. Ground truth refers to actual data that is used as a basis for comparison in machine learning models. SIOU loss function includes angle loss, distance loss, shape loss, and IOU loss.

The formula for angle loss is given by Equation (18):

A = cos (2 arcsin (C_{n}) - \frac{π}{4})

(18)

where C represents normalized difference between center coordinates of predicted bounding box and ground truth bounding box, with values ranging from [−1, 1].

The formula for distance loss is given by Equation (19):

Δ = 1 - e^{- λ ρ}

(19)

where

ρ

denotes Euclidean distance between center points of predicted bounding box and ground truth bounding box, while

γ

is a hyperparameter that controls the sensitivity of distance loss.

The formula for shape loss is given by Equation (20):

Q = (1 - e^{- β w}) + (1 - e^{- β h})

(20)

where w represents the width difference between predicted bounding box and ground truth bounding box, h represents height difference, and

β

is a hyperparameter that adjusts the weight of shape loss.

The Intersection over Union (IOU) [41] is calculated by dividing area of overlap between the predicted bounding box and the ground truth bounding box by total area covered by both boxes. Specifically, first determine coordinates of intersection rectangle, which is overlapping region between two boxes. Then, the area of this intersection is computed. Next, the union area is calculated by adding the areas of the predicted and ground truth boxes and subtracting intersection area. Finally, IOU is obtained as the ratio of intersection area to union area. This metric quantifies how well predicted box matches ground truth box, with values ranging from 0 (no overlap) to 1 (perfect overlap).

From the above, SIOU loss function is given by Equation (21):

L_{S I O U} = 1 - I O U + Δ + \frac{Q}{2}

(21)

SIOU optimizes the matching accuracy of bounding boxes from multiple dimensions by introducing angle loss, distance loss, and shape loss. It provides a more refined measurement of the relationship between predicted and ground truth boxes, reducing localization errors and increasing convergence speed.

3.6. Evaluation Metrics

In this paper, four performance metrics [42]—Precision, Recall, mean average precision (mAP), and Inference time—are employed to evaluate the performance of the improved network model for underground target detection in coal mines. Specifically, Precision refers to the proportion of correctly detected target objects, while Recall measures the proportion of actual target objects identified by the algorithm. The mAP represents the average precision calculated under multiple thresholds, serving as a critical indicator of the algorithm’s overall performance across multiple categories. The inference time is the average time taken by the model from the input image to the output of the detection result, which reflects the real-time performance and efficiency of the model. The Precision, Recall, and mAP values are as follows:

P = \frac{TP}{TP + FP} \times 100 %

(22)

R = \frac{TP}{TP + FN} \times 100 %

(23)

mAP = \frac{\sum_{i = 1}^{c} {AP}_{i}}{c} \times 100 %

(24)

where TP denotes the number of targets that were correctly detected; FP denotes the number of objects that were misdetected; FN denotes the number of targets that were not missed; c is the number of classifications; AP is the average accuracy of the recognition of a single target category, and refers to the area below the curve in the P-R coordinate system.

4. Results and Discussion

4.1. Datasets

In this study, we used two pedestrian detection datasets from complex scenarios, which include factors such as adverse weather and dense distributions, closely resembling the challenging real-world environments captured by urban surveillance cameras.

The WiderPerson dataset [43] focuses on outdoor pedestrian detection and includes 13,382 images with 400,000 occlusion annotations. The dataset contains a large number of pedestrians, totaling 386,353, with an average of 29 pedestrians per image and varying resolutions. The RTTS dataset [44] was collected in fog, rain, and snow weather conditions, providing valuable data for researching and analyzing pedestrian detection in complex environmental scenarios. It contains a total of 4322 images covering various scenes, including roads, rural areas, and tourist spots. This diversity makes the dataset more reflective of real-world application scenarios and allows for a evaluation of robustness and adaptability of proposed model.

The data in these datasets are anonymized, pertain to groups of people, do not include any individual’s body, are derived from human specimens, and do not involve personal information. Ethical approval is not required.

4.2. Experimental Settings

The environment used in this experiment for training the model is the NVIDIA RTX 4090 GPU, and the platform used for inference validation of the model is the NVIDIA RTX 4060Ti GPU. The manufacturer of these two devices is NVIDIA. The city and country of the manufacturer’s headquarters are Santa Clara, CA, USA. A comprehensive breakdown of the experimental condition configurations is provided in Table 1.

4.3. Comparison with SOTA Models

To comprehensively evaluate performance of FA-YOLO, we compared it against State-of-the-Art pedestrian detection models. The comparison results on two datasets [43,44] are presented in Table 2 and Table 3. FA-YOLO outperforms conventional methods across all four evaluation metrics, which can be attributed to several key enhancements.

The FEM integrates both global and local features, enabling the network to better capture fine-grained pedestrian characteristics, as reflected in the improved precision and recall. The ASSA module suppresses irrelevant background noise, reduces feature redundancy, and enhances the model’s robustness to occlusion and scale variations, as evidenced by the increase in the mAP score. Furthermore, the integration of the C3ASSA directs the model’s focus toward critical pedestrian features, further boosting detection performance. The SIoU loss function refines bounding box regression, leading to more precise localization, which is crucial for high-accuracy pedestrian detection. Despite these enhancements, FA-YOLO maintains a competitive inference speed, making it well-suited for real-world pedestrian detection applications.

Figure 5 and Figure 6 show the precision-recall curves of FA-YOLO and several baseline detectors on the WiderPerson and RTTS datasets, respectively. FA-YOLO consistently achieves higher precision across a broad range of recall values compared to traditional detectors such as SSD, Faster R-CNN, and DETR. Notably, in the high-recall region (recall > 0.7), the FA-YOLO curve remains more stable and stays closer to the ideal top-right corner, indicating greater robustness in detecting challenging or small-scale pedestrian instances. On RTTS dataset, which includes more challenging conditions such as low visibility and occlusion, FA-YOLO again outperforms other YOLO variants and two-stage detectors. Although performance gap is less pronounced than on the WiderPerson dataset, FA-YOLO achieves a balanced trade-off, exhibiting less degradation in precision as recall increases. These precision-recall curves demonstrate that FA-YOLO not only achieves high average precision (as shown in Table 2 and Table 3) but also delivers consistent performance across different detection thresholds, making it well-suited for safety-critical applications where high recall is required without compromising precision.

4.4. Ablation Experiments

The results of ablation experiments conducted on the WiderPerson and RTTS datasets are shown in Table 4 and Table 5, respectively. On the WiderPerson dataset, we observed that removing the FEM led to a noticeable decline in precision, recall, and mAP@0.5. We interpreted these results as evidence that FEM plays a critical role in improving feature representation. Similarly, the absence of ASSA resulted in a significant drop in recall and mAP@0.5:0.95. These results can be explained by supposing that ASSA contributes to suppressing irrelevant background noise and reducing feature redundancy. Furthermore, when the improved C3K2 structure (C3ASSA) was removed, a noticeable reduction in mAP@0.5:0.95 was observed. This suggested that C3ASSA plays an important role in enhancing model’s focus on key pedestrian features and improving robustness against occlusions and scale variations. Finally, eliminating SIoU loss function caused a decrease in mAP@0.5:0.95, which we interpret as an indication of its potential role in refining bounding box regression and improving localization accuracy.

On the RTTS dataset, similar trends were observed. Removing the FEM led to lower precision and recall, which we interpreted as further supporting its significance in feature representation. The absence of ASSA again caused a substantial drop in recall and mAP@0.5:0.95, this result can be explained by assuming that ASSA helps to suppress background noise and reduce redundancy. Likewise, excluding C3ASSA resulted in a decrease in mAP@0.5:0.95, possibly due to its contribution to robust feature extraction under challenging conditions. Finally, removing the SIoU loss function led to a decline in mAP@0.5:0.95, which may suggest its usefulness in enhancing bounding box regression performance.

In conclusion, the ablation experiments on both datasets validated the indispensable role of each FA-YOLO component in optimizing performance. The FEM is presumed to enhance feature representation, the ASSA module mitigates background noise and feature redundancy, the C3ASSA structure strengthens robustness against occlusions and scale variations, and the SIoU loss function improves bounding box localization. These components are designed to work synergistically, enabling FA-YOLO to achieve superior pedestrian detection performance compared to existing models.

4.5. Visualisation and Analysis

Four images were selected from each dataset for visual analysis, and the results for the two datasets are shown in Figure 7 and Figure 8, respectively. The visualization results demonstrate the robustness and accuracy of our model in detecting pedestrians under diverse and challenging conditions. Specifically, even in complex environments such as foggy weather, crowded pedestrian zones, and varying lighting conditions, the model consistently identifies small-scale pedestrians and accurately annotates their bounding boxes, even when they appear at a significant distance from the camera.

A particularly noteworthy observation is that the model detected pedestrians that were not annotated in the original ground truth labels. After careful manual verification, it was confirmed that these detected pedestrians were indeed present but had been overlooked by the human annotator. This highlights an important advantage of our model: it exhibits finer detection granularity and greater sensitivity in identifying multiple objects, even those that might be difficult for human annotators to notice. Such a capability is crucial for real-world applications where missing a target, even a small one, could have significant consequences, such as in autonomous driving, surveillance, and public safety monitoring.

Additionally, the model’s ability to detect objects under occlusion and varying poses further emphasizes its robustness. In cases where pedestrians were partially occluded by other objects (e.g., vehicles, trees, or other people), the model still managed to infer their presence and accurately delineate their bounding boxes. This suggests that the model effectively captures contextual cues and learns meaningful feature representations beyond simple edge detection.

These results collectively validate the effectiveness of our model for multi-object detection in real-world environments. Unlike traditional detection approaches that may struggle with small objects, occlusions, or variations in environmental conditions, our model demonstrates enhanced adaptability and superior recall performance. This makes it highly suitable for applications requiring precise and comprehensive target identification, reinforcing its potential for deployment in practical, high-stakes scenarios.

4.6. Detection Failure Case Analysis

Although FA-YOLO outperforms mainstream object detection algorithms in both accuracy and robustness across multiple datasets, it still inevitably encounters detection failures. Understanding the limitations of model is essential for its further development. To further optimize the algorithm’s performance and guide future research, this section analyzes typical failure cases, investigates their root causes, and proposes potential directions for improvement. Examples of detection failures on two datasets are shown in Figure 9 and Figure 10, respectively.

On the WiderPerson dataset, detection failures are mainly observed in scenarios involving densely packed pedestrians or partial occlusions. On the one hand, blur pedestrian boundaries, making it difficult to extract complete target features. On the other hand, smaller-scale pedestrians are more likely to suffer feature loss during downsampling. Furthermore, when multiple pedestrians are densely clustered, heavily overlapping feature regions often lead to target confusion or false detections. While ASSA enhances local context modeling to a certain degree, its performance remains limited in severely occluded scenarios. Future work could incorporate human body priors or adopt instance segmentation methods to assist in identification, thereby further enhancing detection performance.

On RTTS dataset, detection failures predominantly occur under nighttime or backlit conditions. Under these conditions, reduced image contrast leads to indistinct boundaries between pedestrians and the background, resulting in misclassifications or missed detections. Specifically, this manifests as blurred pedestrian silhouettes in low-light environments and the misclassification of background pseudo-targets as pedestrians. Despite employing feature enhancement strategies, model’s robustness to illumination variations remains limited. Future research could explore integrating image enhancement preprocessing or adopting style-transfer-based illumination normalization modules to improve model adaptability in complex illumination environments.

5. Conclusions

This paper proposes a novel pedestrian detection algorithm, FA-YOLO, designed to address challenges in complex real-world scenarios, such as drastic illumination changes and diverse pedestrian postures. To enhance detection performance, we design a Feature Enhancement Module (FEM) to fuse global and local features, thereby improving feature representation capability. Additionally, we proposed an Adaptive Sparse Self-Attention (ASSA) mechanism to reduce noise interference from irrelevant regions and optimize information representation across spatial and channel dimensions. Furthermore, the C3ASSA module further strengthens the model’s focus on target features. Meanwhile, we employ the Scalable Intersection over Union (SIoU) loss function, incorporating the vector angular relationship between predicted and ground-truth boxes to improve localization accuracy.

Extensive experiments demonstrate that FA-YOLO achieves outstanding performance in pedestrian detection tasks, maintaining high accuracy and efficiency in challenging scenarios such as occlusion, scale variations, and adverse weather conditions, outperforming existing mainstream methods. Notably, FA-YOLO is also capable of detecting pedestrians missed in manual annotations, further verifying its robustness and practical value in real-world applications such as smart city surveillance and autonomous driving.

Although FA-YOLO demonstrates strong performance across multiple datasets, it still presents certain limitations. First, although ASSA module enhances attention allocation, it introduces additional computational complexity compared to lightweight detection models, which may hinder deployment on resource-constrained edge devices. Second, proposed method has been primarily evaluated on pedestrian datasets, and its generalizability to more diverse object detection tasks has yet to be verified. In future work, we plan to explore model compression techniques and expand evaluations to more comprehensive benchmarks, with the goal of improving the practicality and generalizability of FA-YOLO.

Author Contributions

H.S. and H.H. designed the whole study. Y.C. and M.Y. conducted data analyses and drafted the manuscript. B.P. supervised the project and participated in writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

All subjects gave their informed consent for inclusion before they participated in the study. Ethics approval is not required for this type of study. The study was conducted following thelocallegislation: https://english.www.gov.cn/news/202310/08/content_WS652262a9c6d0868f4e8e0063.html (accessed on 8 October 2023).

Data Availability Statement

The data supporting the findings of this study are openly available in two repositories: http://www.cbsr.ia.ac.cn/users/sfzhang/WiderPerson/ (accessed on 16 July 2019) and https://sites.google.com/view/reside-dehaze-datasets (accessed on 30 August 2018), with reference [43,44], respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SIoU	scalable intersection over union
YOLO	You Only Look Once
LBP	Local Binary Patterns
HOG	Histogram of Oriented Gradients
mAP	mean average precision
FEM	feature enhancement module
ASSA	adaptive sparse self-attention
C3ASSA	cross stage partial with adaptive sparse self-attention
GCEB	Global context enhancement branch
GAP	global average pooling
SSA	sparse self-attention branch
DSA	dense self-attention branch
IOU	Intersection over Union
C3K2	Cross Stage Partial with kernel size 2
SPPF	Spatial Pyramid Pooling—Fast
C2PSA	Convolutional block with Parallel Spatial Attention
mAP	mean Average Precisio
DNNs	Deep neural networks

References

Iftikhar, S.; Zhang, Z.; Asim, M.; Muthanna, A.; Koucheryavy, A.; Abd El-Latif, A.A. Deep learning-based pedestrian detection in autonomous vehicles: Substantial issues and challenges. Electronics 2022, 11, 3551. [Google Scholar] [CrossRef]
Liu, Q.; Li, Z.; Zhang, L.; Deng, J. MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors 2025, 25, 438. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Liu, F.; Cao, Y.; Ullah, F.; Zhou, M. LFIR-YOLO: Lightweight Model for Infrared Vehicle and Pedestrian Detection. Sensors 2024, 24, 6609. [Google Scholar] [CrossRef]
Boudjit, K.; Ramzan, N. Human detection based on deep learning YOLO-v2 for real-time UAV applications. J. Exp. Theor. Artif. Intell. 2022, 34, 527–544. [Google Scholar] [CrossRef]
Chen, W.; Zhu, Y.; Tian, Z.; Zhang, F.; Yao, M. Occlusion and multi-scale pedestrian detection A review. Array 2023, 19, 100318. [Google Scholar]
Li, F.; Li, X.; Liu, Q.; Li, Z. Occlusion handling and multi-scale pedestrian detection based on deep learning: A review. IEEE Access 2022, 10, 19937–19957. [Google Scholar] [CrossRef]
Sukkar, M.; Jadeja, R.; Shukla, M.; Mahadeva, R. A Survey of Deep Learning Approaches for Pedestrian Detection in Autonomous Systems. IEEE Access 2025, 13, 3994–4007. [Google Scholar] [CrossRef]
Zhang, X.; Wang, M. Research on Pedestrian Tracking Technology for Autonomous Driving Scenarios. IEEE Access 2024, 12, 149662–149675. [Google Scholar] [CrossRef]
Dalal, S.; Lilhore, U.K.; Sharma, N.; Arora, S.; Simaiya, S.; Ayadi, M.; Almujally, N.A.; Ksibi, A. Improving smart home surveillance through YOLO model with transfer learning and quantization for enhanced accuracy and efficiency. PeerJ Comput. Sci. 2024, 10, e1939. [Google Scholar] [CrossRef]
Talaat, F.M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. Neural Comput. Appl. 2023, 10, 20939–20954. [Google Scholar] [CrossRef]
Mahadevkar, S.V.; Khemani, B.; Patil, S.; Kotecha, K.; Vora, D.R.; Abraham, A.; Gabralla, L.A. A review on machine learning styles in computer vision—Techniques and future directions. IEEE Access 2022, 10, 107293–107329. [Google Scholar] [CrossRef]
Manakitsa, N.; Maraslidis, G.S.; Moysis, L.; Fragulis, G.F. A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies 2024, 12, 15. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Hou, B. Theoretical analysis of the network structure of two mainstream object detection methods: YOLO and Fast RCNN. Appl. Comput. Eng. 2023, 17, 213–225. [Google Scholar] [CrossRef]
Hussain, M. Analysis of object detection performance based on Faster R-CNN. J. Phys. Conf. Ser. 2021, 1827, 012085. [Google Scholar]
Vijayakumar, A.; Vairavasundaram, S. Yolo-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling each variant–a comprehensive review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A comprehensive review of yolo architectures in computer vision: From YOLOV1 to YOLOV8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Jiang, X.; Gao, T.; Zhu, Z.; Zhao, Y. Real-time face mask detection method based on YOLOv3. Electronics 2021, 10, 837. [Google Scholar] [CrossRef]
Yu, J.; Wei, Z. Face mask wearing detection algorithm based on improved YOLO-v4. Sensors 2021, 21, 3263. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-time vehicle detection based on improved YOLO v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
He, Q.; Xu, A.; Ye, Z.; Zhou, W.; Cai, T. Object detection based on lightweight YOLOX for autonomous driving. Sensors 2023, 23, 7596. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Mao, M.; Min, H. YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11. Sensors 2025, 25, 2270. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Gallagher, J.E.; Oughton, E.J. Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements, Applications and Challenges. IEEE Access 2025, 13, 7366–7395. [Google Scholar] [CrossRef]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. PC-YOLO11s: A lightweight and effective feature extraction method for small target image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Springer: Cham, Switzerland, 2014. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Sekharamantry, P.K.; Melgani, F.; Malacarne, J. Deep learning-based apple detection with attention module and improved loss function in YOLO. Remote Sens. 2023, 15, 1516. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Zhou, M.; Stoffl, L.; Mathis, M.W.; Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Widerperson: MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Cheng, B.; Girshick, R.; Dollár, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving object-centric image segmentation evaluation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Flach, P.; Kull, M. Precision-recall-gain curves: PR analysis done right. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2020, 22, 380–393. [Google Scholar] [CrossRef]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2019, 28, 492–505. [Google Scholar] [CrossRef]

Figure 1. Diagram of the structure of YOLOv11.

Figure 2. Diagram of the structure of FA-YOLO.

Figure 3. Diagram of the structure of FEM.

Figure 4. Diagram of the structure of C3ASSA.

Figure 5. Precision-Recall Curve on the WiderPerson dataset.

Figure 6. Precision-Recall Curve on the RTTS dataset.

Figure 7. Visualization results for the WiderPerson dataset: Comparison of original (left) and predicted (right) annotations. (a–d) represents four different images.

Figure 8. Visualization results for RTTS dataset: Comparison of original (left) and predicted (right) annotations. (a–d) represents four different images.

Figure 9. Detection failure case for the WiderPerson dataset: Comparison of original (left) and predicted (right) annotations. (a,b) represents four different images.

Figure 10. Detection failure case for RTTS dataset: Comparison of original (left) and predicted (right) annotations. (a,b) represents four different images.

Table 1. Model training parameter configuration.

Training Parameter	Value
image Size	640 × 640
initial learning rate	0.01
attenuation factor	0.0005
momentum	0.937
epochs	100
optimiser	SGD
batchsize	16

Table 2. Comparison results on the WiderPerson dataset. The optimal results are highlighted in bold.

Method	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time (s)
SSD	0.702	0.618	0.779	0.437	0.63
DETR	0.735	0.654	0.790	0.468	0.90
Faster-RCNN	0.802	0.710	0.869	0.589	0.75
Mask-RCNN	0.805	0.713	0.779	0.437	0.85
YOLOv5-n	0.810	0.688	0.774	0.529	0.61
YOLOv7-tiny	0.839	0.790	0.869	0.594	0.65
YOLOv8-n	0.839	0.763	0.857	0.573	0.65
YOLOv9-n	0.842	0.763	0.859	0.580	0.71
YOLOv11-n	0.852	0.791	0.876	0.592	0.72
FA-YOLO	0.887	0.832	0.918	0.670	0.68

Table 3. Comparison results on the RTTS dataset. The optimal results are highlighted in bold.

Method	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time (s)
SSD	0.738	0.650	0.728	0.478	0.63
DETR	0.765	0.668	0.741	0.489	0.88
Faster-RCNN	0.825	0.694	0.732	0.506	0.75
Mask-RCNN	0.828	0.698	0.740	0.058	0.84
YOLOv5-n	0.797	0.682	0.784	0.544	0.61
YOLOv7-tiny	0.804	0.705	0.782	0.545	0.64
YOLOv8-n	0.807	0.716	0.783	0.546	0.65
YOLOv9-n	0.814	0.713	0.786	0.547	0.70
YOLOv11-n	0.820	0.708	0.785	0.548	0.73
FA-YOLO	0.850	0.722	0.787	0.550	0.69

Table 4. Ablation experiment on the WiderPerson dataset. The optimal results are highlighted in bold.

Method	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time (s)
w/o FEM	0.872	0.820	0.906	0.653	0.65
w/o ASSA	0.865	0.806	0.899	0.640	0.63
w/o C3ASSA	0.870	0.812	0.902	0.648	0.66
w/o SIOU	0.875	0.827	0.911	0.651	0.68
FA-YOLO	0.887	0.832	0.918	0.670	0.68

Table 5. Ablation experiment on the RTTS dataset. The optimal results are highlighted in bold.

Method	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time (s)
w/o FEM	0.837	0.710	0.770	0.535	0.66
w/o ASSA	0.829	0.698	0.764	0.520	0.64
w/o C3ASSA	0.833	0.704	0.768	0.527	0.67
w/o SIOU	0.841	0.719	0.776	0.532	0.69
FA-YOLO	0.850	0.722	0.787	0.550	0.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sui, H.; Han, H.; Cui, Y.; Yang, M.; Pei, B. FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention. Electronics 2025, 14, 1713. https://doi.org/10.3390/electronics14091713

AMA Style

Sui H, Han H, Cui Y, Yang M, Pei B. FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention. Electronics. 2025; 14(9):1713. https://doi.org/10.3390/electronics14091713

Chicago/Turabian Style

Sui, Hang, Huiyan Han, Yuzhu Cui, Menglong Yang, and Binwei Pei. 2025. "FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention" Electronics 14, no. 9: 1713. https://doi.org/10.3390/electronics14091713

APA Style

Sui, H., Han, H., Cui, Y., Yang, M., & Pei, B. (2025). FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention. Electronics, 14(9), 1713. https://doi.org/10.3390/electronics14091713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FA-YOLO: A Pedestrian Detection Algorithm with Feature Enhancement and Adaptive Sparse Self-Attention

Abstract

1. Introduction

2. Related Works

2.1. YOLOv11

2.2. Attention Mechanism

2.3. Deep Network Features

3. Materials and Methods

3.1. Overview

3.2. FEM

3.2.1. Local Feature Extraction Branch

3.2.2. Global Context Enhancement Branch

3.2.3. Feature Fusion

3.3. ASSA

3.4. C3ASSA

3.5. SIoU Loss Function

3.6. Evaluation Metrics

4. Results and Discussion

4.1. Datasets

4.2. Experimental Settings

4.3. Comparison with SOTA Models

4.4. Ablation Experiments

4.5. Visualisation and Analysis

4.6. Detection Failure Case Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI