1. Introduction
As a core topic in computer vision, pedestrian detection aims to automatically locate pedestrians in images or videos through algorithms [
1,
2,
3]. Technological advancements in this field have significant implications for intelligent transportation systems, autonomous driving, video surveillance, and human–computer interaction [
4,
5,
6,
7]. In autonomous driving [
8], real-time and accurate pedestrian detection can effectively reduce traffic accident risks and ensure safety of vulnerable road users. In security surveillance [
9], this technology can be used for crowd density estimation and abnormal behavior detection, contributing to public safety management. In smart city development [
10], traffic flow analysis incorporating pedestrian detection provides valuable data support for urban planning.
With the continuous advancement of computer vision [
11], an increasing number of algorithms have been developed for automatically recognizing pedestrians in various scenes. These methods can be categorized into traditional object detection methods and deep learning-based object detection methods.
Traditional object detection methods [
12] primarily consist of three stages: candidate region selection, feature extraction, and classification. However, using a fixed-size search window may result in missing candidate regions of other sizes, thereby affecting detection accuracy and efficiency. Ojala et al. (2002) [
13] introduced Local Binary Patterns (LBP) to extract texture features of the target. The method binarized surrounding pixels based on grayscale value of the central pixel in a window and then computes a weighted sum over different pixel positions within the window to obtain the LBP value. However, this algorithm only processed fixed-size small regions and could not adapt to multi-scale variations. Dalal et al. (2005) [
14] proposed a detection method based on Histogram of Oriented Gradients (HOG) features. This method described the local shape and texture of the target using histograms of image gradient orientations and exhibited strong robustness to optical distortions in target images. However, the computation of HOG features relied on local gradient information, making them sensitive to scale and rotation variations, and their detection performance degrades under occlusion conditions.
Deep learning-based object detection methods can be categorized into two-stage and one-stage methods based on whether they employ a region proposal mechanism. Two-stage object detection algorithms transform detection problem into a classification task for local images within the generated proposal regions through explicit region proposals. Initially, a region proposal network generates candidate regions. Subsequently, feature extraction, classification, and bounding box regression are performed for each candidate region. Representative two-stage object detection algorithms include R-CNN [
15], Fast R-CNN [
16], and others. R-CNN [
15] used selective search to generate 2000 candidate regions, each of which was individually input into a CNN for feature extraction and then classified using an SVM. Fast R-CNN [
16] introduced an RoI Pooling layer to achieve feature sharing, thereby improving efficiency. Faster R-CNN [
17] further integrated the RPN with the detection network for end-to-end joint training, significantly reducing redundant computations. Although these algorithms achieved high accuracy, their computational complexity was high and speed was slow due to the need to process a large number of candidate regions. To address this issue, one-stage algorithms directly regressed target positions and categories. By dividing the image into grids and predicting bounding boxes and class probabilities for each grid, combined with non-maximum suppression to optimize the results, they achieved end-to-end real-time detection. You Only Look Once (YOLO) [
18,
19,
20,
21,
22] recently became mainstream due to its high efficiency. The YOLO model enabled real-time pedestrian detection while maintaining high accuracy; however, it performed poorly in detecting small pedestrian targets. To address this issue, YOLO series had undergone continuous iteration and optimization. YOLOv3 [
23] introduced feature pyramids and multi-scale prediction techniques, improving detection accuracy for targets of varying scales and aspect ratios. YOLOv4 [
24] reduced computational complexity and improved the network’s receptive field and feature representation ability by modifying the processing of input feature maps. YOLOv5 [
25] further introduced techniques such as adaptive training data augmentation, Mosaic data augmentation, and the Swish activation function, significantly improving detection accuracy and inference speed. YOLOX [
26] employed Mosaic and Mixup data augmentation techniques, along with innovative methods such as Anchor-free and Decoupled Head, further enhancing detection accuracy, particularly for small target detection. In subsequent versions, YOLOv8 [
27] integrated detection, segmentation, and tracking capabilities, enhancing the model’s versatility. As the latest version, YOLOv11 [
28] had undergone significant improvements in both architecture and training methods. It adopted an improved backbone and neck architecture, enhancing feature extraction capabilities, and introduces components such as Cross Stage Partial with kernel size 2 (C3k2) [
20], Spatial Pyramid Pooling - Fast (SPPF) [
29], and Convolutional block with Parallel Spatial Attention (C2PSA) [
20], further optimizing feature extraction and processing efficiency. The YOLO series models [
30,
31] have continuously optimized from YOLOv3 to YOLOv11, gradually improving performance and practicality in object detection tasks such as pedestrian detection through the introduction of various innovative techniques and architectural improvements.
Although these methods have made some progress, surveillance video scenes are large-field environments, where pedestrians are at varying distances from the camera and appear with different granular sizes in the image. Furthermore, pedestrian detection in real-world scenarios remains challenging due to factors such as variable illumination, occlusions, and adverse weather like rain or fog, all of which which can compromise the robustness of detection systems. To address these issues, we propose a novel pedestrian detection algorithm, FA-YOLO. First, we develop a feature enhancement module (FEM) that seamlessly integrates global and local information, thereby significantly enhancing the network’s ability to extract comprehensive and robust pedestrian features. This integration ensures that the model captures both contextual and detailed information, leading to improved feature representation. Second, we put forward an adaptive sparse self-attention (ASSA) module. This module is specifically designed to reduce noise interactions in irrelevant regions and mitigate feature redundancy across both spatial and channel domains. By doing so, it enhances the model’s discriminative power and ensures that the extracted features are more relevant and informative for pedestrian detection tasks. Furthermore, we enhance the model’s architecture by incorporating an improved C3K2 structure. This structure enables model to focus more effectively on critical target features, thereby improving its detection accuracy and overall performance. The improved C3K2 structure is optimized to better capture the essential characteristics of pedestrians, even in challenging scenarios. Finally, we adopt a scalable intersection over union (SIoU) loss function, which accounts for vector angle differences between predicted and ground-truth bounding boxes, leading to more precise localization. These improvements collectively enable FA-YOLO to handle multi-scale pedestrian detection while maintaining robustness against environmental challenges such as lighting variations, occlusions, and adverse weather conditions. The key contributions of this study are summarized as follows:
We propose FA-YOLO, a novel pedestrian detection algorithm that enhances YOLOv11 by integrating a feature enhancement module (FEM) and an adaptive sparse self-attention (ASSA) module to improve feature representation and reduce redundancy.
We introduce the SIoU loss function, which considers vector angle differences to enhance localization accuracy and improve the model’s ability to detect pedestrians in complex environments.
We integrate an improved C3K2 structure, C3ASSA, which enhances the network’s ability to focus on key pedestrian features, improving robustness against occlusions and scale variations.
Extensive experiments on public pedestrian detection datasets demonstrate that FA-YOLO outperforms existing methods, achieving higher detection accuracy and robustness in challenging real-world scenarios.
3. Materials and Methods
3.1. Overview
Our work builds on YOLOv11 by introducing multiple key enhancements. To enhance feature representation, we design a FEM that integrates global and local features, improving the network’s ability to capture fine-grained pedestrian characteristics. Additionally, we propose an ASSA module to suppress irrelevant background noise and reduce feature redundancy in both spatial and channel dimensions. Furthermore, we integrate an improved C3K2 structure, C3ASSA, which enhances ability to focus on key pedestrian features, improving robustness against occlusions and scale variations. Moreover, we introduce SIoU loss function, which refines the bounding box regression by addressing vector angle differences between predicted and ground-truth boxes, leading to more accurate localization. The overall architecture of FA-YOLO is illustrated in
Figure 2. The backbone network serves as primary feature extractor, capturing fundamental structures from input images. The neck module aggregates multi-scale information to ensure robustness against variations in pedestrian sizes, while the detection head generates the final classification and bounding box predictions. These improvements collectively enable FA-YOLO to achieve superior pedestrian detection performance compared to existing models. Next, we will introduce FEM, ASSA, C3ASSA and SIou loss function in detail, respectively.
3.2. FEM
In object detection tasks, convolutional layers primarily focus on extracting local detailed features, while the acquisition of global features is relatively limited. Furthermore, during the forward propagation of deep neural networks, feature information may encounter bottlenecks as network depth increases, leading to gradual weakening or loss of key features. To address these issues, we propose a Feature Enhancement Module (FEM), as shown in
Figure 3. This module consists of a local feature extraction branch and a Global Context Enhancement Branch (GCEB) to optimize feature representation in deep networks.
3.2.1. Local Feature Extraction Branch
This branch first utilizes a 1 × 1 convolution to adjust the number of channels in input feature map, reducing computational redundancy while preserving the original information representation, as shown in Equation (
4).
where
represents the 1 × 1 convolution. Then, a 3 × 3 convolution is applied to further extract local contextual features, as shown in Equation (
5).
where
represents the 3 × 3 convolution. To prevent gradient vanishing or explosion, we employ residual connections and concatenate the input features with the local features to enhance feature representation, as shown in Equation (
6).
3.2.2. Global Context Enhancement Branch
To compensate for the limited receptive field of local features, we introduce the Global Context Enhancement Branch (GCEB) in parallel. This branch aggregates global information from the feature map using global average pooling (GAP), as shown in Equation (
7).
where
H and
W represent height and width of feature map, respectively.
denotes pixel value of feature map F at position (
i,
j). Then, a 1 × 1 convolution is applied for channel compression, followed by ReLU activation function to enhance the network’s nonlinearity, as shown in Equation (
8).
where
is an activation function to enhance network’s nonlinearity. Subsequently, channel attention weights are normalized using Sigmoid function, as shown in Equation (
9).
where
is an activation function to normalize feature values to the range of 0 to 1, generating attention weights. Finally, the input features are multiplied element-wise with the channel attention weights to achieve global feature enhancement, as shown in Equation (
10).
where ⊙ represents element-wise multiplication, used to combine the input features with the channel attention weights.
3.2.3. Feature Fusion
The features from the local and global branches are fused to enhance feature representation, as shown in Equation (
11).
Finally, a 1 × 1 convolution is applied to map features to final output, achieving channel expansion and enhancing nonlinear feature representation. FEM optimizes the feature extraction process in object detection tasks by combining local features with global context information, improving network’s ability to detect complex targets and effectively mitigating issue of feature information loss.
3.3. ASSA
In complex environments, pedestrian detection is often affected by irrelevant background noise and feature redundancy, which can significantly degrade performance of detection models. To address these challenges, we propose the ASSA module, which is designed to suppress irrelevant background noise and reduce feature redundancy in both spatial and channel dimensions.
The ASSA module is designed to adaptively capture the most informative interactions among tokens while preserving essential information. It consists of two branches: a sparse self-attention branch (SSA) and a dense self-attention branch (DSA). The SSA branch filters out irrelevant interactions among tokens, while the DSA branch ensures necessary information flows through the network.
Given a normalized feature map
, where C represents the number of channels, we partition it into non-overlapping windows of size
, resulting in a representation
from the
i-th window. We then generate matrices of queries
Q, keys
K, and values
V from X, as shown in Equation (
12).
where
are linear projection matrices shared among all windows, d represents the dimension of vector. The attention computation can be defined in Equation (
13):
where
A denotes estimated attention,
B refers to learnable relative positional bias, and
f is a scoring function. The standard dense self-attention (DSA) mechanism employs a SoftMax layer to obtain attention scores, as shown in Equation (
14).
However, not all query tokens are closely relevant to corresponding ones in keys, making utilization of all similarities ineffective for clear image reconstruction. To enhance feature aggregation, we develop a sparse self-attention (SSA) mechanism to select the most useful interactions among tokens, as shown in Equation (
15).
To balance the sparse and dense branches, we propose an adaptive two-branch self-attention mechanism. The attention matrix in Equation (
13) can be updated to Equation (
16).
where
are two normalized weights for adaptively modulating the two branches, computed as Equation (
17).
where
are learnable parameters initialized to 1 for two branches. This design ensures a trade-off between filtering out noisy interactions from irrelevant areas and retaining enough informative features.
3.4. C3ASSA
The C3ASSA module is an enhanced feature extraction component that replaces Bottleneck structure in original module with the IR-ASSA, using an inverted residual structure, as shown in
Figure 4. This module leverages the ASSA mechanism to improve the C3K2 structure, enhancing the ability to capture fine-grained pedestrian features while maintaining low computational overhead.
The IR-ASSA module first expands the input channels to twice their original size using pointwise convolution, mapping low-dimensional features to a higher-dimensional feature space. This enables subsequent operations to extract complex feature information more efficiently. Next, a 3 × 3 depthwise separable convolution is used instead of a standard convolution to extract local spatial features while significantly reducing computational complexity. Subsequently, the ASSA attention mechanism module is integrated to effectively filter out irrelevant features and highlight key features, thereby enhancing the ability to capture target features. Finally, pointwise convolution is applied to compress and reduce the dimensionality of the channels, while a residual connection is introduced to prevent the loss of important global features during the dimensionality reduction process, further improving information retention and representation capability. Compared to the traditional C3K2 module, which extracts features using standard convolution, the C3ASSA module allows model to focus on key pedestrian features, improving robustness against occlusion and scale variations.
3.5. SIoU Loss Function
The YOLOv11 network model uses the CIOU loss for bounding box detection, which only considers aspect ratio of the bounding box and the distance between predicted and ground truth centers, without addressing the issue of bounding box angle. However, angle of bounding box also has an impact on model’s regression. To address this, SIOU is introduced to resolve vector angle issue between ground truth and predicted bounding boxes, reducing detection errors caused by inaccurate position or shape estimation. Ground truth refers to actual data that is used as a basis for comparison in machine learning models. SIOU loss function includes angle loss, distance loss, shape loss, and IOU loss.
The formula for angle loss is given by Equation (
18):
where
C represents normalized difference between center coordinates of predicted bounding box and ground truth bounding box, with values ranging from [−1, 1].
The formula for distance loss is given by Equation (
19):
where
denotes Euclidean distance between center points of predicted bounding box and ground truth bounding box, while
is a hyperparameter that controls the sensitivity of distance loss.
The formula for shape loss is given by Equation (
20):
where
w represents the width difference between predicted bounding box and ground truth bounding box, h represents height difference, and
is a hyperparameter that adjusts the weight of shape loss.
The Intersection over Union (IOU) [
41] is calculated by dividing area of overlap between the predicted bounding box and the ground truth bounding box by total area covered by both boxes. Specifically, first determine coordinates of intersection rectangle, which is overlapping region between two boxes. Then, the area of this intersection is computed. Next, the union area is calculated by adding the areas of the predicted and ground truth boxes and subtracting intersection area. Finally, IOU is obtained as the ratio of intersection area to union area. This metric quantifies how well predicted box matches ground truth box, with values ranging from 0 (no overlap) to 1 (perfect overlap).
From the above, SIOU loss function is given by Equation (
21):
SIOU optimizes the matching accuracy of bounding boxes from multiple dimensions by introducing angle loss, distance loss, and shape loss. It provides a more refined measurement of the relationship between predicted and ground truth boxes, reducing localization errors and increasing convergence speed.
3.6. Evaluation Metrics
In this paper, four performance metrics [
42]—Precision, Recall, mean average precision (mAP), and Inference time—are employed to evaluate the performance of the improved network model for underground target detection in coal mines. Specifically, Precision refers to the proportion of correctly detected target objects, while Recall measures the proportion of actual target objects identified by the algorithm. The mAP represents the average precision calculated under multiple thresholds, serving as a critical indicator of the algorithm’s overall performance across multiple categories. The inference time is the average time taken by the model from the input image to the output of the detection result, which reflects the real-time performance and efficiency of the model. The Precision, Recall, and mAP values are as follows:
where TP denotes the number of targets that were correctly detected; FP denotes the number of objects that were misdetected; FN denotes the number of targets that were not missed; c is the number of classifications; AP is the average accuracy of the recognition of a single target category, and refers to the area below the curve in the P-R coordinate system.
4. Results and Discussion
4.1. Datasets
In this study, we used two pedestrian detection datasets from complex scenarios, which include factors such as adverse weather and dense distributions, closely resembling the challenging real-world environments captured by urban surveillance cameras.
The WiderPerson dataset [
43] focuses on outdoor pedestrian detection and includes 13,382 images with 400,000 occlusion annotations. The dataset contains a large number of pedestrians, totaling 386,353, with an average of 29 pedestrians per image and varying resolutions. The RTTS dataset [
44] was collected in fog, rain, and snow weather conditions, providing valuable data for researching and analyzing pedestrian detection in complex environmental scenarios. It contains a total of 4322 images covering various scenes, including roads, rural areas, and tourist spots. This diversity makes the dataset more reflective of real-world application scenarios and allows for a evaluation of robustness and adaptability of proposed model.
The data in these datasets are anonymized, pertain to groups of people, do not include any individual’s body, are derived from human specimens, and do not involve personal information. Ethical approval is not required.
4.2. Experimental Settings
The environment used in this experiment for training the model is the NVIDIA RTX 4090 GPU, and the platform used for inference validation of the model is the NVIDIA RTX 4060Ti GPU. The manufacturer of these two devices is NVIDIA. The city and country of the manufacturer’s headquarters are Santa Clara, CA, USA. A comprehensive breakdown of the experimental condition configurations is provided in
Table 1.
4.3. Comparison with SOTA Models
To comprehensively evaluate performance of FA-YOLO, we compared it against State-of-the-Art pedestrian detection models. The comparison results on two datasets [
43,
44] are presented in
Table 2 and
Table 3. FA-YOLO outperforms conventional methods across all four evaluation metrics, which can be attributed to several key enhancements.
The FEM integrates both global and local features, enabling the network to better capture fine-grained pedestrian characteristics, as reflected in the improved precision and recall. The ASSA module suppresses irrelevant background noise, reduces feature redundancy, and enhances the model’s robustness to occlusion and scale variations, as evidenced by the increase in the mAP score. Furthermore, the integration of the C3ASSA directs the model’s focus toward critical pedestrian features, further boosting detection performance. The SIoU loss function refines bounding box regression, leading to more precise localization, which is crucial for high-accuracy pedestrian detection. Despite these enhancements, FA-YOLO maintains a competitive inference speed, making it well-suited for real-world pedestrian detection applications.
Figure 5 and
Figure 6 show the precision-recall curves of FA-YOLO and several baseline detectors on the WiderPerson and RTTS datasets, respectively. FA-YOLO consistently achieves higher precision across a broad range of recall values compared to traditional detectors such as SSD, Faster R-CNN, and DETR. Notably, in the high-recall region (recall > 0.7), the FA-YOLO curve remains more stable and stays closer to the ideal top-right corner, indicating greater robustness in detecting challenging or small-scale pedestrian instances. On RTTS dataset, which includes more challenging conditions such as low visibility and occlusion, FA-YOLO again outperforms other YOLO variants and two-stage detectors. Although performance gap is less pronounced than on the WiderPerson dataset, FA-YOLO achieves a balanced trade-off, exhibiting less degradation in precision as recall increases. These precision-recall curves demonstrate that FA-YOLO not only achieves high average precision (as shown in
Table 2 and
Table 3) but also delivers consistent performance across different detection thresholds, making it well-suited for safety-critical applications where high recall is required without compromising precision.
4.4. Ablation Experiments
The results of ablation experiments conducted on the WiderPerson and RTTS datasets are shown in
Table 4 and
Table 5, respectively. On the WiderPerson dataset, we observed that removing the FEM led to a noticeable decline in precision, recall, and mAP@0.5. We interpreted these results as evidence that FEM plays a critical role in improving feature representation. Similarly, the absence of ASSA resulted in a significant drop in recall and mAP@0.5:0.95. These results can be explained by supposing that ASSA contributes to suppressing irrelevant background noise and reducing feature redundancy. Furthermore, when the improved C3K2 structure (C3ASSA) was removed, a noticeable reduction in mAP@0.5:0.95 was observed. This suggested that C3ASSA plays an important role in enhancing model’s focus on key pedestrian features and improving robustness against occlusions and scale variations. Finally, eliminating SIoU loss function caused a decrease in mAP@0.5:0.95, which we interpret as an indication of its potential role in refining bounding box regression and improving localization accuracy.
On the RTTS dataset, similar trends were observed. Removing the FEM led to lower precision and recall, which we interpreted as further supporting its significance in feature representation. The absence of ASSA again caused a substantial drop in recall and mAP@0.5:0.95, this result can be explained by assuming that ASSA helps to suppress background noise and reduce redundancy. Likewise, excluding C3ASSA resulted in a decrease in mAP@0.5:0.95, possibly due to its contribution to robust feature extraction under challenging conditions. Finally, removing the SIoU loss function led to a decline in mAP@0.5:0.95, which may suggest its usefulness in enhancing bounding box regression performance.
In conclusion, the ablation experiments on both datasets validated the indispensable role of each FA-YOLO component in optimizing performance. The FEM is presumed to enhance feature representation, the ASSA module mitigates background noise and feature redundancy, the C3ASSA structure strengthens robustness against occlusions and scale variations, and the SIoU loss function improves bounding box localization. These components are designed to work synergistically, enabling FA-YOLO to achieve superior pedestrian detection performance compared to existing models.
4.5. Visualisation and Analysis
Four images were selected from each dataset for visual analysis, and the results for the two datasets are shown in
Figure 7 and
Figure 8, respectively. The visualization results demonstrate the robustness and accuracy of our model in detecting pedestrians under diverse and challenging conditions. Specifically, even in complex environments such as foggy weather, crowded pedestrian zones, and varying lighting conditions, the model consistently identifies small-scale pedestrians and accurately annotates their bounding boxes, even when they appear at a significant distance from the camera.
A particularly noteworthy observation is that the model detected pedestrians that were not annotated in the original ground truth labels. After careful manual verification, it was confirmed that these detected pedestrians were indeed present but had been overlooked by the human annotator. This highlights an important advantage of our model: it exhibits finer detection granularity and greater sensitivity in identifying multiple objects, even those that might be difficult for human annotators to notice. Such a capability is crucial for real-world applications where missing a target, even a small one, could have significant consequences, such as in autonomous driving, surveillance, and public safety monitoring.
Additionally, the model’s ability to detect objects under occlusion and varying poses further emphasizes its robustness. In cases where pedestrians were partially occluded by other objects (e.g., vehicles, trees, or other people), the model still managed to infer their presence and accurately delineate their bounding boxes. This suggests that the model effectively captures contextual cues and learns meaningful feature representations beyond simple edge detection.
These results collectively validate the effectiveness of our model for multi-object detection in real-world environments. Unlike traditional detection approaches that may struggle with small objects, occlusions, or variations in environmental conditions, our model demonstrates enhanced adaptability and superior recall performance. This makes it highly suitable for applications requiring precise and comprehensive target identification, reinforcing its potential for deployment in practical, high-stakes scenarios.
4.6. Detection Failure Case Analysis
Although FA-YOLO outperforms mainstream object detection algorithms in both accuracy and robustness across multiple datasets, it still inevitably encounters detection failures. Understanding the limitations of model is essential for its further development. To further optimize the algorithm’s performance and guide future research, this section analyzes typical failure cases, investigates their root causes, and proposes potential directions for improvement. Examples of detection failures on two datasets are shown in
Figure 9 and
Figure 10, respectively.
On the WiderPerson dataset, detection failures are mainly observed in scenarios involving densely packed pedestrians or partial occlusions. On the one hand, blur pedestrian boundaries, making it difficult to extract complete target features. On the other hand, smaller-scale pedestrians are more likely to suffer feature loss during downsampling. Furthermore, when multiple pedestrians are densely clustered, heavily overlapping feature regions often lead to target confusion or false detections. While ASSA enhances local context modeling to a certain degree, its performance remains limited in severely occluded scenarios. Future work could incorporate human body priors or adopt instance segmentation methods to assist in identification, thereby further enhancing detection performance.
On RTTS dataset, detection failures predominantly occur under nighttime or backlit conditions. Under these conditions, reduced image contrast leads to indistinct boundaries between pedestrians and the background, resulting in misclassifications or missed detections. Specifically, this manifests as blurred pedestrian silhouettes in low-light environments and the misclassification of background pseudo-targets as pedestrians. Despite employing feature enhancement strategies, model’s robustness to illumination variations remains limited. Future research could explore integrating image enhancement preprocessing or adopting style-transfer-based illumination normalization modules to improve model adaptability in complex illumination environments.
5. Conclusions
This paper proposes a novel pedestrian detection algorithm, FA-YOLO, designed to address challenges in complex real-world scenarios, such as drastic illumination changes and diverse pedestrian postures. To enhance detection performance, we design a Feature Enhancement Module (FEM) to fuse global and local features, thereby improving feature representation capability. Additionally, we proposed an Adaptive Sparse Self-Attention (ASSA) mechanism to reduce noise interference from irrelevant regions and optimize information representation across spatial and channel dimensions. Furthermore, the C3ASSA module further strengthens the model’s focus on target features. Meanwhile, we employ the Scalable Intersection over Union (SIoU) loss function, incorporating the vector angular relationship between predicted and ground-truth boxes to improve localization accuracy.
Extensive experiments demonstrate that FA-YOLO achieves outstanding performance in pedestrian detection tasks, maintaining high accuracy and efficiency in challenging scenarios such as occlusion, scale variations, and adverse weather conditions, outperforming existing mainstream methods. Notably, FA-YOLO is also capable of detecting pedestrians missed in manual annotations, further verifying its robustness and practical value in real-world applications such as smart city surveillance and autonomous driving.
Although FA-YOLO demonstrates strong performance across multiple datasets, it still presents certain limitations. First, although ASSA module enhances attention allocation, it introduces additional computational complexity compared to lightweight detection models, which may hinder deployment on resource-constrained edge devices. Second, proposed method has been primarily evaluated on pedestrian datasets, and its generalizability to more diverse object detection tasks has yet to be verified. In future work, we plan to explore model compression techniques and expand evaluations to more comprehensive benchmarks, with the goal of improving the practicality and generalizability of FA-YOLO.