Figure 1.
YOLOv5 Model Architecture. The YOLOv5 framework comprises three core components: the backbone, the neck, and the head. The backbone is composed of multi-scale convolutional layers, batch normalization, and Mish activation functions for feature extraction. The C3 modules enhance feature processing through convolutional and bottleneck layers. The neck employs a feature pyramid structure and upsampling, using Concat operations to integrate features across scales, thereby improving multi-scale target detection efficiency. The head network includes three detection layers designed for classification, detection, or segmentation.
Figure 1.
YOLOv5 Model Architecture. The YOLOv5 framework comprises three core components: the backbone, the neck, and the head. The backbone is composed of multi-scale convolutional layers, batch normalization, and Mish activation functions for feature extraction. The C3 modules enhance feature processing through convolutional and bottleneck layers. The neck employs a feature pyramid structure and upsampling, using Concat operations to integrate features across scales, thereby improving multi-scale target detection efficiency. The head network includes three detection layers designed for classification, detection, or segmentation.
Figure 2.
Enhanced YOLOv5 model architecture diagram. This diagram highlights the modifications made to the original YOLOv5 architecture shown in
Figure 1. In the backbone, the standard downsampling convolution modules are replaced with DownC modules (DownC(1/4), DownC(1/8), DownC(1/16), and DownC(1/32)), and C3 modules are replaced with C3AFAM modules to enhance feature extraction with attention mechanisms. In the neck, EMA modules are added after each C3 module to further improve multi-scale feature integration. Additionally, an extra detection layer (Detect(1/4)) is included in the head to enhance the detection of small objects.
Figure 2.
Enhanced YOLOv5 model architecture diagram. This diagram highlights the modifications made to the original YOLOv5 architecture shown in
Figure 1. In the backbone, the standard downsampling convolution modules are replaced with DownC modules (DownC(1/4), DownC(1/8), DownC(1/16), and DownC(1/32)), and C3 modules are replaced with C3AFAM modules to enhance feature extraction with attention mechanisms. In the neck, EMA modules are added after each C3 module to further improve multi-scale feature integration. Additionally, an extra detection layer (Detect(1/4)) is included in the head to enhance the detection of small objects.
Figure 3.
Adaptive Fusion Attention Module. The module is composed of two primary components: the ACM and the FSM. The ACM adjusts the feature channels adaptively, enhancing the most informative channels. The FSM then applies spatial attention to the fused feature maps, focusing on the most relevant spatial information.
Figure 3.
Adaptive Fusion Attention Module. The module is composed of two primary components: the ACM and the FSM. The ACM adjusts the feature channels adaptively, enhancing the most informative channels. The FSM then applies spatial attention to the fused feature maps, focusing on the most relevant spatial information.
Figure 4.
ACM network architecture diagram. It begins with three convolution layers (, , and ), each weighted by ,, and , respectively. The outputs are concatenated and passed through max pooling and average pooling layers. The pooled features are then fed into an MLP to generate channel-wise attention weights.
Figure 4.
ACM network architecture diagram. It begins with three convolution layers (, , and ), each weighted by ,, and , respectively. The outputs are concatenated and passed through max pooling and average pooling layers. The pooled features are then fed into an MLP to generate channel-wise attention weights.
Figure 5.
Schematic diagram of the FSM. The input feature map is processed through max pooling and average pooling layers, followed by a convolutional layer. The outputs from these operations are concatenated and passed through additional convolutional layers to generate spatial attention maps.
Figure 5.
Schematic diagram of the FSM. The input feature map is processed through max pooling and average pooling layers, followed by a convolutional layer. The outputs from these operations are concatenated and passed through additional convolutional layers to generate spatial attention maps.
Figure 6.
Schematic diagram of the DownC. The input feature tensor, with dimensions , is processed through a convolutional layer followed by a max pooling layer. The resulting feature maps are then processed through a convolutional layer with a stride of 2 and another convolutional layer. The outputs of these layers are concatenated and summed, resulting in a feature map with reduced dimensions .
Figure 6.
Schematic diagram of the DownC. The input feature tensor, with dimensions , is processed through a convolutional layer followed by a max pooling layer. The resulting feature maps are then processed through a convolutional layer with a stride of 2 and another convolutional layer. The outputs of these layers are concatenated and summed, resulting in a feature map with reduced dimensions .
Figure 7.
Schematic diagram of the EMA. The input feature map is divided into g groups, each processed separately. For each group, the features are passed through an average pooling layer and concatenated. This is followed by sigmoid and softmax activation functions applied to the pooled features. The outputs are then normalized using group normalization. The resultant features are multiplied and summed, producing the final output feature map, which retains the input dimensions after passing through a sigmoid activation function.
Figure 7.
Schematic diagram of the EMA. The input feature map is divided into g groups, each processed separately. For each group, the features are passed through an average pooling layer and concatenated. This is followed by sigmoid and softmax activation functions applied to the pooled features. The outputs are then normalized using group normalization. The resultant features are multiplied and summed, producing the final output feature map, which retains the input dimensions after passing through a sigmoid activation function.
Figure 8.
Baseline loss function graphs. (a) Training set loss. This graph shows the training loss for box, object, and class predictions over 500 epochs. (b) Validation set loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs.
Figure 8.
Baseline loss function graphs. (a) Training set loss. This graph shows the training loss for box, object, and class predictions over 500 epochs. (b) Validation set loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs.
Figure 9.
Detection results of the baseline model. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant Image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant Image. All instances of pedestrians in this image are successfully detected.
Figure 9.
Detection results of the baseline model. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant Image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant Image. All instances of pedestrians in this image are successfully detected.
Figure 10.
Incorporation of EMA loss function graphs. (a) Training set loss. This graph shows the training loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module. (b) Validation set loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module.
Figure 10.
Incorporation of EMA loss function graphs. (a) Training set loss. This graph shows the training loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module. (b) Validation set loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module.
Figure 11.
Detection results with EMA integration. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant image. All instances of pedestrians in this image are successfully detected.
Figure 11.
Detection results with EMA integration. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant image. All instances of pedestrians in this image are successfully detected.
Figure 12.
Graphs with modified MPDIoU loss function. (a) Training Set Loss. This graph shows the training loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function. (b) Validation Set Loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function.
Figure 12.
Graphs with modified MPDIoU loss function. (a) Training Set Loss. This graph shows the training loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function. (b) Validation Set Loss. This graph shows the validation loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function.
Figure 13.
Detection results with EMA integration. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant image. All instances of pedestrians in this image are successfully detected.
Figure 13.
Detection results with EMA integration. (a) Close-up view. Detection results of large targets in close-up scenarios. (b) Distant view. Detection results of small targets in distant scenarios. (c) Car-dominant image. This image predominantly contains cars, with only a few instances not being detected. (d) Pedestrian-dominant image. All instances of pedestrians in this image are successfully detected.
Figure 14.
Comparison of detection results. (a) Baseline model detection results, illustrating the detection performance for each category using the baseline model. (b) Improved model detection results, showcasing the detection performance of the proposed improved model.
Figure 14.
Comparison of detection results. (a) Baseline model detection results, illustrating the detection performance for each category using the baseline model. (b) Improved model detection results, showcasing the detection performance of the proposed improved model.
Figure 15.
Improved comparison charts. (a) Precision graph. This graph shows the precision values over 500 epochs for the baseline model, the model with EMA, and the model with EMA and MPDIoU. (b) Recall graph. This graph depicts the recall values over 500 epochs for the three models. (c) mAP50 graph. This graph illustrates the mAP at 50% IoU over 500 epochs. (d) mAP50-90 graph. This graph shows the mAP across IoU thresholds from 50% to 90% over 500 epochs.
Figure 15.
Improved comparison charts. (a) Precision graph. This graph shows the precision values over 500 epochs for the baseline model, the model with EMA, and the model with EMA and MPDIoU. (b) Recall graph. This graph depicts the recall values over 500 epochs for the three models. (c) mAP50 graph. This graph illustrates the mAP at 50% IoU over 500 epochs. (d) mAP50-90 graph. This graph shows the mAP across IoU thresholds from 50% to 90% over 500 epochs.
Table 1.
Experimental platform configuration. The GPU configuration includes two RTX 4090 cards, each with 24 GB of memory. The CPU used is a 24 vCPU Intel(R) Xeon(R) Platinum 8352V running at 2.10 GHz. The operating system is Ubuntu 20.04, with CUDA version 11.3 and cuDNN version 8.2. PyTorch version 1.10.0+cu113 is used as the deep learning framework.
Table 1.
Experimental platform configuration. The GPU configuration includes two RTX 4090 cards, each with 24 GB of memory. The CPU used is a 24 vCPU Intel(R) Xeon(R) Platinum 8352V running at 2.10 GHz. The operating system is Ubuntu 20.04, with CUDA version 11.3 and cuDNN version 8.2. PyTorch version 1.10.0+cu113 is used as the deep learning framework.
Name | Configuration Details |
---|
GPU | RTX 4090(24GB) ×
2 |
CPU | 24 vCPU Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10 GHz |
Operating System | Ubuntu 20.04 |
CUDA | 11.3 |
cuDNN | 8.2 |
PyTorch | 1.10.0+cu113 |
Table 2.
Experimental parameter settings. The model was trained for 500 epochs. There are 2 categories in the dataset. The batch size used during training is 8, and the input image size is 640 × 640 pixels.
Table 2.
Experimental parameter settings. The model was trained for 500 epochs. There are 2 categories in the dataset. The batch size used during training is 8, and the input image size is 640 × 640 pixels.
Parameter | Configuration Details |
---|
Epoch | 500 |
Number of Categories | 2 |
Batch size | 8 |
Image size | 640 × 640 |
Table 3.
Baseline model evaluation metrics.This table presents the precision, recall, and mean average precision (mAP) metrics for different classes using the baseline model. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Table 3.
Baseline model evaluation metrics.This table presents the precision, recall, and mean average precision (mAP) metrics for different classes using the baseline model. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Class | Precision | Recall | mAP50 | mAP50-90 |
---|
All | 0.913 | 0.836 | 0.921 | 0.647 |
Car | 0.921 | 0.901 | 0.959 | 0.742 |
Pedestrian | 0.905 | 0.772 | 0.883 | 0.552 |
Table 4.
Evaluation metrics for the introduced EMA module. This table presents the precision, recall, and mAP metrics for different classes after incorporating the EMA module. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Table 4.
Evaluation metrics for the introduced EMA module. This table presents the precision, recall, and mAP metrics for different classes after incorporating the EMA module. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Class | Precision | Recall | mAP50 | mAP50-90 |
---|
All | 0.939 | 0.859 | 0.938 | 0.691 |
Car | 0.943 | 0.915 | 0.967 | 0.779 |
Pedestrian | 0.934 | 0.803 | 0.908 | 0.604 |
Table 5.
Evaluation metrics utilizing MPDIoU. This table presents the precision, recall, and mAP metrics for different classes using the modified MPDIoU loss function. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Table 5.
Evaluation metrics utilizing MPDIoU. This table presents the precision, recall, and mAP metrics for different classes using the modified MPDIoU loss function. The results are shown for both mAP@50 and mAP@50-90 thresholds. The “All” category represents the overall performance across all classes, while specific metrics are provided for the “Car” and “Pedestrian” classes.
Class | Precision | Recall | mAP50 | mAP50-90 |
---|
All | 0.937 | 0.862 | 0.936 | 0.695 |
Car | 0.945 | 0.913 | 0.964 | 0.779 |
Pedestrian | 0.928 | 0.811 | 0.908 | 0.612 |
Table 6.
Comparative experiment table. This table compares the evaluation metrics of three models: baseline, EMA, and MPDIoU. The metrics include precision, recall, and mAP at 50% and 50–90% IoU thresholds. Arrows indicate the change in performance relative to the baseline model. The EMA model shows improvements in precision, recall, and mAP metrics, while the MPDIoU model shows a slight decrease in precision but improvements in recall and mAP.
Table 6.
Comparative experiment table. This table compares the evaluation metrics of three models: baseline, EMA, and MPDIoU. The metrics include precision, recall, and mAP at 50% and 50–90% IoU thresholds. Arrows indicate the change in performance relative to the baseline model. The EMA model shows improvements in precision, recall, and mAP metrics, while the MPDIoU model shows a slight decrease in precision but improvements in recall and mAP.
Model | Precision | Recall | mAP50 | mAP50-90 |
---|
Baseline | 0.913 | 0.836 | 0.921 | 0.647 |
EMA | | | | |
MPDIou | | | | |
Table 7.
Comparison with YOLOv8. This table presents the comparison of various metrics between the experimental model and YOLOv8.
Table 7.
Comparison with YOLOv8. This table presents the comparison of various metrics between the experimental model and YOLOv8.
Model | Precision | Recall | mAP50 | mAP50-90 |
---|
YOLOv8 | | | | |
Ours | 0.937 | 0.862 | 0.936 | 0.695 |