1. Introduction
A rapid increase in vehicles on the road has created challenges in traffic management, requiring accurate vehicle detection and counting for effective traffic control. Vehicle detection, which is vital for real-time estimation of types and total counts of vehicles, involves localization to provide the locations of vehicles and classification to identify their categories [
1]. Vehicle detection is a crucial application in computer vision, with wide-ranging uses in traffic monitoring, autonomous driving, security, and environmental management. Various challenges can significantly influence the accuracy of vehicle detection. Variations in lighting conditions can alter the appearance of vehicles in images or videos, making detection inconsistent. The presence of obstacles, such as other vehicles, buildings, or trees, can obscure vehicles wholly or partially. Furthermore, diversity in vehicle colors, shapes, and sizes adds complexity to detection processes. Viewing angles also affect the consistency of vehicle appearances, with different perspectives like the front, rear, and side posing unique challenges. Image or video quality, particularly low-resolution or blurry media, can impair the precision of detection models. Lastly, extreme weather conditions such as heavy rain, fog, or snow can drastically reduce visibility, further complicating the detection of vehicles.
Traditionally, methods like the histogram of oriented gradients (HOG) have been used to implement vehicle detection [
2]. These traditional approaches often involve complex workflows, significant manual intervention, and lengthy processing times. To overcome these challenges, convolutional neural networks (CNNs) have been introduced, demonstrating superior performance [
3,
4,
5]. Further advancements in this field have led to the development of region-based convolutional networks (R-CNN), Fast R-CNN, Faster R-CNN, and region-based fully convolutional networks (R-FCN), which employ a two-stage detection process. This process begins with the generation of region proposals and is followed by their refinement through localization and classification [
6,
7,
8,
9,
10].
Given the computational complexities, slower inference speeds, and resource limitations of the previously mentioned models, the one-stage detection approach has gained popularity. This method identifies objects in a single iteration by predicting bounding box locations and assigning labels to detected objects [
11]. The Single Shot Multibox Detector (SSD) leverages convolutional features across multiple scales to predict bounding boxes and class probabilities. RefineDet++ [
12] enhances accuracy by merging improved feature extraction with precise boundary determination. The Deconvolutional Single Shot Detector (DSSD) [
13] uses deconvolutional layers to restore spatial data lost during feature pooling, while ResNet manages class imbalances effectively with focal loss.
You Only Look Once (YOLO ) stands out for its simple architecture, high accuracy, and rapid inference speed among single-stage detectors. YOLOv1 employs a 24 layer convolutional network with 2 fully connected layers, and a streamlined variant, Fast YOLO, features 9 convolutional layers, with both built on Darknet. YOLOv2 improves upon this with batch normalization, anchor boxes, a depth-classifying feature, dimension clusters for anchor sizing, and fine-grained features to detect smaller objects, using Darknet-19 instead of the more complex VGG-16 or less accurate GoogleNet [
14,
15]. YOLOv3 further deepens the architecture and refines feature extraction using Darknet-53 coupled with residual networks, and it shifts from SoftMax to binary cross-entropy for more versatile box labeling [
16].
YOLOv4 and YOLOv5 continue this evolution by enhancing both speed and accuracy. YOLOv4 integrates advancements in the backbone network (CSPDarknet53), a neck module for feature integration (PANet with SPP), and a predictive head based on the YOLOv3 architecture [
17]. YOLOv5, built on PyTorch, simplifies the model for better user accessibility and further speeds up performance using a new CSP-Darknet53 backbone, SPPF, and a new CSP-PAN neck. YOLOv6 and YOLOv7 focus on reducing computational demands and improving precision, respectively, with YOLOv7 introducing auxiliary and lead heads for enhanced training and output accuracy [
18,
19].
YOLOv8 expands the framework to support a variety of AI tasks such as detection, segmentation, and tracking, enhancing its versatility across multiple domains. It features a modified CSPDarknet53 backbone and a PAN-FPN neck. YOLOv9 and YOLOv10 introduce cutting-edge techniques like programmable gradient information (PGI) and the generalized efficient layer aggregation network (GELAN), with YOLOv10 eliminating the need for non-maximum suppression (NMS) through an end-to-end head, facilitating real-time object detection [
20,
21].
This study aims to compare the performance of YOLOv8 and YOLOv10 for vehicle detection, focusing on their effectiveness in various challenging environments. Given the advancements in one-stage detection algorithms, it is crucial to evaluate how these newer models perform in terms of accuracy, speed, and computational efficiency. This analysis will help identify which model delivers superior performance and under what conditions, providing valuable insights for applications in autonomous driving, traffic management, and other related fields.
The remainder of this work is organized as follows. The literature related to this study is thoroughly discussed in the
Section 2, which provides a comprehensive overview of the developments in vehicle detection technologies, especially focusing on the evolution of YOLO architectures. The
Section 3 details the workflow and mechanisms behind YOLOv8 and YOLOv10, elaborating on their technical specifications and implementation strategies. Finally, in the
Section 4 and
Section 5, exploratory data analysis is conducted, and performance matrices are evaluated and discussed, highlighting the comparative strengths and weaknesses of the models.
2. Literature Review
In the field of computer vision, the ability to accurately detect and track vehicles has become increasingly vital across various applications, from traffic management to autonomous driving. Over the years, technology has evolved from basic detection algorithms to sophisticated neural networks capable of performing complex vehicle recognition tasks under a variety of conditions. This literature review explores the progression from early vehicle detection methods to the latest advancements in YOLO, highlighting significant contributions and innovations which have shaped current practices in vehicle monitoring systems.
Vehicle detection and tracking are critical components of computer vision-based monitoring systems, facilitating tasks such as vehicle counting, accident identification, traffic pattern analysis, and traffic surveillance. Vehicle detection involves identifying and locating vehicles using bounding boxes, while tracking entails following and predicting vehicle movements through trajectories [
22]. Initially, convolutional algorithms focused on background removal and feature extraction defined by users, but these faced difficulties with dynamic backgrounds and variable weather conditions [
23]. To address these issues, Barth et al. introduced the Stixels method in their research, which utilizes color schema to convert movement information [
24]. Convolutional neural networks (CNNs) have been adopted to overcome obstacles such as obstructive objects, varying backgrounds, and processing delays, enhancing accuracy in the process [
6]. Various papers have explored CNN architectures tailored to these tasks, including the RCNN, FRCNN, SSD, and ResNet [
7,
25,
26,
27,
28,
29,
30,
31,
32,
33]. Additionally, Azimjonov and Özmen compared traditional machine learning and deep learning algorithms for road vehicle detection [
11]. For vehicle tracking, these methods include tracking by detection using bounding boxes and appearance-based tracking focused on visual features. Among these, the combination of YOLO for detection and a CNN for tracking has shown superior performance compared with nine other machine learning models, demonstrating a robust approach for vehicle monitoring systems.
YOLO, by approaching vehicle detection as a regression task using convolutional neural networks (CNNs), has significantly improved the accuracy of detecting the locations, types, and confidence scores of vehicles. It also enhances detection speed while managing blur during movement by providing bounding boxes and class probabilities.
Figure 1 shows a comprehensive visual overview of the progressive enhancements in the YOLO architecture series from YOLOv1 to YOLOv10. YOLOv2, leveraging GPU features and the anchor box technique, improved upon its predecessor in the detection, classification, and tracking of vehicles [
34]. Ćorović, Ilić et al. proposed YOLOv3, which was trained for five classes, including cars, trucks, street signs, people, and traffic lights. It was proposed to detect traffic participants effectively across various weather conditions [
35]. YOLOv4 focused on enhancing the detection speeds of slow-moving vehicles in video feeds [
36], while YOLOv5 utilized an infrared camera to locate heavy vehicles in snowy conditions, facilitating real-time parking space prediction due to its efficient framework and fast identification capabilities [
37]. Furthermore, YOLOv6 introduced an enhanced network architecture and training methods to achieve superior accuracy in real-time detection [
38]. Modifications in YOLOv7 were tailored for detecting, tracing, and measuring vehicles on highways, improving decision making and tracking in urban settings [
39,
40]. Future iterations of YOLO are expected to incorporate features such as lane tracking and velocity estimation to further enhance the accuracy and utility of the system in diverse driving environments.
In their research, Soylu and Soylu explored the development of a traffic sign detection system using YOLOv8, which was enhanced with a spatial attention module (SAM) for better feature representation and a spatial pyramid pooling (SPP) module to capture vehicle features at multiple scales, aiming to improve road safety and efficiency in autonomous vehicles [
41]. YOLOv8m, with 295 layers and while processing images of 1280 pixels, achieved impressive accuracies of 99% in mAP-50 and 83% in mAP-50-95. Further optimizations could involve simplification, fine-tuning, and pruning of the model. Future enhancements may integrate natural language processing to extend YOLOv8’s capabilities. Additionally, this model has improved vehicle detection in segmented images by employing advanced feature extraction techniques such as the scale-invariant feature transform, oriented FAST and rotated BRIEF, and KAZE, followed by training with a deep belief network classifier, achieving accuracies of 95.6% and 94.6% on the Vehicle Detection in Aerial Imagery and Vehicle Aerial Imagery from a Drone datasets, respectively. Increasing vehicle categories and integrating additional features could further enhance classification accuracy [
42].
In comparison, YOLOv10, although faster in post-processing due to its non-maximum suppression (NMS) free approach, which reduces latency, tends to miss smaller objects because of fewer parameters and lower confidence scores [
43]. Each iteration of YOLO has seen enhancements in inference speed and overall performance. While earlier versions like YOLO and YOLOv3 struggled with vehicle visibility in severe weather, YOLOv5 proved effective under cold conditions. YOLOv8 and YOLOv10 have the potential to further refine feature extraction techniques to better cope with adverse weather. These models have traditionally struggled with detecting small and occluded objects due to their fixed grid structure. However, YOLOv8 improved the detection of smaller objects through advanced feature extraction, and YOLOv10 adopted multi-resolution techniques and an anchor-free approach. While older YOLO models grappled with inaccuracies in complex scenarios due to higher frame rates, YOLOv8 implemented model quantization to balance accuracy with real-time performance, and YOLOv10 employed a complex architecture with lightweight layers to maintain precise detection capabilities.
In conclusion, the literature review underscores the rapid advancements in vehicle detection technologies, particularly within the YOLO series, from basic detection to sophisticated systems capable of handling complex scenarios and adverse conditions. The progression from YOLO to YOLOv10 highlights a continual improvement in speed, accuracy, and adaptability, addressing previous limitations such as poor visibility in bad weather and difficulty detecting small or occluded objects. These developments set a solid foundation for this study, which aims to delve deeper into the capabilities and performance of YOLOv8 and YOLOv10, focusing on their application in real-world vehicle detection scenarios. By comparing these models, this research seeks to identify optimal solutions which may significantly enhance the efficacy of automated systems in traffic management and autonomous driving.
4. Experimental Results
The experimental results for the YOLOv8 and YOLOv10 models, detailed in
Figure 3 and
Figure 4, highlight their performance across various metrics throughout the training and validation phases. Both models were trained and validated in a PyTorch environment on a high-performance system equipped with NVIDIA GPUs when using cutting-edge deep learning techniques. The hyperparameters were meticulously fine-tuned over 1000 epochs to enhance model accuracy and efficiency. The stochastic gradient descent (SGD) optimizer was employed with an initial learning rate of 0.01 and a momentum value of 0.9. A decaying learning rate strategy complemented by specific regularization settings optimized performance. Notably, the group parameters were adjusted to ensure some weights experienced no decay (decay = 0.0), while the biases were subjected to L2 regularization (decay = 0.0005). This regimen helped minimize overfitting and maintain generalizability. The batch size was set to 16 to balance computational efficiency with accuracy.
The training was executed for up to 1000 epochs, incorporating an early stopping mechanism activated if no improvement in the validation metrics was observed for 20 consecutive epochs. This strategy is crucial for preventing overfitting while striving for optimal accuracy. From the trends depicted in the figures, it is evident that as the training progressed, there was a consistent decrease in box loss, classification loss, and distribution focal loss for both the training and validation phases. These reductions correlated with improved accuracy in bounding box prediction, object classification, and confidence in prediction, respectively.
The precision, although fluctuating, showed a general upward trend, indicating gradual enhancements in the model’s accuracy for identifying vehicles. The recall metrics, while variable, displayed an overall upward trajectory, suggesting an increase in the model’s ability to detect vehicles consistently. The mean average precision (mAP) at IoU thresholds of 0.50 (mAP@50) and 0.50–0.95 (mAP@50–95) also demonstrated a notable increase throughout the training epochs, underscoring a significant improvement in the models’ overall precision and reliability at these thresholds.
The precision–confidence curves for the YOLOv8 and YOLOv10 models, as displayed in
Figure 5A,B, illustrate how the precision of vehicle detection varied with the confidence threshold for each class. These curves reveal a clear relationship between the confidence levels and the precision of detection for various vehicle classes, including bicycles, buses, cars, motorcycles, and trucks. As the confidence threshold increased, the precision of detecting each vehicle class generally increased, which reflects the models’ effectiveness in confidently predicting more accurate bounding boxes.
In
Figure 5A, representing YOLOv8, the precision for all classes combined approached a perfect score (1.00) at a confidence level of 0.948. This indicates the model’s excellent capability to accurately identify and classify objects when the confidence is high. Similarly, in
Figure 5B for YOLOv10, the model achieved perfect precision (1.00) at an even higher confidence threshold of 0.989. This demonstrates the incremental improvements in the YOLOv10 model in handling detection with high confidence.
Specifically, the curve for the motorcycle class shows considerable fluctuations in precision at lower confidence levels but stabilizes and increases as the confidence level approaches the 0.6–0.8 range, highlighting a relative challenge in motorcycle detection compared with other vehicle types. On the other hand, the bus and truck categories maintained higher precision across a wider range of confidence levels, indicating robustness in detecting larger vehicle types.
These insights underscore the models’ reliability and precision across various vehicle categories at higher confidence levels, reflecting their robustness in practical applications where high confidence in detection is crucial.
The recall–confidence curves for YOLOv8 and YOLOv10, as illustrated in
Figure 6A,B, elucidate the trade-off between recall and confidence in the context of vehicle detection for various classes, such as bicycles, buses, cars, motorcycles, and trucks. The recall, which measures the proportion of actual positives correctly identified by the model, tended to decrease as the confidence threshold increased. This typical inverse relationship indicates that as the requirement for the model to be more confident in its predictions increases, the number of actual positives it detects tends to decrease.
In both graphs, the curves for different vehicle types show distinct behaviors, reflecting their varying detection challenges. For instance, the recall for motorcycles and trucks started relatively high at lower confidence levels, suggesting these models initially detected most of these vehicles but with varying degrees of confidence. As the confidence requirements increased, the recall for motorcycles and trucks sharply decreased, indicating fewer detections meeting the higher confidence threshold.
Notably, the curve representing all classes combined shows an initial high recall at extremely low confidence levels, achieving a recall of 0.90 at a confidence level of 0.000 in both YOLOv8 and YOLOv10. This demonstrates the models’ ability to detect a high number of actual positives initially, which declined as more stringent confidence thresholds were enforced.
Furthermore, the recall for cars and buses exhibited a more gradual decline compared with bicycles, which suggests that cars and buses are easier for the models to consistently detect across a range of confidence levels. Conversely, bicycles showed a steep drop in recall as the confidence level increased, highlighting specific challenges in detecting this class with high confidence.
These observations underscore the nuances of model performance across different vehicle classes, highlighting the balance between maintaining high recall and achieving high confidence in detections. This information is crucial for practical applications where missing detections (lower recall) can be critical, such as in autonomous driving and traffic monitoring systems.
The F1–confidence curves for the YOLOv8 and YOLOv10 models, as shown in
Figure 7A,B, depict the harmonic mean of the precision and recall at various confidence thresholds for different vehicle classes, including bicycles, buses, cars, motorcycles, and trucks. The F1 score serves as a balanced measure of the models’ precision and recall, providing a comprehensive metric of overall performance.
For YOLOv8, the F1 score for all classes combined peaked at 0.65 at a confidence level of 0.511, indicating a balanced trade-off between recall and precision at this threshold. This peak suggests an optimal point for detecting objects across all classes with reasonable accuracy and minimal false negatives or positives. In contrast, YOLOv10 achieved a slightly higher peak F1 score of 0.69 at the same confidence level, reflecting improvements in either precision, recall, or both compared with YOLOv8.
When examining the curves in detail, YOLOv8 showed a steady increase in F1 score as the confidence level increased from zero, reaching a plateau around the 0.30–0.5 confidence range before gradually declining as the model became excessively strict in its predictions. On the other hand, YOLOv10 demonstrated a similar trend but maintained a higher F1 score across most confidence levels, indicating its enhanced ability to maintain high precision and recall simultaneously.
Specifically, the F1 scores for motorcycles and trucks in YOLOv10 exhibited higher peaks compared with those in YOLOv8, suggesting particular improvements in the detection of these vehicle types. Buses and cars also showed strong performance in both models but were particularly pronounced in YOLOv10, where the F1 curve remained above those for other vehicle classes across a broader range of confidence levels.
These curves clearly demonstrate the performance gains of YOLOv10 over YOLOv8, highlighting the former’s advanced capability in balancing detection accuracy across various vehicle categories at differing levels of confidence.
The normalized confusion matrices for YOLOv8 and YOLOv10 provide a clear comparative view of the classification accuracy across various vehicle classes, as depicted in
Figure 8 and
Figure 9. These matrices show the proportion of each class which was correctly classified, along with the distribution of misclassifications.
In
Figure 8, YOLOv8 shows strong performance in correctly classifying buses with a high accuracy of 0.89, matching the performance seen in YOLOv10. For bicycles and trucks, both models achieved consistent, correct classification scores of 0.62 and 0.33, respectively. Notably, YOLOv8 slightly outperformed YOLOv10 in car classification with a score of 0.76, compared with 0.75 in YOLOv10, suggesting slightly better reliability in car detection.
However, YOLOv10, as shown in
Figure 9, demonstrated a notable improvement in the classification of motorcycles, with a correct classification rate of 0.74 compared with 0.63 in YOLOv8. This improvement reflects enhancements in YOLOv10’s ability to handle the dynamic and smaller profile of motorcycles, which can be more challenging to detect accurately.
Both models also exhibited certain levels of misclassification, which were particularly noticeable where some motorcycles and trucks were misclassified as background pieces, which is highlighted by non-trivial values in the ‘background’ row of the confusion matrix. This might indicate challenges in distinguishing these vehicles from the background under certain conditions.
Additionally, the confusion matrices revealed areas where each model could potentially improve. For example, the lower scores in truck classification by both models suggest difficulties in distinguishing trucks from other vehicle types or the background, which could be due to their varied sizes and overlapping features with other large vehicles.
Overall, YOLOv10 shows a trend toward better overall accuracy, particularly in the challenging motorcycle category, while maintaining competitive performance in other categories when compared to YOLOv8.
5. Discussion
The comparative analysis between YOLOv8 and YOLOv10 for vehicle detection underscores the distinct performance characteristics inherent to each model. YOLOv10 demonstrated superior performance in detecting bicycles and trucks, with correct classification values of 0.87 and 0.33 compared with YOLOv8’s lower scores of 0.75 and 0.29, respectively. This enhancement suggests that YOLOv10 may incorporate optimizations and architectural improvements which are more adept at capturing the features of bicycles and trucks, possibly due to advanced feature extraction techniques or more effective handling of small or complex object geometries.
Conversely, YOLOv8 excelled in the detection of cars, achieving a higher classification accuracy of 0.82 against YOLOv10’s 0.79. This indicates that YOLOv8 may still retain certain advantages, particularly in recognizing features typical of passenger vehicles, which could stem from more targeted training or better tuning of model parameters for car-like shapes and sizes.
For the detection of buses and motorcycles, both models performed equivalently well, with each achieving a classification accuracy of 0.89 for buses and 0.74 for motorcycles. This parity in performance for these vehicle types suggests that the foundational elements of the YOLO architecture, present in both versions, are sufficiently robust for detecting these classes, indicating that either model may be effective in scenarios where buses and motorcycles are prevalent.
The observed discrepancies in model performance can be attributed to several factors. YOLOv10’s enhancements in handling bicycles and trucks may result from the incorporation of new layers or tuning strategies which improve the model’s ability to generalize from training data to real-world scenarios, particularly in handling objects with varying aspect ratios or in cluttered environments. On the other hand, the slight edge YOLOv8 holds in car detection could be linked to its potentially less complex but more focused approach to feature relevance, which may benefit the detection of more uniform and larger objects like cars.
These results suggest that YOLOv10 is generally more versatile, especially for detecting more challenging object classes such as bicycles and trucks, while YOLOv8 might be favored in applications where car detection is critical. Given the performance equivalency in detecting buses and motorcycles, the choice between these models could be determined by specific application needs, computational constraints, or the availability of training data.
Limitations of This Study
While this research provides a detailed comparative analysis of YOLOv8 and YOLOv10 in vehicle detection, there are several areas which were not explored and which can be addressed in future studies. Notably, the impact of varying environmental conditions, such as lighting and weather changes, on model performance was not assessed. Future research could investigate how these models perform under different scenarios, such as at nighttime or in adverse weather conditions, which are critical for real-world applications like autonomous driving. Additionally, this study did not examine the computational efficiency of the models in real-time processing environments, an important factor for deployment in onboard systems in vehicles. Further analysis may also include a deeper dive into the models’ performance on a larger and more diverse dataset which includes a wider variety of vehicle types and more complex urban scenarios. This would help in understanding the scalability and robustness of the models. Finally, integrating newer techniques, such as transfer learning or exploring the impact of different training strategies, might reveal additional insights into optimizing model performance for specific vehicle detection tasks.
In conclusion, the selection between YOLOv8 and YOLOv10 should consider these nuanced performance differences, aligning model capabilities with specific detection requirements and operational environments.