4.1. Custom Dataset and Preprocessing
A large-scale custom dataset consisting of over 16 h of data was collected. Videos recorded from dashboard cameras during real road driving were utilized, resulting in a total of more than 11,000 images. These images were obtained by setting the time interval
T to 5 s. Manual annotation was performed on all images to label the bounding box of the driving vehicle and the brake light status of each vehicle. The total number of annotations exceeded 30,000. To ensure a balanced and diverse dataset, various driving scenarios were included, such as daytime, nighttime, city, highway, and tunnel environments. Precautions were taken to avoid bias toward any specific category, ensuring information about the number of images and annotations in the train and test sets for each category of brake light status. The details of numbers of images and annotations in the train and test sets in each category of brake light status are given in
Table 1.
To make our custom dataset trainable with YOLOv8 and achieve robust performance, several preprocessing steps are necessary. The first crucial processes involve image resizing and normalization. In order to maintain a consistent input size, the width and height of all images were resized to
and
, respectively. In this study, both
and
were defined as 640. Regarding image normalization, min-max normalization was applied to normalize the pixel values. The normalization process ensures that all pixel values fall within a specific range, typically between 0 and 1, and is performed as follows:
where
x and
are the origin and normalized pixel value, respectively, and
and
are the maximum and minimum pixel value of the image, respectively. In this study, the values for
and
were set to
and
, respectively, following the usual convention. Furthermore, various data augmentation techniques were applied to enhance the robustness of the inference performance. Random horizontal flipping and image cropping were performed to generate variations of the collected images that could realistically occur. To improve detection performance in cases of occlusion, random black-box cutout augmentation was also applied. Finally, to ensure robustness across different camera setups, the quality of the input images was intentionally degraded using various methods. As stated in
Section 3.1, the dashboard cameras used for image acquisition are equipped with various postprocessing methods to captured high-quality images. To ensure the robust performance of the trained network even with general cameras, random modifications, such as brightness changes, blur, and noise injection, were applied to the images. The details of all random modifications applied during the preprocessing stage are as follows:
Crop: zoom rate chosen from uniform distribution within the range of to ;
Cutout: a maximum size of black-box is of image size, a maximum of 3 boxes;
Brightness: adjustment with a range of minimum to maximum ;
Blur: a gaussian blur with a maximum kernel size of ;
Noise: add noise to a maximum of of pixels.
The images illustrating each modification can be found in
Figure 3.
Figure 4 provides examples of preprocessed images. All preprocessing steps were performed using Roboflow [
41], a comprehensive platform for computer vision and image processing tasks. The preprocessed train dataset is publicly available for access [
28].
4.3. Results
In this section, the evaluation results of the trained detection models are presented, both qualitatively and quantitatively. Qualitative analysis confirmed that the trained detection models accurately detect the bounding boxes of driving vehicles and classify their brake light status (on or off) in various road environments.
Figure 5 shows some of the images used for qualitative analysis. These images are provided as pairs of two or more consecutive frames to demonstrate clear analysis results.
Figure 5a,b represent one continuous image pair displaying the detection performance of leading vehicle located at a close distance.
Figure 5c,d represent another continuous image pair displaying the detection performance of leading vehicle located at a far distance. As evident from these two image pairs, the proposed model accurately detects the location and brake light status of leading vehicles regardless of their distance from the ego vehicle.
Figure 5e–h represent other continuous image pairs depicting scenarios with multiple vehicles present at a close distance. In these images, the location and brake light status of all vehicles in the images are successfully detected. Furthermore, even in scenarios with multiple vehicles at a far distance, all vehicles are accurately detected, as demonstrated by images pairs
Figure 5i–l.
Qualitative analysis was conducted not only for different vehicle quantities and distances but also for various driving environments and vehicle types.
Figure 6a–d demonstrate the robust performance of the trained model in diverse driving environments, including highway, city, tunnel, and nighttime.
Figure 6e–h provide qualitative evidence that the model has ability to detect various vehicle types, including passenger cars, motorcycles, buses, trucks, and special vehicles.
The evaluation of the driving vehicle and brake light status detection performance of each model that underwent transfer learning is conducted by calculating the mean average precision (mAP) on the test set. Two mAP values are calculated: mAP50 and mAP50-95. mAP50 represents the average precision at an intersection over union (IoU) threshold of . The IoU threshold measures the overlap between the predicted bounding boxes and the ground truth labels, indicating how well the predicted boxes align with the actual objects. mAP50-95 represents the average precision over a range of IoU thresholds from to , with a step size of . This metric provides comprehensive assessment of the detection model’s performance across different levels of overlap. Both mAP50 and mAP50-95 are commonly used metrics to evaluate the overall performance of object detection models.
Figure 7 presents the detection performance of each trained model, showcasing the results for mAP50 and mAP50-90 in
Figure 7a,b, respectively. The detection performance for each individual classes, brake light off and brake light on, is represented by blue and red bars, respectively. The overall detection performance for all classes is shown by the purple bar, with a purple line plot illustrating the trend of performance differences across models. Both mAP50 and mAP50-95 exhibit similar overall trends, although they differ in scale. As expected, the detection performance for all classes generally increases as the model size increases. However, the YOLOv8s model shows a slightly lower performance increase, primarily due to its lower brake light on class detection performance. Comparing the mAP50 values for each class, it can be observed that all models, except for YOLOv8s, have higher detection performance for the brake light on class compared to the brake light off class. Overall, the proposed methodology achieved mAP50 values ranging from
to
and mAP50-95 values ranging from
to
. Considering the recent benchmarking performance of MS-COCO [
35], which is one of the leading object detection, with mAP50 values ranging form
to
and mAP50-95 values ranging from
to
, the proposed methodology demonstrates significant results [
43,
44]. Detailed detection performance for each model and class can be found in
Table 3.
In
Figure 7, it was observed that the brake light on detection performance for the brake light on class was generally better than that for the brake light off. However, since the two classes are distinguished solely based on the brightness of the brake light and not the shape or form of the vehicle, it is possible to hypothesize that ambient illumination can affect the detection performance. To verify this hypothesis, the test dataset was split into two types based on ambient light levels: Day, representing images taken during daytime with high ambient illumination, and Night, representing images taken at night or in tunnels with low ambient illumination. The number of images and annotations for each type are provided in
Table 4.
Figure 8 depicts the detection performance for each class on the Day/Night split test dataset. In
Figure 8a,b, mAP50 is plotted, while in
Figure 8c,d, mAP50-95 is plotted.
Figure 8a,c show the performance on the Day test set, while b,d show the performance on the Night test set. On the Day test set, the detection performance for the brake light off class is better than that for the brake light on class. Conversely, on the Night test set, the detection performance for the brake light on class is superior to that for the brake light off class. The brake light off class, which was well detected in an environment with high ambient light, experienced a decline in detection performance as the ambient light decreased. On the other hand, the brake light on class, which initially exhibited relatively low detection performance in an environment with high ambient light, demonstrated high detection performance when the ambient light was low. The difference in performance due to ambient illumination is more pronounced in the brake light off class. Detailed detection performance comparisons for ambient illumination difference for each model and class can be found in
Table 5.
According to the detailed analysis, the performance difference attributed to the difference in ambient illumination can be explained in terms of accuracy. The overall accuracies for driving vehicle detection across all classes were in the Day test set and in the Night test set. As the ambient illumination decreased, the accuracy for driving vehicle detection slightly improved. However, the accuracies for the brake light off class decreased to in the Day test set and in the Night test set, while the accuracies for the brake light on class increased to in the Day test set and in the Night test set. The detailed analysis revealed that this performance difference is influenced by the presence of tail lights. As the ambient illumination decreases, the tail lights of the vehicles are turned on, enhancing the detection performance for driving vehicles. However, the turned-on tail lights can cause confusion with the turned-on brake lights, leading to a decrease in the detection performance for brake light off class. Consequently, the decrease in ambient illumination improves the detection performance of vehicles with the brake light turned on while deteriorating the detection performance of vehicles with the brake light turned off.
In order to validate the real-time inference performance of the trained models on edge devices, experiments were conducted to evaluated both accurate detection and inference time. The Nvidia Jetson Nano device was utilized for this purpose. The trained models were converted to the Open Neural Network Exchange (ONNX) format, which is an open format that facilitates the sharing and interoperability of neural network models across different frameworks. The inference time was measured on the Jetson Nano device using the ONNX models. The measured inference times ranged from
ms to
ms, depending on the size of the model. As expected, among the proposed models, YOLOv8n, with the smallest number of parameters and computations, exhibited the fastest inference time of
ms, surpassing even human cognitive processing time. It is worth noting that the average human cognitive response time is approximately 200 ms. While faster inference time are generally preferred, it is crucial to acknowledge that Jetson Nano operates in a resource-constrained environment. Despite these strict limitations, YOLOv8n achieved inference time faster than human cognitive processing, indicating that it has sufficient real-time capability. The trade-off performance between inference time and detection accuracy is illustrated in
Figure 9. To provide a comprehensive performance comparison, the performance on different devices, including the Jetson Nano, CPU (Intel Xeon 2.20 GHz), and GPU (Nvidia Tesla T4), was included. Detailed values can be found in
Table 6.
Table 7 provides a detailed description of the differences between our proposed model and the key existing brake light status detection studies. The algorithms listed in
Table 7 are state-of-the-art learning-based brake light status detection algorithms. The five studies listed at the top are divided into two or more stages, involving vehicle localization and classification of brake light status [
13,
14,
15,
16,
18]. The methodologies for each stage are sequentially presented under the second column, named proposed work. It is important to note that these five studies only present evaluation results for brake light status classification, excluding vehicle localization, and hence, evaluation metrics such as detection rate, F1 score, and accuracy were used to describe the classification performance. On the other hand, both the algorithms presented by Li et al. [
17] and our proposed algorithm perform vehicle localization and brake light status classification simultaneously in a single-stage process. The evaluation results provided encompass both vehicle localization and brake light status classification, and mAP50 was used to describe both classification and localization performance.
When comparing the two mAP50 values in
Table 7, Li et al.’s is higher than ours. However, one should be mindful that the number of classification classes and the dataset differ. The
value reported by Li et al. [
17] pertains only to the performance considering the turned on brake light status, while our value of
accounts for both turned on and off brake light status. In terms of data and sample size comparison, our research utilized the largest dataset. Furthermore, our dataset includes diverse environmental conditions, including daytime, nighttime, and tunnel scenarios, which were not all simultaneously considered in the other studies. By integrating this vast amount data of diverse conditions, our experimental results effectively represent a wide range of real-world scenarios.
Among the existing key algorithms in
Table 7, the inference time of the algorithms is only presented through the study by Li et al. [
17], which reported an impressive inference time of
ms. However, it is important to note that their experiments were conducted on a high-performance GPU, GTX-1060, which may have contributed to the rapid inference speed. In contrast, our study not only utilized a powerful GPU, but also conducted experiments on an edge device with limited resources. As the experimental environments differed, a direct comparison of the inference speeds between the two algorithms is not feasible. Nonetheless, our study presented the inference time on edge device, showcasing real-time performance. This demonstration yields more practical research results, considering the constraints of edge devices and emphasizing the relevance of our findings for real-world applications.