Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO

Shen, Qian; Zhang, Lei; Zhang, Yuxiang; Li, Yi; Liu, Shihao; Xu, Yin

doi:10.3390/electronics13163216

Open AccessArticle

Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO

by

Qian Shen

^1,2,*

,

Lei Zhang

^1,2

,

Yuxiang Zhang

^1,2

,

Yi Li

^1,2,

Shihao Liu

^1,2 and

Yin Xu

^1,2

¹

School of Automation, Huaiyin Institute of Technology, Huaian 223003, China

²

Intelligent Energy Research Institute, Huaiyin Institute of Technology, Huaian 223003, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3216; https://doi.org/10.3390/electronics13163216

Submission received: 15 July 2024 / Revised: 12 August 2024 / Accepted: 13 August 2024 / Published: 14 August 2024

Download

Browse Figures

Versions Notes

Abstract

Distracted driving is one of the major factors leading drivers to ignore potential road hazards. In response to the challenges of high computational complexity, limited generalization capacity, and suboptimal detection accuracy in existing deep learning-based detection algorithms, this paper introduces a novel approach called StarDL-YOLO (StarNet-detectlscd-yolo), which leverages an enhanced version of YOLOv8n. Initially, the StarNet integrated into the backbone of YOLOv8n significantly improves the feature extraction capability of the model with remarkable reduction in computational complexity. Subsequently, the Star Block is incorporated into the neck network, forming a C2f-Star module that offers lower computational cost. Additionally, shared convolution is introduced in the detection head to further reduce computational burden and parameter size. Finally, the Wise-Focaler-MPDIoU loss function is proposed to strengthen detection accuracy. The experimental results demonstrate that StarDL-YOLO significantly improves the efficiency of the distracted driving behavior detection, achieving an accuracy of 99.6% on the StateFarm dataset. Moreover, the parameter count of the model is minimized by 56.4%, and its computational load is decreased by 45.1%. Additionally, generalization experiments are performed on the 100-Driver dataset, revealing that the proposed scheme enhances generalization effectiveness compared to YOLOv8n. Therefore, this algorithm significantly reduces computational load while maintaining high reliability and generalization capability.

Keywords:

distracted driving; deep learning; YOLOv8n; StarNet

1. Introduction

Distracted driving behaviors include making or receiving phone calls, talking to passengers, drinking water and other behaviors that distract the driver’s attention while driving. With the dramatic increase in the number of cars in recent years, distracted driving has become a widespread and increasingly serious global problem, which not only poses a major threat to the safety of drivers and passengers, but also profoundly affects the overall safety of road traffic. This phenomenon has aroused widespread concern. With the advancement of technology, more and more vehicles are equipped with L2- or even L3-level autonomous driving capabilities, which can quickly take control of the vehicle in emergency situations to reduce the incidence of accidents. However, effectively recognizing distracted driving behaviors is a prerequisite for assisted driving. Therefore, a precise and efficient algorithm for detecting distracted driving behaviors holds paramount significance in enhancing road safety and mitigating the occurrence of traffic accidents.

Current research has analyzed natural driving data collected from vehicles [1]. This approach primarily uses sensors to measure vehicle speed, acceleration, and steering wheel rotation angle to assess the driving state [2]. However, it suffers from poor real-time performance and cannot accurately judge the driving state in real time. Alternatively, detecting driver physiological parameters with wearable devices [3] offers high accuracy, but the devices can interfere with driving operations. Presently, the primary method to detect distracted driving involves computer vision [4], which uses cameras to identify distracted driving behaviors.

In early distracted driving detection, traditional handcrafted features were used through various methods to detect distracted driving. Sharma et al. [5] (2012) conducted a multi-scale scale-invariant feature transform, followed by feature sampling to simulate the contribution of each image region to classification, and then used a Support Vector Machine (SVM) to classify the weighted heat maps. Zhao et al. [6] (2012) employed an unsupervised method to separate the driver region from the image background, extract the contour information of the driver, and use a classifier to judge driving behavior and detect distracted driving. Guo et al. [7] (2014) utilized color and shape information for detecting driving behavior. Yan et al. [8] (2014) combined motion history images and pyramid histograms of oriented gradients for driver behavior recognition.

In recent years, deep learning has made significant progress in detecting distracted driving behaviors. However, most existing methods still suffer from high computational complexity, poor generalization ability and insufficient average precision, which pose a significant barrier to their widespread adoption, particularly on resource-constrained embedded devices [9]. In order to address these challenges, this paper proposes an improved algorithm based on the YOLOv8n model: StarDL-YOLO. The main contributions and methods of this paper are summarized as follows:

The StarDL-YOLO algorithm integrates StarNet [10] with the YOLOv8n backbone. StarNet utilizes star operations to efficiently process high-dimensional features within a low-dimensional computational space, enhancing the model’s ability to focus on distracted driving behaviors and improving its feature extraction capabilities.
The C2f module of YOLOv8n is improved by replacing its bottleneck structure with the Star Block, forming the C2f-Star module. This improvement not only increases the detection precision, but also optimizes the operational efficiency of the model.
Shared convolution is added in the detection head to construct the LSCD (Lightweight Shared Convolutional Detection head), significantly lowering computational load.
The Wise-Focaler [11]-MPDIoU [12] loss function is proposed to optimize the bounding box regression task. This loss function not only enhances regression performance across a spectrum of distracted driving behaviors, ranging from simple to challenging, but also improves detection for behaviors exhibiting significant variations in width and height.

2. Related Works

In recent years, deep learning has achieved great success in various fields, including the detection of distracted driving behaviors. Koesdwiady et al. [13] (2017) utilized VGGNet (Visual Geometry Group Network) for distracted driver detection. However, the classes of distracted driving identified by this method are too few to include most of the current distracted driving behaviors. Xing, Y et al. [14] (2019) used a Gaussian mixture model to segment the original image of driver and trained AlexNet, GoogLeNet, and ResNet50 models using transfer learning, achieving accuracy rates of 78.6%, 74.9%, and 81.6%, respectively, in the task of classifying distracted behaviors, thus analyzing the feasibility of classical convolutional networks and transfer learning in this classification task. Hesham et al. [15] (2019) used the proposed algorithm to capture the face and hands, extracting features with the AlexNet, and achieved high detection accuracy on the AUC dataset. This paper creates a new distracted driving dataset, which makes a great contribution to the subsequent distracted driving behavior detection. Hu, Y et al. [16] (2019) improved the recognition of distracted driving behaviors by applying transfer learning to neural networks. The paper builds and uses the SEU-DRIVING dataset, but the dataset is not used in a real vehicle environment. Zhang, Z.W [17] (2020) improved the pose estimation algorithm OpenPose, inputting the output skeleton confidence maps, skeletal affinity fields, and original images of algorithm into the VGG19 deep learning network to classify distracted and abnormal driving behaviors. BAI, Z.Y et al. [18] (2020) used graph convolutional networks to extract driver posture features and combined them with object detection algorithms to classify distracted driving behaviors, achieving 90% accuracy on the StateFarm dataset. Tran et al. [19] (2020) proposed a dual-camera-based driver behavior detection system, where data fusion from dual cameras outperformed single-camera data input. LI, P et al. [20] (2021) proposed a lightweight convolutional neural network, OLCMNet, incorporating an SE module to reduce model parameters while improving classification accuracy. Yin, Z.S et al. [21] (2022) proposed a distracted driving behavior detection model of 2D pose estimation algorithm. Although the accuracy of the model in the StateFarm dataset reached 95.77%, the accuracy of the model in the self-built actual driving set decreased by 10.35%. Zhang, Z.Y [22] (2023) introduced an improved YOLOv5s network model, mainly using the MobileNet-v3 structure to replace the backbone network, adding an SPPF structure, and combining the attention mechanism with the bottleneck structure to detect dangerous driving behaviors. But, this paper does not involve the influence of light, weather and other factors on the algorithm. Peng, P et al. [23] (2023) proposed a method based on a spatiotemporal dual-stream deep learning network and causal and-or graph for distracted behavior recognition. This method processed continuous frames to identify distracted driving behaviors, substantially surpassing other state-of-the-art methods. Li, S.F et al. [24] (2023) integrated YOLOv5 with HRNet, achieving accuracies of 96.17% on the AUC dataset and 96.97% on a self-constructed dataset, thereby significantly outperforming traditional methods. Lou, C.C et al. [25] (2023) proposed an improved algorithm based on YOLOv5s, integrating it with Ghostnet to reduce computation and parameter count. Additionally, the CBAM attention mechanism was incorporated into the C3 module to enhance the performance of the model. On a self-constructed distracted driving behavior dataset, the proposed model demonstrated a 1.5% improvement in accuracy and a reduction of 7.6 GFLOPs in computational load compared to YOLOv5s. Du, Y.J et al. [26] (2023) introduced an improved algorithm based on YOLOv8n, named YOLO-LBS. The study integrated the GhostBottleneck with the C2f module to form GhostC2f and replaced the PAN structure in YOLOv8n with a BiFPN (Bidirectional Feature Pyramid Network). This modification resulted in a 5.1% improvement in model accuracy while significantly reducing computational load. He, Y et al. [27] (2024) presented a MobileViT-CA model for detecting distracted driving in commercial vehicles. The study, based on the MobileViT network, identified MobileViT-CA as the optimal recognition model through extensive experimentation. It achieved an accuracy of 96.57% on a self-constructed dataset, demonstrating advantages in both size of the model and detection accuracy.

3. Materials and Methods

YOLOv8 [28], the most recent advancement in the Ultralytics YOLO series [29] of object detection algorithms, incorporates a wide array of optimizations and enhancements derived from its predecessors, with the goal of substantially improving both performance and adaptability. YOLOv8 offers five different scale versions: YOLOv8n, YOLOv8s [30], YOLOv8m, YOLOv8l, and YOLOv8x. These models share the same basic architecture but differ in depth and width. Among them, YOLOv8n is the fastest in detection speed and has the smallest number of parameters. The method proposed in this paper, StarDL-YOLO, is also based on improvements made to YOLOv8n, as shown in Figure 1.

3.1. Modified Backbone with StarNet

The original backbone network of YOLOv8n consists of a series of convolutional layers and pooling layers. In order to further improve the performance of the model in target detection tasks, enhance feature extraction capabilities and reduce the amount of computation and parameters, this paper decides to replace its original backbone with StarNet. The StarNet utilizes star operations, which show great potential by efficiently capturing complex, high-dimensional and nonlinear feature spaces in low-dimensional input spaces. In a single-layer neural network, StarNet consolidates the weight matrix and bias into a single entity represented as

W = [\begin{matrix} W \\ B \end{matrix}]

, where W represents the weight part and B represents the bias term. Correspondingly, the input vector X is expanded to include a constant term (typically 1), resulting in a new input matrix

X = [\begin{matrix} X \\ 1 \end{matrix}]

. Through this arrangement, StarNet implements the star operation, specifically,

(W_{1}^{T} X) * (W_{2}^{T} X)

. To simplify analysis, we first focus on the scenario with single input and single output. Specifically, defining W1 and W2,

x \in R^{(d + 1) \times 1}

, where d represents the number of input channels. This can be easily extended to multiple output channels, with

W 1, W 2 \in R^{(d + 1) \times n}

. In summary, the star operation can be expressed as

\begin{matrix} w_{1}^{T} x * w_{2}^{T} x \\ = (\sum_{i = 1}^{d + 1} w_{1}^{i} x^{i}) * (\sum_{j = 1}^{d + 1} w_{2}^{j} x^{j}) \\ = \sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} w_{1}^{i} w_{2}^{j} x^{i} x^{j} \\ = α_{(1, 1)} x^{1} x^{1} + \dots + α_{(4, 5)} x^{4} x^{5} + \dots + α_{(d + 1, d + 1)} x^{d + 1} x^{d + 1} \end{matrix}

(1)

α_{(i, j)} = \{\begin{matrix} w_{1}^{i} w_{2}^{j} & i = j, \\ w_{1}^{i} w_{2}^{j} + w_{1}^{j} w_{2}^{i} & i \neq j . \end{matrix}

(2)

Here, i and j index the channels, and

α

denotes the coefficients for each item.

The star operation can ultimately be expanded into

\frac{(d + 2) (d + 1)}{2}

different item combinations, as shown in Equation (1). Next, by stacking multiple layers, the hidden dimensions can be recursively increased to approach infinity. Assuming the initial network layer width is d, applying a star operation once yields

\sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} w_{1}^{i} w_{2}^{J} x^{i} x^{j}

. This results in an implicit feature space representation in

R^{{(\frac{d}{\sqrt{2}})}^{2^{1}}}

. Let

S n

denote the star operation output for the n-th iteration:

\{\begin{matrix} S_{1} = \sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} w_{(1, 1)}^{i} w_{(1, 2)}^{j} x^{i} x^{j} \\ S_{2} = W_{2, 1}^{T} S_{1} * W_{2, 2}^{T} S_{1} \\ S_{3} = W_{3, 1}^{T} S_{2} * W_{3, 2}^{T} S_{2} \\ \dots \\ S_{n} = W_{n, 1}^{T} S_{n - 1} * W_{l, 2}^{T} S_{n - 1} \end{matrix}

(3)

By stacking multiple layers or even several layers of star operations, the latent dimensions can be exponentially amplified. Based on this concept, the Star Block is designed and further developed into StarNet, as shown in Figure 2. This simple yet powerful model, supported by a minimal network structure and efficient star operations, achieves high real-time performance and accuracy, making it highly suitable for addressing the problem proposed in this paper.

In the field of image recognition, the continuous increase in the depth of neural networks and the number of feature map channels have significantly improved network performance. However, this also leads to the generation of a large amount of redundant information and an increase in computing costs. The same issue applies to the YOLO family of models, whose size and computational cost necessitate deployment on powerful hardware. To enhance the efficiency of detecting distracted driving behaviors, this paper introduces a lightweight backbone network, StarNet, which substantially reduces computation and parameter count. Specifically, this paper substitutes the four stages in StarNet with their corresponding stages in the YOLOv8n backbone.

3.2. The Proposed C2f-Star Module

The C2f module is a pivotal component in YOLOv8, introduced as an enhancement over the C3 module in YOLOv5. The purpose is to preserve the lightweight architecture of the model while promoting richer gradient flow information. The C2f module in YOLOv8n incorporates multiple bottleneck structures, effectively addressing issues like gradient vanishing and exploding, which in turn bolsters the performance of the network and facilitates the utilization of deeper architectures. However, due to the excessive stacking of bottleneck structures within the C2f module, redundant or irrelevant feature information may be present in the feature maps. This redundancy will not only increase the computational complexity of the model, but also affect the recognition accuracy. The star block has the same capabilities as the bottleneck, achieving comparable or superior recognition performance at a reduced computational cost. In this paper, the original bottleneck components within the C2f module are replaced with Star Blocks, illustrated in Figure 3.

The Star Block initiates with a convolution operation on the input data, followed by two fully connected operations, one of which undergoes Rule activation. Subsequently, the outputs from these operations are amalgamated using the star operation. Finally, the resultant is combined with the initial input data and subjected to another convolutional step to yield the final output.

The C2f-Star module makes use of the Star operations in the Star Block and gives up the original bottleneck, thus reducing redundant computations and the size of the model. And Star Block has the ability to obtain high-dimensional feature space from low-dimensional space input, which greatly improves the ability to extract distracted driving features. This makes the model more efficient and practical for real-world applications.

3.3. LSCD Detection Head with Shared Convolutions

In the YOLO series, the detection head typically comprises three branches, each handling information about the same object at different scales. Traditionally, these branches operate independently, potentially leading to the inefficient use of model parameters and susceptibility to overfitting. To address these issues, this paper introduces the concept of shared convolutions within the detection head. The structure of the module is shown in Figure 4. As shown in the figure, small, medium-sized, and large targets are typically processed through three separate convolutional layers, each with its own set of parameters, which greatly increases the computational load. In detecting distracted driving behaviors, the LSCD detection head effectively reduces parameter count, minimizes model complexity, and enhances computational performance. Additionally, parameter sharing facilitates the discovery of the global optimal solution and helps to avoid the risk of overfitting.

To address the issue of inconsistent target scale processing across individual detection heads, this paper proposes the incorporation of a scale layer within each detection head. These modules, equipped with specialized convolutional layers, are tasked with predicting bounding box coordinates and class probabilities for their respective feature maps. Following this, the predictions are harmoniously consolidated into a comprehensive tensor, ensuring consistent and efficient target detection across varying scales.

3.4. Improved Wise-Focaler-MPDIoU Loss Function

Bounding box regression serves as a pivotal component in object detection, playing a crucial role in achieving the precise localization of targets. Moreover, the effectiveness of target localization heavily depends on the choice of bounding box regression loss function. This study proposes the Wise-Focaler-MPDIoU algorithm, which integrates the strengths of MPDIoU and Focaler-IoU techniques.

MPDIoU improves optimization by enhancing alignment between predicted and ground truth bounding boxes, addressing scenarios where predicted bounding boxes and ground truth boxes share the same aspect ratio but vary significantly in width and height values. In contrast, Focaler-IoU prioritizes optimizing bounding box regression across various difficulty distributions, thereby further enhancing detection performance by focusing on harder-to-detect objects. Detecting distracted driving behavior poses a unique challenge, as it involves identifying actions rather than objects with distinct boundaries or uniform sizes. Furthermore, there is no universally defined bounding box that can be applied to accurately capture such behaviors. MPDIoU can effectively improves the accuracy of the bounding box. However, when some actions involve rapid movements of the human body, such as reaching behind, which can potentially compromise accuracy. In such instances, Focal-IoU proves to be effective in enhancing the recognition precision of these dynamic actions. The Wise-Focaler-MPDIoU loss function is detailed below.

Typically, a rectangle can be defined by two points, and MPDIoU calculates based on the minimum point distance between these points, considering overlap area, center point distance, as well as width and height deviations. Let rectangles A and B have widths w and heights h, respectively.

d_{1}^{2}

denotes the squared distance from the top–left corner of rectangle B to rectangle A, while

d_{2}^{2}

denotes the squared distance from the bottom–right corner of rectangle B to rectangle A.

M P D I o U = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(4)

The MDPIoU bounding box regression loss function is defined as

L_{M P D I o U} = 1 - M P D I o U = 1 + \frac{d_{1}^{2}}{d^{2}} + \frac{d_{2}^{2}}{d^{2}}

(5)

The formula for Focaler-IoU is as follows:

I o U^{f o c a l e r} = \{\begin{matrix} 0, & I o U < d \\ \frac{I o U - d}{u - d}, & d ≪ I o U ≪ u \\ 1, & I o U > u \end{matrix}

(6)

By adjusting the values of d and u, Focaler-IoU can focus on different regression samples. The loss function is defined as follows:

L_{Focaler - IoU} = 1 - I o U^{f o c a l e r}

(7)

In this paper, Focaler-IoU and MPDIoU are combined to enhance the accuracy of the model. The loss function is defined as follows:

L_{F o c a l e r - MPDIoU} = L_{M P D I o U} + I o U - I o U^{F o c a l e r}

(8)

4. Experiments and Results

In this section, we will first introduce the software and hardware setup of the experiment. Next, we will introduce the dataset used in our experiments and propose the evaluation metrics for distracted driving detection. Subsequently, we will modify backbone to compare with other mainstream models, and then we will conduct ablation experiments to verify the effectiveness of our proposed method. Finally, we compare StarDL-YOLO with models proposed in recent years and conduct experiments to test the generalization of our model.

4.1. Experimental Environment and Parameter Settings

Experimental Platform Information: Windows 11, NVIDIA GeForce RTX 4080 Super with 16GB, Python 3.11, PyTorch 2.1.0, CUDA 12.1. Parameter settings: 150 epochs, batch size of 64, image size of 640 pixels, SGD optimizer.

4.2. Datasets

(1) StateFarm dataset. This paper validates the model by using the StateFarm Distracted Driver dataset, which contains 10 categories of distracted driving behaviors observed from 26 drivers of different nationalities. These behaviors captured in the dataset include safe driving, texting while driving (with left or right hand), talking on the phone while driving (with left or right hand), conversing with a passenger, adjusting the radio, reaching behind, grooming (hair or makeup), and drinking water, as illustrated in Figure 5. Each class in this dataset comprises approximately 2000 images, each featuring a high-definition resolution of 640 × 480 pixels, collectively amounting to a total of 22,424 high-quality images that ensure the richness and diversity of the dataset. The detailed distribution is provided in Table 1.

(2) 100-Driver dataset. To demonstrate the generalization capability of our model, this paper further utilizes the 100-Driver [31] distracted driving dataset. This dataset employs RGB format during daylight hours and switches to NIR (Near-Infrared) format at night, and in the daytime images, different lighting conditions are collected. This extensive dataset comprises data captured from four different angles for 100 drivers, encompassing 21 distinct distracted driving behaviors along with 1 normal driving behavior. The selected photos are taken from a fourth camera angle, including all kinds of activities of distracted driving: texting (with left or right hand), talking on the phone (with left or right hand), interacting with the co-driver, operating the radio, reaching to the back seat, adjusting hair or makeup, and drinking water. For each class, 100 daytime and 100 nighttime photos are randomly selected to test the generalization ability of the algorithm. In total, there are 1000 images in RGB format and 1000 images in NIR format. Sample images are shown in Figure 6.

4.3. Evaluation Metrics

The objective of this study is to optimize YOLOv8n by reducing its model size while maintaining high detection accuracy. Therefore, mAP@0.5 and mAP@0.5:0.95 are chosen as the metrics to assess the precision of this model. mAP stands for mean average precision. The values 0.5% and 0.5:0.95% refer to the overlap between the “predicted bounding box” and the “ground truth bounding box” reaching 50% and 50–95%, respectively. Additionally, model complexity is evaluated based on parameter count and computational load. Precision and recall serve as secondary metrics, defined as follows, where n represents the number of classification classes:

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(9)

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(10)

A P = \int_{0}^{1} P (r) dr

(11)

m A P = \frac{Σ_{i = 0}^{n} A P (i)}{n}

(12)

4.4. Experimental Results and Analysis of YOLOv8n Model-Modified Backbone Comparison

To evaluate the performance advantages of the StarNet when replacing the backbone of YOLOv8n, this paper conducts a series of comparative experiments. The original YOLOv8n model and five popular lightweight models from the past few years are selected as references. All models are tested under a unified experimental environment to ensure fairness and comparability of the results. The specific experimental results are summarized in Table 2.

In this experiment, HGNet [32] slightly reduces computational and parameter complexity compared to other models. EfficientViT [33] and MobileNetV4 [34] require more parameters and computing resources, but the performance of the models are not improved. Fasternet [35] exhibits a slight improvement in model accuracy over the baseline but at increased computational and parameter costs. Compared with YOLOv8n, StarNet has better performance, while significantly reducing 20.7% computational requirements and 26.6% parameters. In summary, adopting StarNet as a backbone enhances the performance in detecting distracted driving of the model while maintaining a lighter model footprint.

4.5. Ablation Experiments and Analysis

This study conducts ablation experiments utilizing YOLOv8n as the baseline model, maintaining consistent software and hardware environment, as well as adhering to a fixed set of hyperparameters, to systematically evaluate the influence of various model components on its overall performance. Table 3 presents the obtained experimental data from nine sets of trials. The notable observation is that the substitution of each module significantly reduces both computational and parametric complexity. Specifically, the integration of StarNet, C2f-Star, and LSCD leads to a reduction in computational costs by 20.7%, 15.8%, and 26.8%, respectively, while a decrease in the number of parameters by 26.7%, 16.7%, and 23.3%, respectively. Furthermore, these substitutions also marginally improve the accuracy compared to the baseline model. These results comprehensively illustrate the effectiveness and practicality of the four proposed improvement strategies in this paper. Following the goal of simplifying the model, the accuracy of detection is maintained while reducing the cost of calculation and parameter.

4.6. Compared with the Algorithm Proposed in Recent Years

To more intuitively showcase the superiority of the StarDL-YOLO algorithm proposed in this paper, we compare it with several other well-performing algorithms, such as YOLOv5n, YOLOv6n, YOLOv8n, YOLOv9t, YOLOv10n, and RT-DETR. As shown in Table 4, the comparison results reveal that although YOLOv9t and YOLOv10n show significant results in reducing the amount of computation and parameters, they do not achieve superior accuracy compared to YOLOv8n. In contrast, it is found that RT-DETR does have strong performance, but it sacrifices a lot of computational resources, parameters, and inference time. Therefore, our algorithm not only substantially reduces the amount of computation and parameters, but also has a slight improvement in accuracy, so our algorithm is more suitable for distracted driving detection.

4.7. Comparison of GFLOPs and Parameter Number between StarDL-YOLO and YOLOv8n Models

This paper aims to design a detection model that optimizes model performance while significantly reducing computational complexity and parameter count. In order to show the advantages of the StarDL-YOLO algorithm more intuitively, this paper compares the GFLOPs and parameters of each module of the StarDL-YOLO and YOLOv8n models. In Figure 7, the red numbers represent the computational complexity of each module in units of G, while blue numbers indicate the parameter count of each module. A comparison of these modules demonstrates varying extents of reduction in both computational complexity and parameter count, with particularly notable improvements observed in the backbone and head sections. Consequently, the StarDL-YOLO achieves a substantial reduction in hardware requirements while maintaining comparable accuracy.

4.8. Generalization Experiment

The generalization experiment utilizes the 100-Driver dataset. Since the photos utilized for training in this paper are captured from the perspective of co-driver and include ten types of distracted driving behaviors, we randomly select photos with corresponding angles and classifications from the 100-Driver dataset. For each class, 100 photos are randomly selected, encompassing various vehicles, drivers, times of day, and lighting conditions. Moreover, the 100-Driver dataset includes two types of images: RGB photos taken during the day and NIR photos captured at night, with 1000 images for each modality, all of which are used for validation. Keeping the experimental environment unchanged, experiments are conducted using YOLOv8n and our proposed StarDL-YOLO. The experimental results are shown in Table 5.

In the generalization experiments of this paper, compared to the weights trained by the original model, recognition accuracy improves by 11.5% during daytime and 10.3% during nighttime. These results underscore the strong generalization ability of the model. Upon closer examination of recognized images, it is noted that excessively strong or weak lighting during the day could lead to decreased recognition accuracy. Similarly, at night, the model often struggles to recognize behaviors, frequently mistaking onboard terminals for smartphones. This phenomenon is illustrated in Figure 8.

4.9. More Challenging Generalization Experiments

From the generalization experiments, it can be seen that although the generalization effect of the proposed algorithm is significantly better than that of the original model, in order to explore the actual impact and potential challenges of the proposed algorithm, we conduct generalization experiments with more stringent conditions. In each class of distracted driving behavior, this paper selects low-light environment pictures, high-light environment pictures, light–shadow interlacing environment pictures, NIR-mode pictures, three-different-angle pictures and ordinary environment pictures, as well as a total of eight different conditions, and uses StarDL-YOLO to detect these pictures. The results are shown in Table 6. An example of the detection results is shown in Figure 9.

As can be seen from the table, testing with images that are similar to the training dataset was easily achieved. Using a mobile phone with the right hand, arranging makeup, and talking with passengers can all be detected in the environment of high light and alternating light and shadows. Using the mobile phone with the left hand and communicating with passengers in low light were successfully recognized. In particular, under the other three different angles, except for the situation of communicating with a passenger, other behaviors were not recognized. Moreover, in the case of NIR mode at night, we failed to detect and recognize all behaviors. Consequently, the algorithm proposed in this paper also has certain limitations, and there is still a lot of room for progress.

5. Conclusions

The StarDL-YOLO algorithm proposed in this paper enhances the recognition accuracy of distracted driving behaviors of vehicle drivers and reduces the computational and parameter complexity of the model. Specific improvements include the following: Firstly, we replace the backbone with the StarNet instead of YOLOv8n, which employs star operations to capture high-dimensional and nonlinear feature spaces from low-dimensional inputs. Secondly, replacing the residual blocks in the C2f module of the neck with Star Blocks reduces the computational burden. Thirdly, introducing the idea of shared convolution in the detection head significantly reduces computational costs. Finally, we propose the Wise-Focaler-MPDIoU as the bounding box loss function, which not only focuses on different regression samples but also enhances detection performance across various detection tasks and optimizes detection results even when widths and heights vary greatly. Experimental results indicate that, under standard lighting conditions, the StarDL-YOLO algorithm achieves an accuracy of 99.6% on the StateFarm dataset. Additionally, StarDL-YOLO reduces the number of parameters by 56.4% and the computational load by 45.1% compared to YOLOv8n. When employing 100-Driver for generalization experiments, StarDL-YOLO also demonstrates a significant superiority over YOLOv8n in terms of accuracy.

Nevertheless, the proposed method still exhibits several areas requiring further refinement. Firstly, the model encounters difficulties in recognizing distracted driving behaviors under varying lighting conditions, such as low-light and bright-light environments. Drinking water in high-light environments is often mistaken for makeup, because the shape of the water cup is essentially covered by the bright light, and most of the behavior cannot be recognized in low-light conditions. Additionally, in nighttime NIR mode, enhanced light sources can interfere with the detection accuracy of the model. For instance, bright lights emitted from car infotainment screens are often misidentified as mobile phones, and most of the distracted driving behaviors cannot be identified in the NIR mode. Moreover, when distracted driving behavior is recognized from other perspectives, there will be false recognition or unsuccessful recognition. To overcome these limitations, future research will focus on the integration of diverse datasets, improvements in model stability under complex conditions, and the development of advanced network architectures.

Author Contributions

Conceptualization, Q.S. and L.Z.; methodology, Q.S.; software, L.Z.; validation, Q.S., L.Z. and Y.Z.; formal analysis, Q.S.; investigation, Q.S.; resources, Q.S.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.X.; visualization, Y.Z.; supervision, S.L.; project administration, S.L.; funding acquisition, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China under Grant 62173159 and the Natural Science Foundation of Huaian under Grant HAB202362.

Data Availability Statement

All the original datasets mentioned in this paper are accessible. The StateFarm Dataset can be obtained from https://www.kaggle.com/c/state-farm-distracted-driver-detection (accessed on 1 December 2023). 100-Driver Dateset can be obtained from https://github.com/Shenqishaonv/100-Driver-Source/blob/main/README.md (accessed on 27 June 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, J.; Zhang, Y.H. Detecting distraction behavior of drivers using naturalistic driving data. China J. High Way Transp. 2020, 33, 225–235. [Google Scholar]
Wang, X.S.; Xu, R.J. Driver distraction detection based on vehicle dynamics using naturalistic driving data. Transp. Res. Part Emerg. Technol. 2022, 136, 103561. [Google Scholar] [CrossRef]
Persson, A.; Jonasson, H.; Fredriksson, I. Heart rate variability for classification of alert versus sleep deprived drivers in real road driving conditions. IEEE Trans. Intell. Transp. Syst. 2020, 22, 3316–3325. [Google Scholar] [CrossRef]
Shi, D.M.; Xiao, F.; Fredriksson, I. Study on driving behavior detection method based on improved long and short-term memory network. Automot. Eng. 2021, 43, 1023–1029. [Google Scholar]
Sharma, K.; Moon, I. Improved scale-invariant feature transform feature-matching technique based object tracking in video sequences via a neural network and Kinect sensor. J. Electron. Imaging 2012, 22, 033017. [Google Scholar] [CrossRef]
Zhao, C.H.; Zhang, B.L.; He, J. Recognition of driving postures by contourlet transform and random forests. IET Intell. Transp. Syst. 2012, 6, 161–168. [Google Scholar] [CrossRef]
Guo, G.D.; Lai, A. A survey on still image based human action recognition. Pattern Recognit. 2014, 47, 3343–3361. [Google Scholar] [CrossRef]
Yan, C.; Coenen, F. Driving posture recognition by joint application of motion history image and pyramid histogram of oriented gradients. Int. J. Veh. Technol. 2014, 2014, 719413. [Google Scholar] [CrossRef]
Li, W.; Zhang, L.; Wu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Technol. 2022, 123, 1999–2015. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, DC, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More focused intersection over union loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Siliang, M.; Yong, X. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Koesdwiady, A.; Bedawi, S.M. End-to-end deep learning for driver distraction recognition. In Proceedings of the Image Analysis and Recognition: 14th International Conference, ICIAR 2017, Montreal, QC, Canada, 5–7 July 2017; pp. 11–18. [Google Scholar]
Xing, Y.; Lv, C.; Wang, H.J.; Velenis, E.; Wang, F.Y. Driver activity recognition for intelligent vehicles: A deep learning approach. IEEE Trans. Veh. Technol. 2019, 68, 5379–5390. [Google Scholar] [CrossRef]
Hesham, E.; Abouelnaga, Y.; Saad, M.H.; Moustafa, M.N. Driver distraction identification with an ensemble of convolutional neural networks. J. Adv. Transp. 2019, 2019, 4125865. [Google Scholar]
Hu, Y.; Lu, M.; Lu, X. Driving behaviour recognition from still images by using multi-stream fusion CNN. Mach. Vis. Appl. 2019, 30, 851–865. [Google Scholar] [CrossRef]
Zhang, Z.W. Research on Abnormal Driving Behavior Detection Method Based on Machine Vision. Master’s Thesis, Hunan University, Changsha, China, 2020. [Google Scholar]
Bai, Z.W.; Wang, Y.Y.; Zhang, L.W. Driver distraction behavior detection with multi-information fusion based on graph convolution networks. Automot. Eng. 2020, 42, 1027–1033. [Google Scholar]
Tran, D.; Do, H.M. Real-time detection of distracted driving using dual cameras. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 2014–2019. [Google Scholar]
Li, P.; Yang, Y.; Grosu, R. Driver distraction detection using octave-like convolutional neural network. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8823–8833. [Google Scholar] [CrossRef]
Yin, Z.S.; Zhong, S.; Nie, L.Z. Distracted driving behavior detection based on human pose estimation. China J. Highw. Transp. 2022, 35, 312–323. [Google Scholar]
Zhang, Z.Y. Research on the Detection Method of Dangerous Driving Behavior of Motor Vehicle Drivers Based on Deep Learning. Master’s Thesis, Hangzhou University of Electronic Science and Technology, Hangzhou, China, 2023. [Google Scholar]
Peng, P.; Huang, C.; Ding, W. Distracted driving detection based on the fusion of deep learning and causal reasoning. Inf. Fusion 2023, 89, 121–142. [Google Scholar] [CrossRef]
Li, S.F.; Gao, S.B.; Zhang, Y.Y. Pose-guided instance-aware learning for driver distraction recognition. J. Image Graph. 2023, 28, 3550–3561. [Google Scholar]
Lou, C.C.; Nie, X. Research on Lightweight-Based Algorithm for Detecting Distracted Driving Behaviour. Electronics 2023, 12, 4640. [Google Scholar] [CrossRef]
Du, Y.J.; Liu, X.F.; Yi, Y.W.; Wei, K. Optimizing Road Safety: Advancements in Lightweight YOLOv8 Models and GhostC2f Design for Real-Time Distracted Driving Detection. Sensors 2023, 23, 8844. [Google Scholar] [CrossRef]
He, Y.; Lu, M.K.; Gao, S. Distracted behavior detection of commercial vehicle drivers based on the MobileViT-CA model. Inf. Fusion 2024, 37, 195–204. [Google Scholar]
Sun, Z.; Zhu, L.; Qin, S.; Yu, Y. Road Surface Defect Detection Algorithm Based on YOLOv8. Electronics 2024, 12, 2413. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, X.; Wu, C. An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators. Electronics 2024, 13, 2660. [Google Scholar] [CrossRef]
Du, S.J.; Pan, W.G.; Li, N.Y. TSD-YOLO: Small traffic sign detection based on improved YOLO v8. IET Image Process. 2024, 1–15. [Google Scholar] [CrossRef]
Wang, J.; Li, W.; Li, F. 100-driver: A large-scale, diverse dataset for distracted driver classification. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7061–7072. [Google Scholar] [CrossRef]
Yao, T.; Li, Y.; Pan, Y.; Mei, T. Hgnet: Learning hierarchical geometry from points, edges, and surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 21846–21855. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M. MobileNetV4-Universal Models for the Mobile Ecosystem. arXiv 2024, arXiv:2404.10518. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]

Figure 1. Network structure diagram of StarDL-YOLO algorithm.

Figure 2. Network structure diagram of StarNet.

Figure 3. Comparison of C2f-Star and C2f module structures.

Figure 4. Detect_LSCD module structure.

Figure 5. Example diagram of the StateFarm dataset.

Figure 6. Example diagram of the 100-Driver dataset.

Figure 7. Parameters and GFLOPs of each module in the models.

Figure 8. Graph of the results of the generalization experiment.

Figure 9. Detection samples of normal driving behavior in different situations.

Table 1. Summary of the StateFarm dataset.

Class	Activate	Number of Images
0	Safe Driving	2489
1	Right_text	2267
2	Right_phone	2317
3	Left_text	2346
4	Left_phone	2326
5	Radio	2312
6	Drinking	2325
7	Reaching Behind	2002
8	Hair or Makeup	1911
9	Talking	2129
total		22,424

Table 2. Comparison of the effectiveness of different backbone networks.

Subject	mAP@0.5	mAP@0.5:0.95	GFLOPs (G)	Parameters (M)	P	R
YOLOv8n	99.3	69.3	8.2	3.0	98.6	99.1
YOLOv8n + HGNetV2	99.4	68.9	6.9	2.3	99.1	99.4
YOLOv8n + efficientViT	99.4	69.0	9.4	4.0	99.1	99.5
YOLOv8n + fasternet	99.5	69.8	10.7	4.2	99.1	99.5
YOLOv8n + mobilenetv4	99.4	69.1	23.2	5.7	99.4	99.2
YOLOv8n + StarNet (ours)	99.4	69.1	6.5	2.2	99.8	99.1

Table 3. Ablation experiments.

StarNet	C2f-Star	LSCD	Wise-Focaler-MPDIoU	mAP@0.5	mAP@0.5:0.95	GFLOPs (G)	Parameters (M)
Baseline				99.3	69.3	8.2	3.0
✓				99.4	69.1	6.5	2.2
	✓			99.4	69.4	6.9	2.5
		✓		99.5	69.1	6.0	2.3
			✓	99.4	69.0	8.2	3.0
✓	✓			99.4	69.3	6.1	2.0
✓		✓		99.5	69.1	4.9	1.6
	✓	✓		99.5	69.0	5.3	1.9
✓	✓	✓		99.5	69.5	4.5	1.4
✓	✓	✓	✓	99.6	69.5	4.5	1.4

Table 4. Comparison experiments of different algorithms proposed in recent years.

Subject	mAP@0.5	mAP@0.5:0.95	GFLOPs (G)	Parameters (M)
YOLOv5n	99.3	69.1	7.1	2.5
YOLOv6n	99.2	68.2	11.8	4.2
YOLOv8n	99.3	69.3	8.2	3.0
YOLOv9t	99.1	69.1	7.6	2.0
YOLOv10n	99.3	68.8	8.2	2.7
RT-DETR	99.6	69.4	108.0	32.8
StarDL-YOLO (ours)	99.6	69.5	4.5	1.4

Table 5. Experiments on generalization during day and night.

Subject	mAP@0.5 (Day)	mAP@0.5 (Night)
YOLOv8n	26.3	20.2
StarDL-YOLO(ours)	37.8	30.5

Table 6. Detection of ten distracted driving behaviors in different conditions.

Classes	Common	Low Light	High Light	Light and Shadow Interlaced	Right Vision	Mesopic Vision
normal	✓
Right_text	✓		✓	✓
Right_phone	✓
Left_text	✓
Left_phone	✓	✓
Radio	✓
Drinking	✓
Behind	✓
Hair	✓		✓	✓
Talking	✓	✓	✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, Q.; Zhang, L.; Zhang, Y.; Li, Y.; Liu, S.; Xu, Y. Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO. Electronics 2024, 13, 3216. https://doi.org/10.3390/electronics13163216

AMA Style

Shen Q, Zhang L, Zhang Y, Li Y, Liu S, Xu Y. Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO. Electronics. 2024; 13(16):3216. https://doi.org/10.3390/electronics13163216

Chicago/Turabian Style

Shen, Qian, Lei Zhang, Yuxiang Zhang, Yi Li, Shihao Liu, and Yin Xu. 2024. "Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO" Electronics 13, no. 16: 3216. https://doi.org/10.3390/electronics13163216

APA Style

Shen, Q., Zhang, L., Zhang, Y., Li, Y., Liu, S., & Xu, Y. (2024). Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO. Electronics, 13(16), 3216. https://doi.org/10.3390/electronics13163216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distracted Driving Behavior Detection Algorithm Based on Lightweight StarDL-YOLO

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Modified Backbone with StarNet

3.2. The Proposed C2f-Star Module

3.3. LSCD Detection Head with Shared Convolutions

3.4. Improved Wise-Focaler-MPDIoU Loss Function

4. Experiments and Results

4.1. Experimental Environment and Parameter Settings

4.2. Datasets

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis of YOLOv8n Model-Modified Backbone Comparison

4.5. Ablation Experiments and Analysis

4.6. Compared with the Algorithm Proposed in Recent Years

4.7. Comparison of GFLOPs and Parameter Number between StarDL-YOLO and YOLOv8n Models

4.8. Generalization Experiment

4.9. More Challenging Generalization Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI