Drowsiness is a big issue while driving; therefore, some drowsiness detection solutions must be implemented in front of a driver while they are driving a vehicle. So, with the help of the OpenCV and Dlib libraries, we developed a driver drowsiness detection system that initiates whether the person’s eyes are closed or open, i.e., the eyes are in an active state or a passive (lazy) state. Moreover, the main motive is to identify or detect whether the person is yawning while holding the steering wheel. It becomes essential to implement such a detection system to reduce accidents caused by fatigue resulting from tiredness or sleepiness, which is more dangerous at night, with accident cases increasing by more than 50 percent. So, to reduce the number of road accidents, an advanced method of detection must be able to be implemented in real-world scenarios. The motive of this study is to control the rate of accidents due to fatigue or driver drowsiness so that no mishaps occur and, most notably, to enhance safety in terms of traffic rules and regulations whenever reckless driving takes place due to an unconscious state of mind, i.e., drowsiness. The proposed detection method follows the physical nature, i.e., eye movement and facial expression, to identify the drowsy state of the driver with the help of convolutional neural networks (CNN). The proposed model uses a 15 s video clip as input. The video is sampled at 1 s intervals, yielding 15 frames. These frames are then passed through a U-Net to extract the region of interest (ROI), in this case, the driver’s body. Using 1 × 1 Conv layers significantly reduces the dimension of the output we obtain from the convolution layers, which plays a significant role in encoding the features extracted from the input frames.
The overall effectiveness of the segmentation results directly contributes to the accuracy and reliability of the entire driver drowsiness detection system. Regular testing and evaluation on diverse datasets and under various conditions are essential to ensure that the segmentation process meets the desired performance standards.
4.3. Detection of Drowsiness State
Monitoring the PERCLOS can be part of a system for detecting when a driver may be becoming drowsy, which is crucial for preventing accidents on the road. Drowsiness detection is a complex task, and combining multiple indicators often leads to more accurate results. The PERCLOS, when used in conjunction with other measures, contributes to a more comprehensive and reliable drowsiness detection system. The human body reflects its states automatically. EM-CNN uses these kinds of human physiological reactions to evaluate the PERCLOS and POM. The equation of the PERCLOS, using a percentage, is given below in Equation (
1) [
52].
represents the frames of a closed eye per unit of time.
is the total number of frames per unit, and
represents the frame of the closed eye. To calculate the threshold of drowsiness, a collection of 13 video frames was used to test and evaluate the value of the perceptron learning rule with output scaling (PERCLOS). According to Equation (
2), a value of 0.25 or greater means that the eye is in a closed state for a continuous period, which indicates drowsiness. The neural network is trained based on the drowsiness thresholds of the PERCLOS and POM [
12]. Recurrent Input Output (RIO) refers to a neural network architecture or a specific type of layer that utilizes recurrent connections. The PERCLOS, in the context of neural networks, is not a commonly used acronym. However, it refers to a neural network architecture or algorithm that combines the perceptron learning rule with output scaling. The perceptron learning rule is a fundamental concept in neural networks, and output scaling refers to adjusting the output of neural network layers to match a desired range or format. The POM, in the context of neural networks, is the “Probabilistic Output Model”. This refers to a type of neural network or model designed to provide probabilistic predictions or estimates as outputs. For example, probabilistic neural networks or certain types of Bayesian neural networks can produce probabilistic outputs, which are valuable in tasks like uncertainty estimation or probabilistic classification. MT-CNN extracts the facial features along with the feature points, which helps obtain the ROI of the eyes and mouth. Then, EM-CNN evaluates the states of the eyes and mouth. By observing the uninterrupted image frames, the degrees of eye and mouth closure are calculated when a threshold is matched or exceeded. The segmentation algorithm used in the proposed method (U-Net, [
13]), is essentially a series of convolution and ReLU blocks with some max-pooling layers between the first and second half of the convolution and ReLU blocks, followed by some transpose convolution layers. The two halves are also connected with multiple skip connections between them. The convolution, ReLU, and max-pooling layers are also used in the primary model, specifically in the CNN part of the CNN-LSTM architecture [
53]. The convolution operation is described by Equation (
2) [
53].
Here,
I is the input matrix, and K is the 2D kernel with a size of
p ×
r. In Equation (
3), the convolution blocks also use the ReLU activation function to add non-linearity to the output. The operation of the ReLU can be described as f being a function of
x.
Max-pooling is the most commonly used method among all the pooling layers. It reduces the number of parameters by sampling the maximum activation value from a patch of the image or the matrix. Max-pooling can be described by Equation (
4),
where A is the activation output from the ReLU, and P is the output from the max-pooling layer. The U-Net also uses the transposed convolution operation, which is similar to max-pooling but upsamples the encoding instead. In Equation (
5), transposed convolution processes an image with a size of
i ×
i using a kernel with a size of
k ×
k and outputs an upsampled matrix with dimensions given by the following formula:
where
s is the stride of the padding. The operation of the transposed convolution involves multiplying the value of the filter with the encoded matrix to obtain another padded and higher-resolution matrix. This stage presents the output from the LSTM layer, which can be used with another system to create an end-to-end helpful system. For instance, a buzzer may be connected at the end, which is triggered when a driver is detected as drowsy, or in the case of a self-driving car, it could safely park on the side of the road and take measures to wake the driver up.
Figure 6 presents the architecture of the CNN-LSTM model.
The CNN-LSTM model uses LSTM layers to fuse information from past time steps. For retaining data activation effects for much longer in the recursion, LSTM layers have proven to be more effective compared to GRU layers. The main reason for this difference is that LSTM uses three gates to update the memory cell. One is the update gate, also present in GRU, and the others are the forget and output gates. More formally, the three gates can be described by the following equation [
54]:
In the above equation and subsequent equations,
denotes the activation from the previous time step, and
denotes the input in the current time step.
and
represent the parameter matrix and the bias, respectively, and
is the value of the update gate. Then,
where
and
are the parameter matrix and the bias, respectively, and
is the value of the update gate.
In (8), and are the parameter matrix and the bias, respectively, and is the value of the update gate.
The memory cell of the LSTM is calculated using the following equation:
where
and
are the values of the memory cell.
is the candidate for the memory cell that is supposed to replace the current one. Here, ∗ means vector multiplication. The value for
can be written in terms of the following equation:
In Equation (
11), finally, the current activation is calculated by combining the output gate and
. Here, ∗ means vector multiplication.
Detecting drowsiness in drivers is a challenging task that demands a sophisticated approach. One promising solution is the fusion of CNN and LSTM networks, two powerful deep learning architectures. This fusion capitalizes on the strengths of both CNNs, known for their image analysis prowess, and LSTMs, renowned for modeling sequential patterns. The CNN-LSTM architecture holds the potential to revolutionize drowsiness detection systems, making roads safer and saving lives. By analyzing real-time video streams of a driver’s face, this hybrid model can not only capture intricate facial features but also track temporal patterns in driver behavior. This innovative approach has the capability to accurately determine when a driver is becoming drowsy, thus enabling timely interventions and preventive measures.
In this system, CNNs are employed as the first line of defense, extracting meaningful features from images of a driver’s face. These features are then passed to the LSTM network, which specializes in understanding the sequence of these features over time. By learning from historical patterns, the LSTM can distinguish between normal behavior and signs of drowsiness, such as drooping eyelids, yawning, or erratic facial movements. The strength of this combined architecture lies in its ability to consider not only the current frame but also the context provided by preceding frames. This context-aware capability allows the model to identify subtle changes that might escape a single-frame analysis. As a result, the CNN-LSTM model can adapt to the dynamic nature of drowsiness, which often manifests gradually rather than abruptly. Through rigorous training on diverse datasets encompassing various lighting conditions, driver characteristics, and scenarios, the CNN-LSTM model refines its ability to accurately recognize drowsiness. The model’s high accuracy, sensitivity, and specificity make it an indispensable tool for modern driver assistance systems. Its potential applications extend beyond drowsiness detection—it can be integrated into smart vehicles, fleet management systems, and transportation infrastructures, contributing to a safer and more secure transportation ecosystem. The steps of the proposed algorithm are as follows:
- Step 1:
Preprocess the image (M) datasets.
- Step 2:
Combine the images with the inputs from the trained models.
- Step 3:
Retrieve the results of the final convolution layer of the model that was provided.
- Step 4:
Flatten the n dimensions, decreasing their number to .
- Step 5:
Apply the different layers of CNN-LSTM.
Padding (Conv2d): The formula below is used to determine the padding width, where
stands for padding, and
stands for the filter dimension,
,
Forward propagation: This is separated into two phases. After computing the intermediate value K that is produced through the convolution of the input data from the preceding layer with the M tensor, it then adds bias b and applies a nonlinear activation function on the intermediate values:
Max-pooling: The output matrix’s proportions can be calculated using (14) while accounting for padding and stride:
The cost function’s partial derivative is expressed as
After applying the chain rule in (15),
The sigmoid activation function, linear transformation, and leaky ReLU are expressed as follows:
It returns r if the input is positive and 0.01 times r if the input is negative. As a result, it also produces an output for negative values. This minor modification causes the gradient on the graph’s left side to become nonzero.
Applying the softmax function: A neural network typically does not create one final figure. To represent the likelihood of each class, these numbers must be reduced to integers from zero to one.
Applying the CNN-LSTM: LSTM is used after the CNN has been applied, i.e., CNN-LSTM:
where
is the input at the current timestamp, and
is the previous LSTM block.
represents the sigmoid function.
is the forget gate.
b is the bias for the respective gates.
is the cell state at timestamp (t).
represents a candidate for the cell state at timestamp (t):
In the context of driver drowsiness, this hybrid CNN-LSTM architecture represents a formidable tool. By analyzing real-time video feeds of a driver’s face, the CNN component is capable of discerning crucial facial features, such as eye movement, blink rate, and facial expressions. These features, extracted through the convolutional layers of the CNN, serve as a rich foundation of visual cues.
Figure 7 presents the architecture of the proposed model. However, what sets this architecture apart is its LSTM component. This network structure possesses the ability to understand sequences and patterns within the extracted features. In the realm of drowsiness detection, this implies that the system can not only assess the current state of the driver’s face but also track how that state evolves over time. Subtle cues that might signify drowsiness, such as prolonged eye closure or micro-expressions, can be identified by the LSTM as part of a sequence, enabling more accurate and reliable detection. The predictive power of the CNN-LSTM architecture extends beyond mere real-time analysis. By leveraging the LSTM’s memory capabilities, the model can recognize trends and tendencies that indicate an increasing likelihood of drowsiness. This forward-looking approach allows for timely intervention, such as alerts to the driver or automated adjustments to vehicle settings, preventing potential accidents before they occur.
A CNN-LSTM network is a popular architecture used in deep learning for tasks that involve both spatial and temporal data, such as video analysis or sequential image data. The input data for a CNN-LSTM network typically consist of a sequence of images or tensors, where each element in the sequence represents a frame in a video or a timestamp in a time series. The input data are first passed through a CNN to extract spatial features. The CNN layers consist of convolutional layers followed by pooling layers, which help capture important spatial patterns in the data. These layers are responsible for feature extraction from individual frames. After the CNN layers, the features can either be flattened into a 1D vector or global average pooling can be used to reduce the spatial dimensions. This step depends on the specific problem and the architecture used. The output of the CNN is then fed into an LSTM network, which is responsible for capturing temporal dependencies and patterns across the sequence of frames. The LSTM network consists of multiple LSTM layers. Each LSTM cell maintains an internal state that can capture information from previous time steps. This internal state helps model long-term dependencies in the data. The LSTM layers process the sequence of features generated by the CNN, one time step at a time, and update their internal states accordingly. The final LSTM layer is often connected to one or more fully connected (dense) layers, which can be used for making predictions or classifications. The output layer’s architecture depends on the specific task. For example, video classification might consist of a softmax layer for class probabilities. In our approach, the MSE (mean-squared error) function was utilized as the loss function. To update the settings of each network layer, the common Adam optimization technique was employed as the optimizer. The dropout layer we used contributed to the model’s enhanced generalizability, decreased the training time, and prevented overfitting. In our research, the constructed model’s prediction performance was compared with that of the EM-CNN, VGG-16. GoogLeNet, AlexNet, and ResNet50 models in order to confirm the model’s efficacy. These methods were chosen for comparison because of their specific characteristics. EM-CNN is a semi-supervised learning algorithm that uses only weakly annotated data and performs very efficiently in face detection. VGG-16 is a 16-layer deep neural network, a relatively extensive network with a total of 138 million parameters, that can achieve a test accuracy of 92.7% on ImageNet, a dataset containing more than 14 million training images across 1000 object classes. GoogLeNet is a type of CNN based on the Inception architecture. It utilizes Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. AlexNet uses an 8-layer CNN, showing, for the first time, that the features obtained through learning can transcend manually designed features, thereby breaking the previous paradigm in computer vision. ResNet-50 is a 50-layer CNN (48 convolutional layers, 1 max-pooling layer, and 1 average-pooling layer) that forms networks by stacking residual blocks.