**3. Methodology**

Two different methodologies are employed to classify driver drowsiness using ECG signals: (1) two traditional classifiers (random forest and KNN) trained by features extracted from HRV signals, and (2) one deep convolutional neural network (CNN) model trained by ECG wavelet scalogram images. The Bayesian optimization method is used to optimize the hyperparameters of the classifiers. Figure 5 shows the flowchart of these methods. The following subsections describe the structure of these methodologies.

**Figure 5.** Two different approaches to classify driver drowsiness using ECG signals: wavelet scalograms or derived HRV features. The hyperparameter of KNN, random forest, and CNN model are optimized using the Bayesian optimization method.

#### *3.1. Data Synchronization*

Ground truth is defined based on the video observation and recorded using the frame number information of SmartEyeTM data collected with a sampling frequency of 100 Hz. Physiological signals were recorded with separate equipment at 500 Hz sampling frequency, but also fed into the central recording module and stored with the same sampling frequency of 100 Hz. The lower sample rate is not sufficient for the high-quality processing of physiological data. Therefore, we need to synchronize video and physiological data sources with the help of the respiration signal which is available at both sampling rates of 100 Hz and 500 Hz. The normalized cross-correlation between the two respiration signals is calculated at all possible lags. The delay between these two signals is calculated as the lag with the largest absolute value of normalized cross-correlation. Figure 6 shows an example of data synchronization where 500Hz-respiration data is shifted about 21.4 s forward to be synchronized with the 100Hz-respiration data. The same time shift is also applied to the ECG signals collected with the sampling frequency of 500 Hz to sync them with the video observations. In those data, offset correction was sufficient for an accuracy ±1 video frame.

**Figure 6.** Data synchronization using respiration signals collected with two different frequencies of 100 Hz and 500 Hz. In this example, the 500 Hz data is shifted about 21.4 s forward to be synced with the 100 Hz data.

#### *3.2. ECG Preprocessing*

Generally, ECG signals are contaminated with different noise sources such as power line interference (50 Hz) [31] and baseline wander [32]. A second-order infinite impulse response (IIR) notch filter [33] is utilized here to remove the power line noise from ECG signals. Furthermore, a high pass filter with a passband frequency of 0.5 Hz is also employed to remove the low-frequency baseline wander noise. Figure 7 shows one part of the noisy and denoised ECG signals after removing the baseline wander and power line noise.

**Figure 7.** Noisy and denoised ECG signals after removing baseline wander and power line noise.

#### *3.3. Driver Drowsiness Classification Using Scalograms of ECG Signals*

This section describes the proposed method for driver drowsiness classification using deep neural networks trained by wavelet scalogram images of the ECG signals.

#### 3.3.1. Wavelet Scalogram

Wavelet analysis calculates the correlation (similarity) between an input signal and a given wavelet function *ψ*(*t*). Unlike Fourier transform, wavelet analysis provides a multiresolution time–frequency output under the assumption that low frequencies maintain the same characteristics for the whole duration in the input signal. In contrast, high frequencies are assumed to appear at different time points as short events.

Therefore, the wavelet function is scaled and translated by two parameters *s* ∈ R<sup>+</sup> and *u* ∈ R to generate a wavelet filter bank, *ψ<sup>u</sup>*,*<sup>s</sup>* as presented by Equation (1) [34].

$$
\psi\_{\mu,s} = \frac{1}{\sqrt{s}} \psi \left( \frac{t - \mu}{s} \right) \tag{1}
$$

By using this transformed wavelet, continuous wavelet transform (CWT) of the input signal *x*(*t*) at the time *u* and scale *s* can be calculated as:

$$X\_{WT}(\mu, s) = \int\_{-\infty}^{\infty} x(t) \psi\_{\mu, s}^\*(t) dt\tag{2}$$

where *ψ*<sup>∗</sup>(*t*) is the complex conjugate of *ψ*(*t*) and *XWT*(*<sup>u</sup>*,*<sup>s</sup>*) provides the frequency contents of *x*(*t*) corresponding to the time *u* and scale *s*. By using the two parameters of *u* and *s*, it is possible to investigate the input *x*(*t*) in two domains of time and frequency simultaneously, whereby the resolution of time and frequency depends on the scale parameter *s*. CWT provides a time–frequency decomposition of *x*(*t*) in the time–frequency plane. This method can be more beneficial than other methods such as short-time Fourier transform (STFT) when investigating the non-stationary signals since it provides a higher time resolution in the higher frequencies by reducing frequency resolution, and a higher frequency resolution in lower frequencies by reducing time resolution. In contrast, the time and frequency resolutions are constant in STFT. The scalogram if *x*(*t*) in any positive scale is calculated as the norm of *XWT*(*<sup>u</sup>*,*<sup>s</sup>*):

$$\mathcal{S}(\mathbf{u}, \mathbf{s}) = ||\mathcal{X}\_{WT}(\mathbf{u}, \mathbf{s})|| = \left(\int\_{-\infty}^{\infty} |\mathcal{X}\_{WT}(\mathbf{u}, \mathbf{s})|^{2} \,\mathrm{d}\mathbf{u}\right)^{\frac{1}{2}}\tag{3}$$

This equation calculates the energy of *XWT* at a scale *s*. Therefore, we can find the significant scales (which correspond to frequencies) in the signal using the scalogram.

The wavelet scalogram is used to transform the time series ECG signal to the time– frequency domain. Here, the Morse wavelet [35] is employed to calculate the wavelet transformation of the ECG signals. A sliding window with a length of 10 s and an overlap of 5 s between two consecutive windows was employed, and the scalogram image of every window of the data is calculated. The resulting number of data windows in each level of driver drowsiness are provided in Table 2 for manual driving, and in Table 3 for automated driving. As these tables show, the generated data sets are imbalanced in both manual and automated modes. This fact must be taken into account in the structure of the deep and traditional classifiers. Moreover, the percentages of MD and ED classes are higher in the automated driving tests than in the manual tests. Thus, the drivers were generally drowsier in automated.

Figure 8 shows sample images of ECG signals and their corresponding scalogram images for all three drowsiness levels in a rested automated test. The generated images are resized to 224 × 224 pixels. To reduce the computational complexity of the training process of the deep network, the RGB scalogram images are also transformed to grayscale images as presented in Figure 9. These grayscaled images are used as input data to train the deep convolutional neural network.

**Table 2.** The number of data samples in each drowsiness class after applying time windows to generate the scalograms in the **manual** driving tests.


**Table 3.** The number of data samples in each drowsiness class after applying time windows to generate the scalograms in the **automated** driving tests.

**Figure 8.** *Cont*.

**Figure 8.** Examples of ECG signals segments and their corresponding scalograms for the AL (**a**), MD (**b**), and ED (**c**) classes.

**Figure 9.** Sample of a grayscale resized image (224 × 224 pixels) of the ECG scalogram image.

3.3.2. Architecture of Deep CNN and Optimization of its Hyperparameters

Convolution neural networks (CNN) have been widely used to learn features from input images in different applications [36–39]. These networks help to capture the spatial dependencies in different parts of an input image by applying a convolution operation of some specific filters to input images [40]. This study used scalogram images of ECG signals to train a deep CNN and classify the driver drowsiness. As scalograms are time– frequency representations of an underlying time signal, temporal information is coded in the spatial features of the image. The input images are first normalized to have zero mean and unit variance. Then, the whole data set is split randomly into the train (80% of the input data), validation (10% of the input data), and test (10% of the input data) subsets in a way that the percentages of the drowsiness classes are approximately the same as presented in Tables 2 and 3 in each of the subsets.

The utilized deep CNN is composed of five convolutional blocks and one fully connected block in its hidden layer. Convolution and fully connected blocks are presented in Figure 10, where Conv, BN, ReLU, Max Pool, and FC are convolution layers, batch normalization layer, the ReLU activation function (*max*(*<sup>x</sup>*, 0)), and a fully connected layer,

respectively. The hidden layer is followed by the output layer that is constructed using an FC layer, a soft-max layer, and a weighted classification layer (weight). The number of neurons in the fully connected layer of the output layer is equal to the number of classes (here, three). The weight layer is used to mitigate the data imbalance issue. This layer contains one element per drowsiness class where every element is calculated using Equation (4):

$$\mathcal{W}\_i = \frac{\mathcal{N}\_c}{\mathcal{C}\_i \sum\_{i=1}^{N\_c} \frac{1}{\mathcal{C}\_i}} \tag{4}$$

where *Nc* is the number of classes (here, three), *Ci* is the number of data samples that belong to the *i*-th class, and finally, *Wi* is the weight of *i*-th class. By applying Equation (4) to the data samples that belong to the drowsiness classes in the manual and automated modes (presented in Tables 2 and 3), the corresponding weights for every class are computed. Table 4 provides these weights. As this table shows, the class weights of the ED class are highest in both manual and automated mode tests. By using these weights, misclassification errors of the MD and ED classes ge<sup>t</sup> more weight in comparison to the AL class. Therefore, if the network classifies an MD or ED sample into the AL class wrongly, it results in a large misclassification error that has a significant influence on the optimization process and thus will reduce the frequency of this kind of classification error.

**Figure 10.** Convolutional (**a**) and fully connected (**b**) blocks are used to construct the deep CNN.

**Table 4.** Class weights of the different drowsiness classes used in the deep CNNs to alleviate the imbalanced data set issue.


Figure 11 presents the architecture of the deep CNN, where five convolution blocks are followed by one fully connected block. Moreover, one dropout layer is also added after convolution blocks to reduce the possibility of overfitting or getting stuck in local minima during the training process. The dropout layer temporarily eliminates some neurons with a predefined probability, along with all of their input and output connections [41].

Deep neural networks have several hyperparameters such as the learning rate, the regularization parameter, and the number of neurons of filters that can influence network performance. Finding a proper combination of these hyperparameters so that they provide the optimal performance of the deep network is a primary active task in the research field of deep learning [42,43]. In this study, the Bayesian optimization method [44,45] is applied to optimize the hyperparameters of the deep CNN. This method has the capability of reasoning about the iterations' performance before they are performed. Therefore, fewer iterations are needed to find the optimal hyperparameter combination than with other hyperparameter optimization methods [45]. Moreover, this method yields a better generalization on the tests data set [46]. Here, four different hyperparameters were considered to be optimized using the Bayesian optimization method, including learning rate, L2

regularization, dropout probability, the number of filters in convolution layers (Conv1 to Conv5 in Figure 11), and neurons in the fully connected layer in the hidden layers (FC1 in Figure 11). Here, it is assumed that the number of filters in Conv1 to Conv5 and the number of neurons in FC1 are equal, so only one hyperparameter is defined to find their optimal values. Table 5 presents the specified search space for each of these hyperparameters.

**Figure 11.** The architecture of the deep CNN used to classify driver drowsiness using ECG scalogram images.


**Table 5.** Used hyperparameters of the deep CNN to be optimized using a Bayesian optimizer.

An adaptive moment estimation (ADAM) optimizer is employed to train the parameters of the designed deep CNNs (weights and biases). The maximum number of epochs is empirically chosen to be 15. A schedule for learning rate is utilized that multiplies the initial learning rate by 0.1 after 12 epochs to alleviate overfitting in the latter training epochs. The size of the mini-batch is defined to be constant and equals 16. The training process was conducted on a system with CPU and GPU types of Intel CoreTM;i7-782HQ and NVIDIATM Quadro M2200, respectively.

#### *3.4. Driver Drowsiness Classification Using Heart Rate Variability Data*

This section describes the proposed method for driver drowsiness classification using feature extraction from HRV data.

#### 3.4.1. Derivation of Heart Rate Variability Data from ECG Signals

The heart rate variability signal is derived from preprocessed ECG signals by applying an R-peak detection algorithm to detect heart rate. In this study, we used the automatic multiscale-based peak detection (AMPD) method [47] as an ECG R-peak detector, then RR Intervals (RRIs) that are defined as the time intervals between every two consecutive R-peaks are calculated. Figure 12 shows the detected R-peaks in a part of the ECG signal.

**Figure 12.** Detected R-peaks in the denoised ECG signal using the AMPD method.

3.4.2. Feature Extraction from HRV Data

Literature has proposed some features to be extracted from RR intervals for driver drowsiness detection [18], which conform to measures that are well established in clinical contexts [48]. Other HRV features are based on a visualization technique called the Poincaré plot. In this subsection, firstly, this plot is introduced, then those and other commonly extracted features from RR intervals are explained.

Poincaré plot: This plot is a type of recurrence plot to investigate the similarity in time series that can be used to analyze the nonlinear properties of HRV data [49]. Consider X= [RRt, RRt + 1, . . . , RRN] as a RR interval time series. The Poincaré plot first plots (RRt, RRt + 1), then plots (RRt + 1, RRt + <sup>2</sup>), then plots (RRt + 2, RRt + 3) and so on. This plot provides information about the short-term and long-term dynamics of the RR interval. An ellipse is fitted to the plotted data points and the minor and major semi-axes of the ellipse are associated with short-term and long-term HRV, respectively. Figure 13 shows the Poincaré plot for RR intervals collected in a rested automated driving test. The least-square method was employed to fit an ellipse on given RR intervals [50] and geometrical properties of this ellipse are extracted as features to describe the HRV dynamics.

**Figure 13.** Poincaré plot and fitted ellipse for RR intervals during a rested automated driving test. Minor and major semi-axes of the fitted ellipse, SD1, SD2, and their ratio, are calculated as features to capture the dynamics of HRV data.

Three features are extracted from this plot:


Other features that have been proposed by previous studies [18,51,52] are also extracted from RR intervals. These features are:

1. MeanRR: This feature presents the mean values of the time intervals between every two consecutive R-peaks. The MeanRR is calculated by Equation (5):

$$\text{MeanRR} = \frac{1}{\text{N}\_{\text{R}} - 1} \sum\_{i=1}^{\text{N}\_{\text{R}} - 1} \text{RR}\_{i+1} \tag{5}$$

where NR is the number of heartbeats in the sliding windows and RRi + 1 is equal to the time interval between Ri and Ri + 1.

2. SDRR: This feature represents the standard deviation of RR intervals, calculated by Equation (6).

$$\text{SDRR} = \sqrt{\frac{1}{\text{N}\_{\text{R}} - 1} \sum\_{i=1}^{\text{N}\_{\text{R}} - 1} \left( \text{RR}\_{i+1} - \text{MeanRR} \right)^{2}} \tag{6}$$

3. RMSSD: This feature calculates the root mean square of consecutive RR intervals' differences, calculated by Equation (7). It reflects parasympathetic activity.

$$\text{RMSSD} = \sqrt{\frac{1}{\text{N}\_{\text{R}} - 1} \sum\_{i=1}^{\text{N}\_{\text{R}} - 1} \left( \text{RR}\_{i+1} - \text{RR}\_{i} \right)^{2}} \tag{7}$$

4. pRR50: This feature measures the ratio of the number of R-peaks that differ more than 50 ms from their next R-peak to the total number of RR intervals in every sliding window. Equation (8) calculates the pRR50.

$$\text{pRR50} = \frac{\text{RR50}\_{\text{count}}}{\text{N}\_{\text{R}}} \tag{8}$$


Overall, eleven features are extracted from HRV data and are used to classify the driver's drowsiness. A window length of ten seconds, which was used in the ECG scalogram approach above, is considered extremely short for evaluation of HRV as it conforms nearly exclusively to the fast fluctuations of heart rate according to parasympathetic activity [55]. For exploratory purposes, we also applied longer sliding windows in comparison to the deep learning model. Two additional sliding windows are employed: (1) 60 s with 30 s overlap, and (2) 40 s with 20 s overlap. These longer windows help to provide a better estimation of mid-range dynamics of HRV data for classifiers than we can expect from short windows that are used in the deep learning method. The results of these sliding methods are compared together in Section 4.1.

The following subsection explains the two classifiers (KNN and random forest) used for drowsiness classification.

#### 3.4.3. Classify Driver Drowsiness Using Traditional Classifiers

The KNN and random forest are employed to classify the driver drowsiness using extracted features from HRV data. Each one of these classifiers has two different hyperparameters. The KNN hyperparameters are the number of neighbors for every sample (numNei) and the function used to measure the distance between samples (distance) [56]. The random forest hyperparameters are also the minimum of leaf size (minLS) and number of predictors to sample at each node (numPTS) [57]. These hyperparameters are also optimized using the Bayesian optimization method to find the optimal set. Moreover, the issue of the imbalanced data set is removed by using the uniform prior probability of every class for the KNN [58] and random under-sampling boosting (RUSBoost) [59] for the random forest classifier.

#### **4. Results and Discussion**

To evaluate the generated classifiers, confusion matrices were calculated for the test dataset. These matrices provide four different values that are computed for every drowsiness level:


These four values are used to calculate five different metrics for every level of driver drowsiness:


The following subsections present the results of the two proposed methods for driver drowsiness classification.

#### *4.1. Results of Driver Drowsiness Classification Using Heart Rate Variability Data*

To evaluate the performance of the KNN and random forest classifiers, confusion matrices of these classifiers trained by HRV-based features with three different sliding windows of 10 s, 40 s, and 60 s are provided in Figures 14–16, respectively. In these Figures, the diagonal elements (in gray) provide the number of the sliding windows that are correctly classified in different classes of drowsiness, according to the ground truth classification from the video observations. Accordingly, the percentage numbers written in these elements show correct classification accuracy for every specific drowsiness level. Non-diagonal cells also present the number of samples that are misclassified. As these figures present, the classification accuracy of the MD class is lower than two other classes in the manual mode. Furthermore, random forest performs better than KNN for drowsiness classification in manual and automated modes regardless of the used sliding window. Balanced accuracies for every classifier in every driving mode are provided in Table 6. These accuracies are calculated as the average TP accuracies in confusion matrices that are shown in grey elements in Figure 16c,d. Therefore, the best balanced accuracy that is achieved using traditional methods in manual mode is the average of 64.7%, 56.2%, and 56.4%. The best balanced accuracy in the automated mode that is achieved by the same methods is also the average of 63.4%, 63.5%, and 66.6%. According to this table, the best balanced accuracies in the automated and manual modes are respectively 63.8% and 62.1%, which are obtained using the random forest classifier and the sliding window of 60 s with 30 s overlap.

**Figure 14.** Confusion matrices of **KNN** classifier in the manual tests (**a**), **KNN** classifier in automated tests (**b**), **random forest** in the manual tests (**c**), and **random forest** in the automated tests (**d**) for driver drowsiness classification. The length of the sliding window for feature extraction is 10 s with a 5 s overlap. AL: alert, MD: moderately drowsy, and ED: extremely drowsy.


**Figure 15.** Confusion matrices of **KNN** classifier in the manual tests (**a**), **KNN** classifier in automated tests (**b**), **random forest** in the manual tests (**c**), and **random forest** in the automated tests (**d**) for driver drowsiness classification. The length of the sliding window for feature extraction is 40 s with a 20 s overlap. AL: alert, MD: moderately drowsy, and ED: extremely drowsy.


**Figure 16.** Confusion matrices of **KNN** classifier in the manual tests (**a**), **KNN** classifier in automated tests (**b**), **random forest** in the manual tests (**c**), and **random forest** in the automated tests (**d**) for driver drowsiness classification. The length of the sliding window for feature extraction is 60 s with a 30 s overlap. AL: alert, MD: moderately drowsy, and ED: extremely drowsy.

**Table 6.** Balanced accuracies of the KNN and random forest classifiers in the manual and automated



For the sake of brevity, classification metrics including specificity, sensitivity, precision, and F1-score are shown only for the best classifier (random forest trained by HRV-based features with a 60 s sliding window) in the manual and automated modes. Table 7 presents these metrics. As this table shows, the precision value for the ED class is low. This has occurred because the number of TP is low for this class. According to this table, the AL class has the maximum F1-score in both manual and automated modes. Therefore, the accuracy of the random forest for the AL class is higher than two other classes, and accordingly, the false alarm of this is reduced using this classifier.

**Table 7.** Classification metrics for the **random forest** classifier trained by HRV-based features in the **manual** and **automated** driving modes. Spe.: specificity; Sen.: sensitivity; Pre.: precision; and F1S: F1-score.


*4.2. Results of Driver Drowsiness Classification Using Scalogram of ECG Signals*

As presented in Section 3.3.2, four different hyperparameters of deep CNN are considered to be optimized using the Bayesian optimizer. Table 8 shows the optimized hyperparameters of deep CNNs in the manual and automated driving modes. As this Table shows, the number of filters in the convolution layers and neurons in the fully connected layer (presented by the hyperparameter *H*4) is higher in the automated driving mode. Therefore, the computational cost is higher in the automated tests to classify driver drowsiness using the proposed deep CNNs. The L2 regularization value (presented by the hyperparameter *H*3) is much higher in manual tests than in automated tests. Thus, deep CNN needs larger parameters to classify the driver drowsiness in the manual tests. The dropout probability (presented by the hyperparameter *H*2) of the trained deep CNN for the ECG signals in the automated tests is higher than the designed deep CNN for the manual tests. The number of neurons is also higher for the deep CNN trained by the ECG signals for automated driving. Therefore, its network is wider than another one. Consequently, the dropout probability

of the network trained by the ECG signals of the automated tests should also be higher to turn off more neurons and avoid overfitting.

**Table 8.** Optimized values of hyperparameters in different driving modes and by inputting ECG scalogram to the deep CNNs. Hyperparameters *H*1 to *H*4 are defined in Table 4.


In comparison with other widely used deep CNNs that are implemented in embedded systems for real-time face recognition or object detection, our developed deep CNNs have much fewer parameters. Table 9 compares the number of parameters of four different frequently used deep networks in real-time applications (AlexNet [60], VGG16 [61], ResNet18 [62], and GoogLeNet [63]) with our developed networks.

**Table 9.** Comparison of the approximate number (app. no.) of parameters in the developed deep CNN with other deep networks that were implemented in real-time applications in previous works. ECG-automated and ECG-manual in this table are deep networks and driving tests in automated and manual modes are used to train them, respectively.


Confusion matrices of the trained deep CNNs using ECG signals of the manual and automated tests are provided in Figure 17 to evaluate their classification performance. As this Figure shows, the MD class and AL class have the lowest and highest classification accuracy in both manual and automated driving modes, respectively. Therefore, reducing the number of classes from three to two can increase classification accuracy. However, it will not be possible to capture the transition between the AL to ED states in the case of binary classification. The balanced accuracy of the deep CNNs in both manual and automated modes is also provided in Table 10. These accuracies are calculated as the average TP accuracies in confusion matrices that are shown in grey elements in Figure 17a,b. Therefore, the balanced accuracy in manual mode is the average of 81.2%, 78.6%, and 79.1%. The balanced accuracy in the automated mode is also the average of 82.2%, 73.8%, and 82.0%. According to Table 10, the balanced accuracy of the deep CNN in the manual and automated driving modes are respectively about 77% and 79%. By comparing Table 10 with Table 6, deep CNNs significantly outperform the random forest and KNN methods in both manual and automated modes. Therefore, the input ECG scalograms are more informative than HRV-based features regarding driver drowsiness levels.

Classification metrics of the deep CNNs in the manual and automated modes are provided in Table 11. Comparing Table 11 with Table 7 shows that the F1-scores of all classes in both driving modes except the AL class in manual mode are improved by using the deep CNN method. According to this table, the precision value for the ED class is also lower than other classes since the numbers of the data samples of this class are much lower than the MD and AL classes.


**Figure 17.** Confusion matrices of **deep CNN** for driver drowsiness classification in the manual (**a**) and automated (**b**) driving tests.

**Table 10.** Balanced accuracy of the deep CNN in the manual and automated driving modes. B.Acc.: balanced accuracy.


**Table 11.** Classification metrics for the **deep CNN** trained by ECG scalogram images in the **manual** and **automated** driving tests. Spe.: specificity; Sen.: sensitivity; Pre.: precision; and F1S: F1-score.

