1. Introduction
Driving requires various cognitive skills such as visual perception, attention, memory, executive functions, and motor skills [
1]. Performance in these cognitive domains is related to driving performance and driving ability. In addition, psychophysiological aspects such as drowsiness and fatigue can affect cognitive processes during driving and lead to traffic accidents [
2]. Moreover, fatigue and drowsy driving are some of the major causes of traffic accidents [
3,
4]. Therefore, objective detection of a driver’s drowsiness is an important factor that could improve driving safety.
However, drowsiness in drivers is not easy to detect. The most used measures are self-assessments of drowsiness. The European Union regulation for the type-approval of motor vehicles with regard to their drivers’ drowsiness and attention warning systems is based on the self-assessment of drowsiness with the Karolinska Sleepiness Scale [
5]. Other methods for inferring correlates of drowsiness are image-based methods, vehicle-based methods, physiological-based methods, and hybrid methods [
6]. Each of these categories can be further split into subcategories, e.g., physiological-based methods can be based on electroencephalogram (EEG), electrocardiogram (ECG), respiration, electrooculogram, electromyogram, galvanic skin response, or skin temperature [
7].
ECG records the heart’s electrical activity (heartbeats), while heart rate variability (HRV) refers to the variations in times between two adjacent heartbeats. HRV is considered a good drowsiness predictor and accuracies of drowsiness detection based on HRV range from 56.6% to 95% among studies [
8]. Respiration can be derived from HRV but is also measured using several other techniques. Siddiqui et al. [
9], in their recent research, used radar for non-invasive respiration measurement and obtained 87% drowsiness classification accuracy. Measuring skin temperature can also be used for the assessment of drowsiness. Gielen and Aerts [
10] measured temperature on the nose and wrist, and obtained accuracies of 68.4% and 88.9%, respectively. Electrooculogram measures eye movement characteristics. It is often used for drowsiness detection, achieving accuracies that range from 64% to 99% among studies [
11].
EEG provides a very accurate assessment of the driver’s mental state [
12], achieving accuracies of drowsiness detection that range from 67% to 99% among studies [
13]. Furthermore, changes in EEG activity are considered biomarkers of mental fatigue and drowsiness [
14]. The most prominent indicator associated with mental fatigue and drowsiness is increased theta activity in frontal, central, and posterior cortical sites. In addition, increased alpha activity is associated with individual variability in cortical changes related to mental fatigue and drowsiness. Many studies also use ratio indices between these frequency bands as indicators of drowsiness [
15], and a recent study suggests that multichannel ratio indices could bring an additional increase in drowsiness detection accuracy [
16]. However, the results obtained are from studies with male participants or males and females as a group, without separating them by sex.
Differences in driving behavior between men and women have been extensively documented using experimental driving tasks, attitudes toward driving, behavioral analysis, risk perception, and the number of accidents.
Drowsiness also has differential effects on driving behavior between men and women. For example, self-rated levels of drowsiness while driving tend to be higher in women than in men [
17]. Women also tend to report longer ideal sleep durations than men [
18]. As a result, it appears that women’s greater need for sleep leads to higher ratings of drowsiness.
Studies also report sex differences in brain organization that influence the regulation of brain activity during awake state and sleep [
19]. In addition, sleep deprivation has a differential effect on brain activity between females and males. Therefore, it is reasonable to assume that changes in EEG activity related to drowsiness differ between men and women.
There are various methods of EEG signal analysis to detect drowsiness and alertness in drivers. They are mostly based on the different types of features extracted from the signal (i.e., frequency-domain features, recurrence quantification analysis features, entropies, etc.) [
20,
21,
22,
23,
24,
25]. To the best of our knowledge, there is no research that includes drivers’ sex information in drowsiness detection systems.
A preliminary version of this work has been reported [
26] and it shows the statistical difference between EEG features of alert male and female drivers. The main goal of this substantially extended study is to improve drowsiness detection by including the information about drivers’ sex in the classifier. This is done in two ways: (1) by considering sex as a feature, and (2) by separating the datasets into male and female. In addition, the study aims to develop a reliable EEG-based sex classification model, where correlations between features are introduced as a novel feature that differentiates between male and female drivers.
2. Methodology
2.1. Experimental Design
In our study, 34 healthy participants were recorded during two sessions in a driving simulation. All participants had a valid driver’s license and drove a car regularly but are not professional drivers. The EEG was recorded with a 32-channel actiCHamp EEG amplifier (Brain Products, Munich, Germany) with passive sintered Ag/AgCl electrodes. The electrodes were located at prefrontal (Fp1, Fp2), frontal (F3, F4), central frontal (FC5, FC1, FC2, FC6), inferior frontal (F7, F8), midline frontal (Fz), central (C3, C4), midline central (Cz), midline parietal (Pz), central parietal (CP5, CP1, CP2, CP6), midline occipital (Oz), inferior temporal (T7, T8), posterior temporal (TP9, TP10), parietal (P3, P4), inferior parietal (P7, P8), posterior occipital (PO9, PO10), and occipital scalp sites (O1, O2). The electrodes were positioned according to the International 10–20 system guidelines. The electrode at the FCz position was used as a reference. A ground electrode was placed on the forehead. BrainVision Recorder software was used for impendence check and on-line monitoring. The EEG signals were recorded with a 1000 Hz sampling rate. A simulation scenario was shown on three wide LCD screens. The car was controlled with a professional steering wheel joystick and pedals. All participants were instructed to follow the traffic rules. Male participants averaged 30.24 of age with a standard deviation of 6.86 years and female participants averaged 30.12 years of age with a standard deviation of 6.98 years. The driving scenario was the same for all participants and consisted of driving on state roads and highways, and in an urban environment. The study was approved by the Ethics Committee of the University of Zagreb, Faculty of Electrical Engineering and Computing.
Each participant had two recording sessions. The first recording session (alert session) began at 2:00 p.m. and the second one (drowsy session) began at midnight. The sessions were recorded on different days. The driving scenario was the same for both sessions: after 15 min of adaptation driving (simple acceleration, braking, and turning exercises), all participants were instructed to drive on a highway for 90 min, with adjusted day/night lighting according to the time of the session. Highway driving was monotonic, with very few other cars in traffic, and did not include any unexpected events (such as road crossings of animals or pedestrians). Room lighting, temperature, and humidity were controlled and were the same for both sessions.
Each EEG recording was divided into 10 s epochs with five seconds’ overlap between epochs. The first 250 epochs (~21 min) of highway driving in the alert session were labeled as periods with the highest alertness level for all participants. The last 250 epochs of the drowsy session (approximately after 70 min of highway driving) were labeled as periods with the highest drowsiness level for all participants. This labeling, where the beginning of the session is labeled as alert, and the end is labeled as drowsy, is commonly used in practice [
27,
28,
29,
30].
These classifications of two-phase classes were additionally confirmed by an expert in psychophysiological behavior based on their self-assessment of drowsiness (Karolinska Sleepiness Scale) before and after the session, and visual inspection of participants. Participants indicated their drowsiness level on the Karolinska Sleepiness Scale for the alert session as extremely alert, very alert, alert, or fairly alert. Additionally, their total score on the Fatigue Assessment Scale indicated no fatigue before the alert session.
Table 1 shows the number of male and female drivers in each class.
Although we used a 32-channel EEG, we wanted to describe dependencies and differences between specific brain regions—front left (FL), front right (FR), occipital left (OL), and occipital right (OR) regions. We calculated the mean value for each feature from five channels in each region (see
Table 2). In addition to the channels from these four regions, we also included Oz, Pz, and Cz channels in our analysis.
2.2. Preprocessing and Feature Extraction
The first step of preprocessing was to filter the raw EEG signals to remove unwanted artifacts from the signal. For filtering, we used a Butterworth bandpass filter from 0.5 Hz to 40 Hz.
In this way, we filtered out the line noise (50 Hz). Independent component analysis (ICA) was also used to remove artifacts such as eye movements [
31]. The Infomax ICA algorithm was used for separating the original signal into independent components [
32]. If there were waveforms between these components that were characteristic of eye movements, we removed these eye movements from the signal. Blinks were removed for all participants and left–right eye movements were removed whenever present (in approximately 50% of participants).
EEG features were calculated based on 10 s epochs with a five-second overlap between epochs. We computed the basic frequency-domain features: the relative power of the delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta (12–30 Hz), and gamma (30–50 Hz) frequencies. They were calculated using the Thomson multitaper method [
33] to obtain the power spectral density. We also used recurrence quantification analysis (RQA) [
34,
35,
36] features: determinism (Det), laminarity (Lam), recurrence rate (RR), trapped time (TT), determinism divided by recurrence rate (Det/RR), longest diagonal (Lmax), longest vertical line (Vmax), average diagonal line length (Adll), divergence (Div), and entropy (Ent). The RQA features were calculated from the recurrence plot (RP) of the signal. RP is a 2D representation of the phase space trajectory of the signal [
37]. It is a matrix with dimensions
N ×
N, where
N is the length of the signal. The position (
i,
j) in the matrix is marked as one if the
i-th and
j-th points in the signal are close to each other.
The total number of extracted features was 15 per channel/region. We used four regions (with averaged feature value from 5 channels) and three additional channels (see above), which gave us a total of 105 features.
2.3. Drowsiness Detection
The 105 features described in
Section 2.2 served as the initial feature set for the first drowsiness detection model. The aim of this analysis was to investigate whether sex as a feature could improve drowsiness detection.
The best combination of algorithms and hyperparameters was optimized using a grid search approach. The experiment was conducted using four classification algorithms (XGBoost, naïve Bayes, random forest, and support vector machines) and three feature selection algorithms (chi2, information gain, and ANOVA F-test).
Table 3 shows the hyperparameters that were explored for classification algorithm selection on the training set. In addition, we also searched for the optimal number of features to include in the model (selection of 10, 20, 30, 40, 50, 60, and 100). Grid search was applied to all these hyperparameters, classification algorithms, and feature selection methods simultaneously. The decision was made based on maximizing the accuracy of a model without including sex as a feature.
Once the best algorithm and feature set were obtained, the evaluation metrics for the drowsiness detection model without sex as a feature were calculated. The classification was performed for each epoch of each participant, resulting in a total of 17,000 epochs (both classes with 34 recording sessions, each session with 250 epochs) for drowsiness detection. A random 66% of the dataset was used for training the classifier and the remaining 33% was used for testing the classifier.
Then, sex as a feature was added to the dataset, and classification was performed in the same way. In the second analysis, to further investigate the influence of driver’s sex on drowsiness detection, the dataset was divided into two subsets containing only male and female drivers, respectively. The classification was performed in the same way for these two subsets of data.
2.4. Sex Classification Model
A reliable EEG-based sex classification model was developed to make previous findings applicable in cases where the dataset does not contain information about drivers’ sex. For the sex classification model, we divided each channel/region of each participant into nine segments (the first eight segments with 27 epochs and the ninth segment with the remaining 34 epochs). The values of each feature in each of these segments were averaged, as shown in
Figure 1. We chose nine segments to increase our dataset for classification to 306 records (34 participants with 9 segments each), which was sufficient to properly split the dataset into training and test subsets. Moreover, each of the nine segments had enough epochs to compute correlations between features. The final feature set consisted of 105 features, explained in
Section 2.2, and the correlations between each pair of features as new features. In total, this amounted to 5565 features. The training dataset consisted of each participant’s six randomly selected segments and the test dataset consisted of each participant’s remaining three segments. We also used leave-one-subject-out evaluation (LOSOCV). Hyperparameter optimization and algorithm selection were performed on the training dataset using 5-fold cross-validation. The same grid search approach as in
Section 2.3 was used for hyperparameter optimization.
3. Results
3.1. Improvement of Drowsiness Detection with Addition of Sex as a Feature
Based on the optimization of grid search hyperparameters, XGBoost with default parameters, along with 50 features selected by the chi2 method, was selected for our drowsiness detection model. The decision was made based on maximizing the accuracy of a model without including sex as a feature.
Table 4 shows the evaluation metrics’ scores of the drowsiness detection model with and without sex as a feature. Including sex as an additional feature in the feature set resulted in only an incremental improvement in the evaluation metrics. However, when the dataset was split into two datasets based on the sex information, it led to a significant improvement in the drowsiness detection evaluation metrics.
Table 4 shows that the accuracy of the drowsiness detection is 84% for male drivers and 88% for female drivers, which is 3% and 7% better, respectively, than the classification without information about driver sex. Additionally, precision and recall are higher in both groups for both male and female drivers.
3.2. Model for Sex Classification
The best classification model for drivers’ sex classification was XGBoost. The following hyperparameters were selected as the best ones and were the same for both groups: eta equals 0.1, gamma equals 0, max_depth equals 10, reg_alpha equals 0, reg_lambda equals 1, and learning_rate parameter equals 0.1. The chi2 method was selected as the feature selection method, and the 40 best-ranked features were selected for our final classification model.
Interestingly, among the selected features, there were no frequency-domain features. There were 36 selected RQA features and four selected correlation features. The same features were selected for all seven channels/regions—Det, Lam, Det/RR, Lmax, and Vmax. Besides these 35 features, the model also selected the RR feature from the Cz channel, the correlation between Cz Theta and Cz Delta, the correlation between FL Theta and Cz Gamma, the correlation between FR Theta and Cz Gamma, and the correlation between OL Theta and OL Delta. All the selected correlations for the drowsy group were between frequency-domain features. Among the selected features, the distribution of channels/regions was almost uniform, 5 ± 1 of all channels/regions were selected.
In general, the final accuracy of any classification model may vary with a different selection of the test set. Since we randomly selected our test set as described in
Section 2.4, we applied the methodology described there five times (as seen in
Table 5 and
Table 6) to verify the stability and robustness of the classification model.
Table 5 and
Table 6 show the accuracy of the classification models for alert and drowsy participants, respectively. Each participant had three randomly selected segments in the test set. The classification accuracy of these segments is marked with blue. If at least two out of three of these segments have a correctly classified sex (accuracy 0.67 or 1.00), the participant is marked as correctly classified in the right, green part of the table. The average classification accuracy was calculated for each participant and each run of the methodology. The average classification accuracy of the segments (marked in blue) was 93% (alert participants) and 92% (drowsy participants), while the average classification accuracy of the participants (marked in green) was 97% (alert participants) and 96% (drowsy participants). The classification accuracy with LOSOCV was 82%.
4. Discussion
Splitting the dataset into two subsets, male and female, resulted in a significant improvement in drowsiness detection (confirmed with Mann–Whitney U test). After adding sex as a feature to the dataset, drowsiness detection was only incrementally improved, but splitting into two sex-dependent datasets yielded a 3% and 7% improvement for male and female drivers, respectively. Precision and recall were higher for alert and drowsy states for both male and female subsets. Interestingly, precision and recall were higher for female drivers than for male drivers. This increase in prediction accuracy, precision, and recall for female drivers suggests differential changes in EEG activity associated with drowsiness compared to male drivers.
Theoretically, approaches using sex as a feature and manual split of the dataset could have the same accuracy. The reason why this is not the case could be the relatively high dimensionality of the dataset and the relatively small number of examples in the dataset. For this reason, the algorithm was unable to optimize the hyperplane to artificially separate the data based on the sex feature.
A recent review of EEG signal features and their application in driver drowsiness detection systems summarized 39 papers, none of which used participant’s sex as a feature [
13]. Based on our results, it is reasonable to assume that the model accuracy presented in many of these papers could be further increased by using sex as a feature or by splitting the dataset into male and female subsets.
For these results to be applicable to all datasets, a high-accuracy sex classifier is needed. The average classification accuracy of the segment’s sex (blue markers in
Table 5 and
Table 6) is 93% and 92% for alert and drowsy drivers, respectively. On average, the classifiers correctly classify 97% of the alert participants and 96% of the drowsy participants (green markers in
Table 5 and
Table 6). When majority voting is applied to each participant across all five runs, the accuracy of both classifiers is 100%.
Other studies with high accuracy of sex classification models based on EEG were those of Kaur et al. [
38] and Kaushik et al. [
39]. Their classification accuracy was 96.7% and 97.5%, respectively. The experimental design was the same in both papers, participants were measured in a relaxed resting position with their eyes closed. In both works, the discrete wavelet transform was used to obtain the frequency-domain features. Their final classifiers were based only on these frequency-domain features. In comparison, our participants were measured while driving, which is a complex mental activity. Kaur et al. achieved their accuracy with the usage of LOSOCV on the 60 participants. Their high accuracy suggests that a higher number of participants could increase our LOSOCV closer to the segments split based accuracy.
Our previous work showed that frequency-domain features and RQA features differ significantly between alert male and female drivers [
26]. Similar statistical results were found when brain activity was analyzed between male and female drowsy drivers. Female drowsy drivers showed significantly higher relative beta power in all regions and significantly higher relative alpha power in all regions except the Cz electrode. Male drowsy drivers showed significantly higher relative delta power in all regions and relative theta power in the occipital region. On the other hand, relative gamma power showed a different pattern between male and female drowsy drivers. In female drowsy drivers, relative gamma power was significantly higher in the occipital region, whereas in male drowsy drivers, relative gamma power was higher in the frontal region. Furthermore, the results of this paper showed that the features in the frequency domain are more correlated in males than in females during both alert and drowsy sessions.
In the current work,
Figure 2 shows the difference between the average feature value for all male and all female drowsy drivers for two RQA features with the smallest
p-values (Vmax in the OR region and Lmax in the Pz channel) and two frequency-domain features (beta in the OR region and the OL region).
Figure 3 shows a high correlation (0.79) between the Det feature from the FR region and the Lam feature from the Cz channel for male drowsy drivers. The right part of
Figure 3 shows the same features for female drowsy drivers, which are only weakly correlated (0.37). These differences in the correlation of the features are the main reason for their use as features.
It should be noted that, although our dataset consisted of frequency-domain features, our feature selection method filtered out all the frequency-domain features for sex classification. The final feature sets for the sex classification model thus consisted of only RQA features and correlation features. Since RQA features discriminate male and female drivers better than frequency-domain features in our work, it is reasonable to assume that the studies (Kaur et al. [
38] and Kaushik et al. [
39]) reporting high accuracy of the sex classification models would report even higher results with the inclusion of RQA features, but further research should be conducted to confirm this assumption.
Our system is based on the 10 s epochs with 5 s overlaps between them so the application in the real-world scenario could make a decision every 5 s. Classifiers make their decisions in less than a second, which means that if epochs would have a step of one second instead of five, the system would still be able to make decisions on time.
There are also some drawbacks to this work. One is the not very large dataset with 34 participants in 68 recording sessions. This results in a lower LOSOCV (82%) and the increase in the number of participants would increase accuracy. Another drawback is the exclusive use of binary drowsiness classification. In real-world applications, at least three levels of drowsiness are usually targeted [
39,
40].
For future work, the number of participants and the number of features considered should be increased. For example, an interesting point would be to observe driving performance data (e.g., line crossings, distance from an ideal position on the road) based on participants’ sex and EEG features. Since our data suggest functional differences between males and females during drowsy driving, the next step would be to investigate whether these differences are related to driving performance. Investigating the relation between driving performance and EEG features could provide insight into how people drive and explain potential differences. The underlying mechanisms related to driving would therefore provide a more accurate model for driving safety systems. Another interesting topic for future work is to investigate the influence of driver’s sex as a feature on deep learning models that have lately been used extensively and are showing promising results [
40]. In addition, combining our findings (sex as a feature and RQA features as good drowsiness predictors) with decreasing the number of electrodes used [
41] could lead to a reliable system that is also easy to implement in practice. Such a system’s accuracy could also benefit from the usage of blink-related features derived from EEG [
42].
5. Conclusions
This research has shown that including the information about driver’s sex increases the accuracy of drowsiness detection. Furthermore, a reliable sex classifier based on EEG signals was developed. Although it is hard to implement exactly the same system in a real-time environment, due to the high number of electrodes, these important findings may benefit all other systems that are less intrusive simply by including sex as a feature in the existing systems.
The drowsiness detection model for drivers is usually based on the EEG features and without sex as a feature. After adding sex as an additional feature in the dataset, only incremental improvements in drowsiness detection accuracy were achieved. With the further step of manually splitting the dataset into male and female subsets, the drowsiness detection model accuracy increased by 3% and 7% for male and female datasets, respectively. We consider these results relevant to a variety of drowsiness detection systems currently being developed in the field.
The sex classification model based on EEG features achieved high accuracy. All participants were correctly classified after applying majority voting to the results of all five runs. Correlations between features used as features scored high on the feature selection list, suggesting that correlations between features from different brain regions/channels should be used more frequently as features.