1. Introduction
According to data published by the World Health Organization (WHO), approximately 1.2 million people die in traffic accidents worldwide every year [
1]. According to the National Highway Traffic Safety Administration (NHTSA), approximately 20% of traffic accidents and 80% of almost impending traffic accidents are caused by driver distraction, which emerges as a key factor in serious and fatal accidents [
2]. In 2018 alone, driver distraction claimed the lives of 2841 people in the USA [
3]. Therefore, investigating the cause of distracted driving and reducing the number of distraction-affected traffic accidents remains an imperative issue.
According to related research [
4], there are two main reasons for driver distraction: (i) internal reasons: fatigue driving, drunk driving, and drug driving, that is, the mental states of the driver are not suitable for driving. Methods that focus on detecting driver distraction due to internal reasons are mainly divided into physiological parameter-based methods [
5,
6] and naturalistic driving data-based methods [
7,
8]; (ii) external reasons: the driver has external interference, such as calling, texting, and talking with passengers, and other secondary tasks that interfere with the driver driving in the proper mental condition. Computer vision methods are used to identify driver distraction caused by external reasons, which have two advantages that can put them into practical application. First, compared to the physiological parameter-based methods, it is a non-intrusive technique of obtaining data, which can ensure that the drivers are not affected by the measuring instruments. Second, compared to the naturalistic driving data-based methods, it can warn the drivers after they have performed distracted actions, instead of warning them after the vehicle’s behavior has already become abnormal.
Against this contextual backdrop, we focus on driver distraction caused by external causes, that is, we choose to use computer vision methods to detect actions of distracted drivers. Driver action recognition (DAR) is a branch of human action recognition (HAR). In the HAR field, the two major aspects in developing deep networks for action recognition are the convolution process and temporal modeling [
9,
10]. Moreover, dealing with the temporal dimension is a challenging issue. The current mainstream solutions include three major categories: two-stream convolutional networks [
11], three-dimensional convolutional networks (3D-ConvNets) [
12], and fusion of convolution neural networks and long short-term memory [
13] (CNN-LSTM).
Table 1 gives a brief introduction of the architectures with advantages and disadvantages. Because CNN-LSTM architecture has high accuracy and fast speed, it was selected as the basic architecture of this research. However, simply completing the architecture selection is not enough, because HAR systems are not automatically useful under DAR constraints. The limited in-vehicle space where the actions are executed and the parallel execution of different in-vehicle actions with driving tasks drastically challenge the HAR techniques [
14]. Therefore, the problem that needs to be solved urgently at this stage is how to extract efficient temporal and spatial features of the driver’s actions, to effectively identify the different actions of the drivers.
In this paper, a hybrid deep learning model is proposed to recognize the actions of distracted drivers. This model uses the OpenPose skeleton extraction algorithm, essentially a CNN model, to obtain the skeleton information of the human body (including bone maps and the joint point position information) by processing every frame captured by monitoring. Then, the action description features (ADFs) are constructed by using the joint point. Based on this, the ADF vectors are composed of the vector angles and the modulus ratio of each frame, which are used as the input of the K-means clustering algorithm to preselect the original frames. Then the keyframe sequences are obtained by using inter-frame comparison (IFC). Finally, the ADF vectors representing the keyframe sequences are fed into the LSTM, which then outputs the recognition results. The model we proposed improves recognition accuracy through a combination of the following three processes: (i) combines OpenPose and LSTM as the basic architecture guarantees the extraction of spatiotemporal features; (ii) constructs ADFs, which realizes the fusion of deep network features and artificial features. The proposed ADFs improve the information density of spatial features and, to a certain extent, eliminate the influence of individual differences and changes in shooting distance; (iii) uses K-means clustering algorithm and IFC to extract keyframe sequences, which can reduce the interference of similarity of actions of distracted drivers and action speed on recognition.
There are three major contributions of this paper.
We propose a novel model, which avoids the use of complex devices (i.e., wearable sensors [
15,
16] and depth cameras [
14]) and only needs 2D cameras in vehicles.
A highly efficient method is introduced to handcraft an effective spatial feature based on the joint points (i.e., deep neural features) derived from OpenPose.
The introduction of temporal features extracted by the K-means clustering algorithm, IFC, and LSTM networks, which makes up for the current deficiencies in the DAR field.
The outline of this paper is as follows. Related works and the current state of researches are reviewed in
Section 2.
Section 3 elaborates on the data collection process. In
Section 4, our model with four modules is described in detail. The experimental results and analysis are presented in
Section 5. Additionally, this paper will be finalized with conclusions and a forward-looking emphasis in
Section 6.
2. Literature Review
The main focus of our research is to extract efficient spatiotemporal features from driver action sequences so as to improve the accuracy and robustness of driver distracted action recognition. Therefore, we summarize and review from three aspects: the application of the computer vision method in the DAR field, the spatiotemporal features acquisition based on skeleton data, and the current status of keyframes extraction.
Many researchers have applied computer vision methods to the field of DAR. Li et al. [
17] situate and detect the driver’s right ear and right hand using You Only Look Once (YOLO) and take the coordinates of regions of interest (ROIs) as input, and a multi-layer perceptron is designed to infer the driver’s status from the ROIs. Huang et al. [
18] present a hybrid CNN framework (HCF) combining Xception, Inception V3, and ResNet50 to detect the actions of distracted drivers, which can improve the accuracy of the driving activity detection system. Baheti et al. [
19] propose a new architecture named mobileVGG based on depth-wise separable convolutions for detecting and classifying the driver distraction, which greatly reduces the parameters compared with other CNN models. Mase et al. [
20] introduce a novel method using CNNs and stacked bidirectional long short-term memory networks (BiLSTM) to capture the spectral-spatial features of the images, where BiLSTM is used to handle the sequence of filtered channels, that is, the output of CNNs (i.e., 8 ∗ 8 feature maps with 2048 channels). Omerustaoglu et al. [
21] integrate the prediction results of the vision-based CNN and the sensor-based LSTM model into the final model to obtain the driver’s distracted motion detection results, which improves the accuracy and generalization ability of the system. For the chaotic driving scene, Jegham et al. [
14] use an RGB-Depth camera to capture RGB images and propose a novel soft spatial attention-based network. It can be summarized that, in the field of DAR, most researchers only focus on the combination and improvement of the CNN model, striving to improve the accuracy or speed of the static detection model, but ignore the importance of temporal information. Although, as the research further develops, the architecture of CNN-LSTM has begun to be used by some researchers to recognize driver’s distracted actions, it still needs additional equipment such as sensors and depth cameras.
With the rapid development of pose estimation techniques, action recognition based on skeleton data is a research hotspot. Skeleton data is the characteristic information of the joint points obtained from the action sequence, including relative track, position, and so on. Wu et al. [
22] extract the meaningful temporal features of sub-actions from the three-dimensional skeleton data by a multiscale wavelet transform, which can improve the robustness of action recognition. Zuo et al. [
23] propose two new graph convolution methods: the partial-image convolution network and full-image convolution network to learn the part scale spatiotemporal features and full-scale skeleton spatiotemporal features. Then the two features are combined to obtain more effective skeleton features. For improving the performance of action recognition, Ahad et al. [
24] regard 3D bone joints as kinematic sensors based on the three-dimensional linear joint position and the angle between the bone segments and propose the linear joint position feature and the angular joint position feature. Ma et al. [
25] use the distances and angles between the joint points as spatial features to input to the deep graph convolutional network (DGCN) and LSTM, which can complete action recognition of basketball players. Connecting the lines between the same joints in adjacent frames, Tasnim et al. [
26] propose a 3D spatiotemporal image formation technology of skeletal joints by capturing spatial information and temporal changes for action discrimination. Above all, previous studies have processed the acquired skeleton data to different degrees to concentrate the information of the action sequence, which can make the model capture relatively more information for training. The above review also shows that the fusion of heterogeneous features, namely handcrafted and deep neural features, can improve the robustness of action recognition by analyzing action sequences from different aspects of expert views and data-driven model views, respectively.
Many studies have shown that using only a few keyframes instead of a complete sequence of frames can perform action recognition tasks more effectively and summarize the video [
27,
28,
29,
30,
31,
32]. Kim et al. [
33] prove that the keyframe extraction enables fast and robust gesture recognition regardless of motion speed. Wang et al. [
34] extract an energy feature, combining kinetic energy and potential energy, from 3D video sequences to represent human actions and employ a support vector machine (SVM) to recognize human actions on the EFs of selected keyframes. Tang et al. [
35] combine image density clustering and entropy and use keyframes in gesture videos for further feature extraction to improve recognition efficiency. Yasin et al. [
36] extract the keyframes that contribute to the action performance from the motion sequence of the 3D frame to eliminate redundant frames and summarize the motion sequence while retaining the original motion semantics. It can be seen that the keyframes should be representative of the video content, diverse to reduce the redundancy, and should be able to cope with the impact of movement speed on recognition.
To summarize, in the field of DAR, most researchers only focus on combining deep learning models to try to extract spatial features with higher information density, ignoring the importance of temporal information. The method proposed in this paper not only introduces a spatial feature extraction method distinguishing from the existing techniques but also extracts the temporal features of the driver’s distracted action. To the best of the authors’ knowledge, the methods of obtaining the spatiotemporal features based on RGB video sequences had yet, to date, to be fully researched. The model embedding feature construction method based on bone information and keyframe sequences technique fills the gaps in the DAR field, which can eliminate the influences of individual differences and movement speed.
3. Data Collection
The literature shows that the driver’s distracted actions mainly include eating, drinking, manipulating dashboard controls, watching a smartphone screen, talking on a phone or with passengers, and grooming [
37]. Therefore, in this study, the above seven actions are selected as the distracted actions to be recognized. The State Farm Distracted Driver Detection dataset published on Kaggle [
38] and the American University in Cairo (AUC) Distracted Driver Dataset [
39] are the most frequently used datasets in the related studies. However, they cannot meet our needs because of the following two reasons. First, the images in the dataset extracted from the same video are almost identical to each other. Second, there is no timestamp information or sequence information about the images [
21,
39,
40]. Therefore, we created a new dataset. In order to make our data collection reasonable, the custom dataset is collected by mimicking the State Farm dataset (e.g., the camera perspective, distance, and the scenarios).
Figure 1 and
Table 2 show the examples of the custom dataset, which contains 8 types of actions performed by 5 females and 10 males of various heights and body shapes. While the vehicle was moving, we collected 30 frame-per-second videos, and each video is controlled at 3 s, so that the length of each action sequence is 90 frames. It is worth noting that performers must demonstrate actions C0, C3, C4, C5, and C6 twice at different speeds, and demonstrate actions C1, C2, and C7 three times each, thus forming a custom dataset containing 285 action sequences. That means we collected about 25,650 data points with timestamp information. A mobile phone, which was placed in the upper right corner of the vehicle, was used to collect video sequences. This location was chosen by us because a similar placement was used in the State Farm dataset.
4. Methodology
The architecture of our proposed model, as shown in
Figure 2, consists of a list of modules: the module of human body poses estimation (Module I), the module of data processing and feature construction (Module II), the module of keyframe sequences extraction (Module III), and the module of action recognition (Module IV).
4.1. Module I
We chose the OpenPose algorithm [
41], first proposed by the Perceptual-Computing-Lab of Carnegie Mellon University, as a technique of detecting human joint points because of its high accuracy. After several generations of updating and optimization, as shown in the bottom left of
Figure 2, the latest OpenPose algorithm reduces the computation amount by half compared with the original structure, but the accuracy almost remains unchanged, which is suitable for obtaining skeleton data. This algorithm was first applied to the COCO key challenges, greatly surpassing the previous results [
42]. The reason why we also choose the COCO model is its characteristics of generating 18 joint points to provide a good trade-off between a detailed representation of the human pose and complexity.
Table 3 shows the 18 joint points saved in each frame of OpenPose.
Figure 3 demonstrates the 18 joint points of the human body. The data of each joint point includes the abscissa value and the ordinate value in the Cartesian coordinate system and the confidence.
However, applying the OpenPose algorithm may be difficult in the following cases. First, as shown in
Figure 4, multiple human skeletons appear in a frame. Second, body occlusion can lead to localization error and false negatives. Therefore, we set up the data processing unit in Module II, which handled the above cases properly.
4.2. Module II
This module is mainly divided into two parts, namely data processing and feature construction. First, we process the collected coordinate data of the joint points. Second, in order to improve the information density of spatial features and make our proposed model have better characteristic performance, we construct the vector angle and vector modulus ratio based on the processed joint point coordinates as in the ADFs.
4.2.1. Data Processing
Objects or pedestrians are sometimes mistaken for human skeletons by OpenPose, causing multiple human skeleton information to be stored in the json file. To cope with this phenomenon, we compared a large amount of data and found that the skeleton information with the highest confidence is displayed on the first line of the json file. In other words, the skeleton information is sorted by confidence, which reaches the threshold but has the lowest confidence is arranged at the end of the file. Since the focus of the video is on the driver, it is clear that the driver’s skeleton is the most obvious and the confidence is the highest. Therefore, the first human skeleton information in the file is always retained.
Since the driver’s distracted action only includes the upper body, we deleted the joint point data numbered 9, 10, 12, and 13 in
Table 3 and all confidence values to avoid the interference of irrelevant data. Because the shooting angle of the dataset has caused a large loss of the performer’s left ear joint points, as shown in
Figure 4, the joint point data numbered 17 were also deleted. Some of the existing methods for dealing with missing joint points are as follows: (i) handle the missing data by model. For example, in Xgboost [
43] and Light GBM [
44], the model skips the missing values and calculates directly; (ii) a statistical method that replaces the missing value with the mean, median, and plural [
45]; (iii) the valuation of missing data using the Kalman filter [
46], currently the most reliable method. In our experiment, the missing values can be well supplemented by statistical methods because there are few missing points. In order to make the coordinates of missed joint points in frames be completed, the detailed procedure of the mean-coordinate supplement method (MCSM) we proposed is as follows. We divide the joint points into two categories: (i) fixed joint points, that is, the joint point where the position remains unchanged, as shown in
Figure 3 as 1, 8, 11, and (ii) changing joint points, that is, 0, 2, 3, 4, 5, 6, 7, 14, 15, and 16 in
Figure 3. The movements are mainly reflected by these joint points related to the arm and head. The processing methods are as follows:
- (i)
Fixed joint points: We take the average of all unmissed joint point coordinate data to replace the joint point coordinates of all video frames. The formula is as follows:
where,
represents the fixed joint points,
is the frame without missing
joint point data,
is the number of frames without missing
joint point data, and
and
are the coordinate values that replace the abscissa and ordinate of the
joint point in each frame.
- (ii)
Changing joint points: There are three possible scenarios. First, single data is missing. The missing coordinate of the frame is represented by the average value of the data of the frames before and after it, and the formula is as follows:
where,
represents the changing joint point,
is the frame with missing
joint point, and
and
are the coordinate values that replace the abscissa and ordinate of the
joint point in the frame
. Experiments on non-missing joint points show that the average value of the two frames before and after the data is optimal, that is,
.
Second, consecutive data is missing. Formula (3) shows the supplementary method of
to
frames which continuously misses
frames. Third, if the joint point data for the first frame of the action sequences are lost, the mean of all unmissed joint point coordinate data is taken and supplemented to the first frame. From then on, we can continue to process the data using the above two processing methods.
The data after the above processing is stored in the following format. (i) Each frame of data occupies a separate row, arranged in chronological order, with 2-row indexes, which are the person and the action of the row of data. (ii) Each row contains 28 columns of data, which are the coordinate values of the abscissa and ordinate of the above-mentioned 10 changing points and 3 fixed points in a rectangular coordinate system.
4.2.2. Feature Construction
If the processed 13 joint-points coordinate data are directly used for subsequent operations, the generalization ability of the model is low. Based on the coordinate data of the joint points, we artificially construct the ADFs, that is, the vector angle and the modulus ratio of the human body structure to achieve a more effective feature descriptor for action recognition [
47]. Furthermore, through the analysis of the characteristics of the driver’s actions, two auxiliary points, an improvement for specific application scenarios, are creatively proposed to assist in the construction of ADFs. The detailed process is as follows:
Stage 1. Acquisition of structure vector. The calculation method is to subtract the coordinates of two joint points in the same frame, the formula is as follows:
where,
and
is the coordinate of the
and
joint points, and
is the structure vector composed by the
and
joint points.
This paper constructs 19 structure vectors for the subsequent calculation of vector angle and vector modulus ratio, as shown in
Figure 5. The innovative point of this paper is the creation of point
(the midpoint of fixed points 8 and 11) and the point
(the center of gravity of the triangle formed by fixed points 1, 8, and 11). The creation of point
is helpful for the subsequent calculation of the modulus ratio, and the creation of point
is helpful for a better description of upper limb movements. Taking the joint point 3 (right elbow) in
Figure 5 as an example, the construction of the structure vectors
,
and
can well describe the movements related to the right elbow.
Stage 2. Acquisition of vector angle. The angle value between the vectors is calculated using the law of cosines. The calculation formula of the vector angle is:
where,
,
and
is the angle between vectors
and
.
This paper constructs 13 vector angles, as shown in
Figure 6. Additionally, taking the number 3 (right elbow) joint point as an example,
is the angle between the upper arm and the forearm in
Figure 6a, which can be used to measure the swing angle of the forearm relative to the upper arm.
and
in
Figure 6b respectively represent the angular relationship of the right elbow joint with respect to the right shoulder, right wrist, and point
. The unique position of the joint point can be determined by the above three vector angles.
Stage 3. Acquisition of vector modulus ratio. In order to avoid large errors in the recognition of driver action due to individual differences, this paper does not use the absolute distance between the joints but chooses the relative distance, that is, the vector modulus ratio. In our paper, a total of eight vector modulus ratios have been constructed, as shown in
Table 4. Equation (6) gives a calculation example of the vector modulus ratios
and
:
The distance between the midpoint and the joint point can be almost constant during driving, and it can well reflect the body shape of different humans. Therefore is selected as the base vector to calculate the vector modulus ratio, which can eliminate the individual differences between different drivers.
In this paper, 13 vector angles and 8 vector modulus ratios of human body structure are constructed as the features of the driver’s actions, totaling 21 ADFs.
4.3. Module III
In this section, we propose a module based on the K-means clustering algorithm [
48] and IFC. The vectors composed of the ADFs are used as the input of the K-means clustering algorithm. The number of keyframes to be extracted is determined by artificially setting the number of clusters. Then we compare the differences between the vectors representing frames and the vectors representing the cluster centers to obtain the final vectors representing the keyframe sequences.
The detailed process is: (i) to obtain keyframes. The most informative frames are extracted and the pose redundancy is removed, which can effectively compress and refine driver actions; (ii) to obtain keyframe sequences. The extracted keyframes are sorted according to the order of occurrence, which not only ensures that the extracted keyframe sequences have efficient spatiotemporal information but also reduces the number of ADF vectors sent to Module IV. Through this module, the accuracy of the model for action recognition can be improved.
Step 1: The K-means clustering algorithm is used to obtain keyframes. The basic principle of the algorithm is to group similar objects into the same cluster, and group dissimilar objects into different clusters. We take the value composition vector of the ADFs, that is, the vector angle feature value and the vector modulus ratio feature value, constructed in each frame as the input of the K-means clustering algorithm. We assume that the complete sequence of action is , ,,, , where is the total number of frames in the action sequence, is a frame in the sequence, is a 21-dimensional vector composed of the values of the 21 ADFs in the frame , and is a collection of vectors composed of vectors in each frame in a complete sequence. In this paper, ADF-vectors representing frames are clustered into clusters. The detailed process is as follows:
- (1)
Randomly select a cluster centroid, and mark it as , ;
- (2)
Calculate the minimum of the distance from each sample to each centroid , and classify the samples into the cluster corresponding to the minimum distance, that is:
- (3)
After the division, for each cluster , recalculate the centroid:
where,
indicates whether the vector
is classified into the cluster
, if it belongs to cluster
, then
, otherwise
.
- (4)
Repeat (2) and (3) until the cluster center remains unchanged, then the algorithm ends. Through the above process, K cluster centers which can be used as pre-selected keyframes are extracted, and each center is a 21-dimensional vector.
Step 2: Since the cluster center does not necessarily coincide with the ADF vectors completely, and does not have a time sequence, the IFC is used to further obtain the keyframe sequences. The detailed method is as follows:
where,
represents the vector of the cluster center of the cluster
,
is the number of keyframes to be extracted in an action sequence,
,
,…,
are the vector angle values, and
,
,…,
are the modulus ratio values.
The ADF vector representing a frame in an action sequence is expressed as follows:
where,
is the ADF vector of the frame
,
is the total number of frames in the action sequence,
,
,…,
are the vector angle values, and
,
, …,
are the modulus ratio values of the frame
.
By solving the minimum Euclidean distance between the and , we determine the correspondence between the cluster centers and the action sequence frames, which is expressed as follows:
In order to ensure that the extracted keyframes are consistent with the frames in the action sequences, it is necessary to mark the corresponding frame with the smallest distance as a keyframe, save the index of the frames, and finally sort by index to obtain the final keyframe sequence. Due to the small change of some actions, the ADF vectors of the frames are similar, which will cause the problem of inconsistency between the sequence of the extracted keyframes and the sequence in the video, but the recognition effect will not be affected.
4.4. Module IV
An LSTM network is used to process the output results of Module III to extract the spatiotemporal features, and then transfer them to the softmax layer to output the action recognition results. The internal structure of the typical deep neural network with LSTM is a one-dimensional vector [
49].
Figure 7 displays a basal LSTM neuron. Within LSTM models, there exist three gates to control and update the cell’s state: (1) inputs, (2) forget, and (3) output. The memory cell in each gate consists of a sigmoid neural net layer and a pointwise multiplication operation.
For time step
, the cell state can be updated by using the following equations:
where,
stands for activate function sigmoid defined as
,
,
,
respectively stand for the outputs of the “input”, “forget”, and “output” gates.
represents the long-term memory state of the cell at time
,
denotes the candidate state value of
.
, and
are the final output and initial input at time
.
,
,
,
,
,
,
,
,
,
, and
stand for the coefficient matrix and offset vector.
In the proposed model presented in
Figure 2, a two-layer LSTM network is constructed to learn the ADF vectors of the keyframe sequence, so as to obtain the spatiotemporal features of the video sequences. In
Figure 2,
,
,
,
are the ADF vectors that are constructed by Module II representing a keyframe sequence. Thus, from an input sequence
,
,
,
, the memory cells in the two LSTM layers will produce a representation sequence
,
,
,
. Finally, the feature vector
at the last moment feeds into the softmax layer so that the driver distracted action can be identified.
6. Conclusions and Feature Works
This paper proposes a method of driver distraction recognition based on RGB video, which emphasizes the importance of temporal features and fills the gap in the DAR field. The hybrid deep learning model we proposed does not only rely on spatial features but also extracts efficient spatiotemporal features from the driver’s action sequence to improve the accuracy and robustness of the driver’s distracted action recognition. The realization of this framework mainly relies on three methods (i) computer vision method, namely CNN-LSTM architecture is used as the basic framework, (ii) feature construction based on joint points, and (iii) keyframe sequence extraction. The improved feature construction method we proposed can weaken individual differences and improve the generalization ability of the model when the relative distance between the driver and the camera changes, or there are differences in height, weight, and body proportions between different drivers. The extracted keyframes enhance the process by providing information that is free from redundancy but carries the most relevant details about the action that exists in the motion. Finally, for thorough and detailed performance evaluations of every module and the model, we designed two sets of comparative experiments. The first group of experiments has compared the influence of different module combinations on distracted action recognition. The results show that the model we proposed has higher recognition accuracy than other module combinations. In this group of experiments, the comparisons of Module II and Module III for separate experiments and the combination of these two modules prove the effectiveness of the modules for action recognition. The second group of experiments was designed to compare our methods with the state-of-the-art methods. We have conducted sufficient experiments based on a custom dataset and our proposed method comparatively produced very competitive results.
Due to the constraint effect of the dataset itself on the neural network, the increase in the number of datasets is helpful for the feature extraction and generalization capabilities of the model. Although a high-quality video dataset was collected, a more diverse dataset is required to cover more scenarios and more kinds of drivers (e.g., different races and body shapes), which is the focus of future work. On the technical side, in the future, our approach can be applied from the following two aspects: