1. Introduction
Hazardous driving is a major contributor to vehicle crashes and fatalities, including violent and emotional driving behaviors and a variety of other poor driving behaviors; according to the National Highway Traffic Safety Administration (NHTSA), traffic fatalities in 2021 increased by an estimated 10.5% over the previous year, with urban roadway fatalities rising by 16%. Therefore, hazardous driving behavior detection is essential for effective control strategies for drivers, and the improvement of mass transit safety is potentially valuable.
Typical sensor approaches for wireless perception include Wi-Fi, the Wi-Fi link selection continuous state decision process [
1], and the Scalable Uplink Multiple-Access (SUMA) protocol for Wi-Fi backscatter systems [
2]. The method improves the performance of the measurement action, in which the CSI (channel state information) [
3,
4] is an important part of the modeling state. Sensing information from sensors, such as body-worn sensors [
5] and the Industrial Internet of Things [
6], has led to the development of multi-information fusion algorithms and action-recognition technologies. Among these is video information, which contains multimodal information, such as reflected brightness information, distance, and heat distribution. However, in the complex environment of vehicle driving, deviations in behavior recognition may be caused by environmental brightness distortion and the misjudgment of distance heat information [
7]. The traditional array radar can only monitor a single target object point-to-point, while the method based on the FM continuous wave encapsulates the millimeter-wave radar antenna array, which solves the problem of multi-path and complex environments caused by the narrow space inside the vehicle. This offers an excellent solution. Therefore, FMCW has a good prospect in human body pose estimation and microwave neighborhoods [
8,
9].
The purpose of multimodal fusion is to maintain the original method of the model in combination with multiple effective modalities and unified high-dimensional representation; therefore, the key problem in multimodal data is the existence of data heterogeneity. The normalized high-representative data can reflect the essential features of the original recognition, such as speech recognition fusion facial emotion recognition [
10,
11,
12] and self-supervised learning (SSL), to extract multiple features for emotion recognition. In the process of FMCW radar frame signal preprocessing, features such as Doppler, velocity, scattering point difference, and amplitude are usually extracted separately, and two kinds of excellent features are randomly combined [
13,
14,
15], which may not solve the problem of extracting features given the coupling relationship between each feature and the balance of feature space. In the literature [
16,
17,
18], other methods explore the high-dimensional representation for multi-feature fusion, use the multi-feature fusion algorithm of HSV, LBP, and LSFP to improve high-dimensional fusion representation, and binarize the DM image constructed by multi-feature fusion. Human complex targets monitor small target features, perform algorithm fusion for local entropy, average gradient strength, and other features, and perform peak normalization processing in vector space (MFVS). Micro-action multimodal classification generally uses the Hidden Markov Model [
19], more complex actions or gestures using a convolutional neural network (CNN) to verify the public NTU-Microsoft-Kinect through the DisturbIoU algorithm in [
20,
21], and the Gesture (NTU) dataset and the Vision for Intelligent Vehicles and Applications (VIVA) dataset.
In this paper, we use the FMCW radar-parsing raw data for DAM to fuse micro-Doppler, velocity Doppler, and amplitude features in low-rank tensor representation (LMF) and use the attention mechanism to assign weights in the gating system of the ReLU activation function. This method solves the problems of heterogeneous fusion confusion and feature space balance of multimodal fusion, and makes the following contributions: (1) It fuses the micro-Doppler spectrum with radial velocity and amplitude complex features in a dynamic sparse cascade, and couples and fuses low-dimensional action information features into a high-dimensional linear vector representation that can be represented dynamically, thus improving model stability as well as classification efficiency. (2) By context modeling and using a low-rank tensor network (LMF), the tensor representation is created by computing the outer product of three different single-peaked modes, and modeling the inter-interactions achieves end-to-end learning. (3) The proposed method in GRU-gated cells uses a self-attentive-rectified-linear-unit (AReLU)-based activation in GRU cells, and the performance of FM continuous-wave tensor representations can be improved by the activation function.
The rest of this paper is organized as follows. In the
Section 2, an overview of the related methods is provided. In the
Section 3, the related principles and technologies of radar information feature extraction and feature fusion are introduced. In addition, the
Section 4 describes the implementation principle and parameters of the feature fusion method and classification model LMF-AR-GRU in detail. The
Section 5 details the experimental setup and the analysis of the experimental results, and the overall efficiency and advantages of the system are compared and analyzed.
5. Experimental Design and Analysis
5.1. Experimental Setup and Environment Variables
The experimental equipment used in this paper included a millimeter-wave radar clock with an amplitude of 77–81 GHz, produced by Texas Instruments, corresponding to a wavelength of about 4 mm and a bandwidth of 4 GHz. The hardware had an ultra-high-frequency bandwidth that is superior to that of other wireless sensing devices, ensuring high-precision recognition of movements. The hardware experimental environment equipment used mainly included an IWR1642 radar sensor device and DCA1000EVM data capture card, sensor bracket, Lenovo R7000 i5-7300HQ GTX1050 802.11AC wireless network card laptop, and two cars and SUVs. In the experimental scenario, on sunny or rainy days, after experimental comparison of the same parameters of the state of the weather and environmental factors, if the millimeter-wave echoes are minimal, they do not merit discussion. An experiment was conducted for vehicles of different sizes, including a three-compartment sedan and off-road cars. The radar sensor was fixed to the B pillar of the vehicle, and at a height of 85 cm from the vehicle chassis, a radar transmitter and receiver were placed perpendicular to each other facing the target driver, and the driver location was just within the radar scanning range so that the target dangerous behavior characteristics could be obtained. The experimental setup is shown in
Figure 6.
The experimental equipment and environment used for collecting information included a Windows 10 Intel Core i7-6700 CPU processor, with a main frequency of 3.4 GHz, a memory size of 32 GB, a GTX1080 graphics card, and PyCharm2018 software; the programming environment was Python 3.7. In this model, in the compiled environment Python 3.7, MemTotal indicated the actual amount of RAM occupied by the system, using physical memory minus some reserved bits and the size of the kernel’s binary code. MemTotal produced 816,136 kb in the environment command in the proc file using cat /proc/partitions. The partitions of the system itself were identified in the proc file as cat /proc/partitions. In the 10 partitions, from 0 to 9, the partition capacity was summed to obtain the ROM capacity of 1,563,645 kb. The data acquisition equipment used included a Logitech-wired Wi-Fi surveillance camera with 720P/1080P high-definition wide-angle images and was easy to install. After comparing and analyzing the data collected from multiple placement angles and positions, the camera installed in the front of the co-pilot position was found to have the best effect, and the sitting posture image data for the front-seat driver and occupant in the car were obtained.
Before setting the parameters, the speed and distance of the moving target in the car were first estimated, because an estimated maximum speed or distance that is too large will lead to a significant decrease in speed resolution. A suitable sampling time setting can ensure the completeness of an action sample and will not produce redundancy. The parameters of the millimeter-wave radar were set, as shown in
Table 1. In the sampling process, to ensure the integrity of the behavior, the frame period was set to 50 ms, and the number of frames was set to 100 to achieve a sampling time of 5 s for each action, which ensured the integrity of the action and reduced the blank redundant time slices, so each unit was set to 256 samples and 128 Chirp.
In terms of the basic conditions, the experiment was conducted with 6 known experimental subjects, 3 males and 3 females with different heights and weights, with an age range of 23 to 55 years, a height range of 152 cm to 190 cm, and a weight range of 43 kg to 97 kg. Six random subjects with unknown information were used as comparisons. The sample set for this experiment consisted of 9 risk behavior datasets, with a sample acquisition time of 5 s and one specific action that was repeated 60 times. The dataset contained 6480 samples, and each data sample acquisition set was 255 × 255 × 3, the Doppler distance was set to 0.3–1.1 m, the unit was 0.1 m, and target sensing was performed at 9 distances.
Due to the special nature of in-vehicle recognition, radar recognition of dangerous movements can only be realized for the upper body due to obscurity, and leg movements could not be identified due to obscurity.
5.2. Experimental Analysis
5.2.1. Model Performance Analysis for Visual Verification
While capturing the driver’s wireless FM signal, the dual-channel processing of RGB image information followed by determining human skeletal key points is essential for describing human posture. Therefore, human skeletal key point detection is the basis for many computer vision tasks, such as motion classification, abnormal behavior detection, and autonomous driving, among others. This experiment validated the use of surveillance equipment to capture images of front-row drivers and passengers, and used the AlphaPose model to obtain the coordinates of 13 skeletal key points in the upper body: left ear, left eye, nose, right ear, right eye, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, and right hip. These skeletal key points and the line segments connected between these points were used to construct a human sitting posture, as shown in Figure. The dataset was divided into four parts: training (70%), validation (10%), test A (10%), and test B (10%). In the car scenario, the driver and passenger in the car were sitting most of the time, and the behavioral actions of the front passenger were composed of the movement of the upper body, while the movement of the lower body could be ignored. The motion and coordinate position information for the upper body skeleton key points was used to analyze and identify the abnormal behavior of the driver and passenger in the car.
Figure 7 shows the predicted behavior of the skeleton key points for a specific action.
To extract the information of the key points of the skeleton, the change in distance between the key points is used to determine whether any of the nine dangerous actions can be observed, but the recognition rate is low, easily affected by other people’s shadows and poor lighting environment, and the average overall recognition rate for the nine behaviors was 73.14%; however, the millimeter-wave radar had the highest recognition rate at 95.74%. Therefore, the model has better recognition accuracy than other techniques used to recognize dangerous behavior within vehicles. Due to the larger internal space of the off-road vehicle, the noise rate of the recognition effect was slightly reduced, and the recognition accuracy inside the off-road vehicle according to experimental comparison was 0.641% higher than the accuracy obtained for the same individuals and the same parameters in the small car.
Figure 8 presents a comparison of the experimental environment and equipment used in the car vs. the off-road vehicle. The off-road vehicle was obviously larger than the car, and there was a difference in the recognition effect.
5.2.2. Analysis of Cross-Person Results
The experimental setup included six people with different physical characteristics differing widely in height from 152 cm to 190 cm, including three males and three females. In FMCW-radar-based cross-person movement recognition, drivers with different physiques and heights may produce large differences in recognition accuracy because the radar is sensitive to distance and has different responses to driving targets regarding the three features. The six participants were given different letter labels, a, b, c, d, e, and f, to indicate different people. For each experiment, the dispersion of each set of data was demonstrated by visualizing the data nodes of the six targets to improve the generality of the experiment and determine a set of data for the six unknown participants for comparison, and a box plot graph was created to show each target median and magnitude. From the box plot diagram, it can be seen that good behavior recognition was achieved for every individual.
Figure 9 presents the box plot diagram.
5.2.3. Analysis of Results at Different Distances
In behavior detection, the physiological characteristics of the tested participants will directly affect the recognition accuracy, including the weight and height of the participants, which will affect the distance of the target to the radar transmitter. The experiment aimed to reflect the millimeter-wave radar FM continuous-wave signal for different distances and determine the speed and the distance to the target. This method used nine time detection nodes, the unit of distance was 10 cm, and the resolved data for the distances of 0.3 m, 0.4 m, 0.5 m, 0.6 m, 0.7 m, 0.8 m, 0.9 m, 1.0 m, and 1.1 m were experimentally. The results obtained after collecting the Doppler spectrum of different distances show that the Doppler spectrum of the distance of 0.3–0.5 m had no obvious echo and radial velocity, indicating that there was no reflected signal of the target at this distance, and thus the actual action state of the driver cannot be reflected, so the data stream and tensor of this distance were eliminated through data screening. However, at the distance of 0.5–1.1 m from the radar transmitter, obvious behavioral Doppler feature data could be detected, and the recognition rate for the distance of 0.3–0.5 m did not exceed 20% as reflected by the phenomenon shown in
Figure 10a, and the Doppler spectrum of this distance cannot be used as evaluation data; however, the recognition rate for driver behavior at the distance of 0.5–1.1 m reached an average of 92.1%. The CDF plot in
Figure 10b reflects that under the same experimental conditions, three groups of targets with different physiological characteristics were set: target a, target b, and target c. Target a was taller and heavier than the other two targets, so it can be concluded that target a achieved a higher recognition accuracy at a smaller distance. However, target c was smaller than the other two targets and also lighter, so the best recognition accuracy was achieved at the sensing distance of 0.8–1.1 m. Therefore, cross-person hazardous driving perception must consider the uncertainty of target body characteristics.
5.2.4. Analysis of Classification Models under Different RNNs
Both GRU and LSTM are sub-branches of RNN. However, the fatal disadvantage of RNN is that the gradient disappears, and data explosion occurs in the process of processing data flow, so the introduction of a gating mechanism can effectively solve this problem. GRU controls the gradient of the input information of the upper layer and calculates the information flow of the hidden layer by updating the gate. In terms of parameters, the activation function that causes the gradient to disappear is replaced by a rectified linear (ReLU) function, which is initialized by an unbounded activation function and identities weights. LSTM is composed of two gate parameters (excluding gate and input gate), and the parameter control is completely different from that of GRU. In information processing, neither the information from the previous moment nor the information from the current hidden layer is selected, while GRU can choose. Additionally, the most important thing in the perception of human body posture in the car is a timely early warning to reduce the time complexity. Although GRU has one less gate than LSTM, GRU has faster convergence and fewer parameters. The accuracy of the same action of GRU reflected from the confusion matrix is slightly higher than that of LSTM parameters. The highest accuracy rate was 96%. The lowest accuracy rate was 88%, and the comparison of the confusion matrix shows that there were similar correlations between different actions, including similar misjudgment outputs, with the highest being 19%, and the output classification of the model may overlap. Additionally, multi-parameter LSTM resulted in higher-similarity output than GRU for the classification of adjacent actions. From the following matrix, it can be seen that the advantages of GRU are more obvious, and it is more suitable for the processing of radar information collected in the car. The confusion matrix is shown in
Figure 11. The confusion matrix indicates the effect of the classifier on the test dataset, and the accuracy of the recognition effect for different behaviors is indicated using a gradient from white to blue. High accuracy for the same behavior is indicated using blue.
5.2.5. Analysis of Combined Modalities
A single mode cannot respond to the integrity movement of physiological action in the human body recognition process, so the feature input of the model in this paper was a three-channel input, with micro-Doppler spectral features as the main recognition input features and radial velocity and amplitude features as hidden features, and three single modes were combined two by two to form three multimodal peaks, and all three modes were also spliced together to form a high-dimensional multimodal feature, with the modal combination as the input to the relational network to determine the relationship between the modes. For the unimodal features, the relational network first transforms them nonlinearly before stitching them together. The model puts the single modality into the GRU model alone for the attention weight decomposition method for recognition, and the single modality is compared and analyzed overall with the two-by-two combined modalities and the fused multimodal, and it can be derived from the three line graphs in the figure that the single modality recognition was the worst, followed by the double dominant modal fusion, and finally the multimodal multi-peak representation. The lowest recognition rate was 76.2% and the highest recognition rate was 84.9% for unimodality, the lowest recognition rate was 84.6% and the highest recognition rate was 89.8% for double dominance modal fusion, and the lowest recognition rate was 91.7% and the highest recognition rate was 95.7% for multimodal multi-peak representation. The three-fold line data statistics indicate that there were no overlapping points, so the accuracy for the same action was in line with the overall modal accuracy, so the line graph indicates the quantitative relationship between the modal fusion method and the number, and the influence of the dependency between the modalities on the accuracy rate is shown in
Figure 12.
5.2.6. Model and Related Model Performance Analysis
A common approach in multimodal corpus fusion is the use of BiLSTM models with different activation functions in frame-by-frame features using attentional operational layers and LSTMs with more parameters. The second most common interaction-aware attentional network is the IAAN contextual baseline-aware model BiGRU-IAAN; this model is based on the GRU network, and in the comparison of this model, using weighted accuracy (WA) and unweighted accuracy (UA) to compare the overall accuracy of each model, the AReLU proposed in this paper is significantly better than the tanh activation function in the absolute value of WA and the absolute value of UA, and it is clear from the analysis in
Figure 12 that the accuracy of the LMF-AR-GRU model is higher than that of the BiGRU-IAAN and BiLSTM models both in terms of UA and WA. The method of integrating GRU units with a multimodal fusion of Attention ReLU activation proposed in this paper can improve the interaction between features, capture driver actions in remote interaction recognition, and solve the problem of BiLSTM gradient disappearance and information explosion in a series of experimental validation work to optimize the AReLU value so that the AReLU is similar to the ReLU. The AReLU activation function can control values below 0. AReLU effectively improves the absolute values of UA and WA relative to the unlearnable BiGRU-IAAN baseline approach, and there are clear experiments showing that suppressing negative values using the default AReLU weighting parameters negatively affects performance and brings them closer to 0. The control unit components of AR-GRU are AReLU (
,
), the specific parameter value that returns the AReLU method to its optimal state. The model uses this set parameter value, and the data representation can reach up to 84.6% unweighted accuracy (UA) and 89.7% weighted accuracy (WA). This variant of the model parameters improves UA and WA by only 0.6% and 0.7%, respectively, relative to the IAAN baseline. The results of this experiment suggest that the AReLU unit does help to improve the overall accuracy of the model, but requires empirical determination of the ideal parameter values. This ensures that negative values close to 0 do not adversely affect performance. The effectiveness of AR-GRU in a dangerous driving behavior data classification setting is demonstrated by the validation of the experimental accuracy rates.
Figure 13 represents the visual representation of the accuracy of different technical methods, both underweighted and unweighted, and it can be seen that the proposed method in this paper demonstrates good recognition results.
6. Conclusions
In this paper, we explore the method of multi-featured low-rank multimodal fusion based on the use of FMCW radar, which decomposes Doppler information into different dimensions of information data by DAM, and the method generates multidimensional Doppler spectral information via Fourier change and then mapping to different vectors. Due to a large amount of information and the poor fit of the Doppler matrix, the information is normalized using specific low-rank multimodal (LMF) processing. The low-rank multimodality is linearly proportional to the number of modes and achieves competitive results in different multimodal tasks. The data of the tensor are fed into the proposed AR-GRU approach, and then the nonlinear fusion of the weight decomposition is performed by the attention unit. Based on multiple low-rank tensors to represent the dense weight matrix within the RNN layer, the present model is evaluated based on the millimeter-wave action self-built dataset, which significantly outperformed the conventional model in terms of dynamic articulation of low-rank fusion. The three Doppler features exhibited excellent weight cascades, the fusion tensor completely interpreted the complete limb echoes for the 5 s acquisition time from the beginning to the end of the action, and the redundant parameters could be reduced in the driving environment to reduce temporal complexity and achieve a timely warning and correction of dangerous driving behaviors. After this study, we will consider various variables to enrich the model input, and try out some new network architectures by manually adjusting the parameters and selecting the network architecture to improve the finer granularity of the recognition effect and reduce the overfitting effect of the recognition network.