1. Introduction
People with severe physical disabilities face many difficulties in their daily lives, such as moving around and eating. Research has been conducted on the operation of electric wheelchairs to improve the quality of life of these people, and various user interfaces have been developed [
1]. In the voice control methods, the user operates the electric wheelchair by uttering “stop”, “go forward”, “left”, and “right” [
2,
3]. Moreover, the EEG (electroencephalography)-based brain–computer interface (BCI) processes the user’s EEG signals and converts them into control commands to drive the wheelchair [
4]. These interfaces are an alternative to joysticks for people with severe physical disabilities. However, voice input may result in undesired commands due to noise and laborious movement adjustments. In the case of BCI, the user must always concentrate on the control commands, which places a heavy burden on the user [
5].
On the other hand, research has been conducted on applying eye-tracking devices to interfaces because eye movements can operate equipment hands-free. Furthermore, human eye behavior is less susceptible to paralysis and strongly correlates with visual intentions. Typical eye movements include fixation and saccades. Fixation is an eye movement in which the user gazes at an arbitrary area, and a saccade is a rapid eye movement [
6]. In these studies [
7,
8,
9,
10], fixation was detected from the user’s eye movements using eye tracking, and the electric wheelchair was driven in the fixated direction. However, people also move their heads and gaze when checking their surroundings or when focusing on a specific object. Such unintentional fixations are also incorrectly recognized as operations in an eye-tracking interface. This problem is known as the Midas touch problem and requires the classification of visual intentions [
11,
12].
One method to solve the Midas touch problem is in distinguishing fixation by setting an extended dwell time for making choices and decisions [
13]. This method is called the gaze dwell time method, which selects arbitrary directions or objects by intentionally fixating on them for a certain time. We have developed an electric wheelchair that can steer in the fixated direction using the gaze dwell time method. Although the gaze dwell time method makes it easy for the user to check the surroundings while the wheelchair is moving, it requires a certain amount of time to distinguish fixation, which causes a lag when turning left or right. Furthermore, the longer gaze dwell time required for fixation detection forces the user to expend extra effort to keep fixating [
1]. Thus, applying the results of fixation detection to the control system of an electric wheelchair makes the operation more difficult and increases the burden on the user. Hence, there is a need to develop a method to solve these problems.
In estimating visual intentions, the ideal approach is to real-time estimate them from subtle differences in the user’s natural eye movements [
14,
15]. Reseacrhers have considered that intentions are indicated in eye movement time series data, and have used machine learning models to estimate subjects’ intentions from eye movement patterns. In particular, Subramanian et al. used not only eye movements but also external environmental information, such as the depth value of the gazing object and the object’s name, to estimate the visual intention toward high accuracy. Doushi et al. [
16] used the temporal relationship between the eye and head pose to determine the attentional state of the subject. They concluded from their experimental results that the head tends to move before the eye movement during task-oriented visual attention, such as lane changes. Other studies on the dynamics of head and eye gazing behavior show that when stimuli are presented in the visual field, the head tends to move later than the eye movements [
17,
18]. Thus, visual intentions tend to relate to head and eye movements. We were inspired by the findings of these related studies that head movements should be considered in addition to eye movements, and that depth information to the fixation point is necessary as the external environment information to estimate the visual intentions during the accurate driving of the electric wheelchair.
In this paper, we aim to develop a data-driven model that estimates in real time the user’s “moving intentions” and develops an eye-controlled electric wheelchair based on intention estimation. The system estimates in real time the visual intentions during the electric wheelchair operation and performs operation assistance according to intentions, which are expected to improve operability. For example, the user can turn right or left immediately according to the user’s intention to do so, without having to keep looking in the direction of movement for a certain period at the corner.
The rest of this paper is outlined as follows.
Section 2 is a review of the related work.
Section 3 is the Methodology section, which describes the architecture of the electric wheelchair system and the machine learning algorithm.
Section 4 summarizes the performance of the visual intentions estimation model and the electric wheelchair control system. Finally, the discussion is stated in
Section 5.
2. Related Work
Many studies have been conducted on integrating eye-tracking interfaces into electric wheelchair control systems, including video-based and EOG (electrooculography)-based methods.
In EOG-based eye-tracking methods [
19,
20], electrodes were placed on the user’s forehead or the skin around the eyes, and the gaze direction was detected by measuring and signal processing the electric potential when the user moves the eyeballs. The EOG-based method was less affected by light changes than the video-based eye-tracking method, but it is difficult to detect oblique eye rotation using this method.
In video-based eye-tracking methods [
7,
8,
9,
21,
22], cameras were used to capture the user’s eyes and to track the position of the pupil via image processing. The pupil detection method depended on the camera type. In the infrared camera method, infrared light was irradiated onto the eyeball. The infrared light reflected by the cornea (Purkinje image) served as a base point, and the gaze direction was determined from the position between the pupil and the Purkinje image. This method had high detection accuracy, but if the infrared irradiation to the eye was missed, the pupil was not tracked. The method using a visible light camera determined the position of the pupil using image recognition from eye images taken under natural light conditions. This method was inexpensive and easy to handle compared to using an infrared camera. However, it required illumination with a certain brightness, and detection was difficult when the eyeball image was whited out or blacked out. Moreover, eye-tracking methods using video images have been studied in two ways: the camera was placed at a distance from the user (non-invasive), and the camera was placed directly on the user’s head or face (invasive).
In studies using invasive eye trackers, the eye tracker was placed on the user’s head to detect gaze direction [
21]. The pupil was extracted using image processing, and the gaze direction was calculated from the motion trajectory of the pupil center coordinates. The pupil range of motion was divided into five regions and labeled. These five labels defined commands to control the movement of the wheelchair (backward, left turn, stop, right turn, and forward). A similar study also set five regions in the range of motion of the pupil and implemented left turn, right turn, forward, stop, and toggle switch (ON/OFF) functions so that the electric wheelchair was driven according to these regions [
22]. Research has been conducted to detect eye direction without tracking the pupil [
8]. Eye movements were classified by inputting eye images into convolutional neural networks (CNNs), then the electric wheelchair was driven according to the classification results (right, left, forward, and eye closed). Invasive eye trackers constantly track the eyes from a fixed angle regardless of the face direction. However, the eye tracker is attached directly to the user’s head, which causes a significant physical burden.
Studies using non-invasive eye trackers used an eye tracker placed in front of the user to detect gaze direction [
9]. The user’s face was detected using face recognition, and then the eye region was determined from the facial landmarks. Next, four types of eye movements (look up, look left, look right, and look middle) were estimated from the eye images using transfer learning with VGG-16. The electric wheelchair was driven according to the estimated results. Research has also been conducted to place a monitor connected to the eye tracker in front of the user [
7]. The eye tracker captured the user’s gaze on a monitor. The monitor displayed a front-facing image captured by the camera, and a control panel consisted of options and operation commands for controlling the electric wheelchair. The user operated the electric wheelchair through fixation on the control panel. Non-invasive eye trackers are less physically burdensome because the camera is located away from the user, but tracking is not available if the eyes are not visible to the camera due to the face pose.
The methods described so far divided the eyeball’s range of motion into an arbitrary number of regions or classified the eye movements to apply to control the electric wheelchair, but these methods led to misselection due to involuntary eye movements. This problem is called the Midas touch problem. A standard solution to overcome the Midas touch problem is to use eye-tracking with other modalities such as gaze dwell time [
13], eye gestures [
23], blinking [
21], and switch and touchpad [
1]. For example, the visual intentions are determined by blinking multiple times or by pressing a switch as an additional input to decide whether the person is moving to the destination of the gaze or checking the surroundings. Such multimodal input enables the system to correctly understand the visual intentions. However, the user is forced to concentrate excessively on the operation when driving on a complex route that requires repeated checks and stops. In other cases, the user is severely disabled if they cannot move their upper limbs. Therefore, research has been conducted to infer visual intentions from natural gaze behavior.
Inoue et al. [
14] developed a classification model for classifying cooking operations from eye movement patterns. They used N-grams to describe gaze patterns such as eye movements, fixations, and blinks during cooking, and trained and classified gaze patterns using a random forest. The mean accuracy rate of the trained random forest was 66.2% for all of the cooking motions. Other approaches have also been studied. Ishii et al. [
24] proposed an algorithm to estimate the user’s level of engagement in a conversation based on gaze information such as eye movement patterns, fixation time, amount of eye movement, and pupil size. They trained a decision tree as an engagement estimation model and predicted user disengagement with about 70% accuracy. These studies tried to estimate visual intentions using a data-driven approach that trains the model using only features related to eye movements and succeeded in estimating them with moderate accuracy.
In recent years, RGB-D cameras and laser rangefinders (LRFs) at low cost and the advancement of object detection algorithms have made obtaining information on fixating targets easy. Hence, research on visual intentions estimation, considering the fixating object, has been conducted. Huang et al. [
15] studied the prediction of which ingredients a customer would request based on a customer’s gaze cues, in a collaborative process in which the worker makes a sandwich using the ingredients requested by the customer. They collected and analyzed data from a simulated sandwich-making process, and found that the intention to select an ingredient was represented by features such as the number of fixations on the ingredient, the time per fixation, the total fixation time, and the most recently seen ingredient. These features were trained on the SVM classifier, which successfully predicted the customer’s order intention with 76% accuracy. Furthermore, the classifier made the correct prediction approximately 1.8
before the customer’s voiced request.
A recent work by Subramanian et al. [
25] indicated that visual intentions to objects were inferred from a wheelchair user’s natural eye movements during an electric wheelchair navigation task. Moreover, they successfully applied visual intentions to wheelchair steering control. First, they performed a task with and without interactive intentions toward objects on subjects and recorded their fixation points during the task. An object detection model based on single shot multibox detector (SSD) [
26] and MobileNets [
27] architecture was utilized to compute object labels and bounding boxes. Next, the SVM and weighted K-nearest neighbors (KNNs) were trained on the fixation points on the object. These classifiers output Boolean values for interactive or non-interactive intention, and the classification accuracy was generally higher than 78.8%. The visual intentions were estimated each time the user looked at an object, and the integrated system autonomously navigated the wheelchair to the object’s location.
Thus, information about the fixating object has become an essential element in estimating visual intentions. In addition, the remarkable advancement of deep learning-related technologies has led to the development of high-performance regression and classification algorithms for time series data. Therefore, we estimate visual intentions by training a deep learning model on patterns of the eye and head movement, fixation, and depth information to the fixation point.
3. Materials and Methods
3.1. The Electric Wheelchair Control System
Figure 1 shows the appearance of the eye-controlled electric wheelchair and the system components. We use the WHILL model CR, a research and development model electric wheelchair manufactured by WHILL Inc. The single-board computer uses a JetsonTX2 manufactured by NVIDIA. The tablet device used is an Apple iPad Pro, placed in front of the wheelchair user to capture facial images every 0.2
. The measurement range of the camera is 3.14
horizontally and 2.09
vertically. The LRF is URG-04LX-UG01, manufactured by HOKUYO. The scanning angle is 4.18
, and the measurement range is 0.02
to 5.6
. The LRF obtains depth values to the points facing the user’s eyes and head. WHILL’s joystick module converts X-axis input values to angular velocity, and Y-axis input values to translational velocity. The electric wheelchair is controlled by bypassing this joystick module and instead inputting speed commands from the electric wheelchair control system implemented in JetsonTX2.
The control system comprises a visual intentions estimation model and a gaze dwell time method. The visual intentions estimation model is data-driven and can be customized for many tasks by adding more driving scenario data during the training phase. First, we construct a simple model by limiting the tasks to be estimated to “forward”, “right turn”, “left turn”, and “stop” only, and we verify whether intention estimation is possible. The speed command is calculated based on the rotation angle of the eye at the time of intention estimation. Moreover, the gaze dwell time method is applied when driving straight ahead to avoid unintended movement in the direction when looking aside or checking the surroundings.
Figure 2 shows the flow from the data acquisition of various sensors to driving.
- 1.
The depth values to the points facing the user’s eyes and head are facing are acquired using LRF. The depth values are input to the visual intentions estimation model.
- 2.
The rotation angles of the horizontal and vertical axes of the user’s eyes and head are obtained from the camera image at each time t using ARKit Face Tracking, a library for measuring face and eye posture provided by Apple Inc. (Cupertino, CA, USA) [
28]. The rotation angles are noise-eliminated using a five-point moving average filter. The rotation angles of the horizontal and vertical axes are treated as a set of vectors in a two-dimensional plane. Next, the angular velocity, standard deviation, and dwell time are calculated from the amount of change in these vectors. The rotation angle vector, angular velocity, standard deviation, and dwell time are input to the proposed model.
- 3.
The proposed model outputs the control commands “forward”, “turn left”, “turn right”, and “stop” at every time t according to the input values.
- 4.
The gaze dwell time method is applied when going straight ahead, and the wheelchair drives in the direction where the user has gazed for more than 0.7 . If the dwell time is less than 0.7 , the wheelchair drives at the previous speed command.
- 5.
The following equations calculate the speed command input to the joystick module of the electric wheelchair.
is the rotation angle of the eyeball on the vertical axis at the time that the command is output by the visual intentions estimation model, and
is the eye rotation angle of the horizontal axis.
is used as a switch to stop and start according to the threshold value th, as shown in Equation (
1). We set
= 0.35
so that the electric wheelchair stops when the user looks at the bottom of the tablet device.
c in Equation (
1) is a constant, and in this paper,
c is set to 100. The translational speed of the electric wheelchair is constant when the wheelchair is driven. In Equation (
2),
is a coefficient for adjusting the angular velocity. However, the
and
values are 0 only when the model outputs a stop command.
- 6.
The speed command values and are input to WHILL, which converts to an angular velocity and to a translational velocity to drive the motor.
The above process enables the electric wheelchair to drive following the eye’s direction. The movement direction of the electric wheelchair is limited to forward, right/left turn, and stop, only to ensure safety, because physically disabled persons cannot quickly check backwards due to muscle paralysis or rigidity.
3.2. Generation of Feature Vectors
This section describes the feature vectors input to the visual intentions estimation model. First, the rotation angles of the horizontal and vertical axes of the eyes and head are treated as vectors
and
, respectively, in the two-dimensional plane. Next, the angular velocity and the standard deviation are calculated from the amount of change in these vectors. Moreover, the “attention histogram” proposed by Adachi et al. [
13] is used to calculate the “gaze dwell time”, a feature that easily expresses the psychological state.
The attention histogram is a two-dimensional histogram that indicates the fixation intensity, and the histogram is distributed around the fixation point. As shown in Equation (
3), a two-dimensional Gaussian distribution is used for the distribution. The two-dimensional Gaussian distribution is calculated from the angles
and
between the fixation point and the other points shown in
Figure 3. The histogram is obtained by adding up this distribution every time. Equation (
4) shows the formula for the attention histogram.
where
is the histogram frequency of point
at current time
t,
is the attenuation rate.
When the user keeps fixating on the same point, the histogram frequency reaches , which is the peak value. When the histogram frequency reaches 90% of the peak value, the user is judged to have fixated on the point, and the gaze dwell time starts counting. When the histogram frequency falls below 50% of the peak value, the gaze dwell time is reset. The attention histogram determines the time required to perform the fixate detection using the parameters and . In this study, we set the standard deviation to 5 and the attenuation rate to 0.5 to obtain a fixate detection time of 0.7 . The attention histogram was also applied to the gaze dwell time method implemented in the electric wheelchair control system.
Finally, the system does not know where the user is gazing because the eye and head tracking data are vectors in a two-dimensional plane. For example, if the user looks at a wall when the electric wheelchair stops, the system does not determine whether that is an open or an obstructed space, based on the eye rotation angle vector alone. Environmental information, such as the distance to the object looked at, is extremely important to determine whether the user interacts intentionally. Therefore, we used each depth value to the point where the user’s eyes and head are facing as a feature value.
Thus, the feature vector input to the visual intentions estimation model are multivariate time series data with 10 dimensions: the rotation angle vector, angular velocity, standard deviation, dwell time, and depth values for each eye and the head.
3.3. One-Dimensional Convolution Neural Network
CNN (convolutional neural network) is one of the deep learning methods using human neural models. CNN extracts local features from the given data via the convolution and pooling layers and performs regression and classification based on the local feature values in a fully connected layer [
29,
30,
31]. CNN is characterized by robustness to input a signal shift [
32] and is widely applied for image processing, object detection, natural language processing, and biometric signal processing [
33]. Since the eye and head movements are one-dimensional sequence data, we use a one-dimensional CNN (1DCNN) model in which the convolution and pooling layers have a one-dimensional shape. Detailed explanations of 1DCNN are described below.
Figure 4 shows the architecture of a 1DCNN model.
3.3.1. Convolution Layer
The convolution layer extracts local features from the local region of the input signal and the upper layer’s feature map, generating a feature map that summarizes the local features [
34,
35]. This mechanism is known as local receptive fields.
Figure 4 shows the architecture of the 1DCNN model. The feature map is output by multiplying the input layers with the convolution kernels and inputting their summed values to the activation function. In this process, a convolution kernel with the same weights is used for all convolutions in each local region. It is known as weight sharing. Local receptive fields and weight sharing effectively reduce the parameters, such as the convolution kernels’ weights and biases, thereby reducing the complexity of the network structure. Generally, convolutional processing uses multiple convolution kernels to extract the various features that contribute to classification. The one-dimensional convolution process is calculated as in Equation (
5).
where
is the value of the
jth neuron in layer
l, and
is the bias corresponding to the convolution kernel.
is the
ith kernel weight linked to the
jth neuron in layer
l of the input channel
c, and
denotes the input value from the
ith neuron in layer
to the
jth neuron in layer
l of the input channel
c. The kernel weights
and bias
are obtained using backpropagation [
36].
c is the number of input channels,
indicates the number of neurons within the selected range of the feature map in layer
, ∗ represents the convolution operator, and
denotes the activation function. In our paper, ReLU [
30] is selected as the activation function. ReLU is described in Equation (
6).
3.3.2. Pooling Layer
The pooling layer subsamples the features extracted from the upper convolution layer and aggregates the local features. We use the max pooling [
37], which outputs the largest parameter in the feature map within a window range as a feature.
Figure 4 shows how the max pooling method reduces the feature map obtained in the previous layer by half. The pooling layer reduces the dimension of the features and thus avoids overfitting [
38]. Furthermore, the pooling layer can retain important features while reducing noise. Hence, a CNN has the robustness to input signal shifts (translational invariance). Max pooling is defined as follows.
where
represents the index of the
jth pooling region in the
lth layer, and
is the output feature of the
jth pooling region in the
lth layer.
denotes the input features of index
i in the
lth layer.
3.3.3. Fully Connected Layer
Feature maps extracted through multiple convolutions and pooling layers are given to the fully connected layer [
39,
40], whereas the convolution layer is a sparse network architecture; all of the neurons in the fully connected layer are interconnected with the neurons in the previous layer and the neurons in the next layer [
41]. The output value of each neuron is calculated as in Equation (
8). The output values are provided to the next layer.
where
is the output value of the
jth neuron, and
is the bias value at the
jth neuron.
represents the
ith weight value linked to the
jth neuron.
denotes the input value from the
ith neuron, and
M is the number of neurons.
denotes the activation function.
3.3.4. Output Layer
A softmax function is adopted for the output layer. The softmax function is described in Equation (
9).
where
is the value of the
ith output unit,
is the value of the
kth neuron, and
M is the number of neurons. The number of neurons equals the number of predicted labels. The value of
indicates the probability because the sum of each
in the output layer is equal to 1. Finally, the index of the unit with the highest value is output as the predictive label
, as shown in Equation (
10).
3.4. Long Short-Term Memory
The RNN (recurrent neural network) is a neural network model with a recurrent structure in the middle layer. The RNN can process data considering past states because the previous output is used as an input value. Hence, RNN is used to forecast time series data [
42]. In particular, LSTM (long short-term memory) is a type of RNN with a unit called LSTM block in the middle layer that learns contexts with long-term dependencies from sequences [
43,
44]. This model is widely utilized in natural language processing [
45] and speech recognition [
46].
Figure 5 shows a schematic diagram of an LSTM block.
LSTM consists of a unit for storing information, called the cell state, and three gate mechanisms (forget gates
, input gates
, and output gates
) that control the information. The cell state is an essential component for LSTMs, which store and pass information between LSTMs by retaining or erasing old information in the cell state and adding new information. The forget gates
control the amount of old information erased. The forget gates
are shown in the equation below.
where
is the output value of the middle layer at the previous time, and
is the input value at the current time
t.
denotes the sigmoid function.
is multiplied by the old cell state
to determine how much information to erase.
The new information added to the cell state is selected through the following process. First, the candidate values
of information added to the cell state
are calculated as shown in the following equation.
where
represents the hyperbolic tangent function.
Next, the input gates
determine the amount of candidate value
to be added to the cell state
. The input gates
are defined as follows.
The cell state
is updated using the obtained
,
,
,
. The updated equation is shown below.
Finally, the output value
of the LSTM block is obtained by passing the updated cell state value
through the output gates
. The equations for the output gates
and the output value
are shown below.
The updated cell state and the newly generated output value are passed to the next time step. In the previous equations, , , , and represent weights, and , , , and represent biases.
3.5. Proposed Method for Estimating Visual Intentions Using One-Dimensional Convolutional Neural Network and Long Short-Term Memory
This section describes the 1DCNN-LSTM (one-dimensional convolutional neural network and long short-term memory). 1DCNN has strengths in local feature extraction, while RNN has advantages in mining time series data. Therefore, we focus on the advantages of both 1DCNN and LSTM and combine the two models to develop a 1DCNN-LSTM model.
Table 1 shows the main structural differences between 1DCNN, LSTM, and 1DCNN-LSTM. 1DCNN excels at extracting local features through convolutional and pooling layers, and LSTM excels at processing sequence data through recurrent neural networks. 1DCNN-LSTM combines these two model structures, preserving their respective characteristics. In the 1DCNN-LSTM model, the 1DCNN is tasked with extracting useful local features from the feature vector in the first stage, and the LSTM is tasked with learning the long-term dependencies of the feature sequences in the next stage so that the model can infer based on local features, taking into account time series information. The 1DCNN-LSTM extracts and learns local spatial and temporal features from multivariate time series data, and many promising results have been reported [
47,
48,
49,
50,
51]. Thus, 1DCNN-LSTM is expected to achieve a high degree of classification accuracy for human intention recognition in this study.
The network structure of the constructed 1DCNN-LSTM model is shown in
Figure 6. The 1DCNN-LSTM model consists of an input layer, one CNN layer, two LSTM layers, two fully connected layers, and an output layer. First, the 1DCNN-LSTM model is input with multivariate time series data of 10 variables. Next, 1DCNN extracts the local features from the input data and provides them to the LSTM. One-dimensional CNN downsamples the data length at the feature extraction from the data; the time series data is reduced to half its length. The 1DCNN layer consists of one convolution layer and one pooling layer. The number of convolution kernels is 48, the kernel window width is 2, the stride width is 1, the pooling width is 2, and the activation function used is ReLU.
Next, the LSTM layer extracts temporal features from the unique feature map provided by the 1DCNN layer. In this LSTM layer, each LSTM block extracts features at each time step, resulting in a two-dimensional (time step × number of LSTM blocks) feature map. These feature maps are converted to one dimension through the flattened layer, and are fed to the fully coupled layer. The LSTM layers are stacked; each layer has 64 LSTM blocks, and the activation functions used are hyperbolic tangent and sigmoid functions.
Two fully connected layers are connected next to the LSTM layer. The first fully connected layer has 48 neurons and the second fully connected layer has 16 neurons, and ReLU is applied as the activation function. Dropout [
52] is utilized to the fully connected layer to avoid overfitting. Dropout is a method to improve the model’s generalization performance by randomly invalidating some neurons from the network according to a certain amount of probability during the learning process. Invalidated neurons have an output value of zero or no output value [
53,
54]. The white neurons in the fully coupled layer in
Figure 6 indicate the invalidated neurons. The dropout rate is set to 0.05.
Finally, the output layer is placed at the end of the model. The output layer has four neurons, to which a softmax function is applied to calculate the probability for each label (forward, right turn, left turn, and stop). The hyperparameters for each layer were acquired using hyperparameter optimization, as described in
Section 4.1.2.
The model’s parameters (weights and biases) that minimize the error in the output values are acquired via the supervised learning of feature vectors’ patterns corresponding to the visual intentions. After learning, when the feature vectors are input to the trained model, the output values are determined corresponding to the input patterns using the acquired parameters. The learning process is described below. First, parameters are initialized to random values, and the input values are propagated through the input layer to the output layer to obtain the output values. Next, the gradient for each parameter is calculated from the error in the output values using the error backpropagation method [
55]. Finally, new parameters are obtained and updated based on the gradients and the optimization algorithm. This process is repeated until the tolerance value is satisfied or the maximum number of iterations is reached. Thus, optimizing the parameters minimizes the error and improves the prediction accuracy. We use the cross-entropy loss function to calculate the error, and Adam [
56] is used as the optimization algorithm. The cross-entropy loss function is represented by Equation (
17), and Adam is defined following Equations (18)–(22).
where
represents the predicted values,
denotes the ground truth, and
N is the total number of sample data.
A moving average
of the gradient is obtained by Equation (
18), where
is the previous gradient,
E is the cross-entropy loss, and
w is the parameter. Similarly, Equation (
19) shows a moving average
of the squared gradient, where
and
denote attenuation rates, and the initial values of
and
are 0. As
and
are minute values at the early stage of updating, the values are adjusted according to Equations (20) and (21). In Equations (20) and (21),
t represents the number of updates, and as the updates proceed,
and
asymptotically approach 0, and so the effect of this calculation becomes minimal. Finally, the parameter
w is updated using Equation (
22), where
is the learning rate and
is a small value to avoid zero division. When the gradient changes significantly, the learning rate
is decreased by
and
to suppress significant changes in the parameter
w, thus making the parameter update more efficient. In this paper, we set the parameters
,
,
, and
according to the recommended values, using reference [
56].
3.6. Overall of the Visual Intention Estimation Method
A flowchart of the visual intention estimation method based on machine learning models proposed in this paper is shown in
Figure 7. First, the signals measured by each sensor are normalized to construct a data set with a set of predictive labels. This data set is split into a training set, a validation set and a test set in a certain proportion. The model is then built according to the pre-defined hyperparameters (convolutional layer, LSTM layer, fully connected layers, number of filters, kernel size, and activation function) and the weight parameters
w,
b are initialized. Next, the model training is started using the training data and the validation data model. At this point, time series cross-validation (
) is used to evaluate the generalizability of the model. Finally, the performance of the models is evaluated by inputting the test data set into the
K trained models and calculating the
from the difference between the output values and the ground truth. The models to be built are 1DCNN-LSTM, 1DCNN, and LSTM, and their performance is compared.
5. Discussion
In the model evaluation experiment described in
Section 4.1, the 1DCNN-LSTM using the eye and head data sets classified visual intentions with the highest accuracy. Therefore, adding information on head movement as well as eye movement to the feature vector is presumed to be effective in estimating visual intentions.
In this study, the
value improved with the length of the input time series, and each model’s
reached its peak value within 2
to 4
. Festor et al. [
67] also reported that visual intentions were determined with 65% accuracy with a 0.6
input time series, and accuracy reached a peak of 92% with a 3.3
input time series. These results indicate that estimating the visual intentions is possible by learning time series patterns from the natural eye and head behavior data. Furthermore, they suggest that a temporal component is essential for accurate estimation, and that the presence of characteristic eye and head movement patterns for estimating visual intentions may be included within 4
.
The PFI results showed that the eye rotation angle vector is the most important feature contributing to the estimation of visual intentions. The next is the distance to the point where the eyes/heads are facing, suggesting that information about the external environment also contributes to the accurate estimation. The high ranking of depth information is considered due to the depth data during poster gazing and turning. In
Figure 13b,
Figure 14b and
Figure 15b, the depth data to the point where the eyes and head were facing when gazing at the poster ranged from 0.49
to 1.83
for Subject A, 0.72
to 1.89
for Subject B, and 0.67
to 1.46
for Subject C. In contrast, the depth data during the left turn ranged from 1.47
to 5.60
(first left turn) and 1.23
to 5.60
(second left turn) for Subject A, 1.46
to 5.60
(first left turn) and 1.25
to 5.60
(second left turn) for Subject B, 1.63
to 3.11
(first left turn) and 1.37
to 5.60
(second left turn) for Subject C, suggesting that they were looking at the hallway. These results indicate that the depth data mainly show a difference between looking at the poster and turning a corner.
Next, the questionnaire results from the electric wheelchair running experiment showed that our implementation of the control system (1DCNN-LSTM model + gaze dwell time method) was more convenient to operate than the traditional method using only the gaze dwell time method. In the free-answer questionnaire, the subjects answered that the proposed control system allowed them to turn at the intended time and drive smoothly without getting too close to a wall when turning. These results were also reflected in the driving data recorded during the experiment. In the case of Subject B, he tended to move his eyes frequently to check his surroundings while turning. Thus, the system’s operability was decreased when using the traditional method because the estimation of visual intent was performed many times, and the turning operation was interrupted during the estimation. Hence, the control method based on gaze dwell time requires more effort to keep the gaze fixation [
1]. On the other hand, incorporating the visual intentions estimation model into the traditional method enabled a real-time estimation, which reduced the delay time until the start of turning and minimized the occurrence of a time lag. Therefore, the system’s operability was improved because the subject could turn right/left, move forward, and stop at the timing intended, without the effort of keeping gaze fixation and assuming the delay time.
Moreover, the electric wheelchair stopped several times when turning a corner during operation with the proposed control system. An analysis of eye movements and depth data while driving revealed that the subject was gazing at a wall within 1.5 to 2.0 , and not in the hallway. Thus, we infer that the proposed model estimated that the subject was gazing at an object, and the stop command was output. Such a malfunction caused by the subject’s behavior seems to occur easily.
However, since the visual intentions estimation model uses only depth data obtained from the depth sensor as information on the external environment, the model cannot recognize the object seen by the user. Hence, we need to take an egocentric video and use object detection to solve that problem. For example, if the user looks at a wall while turning the electric wheelchair, the system does not stop the wheelchair, whereas if the user focuses on a specific object, the system stops the wheelchair. Therefore, the visual intentions estimation model is expected to adapt to various driving scenarios by using the information obtained from object detection.
Finally, at the current stage of our study, the gaze dwell time method was used to suppress the effects of eye movements on electric wheelchair driving that included observational intentions such as looking away and checking the surroundings, but these effects are insufficient. Although the proposed model outputs only four behavioral intentions (forward, right turn, left turn, and stop), the model can increase the number of tasks estimated through data collection and labeling for each driving scenario. Thus, we need to collect more data to estimate the observational intentions. Further research will be conducted to create data sets for various driving scenes, such as avoiding obstacles, checking left and right at crossroads, and human traffic, for the practical use of the electric wheelchair control system.