Eye-Gaze Controlled Wheelchair Based on Deep Learning

Xu, Jun; Huang, Zuning; Liu, Liangyuan; Li, Xinghua; Wei, Kai

doi:10.3390/s23136239

Open AccessArticle

Eye-Gaze Controlled Wheelchair Based on Deep Learning

by

Jun Xu

^1,*

,

Zuning Huang

²,

Liangyuan Liu

²,

Xinghua Li

² and

Kai Wei

²

¹

School of Automation, Harbin University of Science and Technology, Harbin 150080, China

²

School of Electrical and Electronic Engineering, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(13), 6239; https://doi.org/10.3390/s23136239

Submission received: 31 May 2023 / Revised: 22 June 2023 / Accepted: 30 June 2023 / Published: 7 July 2023

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we design a technologically intelligent wheelchair with eye-movement control for patients with ALS in a natural environment. The system consists of an electric wheelchair, a vision system, a two-dimensional robotic arm, and a main control system. The smart wheelchair obtains the eye image of the controller through a monocular camera and uses deep learning and an attention mechanism to calculate the eye-movement direction. In addition, starting from the relationship between the trajectory of the joystick and the wheelchair speed, we establish a motion acceleration model of the smart wheelchair, which reduces the sudden acceleration of the smart wheelchair during rapid motion and improves the smoothness of the motion of the smart wheelchair. The lightweight eye-movement recognition model is transplanted into an embedded AI controller. The test results show that the accuracy of eye-movement direction recognition is 98.49%, the wheelchair movement speed is up to 1 m/s, and the movement trajectory is smooth, without sudden changes.

Keywords:

eye-tracking; deep learning; CBAM attention; wheelchair acceleration model

1. Introduction

ALS is a progressive and fatal neurodegenerative disease that causes the degeneration of the patient’s upper and lower motor neurons, thereby weakening the muscles. Therefore, although many ALS patients are conscious, they cannot perform physical movements and verbal expressions. A growing body of research is dedicated to the application of artificial intelligence technologies to modify power wheelchairs to improve the quality of life of people with ALS. In the past few decades, researchers have carried out corresponding research on wheelchair motion control methods, including gesture control [1,2,3,4,5], voice control [6,7,8,9,10], eye-tracking control [11,12,13,14,15], and brain–computer interfaces [16,17,18,19,20,21,22]. These control methods can replace the rocker to complete the reading of the user’s motion direction intention and realize the motion control of the wheelchair. However, due to the loss of limb control and language communication abilities in patients with amyotrophic lateral sclerosis (ALS), gesture control and voice control are not viable options. In brain–computer interfaces, although the collection of brain signals largely eliminates noise interference, the semi-invasive or invasive electrodes used in these interfaces can pose risks to human health [23]. Compared with the above-mentioned control methods, eye-tracking control has unique advantages in terms of safety, portability, and practicality for patients with ALS.

Currently, research on the eye-tracking control of wheelchairs mainly focuses on two aspects: eye-tracking recognition and wheelchair control. Xiaokun Li et al. designed a head-mounted device based on an energy-controlled iterative curve-fitting method of infrared light, which can achieve precise pupil detection and tracking. The experimental results showed that the average tracking accuracy of the method for pupil rotation was at least 1.38% higher than that of conventional methods [24]. Fatma et al. proposed a method that uses a front camera to capture the user’s face information and determines the pupil center coordinates through a fuzzy logic controller to output wheelchair control decisions. This method features an adaptive threshold adjustment, enabling eye-tracking even under poor or unstable lighting conditions [25]. In terms of eye-tracking recognition, although the eye-tracker equipment has a high accuracy rate, the infrared light it generates can cause irreversible damage to the human eye, and it is far inferior to the camera recognition solution in terms of comfort and economic practicability.

In recent years, deep learning technology has been widely used in the fields of non-verbal communication, human–computer interaction, and intention prediction. Dliber et al. used the AlexNet network structure for eye-movement direction determination and trained the network model using a private dataset. The visualization results showed that the recognition accuracy of the model reached 97.88% [26]. Moayad Mokatren et al. designed a faster region-based convolutional neural network (RCNN) to detect pupils in infrared and RGB images. The experimental results showed that the method had a 3D gaze error of only 2.12° and could be used with high accuracy for various real-world moving scenes [27]. Ling Hu et al. proposed a model called SSD to detect eyeballs in images using a single deep neural network. The total average accuracy of target detection for this scheme was 91.85%, and the total average relative error after distortion removal was reduced to 1.25% [28]. The above-mentioned research based on deep learning provides a new direction for the development of eye-tracking technology.

For wheelchair control, Delia Dragusin et al. used a pupil-corneal reflex-based approach to accomplish sight estimation on a PC, and the wheelchair control chip received eye-movement commands from a PC via Bluetooth and controlled relay closures to drive motor motion [29]. Razvan Solea et al. used the PCI protocol to send the eye-tracking signal from a PC via a data acquisition card to a servo amplifier, which, in turn, sent a PWM control signal to the DC motor [30]. Aniwat Juhong et al. used x-axis–y-axis servos to control the wheelchair rocker so that the wheelchair could theoretically have a 360° movement direction, which greatly improved the maneuverability of the wheelchair [31]. Sho Higa et al. bypassed the joystick and directly sent the eye-tracking signal output via an embedded AI computing device to the control system of a WHILL CR electric wheelchair through a USB to complete the control of the wheelchair movement. Although this control method is simple and reliable, the electric wheelchair it is paired with is very expensive [32].

Deep learning-based eye-tracking methods heavily rely on the completeness and richness of the training data. For the public eye-gaze dataset, Wolfgang et al. compiled a dataset (TEyeD) from more than 20 million real-world eye images and pupil information, and the data were carefully labeled [33]. The MRL (Media Research Lab) built a large-scale dataset of human eye images with 15,000 pupil points (images) using different devices to capture drivers’ facial images under realistic conditions [34]. However, the existing dataset suffers from a lack of labeled samples, physical constraints on the recording process, and poor applicability to specific scenarios.

To sum up, to develop an intelligent eye-tracking wheelchair with low economic and operational barriers, here, we use a human eye-gaze model that combines convolutional neural networks and attention mechanisms in the eye-tracking recognition scheme. To further improve the generalizability and accuracy of the model, a large-range, multidimensional eye-tracking dataset is established. For wheelchair control, a structure that controls the movement of the wheelchair by manipulating the rocker with a tiller is developed to achieve precise control of the wheelchair. The intelligent eye-tracking wheelchair based on deep learning designed in this paper is an example of the application of artificial intelligence in medical care, elderly care, and other fields, and can serve as a reference for the use of emerging technologies to improve the quality of human life.

The rest of the paper is organized as follows. Section 2 presents a review of the related works. Section 3 outlines the intelligent eye-tracking wheelchair scheme developed in this paper and describes the acquisition process of the multidimensional eye-tracking dataset. Section 4 introduces the eye-tracking model-building and training methods. Section 5 describes the design of the eye-tracking wheelchair control system. Section 6 discusses the experiments and analyzes the results. Section 7 summarizes the article and suggests directions for future work.

2. Related Work

Eye-tracking methods can be divided into invasive and non-invasive methods. Invasive methods mainly include EEG (electroencephalogram) and EOG (electrooculogram) methods, whereas non-invasive methods can be further divided into appearance-based and model-based methods.

Intrusive methods use specific intrusive sensors to detect changes in the voltage, current, and magnetic field of eye movements to determine the direction of sight. The EOG method is an objective and quantitative retinal function detection method for detecting the electrostatic potential of the eye, which slowly changes with adaptation to light. To detect the horizontal movement of the eyeball, a pair of electrodes are placed on the inner and outer corners of the eye, respectively. To detect vertical movement, they are placed on the upper line of the eyelid, and the potential difference between the two is recorded. There is no potential difference when the eyeball is facing straight ahead, but when the eyeball is rotated, the electrode on the corneal side is more positive than the other electrodes [35,36]. The contact lens method involves implanting a contact lens consisting of reflective lenses into the human eye. The principle is to measure the reflection of different light rays by the reflective lens after different light rays enter the human eye, obtain the position of the center of the pupil by calculating the angle of the reflected light rays, and then judge the direction of sight. This method of directly embedding measuring equipment in the pupil is insensitive to changes in the external environment and has a high measurement accuracy; however, it is expensive, requires special equipment, and causes serious interference to users [37]. The above two methods determine the direction of the line-of-sight through a contact sensor device, although they are highly robust to changes in the external environment, such as lighting and head movement. However, in the process of use, the user needs to wear a special eye-contact device with large interference, which can have a significant impact on the comfort of the user and limit the degree of freedom of the user.

On the non-invasive side, many studies have applied model-based principles. Morimoto et al. proposed a sight estimation technique based on the pupil vector reflection method, which uses polynomial equations to describe the mapping relationship between the pupil corneal vector and the sight fall point. This method has high accuracy, but its head must remain absolutely still to interfere with the human degree of freedom [38]. Cerrolaza et al. used a camera and two infrared light-emitting tubes to calculate the relative distance between two spots formed by the corneal reflected infrared light, and then to calculate the mapping relationship between the pupil corneal reflection vector and the gaze point, and this method can solve the error caused by the small displacement of the head to some extent [39]. Hennessey et al. calculated the quadrilateral formed by four infrared reflected spots with a similar calibration. However, such methods require the passage of infrared light, and prolonged exposure to near-infrared light may cause damage to the eyes [40].

Appearance-based eye-tracking methods use image data as input and do not need to build 2D or 3D eye models to directly map image data to gaze points. In addition, the appearance-based approach [41,42] relies on the completeness and richness of the training data, i.e., the training data should contain all the situations present in the application scene, otherwise the approach cannot complete the mapping of image data to the gaze point. In the current application, a large number of image datasets are mainly used as the input to the neural network for training to achieve the mapping of image data to gaze points. The appearance-based method does not perform complicated operations such as camera and geometric scene calibration, which makes it less difficult to use and facilitates practical applications, without a complicated calibration process in the application.

Therefore, we designed a neural network model combining the convolutional neural network and attention mechanism for line-of-sight direction estimation from the aspects of economy, safety, practicality, and comfort of use.

3. Materials and Methods

3.1. Eye-Tracking Wheelchair Program

The eye-tracking wheelchair system consists of eye-tracking data acquisition, data preprocessing, eye-tracking direction estimation, and wheelchair motion control. The camera in front of the wheelchair transmits the face images collected in real time to the embedded AI computing module. The latter inputs the processed image into the neural network model to obtain the estimated direction of eye-tracking and transmits the signal to the Arduino; then, the Arduino controls the 2D servo to adjust the wheelchair rocker and change the motion state of the wheelchair. The user’s eye state and the estimated direction of eye-tracking are updated in real time on the display. The general block diagram of the eye-tracking wheelchair system is shown in Figure 1.

3.2. Dataset Creation

Datasets are the basis for training deep learning models, and the performance of deep learning models heavily depends on the quality and size of the datasets they are trained on. In order to further improve the practicability and accuracy of the model, 100 Chinese volunteers were recruited for this dataset collection. Through the two major scenes of virtual and reality, we recorded videos of volunteers gazing in different directions while completing tasks, extracted human eye images frame-by-frame using the OpenCv program and the Dlib algorithm, and automatically labeled the obtained data by the task attributes of the time period in which the frame was located.

3.2.1. Multidimensional Eye-Tracking Data Acquisition

In this paper, we built the eye-tracking dataset through two dimensions: virtual and reality. Multidimensional datasets can capture the complex relationships between different features, better describe the characteristics and attributes of the data, provide a more comprehensive and accurate representation of the data, and enhance model performance.

(1): Virtual scene acquisition

In this paper, a virtual scene for eye-tracking direction detection was built under the robot operating system ROS using Gazebo software, as shown in Figure 2. The virtual scene is a maze, and there is a car in the maze, which matches the virtual camera so that the perspective of the volunteer is consistent with that of the car. The volunteers keep their heads still while watching the road in front of the car with their eyes and control the movement of the car through the keyboard, simulating the movement of a wheelchair in a real-life scenario. When the state of the car is shown (as in Figure 2), volunteers look at the feasible road on the right side of the wall, and at the same time control the car to the right through the keyboard. The volunteer’s facial image is captured and recorded by a camera placed in the center of the computer screen, while the volunteer’s actions on the keyboard are recorded by a script to achieve hand–eye synergy data collection. Through this method, the eye-tracking dataset in the virtual scene was established.

(2): Real scene acquisition

We set up an environment for eye-tracking data acquisition in a real scene, as shown in Figure 3b. The scene consisted of a wall with a nine-grid (as shown in Figure 3a), a laser pointer, and a wheelchair with a fixed camera. The nine-grid was a square area of 210 cm

\times

210 cm in size, each grid size was 70 cm

\times

70 cm, and a red sign was pasted in the center. When the volunteer’s eye gazes at the designated red sign, the direction of the human eye gaze becomes a nine-classification problem, which reduces the influence of the volunteer’s subjective behavior. In the real scene, the experimental assistant pointed to the red mark in the center of the grid with a laser pointer row-by-row from left to right, while the volunteer sat in a wheelchair and kept his head still, staring at the positions illuminated by the experimental assistant, and each position was maintained for 10 s. At the same time, the facial changes of the volunteers were recorded in real time by the camera on the wheelchair. Each frame of facial data was automatically marked according to the time the frame belonged to.

3.2.2. Data Preprocessing

To ensure the accuracy of convolutional neural network-based algorithms, images need to be preprocessed before they are analyzed. The quality of the image has a direct impact on the accuracy of the algorithm. Different tasks require different image preprocessing methods to remove irrelevant information and enhance relevant information to improve task reliability. In this paper, the convolutional neural network was used to detect the human eye-tracking state, so redundant information other than human eyes needed to be removed in the preprocessing stage.

We chose to use the detector function of the Dlib library [43] to draw 68 feature points of the recognized face, used OpenCv to extract frames from the video, and used the face-detection algorithm of the Dlib library for each frame of the image. In order to effectively extract the irrelevant area information of the bridge of the nose in the two eyes, the images of the left and the right eyes were extracted, respectively, instead of directly extracting the images of both eyes. The comparison between the two is shown in Figure 4.

The area delineated by the feature points of serial numbers 42–47 is the corresponding area of the left eye, and the feature points of the corresponding area of the right eye are serial numbers 36–41. Taking the extraction of the left-eye image as an example, the minimum and maximum values of the abscissa and ordinate coordinates of the six points are the boundary area coordinates of the left-eye image. The following formula was used to obtain the boundary coordinates of the eye image. Among them,

x_{i}, y_{i}

are the corresponding abscissa and ordinate of each feature point,

a, b

are the minimum abscissa and ordinate of the eye area, and

A

,

B

are the maximum abscissa and ordinate of the eye area.

\{\begin{matrix} a = \min_{} (x_{i}) \\ b = \min_{} (y_{i}) \\ A = \max_{} (x_{i}) \\ B = \max_{} (y_{i}) \end{matrix}

(1)

A total of 200 videos of volunteers staring in different directions in virtual and real scenes were saved. In order to extract the human eye information in each frame of the image, we processed the video frame-by-frame, and cut out the binocular pictures according to the feature point coordinates of human eyes. The part between the two eyes may have some noise and redundant information that interferes with the prediction of the gaze direction. For example, the nose bridge, eyeglass frames, eye shadows, and eye spacing between the two eyes may have an impact on the feature extraction of the eyes, thereby reducing the accuracy of estimating the gaze direction. By removing the part between the two eyes, the attention can be focused on the separate eye regions and the performance of the gaze direction recognition can be improved.

Considering the interference of the distance between the eyes on the human eye information, we intercepted the eye information of each frame of the picture, adjusted the size of the extracted picture to 100

\times

50, and merged the eyes together, horizontally. According to the task attributes of the time of each frame of the image, the human eye information in each frame of the image was automatically marked, and finally, the collection of gaze data of 100 volunteers in virtual and real scenes was completed.

3.2.3. Unification of Datasets

The data collected in the real environment in Section 3.2 had nine different labels, but after visualization by t-SNE [44], these labels could be directly transformed into three categories, which is also consistent with reality. For example, when the test volunteers gazed at the leftmost column, their eye features were approximately distributed in the same region regardless of the column they gazed at. Therefore, the three category labels could be cleverly used to uniformly mark the datasets of the real scene, which is exactly the same as the label classification of the datasets collected in the virtual environment, which is convenient for further dataset screening. The t-SNE visualization effect is shown in Figure 5.

For the convenience of description, we introduced some symbols:

X = \{X_{1}, X_{2}, \dots, X_{n}\} a n d n = 1,2, \dots, 100; Y = \{y_{1}, y_{2}, y_{3}\} .

Among them, the

X

set represents 100 volunteers participating in the data collection. Each

X_{i}

contains 1350 photos, and each photo has a corresponding label.

Y

represents the set of labels,

y_{1}

corresponds to the label left,

y_{2}

corresponds to the label forward, and

y_{3}

corresponds to the label right. A total of 135,000 pictures with gaze labels in different directions were selected to provide training data for the algorithm training in the next section. The established dataset is shown in Figure 6.

4. Eye-Tracking Model Building

In this paper, we used deep learning for the estimation of the eye-gaze direction in the human eye-tracking recognition task for eye-tracking wheelchairs, which is divided into feature extraction networks and classification. This was completed by training a deep learning network (GazeNet) using the human eye database we have built, which incorporates several modules for feature extraction optimization, and tri-classifying the features for output using a cross-entropy loss function.

4.1. Eye-Gaze Direction Estimation

In this paper, we fully traded off the lightweight and accuracy of the algorithm when designing the network. In order to efficiently run the human gaze direction determination algorithm on embedded devices, we added the improved Inception module to reduce the model computation, and the ResBlock module and the CBAM attention module to improve the network performance in the designed GazeNet network structure, and the final overall network structure is shown in Figure 7. Each extracted human eye image first entered the improved Inception module after convolution and pooling operations, its output parameters entered the ResBlock module after convolution and pooling operations, and the resulting output entered the CBAM attention module after convolution and pooling operations. Then, the output features were again convolved and pooled before being tri-classified by the fully connected layer to achieve the eye-gaze direction estimation.

4.1.1. Inception Module

The parallel structure adopted by the Inception module enables the input image to be processed by multiple convolutional kernels of different scales and pooling operations to obtain different levels of feature information [45]. The purpose is to extract features at different scales while keeping the size of the output feature map of the convolutional layer unchanged, to efficiently expand the depth and width of the network, and to prevent overfitting while improving the accuracy of the deep learning network. The specific implementation uses a combination of multiple convolutional kernels of different scales and pooling operations to increase the nonlinear representation of the model without increasing the number of model parameters. In this paper, we used the Inception module to decompose a 5 × 5 convolution kernel (Figure 8a) into two 3 × 3 convolution kernels (Figure 8b), so that we could effectively use only about

(3 \times 3 + 3 \times 3) / (5 \times 5) = 72 %

of the computational overhead. This reduced the number of model parameters and computation while maintaining the same perceptual field, reducing the computational burden, as shown in Figure 8.

4.1.2. ResBlock Module

Convolutional neural networks can extract a rich feature hierarchy, but there is also the hidden danger of gradient disappearance or gradient explosion, and the use of regularization processing may produce degradation problems. Therefore, we chose to add the ResBlock module in the middle layer of the network, which achieved a jump connection by directly adding the output and input of the convolutional layer, thus ensuring better gradient transfer during backpropagation and reducing the number of model parameters [46]. The

F (x)

that needs to be taught can be written in the form of “residuals”, as follows:

F (x) = H (x) - x

(2)

During the training process, if the model finds that the gradient becomes very small when the neural network becomes deeper and deeper, i.e., the “degeneration phenomenon” appears, it can directly add the output and input to achieve a constant mapping [47]. The advantage of this is that even if the network is deep enough, it can guarantee effective feature extraction at each layer. In this way, the performance of the network can be maintained even if the network becomes deeper and deeper, avoiding the degradation problem that occurs in traditional neural networks, as shown in Figure 9.

X

in Figure 9 represents the input,

H (X)

represents the output obtained after a series of transformations in the network, and

F (X)

represents the residual, which is the difference between the network output

H (X)

and the input

X

. During the training process, the network automatically learns a set of appropriate weights, so that

F (X)

converges to 0. The introduction of the residual concept allowed the network to better learn the mapping relationship between the input and the output, while reducing the risk of gradient disappearance or gradient explosion, thus improving the performance and generalization of the model.

4.1.3. CBAM Module

In addition, since we wanted the feature processing to focus on the direction of the pupil in the eye, the Convolutional Block Attention Model (CBAM) was inserted at the back of the ResBlock module. The CBAM module can improve the performance of convolutional neural networks for eye-tracking direction estimation by better learning and representing specific image features through the attention mechanism [48]. The overall process of the CBAM module can be divided into two parts, as shown in Figure 10.

Channel attention is mainly used to capture the correlation between different channels [49]. This module first obtained two channels through global average pooling and maximum pooling operations: the

1 \times 1 \times C

channels. Then, it was mapped and activated by a two-layer neural network, with the number of neurons in the first layer being

C / r

. The number of neurons in the first layer is Relu, and the number of neurons in the second layer is

C

. A summary of the two obtained features was produced and passed through the Sigmoid activation function to obtain the weight coefficients,

M_{C}

, which was multiplied with the input feature mapping to obtain the output with enhanced channel feature representation. The features of an H × W × C input are F. The output feature formula is shown below:

\begin{matrix} M_{C} (F) & = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{C})) + W_{1} (W_{0} (F_{m a x}^{C}))) \end{matrix}

(3)

In Equation (3),

σ

denotes the Sigmoid function, MLP denotes the multilayer perceptron, AvgPool denotes the average pooling layer, MaxPool denotes the maximum pooling layer,

W_{0}

and

W_{1}

denote the two weight matrices,

F_{a v g}^{}

and

F_{m a x}^{}

denote the results of the input data after the AvgPool and MaxPool operations, and the superscript

C

is Channel, indicating the average pooling and maximum pooling operations in the channel dimension.

Spatial attention is mainly used to capture the correlation between different locations on the feature map [50]. This module first performed average pooling and maximum pooling to obtain two

H \times W \times 1

channels, respectively, which were stitched together and passed through a 7 × 7 convolutional layer with the Sigmoid activation function to obtain the weight coefficients,

M_{S}

. This weight was multiplied with the input feature mapping by means of element multiplication to obtain the output of the enhanced spatial feature representation [51]. An

H \times W \times C

input is characterized by F, and the output feature formula is shown below:

\begin{matrix} M_{S} (F) & = σ (f 7 \times 7 ([A v g P o o l (F), M a x P o o l (F)])) \\ = σ (f 7 \times 7 ([F_{a v g}^{S}; F_{m a x}^{S}])) \end{matrix}

(4)

In Equation (4),

σ

denotes the Sigmoid function, AvgPool denotes the average pooling layer, MaxPool denotes the maximum pooling layer, W₀ and W₁ denote two weight matrices,

f 7 \times 7

denotes a convolution kernel of size 7 × 7,

F_{a v g}^{}

and

F_{m a x}^{}

denote the results of the input data after the AvgPool and MaxPool operations, and the superscript

S

is Spatial, indicating the average pooling and maximum pooling operations in the spatial dimension.

4.2. Classification Network

4.2.1. Fully Connected Layer

The classification estimation task was implemented by a fully connected layer integrating the local features after the convolution operation through a weight matrix. In the eye-tracking direction classification task of this paper, it was necessary to obtain the result of determining whether the eye is forward, left, or right, and this, therefore, is a triple-classification problem. The output of the feature extraction network in this paper was

2 \times 2 \times 4

, and the fully connected layer had 3 neurons. First, the output of the feature extraction network was flattened into a one-dimensional column vector:

x = {[x_{1}, x_{2}, \dots, x_{15}, x_{16}]}^{T}

; then, for each neuron in the fully connected layer:

Z = {[Z_{1}, Z_{2}, Z_{3}]}^{T}

, a linear operation was performed with each element in x. In the forward-propagation process, the fully connected layer can be viewed as a linear weighted summation process. Specifically, each node in the previous layer was multiplied by a weighting factor,

w

, and a bias,

b

, was added to obtain the corresponding output on the fully connected layer,

z

. The computational process of the fully connected layer can be expressed using Equation (5):

[\begin{matrix} z_{1} \\ z_{2} \\ z_{3} \end{matrix}] = [\begin{matrix} \begin{matrix} w_{11} \\ w_{21} \end{matrix} & \begin{matrix} \begin{matrix} w_{12} & \dots & w_{16} \end{matrix} \\ \begin{matrix} w_{22} & \dots & w_{26} \end{matrix} \end{matrix} \\ w_{31} & \begin{matrix} w_{32} & \dots & w_{36} \end{matrix} \end{matrix}] * [\begin{matrix} x_{1} \\ \begin{matrix} x_{2} \\ ⋮ \end{matrix} \\ \begin{matrix} x_{15} \\ x_{16} \end{matrix} \end{matrix}] + [\begin{matrix} b_{1} \\ b_{2} \\ b_{3} \end{matrix}]

(5)

The output was normalized using the SoftMax function, which maps a vector into a probability distribution, where each element is a non-negative number, and the sum of all elements is 1. Thus, for a multiclassification problem with n classes, the SoftMax function can convert an n-dimensional vector into a probability distribution, where each element represents the predicted probability of that class. The neuron

Z = {[Z_{1}, Z_{2}, Z_{3}]}^{T}

on the fully connected layer was normalized to

y = {[y_{1}, y_{2}, y_{3}]}^{T}

by the SoftMax function with the constraint:

y_{1} + y_{2} + y_{3} = 1

. The transformation relation between

y_{i}

and

Z_{j}

is:

y_{i} = \frac{e^{Z_{j}}}{\sum_{j = 1}^{3} e^{Z_{j}}}

(6)

4.2.2. Cross-Entropy Loss Function

During the training process, we wanted to make the output probability distribution of the model as close as possible to the true probability distribution, so we needed to design a suitable loss function to measure the difference between them. The cross-entropy function can effectively measure the difference between two probability distributions and is, therefore, widely used in classification problems [52]. In particular, if for a sample

x

with a true label of

y

, the predicted output of the SoftMax model for it is

\hat{y}

, then the cross-entropy loss function can be expressed as:

L o s s (y, \hat{y}) = - \sum_{i} y_{i} \ln \hat{y}

(7)

This loss function can be regarded as the KL dispersion (Kullback–Leibler divergence) between the true and predicted labels [53]. The cross-entropy loss function obtained the minimum value of 0 when the output probability distribution of the model was exactly the same as the true probability distribution; however, when the difference between them increased, the value of the cross-entropy loss function also increased. Since the SoftMax function mapped the

y_{i}

values to between 0 and 1, and according to the constraint

\sum_{i} y_{i} = 1,

it can be deduced that:

\frac{e^{z_{i}}}{\sum_{j = 1}^{3} e^{z_{j}}} = 1 - \frac{\sum_{j \neq i} e^{z_{j}}}{\sum_{j = 1}^{3} e^{z_{j}}}

(8)

when

y_{i} = 1

, the loss function is:

{L o s s}_{i} (y, \hat{y}) = - \sum_{i} \ln y_{i}

(9)

The derivative procedure for

{L o s s}_{i} (y, \hat{y})

is as follows:

\begin{matrix} \frac{\partial {L o s s}_{i}}{\partial z_{i}} & = - \frac{\partial \ln y_{i}}{\partial z_{i}} = \frac{\partial (- \ln \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}})}{\partial z_{i}} = - \frac{1}{\frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}}} \cdot \frac{\partial (\frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}})}{\partial z_{i}} \\ = \frac{\sum_{j} e^{z_{j}} \cdot \sum_{j \neq i} e^{z_{j}}}{e^{z_{i}}} \cdot \frac{- e^{z_{i}}}{{(\sum_{j} e^{z_{j}})}^{2}} = - \frac{\sum_{j \neq i} e^{z_{j}}}{\sum_{j} e^{z_{j}}} \\ = y_{i} - 1 \end{matrix}

(10)

From the derivation of the formula, it can be seen that the clever use of the cross-entropy loss function with SoftMax for the triple-classification task makes it very easy to calculate the gradient in the backpropagation. The gradient of the backward update was obtained by simply taking the

y_{i} - 1

calculated in the forward direction.

5. The Design of the Eye-Tracking Wheelchair Control System

In the previous section, the eye-tracking recognition algorithm was investigated. A complete eye-tracking wheelchair control system design solution should also include data acquisition, data processing, motion control, human–machine interaction, and system optimization. The physical diagram and hardware data diagram of the wheelchair designed in this paper are shown in Figure 11.

According to the role of hardware in the system, it can be divided into three sections: the data acquisition section, the data processing section, and the motion control section. The data acquisition area consists of the Gook HD98 HD camera with a resolution of 1920 × 1080 and a 10-inch touchscreen. The camera is responsible for capturing face images and the touch screen displays eye-tracking information in real time. The data processing section consists of the Jeston Tx2, which has 256 CUDA cores and up to 8 GB of memory. Its arithmetic power is comparable to that of a desktop-class graphics card GTX750, and it can easily handle the computational task of eye-tracking recognition. The motion control section is composed of Arduino as well as MG995 servo. Arduino receives an eye-tracking signal and drives MG995 to control the rocker to change the wheelchair motion.

Common power wheelchair modifications are adjustments to the hardware portion of an existing wheelchair. Changes to the wheelchair motion can be achieved by simply controlling the rocker during the wheelchair motion. Therefore, in this paper, a mechanical structure (as shown in Figure 12) was installed on the wheelchair rocker controller. The structure consists of a base, a servo bracket, a control rod extension, and a control arm. The two servos are the x-axis and y-axis servos, which cooperate with each other to realize the control of the electric wheelchair rocker by turning in different directions according to the control command, so as to fulfill the control of the direction of wheelchair movement.

In the upper computer TX2, we designed the intelligent eye-tracking wheelchair control platform through PYQT5. The intelligent eye-tracking wheelchair control platform contains a data acquisition section and a wheelchair control section. The data acquisition section collects face and human eye images in real time. The wheelchair control section contains a wheelchair start button, wheelchair motion direction display, camera selection, and face count display functions. The intelligent eye-tracking wheelchair control platform makes it easier to achieve wheelchair activation and monitor wheelchair movement. The interactive interface is shown in Figure 13.

5.1. Motion Control Optimization

In the process of wheelchair steering, the motion path of the rocker will have an important impact on the acceleration change in the system because of the non-linear relationship between the left and right wheel speeds and the rocker angle, as shown in Figure 14. In order to make the wheelchair motion change smoothly and weaken the risk of vibration and even rollover caused by the sudden change of acceleration, this section presents the research on the rocker trajectory-tracking control.

In order to establish the rocker control model, we introduced a polar coordinate system in the rocker motion plane, specifying the distance from the rocker to the reset as

r

, and the angle between the projection line of the rocker on the plane and the positive left as

θ

. The coordinate diagram is shown in Figure 15.

Since the speed of the left and right wheels is symmetrical with respect to the change in the rocker angle, and the speed of the right wheel is almost constant in the interval of

0^{°} - 90^{°}

, therefore, it is only necessary to analyze the velocity change curve of the left wheel at

0^{°} - 90^{°}

. The acceleration,

a

, of the left wheel is related to the velocity,

V_{w}

, and the angle, θ, of the rocker in polar coordinates, as follows:

a = \frac{d V_{w}}{d θ} \cdot \frac{d θ}{d t}

(11)

In the range of

0^{°} - 90^{°}

,

\frac{d V_{w}}{d θ}

continuously decreased as

θ

increased. In order to make the

a

change more smoothly, it was necessary to make

\frac{d θ}{d t}

continuously increase as

θ

increased. Accordingly, the following

θ - t

diagram was drawn, as in Figure 16.

By:

\{\begin{matrix} θ (t) = \int_{0}^{t} ω (t) d t \\ ω (t) = \frac{V_{R}}{r (t)} \end{matrix}

(12)

It was obtained as follows:

r (t) = \frac{V_{R}}{θ^{'} (t)}

(13)

where

ω (t)

is the rocker angular velocity and

V_{R}

is the rocker velocity, which can be considered constant in magnitude. Then,

r (t)

could be obtained according to Equation (13) with the derived

θ (t)

, and the rocker motion trajectory could be determined accordingly.

To simplify the analysis,

V_{w} (θ)

as well as

θ (t)

can be expressed as Equations (14) and (15):

V_{w} (θ) = a_{1} θ^{2} + b_{1} θ (θ < 90^{°})

(14)

θ (t) = a_{2} t^{2} + b_{2} t (t < T_{e})

(15)

where

a_{1}

,

b_{1}

, and

T_{e}

are known,

T_{e}

denotes the time required for the rocker to move from the starting position to the end position, and

θ (T_{e}) = \frac{π}{2}

.

Then, substituting Equations (14) and (15) into (11) yields:

a (t) = [2 a_{1} (a_{2} t^{2} + b_{1} t) + b_{1}] (2 a_{2} t + b_{2})

(16)

The simplification yields:

a (t) = A (t - t_{1}) (t - t_{2}) (t - t_{3})

(17)

with the following constraints:

\{\begin{matrix} a (0) = 0 \\ a (T_{e}) = 0 \\ \int_{0}^{T_{e}} a (t) = V_{e} \end{matrix}

(18)

Substituting Equation (17) into (18) yields:

a (t) = \frac{12 V_{e}}{2 t_{3} {T_{e}}^{3} - T_{e}^{4}} t (t - T_{e}) (t - t_{3})

(19)

where

T_{e}

indicates the time taken for the rocker to move from

0^{°}

to

90^{°}

. According to the characteristics of the cubic function, it is necessary to satisfy

t_{3} > T_{e}

in order to make the acceleration change more smoothly in

0 - T_{e}

. Then, Matlab can be used to plot the curve of

a

with time

t

when

t_{3}

changes.

From Figure 17, it can be seen that the curve of

t_{3}

with

t

tended to flatten out during the increase of

t_{3}

from

T_{e}

, and the maximum value tended to be

\frac{1.5 V_{e}}{T_{e}}

. Accordingly,

θ (t)

and

r (t)

under the target conditions could be derived, and the rocker motion trajectory could then be derived from them.

5.2. System Flow

5.2.1. Blink Detection

The face-detection algorithm using the Dlib library with OpenCV image processing to obtain periocular feature point data was described in Section 3.2.2, and the eye-tracking estimation model was built in Section 3. Taking the left eye as an example, the outline of the eye was first located in the eye image, and points p₁–p₆ were used to represent the key points of the eye, as shown in Figure 18. The actual use of the system control needs to detect whether the human eye is open before the line-of-sight estimation. The eye opening and closing states can be determined by calculating the eye aspect ratio (EAR) [54] in real time and comparing it with the set threshold, and there is no need to enter the line-of-sight estimation procedure if the eye opening and closing do not meet the EAR threshold.

The respective EAR values of the left and right eyes were calculated, and finally, the average EAR value of the left and right eyes was obtained. When the eyes were open, EAR maintained a constant value; when the eyes were closed, the EAR value tended to 0. If the calculated EAR value was less than the set threshold, it was determined to be a blink action. The EAR calculation formula is shown below:

E A R_{l e f t} = \frac{‖p_{2} - p_{6}‖ + ‖p_{3} - p_{5}‖}{2 ‖p_{1} - p_{4}‖}

(20)

E A R = \frac{E A R_{l e f t} + E A R_{r i g h t}}{2}

(21)

5.2.2. Saccades Processing

In addition to the impact that blinking can have on wheelchair control, the sweeping behavior of the user due to unexpected events during use may also affect the control of the wheelchair and the safety of the user. If the user rapidly changes the direction of gaze in a short period of time, the wheelchair may continuously receive movement commands in different directions, causing the wheelchair to swing from side to side. To reduce the risk of this unexpected situation, we used the results that occurred four or more times out of every eight classification results as the motion control command according to the majority principle, and if there was no majority result out of eight classification results, the wheelchair motion stopped. Since the network model is capable of generating approximately 16 classification results in 1 s, the above treatment had less impact on wheelchair maneuverability.

5.2.3. System Flow Chart

After analyzing the workflow of the eye-tracking wheelchair in a previous paper, the system flowchart of the eye-tracking wheelchair is presented in this paper, as shown in Figure 19. The control program begins in the host computer, and the camera in front of the wheelchair is automatically turned on by the program, and then the camera starts to extract facial features and collect eye-movement data. Jeston Tx2 calculates the EAR of each frame of the acquired eye-movement data. If the latter does not exceed the threshold, the eye-movement data are reacquired, and if the threshold is exceeded, the frame is fed into the GazeNet network model to calculate the eye-tracking direction and the classification result is output to the motion controller, Arduino. The latter manipulates the rocker to complete the control of the wheelchair movement based on the eye-movement signal. If the wheelchair is turned off at this point, the system stops working; otherwise, it will return to the eye-movement data acquisition process and continue to complete the calculation and judgment of the eye-movement data EAR, and so on.

6. Experiments and Results Analysis

In the previous sections, we completed the establishment of the model and the construction of the system. In order to verify the practicality and reliability of the eye-tracking wheelchair, in this section, we outline the test experiments conducted on the accuracy of the model recognition results and the accuracy of the wheelchair control. The first experiment was a comparative experimental evaluation of the GazeNet proposed in this paper with three existing models, such as AlexNet, ResNet18, and MobileNet-V2, to demonstrate the superiority of the model used in this paper in eye-tracking recognition. The second experiment quantified the wheelchair control accuracy by testing the deviation of the actual wheelchair motion trajectory from the target path, and the third experiment tested the effect of motion control optimization in Section 4.2 by measuring the Arduino output PWM signal.

6.1. Evaluation of GazeNet’s Effectiveness

6.1.1. Hyperparameter Optimization

In the training and testing of the GazeNet network, each image in

\bar{X}

was subjected to the preprocessing operation described in Section 3.2.2 to obtain

\bar{X}

, which was named the multi-environment attentional gaze dataset. The 135,000 binocular images in

\bar{X}

were divided into a training set, validation set, and test set in the ratio of 98:1:1 and put into the network for training. The initial parameters were set with random initialization, the learning rate was 0.001, the optimizer used SGD (stochastic gradient descent) [55], the loss function used cross-entropy, and accuracy was used as an important evaluation index to measure the model performance. The accuracy rate was calculated as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(22)

where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

As shown in Figure 20, a total of 30 rounds were trained, and after 18 epochs of training, the accuracy curve gradually leveled off and no longer significantly improved, which indicated that the training of the classification network was completed. In this paper, we chose to save the weight parameters of the model at the 29th round, and its accuracy rate was 0.98494.

6.1.2. Assessment Measures and Methods

The GazeNet network described above was trained on the MEAGaze dataset using the same training set as the widely used AlexNet [56], ResNet18 [57], and MobileNet-V2 [58] models, and tested under the same test set. The test results are shown in Table 1 and Figure 21. The GazeNet network in this paper achieved an accuracy of 98.49% and used the smallest number of parameters, only 125,749, which was significantly better than the other three models. The comparison showed that the GazeNet network was about 1% point more accurate than the ResNet18 model, with the second-highest accuracy rate, while the former had only 22.3% of the number of participants of the AlexNet model, with the second-lowest parameter amount. The GazeNet network has a minimum number of parameters with excellent accuracy, which makes it an ideal lightweight model that can meet the needs of both high accuracy and low resource consumption.

6.2. Reliability Analysis of Eye-Tracking Wheelchair Control

The comparative experiments in the previous section demonstrated the high recognition rate of the eye-tracking model used in this paper. In order to verify the reliability and control accuracy of the eye-tracking wheelchair, we launched experiments related to the eye-tracking wheelchair in specific scenarios.

The experimental route diagram is shown in Figure 22. The solid part represents the target path of the wheelchair center, and the dashed part is the target path of both wheels. Since the width of the wheelchair that we used was 60 cm, the dashed path spacing was 60 cm. An experiment designed in this paper consisted of two laps of the line: the yellow line in Figure 22a is the target path of the first lap, the blue line in Figure 22c is the target path of the second lap, and Figure 22b is the real target line diagram combining the two, where the green line represents the overlapping part of the two laps of the line. The start and end points of both laps are the red points in Figure 22. The experimental design allowed the wheelchair to complete an experiment with an equal number of left and right turns, both eight times, making the experimental data more meaningful.

To make the data of the experiment generalizable, we recruited 20 volunteers aged 20–35 years to participate in the experiment. These 20 volunteers were 50% men and 50% women and wore glasses. Each volunteer was trained to operate a wheelchair while maintaining head immobilization to complete the experiment, during which the speed of the wheelchair was set to 1

m / s

. We installed a laser distance measurement module on each side of the wheelchair, as shown in Figure 23, which had a range of up to 80 m, a measurement accuracy of 1.0 mm, and a measurement frequency of 20 Hz. It can measure and record the distance to the wall in real time and obtain the deviation of the wheelchair center from the target path based on the distance to the wall. The values of deviation with a distance of movement at each experiment were totaled and divided by the total number of people, and a graph of deviation with a distance of movement was obtained, as shown in Figure 24. The fluctuating part of this figure is the deviation of the wheelchair during the turn, and the maximum deviation appeared at the second circle of 8 m, which was 6.76 cm. It can be seen that the deviation during the wheelchair movement was small, the accuracy of the model and control was high, and the eye-activated wheelchair we designed had high practicality.

In Section 4.2, we optimized the rocker control. To verify the effect of this work on the stability of the wheelchair control, we had the experimenter sit on the motorized wheelchair, disconnect the motor power, and connect the Arduino output pins with an oscilloscope. The experimenter simulated the eye-movement state in the previous experiment while keeping the head immobile. The average curve of the rudder duty cycle over time during the experiment was obtained by superimposing the numerical curves of the rudder duty cycle over time during the experiment for the 20 volunteers and applying the total number of people, as shown in Figure 25.

As can be seen from Figure 25, the control of the servo was smoother and without noise, which can also indicate the high accuracy of the neural network triple classification and the absence of several types of jumping outputs.

7. Conclusions

We collected an eye-movement dataset with 135,000 annotated images through virtual and real scenes and proposed a GazeNet eye-movement neural network model based on the three-category dataset. The model comparison experiment results showed that the GazeNet model proposed in this paper had a faster convergence speed and higher accuracy than the other three models. In terms of wheelchair control, we used a 2D steering gear to control the joystick and optimize the steering gear control signal. The follow-up Arduino output waveform experiment showed that the steering gear control signal was smooth and gentle, and the optimization results were good. In addition, the experiment also proved that the three-category result of the model was accurate, and there was no case of jumping outputs of several eye-movement categories. In the wheelchair control reliability analysis, we obtained the deviation between the target and the actual trajectory during the movement process through laser ranging and concluded that the accuracy of the motion control and eye-movement model was high.

However, the current motion control part is relatively complicated, and the upper limit of optimization is low. If the signal of Jeston Tx2 can be directly output to control the wheelchair, the motion control can be made simpler and more reliable. In addition, the eye-movement dataset was also collected in a specific scene and did not fully consider applications in scenes in daily life, such as crossing the road, rainy days, etc. After the experiment, many volunteers reported that continuous eye-movement control increased the burden on users, which reduced the accuracy of control to a certain extent and increased the risk for users. Considering that the current wheelchair control scheme lacks adaptability to the environment, we expect to add a visual slam and path planning [59] to the control part in the follow-up work. During the process, the direction of sight can be freely controlled, and the wheelchair can autonomously perceive the surrounding environment and make adjustments to reach the destination. For safety reasons, we plan to add a positioning system to the wheelchair. When the user operates the wheelchair, their family members can remotely know the user’s specific location through the mobile phone program.

Author Contributions

J.X., writing—review and editing, funding acquisition; Z.H., methodology, supervision; L.L., software, formal analysis; X.L., visualization; K.W., data curation. All authors have read and agreed to the published version of the manuscript.

Funding

Heilongjiang Provincial Education Department Scientific Research Fund Project “Intelligent Eye-Movement Wheelchair Based on Deep Learning”, Project No. S202210214067.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of the Harbin University of Science and Technology (project identification code: R4-12-2, approval code: 56).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The research data is available at: https://github.com/crzzx1/dataset (accessed on 1 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Callupe Luna, J.; Martinez Rocha, J.; Monacelli, E.; Foggea, G.; Hirata, Y.; Delaplace, S. WISP, Wearable Inertial Sensor for Online Wheelchair Propulsion Detection. Sensors 2022, 22, 4221. [Google Scholar] [CrossRef] [PubMed]
Li, Y. Hand gesture recognition using kinect. In Proceedings of the International Conference on Software Engineering and Service Science, Beijing, China, 22–24 June 2012; IEEE Computer Society: Washington, DC, USA, 2011; pp. 196–199. [Google Scholar]
Adebayo, O.O.; Adetiba, E.; Ajayi, O.T. Hand Gesture Recognition-Based Control of Motorized Wheelchair using Electromyography Sensors and Recurrent Neural Network. In Proceedings of the International Conference on Engineering for Sustainable World (ICESW 2020), Ota, Nigeria, 10–14 August 2020; 1107. [Google Scholar]
Nasare, R.K.; Yenurkar, G.K. Hand gesture based navigation control for automated wheelchair. Int. J. Latest Trends Eng. Technol. 2016, 8, 230–238. [Google Scholar]
Ashley, S.; Jaydip, D. Hand Gesture-based Artificial Neural Network Trained Hybrid Human–machine Interface System to Navigate a Powered Wheelchair. J. Bionic Eng. 2021, 18, 1045–1058. [Google Scholar]
Iskanderani, A.I.; Tamim, F.R.; Rana, M.M.; Ahmed, W.; Mehedi, I.M.; Aljohani, A.J.; Latif, A.; Shaikh, S.A.; Shorfuzzaman, M.; Akther, F.; et al. Voice Controlled Artificial Intelligent Smart Wheelchair. In Proceedings of the 2020 8th International Conference on Intelligent and Advanced Systems (ICIAS), Kuching, Malaysia, 13–15 July 2021; pp. 1–5. [Google Scholar] [CrossRef]
Jayakody, A.; Nawarathna, A.; Wijesinghe, I.; Liyanage, S.; Dissanayake, J. Smart Wheelchair to Facilitate Disabled Individuals. In Proceedings of the 2019 International Conference on Advancements in Computing (ICAC), Malabe, Sri Lanka, 5–7 December 2019; pp. 249–254. [Google Scholar] [CrossRef]
Karim, A.B.; Haq, A.; Noor, A.; Khan, B.; Hussain, Z. Raspberry Pi Based Voice Controlled Smart Wheelchair. In Proceedings of the 2022 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan, 23–24 September 2022; pp. 1–5. [Google Scholar]
Cao, W.; Yu, H.; Wu, X.; Li, S.; Meng, Q.; Chen, C. Voice controlled wheelchair integration rehabilitation training and posture transformation for people with lower limb motor dysfunction. Technol. Health Care Off. J. Eur. Soc. Eng. Med. 2020, 29, 609–614. [Google Scholar] [CrossRef] [PubMed]
Mokhles, M. Abdulghani et al. Wheelchair Neuro Fuzzy Control and Tracking System Based on Voice Recognition. Sensors 2020, 20, 2872. [Google Scholar]
Cojocaru, D.; Manta, L.F.; Vladu, I.C.; Dragomir, A.; Mariniuc, A.M. Using an Eye Gaze New Combined Approach to Control a Wheelchair Movement. In Proceedings of the 2019 23rd International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 9–11 October 2019; pp. 626–631. [Google Scholar] [CrossRef]
Wanluk, N.; Visitsattapongse, S.; Juhong, A.; Pintavirooj, C. Smart wheelchair based on eye tracking. In Proceedings of the 2016 9th Biomedical Engineering International Conference (BMEiCON), Laung Prabang, Laos, 7–9 December 2016; pp. 1–4. [Google Scholar] [CrossRef]
Luo, W.; Cao, J.; Ishikawa, K.; Ju, D. A Human-Computer Control System Based on Intelligent Recognition of Eye Movements and Its Application in Wheelchair Driving. Multimodal Technol. Interact. 2021, 5, 50. [Google Scholar] [CrossRef]
Wästlund, E.; Sponseller, K.; Pettersson, O.; Bared, A. Evaluating gaze-driven power wheelchair with navigation support for persons with disabilities. J. Rehabil. Res. Dev. 2015, 52, 815–826. [Google Scholar] [CrossRef]
Bai, D.; Liu, Z.; Hu, Q.; Yang, J.; Yang, G.; Ni, C.; Yang, D.; Zhou, L. Design of an eye movement-controlled wheelchair using Kalman filter algorithm. In Proceedings of the 2016 IEEE International Conference on Information and Automation (ICIA), Ningbo, China, 1–3 August 2016; IEEE: Piscateville, NJ, USA, 2017. [Google Scholar]
Gautam, G.; Sumanth, G.; Karthikeyan, K.C.; Sundar, S.; Venkataraman, D. Eye movement based electronic wheel chair for physically challenged persons. Int. J. Sci. Technol. Res. 2014, 3, 206–212. [Google Scholar]
Antoniou, E.; Bozios, P.; Christou, V.; Tzimourta, K.D.; Kalafatakis, K.; G. Tsipouras, M.; Giannakeas, N.; Tzallas, A.T. EEG-Based Eye Movement Recognition Using Brain–Computer Interface and Random Forests. Sensors 2021, 21, 2339. [Google Scholar] [CrossRef]
Rao, E.V.K.; Reddy, N.Y.; Greeshma, B.V.S.S.; Reddy, Y.S.S.V. EEG Based Smart Wheelchair For Disabled Persons Using Non-Invasive BCI. In Proceedings of the 2022 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES), Greater Noida, India, 20–21 May 2022; pp. 440–446. [Google Scholar] [CrossRef]
Ngo, B.V.; Nguyen, T.H.; Tran, D.K.; Vo, D.D. Control of a Smart Electric Wheelchair Based on EEG Signal and Graphical User Interface for Disabled People. In Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, 26–28 August 2021; pp. 257–262. [Google Scholar] [CrossRef]
Joshi, K.; Soni, P.; Joshi, S.; Vyas, A.; Joshi, R. Cognitive-Chair: AI based advanced Brain Sensing Wheelchair for Paraplegic/Quadriplegic people. In Proceedings of the 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 9–10 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
Fadheel, B.A.; Mahdi, A.J.; Jaafar, H.F.; Nazir, M.S.; Obaid, M.S.; Musa, S.H. Speed Control of a Wheelchair Prototype Driven by a DC Motor Through Real EEG Brain Signals. In Proceedings of the 3rd International Conference on Engineering Sciences, Kerbala, Iraq, 4–6 November 2019; 2020; Volume 671, p. 012036. [Google Scholar]
Ferracuti, F.; Freddi, A.; Iarlori, S.; Longhi, S.; Monteriù, A.; Porcaro, C. Augmenting robot intelligence via EEG signals to avoid trajectory planning mistakes of a smart wheelchair. J. Ambient. Intell. Humaniz. Comput. 2021, 14, 223–235. [Google Scholar] [CrossRef]
Griss, P.; Enoksson, P.; Tolvanen-Laakso, H.K.; Merilainen, P.; Ollmar, S.; Stemme, G. Micromachined electrodes for biopotential measurements. J. Microelectromechanical Syst. 2001, 10, 10–16. [Google Scholar] [CrossRef]
Li, X.; Wee, W.G. An efficient method for eye tracking and eye-gazed FOV estimation. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2597–2600. [Google Scholar] [CrossRef]
Taher, F.B.; Amor, N.B.; Jallouli, M. A multimodal wheelchair control system based on EEG signals and Eye tracking fusion. In Proceedings of the 2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA), Madrid, Spain, 2–4 September 2015; pp. 1–8. [Google Scholar] [CrossRef]
Çetintaş, D.; Firat, T.T. Eye-Tracking Analysis with Deep Learning Method. In Proceedings of the 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Zallaq, Bahrain, 29–30 September 2021; pp. 512–515. [Google Scholar] [CrossRef]
Mokatren, M.; Kuflik, T.; Shimshoni, I. 3D Gaze Estimation Using RGB-IR Cameras. Sensors 2023, 23, 381. [Google Scholar] [CrossRef] [PubMed]
Hu, L.; Gao, J. Research on real-time distance measurement of mobile eye tracking system based on neural network. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1561–1565. [Google Scholar] [CrossRef]
Dragusin, D.; Baritz, M.I. Development of a System for Correlating Ocular Biosignals to Achieve the Movement of a Wheelchair. In Proceedings of the 2020 International Conference on e-Health and Bioengineering (EHB), Iasi, Romania, 29–30 October 2020; pp. 1–4. [Google Scholar] [CrossRef]
Solea, R.; Margarit, A.; Cernega, D.; Serbencu, A. Head Movement Control of Powered Wheelchair. In Proceedings of the 2019 23rd International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 9–11 October 2019; pp. 632–637. [Google Scholar] [CrossRef]
Juhong, A.; Treebupachatsakul, T.; Pintavirooj, C. Smart eye-tracking system. In Proceedings of the 2018 International Workshop on Advanced Image Technology (IWAIT), Chiang Mai, Thailand, 7–9 January 2018; pp. 1–4. [Google Scholar] [CrossRef]
Higa, S.; Yamada, K.; Kamisato, S. Intelligent Eye-Controlled Electric Wheelchair Based on Estimating Visual Intentions Using One-Dimensional Convolutional Neural Network and Long Short-Term Memory. Sensors 2023, 23, 4028. [Google Scholar] [CrossRef] [PubMed]
Fuhl, W.; Kasneci, G.; Kasneci, E. TEyeD: Over 20 Million Real-World Eye Images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types. In Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 4–8 October 2021; pp. 367–375. [Google Scholar] [CrossRef]
Fusek, R.; Sojka, E. Iris Center Localization Using Geodesic Distance and CNN. In Proceedings of the 9th Iberian Conference on Pattern Recognition and Image Analysis, IbPRIA 2019, Madrid, Spain, 1–4 July 2019; pp. 76–85. Available online: http://mrl.cs.vsb.cz/eyedataset (accessed on 1 January 2023).
Constable, P.A.; Bach, M.; Frishman, L.J.; Jeffrey, B.G.; Robson, A.G. ISCEV Standardbfor Clinical Electro-oculography (EOG) 2019. Doc. Ophthalmol. 2019, 113, 205–212. [Google Scholar]
Robinson, D.A. A Method of Measuring Eye Movement Using a Scleral Search Coil in a Magnetic Field. IEEE Trans. Bio-Med. Electron. 2021, 10, 137–145. [Google Scholar]
Ohno, T.; Mukawa, N.; Yoshikawa, A. Abstract FreeGaze: A Gaze Tracking System for Everyday Gaze Interaction. In Proceedings of the Eye Tracking Research & Application Symposium, DBLP 2018, Warsaw, Poland, 14–17 June 2018; pp. 125–132. [Google Scholar]
Coutinho, F.L.; Morimoto, C.H. Augmenting the robustness of cross-ratio gaze tracking methods to head movement. In Symposium on Eye Tracking Research & Applications; ACM: New York, NY, USA, 2019; pp. 59–66. [Google Scholar]
Cerrolaza, J.J.; Villanueva, A.; Cabeza, R. Taxonomic Study of Polynomial Regressions Applied to the Calibration of Video-Oculographic Systems. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications (ETRA 2008), Savannah, GA, USA, 26–28 March 2008; pp. 259–266. [Google Scholar]
Hennessey, C.; Lawrence, P. Improving the accurancy and reliability of remote systemcalibration-free eye-gaze tracking. IEEE Trans Biomed. Eng. 2009, 56, 1891–1900. [Google Scholar] [CrossRef]
Xu, L.Q.; Machin, D.; Sheppard, P. A Novel Approach to Real-Time Non-Intrusive Gaze Finding. In Proceedings of the BMVC, London, UK, 4–7 September 2017. [Google Scholar]
Tan, K.; Ahuja, D.J.K.N. Appearance-based Eye Gaze Estimation. In Proceedings of the IEEE Workshop on Applications of Computer Vision, Nashville, TN, USA, 19–25 June 2021; IEEE Computer Society: Washington, DC, USA, 2019. [Google Scholar]
Jeong, C.; Kim, T. Eye Blink Detection Using Algorithm Based On dlib And OpenCV Library for Game Players In Competitive Environments. J. Int. Res. Med. Pharm. Sci. 2021, 16, 33–45. [Google Scholar]
Robert, G.; Lucas, C.; Nicholas, L.; Jedediah, M.S.; Alexander, Z. New guidance for using t-SNE: Alternative defaults, hyperparameter selection automation, and comparative evaluation. Vis. Inform. 2022, 6, 87–97. [Google Scholar]
Sunsuhi, G.S.; Albin Jose, S. An Adaptive Eroded Deep Convolutional neural network for brain image segmentation and classification using Inception ResnetV2. Biomed. Signal Process. Control. 2022, 78, 103863. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, Y.; Li, W.; Fu, W.; Cao, L. DRNet: A Deep Neural Network with Multi-Layer Residual Blocks Improves Image Denoising. IEEE Access 2021, 9, 79936–79946. [Google Scholar] [CrossRef]
Wang, B.; Yan, X.; Li, D. An End-to-End Lane Detection Model with Attention and Residual Block. Comput. Intell. Neurosci. 2022, 2022, 5852891. [Google Scholar] [CrossRef]
Ma, R.; Wang, J.; Zhao, W.; Guo, H.; Dai, D.; Yun, Y.; Li, L.; Hao, F.; Bai, J.; Ma, D. Identification of Maize Seed Varieties Using MobileNetV2 with Improved Attention Mechanism CBAM. Agriculture 2022, 13, 11. [Google Scholar] [CrossRef]
Zhang, T.; Sui, Y.; Wu, S.; Shao, F.; Sun, R. Table Structure Recognition Method Based on Lightweight Network and Channel Attention. Electronics 2023, 12, 673. [Google Scholar] [CrossRef]
Cao, Y. An Expression Recognition Model Based on Channel and Spatial Attention Fusion. J. Phys. Conf. Ser. 2022, 2363, 012016. [Google Scholar] [CrossRef]
Imen, J.; Ihsen, A.; Anouar, B.K.; Ali, M.M. Deep learning-based hard spatial attention for driver in-vehicle action monitoring. Expert Syst. Appl. 2023, 219, 119629. [Google Scholar]
Li, L.; Doroslovacki, M.; Loew, M.H. Approximating the Gradient of Cross-Entropy Loss Function. IEEE Access 2020, 8, 111626–111635. [Google Scholar] [CrossRef]
Huber Holly, A.; Georgia Senta, K.; Finley Stacey, D. Systematic Bayesian Posterior Analysis Guided by Kullback-Leibler Divergence Facilitates Hypothesis Formation. J. Theor. Biol. 2022, 558, 111341. [Google Scholar] [CrossRef] [PubMed]
Akihiro, K.; Kazu, N.; Rin, H.; Hideaki, K.; Yoshihisa, N. Eye fatigue estimation using blink detection based on Eye Aspect Ratio Mapping (EARM). Cogn. Robot. 2022, 2, 50–59. [Google Scholar]
Yovan, L.; Kumar, P.; Singh, P.; Chand, J.; Narayana, U.H.; Sarojam, S.B.D.; Keshava, B. Submarine Groundwater Discharge (SGD): Impacts, challenges, limitations, and management recommendations. Groundw. Sustain. Dev. 2023, 21, 100903. [Google Scholar]
Ren, S.; Wu, H.; Chen, W.; Li, D. Polarization Domain Spectrum Sensing Algorithm Based on AlexNet. Sensors 2022, 22, 8946. [Google Scholar] [CrossRef]
Huang, B.; Liu, J.; Zhang, Q.; Liu, K.; Li, K.; Liao, X. Identification and Classification of Aluminum Scrap Grades Based on the Resnet18 Model. Appl. Sci. 2022, 12, 11133. [Google Scholar] [CrossRef]
Wang, H. Residual Mask Based on MobileNet-V2 for Driver’s Dangerous Behavior Recognition. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, Normal, IL, USA, 6–8 December 2019. [Google Scholar]
Durrant-Whyte, H.; Bailey, T. Simultaneous Localization and Mapping (SLAM): Part I The Essential Algorithms. IEEE Robot. Autom. Mag. 2019, 2, 206–215. [Google Scholar]

Figure 1. General block diagram of intelligent eye-tracking wheelchair system.

Figure 2. Virtual eye-tracking data acquisition.

Figure 3. Real eye-tracking data acquisition: (a) nine-grid and (b) real scene.

Figure 4. Collecting Dlib key points to extract human eye pictures.

Figure 5. t-SNE visualization effect diagram.

Figure 6. Eye-tracking dataset.

Figure 7. GazeNet structure diagram.

Figure 8. Sensory fields corresponding to two different convolution operations: (a) 5 × 5 convolution kernel and (b) 3 × 3 convolution kernel.

Figure 9. ResBlock module.

Figure 10. Convolutional Block Attention Model.

Figure 11. Hardware design diagram: (a) wheelchair physical picture and (b) hardware data flow diagram.

Figure 12. Modified wheelchair controller: (a) stop, (b) left, (c) forward, and (d) right.

Figure 13. Interactive interface.

Figure 14. Fitting diagram of wheelchair speed and rocker angle.

Figure 15. Polar coordinate diagram of rocker position.

Figure 16. θ-t diagram.

Figure 17. Variation curve of

a

with

t

at different

t_{3}

.

Figure 17. Variation curve of

a

with

t

at different

t_{3}

.

Figure 18. Key points of the eye.

Figure 19. System flow chart.

Figure 20. GazeNet training process.

Figure 21. Training models for four different networks.

Figure 22. Experimental circuit diagram. (a) First lap, (b) actual route, and (c) second lap.

Figure 23. Actual test diagram.

Figure 24. Variation of deviation with distance travelled: (a) lap 1 and (b) lap 2.

Figure 25. Servo duty cycle variation curve with motion time: (a) servo No. 1 and (b) servo No. 2.

Table 1. Comparison of the experimental results of different networks.

Model	Accuracy	Parameter Amount
GazeNet	98.49%	125,749
AlexNet	96.5%	564,003
ResNet18	97.5%	11,178,051
MobileNet-V2	96.9%	2,227,715

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, J.; Huang, Z.; Liu, L.; Li, X.; Wei, K. Eye-Gaze Controlled Wheelchair Based on Deep Learning. Sensors 2023, 23, 6239. https://doi.org/10.3390/s23136239

AMA Style

Xu J, Huang Z, Liu L, Li X, Wei K. Eye-Gaze Controlled Wheelchair Based on Deep Learning. Sensors. 2023; 23(13):6239. https://doi.org/10.3390/s23136239

Chicago/Turabian Style

Xu, Jun, Zuning Huang, Liangyuan Liu, Xinghua Li, and Kai Wei. 2023. "Eye-Gaze Controlled Wheelchair Based on Deep Learning" Sensors 23, no. 13: 6239. https://doi.org/10.3390/s23136239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Eye-Gaze Controlled Wheelchair Based on Deep Learning

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Eye-Tracking Wheelchair Program

3.2. Dataset Creation

3.2.1. Multidimensional Eye-Tracking Data Acquisition

3.2.2. Data Preprocessing

3.2.3. Unification of Datasets

4. Eye-Tracking Model Building

4.1. Eye-Gaze Direction Estimation

4.1.1. Inception Module

4.1.2. ResBlock Module

4.1.3. CBAM Module

4.2. Classification Network

4.2.1. Fully Connected Layer

4.2.2. Cross-Entropy Loss Function

5. The Design of the Eye-Tracking Wheelchair Control System

5.1. Motion Control Optimization

5.2. System Flow

5.2.1. Blink Detection

5.2.2. Saccades Processing

5.2.3. System Flow Chart

6. Experiments and Results Analysis

6.1. Evaluation of GazeNet’s Effectiveness

6.1.1. Hyperparameter Optimization

6.1.2. Assessment Measures and Methods

6.2. Reliability Analysis of Eye-Tracking Wheelchair Control

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI