A CNN consists of three building blocks:


**Figure 9.** Convolutional Neural Network (CNN) architecture.

#### Data Preprocessing

Before feeding input images to the network, they were decimated to 64 × 64 images regardless of their original sizes. This reduced the time for the forward propagation of a single image when the CNN is deployed for real-time classification. Then, they were zero-centered by subtracting the mean image and normalized to unit variance by dividing over the standard deviation:

$$I\_N(\mathbf{x}, \mathbf{y}) = \frac{I(\mathbf{x}, \mathbf{y}) - I\_{\text{mean}}}{\sigma\_I} \tag{3}$$

where *I* is the original image; the *IN* is the preprocessed one; and *Imean* and σ*<sup>I</sup>* are the mean and standard deviation of the original image, respectively.

The preprocessing stage will help reduce the convergence time of the training algorithm. Loss Function

The mean squared error (MSE) is the error function chosen to quantify how good the classifier is by measuring its fitness to the data:

$$E(\text{weights}, \text{biases}) = \sum\_{i=1}^{N\_c} (y\_i - t\_i)^2 \tag{4}$$

where *E* is the loss function; *yi* and *ti* are the predicted and the actual scores of the *i th* class, respectively; and *Nc* denotes the number of classes which is 4 in this classification problem.

Training Algorithm Selection

Learning is actually an optimization problem where the loss function is the objective function and the output of the training algorithm is the network's parameters (i.e., weights and biases) that minimize the objective.

Stochastic gradient descent with adaptive learning rate (adaptive backpropagation) is the employed optimization algorithm to train the CNN. Backpropagation involves iteratively computing the gradient of the loss function (the vector of the derivatives of the loss function with respect to each weight and bias) to use it for updating all of the model's parameters:

$$w\_{ki}^l(t+1) = w\_{ki}^l(t) - \varepsilon \frac{\delta E}{\delta w\_{ki}^l} \tag{5}$$

$$b\_{ki}^l(t+1) = b\_{ki}^l(t) - \varepsilon \frac{\delta E}{\delta b\_{ki}^l} \tag{6}$$

where *wl ki* is the weight of the connection between the *kth* neuron in the *lth* layer and the *ith* neuron in the *(l-1)th* layer, *bl ki* is the bias of the *kth* neuron in the lth layer, and ε is the learning rate.

Hyperparameters Setting

Five-fold cross-validation is employed to provide more confidence in the decision-making on the model structure to prevent overfitting.

First, the whole dataset is split into a test set on which the model performance will be evaluated, and a development set utilized to tune the CNN hyperparameters. The latter set is divided into five folds. Second, each fold is treated in turn as the validation set. In other words, the network is trained on the other four folds, and tested on the validation fold. Finally, the five validation losses are averaged to produce a more accurate error estimate since the model is tested on the full development set. The hyperparameters are then chosen such that they minimize the cross-validation error.

To test and compare the performance of these four gaze estimation techniques, a testing dataset was needed.

#### *3.3. Building a Database for Training and Testing*

All the discussed algorithms for gaze tracking require a database for the purpose of evaluation of the algorithms. Although three algorithms (template matching, LBP, and CNN) require a labeled dataset, the template matching algorithm that uses the eye pupil as a template does not require having such labels. There is more than one available database that could be used in testing the algorithms. Columbia Gaze Data Set [44] has 5880 images for 56 people in different head poses and eye gaze directions. Also, Gi4E [45] is another public dataset of iris center detection, it contains 1339 images that are collected using a standard webcam. UBIRIS [46] is another 1877-image database that is specified for iris recognition.

Nevertheless, neither of these databases was used because each one has one or more of the following drawbacks (with respect to our approach of testing).


Accordingly, a database was created for eight different users. Videos of eight users gazing in three directions (right, left, and forward) and closing their eyes were captured under two different lighting conditions. The dataset consists of the four selected classes, namely, Right, Forward, Left gazing, and one more class for closed eyes. For each user, a set of 2000 images were collected (500 for each class). The collected dataset includes images for indoors and outdoors lighting conditions. Besides, a dataset was collected for different ten users, yet, only for indoors lighting condition. Then, 500 frames per each class for each user were taken to constitute a benchmark dataset of 16,000 frames. Next, the dataset was partitioned into training set and testing set, where 80% of the data were used for training. Consequently, the algorithm learnt from 12,800 frames and were tested on 3200 unseen frames.

#### *3.4. Safety System—Ultrasonic Sensors*

For safety purposes, an automatic stopping mechanism installed on the wheelchair should disconnect the gazing-based controller, and then immediately stop the chair in case that the wheelchair becomes close to any object in the surroundings. Throughout this paper, the word "object" refers to any type of obstacle, which can be as small as a brick, or a chair, or even a human being.

A proximity sensor qualitatively measures how close an object is to the sensor. It has a specific range, and it raises a flag when any object enters this threshold area. There are two main technologies upon which proximity sensors operate: ultrasonic technology or infrared (IR) technology proximity sensors. In fact, the same physical concept applies for both, namely, wave reflections. The sensor transmits electromagnetic waves and receives them after they are reflected from surrounding objects. The process is very analogous to the operation of radar. The difference between ultrasonic-based and IR-based proximity sensors is obviously the type of the electromagnetic radiation.

The different types of proximity sensors can be used in different applications. As infrared sensors operate in the IR spectrum, they cannot be used in outdoors applications where sunlight interferes with their operation [47]. Besides, it is difficult to use IR sensors in dark areas [47]. On the other hand, ultrasonic waves are high-frequency sound waves that should not face any sort of interference. Furthermore, ultrasonic reflections are insensitive to hindrance factors as light, dust, and smoke. This makes ultrasonic sensors advantageous over IR sensors for the case of the wheelchair. A comparison between the two technologies [47] suggests that combining the two technologies together gives more reliable results for certain types of obstacles like rubber and cardboard. However, for this paper, there is no need to use the two sensors together; ultrasonic ones should be enough to accomplish the task of object detection.

Usually, an ultrasonic sensor is used along with a microcontroller (MCU) to measure the distance. As seen in Figure 10, the sensor has an ultrasonic transmitter (Tx) and a receiver (Rx). The microcontroller sends a trigger signal to the sensor, and this triggers the "Tx" to transmit ultrasonic waves. The ultrasonic waves then reflect from the object and are received by the Rx port on the sensor. The sensor accordingly outputs an echo signal (digital signal) whose length is equal to the time taken by the ultrasonic waves to travel (the double-way distance).

$$\text{distance}(\mathbf{m}) = \frac{\text{travelling time}(\text{seconds})}{2} \ast \text{speed of ultrasonic waves} \left(\frac{\text{meter}}{\text{second}}\right) \tag{7}$$

$$\text{Error in distance} = \text{total delay in the system} \circ \text{speed of the wheel chain} \tag{8}$$

**Figure 10.** Proximity sensor with microcontroller working.

By knowing the travelling time of the signal, the distance can be calculated using Equation (7). Being analogous to the radars' operation, the time should be divided by two because the waves travel the distance twice. Ultrasonic waves travel in air with a speed of 340 m/sec.

As the wheelchair is moving with a certain speed, this affects the echo-based approach of measuring the distance (Doppler Effect). This introduces some inaccuracy in the calculated distance. However, this error did not affect the performance of the system. The first thing to do is to clarify that any fault in the measurement is completely dependent on the speed of the wheelchair, not the speed of the ultrasonic waves. This makes the error extremely small; as the error is directly proportional to the speed. The faulty distance can be calculated using Equation (8). The wheelchair moves at a maximum speed of 20 km/h (5.56 m/s). The delay is not a fixed parameter; thus, for design purposes, the maximum delay (worst-case scenario) was used.

The total delay in the system is the sum of delay introduced by the MCU and that of the sensor. The delay of the sensor strictly cannot exceed 1 millisecond because the average distance the sensor can measure is ~1.75 m. The average time of travelling of the ultrasonic wave can be calculated using Equation (7). The delay that may occur due to the MCU processing, it is in microseconds range. When applying this delay to Equation (8), the maximum inaccuracy in the measured distance is 0.011 m (1.1 cm).

To cover a wider range for the proximity sensors, a number of sensors was used to cover at 1200 as the chair is designed to move only in the forward direction (of course with right and left steering), but not towards back, refer Figure 11. A sketch program (Fusion 360) was used to visualize the scope of the proximity sensors and specify the number of sensors to use.

ȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱ (a)ȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱ (b)ȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱ

**Figure 11.** Proximity sensor covering range: (**a**) 360 degrees and (**b**) 120 degrees.
