*4.2. Gaze Estimation Algorithm*

In this paper, different solutions are proposed to estimate gaze directions: Template matching-based classifiers, LBP, and the CNN. Although the former alternative yielded a very satisfactory performance (i.e., average accuracy of 95%) for all indoors lighting condition, it failed to preserve this accuracy when images were collected outdoors because of the simplicity of the underlying algorithm. To clarify, this depreciation in performance is not because of the change of illumination, however, it is because of the physical difference of the shape of the eye between indoors and outdoors conditions. Two major changes occur when the user is exposed to direct sunlight: the iris shrinks when exposed to high intensity light and the eyelids tend to close.

One way to tackle this issue is to take distinct templates for the various lighting conditions that can be discerned by a light sensor, and then match the input with the template of the corresponding lighting condition. Conversely, the CNN is inherently complex enough to be able to generalize to all lighting conditions without the need of any additional hardware. Template matching technique was used as an aiding tool to the CNN to achieve better performance.

#### 4.2.1. Collecting Training Dataset for the CNN—Calibration Phase

To make use of the accurate and fast CNNs, a training dataset should be available to train the model. The performance of the CNN is dependent on the selection of the training dataset; if the training dataset is not properly selected, the accuracy of the gaze estimation system dramatically decreases. Thus, a calibration phase is required for any new user of the wheelchair. The output of this calibration phase is the trained CNN model that can be used later for gaze estimation. To prepare a nice training dataset, a template matching technique was used. Template matching does not require any prior knowledge of the dataset. Note that the calibration phase faces two problems: the first problem arises from the fact that the user blinks during the collection of the training data. This leads to a flawed model prediction. The second issue is that the user may not respond immediately to the given instructions, i.e., the user may be asked to gaze right for couple of seconds, but he or she responds one or two seconds later.

Fortunately, template matching, if smartly used, can overcome these two problems, and can be used to make sure that the training dataset is 100% correctly labeled. Moreover, the training dataset should be diverse enough to account for different lighting conditions, and different placement of glasses. The cameras could zoom the user's eyes, thus if the user slightly changes the position of glasses, noticeable differences occur between the images.

Figure 13 shows a flowchart of the calibration phase. For simplicity, illustration is discussed on only one eye; however, the same applies for the second eye. First, to have a wide-ranging dataset, the calibration stage was repeated for different conditions, each condition is named as a Scenario.

For each lighting condition, the user was asked to change the position of glasses (slightly slide the glasses either down or up). Thus, four different scenarios were needed. However, that is not enough to ensure a diverse dataset. Therefore, for each scenario, images were acquired for the four classes in three nonconsecutive attempts to capture all the possible patterns of the user's gaze in a particular direction. Every time, the user, indeed, gazed in a different manner. Thus, the importance of collecting images three times can be seen. Each time, 200 frames were collected (a net of 600 frames per class). These frames (that are contaminated with blinks) constitute a temporary training dataset. As previously mentioned, template matching was used to remove these blinks; a random frame was selected from each class to be used as a template.

**Figure 13.** Flowchart for the calibration phase.

There are numerous scenarios that may happen: the first and most probable scenario is that the user has followed the instructions and the template selection was successful, that is, the template image is not a blinking frame. In such a case, and from the testing on the benchmark dataset, the accuracy of template matching was higher than 80%. This accuracy had been set as the threshold; if this accuracy was attained, this means that the image acquisition was successful. The second possibility was that while running the template-matching algorithm, a low accuracy was returned. The cause of this low accuracy can be wrong acquisition or wrong template selection. Thus, it was better to test first for a wrong template selection; thus, a different template was randomly selected, and the template matching was carried out again. If the accuracy increases to reach the threshold, the issue was with the template; i.e., it might have been an irrelevant blinking image. However, if the low accuracy persists, there is a high probability that the user has gazed in a wrong direction during the image acquisition phase, and therefore the image acquisition was repeated.

With the threshold accuracy obtained, it was certainly known that the template image is correctly selected. The next stage was to clean the 600 images of each class from those blinking frames and to keep 500 useful images for each class. Then, 400 images were randomly selected for training while 100 images tested the trained model. After finishing all the four scenarios, the final training dataset was fed to the CNN to get the trained CNN model. It was worth mentioning that even if this calibration requires few minutes to accomplish, it is required to execute once only.

By having the trained CNN, the system was ready for real-time gaze estimation to steer the wheelchair. Between the real-time gaze estimation and moving the wheelchair, there are two points that should be taken into consideration. The first is the safety of the user; whether there is an obstacle in his/her way. The second point is that the real-time estimation for the right eye may give a different classification from the left eye (because of a wrong classification or because of different gazing at the instance the images are collected).

Figure 14 shows a flowchart for the real-time classification, where these two points are tackled. As discussed earlier, proximity sensors were used to scan the region in the wheelchair way. In the decision-making part, the minicomputer first measured the distance for the closest obstacle. If there was no close obstacle, then the wheelchair moved in the predicted direction. Nevertheless, if there was an object in the danger area (threshold), the wheelchair stopped immediately.

**Figure 14.** Flowchart for real-time classification and controlling the movement of the wheelchair.

If there was no close obstacle, when the real-time classification runs, two classification results are returned, one for each eye. In fact, it is very important to make use of this redundancy of results; otherwise, there is no need of having two cameras. If the two eyes returned the same class, and considering the high classification performance of the CNN, this predicted class was used to move the wheelchair in the determined direction. However, if the two eyes gave different classifications, no action will be taken, as the safety of the user had the priority.
