*3.2. Gaze Estimation Algorithm*

Accuracy, speed, and robustness to variation in illumination are the key parameters of any gaze estimation algorithm. The work presented a comparison between different algorithms of gaze estimation using these three parameters. Numerous gaze estimation algorithms were implemented by the researchers; nevertheless, the focus of this work is on those algorithms which are user-specific. Figure 6 shows a set of such algorithms. The user specificity means that the algorithms initially trained for a particular user using sample images and then use the trained algorithm to estimate gaze directions. Those four algorithms with blue color in Figure 6 are those tested in this work.

To facilitate the disabled user to have full mobility for the wheelchair, at least three commands are required: forward, right, and left. Furthermore, more than two commands are needed to start and stop the process of moving the chair. It was decided to use left eye winking to start acquisition of images and moving the chair, and another left eye winking to stop acquisition. Correspondingly, for each eye, four states are required: Right Gazing, Forward Gazing, Left Gazing, and Closing the eye.

**Figure 6.** Classification of algorithms for gaze estimation.

The following subsection discusses different algorithms tested in this work.

(a) Template Matching

The process of template matching between a given patch image (known as a template T) and a search image S, simply involves finding the degree of similarity between the two images. The template images are usually smaller in size (lower resolution). The simplest way to find the similarity is the Full Search (FS) method [25], where the template image moves (slides) over the search image, and for each a new position of T, the degree of similarity of the pixels of both images is calculated. For each new position of T, a mathematical equation was used to measure the degree of matching, which was a correlation operation as represented in Equation 1, where T stands for template, S for search image, and the prime superscript represents the X or Y coordinates for the template image.

$$R(\mathbf{x}, y) = \sum\_{\mathbf{x}'y'} T(\mathbf{x}'.y') . S(\mathbf{x} + \mathbf{x}'.y + y') \tag{1}$$

Template Matching Using Whole Image Templates

To use template matching for gaze estimation, a set of search images should be available for each user (this set of images were collected in a calibration phase). Four search images were used: three for the three different gaze positions and one for the closed eye. The template image was the image with unknown gaze direction (but certainly, it was one of the four classes). For each new template image, the correlation is done four times (one for each class).

It is worth mentioning that the search and template images should be acquired with the same resolution. The template matching was not sliding, rather it was using a single correlation. The correlation process returned a correlation coefficient for each search image, and the search image with the highest correlation was selected as the class for the template image. Figure 7 shows how the gaze estimation algorithm using equal-sized images was applied.

The process shown in Figure 7 requires that the template and search image should be acquired in the same illumination condition. However, the search images should be collected only once at the beginning and should not be changed, but the template image was subjected to such changes in the lighting conditions. Accordingly, to manage the changes in illumination, Histogram Equalization (HE) was used. Histogram Equalization changes the probability distribution of the image to a uniform distribution to improve the contrast.

**Figure 7.** Gaze estimation using template matching.

Template Matching Using Eye Pupil Templates

The second template matching based method uses correlation, but with a slight variation. In this method, the template is an image that contains only the pupil of the user. It was extracted in the calibration phase while the user was gazing forward; this gaze direction allowed for easier and simpler pupil extraction. The iris could have been used as a template; however, the pupil was a better choice because it always appears in all gaze directions (assuming an open eye). This is evident in Figure 7, where the whole iris does not appear in the case of right and left gazes. It is notable that in this template matching approach, only an image of the pupil was required as the template. The newly acquired images (with unknown pupil position) were considered as the search images.

The correlation was applied for each new search image by sliding the pupil on all possible positions of the search image with the template image to locate the position of the pupil on the search image that was maximally matched to the template. This method has two main advantages over the previous method. First, the correlation was done only once for each new image (instead of four times). The second merit was eliminating the constraint of having only few number of known gaze directions. This method can locate the pupil position at any location in the eye, not only forward, right, and left.

Feature-Based Template Matching

Another template matching technique was used to correlate feature values rather than raw pixel intensities. Local Binary Patterns (LBP) represent a statistical approach to analyze image texture [42]. The term "local" is used because texture analysis is done for each pixel with pixels in the neighborhood. The image was divided into cells, 3 × 3 pixels each. For each cell, the center pixel is surrounded by eight pixels. Simply stated, the value of the center pixel was subtracted from the surrounding pixels. Let *x* denote the difference, then the output of each operation s(*x*) was a zero or one, depending on the following thresholding.

$$s(x) = \begin{cases} \ 1, \ x \ge 0 \\ \ 0, x < 0 \end{cases} \tag{2}$$

Starting from the upper left pixel, the outputs were concatenated into a string of binary digits. The string was then converted to the corresponding decimal representation.

This decimal number was saved in the position of the center pixel in a new image. Therefore, the LBP operator returned a new image (matrix) that contains the extracted texture features. The template matching was done between the LBP outputs of the source and template images. Although applying this operation requires more time than directly matching the raw pixels, the advantage provided by LBP is the robustness to monotonic variations in pixels intensities due

to changes in illumination conditions [43]. Thus, applying the LBP operator eliminates the need of equalizing the histogram.

(b) Networks Architecture and Parameters

In compliance with the notion of a user-specific approach, Convolutional Neural Networks (CNNs) were proposed as another alternative to classify gaze directions in a fast and accurate manner. Like other template matching-based methods, a calibration process should to be carried out to generate a labeled dataset that is required to train the supervised network. This is required to be performed only once for the first time the user utilizes the wheelchair. As a result, relatively small user-specific data, which can be acquired in less than 5 minutes at 30 frames/sec, were employed to train a dedicated CNN for each user. Therefore, the training time will be short and the trained classifier was well suited for real-time eye tracking by predicting the probabilities of the input image (i.e., the user's eye) being one of four classes: right, forward, left, and closed.

This proposed solution was data-driven approach and comprises the following stages.

