**1. Introduction**

Accurately estimating gaze direction plays a major role in applications, such as the analysis of visual attention, research on consumer behavior, augmented reality (AR), and virtual reality (VR). Because inference results are more improved by using deep-learning models than other approaches, they can be applied to advanced technologies, such as autonomous driving [1] and smart glasses [2], and can overcome challenges in the medical field [3]. Using these deep-learning models is quite helpful, but it is difficult to train them due to lighting conditions and insufficient and poor-quality datasets. Moreover, the value of gaze datasets is very expensive and complicated to process. To alleviate this problem, we propose a model that extracts eye features using UnityEyes [4], high-quality synthetic data. An exact position of a feature is obtained from the enhanced model by using a self-attention module. Subsequently, gaze estimation is performed through using high-level eye features, which is less restrictive as it does not utilize complex information, such as full-face information and head poses used for gaze estimation [5–7].

Recently, deep-learning-based eye-tracking technology has been developed mainly through appearance-based methods [5–11] that use eye images or face images. These appearance-based models currently perform particularly well in a controlled, environment

**Citation:** Oh, J.; Lee, Y.; Yoo, J.; Kwon, S. Improved Feature-Based Gaze Estimation Using Self-Attention Module and Synthetic Eye Images. *Sensors* **2022**, *22*, 4026. https:// doi.org/10.3390/s22114026

Academic Editor: Jing Tian

Received: 21 April 2022 Accepted: 22 May 2022 Published: 26 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in which there are no disturbances, such as noise in an input frame. However, these models have some drawbacks. First, the cost of datasets is very high, and the quality of data has a significant impact on the training of the model. Second, most models are black-box solutions, which pose the challenge of locating and understanding points for improvement. This study reduces the dependency of the feature map, which is difficult to interpret and approaches a feature-based method that can estimate a gaze vector through accurate feature points after acquiring landmarks obtained from an image. Refs. [12,13] used a stacked hourglass model [14] to extract a few eyelid and iris points.

In this study, we reinforced and used an advanced model, called HRNet [15], which shows state-of-the-art performance in the pose estimation task to extract high-quality landmarks. In pixel-wise fields, such as pose estimation and landmark detection, the resolutions and sizes of images have huge impacts on performance. Therefore, we extracted landmarks with a high accuracy by remodeling the model using a self-attention module [16–18]. CBAM [18], a self-attention technology, helps to generate a refined feature map that better encodes positional information using channel and spatial attention.

Because we aimed to estimate a gaze vector centered on a landmark extraction, a labeled gaze vector and eye-landmark dataset were essential. However, because gaze data are very expensive and difficult to generate, it is more difficult to obtain a dataset that provides both high-resolution images and landmarks simultaneously. Therefore, UnityEyes, a synthetic eye-image dataset with eye landmarks, was adopted as a training dataset through high-resolution images and an automatic labeling system. The model was trained by processing 32 iris and 16 eyelid points from the eye image obtained by fixing the head pose. Figure 1 shows the predicted heatmaps during the training We evaluated landmark and gaze performance by composing a test set for UnityEyes and performed a gaze performance evaluation using MPIIGaze [11], which has real environment settings.

Our paper is organized as follows. We first summarize related work in Section 2. In Section 3, the proposed gaze estimation method is explained. Section 4 describes the datasets used in the experiments. The experiment results are provided in Section 5. Finally, Section 6 presents our discussion on this study, and Section 7 presents the conclusion.

**Figure 1.** From left to right, the predicted heatmaps are shown as the training epoch increases. The heatmaps have high confidence scores where the landmarks are most likely to be located, and the right bar represents the color space corresponding to the confidence score.

#### **2. Related Work**

The gaze estimation method is a research topic of grea<sup>t</sup> interest as it has excellent applicabilities to real environments. As it can be applied to various fields, obtaining and creating accurate gaze values and gaze estimations with less constraints are challenging tasks. In this section, we provide a brief overview of the research related to our method. The studies in each subsection are summarized in Tables 1–3, respectively.

#### *2.1. Feature-Based Method*

Feature-based estimation [13,19–22], a method of gaze estimation, mainly uses unique features that have geometric relationships with the eye. Existing research studies have focused on objects that have a strong visual influences on images.

Roberto et al. [19] used the saliency method to estimate visual gaze behavior and used gaze estimation devices to compensate for errors caused by incorrect calibrations, thereby reducing restrictions caused by user movements. However, the use of these devices increases the error rate as the head moves and interferes with gaze detection.

Some researchers added multiple cameras to compute and focus head movements in multiple directions, extending the influence of eye information and head posture information [20,21]. Head pose has a significant effect on the gaze and requires many restrictions.

To avoid this problem, studies dealing with the static head-pose problem were conducted [13,22] using the convolutional neural network (CNN) model, in which images from a single camera are used to perform gaze estimations based on landmarks as they are less restrictive features. This makes it less difficult to build an experimental environment because there is no need for separate camera calibrations. As only eye images are used for gaze estimations, the dependence of the eye landmark feature vector is increased. After acquiring eye landmarks using a CNN model, a gaze is inferred using support-vector regression (SVR) [23].

Bernard et al. [24] used two gaze maps; one represented the eyeball region and the other represented the iris region. A gaze vector was regressed through the positional relationship between the two gaze maps.



#### *2.2. Landmark Detection*

We used a deep-learning-based pose-estimation model as a tool to acquire eye region features. The landmark-detection task includes a key-point detection to detect a skeleton representing the body structure and a facial-landmark detection to extract landmarks in the face; this is a field that requires a large number of datasets depending on the domain. Some models have a direct regression structure based on the deep neural network and have predicted key points [25]. The predicted key-point positions are progressively improved using feedback on the error prediction.

Some researchers proposed a heatmap generation method through using soft-max in a manner that can be fully differentiated [26]. The convolutional pose machine [27] predicts a heatmap with intermediate supervision to prevent vanishing gradients, which are detrimental to deep-learning models. A new network architecture called the stacked hourglass [14] proved that repeated bottom-up and top-down processing with intermediate supervision is an important process for improving performance. A network structure that used high to low sub-networks in parallel is one of the networks that are currently showing the best performance [15]. For a spatially accurate heatmap estimation, highresolution learning is maintained throughout the entire process; unlike the stack hourglass, it does not use intermediate heatmap supervision, which makes it efficient in terms of complexity and parameters and can generate highly information-rich feature outputs through a multi-scale feature fusion process. Yang et al. [28] introduced a transformer for key-point detections using HRNet as a backbone that extracts a feature map. It causes a performance improvement over existing performance through multi-head self-attention but is computationally demanding.


**Table 2.** A summary table of the landmark detection.

#### *2.3. Attention Mechanism*

Attention mechanisms in computer vision aim to selectively focus on the prominent parts of an image to better capture the human visual structure. Several attempts have been made to improve the performance of CNN models in large-scale classification tasks. Residual attention networks improve feature maps through encoder–decoder style attention modules and are very robust to noisy inputs. The attention module reduces the complexity and parameters by dividing calculations into channels and spaces instead of performing calculations in the typical three-dimensional space manner in addition to achieving a significant effect.

The squeeze-and-excitation module [16] proposes an attention module to exploit the relationship between channels. Channel weights are generated through average pooling or max pooling to apply attention to each channel. BAM [17] adds a spatial attention module in addition to the above channel method and places it in the bottleneck to create richer features. CBAM [18] is not only located in each bottleneck of a network but also forms a convolution block to configure the network. In addition, the performance is increased empirically by using the sequential processing method for channel and spatial attention; this method has an empirically better performance than using only the channel unit and has achieved state-of-the-art performance in the classification field.

The self-attention module with such a flexible structure has been applied to many tasks, such as image captioning and visual question answering [29]. The self-attention module is widely used in detection and key-point detection in which spatial information is important [30,31].


**Table 3.** A summary table of the attention mechanism.

#### **3. Proposed Method**

#### *3.1. Overview of Gaze Estimation Based on Landmark Features*

In this section, we introduce a network structure and a process for extracting a rich and accurate landmark feature vector from an eye image and then estimating a gaze based on it. A series of procedures for estimating a proposed gaze is shown in Figure 2. Eye images can simply be acquired from a single camera. If a frame contains a full-face image, the frame must be cropped to a 160 × 96 sized image centered on the eye area using the face detection algorithm [32]. The image is converted into black and white image for simple processing. This can enhance the performance output of the infrared camera.

**Figure 2.** Overall flowchart of our feature-based gaze-estimation system.

To obtain eye-feature vectors from processed images, we selected HRNet as a baseline model that can generate feature maps containing rich information through fusions with various feature maps while maintaining a high image resolution. HRNet showed the best performance in the key-point detection task, proving its utility. We modified HRNet by additionally using the self-attention module CBAM. Channel-wise and spatial-wise weights were applied to infer the most important channels in the 3D feature map and most important spatial points in the channels. Section 5.1 shows that the proposed model achieved higher landmark accuracy than models in previous studies.

The EAR [33] threshold (*T*), which can be different for each individual, was set using the initial 30 input frames. The EAR ratio value (*E*) was calculated for each frame; if the calculated EAR ratio value was less than the threshold value, it was judged that there was no need to estimate gaze because the eyes were closed. By reducing false-positive errors, it was possible to proceed with a gaze estimation that had a computational advantage. In some cases, 3D gaze regressions use SVR, but we proceeded by constructing an optimal MLP. The architecture configurations of these models have the advantage of being able to proceed one step when learning.

The most important task of our proposed method was to acquire a high-level landmark feature that affects the EAR ratio and gaze. Before training the model, we were faced with the problem of a lack of a dataset, which is a chronic problem of deep-learning models, adversely affects their training, and can result in over-fitting. Landmark datasets are especially expensive, and only a few datasets include both a gaze and a landmark. To avoid this problem, we used a large set of UnityEyes synthetic data for training the dataset. UnityEyes synthetic data is a dataset that includes annotations, such as rich eye landmarks and gazes, by modeling a 3D eyeball based on an actual eye shape created by using the game engine Unity. The models [4] trained with this synthetic dataset showed good performances and had a lot of information and high resolutions; therefore, they are very suitable for processing and applications.

#### *3.2. Architecture of Proposed Landmark-Detection Model*

We used a feature vector for gazes with large amounts of eye landmarks obtained through the model from the input frame that contains eye information from a single camera. To increase the gaze accuracy, it was important to generate a high-level feature, and we set the advancement of the model that extracted a heatmap output most similar to the correct answer as the main goal of this study. Previous studies [12,13] that used eye landmarks as features mainly adopted [14] the production of feature outputs. However, because the feature map is restored through decoding after passing forward from high resolution to low resolution, it is weak in expression learning at a high resolution. Because our model requires extracting more eye landmarks from small-sized eye images than previous

studies [13], feature learning at high resolution, which has high sensitivity to positional information in an image space, was necessary. Therefore, we adopted HRNet, which maintains multi-resolution learning (including a high-resolution), as a baseline model.

The basic structure of HRNet consists of 4 steps and 4 stages. Each step creates a feature that doubles the number of channels with half the resolution of the previous step. Each stage consists of a residual block and an exchange unit, and the feature map of each step is processed in parallel. The exchange unit is an information exchange process through fusion between feature maps of each step through fusion, and the second, third, and fourth stages have one, four, and three units, respectively. At the end of each stage, there are feature fusion and transition processes that increase a step by generating a feature map that is half the previous size. Fusion between multi-scale features includes an up-sampling process that uses 1 × 1 convolution, a nearest-neighbor interpolation in the bottom-up path, and a down-sampling process that uses several 3 × 3 convolution blocks with strides of 2 in the top-down path.

$$input = \{X\_1, X\_2, \ldots X\_r\} \qquad output = \{\mathbf{Y}\_1, \mathbf{Y}\_2, \ldots \mathbf{Y}\_r\}$$

$$\mathbf{Y}\_k = \sum\_{i=1}^r F(\mathbf{X}\_i, k) \quad , \quad F(\mathbf{X}\_i, k) = \begin{cases} \text{identity connection}, & \text{if } i = k \\ \text{up sampling}, & \text{if } i < k \\ \text{down sampling}, & \text{if } i > k \end{cases} \tag{1}$$

Equation (1) describes the feature fusion process. For input {*<sup>X</sup>*1, *X*2, ... *Xr*} of different resolutions, output features {*<sup>Y</sup>*1,*Y*2, ... *Yr*} are generated through an element-wise summation of features after down-sampling and up-sampling. *r* represents resolution numbers; if *r* is the same, the widths and resolutions of the input and output are the same. At the end of the 4th stage, all step information is concatenated to create the feature block *Fb* [*Y*<sup>4</sup> 1 ;*Y*<sup>4</sup> 2 ;*Y*<sup>4</sup> 3 ;*Y*<sup>4</sup> 4 ] and to head to the prediction head.

To solve the problem of the typically acquired eye image having a small resolution, we introduced an additional residual block layer composed of a 3 × 3 convolution to the model to create feature (*Fo*) of the origin resolution that stores the information of the largest resolution. Through the summation of *Fo* and up-sampled *Fb*, more spatially accurate features are created.

Because the heat map, which is the final result of the network, requires accurate spatial information for each channel, we applied CBAM, a self-attention technique, to the normal residual and convolution blocks of each stage. Architecture of the modified network is illustrated in Figure 3. These techniques (adding the residual CBAM layer and applying CBAM to all stages of the residual block) improved the landmark-detection performance, which is described in Section 5.1.

**Figure 3.** Our landmark-detection network architecture used to extract feature map.

#### *3.3. Network Engineering with the Self-Attention Module CBAM*

Attention mechanisms have been widely used for feature selection using multi-modal relationships. Refining the feature maps using attention module helps the network and causes it to perform well and become robust to noisy inputs. Based on empirical results, such as those in [16,17], the CBAM self-attention module has developed rapidly and showed higher accuracy than existing modules in the image classification task through various structure and processing experiments. We judged that the positional information of the refined feature would improve the performance; therefore, we applied the residual block of the network by replacing the CBAM block. The architecture of CBAM is illustrated in Figure 4.

**Figure 4.** Convolutional block attention module (CBAM) architecture in residual block.

CBAM adds two sub-networks that consist of channel attention and spatial attention networks to the basic residual block. Feature *F* ∈ R *C*×*H*×*W* is generated through a 3 × 3 convolution of the residual, which is the output of the previous block. *F* goes through the channel attention and spatial attention networks sequentially. First, in the case of channel attention, the two types of channel-wise pooling, that is, max pooling and average pooling, are performed to obtain weight parameters for channels. Feature vectors *Fmax* ∈ R *C*×1×1 and *Favg* ∈ R *<sup>C</sup>*×1×1, generated through pixel-wise pooling, share an MLP that has a bottleneck structure with the advantages of parameter reduction and generalization and are merged using element-wise summation. Finally, the product is normalized using sigmoid function to obtain the meaningful weights *Mc*(*F*) ∈ R *C*×1×1 and generate *Fc* ∈ R *C*×*H*×*W* by multiplying *Mc*(*F*) and *F*. The above process is described by using Equation (2).

$$\begin{aligned} M\_{\mathfrak{c}}(F) &= F\_{\text{sigmoid}}(MLP(F\_{\text{max}}) + MLP(F\_{\text{avg}})), \\ F^{\mathbb{C}} &= M\_{\mathfrak{c}}(F) \bigotimes F \end{aligned} \tag{2}$$

Subsequently, using the channel-refined feature (*Fc*) as an input, *Ms*(*Fc*) is generated through the spatial attention module.

$$M\_{\mathfrak{s}}(F^{\mathfrak{c}}) = F\_{\text{sigmoid}}(\mathsf{Conv}\_{\mathcal{T}\times\mathsf{T}}([F\_{\text{max}}; F\_{\text{avg}}])),$$

$$F^{\mathfrak{s}\mathfrak{c}} = M\_{\mathfrak{c}}(F^{\mathfrak{c}}) \bigotimes F^{\mathfrak{c}},\tag{3}$$

$$\text{output} = \text{Residual} \bigoplus F$$

In Equation (3), spatial weight feature *Ms*(*F*) ∈ R 1×*H*×*W* is made by using sequential process pooling, concatenation, 7 × 7 convolution and normalizing using with sigmoid function, then *Fsc* ∈ R *C*×*H*×*W* is merged by multiplying *Ms*(*Fc*) ∈ R 1×*H*×*W* and *Fc*. The output of blocks that are merged by using the element-summation residual and *Fsc* is refined with a focus on 'what' and 'where', respectively. Because we applied this module to the residual block of the processing stage in parallel, the output at each stage contains very rich information and encodes channel information at each pixel over all spatial locations due to attention and fusion.

We applied the CBAM module of the additional residual layer and additionally applied the CBAM module to all steps of the stage. We showed performance improvement through the normalized mean error (NME) value, which is a key-point-detection performance value. Detailed outcome indicators are described in Section 5.

#### *3.4. Gaze-Estimation-Based Eye Landmarks with EAR*

We estimated gaze vectors based on an eye feature that consists of a total of 50 eye landmarks (1 from an eye center, 1 from an iris center, 16 from an eyelid, and 32 from an iris). We extracted high-accuracy eye landmark localization while optimizing and improving the network. As the quality of the landmark extracted by the network improved, the gaze regression performance also improved empirically. Existing feature-based studies [13,34] mainly used the SVR for gaze regressions. We empirically confirmed that the difference between the SVR and multi-layer perceptron (MLP) performance is very small and that the MLP performance is relatively good. The MLP simply contains two hidden layers and uses Leaky ReLU [35] as an activation function. In addition, when the MLP is used, there is the advantage that landmark detection and gaze estimation are possible in one-stage training. The MLP contains two hidden layers and uses Leaky ReLU as an activation function. The co-ordinates used as the inputs are normalized to the distance between the eye endpoints, and all eye points are translated with respect to the eye center coordinates.

To reduce false positives and increase efficiency, we utilized an EAR value. The EAR value was calculated to decide whether an eye was closed or not using 16 eyelid points. We introduced a new EAR metric based on a method that uses 6 points because we could obtain richer, high-quality eyelid points. Figure 5 shows the measured lengths of an eye using images that include a closed eye. We measured the horizontal length through the *p*1 and *p*9 points out of a total of 16 points {*p*1, *p*2, ...*p*16}, and the average value of the remaining seven pairs of points {(*p*2, *p*16),(*p*3, *p*15)...(*p*8, *p*10)} was defined as the vertical length. The EAR was calculated using Equation (4).

$$EAR = \frac{\sum\_{n=1}^{7} \parallel p\_{n+1} - p\_{17-n} \parallel\_1}{\sum \parallel p\_1 - p\_9 \parallel\_1} \tag{4}$$

Because the EAR varies considerably from user to user, we set the EAR threshold ( *T*) to half the median value after receiving the EARs for the initial 30 frames' inputs. Then, if the measured EAR was smaller than *T* , the network did not estimate gaze.

**Figure 5.** EAR was calculated through the displayed landmark coordinates (from *P*1 to *P*16). The blue dots are used to represent the width of the eye and the red the height. The eye on the right is in a state at which there is no need to estimate the gaze.
