Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network

Li, Yujie; Huang, Longzhao; Chen, Jiahui; Wang, Xiwen; Tan, Benying

doi:10.3390/math11030686

Open AccessArticle

Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network

by

Yujie Li

¹

,

Longzhao Huang

²,

Jiahui Chen

²,

Xiwen Wang

² and

Benying Tan

^1,*

¹

Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Jinji Road, Guilin 541004, China

²

School of Artificial Intelligence, Guilin University of Electronic Technology, Jinji Road, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(3), 686; https://doi.org/10.3390/math11030686

Submission received: 3 December 2022 / Revised: 19 January 2023 / Accepted: 25 January 2023 / Published: 29 January 2023

(This article belongs to the Special Issue Learning-Based Control and Nonlinear Optimization: Theory, Models, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Gaze behavior is important and non-invasive human–computer interaction information that plays an important role in many fields—including skills transfer, psychology, and human–computer interaction. Recently, improving the performance of appearance-based gaze estimation, using deep learning techniques, has attracted increasing attention: however, several key problems in these deep-learning-based gaze estimation methods remain. Firstly, the feature fusion stage is not fully considered: existing methods simply concatenate the different obtained features into one feature, without considering their internal relationship. Secondly, dynamic features can be difficult to learn, because of the unstable extraction process of ambiguously defined dynamic features. In this study, we propose a novel method to consider feature fusion and dynamic feature extraction problems. We propose the static transformer module (STM), which uses a multi-head self-attention mechanism to fuse fine-grained eye features and coarse-grained facial features. Additionally, we propose an innovative recurrent neural network (RNN) cell—that is, the temporal differential module (TDM)—which can be used to extract dynamic features. We integrated the STM and the TDM into the static transformer with a temporal differential network (STTDN). We evaluated the STTDN performance, using two publicly available datasets (MPIIFaceGaze and Eyediap), and demonstrated the effectiveness of the STM and the TDM. Our results show that the proposed STTDN outperformed state-of-the-art methods, including that of Eyediap (by 2.9%).

Keywords:

gaze estimation; static transformer temporal differential network; static transformer module; temporal differential module; self-attention mechanism

MSC:

68T45

1. Introduction

Gaze direction is important non-invasive human behavioral information, and can be an important cue for understanding human intention, making it valuable in human-behavior-related fields, such as human–computer interaction [1,2,3], autonomous driving [4,5], and virtual reality [6,7]. Gaze estimation tasks can be divided into three main categories: (1) 3D gaze direction estimation [8,9]; (2) target estimation [10,11]; (3) tracking estimation [12,13]. In this paper, we focus on 3D gaze direction estimation.

3D gaze estimation methods can be divided into model-based gaze estimation methods and appearance-based gaze estimation methods, as shown in Figure 1. Model-based gaze estimation methods [14,15,16,17] usually consider the eye’s geometric features, such as the eyeball shape and the pupil position, using machine learning methods—for example, the support vector machine—to predict gaze direction; however, model-based gaze estimation methods usually require specific detection equipment, making them unpopular to use in real-world environments. Conversely, compared to model-based gaze estimation methods, appearance-based gaze estimation methods do not require specific detection equipment. Such methods can predict gaze direction, using face or eye images: however, these methods depend on large volumes of trainable data. Zhang et al. [18] first applied deep learning in the gaze estimation field: since then, many researchers [18,19,20,21] have proposed appearance-based deep learning gaze estimation methods and, more recently, a new pipeline of gaze estimation has been proposed [22]. The new pipeline of appearance-based gaze estimation includes static feature extraction, dynamic feature extraction, and gaze estimation: such methods are usually highly accurate, even when head pose and illumination change.

There are several significant problems with appearance-based deep learning gaze estimation methods: for example, the feature fusion stage has not been fully considered. Recent research [23,24] has proposed several feature fusion strategies to help settle this issue. In this paper, we considered the feature fusion stage, applying the self-attention mechanism [25] to fuse the coarse-grain face feature and fine-grain eye features. The self-attention mechanism was able to effectively learn interrelation among different features, and retain more important features. As far as we know, our study is the first time that the self-attention mechanism for feature fusion has been applied in appearance-based deep learning gaze estimation methods. The self-attention mechanism [25] was first proposed in natural language processing (NLP), the vision transformer (ViT) [26] making the connection between the self-attention mechanism and computer vision (CV). The ViT has two variants: the pure-ViT and the hybrid-ViT. The pure-ViT needs to split the original image, but the hybrid-ViT uses a convolutional network to extract the feature map from the original image, before splitting the feature map. Both the pure-ViT and the hybrid-ViT are extremely competitive state-of-the-art techniques used in the image classification field: Cheng et al. [27] applied them to gaze estimation, and achieved highly accurate results. Compared to the hybrid-ViT, the pure-ViT does not perform quite as well, because it needs to slice the facial image into several patches, which can destroy global information in the image, such as the head pose. To avoid the destruction of global information, we considered an independent but entire feature as a patch, independently. In particular, the face, the left eye image, and the right eye image were extracted, using a convolutional neural network, and flattened to a one-dimensional vector as one patch, independently. We implemented this operation in a static transformer module (STM): this module is described in detail in Section 3.1.

Another key problem in appearance-based deep learning gaze estimation methods is the efficient extraction of dynamic features. Unlike the static features extracted from an image, the dynamic features need to be extracted from an input video. Some research [22,28] has extended the basic static network, using a dynamic network—for example, using a long short-term memory (LSTM) approach [29]. However, these methods do not define dynamic features clearly—in other words, such methods obtain implicit dynamic features through RNNs. Liu et al. [30] proposed a differential network to predict differential information in solving personal-calibration problems in the gaze estimation field. The differential information was characterized as the difference between the gaze direction of two images, reflecting sight movements in continuous time. Inspired by this, we defined the dynamic feature to be sight movement, and we proposed a new RNN cell—the temporal differential module (TDM)—to obtain it in our work: this module can steadily extract effective dynamic features, using differential information. The TDM is described in detail in Section 3.2. The TDM is the core of the temporal differential network (TDN). We then combined the STM and the TDN into an end-to-end gaze estimation network—that is, a static transformer temporal differential network (STTDN)—to achieve better results.

To our knowledge, this is the first study to use self-attention mechanisms to fuse both coarse-grained facial features and fine-grained eye features. The contribution of our work can be summarized as follows:

(1): We proposed a novel STTDN for gaze estimation, which could achieve better accuracy, compared to state-of-the-art algorithms;
(2): We proposed the STM, to extract and fuse features. In the STM, we used a convolutional neural network to extract features from the face, left eye image, and right eye image. Then, we considered each feature as a patch, independently, to solve the key pure-ViT problem. Lastly, we used multi-head self-attention to fuse these patches.
(3): The proposed TDM was used to obtain the dynamic information from video, and we clearly defined the dynamic feature to be sight movement.

The rest of the paper is organized as follows: we discuss the related work in Section 2; we describe our proposed STTDN method in Section 3; we summarize and analyze the experimental results in Section 4; finally, the study conclusions are discussed in Section 5.

2. Related Work

In recent years, appearance-based deep learning methods have been the mainstream methods used in the gaze estimation field. Compared to other traditional gaze estimation methods, appearance-based deep learning methods perform accurately when the head pose and illumination change. Zhang et al. [18] proposed the first deep learning model based on LeNet [31] for gaze estimation, which was used to predict the gaze direction from a grayscale image of the eye. Zhang et al. extended the convolutional neural network to 13 layers in their work [20], which achieved more accurate results, their appearance-based method using a VGG16-inherited network [32]. Zhang et al. also considered facial features as inputs in their work [21]. Other studies [32,33,34] further developed appearance-based deep learning gaze estimation methods. Fischer et al. [33] used two VGG16 networks simultaneously, to extract features from two images of the human eye, and connected the two-eye feature for regression. Cheng et al. [34] established a four-channel network for gaze estimation, in which two channels were used to extract features from the left eye and right eye images, the other two channels being used to extract facial features. Additionally, Chen et al. [19] applied a dilated convolution to extract high dimension features, the dilated convolution effectively increasing the receptive field while avoiding reduced image resolution. Krafka et al. [11] added face grid information to their model.

Although past appearance-based deep learning methods demonstrated high performance, they still needed to improve. For example, existing methods [33,34] simply concatenate the different obtained features (such as left eye image and right eye image) into one feature, without considering their internal relationship. The feature fusion stage is not fully considered in these existing methods. More recently, some research [23,24] has proposed the use of a feature fusion strategy. Bao et al. [23] applied the squeeze-and-excitation mechanism in the fusion of eye features, and used adaptive group normalization to correct fused eye features with facial features. Cheng et al. [24] proposed an attention module to fuse fine-grained eye features, with the guidance of coarse-grained facial features.

With the self-attention mechanism developing in CV, Cheng et al. [27] first applied the ViT in the gaze estimation field, and achieved an advanced level of performance. The self-attention mechanism [25] was proposed initially in the NLP field. Alexey Dosovitskiy et al. [26] proposed the ViT to integrate this method into the CV field. Liu et al. [35] further developed the ViT to the swin transformer (Swin), the sliding window technique being applied to reduce the computing resource overhead. At this stage, the Swin method remains at the forefront of several CV tasks—including image classification [36], semantic segmentation [37], and object detection [38], among others. In this paper, we further developed the self-attention mechanism application in the gaze estimation field, by specifically applying the self-attention mechanism to the feature fusion process.

Apart from the static features extracted from images, important features—namely, the dynamic features—can be obtained from video. Research [22,28,39,40] has proposed several gaze estimation methods that incorporate dynamic features. Kellnhofer et al. [22] proposed a video-based annotation tracking model, and used an LSTM method to obtain the dynamic features from a video. Zhou et al. [28] applied the Bi-LSTM to obtain dynamic features. Some research [22,28] has applied RNNs to implicitly obtaining dynamic features: at this stage, dynamic features are difficult to learn, because of the unstable extraction process of ambiguously defined dynamic features. More recently, some research [39,40] has defined dynamic features obtained clearly. Wang et al. [39] defined optical flow to be a dynamic feature, and then used the optical dynamic to reconstruct the three-dimensional face structure. Wang et al. [40] considered eye movement as a dynamic feature, and proposed a gaze tracking algorithm. Inspired by differential information, we also defined sight movement as a dynamic feature. Liu et al. [30] proposed a differential network to settle the personal calibration problem: in their work, the differential network predicted the differential information, which reflected eye movement. In this paper, we propose a new RNN cell—the TDM—to more efficiently obtain this dynamic feature.

3. Proposed Model and Algorithm

In this section, we describe in detail how we designed this end-to-end 3D gaze estimation network. The pipeline of the proposed STTDN is as shown in Figure 2. The STTDN consisted of two main components: (1) the STM, which was the static network; (2) the TDN, which was the dynamic network. The network used input video frames to predict the gaze direction corresponding to the last input frame. At this stage, the STM was responsible for extracting the static features from a single face image and the left eye and right eye images corresponding to it, while the TDN was responsible for extracting the dynamic features from the static features. The flow chart of the proposed model is shown in Figure 3. There were two basic states of our proposed model: back propagation and forward propagation. Forward propagation could be used to calculate the intermediate variables of each layer, and back propagation could be used to calculate the gradient of each layer.

3.1. The Design of the STM

The overall structure of the STM is as shown in Figure 2, where we integrated two convolutional neural networks and a multi-head self-attention fusion block. The STM module used a face image and its corresponding left eye and right eye images, before outputting static features.

Given a face image

(I_{f a c e}^{(i)} \in R^{H \times W \times C})

and its corresponding left eye

(I_{l e f t_e y e}^{(i)} \in R^{H \times W})

and right eye

(I_{r i g h t_e y e}^{(i)} \in R^{H \times W})

images—where (H, W) denotes the resolution of the original image, C denotes the number of channels, and i denotes the ith image in the input video frames—we used two independent convolutional neural networks for feature extraction. We differentiated left eye and right eye images in the feature extraction stage, because the left eye and right eye images contributed differently to the gaze direction, due to the headpose and illumination. The first convolutional neural network was used to extract the facial features; the second convolutional neural network extracted the left eye and right eye features simultaneously, before flattening them to one-dimensional vectors—that is, the

f_{f a c e}^{(i)} \in R^{1 \times d}

,

f_{l e f t_e y e}^{(i)} \in R^{1 \times d}

, and

f_{r i g h t_e y e}^{(i)} \in R^{1 \times d}

, where d was the feature dimension and d = 32 in our experiments.

Similar to the position encoding and patch embedding process in [26], we created the feature matrix

X

: firstly, we created a learnable

f_{t o k e n}^{(i)} \in R^{1 \times d}

; secondly, we coded the feature position—specifically,

f_{t o k e n}^{(i)}

was coded as position 0, and

f_{f a c e}^{(i)}

,

f_{l e f t_e y e}^{(i)}

,

f_{r i g h t_e y e}^{(i)}

were coded as positions 1, 2, and 3, respectively; finally, we concatenated the four one-dimension vectors into a feature matrix

X = [\begin{matrix} f_{f a c e}^{(i)}; f_{f a c e}^{(i)}; f_{l e f t_e y e}^{(i)}; f_{r i g h t_e y e}^{(i)} \end{matrix}]

.

The feature matrix

X

could be further fused by using the multi-head self-attention fusion block. The core of the multi-head self-attention fusion block was the multi-head self-attention mechanism [25], which was a derivative of the self-attention mechanism. The self-attention mechanism used a multi-layer perception (MLP) to map the feature matrix

X

to (Queries)

Q \in R^{n \times d_{k}}, (Keys) K \in R^{n \times d_{k}}, and (Values) V \in R^{n \times d_{v}}

, where n was the batch size,

d_{k}

was the dimension of Queries and Keys, and

d_{v}

was the dimension of Values. For our experiment,

d_{k} = d_{v} = 8

. The formulaic definition of self-attention can be expressed as follows:

A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

Unlike the self-attention mechanism, the multi-head self-attention mechanism used projection matrices

W_{i}^{Q} \in R^{d \times d_{k}}, W_{i}^{K} \in R^{d \times d_{k}}, W_{i}^{V} \in R^{d \times d_{v}}

, projecting feature

X \in R^{n \times d}

into different representation subspaces. Moreover, a fusion matrix

W^{O} \in R^{h d_{v} \times d}

fused the information extracted from the different representation subspaces, where i denoted the

i^{t h}

representation subspace, and h denoted the number of representation subspaces. In our experiment, we employed

h = 4

. The definition of the multi-head self-attention mechanism calculation can be expressed as follows:

\begin{matrix} M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O} \end{matrix}

(2)

where

h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

, and

C o n c a t

represents concat operation.

We could stack multiple transformer layers [25], to implement the multi-head self-attention fusion block, the structure of which is shown in Figure 4. In Figure 4, L represents the number of transformer layers. An independent transformer layer comprises the MLP and a multi-head attention layer (MSA). The definition of the transformer calculation can be expressed as follows:

\begin{matrix} X^{'} = & X + MSA (LN (X)) \end{matrix}

(3)

\begin{matrix} Y = & X^{'} + MLP (LN (X^{'})) \end{matrix}

(4)

where

X

denotes the feature matrix received by each layer of the transformer,

X^{^{'}}

denotes the intermediate variable, which is the feature output using a single layer transformer, and

Y

denotes the feature matrix output by each layer of the transformer.

Finally,

g_{t o k e n}^{(i)}

is the independent output from the STM, which represents the static feature of each frame.

3.2. Design of the TDN

In this section, we will take a closer look at the TDN in the STTDN. The structure of the TDN was as shown in Figure 5. In the TDN, we used five RNN cells (TDM) to obtain the dynamic features from five input video frames, and applied a fully connected layer to predict the gaze direction of the fifth input video frame.

The TDM was the core component in the TDN, its structure being as shown in Figure 6. We defined the dynamic feature as clear sight movement in the TDM. We introduced the differential information to define the dynamic feature clearly. Liu et al. [30] initially proposed the differential information concept, which represents the difference between the gaze direction of two images of the eye. We can generalize the definition of the dynamic feature as a sight movement using differential information, specifically information from the eye image to the face image, and from only these two images to the video frames. A generalized definition of the dynamic feature can be expressed as follows:

d_{i} = {g a z e}_{i} - {g a z e}_{i - 1}

(5)

where

d_{i}

denotes the sight movement from the ith frame to the

(i - 1)

th frame,

{g a z e}_{i}

denotes the gaze direction of the ith frame, and

{g a z e}_{i - 1}

denotes the gaze direction of the

(i - 1)

th frame.

We could apply

d_{i}

in the TDM to extract the dynamic features from the video frames. Compared to the LSTM cell, we kept only two gates in the TDM: the forgotten gate

(f_{i})

and the output gate

(o_{i})

, the coefficient of both gates being determined by

d_{i}

. The proposed algorithm of the TDM is shown in Algorithm 1:

Algorithm 1: TDM

Input: Feature vector of 5 frames

{g a z e}_{t o k e n} = \{g_{t o k e n}^{(1)}, g_{t o k e n}^{(2)}, g_{t o k e n}^{(3)}, g_{t o k e n}^{(4)}, g_{t o k e n}^{(5)},\}

Output: The state of hidden layer

h_{5}

1:: procedureTDM( ${g a z e}_{t o k e n}$ )
2:: Initialize two zero vectors $h_{0} \in R^{1 \times d}, c_{0} \in R^{1 \times d}$
3:: repeat
4:: for $i = 1 to 5$ do
5:: Calculate $d_{i}$ based on $c_{i - 1}$ and $g_{t o k e n}^{(i)}$ using (6)
6:: Calculate $f_{i}$ and $o_{i}$ base on $d_{i}$ using (7) and (8)
7:: Update $c_{i}$ based on $c_{i - 1}$ and $g_{t o k e n}^{(i)}$ using (9)
8:: Update $h_{i}$ based on $h_{i - 1}$ and $g_{t o k e n}^{(i)}$ using (10)
9:: end for
10:: until eigenvector calculation completed
11:: return $h_{5}$
12:: end procedure

The definitions can be expressed as follows:

d_{i} = tanh (w_{1} c_{i - 1} + w_{1} g_{t o k e n}^{(i)} + b_{1})

(6)

f_{i} = σ (w_{2} d_{i} + b_{2})

(7)

o_{i} = σ (w_{3} d_{i} + b_{3})

(8)

c_{i} = c_{i - 1} f_{i} + tanh (w_{4} g_{t o k e n}^{(i)} + b_{4}) (1 - f_{i})

(9)

h_{i} = (h_{i - 1} + d_{i}) o_{i} + tanh (w_{5} g_{t o k e n}^{(i)} + b_{5}) (1 - o_{i})

(10)

where,

w_{1}

,

w_{2}

,

w_{3}

,

w_{4}

, and

w_{5}

represent weight matrix, and

b_{1}

,

b_{2}

,

b_{3}

,

b_{4}

, and

b_{5}

represent bias matrix.

Finally,

h_{5}

was output by the last TDM in the TDN. A fully connected layer was implemented after the last TDM. The fully connected layer used

h_{5}

to predict the gaze direction.

3.3. Topology of the STTDN Network

In this section, we supplied additional topology details. We integrated the convolutional neural network and the multi-head self-attention fusion block in the STM. We implemented ResNet18 [41] as the convolutional neural network to extract features, the number of feature maps being 32—two ResNet18 being used to extract features from eye and face images, respectively. An additional average pooling-down sampling layer followed multiple convolutional layers, to ensure consistent feature dimension. We then stacked six transformer encoders as the multi-head self-attention fusion block. The number of heads of each transformer encoder was 4, the input dimension of each transformer encoder was 32, and the inner dimension of each transformer encoder was 128. The TDM had fewer parameters—the hidden layer dimension was 32, the num layer was 1.

4. Experimental Results

In this section, we discuss the experimental performance of the proposed STTDN on two public datasets, MPIIFaceGaze [20] and Eyediap [42], and the effectiveness of the STM and the TDM. The remaining sections are as follows: first, we introduce the two public datasets used in this study (Section 4.1), and the evaluation metric (Section 4.2); then, we compare our proposed method with state-of-the-art methods, in Section 4.3; in Section 4.5 and Section 4.6, we evaluate the effectiveness of the two main ideas in the STTDN—that is, (1) the way to fuse features by STM, and (2) the way to extract dynamic features by DTN; in Section 4.7, we discuss an ablation study conducted for the STTDN.

To allow readers to reproduce our proposed architecture and conduct further research, we offer several important parameters related to this study. We implemented our model using Pytorch, and evaluated it on two TITAN RTX platforms. The optimizer used was the AdamW optimizer [43], the loss function was L1 loss, the epoch was set to 30, the initial learning rate was set to

5 \times 10^{- 4}

, and the number of video frame inputs was set to 5.

4.1. Datasets

To better evaluate the performance of the STTDN, we used two public datasets: (i) MPIIFaceGaze [20] and (ii) Eyediap [42]. Figure 7 shows examples of face images on the two public datasets, and their corresponding left eye and right eye images.

The MPIIFaceGaze dataset [20] was proposed by Zhang et al., and is the most popular dataset used for appearance-based gaze estimation methods. The MPIIFaceGaze dataset contains a total of 213,659 images collected from 15 subjects during several months of daily life without head pose constraints. Because the images come from the real-world environment, the dataset has abundant illumination and head pose scenes. We considered the two face images with the shortest time interval as two adjacent frames.

The Eyediap [42] dataset contains 94 video clips from 16 participants in experimental scenes: it contains three visual target segments—that is, continuous moving targets, screen targets, and floating balls. Each participant was recorded with six static-head postures and six free-head postures. As the data were collected in a laboratory environment, the images lacked illumination variation. As 2 subjects lacked screen-target video, we obtained images from 14 subjects for our study.

Based on the dataset pre-processing process in [32], we cropped out RGB face images with resolutions of 224 × 224 pixels, as well as grayscale left eye and right eye images with resolutions of 36 × 60 pixels.

4.2. Evaluation Metric

We used the leave-one-person-out criterion as the experimental evaluation metric—a common choice in gaze estimation studies. We used 14 subjects as the training dataset, and 1 subject as the validation dataset, before selecting 15 objects as the validation dataset, in turn, and using the average error precision of the 15 experiments as the model performance. We used the angular error as the evaluation metric. The greater the angular error, the lower the accuracy of the model. The definition of the angular error can be expressed as follows:

L_{a n g u l a r} = arccos (\frac{g \cdot \hat{g}}{∥ g ∥ ∥ \hat{g} ∥})

(11)

where

g \in R^{3}

denotes the actual gaze direction, and

\hat{g} \in R^{3}

denotes the estimated gaze direction.

4.3. Comparison with State-of-the-Art Methods

To evaluate the performance of the model, we selected several networks which had exhibited advanced performance, for comparison: Hybrid-ViT [27]; Gaze360 [22]; iTracker [11]; DilatedNet [19]; RTGene [33]; AFFNet [23]; and CANet [24]. We recorded the performance of the STTDN and the comparison methods, as shown in Table 1—a more intuitive representation being reflected in Figure 8.

The STTDN achieved an angular error of 3.73

^{\circ}

on the MPIIFaceGaze dataset [20], which was highly competitive with the previous best method—that is, AFFNet [23], which also achieved an accuracy error of 3.73

^{\circ}

on MPIIFaceGaze. The STTDN achieved an angular error of 5.02

^{\circ}

on Eyediap [42], an improvement of 2.9% on the previous best method—that is, Hybrid-ViT [27], which achieved an angular error of 5.17

^{\circ}

on the Eyediap dataset. It is evident that our model outperformed the other models, as shown in Figure 8.

We also analyzed the performance errors from two different perspective: (1) the angular error of the STTDN on different experiment participants; (2) the angular error distribution of the STTDN on different datasets.

4.4. Perfromance Analysis

4.4.1. The Angular Error of the STTDN on Different Experiment Participants

First, we analyzed the error of the STTDN on different experiment participants. We recorded the angular error of each experiment participant, as shown in Figure 9. As shown in Figure 9a, the STTDN performed best on person ID p0 (2.4

^{\circ}

), and worst on person ID p14 (4.74

^{\circ}

), the difference being 2.34

^{\circ}

(4.74

^{\circ}

–2.4

^{\circ}

). In Figure 9b, the STTDN performed best on person ID p14 (3.33

^{\circ}

), and worst on person ID p7 (7.57

^{\circ}

), the difference being 4.24

^{\circ}

(7.57

^{\circ}

–3.33

^{\circ}

).

There was still a large difference among different experiment participants, which prevented our model from performing better. This problem in the personal estimation field is called the personal calibration problem. The calibration problem can be considered as a domain adaption problem, where the training set is the source domain and the test set is the target domain. The proposed method did not use a calibration sample in the target domain: thus, this proposed method did not solve the personal calibration problem. For example, the facial contrast of person ID p8 in Figure 9b was quite different from the others, leading to a higher angular error. Moreover, the personal calibration problem exists in other gaze estimation methods, too. Compared to the Eyediap dataset [42], the difference was smaller using the MPIIFaceGaze dataset [20] (4.24

^{\circ}

> 2.34

^{\circ}

). We analyzed the main reason, as follows. The MPIIFaceGaze had a larger data scale, and more trainable experiment participants. Additionally, the MPIIFaceGaze had richer illumination conditions. Improving the dataset could effectively alleviate the personal calibration problem, to a certain extent.

4.4.2. The Angular Error Distribution of the STTDN on Different Angles

We analyzed the angular error distribution of the STTDN at different gaze directions. We recorded the distribution of the gaze direction for the two datasets, as shown in Figure 10. Figure 11 shows the recorded distribution of the angular error of the STTDN in different gaze directions. The proposed method performed poorly at some extreme gaze directions, because the training dataset lacked data samples at extreme angles. Conversely, the more concentrated the gaze direction distribution, the better the proposed model’s performance. A limited number of data samples at extreme gaze directions resulted in the STTDN performing poorly at extreme gaze directions.

4.5. The Effectiveness of the STM

We applied the MSA method to the STM, to fuse the coarse-grained facial features and the fine-grained eye features. We evaluated the effectiveness of the fusion feature in this subsection. Specifically, we evaluated the effectiveness of: (1) adding fine-grained eye features; (2) fusing two differently-grained features, using the MSA method.

To evaluate the effectiveness of the fusion feature, we implemented one fully connected layer after the STM into an end-to-end gaze estimation network, called the transformer static network (STN). We evaluated the angular error of the STN, using two public datasets—MPIIFaceGaze [20] and Eyediap [42]—and recorded the results, as shown in Table 2. We also set up two comparison models—the STN-W/O self-attention and the STN-W/O eye patches—the angular errors of which were also recorded in Table 2. The specific implementation detail of the STN-W/O self-attention and STN-W/O eye patches was as follows:

(1): STN-W/O self-attention: we removed the multi-head self-attention fusion block from the STN, to obtain the STN-W/O self-attention. At this stage, the fine-grained eye and coarse-grained facial features were connected and directly input to the fully connected layer, before the fully connected layer predicted the gaze direction.
(2): STN-W/O eye patches: the model was approximately the same as the hybrid-ViT network architecture [27]. The hybrid-ViT [27] used only face image as its input, with its main structure also comprising two parts: Resnet18 and the transformer layers. The hybrid-ViT used Resnet18 to extract the feature map from the face image, before splitting the feature map into patches: finally, it input these patches into the transformer layer. The hyperparameters used in this model should remain consistent.

Table 2 shows that STN achieved the best performance on both datasets—that is, an angular error of 5.07

^{\circ}

using the Eyediap dataset, and 3.75

^{\circ}

using the MPIIFaceGaze dataset. After removing the multi-head self-attention fusion block from the STN, the angular error of the STN-W/O self-attention on the MPIIFaceGaze increased by 0.24

^{\circ}

, and on the Eyediap dataset by 0.02

^{\circ}

, demonstrating the effectiveness of the multi-head self-attention fusion block. Compared to the STN, the STN-W/O eye patches also exhibited different degrees of degradation on the two public datasets: specifically, the angular error of the STN-W/O eye patches increased by 0.25

^{\circ}

on the MPIIFaceGaze dataset, and by 0.17

^{\circ}

on the Eyediap dataset, demonstrating the effectiveness of adding fine-grained eye features.

4.6. The Effectiveness of the TDN

In this section, we explore the effectiveness of using dynamic features. We set up two comparison methods—that is STM–LSTM and STM–BiLSTM. Specifically, we replaced the TDM with LSTM and BiLSTM to get STM–LSTM and STM–BiLSTM, respectively. The number of input video frames was set up to five. We recorded the angular error of these models, as shown in Table 3. STTDN performed best among these models. Compared to STM–LSTM, STTDN had improved by 0.7% and 2.7% on the MpiiFaceGaze dataset [20] and the Eyediap dataset [42]. Compared to STM–BiLSTM, STTDN had improved by 4.6% and 4.1% on the MpiiFaceGaze dataset [20] and the Eyediap dataset [42]. This proves that the ability to extract dynamic features by TDN is better than the other common RNN.

We also discovered another critical issue: STM–LSTM and STM–BiLSTM showed degradation on both the MpiiFaceGaze dataset [20] and the Eyediap dataset [42], compared to TSN, which did not add dynamic features. Compared to TSN, STTDN/TSM–LSTM/TSM–BiLSTM all used RNN to extract dynamic features and predict gaze direction based on those dynamic features. In other words, STTDN, TSM–LSTM, and TSM–BiLSTM added the dynamic features in their model. Specifically, STN achieved the angle error of 3.75

^{\circ}

on the MpiiFaceGaze dataset in Table 2, while STM–BiLSTM and STM–LSTM achieved 3.76

^{\circ}

and 3.91

^{\circ}

, respectively, on the MpiiFaceGaze dataset; STN achieved the angle error of 5.07

^{\circ}

on the Eyediap dataset in Table 2, while STM–BiLSTM and STM–LSTM reached 5.16

^{\circ}

, respectively, and 5.24

^{\circ}

on the Eyediap dataset: however, this does not prove that adding dynamic features is ineffective, because the proposed STTDN performed better than STN on two datasets. A reliable explanation for this degradation is that RNN was difficult to be trained when RNN was used to extract ambiguous dynamic features. This phenomenon was more obvious in non light-weight models. Thus, the proposed TDN can extract better dynamic features and predict more accurate gaze direction.

4.7. Ablation Study

We conducted ablation experiments to evaluate the effectiveness of the main modules—(1) the STM and (2) the TDN—in the STTDN. For this purpose, we set up two variant models—namely, the STTDN-W/O STM and STTDN-W/O TDN:

(1): The STTDN-W/O STM: unlike the STTDN, we replaced the Resnet18 structure in the STM with four 3 × 3 convolutional layers and a global average down-sampling layer, and replaced the multi-head self-attention fusion block with a fully connected layer.
(2): The STTDN-W/O TDN: unlike the STTDN, we removed the TDN from the STTDN. With the removal of the TDN, an external fully connected layer was implemented after the STM, to predict the gaze direction.

We recorded the angular error of these variants and the STN on two public datasets, as shown in Table 4. The angular error of the STTDN reached 3.73

^{\circ}

on the MPIIFaceGaze dataset [20], and 5.02

^{\circ}

on the Eyediap dataset [42], respectively, when the number of input frames was set to five. After removing the STM, the angular error of the STTDN-W/O STM reached 4.67

^{\circ}

on the MPIIFaceGaze dataset [20], and 5.85

^{\circ}

on the Eyediap dataset [42], respectively. The STTDN-W/O STM and STTDN-W/O TDM methods exhibit more reduction than the STTDN, demonstrating the effectiveness of the STM and TDN in the STTDN.

4.8. Computational Complexity

Our proposed model included convolution structure, transformer structure, and RNN structure. In order to compute the whole model complexity, we defined the complexity calculation formulas of three main structures. The definition of the complexity computing of convolution structure can be expressed as follows:

T i m e \sim O (\sum_{l = 1}^{D} M_{l}^{2} \times K_{l}^{2} \times C_{l - 1} \times C_{l})

(12)

where D is network depth, l is the lth convolution layer,

M_{l}

is the feature map side output by the lth convolution layer,

M_{l}

is the kernel side of the lth convolution layer,

C_{l}

is the channel output by the lth convolution layer, and

C_{l - 1}

is the channel output by the

l - 1

th convolution layer.

The definition of the complexity computing of RNN structure can be expressed as follows:

T i m e \sim O (\sum_{l = 1}^{D} n_{l} \times d_{l}^{2})

(13)

where D is network depth, l is the lth RNN layer, n is the RNN cells number of the lth RNN layers, and d is the input feature dimension of the lth RNN layers.

The definition of the complexity computing of transformer structure can be expressed as follows:

T i m e \sim O (\sum_{l = 1}^{D} n_{l}^{2} \times d_{l})

(14)

where D is network depth, l is the lth transformer layer, n is the RNN cells number of the lth RNN layers, and d is the input feature dimension of the lth RNN layers.

We could compute the time complexity of the convolution part in our proposed model using (12), the multi-head self-attention fusion module using (14), and TDN using (13). The overall time complexity of our proposed model was

O (N^{2})

, where N was the input frame number. Compared to the convolution network (e.g., STN-W/O Self-Attention), the computational complexity of the proposed model only additionally increased when (1) using the multi-head self-attention mechanism to fuse the different-grain feature, and (2) extracting dynamic features. Fortunately, its complexity was not N times when using N frames as input. We will save the next N-1 frames fusion features during operation. The previous N-1 frames fusion features did not need to be recomputed in the next operation.

5. Conclusions

This study proposed a novel gaze estimation network, the STTDN, which integrated the STM and the TDM. We provided a multi-head self-attention fusion strategy STM for fusing fine-grained eye features and coarse-grained facial features. Additionally, we defined a new dynamic feature (sight movement) and proposed an innovative RNN cell-TDM to obtain it. Through experimental evaluation, the STTDN demonstrated great competitiveness, compared to the state-of-the-art methods, on two publicly available datasets: MPIIFaceGaze and Eyediap. In future work, we will apply contrastive learning in gaze estimation, to solve the personal calibration challenge, and to increase performance in extreme angle environments. In addition, we will apply the proposed STTDN to cognitive workload estimation, which presents the occupancy rate of human mental resources under working conditions.

Author Contributions

Y.L. was responsible for methodology, original draft preparation, review, and editing; L.H. was responsible for methodology, validation, review, and editing; J.C. was responsible for validation, review, and editing; X.W. was responsible for validation, review, and editing; B.T. was responsible for methodology, review, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Guangxi Science and Technology Major Project (AA22068057), the Guangxi Natural Science Foundation (2022GXNSFBA035644, 2021GXNSFBA220039), and the National Natural Science Foundation of China (61903090).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, P.; Hou, X.; Duan, X.; Yip, H.; Song, G.; Liu, Y. Appearance-based gaze estimator for natural interaction control of surgical robots. IEEE Access 2019, 7, 25095–25110. [Google Scholar] [CrossRef]
Mohammad, Y.; Nishida, T. Controlling gaze with an embodied interactive control architecture. Appl. Intell. 2010, 32, 148–163. [Google Scholar] [CrossRef]
Vanneste, P.; Oramas, J.; Verelst, T.; Tuytelaars, T.; Raes, A.; Depaepe, F.; Van den Noortgate, W. Computer vision and human behaviour, emotion and cognition detection: A use case on student engagement. Mathematics 2021, 9, 287. [Google Scholar] [CrossRef]
Fridman, L.; Reimer, B.; Mehler, B.; Freeman, W.T. Cognitive load estimation in the wild. In Proceedings of the 2018 Chi Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–9. [Google Scholar]
Ma, H.; Pei, W.; Zhang, Q. Research on Path Planning Algorithm for Driverless Vehicles. Mathematics 2022, 10, 2555. [Google Scholar] [CrossRef]
Patney, A.; Kim, J.; Salvi, M.; Kaplanyan, A.; Wyman, C.; Benty, N.; Lefohn, A.; Luebke, D. Perceptually-based foveated virtual reality. In Proceedings of the ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA, 24–28 July 2016; pp. 1–2. [Google Scholar]
Moral-Sánchez, S.N.; Sánchez-Compaña, M.T.; Romero, I. Geometry with a STEM and Gamification Approach: A Didactic Experience in Secondary Education. Mathematics 2022, 10, 3252. [Google Scholar] [CrossRef]
Funes-Mora, K.A.; Odobez, J.M. Gaze estimation in the 3d space using rgb-d sensors. Int. J. Comput. Vis. 2016, 118, 194–216. [Google Scholar] [CrossRef]
Huang, L.; Li, Y.; Wang, X.; Wang, H.; Bouridane, A.; Chaddad, A. Gaze Estimation Approach Using Deep Differential Residual Network. Sensors 2022, 22, 5462. [Google Scholar] [CrossRef]
Li, Y.; Tan, B.; Akaho, S.; Asoh, H.; Ding, S. Gaze prediction for first-person videos based on inverse non-negative sparse coding with determinant sparse measure. J. Vis. Commun. Image Represent. 2021, 81, 103367. [Google Scholar] [CrossRef]
Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
Recasens, A.; Khosla, A.; Vondrick, C.; Torralba, A. Where are they looking? Adv. Neural Inf. Process. Syst. 2015, 28, 199–207. [Google Scholar] [CrossRef]
Xu, B.; Li, W.; Liu, D.; Zhang, K.; Miao, M.; Xu, G.; Song, A. Continuous Hybrid BCI Control for Robotic Arm Using Noninvasive Electroencephalogram, Computer Vision, and Eye Tracking. Mathematics 2022, 10, 618. [Google Scholar] [CrossRef]
Guestrin, E.D.; Eizenman, M. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Trans. Biomed. Eng. 2006, 53, 1124–1133. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Ji, Q. Novel eye gaze tracking techniques under natural head movement. IEEE Trans. Biomed. Eng. 2007, 54, 2246–2260. [Google Scholar] [PubMed]
Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Alberto Funes Mora, K.; Odobez, J.M. Geometric generative gaze estimation (g3e) for remote rgb-d cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1773–1780. [Google Scholar]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar]
Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018; pp. 309–324. [Google Scholar]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 162–175. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–60. [Google Scholar]
Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6912–6921. [Google Scholar]
Bao, Y.; Cheng, Y.; Liu, Y.; Lu, F. Adaptive feature fusion network for gaze tracking in mobile tablets. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9936–9943. [Google Scholar]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cheng, Y.; Lu, F. Gaze estimation using transformer. arXiv 2021, arXiv:2105.14424. [Google Scholar]
Zhou, X.; Lin, J.; Jiang, J.; Chen, S. Learning a 3D gaze estimator with improved Itracker combined with bidirectional LSTM. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 850–855. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Liu, G.; Yu, Y.; Mora, K.A.F.; Odobez, J.M. A differential approach for gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1092–1099. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
Cheng, Y.; Lu, F.; Zhang, X. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 100–115. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Wang, Z.; Chai, J.; Xia, S. Realtime and accurate 3D eye gaze capture with DCNN-based iris and pupil segmentation. IEEE Trans. Vis. Comput. Graph. 2019, 27, 190–203. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Su, H.; Ji, Q. Neuro-inspired eye tracking with eye movement dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9831–9840. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Funes Mora, K.A.; Monay, F.; Odobez, J.M. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, San Antonio, TX, USA, 22–24 March 2004; pp. 255–258. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2018, arXiv:1711.05101. [Google Scholar]

Figure 1. Two categories of gaze estimation methods: (1) appearance-based gaze estimation method; (2) model-based gaze estimation method.

Figure 2. The pipeline of the proposed static transformer temporal differential network.

Figure 3. The flow chart of the proposed static transformer temporal differential network.

Figure 4. The structure of the multi-head self-attention fusion block.

Figure 5. The structure of the temporal dynamic network.

Figure 6. The structure of the TDM.

Figure 7. Example face images and their corresponding left eye and right eye images on two public datasets.

Figure 8. Angular error of the STTDN, and comparison methods on the MPIIFaceGaze [20] and Eyediap [42] datasets. We recorded the angular error of the MPIIFaceGaze on the horizontal axis, and the angular error of Eyediap on the vertical axis. The performance was better when the position was closer to the bottom left corner.

Figure 9. The angular error of the STTDN among the different experiment participants: (a) records the angular error of the different experiment participants (p0–p14) on the MPIIFaceGaze dataset [20]; (b) records the angular error of the different experiment participants (p1–p16) on the Eyediap dataset [42]. The angular error is recorded on the vertical axis. Each experiment participant ID and its corresponding face image is recorded on the horizontal axis.

Figure 10. The data distribution of gaze direction on the two datasets: (a) MPIIFaceGaze [20]; (b) Eyediap [42]. The yaw of the gaze direction is recorded on the horizontal axis, and the pitch of the gaze direction on the vertical axis. The brighter the color of the heatmap, the more concentrated the gaze direction.

Figure 11. The angular error distribution of the STTDN on two datasets: (a) MPIIFaceGaze [20]; (b) Eyediap [42]. The yaw of the gaze direction is recorded on the horizontal axis, and the pitch of the gaze direction on the vertical axis. The brighter the color of the heatmap, the bigger the angular error.

Table 1. Angular error of STTDN and comparison methods on the MpiiFaceGaze [20] and Eyediap [42] datasets.

	MpiiFaceGaze [20]	Eyediap [42]
Method	MpiiFaceGaze [20]	Eyediap [42]
Hybrid-VIT [16]	4.00 $^{\circ}$	5.17 $^{\circ}$
Gaze360 [22]	4.06 $^{\circ}$	5.36 $^{\circ}$
iTracker [11]	7.33 $^{\circ}$	7.13 $^{\circ}$
DilatedNet [19]	4.42 $^{\circ}$	6.19 $^{\circ}$
RTGene [33]	4.66 $^{\circ}$	6.02 $^{\circ}$
AFFNet [23]	3.73 $^{\circ}$	6.41 $^{\circ}$
CANet [24]	4.27 $^{\circ}$	5.27 $^{\circ}$
STTDN (ours)	3.73 $^{\circ}$	5.02 $^{\circ}$

Table 2. The angular error of STN and its variants, STN-W/O Self Attention and STN-W/O Eye Patches, on MpiiFaceGaze [20] and Eyediap [42].

	MpiiFaceGaze [20]	Eyediap [42]
Method	MpiiFaceGaze [20]	Eyediap [42]
STN-W/O Self-Attention	3.99 $^{\circ}$	5.09 $^{\circ}$
STN-W/O Eye Patches	4.00 $^{\circ}$	5.24 $^{\circ}$
STN	3.75 $^{\circ}$	5.07 $^{\circ}$

Table 3. The angular error of STM–LSTM, STM–BiLSTM, STTDN.

	MpiiFaceGaze [20]	Eyediap [42]
Method	MpiiFaceGaze [20]	Eyediap [42]
STM–BiLSTM	3.76 $^{\circ}$	5.16 $^{\circ}$
STM–LSTM	3.91 $^{\circ}$	5.24 $^{\circ}$
STTDN (ours)	3.73 $^{\circ}$	5.02 $^{\circ}$

Table 4. Angular errors of STTDN, STTDN-W/O STM, STTDN-W/O TDM on two public datasets: MpiiFaceGaze [20] and Eyediap [42].

Method	STM	TDN	MpiiFace Gaze [20]	Eyediap [42]
STTDN	√	√	3.73 $^{\circ}$	5.02 $^{\circ}$
STTDN-W/O STM	×	√	4.67 $^{\circ}$	5.85 $^{\circ}$
STTDN-W/O TDN	√	×	3.75 $^{\circ}$	5.07 $^{\circ}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Huang, L.; Chen, J.; Wang, X.; Tan, B. Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network. Mathematics 2023, 11, 686. https://doi.org/10.3390/math11030686

AMA Style

Li Y, Huang L, Chen J, Wang X, Tan B. Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network. Mathematics. 2023; 11(3):686. https://doi.org/10.3390/math11030686

Chicago/Turabian Style

Li, Yujie, Longzhao Huang, Jiahui Chen, Xiwen Wang, and Benying Tan. 2023. "Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network" Mathematics 11, no. 3: 686. https://doi.org/10.3390/math11030686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network

Abstract

1. Introduction

2. Related Work

3. Proposed Model and Algorithm

3.1. The Design of the STM

3.2. Design of the TDN

3.3. Topology of the STTDN Network

4. Experimental Results

4.1. Datasets

4.2. Evaluation Metric

4.3. Comparison with State-of-the-Art Methods

4.4. Perfromance Analysis

4.4.1. The Angular Error of the STTDN on Different Experiment Participants

4.4.2. The Angular Error Distribution of the STTDN on Different Angles

4.5. The Effectiveness of the STM

4.6. The Effectiveness of the TDN

4.7. Ablation Study

4.8. Computational Complexity

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI