*3.5. Learning*

Two losses, a landmark loss and a gaze loss, were required for the training of our proposed network. The method of regressing the heatmap, which is the probability of the existence of each feature point using a CNN model, has fewer parameters than the method of directly regressing the feature point coordinates and can avoid the problem of over-fitting. However, it is difficult to precisely detect units below the decimal point because heatmap regression acquires integer co-ordinates through an arg-max operation in the process of converting heatmap into coordinates. We used integral regression [36] to properly compensate for the above two shortcomings. The integral regression module removes negative values by applying AB ReLU operation to the heatmap and divides the operation by the total sum to normalize it. As shown in Equation (5), all values of *H*ˆ are between 0 and 1 and the total sum becomes 1; therefore, it is defined as a probability distribution. Subsequently, the co-ordinates of each feature point in the heatmap can be obtained through the expected value calculation.

$$\begin{aligned} \mathcal{H}\_c(\mathbf{x}, \mathbf{y}) &= \frac{F\_{\text{Ref},II}(H\_{\text{c}}(\mathbf{x}, \mathbf{y}))}{\sum\_{i} \sum\_{j} (H\_{\text{c}}(i, j))} \\ \text{predicted coordinates} &= \begin{cases} \mathbf{x}\_c = \sum\_{i} \sum\_{j} i \mathcal{H}\_{\text{c}}(i, j) \\ \mathbf{y}\_c = \sum\_{i} \sum\_{j} j \mathcal{H}\_{\text{c}}(i, j) \end{cases} \end{aligned} \tag{5}$$

Therefore, the final landmark cost function consists of the mean squared error (MSE) loss between the output and the ground-truth heatmap, the L1 loss of the ground-truth co-ordinate and the co-ordinates obtained by using the expected value operation. *H* is the predicted heatmap, *H* is the ground-truth heatmap, (*x* , *y* ) is the co-ordinate predicted through the integral module, and (*<sup>x</sup>*, *y*) is the ground-truth co-ordinate.

$$\text{Loss}\_{\text{heartmap}} = \sum\_{i} \sum\_{\mathbf{x}} \sum\_{\mathbf{y}} \parallel \, H\_i(\mathbf{x}, \mathbf{y}) - H\_i(\mathbf{x}, \mathbf{y}) \parallel \mathbf{2} \quad , \text{Loss}\_{\text{Controlutes}} = \sum\_{i} \parallel (\mathbf{x}^\prime, \mathbf{y}^\prime) - (\mathbf{x}, \mathbf{y}) \parallel \mathbf{1} \tag{6}$$

$$\text{Loss}\_{\text{landmark}} = \text{Loss}\_{\text{heartmap}} + \text{Loss}\_{\text{coordinates}}$$

To compare each gaze performance, experiments were conducted using several methods. There are two frequently used methods of gaze regression. The first is a method of directly regressing a 3D vector and the second is a method of encoding a 3D normal vector into 2D space pitch (*θ*) and yaw (*ϕ*) regression. The pitch and yaw are the angles between the pupil and the eyeball, which can explain the positional relationship. The positional relationship between an eyeball and a pupil is illustrated in Figure 6. We found that the generalization was better when a 2D angle vector was encoded empirically and cosine distance loss and MSE were used as cost functions, and the best performance was obtained

in MSE. (Pitch, yaw), that is, (*θ* , *ϕ* ), is the predicted 2D gaze and (*θ*, *ϕ*) is the ground-truth 2D gaze.

$$\begin{aligned} pitch(\theta) &= \arcsin(y), \quad yaw(\varphi) = \arctan(\frac{x}{z}) \\ Loss\_{\mathbb{Q}d\overline{z}\varepsilon} &= \left\| \begin{pmatrix} \theta \end{pmatrix} \begin{pmatrix} \rho \\ \end{pmatrix} - (\theta, \overline{\rho}) \right\|\_{2}^{2} \end{aligned} \tag{7}$$

We trained our model using a UnityEyes dataset that consists of 80,000 images, and each validation and test used 10,000 images. We used black and white 1 × 160 × 96 images and set batch size to 16. We used the Adam optimizer. The learning schedule followed the settings in [15]. We used pre-trained data on ImageNet. The base learning rate was set as 4 × 10−<sup>4</sup> and decreased by every 25 epochs. Specifications of the PC used in the experiment were an Intel Core i9-11900K, 3.5 GHz CPU, and NVIDIA RTX 3090 GPU with 24 GB of memory for training.

**Figure 6.** (**a**) illustrates the simple network architecture for gaze estimation and (**b**) shows the relationship between the pupil and the eyeball. Gray embedding vector encode the landmarks coordinates. Gaze vector (red) can be explained through a pitch (*θ*) and yaw (*ϕ*).

#### **4. Description of the Dataset**

This section describes the dataset used for network training and evaluation. Figure 7 shows the original forms of the utilized datasets.

**Figure 7.** Samples from two datasets: left is UnityEyes and right is MPIIGaze.
