4.2.2. Training the CNN

The 2-D CNN-based gaze estimator had a relatively shallow and compact structure with only two CNN layers, two subsampling layers, and two fully connected (FC) layers, as illustrated in Table 1. This boosted the system's computational efficiency for training, and most importantly for real-time classification. In this configuration, the subsampling factor (in both dimensions) of the last subsampling layer was adaptively set to 13 to ensure that its output was a stack of 1 × 1 feature maps (i.e., scalar features).


**Table 1.** CNN hyperparameters.

The used network had 16 and 12 filters (i.e., hidden neurons) in the two hidden CNN layers, respectively, and 16 hidden neurons in the hidden FC layer. The output layer size was 4, which corresponds to the number of classes. RGB images were fed to the 2-D CNN, and thus the input depth was 3 channels, each of which was a 64 × 64 frame. Training was conducted by means of backpropagation (BP) with 3 stopping criteria: the minimum train classification error was 1%, the maximum number of BP iterations was 100, or the minimum loss gradient was 0.001 (optimizer converges).

The learning rate ε was initialized to 0.001 and then global adaptation was carried out for each BP iteration: ε was increased by 5% in the next iteration if the training loss decrease, and it was reduced by 30% otherwise.
