4.1.1. Datasets

To evaluate the proposed algorithms, various datasets were created to investigate the robustness of the algorithms for imitating landmark detection in real-life situations. The datasets contained independent variations in pose, expression, illumination, background, occlusion and image quality. For instance, the 300W dataset [22] consisted of a wide range of head pose images and AFLW2000-3D [43] contained large-scale images in 3D. For training and validation, we used 300W-LP [23], a synthetically expanded version of 300W, as a basis to train our model. The model was fine-tuned with LFPW, HELEN and 300W datasets. To observe how the network was flexible with unseen datasets, we analyzed the AFLW2000-3D dataset without training it in advance, as presented in Table 3. In our evaluation experiments, we implemented our proposed algorithm (Algorithm 1) in "in-the-wild" datasets as follows:



**Table 3.** The list of face datasets used for training and testing.

#### 4.1.2. Data Augmentation

For data augmentation (e.g., randomly flipping, resizing and cropping images, etc.), PyTorch framework [42] leaves the original input images untouched, returning only a changed copy at every batch generation.

To reduce overfitting in our model, we artificially expanded the amount of training data using random augmentation including cropping, rotation, flipping, color jittering, scale noise and random occlusion. We rotated the input image with a random angle of ±50◦ and scale noise from 0.8 to 1.2. We also scaled the longest side to 256 resulting in a 256 × *H* or *H* × 256 image, where *H* ≤ 256.

#### *4.2. Experimental Setting*

#### 4.2.1. Implementation Detail

We implemented our model based on the open source PyTorch framework [42], which is a dynamic program that runs on a GPU. First, we cropped an input image to 256 × 256 resolution and generated an output set of response maps with the same resolution. Then, we transferred the image's facial key points to heatmap key points using the 2D Gaussian kernel. In our method, the variance (sigma) of the 2D Gaussian kernel in the ideal response map was set to 0.25. For training, we optimized the network parameters by RMSprop [50] with a momentum of 0.9 and a weight decay of 10−4. We trained our model for 100 epochs with an initial learning rate of 10−4. We reduced it subsequently to 10−<sup>5</sup> after 50 epochs and to 10−<sup>6</sup> after another 80 epochs.

For loss function in our network, we chose the Euclidean distance loss function for our network,

$$L(\Theta) = \frac{1}{N} \sum\_{i=1}^{N} ||Z(X\_i; \Theta) - Z\_i^{\mathbb{S}^t}||\_2^2 \tag{9}$$

where *N* is the size of the training batch and *<sup>Z</sup>*(*Xi*, Θ) is the output generated by the DSC network with parameters shown as Θ. *Xi* represents the input images and *Zg<sup>t</sup> i* is the ground-truth result of input image *Xi*.

During training, *L*(Θ) calculates the difference between the estimated and corresponding ground-truth feature map to update weight parameter Θ, to ultimately identify a set of parameters that make *L*(Θ) as small as possible.

## 4.2.2. Evaluation

We evaluated for accuracy with three popular metrics: the normalized mean error (NME), the cumulative error distribution (CED) curve and the area under the curve (AUC). The NME was evaluated by measuring the distance between the detected landmark coordinates and the ground-truth facial landmark coordinates. It calculates the mean of the inter-pupil distance of multiple images which can be represented by

$$NME = \frac{1}{n} \sum\_{i=1}^{n} \frac{||\mathbf{x}\_i - \mathbf{x}\_i^{\mathcal{S}^1}||^2}{d},\tag{10}$$

where *xi* is the predicted coordinates and *xg<sup>t</sup> i* is the ground-truth coordinates for *ith* image, *d* donates inter-ocular distance (Euclidean distance between two eye centres) and *n* is the total number of facial landmarks.

The CED is the cumulative distribution function of the normalized error which is larger than *l* and is reported as a failure. Thus, CED at the error is defined as

$$CED = \frac{N\_{NME \stackrel{\sim}{\simeq} \stackrel{\sim}{I}}}{n} \,\tag{11}$$

where *NNME* is the number of images in which the error *NMEi* is no higher than *l*.

AUC calculates the percentages of images that lie under certain thresholds. It is defined as:

$$A \amalg \mathbb{C}\_{\mathbb{A}} = \int\_0^{\mathbb{A}} f(\varepsilon) d\varepsilon,\tag{12}$$

where *e* is the normalized error, *f*(*e*) is the CED function and *α* is the upper bound used to calculate the definite integration.

In this study, we present our evaluations using mean error rate and CED curves. We calculated additional statistics from the CED curves such as the AUC which is up to an error of 0.07. CED curves for our experiments on the 300W and AFLW2000-3D testing sets are illustrated in Figure 7. Furthermore, as clearly stated in the figure, the AUC of 300W dataset is 72.49% and 65.99% for AFLW200-3D dataset.

**Figure 7.** Cumulative error distribution (CED) curve and area under the curve (AUC).

#### *4.3. Comparison with State-of-the-Art Algorithms*

#### 4.3.1. Comparison with LFPW Dataset

The goal of the LFPW dataset was to study the problem of unconstrained face conditions that were trained on 811 images and tested on 224 images. Images were collected from Google, Flickr and Yahoo using text queries.

Comparisons of different methods versus the proposed method are listed in Table 4. Our proposed method substantially reduced the mean error rate. The second-best mean error rate in the table is the CFSS [51] method, which has a mean error of 4.87%. Our method is considerably superior with an error rate of only 3.52%. Furthermore, compared to the SDM [5] method, which uses cascaded regressions and has an error rate of 5.67%, our method also prevails by 2.15%.


**Table 4.** Mean error in LFPW dataset.

#### 4.3.2. Comparison with HELEN Dataset

Similar to the LFPW dataset, images were taken under unconstrained conditions with high resolutions and collected from Flickr using text queries. The dataset contained 2000 images for training and 330 images for testing.

Mean error comparisons of different methods on the HELEN dataset are presented in Table 5. Our method successfully achieved the lowest mean error percentage among all mentioned methods, with a mean error rate of 3.11% compared to the second-best, TCDCN [55], which achieved only a 4.60% error rate.

**Table 5.** Mean error on HELEN dataset.


#### 4.3.3. Comparison with 300W Dataset

The 300W is an extremely challenging dataset that is widely used to compare the performance of different algorithms for facial landmark detection under the same evaluation protocol. Table 6 presents the comparison results of the mean error rate of the 300W dataset. Our method reduced the mean error rate by 3.60%, 8.69% and 3.90% for the common subset, challenging subset and full set subset. Moreover, our proposed method performed significantly better than the previous methods in full set subsets with an error reduction of 0.46% when compared to the second-best method, CPM [56]. Our method for common, challenging and full set subsets also demonstrated significant improvement compared to the current state-of-the-art method DeFA [57]. Its error rate was 5.37% for the common subset, 9.38% for the challenging subset and 6.10% for the full set subset, which are higher than in our proposed method. The example landmark detection results of our method are illustrated in Figure 8, which is a collection of example results from the common, challenging and full set subsets.


**Table 6.** Mean error on 300W dataset.

**Figure 8.** Landmark detection examples from the 300W dataset.

4.3.4. Comparison of the AFLW2000-3D Dataset

The goal of the AFLW2000-3D dataset is to evaluate the algorithms on a large-pose dataset. In this dataset, we compared our proposed method with several state-of-the-art methods as presented in Table 7. The results show that our method had a mean error of 4.04%.

In comparison to 3DSTN [60], our method successfully reduced the mean error by 0.45% for the AFLW2000-3D dataset. The third best result in the dataset was DeFA [57], with an error rate of 4.50%. Our method has significantly and effectively improved errors in the dataset. The example landmark detection results of our method are illustrated in Figure 9.

**Figure 9.** Landmark detection examples from AFLW2000-3D dataset.


**Table 7.** Mean error on AFLW2000 dataset.
