**3. Methods**

The proposed facial landmark detection architecture is illustrated in Figure 3. We divide our approach into two connected sub-parts: the local appearance initialization (LAI) subnet and the dilated skip convolution (DSC) subnet for shape refinement. LAI pursues a heatmap regression approach convolved with kernel convolution to serve as a local detector of facial landmarks and the DSC subnet is designed to refine the local prediction of the first subnet.

**Figure 3.** Overview of the proposed approach for facial landmark detection.

#### *3.1. Local Appearance Initialization Networks*

It is well known that facial landmark detection uses single specific pixel location data *p*(*<sup>x</sup>*, *y*) as a training label where *x* and *y* are pixel coordinates in 2D images. However, using the training label data as a single-pixel point *p*(*<sup>x</sup>*, *y*) is inefficient for learning features from the input data. Even though the model returns a result close to the ground-truth pixel, a result that does not comply with the exact pixel location data *p*(*<sup>x</sup>*, *y*) may be considered wrong; as a result, the model may search for another pattern despite being close to the answer.

Recently, Gaussian distribution has come into play for manipulating the training label into a Gaussian heatmap label. It modifies the training label, not as a single specific point *p*(*<sup>x</sup>*, *y*) but rather as probabilities near the given training label pixel point. References [24,40] present several successful heatmap implementations in facial alignment. As presented by both papers, using heatmaps as a training label allows the network to learn faster. Furthermore, heatmaps demonstrate how the network is thinking during training since heatmaps are more visible to the naked eye. The correct point will have the highest probability in the distribution, whereas the neighboring pixels close to the correct pixel will also have high probabilities but not as high as that of the correct pixel. In Equation (1), the value of *p*ˆ*i* will let the network know whether or not it is making a guess close to the ground-truth rather than penalizing a guess that deviates by a small number of pixels. During the training, network weight *w* and bias *b* are learned in predicted heatmaps *hi*(*p*; *w*, *b*).

$$
\hat{p}\_i = \arg\max\_p h\_i(p; w, b). \tag{1}
$$

The output of a network will now be a continuous probability distribution on an input image plane, making it easier to see where the network's guess is confident; in contrast, having a single position as an output does not show how the network is guessing.

Our goal in the first part of the network is to obtain the output feature maps that contain sufficient pixel-level details, high-resolution outputs that remain the same size as the input image (no resolution loss) and less extensive computation. A FCNs-based heatmap regression, followed by a kernel convolution, is used to meet our goal. To do so, we initially transform the facial landmarks' ground-truth location *pgt i* (*<sup>x</sup>*, *y*) of *ith* key point into target heatmap *hg<sup>t</sup> i* (*p*) of *ith* key point (Figure 4a) via 2D Gaussian kernel (Equation (2)). Then, the target heatmap *hg<sup>t</sup> i* (*p*) are fed into FC-DenseNets and finally convolved with a kernel convolution as illustrated in Figure 4b. In fully convolutional heatmap regression fashion, the task becomes one of predicting per-pixel likelihood of each key point's heatmap from the image. It regresses the target heatmap of each landmark *hg<sup>t</sup> i* (*p*) directly to obtain the response map *<sup>M</sup>*(*p*) stated in Equation (3), which has the same resolution as the input image.

We transform ground-truth location *pgt i* (*<sup>x</sup>*, *y*) to target heatmap *hg<sup>t</sup> i*(*p*) as

$$h\_i^{\mathcal{Y}^t}(p) = \frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{||p - p\_i^{\mathcal{Y}^t}||}{2\sigma^2}), p \in \Omega,\tag{2}$$

where *σ* is the standard deviation for the heatmaps used to control the response scope and Ω is the set of all pixel locations in image *I*.

We set the FC-DenseNet architecture to include 56 layers following Reference [19], which had FC-DenseNet56 with 4 layers per dense block and growth rate = 12. We adopted the smallest FC-DenseNet to reduce network computational complexity, as shown in Table 1, while still achieving notable outcomes compared with current popular architectures. We also applied fully convolutional ResNets with 50 layers (FC-ResNets50 [41]), available in the PyTorch framework [42] (Torchvision) and then compared the outcomes with fully convolutional DenseNets with 56 layers (FC-DenseNet56). As expected, FC-DenseNets56 outperformed FC-ResNets50 due to more depth and hence more parameters.

(**a**) An example image with facial landmarks and the image's first 20 key points in heatmap key points

(**b**) Local Appearance Initialization Diagram **Figure 4.** Local appearance initialization network.


**Table 1.** Architecture of FC-DenseNet56 used in the LAI network.

Kernel Convolution

The output of FC-DenseNets is in a channel-wise fashion that has the same resolution as the input image. After reaching the output resolution of the network, an implicit 45 × 45 pixel kernel convolution *Kσ* is applied to produce a clear shape output of the feature maps. For computational efficiency, the kernel convolution *Kσ* was generated by the Gaussian function in Equation (2). Here, the kernel convolution filter acts as a point-spread function to blur the input feature maps as shown in Figure 5. The kernel convolution filter *Kσ* removes the detail and noise and provides gentler smoothing by preserving the edges of the feature maps. Without the kernel convolution, landmarks' sub-pixel positions are neglected [43].

The kernel convolution filter convolves with the entire image using grouped convolution [44], which allows for more efficient learning and improved representation. In grouped convolutions, each input channel is convolved with its own filter. The final output of the network is a set of heatmaps that contain the probability of each key point's presence at each pixel. With the convolved response maps *<sup>M</sup>*(*p*)=[*hg<sup>t</sup> i* (*p*)|*<sup>i</sup>* = 1...*N*] and a kernel convolution filter *K<sup>σ</sup>*, we can obtain the density heatmap *H*<sup>0</sup> as follows:

$$H^0 = M(p) \* K\_\sigma \tag{3}$$

**Figure 5.** Best viewed in color. **Left**: Output of FC-DenseNets. **Middle**: Visualization of kernel convolution filter (*Kσ*). **Right**: Feature map after applying the filter (*Kσ*).

#### *3.2. Dilated Skip Convolution Network for Shape Refinement*

To enable networks to learn the spatial relationships between each key point and make better guesses, it must be able to view large portions of the input images. The portion of the input image viewed by the network is called the receptive field. Using the vanilla convolution filter [45] is a challenge when using a large receptive field: it is computationally expensive and can be easily overfitted due to the vast number of parameters. This problem is usually tackled by using pooling layers in conventional CNNs. Pooling layers choose one pixel from its field and discard other information, thereby reducing information and resolution of the input image. This degrades the performance of the network because some important information is lost when the resolution is decreased. Fortunately, dilated convolutions [37] solve this problem by using sparse kernels to alternate the pooling and convolutional layer, which dilates the kernels with zeros as a result of not only affecting the number of parameters but also increasing the size of the receptive field. In practice, kernels with different dilation factors are convoluted to the input and the outputs of those kernels are concatenated for subsequent layers [9]. Subsequent layers have no missing information from the input image and fewer parameters with different receptive fields. To apply this concept, References [18,46] introduced a stack of dilated convolutions in their network that can enlarge the receptive field exponentially while keeping the number of parameters low. Inspired by this design, we constructed a dilated skip convolution network that combined seven consecutive zero-padded dilated convolutions and skip-connections to overcome the issue of scale variations. In the network, our dilation factors ranged from *d* = 1 to *d* = 32 as stated in Table 2.

This module was carefully designed to increase the performance of our dense prediction architecture and ensure accurate spatial information by aggregating multi-scale contextual information. Our objective was to combine intermediate feature representations to learn global-context information and improve the final heatmap predictions. We exploited dilated convolutions to extract the global-context from input feature maps and then progressively updated the initial heatmap (*H*<sup>0</sup>). Due to the capacity to capture texture information at the pixel level, concatenating dilated convolutions of sub-layers together aids the network-extracting features from different scales concurrently. We also built extra skip-connections and embedded them in our dilated convolutions network to add global information from the entire image to common knowledge of the network from the previous feature map (*H<sup>τ</sup>*−<sup>1</sup>[.]). During the training, skip-connections concatenated output feature maps from previous and current layers together. Thus, our dilated skip convolution's feature map *<sup>H</sup>DSC*[.], which has current feature map *H<sup>τ</sup>*, previous feature map *<sup>H</sup><sup>τ</sup>*−<sup>1</sup>[.], kernel filter *k*[.] and dilation factor *d*, is defined as:

$$H^{DSC}[\mathbf{x}, \mathbf{y}] = \sum\_{i} \sum\_{j} k[i, j] \cdot H\_{\mathbf{r}}[\mathbf{x} - d i, \mathbf{y} - d j] + H\_{\mathbf{r} - 1}[\mathbf{x}, \mathbf{y}].\tag{4}$$

Intuitively, Equation (4) shows that the model learns from each dilated convolution layer and the input initial heatmap, *H*0, providing robustness against appearance changes. This is achieved through skip connections, which are extra connections between *H*<sup>0</sup> and dilated layers with different dilation factors, *d*. Consider the output feature map for the *nth* layer, *Hn* and a non-linear transformation of the *nth* layer, *Tn*(.). At each stage, the kernel, *k*[.], convolves with *Hn* and then concatenates it with *H*0. Thus, from Table 2, the network from the initial to the final output feature map for a DSC subnet with 7 dilation factors can be formulated as

$$\begin{cases} H^1 = T\_1(H^0) \\ H^2 = T\_2([H^0, H^1]) \\ \vdots \\ H^7 = T\_7([H^0, \dots, \cdot, H^6]) \end{cases} \tag{5}$$

where [ *H*0, *Hn*] donates feature map concatenation.

Rather than having the dilated skip convolution network predicting the landmark locations from scratch in Equation (6), it is beneficial to refine the LAI subnet predictions. This was achieved by summing *H*<sup>0</sup> and *HDSC* to obtain the final feature map of the architecture,

$$H^f = H^0 + H^{DSC}.\tag{6}$$

To better understand how the heatmap is regressed in a real image, we transferred back *H*0, *<sup>H</sup>DSC*, *Hf* to *S*0, *<sup>S</sup>DSC*, *Sf* . Thus, Equation (6) was replaced as follows:

$$S^f = S^0 + S^{DSC}.\tag{7}$$

Figure 6 compares visualizations of landmark coordinates (green dots) in the real face image for both stages. Landmark coordinates from Figure 6a are improved in the second stage, for example the green dots with red circles in Figure 6b locate more correctly on the face contour and there is no missed landmark on the left eyebrow compared to Figure 6a.

(**a**) First stage: Initial shape (*S*<sup>0</sup>) from LAI subnet (**b**) Second stage: Final shape refinement (*Sf* ) **Figure 6.** Dilated skip convolution network for shape refinement.

Thus, dilated convolutions offer a method to increase global view exponentially on input image, hence the dilation factors should be set as exponential values following [35],

$$d\_{(i+1)} = 2^i, \quad \text{for} \quad i = (0, 1, 2, \cdot, \cdot, n-2), \tag{8}$$

where *<sup>d</sup>*(*<sup>i</sup>*+<sup>1</sup>) is the dilation factor for the (*i* + 1)*th* layer and *n* is the number of layers. In this case, the dilated convolution has 7 layers, hence optimal dilation factors *<sup>d</sup>*(*<sup>i</sup>*+<sup>1</sup>) ≤ 32, for *i* = (0, 1, ··· , <sup>5</sup>). Table 2 shows dilation factors = 1, 1, 2, 4, 8, 16, 32, where the first two layers serve as conventional convolution layers.


**Table 2.** Structure of dilated convolutions.

Table 2 compares the proposed method's using the mean error rate of the datasets, which should ideally be as small as possible. Thus, we need to find the optimal number of dilated layers most suitable for our entire network. Table 2 shows the optimal number of dilated layers = 7. Increasing the number of layers beyond that does not significantly improve the mean error rate, while introducing more parameters for the network and aggressively widening the receptive field via dilation factors would be detrimental to local features of small objects.

**Algorithm 1** Dilated skip convolution for facial landmark detection

