**3. Methodology**

## *3.1. Overview*

Given a 3D facial geometry data *G*, 3D facial landmarks' detection is the task to locate *N* pre-defined fiducial points, including eye corners, nose tip, mouth corners and so on. We denote the homogeneous coordinate of 3D facial landmarks as *S*:

$$S = \begin{pmatrix} x\_1 & x\_2 & \dots & x\_N \\ y\_1 & y\_2 & \dots & y\_N \\ z\_1 & z\_2 & \dots & z\_N \end{pmatrix} \\ \tag{1}$$

where *N* is the pre-defined number of landmarks. The function is also equal to the following function:

$$S = \begin{pmatrix} x(u\_1, v\_1), & x(u\_2, v\_2), & \dots & x(u\_N, v\_N) \\ y(u\_1, v\_1), & y(u\_2, v\_2), & \dots & y(u\_N, v\_N) \\ z(u\_1, v\_1), & z(u\_2, v\_2), & \dots & z(u\_N, v\_N) \end{pmatrix},\tag{2}$$

where *<sup>x</sup>*,*y* and *z* represent the *<sup>x</sup>*,*y*,*<sup>z</sup>* coordinate map for each pair (*<sup>u</sup>*, *<sup>v</sup>*). Given 3D facial data, our goal is to simultaneously estimate the (*<sup>u</sup>*, *v*) accurately.

For this purpose, we propose transforming the 3D face landmarks' estimation to detect the landmarks on five types of 2D facial attribute maps, including shape index map, normal maps and original range map that calculated on 3D geometry data. Then, a novel framework as Figure 1 was presented to achieve our goal accurately and efficiently. Based on the coarse-to-fine strategy, the framework comprises two main parts: one is for global estimation and the other is for local refinement. Specifically, the global estimation phase is intended to locate the landmarks roughly by feeding into the fused global feature that extracted from these attribute maps. Then, the local refinement stage is to learn the nonlinear mapping function from the fused local feature that extracted from a local patch around estimated global landmarks to residual distance.

In the global estimation phase, the goal is to locate landmarks roughly, but it is still more robust and accurate than the mean shape. To train this model, instead of applying the handcrafted feature, we use the pre-trained deep network to extract features from each facial attribute map as a global feature and then concatenate them as the fused feature. Feeding into the fused feature, the target of the regression model is to estimate global landmarks directly. According to the trained model, the global landmarks would be obtained roughly but robustly, which can lay the foundation for the local refinement.

After global optimization by inputting the fused global feature, we can ge<sup>t</sup> the initialization shape. The initialization shape is more robust and accurate than the mean shape; however, it is still

not satisfied. To refine the global estimation, the refinement stage is designed to refine the results. We extract the local CNN feature from the cropped local patches around the global landmarks and then learn the mapping function from the fused local feature to the residual distance between previous landmarks and ground truth.

#### *3.2. Facial Attribute Maps*

To comprehensively describe the geometric information of 3D data, five types of facial attribute maps were constructed, including three surface normal maps *Nx*, *Ny*, *Nz*, curvature feature *SI*, and range data *R*. Among these maps, surface curvature and normal maps are the most significant feature in 3D object detection, recognition and other 3D tasks. Figure 2 shows the five types of facial attribute maps computed from original 3D facial geometry data.

**Figure 2.** These five facial attribute maps, denoted as three surface normal map *Nx*, *Ny*, *Nz*, curvature feature map *SI* and range map *R*.

#### 3.2.1. Surface Curvature Feature

The surface curvature features have been adopted for 3D face landmarks' estimation in many types of research. Actually, surface curvature is the most significant feature in 3D object detection, recognition and other 3D tasks. Thus, this paper chooses the shape index feature map as the first facial attribute.

The Shape index is a continuous mapping of principal curvature values (*kmax*, *kmin*) of a 3D object point *p*. Once we have two principal curvature (*kmax*, *kmin*), the shape index values, which describe different shapes classed as single numbers ranging from 0 to 1, are calculated as:

$$SI(p) = \frac{1}{2} - \frac{1}{\pi} \arctan(\frac{k\_{\max} + k\_{\min}}{k\_{\max} - k\_{\min}}).\tag{3}$$

#### 3.2.2. Surface Normal Maps

Considering a normalized 3D facial geometry data *G*, denoted as a *m* × *n* × 3 matrix:

$$G = [P\_{uv}(\mathbf{x}, \mathbf{y}, z)]\_{m \times n} = [p\_{uv \mathbf{x}}, p\_{uv \mathbf{y}}, p\_{uv z}]\_{1 \le u \le m, 1 \le v \le n} \tag{4}$$

where [*Puv*(*<sup>x</sup>*, *y*, *z*)] denotes the corresponding 3D point coordinate of facial geometry data. The corresponding surface normal maps are represented as:

$$N(I\_{\mathcal{g}}) = N[P\_{uv}(x, y, z)]\_{m \times n}$$

$$\mathcal{I} = [N(p\_{uvx}), N(p\_{uvy}), N(p\_{uvz})]\_{1 \le u \le m, 1 \le v \le n}. \tag{5}$$

In this paper, a local plane fitting method is applied to compute *<sup>N</sup>*(*Ig*), which consists of a three *M* × *n* matrix. In other words, for each point in 3D facial geometry data, the surface normal vector can be computed by the following function:

$$S\_{\rm uv} : N\_{\rm uvx} q\_{\rm uvx} + N\_{\rm uvy} q\_{\rm uvy} + N\_{\rm uvz} q\_{\rm uvz} = d\_{\prime} \tag{6}$$

where (*quvx*, *quvy*, *quvz*) represents any point within the local neighbourhood of point *puv* and *Nuvx*, *Nuvy*, *Nuvz<sup>T</sup>*2 = 1. In this paper, a neighbourhood of 5 × 5 window is adopted and three normal maps would be obtained, denoted as *Nx*, *Ny*, *NZ*.

#### *3.3. Global Estimation*

As the proposed method illustrates, these five types of attribute maps as Figure 2 would be fed into the neural network to estimate landmarks roughly. Considered the calculated feature maps, denoted as shape index *SI*, *Nx*, *Ny*, *Nz* and original range map *R*, *Sg* (*x*) ∈ *R*2*N*×<sup>1</sup> represents the ground truth of *N* landmarks. The goal of our global model is to learn the mapping function *F* from our fused feature map to the ground truth coordinate:

$$S\_{\mathcal{K}}\left(\mathbf{x}\right) \leftarrow F\left(SI, \mathcal{N}\_{\mathbf{x}}, \mathcal{N}\_{\mathcal{Y}}, \mathcal{N}\_{\mathbf{z}}, \mathcal{R}\right). \tag{7}$$

Limited to the amount of training data, training a global CNN model directly is always over-fitting. To overcome this limitation, fine-tuning based on a pre-trained deep model was employed to learn *F*. To achieve this goal, the parameters of pre-trained model were fixed except training the last layer. Then, the *SI*, *Nx*, *Ny*, *Nz*, *R* are fed into the pre-trained model (e.g., VGG (Visual Geometry Group)-net in this paper) separately. Generally, the pre-trained deep CNN model can be regarded as a special feature extractor, which can be regarded as *v* = *DNN* (*Map*), where *DNN* represents the fixed part of the pre-trained model, *Map* denotes the resized facial attribute map, and *v* is the extracted feature vector of each attribute map. Consider adopting shape index maps and convolution neural networks to detect a coarse *S*0 as the result of the first step. In particular, the deep models are all comprised of three main parts including convolutional layers, pooling layers and fully connected layers.

#### • Convolutional Layer and ReLU Non-linearity.

Through a set of designed and learnable filters, the convolutional layer transforms the input images or activation maps to another. Specifically, given a set of activation maps from the previous layer *yl*−<sup>1</sup> ∈ R*Wl*−1<sup>×</sup>*Hl*−1<sup>×</sup>*Dl*−<sup>1</sup> , and *Kl* convolutional filters, each with size *Wf* × *Hf* × *Dl*−1, a list of activation maps *yl* ∈ R*Wl*×*Hl*×*Dl* at the layer *L* will be computed and output. Let this stride be *S*; then, the *Wl* = (*Wl*−<sup>1</sup> − *Wf* + 2*P*)/*S* + 1 and *Hl* = (*Hl*−<sup>1</sup> − *Hf* + 2*P*)/*S* + 1. Then, we add an activation function *ϕ* to adjust the result to a nonlinear function. In this paper, rectified linear units (ReLU), denoted as *ϕ* (*x*) = *max* (0, *<sup>x</sup>*), is used. Thus, the result of *l* layer is denoted as:

$$\boldsymbol{y}^{l} = \boldsymbol{\varphi}(\boldsymbol{\mathcal{W}}\_{l} \* \boldsymbol{y}^{l-1} + \boldsymbol{b}\_{l}),\tag{8}$$

where *bl*denotes the bias term, and ∗ denotes the convolution operator.

• Fully Connected layers.

This layer is used to reshape these feature maps into a vector feature. The hidden layers are fully connected, which means that each unit in a previous layer is connected with each unit in the next layer. Suppose the global network has *L* convolutional layers in total and so the feature maps in the last convolutional layers are represented as *yl* ∈ R*WL*×*HL*×*DL* . Let the (*L* + 1)-th layer be the fully connected layer, and the output of layer *L* be the input of layer *L* + 1, with size *y<sup>L</sup>*+<sup>1</sup> ∈ <sup>R</sup>*K*, where *K* = *WL* × *HL* × *DL*. Thus, this layer is equal to:

$$y^{L+1} = r
epsilon p(y^L).\tag{9}$$

Then, the next fully connected layer will be:

$$y^{L+2} = \varphi(w^{L+1} \times y^{L+1} + b^{L+1}),\tag{10}$$

where *WL*+<sup>1</sup> is the weight value in the *L* + 1-th layer and *bL*+<sup>1</sup> is the bias term value. *ϕ* denotes the tanh activation function. C. Objective function. After feature extraction for each facial attribute map is done separately, the feature vectors are concatenated as *V* = *vSI*, *vNx* , *vNy* , *vNz* , *vR* to train the global model *F*. Specifically, by training a designed neural networks, our target has been formulated as solving the objective function:

$$\arg\min \left\| S\_{\mathcal{J}} - F\left(V\right) \right\|\_{2}^{2} . \tag{11}$$

where *F* is the nonlinear regression function from *V* to the landmarks *Sg*, denoted as *F* = *σ WTV* + *b*, where *σ* represents the nonlinear activation function such as sigmoid, tanh and Relu. In this paper, sigmoid function is employed by the final output layer to learn the parameters [*<sup>W</sup>*, *b*]. However, the range of final output is [0, 1], while the range of regression is inconsistent. Therefore, *Sg* would be normalized to range [0, 1], so that the objective function can be formulated as minimizing the function:

$$\arg\min \left\| S\_{\mathcal{S}} - F\left(V\right) \right\|\_{2}^{2} + \lambda \left\| \mathcal{W} \right\|\_{F'}^{2} \tag{12}$$

where *<sup>W</sup>* <sup>2</sup>*F* denotes the regularization term, added to prevent the over-fitting. *λ* is the set to 0.00005.

After the optimization with Equation (12), the learned parameters [*<sup>W</sup>*, *b*] are obtained and *S*0 would be calculated via *S*0 = *<sup>F</sup>*(*V*).

#### *3.4. Local Refinement*

The global estimation phase describes the mapping function from the fused facial attribute maps to the target landmarks' location. Unlike other methods, the estimated shape is global and more accurate than the mean shape. However, it is still rough and there is room for improvement. To achieve more accurate locations, a coarse-to-fine based approach is proposed to improve the performance. Similar to many cascade regression methods for 2D face alignment, a local model as Figure 3 is employed to estimate the residual distance Δ*S*, representing the distance between global estimated shape *S*0 and ground truth *Sg*.

**Figure 3.** Five different local attribute maps for 22 landmarks. (**a**): depth feature map; (**b**): curvature feature; (**c**): surface normal feature along the *x*-axis; (**d**): surface normal feature along the *y*-axis; (**e**): surface normal feature along the *z*-axis.

Similar to the global estimation, we employed the pre-trained CNN model to extract local features from the local patches around the estimated shape *S*0. Each local patch around *S*0 is cut out within 30 mm, and then transformed to attribute maps. After the calculation of local attribute maps, they would be resized to 224 × 224 and are fed into the pre-trained deep neural network to extract local CNN features. Actually, we once considered concatenating the fused local feature of all landmarks to estimate the Δ*S* jointly. However, limited to the huge number of trained parameters (e.g., 4096 × 5 × 22 × 44 = 19,824,640), we propose refining each local patch around a landmark independently. For this purpose, deep feature fusion is also applied for training local model, denoted as *φi* = [*φiSI*, *φiNx*, *<sup>φ</sup>iNy*, *φiNz*, *φi <sup>R</sup>*]*<sup>i</sup>*=1,2,... *N*, where *i* represents the *i*-th landmark and *N* is the number of located landmarks.

Getting the local feature vectors, the local refinement model is to learn a nonlinearity function *Hi* from fused local feature *φi* to the Δ*Si* for each landmark, denoted as Δ*Si* = *Sg*(*i*) − *<sup>S</sup>*0(*i*). The objective function of each model can be formulated as follows:

$$\arg\min \|\Delta S\_i - H\_i(\phi\_i)\|\_2^2 + \beta \left\|\mathcal{W}\_k\right\|\_{F'}^2 \tag{13}$$

where *Hi* is a regression function the same as *F*, represented as *Hi* = *σ* (*Wiφi* + *bi*). Different from the global estimation, the activation function *σ* is the tanh function, so that all the outputs are in range [−1, 1]. After optimization, we can compute Δ*Si* according to Δ*Si* = *Hi* (*φi*), and then we obtain Δ*S* = [Δ*S*1, Δ*S*2, ..., <sup>Δ</sup>*SN*]. Therefore, normalized results *Sfinal* can be computed as the following:

$$S\_{final} = \Delta S + S\_0. \tag{14}$$
