Neutral Facial Rigging from Limited Spatiotemporal Meshes

Hou, Jing; Weng, Dongdong; Zhao, Zhihe; Li, Ying; Zhou, Jixiang

doi:10.3390/electronics13132445

Open AccessArticle

Neutral Facial Rigging from Limited Spatiotemporal Meshes

by

Jing Hou

¹

,

Dongdong Weng

^1,*,

Zhihe Zhao

¹,

Ying Li

¹ and

Jixiang Zhou

²

¹

Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

²

The Department of Science Theater, The Central Academy of Drama, Beijing 100710, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2445; https://doi.org/10.3390/electronics13132445

Submission received: 14 May 2024 / Revised: 19 June 2024 / Accepted: 19 June 2024 / Published: 21 June 2024

(This article belongs to the Special Issue Neural Networks and Deep Learning in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Manual facial rigging is time-consuming. Traditional automatic rigging methods lack either 3D datasets or explainable semantic parameters, which makes it difficult to retarget a certain 3D expression to a new face. To address the problem, we automatically generate a large 3D dataset containing semantic parameters, joint positions, and vertex positions from a limited number of spatiotemporal meshes. We establish an expression generator based on a multilayer perceptron with vertex constraints from the semantic parameters to the joint positions and establish an expression recognizer based on a generative adversarial structure from the joint positions to the semantic parameters. To enhance the accuracy of key facial area recognition, we add local vertex constraints for the eyes and lips, which are determined by the 3D masks computed by the proposed projection-searching algorithm. We testthe generation and recognition effects on a limited number of publicly available Metahuman meshes and self-collected meshes. Compared with existing methods, our generator has the shortest generation time of 14.78 ms and the smallest vertex relative mean square error of 1.57 × 10⁻³, while our recognizer has the highest accuracy of 92.92%. The ablation experiment verifies that the local constraints can improve the recognition accuracy by 3.02%. Compared with other 3D mask selection methods, the recognition accuracy is improved by 1.03%. In addition, our method shows robust results for meshes of different levels of detail, and the rig has more dimensions of semantic space. The source code will be made available if this paper is accepted for publication.

Keywords:

3D vision; neutral facial rigging; 3D expression generation; 3D emotion recognition

1. Introduction

In the metaverse, 3D animated faces twinned from real-life characters can provide users with an immersive experience [1] and are more realistic than 2D faces or AI-generated faces. A typical application is Yaya Tu in Wandering Earth 2. Consequently, the construction and animation of 3D twin faces have constituted an important area of concern within both the academic [2,3] and engineering communities [4]. In the field of academics, the process of creating a facial animated model is commonly referred to as ‘facial rigging’ and the animated model is commonly referred to as a ‘rig’.

Digital facial rigs are usually divided into three kinds, of which the first is blendshape-based rigging. It is a linear facial model, and each basis vector is a complete facial expression [5] or an offset expression with a neutral face [6]. As the semantics of each basis vector are clear, it is intuitive for artists to modulate expressions. However, it is difficult to maintain orthogonality between different basis vectors, which may lead to repeated operations when modulating expressions. To solve this problem, most existing facial rigging works, such as 3D Morphable Model (3DMM) [7], Faces Learned with an Articulated Model and Expressions (FLAME) [8], FaceScape [9], FaceVerse [10], etc., choose principal component analysis (PCA) to obtain basis vectors. However, the use of PCA makes the basis vectors lose semantic interpretability. Animators cannot use it directly, and it requires a great deal of experience on the part of artists when generating a specific expression. In addition, blendshape-based rigging is slightly deficient in localized subtle deformation. The second alternative is skeleton-based rigging, which generates a specific pose by joint transformations [11] or nonlinear interpolation [12] in a pose space, then affects the vertex positions through weights. The pose space consists of joint positions, joint angles, and degrees of freedom at each level [12]. This operation is referred to as ‘skinning’ in animation production. As the joint is locally deformable, the animated expression is delicate; however, the pose space lacks interpretable semantic parameters. What is more, due to the lack of triangulated plane normal constraints, unsmooth phenomena [13] such as candy-wrapping and collapsing elbows [14] may appear. The third alternative is muscle-based or physical rigging, which generates the facial shape by simulating the muscles and tissues [15] or controlling facial deformation through physical curves [16]. This approach is the most compatible model with human facial dynamics, but is rarely used due to its time-consuming dynamics simulation. Nonetheless, several of the physical constraints employed in muscle-based facial rigging are beneficial, such as the 2D/3D facial keypoint constraints [17,18].

In industry, such rigs often require extensive animation experience on the part of artists to produce; hence, manual facial rigging is labor-consuming and costly. While researchers have attempted to use deep learning networks to accomplish automatic facial rigging [15,19,20,21], insufficient 3D datasets continue to be a bottleneck. Moreover, these methods typically employ complex network architectures, resulting in a multitude of parameters and sluggish loading speeds. Consequently, when the necessity arises to create a 3D expression, scholars typically employ a process of 3D reconstruction from 2D images [22,23]. However, the quality of such reconstructions is significantly influenced by factors such as image noise, lighting conditions, and the camera’s point of view. When there are multiple characters in the scene, it is difficult for 3D reconstruction methods to guarantee spatial proportions. Furthermore, when there is a need to reenact 3D expressions from a template 3D animation, 3D reconstruction methods become useless. To handle this problem, researchers may use face rigging transfer methods [24,25,26] to extract source expression features and transfer them to the target face. However, the aforementioned methods usually do not consider the semantic interpretability of the latent space, and can constrain the hidden variable due to their simple semantics [3,20]. This results in a rig that is challenging to operate.

In summary, current facial rigging methods and rigs are not able to provide a balance between production costs and application effects. Therefore, our long-term objective is to develop a lightweight rigging paradigm that integrates the benefits of blendshape-based and skeletal-based rigs. The goal of this paper is to establish real-time bidirectional relationships between joints and semantic parameters from a limited number of meshes. The specific contributions are summarized below:

A large training dataset containing joint positions, semantic parameters, and vertex positions is automatically expanded from limited number of spatiotemporal facial meshes.
An expression generation network called RigGenNet is established for generation from semantic parameters to joint positions based on a multilayer perceptron with vertex constraints. An expression recognition network called RigRecogNet is established for generation from joint positions to semantic parameters based on a generative adversarial network (GAN)-like structure. The recognition accuracy of critical areas is improved by adding local constraints determined by 3D masks.
After training, RigGenNet is capable of generating 3D expression in real time, while RigRecogNet can recognize 3D expression in real time. Both networks are applicable to 3D meshes with different levels of detail, and the limited number of target meshes can be accurately reenacted.

The rest of this paper is structured as follows: Section 2 analyzes recent relevant works; Section 3 introduces the technical details of our proposed methods; Section 4 describes the experimental protocol; Section 5 discusses the results; Section 6 points out limitations; finally, Section 7 highlights the contributions and proposes future works.

2. Related Works

This section focuses on analyzing recent facial rigging neural models and summarizing the latest research on 3D keypoint selection.

2.1. Neutral Facial Rigging Models

For blendshape-based rigs, the computing principle is derived from Li’s classical work [27]. Although the method does not use a network, it is often used to derive the worst effect of facial rigging for a dataset [13]. Recently, Qin et al. [20] proposed an end-to-end rigging network to automatically retarget expressions to wild faces. Their neutral rigging (NFR) model extracted an ’expression code’ for a Facial Action Coding System (FACS)-like [28] latent space. As the latent space is interpretable, the NFR can be used to generate target expressions after rigging. However, the expression code has only 128 dimensions, which is insufficient to accurately represent subtle changes in expression. Han et al. [25] suggested a framework for blendshape generation and blendshape augmentation. Wang et al. [29] proposed a ‘Versatile Face Animator’ that combines facial motion capture data to create 3D facial animation. Their method can manipulate facial meshes with controller transformations estimated from dense optical flow fields.

For skeleton-based rigs, joints are extensively employed in facial rigging due to their reduced dimension in comparison to blendshapes. This results in enhanced solving rates and simplified neutral models. Holden et al. [30] first introduced feed-forward networks (FFNet) in facial rigging. They established the mapping function between the joint world coordinates and the semantic parameters in real time on faces with a lower level of detail (LOD). Hou et al. [31] proposed Controls to Bones Network (CTBNET) and Bones to Controls Network (BTCNET) to handle the bidirectional rigging problem between joints and semantic parameters. They added layers in FFNet and changed the number of neurons to progressive, allowing them to realize rigging on faces with a higher LOD. Chandran et al. [24] presented Shape Transformers to learn ‘Shape Code’ for 3D faces. Although they did not mention joints in their paper, the mechanism was similar to skeleton-based rigging. Their method can accurately reproduce facial shapes; however, the shape code is not interpretable, and the shape code obtained from each solving for the same 3D face was inconsistent. Qammaz et al. [32] extended the effect of the MocapNET motion capture framework [33] to estimate 3D head pose, expressions, and gaze direction. In addition, a number of neutral body rigging works have the potential to be applied to facial rigging, as indicated by successful tests on a range of mesh types. For example, Moutafidou et al. [34] proposed a network for assigning vertices to bones, allowing an approximation of the source animation sequence to be reenacted.

For physical rigs, Luo et al. [16] proposed a sketching system for modeling 3D expressions, and developed a network named “Implicit and Depth-Guided Mesh Modeling” (IDGMM) to realize the mapping from 2D sketch to 3D face. Their approach allows for more freedom in 3D expression generation; however, emotions cannot be recognized, meaning that users may have to categorize the template 3D expression as a simple emotion [35] and then rely on the artists to sketch it. Yang [36] built a generalized physical rigging network to fit different identities and automatically produce a ready-to-animate physical rig. Rigging methods for the three kinds of rigs are compared in Table 1.

2.2. Selection of 3D Keypoints

Selection of 3D keypoints is often used in feature extraction and matching tasks. Cui et al. [37] proposed ‘Linear Keypoints Representation for 3D LiDAR Point Cloud (LinK3D)’ to obtain keypoints from a 3D LiDAR point cloud. Their method fully considers the sparsity and complexity of point clouds. Wang et al. [38] proposed ‘Keypoints Voting Network (KVNet)’ to adaptively select the optimal keypoints from a mesh, in which the uncertainty of keypoint positions is treated as a Lipschitz continuous function of the seed point’s embedding features. Teng et al. [39] proposed a detector for extracting both geometry and color keypoints from a colored point cloud. Nevertheless, direct selection methods for 3D keypoints frequently encounter difficulties when applied to disparate datasets.

Therefore, some research starts from 2D pictures and selects 3D key points by the relationship of the picture to the point cloud. Li et al. [40] proposed ‘2D-3D Matching Transformer (2D3D-MATR)’, a method for accurate matching between images and 3D point clouds. By coarse correspondences and expended dense pixel–point correspondences, 3D keypoints can be extracted from 2D features. Kim et al. [41] proposed ‘End-to-End 3D Point to 2D Pixel Localization (EP2P-Loc)’, a method for finding 2D–3D registrations without keypoint detection. Li et al. [42] presented a transformer-based 2D–3D matching network called D-GLSNet for matching LiDAR point clouds with satellite images. Nevertheless, these 2D–3D matching methods cannot identify the same 3D positions across all solutions. The different 3D mask selection methods are summarized in Table 2.

3. Materials and Methods

In this paper, we first construct a comprehensive dataset for training. The dataset is expanded from a limited number of spatiotemporal meshes, and contains automatically solved emotional parameters and facial joints. Next, we propose a rig generation network (RigGenNet) to control joint position by adjusting semantic parameters. The joint position is optimized using mesh vertex constraints. Then, we construct a rig recognition network (RigRecogNet) to decouple the semantic parameters from the joints. RigRecogNet is a generative adversarial network (GAN)-like structure in which the discriminative model is replaced by RigGenNet. The purpose of RigGenNet is to judge whether the recognized true-valued semantic parameters can generate true-valued joints and whether the recognized ‘false’ semantic parameters can generate corresponding ‘false’ joints. In addition to joint errors, vertex errors and localized errors are added to the judgment criteria. Localized errors are controlled using 3D masks. We propose a projection-searching algorithm to select the corresponding 3D masks from 2D keypoints. The workflow is shown in Figure 1, and the notation used for the most relevant variables is shown in Table 3.

3.1. Dataset Expansion

Given a limited number of spatiotemporal meshes, we first compute the target blendshapes

B_{k}^{T_{j}}

and corresponding semantic parameters

c_{k}^{T_{j}}

by optimizing Equation (1) [27,43]. As blendshapes, we extracted 240 expressions from Metahuman [4]. which contains all the AUs of FACS [28]. In addition, our method supports user customization of the emotions of the limited number of spatiotemporal meshes; the customized semantic parameters are denoted by

c_{k}^{T_{j} *}

.

L_{ebfr} = {∥T_{j} - (T_{o} + \sum_{k = 1}^{m} c_{k}^{T_{j}} B_{k}^{T_{j}})∥}_{F}^{2} + μ_{1} \sum_{k = 1}^{m} w_{k}^{j} {∥Q_{k}^{S_{j}} - Q_{k}^{T_{j}}∥}_{F}^{2} + μ_{2} \sum_{k = 1}^{m} (c_{k}^{T_{j}} - c_{k}^{T_{j} *})^{2}

(1)

The extracted expressions have 240 emotions, and the semantic labels refer to the ‘ctrl expression’ module in Metahuman [4]. To avoid excessive facial deformation, 476 high-order correction terms were added to the semantic labels, referring to the corrective expressions in rig logic [44]. All dimensions of the semantic parameters take values from 0 to 1. For each dimension, we sample a value according to a normal distribution. The sampling times are the number of training samples. Then training meshes

T_{j}^{r}

are obtained by morphing the computed target blendshapes with sampled semantic parameters

c_{k}^{r}

; see Equation (2) below.

T_{j}^{r} = T_{o} + \sum_{k = 1}^{m} c_{k}^{r} B_{k}^{T_{j}}

(2)

The joint world coordinates and fixed angle are decoupled from each mesh by optimizing Equation (3). Here,

L_{j d}

is consistent with Zhou’s work [13], and we add a fourth component to the vertex tensor to represent the normal to each triangular on the mesh; see Equation (4). We only take the outermost sub-joints as facial features. Each joint contains six parameters, consisting of three position coordinates and three rotational fixed angles. The position coordinates and rotation fixed angles are difficult to place under the same loss function due to their different physical meaning, and multi-task learning undoubtedly increases the difficulty of the research task; therefore, we convert the rotational fixed angles into distance parameters. Then, the error of the angle is converted into the error of the distance and can be directly summed up using the error of the joint positions in the same loss function.

L_{j d} = {∥V_{j} - \sum_{i = 1}^{n} (w_{i j}^{V} (R_{i} V_{o} + T_{i}))∥}_{2}^{2} + μ_{3} \sum_{i = 1}^{n} {∥R_{i} b_{i} + T_{i} - b_{i}^{*}∥}_{2}^{2}

(3)

v_{4} = v_{1} + (v_{2} - v_{1}) \times (v_{3} - v_{1}) / \sqrt{|(v_{2} - v_{1}) \times (v_{3} - v_{1})|}

(4)

For an outermost joint

B

, the fixed angles in world coordinates are converted to the world positions of the sub-joint

B_{1}, B_{2}, B_{3}

using Equation (5). The three sub-joints are at unit distance from the joint

B

. Here,

φ, θ, ψ

are the fixed angles of joint

B

along the z, y, and x axes. The order of rotation is

x y z

; thus, the rotation matrix of a joint

R_{b}

is obtained by left-multiplying the rotation matrices

R_{z}, R_{y}, R_{x}

along z, y, and x in order. The whole process is explained in Figure 2. In addition, we establish a homogeneous rigid transformation matrix

H

of the joint, as shown in Equation (6), which is used during subsequent solving.

P_{r o} = R_{z} (φ) R_{y} (θ) R_{x} (ψ) + T

(5)

H = [\begin{matrix} R_{b} & T \\ 0 & 1 \end{matrix}]

(6)

3.2. Structure and Loss Function for the Proposed RigGenNet

In this part, we mainly introduce how to establish the mapping function from semantic parameters to joint positions. We adopt the same layer-by-layer expanded dimensional MLP structure as CTBNET [31]. The input consists of semantic parameters containing higher-order terms, while the output consists of joint positions. For joints,

P_{r o}

is expanded to a 1 × 9 tensor and is connected to the position tensor to obtain a 1 × 12 tensor; thus, the size of the input is

(s, 716)

and the size of the output is

(s, 12 n)

, with s as the batch size. The number of joints, denoted as n, varies with the level of detail.

A vertex constraint

L_{v}

is added to the loss function, as shown in Equation (8). The vertex positions are obtained from the homogeneous rigid transformation of the joints, as shown in Equation (7), where

V_{o}

and

V_{G}^{i}

denote the position vectors of all the vertices from the neutral mesh and the network-generated mesh, respectively,

H_{o}, H_{G}^{i}

denotes the homogeneous rigid transformation matrix from the neutral mesh and the network-generated mesh, and

W

denotes the skinning weight matrix. The joint error in Equation (9) and vertex error in Equation (10) are both squared differences in distance, meaning that the overall optimization objective of the generator can be obtained by directly adding them. In addition,

B_{G}^{i}

and

B_{label}^{i}

both have a size of

(s, 12 n)

, while

V_{G}^{i}

and

V_{label}^{i}

both have a size of

(s, 12 v)

, where v is the vertex number.

V_{G}^{i} = W H_{G}^{i} {H_{o}}^{- 1} V_{o}

(7)

L_{G} = L_{b} + β L_{v}

(8)

L_{b} = \frac{1}{s} \sum_{i = 1}^{s} | | B_{G}^{i} - B_{label}^{i} {| |}_{2}^{2}

(9)

L_{v} = \frac{1}{s} \sum_{i = 1}^{s} | | V_{G}^{i} - V_{label}^{i} {| |}_{2}^{2}

(10)

3.3. Structure and Loss Function for the Proposed RigRecogNet

RigRecogNet is trained based on a network similar to the generative adversarial structure. There are two elements of data entering RigRecogNet: the aforementioned precalculated joint positions, and the adversarial joint positions. The adversarial joints are the result of inputting random semantic parameters into RigGenNet. Our expression recognizer adds jump connections based on the structure of BTCNET [31] and uses LeakyReLU as the activation function of the hidden layer, with the exception of the output layer, which adopts the Sigmoid function to limit the semantic parameter variance to between 0 and 1. In order to ensure the semanticity of the embedding space, we add generative semantic constraints and adversarial semantic constraints to the loss function.

For the generative joints and semantic parameters, we denote the loss function as

L_{r}

, as in Equation (11), which contains six errors. The first is the recognized semantic error between the recognized semantic parameters and the labeled semantic parameters, provided in Equation (12). The second is the reconstructed joint error between the reconstructed joint parameters and the labeled joint parameters, provided in Equation (13). The third is the reconstructed vertex error between the reconstructed vertex positions and the labeled vertex positions, provided in Equation (14). The fourth to sixth are the localized vertex errors, provided in Equation (15), including the reconstructed left eye vertex error, denoted as

L_{r}^{l e}

, the reconstructed right eye vertex error, denoted as

L_{r}^{r e}

, and the reconstructed lip vertex error, denoted as

L_{r}^{m}

. The homogeneous rigid transformation matrix in Equation (14) is derived from the reconstructed joint parameters. Finally, the facial localized error is computed by multiplying localized 3D vertex masks onto the mesh to improve the reproduction accuracy of the eyes and the lips.

L_{r} = L_{r}^{c} + L_{r}^{b} + L_{r}^{v} + 10 * L_{r}^{l e} + 10 * L_{r}^{r e} + 10 * L_{r}^{m}

(11)

L_{r}^{c} = \frac{1}{s} \sum_{i = 1}^{s} | | C_{i}^{R} - C_{i}^{label} {| |}_{2}^{2}

(12)

L_{r}^{b} = \frac{1}{s} \sum_{i = 1}^{s} | | G (C_{i}^{R}) - B_{i}^{label} {| |}_{2}^{2}

(13)

L_{r}^{v} = \frac{1}{s} \sum_{i = 1}^{s} | | W H_{i}^{G (C_{i}^{R})} {H_{o}}^{- 1} V_{o} - V_{i}^{label} {| |}_{2}^{2}

(14)

L_{r}^{l e} = M_{l e}^{2} L_{r}^{v}, L_{r}^{r e} = M_{r e}^{2} L_{r}^{v}, L_{r}^{m} = M_{m}^{2} L_{r}^{v}

(15)

For the adversarial joints and semantic parameters, we denote the loss function as

L_{f}

, as in Equation (16), which contains five errors. The first is the adversarial semantic error between the verified semantic parameters and the randomly generated semantic parameters, provided in Equation (17). The second is the adversarial vertex error between the verified vertex positions and the synthetic vertex positions, provided in Equation (18). The third and fifth are the adversarial localized face error, provided in Equation (19), including the left eye vertex error, denoted as

L_{f}^{l e}

, right eye vertex error, denoted as

L_{f}^{r e}

, and lip vertex error, denoted as

L_{f}^{m}

. Specifically, the verified semantic parameters are the recognizer’s output obtained by inputting the generator’s results, which takes randomly generated semantic parameters as input. The verified vertex positions are computed by Equation (7), and the synthetic vertex positions are computed by morphing blendshapes with randomly generated semantic parameters.

The final optimization objective of RigRecogNet is denoted by Equation (20). The aim of the

L_{f}

loss is to increase the generalization of the network and smooth the mesh driven by the joints.

L_{f} = L_{f}^{c} + L_{f}^{v} + 10 * L_{f}^{l e} + 10 * L_{f}^{r e y e} + 10 * L_{f}^{m}

(16)

L_{f}^{c} = \frac{1}{s} \sum_{i = 1}^{s} | | R (G (C_{i}^{rand})) - C_{i}^{rand} {| |}_{2}^{2}

(17)

L_{f}^{v} = \frac{1}{s} \sum_{i = 1}^{s} | | W H_{i}^{G (C_{i}^{rand})} {H_{o}}^{- 1} V_{o} - (T_{o} + \sum_{j = 1}^{l} c_{i j}^{rand} T_{j}) {| |}_{2}^{2}

(18)

L_{f}^{l e} = M_{l e}^{2} L_{f}^{v}, L_{f}^{r e} = M_{r e}^{2} L_{f}^{v}, L_{f}^{m} = M_{m}^{2} L_{f}^{v}

(19)

L = L_{r} + γ L_{f}

(20)

3.4. Creation of Localized 3D Masks

The eyes and lips are key areas for expressing emotions, and small deformations directly affect the judgment of the emotion. Employing an attention mechanism to learn the distribution of joints with different emotions does not guarantee that the machine will focus on the eyes and lips in each iteration. Thus, in Section 3.3 we describe the addition of fixed masks for the eyes and lips. The key element is to find the vertex indexes of the mask in the 3D mesh. The direct selection of 3D masks from a mesh is difficult to transfer between identities, and existing 2D–3D keypoint matching methods do not guarantee a fixed 3D mask for each solving.

Thus, in this paper we project the 3D mesh to the

Z O X

plane using the camera model and use 2D facial keypoint detection to find the keypoints around the eyes and lips. To be clear, we first establish a relationship library between 3D vertices and vertex indexes and then project the 3D vertices to the 2D detection map in turn. If the distance between a projected point and any 2D target points is less than a certain value r, then the index of the 3D vertex is recorded. Figure 3 provides an illustration of the process.

The 2D keypoints are detected using MediaPipe v0.10.9 [45], which has a specific numbering rule for its detected vertices. We picked four circles around the lips, totaling 76 key points, two circles of the left eye, totaling 32 key points, and two circles of the right eye, totaling 32 key points. The fixed value r is determined according to the level of detail and size of the rig; in our experiment, it is taken as the projected length of the average side length in 3D meshes. The detected 2D points, chosen 2D points, and 3D masks are shown in Figure 4. The value of the keypoints is 1, and the value of the rest is 0.

3.5. Methodological Feasibility

As our proposed RigGenNet constrains the vertex positions during training, the generated face error should be smaller than that of skeleton-based generation methods. The fixed angles of the joints are transformed into distance parameters and the joint dimension is expanded from a high dimension to one dimension, allowing the network to avoid the use of convolutions. Consequently, RigGenNet is expected to exhibit lower computational complexity and be capable of real-time expression generation. We referred to a complex deformable facial model to extract multidimensional semantic representation parameters, theoretically allowing more fine-grained emotional changes to be recognized and accurately reproduced. The input and output dimensions of our neural models are related to the number of joints, and the networks have a gradient structure; thus, in principle our method can handle 3D face generation and recognition tasks on meshes with different levels of detail. The attention mechanism determines the critical parts of the face through continuous learning, making it difficult to keep the high-attention region consistent at each iteration. We use fixed masks to identify the key parts as the eyes and lips. Thus, our method theoretically has higher recognition accuracy for the eyes and lips, which can be derived from the error between the reenacted meshes and the labeled meshes in the eyes and lips.

4. Experiments

This section first illustrates the datasets, hyperparameters, pseudocode used for training, and evaluation factors in Section 4.1, Section 4.2 and Section 4.3. Then, the purpose and details of the four experiments described in this paper are provided in Section 4.4.

4.1. Datasets

In this paper, we used Metahuman characters [46] with different levels of detail as template blendshapes; see Table 4 for details. In addition, we performed validation on four sets of meshes with different levels of detail: one neutral expression and 75 other expressions of Danielle with a level of detail (LOD) of 6, one neutral expression and 99 other expressions of Keiji with LOD3, one neutral expression and 49 other expressions of Hou with LOD3, and one neutral expression and 99 other expressions of Zhou with LOD0. Danielle and the Keiji are both Metahuman characters, while Hou and Zhou were collected using our 4D facial capture system.

For the 4D capture process, the actor was required to wear three cubes containing four ArUco markers [13] on their head during the collection process in order to register and retopologize the meshes. After preprocessing was used to capture meshes as described in Section 3.1, 100,000 sets of expressions were morphed for use as the training meshes. The semantic parameters, joint positions, and vertex positions corresponding to each mesh were packed to obtain the final training set. The ratio of the training set to the validation set was 8:2. For the Metahuman characters, each expression satisfied the same topology condition and the training set was obtained using the dataset expansion method described in Section 3.1.

4.2. Hyperparameters and Pseudocode

We found that the expressions generated by the RigGenNet had the best fit to the labeled semantics in the validation set when

β

was close to 0.1, and that the positional errors of the joints and vertices were within the acceptable range. When

γ

was near 0.8, RigRecogNet had a better generalizability effect. The expression could be reproduced precisely using the recognized semantic parameters, and the reconstructed expressions combined the smoothness of blendshape-based rigs and the localized deformability of skeletal-based rigs.

All networks were trained on an NVIDIA GeForce GTX 1080 Ti with a training batch size of 16 and test batch size of 1, the pseudocode for training is shown in Algorithm 1. RigGenNet was trained for 100 epochs, while RigRecogNet was trained for 60 epochs. The Adam optimizer was used for all networks, with an exponential decay rate of 0.9 for first-order moment estimation and 0.999 for second-order moment estimation and with a weight decay L2 penalty term of 1 × 10⁻⁴. The learning rate for RigGenNet was 1 × 10⁻⁴ and the learning rate for RigRecogNet was 1 × 10⁻⁵.

Algorithm 1: Training for the RigGenNet and the RigRecogNet

4.3. Evaluation Factors

We evaluated RigGenNet based on the difference between the generated expressions and the labeled expressions in the validation sets. Four evaluation factors were used: the full-face joint relative mean square error

B_{RMSE} = \sqrt{L_{b}} / d_{f}

, full-face vertex relative mean square error

F_{RMSE} = \sqrt{L_{v}} / d_{f}

, absolute vertex error, and single-frame expression generation time

G_{t}

. The relative mean error is a common evaluation criterion for facial animation [13,47], where

d_{f}

is the diagonal length of the bounding box enclosing the target character’s neutral mesh.

We evaluated RigRecogNet based on the difference between the reconstructed expressions and the labeled expressions. Seven evaluation factors were used: the semantic recognition accuracy P, reconstructed full-face vertex relative mean square error

F_{RMSE}^{r e} = \sqrt{L_{r}^{v}} / d_{f}

, reconstructed eye vertex relative mean square error

L E_{RMSE}^{r e} = \sqrt{L_{r}^{l e}} / d_{f}, R E_{RMSE}^{r e} = \sqrt{L_{r}^{r e}} / d_{f}

, reconstructed lip vertex relative mean error

M_{RMSE}^{r e} = \sqrt{L_{r}^{m}} / d_{f}

, absolute vertex error, and single-frame expression recognition time

G_{r}

.

Unlike conventional emotion category recognition, we recognize expressions as parameters of a continuously varying space. Each dimension represents a subtle expression, such as ‘opening the left eye’. The category emotions were first classified into six categories according to the Facial Action Coding System (FACS) [28]: happiness, sadness, anger, surprise, fear, and disgust. Real human emotions are complex, with blurred boundaries between expressions; thus, discrete categories may not reflect subtle emotion shifts. The use of our continuously varying semantic space can better simulate the real expressions. Each dimension of the semantic space varies from 0 to 1. Here, the semantic recognition accuracy means the proportion of the well-fitted dimensions to the total dimensions in the semantic space, while ‘well-fitted’ means that the absolute error between the calculated semantic parameters and the labeled semantic parameters is less than 0.1. In addition to the above factors, we evaluated the time and computational complexities based on the memory occupied by the parameters (Params) and floating-point operations per second (FLOPs) for all methods.

4.4. Experimental Settings

The objective of the first experiment was to verify that our RigGenNet can generate facial expressions from the reference expression with less error under the same semantic parameters and in real time. We used Myles as the template blendshape and the Keiji and Hou dataset (

d_{f}^{Keiji} = 53.99 cm, d_{f}^{Hou} = 48.71 cm

) as the validation set. We compared the evaluation factors between five methods: the traditional blendshape-based rigging method [43,48], the latest ‘NFR’ blendshape-based rigging method [20], the ‘SketchMetaFace’ physical rigging method [16], the ‘CTBNET’ skeleton-based rigging method [31], and our RigGenNet. The expressions to be verified were generated using the same semantic parameters.

The objective of the second experiment was to verify that our RigRecogNet can accurately recognize emotions and that the recognized emotions can be accurately reproduced by RigGenNet. We compared the evaluation factors of six expression recognizers on the Keiji and Hou dataset (

d_{f}^{Keiji} = 53.99 cm

,

d_{f}^{Hou} = 48.71 cm

): the first ‘FFNet’ skeleton-based rigging method [30], BTCNET [31], the traditional blendshape-based rigging method [43,48], the latest ‘NFR’ blendshape-based rigging method [20], the ‘Shape Transformer’ rigging transfer method [24], and our RigRecogNet. We used the second step of the blendshape-based rigging method [43,48] to optimize the semantic parameters, where the blendshape was fixed. For the same set of expressions, the optimized semantic parameters were different for each iteration, and there was no fixed changing rule; thus, we take the average of ten calculations as the final evaluation factor value of our method.

The objective of the third experiment was to test the robustness of our RigGenNet and RigRecogNet on datasets with different levels of detail. We used labeled semantic parameters to generate meshes on the Danielle, Keiji, and Zhou datasets, then compared the generated faces to verify the robustness of the network. We used RigRecogNet to recognize facial emotions on the Danielle, Keiji, and Zhou datasets and used RigGenNet on the recognized semantic parameters to obtain the reenacted faces. The reenacted faces were then compared to verify the robustness of the network.

The objective of the fourth experiment was to verify the effectiveness of the local vertex constraints and validate the superiority of the proposed 3D mask selection method. Ablation experiments were conducted on the LOD3 datasets to evaluate the efficacy of different masks. We tested five cases: BTCNET [31] with full-face constraints, BTCNET [31] with full-face constraints and eye constraints, BTCNET [31] with full-face constraints and lip constraints, BTCNET [31] with all of the above constraints and with local constraints obtained by 2D3D-MATR [40], and BTCNET [31] with all of the above constraints and with local constraints obtained by our method.

5. Results and Discussion

This section lists the results of the four experiments described in Section 4.4 and analyzes the results of each experiment.

5.1. Evaluation of RigGenNet

The evaluation factors are compared in Table 5 and the absolute vertex error is visualized on the 3D meshes, as shown in Figure 5.

Although our method requires some dataset expansion time, the generation speed after training for a single 3D expression is greatly improved over the other methods. Compared with CTBNET [31] and NFR [20], our RigGenNet has a semantic space with more dimensions while being is faster. We additionally converted the joint’s fixed angles into position parameters, allowing us to use a more lightweight network for facial rigging, which reduces training costs and expression generation time.

Compared to the other methods, the whole absolute vertex error of our RigGenNet is much smaller, especially key parts such as the eyes and lips. For certain expressions, the blendshape-based method [43,48] and CTBNET [31] may deviate in the same way; however, the intrinsic causes are not the same. The blendshape-based method [43] only considers the mesh surface and vertex positions. Thus, a linear superposition of vertices at the same position may lead to a huge error locally, e.g., the left forehead in Figure 5 line 2 (b). On the other hand, CTBNET [31] only considers the joint positions, which may lead to a large error in undeformed regions due to excessive joint constraints in the deformed regions, e.g., the lips in Figure 5 line 3 (c). NFR [20] uses fewer dimensions for the semantic parameters, resulting in an unstable generation effect. SketchMetaFace [16] is the most intuitive of all the rigging methods, but requires the user to possess some basic sketching skills. Consequently, the generated expressions exhibit a significant degree of error in the deformation region. Our method balances joint errors and vertex errors during training, contains detailed emotional parameters, and requires no artistic experience. Therefore, the generated faces are closer to real expressions.

5.2. Evaluation of RigRecogNet

The evaluation factors are listed in Table 6. Figure 6 shows the absolute vertex error heatmap of all the above methods. The error heatmap is drawn on the reconstructed mesh, which is generated by RigGenNet using the identified semantic parameters.

The single-frame expression recognition time of the blendshape-based rigging method [43] is proportional to the number of expressions to be recognized at one time. For example, our experiment shows that it costs 0.0334 s to recognize only a single expression, but an average of 0.6 s to recognize 100 expressions simultaneously. Our RigRecogNet does not encounter this problem, and the recognition time is constant regardless of how many expressions are recognized at one time. Compared with NFR [20] and Shape Transformer [24], our method has lower computational complexity and recognition time. Although the recognition time is longer than FFNet [30] and BTCNET [31], the difference is almost negligible and the reconstructed face error is smaller, as shown in Figure 6.

As seen in Figure 6, our method has the smallest error of the reconstructed full face, indicating that the vertex constraint

L_{f}

introduced in Section 3.3 for randomly generated expressions enhances the generalization and robustness of RigRecogNet. As can be further observed from the local error for the eyes and lips, our method provides a balance between recognition accuracy and the reconstruction precision of key regions.

5.3. Robustness Experiments on Meshes with Different levels of detail

Here, we mainly show the evaluation factors of RigGenNet and RigRecogNet on a dataset with different levels of detail (LOD). The evaluation factors of RigGenNet with the same semantic parameters are shown in Table 7 and the evaluation factors of RigRecogNet are shown in Table 8. The absolute vertex error heatmap of the generated meshes is shown in Figure 7. The full process, which first recognizes expressions in the validation set and then reconstructs the meshes with joints, is shown in Figure 8.

The results of the robustness experiments indicate that our method is applicable to meshes with different levels of detail. RigGenNet can accurately generate meshes with different numbers of joints. The generation time for the low-detail face is slightly shorter than that of the high-detail face, which is a reasonable consequence of the smaller number of hidden elements in the network. In addition, the continuous emotion recognition accuracy is greater than 90% for faces with different levels of detail, especially for sensitive areas such as the eyes and the mouth.

5.4. Ablation Study and Comparison for Different Masks

Table 9 shows the evaluation metrics.

Comparing the results, the expressions recognized and reproduced by the network with eye vertex constraints have the smallest error at both eyes, while the lip error is larger, being higher than even the case when only full-face constraints are added. While the expressions reproduced by the network with lip vertex constraints have the smallest error at the lips, the errors at the eyes are larger, and are even higher than the case when only the full-face vertex constraints are added. RigRecogNet balances the eye error and lip error, and is better than the network with only full-face vertex constraints on all of the evaluation factors. The recognition accuracy is also improved by 3.02%. The emotion recognition time is slightly longer than that of the network with full-face vertex constraints, but the difference is not large and is almost negligible. Compared with other mask selection methods, the 3D mask selected in this paper is more fixed at each solving, and the recognition accuracy is improved by 1.03% over 2D3D-MATR [40].

6. Limitations

Although the proposed networks have better results than other methods, more time is needed to expand the amount of data, including retopologizing, blendshape solving, blendshape morphing, and joint decoupling, all of which are costly for the establishment of a new character. In addition, we only added vertex constraints during optimization, which may lead to unsmooth phenomena in certain expressions, as shown in Figure 9.

7. Conclusions

We propose an automatic dataset expansion method that can automatically synthesize a large number of meshes from a limited number of spatiotemporal samples and decouple the joints and sentiment parameters. Based on the expanded dataset, we propose RigGenNet to fit the generative relationship from semantic parameters to joints; the proposed network outperforms the latest works [16,20,31] in terms of generation time and computational complexity, and the generated faces have the minimum error with reference expressions. Based on RigGenNet, we built RigRecogNet to decouple semantic parameters from the joints; again, the proposed network outperforms existing works [20,24,31] in terms of recognition accuracy. During optimization of RigRecogNet, we use 3D masks to constrain the critical parts. An ablation experiment verifies that the local constraints can improve overall recognition accuracy by 3.02%. Compared with 3D masks chosen by 2D3D-MATR [40], the recognition accuracy of our method is higher and the mask location is more fixed. In addition, our rigging method provides robust results for faces with different levels of detail, and the resulting rig has more dimensions of emotional parameters and can express more subtle emotional changes.

Subsequent work could focus on pretraining a mean expression generator and mean expression recognizer based on a large dataset containing different characters. In this case, even a small amount of captured meshes (e.g., 50) would be sufficient to fine-tune a facial autoencoder [31] for rigging meshes. In addition, subsequent works might consider adding triangular surface normal constraints during network optimization or adding dynamics constraints referring to the physical rig.

Author Contributions

Conceptualization, J.H.; methodology, J.H. and Z.Z.; software, J.H.; validation, J.H., Y.L. and Z.Z.; formal analysis, J.H.; investigation, J.H.; resources, D.W.; data curation, J.Z. and Y.L.; writing—original draft preparation, J.H.; writing—review and editing, J.H., D.W.m and Z.Z.; visualization, J.H. and Z.Z.; supervision, D.W.; project administration, J.H.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (No. 2022YFF0902303) and the Beijing Municipal Science and Technology Commission and Administrative Commission of Zhongguancun Science Park (Grant Z221100007722002).

Institutional Review Board Statement

All subjects provided their informed consent for inclusion before they participated in the study. Ethics approval is not required for this type of study. The study has been granted an exemption by the Institutional Review Board of the Beijing Engineering Research Center for Mixed Reality and Advanced Display Technology.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data can be made available upon request from the authors.

Acknowledgments

We would like to thank the Beijing Engineering Research Center of Mixed Reality and Advanced Display for providing experimental materials and instruments as well as the associate editor and the anonymous reviewers for their helpful feedback. We are also grateful to Epic Games for the 3D rigs, and would like to thank Zhichao Li, Zeyu Tian, and Hui Fang for their assistance in preparing the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, M. A proposed reconstruction method of a 3D animation scene based on a fuzzy long and short-term memory algorithm. PeerJ Comput. Sci. 2024, 10, e1864. [Google Scholar] [CrossRef] [PubMed]
Guo, M.; Xu, F.; Wang, S.; Wang, Z.; Lu, M.; Cui, X.; Ling, X. Synthesis, Style Editing, and Animation of 3D Cartoon Face. Tsinghua Sci. Technol. 2023, 29, 506–516. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, R.; Wang, J.; Ding, Y.; Mitchell, K. Real-time Facial Animation for 3D Stylized Character with Emotion Dynamics. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 6851–6859. [Google Scholar]
Epic Games. Metahumans. 2021. Available online: https://www.unrealengine.com/marketplace/zh-CN/product/metahumans (accessed on 17 November 2023).
Parke, F.I. Computer generated animation of faces. In Proceedings of the ACM Annual Conference, Boston, MA, USA, 1 August 1972; Association for Computing Machinery: New York, NY, USA, 1972; Volume 1, pp. 451–457. [Google Scholar]
Parke, F.I. A model for human faces that allows speech synchronized animation. Comput. Graph. 1975, 1, 3–4. [Google Scholar] [CrossRef]
Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Seminal Graphics Papers: Pushing the Boundaries; Association for Computing Machinery: New York, NY, USA, 2023; Volume 2, pp. 157–164. [Google Scholar]
Li, T.; Bolkart, T.; Black, M.J.; Li, H.; Romero, J. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 2017, 36, 194. [Google Scholar] [CrossRef]
Yang, H.; Zhu, H.; Wang, Y.; Huang, M.; Shen, Q.; Yang, R.; Cao, X. Facescape: A large-scale high quality 3D face dataset and detailed riggable 3D face prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 598–607. [Google Scholar]
Wang, L.; Chen, Z.; Yu, T.; Ma, C.; Li, L.; Liu, Y. Faceverse: A fine-grained and detail-controllable 3D face morphable model from a hybrid dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 20301–20310. [Google Scholar]
Yang, S.; Yin, M.; Li, M.; Li, G.; Chang, K.; Yang, F. 3D mesh pose transfer based on skeletal deformation. Comput. Animat. Virtual Worlds 2023, 34, e2156. [Google Scholar] [CrossRef]
Lewis, J.P.; Cordner, M.; Fong, N. Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In Seminal Graphics Papers: Pushing the Boundaries; Association for Computing Machinery: New York, NY, USA, 2023; Volume 2, pp. 811–818. [Google Scholar]
Zhao, Z.; Weng, D.; Guo, H.; Hou, J.; Zhou, J. Facial Auto Rigging from 4D Expressions via Skinning Decomposition. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery, Inc.: New York, NY, USA, 2023; pp. 6101–6109. [Google Scholar]
Polanco, A.; Lafon, Y.; Beurier, G.; Peng, J.; Wang, X. A Workflow for Deforming External Body Shapes with the Help of an Anatomically Correct Skeletal Model. In Proceedings of the International Conference on Digital Human Modeling, Antwerp, Belgium, 4–6 September 2023; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2023; pp. 168–176. [Google Scholar]
Wagner, N.; Botsch, M.; Schwanecke, U. Softdeca: Computationally efficient fhysics-based facial animations. In Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games, Rennes, France, 15–17 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–11. [Google Scholar]
Luo, Z.; Du, D.; Zhu, H.; Yu, Y.; Fu, H.; Han, X. SketchMetaFace: A learning-based sketching interface for high-fidelity 3D character face modeling. IEEE Trans. Vis. Comput. Graph. 2023, 1–15. [Google Scholar] [CrossRef] [PubMed]
Pikula, B.; Engels, S. FlexComb: A facial landmark-based model for expression combination generation. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Salt Lake City, UT, USA, 8–12 October 2023; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; pp. 337–342. [Google Scholar]
Nicolas, W.; Ulrich, S.; Mario, B. SparseSoftDECA—Efficient high-resolution physics-based facial animation from sparse landmarks. Comput. Graph. 2024, 119, 103903. [Google Scholar] [CrossRef]
Garbin, S.J.; Kowalski, M.; Estellers, V.; Szymanowicz, S.; Rezaeifar, S.; Shen, J.; Johnson, M.; Valentin, J. Voltemorph: Realtime, controllable and generalisable animation of volumetric representations. arXiv 2022, arXiv:2208.00949. [Google Scholar]
Qin, D.; Saito, J.; Aigerman, N.; Groueix, T.; Komura, T. Neural face rigging for animating and retargeting facial meshes in the wild. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; Association for Computing Machinery, Inc.: New York, NY, USA, 2023; pp. 1–11. [Google Scholar]
Strohm, F.; Bâce, M.; Kaltenecker, M.; Bulling, A. SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing. arXiv 2024, arXiv:2403.13972. [Google Scholar]
Zhao, H.; Wu, B. Three-dimensional face reconstruction of static images and computer standardization issues. Soft Comput. 2023, 27, 1143–1152. [Google Scholar] [CrossRef]
Kong, W.; You, Z.; Lyu, S.; Lv, X. Multi-dimensional stereo face reconstruction for psychological assistant diagnosis in medical meta-universe. Inf. Sci. 2024, 654, 119831. [Google Scholar] [CrossRef]
Chandran, P.; Zoss, G.; Gross, M.; Gotardo, P.; Bradley, D. Shape Transformers: Topology-Independent 3D Shape Models Using Transformers. Comput. Graph. Forum 2022, 41, 195–207. [Google Scholar] [CrossRef]
Han, J.H.; Kim, J.I.; Suh, J.W.; Kim, H. Customizing blendshapes to capture facial details. J. Supercomput. 2023, 79, 6347–6372. [Google Scholar] [CrossRef]
Bounareli, S.; Tzelepis, C.; Argyriou, V.; Patras, I.; Tzimiropoulos, G. One-Shot Neural Face Reenactment via Finding Directions in GAN’s Latent Space. Int. J. Comput. Vis. 2024, 1–31. [Google Scholar] [CrossRef]
Li, H.; Weise, T.; Pauly, M. Example-based facial rigging. ACM Trans. Graph. 2010, 29, 1–6. [Google Scholar]
Friesen, E.; Ekman, P. Facial action coding system: A technique for the measurement of facial movement. Palo Alto 1978, 3, 5. [Google Scholar]
Wang, H.; Wu, H.; Xing, J.; Jia, J. Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery, Inc.: New York, NY, USA, 2023; pp. 7776–7784. [Google Scholar]
Holden, D.; Saito, J.; Komura, T. Learning inverse rig mappings by nonlinear regression. IEEE Trans. Vis. Comput. Graph. 2017, 23, 1167–1178. [Google Scholar] [CrossRef] [PubMed]
Hou, J.; Zhao, Z.; Weng, D. UI Binding Transfer for Bone-driven Facial Rigs. In Proceedings of the 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Shanghai, China, 25–29 March 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 659–660. [Google Scholar]
Qammaz, A.; Argyros, A.A. A Unified Approach for Occlusion Tolerant 3D Facial Pose Capture and Gaze Estimation Using MocapNETs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 3178–3188. [Google Scholar]
Qammaz, A.; Argyros, A. Compacting mocapnet-based 3D human pose estimation via dimensionality reduction. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 5–7 July 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 306–312. [Google Scholar]
Moutafidou, A.; Toulatzis, V.; Fudos, I. Deep fusible skinning of animation sequences. Vis. Comput. 2023, 1–21. [Google Scholar] [CrossRef]
Wang, H.; Li, B.; Wu, S.; Shen, S.; Liu, F.; Ding, S.; Zhou, A. Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 17958–17968. [Google Scholar]
Yang, L.; Zoss, G.; Chandran, P.; Gross, M.; Solenthaler, B.; Sifakis, E.; Bradley, D. Learning a Generalized Physical Face Model from Data. arXiv 2024, arXiv:2402.19477. [Google Scholar]
Cui, Y.; Zhang, Y.; Dong, J.; Sun, H.; Chen, X.; Zhu, F. Link3D: Linear keypoints representation for 3D lidar point cloud. IEEE Robot. Autom. Lett. 2024, 9, 2128–2135. [Google Scholar] [CrossRef]
Wang, F.; Zhang, X.; Chen, T.; Shen, Z.; Liu, S.; He, Z. KVNet: An iterative 3D keypoints voting network for real-time 6-DoF object pose estimation. Neurocomputing 2023, 530, 11–22. [Google Scholar] [CrossRef]
Teng, H.; Chatziparaschis, D.; Kan, X.; Roy-Chowdhury, A.K.; Karydis, K. Centroid distance keypoint detector for colored point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 1196–1205. [Google Scholar]
Li, M.; Qin, Z.; Gao, Z.; Yi, R.; Zhu, C.; Guo, Y.; Xu, K. 2D3D-matr: 2D-3D matching transformer for detection-free registration between images and point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 14128–14138. [Google Scholar]
Kim, M.; Koo, J.; Kim, G. Ep2p-loc: End-to-end 3D point to 2D pixel localization for large-scale visual localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 21527–21537. [Google Scholar]
Li, F.; Zhang, X.; Chen, T.; Shen, Z.; Liu, S.; He, Z.; Li , L.; Ma, Y.; Tang, K.; Zhao, X.; et al. Geo-localization with transformer-based 2D-3D match network. IEEE Robot. Autom. Lett. 2023, 8, 4855–4862. [Google Scholar] [CrossRef]
vasiliskatr. Example Based Facial Rigging ARkit Blendshapes. 2022. Available online: https://github.com/vasiliskatr/example_based_facial_rigging_ARkit_blendshapes (accessed on 17 November 2023).
Unreal Engine. Rig Logicr: Runtime Evaluation of Metahuman Face Rigs. 2022. Available online: https://cdn2.unrealengine.com/rig-logic-whitepaper-v2-5c9f23f7e210.pdf (accessed on 17 November 2023).
Grishchenko, I.; Bazarevsky, V.; Zanfir, A.; Bazavan, E.G.; Zanfir, M.; Yee, R.; Raveendran, K.; Zhdanovich, M.; Grundmann, M.; Sminchisescu, C. Blazepose ghum holistic: Real-time 3D human landmarks and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 19–20 June 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 1–4. [Google Scholar]
Epic Games. Unreal Engine Metahuman Creater. 2021. Available online: https://www.unrealengine.com/en-US/metahuman (accessed on 2 December 2023).
Radzihovsky, S.; de Goes, F.; Meyer, M. Facebaker: Baking character facial rigs with machine learning. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Talks, Virtual, 17 August 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–2. [Google Scholar]
Chaudhuri, B.; Vesdapunt, N.; Shapiro, L.; Wang, B. Personalized face modeling for improved face reconstruction and motion retargeting. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; pp. 142–160. [Google Scholar]

Figure 1. Overview of our method: (a) dataset expansion, (b) structure of the proposed RigGenNet, and (c) structure of the proposed RigRecogNet.

Figure 2. Explanation of joint rotation transformations. The fixed angles of joint

B

are represented by the positions of sub-joints

B_{1}, B_{2}, B_{3}

.

Figure 2. Explanation of joint rotation transformations. The fixed angles of joint

B

are represented by the positions of sub-joints

B_{1}, B_{2}, B_{3}

.

Figure 3. Illustration showing the selection of 3D masks; 3D meshes are projected onto the

Z O X

plane, then the 3D vertices, which have corresponding 2D points that are less distant from the detected 2D points than a certain value, are selected as masks.

Figure 3. Illustration showing the selection of 3D masks; 3D meshes are projected onto the

Z O X

plane, then the 3D vertices, which have corresponding 2D points that are less distant from the detected 2D points than a certain value, are selected as masks.

Figure 4. Mask illustration at different levels of detail: (a) the detected 2D points [45], (b) the chosen 2D points, and (c) the 3D mask. The green parts are the adopted vertices with a mask value of 1, while the gray parts are the useless vertices with a mask value of 0.

Figure 5. The absolute vertex error visualized on the generated meshes of different datasets: (a) ground truth, (b) blendshape-based rig [43,48], (c) CTBNET [31], (d) NFR [20], (e) SketchMetaFace [16], and (f) ours.

Figure 6. The absolute vertex error visualized on the reconstructed mesh of different datasets. The reconstructed meshes are computed by inputting the recognized semantic parameters into RigGenNet. (a) Ground truth; (b) blendshape-based rig [43,48]; (c) FFNet [30]; (d) BTCNET [31]; (e) NFR [20]; (f) Shape Transformer [24]; (g) ours.

Figure 7. The absolute vertex error visualized on the generated mesh and the UV mapping of different levels of detail. The generated meshes were computed by inputting the labeled semantic parameters into RigGenNet. (a) LOD6; (b) LOD3; (c) LOD0.

Figure 8. The absolute vertex error visualized on the reconstructed mesh and the UV mapping of different levels of detail. The reconstructed meshes were computed by inputting the recognized semantic parameters into the RigGenNet. (a) Ground truth; (b) recognized semantic parameters; (c) reconstructed rig; (d) error heatmap drawn on the reconstructed mesh; (e) error heatmap drawn on the UV mapping.

Figure 9. Unsmooth phenomena in the lips: (a) true expression and (b) the expression reenacted using our method.

Table 1. Comparison of different neural models for facial rigging.

	Meshes/Vertices	Joints	Semantic Parameters	Time and Computational Complexity
Qin et al. [20]	✓		✓, 128 dimensions	middle
Han et al. [25]	✓		✓, 52 dimensions	high
Wang et al. [29]	✓		✓, 468 dimensions	high
Holden et al. [30]		✓	✓, varied dimensions	middle
Hou et al. [31]		✓	✓, varied dimensions	low
Chandran et al. [24]	✓		✓, not interpretable	middle
Qammaz et al. [32]	✓	✓		high
Moutafidou et al. [34]	✓	✓		middle
Luo et al. [16]	✓		✓, curved form	high
Yang [36]	✓			high
Ours	✓	✓	✓, 716 dimensions	low

Table 2. Comparison of different methods for 3D keypoint selection. Portability indicates whether the detected points can be transferred to a new character.

	Input	Output	Portability
Cui et al. [37]	LiDAR point cloud	3D keypoint positions	no
Wang et al. [38]	mesh	3D keypoint positions	no
Teng et al. [39]	colored point cloud	3D keypoint positions	no
Li et al. [40]	images and 3D point clouds	3D keypoint positions	no
Kim et al. [41]	images and 3D point clouds	2D and 3D keypoint positions, the correspondence	no
Li et al. [42]	satellite images and LiDAR point clouds	the correspondence	no
Ours	mesh	3D keypoint indexes in mesh	yes

Table 3. Overview of the notation used for the most relevant variables. Variables explained in the text are not listed here.

Variables	Description
$T_{o}$	Target neutral mesh
$T_{j}$	Target spatiotemporal meshes
$μ_{1}$	blendshape similarity control factor
$μ_{2}$	semantic parameters similarity control factor
$Q_{k}^{S_{i}}$ , $Q_{k}^{T_{i}}$	Affine transformation matrix changing from the neutral face into other expressions for the source and target character
$V_{o}$ , $V_{j}$	the vertex position tensor for neutral face and other expressions
$w_{i j}^{V}$	skinning weight of the ith joint to all vertices of the jth expression
$R_{i}$ , $T_{i}$	Rotation and translation matrix of the ith joint relative to the world coordinate system
$b_{i}$ , $b_{i}^{*}$	Position tensor before the transformation of the ith joint point, labeled position tensor after the transformation
$v_{1}, v_{2}, v_{3}, v_{4}$	The vertex tensor of a triangular of a mesh

Table 4. Levels of detail (LODs) of the template Metahuman characters.

	Triangle Number	Vertex Number	Joint Number
Cooper, LOD6	522	270	124
Myles, LOD3	5048	2548	254
Danielle, LOD0	48,004	24,049	639

Table 5. Comparison of evaluation factors for generating faces. The meaning of each factor is introduced in Section 4.3. The bolded numbers are the best value for each evaluation factor.

	$B_{RMSE}$	$F_{RMSE}$	$G_{t} / ms$	FLOPs/G	Params/M
Blendshape-based rig [43,48]	/	1.87 × 10⁻³	10,665.00	/	/
The CTBNET [31]	4.00 × 10⁻³	2.66 × 10⁻³	27.13	0.0209	21.1240
the NFR [20]	/	1.63 × 10⁻³	45.157	0.0320	31.8656
the SketchMetaFace [16]	/	2.34 × 10⁻³	146.37	0.1148	113.8969
Ours	3.29 × 10⁻³	1.57 × 10⁻³	14.78	0.0115	11.5002

Table 6. Comparison of evaluation factors for recognizing faces. The meaning of each factor is introduced in Section 4.3.

	$P / %$	$F_{RMSE}^{re}$	${LE}_{RMSE}^{re}$	${RE}_{RMSE}^{re}$	$M_{RMSE}^{re}$	$G_{r} / ms$	FLOPs/G	Params/M
Blendshape-based rig [43,48]	75.75	1.87 × 10⁻³	1.74 × 10⁻⁴	2.15 × 10⁻⁴	6.91 × 10⁻⁴	33.40	/	/
Holden’s FFNet [30]	10.10	2.08 × 10⁻³	5.82 × 10⁻⁴	6.52 × 10⁻⁴	9.47 × 10⁻⁴	11.23	0.0087	8.7436
The BTCNET [31]	57.37	2.28 × 10⁻³	3.58 × 10⁻⁴	4.39 × 10⁻⁴	6.19 × 10⁻⁴	11.52	0.0088	8.9639
The NFR [20]	91.18	1.96 × 10⁻³	1.89 × 10⁻⁴	2.06 × 10⁻⁴	6.45 × 10⁻⁴	28.32	0.0218	22.0510
The shape transformers [24]	90.95	2.04 × 10⁻³	2.23 × 10⁻⁴	2.53 × 10⁻⁴	7.52 × 10⁻⁴	30.27	0.0237	23.5649
Ours	92.92	1.53 × 10⁻³	1.63 × 10⁻⁴	1.97 × 10⁻⁴	6.32 × 10⁻⁴	15.76	0.0123	12.2710

Table 7. Comparison of evaluation factors for generated meshes at different LOD.

	$B_{RMSE}$	$F_{RMSE}$	$G_{t} / ms$
LOD 6	1.89 × 10⁻³	1.09 × 10⁻³	9.23
LOD 3	3.29 × 10⁻³	1.57 × 10⁻³	14.78
LOD 0	4.69 × 10⁻³	1.89 × 10⁻³	36.60

Table 8. Comparison of evaluation factors for recognized meshes at different LOD.

	$P / %$	$F_{RMSE}^{re}$	${LE}_{RMSE}^{re}$	${RE}_{RMSE}^{re}$	$M_{RMSE}^{re}$	$G_{r} / ms$
LOD 6	92.80	1.03 × 10⁻³	7.05 × 10⁻⁵	9.07 × 10⁻⁵	6.53 × 10⁻⁴	10.31
LOD 3	92.92	1.53 × 10⁻³	1.63 × 10⁻⁴	1.96 × 10⁻⁴	6.32 × 10⁻⁴	15.76
LOD 0	92.91	7.11 × 10⁻⁴	3.67 × 10⁻⁴	2.24 × 10⁻⁴	7.13 × 10⁻⁴	36.60

Table 9. Results of the ablation experiment. The bolded number is the best score for each evaluation factor.

	$P / %$	$F_{RMSE}^{re}$	${LE}_{RMSE}^{re}$	${RE}_{RMSE}^{re}$	$M_{RMSE}^{re}$	$G_{r} / ms$
BTCNET [31]+( $L_{r}^{v}$ + $L_{f}^{v}$ )	89.90	1.86 × 10⁻³	1.72 × 10⁻⁴	3.89 × 10⁻⁴	7.30 × 10⁻⁴	14.41
BTCNET [31]+( $L_{r}^{v}$ + $L_{f}^{v}$ )+( $L_{r}^{r e}$ + $L_{r}^{l e}$ + $L_{f}^{r e}$ + $L_{f}^{l e}$ )	91.92	1.64 × 10⁻³	1.38 × 10⁻⁴	1.66 × 10⁻⁴	7.34 × 10⁻⁴	15.76
BTCNET [31]+( $L_{r}^{v}$ + $L_{f}^{v}$ )+( $L_{r}^{m}$ + $L_{f}^{m}$ )	87.88	1.66 × 10⁻³	1.78 × 10⁻⁴	4.37 × 10⁻⁴	2.54 × 10⁻⁴	18.76
BTCNET [31]+( $L_{r}^{v}$ + $L_{f}^{v}$ )+(Local constraints by 2D3D-MATR [40])	91.89	1.61 × 10⁻⁴	1.68 × 10⁻⁴	1.92 × 10⁻⁴	6.39 × 10⁻⁴	15.89
BTCNET [31]+( $L_{r}^{v}$ + $L_{f}^{v}$ )+( $L_{r}^{r e}$ + $L_{r}^{l e}$ + $L_{f}^{r e}$ + $L_{f}^{l e}$ )+( $L_{r}^{m}$ + $L_{f}^{m}$ )	92.92	1.53 × 10⁻⁴	1.63 × 10⁻⁴	1.96 × 10⁻⁴	6.32 × 10⁻⁴	15.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, J.; Weng, D.; Zhao, Z.; Li, Y.; Zhou, J. Neutral Facial Rigging from Limited Spatiotemporal Meshes. Electronics 2024, 13, 2445. https://doi.org/10.3390/electronics13132445

AMA Style

Hou J, Weng D, Zhao Z, Li Y, Zhou J. Neutral Facial Rigging from Limited Spatiotemporal Meshes. Electronics. 2024; 13(13):2445. https://doi.org/10.3390/electronics13132445

Chicago/Turabian Style

Hou, Jing, Dongdong Weng, Zhihe Zhao, Ying Li, and Jixiang Zhou. 2024. "Neutral Facial Rigging from Limited Spatiotemporal Meshes" Electronics 13, no. 13: 2445. https://doi.org/10.3390/electronics13132445

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neutral Facial Rigging from Limited Spatiotemporal Meshes

Abstract

1. Introduction

2. Related Works

2.1. Neutral Facial Rigging Models

2.2. Selection of 3D Keypoints

3. Materials and Methods

3.1. Dataset Expansion

3.2. Structure and Loss Function for the Proposed RigGenNet

3.3. Structure and Loss Function for the Proposed RigRecogNet

3.4. Creation of Localized 3D Masks

3.5. Methodological Feasibility

4. Experiments

4.1. Datasets

4.2. Hyperparameters and Pseudocode

4.3. Evaluation Factors

4.4. Experimental Settings

5. Results and Discussion

5.1. Evaluation of RigGenNet

5.2. Evaluation of RigRecogNet

5.3. Robustness Experiments on Meshes with Different levels of detail

5.4. Ablation Study and Comparison for Different Masks

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI