1. Introduction
In the metaverse, 3D animated faces twinned from real-life characters can provide users with an immersive experience [
1] and are more realistic than 2D faces or AI-generated faces. A typical application is
Yaya Tu in
Wandering Earth 2. Consequently, the construction and animation of 3D twin faces have constituted an important area of concern within both the academic [
2,
3] and engineering communities [
4]. In the field of academics, the process of creating a facial animated model is commonly referred to as ‘facial rigging’ and the animated model is commonly referred to as a ‘rig’.
Digital facial rigs are usually divided into three kinds, of which the first is blendshape-based rigging. It is a linear facial model, and each basis vector is a complete facial expression [
5] or an offset expression with a neutral face [
6]. As the semantics of each basis vector are clear, it is intuitive for artists to modulate expressions. However, it is difficult to maintain orthogonality between different basis vectors, which may lead to repeated operations when modulating expressions. To solve this problem, most existing facial rigging works, such as 3D Morphable Model (3DMM) [
7], Faces Learned with an Articulated Model and Expressions (FLAME) [
8], FaceScape [
9], FaceVerse [
10], etc., choose principal component analysis (PCA) to obtain basis vectors. However, the use of PCA makes the basis vectors lose semantic interpretability. Animators cannot use it directly, and it requires a great deal of experience on the part of artists when generating a specific expression. In addition, blendshape-based rigging is slightly deficient in localized subtle deformation. The second alternative is skeleton-based rigging, which generates a specific pose by joint transformations [
11] or nonlinear interpolation [
12] in a pose space, then affects the vertex positions through weights. The pose space consists of joint positions, joint angles, and degrees of freedom at each level [
12]. This operation is referred to as ‘skinning’ in animation production. As the joint is locally deformable, the animated expression is delicate; however, the pose space lacks interpretable semantic parameters. What is more, due to the lack of triangulated plane normal constraints, unsmooth phenomena [
13] such as candy-wrapping and collapsing elbows [
14] may appear. The third alternative is muscle-based or physical rigging, which generates the facial shape by simulating the muscles and tissues [
15] or controlling facial deformation through physical curves [
16]. This approach is the most compatible model with human facial dynamics, but is rarely used due to its time-consuming dynamics simulation. Nonetheless, several of the physical constraints employed in muscle-based facial rigging are beneficial, such as the 2D/3D facial keypoint constraints [
17,
18].
In industry, such rigs often require extensive animation experience on the part of artists to produce; hence, manual facial rigging is labor-consuming and costly. While researchers have attempted to use deep learning networks to accomplish automatic facial rigging [
15,
19,
20,
21], insufficient 3D datasets continue to be a bottleneck. Moreover, these methods typically employ complex network architectures, resulting in a multitude of parameters and sluggish loading speeds. Consequently, when the necessity arises to create a 3D expression, scholars typically employ a process of 3D reconstruction from 2D images [
22,
23]. However, the quality of such reconstructions is significantly influenced by factors such as image noise, lighting conditions, and the camera’s point of view. When there are multiple characters in the scene, it is difficult for 3D reconstruction methods to guarantee spatial proportions. Furthermore, when there is a need to reenact 3D expressions from a template 3D animation, 3D reconstruction methods become useless. To handle this problem, researchers may use face rigging transfer methods [
24,
25,
26] to extract source expression features and transfer them to the target face. However, the aforementioned methods usually do not consider the semantic interpretability of the latent space, and can constrain the hidden variable due to their simple semantics [
3,
20]. This results in a rig that is challenging to operate.
In summary, current facial rigging methods and rigs are not able to provide a balance between production costs and application effects. Therefore, our long-term objective is to develop a lightweight rigging paradigm that integrates the benefits of blendshape-based and skeletal-based rigs. The goal of this paper is to establish real-time bidirectional relationships between joints and semantic parameters from a limited number of meshes. The specific contributions are summarized below:
A large training dataset containing joint positions, semantic parameters, and vertex positions is automatically expanded from limited number of spatiotemporal facial meshes.
An expression generation network called RigGenNet is established for generation from semantic parameters to joint positions based on a multilayer perceptron with vertex constraints. An expression recognition network called RigRecogNet is established for generation from joint positions to semantic parameters based on a generative adversarial network (GAN)-like structure. The recognition accuracy of critical areas is improved by adding local constraints determined by 3D masks.
After training, RigGenNet is capable of generating 3D expression in real time, while RigRecogNet can recognize 3D expression in real time. Both networks are applicable to 3D meshes with different levels of detail, and the limited number of target meshes can be accurately reenacted.
The rest of this paper is structured as follows:
Section 2 analyzes recent relevant works;
Section 3 introduces the technical details of our proposed methods;
Section 4 describes the experimental protocol;
Section 5 discusses the results;
Section 6 points out limitations; finally,
Section 7 highlights the contributions and proposes future works.
4. Experiments
This section first illustrates the datasets, hyperparameters, pseudocode used for training, and evaluation factors in
Section 4.1,
Section 4.2 and
Section 4.3. Then, the purpose and details of the four experiments described in this paper are provided in
Section 4.4.
4.1. Datasets
In this paper, we used Metahuman characters [
46] with different levels of detail as template blendshapes; see
Table 4 for details. In addition, we performed validation on four sets of meshes with different levels of detail: one neutral expression and 75 other expressions of Danielle with a level of detail (LOD) of 6, one neutral expression and 99 other expressions of Keiji with LOD3, one neutral expression and 49 other expressions of Hou with LOD3, and one neutral expression and 99 other expressions of Zhou with LOD0. Danielle and the Keiji are both Metahuman characters, while Hou and Zhou were collected using our 4D facial capture system.
For the 4D capture process, the actor was required to wear three cubes containing four ArUco markers [
13] on their head during the collection process in order to register and retopologize the meshes. After preprocessing was used to capture meshes as described in
Section 3.1, 100,000 sets of expressions were morphed for use as the training meshes. The semantic parameters, joint positions, and vertex positions corresponding to each mesh were packed to obtain the final training set. The ratio of the training set to the validation set was 8:2. For the Metahuman characters, each expression satisfied the same topology condition and the training set was obtained using the dataset expansion method described in
Section 3.1.
4.2. Hyperparameters and Pseudocode
We found that the expressions generated by the RigGenNet had the best fit to the labeled semantics in the validation set when was close to 0.1, and that the positional errors of the joints and vertices were within the acceptable range. When was near 0.8, RigRecogNet had a better generalizability effect. The expression could be reproduced precisely using the recognized semantic parameters, and the reconstructed expressions combined the smoothness of blendshape-based rigs and the localized deformability of skeletal-based rigs.
All networks were trained on an NVIDIA GeForce GTX 1080 Ti with a training batch size of 16 and test batch size of 1, the pseudocode for training is shown in Algorithm 1. RigGenNet was trained for 100 epochs, while RigRecogNet was trained for 60 epochs. The Adam optimizer was used for all networks, with an exponential decay rate of 0.9 for first-order moment estimation and 0.999 for second-order moment estimation and with a weight decay L2 penalty term of 1 × 10
−4. The learning rate for RigGenNet was 1 × 10
−4 and the learning rate for RigRecogNet was 1 × 10
−5.
Algorithm 1: Training for the RigGenNet and the RigRecogNet |
![Electronics 13 02445 i001]() |
4.3. Evaluation Factors
We evaluated RigGenNet based on the difference between the generated expressions and the labeled expressions in the validation sets. Four evaluation factors were used: the full-face joint relative mean square error
, full-face vertex relative mean square error
, absolute vertex error, and single-frame expression generation time
. The relative mean error is a common evaluation criterion for facial animation [
13,
47], where
is the diagonal length of the bounding box enclosing the target character’s neutral mesh.
We evaluated RigRecogNet based on the difference between the reconstructed expressions and the labeled expressions. Seven evaluation factors were used: the semantic recognition accuracy P, reconstructed full-face vertex relative mean square error , reconstructed eye vertex relative mean square error , reconstructed lip vertex relative mean error , absolute vertex error, and single-frame expression recognition time .
Unlike conventional emotion category recognition, we recognize expressions as parameters of a continuously varying space. Each dimension represents a subtle expression, such as ‘opening the left eye’. The category emotions were first classified into six categories according to the Facial Action Coding System (FACS) [
28]: happiness, sadness, anger, surprise, fear, and disgust. Real human emotions are complex, with blurred boundaries between expressions; thus, discrete categories may not reflect subtle emotion shifts. The use of our continuously varying semantic space can better simulate the real expressions. Each dimension of the semantic space varies from 0 to 1. Here, the semantic recognition accuracy means the proportion of the well-fitted dimensions to the total dimensions in the semantic space, while ‘well-fitted’ means that the absolute error between the calculated semantic parameters and the labeled semantic parameters is less than 0.1. In addition to the above factors, we evaluated the time and computational complexities based on the memory occupied by the parameters (Params) and floating-point operations per second (FLOPs) for all methods.
4.4. Experimental Settings
The objective of the first experiment was to verify that our RigGenNet can generate facial expressions from the reference expression with less error under the same semantic parameters and in real time. We used Myles as the template blendshape and the Keiji and Hou dataset (
) as the validation set. We compared the evaluation factors between five methods: the traditional blendshape-based rigging method [
43,
48], the latest ‘NFR’ blendshape-based rigging method [
20], the ‘SketchMetaFace’ physical rigging method [
16], the ‘CTBNET’ skeleton-based rigging method [
31], and our RigGenNet. The expressions to be verified were generated using the same semantic parameters.
The objective of the second experiment was to verify that our RigRecogNet can accurately recognize emotions and that the recognized emotions can be accurately reproduced by RigGenNet. We compared the evaluation factors of six expression recognizers on the Keiji and Hou dataset (
,
): the first ‘FFNet’ skeleton-based rigging method [
30], BTCNET [
31], the traditional blendshape-based rigging method [
43,
48], the latest ‘NFR’ blendshape-based rigging method [
20], the ‘Shape Transformer’ rigging transfer method [
24], and our RigRecogNet. We used the second step of the blendshape-based rigging method [
43,
48] to optimize the semantic parameters, where the blendshape was fixed. For the same set of expressions, the optimized semantic parameters were different for each iteration, and there was no fixed changing rule; thus, we take the average of ten calculations as the final evaluation factor value of our method.
The objective of the third experiment was to test the robustness of our RigGenNet and RigRecogNet on datasets with different levels of detail. We used labeled semantic parameters to generate meshes on the Danielle, Keiji, and Zhou datasets, then compared the generated faces to verify the robustness of the network. We used RigRecogNet to recognize facial emotions on the Danielle, Keiji, and Zhou datasets and used RigGenNet on the recognized semantic parameters to obtain the reenacted faces. The reenacted faces were then compared to verify the robustness of the network.
The objective of the fourth experiment was to verify the effectiveness of the local vertex constraints and validate the superiority of the proposed 3D mask selection method. Ablation experiments were conducted on the LOD3 datasets to evaluate the efficacy of different masks. We tested five cases: BTCNET [
31] with full-face constraints, BTCNET [
31] with full-face constraints and eye constraints, BTCNET [
31] with full-face constraints and lip constraints, BTCNET [
31] with all of the above constraints and with local constraints obtained by 2D3D-MATR [
40], and BTCNET [
31] with all of the above constraints and with local constraints obtained by our method.
7. Conclusions
We propose an automatic dataset expansion method that can automatically synthesize a large number of meshes from a limited number of spatiotemporal samples and decouple the joints and sentiment parameters. Based on the expanded dataset, we propose RigGenNet to fit the generative relationship from semantic parameters to joints; the proposed network outperforms the latest works [
16,
20,
31] in terms of generation time and computational complexity, and the generated faces have the minimum error with reference expressions. Based on RigGenNet, we built RigRecogNet to decouple semantic parameters from the joints; again, the proposed network outperforms existing works [
20,
24,
31] in terms of recognition accuracy. During optimization of RigRecogNet, we use 3D masks to constrain the critical parts. An ablation experiment verifies that the local constraints can improve overall recognition accuracy by 3.02%. Compared with 3D masks chosen by 2D3D-MATR [
40], the recognition accuracy of our method is higher and the mask location is more fixed. In addition, our rigging method provides robust results for faces with different levels of detail, and the resulting rig has more dimensions of emotional parameters and can express more subtle emotional changes.
Subsequent work could focus on pretraining a mean expression generator and mean expression recognizer based on a large dataset containing different characters. In this case, even a small amount of captured meshes (e.g., 50) would be sufficient to fine-tune a facial autoencoder [
31] for rigging meshes. In addition, subsequent works might consider adding triangular surface normal constraints during network optimization or adding dynamics constraints referring to the physical rig.