VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Liu, Li; Wang, Jinhui; Chen, Shijuan; Li, Zongmei

doi:10.3390/electronics13183657

Open AccessArticle

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Department of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361204, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3657; https://doi.org/10.3390/electronics13183657

Submission received: 1 August 2024 / Revised: 8 September 2024 / Accepted: 11 September 2024 / Published: 14 September 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model’s robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.

Keywords:

facial animation generation; speech-driven lip synchronization; cross-attention mechanism; feature fusion

1. Introduction

Speech-driven lip synchronization is a cutting-edge research direction at the intersection of computer vision and speech processing [1]. It aims to generate facial animations that accurately synchronize with given speech content and facial images or videos [2]. This technology has broad application prospects in various fields, including virtual reality, distance education, digital entertainment, and telemedicine [3]. Although it has seen preliminary applications in virtual presenters and intelligent customer service, significant challenges remain in terms of lip-sync accuracy and naturalness [4].

Existing research has primarily focused on improving the fidelity of generated facial images and adapting to different facial features [5]. However, these methods still exhibit notable limitations when dealing with complex facial expressions and multilingual environments [6]. In particular, we have observed a significant lip jitter phenomenon in the generated results, which may stem from the insufficient fusion of visual and speech information when the model processes rapidly changing audio inputs [7]. Furthermore, the quality of generated facial images still shows a marked gap compared to real videos, which to some extent limits the practical application of this technology.

To address these issues, this study proposes VividWav2Lip, an audio-driven lip synchronization system based on an improved Wav2Lip [8] architecture. As shown in Figure 1, given a source video or facial image and a specified audio input, our model can generate lip-synced facial animations.

Our approach incorporates three main innovations: firstly, we introduce a cross-attention mechanism to achieve the fusion of audio and visual features; secondly, we enhance the model’s expressive capacity through additional residual blocks and Squeeze-and-Excitation (SE) residual mechanism; finally, we integrate CodeFormer [9], an advanced facial reconstruction technique, to further enhance the visual quality of generated videos. The main contributions of this study are as follows:

We propose the VividWav2Lip model architecture, introducing a feature fusion strategy that significantly improves the accuracy and stability of lip synchronization;
We enhance the video dataset, expanding the model’s language adaptability and laying a foundation for training;
The integration of advanced facial restoration networks in the post-processing pipeline greatly improves the quality of the generated videos;
Through extensive experiments, we validate the superiority of our proposed method across multiple-objective metrics and subjective evaluations.

The structure of this paper is as follows: Section 2 reviews related work, Section 3 provides a detailed description of the VividWav2Lip model, Section 4 presents experimental results, Section 5 discusses the findings and explores future research directions, and Section 6 concludes the study.

2. Related Work

2.1. Talking-Head Generation

Talking-head generation is a significant research area in computer graphics and computer vision, aiming to create facial animations synchronized with audio [10]. In recent years, research methods in this field have primarily focused on 3D-model-based approaches and deep-learning-based 2D image methods. Three-dimensional model-based methods typically utilize trained neural networks, combining audio datasets with 3D facial meshes to achieve vertex-level facial animation [11]. The VOCA model [12] is a representative work in this category, employing an end-to-end approach to directly predict 3D facial mesh vertex movements from audio. Subsequently, FaceFormer [13] proposed 3D facial animation architecture based on autoregressive transformation, further improving the quality and naturalness of generated animations. Additionally, studies such as Imitator [14], DiffPoseTalk [15], CodeTalker [16], and MeshTalker [17] have provided new insights and methods for audio-driven 3D facial animation.

However, 3D-model-based methods face several challenges. Firstly, these methods are computationally complex, requiring the processing of numerous vertices and faces, resulting in high computational costs and difficulties in rendering. Secondly, 3D model methods have limited generalization capabilities, struggling to adapt to different individual facial features and expression variations.

Given the aforementioned limitations, researchers have gradually shifted their attention to deep-learning-based 2D methods. These approaches learn directly from audio and two-dimensional images without requiring explicit three-dimensional intermediate representations, thus offering potential advantages in computational efficiency and generation quality. ObamaNet [18], building upon Synthesizing Obama [19], utilized neural networks to train various modules, laying the foundation for the subsequent LipGAN network [20]. This work pioneered a surge of research into audio-driven animation for arbitrary identities. SyncTalkFace [21] introduced visual information from the mouth region into the lip synchronization process, while FaceChain-ImagineID [22] proposed a progressive decoupling method to estimate facial geometry, enhancing the learning of lip movements. As research progresses, scholars began to focus on more facial features and expression variations. VedioReTaliking [23] and FlowVQTalker [24] not only concentrated on lip synchronization but also addressed expression matching. Works such as Speech2Lip [25], LipFormer [26], and FaceTalk [27] further considered changes in head features and postures, making the generated head animations more natural. Audio2Head [28] generated head movements from arbitrary audio inputs, providing an effective solution to the challenge of lip synchronization in real, uncontrolled environments.

2.2. Wav2Lip

Wav2Lip represents a revolutionary approach to audio-driven lip synchronization, exerting a profound influence on the field of talking-head generation. It comprises an audio encoder, a video encoder, and a generator. The audio encoder primarily extracts features from the audio, while the video encoder processes the input facial image sequence. Both then transmit these features to the generator to synthesize precisely synchronized lip movements. Its advantage lies in its powerful generalization capability across different languages and styles, demonstrating excellent performance on languages and tasks beyond the training dataset. It exhibits robust adaptability in the wild.

Recently, researchers have explored ways to extend Wav2Lip’s capabilities, proposing a series of improvements such as ca-Wav2Lip [29], Wav2Lip-HR [30], SadTalker [31], Diff2Lip [32], WavSyncSwap [33], etc. These Wav2Lip-based studies have not only expanded the model’s application scope but also enhanced the quality and naturalness of generated results. Despite facing challenges from numerous emerging technologies, Wav2Lip remains one of the most advanced lip synchronization methods currently available, laying a crucial foundation for subsequent research. However, despite its strengths, Wav2Lip and its variants still face challenges in several key areas. These include the need for more effective fusion of audio and visual features, limited expressiveness in handling complex facial expressions, and difficulties in adapting to multilingual environments with varying speech rates. Additionally, the visual quality of the generated facial images often falls short of real video standards, which to some extent limits the practical application of this technology.

2.3. Facial Restoration Networks

In the process of audio-driven lip generation, the generated facial images may sometimes exhibit distortions or unnatural phenomena, particularly with some generated dental artifacts that may not be clearly presented. In this study, we integrate a facial restoration network as a post-processing step into the generation pipeline to enhance overall visual quality [34]. Current mainstream facial restoration techniques typically rely on traditional image restoration and enhancement methods, which may have certain limitations when processing facial details. CodeFormer stands out in this field by combining local attention mechanisms with global feature extraction capabilities. It can effectively restore and enhance facial details while maintaining identity consistency. By integrating CodeFormer into our end-to-end system, we provide a comprehensive solution for generating talking-head videos. This integration addresses a critical gap in current lip synchronization research, where the focus has primarily been on synchronization accuracy rather than overall visual quality. By combining advanced lip synchronization techniques with state-of-the-art facial restoration, our approach aims to produce more realistic and visually appealing results, pushing the boundaries of what is possible in audio-driven facial animation.

3. Methodology

3.1. Dataset Preparation

High-quality datasets are crucial foundations for training high-performance models. In the field of speech recognition and lip generation, the LSR2 dataset [35] is widely applied. This study optimizes and expands upon the LSR2 dataset, aiming to enhance model performance and improve its applicability in Chinese language contexts.

3.1.1. Optimization Data Processing

We have made the following improvements to the dataset:

Optimization: We conducted random screening and cleaning of the original dataset. Video samples with obvious occlusions and those predominantly featuring side angles were removed to ensure dataset quality;
Integration: Given the significant differences between Chinese and English phonetic systems, we introduced some Chinese data to enhance the model’s performance in Chinese language contexts. The data were sourced from current-year CCTV news broadcasts and their associated video recordings, featuring standard Mandarin pronunciations with superior audio quality and diverse content. To ensure data diversity and representativeness, we maintained a strict 1:1 male-to-female ratio in the selection process and limited individual sample durations to one minute. This carefully designed dataset enables the model to be exposed to a wide array of vocabulary and linguistic patterns, thereby improving its adaptability and accuracy in practical applications;
Supplementation: To further augment the model’s linguistic diversity and robustness, we supplemented the dataset with English video samples from the White House’s official website and YouTube educational content. In the selection process, we paid particular attention to maintaining a balance in gender, age, and duration of the samples, ensuring that the model could adapt to the speech characteristics and expression styles of diverse speakers.

3.1.2. Data Processing

This study employed the Single Shot Scale-invariant Face Detector (S³FD) [36] algorithm for face detection, which efficiently detects faces of various scales within a single image while maintaining consistent detection performance. Following data integration and supplementation, we segmented the dataset into 2 s video clips to optimize model training efficiency. These clips underwent meticulous processing: videos were cropped at a rate of 25 frames per second, and audio was extracted at a sampling rate of 16,000 Hz. Using the S³FD algorithm, we standardized all facial images to uniform dimensions, ensuring data consistency. This cleaned and standardized dataset was ultimately utilized for training the lip-sync model. Figure 2 illustrates examples of the processed facial images.

Our dataset optimization and expansion strategy aims to provide a more balanced and comprehensive training foundation for lip-driven technology. The carefully adjusted dataset not only provides richer, high-quality learning materials for improving the Wav2Lip model but also significantly enhances the model’s performance and generalization capabilities.

3.2. Model Architecture

This section details the architecture of our proposed VividWav2Lip model. Our research improves upon the original Wav2Lip model to enhance performance capability and generation quality in facial environments, reducing lip jitter issues. Our improvements primarily focus on two aspects: adding a cross-attention mechanism and additional residual blocks (SE residual mechanism, introduction of pointwise convolution, etc.). These enhancements significantly boost the model’s feature extraction and fusion capabilities. The overall architecture of the VividWav2Lip model is illustrated in Figure 3.

3.2.1. Encoders

The model inputs are shown in Figure 3. Facial images are randomly selected from the dataset and undergo lower-half occlusion processing to form ground-truth facial segments, which are then input into the facial encoder. Simultaneously, the corresponding audio signals are transformed into mel-spectrograms before being input into the audio encoder, ensuring that the model can process both visual and auditory information concurrently.

The audio encoder extracts high-level features of the audio signal (such as pitch, prosody, etc.) through neural networks, providing rich audio information for the subsequent generation process. The facial encoder is responsible for processing input video frames, using neural networks to extract high-level facial information, capturing key facial features (facial structure, details, etc.), and providing accurate visual information for subsequent generation. The coordinated work of both encoders provides a solid foundation for the model to generate high-fidelity facial animations.

3.2.2. Cross-Attention Mechanism Module

To enhance the interaction between audio features and visual features, we introduce a cross-attention mechanism [37]. This mechanism is inspired by the application of the Transformer model, where the attention mechanism has demonstrated advantages in capturing complex relationships and enhancing feature representation across various domains. This inspired us to apply it to facial animation generation.

The cross-attention mechanism aims to achieve natural and precise audio-visual synchronization by establishing closer connections between audio and visual information. This bidirectional interaction mechanism enables the model to “listen” to audio content while processing visual information, and vice versa, thereby better aligning audio and visual information. Figure 4 illustrates the detailed structure of the cross-attention mechanism.

When extracting features, the audio encoder extracts audio features A ∈ R^C×T, and the facial image is processed to extract features I ∈ R^C^×^H^×^W. These features are then transformed via linear transformations to generate Query(Q), Key(K), and Value(V) matrices as follows:

Q = W_{q} \times I, K = W_{k} \times A, V = W_{V} \times A,

(1)

In Equation (1), W_q, W_k, and W_v are learnable matrix weights. The attention weights are computed by taking the dot product of Q and K, followed by normalization as shown in Equation (2), where d_k represents the dimension of K. This operation allows the model to dynamically adjust the attention given to different features, thus achieving more accurate audio-visual synchronization.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{k}}}) V,

(2)

To enhance the model’s expressive capability, we employ a multi-head attention mechanism for parallel computation, with the final output processed through residual connections and normalization. In the model, this module is positioned after the encoders and before the generator, ensuring that the generated facial features are highly synchronized with the audio content. The advantages of using this mechanism include:

Dynamic Association: It allows the model to dynamically focus on relevant parts of audio features and visual features;
Long-Range Dependencies: It can capture long-distance dependencies, which is effective for processing long audio sequences;
Adaptive Fusion: It can adaptively adjust the fusion method according to different features.

This cross-attention mechanism significantly enhances the model’s ability to align audio and visual information, resulting in more natural and accurate lip synchronization. By enabling the model to “listen” and “see” simultaneously, it creates a more integrated and context-aware generation process.

3.2.3. Decoder

The decoder is a core component of VividWav2Lip, responsible for transforming fused features into lip-synchronized video sequences. It comprises a series of upsampling and convolutional layers, gradually converting low-resolution feature maps into high-resolution facial images. As shown in Figure 2, we introduce additional residual blocks in the upsampling process to enhance the model’s expressive capability and generation quality. These additional residual blocks are computed after each upsampling, not only increasing the network’s depth but also effectively mitigating the vanishing gradient problem, ensuring thorough training of the deep network. By introducing these additional residual blocks, the network can learn residual functions rather than directly learning underlying mappings, simplifying the learning process and improving generation quality.

We further enhance the model’s detail representation and overall quality in the generation process by incorporating the SE residual mechanism into the additional residual blocks. The application of the SE mechanism enables the model to dynamically adjust channel feature weights when processing various input features, thereby achieving more precise generation results. The SE mechanism effectively improves model performance by explicitly modeling interdependencies between channels to adjust feature responses. This mechanism originates from Squeeze-and-Excitation Networks (SENets) [38], which have achieved significant success in wide-ranging applications. Figure 5 illustrates the specific application of the SE residual mechanism in additional residual blocks.

We use Squeeze and Excitation to perform channel recalibration, enhancing sensitivity to different features. The Squeeze operation compresses the spatial information of each channel into a scalar using global average pooling, thereby obtaining a global channel description, as shown in Equation (3):

F_{s q} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U (i, j),

(3)

The Excitation operation uses two fully connected layers to learn the inter-channel correlations and adjusts the channel weights using activation functions, as shown in Equation (4):

F_{e x} = σ (W_{2} \times δ (W_{1} \times F_{s q})),

(4)

In the above equations, W₁ and W₂ represent the weights of our fully connected layers, δ denotes the ReLU activation function, and σ denotes the Sigmoid activation function. Finally, the input features are recalibrated using F_scale to obtain the recalibrated features Ũ.

The SE residual mechanism plays a crucial role in our model, enabling it to better handle complex multilingual audio-driven lip synchronization tasks. Notably, to achieve efficient feature fusion, we added pointwise convolution layers after the decoder. Although these convolutional layers have a simple structure, they effectively integrate cross-channel information, allowing different information to interact and recombine. Strategically placing these convolutional layers at fusion points of skip connections and after upsampling operations helps adjust feature dimensions and enhances the fusion effect of features at different scales.

3.2.4. Loss Function

During the training process, L₁ loss function and L_sync loss function were employed to optimize model performance. These two loss functions correspond to the quality of generated images and the accuracy of lip synchronization, respectively, ensuring that the final generated image sequence possesses both high-quality and accurate lip synchronization effects. The L₁ loss function is used to measure the difference between generated images and real images. The calculation formula is shown in Equation (5):

L_{1} = \frac{1}{N} \sum_{i = 1}^{N} ‖I_{l o w} - I_{g i}‖,

(5)

where I_low represents the generated image frame, and I_gi represents the real image frame. By minimizing the difference between the two, the model is encouraged to generate output closer to real images, thereby improving image quality. L_sync is used to measure the effect of generated lip synchronization, ensuring visual consistency between the generated images and audio synchronization. We use a pre-trained lip-sync expert discriminator to calculate this loss (Equation (6)):

L_{s y n c} = \frac{1}{N} \sum_{i = 1}^{N} - \log (\frac{I \times A}{m a x ({‖I‖}_{2} \times {‖A‖}_{2})}),

(6)

L_sync ensures that the generated lip movements are synchronized with the input audio by minimizing the difference between the audio (A) and the generated image (I).

By combining the L₁ loss function and L_sync loss function, the model can simultaneously optimize image quality and lip synchronization accuracy. This comprehensive loss function design enables the generated image sequence to not only have high-quality visual effects but also accurately reflect the lip movements in the audio.

4. Experiments

4.1. Data Setup

4.1.1. Pre-Trained Lip Sync Expert Discriminator

In this study, we first prepared and processed the optimized dataset. The dataset was divided into training, validation, and test sets in a 7:2:1 ratio, totaling approximately 30 h of data and resulting in over 50,000 video segments after cutting.

The training process for the pre-trained lip-sync expert discriminator was as follows: We conducted a total of 12 × 10⁵ steps of training, using cosine loss for loss computation. During the training, the loss value stabilized around 0.3 after 6 × 10⁵ steps and further stabilized below 0.2 at 12 × 10⁵ steps, indicating that the training was effective.

4.1.2. Generator Training

The VividWav2Lip model was trained on an NVIDIA GeForce RTX 4090 GPU (based in Santa Clara, CA, USA) with a batch size of 64. We employed the Adam optimizer and utilized a combination of L₁ loss and L_sync loss as our loss functions. The training process lasted for 48 h, encompassing a total of 500 epochs.

To ensure the reliability and accuracy of our experimental results, we compared the quality of generated images under the optimized dataset with the same number of epochs. Figure 6 illustrates the dynamic evolution of the loss function. Following the methodological approach of Wav2Lip, we have omitted data points where L_sync exceeds 0.3 from the graphical representation. This data processing strategy enables us to focus on model performance within lower loss intervals, thereby facilitating a more precise evaluation of the model’s efficacy in lip synchronization. When L_sync surpasses the predetermined threshold, the quality of the generated lip synchronization typically fails to meet the standards required for practical applications. By excluding these anomalously high loss values, we can more clearly observe and analyze the model’s performance within the most relevant range for real-world applications. This approach not only enhances the interpretability of our results but also allows for a more nuanced comparison of different methods across critical performance indicators.

4.2. Evaluation Metrics

To comprehensively assess the performance of the VividWav2Lip model, we employed a combination of objective and subjective evaluation methods. The objective evaluation primarily focuses on two aspects: lip synchronization quality and image generation quality.

4.2.1. Lip Generation Evaluation

For evaluating lip synchronization quality, we adopted two metrics: Lip Sync Error Confidence (LSE-C) and Lip Sync Error Distance (LSE-D). LSE-C reflects the confidence of lip synchronization in the generated facial animation, determined by calculating the dispersion of the distance distribution between audio and image features. A higher LSE-C value indicates higher synchronization confidence. LSE-D directly assesses the degree of synchronization between the two by calculating the Euclidean distance between video frames and audio features.

4.2.2. Image Generation Quality Evaluation

For image generation quality assessment, we employed a comprehensive system of objective indicators, including Brenner, Energy, Variance, SMD, SMD2, and Entropy. Brenner evaluates image sharpness by calculating the squared difference in grayscale values between interval pixels. Energy reflects the uniformity of image texture and richness of details. Variance measures the dispersion of pixel values, representing image contrast. SMD and SMD2 assess local clarity by calculating the difference between pixels and their neighboring pixels. Entropy measures the information content of the image, reflecting its complexity. Generally, higher values for these indicators suggest a better quality of the generated images. Equations (7)–(11) define the following: I(i,j) denotes the pixel value at position (i,j), μ represents the average pixel value of the image, M and N indicate the number of rows and columns, respectively, and p_i represents the probability of pixel value i appearing in the image.

B r e n n e r = \sum_{i = 1}^{M - 2} \sum_{j = 1}^{N} {[I (i + 2, j) - I (i, j)]}^{2},

(7)

E n e r g y = \sum_{i = 1}^{M - 1} \sum_{j = 1}^{N - 1} {[I (i + 1, j) - I (i, j)]}^{2} + {[I (i, j + 1) - I (i, j)]}^{2},

(8)

V a r i a n c e = \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {[I (i, j) - μ]}^{2},

(9)

S M D 2 = \frac{1}{(M - N)} \sum_{i = 1}^{M - 1} \sum_{j = 1}^{N} |I (i + 1, j) - I (i, j)|,

(10)

E n t r o p y = - \sum_{0}^{255} p_{i} \log p_{i}

(11)

4.2.3. Subjective Evaluation

Relying solely on objective metrics may not fully capture the nuances of human visual perception. Therefore, we introduced subjective evaluation methods to complement the objective assessment. We recruited 30 volunteers, including graduate students specializing in computer vision, professional video editors, and general users, to ensure a diverse and representative evaluation panel. These volunteers were tasked with rating the generated videos based on the following criteria: Lip Sync Accuracy (LSA), Image Sharpness (IS), Facial Naturalness (FN), and Overall Visual Quality (OVQ). Prior to the evaluation, volunteers underwent a brief training session to ensure a comprehensive understanding of each assessment criterion. The evaluation process involved each volunteer reviewing five video samples, with each sample consisting of the original video alongside videos generated by our VividWav2Lip model and other comparative methods. To maintain objectivity and mitigate order effects, the videos were presented in a randomized sequence, and volunteers were not informed about the generation method of each video.

Lip Sync Accuracy assesses the degree of match between the generated lip movements and the audio. Image Sharpness focuses on the presence of blurring or distortion phenomena. Facial Naturalness evaluates the coordination and coherence of expressions. Overall Visual Quality comprehensively considers factors such as brightness, color, and contrast. Each indicator is scored on a scale of 1–5, with higher scores indicating better results.

4.3. Test Video Generation

To comprehensively evaluate the performance of the VividWav2Lip model, we designed a diverse test dataset inspired by the ReSyncED evaluation framework. Our test dataset aims to simulate various real-world application scenarios and comprises the following components:

Source Content: 8 different source videos or character source images;
Driven Content: 10 segments of driven audio, including real speech recordings (in the wild) and text-to-speech (TTS)-generated audio. All segments exceed 10 s to assess the model’s long-term performance;
Language Distribution: A 2:3 ratio of Chinese to English content to test the model’s adaptability to different speech characteristics and mouth shape variations;
Test Matrix: Through cross-combination of 8 source contents and 10 driving contents, we generate an 8 × 10 test matrix, producing 80 distinct test cases.

This systematic test dataset ensures comprehensive and representative evaluation. By covering all source-driving pairing scenarios, we can assess the model’s performance under different conditions and conduct fair comparisons with other advanced methods. In the experiments, each comparative model generates 80 test cases, which are then comprehensively evaluated based on the aforementioned objective and subjective assessment metrics. This large-scale, diverse testing approach provides a solid experimental foundation for our research conclusions, facilitating a deep understanding of the VividWav2Lip model’s performance advantages and potential application value.

4.4. Experimental Results

This section compares the VividWav2Lip model with current mainstream open-source audio-driven lip-sync models, including Wav2Lip, Speech2Video, Diff2Lip, and LipGAN, cross three dimensions: generation effects, generation quality, and subjective evaluation. To ensure the objectivity and authenticity of the assessment, we did not use post-processed video frames during the data calculation and analysis, more accurately reflecting the original output quality of each model and providing more reliable and convincing comparative results.

4.4.1. Lip Synchronization Effect Evaluation

We used the LSE-C and LSE-D metrics for the quantitative evaluation of lip synchronization effects. Table 1 and Table 2 show the performance of different models on these two indicators under various conditions. In all the tables, ↑ indicates better performance, while ↓ indicates worse performance.

VividWav2Lip demonstrates superior performance in the wild, achieving an LSE-C metric of 9.61 and reducing LSE-D to 6.22 outperforming other models. This confirms the effectiveness of introducing the cross-attention mechanism and optimized model architecture in improving lip synchronization accuracy. In TTS speech-driven scenarios, while our model’s overall performance is comparable to Wav2Lip, it shows a slight advantage in the LSE-D metric. Compared to Audio2Head and LipGAN, VividWav2Lip exhibits significant advantages under both driving conditions. Notably, all models show a performance decrease when processing TTS speech, possibly due to the lack of certain subtle features present in natural speech.

We further visually compare the performance of various models through visualization results. Figure 7 includes lip generation results from different methods (including source video, driven audio, LipGAN, Wav2Lip, and VividWav2Lip) for two different speakers.

In Figure 7, the left four columns demonstrate the generation results driven by Chinese audio from the input source video, while the right four columns show the results driven by English TTS audio for the input source image. We can conclude that all models are capable of maintaining the speaker’s lip information adequately. However, our method excels in detail preservation. Regarding lip synchronization, our method demonstrates more accurate and natural performance in lip-sync effects driven by both languages. In processing Chinese pronunciations, it more reasonably reflects the degree of lip opening and closing. In terms of image quality, our post-processed images retain better clarity and facial detail, showing a clear advantage. The generated images are sharper and more realistic, with richer and more vivid facial expressions, especially in the subtle changes of the eyes and mouth corners, demonstrating a higher degree of realism.

4.4.2. Analysis of Generated Image Quality

To comprehensively evaluate the quality of the generated images, we employed multi-dimensional analysis using objective metrics. First, we utilized the S3FD algorithm to precisely detect and extract the facial regions from the generated facial videos. This ensures that all experiments are based on the same facial range, thus guaranteeing consistency and comparability in the evaluation. Image quality assessment was performed through Brenner, Energy, Variance, SMD, SMD2, and Entropy. Table 3 presents a comparison of our model with Wav2Lip on these metrics.

Table 3 presents image quality metrics, where higher values indicate superior performance. The Brenner and Energy metrics quantify image clarity and detail richness, while Variance, SMD, and SMD2 assess the naturalness of facial region transitions. Entropy measures information content, correlating with expression diversity. Our VividWav2Lip model demonstrates consistent superiority over Wav2Lip across all metrics, indicating a comprehensive enhancement in image generation quality. The marked improvements in Brenner and Energy metrics suggest significantly clearer images with enhanced detail. Elevated Variance, SMD, and SMD2 values point to more natural transitions between facial regions, while increased Entropy implies richer facial expressions. These results collectively underscore the efficacy of our proposed improvements in generating high-quality, naturalistic facial animations.

The comprehensive improvement in objective indicators not only confirms our model’s exceptional performance in high-quality image generation but also provides reliable baseline data for subsequent facial optimization processes. Experiments verify that after applying facial restoration techniques, all image quality indicators achieved a further improvement of over 5% from their original values, highlighting the high plasticity and potential value of our model’s output results.

4.4.3. Subjective Evaluation Results

Given that certain subtle yet significant visual differences are challenging to fully capture through objective metrics alone, we conducted a subjective evaluation experiment. Thirty volunteers participated in the assessment, rating the generated videos on a scale of 1–5 for ISA, IS, FN, OVQ. Figure 8 presents the average scores from the subjective evaluation.

Table 4 presents the average scores (AVG) and standard deviations (SDs) of subjective evaluation metrics for different models. Our model achieved higher average scores across all evaluation metrics with lower standard deviations, indicating improved consistency and stability in the generated facial results. The subjective evaluation results align with the previous objective metric analysis, providing strong evidence for the superiority of the VividWav2Lip model. Notably, it demonstrates significant progress in addressing the long-standing issue of facial instability, which has important implications for enhancing user experience and expanding the range of applications.

4.5. Ablation Studies

Ablation experiments are crucial for validating the effectiveness of various model components. In this section, we conduct ablation studies to evaluate the contributions of key components in the VividWav2Lip model, including the cross-attention mechanism and additional residual blocks, aiming to quantify the impact of these improvements on the model’s overall performance.

4.5.1. Experimental Setup

While keeping other conditions constant, we sequentially replace key components of the model, including the cross-attention mechanism and additional residual blocks. The ablation experiments use video as the input source, with other methods consistent with those used in Section 4.3. In addition to the original metrics, we introduce the Peak Signal-to-Noise Ratio (PSNR) metric to assess the quality of our generated images. We did not use PSNR in the lip generation phase, primarily considering that this phase focuses more on continuous video sequences rather than static images as input sources. PSNR can more accurately reflect the temporal consistency and detail the fidelity of the generated results. Furthermore, as an objective metric widely applied in video quality assessment, PSNR complements our original evaluation system, providing a more comprehensive performance analysis.

4.5.2. Quantitative Results

Table 5 presents the quantitative results from our ablation experiments. We compared the performance of the model with the removal of additional residual blocks, the cross-attention mechanism, and the complete model.

In the ablation experiments, the VividWav2Lip model achieved a PSNR of 32.06, which is higher than the models using only the cross-attention mechanism or only the additional residual blocks. Furthermore, the LSE-C and LSE-D metrics indicate that the complete model performs best in terms of lip synchronization. In comparison, the Wav2Lip model achieved a PSNR of 31.68 under the same conditions.

4.5.3. Ablation Study Results Analysis

Additional Residual Blocks: When the additional residual blocks were removed, the model’s performance slightly decreased. This indicates that the SE residual mechanism, through adaptive feature calibration, enhances the model’s sensitivity to different features, thereby improving facial generation results.

Cross-Attention Mechanism: The addition of this mechanism resulted in an improvement across all metrics. This demonstrates that the attention mechanism plays a crucial role in audio-visual feature fusion, effectively mitigating lip jitter issues and enhancing the stability of generated animations.

4.6. Qualitative Analysis and Facial Repair Techniques

Numerical metrics alone are insufficient to comprehensively reflect the actual performance of lip-driven generation effects. Subtle lip jitters and changes may not be evident in metrics but are noticeably perceptible to humans. To address this issue, we conducted a frame-by-frame analysis of the generated videos and compared the results with the Wav2Lip model, demonstrating the effectiveness of our method. In terms of generation quality, VividWav2Lip exhibited more stable performance compared to Wav2Lip, effectively eliminating inter-frame jumps and jitters in lip movements.

In terms of generation quality, VividWav2Lip shows more stability compared to Wav2Lip, effectively eliminating lip sync jitter and frame-to-frame inconsistencies. Additionally, the application of the facial repair network has enhanced the realism and clarity of the generated facial animations. Specifically, we utilized the CodeFormer technology for facial repair, which further improves the authenticity and detail expression of the generated images.

Figure 9 presents the qualitative analysis and facial restoration results. The first column shows the source-driven video and source image. Driven by audio, we compare the results generated by Wav2Lip (second column) and our generator (third column). It is evident that the video generated by Wav2Lip exhibits a noticeable lip jitter, while our model maintains stable lip synchronization effect throughout the generation process. The last column demonstrates the results after applying our facial restoration network. Through this network, we enhanced the realism and three-dimensional quality of the generated video, with particularly notable improvements in dental artifacts and facial details.

The experimental results demonstrate that VividWav2Lip has achieved significant advancements in lip sync accuracy, visual quality, and multilingual adaptability. Notably, the model shows superior performance when handling tonal languages like Chinese, offering new possibilities for cross-linguistic applications. The introduction of the cross-attention mechanism allows the model to more precisely capture the subtle relationships between audio and visual features, resulting in more natural lip animations. Additionally, the application of the SE residual mechanism enhances the model’s sensitivity to different facial features, improving the detail and quality of the generated results.

5. Discussion and Future Work

While significant advancements have been made in audio-driven lip synchronization technology, certain limitations and areas for future research remain. The primary challenge lies in generating fine oral structures such as the teeth and tongue, which are crucial for enhancing the realism of facial animations. Future studies may consider employing advanced image processing techniques, such as feathering algorithms or deep-learning-based detail enhancement methods. Additionally, the model’s performance slightly degrades when processing complex speech inputs, such as tongue twisters, necessitating the introduction of more sophisticated speech feature extraction techniques to improve the model’s adaptability. Furthermore, the model’s performance in handling emotionally expressive speech inputs requires improvement. Future research could explore the integration of emotion networks, such as normalizing flows and vector quantization models, to better capture and generate emotion-related facial animation features.

Despite VividWav2Lip’s strong generalization capabilities, the diversity of languages manifested in phonemes, pronunciation methods, and lip movements affects the model’s generalization performance. To enhance the model’s applicability across various languages, we propose the following improvements: First, expand the training dataset to include samples from a wider range of languages, particularly those with significant differences in phonemes and lip movements. Second, consider introducing language-independent phoneme representations or universal speech feature extractors to help the model capture common features across different languages. Third, explore transfer learning techniques to enable more effective knowledge transfer from learned languages to new linguistic environments and conduct in-depth analysis on how varying speech rates affect the model’s speech recognition and lip synchronization accuracy, thereby enhancing the model’s adaptability to different speech rate characteristics. Through these efforts, we anticipate the future development of truly multilingual lip synchronization models, providing advanced technological support for global communication.

6. Conclusions

This study presents an innovative deep learning approach for audio-driven lip-synchronized facial animation generation, applicable to multilingual environments and particularly excelling in processing Chinese and English audio. Our pipeline optimizes the Wav2Lip model in four aspects: First, we laid a solid foundation for model training by optimizing and expanding the dataset. Second, we employed a cross-attention mechanism to capture complex associations between audio and visual features. Third, we designed and integrated additional residual blocks to enhance sensitivity to different facial features. Lastly, we introduced a facial restoration network that significantly improved the overall realism of synthesized images. Through a systematic evaluation framework, we validated the effectiveness of our proposed method. Notably, significant progress was achieved in eliminating inter-frame jumping and jitter, providing a new solution for generating high-quality lip-synchronized animations.

Author Contributions

Conceptualization, L.L. and J.W.; methodology, Z.L.; software, S.C.; validation, L.L., J.W. and S.C.; formal analysis, J.W.; data curation, S.C.; writing—original draft preparation, L.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the High-Level Talent Recruitment Hundred Talents Program of Fujian Province, grant number 7050123001, and the Research Program of Xiamen University Technology, grant number XPDKT20029.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request. All images used in this study are sourced from the following two sources: © UK government (Open Government Licence), © White House (public domain).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, H.; Kwon, B. Facial Animation Strategies for Improved Emotional Expression in Virtual Reality. Electronics 2024, 13, 2601. [Google Scholar] [CrossRef]
Park, S.J.; Kim, M.; Choi, J.; Ro, Y.M. Exploring Phonetic Context-Aware Lip-Sync on Talking Face Generation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; Cornell University Library: Ithaca, NY, USA, 2024; pp. 4325–4329. [Google Scholar]
Yu, R.; He, T.; Zhang, A.; Wang, Y.; Guo, J.; Xu, T.; Liu, C.; Chen, J.; Bian, J. Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement. arXiv 2024, arXiv:2406.08096. [Google Scholar] [CrossRef]
Wang, J.; Qian, X.; Zhang, M.; Tan, R.T.; Li, H. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14653–14662. [Google Scholar]
Wu, M.; Zhu, H.; Huang, L.; Zhuang, Y.; Lu, Y.; Cao, X. In High-fidelity 3D face generation from natural language descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4521–4530. [Google Scholar]
Li, H.; Hou, X.; Huang, Z.; Shen, L. StyleGene: Crossover and Mutation of Region-level Facial Genes for Kinship Face Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 20960–20969. [Google Scholar]
Ling, J.; Tan, X.; Chen, L.; Li, R.; Zhang, Y.; Zhao, S.; Song, L. StableFace: Analyzing and Improving Motion Stability for Talking Face Generation. IEEE J. Sel. Top. Signal Process. 2023, 17, 1232–1247. [Google Scholar] [CrossRef]
Prajwal, K.R.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C.V. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar]
Liu, G.; Zhou, X.; Pang, J.; Yue, F.; Liu, W.; Wang, J. Codeformer: A gnn-nested transformer model for binary code similarity detection. Electronics 2023, 12, 1722. [Google Scholar] [CrossRef]
Peng, Z.; Hu, W.; Shi, Y.; Zhu, X.; Zhang, X.; Zhao, H.; He, J.; Liu, H.; Fan, Z. SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; Cornell University Library: Ithaca, NY, USA, 2024; pp. 666–676. [Google Scholar]
Hegde, S.; Mukhopadhyay, R.; Jawahar, C.V.; Namboodiri, V. Towards Accurate Lip-to-Speech Synthesis in-the-Wild. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; ACM: New York, NY, USA, 2023; pp. 5523–5531. [Google Scholar]
Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M.J. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 10093–10103. [Google Scholar]
Fan, Y.; Lin, Z.; Saito, J.; Wang, W.; Komura, T. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18749–18758. [Google Scholar]
Thambiraja, B.; Habibie, I.; Aliakbarian, S.; Cosker, D.; Theobalt, C.; Thies, J. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 20621–20631. [Google Scholar]
Sun, Z.; Lv, T.; Ye, S.; Lin, M.; Sheng, J.; Wen, Y.; Yu, M.; Liu, Y. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Trans. Graph. (TOG) 2024, 43, 1–9. [Google Scholar] [CrossRef]
Xing, J.; Xia, M.; Zhang, Y.; Cun, X.; Wang, J.; Wong, T. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12780–12790. [Google Scholar]
Richard, A.; Zollhofer, M.; Wen, Y.; de la Torre, F.; Sheikh, Y. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1153–1162. [Google Scholar]
Kumar, R.; Sotelo, J.; Kumar, K.; de Brebisson, A.; Bengio, Y. ObamaNet: Photo-realistic lip-sync from text. arXiv 2017. [Google Scholar] [CrossRef]
Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
KR, P.; Mukhopadhyay, R.; Philip, J.; Jha, A.; Namboodiri, V.; Jawahar, C.V. Towards automatic face-to-face translation. In Proceedings of the 27th ACM international conference on multimedia, Nice, France, 21–25 October 2019; pp. 1428–1436. [Google Scholar]
Park, S.J.; Kim, M.; Hong, J.; Choi, J.; Ro, Y.M. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2062–2070. [Google Scholar]
Xu, C.; Liu, Y.; Xing, J.; Wang, W.; Sun, M.; Dan, J.; Huang, T.; Li, S.; Cheng, Z.; Tai, Y. FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 1292–1302. [Google Scholar]
Cheng, K.; Cun, X.; Zhang, Y.; Xia, M.; Yin, F.; Zhu, M.; Wang, X.; Wang, J.; Wang, N. VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing in the Wild. In SIGGRAPH Asia 2022 Conference Papers, Proceedings of the SA ‘22: SIGGRAPH Asia 2022, Daegu, Republic of Korea, 6–9 December 2022; Cornell University Library: Ithaca, NY, USA, 2022; pp. 1–9. [Google Scholar]
Tan, S.; Ji, B.; Pan, Y. FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; Cornell University Library: Ithaca, NY, USA, 2024; pp. 26317–26327. [Google Scholar]
Wu, X.; Hu, P.; Wu, Y.; Lyu, X.; Yan-Pei, C.; Shan, Y.; Yang, W.; Sun, Z.; Qi, X. Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; Cornell University Library: Ithaca, NY, USA, 2023; pp. 22168–22177. [Google Scholar]
Wang, J.; Zhao, K.; Zhang, S.; Zhang, Y.; Shen, Y.; Zhao, D.; Zhou, J. LipFormer: High-Fidelity and Generalizable Talking Face Generation with A Pre-Learned Facial Codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 13844–13853. [Google Scholar]
Aneja, S.; Thies, J.; Dai, A.; Nießner, M. FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; Cornell University Library: Ithaca, NY, USA, 2024; pp. 21263–21273. [Google Scholar]
Wang, S.; Li, L.; Ding, Y.; Fan, C.; Yu, X. Audio2Head: Audio-Driven One-Shot Talking-Head Generation with Natural Head Motion. arXiv 2021. [Google Scholar] [CrossRef]
Wang, K.; Zhang, J.; Huang, J.; Li, Q.; Sun, M.; Sakai, K.; Ku, W. CA-Wav2Lip: Coordinate Attention-Based Speech to Lip Synthesis in the Wild. In Proceedings of the 2023 IEEE International Conference on Smart Computing (SMARTCOMP), Nashville, TN, USA, 26–30 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
Liang, C.; Wang, Q.; Chen, Y.; Tang, M. Wav2Lip-HR: Synthesising clear high-resolution talking head in the wild. Comput. Animat. Virtual Worlds 2024, 35, e2226. [Google Scholar] [CrossRef]
Zhang, W.; Cun, X.; Wang, X.; Zhang, Y.; Shen, X.; Guo, Y.; Shan, Y.; Wang, F. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8652–8661. [Google Scholar]
Ma, Z.; Zhu, X.; Qi, G.; Chen, Q.; Zhang, Z.; Lei, Z. DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer. arXiv 2024. [Google Scholar] [CrossRef]
Bao, W.; Chen, L.; Zhou, C.; Yang, S.; Wu, Z. Wavsyncswap: End-To-End Portrait-Customized Audio-Driven Talking Face Generation. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Kumari, A.; Dubey, R.K.; Mishra, S.K. A cascaded method for real face image restoration using GFP-GAN. Int. J. Innov. Res. Techn. Manag. 2022, 6, 9–105. [Google Scholar]
Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 192–201. [Google Scholar]
Gheini, M.; Ren, X.; May, J. Cross-attention is all you need: Adapting pretrained transformers for machine translation. arXiv 2021, arXiv:2104.08771. [Google Scholar]
Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-grained oriented object recognition via separate feature refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–10. [Google Scholar] [CrossRef]

Figure 1. Speech-driven facial animation generation illustration.

Figure 2. Partial dataset sample illustration.

Figure 3. The architecture of VividWav2Lip. The network architecture primarily consists of three core components: Generator, Pre-trained Lip-Sync Expert and Facial Restoration module. The Generator encompasses four key elements: Facial Encoder, Audio Encoder, our proposed Cross Attention mechanism module and Additional Residual Blocks.

Figure 4. Cross-Attention mechanism architecture diagram.

Figure 5. SE residual mechanism architecture diagram.

Figure 6. Training loss curve diagram.

Figure 7. Performance comparison across leading lip synchronization approaches. The first four terms in the ‘Audio’ row are text in Chinese used for driven the model.

Figure 8. Subjective evaluation results across four different tasks.

Figure 9. Qualitative analysis. The Wav2Lip model suffers from jitter issues, which VividWav2Lip effectively addresses. The incorporation of a facial restoration network further enhances the output, resulting in high-fidelity images. The arrows and boxes indicate the lip regions of interest.

Table 1. Evaluation of facial animation metrics generated by different models under in the wild speech-driven conditions.

Method	LSE_C ↑	LSE_D ↓
Audio2Head	7.66	7.75
LipGAN	5.39	10.62
Diff2Lip	8.35	7.15
Wav2Lip	9.10	6.63
Ours	9.61	6.22

Table 2. Evaluation of facial animation metrics generated by different models under TTS speech-driven conditions.

Method	LSE_C ↑	LSE_D ↓
Audio2Head	6.93	8.21
LipGAN	4.83	9.53
Diff2Lip	8.36	7.02
Wav2Lip	8.83	6.67
Ours	9.08	6.24

Table 3. Comparison of image generation quality between VividWav2Lip and Wav2Lip models.

Method	Brenner ↑	Energy ↑	Variance ↑	SMD ↑	SMD2 ↑	Entropy ↑
Wav2Lip	4,662,686	75,296,180	184,085,379	351,049	636,804	4.81
Ours	4,736,445	77,909,458	188,225,164	357,726	642,126	4.93

Table 4. AVG and SD of subjective evaluation metrics for different models.

Method	AVG ↑	SD ↓
Audio2Head	2.90	0.26
LipGAN	3.28	0.62
Diff2Lip	3.70	0.26
Wav2Lip	3.92	0.40
Ours	4.65	0.11

Table 5. Results of the ablation evaluation.

Model	PSNR ↑	LSE_C ↑	LSE_D ↓
Wav2Lip	31.68	8.965	6.650
Ours (w/o cross-attention)	31.98	9.270	6.405
Ours (w/o additional residual)	32.04	9.302	6.328
Ours	32.06	9.345	6.230

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Wang, J.; Chen, S.; Li, Z. VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization. Electronics 2024, 13, 3657. https://doi.org/10.3390/electronics13183657

AMA Style

Liu L, Wang J, Chen S, Li Z. VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization. Electronics. 2024; 13(18):3657. https://doi.org/10.3390/electronics13183657

Chicago/Turabian Style

Liu, Li, Jinhui Wang, Shijuan Chen, and Zongmei Li. 2024. "VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization" Electronics 13, no. 18: 3657. https://doi.org/10.3390/electronics13183657

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Abstract

1. Introduction

2. Related Work

2.1. Talking-Head Generation

2.2. Wav2Lip

2.3. Facial Restoration Networks

3. Methodology

3.1. Dataset Preparation

3.1.1. Optimization Data Processing

3.1.2. Data Processing

3.2. Model Architecture

3.2.1. Encoders

3.2.2. Cross-Attention Mechanism Module

3.2.3. Decoder

3.2.4. Loss Function

4. Experiments

4.1. Data Setup

4.1.1. Pre-Trained Lip Sync Expert Discriminator

4.1.2. Generator Training

4.2. Evaluation Metrics

4.2.1. Lip Generation Evaluation

4.2.2. Image Generation Quality Evaluation

4.2.3. Subjective Evaluation

4.3. Test Video Generation

4.4. Experimental Results

4.4.1. Lip Synchronization Effect Evaluation

4.4.2. Analysis of Generated Image Quality

4.4.3. Subjective Evaluation Results

4.5. Ablation Studies

4.5.1. Experimental Setup

4.5.2. Quantitative Results

4.5.3. Ablation Study Results Analysis

4.6. Qualitative Analysis and Facial Repair Techniques

5. Discussion and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI