Next Article in Journal
Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning
Previous Article in Journal
Research on the Tooth Surface Integrity of Non-Circular Gear WEDM Based on HPSO Optimization SVR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

1
College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
2
College of Software, Taiyuan University of Technology, Taiyuan 030024, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(24), 12852; https://doi.org/10.3390/app122412852
Submission received: 9 October 2022 / Revised: 20 November 2022 / Accepted: 12 December 2022 / Published: 14 December 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
With the continuous development of cross-modality generation, audio-driven talking face generation has made substantial advances in terms of speech content and mouth shape, but existing research on talking face emotion generation is still relatively unsophisticated. In this work, we present Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait to synthesize lip-sync and an emotionally controllable high-quality talking face. Specifically, we take a facial reenactment perspective, using facial landmarks as an intermediate representation driving the expression generation of talking faces through the landmark features of an arbitrary emotional portrait. Meanwhile, decoupled design ideas are used to divide the model into three sub-networks to improve emotion control. They are the lip-sync landmark animation generation network, the emotional landmark animation generation network, and the landmark-to-animation translation network. The two landmark animation generation networks are responsible for generating content-related lip area landmarks and facial expression landmarks to correct the landmark sequences of the target portrait. Following this, the corrected landmark sequences and the target portrait are fed into the translation network to generate an emotionally controllable talking face. Our method controls the expressions of talking faces by driving the emotional portrait images while ensuring the generation of animated lip-sync, and can handle new audio and portraits not seen during training. A multi-perspective user study and extensive quantitative and qualitative evaluations demonstrate the superiority of the system in terms of visual emotion representation and video authenticity.

1. Introduction

With the rapid growth in demand for virtual humans [1,2,3], talking face generation has become a hot spot for current research [4,5,6]. The goal of talking face generation is to synthesize realistic, a lip-synced portrait animation from one (or a few) static portrait image(s) and an audio (or motion) sequence [7]. This task has a wide range of applications in multiple fields through simulated visual avatars, such as treatment systems for phantom vision and hearing, virtual anchors, and character customization in games.
Research in talking face animation generation is making good progress as large-scale audiovisual datasets [8,9,10] are becoming more and more available. However, most existing studies have focused on realistic lip syncs [5,11,12,13,14] and head-pose movements [15,16], and research on conveying the emotional aspect, which is a vital ingredient of authenticity in communication, lacks maturity.
Some previous studies have attempted to generate emotionally controlled portrait animations [17,18,19,20]. Sadoughi et al. [18] and Eskimez et al. [20] controlled the expression of the generated animation by means of additional emotion labels, but fixed labels can only roughly express a limited number of expressions. Fang et al. [19] used audio to directly control the emotion of the generated animation, resulting in results with lower visual quality. Building on this, the state-of-the-art method proposed by Ji et al. [17] was used to generate emotional animations while ensuring the high fidelity of the animation. However, due to the complexity of audio emotion itself, determining emotion in animation can lead to ambiguity. Furthermore, for the generation of the emotional animations studied above, the target portrait must be trained by the model, which results in its having a limited ability to render emotions and more limited application scenarios [21].
In this study, we propose Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait, a method that selects portrait images as an emotion-driven source and inputs audio and target identity portraits to generate emotionally diverse facial animations. Meanwhile, facial landmarks are used as intermediate representatives, dividing the model into an audio-to-landmark network and a landmark-to-animation network. Facial landmarks connect audio to pixels and have the following advantages for facial emotion animation generation tasks. On the one hand, facial geometry information is effectively maintained while deleting millions of pixels, resulting in a natural low-dimensional representation for the portrait. This allows the model to capture subtle dynamic landmark displacements of the head, avoiding the mishmash between portrait outlines and background pixels that miss these details. On the other hand, landmark features can convey sufficient emotional information for downstream emotional reenactment tasks and are more structured than intermediate representations such as action units (AU) [22].
In addition, to make the animation more emotionally controllable, we use the design idea of decoupling to divide the audio-to-landmark network into two sub-networks for the prediction of lip-sync landmarks and facial emotion landmarks. Firstly, the prediction of lip-sync landmarks is handled by the lip-sync landmark animation generation network, which is based on MakItTalk [15], by decoupling the audio content in order to improve accuracy. Secondly, for the prediction of facial emotion landmarks, an additional input emotion source portrait is used, which can be derived from an untrained arbitrary portrait. Driven by the co-driving of arbitrary emotional source portraits and audio, our emotional landmark animation generation network effectively translates the emotion to the target portrait frame by frame. Finally, the landmark-to-animation translation network was used to generate an emotionally controllable talking face. Compared to the label and audio-driven emotion methods, our method is more flexible in its ability to control the facial emotions in the animation, and the video covers more expressions while ensuring lip-sync and visual clarity. The advanced performance of our method and its key components is demonstrated through qualitative and quantitative experiments.
In summary, our approach allows the generation of emotionally controlled talking faces through two unseen portraits and one unseen audio sequence, a simplified diagram of which is shown in Figure 1. The contributions of our work can be summarized as follows: (a) We propose an emotionally controllable facial animation generation method based on facial landmarks, using uniform low-dimensional facial landmarks as intermediate variables to avoid the training dependency problem of traditional emotionally controllable talking faces. (b) We adopt a decoupled design idea to divide the facial animation generation task into a lip-sync task and a facial emotion task in order to avoid generating landmarks jumbled between the lip region and facial expressions. (c) We enriched the emotional coverage of the portrait animation with additional emotion sources to reenact the expressions of the target portrait, making it more versatile and flexible for practical applications.

2. Related Work

2.1. Audio-Driven Talking Face Generation

Audio-driven facial animation generation is a popular topic in computer vision and in generative models. Its purpose is to transform a static portrait into a realistic talking face animation through arbitrary audio. Audio-driven facial animation does not require as many video clips of the target individual for training as graphics-based portrait facial animation to maintain the phonemic information of the target [7,23,24]. The methods used by previous researchers for audio-driven facial animation can be divided into two groups based on the presence or absence of an intermediate structural representation.
Direct Image Reconstruction Methods. Raw pixels or dots with lip movements synchronized with the speech signal are directly generated through the audio driver. With the development of audiovisual cross-modals, Chung et al. [11] proposed to generate lip-synced videos by the encoder–decoder CNN model named Speech2Vid. Song et al. [12] used conditional recursive generative networks to combine image and audio features in recursive units to achieve temporal dependence across animation frames. Prajwal et al. [14] developed a new lip-sync discriminator that enhances the robustness of lip synchronization in the wild. Zhou et al. [16] successfully added head motion to the generated animation by defining a 12-dimensional pose encoding in latent space. The above methods focus on matching lip movements to audio, but they were unable to control the video emotion.
More recently, a few methods [10,19,20,21,25] have added an emotional element to the generated talking face. Karras et al. [25] suggested an end-to-end network with a more realistic emotional component using a 3D-based method; however, the learned emotional states are difficult to comprehend and do not model facial features. Fang et al. [19] used a generative adversarial network (GAN) to directly synthesize animations from audio and pixels to express the emotions conveyed by audio; however, their results suffered from fidelity problems (facial distortion, blurring, and other artefacts). Recently, Eskimez et al. [20] developed an end-to-end emotionally controllable conversation face generation system using speech, individual face images, and categorized sentiment labels as inputs. Although the implementation idea is innovative, emotion control is affected by the classification expressions, and it can only cover a limited range of expressions. In addition, some studies have opted to add additional visual input to control emotion. Wang et al. [10] could only achieve a transition from neutral to emotional; Liang et al. [21] controlled head pose and facial expression through multiple visual inputs, but this also resulted in a model with low robustness to complex backgrounds.
Intermediate Representation Method. In the portrait reconstruction pipeline, structural information was employed as an intermediate representation to help the model learn the features of the target. The neural network of Eskimez et al. [26] can generate talking face landmarks from arbitrarily long noisy speeches. Chen et al. [13] divided the process into two stages: an audio-to-landmark network and a landmark-to-animation network based on facial landmarks. This cascade method avoids fitting erroneous audiovisual signal correlations that are unrelated to speech content. Zhou et al. [15] used decoupled learning to separate speech into content and speaker identity components trained separately, whereby spontaneous head posture movement is predicted after successfully completing the lip-sync task. In addition to facial landmarks, previous approaches have also used other intermediate representations to facilitate animated synthesis, such as the approach of Sadoughi et al. [18], who completed emotional talker face synthesis of the lip region via markers from the IEMOCAP corpus, and that of Chen et al. [27] who completed talker face synthesis using action units from the facial action coding system (FACS) [22].
Compared to direct methods for generating talking face, facial emotion animation based on intermediate representatives has been less studied. Recently, a state-of-the-art method was proposed by Ji et al. [17] using the cross-reconstructed emotion disentanglement technique to decompose speech into two decoupled spaces and then using the audio-to-landmark network to generate landmarks for emotional animations. However, this method can only be applied to seen portraits and audio, with weak emotion generalization. In contrast to this approach, our method captures emotion from arbitrary portraits, which can be used for unseen audio and portraits, allowing more flexibility in generating emotions for the talking face.

2.2. Facial Expression Reenactment

Significant progress in face generation and editing [28,29,30] has been made in recent years through advanced image-to-image translation [31,32]. Facial emotional expression reproduction is more challenging than facial attribute editing, which only modifies the appearance of specific facial areas. Facial expression involves the movement of several facial muscles, involving a large amount of geometric information. In the ExprGAN proposed by Ding et al. [33], an expression controller module is designed to learn expressions so that the expression intensity can be controlled during facial expression recognition. Pumarola et al. [34] proposed GANimation using action units (AU) as emotion labels, allowing the intensity of each AU to be controlled individually. This solution can reenact facial emotional expressions in a continuous domain while preserving identity information. Wu et al. [35] used a cascading transformation strategy to perform progressive facial expressions to reenact the local expression focus to better preserve identity-related features and details around the eyes, nose, and mouth, thereby improving the fidelity of the generated image.
On the other hand, there are some studies [36,37,38] that introduce facial landmarks as intermediate representatives to accomplish the task of facial expression reenactment. Qian et al. [36] used contrastive learning to transform geometric manifolds into embedded semantic manifolds of facial expressions to accomplish a cross-identity emotion transfer. Zhang et al. [37] proposed a unified landmark converter to bridge the source and target images via facial landmarks with geometric information. A new triple perceptual loss function was reapplied to enrich the facial details of the image by allowing the model to learn both appearance and geometric information. Liu et al. [38] also used facial landmark transformation, but added a face rotation module and an expression enhancement generator to solve the entanglement problem of expression and pose features. Inspired by the facial landmark-based reenactment of facial expressions, in this study, another portrait is input into the network for emotion generation. The combined operation of the two portraits and audio results in a talking face that is emotionally rich and realistic. Unlike in traditional facial expression reenactment, our method needs to ensure the identity of the target in each frame while warping the geometric information of the video frame by frame to make the video smoother and more natural, which is undoubtedly more difficult for expression reenactment.

3. Method

3.1. Overview

Figure 2 shows a schematic of the general structure of the proposed method, which accepts as input audio and two static portrait images (one representing the identity source and another representing the emotion source) and outputs a lip-sync, spontaneous head movement, and identity source portrait animation with emotion source expression. Facial landmarks are used as intermediary representations in this network to bridge the gap between audio and visual features. The network can be broken down into three main network components: the lip-sync landmark animation generation network; the emotional landmark animation generation network; and the landmark-to-animation translation network. The lip-sync landmark animation generation network is used to generate the speech-content landmark sequence from identity source facial landmarks and audio content (see Section 3.2). The emotional landmark animation generation network reenacts the facial landmarks of the identity source portrait, frame by frame, with respect to an arbitrary emotion source portrait and audio, to create landmark displacements with emotion and head movement (see Section 3.3). The landmark-to-animation translation network is an image-to-image translation network that inputs the target portraits and landmarks predicted by the above network to generate an emotionally controllable talking face (see Section 3.4). In the following sections, the modules of the proposed architecture are described.

3.2. Lip-Sync Landmark Animation Generation Network

The goal of the lip-sync landmark animation generation network is to generate the target portrait lip region landmark animation synchronized with audio content, which requires mapping the audio to the facial landmark locations. This network uses an encoder–decoder network structure containing a multilayer perceptron (MLP) encoder, long short-term memory (LSTM) [39] encoder, and an MLP decoder. Content-independent features in audio reduce the prediction accuracy to a certain degree. Therefore, adopting the idea in [15], disentangled learning in the field of voice conversion using AutoVC [40] was used to design a state-of-the-art approach to disentangle audio into content embedding and style embedding. The speaker-independent content embedding E c R T × C was separated from audio V , where T is the total number of input audio frames ( t denotes any frame), and C is the content dimension. In addition to content embedding, a landmark embedding vector is also extremely important, providing facial shape and identity information for this network. To this end, a 3D landmark detector [41] is first used to create a landmark L a R 68 × 3 , followed by an MLP encoder to extract the landmark feature from L a . The deconstructed content embedding with landmark feature is then fed into an LSTM encoder that captures the audio content and continuous lip landmark dependencies. Finally, the MLP decoder accepts the output of the LSTM encoder and predicts the lip landmark displacement Δ P t , which is summed with the original landmarks to obtain the lip-sync landmark sequence, P t . Mathematically, this network can be expressed as follows:
Δ P t = M L P c ( L S T M c ( E c t t + λ , M L P L ( L a ; W m l p , l ) ; W l s t m ) ; W m l p , c )
P t = L a + Δ P t
where { W m l p , l , W l s t m , W m l p , c } are the learnable parameters of the MLP encoder, LSTM encoder, and MLP decoder, respectively; the MLP encoder M L P L consists of two layers with internal hidden state vectors of size 256 and 128; the LSTM encoder L S T M c is made up of three layers of units, each with a hidden state vector of 256; and the MLP decoder M L P c comprises three layers with internal hidden state vectors of size 512, 256, and 204 (68 × 3); { E c , L a , Δ P t , P t } is consistent with the above descriptions; for content embedding E c , the window size for each frame t is λ = 18.
To improve the realism of the lip landmark displacement, we used the loss function C to minimize the distance between the reference landmark P ^ t and the predicted landmark P t   (where i is the index of each individual landmark):
L = t = 1 T i = 1 N P i , t P ^ i , t 2 2

3.3. Emotional Landmark Animation Generation Network

A realistic talking face must include lips that match the audio content as well as facial expressions and the head movements. To make the generated animations richer in emotion, we propose to extract emotional features from the expressions of arbitrary portrait landmarks instead of speech signals, as depicted in Figure 2.
With the input of the emotion source portrait, the audio and emotional conditions can be separated, and the model can efficiently transform emotions from any portrait to the target portrait, allowing the emotions to be controlled when synthesizing the animation. The visual modality inputs to the emotional landmark animation generation network consist of the same identity source landmark L a R 68 × 3 as the lip-sync landmark animation generation network and an arbitrary emotion source landmark L b R 68 × 3 . The input of the network auditory modality then uses the content embedding E c and the style embedding E s isolated from AutoVC [40], where style embedding is a feature of the speaker. The reason for using a speech conversion neural network to decouple audio is that there is a geometric relationship between the discourse content and the speaker, notably because the mouth shape for the same sentence varies somewhat from person to person, not to mention features such as eyebrows and eyes. Our network consisted of an audio encoder ( ϕ s ), two face landmark encoders ( ϕ e 1 and ϕ e 2 ), and a face landmark decoder ( ψ e   ).
To capture the effect of audio on facial expressions, the audio encoder ϕ s was modeled after the approach of the MakeItTalk [15] model component. Unlike MakeItTalk, MLP was not used to conduct direct landmark prediction on the acquired audio features, which were instead passed on to a downstream task for multimodal fusion. Consequently, the visual and auditory modal data complement and corroborate each other, resulting in high-quality portrait animation with emotion. First, the content embedding E c is fed into an LSTM with the same network structure and time window as the lip-sync network, except for the input size. Second, the style embedding E s goes through the model of [42] to maximize the similarity of embeddings between different discourses of the same speaker and to minimize the similarity between different speakers. The internal hidden state vector sizes were 256, 128, and 128. Finally, the above output is assembled in a self-attention layer [43] to capture the longer structured dependencies because the head action lasts much longer than tens of milliseconds of phonemes. The extraction of audio features can be formulated as
S t = A t t n ( L S T M c ( E c t t + λ ; W l s t m ) , M L P s ( E s ; W m l p , s ) ; W a t t n )
where { W l s t m , W m l p , s , W a t t n } are the learnable parameters of the network; and { L S T M c , M L P s , A t t n } are the models described above; { E c , E s } are consistent with those described above. The output is the feature value S t ; the length is the total number of input audio frames; and the width is 64.
Encoders ϕ e 1 and ϕ e 2 are assigned the tasks of extracting landmark features of the target portrait landmark L a and emotional portrait landmark L b . Both have similar structures and are lightweight neural networks consisting of seven-layer MLP, but their functions are not the same. The encoder ϕ e 1 primarily extracts geographic information F a from the identity and position of L a , whereas encoder ϕ e 2 primarily extracts geographic information F b from the facial expressions of L b . Subsequently, the three extracted features are linearly fused to obtain the fused feature F t . In addition, the decoder ψ e   consisting of three-layer MLP accepts the fused feature F t and reconstructs the emotional landmark displacement Δ Q t for each frame of the identity source. Lastly, Δ Q t is used to revise the lip-sync landmark sequence P t to obtain the target portrait facial landmark sequence with emotional source expressions, Q t . To summarize, the module uses the following modifications to describe the output landmarks in sequential order:
F a = ϕ e 1 ( L a ; W e 1 )
F b = ϕ e 2 ( L b ; W e 2 )
F t = c o n c a t ( F a , F b , S t )
Δ Q t = ψ e   ( F t ; W e )
Q t = P t + Δ Q t
where { W e 1 , W e 2 , W e } are the learnable parameters of the network; the encoder { ϕ e 1 , ϕ e 2 } internal hidden state vectors are all 256 in size; the decoder ψ e   for the three layers is 512, 256, and 204 (68 × 3) in size; { L a , L b , F a , F b , S t , F t , Δ Q t , P t , Q t } are consistent with those described above; and c o n c a t denotes the horizontal connectivity.
In the training phase, the overall loss is given by the sum of the three loss terms, L , D L and D T :
s = λ 1 L + λ 2 D L + λ 2 D T
where λ 1 = 1 and λ 2 = 0.001 are set through hold-out validation.
Landmark Loss.  L is defined as the distance between the minimized reference landmark Q ^ t and the predicted landmark Q t , with the aim of promoting the correct placement of the emotional landmarks while preserving the target identity.
L = t = 1 T i = 1 N Q i , t Q ^ i , t 2 2
Adversarial Loss. Generates facial landmarks with stable frame rates and rich facial expressions during the training process. To this end, we created two discriminator networks following GAN design ideas and inspired by [37]. The goal of discriminator D L is to determine the authenticity of the predicted landmarks. To ensure the fluidity of the landmark sequence and to prevent the interval frame landmarks from shifting too much, we used the discriminator D T to determine the similarity of the interval frame identities. D L and D T are then their loss functions:
D L = log D L ( Q ^ t ) + log ( 1 D L ( Q t ) )
D T = log D T ( Q ^ t 1 , Q ^ t ) + log ( 1 D T ( Q ^ t 1 , Q t ) )
where Q ^ t 1 denotes the reference landmark of the previous frame

3.4. Landmark-To-Animation Translation Network

This network was used to generate portrait animation along with the input identity source portrait L a after obtaining the generated landmark sequence Q . Our network follows the basic idea of Zakharov et al. [44]. The U-Net architecture of the encoder–decoder network performs the landmark-to-animation translation while preserving image identity information and facial texture to act as a face generator. A frame-by-frame 256 × 256 three-channel facial sketch is drawn by connecting the expected landmark with predefined colorful lines and then adding it to the original image channel to generate a 256 × 256 6-channel image. The preprocessed image is passed to the network with 12 layers. The first six layers are encoders with each layer consisting of a 2-step convolution with a core of 3 × 3 followed by two residual blocks, and the last six decoder layers are symmetric upsampling structures. Skip connections also exist between the encoding and decoding layers with the same spatial resolution. The output facial image sequences are combined in frames to obtain the result, a lip-sync facial animation of the target portrait with the emotion source expression.

4. Experiment

4.1. Experimental Settings

Datasets. Two audiovisual datasets: MEAD [10] and the Obama Weekly Address database [8] were employed. MEAD is a high-resolution emotional audiovisual dataset consisting of 60 actors from different continents with eight emotions, three intensities, and seven poses. A subset of this database was used for the pre-trained model of the emotional landmark animation generation network and the landmark-to-animation translation network. Videos of the frontal pose were selected from the MEAD database to train our network. The dataset was divided into three sections: 60%, 20%, and 20% for training, hold-out validation, and testing, respectively. For the emotional landmark animation generation network, the reference was a video of a target portrait with emotional source expression. The Obama Weekly Address database contains high-quality video clips of weekly presidential speeches from a single speaker, former President Barack Obama. Its consistent frontal pose with little emotion is ideal for training the lip-sync landmark animation generation network.
Implementation Details. All audio waveforms in the dataset were sampled at 16KHz and all video was converted to 30 fps. All images of the network model were 256 × 256 pixels in size. There are 68 key points of landmarks provided in [41]. The models were trained using PyTorch, including a 16 GB Nvidia Tesla V100 GPU from Santa Clara, CA, USA.
Comparing Methods. Our approach was compared with two state-of-the-art representative algorithms for image-based face video generation, AVTG and MakeItTalk. The cascading network structure of AVTG based on face landmarks was proposed by Chen et al. [13]. In addition, MakeItTalk by Zhou et al. [15] uses decoupled learning to generate spontaneous head movements. These two methods were used extensively for comparison.

4.2. Quantitative Evaluation

To evaluate the quality of the images generated by different methods, the metrics SSIM [45], PSNR and FID [46] were selected. To evaluate the semantic level of lip synchronization and facial expressions, the landmark distance (LD) and landmark velocity difference (LVD) were selected as metrics [13,15,17]. LD denotes the average Euclidean distance between all predicted landmark locations and the reference location. The LVD is the average Euclidean distance that represents the landmark velocity between two sequences, and velocity refers to the difference in landmark positions between consecutive frames. Prior to evaluation, the resultant sequences were aligned with the ground truth sequences and the facial landmarks were extracted from them. The results are presented in Table 1, with metrics including video quality (SSIM, PSNR, FID), lip sync (L-LD, L-LVD), and facial expression (F-LD, F-LVD), where the facial expression metric includes all landmarks in the facial region. As can be seen, our generated animation adds additional emotion, and the data outperformed Chen et al. [13] and Zhou et al. [15]. Furthermore, our results are comparable to the existing state-of-the-art algorithms of Ji et al. [17] and Wang et al. [10] Notably, emotion can be controlled in our method using arbitrary portraits, avoiding the training dependency problem, whereby the method is individual specific and a model must be trained for a particular person.

4.3. Qualitative Evaluation

Subject evaluation is equally crucial for the visual content generation and emotion generation tasks. As shown in Figure 3, two advanced techniques (listed in Section 4.1) were compared with the proposed approach. Our system requires the additional input of an emotional source portrait (the portrait in the lower left corner) in addition to audio and target portrait inputs. The proposed approach allows the target portrait to produce expressions corresponding to the seen emotional source portrait while ensuring the visual quality of the video and the identity of the target. Specifically, in Chen et al. [13] and Zhou et al. [15], lip movements are acceptable, and Zhou et al. [15] additionally considers head posture. However, the expressions of both characters are constrained by the input portrait and less attention is accorded to the emotional aspect. In our case, lip movement and head posture are guaranteed, and the input of the emotional source image allows control of the emotional content in the output video. It has been shown that multimodal inputs may provide rich emotional qualities, allowing the creation of emotionally richer videos via the fusion of diverse input types.
In addition to this, we used our proposed method to perform a comparison between the same target portrait with audio and different emotion-driven sources as input. As shown in Figure 4, we generated seven emotions for the target portrait from top to bottom: anger, contempt, disgust, fear, happiness, sadness, and surprise. Our approach allows for diverse expressions to be generated based on the emotion source. Expressions with more evident traits, such as anger, happiness, and melancholy, provide better outcomes, but the more muddled disdain and disgust, surprise, and sadness produce fewer effective results. In addition to the basic categories of facial emotions, there are also compound emotions, meaning those that can be constructed by combining the basic component categories to create new ones [47]. Theoretically, the proposed method can generate expressions of arbitrary emotional origin, including compound emotions. The experimental results show that this system ensures the visual quality and target identity to a certain extent and enriches the emotional representation ability of the talking face animation. However, it is limited by the identity source portrait and the landmark-to-animation translation network, leading to some confusion in the reenactment of certain expressions.

4.4. Ablation Study

To evaluate the impact of each component on our approach, an ablation study was carried out. We created two variants, the first being the Lip-Sync Landmark Animation Generation Network and the Landmark-to-Animation Translation Network, referred to as the content variant. Meanwhile, the second was an emotional variant, which consisted of the Emotional Landmark Animation Generation Network and the Landmark-to-Animation Translation Network. As shown in Figure 5, this content variant has a well-moving lip effect and has the same expression profile as the identity source portrait. The emotion variant is driven by the audio and the emotion source portrait to generate a video that matches the expression features of the emotion source portrait. Unlike the previous variant, our method generates lip-sync while also migrating expression features from another portrait. Furthermore, to further demonstrate the versatility and flexibility of our approach, we chose unseen portraits for this portrait input (the identity source is the corresponding author of this paper, and the emotion source is the first author of this paper).

4.5. User Study

The user study was conducted in three phases, each with 126 participants. The first phase focused on assessing image quality and lip syncing, while the second and third phases focused on assessing the emotional validity of the videos. The input video and audio were obtained from MEAD, and the input image was the first frame of the video in this database.
The first stage was a cross-sectional comparison, with the four groups included in the study consisting of real data, our proposed method, Chen et al. [13], and Zhou et al. [15]. The selected target identity pictures had neutral emotion (no emotion on the face) in three frontal poses and three voices were randomly selected for each emotion, except for the neutral emotion (seven in total), generating a total of sixty-three animations per group. The emotion source input to our method was the first frame of the video of the target identity for the remaining seven emotions, except for the neutral emotion. Participants were required to evaluate the videos on a scale of 5 out of 5 on the authenticity of the video and the degree of lip-sync. The comparison results of the four groups are shown in Figure 6, where error bars indicate standard deviation. Asterisks represent a significant difference, ** indicates p < 0.01. Pink, Orange, Blue and Green histograms denote Chen et al. [13] and Zhou et al. [15], with similar scores, our method and real data, respectively.
The second phase investigated whether the emotional rendering of our generated videos was authentic and effective. Participants at this stage were asked to classify the sentiments of the videos generated on the basis of real data, Wang et al. [10], and our proposed method. It was decided that the video samples for this stage would all be taken from the same portrait and the same audio, but with different emotions, for a total of 64 videos for each method. Additionally, the videos would have no background sound to allow participants to classify their feelings using visual cues. The confusion matrix for the subjective emotion classification in the user survey is shown in Figure 7. Our videos were more similar to ground-truth video patterns and produced more diagonal confusion matrices than those of Wang et al. The overall classification accuracies of 64.2% (ground-truth), 43.5% (Wang), and 50.9% (our) validate the effectiveness of the proposed model in generating emotionally controlled talking face videos, showing the power of the emotional landmark animation generation network.
In the last stage, an emotionally perceptive user survey was used regarding emotionally mismatched videos as in Eskimez et al. [20], to ask participants to identify the major and secondary emotions (if any) generated by each video. The input consisted of identity source portraits of neutral emotions as well as audio and emotion source portraits covering eight emotions. Sixty-four audiovisual emotion videos were generated by our system through permutation, of which eight were matched with each other. The evaluations of the participants on the emotions displayed in these videos can be used to some extent as a reference for determining the emotion recognition mechanism people rely on the most in recognizing the emotions depicted in these videos. As can be seen from the confusion matrix in Figure 8, the matching value of the visual emotion as the primary emotion is greater than the matching value of the auditory emotion as the primary emotion, with ratios of 41.4% and 34.9%, respectively. This indicates that, for the emotional perception of audiovisual content, the subjects relied much more on the visual modality than on the audio modality. Perceiving these eight emotions, such as happiness, sadness, and anger through the visual modality is relatively simple. The pairs of disgust/contempt and fear/surprise are more easily confused. The audio modality has three pairs of confusing emotions: disgust/contempt, happy/surprised, and fearful/sad.

5. Conclusions

In this study, we proposed a generative network of talking faces based on facial landmarks from the perspective of decoupled design and expression reenactment to enrich the emotion of the generated animation. Our method allows the extraction of expression features from the additional input of unseen emotional portraits and processes them together with the audio to control the generation of animated facial expressions. This method generates videos that satisfy authenticity and high fidelity while enhancing the emotion of the existing animation, enriching the video material, and providing flexible control over the expression of the animation. Qualitative and quantitative experiments have validated the efficiency and flexibility of our approach. In the future, we will use more efficient networks to capture long-term temporal and spatial correlations between pixels to reduce the artifacts of expression reenactment. The emotional element of audio will also be further explored to generate emotionally rich audio-visual animations.

Author Contributions

Conceptualization, Z.Z.; methodology, Z.Z.; software, Z.Z. and T.W.; validation, Y.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, H.G. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used for training and evaluating the performance of our proposed approach are publicly available and can be accessed from [8,10].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, X.; Wu, Q.; Zhou, H.; Xu, Y.; Qian, R.; Lin, X.; Zhou, X.; Wu, W.; Dai, B.; Zhou, B. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10462–10472. [Google Scholar]
  2. Liu, X.; Xu, Y.; Wu, Q.; Zhou, H.; Wu, W.; Zhou, B. Semantic-aware implicit neural audio-driven video portrait generation. arXiv 2022, arXiv:2201.07786. [Google Scholar]
  3. Sheng, C.; Kuang, G.; Bai, L.; Hou, C.; Guo, Y.; Xu, X.; Pietikäinen, M.; Liu, L. Deep Learning for Visual Speech Analysis: A Survey. arXiv 2022, arXiv:2205.10839. [Google Scholar]
  4. Hong, F.-T.; Zhang, L.; Shen, L.; Xu, D. Depth-Aware Generative Adversarial Network for Talking Head Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3397–3406. [Google Scholar]
  5. Wang, S.; Li, L.; Ding, Y.; Yu, X. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 2531–2539. [Google Scholar]
  6. Shen, S.; Li, W.; Zhu, Z.; Duan, Y.; Zhou, J.; Lu, J. Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 666–682. [Google Scholar]
  7. Chen, L.; Cui, G.; Kou, Z.; Zheng, H.; Xu, C. What comprises a good talking-head video generation?: A survey and benchmark. arXiv 2020, arXiv:2005.03201. [Google Scholar]
  8. Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
  9. Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep speaker recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
  10. Wang, K.; Wu, Q.; Song, L.; Yang, Z.; Wu, W.; Qian, C.; He, R.; Qiao, Y.; Loy, C.C. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; pp. 700–717. [Google Scholar]
  11. Chung, J.S.; Jamaludin, A.; Zisserman, A. You said that? arXiv 2017, arXiv:1705.02966. [Google Scholar]
  12. Song, Y.; Zhu, J.; Li, D.; Wang, X.; Qi, H. Talking face generation by conditional recurrent adversarial network. arXiv 2018, arXiv:1804.04786. [Google Scholar]
  13. Chen, L.; Maddox, R.K.; Duan, Z.; Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7832–7841. [Google Scholar]
  14. Prajwal, K.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar]
  15. Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeltTalk: Speaker-aware talking-head animation. ACM Trans. Graph. (TOG) 2020, 39, 1–15. [Google Scholar] [CrossRef]
  16. Zhou, H.; Sun, Y.; Wu, W.; Loy, C.C.; Wang, X.; Liu, Z. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4176–4186. [Google Scholar]
  17. Ji, X.; Zhou, H.; Wang, K.; Wu, W.; Loy, C.C.; Cao, X.; Xu, F. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14080–14089. [Google Scholar]
  18. Sadoughi, N.; Busso, C. Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Trans. Affect. Comput. 2019, 12, 1031–1044. [Google Scholar] [CrossRef] [Green Version]
  19. Fang, Z.; Liu, Z.; Liu, T.; Hung, C.-C.; Xiao, J.; Feng, G. Facial expression GAN for voice-driven face generation. Vis. Comput. 2021, 38, 1151–1164. [Google Scholar] [CrossRef]
  20. Eskimez, S.E.; Zhang, Y.; Duan, Z. Speech driven talking face generation from a single image and an emotion condition. IEEE Trans. Multimed. 2021, 24, 3480–3490. [Google Scholar] [CrossRef]
  21. Liang, B.; Pan, Y.; Guo, Z.; Zhou, H.; Hong, Z.; Han, X.; Han, J.; Liu, J.; Ding, E.; Wang, J. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3387–3396. [Google Scholar]
  22. Friesen, E.; Ekman, P. Facial action coding system: A technique for the measurement of facial movement. Palo Alto 1978, 3, 5. [Google Scholar]
  23. Sha, T.; Zhang, W.; Shen, T.; Li, Z.; Mei, T. Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis. arXiv 2021, arXiv:2109.02081. [Google Scholar] [CrossRef]
  24. Zhu, H.; Luo, M.; Wang, R.; Zheng, A.H.; He, R. Deep Audio-visual Learning: A Survey. Int. J. Autom. Comput. 2021, 18, 351–376. [Google Scholar] [CrossRef]
  25. Karras, T.; Aila, T.; Laine, S.; Herva, A.; Lehtinen, J. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (TOG) 2017, 36, 1–12. [Google Scholar] [CrossRef]
  26. Eskimez, S.E.; Maddox, R.K.; Xu, C.; Duan, Z. Generating talking face landmarks from speech. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Guildford, UK, 2–6 July 2018; pp. 372–381. [Google Scholar]
  27. Chen, S.; Liu, Z.; Liu, J.; Yan, Z.; Wang, L. Talking Head Generation with Audio and Speech Related Facial Action Units. arXiv 2021, arXiv:2110.09951. [Google Scholar]
  28. Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1558–1566. [Google Scholar]
  29. Wang, H.-P.; Yu, N.; Fritz, M. Hijack-gan: Unintended-use of pretrained, black-box gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7872–7881. [Google Scholar]
  30. He, J.; Shi, W.; Chen, K.; Fu, L.; Dong, C. GCFSR: A Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1889–1898. [Google Scholar]
  31. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
  32. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 2223–2232. [Google Scholar]
  33. Ding, H.; Sricharan, K.; Chellappa, R. Exprgan: Facial expression editing with controllable expression intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; p. 32. [Google Scholar]
  34. Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. Ganimation: One-shot anatomically consistent facial animation. Int. J. Comput. Vis. 2020, 128, 698–713. [Google Scholar] [CrossRef]
  35. Wu, R.; Zhang, G.; Lu, S.; Chen, T. Cascade ef-gan: Progressive facial expression editing with local focuses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5021–5030. [Google Scholar]
  36. Qiao, F.; Yao, N.; Jiao, Z.; Li, Z.; Chen, H.; Wang, H. Geometry-contrastive gan for facial expression transfer. arXiv 2018, arXiv:1802.01822. [Google Scholar]
  37. Zhang, J.; Zeng, X.; Wang, M.; Pan, Y.; Liu, L.; Liu, Y.; Ding, Y.; Fan, C. Freenet: Multi-identity face reenactment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5326–5335. [Google Scholar]
  38. Liu, J.; Chen, P.; Liang, T.; Li, Z.; Yu, C.; Zou, S.; Dai, J.; Han, J. Li-Net: Large-Pose Identity-Preserving Face Reenactment Network. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
  39. Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
  40. Qian, K.; Zhang, Y.; Chang, S.; Yang, X.; Hasegawa-Johnson, M. Autovc: Zero-shot voice style transfer with only autoencoder loss. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 5210–5219. [Google Scholar]
  41. Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 1021–1030. [Google Scholar]
  42. Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar]
  43. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  44. Zakharov, E.; Shysheya, A.; Burkov, E.; Lempitsky, V. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 9459–9468. [Google Scholar]
  45. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  46. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
  47. Du, S.C.; Tao, Y.; Martinez, A.M. Compound facial expressions of emotion. Proc. Natl. Acad. Sci. USA 2014, 111, E1454–E1462. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The abbreviated framework of the proposed method. Two portraits and audio are input, and a video of the identity source portrait talking face with new emotional expressions is obtained as output.
Figure 1. The abbreviated framework of the proposed method. Two portraits and audio are input, and a video of the identity source portrait talking face with new emotional expressions is obtained as output.
Applsci 12 12852 g001
Figure 2. The general framework of the proposed method consists of three network models in dashed lines in three colors. Using the above network structures, the identity source portrait image, emotion source portrait image, and audio are used as inputs to generate an emotionally controllable talking face.
Figure 2. The general framework of the proposed method consists of three network models in dashed lines in three colors. Using the above network structures, the identity source portrait image, emotion source portrait image, and audio are used as inputs to generate an emotionally controllable talking face.
Applsci 12 12852 g002
Figure 3. Qualitative comparison with the state-of-the-art methods. All the methods in the figure are based on image generation animation. The input images are all the first frames of the target video; the input videos are all the same speech; and our method additionally selects the unseen portrait as the emotion source.
Figure 3. Qualitative comparison with the state-of-the-art methods. All the methods in the figure are based on image generation animation. The input images are all the first frames of the target video; the input videos are all the same speech; and our method additionally selects the unseen portrait as the emotion source.
Applsci 12 12852 g003
Figure 4. Internal qualitative comparison. Using the same identity source and different emotion sources.
Figure 4. Internal qualitative comparison. Using the same identity source and different emotion sources.
Applsci 12 12852 g004
Figure 5. A comparison of our approach with different variants: our full model (top right), the content variant (middle right), and the affective variant (bottom right).
Figure 5. A comparison of our approach with different variants: our full model (top right), the content variant (middle right), and the affective variant (bottom right).
Applsci 12 12852 g005
Figure 6. User study results of audio-visual sync and video authenticity (refs. [13,15]).
Figure 6. User study results of audio-visual sync and video authenticity (refs. [13,15]).
Applsci 12 12852 g006
Figure 7. User study results of the human emotion classification confusion matrix. Ground truth video (left), and video generated on the basis of our proposed method (middle) and Wang et al. [10] (right).
Figure 7. User study results of the human emotion classification confusion matrix. Ground truth video (left), and video generated on the basis of our proposed method (middle) and Wang et al. [10] (right).
Applsci 12 12852 g007
Figure 8. User survey of emotional perception of emotionally mismatched videos. Visual modality emotion perception (left) and auditory modality emotion perception (right).
Figure 8. User survey of emotional perception of emotionally mismatched videos. Visual modality emotion perception (left) and auditory modality emotion perception (right).
Applsci 12 12852 g008
Table 1. Quantitative comparison with state-of-the-art methods. Calculation of landmark accuracy and video quality by comparing the results of different methods with real data. L- represents the lip area and F-represents the face area.
Table 1. Quantitative comparison with state-of-the-art methods. Calculation of landmark accuracy and video quality by comparing the results of different methods with real data. L- represents the lip area and F-represents the face area.
MethodSSIM↑PSNR↑FID↓L-LD↓L-LVD↓F-LD↓F-LVD↓
Chen et al. [13]0.6028.5567.603.272.093.821.71
Zhou et al. [15]0.6228.9417.332.491.663.551.64
Wang et al. [10]0.6828.6122.522.522.283.162.01
Ji et al. [17]0.7129.537.992.451.783.011.56
Ours0.6228.9216.802.521.682.981.48
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Zhang, Y.; Wu, T.; Guo, H.; Li, Y. Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait. Appl. Sci. 2022, 12, 12852. https://doi.org/10.3390/app122412852

AMA Style

Zhao Z, Zhang Y, Wu T, Guo H, Li Y. Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait. Applied Sciences. 2022; 12(24):12852. https://doi.org/10.3390/app122412852

Chicago/Turabian Style

Zhao, Zikang, Yujia Zhang, Tianjun Wu, Hao Guo, and Yao Li. 2022. "Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait" Applied Sciences 12, no. 24: 12852. https://doi.org/10.3390/app122412852

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop