Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures

Bao, Yihua; Weng, Dongdong; Gao, Nan

doi:10.3390/electronics13163315

Open AccessArticle

Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures

by

Yihua Bao

¹

,

Dongdong Weng

¹ and

Nan Gao

^2,*

¹

Beijing Engineering Research Center of Mixed Reality and Advanced Display, Beijing Institute of Technology, No. 5 Yard, Zhong Guan Cun South Street Haidian District, Beijing 100081, China

²

Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road, Haidian District, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3315; https://doi.org/10.3390/electronics13163315

Submission received: 16 July 2024 / Revised: 15 August 2024 / Accepted: 19 August 2024 / Published: 21 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Co-speech gesture synthesis is a challenging task due to the complexity and uncertainty between gestures and speech. Gestures that accompany speech (i.e., Co-Speech Gesture) are an essential part of natural and efficient embodied human communication, as they work in tandem with speech to convey information more effectively. Although data-driven approaches have improved gesture synthesis, existing deep learning-based methods use deterministic modeling which could lead to averaging out predicted gestures. Additionally, these methods lack control over gesture generation such as user editing of generated results. In this paper, we propose an editable gesture synthesis method based on a learned pose script, which disentangles gestures into individual representative and rhythmic gestures to produce high-quality, diverse and realistic poses. Specifically, we first detect the time of occurrence of gestures in video sequences and transform them into pose scripts. Regression models are then built to predict the pose scripts. Next, learned pose scripts are used for gesture synthesis, while rhythmic gestures are modeled using a variational auto-encoder and a one-dimensional convolutional network. Moreover, we introduce a large-scale Chinese co-speech gesture synthesis dataset with multimodal annotations for training and evaluation, which will be publicly available to facilitate future research. The proposed method allows for the re-editing of generated results by changing the pose scripts for applications such as interactive digital humans. The experimental results show that this method generates more quality, more diverse, and realistic gestures than other existing methods.

Keywords:

co-speech gesture synthesis; human gesture generation; multi-modal data processing; deep learning

1. Introduction

In contemporary human–computer interaction research, gesture synthesis has become a fundamental component for enhancing the naturalness and intuitiveness of communication across a wide range of applications, including robotics [1], virtual reality [2], and other interactive systems. These applications demand gestures that not only appear natural and expressive but also align closely with human communication patterns, thus improving the overall user experience. For example, in robotics, gesture synthesis facilitates more effective communication by providing non-verbal cues that complement spoken language, making interactions smoother and more intuitive. In virtual reality, gestures enhance immersion by allowing avatars to perform lifelike and contextually appropriate movements, thereby making the virtual environment more engaging and realistic.

The intelligent interactive digital human employs recognition systems to analyze and comprehend speech signals, enabling the synthesis of corresponding behaviors such as lip synchronization [3], gestures [4], head and body postures [5], and facial expressions [6]. This technology has successfully simulated realistic human interactions with impressive results. It is worth noting that gestures are effective in conveying non-verbal messages that complement verbal expressions [7]. The speech-driven gesture generation task focuses on establishing cross-modal associations between gestures and speech, facilitating the generation of plausible gestures. This is a challenging problem stemming from the uncertainty of gesture expressions, which leads to difficulties in modeling the relationship between gestures and speech.

Although some psychological studies have attempted to classify gestures based on their intended expressions [8], there are no universally applicable rules for gesture generation due to their highly individualized nature. Previous approaches to rule-based gesture generation methods [9,10] involved annotating gestures and symbolically describing input text or speech using specific rules in order to match them with an ore-defined gesture database. Rule-based systems for generating ore-defined gestures ensure the consistency and stability of the generated results, making them widely used in commercial robotics applications. However, these rule-based approaches require specialized experts with prior knowledge to design the rules and necessitate extensive manual data annotation. The predefined rules may seem simplistic and subjective when applied to complex speech, resulting in generated gestures that lack diversity, authenticity, and naturalness.

The rules governing the generation of gestures from speech are both subjective and complex. Co-speech gestures—gestures that naturally accompany spoken language—are particularly challenging to model due to their spontaneous and context-dependent nature. Subsequent research has focused on learning to generate gestures from data [11,12], leading to the production of more expressive and nuanced gestures. Recent advancements in deep learning techniques have greatly facilitated the task of gesture generation [13,14]. These methods leverage large-scale datasets such as TED [15], Speech2Gesture [16], Trinity Gesture Dataset [17], and BEAT Data set [18], among others, and employ deep neural networks to effectively model the intricate relationships between different modalities. This process, known as gesture synthesis, aims to generate gestures that correspond to speech or other input modalities in a coherent and natural manner. Several studies have [19,20] demonstrated the ability of deep learning-based approaches to successfully capture the correlation between speech rhythmic features and corresponding gestures. However, it is important to note that gesture generation remains an ill-posed problem. Adopting a deterministic modeling approach can lead to the averaging of results, resulting in digital human gestures that may appear monotonous and lack diversity. Furthermore, the use of learning-based modeling approaches directly outputs serialized gestures, limiting the ability to customize or edit the generated results. This constraint presents challenges for practical applications of the technique.

In this paper, we focus on gesture generation for Chinese speech with the goal of producing comprehensive and expressive gestures that can be easily edited. The scarcity of studies focusing on Chinese datasets, coupled with the distinct rhythmic patterns in Chinese speech, further underscores the necessity of this research. Unlike other datasets, such as BERT, which often lack complete gestures for each sentence, the Chinese anchor gesture dataset we utilize ensures a complete gesture accompanies each sentence, making it uniquely suited for detailed gesture analysis. Specifically, we introduce a novel concept called representative gestures, which is based on the following observations. McNeil’s classification [21] categorizes gestures into emblematic gestures, deictic, iconic gestures, metamorphic, and beat gestures based on their communicative purposes. Among these, beat gestures occur most frequently in synchronization with speech, constituting over 90% of gestures in news speech contexts. We refer to these beat gestures without explicit semantic meaning as representative gestures. By detecting changes in gestures and clustering them, we automatically obtain a set of representative gestures. According to Donkey’s research [22], gestures can be composed of multiple gesture units and pose variables related to rhythmic patterns, which we refer to as rhythmic gestures. These rhythmic gestures are organized according to a pose script, a sequence that dictates the specific poses and movements tied to speech rhythms, ensuring the natural flow and coherence of the generated gestures.

Based on these observations, we propose an editable gesture synthesis method that incorporates individual representative gestures, as depicted in Figure 1. Firstly, we predict the probability of gesture occurrence from speech, generating a pose script to guide subsequent gesture generation. Next, we decompose gestures into representative gestures and rhythmic gestures. We employ a learning-based approach to model relationships among rhythmic gestures, using the generated results for rhythmic gestures. Additionally, we adopt generative methods to model personalized representative gestures. Finally, we merge these two types of gestures based on the learned pose script, resulting in the generation of final gestures. To objectively evaluate the generated results, we employ multidimensional evaluation metrics. Experimental results demonstrate that our proposed method achieves natural and diverse gestures, with the ability for further editing.

To summarize, our contributions are as follows:

We propose an editable gesture generation pipeline that generates gestures based on a pose script predicted from audio and text. We decouple the gestures into representative gestures and rhythmic gestures, modeling them separately. Experimental results demonstrate that our model produces comprehensive and expressive gestures.
We model representative gestures as personalized gestures with distinct phases. We introduce a representative gesture modeling module to generate representative gestures by sampling the VAE distribution space. The representative gestures are professional and highly detailed, which can further enhance the effectiveness of gesture generation.
We have gathered a large number of videos featuring Chinese anchors from the internet and constructed a substantial Chinese speech Data set named ZHUBO dataset. This Dataset includes audio, text, and gesture annotations. To the best of our knowledge, this is the first Data set specifically dedicated to Chinese co-speech gestures. The Data set is available at https://github.com/sunny123gaoao/ZHUBO-gesture, accessed on 1 December 2023.

2. Related Works

End-to-end generation methods leverage multi-layer networks to encode high-dimensional audio and text features into compact representations, which are then decoded into corresponding gestures, thus avoiding complex rule-based manual annotations [23,24,25]. Gesture expressions exhibit individual styles, and some studies have incorporated role styles in the gesture generation process. GestureDiffuCLIP [26] exploits the powerful capabilities of large-scale contrastive language-image pre-training (CLIP) models to extract efficient style representations from diverse input modalities, including text, example action clips, or videos and transfers them to facilitate novel gesture generation. Yoon et al. [15] represent different roles as identity IDs and jointly input them into the network during training, thereby enabling control over role styles during inference. ZeroEGGS [27] employs a variational framework to learn style embeddings and modifies the final gesture style through latent space operations like mixing and scaling of style embeddings. Alexanderson et al. [28] incorporate style control into the learning process of normalizing flow models. Deep learning models can effectively capture underlying features, as Kucherenko et al. [29] observed a clear consistency between the rhythm of speech and gestures. However, similar speech may correspond to different gestures in the training dataset, and this ambiguity in the data may result in the averaging of the learning results. Additionally, the loss of some high-frequency features during the encoding and decoding processes leads to a lack of richness in the gestures learned by end-to-end methods, limiting the model’s capacity. In addition to end-to-end learning, effectively decomposing gesture synthesis is a research trend.

Generating synchronized speech and gestures is a challenging problem with multiple possible mappings. End-to-end modeling methods may simplify gestures and lack detailed expression. To address this, researchers have introduced prior knowledge to decompose gesture synthesis, resulting in improved gesture quality. Qian et al. [30] introduced a template vector to determine the overall appearance of generated gesture sequences, enabling the generation of diverse styles by adjusting this learned vector. Speech2video [31] defined emblematic gestures as a gesture dictionary and integrated them into a recurrent neural network-based framework using text retrieval. This approach yields richer gesture expressions compared to direct end-to-end learning. Audio2gestures [32] decoupled gesture generation into shared and distinctive action features, easing the challenges of modeling the direct mapping from speech to gestures. Xu et al. [33] proposed a dual-stream model that decomposes gesture synthesis into rhythmic and postural modal actions. Rhythmic gesticulator [34] decomposed gestures into basic patterns using VAE, allowing for conditional sampling in the latent space to ensure gesture diversity. Some studies aim to mathematically represent physically interpretable features by decomposing gestures. Ferstl et al. [35] employed machine learning models to encode the relationship between speech features and expressive parameters, such as gesture velocity, acceleration, size, and arm swing. They then searched a gesture database for parameterized gestures that best matched predicted mathematical parameters. Recent work [36] has utilized large language models to enhance the generation of semantically meaningful gestures. We believe that generating gestures with clear and distinct meanings is crucial, as we prioritize the generation of representative gestures. To obtain diverse and rhythmically relevant gestures, our approach decomposes gestures into independently modeled rhythmic and representative gestures. Additionally, we learn pose scripts to integrate these two types of gestures. Furthermore, gestures can be re-edited.

3. Proposed Method

3.1. Overall Architecture

Co-speech gesture synthesis maps speech (text) to gestures by representing audio with the mel-spectrum S. This involves applying Short-Time Fourier Transform (STFT) to obtain the power spectral density and then converting it to a mel-frequency scale using a mel filter bank. Text T is aligned and formatted using Yoon et al.’s method [15], which includes padding tokens to match the text sequence length with the gesture sequence, and transforming the text into feature vectors using word embeddings and Temporal Convolutional Networks (TCNs). Gestures are represented using the position of landmarks G, which consists of 42 2D points. These include seven body keypoints, namely the nose, left shoulder, right shoulder, left elbow, right elbow, left wrist, and right wrist, as well as 21 hand keypoints for both the left and right hands. We segment the video into segments, each divided into K frames, where the goal of gesture generation is

G_{K} = M (S_{K}, T_{K})

. In this study, K is set to 50.

Our system employs a two-step approach to model and learn different types of gestures separately, namely representative gestures modeling

M_{R}

and rhythmic gestures modeling

M_{P}

. These two types of gestures are then fused by a pose script, which is learned by the pose script generation module

M_{A}

. The overall framework is illustrated in Figure 2. To generate the pose script n, we leverage the text

T_{i}

and speech

S_{i}

inputs. Specifically, the pose script generation network is trained to obtain the pose script as

n = M_{A} (T_{i}, S_{i})

. Subsequently, representative gestures

P_{i}^{R}

are learned through the pose auto-encoder network, denoted as

P_{i}^{R^{*}} = M_{R} (P_{i}^{R})

, while rhythmic gestures

P_{i}^{B}

are learned through the gesture generator network, denoted as

P_{i}^{B} = M_{P} (S_{i})

. Finally, the complete gesture

G_{i}

is predicted by the network guided by the pose script n, as shown in Equation (1).

G_{i} = P_{i}^{B} + {P_{i}}^{'} \{\begin{matrix} if n = 1, {P_{i}}^{'} & = P_{i}^{R^{*}} \\ e l s e, {P_{i}}^{'} & = 0 \end{matrix}

(1)

3.2. Pose Script Generation

We present a gesture synthesis scheme based on a pose script, which leverages a large set of audio-gesture pairs to predict the probability of gesture occurrence as the pose script. This approach allows users to edit the gesture generation through the pose script, providing a higher degree of controllability and editability. Specifically, we automatically annotate the pose script training data from the video and utilize a script generation network to predict the probability distribution, aiming to generate an appropriate pose script that corresponds to a given audio input.

Pose Script Dataset Construction: In Kendon’s gesture framework [22], a complete gesture is composed of several phases in the temporal dimension, including rest position, preparation, stroke, hold, and retraction. The gesture involves an initial movement from one rest position to another new one, passing through these aforementioned phases. To detect the start of the gesture, we identify the moment when the rest pose changes. To simplify the representation of gestures, we first utilized the landmark positions of the left and right elbow, as well as the left and right wrist. By computing the position histograms of these four landmarks, we identified the position with the highest frequency as the rest pose.

Next, we compute the Euclidean distance between the four landmarks of the current frame and the rest pose (denoted as

D_{t r}

). Additionally, we calculate the distance between the current frame and the next frame (denoted as

D_{t p}

). By setting a threshold value, we can detect the gesture frame according to (2). When the pose script n is assigned a value of 1, it indicates the presence of a gesture at that particular location. Conversely, a marker value of 0 indicates the absence of a gesture. Consequently, the combination of audio data and the 0–1 annotation forms the dataset used for pose script generation.

\begin{matrix} n = \{\begin{matrix} 1, & if D_{t r} > 0.05 and 0.01 < D_{t p} < 0.05 \\ 0, & else \end{matrix} \end{matrix}

(2)

Pose Script Generation: We utilize the audio from the preceding 24 frames and the subsequent 25 frames of the current frame to represent the audio for each frame. A regression model is trained to acquire the pose script. Specifically, an LSTM network is employed as the feature extraction model, while a ResNet-101 network structure is used as the regression model. Subsequently, the probability of gesture occurrence is obtained through a sigmoid layer. The Mean Squared Error is employed as the loss function, where m represents the batch size, as depicted in (3).

\begin{matrix} L_{S C R I P T} = \frac{1}{m} \sum_{i = 1}^{m} {|n_{i} - {n_{i}}^{'}|}^{2} \end{matrix}

(3)

3.3. Representative Gestures Modeling

Personalized representative gestures often appear in co-speech gestures. We will describe how to represent and model these gestures, with the aim of integrating the modeling results as gesture units into the gesture generation framework, thereby enhancing the final effectiveness of gesture generation.

Representative Gestures Representations: According to the aforementioned definition of the complete gesture phase by Kendon et al. [22], we define representative gestures as those that occur within this process, starting from the rest pose and ending with the next rest pose. We employ (2) to extract gesture clips, where each clip encompasses the entire phase of the gesture, and we refer to these clips as personalized representative gestures. To ensure consistency, we uniformly sample the gesture clips such that each representative gesture contains 50 frames.

However, missing hand landmarks caused by self-masking and caption masking can impact the modeling of representative gestures. To address this, we perform inpainting on the hand landmarks within the dataset. The complete hand consists of 21 landmarks, and we determine the first frame

F_{i}

in which the hand is missing by detecting the number of hand landmarks. Subsequently, we utilize the previous frame

F_{i - 1}

for padding, and the inpainting process involves calculating the relative rotation

R_{i - 1 \to i}

and translation

T_{i - 1 \to i}

between adjacent frames, as illustrated in (4).

\begin{matrix} F_{i} = F_{i - 1} * R_{i - 1 \to i} + T_{i - 1 \to i} \end{matrix}

(4)

Representative Gestures Modeling: The VAE model enables the generation of data that closely resembles the original distribution by constructing a hidden vector layer, which we utilize to effectively model representative gestures. Each representative gesture is composed of 50 frames, with 42 landmarks per frame serving as the input

P_{i}^{R} \in R (49, 2)

. The Pose Auto-encoder employs the encoder to learn a Gaussian distribution

N (μ, σ)

, approximating a standard normal distribution. Subsequently, the decoder reconstructs the corresponding gesture

P_{i}^{R^{*}}

based on this learned distribution. This process incorporates a loss function that combines the reconstruction loss

L_{C O N S}

and KL divergence

L_{K L}

, as depicted in (5) and (6).

\begin{matrix} L_{C O N S} = \frac{1}{m} \sum_{i = 1}^{m} {|P_{i}^{R} - P_{i}^{R^{*}}|}^{2} \end{matrix}

(5)

\begin{matrix} L_{K L} = - log μ + \frac{μ^{2} + σ^{2}}{2} - \frac{1}{2} \end{matrix}

(6)

3.4. Rhythmic Gestures Modeling

Rhythmic gestures refer to action variables that are closely linked to the rhythm of speech. Recent research [19] has demonstrated the efficacy of deep learning approaches in capturing the correlation between audio signals and corresponding rhythmic actions. In our study, we adopt a learning-based framework for accurately modeling rhythmic gestures. Specifically, we employ a one-dimensional convolutional model, namely U-NET, to facilitate the representation of rhythmic gestures.

We let the gesture generator learn the mapping of audio

S_{i}

to rhythmic gestures

P_{i}^{B}

, where

P_{i}^{B} \in R (49, 2)

. During the training process, the generation of rhythmic gestures is supervised by leveraging the

L 1

distance metric, which quantifies the dissimilarity between the ground truth gestures

P_{G T}^{B}

and the predicted gestures

P_{i}^{B}

. To enhance the stability and smoothness of the learned gestures, we incorporate a loss function based on higher order derivatives, as indicated in (7), following the methodology employed by Xu et al. [33].

\begin{matrix} L_{B E A T} = \frac{1}{m} \sum_{i = 1}^{m} |P_{G T}^{B} - P_{i}^{B}| |P_{G T}^{B^{'}} - P_{i}^{B^{'}}| + |P_{G T}^{B^{″}} - P_{i}^{B^{″}}| \end{matrix}

(7)

3.5. Training and Inference

First, we train the

M_{A}

,

M_{R}

,

M_{P}

three modules separately and subsequently fine-tune them collectively. Our gesture synthesis system is designed to generate expressive gestures, resembling representative gestures at appropriate timings, rather than strictly adhering to ground truth gestures. To achieve this, we introduce a training loss function

L_{T O T A L}

, as depicted in (8), for effective optimization.

L_{T O T A L} = \{\begin{matrix} L_{B E A T} + {(n_{j} - {n_{j}}^{'})}^{2}, if n = 0 \\ {(n_{j} - {n_{j}}^{'})}^{2}, if n = 1 \end{matrix}

(8)

In the inference stage, we extract the mel spectrum and text features of the audio and predict the corresponding pose script n and rhythmic gestures

P^{B}

. The representative gestures

P^{R}

and rhythmic gestures

P^{B}

are then linearly fused by the pose script n to synthesize the co-speech gestures G, as shown in (9). The representative gestures

P^{R}

are reconstructed by a decoder for the variable in

M_{R}

, where z is randomly sampled from (0, 1).

G = P^{B} + P^{R} \{\begin{matrix} if n = 1, P^{R} = M_{R} (z) \\ else, P^{R} = 0 \end{matrix}

(9)

4. Experiments

4.1. Dataset and Experimental Details

ZHUBO Dataset: We have constructed a dataset of professional Chinese news commentary videos, comprising 613 videos featuring 10 Chinese news anchors. The content of the videos mainly involves speakers providing commentary on current events while employing professional gestures to aid expression. Each video has a duration of approximately 2 min, and operates at 25 frames per second, and in total, our dataset comprises around 20.4 h of footage. To assist with gesture annotation, we extracted the audio of each video in wav format and subsequently used the Mediapipe tool [37] to extract human landmarks. Specifically, we extracted 42 hand keypoints, 33 pose keypoints, and 468 face landmarks, which form the basis of our human landmarks annotation. The dataset was split into training, testing, and validation sets, with 80%, 10%, and 10% of the data being allocated to each set, respectively. Taking the Kang Hui character in the ZHUBO dataset as an example, we utilized 85 videos for the training dataset, with each clip comprising 50 frames at a rate of 25 fps, resulting in 12,296 clips in total, and 10 videos were utilized for the validation set, with a total of 1160 clips included.

Experimental Details: We initially trained three separate models: the script generation model, the representative gesture model, and the rhythmic gesture model. The gesture generation framework was optimized using the Adam optimizer with parameters

β_{1} = 0.5

and

β_{2} = 0.999

. For the training of the script generation model and rhythmic gesture model, we set the batch size to 32, and the learning rate to 0.0001, and trained them for 200 epochs. As for the training of the representative gesture model, we used a batch size of 6, learning rate of 0.001, and trained it for 1000 epochs. Finally, all the individual modules were combined for fine-tuning, where we utilized a batch size of 6 and trained for 20 epochs.

4.2. Objective Study

Objective Evaluation Metrics: Our goal is to generate gestures that are rich and natural, rather than solely fitting into the ground truth. Therefore, we assess the quality of gesture synthesis based on criteria such as rhythmic consistency, diversity, and authenticity.

(1) Rhythm Consistency: The pose script indicates the occurrence of co-speech gestures, and we employ precision as the objective evaluation metric for assessing the rhythm consistency between the predicted pose script and the ground truth, as demonstrated in (10). Here,

C o u n t (n_{p o s i t i v e}^{'})

represents the count of pose scripts predicted as positive samples. A higher

F_{P r e c i s i o n}

value indicates a closer temporal alignment between the predicted gesture and the ground truth, thereby indicating a higher degree of rhythm consistency.

\begin{matrix} F_{P r e c i s i o n} = \frac{C o u n t (n_{p o s i t i v e}^{'} = n_{p o s i t i v e})}{C o u n t (n_{p o s i t i v e}^{'})} \end{matrix}

(10)

(2) Diversity: Inspired by the work of Xu et al. [33], we quantify the gesture diversity by measuring the frame difference within the generated gesture sequence, as depicted in (11). Here, N represents the total number of frames.

\begin{matrix} F_{D i v e r s i t y} = \frac{\sum_{i = 1}^{N} | G_{i} - G_{i - 1} |}{N} \end{matrix}

(11)

(3) Authenticity: The evaluation of authenticity in image generation has commonly employed the FID (Fréchet Inception Distance) metric, and Yoon et al. [15] have utilized a modified version of FID to assess the realism of generated gestures. Drawing inspiration from their work, we adopt FID as a measure of the realism in our generated gestures. Specifically, we train a VAE on the ZHUBO dataset and employ its middle layer features as the representation of the true data distribution. Subsequently, we calculate the distance between the variances and means of the true

(μ_{1}, σ_{1})

and predicted gestures

(μ_{2}, σ_{2})

, as demonstrated in (12), where

Tr (\cdot)

denotes the trace operation.

\begin{matrix} F I D = {∥ μ_{1} - μ_{2} ∥}^{2} + Tr (σ_{1} σ_{2} - 2 \sqrt{σ_{1} σ_{2}}) \end{matrix}

(12)

Objective Evaluation Experiment: The comparison methods employed in our study include the following: (i) Speech2Gesture [16]: The Speech2Gesture method employs a GAN network to model the correlation between audio and gesture. The original Speech2Gesture model was trained on the Speech2Gesture dataset, which is a large video dataset of person-specific gestures annotated in English. To adapt this method to the ZHUBO dataset, which focuses on Chinese scenarios, we retrained the model on this specific dataset to ensure objective comparison. (ii) Trimodal [15]: The Trimodal approach utilizes joint inputs of audio and text to capture the complex relationship between multimodal data and gestures. The original Trimodal model was trained on the TED Gesture dataset, which includes gesture video data from TED talks in English. To optimize its performance on the ZHUBO dataset, we retrained the model using a sequence model, as this dataset better reflects the gestures in a Chinese context. (iii) Template [30]: The Template method uses a learned template to generate character gestures in diverse styles. Initially trained on the Speech2Gesture dataset, this method was also retrained on the ZHUBO dataset with the template-BP form mentioned in the paper to better suit the Chinese data.

The objective evaluation results displayed in Table 1 demonstrate that our proposed method exhibits notable advantages in terms of diversity and realism. The script generation module effectively ensures the accurate timing of generated gestures, as indicated by the 70.80% accuracy rate for rhythmic consistency. Moreover, the decomposition of gestures into rhythmic and representative components circumvents the direct end-to-end approach, which tends to produce gesture results with overly average characteristics, resulting in the production of more comprehensive and authentic gestures. This observation strongly supports the notion that appropriate decomposition of gestures facilitates more effective learning of gesture-related features, ultimately yielding better performance than direct end-to-end learning.

4.3. Ablation Study

The results of the ablation experiments are presented in Table 2, and we utilize Equations (10)–(12) as objective evaluation metrics.

Script Generation: The pose script generation module

M_{A}

learns the probability of gesture occurrence, aiming to demonstrate its ability to achieve rhythmic consistency. Baseline comparisons encompass (i) all audio clips corresponding to rest pose, with a pose script of all 0; (ii) all audio clips corresponding to gestures, with a pose script of all 1; (iii) randomly generated gestures, with a pose script randomly assigned values between 0 and 1. Ablation experiments reveal that our proposed pose script learning module can, to some extent, ensure the appropriate timing of gestures.

Gesture Modeling: We adopt a learning-based end-to-end gesture generation model as the baseline, which corresponds to the rhythmic gesture modeling module

M_{P}

in our proposed approach. As illustrated in Table 2, incorporating representative gestures

M_{R}

for gesture synthesis enhances both the diversity and realism of generated gestures, with improved results observed when rhythmic gestures and representative gestures are combined. Modeling based on action scripts with iconic gestures effectively addresses the issue of averaging results in deep learning-based methods and enhances gesture diversity. Meanwhile, our proposed representative gestures ensure the realism of synthesized gestures. Additionally, we discovered that incorporating rhythm-related gesture minutiae variables further enhances the realism of the generated gestures.

4.4. Visualization Results

The visualization results are presented in Figure 3, where it is important to note that the baseline uses the sequential model from the Trimodal approach [15]. In the ZHUBO dataset, gestures sometimes exhibit a rest pose, and the features extracted from the multilayer network directly modeled in an end-to-end manner tend to lose high-frequency details. As a consequence, the generated results become averaged, resembling the rest pose and lacking richness. In contrast, our method effectively disentangles gestures into representative gestures and rhythmic gestures. By leveraging the modeling capabilities of the VAE, our approach successfully generates complete and diverse representative gestures, thereby circumventing the issue of result averaging encountered in end-to-end modeling approaches.

We employ the VAE to model the representative gestures. The VAE’s encoder compresses the gestures into a latent space, following which new gestures are sampled and generated by the decoder. The obtained results from different sampling values are depicted in Figure 4. It is observed that the representative gestures commence and conclude at the rest pose, ensuring the completeness of the generated gestures. Random sampling was conducted from the range of (0, 1), with each row in the figure representing results obtained from distinct sampling values. Our proposed method for modeling representative gestures enhances the richness and realism of the synthesized gestures, aligning with human gestures where individuals may exhibit different gestures while delivering the same speech.

We utilize the pose script to enable gesture re-editing, as depicted in Figure 5. Existing deep learning-based gesture generation pipelines lack the capability to modify generated gestures, while applications like embodied conversational agents require precise gesture control. Our proposed action script generation module ensures the rhythmic consistency between gestures and speech, with the ability to edit script files. Specifically, the action script learns the probability of gesture occurrence (1 or 0) based on the available data. By modifying the values of 0 and 1 in the action script, users can exercise control over the generated gestures. Similarly, the pose script also learns the probability of gesture occurrence (1 or 0) from the data, enabling users to manipulate the generated gestures by adjusting the corresponding values.

4.5. User Study

We conducted a comparative analysis between the generated gesture landmark videos and the ground truth landmark videos, involving the participation of 15 participants in a subjective evaluation experiment. The participant’s task was to assess the videos across three dimensions: rhythm matching, gesture authenticity, and gesture diversity. They were also asked to select their preferred gesture expression video. To prepare the subjective evaluation material, we selected 50 sets of videos from the test set of the ZHUBO dataset. Each set consisted of generated videos with corresponding audio and ground truth videos, each having a duration of 30 s. We introduced random disruptions in the order of the ground truth and generated videos, and participants made their choice after each set of videos was played.

We computed the average scores from the subjective evaluation results and found that 63.6% of the data indicated a better subjective perception of the gestures produced by our proposed method. Upon analyzing the data, we discovered that users showed a preference for videos with diverse gestures. This finding suggests that our approach is capable of achieving gesture diversity while maintaining authenticity. The subjective evaluation experiment conducted with a diverse set of participants provides valuable insights into the effectiveness and preference of our gesture generation approach. The results highlight the potential of our method in generating authentic and diverse gestures that align with the audio content, thereby enhancing the overall user experience.

5. Conclusions and Discussion

In this paper, we propose a method for generating editable gestures that combines individual representative gestures with a learned pose script. Our objective is to generate diverse and high-quality gestures in a controlled manner. To achieve this, we introduce an editable pose script that accurately represents the location of gesture occurrence, enabling precise control over the generation of final gestures. To ensure diversity, we employ a representative gesture modeling module capable of generating fine gestures. This module offers various sampling techniques to produce a wide range of representative gestures. To maintain consistency between gestures and speech rhythms, we propose a rhythmic gesture modeling module. This module ensures that the generated gestures effectively align with the accompanying speech. Experimental results demonstrate the effectiveness of our disentanglement learning approach, which produces comprehensive, diverse, and realistic gestures. Additionally, the generated gestures can be easily modified by adjusting the pose script, providing enhanced flexibility and customization options.

However, our study also has some limitations. Firstly, the gesture classification process is currently semi-automatic, which, while allowing for more precise categorization, is time-consuming and requires manual intervention. Future research could focus on developing a fully automated classification system to improve efficiency and scalability, making the process more suitable for large-scale applications.

Furthermore, our study has not yet fully utilized the capabilities of large models in semantic understanding. Advances in large models within the field of natural language processing (NLP) present new opportunities for gesture generation. By integrating large models for semantic analysis and context understanding, gesture generation systems can more accurately capture the emotions and contextual information behind spoken content. This would help in generating more natural and contextually relevant semantic gestures, significantly enhancing the user experience in applications such as virtual reality and human–computer interaction. Future research could explore how to better combine the semantic understanding capabilities of large models with gesture generation systems to achieve more realistic and immersive digital human interactions.

In summary, our approach offers a novel solution in the field of gesture generation, demonstrating excellence in diversity, flexibility, and editability. However, further optimization and improvement will depend on the refinement of automated processes and the application of large models in semantic understanding, which are important directions for future research.

Author Contributions

Conceptualization, N.G., Y.B. and D.W.; methodology, N.G. and Y.B.; software, N.G. and Y.B.; validation, N.G. and Y.B.; formal analysis, Y.B.; investigation, N.G., Y.B. and D.W.; resources, D.W.; data curation, N.G. and Y.B.; writing—original draft preparation, Y.B.; writing—review and editing, N.G.; visualization, Y.B.; supervision, N.G. and D.W.; project administration, N.G., Y.B. and D.W.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (grant number 2022YFF0902303) and the 2022 major science and technology project “Yuelu·Multimodal Graph-Text-Sound-Semantic Gesture Big Model Research and Demonstration Application” in Changsha (grant number kh2301019).

Institutional Review Board Statement

Ethical review and approval are not applicable to this research.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The ZHUBO Data set is available at https://github.com/sunny123gaoao/ZHUBO-gesture, accessed on 1 December 2023.

Acknowledgments

The authors are grateful for the support from the Beijing Engineering Research Center of Mixed Reality and Advanced Display, Beijing Institute of Technology. They also appreciate the editor and anonymous reviewers for their insightful suggestions, which have greatly enhanced the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qi, J.; Ma, L.; Cui, Z.; Yu, Y. Computer vision-based hand gesture recognition for human-robot interaction: A review. Complex Intell. Syst. 2024, 10, 1581–1606. [Google Scholar] [CrossRef]
Bhattacharya, U.; Rewkowski, N.; Banerjee, A.; Guhan, P.; Bera, A.; Manocha, D. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisbon, Portugal, 27 March–1 April 2021; pp. 1–10. [Google Scholar]
Liang, B.; Pan, Y.; Guo, Z.; Zhou, H.; Hong, Z.; Han, X.; Han, J.; Liu, J.; Ding, E.; Wang, J. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3387–3396. [Google Scholar]
Nyatsanga, S.; Kucherenko, T.; Ahuja, C.; Henter, G.E.; Neff, M. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. Comput. Graph. Forum 2023, 42, 569–596. [Google Scholar] [CrossRef]
Petrovich, M.; Black, M.J.; Varol, G. TEMOS: Generating diverse human motions from textual descriptions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 480–497. [Google Scholar]
Otberdout, N.; Ferrari, C.; Daoudi, M.; Berretti, S.; Bimbo, A.D. Sparse to dense dynamic 3d facial expression generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20385–20394. [Google Scholar]
Tuite, K. The production of gesture. Semiotic 1993, 93, 83–106. [Google Scholar] [CrossRef]
Wagner, P.; Malisz, Z.; Kopp, S. Gesture and speech in interaction: An overview. Speech Commun. 2014, 57, 209–232. [Google Scholar] [CrossRef]
Cassell, J.; Pelachaud, C.; Badler, N.; Steedman, M.; Achorn, B.; Becket, T.; Douville, B.; Prevost, S.; Stone, M. Animated conversation: Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, Orlando, FL, USA, 24–29 July 1994; pp. 413–420. [Google Scholar]
Salem, M.; Kopp, S.; Wachsmuth, I.; Joublin, F. Towards meaningful robot gesture. In Human Centered Robot Systems: Cognition, Interaction, Technology; Springer: Berlin/Heidelberg, Germany, 2009; pp. 173–182. [Google Scholar]
Chiu, C.C.; Marsella, S. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Workshop on Intelligent Virtual Agents, Reykjavik, Island, 15–17 September 2011; pp. 127–140. [Google Scholar]
Yang, Y.; Yang, J.; Hodgins, J. Statistics-based motion synthesis for social conversations. Comput. Graph. Forum 2020, 39, 201–212. [Google Scholar] [CrossRef]
Ferstl, Y.; Neff, M.; McDonnell, R. Adversarial gesture generation with realistic gesture phasing. Comput. Graph. 2020, 89, 117–130. [Google Scholar] [CrossRef]
Zhu, L.; Liu, X.; Liu, X.; Qian, R.; Liu, Z.; Yu, L. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10544–10553. [Google Scholar]
Yoon, Y.; Cha, B.; Lee, J.H.; Jang, M.; Lee, J.; Kim, J.; Lee, G. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 2020, 39, 1–16. [Google Scholar] [CrossRef]
Ginosar, S.; Bar, A.; Kohavi, G.; Chan, C.; Owens, A.; Malik, J. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3497–3506. [Google Scholar]
Ferstl, Y.; McDonnell, R. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, Sydney, Australia, 5–8 November 2018; pp. 93–98. [Google Scholar]
Liu, H.; Zhu, Z.; Iwamoto, N.; Peng, Y.; Li, Z.; Zhou, Y.; Bozkurt, E.; Zheng, B. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 612–630. [Google Scholar]
Kucherenko, T.; Nagy, R.; Neff, M.; Kjellström, H.; Henter, G.E. Multimodal analysis of the predictability of hand-gesture properties. arXiv 2021, arXiv:2108.05762. [Google Scholar]
Yoon, Y.; Wolfert, P.; Kucherenko, T.; Viegas, C.; Nikolov, T.; Tsakov, M.; Henter, G.E. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, India, 7–11 November 2022; pp. 736–747. [Google Scholar]
Studdert-Kennedy, M. Hand and Mind: What Gestures Reveal About Thought. Lang. Speech 1994, 37, 203–209. [Google Scholar] [CrossRef]
Kendon, A. Gesticulation and speech: Two aspects of the process of utterance. Relatsh. Verbal Nonverbal Commun. 1980, 25, 207–227. [Google Scholar]
Liu, X.; Wu, Q.; Zhou, H.; Xu, Y.; Qian, R.; Lin, X.; Zhou, X.; Wu, W.; Dai, B.; Zhou, B. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10462–10472. [Google Scholar]
Ye, S.; Wen, Y.H.; Sun, Y.; He, Y.; Zhang, Z.; Wang, Y.; He, W.; Liu, Y.-J. Audio-driven stylized gesture generation with flow-based model. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 712–728. [Google Scholar]
Yi, H.; Liang, H.; Liu, Y.; Cao, Q.; Wen, Y.; Bolkart, T.; Tao, D.; Black, M.J. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 469–480. [Google Scholar]
Ao, T.; Zhang, Z.; Liu, L. GestureDiffuCLIP: Gesture diffusion model with CLIP latents. arXiv 2023, arXiv:2303.14613. [Google Scholar] [CrossRef]
Ghorbani, S.; Ferstl, Y.; Holden, D.; Troje, N.F.; Carbonneau, M.A. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. Comput. Graph. Forum 2023, 42, 206–216. [Google Scholar] [CrossRef]
Alexanderson, S.; Henter, G.E.; Kucherenko, T.; Beskow, J. Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum 2020, 39, 487–496. [Google Scholar] [CrossRef]
Kucherenko, T.; Jonell, P.; Van Waveren, S.; Henter, G.E.; Alexandersson, S.; Leite, I.; Kjellström, H. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, Virtual, 25–29 October 2020; pp. 242–250. [Google Scholar]
Qian, S.; Tu, Z.; Zhi, Y.; Liu, W.; Gao, S. Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11077–11086. [Google Scholar]
Liao, M.; Zhang, S.; Wang, P.; Zhu, H.; Zuo, X.; Yang, R. Speech2video synthesis with 3d skeleton regularization and expressive body poses. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Li, J.; Kang, D.; Pei, W.; Zhe, X.; Zhang, Y.; He, Z.; Bao, L. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11293–11302. [Google Scholar]
Xu, J.; Zhang, W.; Bai, Y.; Sun, Q.; Mei, T. Freeform body motion generation from speech. arXiv 2022, arXiv:2203.02291. [Google Scholar]
Ao, T.; Gao, Q.; Lou, Y.; Chen, B.; Liu, L. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Trans. Graph. (TOG) 2022, 41, 1–19. [Google Scholar] [CrossRef]
Ferstl, Y.; Neff, M.; McDonnell, R. ExpressGesture: Expressive gesture generation from speech through database matching. Comput. Animat. Virtual Worlds 2021, 32, e2016. [Google Scholar] [CrossRef]
Gao, N.; Zhao, Z.; Zeng, Z.; Zhang, S.; Weng, D.; Bao, Y. GesGPT: Speech Gesture Synthesis with Text Parsing from ChatGPT. IEEE Robot. Autom. Lett. 2024, 9, 2718–2725. [Google Scholar] [CrossRef]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for perceiving and processing reality. In Proceedings of the Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 17 June 2019; Volume 2019. [Google Scholar]

Figure 1. Motivation of editable co-speech gesture synthesis. We learn the probability of gestural occurrences from speech and use it to generate a pose script that serves as the foundation for blending two types of gestures, namely rhythmic gestures and representative gestures.

Figure 2. Pipeline of our method. The pose script generation module generates when gestures should occur. The representative gestures modeling module generates gesture segments with rich details and complete stages. The rhythmic gesture module generates basic gestures related to rhythm.

Figure 3. Visualization results that illustrate the comparison between our proposed model and the baseline, both trained and tested on the ZHUBO dataset. The first row displays the ground truth, while the second and third rows correspond to our results and the baseline results, respectively. We sampled the visualization results in the video at every five frames.

Figure 4. Visualization results of representative gestures. Each row represents a sequence of representative gestures generated under different sampling values.

Figure 5. Editable gestures where the first line represents the original predicted pose script, while the subsequent lines display the corresponding generated gestures based on this script. In the pose script, a value of 0 indicates the absence of any gesture, representing a rest pose. Conversely, a value of 1 signifies the presence of a gesture at that specific time. By modifying the values within the pose script, users can exert controlled editing over the generated gestures, as depicted in the subsequent lines.

Table 1. Evaluation metrics for different methods. Objective comparison results with other methods on the ZHUBO dataset, where ↓ means smaller is better and ↑ means larger is better, and the optimal results are marked in bold.

Methods	$F_{Precision}$ ↑	$F_{Diversity}$ ↑	FID ↓
Speech2Gesture	/	3.49	10.43
Trimodal	/	2.27	10.12
Template	/	3.97	16.74
Ours	70.80%	4.34	8.62

Table 2. Ablation experiment results where the symbol ↓ indicates that smaller values are better, while ↑ indicates that larger values are better. The optimal results are highlighted in bold.

Methods	$F_{Precision}$ ↑	$F_{Diversity}$ ↑	FID ↓
Rest Pose	58.69%	/	/
All Pose	41.31%	/	/
Random Pose	54.40%	/	/
$M_{P}$ (Baseline)	/	3.28	10.79
$M_{A}$	70.80%	/	/
$M_{A}$ + $M_{R}$	/	4.04	11.76
$M_{P}$ + $M_{A}$ + $M_{R}$ (Ours)	70.80%	4.34	8.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, Y.; Weng, D.; Gao, N. Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures. Electronics 2024, 13, 3315. https://doi.org/10.3390/electronics13163315

AMA Style

Bao Y, Weng D, Gao N. Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures. Electronics. 2024; 13(16):3315. https://doi.org/10.3390/electronics13163315

Chicago/Turabian Style

Bao, Yihua, Dongdong Weng, and Nan Gao. 2024. "Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures" Electronics 13, no. 16: 3315. https://doi.org/10.3390/electronics13163315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Overall Architecture

3.2. Pose Script Generation

3.3. Representative Gestures Modeling

3.4. Rhythmic Gestures Modeling

3.5. Training and Inference

4. Experiments

4.1. Dataset and Experimental Details

4.2. Objective Study

4.3. Ablation Study

4.4. Visualization Results

4.5. User Study

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI