Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation

Mu, Xiaoqian; He, Jialiang

doi:10.3390/app14188177

Open AccessArticle

Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation

by

Xiaoqian Mu

^1,2 and

Jialiang He

^1,2,*

¹

Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou 571158, China

²

College of Information and Communication Engineering, Dalian Nationalities University, Dalian 116600, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8177; https://doi.org/10.3390/app14188177

Submission received: 16 August 2024 / Revised: 5 September 2024 / Accepted: 9 September 2024 / Published: 11 September 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Throughout the development process of the education industry, the core competitiveness of education focuses on the output of high-quality content, and the emergence of a virtual human provides a more efficient carrier for the education industry, which can fundamentally help the education industry transform to improve efficiency. By combining virtual reality technology with artificial intelligence, this paper designs a virtual teacher based on the VOCA model for real-time interaction. The learned model, VOCA (voiceoperated character animation) takes any speech signal as input and realistically animates a wide range of adult faces. Compared with the traditional virtual teacher based on text or speech, the virtual teacher in this paper provides human-like interaction, which is a new teaching form for people involved in the field of artificial intelligence. According to the appearance, movement, and behavior characteristics of real teachers, the virtual teacher image is designed, and the interaction mode of virtual teachers is enriched from facial expression, body posture, voice, and speech. A virtual teacher with personalized, interactive, and intelligent characteristics is developed by combining voice, image and natural language-processing technology. It enables virtual teachers to interact with students in a more intuitive and personalized way, provide real-time feedback and personalized guidance, and provide better learning support and teaching experience for online education.

Keywords:

virtual teacher; VOCA; voice interaction; sentiment analysis

1. Background

With the development of the times, online education has become the main mode of teaching. Limited by time, space, equipment, teachers, and other factors, traditional classroom teaching methods can no longer meet people’s growing demand for knowledge. The introduction of virtual teachers solves this pain point problem [1]. The emergence of virtual teachers not only improves the efficiency and quality of teaching, but also provides students with more convenient and diversified learning ways, which has a far-reaching role in promoting the cause of education and teaching.

As a new teaching assistant tool, the virtual teacher has important innovative significance in the field of educational technology. Virtual teachers integrate cutting-edge technologies, such as natural language processing, machine learning, and human–computer interaction, and provide new possibilities for teaching. Through the study of virtual teachers, the nature and law of teacher–student interaction in the education process can be deeply discussed [2]. As a human–computer interaction system, virtual teachers can help us better understand the mechanism of information transmission, cognitive inspiration, and emotional communication in teaching and provide theoretical support for improving teaching quality and educational experience. Therefore, the research of virtual teachers can not only help to improve the level of personalized teaching and promote the innovative development of educational technology, but also deepen our understanding of the education process and provide important reference and support for education reform and improving teaching effect [3].

The importance of the virtual teacher lies not only in its technical innovation and teaching function, but also in the emotional value it can bring. An American study shows that when people express their thoughts through speech, 93% of the information is carried out through non-verbal behaviors and intonation and only 7% of the information is carried out through speech [4]. In addition to the usual oral feedback, teachers can also use non-verbal communication to influence students, and non-verbal expressions can better help virtual teachers convey information. Articulate and enthusiastic teachers are able to clearly convey suggestions for solving problems, while also having a noticeable incentive effect on student learning outcomes [5]. Therefore, based on the VOCA model, this paper will generate virtual teachers with rich head movements, expressions, and intelligent question-answering capabilities, aiming to behave like real people and provide a richer and more diverse teaching experience.

2. Related Work

2.1. Digital Human Voice Interaction

With the mature development of speech recognition technology and natural language interaction technology, there is an urgent need for robots to recognize users’ emotions and respond intelligently. Human studies on emotion can be traced back to the 1960s and 1970s. Ekman et al. proposed six basic emotions: happiness, surprise, boredom, sadness, fear, and anger, and the current emotional research is based on this [6].

In various forms of communication, human–computer speech interaction will greatly facilitate people’s daily life and work. With the continuous progress of technology, the virtual digital human has achieved a higher degree of realism at this stage [7]. However, the virtual digital human appearing in simple 2D animation or a 3D model is not very intelligent and can only circulate the content designed in advance without any interaction and communication. The main reasons are as follows:

(i).: Without a pre-trained language model like ChatGPT, a virtual digital human will not be able to effectively intelligently process and respond to content given by the user. If we only rely on the way to trigger keywords, it is difficult to meet the needs of users for real and natural conversations.
(ii).: The traditional 3D digital human is often unable to achieve real-time expression animation drive, which leads to the mismatch between speech and mouth shape, and then makes the voice performance and facial expression difficult to coordinate, thus reducing the real effect and immersion of the character.
(iii).: The use range of the virtual digital human in text input mode is relatively narrow, which is limited by the operating platform. If the voice input method is used, users can become rid of the restrictions of the operation interface and communicate with the virtual digital human in a wide range of convenient and natural interaction.

2.2. Virtual Teacher

The virtual teacher originated in the 1950s and 1960s as computer-aided instruction; that is, the use of computers to simulate the behavior of teachers through the interaction between students and computers to achieve the purpose of teaching. Since then, researchers have begun to explore the relationship between computers and teaching aids. In the subsequent development, more interactive intelligent tutors emerge. At this time, intelligent tutors have a certain role as real teachers and can provide basic learning guidance [8]. After entering the 21st century, the rapid development of virtual reality technology and virtual human technology has given birth to virtual teachers.

Through research, it is found that the current research focus of the virtual teacher is to build a high-fidelity virtual teacher character image, which not only exhibits an external performance close to the real person, but also has real action behavior and has a certain degree of intelligence. However, there is little research on intelligent interaction technology to generate virtual teachers with rich head movements, expressions, and intelligent question-answering capabilities. As a result, the interactive ability of virtual teachers cannot fully simulate real human emotions and non-verbal expressions. According to the existing virtual teacher-related systems, an in-depth investigation is carried out and their functions are compared and analyzed, as shown in Table 1.

Adele, a teaching agent proposed by JOHNSON W. and SHAW E. from the University of Southern California, USA, accomplishes basic educational functions through an intelligent agent: expressing knowledge, monitoring students, providing feedback, exploring questions, and prompting and answering.

Steve is a teaching agent developed by the Computer Department of Information Science of University of Southern California to support the learning process. Steve can demonstrate skill operation, answer students’ questions, watch students complete tasks, and give suggestions when students encounter difficulties.

AvaTalk is an intelligent virtual teacher system for interactive skill training. It uses agent technology to design a surrogate with knowledge and emotion expression.

The Jacob project integrates various disciplines of technology, including an intelligent tutor system, virtual reality technology, intelligent agent technology, natural language-processing technology, agent visualization, animation technology, etc., and complies with the H-Anim standard [15] in human motion.

Researchers at Massey University in New Zealand have created a virtual teacher dubbed Eve, a human-like 3D animation teacher that is a concrete practice of the concept of an “intelligent and emotion-aware system”.

Jack is a software (Version 1.0) system developed by the Center for Human Modeling and Simulation at the University of Pennsylvania to control and simulate realistic human characters in a 3D environment. It provides the functions of collision detection, real-time object grasping, and real-time visualization.

Led by Ron Cole of the University of Colorado at Boulder, Baldi, a three-dimensional tutor with only a head, is designed and implemented to develop the conversational skills of children with hearing disabilities. Through speech recognition and other new technologies, Baldi shows students how to understand and produce spoken language.

With intelligent virtual English teacher Lucy, learners can use this software (Version 1.2191009) to talk and chat with the virtual character Lucy to help overcome the common barriers of English learners who find it difficult to speak English boldly, and to create an environment for human–computer intelligent interaction to speak English.

3. Principle

3.1. Frame Structure

The construction of a 3D virtual teacher is mainly divided into five aspects: appearance image expression, motion generation and control, behavior expression, cognitive expression, and emotional expression, which involves the research of the appearance and motion characteristics of the virtual teacher, the intelligence of the virtual teacher, and the emotional quasi-nature of the virtual teacher. Based on these five aspects, the universal virtual agent is applied to the virtual learning environment. After refinement and expansion, a five-layer model architecture is designed, which is divided into five parts: front-end interaction, information perception, session generation, voice interaction, and user portrait [16]. The architecture of the virtual teacher function module in this paper is shown in Figure 1, which includes five parts, namely front-end interaction, information perception, session generation, speech synthesis, and user portrait, to realize the completed virtual teacher function.

3.2. Dialogue Module

The key to realizing the speech interaction of the virtual teacher is to establish a complete technical system, including speech recognition, natural language understanding, speech synthesis, and other modules. As shown in Figure 2, the speech recognition module is responsible for converting the user’s speech input into text information. After speech recognition, the natural language understanding module parses and understands the user’s text input. The speech synthesis module converts the responses of the digital human into natural and fluent speech output [17]. In addition, the sentiment analysis module is added to the virtual teacher in this paper, which can identify the emotional tendency and emotional state in the text, analyze and understand the positive, negative, or neutral emotions in human language, and convert them into a form that can be understood by the computer, so as to have deeper human characteristics and provide more personalized and intimate teaching services.

3.3. AI-Driven

According to the driving mode, the virtual digital human can be divided into an AI-driven type and a human-driven type (motion capture technology). AI-driven speech is also known as the speech animation synthesis technology of the virtual image. By inputting text or speech, users can generate the facial expression coefficients of the corresponding 3D virtual image with certain rules or deep learning algorithms to complete the accurate driving of mouth movements and facial expressions. As shown in Figure 3, the virtual teacher in this paper will use the VOCA (voice-operated character animation) model to realize the generation of digital human image animation based on speech. The main principle is to capture the acoustic characteristics of speech through audio signal recognition and convert the user’s real-time voice input into realistic facial animation. Thus, the voice interaction and expression control of digital figures are realized [18].

4. Materials and Methods

4.1. Iflytek (Version 3.5) Voice Interaction

The Iflytech voice interaction API supports a variety of scenarios, such as an intelligent human–machine dialog system, voice assistant, intelligent customer service, etc. Whether in the field of smart homes, smart cars, smart medicine, or smart education, this API can provide developers with powerful voice interaction capabilities to help them quickly upgrade their products through voice. In addition, the IFunfei voice interaction API has a high degree of customization, so developers can customize the function and interface of voice interaction according to their own needs, so as to better meet the needs of users. In short, the IFlyfly voice interaction API provides developers with a simple, flexible, and efficient voice interaction solution, making the application of artificial intelligence voice interaction capabilities more convenient and easy to implement.

Iflytek (Anhui, China) provides an Application Programming Interface (API) for intelligent voice development technology. This paper will use the IFlyfly API as the technical support for speech recognition and speech synthesis, and its API is directly embedded in the overall system. When users use the virtual teacher learning system to ask questions by voice, they can directly access the Iflyfly platform through the network, call its speech recognition API, and return the converted text information to the system. After the system query obtains the correct answer dialog, the IFlytek platform is accessed through the network again, and its speech synthesis API is called to convert the text information of the answer dialog into audio information. By calling theIflytek API, the construction process of the virtual teacher-aided learning system is greatly simplified and the response speed of the overall voice technology of the system is improved. In order to use the voice interaction API of IFlyfly, developers need to register an account in the Iflyfly open platform and create an application in order to obtain the corresponding AppID, API Key, and API Secret. This information is key to calling the API for authentication and secure access.

appID is a unique identifier automatically generated when an application is created on the IFlytek open platform. It is used to identify a specific application or service. apiKey is a key used for authentication and authorization, and it is typically used to ensure that only legitimate users have access to an API service. apiSecret is a security key that is used to generate a request signature to ensure the legitimacy of the request and the security of the data.

4.2. CEmotion Emotion Analysis

CEmotion is a sentiment analysis tool that focuses on extracting and understanding emotional information from textual data. Utilizing advanced natural language processing techniques and machine learning algorithms, CEmotion is able to analyze sentiment tendencies, including positive, negative, and neutral, in user comments, social media posts, and other textual content. It not only focuses on the sentiment of individual words, but also considers context to provide more precise sentiment insights. This enables enterprises to better understand customer feedback, optimize products and services, and enhance the user experience. In order to improve the accuracy of sentiment analysis, CEmotion considers the contextual information in the text. By understanding the sentence structure and context, the system avoids the possible misjudgment caused by simple keyword matching. CEmotion employs the latest NLP techniques, including semantic parsing, dependency parsing, and entity recognition. These techniques allow the system to understand the deep meaning of the text, not just the surface words. The system uses a variety of machine learning models, such as deep learning networks (e.g., LSTM, BERT, etc.) and traditional classifiers (e.g., support vector machines, random forests, etc.), to improve the accuracy and robustness of sentiment analysis. By training on a large amount of labeled data, the model can continuously optimize and improve its performance. CEmotion supports sentiment analysis in multiple languages including, but not limited to, English, Chinese, French, Spanish, etc. This makes it perform well in global business, being able to handle sentiment data in different linguistic and cultural contexts. CEmotion’s sentiment analysis system is capable of processing large amounts of data and generating sentiment reports in real time.

5. System Implementation

In this section, we discuss the implementation of the proposed student behavior detection mechanism in detail, including the integration of the YOLOv5 and CA attention mechanism for feature fusion and extraction, as well as its combination with the OpenPose network for long-distance human pose recognition.

5.1. Three-Dimensional Modeling

(i).: Character modeling

In this system, 3Ds Max, Poser, and FaceGen are used to realize the design of the static model of the figure and the facial expression of the teacher. The static model of the character is divided into two parts: the human body model and the appearance image. The human body model is constructed by Poser, and the appearance image is produced by 3Ds Max (Version 2022). Poser (Version 12) software has a lot of ready-made designed human body models, character actions, and corresponding parts; you can freely choose a variety of different types of human body parts to compose a unique human model. Due to the complexity of the head construction and appearance of the virtual teacher, FaceGen is used in this paper to make the head appearance of the virtual teacher.

(ii).: Texture mapping

After the model is made, the model needs to be mapped. In order to make the digital human look more realistic, it is necessary to add appropriate materials and textures to it. The first thing to do is UV defolding, flattening the model, and then pasting it on the UV expansion map. Considering the clarity of the model and the problem of saving resources, the method of low-resolution model to map the texture of the high-resolution model is chosen in this paper.

(iii).: Bone binding

In order to achieve the animation effect of the digital human, it is necessary to bind the skeletal system to the digital human model. After the model is made, it is necessary to build and bind the bones of the joints, muscles, and other information to facilitate the production of the model animation. The model was skinned after the skeleton was built. Skinning is to bind the created skeleton to the model to ensure that the model can move smoothly and correctly.

5.2. Voice Interaction

At the beginning of the voice interaction, the virtual teacher system will start its speech recognition technology, as shown in Figure 4; first, it converts the student’s voice input into text, so that the system can understand the student’s question or request, and the system will quickly recognize and accurately understand this instruction. This is followed by the natural language processing phase of the virtual teacher system. The system not only understands the student’s voice input, but also analyzes and processes the input based on its content. After the comprehension and processing phase, the virtual teacher system will respond through speech or text. If the student is using voice input, the system answers the question in voice form, which makes the communication more natural and vivid. In addition, if students prefer written feedback, the system can also present the solution in the form of text, so that students can consult and understand it more easily.

Iflytek provides the application programming interface for intelligent voice development technology. In this paper, the IFlytek API will be used as the technical support for speech recognition and speech synthesis, the appID, apiSecret, and apiKey of the obtained application will be used as the verification conditions, and the connection with the Iflytek platform will be established through the play() function. Its API is directly embedded in the overall system. When users use this system to ask voice questions, they can directly access the IFlytek platform through the network, call its speech recognition API, and return the converted text information to the system. After the system query obtains the correct answer dialog, the IFlytek platform is accessed through the network again, and its speech synthesis API is called to convert the text information of the answer dialog into audio information. By calling the iflytek API, the construction process of the virtual teacher answering system is greatly simplified and the response speed of the overall voice technology of the system is improved.

Once the user’s speech has been successfully converted to text, these data need to be sent to the ChatGPT model to generate a text response. With the aid of the application programming interface provided by OpenAI, it is easy to realize the communication with ChatGPT. In Unreal Engine, network request and data processing functions can be written using the Set Open AiApi Key blueprint system.

5.3. Sentiment Analysis

Sentiment analysis aims to identify and extract the sentiment and emotional information contained in the text. It is not only limited to simply identifying the emotional state of the text (such as positive, negative, or neutral), but can also deeply analyze the complexity and context of the sentiment, including the emotional intensity, the emotional object, and the emotional tendency behind the text [19]. As shown in Figure 5, the six basic emotions proposed by American psychologist Ekman, Ekman listed six basic emotions, in turn, anger, happiness, surprise, disgust, sadness and fear. Mehrabian proposed PAD (Pleasure-Arousal-Dominance) three-dimensional emotional space. Among them, P stands for pleasantness, which indicates the positive and negative characteristics of individual emotional states. A stands for activation, which indicates the neurophysiological activation level of an individual. D stands for dominance, which indicates an individual’s state of control over the situation and others. In this paper, we will use CEmotion, a third-party emotion analysis library, which has the advantages of high accuracy and is easy to call. The implementation consists of the following steps:

(i).: Data acquisition: Through the sentiment classification or sentiment polarity classification of the user’s text, multi-modal data are collected from the perspective of the learner’s speech and text, and preprocessing and feature extraction are carried out.
(ii).: CEmotion emotion analysis: In the teaching process, the virtual teacher’s facial expressions are constantly adjusted with the changes of the situation, as shown in Figure 6. The CEmotion emotion analysis algorithm is used to process the multimodal data of the learner and obtain the emotional state of the learner.
(iii).: Emotional interaction: According to learners’ emotional state, virtual teachers automatically adjust their language style, expression, and tone to better communicate with learners. According to learners’ emotional states and behavior data, virtual teachers can adjust teaching strategies, such as adjusting the difficulty, providing more support, etc., to achieve better teaching effects.

5.4. Audio Driver

(i).: Introduction to the VOCA model

Speech-driven face and facial animation synthesis are two essential models in digital human construction, which are used to realize the speech and expression functions of the digital human. The VOCA model can quickly realize these two functions, generate realistic facial expressions and mouth movements, and make the 3D virtual teacher image more vivid and natural [20].

In this paper, VOCA is used to realize the functions related to speech interaction such as the expression and mouth movement of the virtual teacher. VOCA has shown absolute superiority among the numerous current methods. VOCA introduces a unique 4D face dataset containing approximately 29 min of 4D scans captured at 60fps and synchronized audio from 12 speakers. VOCA also provides animator control to change speaking style, identity-related face shape, and pose (i.e., head, chin, and eye rotation) during animation, and is a model that enables realistic 3D facial animation.

(ii).: VOCA model architecture

The input of the VOCA model is the target-specific template T and the original speech signal, and the speech features are extracted through DeepSpeech. The output is the 3D mesh of the target. The whole structure adopts an encoder-decoder network structure, as shown in Figure 7 and Table 2. The encoder part consists of four convolutional layers and two fully connected layers, and the decoder is a fully connected layer + linear activation function [21].

The overall framework process is as follows:

Speech feature extraction: The speech signal is input into DeepSpeech, and the feature extraction is carried out to obtain a three-dimensional array (dimension is 60T × W × D), where T is the length of the input speech in seconds; W is the window size (speech overlap window size); and D is the number of letters in the alphabet plus 1 (space label).

Encoder: The labels of the input head module are one-hot encoded, added to the extracted speech features (channel stacking), and then fed into the encoder. In the output of the fourth convolutional layer, the one-hot encoding of the head model label is added to the output again, and then the output of the encoder is obtained by passing through the last two layers of FC.

Decoder: The output of the encoder is fed into the decoder to obtain the vertex offset of the 3D mesh. Finally, the final 3D mesh is obtained by adding the vertex offset with the vertex value of the template.

Animation control: Change the eight-dimensional one-hot vector to change the output style of the character. The VOCA model has the same face mesh topology at the initial position as the FLAME model, so it can be easily extended.

(iii).: model training

We employ a new dataset of 4D face scans together with speech. The dataset has 12 subjects and 480 sequences of about 3–4 s each, with sentences chosen from an array of standard protocols that maximize phonetic diversity. The 4D scans are captured at 60 fps, and we align a common face template mesh to all the scans, bringing them into correspondence.

The raw 3D head data were aligned using the method provided by the FLAME model [22]. Image-based keypoint detection methods can enhance the robustness of alignment. After alignment, each mesh includes 5023 vertices. For all continuations, the neck boundary and ears were automatically repaired and the area around the eyes was smoothed using a Gaussian filter, shifting out the acquisition noise. The mouth region is not smoothed to preserve subtle movements [23].

Among voice-4D scan datasets, represented by

{\{(x_{i}, y_{i})\}}_{i = 1}^{F}

, the

x_{i} \in R^{W \times D}

is the voice input window, the window center for the case of a video frame

y_{i} \in R^{N \times 3}

. Moreover, let

f_{i} \in R^{N \times 3}

denote the VOCA output of

x_{i}

.

The loss function consists of two terms: a position term and a velocity term.

The position term is used to calculate the distance between the predicted vertex and the training data vertex. The formula is expressed as follows:

E_{p} = {‖y_{i} - f_{i}‖}_{F}^{2}

The velocity term is used to calculate the distance between the differences in consecutive frames between the predicted output and the training vertices. The formula is expressed as follows:

E_{v} = {‖({(y}_{i} - y_{i - 1}) - (f_{i} - f_{i - 1}))‖}_{F}^{2}

The hyperparameters were tuned on the validation set and trained for 50 epochs using a learning rate of 1 × 10⁻⁴. The weights for the position and velocity terms are 1 and 10, respectively, and batch normalization with a batch size of 64 is used during training. The window size is W = 16, and the speech feature dimension is D = 29.

The virtual teacher in this paper is implemented in Python (Version 3.6) using TensorFlow and trained using Adam. Training one epoch takes about ten minutes on a single NVIDIA Tesla K20. We use a pre-trained DeepSpeech model which is kept fixed during training.

Extracting identity from facial motion enables the animation of a wide range of adult faces. Figure 8 shows the static templateand some VOCA animation frames, which are driven by the same audio sequence, and the results show that the model has some generalization ability. The possibility to change the identity-related shape (top) and head pose (bottom) during animation, driven by the same audio sequence, makes the facial animation look realistic despite different shapes and poses.

5.5. System Presentation

The running environment of the virtual teacher-learning system in this paper includes two main components: PyCharm and Unreal Engine 5 (UE5). In PyCharm, we implemented the design of the virtual teacher controller. PyCharm serves as a powerful integrated development environment (IDE) for writing and debugging code for controllers that are responsible for managing the behavior of the virtual teacher, responding to user input, and enabling interaction with other systems. The core functions of the controller include the natural language processing, situation recognition, and dynamic generation of teaching content.

On the other hand, Unreal Engine 5 (UE5) is used to complete the binding and rendering of the virtual teacher model. As an advanced game engine, UE5 provides powerful 3D modeling and real-time rendering functions, which makes the visual effect of virtual teachers more realistic. With UE5, we are able to seamlessly interface the 3D model of the virtual teacher with the controller designed in PyCharm, ensuring that the virtual teacher can interact with the user in a natural and fluent way in the virtual environment. Ultimately, this integration allows the virtual teacher system to provide an immersive teaching experience while maintaining a high degree of interactivity and flexibility.

Turn on the controller and connect the virtual teacher character model. As shown in Figure 9, users can directly communicate with the virtual teacher by voice or text, ask questions, or request explanations. The virtual teacher can understand and analyze the voice input, and can then provide detailed answers or explanations in the form of speech or text to achieve a more intuitive and natural learning interactive experience. During the interaction, the virtual teacher will make some facial micro-expressions when explaining, use hand gestures to emphasize teaching points, and nod or smile when appropriate to encourage students, making the teaching process more vivid and interactive.

6. Conclusions

In this paper, we use the VOCA model to realize the generation of digital human image animation based on speech and complete the accurate driving of mouth shape and facial expression. Audio signal recognition is used to capture the acoustic characteristics of speech, and the user’s real-time voice input is converted into realistic facial animation, so as to realize the voice interaction and expression control of digital characters. Using advanced artificial intelligence technology and an online education platform to improve learning effect and teaching quality. By integrating intelligent analysis, real-time feedback, and personalized learning paths, VOCA can dynamically adjust teaching content and provide targeted guidance and support according to students’ learning progress and needs. The interactive interface and natural language processing ability of the system enable students to have a natural dialog with the virtual teacher, which enhances the interactivity and participation of learning.

In addition, in view of the lack of empathy in virtual teachers, this paper adds the function of sentiment analysis, which can make virtual teachers more intelligent and humanized. By recognizing students’ emotional states, virtual teachers can adjust teaching strategies and tone in time to better adapt to students’ emotional needs and provide a warmer and intimate teaching experience.

In view of the lack of the interpersonal interaction of virtual teachers, this paper provides students with a richer and diversified learning experience. Students can communicate and interact with virtual teachers through various forms such as voice, text, and pictures, which makes the learning process more vivid and friendly.

In future research, we plan to further optimize the interactivity and emotional intelligence of virtual teachers, so that virtual teachers can achieve more efficient, personalized, and interactive teaching effects. The optimization of virtual teachers in the future will focus on improving intelligence, personalization, and interactivity. By integrating advanced artificial intelligence and machine learning technologies, virtual teachers will be able to more accurately analyze students’ learning habits and performance and provide tailored teaching content and feedback. The optimized virtual teacher will have deep learning ability, which can not only identify students ‘knowledge blind spots, but also adjust teaching strategies to adapt to individual differences. In addition, the application of Augmented Reality (AR) and virtual reality (VR) technology will enable virtual teachers to create a more immersive learning environment, which will improve the interest and practicality of learning. Through continuous technology updates and data-driven improvements, future virtual teachers will become more intelligent and interactive learning assistants, greatly enhancing the learning effect and experience of students.

Author Contributions

Writing—original draft, X.M.; Writing—review and editing, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the Key Laboratory of Data Science and Intelligence Education (Hainan Normal University), the Ministry of Education, China (No: DSIE202301).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bunglowala, A.; Bunglowala, A. Nonverbal communication: An integral part of teaching learning process. Int. J. Res. Advent Technol. 2015, 1, 371–375. [Google Scholar]
Yu, L.; Xu, F.; Qu, Y.; Zhou, K. Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion. Appl. Acoust. 2024, 216, 109752. [Google Scholar] [CrossRef]
Arnau-González, P.; Arevalillo-Herráez, M.; Albornoz-De Luise, R.; Arnau, D. A methodological approach to enable natural language interaction in an Intelligent Tutoring System. Comput. Speech Lang. 2023, 81, 101516. [Google Scholar] [CrossRef]
Yang, X. Design of Intelligent Voice Interactive Robot Based on Cloud Platform. Adv. Comput. Commun. 2023, 4, 21–24. [Google Scholar] [CrossRef]
Tu, J. Learn to speak like a native: AI-powered chatbot simulating natural conversation for language tutoring. J. Phys. Conf. Ser. 2020, 1693, 012216. [Google Scholar] [CrossRef]
Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M.J. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10101–10111. [Google Scholar]
Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv 2018, arXiv:1804.03619. [Google Scholar] [CrossRef]
Zhao, P.; Gu, Z. Application status and development of Intelligent Tutor System in military training. Comput. Eng. Des. 2007, 28, 4275–4277. [Google Scholar]
Johnson, W.L.; Shaw, E. Using agents to overcome deficiencies in Web-based Courseware. In Proceedings of the Workshop, Intelligent Educational Systems on the World Wide Web, 8th World Conference of the AIED Society, Kobe, Japan, 18–22 August 1997. [Google Scholar]
Rickel, J.; Johnson, W.L. Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Appl. Artif. Intell. 1999, 13, 343–382. [Google Scholar] [CrossRef]
Guiznn, C.; Hubal, R. Extracting Emotional Information from the Text of Spoken Dialog. 2009. Available online: http://www.cs.ubc.ca/~conati/um03-affect/guinn-final.pdf (accessed on 15 April 2024).
Tan, T.; Shi, Y.; Gao, W. Jacob-An Animated Instruction Agent in Virtual Reality; Springer: Berlin/Heidelberg, Germany, 2000; pp. 526–533. [Google Scholar]
Lester, J.C.; Zettlemoyer, L.S.; Gregoire, J.P.; Bares, W. Explanatory lifelike avatars: Performing user-centered tasks in 3D learning environments. In Proceedings of the Third International Conference on Autonomous Agents (Agents’99), Seattle, WA, USA, 1–5 May 1999; ACM Press: New York, NY, USA, 1999; pp. 24–31. [Google Scholar]
Zhao, H.; Sun, B.; Zhang, C. Virtual teacher research review. Micro Comput. Appl. 2010, 29, 1–5+8. [Google Scholar] [CrossRef]
ISO/IEC 19774; Human Animation. International Organization for Standardization: Geneva, Switzerland, 2009.
Kim, J.H.; Yu, C.Y.; Seo, K.; Wang, F.; Oprean, D. The effect of virtual instructor and metacognition on workload in a location-based augmented reality learning environment. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2023, 67, 1550–1555. [Google Scholar]
Samonte, M.J.; Acuña, G.E.O.; Alvarez, L.A.Z.; Miraflores, J.M. A Personality-Based Virtual Tutor for Adaptive Online Learning System. Int. J. Inf. Educ. Technol. 2023, 13, 899–905. [Google Scholar] [CrossRef]
Wu, J.; Fan, M.; Sheng, L.; Sun, G. Exploring the design space of virtual tutors forchildren with autism spectrum disorder. Educ. Inf. Technol. 2023, 28, 16531–16560. [Google Scholar] [CrossRef]
Cambria, E.; Das, D.; Bandyopadhyay, S.; Feraco, A. Affective computing and sentiment analysis. In A Practical Guide to Sentiment Analysis; Springer: Cham, Switzerland, 2017; pp. 1–10. [Google Scholar]
Spielmaker, D. “Making it Easy” with Innovations to Increase Agricultural Literacy. Agric. Educ. Mag. 2023, 95, 6–9. [Google Scholar]
Shallal, T.M.; Alkhateeb, N.E.; Al-Dabbagh, A. Virtual faculty development program in bioethics evaluated by Kirkpatrick model: A unique opportunity. PLoS ONE 2023, 18, e0293008. [Google Scholar] [CrossRef] [PubMed]
Palsodkar, P.; Dubey, Y.; Palsodkar, P.; Bajaj, P. Project-based pedagogical inevitability and social media impact. Int. J. Technol. Enhanc. Learn. 2023, 15, 346–363. [Google Scholar] [CrossRef]
Liew, T.W.; Tan, S.M.; Pang, W.M.; Khan, M.T.I.; Kew, S.N. I am Alexa, your virtual tutor: The effects of Amazon Alexa’s text-to-speech voice enthusiasm in a multimedia learning environment. Educ. Inf. Technol. 2022, 28, 31–35. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Virtual teacher functional module architecture.

Figure 2. Conceptual diagram of speech interaction.

Figure 3. AI-driven conceptual graphs.

Figure 4. Voice interaction process.

Figure 5. Pleasure-Arousal-Dominance model and Valence-Arousal emotion model.

Figure 6. Effects of affective computing for virtual teachers.

Figure 7. Network architecture.

Figure 8. Model training Effect.

Figure 9. System Interface Display.

Table 1. Comparison of the main characteristics of typical virtual teacher characters.

Name of Virtual Teacher	Whether It Is a 3D Virtual Learning Environment	Whether It Has a 3D Teacher Image	Whether It Has Intelligence Characteristics	Whether There Is Voice Interaction	Does It Have Sentiment Analysis?
Adele [9]	Yes	Yes	Yes	No	No
Steve [10]	Yes	Yes	Yes	Yes	No
AvaTalk [11]	No	No	Yes	Yes	No
Jacob [12]	Yes	Yes	Yes	No	No
Whizlow [13]	Yes	Yes	No	No	No
Jack [14]	Yes	Yes	No	No	No
Baldi	No	Yes	No	Yes	No
Lucy	No	Yes	No	Yes	No

Table 2. Model architecture.

Type	Kernel	Stride	Output	Activation
DeepSpeech	-	-	16 × 1 × 29	-
Identity concat	-	-	16 × 1 × 37	-
Convolution	3 × 1	2 × 1	8 × 1 × 32	ReLU
Convolution	3 × 1	2 × 1	4 × 1 × 32	ReLU
Convolution	3 × 1	2 × 1	2 × 1 × 64	ReLU
Convolution	3 × 1	2 × 1	1 × 1 × 64	ReLU
Identity concat	-	-	72	-
Fully connected	-	-	128	tanh
Fully connected	-	-	50	linear
Fully connected	-	-	5023 × 3	linear

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mu, X.; He, J. Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation. Appl. Sci. 2024, 14, 8177. https://doi.org/10.3390/app14188177

AMA Style

Mu X, He J. Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation. Applied Sciences. 2024; 14(18):8177. https://doi.org/10.3390/app14188177

Chicago/Turabian Style

Mu, Xiaoqian, and Jialiang He. 2024. "Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation" Applied Sciences 14, no. 18: 8177. https://doi.org/10.3390/app14188177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Virtual Teacher-Aided Learning System Based on Voice Operated Character Animation

Abstract

1. Background

2. Related Work

2.1. Digital Human Voice Interaction

2.2. Virtual Teacher

3. Principle

3.1. Frame Structure

3.2. Dialogue Module

3.3. AI-Driven

4. Materials and Methods

4.1. Iflytek (Version 3.5) Voice Interaction

4.2. CEmotion Emotion Analysis

5. System Implementation

5.1. Three-Dimensional Modeling

5.2. Voice Interaction

5.3. Sentiment Analysis

5.4. Audio Driver

5.5. System Presentation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI