Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples

Matsouliadis, Lazaros; Siamtanidou, Eleni; Vryzas, Nikolaos; Dimoulas, Charalampos

doi:10.3390/info16030238

Open AccessArticle

Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples

¹

Department of Applied Arts and Sustainable Design, Hellenic Open University (HOU), 26331 Patra, Greece

²

Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Information 2025, 16(3), 238; https://doi.org/10.3390/info16030238

Submission received: 27 January 2025 / Revised: 25 February 2025 / Accepted: 12 March 2025 / Published: 18 March 2025

(This article belongs to the Special Issue Information Processing in Multimedia Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Computer games have emerged as valuable tools for education and training. In particular, serious games, which combine learning with entertainment, offer unique potential for engaging users and enhancing knowledge acquisition. This paper presents a case study on the design, development, and evaluation of two serious games, “Silent Kingdom” and “Job Interview Simulator”, created using Unreal Engine 5 and incorporating speech emotion recognition (SER) technology. Through a systematic analysis of the existing research in SER and game development, these games were designed to elicit a wide range of emotion responses from player and collect voice data for the enhancement of SER models. By evaluating player engagement, emotional expression, and overall user experience, this study investigates the effectiveness of serious games in collecting speech data and creating more immersive player experiences. The research also explores the technical limitations of SER integration within game environments in real-time, as well as its impact on player enjoyment. Although there are some technology limitations due to the latency provided for real-time SER analysis, the results reveal that a properly developed game with integrated SER technology could become a more engaging and efficient tool for crowdsourcing speech data.

Keywords:

semantic analysis and classification of sounds; human–computer interaction; artificial intelligence; speech emotion recognition

1. Introduction

The rapid development of artificial intelligence has created an inseparable bond between humans and computers, known as human–computer interaction (HCI). An integral and vital part of human communication is the recognition of emotions. Computer programs known as speech emotion recognition (SER) have been developed to process human speech and recognize and categorize the emotions expressed. Applications of these programs encompass leveraging SER systems for media communication and educational purposes, including interactive and performing arts, instructing journalists and presenters in emotional expressions and paralinguistic features of speech, detecting aggressive language and hate speech, and identifying and regulating audio and audiovisual content based on emotional cues. Considering the broad range of uses of SER systems, it is essential to prioritize efforts to refine and enhance this technology [1,2].

In order for speech emotion recognition (SER) systems to work well, it is essential to collect speech utterances with diverse emotional content. These data are then used to train the computer to understand human emotions, benefiting human understanding as well. The meticulous development of well-crafted and complete datasets for the proficient training of SER systems is a highly detail-oriented process that involves following the same steps repeatedly. While it is crucial for accuracy, it can become monotonous after a while. The configuration of sizable speech datasets with emotional annotations (where emotional states are predefined prior to SER system learning) can be resource-intensive and financially burdensome. Additionally, this process requires significant time and effort from human programmers to meticulously label and categorize emotions within the dataset. Consequently, the availability of datasets of superior quality is limited, impeding the effective training of speech emotion recognition systems [1].

Audio plays a significant role in information processing within multimedia applications. It offers several key advantages, including enhancing ambient continuity, reducing computational load (i.e., compared to vision systems), and facilitating content indexing and dataset annotation. By seamlessly integrating audio recording and processing capabilities within an application, we can leverage audio for several key functions. Firstly, audio provides essential context and enhances the immersive experience, offering ambient sounds that contribute to the overall narrative and ensuring a more engaging and continuous stream of information for the audience. Secondly, audio processing can significantly reduce the computational load associated with joint audiovisual content indexing and annotation, particularly when compared to complex vision-based systems [3,4,5,6].

Gamification offers a unique approach to mitigating the lack of motivation often associated with repetitive tasks, such as collecting speech data, by transforming the process into an engaging and interactive experience. An advantage of video games is that as the participants become more deeply involved in the game and its narrative they tend to interact and express themselves more freely resulting in more authentic emotional responses and the formation of a more enriched and diverse dataset. In general, video games can enhance individuals’ lives and contribute to broader social and cultural influences. They also provide avenues for identity play, social interaction, and personal growth. Research suggests that engaging in video games can lead to positive outcomes such as improved cognitive abilities, academic success, and decision-making skills. Moreover, video games offer opportunities for self-expression through narrative structures and design strategies. They promote engagement, immersion, challenge, mastery, and flow, contributing to feelings of enjoyment and accomplishment [7,8,9,10,11,12,13].

Prior to our research, several serious games had been created incorporating speech emotion recognition (SER) or speech recording features to collect and, in some cases, annotate speech data [14,15,16,17,18]. These games leverage the interactive and immersive nature of gaming to gather emotional and vocal expressions from players, enhancing the quality of the data. By integrating such systems, these games not only engage players in meaningful ways but also contribute to the development of more accurate and diverse datasets for training emotion recognition models. This integration offers an innovative approach to data collection, blending entertainment with research in a seamless and efficient manner.

This paper explores the creation of serious video games utilizing the Unreal Engine 5, with a specific focus on crowdsourcing emotional speech data with integrated SER technology in real-time. It also investigates the potential of serious games to address the limitations of SER systems in real-time, often hampered by a lack of diverse and comprehensive data for training and development [19]. A two-pronged approach is proposed through the creation by us of two new games, “Silent Kingdom” and “Job Interview Simulator”, each catering to different player preferences and gameplay styles. “Silent Kingdom” adopts an arcade-style format, incorporating elements like health bars, levels, checkpoints, and platform puzzles within a 2D environment. Players navigate their character, encountering enemies and solving puzzles that necessitate vocal expression. This engaging design encourages players to exhibit a wider range of emotions, such as excitement, frustration, and determination, while simultaneously facilitating the collection of valuable speech data for SER system development.

“Job Interview Simulator” caters to players who enjoy role-playing experiences. Players are presented with realistic job interview scenarios, allowing them to make decisions and explore various conversational paths. This approach fosters natural emotional responses within a contextually relevant setting, mirroring real-world interactions and capturing the nuances of human emotion during job interviews.

By offering these distinct gameplay experiences, the research aims to attract a broader player base, encompassing individuals with varying preferences for game mechanics and narratives. This wider range of participants translates to a more diverse and comprehensive speech data collection, encompassing a richer emotional spectrum and a larger variety of speakers. This enriched dataset holds significant potential to enhance the generalizability and robustness of SER systems, enabling them to perform more accurately and reliably across different contexts and speaker demographics.

The gamified approach presents several advantages over traditional data collection methods [8]. The engaging and interactive nature of games motivates players to participate for extended periods, potentially leading to larger datasets compared to passive techniques. Additionally, the inherent variability within gameplay scenarios encourages players to express a wider range of emotions, enriching the data with diverse emotional expressions crucial for training robust SER systems.

Based on the above analysis, the research hypothesis (RH) in the conducted work is stated, as follows:

RH:

What are the potential outcomes and implications of integrating real-time speech emotion recognition into serious games?

The following research questions (RQs) stem from this proposed hypothesis and aim to examine the effectiveness of serious games in collecting speech emotional data:

RQ1:

To what extent does integrating SER technology in real-time within engaging serious games enhance player enjoyment?

RQ2:

Does the integration of SER technology in real-time within serious games introduce any technical limitations or biases that could affect the quality of the collected emotional data?

Throughout development, we addressed the research questions through continuous communication, iterative reviews, and ongoing improvements.

This paper is organized into five sections including this introduction. The second section provides a comprehensive review of the existing literature, where the challenges encountered in obtaining annotated speech data and the various methods of emotion annotation are explored. This review lays the groundwork for the third section which explicitly outlines the research goals and the project’s motivation. The fourth section details the materials used and the methodologies employed in this research. Finally, the fifth section presents the findings of this research alongside a thorough discussion of their importance.

2. Related Work

2.1. Speech Emotion Recognition and Its Application

SER, or speech emotion recognition, is the task of identifying emotional states or related characteristics (like happiness, sadness, etc.) from speech samples. In recent years, the field of SER has been heavily influenced by the rise of deep learning techniques in speech recognition [20].

Specifically, convolutional neural networks (CNNs) have become a popular choice for automatically extracting features from audio data. These features can be temporal (1D CNNs) representing changes over time, or spectrotemporal (2D CNNs) representing both time and frequency information. Various audio representations like spectrograms, MFCC vectors, and Mel-scale spectrograms are used as input to CNNs. Recurrent neural networks (RNNs) are known for their ability to model temporal sequences, and have also been explored for SER. These are often combined with CNNs to leverage both temporal and spectral information in speech signals. Recent research has focused on enhancing SER models with attention mechanisms and large-margin learning techniques. Additionally, one-dimensional dilated convolutional neural networks (DCNNs) have shown promise for end-to-end real-time SER systems [10,19,20,21,22,23,24,25].

The application of SER has spread over various domains, including medicine, performance interaction, serious video games, education, multimodal semantic repositories, etc., demonstrating its potential across diverse scientific and societal contexts [10,21,22,24].

Current speech emotion research faces inconsistency due to limitations in existing databases. While studies have explored diverse features and selection techniques, they often rely on single databases with low human agreement on emotion labels (e.g., 65–80% accuracy). Additional issues include low audio quality, limited utterances, and missing phonetic information. This hinders generalizability to findings. The solution lies in increased collaboration across research institutes to develop robust benchmark databases for a more reliable foundation in speech emotion recognition [25,26,27,28].

While SER has demonstrated significant advances through deep learning techniques, it is only one approach to emotion recognition. Facial expression analysis, which relies on visual cues such as muscle movements and micro expressions, provides another widely used method for detecting emotions. However, it can be influenced by individual differences, cultural variations, and the intentional masking of emotions. In contrast, SER captures affective states through vocal characteristics such as pitch, tone, and rhythm, making it particularly effective when facial cues are ambiguous or suppressed. Despite its advantages, SER also faces challenges, including variations in speech patterns across speakers and languages, background noise interference, and inconsistencies in emotional labeling. Given the limitations of both methods, integrating facial- and speech-based recognition in a multimodal framework presents a promising direction for improving the accuracy and robustness of emotion detection. By combining complementary information from visual and auditory cues, multimodal approaches can enhance emotion recognition in AI-driven learning environments, leading to more adaptive and responsive intelligent tutoring systems [29,30].

2.2. Crowdsourcing

It is known that crowdsourcing can be defined as an “online, distributed problem-solving and production model that leverages the collective intelligence or energy of an online community to serve an organizational goal” [14]. It involves a mutually beneficial relationship between an organization with a task and a community willing to voluntarily complete it, facilitated by an online platform. Crowdsourcing has diverse applications, ranging from consumer goods and media to scientific research, policy-making, and microtasks [15,20].

Estellés-Arolas and González-Ladrón-de Guevara define crowdsourcing as a participatory online activity where an individual, organization, or company proposes a task to a diverse group of individuals through an open call. This task, varying in complexity and scope, is undertaken voluntarily by the crowd, who contribute their work, knowledge, money, or experience for mutual benefit. Participants receive personal satisfaction, whether it be economic, social recognition, self-esteem, or skill development, while the crowdsourcer gains valuable contributions that depend on the nature of the task. For a process to be considered crowdsourcing, it must meet the following specific criteria: a clearly defined crowd and task, transparent compensation for both parties, an open call for participation, and the use of online platforms. Additionally, understanding the crowd’s motivations, whether intrinsic (enjoyment, challenge) or extrinsic (financial rewards), is crucial for designing effective crowdsourcing initiatives [15,16,31,32,33,34].

In recent years, crowdsourcing has gained significant traction in machine learning research, particularly in areas like data generation, model evaluation, and the development of hybrid human–machine intelligence systems [35]. Numerous projects exemplify this trend, such as Cascade for taxonomy creation [36], Whake FN for whale sound classification [37], and Clotho for audio captioning [38]. These projects demonstrate the potential of crowdsourcing to efficiently collect and annotate large-scale datasets, which are essential for training and improving machine learning models.

Several studies have integrated SER within serious video games with positive outcomes. This integration serves a dual purpose. Firstly, creating an engaging gameplay with emotionally responsive elements, and secondly harnessing players’ interactions to crowdsource data for enhancing existing SER databases [14,15,16,17]. By leveraging the inherent motivational power of games and the ability to elicit natural emotional expressions in controlled environments, this approach facilitates the collection of large-scale datasets potentially surpassing traditional methods in terms of representativeness and generalizability. Moreover, serious games can incorporate gamified mechanisms for participant-driven emotion labelling, promoting efficient data annotation compared to manual expert approaches. This, in turn, enables active learning processes, where player interactions and emotional responses inform the selection of new training data, leading to more targeted and efficient model improvement. However, careful consideration must be given to potential limitations such as game design bias, participant diversity, and data quality control mechanism to ensure the robustness and generalizability of the collected data [16,17].

Despite the advances in SER and its integration into video games, a research gap exists in the realm of real-time speech emotion recognition (SER) within video game environments. Addressing this gap, the present research proposes serious games that leverages the benefits of crowdsourcing and serious games to integrate a real-time SER system into serious games. By collecting in-game emotional speech data, this approach aims to enhance the ecological validity and generalizability of SER models, ultimately leading to more immersive and emotionally intelligent gaming experiences.

3. Materials and Methods

3.1. Analysis and Methodology

This research is interdisciplinary in that it combines sectors such as game development, sound design, SER, etc. Analyzing the existing literature in the above areas, a research gap appears. This gap is the integration of SER with cutting edge technology of game engines (Unreal Engine).

Serious game development presents a multifaceted and intricate challenge, demanding the seamless integration of diverse systems such as graphics, physics, and sound engines. While the independent creation of a game engine remains achievable, it is a time-consuming and highly complex undertaking requiring expertise in numerous technical domains. For the current research project, the Unreal 5 engine was chosen due to its comprehensive suite of features that satisfied most project requirements and for the integration of SER with Unreal Engine 5, the library Acted Emotional Speech Dynamic Database (AESDD) was used [2].

The AESDD contains utterances of emotional speech in theatrical context in the Greek language, captured by six professional actors (three male, three female) of ages between 25 and 30 years old. A professional theatrologist supervised the process and approved the recordings. The dataset includes utterances in the five discrete emotions of anger, disgust, fear, happiness, and sadness. Subjective evaluation through listening tests resulted to an accuracy of around 74% by native speakers [39]. A 2D CNN model has been trained on the dataset and is provided online as a web service [20].

3.2. The Case of Silent Kingdom

3.2.1. Game Design

Silent Kingdom is a two-dimensional Adventure Platformer game. The player must use their movement abilities to overcome various obstacles. They are also given the ability to defend themselves against any enemies they may encounter.

The scenario was designed with the purpose of collecting voice timbre data. Thus, the goal was to create a fun, unique, and well-structured scenario that would allow for the recording and analysis of voice timbre.

Therefore, it is set in a fantasy world where an Emperor has instituted a regime that forces citizens to whisper. It is forbidden to speak loudly, and any hint of emotion is punished. However, the protagonist is unaware of this regime (because they are not from there) and the game begins when they arrive in the kingdom. The final goal of the game is for the player to convince the emperor to change the regime.

In the first level, the player encounters citizens of the kingdom who, while having various emotions, speak only in whispers because they do not know how else to express themselves. The player, with guidance from the computer, teaches the citizens of the kingdom to express their emotions.

More specifically, each recording of the player’s voice is analyzed by a SER model trained on the Acted Emotional Speech Dynamic Dataset (AESDD). The result of the voice analysis will detect the emotion at any given time and the game will progress accordingly.

In addition, the character will have statistics such as life and stamina, which they will be able to replenish by saying a sentence that expresses the emotion given to them by the computer.

In the second level, in addition to the citizens, the evil guards of the emperor appear, who are looking for the protagonist. To defeat them, they use musical notes as a weapon. If they are able to avoid them, then the player will advance to the third and final level.

In the final level, the protagonist comes face to face with the emperor. The final battle is nothing more than a dialogue that contains all the emotions that were explored in the previous levels. If the player manages to express them all correctly, they convince the emperor and the game ends with them shouting, “Everyone can now speak freely!”

3.2.2. Development

The environment for Silent Kingdom was created using the technique of Tilemaps (Figure 1). This is an effective and flexible way to build 2D environments in a time-efficient manner. Additionally, collectible “stars” were implemented to guide the player and eliminate confusion about their next objective.

The recordings for the NPCs and all sound effects were based on the script, which in turn represented the enforced regime and the fantasy world the protagonist inhabits. For the NPCs, 2D models were sourced from the community, specifically chosen to reflect the emotions they represent. Movable platforms and spikes were also created to introduce environmental puzzles.

3.3. The Case of Job Interview Simulator

3.3.1. Game Design

Job Interview Simulator is a serious game specifically designed to collect voice timbre data with a simulated environment. This educational and uniquely structured game aims to record and analyze the player’s vocal tone during a job interview scenario using AESDD. Players navigate a 3D environment populated by simulated android characters, mimicking a real-world company setting, and participate in a virtual interview process. This innovative approach allows for controlled data collection of vocal tones during a standardized simulated social interaction.

Upon initiating the game, the players assume the first-person perspective and commence at the company’s entrance. Immediately thereafter, the player(s) engage(s) with a human resources representative, represented as a non-player character (NPC). This interaction initiates the tutorial phase, wherein the NPC elucidates the player’s presence with the simulation and outlines the overarching objective, successful completion of the virtual job interview.

After the tutorial phase is finished, the player has to pass through the foyer of the company and go to the head office. This navigation leverages binaural audio technology, replicating the spatial sound characteristics experienced by the human auditory system, to enhance the overall realism and immersion of the simulated environment.

The players culminate their journey by engaging in a face-to-face encounter with the company director, initiating the job interview. This interview deviates from the traditional format, focusing instead on eliciting specific emotional responses from the player. A series of questions designed to evoke the targeted emotions are posed, and the player’s success is not contingent upon providing “correct” answers, but rather on their ability to convincingly express the desired emotional states. This unique approach prioritizes the player’s expression over content.

3.3.2. Development

Job Interview Simulator adopted a realistic approach, featuring a single level with photorealistic textures and characters (Figure 2). The first-person perspective and vocal guidance from a Human Resources (HR) NPC during the tutorial enhanced the immersive experience. As with Silent Kingdom, sound design employed meticulously crafted recordings to reflect the environment and script, resulting in a realistic office soundscape. Additionally, 3D scanned models were used for NPCs to heighten realism.

3.4. User Connection Process with AESDD Database

The flow chart (Figure 3) shows when the user is recorded, the recording is saved to an external folder as a .wav file, archived in a way that can be sent to the database (AESDD), for the purpose of collecting the audio information and performing speech emotion recognition.

The game code in Unreal Engine 5 stores the recording in an external folder with proper archiving (e.g., Record-1). The user-database (AESDD) connection code executes a function externally from Unreal Engine 5, directly due to its use of a “watch” function. More specifically, this function enables the recognition of when a new file is added, in order to connect with the database for emotional analysis of the new recorded file and “print” the results in a .txt. file for Unreal Engine 5 to subsequently call.

3.5. Evaluation

Data were collected from a sample of 11 participants through personal interviews which encompassed key areas of interest including participant demographics (age, gender, educational background), prior video game experience, gameplay experience across both games, and overall user perceptions. Demographic data provided valuable insights into the participant pool, while gameplay metrics and user feedback offered a comprehensive understanding of user engagement and satisfaction levels.

4. Results

This section presents the data from the two serious video games referred to above: Silent Kingdom and Job Interview Simulator. Table 1 summarizes the questions included in the personal interviews of a sample of 11 participants who played either one or both of the games.

First, we will see the graphs of the users’ backgrounds referring to their age and educational level.

Figure 4 reveals that the majority of participants (54.5%) fell within the 26–30 age range. Participation rates were lower for other age groups, with 9.1% each in the 18–25, 40–60, and above 60 brackets, and 27.3% between 31 and 39 years old.

Similarly, it indicates that 54.5% of participants held a bachelor’s degree. Short-cycle education followed with 18.2%, while post-secondary, upper secondary, and doctorate degrees were all represented by 9.1% each.

Figure 5 shows the users’ experience with Silent Kingdom, its perceived graphics quality, level difficulty, difficulty of controls, and their overall enjoyment of the experience.

In more detail, Figure 5 delves into user perceptions of graphics’ quality. Here, 54.5% of participants rated the graphics as low quality. Conversely, 27.3% perceived them as high quality, and 18.2% considered them to be of medium quality.

Additionally, Figure 5 highlights user perceptions of level difficulty. A total of 36.4% of participants found Silent Kingdom to have either a low or medium difficulty level. On the other hand, 18.2% considered it to have almost no difficulty, while 9.1% perceived the game as having a high level of difficulty.

As to perceived control difficulty, 36.4% of participants found the controls to have either low or medium difficulty, and 18.2% perceived them as having almost no difficulty. However, not everyone found the controls intuitive, with 9.1% reporting high difficulty.

Lastly, Figure 5 sets out the level of enjoyment experienced by the participants in Silent Kingdom. While the response was positive overall, only 18.2% expressed that they liked the experience very much, 27.3% of participants said they derived little enjoyment from the game, and a further 36.4% said their experiences was moderate.

Next, we observe participants’ playing time (Figure 6), revealing that 45.5% of participants played for 16–20 min. This is followed by 36.4% who played for 31–60 min, and 18.2% who played for 21–30 min.

Shifting focus to the user gameplay experience with Silent Kingdom, the next figure (Figure 7) explores the participants’ prior experience in videogames, completion rates, bug encounters, perceptions of repetitiveness, the potential influence of survey participation on enjoyment, and user recommendations.

Figure 7 reveals that 81.8% of participants had prior video game experience, while the remaining 18.2% were new to video games or do not play often. A resounding 90.9% of participants successfully completed the game, while only 9.1% did not finish it. Examining player progression, Figure 7 reveals that 63.6% of participants encountered moments where they got stuck at least once during gameplay. Conversely, 36.4% were able to progress through the game without getting stuck. As to repetitiveness, 90.9% of participants did not find this a problem with Silent Kingdom. Conversely, 9.1% found it to be repetitive. All participants reported that participating in the scientific survey enhanced their enjoyment of the Silent Kingdom experience. Furthermore, all participants indicated they would be willing to recommend Silent Kingdom to others.

Moving on to the Job Interview Simulator data, the following figures are based on a sample of 11 participants. The below analysis of this game follows the approach used above for Silent Kingdom.

From Figure 8, we can glean the age distribution of the participants. The majority (54.6%) fell within the 26–30 age range. The remaining participants were spread across the 31–39 (27.3%), 40–60 (9.1%), and above 60 (9.1%) age brackets. Additionally, Figure 8 shows the educational background of the participants. The majority (45.5%) held a Bachelor’s degree. The remaining participants possessed a variety of qualifications, 18.2% with short-cycle education, followed by 9.1% each in upper secondary, post-secondary, Master’s, and doctorate degrees.

Furthermore, we set out in Figure 9 the users’ experience with the Job Interview Simulator, its perceived graphics’ quality, level difficulty, difficulty of controls, and overall user enjoyment of the experience.

Specifically, Figure 9 reveals participants’ perception of the game’s graphics’ quality. A substantial majority (81.8%) rated the Job Interview Simulator’s graphics as very high quality. The remaining participants (18.2%) perceived them to be of high quality. As to the perceived level of difficulty within Job Interview Simulator, Figure 9 shows a range of experiences with nearly half the participants (45.5%) finding the simulation to have almost no difficulty. In relation to the perceived difficulty of controls within the simulation, we see a positive user experience with the controls. A substantial majority (63.6%) found them to be almost effortless to use. An additional 18.2% perceived the controls to have low or medium difficulty, indicating an overall intuitive control scheme. Finally, Figure 9 illustrates the level of enjoyment participants experienced within the simulation. While responses were varied, only 20% expressed that they liked the experience very much. A further 20% indicated they derived little enjoyment from the simulation, rating it as ‘liking it a little’ and 40% of participants reported a moderate level of enjoyment, stating they ‘liked it moderately’.

Next, Figure 10 shows participants’ playtime on the Job Interview Simulator illustrating a variation among participants with the largest group (45.5%) playing for 16–20 min. This is followed by 36.4% who played for 21–30 min. The remaining participants were divided between shorter durations (9.1% for 5–10 min) and longer durations (a further 9.1% for 31–60 min).

Figure 11 reveals a diverse participant pool for the Job Interview Simulator, with 81.1% possessing prior video game experience. A significant majority (90.9%) successfully completed the simulation, while only 9.1% did not finish. While 36.4% encountered moments of difficulty navigating the simulation, 63.6% navigated the experience seamlessly. Repetition did not appear to be a concern, with 90.9% finding the Job Interview Simulator engaging. Notably, 100% of participants reported that the scientific survey enhanced their enjoyment of the simulation. Regarding control difficulty, 63.6% found the controls effortless, with an additional 18.2% perceiving them as low to medium difficulty. Finally, user experience was positive, with a range of sentiments, including 30% who “liked it a little”, 20% who “liked it moderately”, and 20% who “strongly enjoyed it”. This positive sentiment translated to strong recommendation rates, with a substantial majority (90.9%) indicating they would recommend the Job Interview Simulator to others.

5. Conclusions and Future Work

Creating annotated speech datasets for effectively recognizing emotions in speech faces two significant roadblocks.

First, deep learning models thrive on vast amounts of data to learn complex patterns and nuances, but the currently available speech emotion recognition datasets often suffer from limited data volume. This shortage poses a major challenge, especially for training deep learning models, as they require more data to achieve optimal performance. With less data, the models struggle to generalize effectively and become easily susceptible to overfitting, where they learn specific details of the training data instead of capturing broader emotional patterns. This leads to poor performance on unseen examples, hindering their real-world applicability.

Secondly, the issue is further complicated by the unbalanced distribution of emotions within these datasets. Often, specific emotions, like happiness or anger, may be overrepresented, while others, like sadness or frustration, are comparatively scarce. This imbalance creates a biased learning environment for the models, leading them to favor the dominant emotions and potentially neglect the underrepresented ones. These results lead to unfair classification where the model performs well on the common emotions but struggles with the less frequent ones, failing to capture the full spectrum of human emotional expression [9].

While developing video games like those presented is a complex and multifaceted undertaking requiring meticulous planning, diverse expertise, and organization due to their interdisciplinary nature, modern technology and software offer valuable tools for scientific research. These games, as evidenced by our applications, present a comprehensive and accessible approach for collecting large speech-annotated datasets usable by a wide range of ages and abilities. Although creating serious games for such purposes may require a substantial investment of time and effort, the long-term benefits of automated data collection and annotation outweigh the upfront investment. Furthermore, the results demonstrate that SER technology in real-time can significantly enhance the interactivity and enjoyment of serious games and video games alike.

Silent Kingdom employed a gamified approach, utilizing an arcade-style perspective, to facilitate the collection of speech data. Predetermined answer prompts streamlined data collection, resulting in a more controlled and potentially more reliable dataset. This format also simplified gameplay, making the experience more welcoming for younger participants. Furthermore, the adoption of a simpler, “old-school” graphical style enhanced accessibility for older hardware. This design choice was particularly important considering the expanding target audience for speech data collection. While we do not have the explicit number of appearances of each emotion, the high completion rate (90.9%) and the fact that the game is designed to elicit the full spectrum of emotions in AESD repository, suggest that a diverse range of emotions was probably expressed [2]. Participants generally enjoyed the game (100% willing to recommend it), which indicates a positive experience and potentially greater emotional engagement. However, a significant expansion would involve adding more levels, missions, dangers, etc. Additionally, enriching the game’s fantasy world’s story would expand the game and offer a more attractive outcome, leading to a greater collection of annotated speech data.

Regarding Job Interview Simulator, which adopted a simulation-based approach, aiming to replicate real-life interview experiences, unlike Silent Kingdom, it does not provide predetermined “correct” answers. However, a hidden scoring system evaluates player performance, allowing for personalized feedback at the conclusion. The use of photorealistic graphics and binaural audio created an immersive environment, and the absence of a stat-heavy gameplay and interface enhanced accessibility for older users. This approach, while visually appealing due to its photorealistic graphics, could pose performance challenges for older hardware, potentially restricting the target audience. The open-ended nature of the simulation likely resulted in a wider range of emotions being expressed. However, the relative frequency and intensity of different emotions within the dataset are unknown without further SER result analysis, and it is possible that the dataset may exhibit an imbalance in the representation of certain emotions which make it less effective at collecting data for particular emotions. Additionally, a significant expansion would entail adding specialized topics with the option to choose a profession. Furthermore, adding more questions and characters would assist in a greater collection of annotated speech data.

Both of the video games utilize real-time SER models, enabling direct user feedback and enhancement of the user’s experience within the game. This integration of emotion recognition technology allows for continuous user engagement and personalized interaction, ultimately enriching user emotional understanding and the overall gaming experience. However, limitations emerged. Initially, regarding player recording, there is a delay of up to 16 s to ensure that the recording is sent, and the result (the response) is received from the SER model (cloud computing). This process is subject to today’s internet speeds and could potentially be reduced to 10 s. For accessibility reasons, the delay was set at 16 s. One significant improvement would be to process this locally which would reduce the delay to a minimum. Additionally, predetermined answers restricted player exploration within the ASEDD, potentially hindering deeper engagement. Moreover, the 74% accuracy rate of the AESDD presented challenges [9]. Users might encounter difficulty progressing due to misinterpretations of their responses, requiring them to replicate the desired emotional state multiple times (RQ2).

This study aimed to integrate real-time SER into video games using Unreal Engine 5. More specifically, this integration opens up new possibilities for creating immersive and captivating games, with the ability to collect voice tone data. This process has potential applications not only in video games but also in various other fields where SER is relevant (H).

Both “Silent Kingdom” and “Job Interview Simulator” demonstrated high levels of player engagement, as evidenced by the high completion rates (90.9% for both games) and the fact that the majority of participants expressing their willingness to recommend the game to others (100% for Silent Kingdom, 90.9% for Job Interview Simulator). This suggests that integrating SER technology in real-time within engaging serious games can create a positive and enjoyable user experience, potentially leading to increased participation and data collection (RQ1). It is important to acknowledge that this study did not directly compare player enjoyment between serious games with integrated SER and traditional data collection methods. Therefore, any conclusions about the extent of enhancement in enjoyment are based on indirect evidence and inferences. Future research could directly address this question by conducting a controlled experiment comparing player enjoyment between the two approaches.

Future iterations of Silent Kingdom and Job Interview Simulator would benefit from a combined approach that leverages their strengths. Both games can utilize scalable graphics to broaden hardware compatibility. Alternative user interactions, like open-ended responses in Job Interview Simulator and free-form speech recordings in Silent Kingdom, can enrich data collection and user agency. Enhancing engagement through variable scenarios, branching storylines, and unlockable content can encourage replayability. By combining these elements, future iterations can offer efficient data collection, engaging experiences, and ultimately, enhanced effectiveness of SER datasets.

Finally, adding the ability to annotate more accurately the emotion demonstrated in collected speech samples would be a significant expansion. Additionally, developing the database AESDD would constitute a significant improvement for a larger number of valid results [7,18].

Author Contributions

Methodology, E.S., N.V. and C.D.; software, L.M. and N.V.; validation, C.D.; data curation, L.M.; writing—original draft, L.M.; writing—review & editing, E.S., N.V. and C.D.; supervision, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hudlická, E. To feel or not to feel: The role of affect in human–computer interaction. Int. J. Hum.-Comput. Stud. 2003, 59, 1–32. [Google Scholar] [CrossRef]
Vryzas, N.; Kotsakis, R.; Liatsou, A.; Dimoulas, C.; Kalliris, G. Speech Emotion Recognition for Performance Interaction. J. Audio Eng. Soc. 2018, 66, 457–467. [Google Scholar] [CrossRef]
Vryzas, N.; Tsipas, N.; Dimoulas, C. Web radio automation for audio stream management in the era of big data. Information 2020, 11, 205. [Google Scholar] [CrossRef]
Kotsakis, R.; Matsiola, M.; Kalliris, G.; Dimoulas, C. Investigation of Spoken-Language detection and classification in broadcasted audio content. Information 2020, 11, 211. [Google Scholar] [CrossRef]
Vryzas, N.; Katsaounidou, A.; Vrysis, L.; Kotsakis, R.; Dimoulas, C. A prototype web application to support Human-Centered audiovisual content authentication and crowdsourcing. Future Internet 2022, 14, 75. [Google Scholar] [CrossRef]
Vryzas, N.; Sidiropoulos, E.; Vrisis, L.; Avraam, E.; Dimoulas, C. Machine-Assisted Reporting in the Era of Mobile Journalism: The MοJο-Mate Platform. Strategy Dev. Rev. 2019, 9, 22–43. [Google Scholar] [CrossRef]
Hamari, J.; Koivisto, J. Why do people use gamification services? Int. J. Inf. Manag. 2015, 35, 419–431. [Google Scholar] [CrossRef]
Triantoro, T.; Gopal, R.; Benbunan-Fich, R.; Lang, G. Would you like to play? A comparison of a gamified survey with a traditional online survey method. Int. J. Inf. Manag. 2019, 49, 242–252. [Google Scholar] [CrossRef]
Hall, A.; Chavarria, E.A.; Maneeratana, V.; Chaney, B.H.; Bernhardt, J.M. Health Benefits of Digital videogames for Older Adults: A Systematic Review of the literature. Games Health J. 2012, 1, 402–410. [Google Scholar] [CrossRef]
Alshammari, S.H.; Ali, M.B.; Rosli, M.S. The effectiveness of video games in enhancing students’ learning. Res. J. Appl. Sci. 2015, 10, 311–316. [Google Scholar]
Leonardou, A.; Rigou, M.; Garofalakis, J. Techniques to Motivate Learner Improvement in Game-Based Assessment. Information 2020, 11, 176. [Google Scholar] [CrossRef]
Yu, D.; Sun, S. A systematic exploration of deep neural networks for EDA-Based emotion recognition. Information 2020, 11, 212. [Google Scholar] [CrossRef]
Crepaldi, M.; Bocci, F.; Sarini, M.; Greco, A. An Integrated Approach of Video Game Therapy^®: A case study. Information 2025, 16, 68. [Google Scholar] [CrossRef]
Peinemann, F.; Tendal, B.; Bölte, S. Digital serious games for emotional recognition in people with autism spectrum disorder. Cochrane Libr. 2021, 2021, CD014673. [Google Scholar] [CrossRef]
Whyte, E.M.; Smyth, J.M.; Scherf, K.S. Designing Serious Game Interventions for Individuals with Autism. J. Autism Dev. Disord. 2014, 45, 3820–3831. [Google Scholar] [CrossRef] [PubMed]
Hagerer, G.; Eyben, F.; Schuller, D.M.; Scherer, K.R.; Schuller, B.W. VoicePlay—An affective sports game operated by speech emotion recognition based on the component process model. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), San Antonio, TX, USA, 23–26 October 2017. [Google Scholar] [CrossRef]
Siamtanidou, E.; Vryzas, N.; Vrysis, L.; Dimoulas, C. A Serious Game for Crowdsourcing and Self-Evaluating Speech Emotion Annotated Data. In Proceedings of the Audio Engineering Society Convention 154, Helsinki, Finland, 13–15 May 2023. [Google Scholar]
Elhaddadi, M.; Maazouz, H.; Alami, N.; Drissi, M.M.; Mènon, C.S.; Latifi, M.; Touhami Ahami, A.O. Serious Games to Teach Emotion Recognition to Children with Autism Spectrum Disorders (ASD). Acta Neuropsychol. 2021, 19, 81–92. [Google Scholar] [CrossRef]
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar] [CrossRef]
Vryzas, N.; Vrysis, L.; Kotsakis, R.; Dimoulas, C. A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition. Mach. Learn. Appl. 2021, 6, 100132. [Google Scholar] [CrossRef]
Zheng, W.Q.; Yu, J.S.; Zou, Y.X. An experimental study of speech emotion recognition based on deep convolutional neural networks. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, 21–24 September 2015; pp. 827–831. [Google Scholar] [CrossRef]
Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar] [CrossRef]
Lim, W.; Jang, D.; Lee, T. Speech emotion recognition using convolutional and recurrent neural networks. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–16 December 2016; pp. 1–4. [Google Scholar]
Madanian, S.; Chen, T.; Adeleye, O.; Templeton, J.Y.; Poellabauer, C.; Parry, D.; Schneider, S. Speech emotion recognition using machine learning—A systematic review. Intell. Syst. Appl. 2023, 20, 200266. [Google Scholar] [CrossRef]
Pao, T.; Chen, Y.; Yeh, J.; Liao, W. Combining acoustic features for improved emotion recognition in Mandarin speech. In Affective Computing and Intelligent Interaction; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 279–285. [Google Scholar] [CrossRef]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiß, B. A database of German emotional speech. In Proceedings of the INTERSPEECH 2005—Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar] [CrossRef]
Ayadi, M.E.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
D’mello, S.; Graesser, A. AutoTutor and affective autotutor. ACM Trans. Interact. Intell. Syst. 2012, 2, 1–39. [Google Scholar] [CrossRef]
El-Haddad, C.; Laouris, Y. The Ability of Children with Mild Learning Disabilities to Encode Emotions through Facial Expressions. In Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 386–402. [Google Scholar] [CrossRef]
Vryzas, N.; Vrysis, L.; Kotsakis, R.; Dimoulas, C. Speech Emotion Recognition Adapted to Multimodal Semantic Repositories. In Proceedings of the 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), Zaragoza, Spain, 6–7 September 2018. [Google Scholar] [CrossRef]
Brabham, D.C. Crowdsourcing in the Public Sector; Georgetown University Press: Washington, DC, USA, 2015. [Google Scholar] [CrossRef]
Nevo, D.; Kotlarsky, J. Crowdsourcing as a strategic IS sourcing phenomenon: Critical review and insights for future research. J. Strateg. Infromation Syst. 2020, 29, 101593. [Google Scholar] [CrossRef]
Estellés-Arolas, E.; González-Ladrón-De-Guevara, F. Towards an integrated crowdsourcing definition. J. Infromation Sci. 2012, 38, 189–200. [Google Scholar] [CrossRef]
Vaughan, J.W. Making Better use of the crowd: How Crowdsourcing Can Advance Machnine Learning research. J. Mach. Learn. Res. 2018, 18, 1–46. Available online: https://www.jmlr.org/papers/v18/17-234.html (accessed on 17 October 2024).
Chilton, L.B.; Little, G.; Edge, D.; Weld, D.S.; Landay, J.A. Cascade: Crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 1999–2008. [Google Scholar]
Shamir, L.; Yerby, C.; Simpson, R.; Von Benda-Beckmann, A.M.; Tyack, P.; Samarra, F.; Miller, P.; Wallin, J. Classification of large acoustic datasets using machine learning and crowdsourcing: Application to whale calls. J. Acoust. Soc. Am. 2014, 135, 953–962. [Google Scholar] [CrossRef]
Lipping, S.; Drossos, K.; Virtanen, T. Crowdsourcing a dataset of audio captions. arXiv 2019, arXiv:1907.09238. [Google Scholar]
Vryzas, N.; Matsiola, M.; Kotsakis, R.; Dimoulas, C.; Kalliris, G. Subjective Evaluation of a Speech Emotion Recognition Interaction Framework. In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion, Wrexham, UK, 12–14 September 2018. [Google Scholar] [CrossRef]

Figure 1. Second level inside Unreal Engine 5 Editor “Silent Kingdom”.

Figure 2. “Job Interview Simulator” level inside Unreal Engine 5 editor.

Figure 3. Flow chart of the recording of the players voice and checking their response in real-time.

Figure 4. User background (Silent Kingdom).

Figure 5. User experience (Silent Kingdom).

Figure 6. Playtime in minutes (Silent Kingdom).

Figure 7. User gameplay experience (Silent Kingdom).

Figure 8. User background (Job Interview Simulator).

Figure 9. User experience (Job Interview Simulator).

Figure 10. Playtime in minutes (Job Interview Simulator).

Figure 11. User gameplay experience (Job Interview Simulator).

Table 1. The analysis and usability evaluation questionnaire.

Question (Indicative Answers—Range)
A	User Age (18–60+)
B	User educational level (1–7)
C	Perceived graphics quality (0–4)
D	Perceived level of difficulty (0–4)
E	Perceived difficulty of controls (0–4)
F	User enjoyment of overall experience (0–4)
G	Playtime in minutes (16–60)
H	Prior experience with video games (Yes/No)
I	Game completion rates (Yes/No)
J	User perception of repetitiveness (Yes/No)
K	Impact of survey participation on user experience (Yes/No)
L	User encountered bugs (Yes/No)
M	User recommendation (Yes/No)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Matsouliadis, L.; Siamtanidou, E.; Vryzas, N.; Dimoulas, C. Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples. Information 2025, 16, 238. https://doi.org/10.3390/info16030238

AMA Style

Matsouliadis L, Siamtanidou E, Vryzas N, Dimoulas C. Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples. Information. 2025; 16(3):238. https://doi.org/10.3390/info16030238

Chicago/Turabian Style

Matsouliadis, Lazaros, Eleni Siamtanidou, Nikolaos Vryzas, and Charalampos Dimoulas. 2025. "Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples" Information 16, no. 3: 238. https://doi.org/10.3390/info16030238

APA Style

Matsouliadis, L., Siamtanidou, E., Vryzas, N., & Dimoulas, C. (2025). Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples. Information, 16(3), 238. https://doi.org/10.3390/info16030238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Emotion Recognition and Serious Games: An Entertaining Approach for Crowdsourcing Annotated Samples

Abstract

1. Introduction

2. Related Work

2.1. Speech Emotion Recognition and Its Application

2.2. Crowdsourcing

3. Materials and Methods

3.1. Analysis and Methodology

3.2. The Case of Silent Kingdom

3.2.1. Game Design

3.2.2. Development

3.3. The Case of Job Interview Simulator

3.3.1. Game Design

3.3.2. Development

3.4. User Connection Process with AESDD Database

3.5. Evaluation

4. Results

5. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI