1. Introduction
The rapid development of artificial intelligence has created an inseparable bond between humans and computers, known as human–computer interaction (HCI). An integral and vital part of human communication is the recognition of emotions. Computer programs known as speech emotion recognition (SER) have been developed to process human speech and recognize and categorize the emotions expressed. Applications of these programs encompass leveraging SER systems for media communication and educational purposes, including interactive and performing arts, instructing journalists and presenters in emotional expressions and paralinguistic features of speech, detecting aggressive language and hate speech, and identifying and regulating audio and audiovisual content based on emotional cues. Considering the broad range of uses of SER systems, it is essential to prioritize efforts to refine and enhance this technology [
1,
2].
In order for speech emotion recognition (SER) systems to work well, it is essential to collect speech utterances with diverse emotional content. These data are then used to train the computer to understand human emotions, benefiting human understanding as well. The meticulous development of well-crafted and complete datasets for the proficient training of SER systems is a highly detail-oriented process that involves following the same steps repeatedly. While it is crucial for accuracy, it can become monotonous after a while. The configuration of sizable speech datasets with emotional annotations (where emotional states are predefined prior to SER system learning) can be resource-intensive and financially burdensome. Additionally, this process requires significant time and effort from human programmers to meticulously label and categorize emotions within the dataset. Consequently, the availability of datasets of superior quality is limited, impeding the effective training of speech emotion recognition systems [
1].
Audio plays a significant role in information processing within multimedia applications. It offers several key advantages, including enhancing ambient continuity, reducing computational load (i.e., compared to vision systems), and facilitating content indexing and dataset annotation. By seamlessly integrating audio recording and processing capabilities within an application, we can leverage audio for several key functions. Firstly, audio provides essential context and enhances the immersive experience, offering ambient sounds that contribute to the overall narrative and ensuring a more engaging and continuous stream of information for the audience. Secondly, audio processing can significantly reduce the computational load associated with joint audiovisual content indexing and annotation, particularly when compared to complex vision-based systems [
3,
4,
5,
6].
Gamification offers a unique approach to mitigating the lack of motivation often associated with repetitive tasks, such as collecting speech data, by transforming the process into an engaging and interactive experience. An advantage of video games is that as the participants become more deeply involved in the game and its narrative they tend to interact and express themselves more freely resulting in more authentic emotional responses and the formation of a more enriched and diverse dataset. In general, video games can enhance individuals’ lives and contribute to broader social and cultural influences. They also provide avenues for identity play, social interaction, and personal growth. Research suggests that engaging in video games can lead to positive outcomes such as improved cognitive abilities, academic success, and decision-making skills. Moreover, video games offer opportunities for self-expression through narrative structures and design strategies. They promote engagement, immersion, challenge, mastery, and flow, contributing to feelings of enjoyment and accomplishment [
7,
8,
9,
10,
11,
12,
13].
Prior to our research, several serious games had been created incorporating speech emotion recognition (SER) or speech recording features to collect and, in some cases, annotate speech data [
14,
15,
16,
17,
18]. These games leverage the interactive and immersive nature of gaming to gather emotional and vocal expressions from players, enhancing the quality of the data. By integrating such systems, these games not only engage players in meaningful ways but also contribute to the development of more accurate and diverse datasets for training emotion recognition models. This integration offers an innovative approach to data collection, blending entertainment with research in a seamless and efficient manner.
This paper explores the creation of serious video games utilizing the Unreal Engine 5, with a specific focus on crowdsourcing emotional speech data with integrated SER technology in real-time. It also investigates the potential of serious games to address the limitations of SER systems in real-time, often hampered by a lack of diverse and comprehensive data for training and development [
19]. A two-pronged approach is proposed through the creation by us of two new games, “Silent Kingdom” and “Job Interview Simulator”, each catering to different player preferences and gameplay styles. “Silent Kingdom” adopts an arcade-style format, incorporating elements like health bars, levels, checkpoints, and platform puzzles within a 2D environment. Players navigate their character, encountering enemies and solving puzzles that necessitate vocal expression. This engaging design encourages players to exhibit a wider range of emotions, such as excitement, frustration, and determination, while simultaneously facilitating the collection of valuable speech data for SER system development.
“Job Interview Simulator” caters to players who enjoy role-playing experiences. Players are presented with realistic job interview scenarios, allowing them to make decisions and explore various conversational paths. This approach fosters natural emotional responses within a contextually relevant setting, mirroring real-world interactions and capturing the nuances of human emotion during job interviews.
By offering these distinct gameplay experiences, the research aims to attract a broader player base, encompassing individuals with varying preferences for game mechanics and narratives. This wider range of participants translates to a more diverse and comprehensive speech data collection, encompassing a richer emotional spectrum and a larger variety of speakers. This enriched dataset holds significant potential to enhance the generalizability and robustness of SER systems, enabling them to perform more accurately and reliably across different contexts and speaker demographics.
The gamified approach presents several advantages over traditional data collection methods [
8]. The engaging and interactive nature of games motivates players to participate for extended periods, potentially leading to larger datasets compared to passive techniques. Additionally, the inherent variability within gameplay scenarios encourages players to express a wider range of emotions, enriching the data with diverse emotional expressions crucial for training robust SER systems.
Based on the above analysis, the research hypothesis (RH) in the conducted work is stated, as follows:
RH: What are the potential outcomes and implications of integrating real-time speech emotion recognition into serious games?
The following research questions (RQs) stem from this proposed hypothesis and aim to examine the effectiveness of serious games in collecting speech emotional data:
RQ1: To what extent does integrating SER technology in real-time within engaging serious games enhance player enjoyment?
RQ2: Does the integration of SER technology in real-time within serious games introduce any technical limitations or biases that could affect the quality of the collected emotional data?
Throughout development, we addressed the research questions through continuous communication, iterative reviews, and ongoing improvements.
This paper is organized into five sections including this introduction. The second section provides a comprehensive review of the existing literature, where the challenges encountered in obtaining annotated speech data and the various methods of emotion annotation are explored. This review lays the groundwork for the third section which explicitly outlines the research goals and the project’s motivation. The fourth section details the materials used and the methodologies employed in this research. Finally, the fifth section presents the findings of this research alongside a thorough discussion of their importance.
4. Results
This section presents the data from the two serious video games referred to above: Silent Kingdom and Job Interview Simulator.
Table 1 summarizes the questions included in the personal interviews of a sample of 11 participants who played either one or both of the games.
First, we will see the graphs of the users’ backgrounds referring to their age and educational level.
Figure 4 reveals that the majority of participants (54.5%) fell within the 26–30 age range. Participation rates were lower for other age groups, with 9.1% each in the 18–25, 40–60, and above 60 brackets, and 27.3% between 31 and 39 years old.
Similarly, it indicates that 54.5% of participants held a bachelor’s degree. Short-cycle education followed with 18.2%, while post-secondary, upper secondary, and doctorate degrees were all represented by 9.1% each.
Figure 5 shows the users’ experience with Silent Kingdom, its perceived graphics quality, level difficulty, difficulty of controls, and their overall enjoyment of the experience.
In more detail,
Figure 5 delves into user perceptions of graphics’ quality. Here, 54.5% of participants rated the graphics as low quality. Conversely, 27.3% perceived them as high quality, and 18.2% considered them to be of medium quality.
Additionally,
Figure 5 highlights user perceptions of level difficulty. A total of 36.4% of participants found Silent Kingdom to have either a low or medium difficulty level. On the other hand, 18.2% considered it to have almost no difficulty, while 9.1% perceived the game as having a high level of difficulty.
As to perceived control difficulty, 36.4% of participants found the controls to have either low or medium difficulty, and 18.2% perceived them as having almost no difficulty. However, not everyone found the controls intuitive, with 9.1% reporting high difficulty.
Lastly,
Figure 5 sets out the level of enjoyment experienced by the participants in Silent Kingdom. While the response was positive overall, only 18.2% expressed that they liked the experience very much, 27.3% of participants said they derived little enjoyment from the game, and a further 36.4% said their experiences was moderate.
Next, we observe participants’ playing time (
Figure 6), revealing that 45.5% of participants played for 16–20 min. This is followed by 36.4% who played for 31–60 min, and 18.2% who played for 21–30 min.
Shifting focus to the user gameplay experience with Silent Kingdom, the next figure (
Figure 7) explores the participants’ prior experience in videogames, completion rates, bug encounters, perceptions of repetitiveness, the potential influence of survey participation on enjoyment, and user recommendations.
Figure 7 reveals that 81.8% of participants had prior video game experience, while the remaining 18.2% were new to video games or do not play often. A resounding 90.9% of participants successfully completed the game, while only 9.1% did not finish it. Examining player progression,
Figure 7 reveals that 63.6% of participants encountered moments where they got stuck at least once during gameplay. Conversely, 36.4% were able to progress through the game without getting stuck. As to repetitiveness, 90.9% of participants did not find this a problem with Silent Kingdom. Conversely, 9.1% found it to be repetitive. All participants reported that participating in the scientific survey enhanced their enjoyment of the Silent Kingdom experience. Furthermore, all participants indicated they would be willing to recommend Silent Kingdom to others.
Moving on to the Job Interview Simulator data, the following figures are based on a sample of 11 participants. The below analysis of this game follows the approach used above for Silent Kingdom.
From
Figure 8, we can glean the age distribution of the participants. The majority (54.6%) fell within the 26–30 age range. The remaining participants were spread across the 31–39 (27.3%), 40–60 (9.1%), and above 60 (9.1%) age brackets. Additionally,
Figure 8 shows the educational background of the participants. The majority (45.5%) held a Bachelor’s degree. The remaining participants possessed a variety of qualifications, 18.2% with short-cycle education, followed by 9.1% each in upper secondary, post-secondary, Master’s, and doctorate degrees.
Furthermore, we set out in
Figure 9 the users’ experience with the Job Interview Simulator, its perceived graphics’ quality, level difficulty, difficulty of controls, and overall user enjoyment of the experience.
Specifically,
Figure 9 reveals participants’ perception of the game’s graphics’ quality. A substantial majority (81.8%) rated the Job Interview Simulator’s graphics as very high quality. The remaining participants (18.2%) perceived them to be of high quality. As to the perceived level of difficulty within Job Interview Simulator,
Figure 9 shows a range of experiences with nearly half the participants (45.5%) finding the simulation to have almost no difficulty. In relation to the perceived difficulty of controls within the simulation, we see a positive user experience with the controls. A substantial majority (63.6%) found them to be almost effortless to use. An additional 18.2% perceived the controls to have low or medium difficulty, indicating an overall intuitive control scheme. Finally,
Figure 9 illustrates the level of enjoyment participants experienced within the simulation. While responses were varied, only 20% expressed that they liked the experience very much. A further 20% indicated they derived little enjoyment from the simulation, rating it as ‘liking it a little’ and 40% of participants reported a moderate level of enjoyment, stating they ‘liked it moderately’.
Next,
Figure 10 shows participants’ playtime on the Job Interview Simulator illustrating a variation among participants with the largest group (45.5%) playing for 16–20 min. This is followed by 36.4% who played for 21–30 min. The remaining participants were divided between shorter durations (9.1% for 5–10 min) and longer durations (a further 9.1% for 31–60 min).
Figure 11 reveals a diverse participant pool for the Job Interview Simulator, with 81.1% possessing prior video game experience. A significant majority (90.9%) successfully completed the simulation, while only 9.1% did not finish. While 36.4% encountered moments of difficulty navigating the simulation, 63.6% navigated the experience seamlessly. Repetition did not appear to be a concern, with 90.9% finding the Job Interview Simulator engaging. Notably, 100% of participants reported that the scientific survey enhanced their enjoyment of the simulation. Regarding control difficulty, 63.6% found the controls effortless, with an additional 18.2% perceiving them as low to medium difficulty. Finally, user experience was positive, with a range of sentiments, including 30% who “liked it a little”, 20% who “liked it moderately”, and 20% who “strongly enjoyed it”. This positive sentiment translated to strong recommendation rates, with a substantial majority (90.9%) indicating they would recommend the Job Interview Simulator to others.
5. Conclusions and Future Work
Creating annotated speech datasets for effectively recognizing emotions in speech faces two significant roadblocks.
First, deep learning models thrive on vast amounts of data to learn complex patterns and nuances, but the currently available speech emotion recognition datasets often suffer from limited data volume. This shortage poses a major challenge, especially for training deep learning models, as they require more data to achieve optimal performance. With less data, the models struggle to generalize effectively and become easily susceptible to overfitting, where they learn specific details of the training data instead of capturing broader emotional patterns. This leads to poor performance on unseen examples, hindering their real-world applicability.
Secondly, the issue is further complicated by the unbalanced distribution of emotions within these datasets. Often, specific emotions, like happiness or anger, may be overrepresented, while others, like sadness or frustration, are comparatively scarce. This imbalance creates a biased learning environment for the models, leading them to favor the dominant emotions and potentially neglect the underrepresented ones. These results lead to unfair classification where the model performs well on the common emotions but struggles with the less frequent ones, failing to capture the full spectrum of human emotional expression [
9].
While developing video games like those presented is a complex and multifaceted undertaking requiring meticulous planning, diverse expertise, and organization due to their interdisciplinary nature, modern technology and software offer valuable tools for scientific research. These games, as evidenced by our applications, present a comprehensive and accessible approach for collecting large speech-annotated datasets usable by a wide range of ages and abilities. Although creating serious games for such purposes may require a substantial investment of time and effort, the long-term benefits of automated data collection and annotation outweigh the upfront investment. Furthermore, the results demonstrate that SER technology in real-time can significantly enhance the interactivity and enjoyment of serious games and video games alike.
Silent Kingdom employed a gamified approach, utilizing an arcade-style perspective, to facilitate the collection of speech data. Predetermined answer prompts streamlined data collection, resulting in a more controlled and potentially more reliable dataset. This format also simplified gameplay, making the experience more welcoming for younger participants. Furthermore, the adoption of a simpler, “old-school” graphical style enhanced accessibility for older hardware. This design choice was particularly important considering the expanding target audience for speech data collection. While we do not have the explicit number of appearances of each emotion, the high completion rate (90.9%) and the fact that the game is designed to elicit the full spectrum of emotions in AESD repository, suggest that a diverse range of emotions was probably expressed [
2]. Participants generally enjoyed the game (100% willing to recommend it), which indicates a positive experience and potentially greater emotional engagement. However, a significant expansion would involve adding more levels, missions, dangers, etc. Additionally, enriching the game’s fantasy world’s story would expand the game and offer a more attractive outcome, leading to a greater collection of annotated speech data.
Regarding Job Interview Simulator, which adopted a simulation-based approach, aiming to replicate real-life interview experiences, unlike Silent Kingdom, it does not provide predetermined “correct” answers. However, a hidden scoring system evaluates player performance, allowing for personalized feedback at the conclusion. The use of photorealistic graphics and binaural audio created an immersive environment, and the absence of a stat-heavy gameplay and interface enhanced accessibility for older users. This approach, while visually appealing due to its photorealistic graphics, could pose performance challenges for older hardware, potentially restricting the target audience. The open-ended nature of the simulation likely resulted in a wider range of emotions being expressed. However, the relative frequency and intensity of different emotions within the dataset are unknown without further SER result analysis, and it is possible that the dataset may exhibit an imbalance in the representation of certain emotions which make it less effective at collecting data for particular emotions. Additionally, a significant expansion would entail adding specialized topics with the option to choose a profession. Furthermore, adding more questions and characters would assist in a greater collection of annotated speech data.
Both of the video games utilize real-time SER models, enabling direct user feedback and enhancement of the user’s experience within the game. This integration of emotion recognition technology allows for continuous user engagement and personalized interaction, ultimately enriching user emotional understanding and the overall gaming experience. However, limitations emerged. Initially, regarding player recording, there is a delay of up to 16 s to ensure that the recording is sent, and the result (the response) is received from the SER model (cloud computing). This process is subject to today’s internet speeds and could potentially be reduced to 10 s. For accessibility reasons, the delay was set at 16 s. One significant improvement would be to process this locally which would reduce the delay to a minimum. Additionally, predetermined answers restricted player exploration within the ASEDD, potentially hindering deeper engagement. Moreover, the 74% accuracy rate of the AESDD presented challenges [
9]. Users might encounter difficulty progressing due to misinterpretations of their responses, requiring them to replicate the desired emotional state multiple times (RQ2).
This study aimed to integrate real-time SER into video games using Unreal Engine 5. More specifically, this integration opens up new possibilities for creating immersive and captivating games, with the ability to collect voice tone data. This process has potential applications not only in video games but also in various other fields where SER is relevant (H).
Both “Silent Kingdom” and “Job Interview Simulator” demonstrated high levels of player engagement, as evidenced by the high completion rates (90.9% for both games) and the fact that the majority of participants expressing their willingness to recommend the game to others (100% for Silent Kingdom, 90.9% for Job Interview Simulator). This suggests that integrating SER technology in real-time within engaging serious games can create a positive and enjoyable user experience, potentially leading to increased participation and data collection (RQ1). It is important to acknowledge that this study did not directly compare player enjoyment between serious games with integrated SER and traditional data collection methods. Therefore, any conclusions about the extent of enhancement in enjoyment are based on indirect evidence and inferences. Future research could directly address this question by conducting a controlled experiment comparing player enjoyment between the two approaches.
Future iterations of Silent Kingdom and Job Interview Simulator would benefit from a combined approach that leverages their strengths. Both games can utilize scalable graphics to broaden hardware compatibility. Alternative user interactions, like open-ended responses in Job Interview Simulator and free-form speech recordings in Silent Kingdom, can enrich data collection and user agency. Enhancing engagement through variable scenarios, branching storylines, and unlockable content can encourage replayability. By combining these elements, future iterations can offer efficient data collection, engaging experiences, and ultimately, enhanced effectiveness of SER datasets.
Finally, adding the ability to annotate more accurately the emotion demonstrated in collected speech samples would be a significant expansion. Additionally, developing the database AESDD would constitute a significant improvement for a larger number of valid results [
7,
18].