1. Introduction
Exercise games (or exer-games) combine gamification strategies and physical activity to motivate individuals to engage in physical exercise within a more playful context [
1]. Recent technological advances in virtual environments and the Internet of Things (IoT) have enabled the development of new games that enhance the perceived value of the experience and improve the effectiveness of interventions during physical activity. These advantages have drawn the attention of researchers from fields beyond entertainment, particularly in health and well-being domains [
2]. In these contexts, games not only impact the physical state of the player but also influence emotional well-being [
3]. This additional effect necessitates the integration of technologies capable of assessing and understanding the emotions and affective responses of players during activities.
These technological ecosystems must align with players’ expectations and preferences. Various studies have identified key requirements for the design of exer-games from the players’ perspective. Findings indicate that the inclusion of music and social aspects is essential for enhancing both the gaming experience and its outcomes [
4,
5]. Music has been shown to produce positive effects on players, both physically and emotionally. These effects amplify the impact of the exercises on the players and, consequently, increase the long-term benefits of the proposed activities [
6]. As a result, music has been consistently integrated into the development of games, from early commercial exer-games based on dancing to its use in setting the ambiance and tracking players’ progress and achievements. More recently, certain acoustic properties of music (primarily rhythm) have been employed to enhance physical activities and guide players toward their goals [
7,
8]. Despite this, the full potential of music in these games remains underexplored, particularly in terms of personalization, which considers players’ musical preferences and the emotions evoked by the music they listen to. Several studies suggest that these emotions are also influenced by the social model employed in the game design [
4,
6,
9]. For instance, multi-player models tend to increase users’ motivation more effectively than equivalent single-player games and are generally perceived as more enjoyable.
Recent advances in wearable devices and artificial intelligence have improved methods and tools for recognizing users’ emotions. Music emotion recognition systems have leveraged these advances to explore the relationship between music and listeners’ emotions. These systems have subsequently been applied to develop music recommendation systems that incorporate the emotional dimension as a decision criterion [
10]. Such systems have proven to be powerful tools for personalization, extending beyond the simple consideration of listeners’ music preferences. Given the importance of emotions in physical activity, these systems could play a significant role in exer-game design, particularly in the context of music-driven customization of game elements (such as ambiance, progress monitoring, and performance tracking, as well as serving as a motivational tool). Nevertheless, the integration of affective computing solutions in exer-games remains an open challenge. Current proposals primarily focus on monitoring players’ physiological parameters and using that data to adjust certain game conditions [
11,
12,
13]. While these parameters can be correlated with the emotions players experience, only the model in [
14] recognizes emotions based on physiological data and uses this information to configure the background music of the game accordingly.
This paper presents a technological solution aimed at integrating emotion-based music into modern exer-games. The game model involves multiple players competing against one another by performing various physical activities in a virtual world. Players participate in a sensorized environment that incorporates IoT devices to manage their actions, monitor the progress of activities, recognize players’ emotions during the game, and adjust the playing environment accordingly. The objective is for music and emotions to serve as primary instruments in programming various aspects of the game. Players’ emotions are recognized during gameplay and analyzed to introduce changes in their activities and generate stimuli that affect their physical and emotional responses, as well as those of their opponents. Additionally, these stimuli are personalized for each player to enhance the gaming experience and meet individual expectations and preferences. The solution combines IoT architectures and cloud computing technologies, facilitating integration into online games. To the best of our knowledge, no similar technological system exists in the field of gaming.
The paper focuses primarily on describing the emotion-based music system proposed for the aforementioned game model. The main contributions of this proposal are as follows:
It is a complex engineering result that combines wearables, smart devices, intelligent systems, IoT, service-oriented architectures, and cloud computing.
It recognizes players’ emotions during gameplay.
It generates musical stimuli capable of influencing players’ emotional behavior.
The music used in these stimuli is automatically selected using emotion-aware recommendation and personalization algorithms.
The solution is implemented using commercial devices and is integrated into the Spotify service platform.
The remainder of this paper is structured as follows.
Section 2 reviews the use of music and affective computing solutions in the development of exer-games.
Section 3 presents the six-layer architecture based on the IoT paradigm designed for the implementation of exer-games that incorporate music and emotions. In
Section 4, the emotion-based music system is described, including the wearable devices used by players, the systems for emotion recognition, and the generation of music stimuli based on those emotions. Additionally, a framework of services specializing in music recommendation and personalization is introduced. These music services, detailed in
Section 5, are implemented as Azure Functions and leverage core functionalities of the
Spotify developer platform. Finally,
Section 6 presents the main conclusions and outlines directions for future work.
3. An IoT-Based Architecture for Exer-Games
The main characteristics of our game model are as follows: it is multi-player, based on physical exercises, with each player individually performing exercises in a sensorized environment; players’ emotions are recognized and incorporated into the game’s control system; musical stimuli, based on those emotions, are generated to induce changes in the players’ physical and emotional responses during exercises, and these stimuli are personalized and can be directed toward one or more players (often configured to evoke different reactions in each participant). This model serves as the conceptual foundation for programming various exer-games. Due to the complexity involved in programming these games, a robust software architecture and a set of highly reusable tools are necessary for game development. In this section, we present the architecture designed to implement exer-games based on the model described above.
The proposed software architecture draws inspiration from architectural models commonly used in IoT systems. These models have evolved from solutions structured in three functional layers [
36,
37] to more modern architectures consisting of six layers [
37,
38]. The six-layer model has been recently adopted by major cloud service providers who offer an extensive catalog of services for programming IoT-based systems and by leading standardization initiatives in the IoT domain, such as the IoT Reference Architecture defined by ISO [
39].
Figure 1 illustrates the six-layer architecture designed for the development of collaborative exer-games. Two types of actors are involved in these games: the players and the game manager. Players wear various devices (such as smart glasses, physiological monitors, or motion sensors) and must complete a series of physical tasks to reach the final goal of the game. Due to the collaborative nature of these games, the milestones achieved by one player may influence the progress of tasks for other players. The game manager oversees the overall progress of the game and can optionally introduce new challenges that modify the game’s control and the tasks players must complete.
The bottom layer of the architecture is composed of the Physical devices integrated into the controlled environment. In this work, particular focus is placed on virtual reality glasses, devices for monitoring players’ physiological signals, and equipment used to deliver music as a stimulus capable of inducing changes in the player during the game. The Device management layer maintains a registry of the specific devices available in the environment and is responsible for ensuring the security requirements related to those devices.
Devices generate various types of events related to players’ actions, their physical and emotional responses during activities, the progress of the game, and more. These events must be transmitted to the functional components in the upper architectural layers, for example, to be interpreted for updating game conditions. Similarly, devices can receive response events from those components to modify game activities or adjust environmental conditions. The Messaging layer is responsible for managing this flow of events between devices and components. In this architecture, this layer serves as a transversal element, facilitating the integration of different distributed device environments.
The functional core of the games is organized into three layers. The Ingestion layer filters and processes events received from devices to generate relevant knowledge for game control and player states. This knowledge is stored in data repositories, which are accessible to the other functional components in the solution. Some of these components provide general-purpose functionality, while others offer specific support for the development of exer-games; both are encapsulated as services. The Service layer integrates all these network-accessible components, which can be reused in the development of various games. Finally, the Application layer contains the applications that configure the games and manage the overall control of each game round. These actions are facilitated by integrating services from the lower layers and leveraging the knowledge stored in the repositories.
4. Integrating Music into Exer-Games
As discussed above, the focus is on integrating music and emotions into the programming of exer-games. A technological framework based on the architecture presented in
Figure 1 is developed to support this integration. The functionality of this framework can be reused to program various games designed in accordance with the described game model.
In this section, the requirements for using music and emotions to enhance exer-games are first detailed. Then, the technological framework developed to meet these requirements is presented.
4.1. Problem Description and Requirements
When a participant is playing, the sensors in their environment acquire data that describe their actions and physiological responses to physical activity. These data are processed by the game application to modify the conditions of the current activity or to determine the next activities the player will complete. As part of these decisions, the application also selects the music to be played during activities to influence the player’s emotions. Typically, this music is intended to increase motivation and thereby improve the outcomes of the physical tasks the player is performing. However, it could also be used for other purposes, such as creating distractions that lead to overexertion.
In collaborative games, the actions and achievements of one player may influence other participants. This principle also applies to music: the application can select or modify the music that a player hears based on the performance of other players. For example, it may increase the motivation of players who are lagging behind in their activities or create a distracting atmosphere for those who are ahead. These decisions are more complex than those affecting individual players, but both types of decisions share the same technological requirements.
The requirements related to emotion-based music in exer-games are as follows (these are listed to facilitate reference in subsequent sections):
R1: Recognize players’ emotions during physical activities.
R2: Recognize the emotions that songs are likely to evoke in listeners in order to create a catalog of songs labeled from an emotional perspective.
R3: Make music recommendations based on emotional criteria.
R4: Personalize emotion-based music recommendations for each player, considering their tastes and musical preferences.
R5: Integrate decision mechanisms that use music as a stimulus to influence game progress.
These requirements are translated into a set of software components that work in conjunction with sensors and devices available in players’ environments to provide the necessary functionality. These components are organized and orchestrated based on the architectural model presented above. Additionally, cloud computing principles are adopted to address the technical and integration challenges of the proposed solution.
4.2. Solution Design
Figure 2 illustrates the devices and software components that constitute the emotion-based music system for exer-games. The right-hand side of the figure outlines the relationship between the six-layer architecture and the system elements. The connectivity between elements across different architectural layers is primarily implemented using an event-driven interaction model. This model enhances the decoupling of distributed components, improves system scalability, and increases fault tolerance.
The player wears two devices: an Empatica E4 physiological wearable [
40] and Meta Quest 3 smart glasses [
41]. The former allows real-time monitoring of the player’s physiological signals (such as heart rate, electrodermal activity, blood volume, and temperature), which are then used to recognize the emotions the player is experiencing. The glasses display game elements and manage the player’s interaction with these virtual components. Additionally, the glasses play a crucial role in game control for two reasons: (1) they generate events describing relevant conditions about game progress (e.g., player actions, challenges overcome, activities completed) as well as events concerning the player’s physiological response during the game, and (2) they receive events specifying changes in game conditions, environment configurations, and stimuli aimed at influencing player behavior. Some of these input/output events are programmed directly using the glasses’ core utilities, while others require specific applications executed on the glasses.
As part of this work, two applications are developed. The first, the Empatica Physiological Data Acquisition (PDA) system, is an Android application capable of communicating with the Empatica E4 wearable to remotely acquire and filter the player’s physiological data and generate corresponding events. These events primarily contain measurements of physiological signals over time. This application is built with the Android 12.0 version, API level 31. The second application, the Spotify Player, is another Android application that plays songs from the Spotify discography. This application is responsible for executing musical stimuli through the audio of the glasses.
These devices and applications comprise the Device layer of the solution. The flow of events between these elements and the other components of the music system is managed by the Messaging layer, which essentially functions as an event bus. In this work, the Azure Event Grid service is chosen as the integration bus [
42].
Data ingestion services subscribe to the event bus to receive events from the glasses. These services specialize in processing specific events and storing the results in data repositories, which are accessible to services in the upper layers. The Ingestion layer of the music system includes two services: the Emotion Recognition Service and the Game Event Processing Service. The Emotion Recognition Service processes events related to the player’s physiological data, using machine learning models to determine the player’s current emotional state based on their electrodermal activity (i.e., translating physiological events into emotions) [
43]. Many physiological events are generated for each player during a match, allowing for the progressive computation of their emotional sequence, which helps define the player’s mood and emotional changes over time. The Game Event Processing Service, on the other hand, analyzes the player’s behavior patterns during the match, translating game progress events into behavior patterns to identify each player’s achievements and challenges.
The emotional and behavioral data processed by these ingestion services are stored in a COSMOS database [
44], which is accessible to the games that operate within the Application layer. Game applications use these data to execute rules that control gameplay. Internally, an Event–Condition–Action (ECA) rule engine is used as follows: certain behavior patterns (events) trigger changes in the player’s activities and/or the generation of musical stimuli to influence those activities (actions), while these actions may also be conditioned by the player’s emotions (conditions). These actions are then converted into events, which are sent to one or more players’ glasses through the messaging layer’s event bus.
A musical stimulus consists of one or more songs selected to evoke a specific emotion in the listener. The chosen songs depend on the player’s current emotional state and the emotion that the system aims to induce. The game engine interacts with a music recommendation system that takes both emotional perspectives into account when determining stimuli. Additionally, the recommendation system includes personalization features that enhance musical decisions based on the listener’s preferences. In this proposal, the recommendation and personalization systems are developed to work with
Spotify, the most popular music streaming provider. The
Spotify discography (containing over 100 million songs) was previously classified from an emotional perspective using the RIADA system [
45], a distributed infrastructure developed by the authors to recognize the emotions conveyed by
Spotify songs. All these systems are encapsulated as services and integrated into a framework within the Service layer. The framework is based on Azure technology and publishes its functional interfaces through the event bus.
Finally, the requirements outlined in
Section 4.1 are mapped to the software elements of the proposed solution: R1 is addressed by the Empatica Physiological Data Acquisition system (Device layer) and the Emotion Recognition Service (Ingestion layer); R2, R3, and R4 are handled by the music services framework (Service layer); and R5 is managed by the game engine (Application layer).
4.3. Example of Emotion-Based Musical Stimulation
In this section, the interaction between the components shown in
Figure 2 is illustrated through an specific example of stimulation scenario. This interaction consists of a flow of messages, some of which contain information about the players’ emotions and the songs to be played as part of the affective regulation actions. So that, before explaining the flow, the affective model to represent emotions and the way of identifying songs are briefly introduced.
Various models to represent emotions have been proposed in the field of affective computing. The most popular model is Russell’s circumplex model [
46]. It represents affective states over a two-dimensional space defined by the valence (X-axis) and arousal (Y-axis) dimensions. The combination of these two dimensions (valence/arousal) determines four distinct quadrants: the aggressive (negative/positive), the happy (positive/positive), the sad (negative/negative), and the relaxed (positive/negative) quadrants. Each emotion is then mapped to a point within this two-dimensional space and is thus located in one of these quadrants. Considering the Russell’s model, we decided to represent an emotion by means of two vectors of four values: the first vector determines the probability that the emotion is mapped to each of the four quadrants, and the second which of these probabilities are significant (a relevance threshold was experimentally calculated). Two examples of emotions are as follows:
The annotation ([, , , ], [false, true, false, false]) represents that the affective state is happy with a probability. The aggressive, sad, and relaxed probabilities (, and , respectively) are lower than the threshold and therefore the state is also defined as not sad, not aggressive and not relaxed. This annotation could correspond with a positive emotion with a high arousal value, for example, with the emotion “excited”.
The annotation ([, , , ], [false, true, false, true]) represents that the affective state is happy and relaxed (and not aggressive and not sad). It could correspond with a positive emotion closer to the X-axis than the one of the previous example, for example, with the emotion “delighted”.
Regarding the identification of songs, the Spotify IDs of tracks are reused. This decision facilitates the use of the Spotify player, which is integrated into the smart glasses, and the interaction with the Spotify online services to provide the music-based functionality.
After presenting the representation models, the stimulation scenario is introduced. We suppose that a player is particularly motivated and is successfully completing the physical tasks (player’s predominant emotion would be “excited” at that moment). The other players are performing their activities more slowly and are lagging behind the motivated player (their would feel “relaxed”). The emotion-based music system wants to regulate the players’ affective states to balance their task outcomes.
Figure 3 synthesizes the flow of messages that would happen in this scenario.
The motivated player wears an Empatica E4 device which monitors their physiological signals. The Empatica PDA system, integrated into the player’s glasses, periodically accesses these raw signals and filters the information related to the electrodermal activity (EDA). Then, EDA data are packaged in an event message to be published in the bus. Each package contains the information extracted from a five-minutes signal fragment. The Emotion Recognition Service is subscribed to be notified when new physiological events are available. It applies a series of machine learning models capable of translating the received EDA data to an emotion. This emotion represents the predominant affective state that the player is most likely to have felt during those five minutes. In this scenario, the affective annotation would be like the one in the first example to determine that the player feels “excited” (an emotion that corresponds with high-motivation states). That annotation is stored in the COSMOS database as part of the mood of the player during the gameplay.
The Exer-game engine has the following rule: (EVENT: “When a player is performing the physical tasks much better and faster than the opponents”; CONDITION: “That player is feeling a positive emotion with a high arousal value”; ACTION: “Send a relaxing stimulus to the player and motivating stimuli to the rest of players”). The rule is activated when that game event is generated by the Game Even Processing service. This high-level event is a composite event, formed from game events published by the players’ glasses. The engine queries in the database the player’s latter emotion (“excited”, a positive emotion) to evaluate the rule condition and, since it is fulfilled, it must generate the corresponding actions. These actions consists of personalized stimuli based on the emotions to be evoked.
The Game application knows the identity of the players which is needed to personalize the music stimuli. In
Figure 3, the red connectors represent the sending and/or receiving of a set of events to/from the bus. In accordance to this representation, the application publishes a recommendation request event in the bus per each of these participants. A recommendation event mainly contains the identity of player and the emotion to be evoked by the recommended songs: in this case, “relaxed” for the motivated player and “enthusiastic” for the rest. An Emotion-based music recommendation service receives these events and searches a set of candidate songs available in the
Spotify discography capable of evoking the requested emotion. These songs are then ranked and filtered by a Personalization service to customize the response to the music preferences of each player. The identity of the player determines the personalization profile to be applied in each recommendation event. The algorithms of emotion-based recommendation and personalization are detailed in
Section 5. The result is a response event for each recommendation request, containing a list of suggested songs, more specifically a list of
Spotify IDs tracks.
Finally, the application publishes the events of recommended songs in the bus. The glasses of each player are notified and recovery the corresponding event, and the list of tracks is locally played through the Spotify player. From a moment on, the physiological data of each player should be used to evaluate whether the music-based regulation actions have effect: the motivated player reduces the performance intensity and the rest of players increases their motivation and the performance of physical tasks.
5. Music Recommendation and Personalization Based on Emotions
This section explores the integration of emotional insights into music recommendation systems, focusing on the methodologies used to recognize emotional content in music based on audio features and machine learning models, and how these insights are adapted to user emotions for personalized recommendations. The discussion begins with an examination of the architectural framework for implementing emotion-aware music functionalities. Following this, the methodologies and technologies used to accurately identify and interpret the emotional content of music are presented. The focus then shifts to how user preferences and emotional responses are utilized to build comprehensive profiles that enhance recommendation accuracy. Finally, the section concludes with an analysis of techniques for tailoring music suggestions to individual users’ emotional states and preferences.
5.1. Function-Based Design of the Music Services
Figure 4 illustrates the Music services developed to support the generation of musical stimuli. These services offer the functionalities required to achieve requirements R2 (
Section 5.2), R3 (
Section 5.4), and R4 (
Section 5.3) described in
Section 4.1. These functionalities are accessible through an event bus, as shown on the right side of the figure. Essentially, two types of service requests can be published: registering a new user and obtaining music recommendations based on emotions.
User registration is required to provide personalized recommendations. When a new user is registered, a musical seed that describes their tastes and preferences is generated. This seed is derived from the user’s
Spotify playlists and recently played songs on the streaming platform. Access to this information requires that the user holds a premium license and provides the necessary credentials. Once access is granted, the seed is generated automatically without user intervention. The user’s registration data and musical seed are then stored in a repository. The two services responsible for these operations are the User Registration and Musical Seed Creation services. The latter interacts with the
Spotify developer platform [
47] to compute the user’s seed.
Game applications request recommended songs to generate musical stimuli. A recommendation request must include at least the listener’s identity (for personalization purposes) and the emotion to be evoked. The Recommendation service is responsible for identifying candidate songs that best match the requested emotion. As part of this process, a Personalization service uses the listener’s seed to filter the candidate songs based on their preferences, ensuring a customized result. A repository of emotionally labeled songs supports these emotion-based recommendations. As described earlier and shown on the left side of
Figure 4, the RIADA system employs a Random Forest machine learning model to classify songs based on their emotional content, using a set of audio features provided by
Spotify (e.g., Valence, Energy, Tempo, Acousticness, Danceability). This system labels each song into one of four emotional quadrants—happy, sad, relaxed, and angry—which are then stored in the song and label repository. These labels serve as the basis for generating music recommendations aligned with the user’s emotional state.
All music services (represented as white rounded rectangles) are implemented as Azure Durable Functions [
48], a type of serverless solution that reduces the costs associated with programming and executing in cloud environments. Azure Durable Functions extend the capabilities of standard Azure Functions by enabling the creation of stateful workflows. They are designed to manage and coordinate complex, long-running business processes and stateful operations without requiring developers to manage the underlying infrastructure.
Durable functions are particularly useful in our scenario, as our problem workflows involve long-running processes as music tag processing, music recommendation, human interactions, or multiple steps that need to be executed in sequence. These functions provide a way to build reliable and scalable applications while abstracting away the complexity of state management.
In this work, three types of durable functions in Azure Functions are used:
Activity functions: These are the building blocks of durable workflows, responsible for performing tasks or operations. They are called by orchestrator functions and can be executed in parallel or sequentially.
Orchestrator functions: These functions define the workflow of the durable application. They manage the coordination and state of activities and control the flow of execution. Orchestrator functions are durable and can handle long-running processes by maintaining their state across restarts.
Client functions: These functions are responsible for starting and interacting with orchestrator functions. They provide an entry point for initiating durable workflows and can be used to pass inputs and retrieve results from orchestrator functions.
The architecture described is designed to integrate Azure Durable Function components to handle and manage complex workflows efficiently. The process is triggered through the Azure Event Grid, which provides a unified event routing mechanism that can handle events from multiple sources. The triggers for the process can be categorized into three main types: HTTP Trigger, Event Trigger, and Timer Trigger.
HTTP trigger: This trigger type serves as an endpoint for clients to initiate the process. When a client sends an HTTP request to the designated endpoint, an HTTP-triggered function is activated. This function acts as a gateway to start the orchestration workflow.
Event trigger: This trigger type is used to initiate the process based on events launched from various services. For example, an event generated by a service like Azure Service Bus or Azure Blob Storage can activate the function, which then starts the orchestration process.
Timer trigger: This trigger is employed for scheduled executions of the process. A Timer-triggered function activates based on a specified schedule, such as daily or weekly intervals, thus enabling the orchestration process to run periodically.
Figure 5 depicts the generic flow described. Upon activation by one of the aforementioned triggers, the process is captured by a function designated as an orchestration trigger. This function is responsible for initiating the orchestration of the workflow. Specifically, it starts an orchestrator function, which manages the workflow and coordinates the execution of various tasks.
The orchestrator function, once initiated, executes a series of activity functions. These activity functions represent the individual tasks or operations that need to be performed as part of the workflow. The sequence of activity functions, ranging from Activity 1 to Activity N, is executed as defined by the orchestrator. Each activity function performs a specific operation that contributes to fulfilling the overall functionality required by the process.
The orchestrator function ensures that the activities are executed in the correct order and handles the state management necessary for long-running processes. It maintains the state of the workflow, allowing for the management of complex, multi-step processes and providing resilience against failures and restarts.
Overall, this architecture leverages Azure Durable Functions to create a robust, scalable solution for managing and executing stateful workflows, with triggers facilitating the initiation of processes and orchestrators coordinating the execution of activities.
5.2. Music Emotion’s Recognition
The emotion recognition process implemented in this study is based on the models introduced in the
RIADA system [
45], designed to tag songs from an emotional perspective. The input for these models consists of various audio features provided by
Spotify, including Valence, Energy, Acousticness, Danceability, Instrumentalness, Loudness, Duration, Speechiness, Tempo, Key, Mode, and Liveness.
The output of the emotion recognition model is a pair of values: a binary value and a continuous value that predict whether the emotions perceived by listeners fall within the corresponding quadrant. There is a separate model for each quadrant.
Figure 6 illustrates the flow of the Azure Durable Function responsible for finding and labeling new songs. The function is triggered by an orchestrator, which sequentially executes tasks to incorporate new data into the database. The first task, Get New Songs, retrieves a list of
Spotify track identifiers by sending requests for New Releases, retrieving up to 100 songs from key international markets. The following task, Get MusicInfo, extracts general song information (title, artist, etc.) as well as audio features from
Spotify, sending Several Tracks and Audio Features requests via the
Spotipy library.
Once the data are gathered, the Label Tracks task applies the pre-trained Random Forest models, loaded using joblib, to assign the four different emotional labels to each song based on the extracted audio features. Finally, the Insert Songs task stores the songs’ general information, features, and emotional labels in the COSMOS database.
This function is designed to periodically update the COSMOS database with newly released songs from Spotify, ensuring that the database remains current with the most recent emotionally labeled music. The process is triggered weekly using an Azure timer. Two of the activities interact with Spotify web services, represented by the Spotify logo, while another activity accesses Random Forest pre-trained machine learning models, symbolized by a blue test tube. The final activity interacts with the COSMOS database.
Let us now depict a more detailed technical level of the process. Before feature selection, normalization was applied to ensure that the values of all features fall within the range of zero to one. This was achieved using the MinMaxScaler from Scikit-learn version 1.5 [
49], and the transformation process was saved to ensure it could be applied to new data during model deployment.
Feature selection was performed through a voting system based on three metrics: Chi-Square, ANOVA F-value, and Mutual Information (all available in Scikit-learn [
50]). Features were ranked according to their relevance, where 1 indicated the least important and 12 the most important. The feature with the highest cumulative rank across the three voting systems was considered the most relevant, while the feature with the lowest score was the least. This ranking process was conducted independently for each of the four emotional quadrants. The following feature combinations, ordered from least to most important, yielded the best results for each emotion:
Sad: danceability, key, speechiness, mode, instrumentalness, tempo, duration, liveness, loudness, valence, acousticness, energy;
Happy: danceability, key, speechiness, mode, instrumentalness, tempo, duration, liveness, loudness, valence, acousticness, energy;
Angry: key, valence, mode, duration, instrumentalness, tempo, danceability, liveness, speechiness, loudness, energy, acousticness;
Relaxed: key, tempo, liveness, mode, duration, speechiness, danceability, valence, acousticness, loudness, energy, instrumentalness.
Then, a recognition model was built for each of Russell’s affective quadrants. Different types of machine learning algorithms were considered for their building: Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machines (SVM), Gradient Boosting (GB), and Multi-Layer Perceptron (MLP). The first three were built and analyzed during the development of the RIADA [
45], while the last two were created as part of this work.
Table 1 shows the results of the different combinations of algorithms and affective quadrants. All those models were trained using repeated five-fold cross-validation with a 70/30 data split and a randomized hyperparameter search. The best combinations are highlighted in green color. As conclusions, Random Forest models offer good accuracy results for the four quadrants and, for this reason, they were selected to be integrated in the process of music emotion recognition.
Different causes influence whether the Random Forest models offer those promising recognition results. In general, it has proven to be a model highly flexible and robust. The input dataset was slightly imbalanced (i.e., the number of songs of each of four affective quadrants is different) and it consisted of categorical and continuous features. Random Forest is able to handle more effectively the impact of these two characteristics in the results than the other alternative learning algorithms. With respect to the size of the dataset classes, the use of voting applied to decision trees trained from different data subsets reduces the effects of imbalance. In addition, the decision-making based on trees instead of distance measurement techniques facilitates the handling of the categorical features included in our dataset. On the other hand, the size of the input dataset could lead to overfitting of the resulting models. Although we applied cross-validation to reduce the risk of that overfitting, Random Forest is usually less sensitive to it than the other models.
The following analysis of hyperparameters was carried out for the Random Forest models using RandomizedSearchCV from Scikit-learn (similar work was performed for the other models):
Number of estimators: 10, 35, 60, 85, 110, 135, 160, 185, 210
Criterion: gini or entropy
Minimum samples per leaf: 2, 7, 12
Minimum samples per split: 2, 5, 10
Maximum depth of the tree: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Bootstrap: true or false
The selected hyperparameters for each of four Random Forest model were as follows:
Sad: bootstrap = False, criterion = entropy, maximum tree depth = 30, minimum samples per leaf = 2, minimum samples per split = 5, number of estimators = 160;
Happy: bootstrap = False, criterion = entropy, maximum tree depth = 30, minimum samples per leaf = 2, minimum samples per split = 5, number of estimators = 160;
Angry: bootstrap = False, criterion = gini, maximum tree depth = 90, minimum samples per leaf = 2, minimum samples per split = 2, number of estimators = 185;
Relaxed: bootstrap = False, criterion = gini, maximum tree depth = 90, minimum samples per leaf = 2, minimum samples per split = 2, number of estimators = 185.
5.3. Generation of a User Profile for Personalization
The personalization system is based on data services provided by
Spotify. Functionally, the system uses the musical seed which represents the user’s musical preferences and tastes. This seed, combined with the user’s requested emotion, is then used to select the appropriate songs to return.
Figure 7 illustrates the system’s functionalities which are divided into distinct activities, each managed by an orchestrator and triggered via an HTTP Trigger accessible to the user.
A Spotify premium account and an authorization token are required to access specific Spotify API requests and retrieve user data. Depending on the scope of the data, different permissions are necessary. In this case, the following permissions are utilized: user-library-read, user-top-read, playlist-read-private, user-read-recently-played, and playlist-read-collaborative. These scopes must be set prior to requesting the user’s authorization, and this step is handled locally on the device using the Authorization Code with PKCE Flow, which is recommended for this environment. Once the user is registered in the application and grants the necessary permissions, the authentication details are stored in the COSMOS database.
The process begins when the Compute musical seed trigger is activated by an HTTP GET request. This trigger retrieves the user ID from the JSON body of the request and initiates the orchestrator with the recovered JSON as a parameter. It waits for the orchestrator to return the response to the client.
The Compute musical seed orchestrator is responsible for coordinating the execution of all subsequent activities. The first activity, Find user’s auth activity, retrieves the user’s authentication details from the COSMOS database. The user must have been previously registered, and their authentication token must have been requested.
Once the user’s authentication is confirmed, the Get preferred tracks activity retrieves the identifiers of the songs that best match the user’s preferences. This is accomplished through several Spotify API requests:
Get Saved Tracks retrieves the user’s saved songs.
Get Recently Played Tracks returns the user’s recently played songs.
Get User’s Top Items lists the top tracks identified by Spotify as most relevant to the user.
Get User’s Playlists, combined with Get Playlist Items, retrieves the user’s playlists and the songs within each playlist.
The output of these API requests is a list of Spotify identifiers representing the user’s favorite songs.
The next activity, Find songs activity, checks the COSMOS database to find songs that are already stored. It returns the stored songs, including their audio features and emotional labels, in JSON format.
For songs not already stored in the database, the process continues with the following activities: Get Music Info, Label Tracks, and Insert Songs, which are described in
Section 5.2.
Finally, the Update Musical Seed activity calculates the average of the musical features provided by Spotify, excluding key and mode due to their discrete nature as well as duration as it is less relevant to user preferences. If a song appears multiple times across the different Spotify requests, its features carry more weight in the seed calculation, emphasizing the importance of songs frequently listened to or saved by the user. The resulting musical seed is then updated in the COSMOS database.
5.4. Music Recommendation
The music recommendation service is built upon personalization profiles. As depicted in
Figure 8, when a user identifier and a desired emotion are provided, the request is triggered by the client.
The Get Personalized Recommendation Trigger is activated through an HTTP GET request. This trigger retrieves the user ID and the desired emotion label from the JSON body of the request and initiates the orchestrator with these data.
The Get Personalized Recommendation Orchestrator is responsible for coordinating the execution of all subsequent activities. The first activity, Get User’s Seed Activity, retrieves the user’s musical seed from the COSMOS database using the provided user ID. This seed is calculated as described previously, representing the user’s preferences based on their listening habits.
Once the seed is retrieved, the Find Songs Activity searches for songs whose audio features fall within a 20% tolerance range of the user’s musical seed, the selected
Spotify audio features are normalized between 0 and 1, except for loudness, which ranges from −60 to 0 and requires an adjustment. This range-based search was selected due to its efficiency through index creation in the database, with a worst case scenario in time complexity of O(log(n)). Other recommendation systems [
51] often employ similarity functions such as Euclidean Distance, Cosine Similarity, or Manhattan Distance, which are more precise and also more computationally expensive with a time complexity of O(n). From our perspective, a faster approach, despite being less precise, is preferred, as the result group of similar songs is randomized to introduce variability in the music recommendation. These songs are selected based on their similarity in audio features, ensuring that they align with the user’s requested emotion. To achieve this, a range of values is defined for the emotional labels, maximizing the likelihood that the selected songs belong to the requested emotional quadrant while minimizing subjective user interpretations. The defined ranges for each emotion are as follows:
Angry: Angry > 0.6, Sad < 0.2, Relaxed < 0.2, Happy < 0.4;
Sad: Angry < 0.2, Sad > 0.8, Relaxed < 0.55, Happy < 0.2;
Relaxed: Angry < 0.2, Sad < 0.6, Relaxed > 0.85, Happy < 0.2;
Happy: Angry < 0.3, Sad < 0.2, Relaxed < 0.2, Happy > 0.6.
The system selects a default of five songs, provided enough songs meet the emotional criteria based on their labels. These labels are generated using the Random Forest classifier trained on Spotify’s audio features, ensuring that the selected songs align with the emotional quadrant requested (happy, sad, relaxed, or angry). Songs are selected randomly from this pool to ensure diversity in the recommendations and avoid monotony.
Once the songs are selected, they are returned to the client as a personalized recommendation, providing a tailored experience based on the user’s musical seed and emotional preferences.
5.5. IoT Principles and Computational Framework
In terms of the IoT principles underlying the proposed system, the architecture leverages a distributed model, where the computational workload is divided between edge devices (wearables) and cloud-based services. This approach ensures that low-latency operations, such as emotion recognition, are processed locally on the wearables, while more computationally intensive tasks, including music recommendation and personalized stimuli generation, are handled by cloud services like Azure Durable Functions. By distributing the workload, the system minimizes delays during real-time interactions, crucial for maintaining an uninterrupted gaming experience. Additionally, the system uses MQTT (Message Queuing Telemetry Transport) as the primary network protocol due to its lightweight nature and low bandwidth consumption, ideal for IoT ecosystems with constrained devices.
The neural network algorithm utilized for emotion recognition is a Random Forest model, trained using audio features provided by the
Spotify API. This model was selected for its robustness and ability to handle non-linear relationships between input features and output labels, which is essential when working with complex emotional states. The network architecture used in this work was chosen after evaluating several alternatives, including Support Vector Machines and Gradient Boosting classifiers, as discussed in [
45]. Random Forest provided the best balance between accuracy and computational efficiency, particularly in scenarios involving limited real-time processing power on edge devices. Furthermore, the computational complexity of the entire system is optimized through the use of serverless computing (Azure Durable Functions), where the pay-per-use model minimizes resource waste and scales automatically with demand, thus reducing the overall cost and complexity of managing a cloud-based infrastructure.
5.6. Experimentation
Two experiments with real users were conducted to corroborate the affective annotations of songs recommended by the programmed system and the musical seed used for the personalization of songs, respectively. We declare that these experiments are not a formal and thorough study of the emotion-based recommendation functionality provided by the system previously described. The scope of the paper focuses on presenting the technological issues of the proposal and therefore, in this section, we are simply interested in carrying out a preliminary analysis about the suitability of algorithms included in the solution.
With respect to the participants in the experiments, a total of 12 users with Spotify premium subscriptions participated in both experiments. The participants were aged between 20 and 30 years, 9 of them male and 3 female. Their musical preferences were varied, but their taste for the contemporary music predominated. Nevertheless, these preferences were not considered as part of these experiments.
5.6.1. Perceived Emotions by the Listeners
The goal of the first experiment is to corroborate the affective annotations of songs with the emotions that users feel when listening to them. In the design of the experiment, we used a repository of Spotify songs that were annotated with the RIADA emotion recognition models as a part of a previous research project.
Instruments: On the one hand, a playlist was created from a repository of songs. It consists of 16
Spotify songs, 4 songs from each of Russell’s quadrants. The main criteria for selecting these songs was that their affective annotations were characterized by having one dominant emotion, i.e., songs that could evoke more than one emotion with high probability were excluded. The preferences or tastes of the participants in the experiment were not taken into account in the selection of the songs, and the songs were randomly ordered. On the other hand, a Google survey form was programmed to gather the participants’ responses. The survey asked each participant to introduce the emotion that they felt when listening to each song. In the design of the survey, the “Pick-A-Mood” (PAM) model [
52] was used. PAM is a cartoon-based pictorial instrument for representing the user’s possible emotional states based on Russell’s affective model. More specifically, PAM expresses nine emotional states, two for each of the four quadrants and a neutral state. This visual representation reduces the time and efforts of the respondents, which makes the PAM model suitable for the design of these state-based emotions.
Methodology: The participants were in a relaxed environment and wearing headphones to listen to the playlist. The experiment consisted of playing each song and asking the participant what they felt while listening to it. Between songs, the listener had enough time to introduce the answer on the online survey. The duration of the experiment was approximately 50 min.
Results:
Table 2 presents the main outcomes of this experiment. The rows represent the affective annotations of the songs included in the playlist and the columns show the listeners’ responses. For example, the component [Sad, Happy] represents the percentage of responses (
) in which a listener declared to feel happy when listening to a sad song. Therefore, the diagonal of the table indicates the percentage of responses that matched the affective annotations of the songs. Overall, a high percentage of aggressive, happy, and relaxed songs were correctly recognized by the listeners (
%,
%, and
%, respectively). However, the percentage of sad songs recognized correctly was lower (
%), as many were identified as relaxed (both emotional categories share a similar valence).
Conclusions and improvements: The number of participants in the experiment is small, and as such, the results should be interpreted with caution. Nevertheless, these preliminary results are promising and suggest that the accuracy of the affective annotations is good with respect the emotions felt by the listeners. We acknowledge that the duration and conditions of the experiment limit its scalability. The participants must be relaxed and concentrated in the listening, avoiding interruptions or early termination of the activity. As a future improvement, an alternative method for making a large-scale validation of annotations should be designed. Ideally, this validation should be automated, which requires a database of songs emotionally labelled by users. These labels should have similar semantics to those used in this work in order to ensure the validity of results.
5.6.2. Recommendations Based on the Listener’s Seed
The second experiment aimed to determine whether the musical seed computed for a user improves emotion-based music recommendations. As was described, this seed is based on the user’s listening habits.
Instruments: A web application was programmed as part of this experiment. When a participant connects to the application, they must first introduce the information of their Spotify premium subscription. Then, the application uses this information to interact with the music recommendation service and to obtain a list of songs to be played. The list contains songs that match the participant’s musical preferences and others that are randomly selected. The application plays the songs in a random order and requests the participant that rates how well each song matches their preferences. A rating of up to five stars could be given for each song, with one meaning the recommendation was not liked at all and five meaning the recommendation was spot on.
Methodology: The application was configured to play 48 songs, with 12 songs from each of the four affective quadrants (i.e., 12 songs annotated as happy, 12 as relaxed, and so on). Of each of these 12 songs, 9 were selected according to the listener’s seed and 3 randomly. The participants could listen to each song in its entirety or stop the playing and rate its matching their preferences. The duration of the experiment was more than 2 h. For this reason, we offered the participants the option of conducting the experiment at home and in several stages.
Results: The personalized songs received an average score of , indicating that listeners generally agreed with or liked the recommendations. The randomly selected songs received an average score of , which, although lower than the personalized songs, was higher than expected. This may be attributed to participants’ openness to listening to new or alternative songs. From the affective perspective, the highest score for personalized songs was for those songs annotated as happy (the average rating was ), while the lower for those as sad (). The relaxed and aggressive songs scored very similar results, and , respectively.
Conclusions and improvements: As in the first experiment, the number of participants is small and, therefore, the results should be interpreted with caution. Nevertheless, the score of personalized songs suggests that the listeners’ seed is suitable to improve the music recommendations. Again, the duration of the experiment is an obstacle for its scalability. As a future alternative, a new player for Spotify subscribers could be programmed. It could play only recommended songs (personalized and random songs) and gather information about the users’ listening behavior. The player could subsequently be published in forums for developers and users of Spotify-based solutions or for researchers in advances in recommendation systems, for instance. Additionally, when exer-games based on musical stimuli are programmed, we are interested in monitoring the players’ responses to those stimuli and in analyzing how those responses are correlated to the personalized recommendations.
6. Conclusions and Future Work
Finally, the main conclusions and the future challenges are presented.
6.1. Conclusions
In this paper, we presented a technological solution to enhance exer-games through the integration of emotion-based music. The system is designed for IoT-based games and consists of a service-oriented infrastructure that integrates smart devices, artificial intelligence models, cloud technologies, and online Spotify resources. During gameplay, players’ emotions are monitored and used to influence the progression of the matches through music. Musical stimuli are generated to regulate players’ affective states and induce changes in their physical performance. These stimuli are personalized to maximize the impact of music-based interventions on participants, with the emotional responses of players providing further insights into the effectiveness of these stimuli.
The paper focused on presenting those Spotify-based systems involved in the recommendation of personalized music from an affective perspective. These systems leverage the data and resources provided by that music provider to offer the functionality needed to achieve the requirements involved in the generation of stimuli: the recognition of the emotions that
Spotify songs are likely to evoke in players (requirement R2), the recommendation based on emotions (R3), and the characterization of players’ musical preferences based on their listening habits in order to improve the recommendations (R4). The integration with
Spotify allows to have available a large-scale catalog of songs and to automate the capture of participants’ musical preferences. This contributes a novel and appealing approach to the system. Note that these music-based requirements need ideally to be combined with the recognition of players’ emotions during the physical activity (Requirement R1) and the configuration of stimuli (R5). The solutions proposed by the authors in [
53] were adapted to the Empatica device in order to accomplish RequirementR1, while R5 has to be addressed during the programming of concrete games.
One of the key advantages of the proposed solution lies in the use of Azure Durable Functions for the orchestration of music services. This serverless, event-driven architecture reduces operational complexity and provides scalability without the need for extensive infrastructure management. By enabling stateful workflows, Azure Durable Functions allow for the execution of long-running processes such as emotion recognition and music recommendation while optimizing resource consumption. The pay-per-execution model of this serverless approach also reduces costs, making it more accessible for large-scale deployments and adaptable to varying workloads in real-time gaming environments. Additionally, the inherent fault tolerance and auto-scaling capabilities of this architecture ensure system robustness and reliability during game operation.
This proposal represents a significant advancement in the role of music and emotions in the design of exer-games. To the best of our knowledge, the combination of these two elements to guide gameplay progression and regulate player performance during physical tasks has not been explored previously. Another key contribution is the real-time monitoring of players’ emotions and their integration into game decision-making processes. This affective dimension enhances the effectiveness of the stimuli, enables the assessment of their real impact, and facilitates the application of personalization strategies that improve the overall gaming experience.
6.2. Future Work
Looking ahead, several avenues of research and development remain open. First, we are working on expanding the system’s capabilities by incorporating additional devices such as cameras and movement recognition technologies, which will allow for more precise monitoring of players’ physical activities and emotional responses. The integration of these devices is expected to improve the accuracy of emotion-based interventions and provide new opportunities for assessing player performance.
We are currently developing two exer-games to validate the concepts presented in this paper. These games evaluate the effectiveness of emotion-based musical interventions across different user profiles. User-centered design is being applied to ensure the usability and acceptance of emotion-based systems. The seamless integration of music and the personalization of stimuli are crucial factors in enhancing user engagement. An iterative design process is being followed, focusing on user motivation through personalization and social interaction. In the future, user testing will be expanded across different demographic groups, and challenges related to the synchronization of music preferences in multiplayer environments will be addressed. The evaluation will involve collaboration with experts in psychology, physical activity, and music to refine the emotion recognition system, optimize device integration, and address challenges in personalization and real-time music synchronization.
We also aim to explore real-world application scenarios by deploying the system in various environments outside of gaming. Potential applications include rehabilitation centers, where personalized music-based interventions could support physical therapy; gyms, where emotion-based music stimuli could enhance workout routines; and wellness programs, where the system could be used to promote physical and emotional well-being. Testing the system in these settings will provide critical insights into its scalability, usability, and overall impact in practical, non-gaming contexts.
Furthermore, in terms of sustainability, future work will focus on analyzing the environmental impact of the system, particularly the carbon footprint generated by the execution of cloud services like Azure Durable Functions. As cloud computing plays an increasingly important role in the deployment of scalable systems, understanding its environmental impact is critical. This analysis will examine the energy consumption and carbon emissions associated with maintaining a continuously operating cloud-based infrastructure. Additionally, we plan to explore more sustainable alternatives, comparing the environmental footprint of Azure Durable Functions with other serverless platforms such as AWS Lambda or Google Cloud Functions. These comparisons will help identify opportunities to reduce the environmental impact of the system. Strategies such as optimizing resource usage, minimizing idle time, and selecting data centers powered by renewable energy will also be considered as part of a broader commitment to sustainability in the development of IoT-based gaming systems.
In terms of cost, the current implementation of the system leverages cloud-based services such as Azure Durable Functions, which operate on a pay-per-execution model. This serverless architecture offers scalability and flexibility, but it also incurs costs related to cloud storage, data processing, and computational resources. These costs are dependent on the volume of users and the frequency of interactions, and while they are manageable for small- to medium-scale deployments, larger implementations could require a more detailed budget analysis.
To address cost concerns and reduce the overall budget, future iterations of the system could explore the feasibility of incorporating open-source frameworks. For example, open-source serverless platforms such as OpenFaaS could be evaluated as alternatives to commercial cloud services, reducing infrastructure costs. Additionally, the use of open-source machine learning libraries such as TensorFlow or PyTorch, combined with locally hosted databases, could further minimize costs. While these open-source options may offer financial advantages, their performance and scalability need to be assessed to ensure that they meet the system’s requirements for real-time processing and personalization in exer-games.