Non-speech sounds play a crucial role in robot-to-human communication, enabling the conveyance of affective information, which is useful when robots need to interact with individuals from diverse cultural and linguistic backgrounds [
1]. The potential for non-speech sounds to express emotion has been explored since non-speech sounds were popularized in media, where fictional robots like R2D2 from Star Wars use squeaks, beeps, and whirrs to communicate emotion and intention [
2,
3].
Yilmazyildiz et al. [
4] studied semantic-free utterances, which include gibberish Speech, paralinguistic utterances, and non-linguistic utterances (NLUs), and explained that, while NLUs are commonly employed as signals in environments such as train stations and airports, they differ from music and speech in that they lack semantic content and consist solely of affective signals. In this study, the focus is on NLUs, not music or speech sounds.
NLUs have been used to convey information, affect, or facilitate communication, with their acoustic parameters derived from natural language or from real-world sources. NLUs are characterized by sounds that do not contain discernible words, are not specifically musical, and exclude laughter or onomatopoeic elements. Having been popularized in movies where fictional robots such as R2D2 and WallE have used them, NLUs’ most obvious application is robots. Characters like R2D2 and WallE are loved by audiences [
1], who can interpret the emotions conveyed by them when watching them, even when they do not exactly decode the meaning of the NLUs themselves. Sound designers and Foley artists utilize NLUs to evoke specific sentiments within scenes. As robots and social robots become a bigger part of society, both in industrial and daily life settings, the necessity of maintaining a harmonious relationship with them and being able to understand them in all possible situations is important [
1]. To this end, sounds can form part of multi-modal communication channels including gestures, actions, expressions, movements, colors, and normal speech.
This work presents a method of generating NLUs that is different from previous work in the area. Researchers [
1,
3,
5,
6] have used varying methods to create NLUs, sometimes in consultation with professional sound designers [
7], and some using musical notation schemes or similar abstractions [
1,
5]. To ensure that social robots can dynamically generate sounds in any given interaction scenario, a new type of method is needed that does away with manual sound creation in favor of a systematic method of sound generation. This work presents a scheme that uses a MIDI note framework and a genetic algorithm (GA) to generate sounds for use by social robots. The method generates many sounds very rapidly. Experiments proved that people perceived sounds as being able to express emotion. This method of sound generation could be implemented in a system that operates in a social robot that actively and dynamically interacts with users in the real world. This system can create many sounds for each type of emotion, as opposed to previous work in the area which usually only proposed a few sounds for each emotion. In some applications, for example, elderly assistance or alarms, we may want a system that uses only a few different sounds that can be easily learned by users, but in others, where more natural and human-like behavior is desired from the social robot, it would be better to have a system that is capable of producing a wide range of sounds in a dynamic and context-sensitive way. Human-like behavior in robots interacting with humans has been found to improve the perception of competence and warmth of the robots [
8]. In this work, we automatically generate many sounds and validate their ability to communicate distinct emotions, through machine learning methods, experiments with human subjects, and statistical analyses.
While GAs have been used together with a multi-layer perceptron to produce music, going as far back as 1995 [
9], the current study concerns non-speech sounds, and the application of that method, together with other mentioned machine learning methods to validate the emotional meaning of the sounds, for use in social robots has not been applied before.
1.1. Related Work on Interpretation of NLUs
Researchers have demonstrated the usefulness of NLUs for social robots. They can be applied to simple robots, robot pets, or toys that do not require complex speech [
2], or even to location-tracking devices [
10]. NLUs can also function independently of any specific language and are able to convey simple sentiments very rapidly [
11]. Researchers [
1,
3,
5,
6] have developed various methods of NLU generation for the expression of intention, emotion, and sometimes dialogue parts. Most have stopped after assessing the recognition rate of emotions from the sounds. This study introduces an automated technique for generating a broad array of NLU sounds that social robots can use naturally and dynamically to convey a wide spectrum of emotions. The method employs a genetic algorithm alongside a random forest model, which has been trained on a selection of representative non-speech sounds, to assess and ensure that each generated sound effectively communicates the intended emotion.
Read [
12] found that adults and children could interpret NLUs that conveyed four basic emotions: happiness, sadness, anger, and fear. The authors suggested that NLUs be used as an additional modality with visual cues and speech, rather than a replacement for them. The same authors [
11] also discovered that children could easily interpret NLUs in terms of affect, such as happiness, sadness, anger, or fear, but they were not always consistent with each other. On the other hand, adults interpreted NLUs categorically and did not distinguish subtle differences between NLUs. Subsequently, Read [
13] investigated how situational context influenced the interpretation of NLUs and found that it overrode NLU interpretation. The same NLU could be interpreted differently depending on the situational context, and when the situational context and NLU aligned, the interpretation was intensified.
Latupeirissa et al. [
14] carried out a study to analyze the sounds from popular fictional robot characters from movies to determine key features that enable them to convey emotion. They defined specific important categorizations of sounds including the robot’s inner workings, communication of movement, and conveying of emotion. They also suggested that the sounds used in films, having been designed for that purpose, inevitably lead to expectations from the audience regarding how robots should sound in the real world. The authors found that the long-term average spectrum (LTAS) effectively characterized robot sounds and the sonic characteristics of robots in films varied with their movements and physical appearance. Lastly, they observed that the sounds of the robots they studied used a wider range of frequencies than humans do when speaking. Jee et al. [
1] also studied the sounds of robots featured in popular films, namely R2D2 and WallE, with the aim of identifying the fundamental factors employed in conveying emotions and intentions. To accomplish this, the researchers devised a set of seven musical sounds, five for intentions, and two for emotions, which were evaluated using their English teaching robot Silbot. Among the parameters examined, intonation, pitch, and timbre emerged as the most prominent in effectively expressing emotions and intentions. The study established standard intonation and pitch contours while emphasizing the importance of crafting non-linguistic sounds that possess a universal quality, transcending any specific culture or language. Notably, the recognition rate experiment demonstrated that a combination of five sounds effectively conveyed intentions, while two sounds sufficed for emotions. The findings indicated that R2D2 relied on intonation and pitch to communicate emotion and intention, whereas WallE used pitch variation, intonation, and timbre to accentuate its communicative intentions. The timbre component represented the character of the speaker, with R2D2’s metallic beeping nature symbolizing an honest, trustworthy persona. The pitch range of the sounds spanned from 100 Hz to 1500 Hz, aligning with the typical range for human communication. Overall, the study concluded that non-verbal sounds (NLUs) hold promise for effective human–robot communication, with 55% of participants successfully recognizing the sounds for intentions, and 80% recognizing the sounds for emotions.
Khota et al. [
15] developed a model to infer the valence and arousal of 560 NLUs extracted from popular movies, TV shows, and video games. Three sets of audio features, which included combinations of spectral energy, spectral spread, zero-crossing rate (ZCR), mel frequency cepstral coefficients (MFCCs), audio chroma, pitch, jitter, formant, shimmer, loudness, and harmonics-to-noise ratio, were used. These features were extracted from the sounds and, after feature reduction where applicable, the best-performing models used a random forest regressor and inferred emotional valence with a mean absolute error (MAE) of 0.107 and arousal with an MAE of 0.097. The correlation coefficients between predicted and actual valence and arousal were 0.63 and 0.75, respectively. This random forest regression model is used in the current work as well and is referred to repeatedly in this paper. Korcsok et al. [
16] applied coding rules based on animal calls and vocalizations in the design of NLUs for social robots. They synthesized their sounds using sine waves as a basis and progressively altered the pitch, duration, harmonics, and timbre to modify them. They carried out experiments where they verified that humans could recognize the emotions conveyed by the sounds. They also suggested that sounds with higher frequencies corresponded to higher arousal, while shorter sounds corresponded to a positive valence. The researchers used a linear mixed model to infer the valence and arousal of the sounds and obtained correlation coefficients between 0.5 and 0.6 for both.
Komatsu [
17] showed that NLUs could convey positive and negative affect using simple sine tones with rising, falling, or flat frequency gradients. Rising frequency gradients corresponded to positive emotions while falling frequency gradients corresponded to negative emotions. The same patterns were reported in studies on sounds known as earcons, which are everyday tones and sounds used in computers, mobile phones, and other machines to signal basic feedback to the user, such as task completion, notifications, or warnings [
18]. Komatsu [
19] also suggested that NLUs help to manage the expectations of users of social robots.
Savery et al. [
20] explored the effect of musical prosody on interactions between humans and groups of robots. They introduced the concept of entitativity, meaning the perception of the group as being a single entity, and found that alterations of prosodic features of musical sounds increased both the likeability and trust of the robots as well as the entitativity. The results of this study suggest that NLUs can also improve interactions between humans and groups of robots.
These studies demonstrate the potential of NLUs in expressing emotions and affect for social robotics. NLUs hold promise for applications involving uncomplicated robots, robotic companions, or toys that do not need intricate speech capabilities, instead relying on endearing or subtle sounds made feasible by NLUs. Furthermore, NLUs possess language-agnostic qualities, allowing for swift and efficient communication of messages. They can be understood by both children and adults, effectively conveying fundamental emotions such as joy, sadness, anger, and fear. The interpretation of NLUs can be influenced by situational context, and they can function as an additional modality of communication alongside speech and visual cues [
12,
13,
21]. Insights into the expressive capacities of emotions have been gleaned from analyzing the sounds and speech of robots featured in popular films. When designing NLUs for social robots, crucial factors to consider encompass intonation, pitch, timbre, and communicative movements. Various experiments and studies have consistently affirmed the efficacy of NLUs in conveying emotions, underscoring their role in managing user expectations and enhancing human–robot interaction.
1.2. Related Work on Design Methods, Models, and Systems for Producing NLUs
Researchers [
16,
22,
23] have drawn inspiration from human speech, music, and animal sounds when studying and generating NLUs. Compared to these sounds, NLUs are abstract, lack defined rules, and are not as easily recognizable. They are more open to interpretation and can originate from a wide variety of existing and imaginary sources.
Jee et al. [
23] used music theory and notation to design NLUs for happiness, sadness, fear, and disgust. Experiments were conducted to test emotion recognizability when using facial expressions, and NLUs together with facial expressions. Fernandez et al. [
5] suggested that the use of NLUs in some form or other as part of the communication system of a social robot would enhance its expressiveness. They developed a novel approach to the creation of NLUs, which was based on a “sonic expression system”. They used a dynamic approach involving the modulation of parameters based on the context of the interaction between the robot and the user. The researchers proposed a concept called the quason, which they defined as “the smallest sound unit that holds a set of indivisible psycho-acoustic features that makes it perfectly distinguishable from other sounds, and whose combinations generate a more complex individual sound unit”. The main parameters of the quason are amplitude, frequency, and duration. Individual quasons combine to form sonic expressions. Such expressions were created for communicative acts including agreement, hesitation, denial, questioning, hush, summon, encouragement, greeting, and laughter. Each sonic expression, made with the help of a sound designer, was made at three intensity levels, and was evaluated by 51 subjects in an experiment. The results of the experiment showed that, while most sounds were categorized correctly, some categories, such as agreement, encouragement, and greeting, were easily confused with each other. They also noted that deny, laughing, question, summon, and hush were more easily distinguishable. The researchers recommended integrating their Sonic Expression System into a full, multi-modal, communication system for robot agents. In addition, they determined that fundamental frequency, pauses, volume contour, rhythm, articulation, and speech rate affected the perceived emotion.
Khota et al. [
6,
24] modeled NLUs in terms of dialogue parts using DAMSL (Dialogue Act Markup in Several Layers). A total of 53 sounds were created using PureData [
25] and combining and modulating sine, saw, and square waveforms, in a similar way to Luengo et al. [
5] while randomly varying the number of notes and frequencies of each note. A total of 31 subjects evaluated the sounds based on communicative dialogue acts including greeting, reject, question, thanking, accept, apology, non-understanding, and exclamation. Analysis including factor analysis showed that the pitch, timbre, and duration of the sounds had important effects on how participants interpreted them.