Next Article in Journal
Recognition of Acoustic Signals of Commutator Motors
Previous Article in Journal
An Approach to Predicting Sediment Microbial Fuel Cell Performance in Shallow and Deep Water
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Within and Across-Language Comparison of Vocal Emotions in Mandarin and English

1
School of Foreign Languages, Tongji University, Shanghai 200092, China
2
Department of English Literature and Language, Cheongju University, Cheongju 28503, Korea
3
School of Foreign Languages and Literature, Fudan University, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2018, 8(12), 2629; https://doi.org/10.3390/app8122629
Submission received: 2 November 2018 / Revised: 6 December 2018 / Accepted: 11 December 2018 / Published: 14 December 2018
(This article belongs to the Section Acoustics and Vibrations)

Abstract

:
This study reports experimental results on whether the acoustic realization of vocal emotions differs between Mandarin and English. Prosodic cues, spectral cues and articulatory cues generated by electroglottograph (EGG) of five emotions (anger, fear, happiness, sadness and neutral) were compared within and across Mandarin and English through a production experiment. Results of within-language comparison demonstrated that each vocal emotion had specific acoustic patterns in each language. Moreover, normalized data were used in the across-language comparison analysis. Results indicated that Mandarin and English showed different mechanisms of utilizing pitch for encoding emotions. The differences in pitch variation between neutral and other emotions were significantly larger in English than in Mandarin. However, the variations of speech rate and certain phonation cues (e.g., CPP (Cepstral Peak Prominence) and CQ (Contact quotient)) were significantly greater in Mandarin than in English. The differences in emotional speech between the two languages may be due to the restriction of pitch variation by the presence of lexical tones in Mandarin. This study reveals an interesting finding that occurs when a certain cue (e.g., pitch) is restricted in one language, other cues were strengthened to take on the responsibility of differentiating vocal emotions. Therefore, we posit that the acoustic realizations of emotional speech are multidimensional.

1. Introduction

Human speech is believed to be a reliable means of conveying a speaker’s emotional state [1]. In the past, researchers looked for acoustic cues that could clearly distinguish different vocal emotions (e.g., [2,3,4], among many others)—mostly anger, fear, happiness and sadness from a core set of basic emotions [5]. Much of the previous research is limited to prosodic cues including fundamental frequency (f0), duration and intensity [2,6], which are considered to be sufficient to differentiate vocal emotions [7]. However, studies focusing on emotions such as anger and happiness report similar prosodic patterns with high mean pitch, fast speech rate and high intensity [2], meaning that it is difficult to distinguish these two emotions using only prosodic cues. Moreover, evidence from research on emotional speech synthesis and recognition demonstrates that multiple cues in addition to f0, duration and intensity are necessary for the acoustic realization of emotional speech [8,9]. Thus, the understanding of the acoustic mechanisms underlying vocal emotions remains incomplete based on prosodic cues.
Recent trends in the acoustic analysis of vocal emotions have led to examinations of phonation-related spectral cues. Patel et al. (2011) [10] investigated the mapping between phonation cues and emotion dimensions in French speech. Results indicated three underlying components—tension, perturbation and voicing frequency—which were claimed to be related to emotion dimensions of arousal, potency and the ability to control respectively. Xu et al. (2013) [11] proposed a set of spectral and f0 cues which was named BID measurements to test the size code hypothesis using synthetic speech. Results revealed that listeners could infer emotions from the changes of cues caused by the physiological structure of a speaker. Similar ideas are also found in References [12,13,14]. Wang et al. (2014) [15] highlighted the importance of phonation cues in differentiating vocal emotions in Mandarin. By adding phonation-related spectral cues in a multi-dimensional scaling acoustic space, different emotions were separated clearly. Although these studies have enriched the knowledge of vocal expression of emotions, few attempts have been made to analyse different roles played by prosodic cues and phonation cues in differentiating vocal emotions.
Another important issue is whether the vocal expression of emotions is universal across languages or language-specific. Many research studies attempted to untangle this debate through cross-cultural perception experiments by asking subjects from different language backgrounds to identify the underlying emotions expressed by speakers from either the same or a different language background (e.g., [16,17,18,19,20]). Results showed that listeners could successfully recognize vocal emotions expressed in non-native languages with no great variation in perception except for an in-group advantage when judging emotions expressed in their native languages [7]. Based on the evidence in perception studies, vocal expression of emotions seems to be universal across languages.
Unlike perception, a limited number of studies have investigated the cross-language vocal expression of emotions through production experiments. This may due to the difficulties in controlling the elicitation of emotions in different languages and in designing experimental paradigms. Among the few studies, Pell et al. (2009) [7] compared acoustic parameters of six vocal emotions in English, German, Hindi and Arabic. Results showed that mean pitch, pitch range and speech rate contributed in a similar way to differentiate vocal emotions across languages. In contrast, Anolli et al. (2008) [21] investigated the vocal expression of emotions in Chinese and Italian and found differences in the specific patterns of acoustic cues including f0, duration and intensity between two languages. For example, their results showed that vocal emotions in Chinese had more restrained variations in acoustic cues among different emotions than those in Italian, indicating that the production of vocal emotions may be influenced by cultures. Similarly, Li et al. (2013) [19] analysed the f0 features of vocal emotions across Mandarin and Japanese. Results indicated that these two languages had distinct f0 features such as lower pitch for sadness in Mandarin but higher pitch in Japanese, which was claimed to be the influence of different language structures. Considering all of this evidence, it seems there is still disagreement as to whether vocal expression of emotions is universal or language-specific in production studies.
A separate body of literature in production has investigated the influence of the lexical tone system in tonal languages on marking vocal emotions. The motivation to examine this possible influence arose from the hypothesis that the existence of a lexical tone system may restrict the paralinguistic use of pitch [22,23]. Ross et al. (1986) [24] found that the manipulation of f0 measures in emotional prosody was restricted in tonal languages, including Taiwanese, Mandarin and Thai, when compared to English as a non-tonal language. In References [25,26], prosodic cues of vocal emotions in Mandarin were analysed across different tone groups. The restriction of pitch variation was found in the sentences which were made up of all high-level tone syllables for vocal emotions because the pitch of a high level-tone must be quasi-static to maintain its tonal structure, indicating that the acoustic patterns of the same emotion could be different even within a language. Likewise, Chong et al. (2015) [27] reported a restriction of f0 variation in signalling vocal emotions in Cantonese—a tonal language—when compared to English, a non-tonal language.
Taken together, the established literature has some limitations. First, a full discussion of comprehensive acoustic cues, including both prosodic and phonation cues, needs to be made within one research paradigm. Much of the previous research emphasizes prosodic cues only in encoding vocal emotions [2], especially in cross-language studies, leaving another small set of studies to focus on phonation cues separately [28,29], which limits the full understanding of mechanisms that underpin vocal emotions. Second, when examining the phonation mechanisms of vocal emotions using spectral measures, the noise in the speech signal caused by poor recording quality may lead to unreliable measurement. In addition, the short distance between the first harmonic and the first formant in the spectrum adds difficulties in extracting the spectral measures. Moreover, it is difficult to confirm the physiological mechanism of voice production since different physiological mechanisms may have similar acoustic performances with spectral measures. For example, breathy voice and soft voice both have a relatively large H1-H2 value [30]. Thus, an EGG experiment is worth doing to directly observe glottal activities and, more importantly, to supplement spectral measures. Third, very little is currently known about how other cues change if pitch variation is restricted when encoding vocal emotions in tonal languages such as Mandarin.
With these issues in mind, the current study aims to improve the understanding of underlying mechanisms of vocal emotions in a tonal language—Mandarin—and a non-tonal language—English—through the collective analysis of prosodic cues, spectral cues and EGG cues. These two languages were selected to be examined because they are typical and have distinctive language structures. Moreover, these two languages have been widely studied, which means our results can be compared to the existing literature. By adopting this new approach, we are attempting to answer the following research questions: (a) whether Mandarin and English have different patterns with respect to prosodic cues, spectral cues and EGG cues in vocal expression of emotions; (b) whether Mandarin shows a restricted pitch variation in encoding vocal emotions compared to English; and (c) if so, how other cues change to supplement the restricted use of pitch. The preliminary results of this study were presented at the 8th International Conference on Speech Prosody as shown in Reference [31].

2. Methods

2.1. Speech Materials

Speech materials were developed using 15 declarative sentences. They were first designed in Chinese and then were translated to English. Each target sentence was semantically neutral and suitably conveyed different kinds of emotions. They were embedded in a specific context to reflect five different emotions: anger, fear, happiness, sadness and neutral. Table 1 lists a sample target sentence in four emotional contexts in two languages.

2.2. Subjects

We recruited five native speakers (two males and three females) for each language. All speakers were graduate students at the University of Pennsylvania. Mandarin speakers had spent less than a year in the U.S. at the time of recordings. They all had acting and public speaking experience. Participants signed a consent form before the experiment and received ten dollars as compensation for their time. Participants reported no problems with their speech and hearing.

2.3. Recording Procedure

Recordings were conducted in a sound-proof booth in the Department of Linguistics at the University of Pennsylvania. We obtained simultaneous electroglottograph (EGG) and audio recordings from all speakers. Audio recordings were made electronically and saved directly on a computer as 16-bit wave files at a sampling rate of 44.1 kHz, using a Glottal Enterprises M80 omnidirectional headset microphone. EGG data were obtained using a two-channel Glottal Enterprises Electroglottograph, model EG2. During the recordings, we presented speech materials on PowerPoint slides.
Different emotions were recorded in a separate block and speakers were offered a break between blocks for a smooth transition from one emotion to another. Sentences with neutrality were produced in isolation and those with anger, fear, happiness and sadness were embedded in a dialogue setting, in which speakers conducted a dialogue with a native speaker. This dialogue setting enabled speakers to express different emotions in a natural way. In total, we collected 750 sentences (= 15 sentences × 5 speakers × 2 repetitions × 5 emotions) for each language.

2.4. Listening Tests

Listening tests were conducted to confirm that the intended emotions were accurately produced using the Qualtrics online survey tool [32] with respective Mandarin and English stimuli. For each language, the stimuli were divided into five sets produced by each of the five speakers. Native Mandarin and English listeners were recruited online and the number of listeners for each set was at least 15. During the listening tests, participants were asked to listen carefully to randomized audio stimuli and select the most appropriate emotion on a five-choice task (i.e., anger, happiness, fear, neutral and sadness) by pressing a computer mouse.
Recordings whose identification rate was less than 60% were excluded. This is three times the 20% chance level permitted in our study [7]. In order to compare five emotions with parallel texts in each language, we chose 530 perceptually valid recordings (106 sentences × 5 emotions) in Mandarin and 300 perceptually valid recordings (60 sentences × 5 emotions) in English.

2.5. Measurements

Automatic segmentation of recordings was made using SPPAS [33]. Three tiers (phoneme tier, syllable tier and sentence tier) were generated and manually corrected afterward. A sequence of pitch target points was detected by Momel algorithm [34] using Praat. Pitch contours were modelled as continuous smooth curves, interpolated quadratically from pitch target points of each utterance in order to eliminate microprosodic effects. Based on the Momel outputs, a set of prosodic cues was measured to quantify the degree of pitch movements as a function of time.
Seven prosodic cues were generated by a Praat script: (1) Number of syllables per second (Speech Rate); (2) Mean intensity of each sentence (Mean Intensity); (3) Mean pitch of each sentence (Mean Pitch); (4) Pitch range of each sentence (Pitch range) (5) Average absolute difference between two adjacent pitch target points divided by distance in seconds (Mean Absolute Slope, MAS hereafter), reflecting the frequency of pitch movements [35]; (6) Average pitch difference between two adjacent pitch target points for rise and fall separately in each sentence (Rise, Fall), which determine the degree of pitch raising and pitch falling [35]; and (7) Average slope for rise and fall separately in each sentence (Rise Slope, Fall Slope), indicating the speed rate of pitch raising and pitch falling [35]. These cues portrayed both global and local pitch movements of each sentence. Pitch-related cues were normalized to eliminate individual differences using the OMe (Octave-Median) scale [36] by applying the following equation:
OMe = log2 (Hz/Median)
where Hz is a raw pitch value and Median indicates the median of a speaker’s pitch range.
In addition, we obtained a set of spectral cues using VoiceSauce [37] through the voicing parts of each sentence. They are: (1) Cepstral Peak Prominence (CPP), indicating harmonics-to-noise ratio and periodicity [38]; (2) Amplitudes of the first, second and fourth harmonics (H1, H2 and H4); (3) Amplitude difference between the first and second harmonics (H1-H2), reflecting the relative breathiness or creakiness of phonation [39]; and (4) Amplitude difference between the first harmonic and harmonics nearest to F1, F2 and F3 (H1-A1, H1-A2 and H1-A3), as measures of spectral tilt [30]. This set of spectral cues was often used as the standard in measuring different phonation properties [40].
Next, three kinds of EGG cues were extracted using EggWorks [41], including (1) Contact quotient (CQ), illustrating the duration of the vocal fold contact during one vocal fold period [42]; (2) Peak Increase in Contact (PIC), the peak positive value in the EGG derivative, indicating the highest speed of vocal fold contact [43]; and (3) Speed Quotient (SQ), the ratio between closing duration and opening duration, reflecting the asymmetry of the EGG pulses [44]. These cues are indicators of the physiological mechanism of vocal fold vibration during speech production.
In summary, in order to assess the vocal expression of emotion in Mandarin and English, we measured a total of 20 cues which were then converted to z-scores combining all the emotions separately for each speaker.

3. Results

In this section, we first looked at whether all the cues were significantly different between emotions within each language. After that, normalized data were used to compare the different acoustic and articulatory patterns of vocal emotions across languages.

3.1. Within-Language Comparison of Vocal Emotions

Table 2 presents the normalized mean values for each emotion in Mandarin and English. To test whether each of the 20 cues discriminated between the five emotions within each language, we conducted a linear mixed-effects model implemented with the lmerTest package [45] in R [46]. These cues were fitted against a model for each language, where emotion was used as a fixed factor and speaker was treated as a random factor. Kolmogorov-Smirnov (K-S) tests were used to estimate the normality of the residuals of the model. Results suggested an acceptance of normality assumption for most of the variables except for the residuals of range (p = 0.001), rise (p = 0.003), fall (p = 0.018), rise slope (p = 0.001), fall slope (p = 0.023), PIC (p = 0.006) and SQ (p = 0.000) in Mandarin and rise (p = 0.000), fall (p = 0.013), rise slope (p = 0.000) and SQ (p = 0.007) in English. We attempted to perform a data transformation (e.g., log transformation) to improve the normality of the residuals but more variables were shown to violate the normality assumptions and the interpretation of the results was not as straightforward as before. This allowed us not to employ any data transformations because the assumption of normality is of little importance in the regression model [47]. Here we present the results in the order of prosodic cues, spectral cues and EGG cues for each language.

3.1.1. Mandarin

A significant effect of emotion on all prosodic cues for Mandarin existed (Mean pitch: X2 = 565.29, df = 4, p < 0.001; Pitch range: X2 = 205.71, df = 4, p < 0.001; Mean absolute slope: X2 = 320.36, df = 4, p < 0.001; Rise: X2 = 93.82, df = 4, p < 0.001; Fall: X2 = 171.16, df = 4, p < 0.001; Rise slope: X2 = 106.39, df = 4, p < 0.001; Fall slope: X2 = 152.80, df = 4, p < 0.001; Speech rate: X2 = 161.24, df = 4, p < 0.001; Intensity: X2 = 806.23, df = 4, p < 0.001).
For spectrum cues, the effect of emotion was significant as well (CPP: X2 = 259.72, df = 4, p < 0.001; H1: X2 = 209.87, df = 4, p < 0.001; H2: X2 = 262.13, df = 4, p < 0.001; H4: X2 = 273.75, df = 4, p < 0.001; H1-H2: X2 = 193.82, df = 4, p < 0.001; H1-A1: X2 = 73.816, df = 4, p < 0.001; H1-A2: X2 = 89.116, df = 4, p < 0.001; H1-A3: X2 = 90.89, df = 4, p < 0.001). Therefore, these vocal emotions demonstrated significantly different acoustic patterns in terms of prosodic cues and spectral cues.
To understand the physiological mechanism of phonation for different emotional expressions, we looked at the EGG results of five emotions in Mandarin. Figure 1 illustrates the pitch-normalized EGG waveform of five emotions. The nature of the EGG waveform differed across emotions. Neutral had a typical waveform of modal phonation, with a left-skewed EGG signal, indicating a slightly longer duration of vocal fold opening period than closing period. Likewise, the waveform of happiness showed the similar pattern of modal phonation. Anger had a larger vocal fold contact area and the duration of vocal fold closing was slower, which was similar to tense phonation. The shape of pulse for fear was symmetrical, indicating a comparatively equal vocal fold opening duration and closing duration and a small contact area, suggesting a breathy phonation. Sadness showed a slight degree of breathy phonation.
Three EGG parameters, including CQ, PIC and SQ, were extracted to quantify the different patterns of EGG waveform mentioned above. Statistic results confirmed our observation that the influence of emotion was significant for CQ, X2 = 185.98, df = 4, p < 0.001, PIC, X2 = 70.07, df = 4, p < 0.001 and SQ, X2 = 107.92, df = 4, p < 0.001.
Table 3 illustrates the post-hoc multiple comparison (Tukey) results between emotions for each language. In Mandarin, each emotion pair can be distinguished by most of the prosodic cues. However, for neutral and sadness, there was no significant difference between them in terms of mean absolute slope, rise and fall slope. Pitch range only had a marginal difference. Likewise, fear and sadness had no significant difference in rise and only a marginal difference in fall. These results suggested that these two pairs of emotions were produced with similar degree of local pitch movements. For spectral cues, happiness and anger displayed similar patterns due to the paucity of significant differences on H1, H2, H4, H1-H2, H1-A1, H1-A2 and H1-A3, indicating that these two emotions had similar spectral distribution. For EGG cues, CQ, the open quotient of the vocal folds, indicated a significant difference among all emotions, suggesting that different emotions were expressed with different phonation mechanisms.

3.1.2. English

Similar to Mandarin, the effect of emotion was significant on all prosodic cues in English (Mean pitch: X2 = 225.47, df = 4, p < 0.001; Pitch range: X2 = 78.57, df = 4, p < 0.001; Mean absolute slope: X2 = 140.70, df = 4, p < 0.001; Rise: X2 = 5.70, df = 4, p < 0.001; Fall: X2 = 46.64, df = 4, p < 0.001; Rise slope: X2 = 47.35, df = 4, p < 0.001; Fall slope: X2 = 55.00, df = 4, p < 0.001; Speech rate: X2 = 6.28, df = 4, p < 0.001; Intensity: X2 = 206.82, df = 4, p < 0.001).
For spectral cues, the effect of emotion was significant on CPP (X2 = 11.44, df = 4, p < 0.001), H1 (X2 = 33.18, df = 4, p < 0.001), H2 (X2 = 35.84, df = 4, p < 0.001), H4 (X2 = 30.04, df = 4, p < 0.001), H1-H2 (X2 = 21.81, df = 4, p < 0.001), H1-A1 (X2 = 32.70, df = 4, p < 0.001), H1-A2 (X2 = 17.08, df = 4, p < 0.001) and H1-A3 (X2 = 15.23, df = 4, p < 0.001).
Finally, statistical analysis yielded a significant emotion effect on all EGG cues: CQ, X2 = 8.61, df = 4, p < 0.001, PIC, X2 = 4.49, df = 4, p < 0.01 and SQ, X2 = 27.60, df = 4, p < 0.001. As illustrated in Figure 2, the phonation mechanism of some emotions could be found from the EGG waveforms. The most prominent one was fear, which showed a symmetrical shape, indicating a breathy phonation. Anger had a little larger vocal fold contact area and the vocal fold closing was slower, which was close to tense phonation. However, if compared with the EGG waveforms of Mandarin in Figure 1, smaller contrasts in Figure 2 can be observed.
A post-hoc multiple comparison procedure was conducted to determine which emotion was significantly different. The output of the multiple comparison analysis is shown in Table 3. For prosodic cues, we can see that sadness and neutral differed only in mean pitch, pitch range and intensity, indicating that they were expressed with similar degrees of pitch fluctuation and speech rate. It is worth mentioning that for speech rate, only sadness was significantly different from anger, fear and happiness, respectively. For spectral cues, happiness and anger exhibited similar spectral distribution as did sadness and neutral, since only a few cues differed from each other. As for EGG cues, all contrasts were significantly different in terms of SQ except neutral and fear. The CQ of fear was significantly different from all other emotions, suggesting that fear had a different phonation mechanism.
In the above analysis, we examined the emotional effect for prosodic cues, spectral cues and EGG cues within Mandarin and English separately. Results showed that in both Mandarin and English, five vocal emotions had different acoustic patterns and phonation mechanisms, indicating that prosodic cues, spectral cues and EGG cues were all indispensable in distinguishing these emotions. An area we want to study further is whether or not language-specific differences exist between Mandarin vocal emotions and English vocal emotions. It is noteworthy that we didn’t compare these results across languages directly. For example, it is meaningless to compare the mean pitch of anger in Mandarin with the mean pitch of anger in English. Although the same experiment paradigm was applied to each language and all cues were normalized within each language by using z-score in order to eliminate individual differences, it is dangerous to compare the two languages horizontally since variations of emotional effects may come from language itself, or from different intensities of emotional elicitation. Therefore, in this part, we focused only on how each cue performed among different emotions within each language and only compared the ordering of five emotions for each cue in the two languages. For instance, anger had the highest pitch-related measures and sadness had the lowest (except for mean pitch) in Mandarin. However, in English, happiness was ordered the highest in terms of pitch-related cues and neutral was the lowest.

3.2. Across-Language Comparison of Vocal Emotions: Mandarin versus English

As mentioned above, to overcome evident biases arising from the absolute value in two languages, we adopted the method of data normalization from [21,48] according to the following equation:
Normalized X = (x − N)/N
where x is the absolute value of each parameter for anger, fear, happiness and sadness, respectively. N is the absolute value of each parameter for neutral. Thus, this equation produces the relative value of each parameter in each of the four emotions compared to the neutral baseline. The derived data have a positive or a negative value, depending on the relative difference with neutral.
These derived data were analysed in a set of mixed-repeated measures ANOVAs, with Emotion (anger, fear, happiness, sadness) as a within-subject factor and Language (Mandarin, English) as a between-subject factor. Mauchly’s tests were conducted to estimate the assumption of sphericity. Results suggested that rise (X2 = 15.69, df = 5, p = 0.008), rise slope (X2 = 17.99, df = 5, p = 0.003), fall slope (X2 = 11.71, df = 5, p = 0.041), H1 (X2 = 13.63, df = 5, p = 0.019), H2 (X2 = 13.07, df = 5, p = 0.024), H4 (X2 = 25.02, df = 5, p = 0.000), H1-A1 (X2 = 25.36, df = 5, p = 0.000), H1-A2 (X2 = 22.04, df = 5, p = 0.001), H1-A3 (X2 = 19.08, df = 5, p = 0.002) and CQ (X2 = 11.97, df = 5, p = 0.037) violated the null assumption. Therefore, the degrees of freedom were corrected for the above variables using Greenhouse-Geisser estimates of sphericity. Results are displayed in Table 4 and analysed in the order of prosodic cues, spectral cues and EGG cues.

3.2.1. Prosodic Cues for Encoding Emotions in Mandarin and English

As presented in Figure 3, the variations of pitch-related cues around 0 (neutral emotion) are greater in English than in Mandarin except for mean pitch. The main effect of Emotion was significant on all seven pitch-related cues as shown in Table 4. In terms of Language, Mandarin and English had significant differences on pitch range, mean absolute slope, rise, fall, rise slope and fall slope. The interaction effect of Emotion × Language was significant on mean absolute slope, rise, fall and fall slope, suggesting that English vocal emotions showed more dynamic pitch-related movements than Mandarin vocal emotions. Specifically, compared to neutral emotion, happiness, fear and anger in both Mandarin and English were realized by raising mean pitch. However, the change of mean pitch for sadness was small in both languages. For pitch range, happiness and anger had expanded pitch range than neutral emotion in Mandarin and English and English had a much greater degree of pitch range expansion. Fear and sadness showed different pitch range patterns in the two languages. The pitch range of fear and sadness were negative in Mandarin, indicating that their pitch range was compressed in comparison to neutral emotion. In contrast, they were expanded in English due to positive values. Likewise, the mean absolute slope of sadness was negative in Mandarin and positive in English with a greater degree of variation, suggesting that the pitch fluctuation was suppressed in Mandarin but expanded in English. For rise, rise slope and fall, anger and happiness were higher than neutral emotion in both languages. But fear and sadness had opposite patterns. Again, English had a clearly greater degree of changes as compared to neutral emotion. Anger, fear and happiness had larger fall slope than neutral emotion in both Mandarin and English. However, the fall slope of sadness was reduced in Mandarin but increased in English.
Unlike pitch-related cues, the variations of speech rate around 0 are much greater in Mandarin than in English. Statistical results showed a significant main effect of Emotion on both speech rate and mean intensity as indicated in Table 4. Language had a significant main effect only on speech rate. There was a significant interaction effect between Emotion and Language on speech rate. We can see in Figure 3 that anger, fear and happiness increased their speech rate in both Mandarin and English, while sadness lowered its speech rate as compared to neutral emotion. However, the degree of changes in Mandarin was much larger than that in English. For intensity, both languages showed similar patterns of increasing intensity in anger, fear and happiness and lowering intensity in sadness.

3.2.2. Phonation Cues for Encoding Emotions in Mandarin and English

Figure 4 illustrates the effects of Emotion and Language on comprehensive spectral cues. It is clear that the mean differences of CPP, H4 and H1-H2 among four emotions are greater in Mandarin than in English. The results of mixed-repeated measures ANOVAs showed that Emotion had a significant main effect on CPP and H2. Language had no significant main effect on all spectral cues. The interaction between Emotion and Language was significant on CPP, meaning that the two languages had different CPP patterns. To be specific, the difference of CPP between anger and neutral in both Mandarin and English was small since the mean values were close to 0. The CPP of fear and sadness in both languages was smaller than neutral emotion and that of happiness was greater than neutral emotion. This result indicated that fear and sadness were less periodic with increased spectral noise in both languages and Mandarin had a greater degree of changes compared to neutral emotion. The pattern of H1 was similar to that of mean pitch, where anger and happiness increased their H1; whereas, sadness lowered its H1 in Mandarin. The changes of H2 were similar in Mandarin and English. Anger and happiness increased H2. Sadness and fear reduced H2.
To comprehend the physiological mechanisms of vocal emotions in Mandarin and English, three EGG cues were analysed. Figure 5 displays the effects of Emotion and Language on these EGG cues. As can be seen from the figure, the mean differences of these three cues among four emotions are greater in Mandarin than in English. In terms of Emotion, there were significant differences among four emotions on CQ (F(3,24) = 20.77, p < 0.001), SQ (F(3,24) = 20.41, p < 0.001) and marginal significant difference on PIC (F(3,24) = 2.93, p = 0.054). However, the main effects of Language on these three cues were not significant. The interaction between Emotion and Language was significant on CQ (F(3,24) = 9.52, p < 0.001) and SQ (F(3,24) = 11.03, p < 0.001). Specifically, CQ showed a greater degree of variation among four emotions compared to neutral emotion in Mandarin than that of English. Anger and happiness increased CQ from neutral emotion, whereas fear decreased its CQ in both languages. However, the CQ of sadness was lower than neutral emotion in Mandarin but raised a little in English. Similarly, the changes of PIC in four emotions in English were close to 0 but greatly increased in Mandarin. For SQ, fear showed a significant increase in Mandarin, suggesting that fear had more symmetrical EGG waveforms.

4. Discussion and Conclusions

This study examined whether a tonal language shows restricted pitch variation when encoding vocal emotions, as compared to a non-tonal language. If this is indeed the case, then how other cues are used to differentiate vocal emotions? To discover the truth of this matter, both within and cross-language comparisons were made among emotions on comprehensive prosodic cues, phonation-related spectral cues and EGG cues through a production experiment.

4.1. Acoustic and Physiological Patterns of Each Vocal Emotion in Mandarin and English

The results of the within-language comparison showed that all the prosodic cues including f0, duration and intensity were significantly different between emotions in each language, suggesting that different vocal emotions had unique prosodic patterns, which was consistent with previous studies (e.g., [2,3,49,50]). Specifically, happiness and anger showed relatively higher pitch level, greater degree of local pitch fluctuation, faster speech rate and higher intensity in both Mandarin and English, which matched those observed in earlier studies (Mandarin: [15,20,51]; English: [7]). Fear had the highest pitch level and fastest speech rate among five emotions. The intensity of fear was also higher but fear had smaller local pitch movements. These results were also consistent with those of [7,52,53]. The mean pitch for sadness was close to that of neutral emotion, which was also reported by [54] for Mandarin and [7] for English. In addition, sadness had smaller local pitch movements, slowest speech rate and weakest intensity in both languages. Unlike previous studies that mostly use pitch measures like mean pitch and pitch range to describe the global changes of the pitch contour, it is interesting to note that our study includes more fine pitch-related cues that can quantify the degree of pitch changes over time locally, such as mean absolute slope, rise, fall, rise slope and full slope. Thus, vocal emotions such as happiness and fear and anger and fear can be clearly separated with respect to the degree of local pitch movements. Furthermore, these cues have been proved to be useful to examine whether pitch fluctuation is influenced by the existence of a lexical tone system within one language [25,26]. Consequently, we also include them in our across-language analysis to test whether the restriction exists.
In addition, our study also analysed a set of spectral cues that could reflect the degree of periodicity, signal to noise ratio (SNR) and spectral tilt in the speech signal [30,38]. Results of spectrum analysis revealed different patterns in the two languages. In Mandarin, happiness had better periodicity, higher SNR and smaller spectral tilt due to the increased high-frequency energy. Anger showed similar patterns with happiness. Fear had poorest periodicity, smallest SNR and higher spectral tilt caused by lower high-frequency energy. Finally, sadness was less periodical with a smaller SNR and highest spectral tilt caused by the lowest high-frequency energy. In contrast, the periodicity and SNR of five emotions exhibited few differences in English. Happiness had highest high-frequency energy, thus, its spectral tilt was small, followed by anger, sadness and fear in English. The above results of spectrum analysis were meaningful to infer the source feature and to categorize different vocal fold vibration patterns of different vocal emotions. On one hand, periodicity and SNR will be reduced when vocal fold is not completely closed during vibration with airflow goes through, which indicates a breathy voice or an aspirated sound. On the other hand, the spectral tilt can reflect the shape of the glottal wave, which represents the mechanism of vocal fold vibration. A larger spectral tilt can be caused by a breathy voice or a soft voice [30]. Therefore, our results of spectrum analysis suggest that fear and sadness may have breathy voice in Mandarin.
Another new approach we used in this study was conducting the EGG experiment to directly detect the physiological mechanism of different vocal emotions as a supplement for spectral measures. Results revealed that neutral and happy were expressed with modal voice in both Mandarin and English. Anger was produced with a small degree of tense voice in Mandarin. Fear showed the typical feature of breathy voice in both Mandarin and English. Sadness also had a small degree of breathy voice in Mandarin but not in English. Based on Table 3, we observed that EGG cues by CQ showed a significant difference among all vocal emotions in Mandarin and all contrasts were significantly different in terms of SQ in English except neutral and fear. This indicates that EGG cues were helpful in discriminating all of the emotions and clearly complemented both prosodic and spectral cues. Furthermore, through the EGG experiment, we found that anger and happiness could be better differentiated than by simply using prosodic cues, which was reported to be similar in previous studies [55]. Moreover, EGG cues helped identify the phonation type of fear as breathy voice instead of aspirated or soft voice, which had conflicting results based on spectral cues. Thus, EGG cues are useful to supplement the spectral cues in order to explore the voice quality features for vocal emotions.

4.2. Multidimensionality of the Acoustic Realization of Vocal Emotions

It is worth noting that in the within-language analysis, no direct comparison of each value between Mandarin and English was made, although the same experiment paradigm was used for both languages. Therefore, we further used normalized data to compare the different patterns of vocal emotions across languages. Each value of anger, fear, happiness and sadness was normalized against that of neutral emotion in each language, so that the derived data represented the proportion of changes for each emotion with neutral emotion as a reference.
Based on the acoustic analysis on the normalized pitch-related cues, we observed that Mandarin and English showed different mechanisms when utilizing pitch to express vocal emotions. There were significant interactions between emotion and language on mean absolute slope, rise, fall, rise slope and fall slope. The pitch variations, with neutral as the baseline, were significantly greater in English compared to those in Mandarin. We posit that this difference is due to the restriction of pitch variation through the existence of lexical tones in Mandarin, since pitch is primarily used to maintain the tonal shapes in Mandarin in order to fulfil the distinction of meaning and, afterward, to realize paralinguistic functions. Although the present study is not the first to propose this idea [24,25,27], our study expands upon previous work in this area and provides additional objective measures. For example, this study included more sophisticated pitch measures which were able to better reflect the variation of pitch over time. Additionally, instead of comparing absolute values between Mandarin and English, we used derived values to reflect the magnitude of change certain cues showed in emotional speech given neutral as the baseline. That is to say, we measured the relative contribution of certain cues when differentiating among emotions.
Distinct from the established literature that looked only at how paralinguistic use of pitch was influenced by the lexical tone system, we explored how other cues, such as duration, spectral and EGG cues, functioned when the restriction existed. Results showed that the differences between neutral and other emotions on speech rate, CPP, CQ and SQ were significantly greater in Mandarin than in English, which was opposite of the pitch-related cues results. A possible explanation for this might be that these cues were enhanced in Mandarin vocal emotions in order to supplement the suppressed pitch variation. Therefore, we believe our study provides new evidence regarding the pitch restriction hypothesis from a fresh perspective.
Despite these promising results, several limitations remain in the way the study was conducted. First, given the limited number of tonal and non-tonal languages examined by the present study, one may argue that the differences between Mandarin and English vocal emotions may be due to different culture backgrounds of the speakers. Therefore, the conclusions need to be further tested by analysing a greater and more diverse pool of languages. The second limitation lies in the small sample number of speakers for each language. Thus, more speakers need to be recruited in future research to increase the generalizability of the study.
In conclusion, this comprehensive study was conducted within and across-language between Mandarin vocal emotions and English vocal emotions with a set of prosodic cues, phonation-related spectral cues and EGG cues. Results revealed that within each language, different vocal emotions showed significantly different acoustic patterns. However, the pitch variation was suppressed in Mandarin vocal emotions due to the influence of lexical tone system when compared with those in English. Evidence from other cues supported the idea that when a certain dimension (for example, pitch) is restricted within a language, other dimensions may be strengthened in compensation. Thus, we posit that the acoustic realization of vocal emotions is multidimensional.

Author Contributions

T.W., Y-c.L. and Q.W.M designed the experiment; T.W. conducted the production experiment, analyzed the data and wrote the first draft of the manuscript; Y-c.L. reviewed and improved the manuscript.

Funding

This work was supported in part by the Youth Project of Humanities and Social Sciences Foundation [grant no. 18YJC740103] of the Ministry of Education in China, the Chenguang Program [grant no. 16CG21] of the Shanghai Education Development Foundation and the Shanghai Municipal Education Commission, and Fundamental Research Funds for the Central Universities [grant no. 22120170484] of Tongji University, which was awarded to the first author.

Acknowledgments

We thank all the participants in the production and perception experiments. We also thank Professor Jianjing Kuang at the University of Pennsylvania for help with the EGG recording facility.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xu, Y. Speech melody as articulatorily implemented communicative functions. Speech Commun. 2005, 46, 220–251. [Google Scholar] [CrossRef] [Green Version]
  2. Juslin, P.N.; Laukka, P. Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 2003, 129, 770–814. [Google Scholar] [CrossRef] [PubMed]
  3. Murray, I.R.; Arnott, J.L. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am. 1993, 93, 1097–1108. [Google Scholar] [CrossRef]
  4. Scherer, K.R. Vocal communication of emotion: A review of research paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef] [Green Version]
  5. Ekman, P. Are there basic emotions? Psychol. Rev. 1992, 99, 550–553. [Google Scholar] [CrossRef] [PubMed]
  6. Juslin, P.N.; Scherer, K.R. Vocal expression of affect. In The New Handbook of Methods in Nonverbal Behavior Research; Oxford University Press: New York, NY, USA, 2005; pp. 65–135. [Google Scholar]
  7. Pell, M.D.; Paulmann, S.; Dara, C.; Alasseri, A.; Kotz, S.A. Factors in the recognition of vocally expressed emotions: A comparison of four languages. J. Phon. 2009, 37, 417–435. [Google Scholar] [CrossRef]
  8. Toivanen, J.; Waaramaa, T.; Alku, P.; Laukkanen, A.-M.; Seppänen, T.; Väyrynen, E.; Airas, M. Emotions in [a]: A perceptual and acoustic study. Logop. Phoniatr. Vocology 2006, 31, 43–48. [Google Scholar] [CrossRef]
  9. Zhang, S. Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In International Symposium on Neural Networks; Springer: Berlin, Germany, 2008; pp. 457–464. [Google Scholar]
  10. Patel, S.; Scherer, K.R.; Björkner, E.; Sundberg, J. Mapping emotions into acoustic space: The role of voice production. Biol. Psychol. 2011, 87, 93–98. [Google Scholar] [CrossRef]
  11. Xu, Y.; Kelly, A.; Smillie, C. Emotional expressions as communicative signals. In Prosody Iconicity; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2013; pp. 33–60. [Google Scholar]
  12. Xu, Y.; Lee, A.; Wu, W.-L.; Liu, X.; Birkholz, P. Human vocal attractiveness as signaled by body size projection. PLoS ONE 2013, 8, e62397. [Google Scholar] [CrossRef]
  13. Chuenwattanapranithi, S.; Xu, Y.; Thipakorn, B.; Maneewongvatana, S. Encoding emotions in speech with the size code. Phonetica 2008, 65, 210–230. [Google Scholar] [CrossRef]
  14. Liu, X.; Xu, Y. Body size projection and its relation to emotional speech—Evidence from Mandarin Chinese. In Proceedings of the Seventh Speech Prosody Conference 2014, Dublin, Ireland, 20–23 May 2014; pp. 974–977. [Google Scholar]
  15. Wang, T.; Ding, H.; Kuang, J.; Ma, Q. Mapping Emotions into Acoustic Space: The Role of Voice Quality. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH 2014), Singapore, 14–18 September 2014; pp. 1978–1982. [Google Scholar]
  16. Scherer, K.R.; Banse, R.; Wallbott, H.G. Emotion inferences from vocal expression correlate across languages and cultures. J. Cross-Cult. Psychol. 2001, 32, 76–92. [Google Scholar] [CrossRef]
  17. Thompson, W.F.; Balkwill, L.-L. Decoding speech prosody in five languages. Semiotica 2006, 2006, 407–424. [Google Scholar] [CrossRef]
  18. Graham, C.R.; Hamblin, A.W.; Feldstein, S. Recognition of emotion in English voices by speakers of Japanese, Spanish and English. IRAL-Int. Rev. Appl. Linguist. Lang. Teach. 2001, 39, 19–37. [Google Scholar] [CrossRef]
  19. Li, A.-J.; Jia, Y.; Fang, Q.; Dang, J.-W. Emotional intonation modeling: A cross-language study on Chinese and Japanese. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), Kaohsiung, Taiwan, 29 October–1 November 2013; IEEE: New York, NY, USA, 2013; pp. 1–6. [Google Scholar]
  20. Wang, T.; Ding, H.; Gu, W. Perceptual Study for Emotional Speech of Mandarin Chinese. In Proceedings of the Sixth International Conference on Speech Prosody 2012, Shanghai, China, 22–25 May 2012; pp. 653–656. [Google Scholar]
  21. Anolli, L.; Wang, L.; Mantovani, F.; De Toni, A. The voice of emotion in Chinese and Italian young adults. J. Cross-Cult. Psychol. 2008, 39, 565–598. [Google Scholar] [CrossRef]
  22. Cruttenden, A. Intonation; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
  23. Hirst, D.; Wakefield, J.; Li, H.Y. Does lexical tone restrict the paralinguistic use of pitch? Comparing melody metrics for English, French, Mandarin and Cantonese. In Proceedings of the International Conference on the Phonetics of Languages in China, Hong Kong, China, 2–4 December 2013; pp. 15–18. [Google Scholar]
  24. Ross, E.D.; Edmondson, J.A.; Seibert, G.B. The effect of affect on various acoustic measures of prosody in tone and non-tone languages: A comparison based on computer analysis of voice. J. Phon. 1986, 14, 283–302. [Google Scholar]
  25. Wang, T.; Lee, Y. Does restriction of pitch variation affect the perception of vocal emotions in Mandarin Chinese? J. Acoust. Soc. Am. 2015, 137, EL117–EL123. [Google Scholar] [CrossRef]
  26. Wang, T.; Qian, Y. Are pitch variation cues indispensable to distinguish vocal emotions? In Proceedings of the 9th International Conference on Speech Prosody 2018, Poznań, Poland, 13–16 June 2018; pp. 324–328. [Google Scholar]
  27. Chong, C.; Kim, J.; Davis, C. Exploring acoustic differences between Cantonese (tonal) and English (non-tonal) spoken expressions of emotions. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, Germany, 6–10 September 2015; pp. 1522–1526. [Google Scholar]
  28. Laver, J. The phonetic description of voice quality. Camb. Stud. Linguist. Lond. 1980, 31, 1–186. [Google Scholar]
  29. Gobl, C.; Chasaide, A.N. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 2003, 40, 189–212. [Google Scholar] [CrossRef]
  30. Stevens, K.N. Diverse acoustic cues at consonantal landmarks. Phonetica 2000, 57, 139–151. [Google Scholar] [CrossRef]
  31. Wang, T.; Lee, Y.-C.; Ma, Q. An experimental study of emotional speech in Mandarin and English. In Proceedings of the 8th International Conference on Speech Prosody 2016, Boston, MA, USA, 31 May–3 June 2016; pp. 430–434. [Google Scholar]
  32. Qualtrics Software; Version 2018; Qualtrics: Provo, UT, USA; Available online: https://www.qualtrics.com (accessed on 7 December 2018).
  33. Bigi, B.; Hirst, D. SPeech Phonetization Alignment and Syllabification (SPPAS): A tool for the automatic analysis of speech prosody. In Proceedings of the Sixth International Conference on Speech Prosody 2012, Shanghai, China, 22–25 May 2012; pp. 19–22. [Google Scholar]
  34. Hirst, D. The analysis by synthesis of speech melody: From data to models. J. Speech Sci. 2011, 1, 55–83. [Google Scholar]
  35. Hirst, D. Melody metrics for prosodic typology: Comparing English, French and Chinese. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013), Lyon, France, 25–29 August 2013; pp. 572–576. [Google Scholar]
  36. De Looze, C.; Hirst, D.J. The OMe (Octave-Median) scale: A natural scale for speech melody. In Proceedings of the Seventh International Conference on Speech Prosody, Dublin, Ireland, 20–23 May 2014; pp. 910–913. [Google Scholar]
  37. Shue, Y.-L.; Keating, P.A.; Vicenik, C.; Yu, K. Voicesauce: A Program for Voice Analysis. In Proceedings of the 17th International Conference of Phonetic Sciences (ICPhS 17), Hong Kong, China, 17–21 August 2011. [Google Scholar]
  38. Hillenbrand, J.; Cleveland, R.A.; Erickson, R.L. Acoustic Correlates of Breathy Vocal Quality. J. Speech Hear. Res. 1994, 37, 769–778. [Google Scholar] [CrossRef] [PubMed]
  39. Johnson, K. The Auditory/Perceptual Basis for Speech Segmentation; Department of Linguistics, Ohio State University: Columbus, OH, USA, 1997. [Google Scholar]
  40. Kuang, J.; Keating, P. Vocal fold vibratory patterns in tense versus lax phonation contrasts. J. Acoust. Soc. Am. 2014, 136, 2784–2797. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  41. Tehrani, H. EggWorks. 2012. Available online: http://www.linguistics.ucla.edu/faciliti/facilities/physiology/EGG.htm (accessed on 13 August 2014).
  42. Rothenberg, M.; Mahshie, J.J. Monitoring Vocal Fold Abduction through Vocal Fold Contact Area. J. Speech Hear. Res. 1988, 31, 338–351. [Google Scholar] [CrossRef] [PubMed]
  43. Keating, P.; Esposito, C.M.; Garellek, M.; Khan, S.U.D.; Kuang, J. Phonation Contrasts across Languages. In Proceedings of the 17th International Conference of Phonetic Sciences (ICPhS 17), Hong Kong, China, 17–21 August 2011; Volume 108, pp. 188–202. [Google Scholar]
  44. Marasek, K. Glottal correlates of the word stress and the tense/lax opposition in German. In Proceedings of the Fourth International Conference on Spoken Language Processing (ICSLP ’96), Philadelphia, PA, USA, 3–6 October 1996; Volume 3, pp. 1573–1576. [Google Scholar]
  45. Kuznetsova, A.; Brockhoff, P.B.; Christensen, R.H.B. lmerTest Package: Tests in Linear Mixed Effects Models. J. Stat. Softw. 2017, 82. [Google Scholar] [CrossRef] [Green Version]
  46. R Core Team. A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2015. [Google Scholar]
  47. Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: New York, NY, USA, 2006. [Google Scholar]
  48. Nasoz, F.; Alvarez, K.; Lisetti, C.L.; Finkelstein, N. Emotion recognition from physiological signals using wireless sensors for presence technologies. Cogn. Techol. Work 2004, 6, 4–14. [Google Scholar] [CrossRef]
  49. Scherer, K.R. Vocal affect expression: Review and a model for future research. Psychol. Bull. 1986, 99, 143–165. [Google Scholar] [CrossRef] [PubMed]
  50. Johnstone, T.; Scherer, K.R. Vocal communication of emotion. In Handbook of Emotions; Lewis, M., Haviland-Jones, J.M., Eds.; The Guilford Press: New York, NY, USA, 2000; pp. 220–235. ISBN 978-1-57230-529-8. [Google Scholar]
  51. Yuan, J.; Shen, L.; Chen, F. The acoustic realization of anger, fear, joy and sadness in Chinese. In Proceedings of the Seventh International Conference on Spoken Language Processing, Denver, CO, USA, 16–20 September 2002; pp. 2025–2028. [Google Scholar]
  52. Williams, C.E.; Stevens, K.N. Emotions and Speech: Some Acoustical Correlates. J. Acoust. Soc. Am. 1972, 52, 1238–1250. [Google Scholar] [CrossRef] [PubMed]
  53. Juslin, P.N.; Laukka, P. Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal expression of emotion. Emotion 2001, 1, 381–412. [Google Scholar] [CrossRef] [PubMed]
  54. Liu, P.; Pell, M.D. Recognizing vocal emotions in Mandarin Chinese: A validated database of Chinese vocal emotional stimuli. Behav. Res. 2012, 44, 1042–1051. [Google Scholar] [CrossRef] [Green Version]
  55. Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
Figure 1. Pitch-normalized EGG waveform of five emotions in Mandarin.
Figure 1. Pitch-normalized EGG waveform of five emotions in Mandarin.
Applsci 08 02629 g001
Figure 2. Pitch-normalized EGG waveform of five emotions in English.
Figure 2. Pitch-normalized EGG waveform of five emotions in English.
Applsci 08 02629 g002
Figure 3. Prosodic cues of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 1 in Reference [31]).
Figure 3. Prosodic cues of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 1 in Reference [31]).
Applsci 08 02629 g003
Figure 4. Phonation-related acoustic measures of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 2 in Reference [31]).
Figure 4. Phonation-related acoustic measures of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 2 in Reference [31]).
Applsci 08 02629 g004
Figure 5. Physiological measures of phonation of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 3 in Reference [31]).
Figure 5. Physiological measures of phonation of four emotions in Mandarin and English. Points indicate mean values and error bars 95% confidence intervals (Adapted from Figure 3 in Reference [31]).
Applsci 08 02629 g005
Table 1. An example of the target sentence “My advisor won’t come to my presentation” in four emotional dialogue settings.
Table 1. An example of the target sentence “My advisor won’t come to my presentation” in four emotional dialogue settings.
Emotional ContextMandarinEnglish
HappinessA:你看起来很开心啊.A: You look so happy.
B:我刚收到学校的邮件.是好消息哦!(高兴地说)导师不来参加我的汇报了.他是系里最严厉的老师,他会指出学生报告中的任何一点儿小错误.B: I just got an email from school. (In a happy mood) My advisor won’t come to my presentation. He is the toughest advisor in the department and points out every little mistake during presentations.
AngerA:我听说你今天要做报告.你的导师也来参加吧?A: I heard you are giving a presentation today. Your advisor is coming today. Right?
B:他不来!(生气地说)导师不来参加我的汇报了.他都不关心我做什么!B: NO! (In an angry mood) My advisor won’t come to my presentation. He never keeps his promises.
SadnessA:你看起来很伤心,怎么了?A: You seem sad. What’s the matter?
B:我为了准备这次汇报很辛苦,就想在导师面前留个好印象.但是我刚收到导师的邮件说他有事.(伤心地说)导师不来参加我的汇报了.B: I’ve been working so hard for this presentation but I just got an email from my advisor that he is not feeling well. (In a sad mood), My advisor won’t come to my presentation.
FearA:你怎么了?A: What’s wrong?
B:我今天要做汇报,本来导师在场,想着被问蒙了的时候他可以帮帮我.但是我收到导师的邮件说他儿子生病了.(害怕地说)导师不来参加我的汇报了.他不在,我汇报的时候更紧张啊.B: I am giving a presentation today but I got an email from my advisor that his son is now in hospital. (In a fearful mood) My advisor won’t come to my presentation. I am so afraid of giving a talk without him today.
Table 2. Normalized mean values of five emotions in each language.
Table 2. Normalized mean values of five emotions in each language.
LanguageMeasurementEmotion
AngerFearHappinessNeutralSadness
MandarinMean pitch0.4740.8890.664−1.111−0.952
Pitch range1.392−0.2281.055−0.050−0.656
Mean absolute slope1.561−0.0230.945−0.165−0.864
Rise1.363−0.2060.6480.006−0.417
Fall−1.5390.267−0.843−0.0090.502
Rise slope1.2310.0700.669−0.206−0.716
Fall slope−1.4920.032−0.8520.1840.636
Speech rate1.0591.0980.7420.034−0.670
Intensity1.1570.0560.954−0.308−1.466
CPP0.470−1.2231.1190.450−0.525
H10.9020.3760.7570.063−1.352
H21.103−0.4010.9560.256−1.088
H41.011−0.3871.0380.539−1.100
H1-H2−0.6941.380−0.580−0.325−0.110
H1-A1−0.8640.872−0.6680.0650.084
H1-A2−0.6860.619−0.8140.2520.584
H1-A3−0.5360.615−0.5890.6190.732
CQ0.947−1.2300.6300.225−0.218
PIC0.7100.5370.365−0.920−0.394
SQ−0.4311.091−0.566−0.4450.180
EnglishMean pitch0.3500.9030.785−1.189−0.849
Pitch range0.542−0.0541.008−0.953−0.542
Mean absolute slope0.433−0.0651.279−0.927−0.721
Rise0.179−0.1700.430−0.220−0.219
Fall−0.309−0.064−0.9370.8130.497
Rise slope0.475−0.0900.905−0.702−0.587
Fall slope−0.324−0.059−0.9940.8210.556
Speech rate0.0580.2540.233−0.039−0.506
Intensity0.898−0.0090.993−0.787−1.094
CPP0.158−0.6320.4630.084−0.073
H10.4160.0850.743−0.651−0.593
H20.456−0.1660.841−0.507−0.624
H40.494−0.3230.769−0.359−0.581
H1-H2−0.1980.761−0.567−0.2880.292
H1-A1−0.4470.623−0.714−0.0850.623
H1-A2−0.4080.240−0.5130.0300.651
H1-A3−0.5890.101−0.2740.1480.613
CQ0.111−0.6320.1920.1230.205
PIC−0.1580.1530.1620.230−0.388
SQ−0.3620.048−0.7240.2700.769
Table 3. Results of multiple comparison analysis for each cue between emotions (A: anger; F: fear; H: happiness; N: neutral; S: sadness).
Table 3. Results of multiple comparison analysis for each cue between emotions (A: anger; F: fear; H: happiness; N: neutral; S: sadness).
LanguageMeasurementF-AH-AN-AS-AH-FN-FS-FN-HS-HS-N
MandarinMean pitch****************************
Pitch range****************************
Mean absolute slope***************.************
Rise***************..*********
Fall***************************
Rise slope*****************************
Fall slope*************** ************
Speech rate ***************************
Intensity******************************
CPP****** *********************
H1*** ************************
H2***.************************
H4*** ************************
H1-H2*** *********************
H1-A1*** *********************
H1-A2*** *********** ********
H1-A3*** ********* ******
CQ******************************
PIC ******** ***************
SQ*** ************ ******
EnglishMean pitch************ ***************
Pitch range*****************************
Mean absolute slope***************************.
Rise ** ****
Fall.************************.
Rise slope*************************
Fall slope ************************
Speech rate ** *** ***.
Intensity*** ***********************
CPP*** ******** **
H1..*********************
H2**********************
H4*** ********* ******
H1-H2***. ******** *****
H1-A1*** ********** *********
H1-A2*** ******* *********
H1-A3*** ******. ******
CQ*** *********
PIC * ***
SQ*********** ***********
Signif. codes: ‘.’ 0.05 < p < 0.1; * p < 0.05; ** p < 0.01; *** p < 0.001.
Table 4. Results of mixed repeated measures ANOVAs for each cue.
Table 4. Results of mixed repeated measures ANOVAs for each cue.
MeasurementVariable
Emotion (Within-Subject)Language (Between-Subject)Emotion × Language
F (3,24)pη2F (1,8)pη2F (3,24)pη2
Mean pitch19.9050.000 ***0.7130.6660.4380.0770.3670.7780.044
Pitch range31.5670.000 ***0.7987.7680.024 *0.4931.2850.3020.138
Mean absolute slope50.4040.000 ***0.86334.7600.000 ***0.8137.7340.001 **0.492
Rise33.3770.000 ***0.80712.7690.007 **0.6155.9020.025 *0.425
Fall23.5070.000 ***0.74630.4190.001 **0.7925.1580.007 **0.392
Rise slope16.9800.001 ***0.68022.9500.001 **0.7423.2100.0980.286
Fall slope27.4160.000 ***0.77422.0860.002 **0.7345.0950.023 *0.389
Speech rate29.5190.000 ***0.78712.3420.007 **0.2265.2910.006 **0.398
Intensity47.6680.000 ***0.8560.8510.3830.0962.1570.1190.212
CPP18.8520.000 ***0.7022.0060.1940.2005.1010.007 **0.389
H13.5950.0510.3100.0040.9530.0000.4650.6370.055
H27.9180.010 *0.4970.0110.9190.0010.8120.4310.092
H41.4670.2630.1550.1160.7430.0142.8020.1230.259
H1-H20.9930.4130.1100.0480.8320.0062.4600.0870.235
H1-A10.4620.5390.0550.4830.5070.0570.0490.8590.006
H1-A21.3680.2820.1460.0210.8890.0030.6950.4700.080
H1-A31.3440.2850.1440.3630.5630.0431.9060.2000.192
CQ20.7720.000 ***0.7220.3310.5810.0409.5240.006 **0.543
PIC2.9320.0540.2680.9460.3590.1060.8940.4580.101
SQ20.4080.000 ***0.7184.9540.0570.38211.0310.000 ***0.580
Signif. codes: ‘.’ 0.05 < p < 0.1; * p < 0.05; ** p < 0.01; *** p < 0.001.

Share and Cite

MDPI and ACS Style

Wang, T.; Lee, Y.-c.; Ma, Q. Within and Across-Language Comparison of Vocal Emotions in Mandarin and English. Appl. Sci. 2018, 8, 2629. https://doi.org/10.3390/app8122629

AMA Style

Wang T, Lee Y-c, Ma Q. Within and Across-Language Comparison of Vocal Emotions in Mandarin and English. Applied Sciences. 2018; 8(12):2629. https://doi.org/10.3390/app8122629

Chicago/Turabian Style

Wang, Ting, Yong-cheol Lee, and Qiuwu Ma. 2018. "Within and Across-Language Comparison of Vocal Emotions in Mandarin and English" Applied Sciences 8, no. 12: 2629. https://doi.org/10.3390/app8122629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop