Factors in the recognition of vocally expressed emotions: A comparison of four languages
Introduction
Nonverbal and paralinguistic cues provide a rich source of information about a speaker's emotions and social intentions when engaged in discourse (Wilson & Wharton, 2006). Emotions expressed in the face, and to a much lesser extent the voice, have been studied to elucidate a core set of emotions—most typically joy/happiness, anger, disgust, fear, sadness, and surprise (Ekman, Sorenson, & Friesen, 1969; Izard, 1994). While there are various ways for characterizing the emotional states and affective dimensions which can be expressed in speech, the notion of categorical emotions which are associated with discrete forms of expression is deeply entrenched in the literature (see Cowie & Cornelius, 2003 for a discussion). Many believe that characteristic expressions of basic emotions “erupt” in speech, often involuntarily, as one of the neurophysiological consequences of experiencing the emotion by the ‘sender’ or encoder of the expression. Perhaps owing to the biological significance of these expressions to con-specifics and their importance for adaptive behaviour, these emotional expressions are believed to possess certain invariant properties which allow them to be recognized independent of learning or culture when presented in the face (Ekman & Friesen, 1971) or in the voice (Scherer, Banse, & Wallbott, 2001).
The communicative or expressive aspects of emotional behaviour (e.g., emotional ‘display rules’, Ekman & Friesen, 1969 are also influenced by socio-cultural dimensions of an interaction. Despite similarities in how emotions are expressed across human cultures, the opportunity to express particular emotions and the form of these displays tend to vary according to cultural norms (Ekman et al., 1987; Elfenbein, Beaupré, Lévesque, & Hess, 2007; Mesquita & Frijda, 1992; Scherer, 1997). Moreover, cultural rules often dictate how males versus females communicate their emotions in speech or in the face (Atkinson, Tipples, Burt, & Young, 2005; Goos & Silverman, 2002; Hess, Adams, & Kleck, 2005; Hofmann, Suvak, & Litz, 2006; Wallbott, 1988). Finally, it is well recognized that individual encoders within a single language or cultural group further vary in how they regulate their nonverbal behaviour to express emotions in speech (Banse & Scherer, 1996; Wallbott & Scherer, 1986) and to achieve particular social-pragmatic goals, such as to signal dominance, affiliation, or to create other social impressions (Hess et al., 2005). It has been argued that individual differences and personality traits are of central importance for understanding emotional expressive behaviour (Matsumoto, 2006). Collectively, these studies underscore that social and interpersonal variables play an important role in how emotions are encoded in speech and other communication channels, and in how the receiver or decoder interprets their emotional meaning in many situations.
If one concentrates on the vocal expression of emotion in speech (emotional prosody), there is a conspicuous lack of research which directly compares how individuals from different linguistic and cultural backgrounds communicate their emotions. In speech, discrete emotion expressions are associated with characteristic variations in the acoustic structure of the speech signal, and the relative perturbation of specific acoustic cues over the course of an utterance, which listeners recognize as an utterance unfolds (Banse & Scherer, 1996; Juslin & Laukka, 2003). The prospect that vocal emotion expressions vary somewhat across languages is suggested by research which has presented vocal emotion expressions for cross-cultural recognition. These studies reveal that listeners can accurately detect and categorize vocal emotions when listening to a foreign language (Albas, McCluskey, & Albas, 1976; Pell, Monetta, Paulmann, & Kotz, 2009; Pell & Skorup, 2008; Scherer et al., 2001; Thompson & Balkwill, 2006; Van Bezooijen, Otto, & Heenan, 1983), consistent with the idea of basic human emotions and the existence of shared principles which guide emotional communication (Ekman, 1992). However, these same studies also typically demonstrate an in-group advantage for identifying vocal emotion expressions more accurately when produced by speakers of the same language when compared to speakers of a foreign language (see Elfenbein & Ambady, 2002 for an overview). Thus, based on the cross-cultural data it can be said that vocal emotion expressions seem to exhibit a core set of acoustic-perceptual features which promote accurate recognition across languages, but that there are also language-specific differences which lead to an in-group processing advantage (Pell & Skorup, 2008; Pell et al., 2009).
A separate literature has examined how vocal emotions are encoded and recognized in the context of specific languages (see Juslin & Laukka, 2003 for a comprehensive overview). From this research we know that emotional meanings in the voice are conveyed by concomitant changes in several acoustic parameters of speech, including but not limited to fundamental frequency (pitch), intensity (loudness), duration, rhythm, and different aspects of voice quality (Banse & Scherer, 1996). As demonstrated by Juslin and Laukka's (2003) meta-analysis, most researchers in the acoustic literature have measured changes in vocal pitch, intensity, and speech rate implying that these parameters are critical features of vocal emotion expressions; in particular, a speaker's pitch level (mean), pitch range (or variation), and speech rate appear to differentiate well among discrete emotion categories in both acoustic and perceptual terms (Mozziconacci, 2001; Pell, 2001; Williams & Stevens, 1972). For example, expressions of sadness tend to be produced with a relatively low pitch/fundamental frequency (f0) and slow speaking rate, whereas expressions of anger, fear, and happiness tend to be produced with a moderate or high mean f0 and fast speaking rate. In addition, anger and happiness usually display high f0 variation, whereas fear and sadness often exhibit less f0 variation (Juslin & Laukka, 2003; cf. Banse & Scherer, 1996; Sobin & Alpert, 1999; Williams & Stevens, 1972 for discussion and exceptions to these patterns). Emotions such as disgust and surprise have been studied less in the context of speech and their acoustic-perceptual features are more controversial, although there is evidence that disgust is sometimes produced with a low mean f0 (Banse & Scherer, 1996).
It bears noting that much of the information we have gained is based on analyses of posed or simulated exemplars of vocal emotion which were elicited from professional or lay actors who were native speakers of the language of interest. Given the close interplay of emotion and linguistic cues in speech, this investigative approach is often necessary in practical terms to control for variations in the linguistic content of utterances, especially when one of the research goals is to compare acoustic measures of different emotional expressions in speech which can be influenced by the segmental and suprasegmental properties of a language (Pell, 2001). Another characteristic of the present literature is that much of our knowledge of the acoustic properties of vocal emotions derives from major European languages such as English or German. Curiously, there have been few attempts to compare emotional expressions produced under a similar set of conditions by speakers of several different languages, especially languages which vary in their linguistic and/or cultural similarity. Thus, while there appear to be “modal tendencies” in how speakers encode discrete emotions in different languages (e.g, Scherer et al., 2001), this evidence is derived largely from the perceptual literature and/or through indirect comparisons of vocal emotion expressions produced in different languages and with different types of stimulus materials (words, sentences, nonsense speech, or spontaneous dialogue, see Juslin & Laukka, 2003). Research which has undertaken a systematic, controlled study of the acoustic and perceptual properties of vocal emotion expressions in several languages in tandem is still rare (Burkhardt et al., 2006).
In the present study, our goal was to directly compare patterns for expressing and recognizing vocal emotion expressions which are assumed to possess certain invariant properties in four distinct language contexts. Given certain evidence that linguistic and/or cultural similarity could play a role in how vocal emotions are recognized (Scherer et al., 2001), we focused on four distinct languages which varied in a systematic manner in their “linguistic proximity” and typology: English, German, Hindi, and Arabic. Whereas English and German are considered closely related in both linguistic and cultural terms (i.e., both from the Germanic branch of Indo-European languages), Hindi is a more distantly related language from the Indo-European family, and Arabic comes from an entirely distinct language group (Semitic). In each language condition, a common procedure was followed: male and female encoders produced utterances in their native language to convey a standard set of different emotions by using their voice; and, recordings of the vocal stimuli were presented to a native listener group (half male, half female) who judged the intended emotion of the speaker for items produced in the same language. By following the same methods in each language condition, our data allowed us to examine patterns of vocal emotion recognition in each language context separately and through direct cross-language comparisons. We also extracted basic acoustic measures of the items presented in each language to compare the acoustic data with emotion recognition rates across languages. Although it was not the purpose of this study to present vocal expressions of emotion to listeners in their non-native language, complementary studies of this nature are ongoing (e.g., Pell et al., 2009).
One of the unique methodological challenges of studying vocal communication of emotion is how to isolate processes related to the encoding/decoding of emotions in the voice from those of processing linguistic-contextual cues of the utterance which accompany vocal emotion expressions; this potential confound affects any investigation of how emotions are recognized from vocal cues in speech because listeners may attend to corresponding linguistic features which bias or conflict with the meaning of the vocal cues. One way to circumvent the “isolation problem” is to require speakers to express emotions in “pseudo-utterances” which mimic the phonotactic and morpho-syntactic properties of the language of interest, in the absence of meaningful lexical-semantic information (Pell & Baum, 1997; Scherer, Banse, Wallbott, & Goldbeck, 1991). It has been shown that such stimuli can be produced in a relatively natural manner by encoders to portray a range of vocal emotions. The recorded utterances can then be judged by listeners, allowing inferences about the processing of vocal emotions in different languages in a controlled context where listeners must base their judgements strictly on vocal parameters of the utterances. We adopted this approach here to determine precisely how vocal cues operate during emotional encoding and recognition in each of our four language contexts.
Since there is little precedence for this research in the vocal literature, firm predictions about the influence of language on emotion recognition patterns or on the major acoustic cues involved could not be made with certainty. Based on our literature review, we expected that individual speakers/encoders in our experiment would display somewhat different abilities and patterns for encoding emotions in the voice, and that this would be true for each of our language conditions under study. We also predicted that listeners would be capable of identifying each of the target emotions from pseudo-utterances in their native language at levels exceeding chance, although one may expect variations in recognition accuracy and error confusion patterns when the language conditions are compared. Overall, it was expected that expressions of anger and sadness would yield the highest recognition rates in each language, and that disgust might lead to relatively poor recognition rates, although more precise patterns for identifying emotions and their relationship among the four language conditions was unclear. It was assumed that the major acoustic parameters of emotion expressions—mean f0, f0 range, and speech rate—would contribute significantly to differences among the emotion categories in each of the four languages. In light of evidence that vocal emotions can be recognized across languages, we expected to find qualitatively similar tendencies in how the acoustic parameters were associated with specific emotional expressions in the four languages, although the relationship between acoustic and perceptual measures of vocal emotion recognition has not previously been described in this way.
Section snippets
Methods
For each of the four languages under study (English, German, Hindi, Arabic), a common set of procedures was adopted to construct and validate the stimuli in each condition. For each language separately, an emotion elicitation study was first carried out to produce recordings of vocal emotion expressions from native speakers. At a second stage, an emotion perception study was undertaken to measure how the vocal stimuli are perceived by a group of native listeners and acoustic analyses were
Results
Table 3 presents the mean recognition rates and Table 4 presents the acoustic data for valid exemplars of each emotion averaged across the four encoders, per language condition. As expected, despite eliminating tokens which were poorly recognized owing to presumed difficulties in the ability to consistently pose emotion expressions, there was marked variability in how accurately decoders recognized specific emotions from the voice in each of the four languages of interest. Based on qualitative
Discussion
Vocal expression is a primary and phylogenetically significant part of the human repertoire for communicating emotions independent of language (Cosmides, 1983; Wilson & Wharton, 2006). However, most commonly these expressions are realized in the context of speech, according to the prevailing structure of the language in usage, which could influence the physical form of vocal emotion expressions (Pell, 2001) and/or how they are interpreted from one language to another (Juslin & Laukka, 2001). To
Acknowledgements
This work was supported by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada (to M.D. Pell) and by the German Research Foundation (DFG FOR 499 to S.A. Kotz). We thank Sarah Crowder, Sarah Elgazar, Romy Leidig, and Rajashree Sen for their efforts with the recordings and for data organization.
References (64)
- et al.
Describing the emotional states that are expressed in speech
Speech Communication
(2003) - et al.
The role of voice quality in communicating emotion, mood and attitude
Speech Communication
(2003) - et al.
Sex differences in face recognition and influence of facial affect
Personality and Individual Differences
(2006) - et al.
On the lateralization of emotional prosody: An event-related functional MR investigation
Brain and Language
(2003) - et al.
When emotional prosody and semantics dance cheek to cheek: ERP evidence
Brain Research
(2007) - et al.
Physical variations related to stress and emotional state: A preliminary study
Journal of Phonetics
(1996) - et al.
The neural response to emotional prosody, as revealed by functional magnetic resonance imaging
Neuropsychologia
(2003) - et al.
Implementation and testing of a system for producing emotion-by-rule in synthetic speech
Speech Communication
(1995) - et al.
An ERP investigation on the temporal dynamics of emotional prosody and emotional semantics in pseudo- and lexical-sentence context
Brain and Language
(2008) - et al.
The ability to perceive and comprehend intonation in linguistic and affective contexts by brain-damaged adults
Brain and Language
(1997)
Implicit processing of emotional prosody in a foreign versus native language
Speech Communication
The voice of confidence: Paralinguistic cues and audience evaluation
Journal of Research and Personality
Relevance and prosody
Journal of Pragmatics
Perception of the emotional content of speech: A comparison of two Canadian groups
Journal of Cross-Cultural Psychology
Asymmetric interference between sex and emotion in face perception
Perception & Psychophysics
Vocal expression of emotion: Acoustic properties of speech are associated with emotional intensity and context
Psychological Science
Acoustic profiles in vocal emotion expression
Journal of Personality and Social Psychology
Right hemisphere emotional perception: Evidence across multiple channels
Neuropsychology
Relationships among facial, prosodic, and lexical channels of emotional perceptual processing
Cognition and Emotion
Invariances in the acoustic expression of emotion during speech
Journal of Experimental Psychology: Human Perception and Performance
An argument for basic emotions
Cognition and Emotion
The repertoire of nonverbal behavior: Categories, origins, usage, and coding
Semiotica
Constants across cultures in the face and emotion
Journal of Personality and Social Psychology
Universals and cultural differences in the judgments of facial expressions of emotion
Journal of Personality and Social Psychology
Pan-cultural elements in facial displays of emotion
Science
On the universality and cultural specificity of emotion recognition: A meta-analysis
Psychological Bulletin
Toward a dialect theory: Cultural differences in the expression and recognition of posed facial expressions
Emotion
Emotional patterns in intonation and music
Zeitschrift fur Phonetik
The prosodic expression of anger: Differentiating threat and frustration
Aggressive Behaviour
Sex related factors in the perception of threatening facial expressions
Journal of Nonverbal Behavior
Who may frown and who should smile? Dominance, affliction, and the display of happiness and anger
Cognition and Emotion
Cited by (202)
Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets
2024, Engineering Applications of Artificial IntelligenceSounds hard: Prosodic features reflect effort level and related affective states during exercise
2023, Mental Health and Physical ActivityRoles of Phonation Types and Decoders’ Gender in Recognizing Mandarin Emotional Speech
2023, Journal of Speech, Language, and Hearing Research