Elsevier

Journal of Phonetics

Volume 37, Issue 4, October 2009, Pages 417-435
Journal of Phonetics

Factors in the recognition of vocally expressed emotions: A comparison of four languages

https://doi.org/10.1016/j.wocn.2009.07.005Get rights and content

Abstract

To understand how language influences the vocal communication of emotion, we investigated how discrete emotions are recognized and acoustically differentiated in four language contexts—English, German, Hindi, and Arabic. Vocal expressions of six emotions (anger, disgust, fear, sadness, happiness, pleasant surprise) and neutral expressions were elicited from four native speakers of each language. Each speaker produced pseudo-utterances (“nonsense speech”) which resembled their native language to express each emotion type, and the recordings were judged for their perceived emotional meaning by a group of native listeners in each language condition. Emotion recognition and acoustic patterns were analyzed within and across languages. Although overall recognition rates varied by language, all emotions could be recognized strictly from vocal cues in each language at levels exceeding chance. Anger, sadness, and fear tended to be recognized most accurately irrespective of language. Acoustic and discriminant function analyses highlighted the importance of speaker fundamental frequency (i.e., relative pitch level and variability) for signalling vocal emotions in all languages. Our data emphasize that while emotional communication is governed by display rules and other social variables, vocal expressions of ‘basic’ emotion in speech exhibit modal tendencies in their acoustic and perceptual attributes which are largely unaffected by language or linguistic similarity.

Introduction

Nonverbal and paralinguistic cues provide a rich source of information about a speaker's emotions and social intentions when engaged in discourse (Wilson & Wharton, 2006). Emotions expressed in the face, and to a much lesser extent the voice, have been studied to elucidate a core set of emotions—most typically joy/happiness, anger, disgust, fear, sadness, and surprise (Ekman, Sorenson, & Friesen, 1969; Izard, 1994). While there are various ways for characterizing the emotional states and affective dimensions which can be expressed in speech, the notion of categorical emotions which are associated with discrete forms of expression is deeply entrenched in the literature (see Cowie & Cornelius, 2003 for a discussion). Many believe that characteristic expressions of basic emotions “erupt” in speech, often involuntarily, as one of the neurophysiological consequences of experiencing the emotion by the ‘sender’ or encoder of the expression. Perhaps owing to the biological significance of these expressions to con-specifics and their importance for adaptive behaviour, these emotional expressions are believed to possess certain invariant properties which allow them to be recognized independent of learning or culture when presented in the face (Ekman & Friesen, 1971) or in the voice (Scherer, Banse, & Wallbott, 2001).

The communicative or expressive aspects of emotional behaviour (e.g., emotional ‘display rules’, Ekman & Friesen, 1969 are also influenced by socio-cultural dimensions of an interaction. Despite similarities in how emotions are expressed across human cultures, the opportunity to express particular emotions and the form of these displays tend to vary according to cultural norms (Ekman et al., 1987; Elfenbein, Beaupré, Lévesque, & Hess, 2007; Mesquita & Frijda, 1992; Scherer, 1997). Moreover, cultural rules often dictate how males versus females communicate their emotions in speech or in the face (Atkinson, Tipples, Burt, & Young, 2005; Goos & Silverman, 2002; Hess, Adams, & Kleck, 2005; Hofmann, Suvak, & Litz, 2006; Wallbott, 1988). Finally, it is well recognized that individual encoders within a single language or cultural group further vary in how they regulate their nonverbal behaviour to express emotions in speech (Banse & Scherer, 1996; Wallbott & Scherer, 1986) and to achieve particular social-pragmatic goals, such as to signal dominance, affiliation, or to create other social impressions (Hess et al., 2005). It has been argued that individual differences and personality traits are of central importance for understanding emotional expressive behaviour (Matsumoto, 2006). Collectively, these studies underscore that social and interpersonal variables play an important role in how emotions are encoded in speech and other communication channels, and in how the receiver or decoder interprets their emotional meaning in many situations.

If one concentrates on the vocal expression of emotion in speech (emotional prosody), there is a conspicuous lack of research which directly compares how individuals from different linguistic and cultural backgrounds communicate their emotions. In speech, discrete emotion expressions are associated with characteristic variations in the acoustic structure of the speech signal, and the relative perturbation of specific acoustic cues over the course of an utterance, which listeners recognize as an utterance unfolds (Banse & Scherer, 1996; Juslin & Laukka, 2003). The prospect that vocal emotion expressions vary somewhat across languages is suggested by research which has presented vocal emotion expressions for cross-cultural recognition. These studies reveal that listeners can accurately detect and categorize vocal emotions when listening to a foreign language (Albas, McCluskey, & Albas, 1976; Pell, Monetta, Paulmann, & Kotz, 2009; Pell & Skorup, 2008; Scherer et al., 2001; Thompson & Balkwill, 2006; Van Bezooijen, Otto, & Heenan, 1983), consistent with the idea of basic human emotions and the existence of shared principles which guide emotional communication (Ekman, 1992). However, these same studies also typically demonstrate an in-group advantage for identifying vocal emotion expressions more accurately when produced by speakers of the same language when compared to speakers of a foreign language (see Elfenbein & Ambady, 2002 for an overview). Thus, based on the cross-cultural data it can be said that vocal emotion expressions seem to exhibit a core set of acoustic-perceptual features which promote accurate recognition across languages, but that there are also language-specific differences which lead to an in-group processing advantage (Pell & Skorup, 2008; Pell et al., 2009).

A separate literature has examined how vocal emotions are encoded and recognized in the context of specific languages (see Juslin & Laukka, 2003 for a comprehensive overview). From this research we know that emotional meanings in the voice are conveyed by concomitant changes in several acoustic parameters of speech, including but not limited to fundamental frequency (pitch), intensity (loudness), duration, rhythm, and different aspects of voice quality (Banse & Scherer, 1996). As demonstrated by Juslin and Laukka's (2003) meta-analysis, most researchers in the acoustic literature have measured changes in vocal pitch, intensity, and speech rate implying that these parameters are critical features of vocal emotion expressions; in particular, a speaker's pitch level (mean), pitch range (or variation), and speech rate appear to differentiate well among discrete emotion categories in both acoustic and perceptual terms (Mozziconacci, 2001; Pell, 2001; Williams & Stevens, 1972). For example, expressions of sadness tend to be produced with a relatively low pitch/fundamental frequency (f0) and slow speaking rate, whereas expressions of anger, fear, and happiness tend to be produced with a moderate or high mean f0 and fast speaking rate. In addition, anger and happiness usually display high f0 variation, whereas fear and sadness often exhibit less f0 variation (Juslin & Laukka, 2003; cf. Banse & Scherer, 1996; Sobin & Alpert, 1999; Williams & Stevens, 1972 for discussion and exceptions to these patterns). Emotions such as disgust and surprise have been studied less in the context of speech and their acoustic-perceptual features are more controversial, although there is evidence that disgust is sometimes produced with a low mean f0 (Banse & Scherer, 1996).

It bears noting that much of the information we have gained is based on analyses of posed or simulated exemplars of vocal emotion which were elicited from professional or lay actors who were native speakers of the language of interest. Given the close interplay of emotion and linguistic cues in speech, this investigative approach is often necessary in practical terms to control for variations in the linguistic content of utterances, especially when one of the research goals is to compare acoustic measures of different emotional expressions in speech which can be influenced by the segmental and suprasegmental properties of a language (Pell, 2001). Another characteristic of the present literature is that much of our knowledge of the acoustic properties of vocal emotions derives from major European languages such as English or German. Curiously, there have been few attempts to compare emotional expressions produced under a similar set of conditions by speakers of several different languages, especially languages which vary in their linguistic and/or cultural similarity. Thus, while there appear to be “modal tendencies” in how speakers encode discrete emotions in different languages (e.g, Scherer et al., 2001), this evidence is derived largely from the perceptual literature and/or through indirect comparisons of vocal emotion expressions produced in different languages and with different types of stimulus materials (words, sentences, nonsense speech, or spontaneous dialogue, see Juslin & Laukka, 2003). Research which has undertaken a systematic, controlled study of the acoustic and perceptual properties of vocal emotion expressions in several languages in tandem is still rare (Burkhardt et al., 2006).

In the present study, our goal was to directly compare patterns for expressing and recognizing vocal emotion expressions which are assumed to possess certain invariant properties in four distinct language contexts. Given certain evidence that linguistic and/or cultural similarity could play a role in how vocal emotions are recognized (Scherer et al., 2001), we focused on four distinct languages which varied in a systematic manner in their “linguistic proximity” and typology: English, German, Hindi, and Arabic. Whereas English and German are considered closely related in both linguistic and cultural terms (i.e., both from the Germanic branch of Indo-European languages), Hindi is a more distantly related language from the Indo-European family, and Arabic comes from an entirely distinct language group (Semitic). In each language condition, a common procedure was followed: male and female encoders produced utterances in their native language to convey a standard set of different emotions by using their voice; and, recordings of the vocal stimuli were presented to a native listener group (half male, half female) who judged the intended emotion of the speaker for items produced in the same language. By following the same methods in each language condition, our data allowed us to examine patterns of vocal emotion recognition in each language context separately and through direct cross-language comparisons. We also extracted basic acoustic measures of the items presented in each language to compare the acoustic data with emotion recognition rates across languages. Although it was not the purpose of this study to present vocal expressions of emotion to listeners in their non-native language, complementary studies of this nature are ongoing (e.g., Pell et al., 2009).

One of the unique methodological challenges of studying vocal communication of emotion is how to isolate processes related to the encoding/decoding of emotions in the voice from those of processing linguistic-contextual cues of the utterance which accompany vocal emotion expressions; this potential confound affects any investigation of how emotions are recognized from vocal cues in speech because listeners may attend to corresponding linguistic features which bias or conflict with the meaning of the vocal cues. One way to circumvent the “isolation problem” is to require speakers to express emotions in “pseudo-utterances” which mimic the phonotactic and morpho-syntactic properties of the language of interest, in the absence of meaningful lexical-semantic information (Pell & Baum, 1997; Scherer, Banse, Wallbott, & Goldbeck, 1991). It has been shown that such stimuli can be produced in a relatively natural manner by encoders to portray a range of vocal emotions. The recorded utterances can then be judged by listeners, allowing inferences about the processing of vocal emotions in different languages in a controlled context where listeners must base their judgements strictly on vocal parameters of the utterances. We adopted this approach here to determine precisely how vocal cues operate during emotional encoding and recognition in each of our four language contexts.

Since there is little precedence for this research in the vocal literature, firm predictions about the influence of language on emotion recognition patterns or on the major acoustic cues involved could not be made with certainty. Based on our literature review, we expected that individual speakers/encoders in our experiment would display somewhat different abilities and patterns for encoding emotions in the voice, and that this would be true for each of our language conditions under study. We also predicted that listeners would be capable of identifying each of the target emotions from pseudo-utterances in their native language at levels exceeding chance, although one may expect variations in recognition accuracy and error confusion patterns when the language conditions are compared. Overall, it was expected that expressions of anger and sadness would yield the highest recognition rates in each language, and that disgust might lead to relatively poor recognition rates, although more precise patterns for identifying emotions and their relationship among the four language conditions was unclear. It was assumed that the major acoustic parameters of emotion expressions—mean f0, f0 range, and speech rate—would contribute significantly to differences among the emotion categories in each of the four languages. In light of evidence that vocal emotions can be recognized across languages, we expected to find qualitatively similar tendencies in how the acoustic parameters were associated with specific emotional expressions in the four languages, although the relationship between acoustic and perceptual measures of vocal emotion recognition has not previously been described in this way.

Section snippets

Methods

For each of the four languages under study (English, German, Hindi, Arabic), a common set of procedures was adopted to construct and validate the stimuli in each condition. For each language separately, an emotion elicitation study was first carried out to produce recordings of vocal emotion expressions from native speakers. At a second stage, an emotion perception study was undertaken to measure how the vocal stimuli are perceived by a group of native listeners and acoustic analyses were

Results

Table 3 presents the mean recognition rates and Table 4 presents the acoustic data for valid exemplars of each emotion averaged across the four encoders, per language condition. As expected, despite eliminating tokens which were poorly recognized owing to presumed difficulties in the ability to consistently pose emotion expressions, there was marked variability in how accurately decoders recognized specific emotions from the voice in each of the four languages of interest. Based on qualitative

Discussion

Vocal expression is a primary and phylogenetically significant part of the human repertoire for communicating emotions independent of language (Cosmides, 1983; Wilson & Wharton, 2006). However, most commonly these expressions are realized in the context of speech, according to the prevailing structure of the language in usage, which could influence the physical form of vocal emotion expressions (Pell, 2001) and/or how they are interpreted from one language to another (Juslin & Laukka, 2001). To

Acknowledgements

This work was supported by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada (to M.D. Pell) and by the German Research Foundation (DFG FOR 499 to S.A. Kotz). We thank Sarah Crowder, Sarah Elgazar, Romy Leidig, and Rajashree Sen for their efforts with the recordings and for data organization.

References (64)

  • M.D. Pell et al.

    Implicit processing of emotional prosody in a foreign versus native language

    Speech Communication

    (2008)
  • K.R. Scherer et al.

    The voice of confidence: Paralinguistic cues and audience evaluation

    Journal of Research and Personality

    (1973)
  • D. Wilson et al.

    Relevance and prosody

    Journal of Pragmatics

    (2006)
  • D. Albas et al.

    Perception of the emotional content of speech: A comparison of two Canadian groups

    Journal of Cross-Cultural Psychology

    (1976)
  • A. Atkinson et al.

    Asymmetric interference between sex and emotion in face perception

    Perception & Psychophysics

    (2005)
  • J. Bachorowski et al.

    Vocal expression of emotion: Acoustic properties of speech are associated with emotional intensity and context

    Psychological Science

    (1995)
  • R. Banse et al.

    Acoustic profiles in vocal emotion expression

    Journal of Personality and Social Psychology

    (1996)
  • J. Borod et al.

    Right hemisphere emotional perception: Evidence across multiple channels

    Neuropsychology

    (1998)
  • J. Borod et al.

    Relationships among facial, prosodic, and lexical channels of emotional perceptual processing

    Cognition and Emotion

    (2000)
  • Burkhardt, F., Audibert, N., Malatesta, L., Türk, O., Arslan, L., & Auberge, V. (2006). Emotional prosody—Does culture...
  • L. Cosmides

    Invariances in the acoustic expression of emotion during speech

    Journal of Experimental Psychology: Human Perception and Performance

    (1983)
  • P. Ekman

    An argument for basic emotions

    Cognition and Emotion

    (1992)
  • P. Ekman et al.

    The repertoire of nonverbal behavior: Categories, origins, usage, and coding

    Semiotica

    (1969)
  • P. Ekman et al.

    Constants across cultures in the face and emotion

    Journal of Personality and Social Psychology

    (1971)
  • P. Ekman et al.

    Universals and cultural differences in the judgments of facial expressions of emotion

    Journal of Personality and Social Psychology

    (1987)
  • P. Ekman et al.

    Pan-cultural elements in facial displays of emotion

    Science

    (1969)
  • H. Elfenbein et al.

    On the universality and cultural specificity of emotion recognition: A meta-analysis

    Psychological Bulletin

    (2002)
  • H. Elfenbein et al.

    Toward a dialect theory: Cultural differences in the expression and recognition of posed facial expressions

    Emotion

    (2007)
  • I. Fónagy et al.

    Emotional patterns in intonation and music

    Zeitschrift fur Phonetik

    (1963)
  • R. Frick

    The prosodic expression of anger: Differentiating threat and frustration

    Aggressive Behaviour

    (1986)
  • L. Goos et al.

    Sex related factors in the perception of threatening facial expressions

    Journal of Nonverbal Behavior

    (2002)
  • U. Hess et al.

    Who may frown and who should smile? Dominance, affliction, and the display of happiness and anger

    Cognition and Emotion

    (2005)
  • Cited by (202)

    View all citing articles on Scopus
    View full text