Elsevier

Journal of Phonetics

Volume 40, Issue 1, January 2012, Pages 177-189
Journal of Phonetics

Evidence for phonetic and social selectivity in spontaneous phonetic imitation

https://doi.org/10.1016/j.wocn.2011.09.001Get rights and content

Abstract

Spontaneous phonetic imitation is the process by which a talker comes to be more similar-sounding to a model talker as the result of exposure. The current experiment investigates this phenomenon, examining whether vowel spectra are automatically imitated in a lexical shadowing task and how social liking affects imitation. Participants were assigned to either a Black talker or White talker; within this talker manipulation, participants were either put into a condition with a digital image of their assigned model talker or one without an image. Liking was measured through attractiveness rating. Participants accommodated toward vowels selectively; the low vowels /æ ɑ/ showed the strongest effects of imitation compared to the vowels /i o u/, but the degree of this trend varied across conditions. In addition to these findings of phonetic selectivity, the degree to which these vowels were imitated was subtly affected by attractiveness ratings and this also interacted with the experimental condition. The results demonstrate the labile nature of linguistic segments with respect to both their perceptual encoding and their variation in production.

Highlights

► Social factors such as liking and dialect influence the degree of spontaneous phonetic imitation. ► Phonetic knowledge is labile with respect to both perception and production. ► Auditory exposure influence subsequent production.

Introduction

Language users acquire the varieties and dialects spoken around them. Sentence structure, lexical selection, and pronunciation are all determined by patterns in the ambient language. Children, for example, acquire language from their caretakers and peer group (Chambers, 1992, Payne, 1980). When adults move to new dialect areas the adoption of novel features is more variable, but, generally, they eventually acquire aspects of the new dialect (Evans and Iverson, 2007, Munro et al., 1999, Trudgill, 1986), even after brief exposure (Delvaux & Soquet, 2007). In the absence of physical relocation, adults' speech patterns are also susceptible to change (Harrington, 2006, Harrington, 2007, Harrington et al., 2000a, Harrington et al., 2000b). These new dialect features do not wholly replace the native dialect, but rather are added to a speaker's repertoire (Howell, Barry, & Vinson, 2006). This addition, in theory, would enable a talker to shift dialects or styles. A question surrounding such style-shifting behaviour is whether it is within a speaker's conscious control (cf. Eckert, 2001, Labov, 2001).

In addition to changing production as a result of real-world exposure to a new dialect, similar changes in speech have been induced in the laboratory. For example, imagine a picture with a man holding a cake and a woman standing by. The orientation of the image suggests the man intends to pass the cake to the woman. Participants who have been exposed to an oral description The boy gave the toy to the teacher are more likely to describe the cake picture as The man gave the cake to the woman as opposed to The man gave the woman the cake. The second description is a completely grammatical utterance that accurately conveys what is going on in the picture. But, simply being exposed to the construction give X to Y favours the future use of that construction over give Y X (Bock, 1986, Branigan et al., 2007, Pickering and Ferreira, 2008). Similarly, with word choice, individuals align their lexical selections to those of their interlocutors in experimental settings (Garrod & Doherty, 1994).

These constant updates to the language system are particularly interesting from a phonetic perspective. The fact that the perceptual categories of language are labile is well grounded (Clayards et al., 2008, Kraljic and Samuel, 2005, Kraljic and Samuel, 2006, Kraljic and Samuel, 2007, Kraljic et al., 2008, Kraljic et al., 2008, Maye et al., 2008, Norris et al., 2003). With respect to speech production, when faced with the question why do talkers sound like they do, the immediate answer is that it is determined by a talker's physiology. The exact size and shape of talker's oral cavity go a long way in determining the acoustic characteristics of that particular talker. Putting aside physiological differences between talkers, however, the question of why we sound like we do returns us to the notion of sounding like those around us. From birth, talkers acquire the ambient language and dialect to which they are exposed, and, if we are to examine how these speech production targets may change from one production to the next, we simply need to examine phonetic imitation.

Phonetic imitation, also known as phonetic convergence or phonetic accommodation, is the process in which a talker takes on acoustic characteristics of the individual they are interacting with. Like sentence structure and lexical selection, phonetic imitation has also been investigated in the laboratory. This research has revealed that phonetic imitation can occur in both socially minimal situations where talkers are simply producing single words (Goldinger, 1997, Goldinger, 1998, Goldinger and Azuma, 2004, Namy et al., 2002) and in cooperative, socially rich, dyadic interactions (Pardo, 2006, Pardo, 2009). This phenomenon is important, since it may account for a wide range of phenomena such as historical sound change and dialect acquisition (Trudgill, 2008). Moreover, the cognitive mechanisms that motivate spontaneous imitation are pervasive from infancy (Kuhl & Metzloff, 1996) and existent across non-linguistic behaviours like foot-shaking and face-touching (Dijksterhuis & Bargh, 2001). In terms of speech, the cognitive mechanisms that prompt actions that are taken in by the perceptual field are of great importance for theories of speech perception and production. The topic is particularly crucial for theories that propose a strong link between the two processes (e.g., Goldstein & Fowler, 2003) as well as for theories about the representational units that mediate perception–production link.

For example, with respect to representation, Mitterer and Ernestus (2008) argue that abstract phonological units mediate the link between speech perception and speech production. Their argument is based on evidence gathered through a speeded shadowing task where they demonstrate that Dutch speakers do not accommodate to different speakers' /r/ variants or their degree of pre-voicing. Using a slightly modified version of this paradigm where the task is, crucially, not speeded, other researchers have shown that speakers phonetically imitate model speakers' productions (Goldinger, 1998, Namy et al., 2002). Investigations of the details involved in phonetic imitation reveal that that speakers imitate subphonemic properties of the speech signal (Nielsen, 2008, Nielsen, 2011, Shockley et al., 2004), which robustly suggests that a more detailed phonetic signal bridges the relationship between perception and production in speech. The current experiment is an auditory naming task that uses methods similar to previous work (Goldinger, 1998, Namy et al., 2002, Shockley et al., 2004), but focuses on the phonetic details of vowel imitation. Using two model talkers across four conditions, the results support the claim that subphonemic phonetic details are mapped from speech perception and production in ways that are mediated by experimental condition and talker voice. Crucially, these data also suggest a relationship between social liking and the strength of the mapping. This finding indicates that the socially mediated perception-behaviour link proposed by Dijksterhuis and Bargh (2001) applies to speech behaviour as well.

The earliest studies on imitation and convergence examined broad acoustic measures and relied on perceptual judgments from listeners. Natale, 1975a, Natale, 1975b found that in dyads, interviewees accommodated to intensity levels and temporal patterns of the interviewer. In a series of studies, Gregory and colleagues examined behaviours across conversational dyads. They found that conversational partners converge with respect to intensity, pause duration, pause frequency, turn-taking duration, and turn-taking frequency (Gregory & Hoyt, 1982) and long-term average spectra (Gregory et al., 1993, Gregory and Webster, 1996, Gregory et al., 1997, Gregory et al., 2001).

Goldinger (1998) examined phonetic imitation at the word level using a lexical shadowing task to prompt imitation and an AXB task to elicit judgments of perceptual similarity (i.e., imitation). Goldinger found lexical frequency, amount of exposure, and shadowing time (immediate or delayed) interacted with degree of imitation. In the immediate shadowing condition, all words were judged to have been imitated for all talkers with increasing word frequency inhibiting convergence. In the delayed shadowing condition only low and middle-low frequency words showed convergent behaviour. To determine what aspects of the acoustic signal were converging, Goldinger (1997) reported the results of a pilot study involving acoustic measurements. Average pitch, intonation, and word duration were measured. All participants deviated from their baseline average pitch toward the pitch of the talker they were exposed to in the shadowing task. In another task, Goldinger and Azuma (2004) revealed that effects of episodic memory traces are activated in orthographic representations as well. Goldinger's (1998) basic finding was replicated by Namy et al. (2002); however they expanded on the result in finding that female participants accommodated to the model talkers more than male participants.

One of the most influential pieces of work on phonetic convergence is reported in Pardo (2006). Pardo examines phonetic convergence in same-gender dyads involved in jointly completing a map task. Pardo also employed Goldinger's (1998) paradigm; the X tokens in the AXB similarity task were taken from the recordings of an individual's conversational partner in the map task. Dyads were perceived to have converged on 62% of the trials. Listeners judged that male dyads converged more than women (75% vs. 58%). Women were found to converge toward the speaker who was receiving instructions. Men patterned oppositely; they converged toward the speech of the male talkers giving instructions. Pardo (2009) provides acoustic analyses of the same-gender dyads alone, and also presents results from mixed-gender dyads. Like Pardo's same-gender dyads, mixed-gender pairs converged, albeit to a lesser extent; listeners perceived mixed-gender pairs to have converged on 53% of the trials. Pardo conducted linear regressions with F0 and duration data to determine the cues on which listeners based their judgments. These values accounted for 41% of the variance for the female talkers, but only 7% of the variance for the male data. Pardo's analysis also involved comparing formant frequencies of /i u æ ɑ/ from a pre- and post-task hVt wordlist. The analysed items were words not produced as part of the map task that comprised the core of Pardo's analysis, so any changes in the wordlist items would be indicative of persistent system-wide changes. Pardo found that givers of map instructions centralized their low vowels from pre- to post-task, which also involved diverging from the receivers to whom they were giving directions.

Thus far, research has established that talkers do spontaneously imitate and accommodate in verbal interaction. Still, little is known about what within the acoustic signal may be the target of imitation. Some work has been done on the imitation of lengthened VOT. Shockley et al. (2004) demonstrated that talkers imitate artificially lengthened VOTs in American English aspirated stops. Nielsen (2011) further expanded on VOT imitation in American English, finding an interaction between imitation and generalization across the system. Nielsen revealed that not only do participants imitate the increased VOTs for words they are exposed to (in this case, /p/-initial words), but they generalize the lengthened pattern to a segment sharing the same [+ spread glottis] feature (/k/-initial words). Analyses revealed a word-specificity effect such that increased VOT effects were strongest in the /p/-initial words heard in the exposure phase. Sancier and Fowler (1997) examined how VOT patterns transfer cross-linguistically based on the ambient language environment for a bilingual speaker of Brazilian Portuguese and English. They found that after multi-month spans of speaking English (her L2), the VOT of voiceless stops increased in her Brazilian Portuguese productions. The reverse was also observed, such that her English voiceless stop VOTs decreased in duration after being in a Brazilian Portuguese environment. Turning to vowels, a study by Vallabha and Tuller (2004) examined talkers' abilities to intentionally imitate their own steady-state vowel productions. They found that their three participants exhibited talker-specific patterns of imitation bias, which could either be due to noise in production or bias in the perceptual processing of the tokens. Vallabha and Tuller speculate that the dialect background of their participants could also have influenced the directions of the imitation bias.

While models of speech perception and production may account for imitation through the structure of the cognitive language system (Pickering & Garrod, 2004), others have accounted for accommodation in speech in its function as a social tool. Communication Accommodation Theory (CAT) argues that speech convergence phenomena are motivated by an individual's motivation to be socially accepted or identify with a particular social group (Giles & Coupland, 1991: 71–72). Shepard, Giles, & LePoire (2001: 33–34) posit that social and cognitive processes affect linguistic behaviour and that individual motivation is the driving force behind accommodative speech behaviour. Language is, therefore, a tool used by speakers to achieve a particular social distance. For example, talkers converge with an interlocutor in order to lessen the social distance and they diverge to increase social distance (Bourhis & Giles, 1977; Giles, 1973, Giles et al., 1973). Similarly, Bell's theory of audience design (Bell, 1984, 2001) argues that talkers will shift speech styles in accordance with audience – interlocutor, overhearers, eavesdroppers, etc. – speech styles, such that in a given instance, a talker's speech style reflects the demands of the setting.

This background leads to the goals of the current experiment, which are multifold. The first goal is to expand our understanding about what is imitated in the phonetic signal by targeting the investigation on the first and second formant frequencies of vowels. While exploring vowels, the phenomenon of phonetic imitation provides a nice test case for theories of speech perception and production. Theories of speech perception that hypothesize that speech gestures are the objects of speech perception such as the Motor Theory (Liberman, 1957, Liberman and Mattingly, 1985) and Direct Realism (Fowler, 1996, Goldstein and Fowler, 2003) easily account for imitation in speech as the perceived signal essentially contains the specific gestural instructions as to how to imitate. Exemplar-based models of speech perception (Goldinger, 1997, Johnson, 1997, Pierrehumbert, 2001, Pierrehumbert, 2003) also easily account for imitative phenomena. Within an exemplar-based model episodic traces are activated in memory when a talker voice or word is perceived. Recently, Tilsen (2009) presented evidence for exemplars in speech production through a subphonemic shadowing task. A simple description of such a theory is as follows. Upon hearing a word, episodic traces associated by the talker voice or word are activated in memory. The more familiar a voice or the higher frequency the word, the higher the number of activated traces because of increased experience. Goldinger (1997: 46) clarifies, “even if an exact match to the [word] exists in memory, all similar activated traces create a ‘generic echo,’ regressing toward the mean of the activated set.” It is the mean of the activated set that is selected for production. Such a theory predicts that upon perceiving a particular word, the activated traces will contribute to its production. Thus, in terms of phonetic imitation, exposure to words produced by a model talker will shift a participant's productions toward those of the model talker. In addition, given the assumptions of exemplar-based theories, the effect of imitation should be cumulative. That is, higher levels of activation, which are expected to result from repeated exposure, should lead to more imitation; this was essentially found by Goldinger (1998). In that study listeners judged more imitation for tokens that had higher repetition counts in the immediate shadowing conditions. A crucial difference between these two theories is that there exist formulations of exemplar-based models, which involve the incorporation of social cognition directly into the model of spoken language (e.g., Johnson, 2006).

The second and third goals of the experiment relate to the social nature of imitation and how it relates to levels of social engagement. If phonetic imitation is a primarily social behaviour, then it should occur more in situations that have more social context. In this experiment, one set of conditions includes Visual Prompts, which are still digital images of the model talkers. This manipulation is intended to increase the level of cognitive social engagement with the task by embodying the model voices with the talkers' faces. Thus, the second goal is to compare imitation across Visual Prompt and No Visual Prompt Conditions, where the presence of the social information available in the image is predicted to alter participants' behaviours (Hay, Warren, & Drager, 2006). The third goal is an attempt to incorporate social and behavioural psychology into our understanding of phonetic imitation. To review, exemplar-based theories predict imitation will occur naturally as a function of the linguistic system and social theories predict imitation will occur in response to more highly crafted social factors. Previous work on phonetic imitation demonstrates that it occurs in relatively asocial laboratory tasks (Goldinger, 1997, Goldinger, 1998, Namy et al., 2002, Nielsen, 2008, Nielsen, 2011, Shockley et al., 2004), which suggests that it is a low-level automated behaviour. Such findings, however, do not necessarily preclude that social factors cannot play a role and, of course, a speech research laboratory is still a type of social context. As mentioned above, Dijksterhuis and Bargh (2001) argue that the most simple mitigating factor in the perception-behavioural link is liking. From this we can predict that an individual's degree of liking of a model talker should affect their degree of imitation. In this experiment liking is measured through participant ratings of model talker attractiveness, which is highly correlated with liking (Bryne, 1971). Specifically, CAT offers the prediction for this study that when there is interpersonal liking between the participant and the model talker we can expect convergent speech behaviour. Cases of linguistic divergence are then predicted where the participant does not like the model talker and rates the model talker as unattractive.

Section snippets

Methodology

The methods for the auditory naming task are described below.

Auditory naming task

A Praat (Boersma & Weenink, 2005) script automatically identified pauses by marking boundaries preceding and following over 600 ms of low intensity energy (<59 dB). These boundaries were hand-corrected so as to mark the onset and offset of the vowel. A second script extracted the mean first and second formants from a series of Gaussian windows spanning the middle 50% of the vowel with a 2.5 ms step size. Outliers were identified as those tokens where the F1 or F2 was more than three standard

Analysis across conditions

The histograms in Fig. 2 illustrate the peaks in the distributions are negative, suggesting an overall trend of imitation. In order to determine whether there were overall effects of imitation for the entire data set and for each vowel across all participants, a series of six t-tests were conducted to determine whether the difference in distance values were significantly below 0. A Bonferroni correction adjusted the significant alpha level for the six comparisons to p=0.008. Overall, difference

Conclusions

The results reported here suggest that phonetic imitation is selective from both a phonetic and social perspective. Not all vowels were imitated to the same degree—it seems that there is more imitation when there is the phonetic space to do so. To this end, it seems that participant dialect may play a crucial role in accommodation, which may provide the impetus for dialect acquisition and change (Trudgill, 2008). Additionally, the fact that more imitation occurred in the condition that involved

Acknowledgements

This research was supported by the Center for Race and Gender and the Abigail Hodgen Publication Fund at the University of California, Berkeley. This project was conducted in partial fulfillment of the requirements of Doctor of Philosophy to the author at UC Berkeley. Thanks to Keith Johnson, Andrew Garrett, Rudy Mendoza-Denton, and Ben Munson for improving this project immeasurably. Special thanks to Dasha Bulatov and Tyler Frawley for help completing the project. This manuscript has been

References (77)

  • T. Kraljic et al.

    Accommodating variation: Dialects, idiolects, and speech processing

    Cognition

    (2008)
  • T. Kraljic et al.

    Perceptual learning for speech: Is there a return to normal?

    Cognitive Psychology

    (2005)
  • T. Kraljic et al.

    Perceptual adjustments to multiple speakers

    Journal of Memory and Language

    (2007)
  • A. Liberman et al.

    The motor theory of speech perception revised

    Cognition

    (1985)
  • M.J. Munro et al.

    Canadians in Alabama: A perceptual study of dialect acquisition in adults

    Journal of Phonetics

    (1999)
  • K. Nielsen

    Specificity and abstractness of VOT imitation

    Journal of Phonetics

    (2011)
  • D. Norris et al.

    Perceptual learning in speech

    Cognitive Psychology

    (2003)
  • M.L. Sancier et al.

    Gestural drift in a bilingual speaker of Brazilian Portuguese and English

    Journal of Phonetics

    (1997)
  • S. Tilsen

    Subphonemic and cross-phonemic priming in vowel shadowing: Evidence for the involvement of exemplars in production

    Journal of Phonetics

    (2009)
  • P. Adank et al.

    A comparison of vowel normalization procedures for language variation research

    Journal of the Acoustical Society of America

    (2004)
  • R.H. Baayen

    Analyzing linguistic data: A practical introduction to statistics for R

    (2008)
  • R.H. Baayen et al.

    The CELEX lexical database (CDROM). Linguistic data consortium

    (1993)
  • M. Babel

    Dialect divergence and convergence in New Zealand English

    Language in Society

    (2010)
  • A. Bell

    Language style as audience design

    Language in Society

    (1984)
  • A. Bell

    Back in style: Reworking audience design

  • K. Bock

    Syntactic persistence in language production

    Cognitive Psychology

    (1986)
  • P. Boersma et al.

    Praat: Doing phonetics by computer (version 4.3.29) [computer program]

    (2005)
  • R.Y. Bourhis et al.

    The language of intergroup distinctiveness

  • D. Bryne

    The attraction paradigm

    (1971)
  • J. Chambers

    Dialect acquisition

    Language

    (1992)
  • C. Clopper et al.

    Acoustic characteristics of the vowel systems of six regional varieties of American English

    Journal of the Acoustical Society of America

    (2005)
  • V. Delvaux et al.

    The influence of ambient speech on adult speech productions through unintentional imitation

    Phonetica

    (2007)
  • P. Eckert

    Style and social meaning

  • B.G. Evans et al.

    Plasticity in vowel perception and production: A study of accent change in young adults

    Journal of the Acoustical Society of America

    (2007)
  • C. Fowler

    Listeners hear sounds, not tongues

    Journal of the Acoustical Society of America

    (1996)
  • H. Giles

    Accent mobility: A model and some data

    Anthropological Linguistics

    (1973)
  • H. Giles et al.

    Language: Contexts and consequences

    (1991)
  • Howard Giles et al.

    Towards a theory of interpersonal accommodation through language: Some Canadian data

    Language in Society

    (1973)
  • Cited by (0)

    View full text