Elsevier

Cognition

Volume 100, Issue 3, July 2006, Pages B21-B31
Cognition

Brief article
Audio-visual speech perception off the top of the head

https://doi.org/10.1016/j.cognition.2005.09.002Get rights and content

Abstract

The study examined whether people can extract speech related information from the talker's upper face that was presented using either normally textured videos (Experiments 1 and 3) or videos showing only the outlined of the head (Experiments 2 and 4). Experiments 1 and 2 used within- and cross-modal matching tasks. In the within-modal task, observers were presented two pairs of short silent video clips that showed the top part of a talker's head. In the cross-modal task, pairs of audio and silent video clips were presented. The task was to determine the pair in which the talker said the same sentence. Performance on both tasks was better than chance for the outline as well as textured presentation suggesting that judgments were primarily based on head movements. Experiments 3 and 4 tested if observing the talker's upper face would help identify speech in noise. The results showed the viewing the talker's moving upper head produced a small but reliable improvement in speech intelligibility, however, this effect was only secure for the expressive sentences that involved greater head movements. The results suggest that people are sensitive to speech related head movements that extend beyond the mouth area and can use these to assist in language processing.

Section snippets

Experiments 1 and 2

Experiments 1 and 2 examined whether observers could use a talker's upper head and face movement to determine whether a pair of silent videos (within-modal) or an audio and silent video pair (cross-modal) were from the same sentence. Experiment 1 used full colour videos that showed the texture of the talker's face and Experiment 2 used black and white videos that showed only an outline of the talker's head and eyes (Fig. 1). This contrast was employed to gauge the degree to which facial

Participants

Forty-eight undergraduate students from The University of Melbourne participated in the within- and cross-modal matching tasks (24 each in Experiments 1 and 2). All were native speakers of English and had normal or corrected-to-normal vision and no report of hearing loss.

Materials

The materials consisted of two sets of 16 sentences selected to contrast degree of head movement: an expressive and a non-expressive set. The non-expressive phonetically balanced sentences, drawn from the IEEE, 1969 list, described mundane events (e.g. ‘the jacket hung on the back of the wide chair’). The 16 expressive sentences required a more animated rendition (e.g. ‘that is really annoying; I have to let you know’). Video recordings of a single male talker saying each sentence were selected

Procedure

The following procedure was used with both the textured videos (Experiment 1) and outline videos (Experiment 2). The expressive and non-expressive items were equally divided into two duration groups (short, M=2.2 s; long, M=2.6 s). The different items were always selected from the same stimulus type (expressive or non-expressive) and from within the appropriate speech duration group. A comparison of the duration differences between the same pairs (different tokens) and different pairs was not

Results and discussion

A series of analyses were conducted on the data from both the textured and outline videos. Two repeated ANOVAs (one for the participant, Fs; one for the item data, Fi) were conducted to determine whether the error scores differed across the different display conditions. For the participant ANOVA the factor of expression was repeated and presentation type non-repeated. For the item analysis, the factor of presentation type was repeated and type of expression non-repeated.

Mean percent errors for

Experiments 3 and 4

The previous experiments demonstrated that people could use the talker's upper face movement as a source of information about what was spoken. Experiments 3 and 4 investigated whether this information can assist in recovering speech in noise. This investigation extends a recent study by Munhall et al. (2004). They used displays of a 3D animated talking head to investigate whether the intelligibility of noise-degraded speech could be influenced by head-motion. Three types of head motion were

Participants

One hundred and seventeen participants were tested in Experiments 3 and 32 participants in Experiment 4.

Materials and procedure

Experiment 3 used the textured videos of Experiment 1. In Experiment 4 the outline videos of Experiment 2 were used. In Experiment 3, participants were presented with noisy speech accompanied by the full moving face (full face); the upper moving face (upper face) or a still picture of the talker's face (still face). Participants were asked to identify as many words as they could and to type these on the keyboard. White noise was used as a masker and word identification experiments were run at

Results

Mean percent words correct (collapsed over stimulus type) for Experiment 3 (textured videos) as a function of presentation condition for the three SNRs are shown in Fig. 3 (left panel). The results confirm the well-established finding that seeing the talker's articulating full face markedly improves intelligibility. What is new is that seeing only the top part of the talker's face likewise produced a gain in intelligibility (although this improvement was relatively small).

Statistical analysis

Discussion

Observing the talker's mouth and jaw offers a fairly direct source of information about the segmental properties of speech; it is clear how such information could improve intelligibility. The finding that seeing only the top part of the head and face can increase intelligibility (at least for the expressive videos) shows that regions beyond those directly involved in articulation play a part in audio-visual speech processing. Further, Experiments 2 and 4 using outline videos show that the

Acknowledgements

We wish to thank Guillaume Vignali for conducting the rigid head movement analyses and Alex Bahar-fuchs and Ashleigh Lin for helping run some of the experiments. We thank James McQueen and the two anonymous reviewers for their useful suggestions. The second author wishes to acknowledge support from Australia Research Council Grant number DP0209664.

References (14)

  • C. Benoît et al.

    Audio–visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP

    Speech Communication

    (1998)
  • H.C. Yehia et al.

    Linking facial animation, head motion and speech acoustics

    Journal of Phonetics

    (2002)
  • Barbosa, A. V., Daffertshofer, A. & Vatikiotis-Bateson, E. (2004). Target practice on talking faces. ICSLP, 2004, Jeju,...
  • Cavé, Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., Espesser, R. (1996). About the relationship between eyebrow...
  • A. Cutler et al.

    Prosody in the comprehension of spoken language: A literature review

    Language and Speech

    (1997)
  • C. Davis et al.

    Audio–visual interactions with intact clearly audible speech

    The Quarterly Journal of Experimental Psychology

    (2004)
  • K.I. Forster et al.

    DMDX: A windows display program with millisecond accuracy

    Behavioral Research Methods: Instruments and Computers

    (2003)
There are more references available in the full text version of this article.

Cited by (40)

  • Fixating the eyes of a speaker provides sufficient visual information to modulate early auditory processing

    2019, Biological Psychology
    Citation Excerpt :

    For example, seeing the lips close for /p/ is more informative about the sound’s place of articulation than acoustic cues. Various regions across a speaker’s face provide helpful visual speech information, but do so to varying degrees (Davis & Kim, 2006; Jordan & Thomas, 2011; Preminger, Lin, Payen, & Levitt, 1998; Thomas & Jordan, 2004). The mouth is the most informative area for the recognition of speech sounds, providing sufficient information for visual speech recognition and for obtaining an audiovisual benefit (IJsseldijk, 1992; Marassa & Lansing, 1995; Thomas & Jordan, 2004).

  • How visual timing and form information affect speech and non-speech processing

    2014, Brain and Language
    Citation Excerpt :

    It is well established that seeing the talker’s moving face (visual speech) influences the process of speech perception, e.g., speech is perceived more accurately in quiet (Davis & Kim, 2004) and in noise (Sumby & Pollack, 1954). Such visual influence has been attributed to the information available from the talker’s oral regions, e.g., from mouth shapes, mouth and lip motion and some tongue positions (Summerfield, 1979) and peri-oral regions such as jaw, eyebrows and head (Davis & Kim, 2006; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). The current study focused on the effect that perceiving speech-related movements has on speech processing, and was motivated by the observation that such motion provides two broad types of information, speech form (segment) and timing information (Summerfield, 1987).

  • Recognizing prosody across modalities, face areas and speakers: Examining perceivers' sensitivity to variable realizations of visual prosody

    2012, Cognition
    Citation Excerpt :

    Participants were tested individually in a double-walled, sound attenuated booth. Each participant completed two experimental tasks: a visual–visual (VV) matching task and an auditory–visual (AV) matching task (as used in Cvejic et al., 2010b; Davis & Kim, 2006) in a counter-balanced order. Stimuli were presented in a two-interval, alternate forced choice (2AFC) discrimination task, in which each interval included a pair of stimuli to be compared and the participant’s task was to select the pair of the same prosodic type.

View all citing articles on Scopus
View full text