Brief articleAudio-visual speech perception off the top of the head
Section snippets
Experiments 1 and 2
Experiments 1 and 2 examined whether observers could use a talker's upper head and face movement to determine whether a pair of silent videos (within-modal) or an audio and silent video pair (cross-modal) were from the same sentence. Experiment 1 used full colour videos that showed the texture of the talker's face and Experiment 2 used black and white videos that showed only an outline of the talker's head and eyes (Fig. 1). This contrast was employed to gauge the degree to which facial
Participants
Forty-eight undergraduate students from The University of Melbourne participated in the within- and cross-modal matching tasks (24 each in Experiments 1 and 2). All were native speakers of English and had normal or corrected-to-normal vision and no report of hearing loss.
Materials
The materials consisted of two sets of 16 sentences selected to contrast degree of head movement: an expressive and a non-expressive set. The non-expressive phonetically balanced sentences, drawn from the IEEE, 1969 list, described mundane events (e.g. ‘the jacket hung on the back of the wide chair’). The 16 expressive sentences required a more animated rendition (e.g. ‘that is really annoying; I have to let you know’). Video recordings of a single male talker saying each sentence were selected
Procedure
The following procedure was used with both the textured videos (Experiment 1) and outline videos (Experiment 2). The expressive and non-expressive items were equally divided into two duration groups (short, M=2.2 s; long, M=2.6 s). The different items were always selected from the same stimulus type (expressive or non-expressive) and from within the appropriate speech duration group. A comparison of the duration differences between the same pairs (different tokens) and different pairs was not
Results and discussion
A series of analyses were conducted on the data from both the textured and outline videos. Two repeated ANOVAs (one for the participant, Fs; one for the item data, Fi) were conducted to determine whether the error scores differed across the different display conditions. For the participant ANOVA the factor of expression was repeated and presentation type non-repeated. For the item analysis, the factor of presentation type was repeated and type of expression non-repeated.
Mean percent errors for
Experiments 3 and 4
The previous experiments demonstrated that people could use the talker's upper face movement as a source of information about what was spoken. Experiments 3 and 4 investigated whether this information can assist in recovering speech in noise. This investigation extends a recent study by Munhall et al. (2004). They used displays of a 3D animated talking head to investigate whether the intelligibility of noise-degraded speech could be influenced by head-motion. Three types of head motion were
Participants
One hundred and seventeen participants were tested in Experiments 3 and 32 participants in Experiment 4.
Materials and procedure
Experiment 3 used the textured videos of Experiment 1. In Experiment 4 the outline videos of Experiment 2 were used. In Experiment 3, participants were presented with noisy speech accompanied by the full moving face (full face); the upper moving face (upper face) or a still picture of the talker's face (still face). Participants were asked to identify as many words as they could and to type these on the keyboard. White noise was used as a masker and word identification experiments were run at
Results
Mean percent words correct (collapsed over stimulus type) for Experiment 3 (textured videos) as a function of presentation condition for the three SNRs are shown in Fig. 3 (left panel). The results confirm the well-established finding that seeing the talker's articulating full face markedly improves intelligibility. What is new is that seeing only the top part of the talker's face likewise produced a gain in intelligibility (although this improvement was relatively small).
Statistical analysis
Discussion
Observing the talker's mouth and jaw offers a fairly direct source of information about the segmental properties of speech; it is clear how such information could improve intelligibility. The finding that seeing only the top part of the head and face can increase intelligibility (at least for the expressive videos) shows that regions beyond those directly involved in articulation play a part in audio-visual speech processing. Further, Experiments 2 and 4 using outline videos show that the
Acknowledgements
We wish to thank Guillaume Vignali for conducting the rigid head movement analyses and Alex Bahar-fuchs and Ashleigh Lin for helping run some of the experiments. We thank James McQueen and the two anonymous reviewers for their useful suggestions. The second author wishes to acknowledge support from Australia Research Council Grant number DP0209664.
References (14)
- et al.
Audio–visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP
Speech Communication
(1998) - et al.
Linking facial animation, head motion and speech acoustics
Journal of Phonetics
(2002) - Barbosa, A. V., Daffertshofer, A. & Vatikiotis-Bateson, E. (2004). Target practice on talking faces. ICSLP, 2004, Jeju,...
- Cavé, Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., Espesser, R. (1996). About the relationship between eyebrow...
- et al.
Prosody in the comprehension of spoken language: A literature review
Language and Speech
(1997) - et al.
Audio–visual interactions with intact clearly audible speech
The Quarterly Journal of Experimental Psychology
(2004) - et al.
DMDX: A windows display program with millisecond accuracy
Behavioral Research Methods: Instruments and Computers
(2003)
Cited by (40)
Fixating the eyes of a speaker provides sufficient visual information to modulate early auditory processing
2019, Biological PsychologyCitation Excerpt :For example, seeing the lips close for /p/ is more informative about the sound’s place of articulation than acoustic cues. Various regions across a speaker’s face provide helpful visual speech information, but do so to varying degrees (Davis & Kim, 2006; Jordan & Thomas, 2011; Preminger, Lin, Payen, & Levitt, 1998; Thomas & Jordan, 2004). The mouth is the most informative area for the recognition of speech sounds, providing sufficient information for visual speech recognition and for obtaining an audiovisual benefit (IJsseldijk, 1992; Marassa & Lansing, 1995; Thomas & Jordan, 2004).
How visual timing and form information affect speech and non-speech processing
2014, Brain and LanguageCitation Excerpt :It is well established that seeing the talker’s moving face (visual speech) influences the process of speech perception, e.g., speech is perceived more accurately in quiet (Davis & Kim, 2004) and in noise (Sumby & Pollack, 1954). Such visual influence has been attributed to the information available from the talker’s oral regions, e.g., from mouth shapes, mouth and lip motion and some tongue positions (Summerfield, 1979) and peri-oral regions such as jaw, eyebrows and head (Davis & Kim, 2006; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). The current study focused on the effect that perceiving speech-related movements has on speech processing, and was motivated by the observation that such motion provides two broad types of information, speech form (segment) and timing information (Summerfield, 1987).
Recognizing prosody across modalities, face areas and speakers: Examining perceivers' sensitivity to variable realizations of visual prosody
2012, CognitionCitation Excerpt :Participants were tested individually in a double-walled, sound attenuated booth. Each participant completed two experimental tasks: a visual–visual (VV) matching task and an auditory–visual (AV) matching task (as used in Cvejic et al., 2010b; Davis & Kim, 2006) in a counter-balanced order. Stimuli were presented in a two-interval, alternate forced choice (2AFC) discrimination task, in which each interval included a pair of stimuli to be compared and the participant’s task was to select the pair of the same prosodic type.
Is student learning from a video lecture affected by whether the instructor wears a mask?
2024, Applied Cognitive PsychologyA Visual Speech Intelligibility Benefit Based on Speech Rhythm
2023, Brain SciencesA Cantonese Audio-Visual Emotional Speech (CAVES) dataset
2023, Behavior Research Methods