Elsevier

Neuropsychologia

Volume 65, December 2014, Pages 1-11
Neuropsychologia

Visual abilities are important for auditory-only speech recognition: Evidence from autism spectrum disorder

https://doi.org/10.1016/j.neuropsychologia.2014.09.031Get rights and content

Highlights

  • The autism group (ASD) was impaired in lip-reading (visual-only speech recognition).

  • In controls, knowing a speaker׳s face improved auditory-only speech recognition.

  • In ASD, knowing a speaker׳s face impaired auditory-only speech recognition.

  • In addition, the ASD group was impaired in speaker identity recognition.

Abstract

In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition.

Introduction

When talking to someone on the phone, recognizing what is said and recognizing who is speaking are two inherently auditory tasks. According to the conventional view, performance in these auditory tasks relies on auditory processes without contribution of visual processes (e.g. Ellis et al., 1997, Hickok and Poeppel, 2007). An alternative view proposes that visual recognition abilities are relevant also for auditory-only tasks (von Kriegstein et al., 2008). We refer to these two different views as the “auditory-only model” (Fig. 1A) and the “auditory–visual model“ (Fig. 1B).

The auditory–visual model is based on behavioral findings and neuroimaging results (Blank et al., 2011, Rosenblum et al., 2007, Schall et al., 2013, Sheffert and Olson, 2004, von Kriegstein et al., 2008, von Kriegstein et al., 2006, von Kriegstein et al., 2005, von Kriegstein and Giraud, 2006). For example, several studies have shown that in auditory-only conditions (such as on the phone) typically developed individuals recognize someone׳s identity by voice more easily if they know this person by voice and face (Sheffert and Olson, 2004, von Kriegstein et al., 2008, von Kriegstein and Giraud, 2006). The studies showed that speaker identity recognition performance is better for voices that have been learned in a brief learning period with a voice–face video recording of a speaker in contrast to learning the voice in a matched control learning condition without the face. In the following, we refer to this improvement in behavioral performance as “face-benefit” (von Kriegstein et al., 2008). In typically developed individuals this face-benefit in speaker identity recognition is associated with enhanced blood oxygen level dependent (BOLD) activity in the fusiform face area (FFA; Fig. 1B) (von Kriegstein et al., 2008), a brain region that is associated with face identity recognition (Puce et al., 1998, von Kriegstein et al., 2008).

A face-benefit also occurs for auditory-only speech recognition (Rosenblum et al., 2007, von Kriegstein et al., 2008): Typically developed individuals are better at auditory-only speech recognition for voices that have been learned in a brief period with a voice–face video recording in contrast to a matched control learning condition. In typically developed individuals this face-benefit in speech recognition is associated with enhanced BOLD activity in the posterior superior temporal sulcus (pSTS; Fig. 1B, von Kriegstein et al., 2008), a brain region that is associated with recognizing face movements (Haxby et al., 2000).

The auditory–visual model proposes that the visual face area recruitment during auditory-only speech and speaker recognition reflects a simulation process. In this view, we simulate a talking speaker׳s face when we hear auditory-only speech. The simulation is thought to rely on two different processes; a simulation of the face identity via the FFA and a simulation of the orofacial speech movement via pSTS. The simulation could fill in the missing visual input in auditory-only conditions and lead to the behavioral improvement, i.e. the face-benefit (von Kriegstein and Giraud, 2006, von Kriegstein et al., 2008). This view results in two hypotheses: First, a deficit in face identity recognition would lead to a lack of face-benefit in auditory-only speaker recognition. Second, a deficit in orofacial speech movement perception would lead to a lack of face-benefit in auditory-only speech recognition. Currently, there is evidence for the first hypothesis: The face-benefit in speaker identity recognition is absent in individuals with a selective face identity recognition deficit, i.e. developmental prosopagnosia (Fig. 1B, “Prosopagnosia”) (von Kriegstein et al., 2008). The face-benefit for auditory-only speech recognition is normal in developmental prosopagnosics (von Kriegstein et al., 2008).

In the present study our aim is to test the second hypothesis, i.e. that a deficit in perceiving orofacial movements is associated with a lack of face-benefit in auditory-only speech recognition. To test this, we recruited a group of people with difficulties in recognizing visual-only speech (lip-reading), i.e. high-functioning autism spectrum disorder (ASD). ASD is a condition whose core features include atypical social interaction and communication (DSM-5, American Psychiatric Association, 2013; ICD-10, World Health Organization, 2004. Visual-only speech recognition is the ability to recognize speech from orofacial speech movements when only the face, but no auditory stimulus is present. Several studies have shown that individuals with ASD have difficulties with visual-only speech recognition (Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Williams et al., 2004, Woynaroski et al., 2013). In contrast, the ability to recognize auditory-only speech, at least under a relatively good signal-to-noise ratio, is intact (Hillier et al., 2007, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Woynaroski et al., 2013). The auditory–visual model predicts that the face-benefit in auditory-only speech recognition will be absent in individuals with difficulties in visual-only speech recognition. Therefore we expect that individuals with ASD will have difficulties in auditory-only speech recognition for speakers that they know by face, but not for those that they do not know by face (Fig. 1B, “ASD”). To test this prediction, we first evaluated the level of face processing abilities in an ASD sample with a visual-only speech recognition task (Fig. 2A) and a face identity recognition task (Fig. 2B). We expected difficulties in visual-only speech recognition (Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Williams et al., 2004, Woynaroski et al., 2013), and impaired but more variable performances across individuals in the ASD group in face identity recognition (Barton et al., 2004, Hedley et al., 2011). These two tests on visual face processing were independent from the main experiment. In the main experiment we tested participants׳ face-benefit in auditory-only speech recognition (Fig. 2C/D). For that we trained participants to identify voices and names of six speakers. In one learning condition, voices and names of three of the speakers were learned together with their face (voice–face learning). In the other learning condition, voices and names of the other three speakers were learned together with a symbol of their occupation instead of the face (voice–occupation learning). To test the face-benefit in auditory-only speech recognition we subsequently presented auditory-only speech samples from all six speakers who were learned in these two different conditions. We expected individuals with ASD to show difficulties in auditory-only speech recognition only for those speakers they knew by face (Fig. 2C/D, ”voice–face–name”) but not for those not known by face (Fig. 2C/D, ”voice–occupation–name”). Thus, for the ASD group we expected that the face-benefit for auditory-only speech recognition would be reduced as compared to the control group. In addition, the design included an auditory-only speaker identity recognition task for the same speech samples and a lower-lever auditory-only object recognition task as control conditions (see Section 2.2.2 for details). We did expect to find similar group performances in the face-benefit for speaker identity recognition and the object recognition task.

There is a large body of literature of audio-visual speech perception in ASD that points to altered audio-visual integration (de Boer-Schellekens et al., 2013, De Gelder et al., 1991, Foxe et al., 2013, Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Mongillo et al., 2008, Smith and Bennetto, 2007, Stevenson et al., 2014, Taylor et al., 2010, Williams et al., 2004, Woynaroski et al., 2013). In these studies, the impact of vision on auditory speech recognition was investigated when both the visual and the auditory stimuli were either presented simultaneously or with varying stimulus onset asynchronies. In contrast, here we investigate the impact of audio-visual training on auditory-only speech recognition when no visual input is available.

We expect that dysfunctional visual-only speech recognition in ASD is accompanied by difficulties in auditory-only speech recognition for speakers known by face (i.e. a lack of face-benefit in ASD); such a finding would be in accordance with the view that visual recognition abilities are important for auditory-only perception. In addition, the findings would highlight the significance of a multisensory perceptual approach when investigating difficulties in interaction and communication in ASD.

Section snippets

Participants

The study included an ASD group (n=14) and a control group (n=14). Groups were matched according to gender, chronological age and IQ (ASD: 10 male, mean age=28.93 years, range 19–44 years; controls: 10 male, mean age=29.40 years, range 20–43 years; IQ see Table 1). IQ was assessed using the German adapted version of the Wechsler Adult Intelligence Scale (WAIS, Wechsler, 1997). All participants were native German speakers and right-handed (Edinburgh handedness questionnaire, Oldfield, 1971). All

Results

An overview of the results is presented in Table 2 and Fig. 4.

Discussion

The present study showed that, in adults with high-functioning ASD, there is a deficit in auditory-only speech recognition specifically for speakers known by face. In contrast, the same ASD subject group had similar auditory-only speaker identity recognition scores for speakers known by face as for speakers of whom the face was not known. In addition, the ASD sample had an impairment in visual-only speech recognition (i.e. lip-reading), but preserved face identity recognition abilities. The

Acknowledgments

This work was funded by a Max Planck Research Group grant to KVK. We thank Brad Duchaine for providing the Cambridge Face Memory Test (CFMT). We thank Laura Smith and Richard Hunger for comments on an earlier version of the manuscript.

References (86)

  • A.R. Nath et al.

    A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion

    Neuroimage

    (2012)
  • K. O’Hearn et al.

    Lack of developmental improvement on a face memory task during adolescence in autism

    Neuropsychologia

    (2010)
  • R.C. Oldfield

    The assessment and analysis of handedness: the Edinburgh inventory

    Neuropsychologia

    (1971)
  • E. Redcay

    The superior temporal sulcus performs a common function for social and speech perception: implications for the emergence of autism

    Neurosci. Biobehav. Rev.

    (2008)
  • G. Rippon et al.

    Disordered connectivity in the autistic brain: challenges for the new psychophysiology

    Int. J. Psychophysiol.

    (2007)
  • C. Roswandowitz et al.

    Two cases of selective developmental voice-recognition impairment

    Curr. Biol.

    (2014)
  • S. Schall et al.

    Early auditory sensory processing of voices is facilitated by visual mechanisms

    Neuroimage

    (2013)
  • J.L. Schwartz et al.

    Seeing to hear better: evidence for early audio-visual interactions in speech identification

    Cognition

    (2004)
  • R.A. Stevenson et al.

    Neural processing of asynchronous audiovisual speech perception

    Neuroimage

    (2010)
  • R.A. Stevenson et al.

    Discrete neural substrates underlie complementary audiovisual speech integration processes

    Neuroimage

    (2011)
  • D.R. Van Lancker et al.

    Impairment of voice and face recognition in patients with hemispheric damage

    Brain Cogn.

    (1982)
  • K. von Kriegstein et al.

    Modulation of neural responses to speech by directing attention to voices or verbal content

    Cogn. Brain Res.

    (2003)
  • J.H.G. Williams et al.

    Visual–auditory integration during speech imitation in autism

    Res. Dev. Disabil.

    (2004)
  • M. Zilbovicius et al.

    Autism, the superior temporal sulcus and social perception

    Trends Neurosci.

    (2006)
  • J.I. Alcantara et al.

    Speech-in-noise perception in high-functioning individuals with autism or Asperger׳s syndrome

    J. Child Psychol. Psychiatry

    (2004)
  • American Psychiatric Association

    Diagnostic and Statistical Manual of Mental Disorders

    (1994)
  • American Psychiatric Association

    Diagnostic and Statistical Manual of Mental Disorders

    (2013)
  • Aschenberner, B., & Weiss, C. (2005). Phoneme-viseme mapping for German video-realistic audio-visual-speech-synthesis....
  • S. Baron-Cohen et al.

    The autism-spectrum quotient (AQ): evidence from Asperger syndrome/high-functioning autism, males and females, scientists and mathematicians

    J. Autism Dev. Disord.

    (2001)
  • J.J. Barton et al.

    Are patients with social developmental disorders prosopagnosic? Perceptual heterogeneity in the Asperger and socio-emotional processing disorders

    Brain

    (2004)
  • P. Belin et al.

    Adaptation to speaker׳s voice in right anterior temporal lobe

    Neuroreport

    (2003)
  • P. Belin et al.

    Voice-selective areas in human auditory cortex

    Nature

    (2000)
  • R. Blake et al.

    Visual recognition of biological motion is impaired in children with autism

    Psychol. Sci.

    (2003)
  • H. Blank et al.

    Direct structural connections between voice- and face-recognition areas

    J. Neurosci.

    (2011)
  • J. Boucher et al.

    Familiar face and voice matching and recognition in children with autism

    J. Child Psychol. Psychiatry

    (1998)
  • A.B. Brandwein et al.

    The development of multisensory integration in high-functioning autism: high-density electrical mapping and psychophysical measures reveal impairments in the processing of audiovisual inputs

    Cereb. Cortex

    (2013)
  • J. Brock et al.

    The temporal binding deficit hypothesis of autism

    Dev. Psychopathol.

    (2002)
  • N.T.R.S. David et al.

    Impairments in multisensory processing are not universal to the autism spectrum: no evidence for crossmodal priming deficits in Asperger syndrome

    Autism Res.

    (2011)
  • L. de Boer-Schellekens et al.

    Diminished sensitivity of audiovisual temporal order in autism spectrum disorder

    Front. Integr. Neurosci.

    (2013)
  • B. De Gelder et al.

    Face recognition and lip-reading in autism

    Eur. J. Cogn. Psychol.

    (1991)
  • C. Deruelle et al.

    Spatial frequency and face processing in children with autism and Asperger syndrome

    J. Autism Dev. Disord.

    (2004)
  • H.D. Ellis et al.

    Intra- and inter-modal repetition priming of familiar faces and voices

    Br. J. Psychol.

    (1997)
  • T. Falck-Ytter et al.

    Lack of visual orienting to biological motion and audiovisual synchrony in 3-year-olds with autism

    PLoS One

    (2013)
  • Cited by (28)

    • Incongruence effects in cross-modal emotional processing in autistic traits: An fMRI study

      2021, Neuropsychologia
      Citation Excerpt :

      Likewise, Charbonneau et al. (2013) suggested that individuals with ASD did not benefit as much from crossmodal emotion expressions as did typical populations. Schelinski et al. (2014, 2017) suggested both voice identity and speech recognition are affected in individuals with ASD. There is also evidence of altered audiovisual modulation in ASD (Jao Keehn et al., 2017).

    • Different neural processes underlie visual speech perception in school-age children and adults: An event-related potentials study

      2019, Journal of Experimental Child Psychology
      Citation Excerpt :

      In addition, auditory-only speech produced by talkers who listeners know by face is significantly easier to understand than the speech produced by talkers who listeners know by voice only. Auditory speech spoken by familiar talkers also activates face movement sensitive areas within the pSTS (Riedel, Ragert, Schelinski, Kiebel, & Von Kriegstein, 2015; Schelinski, Riedel, & Von Kriegstein, 2014). Furthermore, listening to speech has been shown to activate the listener’s own motor representations for speech production (e.g., Skipper, Nusbaum, & Small, 2005; Skipper, van Wassenhove, Nusbaum, & Small, 2007; Wilson, Saygin, Sereno, & Iacoboni, 2004) even when such speech is clear and easy to understand (Panouilleres, Boyles, Chesters, Watkins, & Mottonen, 2018).

    • Understanding the mechanisms of familiar voice-identity recognition in the human brain

      2018, Neuropsychologia
      Citation Excerpt :

      This effect emerges rather rapidly following approximately two minutes of audio-visual experience with the speaker’s identity (von Kriegstein et al., 2008; Schall et al., 2013; Schelinski et al., 2014). Studies observing this behavioural enhancement have employed a number of control learning conditions, including familiarising listeners with the speaker by voice alone (Sheffert and Olson, 2004) or in conjunction with other visual input including the name (von Kriegstein and Giraud, 2006) or an image depicting the occupation of the speaker (Schelinski et al., 2014; von Kriegstein et al., 2008; Schall et al., 2013). Across these manipulations voice-identity recognition remained superior for speakers who had been learned in conjunction with their facial identity (for review see von Kriegstein, 2012).

    • Recognizing visual speech: Reduced responses in visual-movement regions, but not other speech regions in autism

      2018, NeuroImage: Clinical
      Citation Excerpt :

      There were four blocks per condition presented in two fMRI runs of 6 min. To assess visual-speech recognition independent of the fMRI experiment, participants performed a visual word-matching test (Schelinski et al., 2014; Riedel et al., 2015). Participants saw a written word on a screen and subsequently viewed a video without audio stream of a male speaker articulating a word.

    • The impact of atypical sensory processing on social impairments in autism spectrum disorder

      2018, Developmental Cognitive Neuroscience
      Citation Excerpt :

      There is also evidence of impaired lip-reading ability in ASD (Foxe et al., 2015), which depends on proper detection and integration of congruent audio-visual speech information (Dodd, 1979). While listening to auditory speech, individuals with ASD struggle to integrate previous exposure to speaker-specific facial information to optimize auditory speech recognition (Schelinski et al., 2014). Lastly, individuals with ASD do not benefit from the addition of gestures to auditory speech (Silverman et al., 2010) and do not properly synchronize gestures with their own speech to aid comprehension (de Marchena and Eigsti, 2010).

    View all citing articles on Scopus
    View full text