Visual abilities are important for auditory-only speech recognition: Evidence from autism spectrum disorder
Introduction
When talking to someone on the phone, recognizing what is said and recognizing who is speaking are two inherently auditory tasks. According to the conventional view, performance in these auditory tasks relies on auditory processes without contribution of visual processes (e.g. Ellis et al., 1997, Hickok and Poeppel, 2007). An alternative view proposes that visual recognition abilities are relevant also for auditory-only tasks (von Kriegstein et al., 2008). We refer to these two different views as the “auditory-only model” (Fig. 1A) and the “auditory–visual model“ (Fig. 1B).
The auditory–visual model is based on behavioral findings and neuroimaging results (Blank et al., 2011, Rosenblum et al., 2007, Schall et al., 2013, Sheffert and Olson, 2004, von Kriegstein et al., 2008, von Kriegstein et al., 2006, von Kriegstein et al., 2005, von Kriegstein and Giraud, 2006). For example, several studies have shown that in auditory-only conditions (such as on the phone) typically developed individuals recognize someone׳s identity by voice more easily if they know this person by voice and face (Sheffert and Olson, 2004, von Kriegstein et al., 2008, von Kriegstein and Giraud, 2006). The studies showed that speaker identity recognition performance is better for voices that have been learned in a brief learning period with a voice–face video recording of a speaker in contrast to learning the voice in a matched control learning condition without the face. In the following, we refer to this improvement in behavioral performance as “face-benefit” (von Kriegstein et al., 2008). In typically developed individuals this face-benefit in speaker identity recognition is associated with enhanced blood oxygen level dependent (BOLD) activity in the fusiform face area (FFA; Fig. 1B) (von Kriegstein et al., 2008), a brain region that is associated with face identity recognition (Puce et al., 1998, von Kriegstein et al., 2008).
A face-benefit also occurs for auditory-only speech recognition (Rosenblum et al., 2007, von Kriegstein et al., 2008): Typically developed individuals are better at auditory-only speech recognition for voices that have been learned in a brief period with a voice–face video recording in contrast to a matched control learning condition. In typically developed individuals this face-benefit in speech recognition is associated with enhanced BOLD activity in the posterior superior temporal sulcus (pSTS; Fig. 1B, von Kriegstein et al., 2008), a brain region that is associated with recognizing face movements (Haxby et al., 2000).
The auditory–visual model proposes that the visual face area recruitment during auditory-only speech and speaker recognition reflects a simulation process. In this view, we simulate a talking speaker׳s face when we hear auditory-only speech. The simulation is thought to rely on two different processes; a simulation of the face identity via the FFA and a simulation of the orofacial speech movement via pSTS. The simulation could fill in the missing visual input in auditory-only conditions and lead to the behavioral improvement, i.e. the face-benefit (von Kriegstein and Giraud, 2006, von Kriegstein et al., 2008). This view results in two hypotheses: First, a deficit in face identity recognition would lead to a lack of face-benefit in auditory-only speaker recognition. Second, a deficit in orofacial speech movement perception would lead to a lack of face-benefit in auditory-only speech recognition. Currently, there is evidence for the first hypothesis: The face-benefit in speaker identity recognition is absent in individuals with a selective face identity recognition deficit, i.e. developmental prosopagnosia (Fig. 1B, “Prosopagnosia”) (von Kriegstein et al., 2008). The face-benefit for auditory-only speech recognition is normal in developmental prosopagnosics (von Kriegstein et al., 2008).
In the present study our aim is to test the second hypothesis, i.e. that a deficit in perceiving orofacial movements is associated with a lack of face-benefit in auditory-only speech recognition. To test this, we recruited a group of people with difficulties in recognizing visual-only speech (lip-reading), i.e. high-functioning autism spectrum disorder (ASD). ASD is a condition whose core features include atypical social interaction and communication (DSM-5, American Psychiatric Association, 2013; ICD-10, World Health Organization, 2004. Visual-only speech recognition is the ability to recognize speech from orofacial speech movements when only the face, but no auditory stimulus is present. Several studies have shown that individuals with ASD have difficulties with visual-only speech recognition (Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Williams et al., 2004, Woynaroski et al., 2013). In contrast, the ability to recognize auditory-only speech, at least under a relatively good signal-to-noise ratio, is intact (Hillier et al., 2007, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Woynaroski et al., 2013). The auditory–visual model predicts that the face-benefit in auditory-only speech recognition will be absent in individuals with difficulties in visual-only speech recognition. Therefore we expect that individuals with ASD will have difficulties in auditory-only speech recognition for speakers that they know by face, but not for those that they do not know by face (Fig. 1B, “ASD”). To test this prediction, we first evaluated the level of face processing abilities in an ASD sample with a visual-only speech recognition task (Fig. 2A) and a face identity recognition task (Fig. 2B). We expected difficulties in visual-only speech recognition (Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Smith and Bennetto, 2007, Williams et al., 2004, Woynaroski et al., 2013), and impaired but more variable performances across individuals in the ASD group in face identity recognition (Barton et al., 2004, Hedley et al., 2011). These two tests on visual face processing were independent from the main experiment. In the main experiment we tested participants׳ face-benefit in auditory-only speech recognition (Fig. 2C/D). For that we trained participants to identify voices and names of six speakers. In one learning condition, voices and names of three of the speakers were learned together with their face (voice–face learning). In the other learning condition, voices and names of the other three speakers were learned together with a symbol of their occupation instead of the face (voice–occupation learning). To test the face-benefit in auditory-only speech recognition we subsequently presented auditory-only speech samples from all six speakers who were learned in these two different conditions. We expected individuals with ASD to show difficulties in auditory-only speech recognition only for those speakers they knew by face (Fig. 2C/D, ”voice–face–name”) but not for those not known by face (Fig. 2C/D, ”voice–occupation–name”). Thus, for the ASD group we expected that the face-benefit for auditory-only speech recognition would be reduced as compared to the control group. In addition, the design included an auditory-only speaker identity recognition task for the same speech samples and a lower-lever auditory-only object recognition task as control conditions (see Section 2.2.2 for details). We did expect to find similar group performances in the face-benefit for speaker identity recognition and the object recognition task.
There is a large body of literature of audio-visual speech perception in ASD that points to altered audio-visual integration (de Boer-Schellekens et al., 2013, De Gelder et al., 1991, Foxe et al., 2013, Gepner et al., 1996, Iarocci et al., 2010, Irwin et al., 2011, Mongillo et al., 2008, Smith and Bennetto, 2007, Stevenson et al., 2014, Taylor et al., 2010, Williams et al., 2004, Woynaroski et al., 2013). In these studies, the impact of vision on auditory speech recognition was investigated when both the visual and the auditory stimuli were either presented simultaneously or with varying stimulus onset asynchronies. In contrast, here we investigate the impact of audio-visual training on auditory-only speech recognition when no visual input is available.
We expect that dysfunctional visual-only speech recognition in ASD is accompanied by difficulties in auditory-only speech recognition for speakers known by face (i.e. a lack of face-benefit in ASD); such a finding would be in accordance with the view that visual recognition abilities are important for auditory-only perception. In addition, the findings would highlight the significance of a multisensory perceptual approach when investigating difficulties in interaction and communication in ASD.
Section snippets
Participants
The study included an ASD group (n=14) and a control group (n=14). Groups were matched according to gender, chronological age and IQ (ASD: 10 male, mean age=28.93 years, range 19–44 years; controls: 10 male, mean age=29.40 years, range 20–43 years; IQ see Table 1). IQ was assessed using the German adapted version of the Wechsler Adult Intelligence Scale (WAIS, Wechsler, 1997). All participants were native German speakers and right-handed (Edinburgh handedness questionnaire, Oldfield, 1971). All
Results
An overview of the results is presented in Table 2 and Fig. 4.
Discussion
The present study showed that, in adults with high-functioning ASD, there is a deficit in auditory-only speech recognition specifically for speakers known by face. In contrast, the same ASD subject group had similar auditory-only speaker identity recognition scores for speakers known by face as for speakers of whom the face was not known. In addition, the ASD sample had an impairment in visual-only speech recognition (i.e. lip-reading), but preserved face identity recognition abilities. The
Acknowledgments
This work was funded by a Max Planck Research Group grant to KVK. We thank Brad Duchaine for providing the Cambridge Face Memory Test (CFMT). We thank Laura Smith and Richard Hunger for comments on an earlier version of the manuscript.
References (86)
The proactive brain: using analogies and associations to generate predictions
Trends Cogn. Sci.
(2007)- et al.
Thinking the voice: neural correlates of voice perception
Trends Cogn. Sci.
(2004) - et al.
Cortical substrates for the perception of face actions: an fMRI study of the specificity of activation for seen speech and for meaningless lower-face acts (gurning)
Cogn. Brain Res.
(2001) - et al.
Reduced multisensory facilitation in speakers with autism
Cortex
(2013) - et al.
The Cambridge Face Memory Test: results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants
Neuropsychologia
(2006) - et al.
Perception of biological motion in autism spectrum disorders
Neuropsychologia
(2008) - et al.
The distributed human neural system for face perception
Trends Cogn. Sci.
(2000) - et al.
The role of MT+/V5 during biological motion perception in Asperger syndrome: an fMRI study
Res. Autism Spectr. Discord.
(2007) - et al.
Audiovisual integration in high functioning adults with autism
Res. Autism Spectr. Discord.
(2010) - et al.
A neural mechanism for recognizing speech spoken by different speakers
Neuroimage
(2014)
A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion
Neuroimage
Lack of developmental improvement on a face memory task during adolescence in autism
Neuropsychologia
The assessment and analysis of handedness: the Edinburgh inventory
Neuropsychologia
The superior temporal sulcus performs a common function for social and speech perception: implications for the emergence of autism
Neurosci. Biobehav. Rev.
Disordered connectivity in the autistic brain: challenges for the new psychophysiology
Int. J. Psychophysiol.
Two cases of selective developmental voice-recognition impairment
Curr. Biol.
Early auditory sensory processing of voices is facilitated by visual mechanisms
Neuroimage
Seeing to hear better: evidence for early audio-visual interactions in speech identification
Cognition
Neural processing of asynchronous audiovisual speech perception
Neuroimage
Discrete neural substrates underlie complementary audiovisual speech integration processes
Neuroimage
Impairment of voice and face recognition in patients with hemispheric damage
Brain Cogn.
Modulation of neural responses to speech by directing attention to voices or verbal content
Cogn. Brain Res.
Visual–auditory integration during speech imitation in autism
Res. Dev. Disabil.
Autism, the superior temporal sulcus and social perception
Trends Neurosci.
Speech-in-noise perception in high-functioning individuals with autism or Asperger׳s syndrome
J. Child Psychol. Psychiatry
Diagnostic and Statistical Manual of Mental Disorders
Diagnostic and Statistical Manual of Mental Disorders
The autism-spectrum quotient (AQ): evidence from Asperger syndrome/high-functioning autism, males and females, scientists and mathematicians
J. Autism Dev. Disord.
Are patients with social developmental disorders prosopagnosic? Perceptual heterogeneity in the Asperger and socio-emotional processing disorders
Brain
Adaptation to speaker׳s voice in right anterior temporal lobe
Neuroreport
Voice-selective areas in human auditory cortex
Nature
Visual recognition of biological motion is impaired in children with autism
Psychol. Sci.
Direct structural connections between voice- and face-recognition areas
J. Neurosci.
Familiar face and voice matching and recognition in children with autism
J. Child Psychol. Psychiatry
The development of multisensory integration in high-functioning autism: high-density electrical mapping and psychophysical measures reveal impairments in the processing of audiovisual inputs
Cereb. Cortex
The temporal binding deficit hypothesis of autism
Dev. Psychopathol.
Impairments in multisensory processing are not universal to the autism spectrum: no evidence for crossmodal priming deficits in Asperger syndrome
Autism Res.
Diminished sensitivity of audiovisual temporal order in autism spectrum disorder
Front. Integr. Neurosci.
Face recognition and lip-reading in autism
Eur. J. Cogn. Psychol.
Spatial frequency and face processing in children with autism and Asperger syndrome
J. Autism Dev. Disord.
Intra- and inter-modal repetition priming of familiar faces and voices
Br. J. Psychol.
Lack of visual orienting to biological motion and audiovisual synchrony in 3-year-olds with autism
PLoS One
Cited by (28)
Incongruence effects in cross-modal emotional processing in autistic traits: An fMRI study
2021, NeuropsychologiaCitation Excerpt :Likewise, Charbonneau et al. (2013) suggested that individuals with ASD did not benefit as much from crossmodal emotion expressions as did typical populations. Schelinski et al. (2014, 2017) suggested both voice identity and speech recognition are affected in individuals with ASD. There is also evidence of altered audiovisual modulation in ASD (Jao Keehn et al., 2017).
Different neural processes underlie visual speech perception in school-age children and adults: An event-related potentials study
2019, Journal of Experimental Child PsychologyCitation Excerpt :In addition, auditory-only speech produced by talkers who listeners know by face is significantly easier to understand than the speech produced by talkers who listeners know by voice only. Auditory speech spoken by familiar talkers also activates face movement sensitive areas within the pSTS (Riedel, Ragert, Schelinski, Kiebel, & Von Kriegstein, 2015; Schelinski, Riedel, & Von Kriegstein, 2014). Furthermore, listening to speech has been shown to activate the listener’s own motor representations for speech production (e.g., Skipper, Nusbaum, & Small, 2005; Skipper, van Wassenhove, Nusbaum, & Small, 2007; Wilson, Saygin, Sereno, & Iacoboni, 2004) even when such speech is clear and easy to understand (Panouilleres, Boyles, Chesters, Watkins, & Mottonen, 2018).
Understanding the mechanisms of familiar voice-identity recognition in the human brain
2018, NeuropsychologiaCitation Excerpt :This effect emerges rather rapidly following approximately two minutes of audio-visual experience with the speaker’s identity (von Kriegstein et al., 2008; Schall et al., 2013; Schelinski et al., 2014). Studies observing this behavioural enhancement have employed a number of control learning conditions, including familiarising listeners with the speaker by voice alone (Sheffert and Olson, 2004) or in conjunction with other visual input including the name (von Kriegstein and Giraud, 2006) or an image depicting the occupation of the speaker (Schelinski et al., 2014; von Kriegstein et al., 2008; Schall et al., 2013). Across these manipulations voice-identity recognition remained superior for speakers who had been learned in conjunction with their facial identity (for review see von Kriegstein, 2012).
Recognizing visual speech: Reduced responses in visual-movement regions, but not other speech regions in autism
2018, NeuroImage: ClinicalCitation Excerpt :There were four blocks per condition presented in two fMRI runs of 6 min. To assess visual-speech recognition independent of the fMRI experiment, participants performed a visual word-matching test (Schelinski et al., 2014; Riedel et al., 2015). Participants saw a written word on a screen and subsequently viewed a video without audio stream of a male speaker articulating a word.
The impact of atypical sensory processing on social impairments in autism spectrum disorder
2018, Developmental Cognitive NeuroscienceCitation Excerpt :There is also evidence of impaired lip-reading ability in ASD (Foxe et al., 2015), which depends on proper detection and integration of congruent audio-visual speech information (Dodd, 1979). While listening to auditory speech, individuals with ASD struggle to integrate previous exposure to speaker-specific facial information to optimize auditory speech recognition (Schelinski et al., 2014). Lastly, individuals with ASD do not benefit from the addition of gestures to auditory speech (Silverman et al., 2010) and do not properly synchronize gestures with their own speech to aid comprehension (de Marchena and Eigsti, 2010).