Co-speech iconic gestures and visuo-spatial working memory
Introduction
Successful communication often requires multi-modal integration, whereby interlocutors combine information from the verbal channel with visual information about the speaker and the environment. For example, we have documented a speaker uttering the phrase, “manual adjustment lens,” to describe a camera while making hand movements that resemble the act of focusing a telephoto lens. The speech and the gesture in this example provide complementary information — and by combining their meanings, it becomes evident that the speaker is describing the lens of a camera, and not some other optical device, such as a telescope or a pair of binoculars (Wu & Coulson, 2007). Although prior research indicates that listeners rapidly combine the meaning of speech and iconic gestures in examples such as this (Kelly et al., 2004, Ozyurek et al., 2007, Wu and Coulson, 2010), little is known about the cognitive resources mediating these integration processes.
Here, we focus on depictive or iconic gestures – that is, those which bear featural similarities to the concepts they represent – as prior research suggests iconic gestures impact semantic aspects of real-time discourse comprehension (Kelly et al., 2004, Ozyurek et al., 2007, Wu and Coulson, 2010). Given that iconic gestures depict visual properties such as shape and size, one obvious possibility is that visuo-spatial processes are important for listeners' success at relating information conveyed in the verbal modality to visual information conveyed in their accompanying gestures. The visuo-spatial resources hypothesis is a natural fit with gesture production models suggesting that people gesture in order to convey analogue information in mental images (McNeill, 1992), or to coordinate spatial aspects of a message with the propositional content in their speech (Kita, 2000). Indeed, the gesture production literature suggests that people are more likely to gesture when their speech has spatial or imagistic content (Hadar and Krauss, 1999, Hostetter and Hopkins, 2002, Lavergne and Kimura, 1987, Morsella and Krauss, 2004). Although the comprehension of gestures has received much less attention, the visuo-spatial resources hypothesis is consistent with research demonstrating similarities between patterns of brain response to iconic gestures and photographs of real world objects (Wu & Coulson, 2011), as well the finding that listeners use information available in speaker's iconic gestures to help formulate visually specific situation models (Wu and Coulson, 2007, Wu and Coulson, 2010).
However, as their name suggests, co-speech gestures occur almost exclusively in the context of speech — and hence, their semantic analysis may depend heavily on verbal resources. The verbal resources hypothesis is in keeping with research suggesting that the meaning of iconic gestures is highly ambiguous, and is determined largely by the meaning of the speech that accompanies them (Hadar and Pinchas-Zamir, 2004, Krauss et al., 1995). It is also consistent with neuroimaging research that indicates many of the brain areas mediating the interpretation of gesture, also mediate the interpretation of speech (Straube et al., 2012, Willems et al., 2007). Finally, the two hypotheses are not mutually exclusive, as it is quite possible that speech–gesture integration recruits both verbal and visuo-spatial resources.
Given the function of co-speech gestures in real-time language comprehension, working memory (WM) is likely to play an important role in their interpretation. According to the now classic model advanced by Baddeley and Hitch (1974), WM is critical for online processing, serving to temporarily maintain and store perceptual information, and enabling the appropriate updating of representations in long term memory. Notably, WM is widely thought to be comprised of a central controller as well as at least two distinct, modality-specific subsystems dedicated to the maintenance of visual information via the visuo-spatial sketch pad, and auditory and verbal information via the phonological loop. If listeners tend to preferentially recruit visuo-spatial or verbal resources during speech–gesture integration, we would expect to observe a relationship between the impact of iconic gestures on discourse comprehension and the availability of either visuo-spatial or verbal WM resources (or both).
The present study explored this hypothesis using a two-fold approach. Experiment 1 adopted a correlational method, examining whether there was a relationship between individual differences in measures of either verbal or visuo-spatial WM capacity and individual differences in sensitivity to iconic gestures. In Experiments 2 and 3, we used a dual task paradigm to examine whether taxing different components of WM impact gesture comprehension, suggestive of a causal role for WM in speech–gesture integration. Accordingly, these studies assessed whether participants' ability to utilize the information in co-speech gestures was compromised by manipulating the load on either visuo-spatial (Experiment 2) or verbal (Experiment 3) WM. Finally, Experiment 4 was conducted to ensure that differences in the results of Experiments 2 and 3 did not stem from differences in the difficulty of the secondary verbal and visuo-spatial recall tasks used in those studies.
Section snippets
Experiment 1
To explore the cognitive resources mediating speech–gesture integration, Experiment 1 examined the relationship between individual differences in WM capacity, as measured through verbal and visuo-spatial span tests, and sensitivity to speech–gesture congruency, as measured through a picture probe classification task. Healthy adults viewed short video clips of spontaneous discourse involving iconic gestures, and then classified subsequent photographs of objects and scenes (picture probes) as
Participants
64 UCSD undergraduates (38 female) gave informed consent and received academic course credit for participation. All participants were fluent English speakers.
Corsi block task
The Corsi block-tapping task (Milner, 1971) is a widely used test of spatial skills and non-verbal WM. In the computerized variant implemented here, an asymmetric array of nine squares was presented on a monitor. On each trial, some or all of the squares would flash in sequence, though no square flashed more than once. Participants were
Accuracy
Analysis revealed that main effects of discourse congruency (F(1,63) = 8, p < 0.05; congruent more accurate) and probe relatedness (F(1,63) = 8.5, p < 0.05; unrelated more accurate) were qualified by a two-way interaction (F(1,63) = 16, p < 0.05). The interaction reflected the presence of a reliable discourse congruency effect for related (t(63) = 4, p < 0.05), but not unrelated (t < 1, n.s.), picture probes. Related picture probes were classified more accurately following discourse primes in which the speech
Discussion
Experiment 1 was intended to explore the relationship between participants' sensitivity to co-speech iconic gestures and the capacity of their verbal and visuo-spatial WM systems. Results suggest first, that the picture probe classification task was indeed a valid index of participants' sensitivity to iconic gestures, and, second, that visuo-spatial WM helps mediate speech–gesture integration. We briefly discuss each of these points below.
Experiment 2
In Experiment 2, we further explored possible visuo-spatial contributions to speech–gesture integration through a dual task paradigm designed to tax visuo-spatial WM concurrently with discourse and picture probe processing. The logic of the dual task paradigm is that performance deficits result when two tasks share the same resources (e.g., Wickens, 1980). Accordingly, we hypothesized that a secondary task that draws heavily on visuo-spatial resources will result in diminished capacity to
Participants
60 new volunteers from the UCSD community (44 female) gave informed consent and received academic course credit for participation. All participants were fluent English speakers.
Materials, design, and procedure
The primary task was identical to the picture probe classification task used in Experiment 1, requiring participants to view discourse primes and make relatedness judgments to pictures of discourse referents. The secondary task involved remembering a sequence of locations in a two-dimensional grid. Each trial began with
Secondary task accuracy (spatial recall)
As expected, superior recall of target locations was observed in low (93.0%, SD 7.4%) versus high (75%, SD 18%) load trials (memory load main effect: (F(1,59) = 115, p < 0.05)). Discourse congruity was not significant either as a main effect or in interaction with memory load.
Bivariate correlation coefficients in Table 3 confirm that Corsi Block Span, but not Sentence Span was correlated with overall accuracy on the spatial recall task. The relative import of visuo-spatial versus verbal WM capacity
Discussion
The goal of Experiment 2 was to evaluate speech–gesture integration under the duress of a concurrent secondary task expected to tax visuo-spatial resources (i.e., remembering grid locations). As expected, accuracy of location recall was positively related to performance on a separate test of visuo-spatial, but not verbal, WM capacity. Participants with larger Corsi Block Spans tended to recall more grid locations on both high and low load trials. This finding suggests that the secondary memory
Experiment 3
Experiment 3 examines the impact of a secondary verbal WM load on speech–gesture integration. A new cohort of volunteers was presented with a similar paradigm to that employed in Experiment 2. Participants were asked to remember spoken digit sequences consisting of either one or four items during the same picture classification task used in the preceding studies. It is widely believed that this type of recall task engages the phonological loop, as digits are thought to be maintained in
Participants
56 new volunteers from the UCSD community (37 female) gave informed consent and received academic course credit for participation. All participants were fluent in English.
Materials, design, procedure, and analysis
The primary task was identical to that used in Experiment 2. For the secondary task, participants were asked to remember sequences of spoken numbers. At the outset of each trial, a series of digitized audio files containing either one (low memory load) or four (high memory load) spoken numbers ranging from one to nine was
Secondary task accuracy (verbal recall)
Unsurprisingly, digits were recalled more accurately on low (97%, SD 5%) versus high (89%, SD 10%) load trials (memory load main effect: (F(1,55) = 54.7, p < 0.05)). No main effect of speech–gesture congruity or interaction with memory load was obtained. Importantly, however, Sentence Span, but not Corsi Block Span, scores were correlated with digit recall accuracy (Table 5). Further, the multiple regression model using Corsi Block and Sentence Span scores as predictor variables revealed that
Discussion
Experiment 3 examined the relationship between verbal working memory abilities and multi-modal discourse comprehension through a dual task paradigm similar to that employed in Experiment 2. Instead of target locations, participants held number sequences in immediate memory while judging the relatedness of picture probes following segments of discourse containing congruent versus incongruent speech and gestures. As expected, some outcomes of this study paralleled findings from Experiment 2. For
Experiment 4
To compare within subjects the attentional demands of the secondary cognitive load manipulations in this study, a new dual task paradigm was created. The primary task involved a conjunctive visual search task analogous to that developed by Treisman and Gelade (1980). This task has been successfully utilized by other behavioral researchers (Hermer-Vazquez & Spelke, 1999) with normative objectives similar to those of the present study. Participants scanned visual displays in search of single
Participants
71 UCSD volunteers (44 female) were awarded course credit for participation in this study. All participants were fluent in English and gave informed consent.
Materials, design, procedure, and analysis
64 total trials were presented. On half, the secondary task involved the high load version of the visuo-spatial recall task used in Experiment 2, whereas the remainder involved the high load version of the verbal digit recall task from Experiment 3. The primary task involved visual displays containing either seven (small set) or eleven
Results
On average, 95% of targets were correctly detected. No main effects of or interactions between secondary recall modality or primary set size were detected (all F's < 1, n.s.). With respect to response times, target detection was reliably slower with eleven (mean: 2251 ms; SD: 645) versus seven distractors (mean: 1994 ms; SD: 441), as expected (F(1, 70) = 31, p < 0.05). Intriguingly, a main effect of secondary recall modality indicated that verbal recall (mean: 2228 ms; SD: 593) resulted in slower target
Discussion
The purpose of Experiment 4 was to compare within subjects the overall demands placed on executive resources by the two secondary recall tasks used in this study. We reasoned that if attention is required to bind features of targets and distractors in the visual search task (Treisman & Gelade, 1980), then additional attentional loads incurred by the two types of secondary task should impact target detection — both with respect to the error rate and the length of time to make a response. If
General discussion
In three experiments, participants classified related pictures more rapidly and accurately when primed by multi-modal discourse with congruent versus incongruent speech and gestures, suggesting first, that people integrate the information conveyed by gestures with that conveyed by the speech, and, second, that our picture probe task offered a reliable index of sensitivity to iconic gestures. Further, Experiments 1 and 2 indicated that the participants who were the most sensitive to the
Conclusion
In conclusion, the present study demonstrates an important role for visuo-spatial resources in multi-modal discourse comprehension. In three experiments, healthy adults classified related picture probes more rapidly when primed by discourse with congruent versus incongruent gestures. The novel finding advanced here is that not all listeners are impacted equally by gestures. In particular, these data suggest visuo-spatial WM capacity plays a more important role in mediating speech–gesture
Acknowledgments
This work was supported by a grant to SC from the NSF (#BCS-0843946). Special thanks go to Rebecca Dai, Jordan Davison, and Marguerite McQuire for their contributions.
References (68)
- et al.
Effects of visibility between speaker and listener on gesture production: Some gestures are meant to be seen
Journal of Memory and Language
(2001) - et al.
Event-related potentials and the semantic matching of pictures
Brain and Cognition
(1990) The Corsi block-tapping task: Methodological and theoretical considerations
Brain and Cognition
(1998)- et al.
Gesturing makes learning last
Cognition
(2008) - et al.
Individual differences in working memory and reading
Journal of Verbal Learning and Verbal Behavior
(1980) - et al.
Iconic gestures: The grammatical categories of lexical affiliates
Journal of Neurolinguistics
(1999) - et al.
Sources of flexibility in human cognition: Dual-task studies of space and language
Cognitive Psychology
(1999) - et al.
Event-related brain potentials reflect semantic priming in an object decision task
Brain and Cognition
(1994) - et al.
The effect of thought structure on the production of lexical movements
Brain and Language
(2002) - et al.
Neural correlates of bimodal speech and gesture comprehension
Brain and Language
(2004)