Elsevier

Acta Psychologica

Volume 153, November 2014, Pages 39-50
Acta Psychologica

Co-speech iconic gestures and visuo-spatial working memory

https://doi.org/10.1016/j.actpsy.2014.09.002Get rights and content

Highlights

  • Listeners integrate meaning expressed in their interlocutors' speech and gestures.

  • Sensitivity to gesture is positively related to visuo-spatial working memory ability.

  • When visuo-spatial working memory is taxed, sensitivity to gestures declines.

  • No relationship was found between sensitivity to gestures and verbal working memory.

Abstract

Three experiments tested the role of verbal versus visuo-spatial working memory in the comprehension of co-speech iconic gestures. In Experiment 1, participants viewed congruent discourse primes in which the speaker's gestures matched the information conveyed by his speech, and incongruent ones in which the semantic content of the speaker's gestures diverged from that in his speech. Discourse primes were followed by picture probes that participants judged as being either related or unrelated to the preceding clip. Performance on this picture probe classification task was faster and more accurate after congruent than incongruent discourse primes. The effect of discourse congruency on response times was linearly related to measures of visuo-spatial, but not verbal, working memory capacity, as participants with greater visuo-spatial WM capacity benefited more from congruent gestures. In Experiments 2 and 3, participants performed the same picture probe classification task under conditions of high and low loads on concurrent visuo-spatial (Experiment 2) and verbal (Experiment 3) memory tasks. Effects of discourse congruency and verbal WM load were additive, while effects of discourse congruency and visuo-spatial WM load were interactive. Results suggest that congruent co-speech gestures facilitate multi-modal language comprehension, and indicate an important role for visuo-spatial WM in these speech–gesture integration processes.

Introduction

Successful communication often requires multi-modal integration, whereby interlocutors combine information from the verbal channel with visual information about the speaker and the environment. For example, we have documented a speaker uttering the phrase, “manual adjustment lens,” to describe a camera while making hand movements that resemble the act of focusing a telephoto lens. The speech and the gesture in this example provide complementary information — and by combining their meanings, it becomes evident that the speaker is describing the lens of a camera, and not some other optical device, such as a telescope or a pair of binoculars (Wu & Coulson, 2007). Although prior research indicates that listeners rapidly combine the meaning of speech and iconic gestures in examples such as this (Kelly et al., 2004, Ozyurek et al., 2007, Wu and Coulson, 2010), little is known about the cognitive resources mediating these integration processes.

Here, we focus on depictive or iconic gestures – that is, those which bear featural similarities to the concepts they represent – as prior research suggests iconic gestures impact semantic aspects of real-time discourse comprehension (Kelly et al., 2004, Ozyurek et al., 2007, Wu and Coulson, 2010). Given that iconic gestures depict visual properties such as shape and size, one obvious possibility is that visuo-spatial processes are important for listeners' success at relating information conveyed in the verbal modality to visual information conveyed in their accompanying gestures. The visuo-spatial resources hypothesis is a natural fit with gesture production models suggesting that people gesture in order to convey analogue information in mental images (McNeill, 1992), or to coordinate spatial aspects of a message with the propositional content in their speech (Kita, 2000). Indeed, the gesture production literature suggests that people are more likely to gesture when their speech has spatial or imagistic content (Hadar and Krauss, 1999, Hostetter and Hopkins, 2002, Lavergne and Kimura, 1987, Morsella and Krauss, 2004). Although the comprehension of gestures has received much less attention, the visuo-spatial resources hypothesis is consistent with research demonstrating similarities between patterns of brain response to iconic gestures and photographs of real world objects (Wu & Coulson, 2011), as well the finding that listeners use information available in speaker's iconic gestures to help formulate visually specific situation models (Wu and Coulson, 2007, Wu and Coulson, 2010).

However, as their name suggests, co-speech gestures occur almost exclusively in the context of speech — and hence, their semantic analysis may depend heavily on verbal resources. The verbal resources hypothesis is in keeping with research suggesting that the meaning of iconic gestures is highly ambiguous, and is determined largely by the meaning of the speech that accompanies them (Hadar and Pinchas-Zamir, 2004, Krauss et al., 1995). It is also consistent with neuroimaging research that indicates many of the brain areas mediating the interpretation of gesture, also mediate the interpretation of speech (Straube et al., 2012, Willems et al., 2007). Finally, the two hypotheses are not mutually exclusive, as it is quite possible that speech–gesture integration recruits both verbal and visuo-spatial resources.

Given the function of co-speech gestures in real-time language comprehension, working memory (WM) is likely to play an important role in their interpretation. According to the now classic model advanced by Baddeley and Hitch (1974), WM is critical for online processing, serving to temporarily maintain and store perceptual information, and enabling the appropriate updating of representations in long term memory. Notably, WM is widely thought to be comprised of a central controller as well as at least two distinct, modality-specific subsystems dedicated to the maintenance of visual information via the visuo-spatial sketch pad, and auditory and verbal information via the phonological loop. If listeners tend to preferentially recruit visuo-spatial or verbal resources during speech–gesture integration, we would expect to observe a relationship between the impact of iconic gestures on discourse comprehension and the availability of either visuo-spatial or verbal WM resources (or both).

The present study explored this hypothesis using a two-fold approach. Experiment 1 adopted a correlational method, examining whether there was a relationship between individual differences in measures of either verbal or visuo-spatial WM capacity and individual differences in sensitivity to iconic gestures. In Experiments 2 and 3, we used a dual task paradigm to examine whether taxing different components of WM impact gesture comprehension, suggestive of a causal role for WM in speech–gesture integration. Accordingly, these studies assessed whether participants' ability to utilize the information in co-speech gestures was compromised by manipulating the load on either visuo-spatial (Experiment 2) or verbal (Experiment 3) WM. Finally, Experiment 4 was conducted to ensure that differences in the results of Experiments 2 and 3 did not stem from differences in the difficulty of the secondary verbal and visuo-spatial recall tasks used in those studies.

Section snippets

Experiment 1

To explore the cognitive resources mediating speech–gesture integration, Experiment 1 examined the relationship between individual differences in WM capacity, as measured through verbal and visuo-spatial span tests, and sensitivity to speech–gesture congruency, as measured through a picture probe classification task. Healthy adults viewed short video clips of spontaneous discourse involving iconic gestures, and then classified subsequent photographs of objects and scenes (picture probes) as

Participants

64 UCSD undergraduates (38 female) gave informed consent and received academic course credit for participation. All participants were fluent English speakers.

Corsi block task

The Corsi block-tapping task (Milner, 1971) is a widely used test of spatial skills and non-verbal WM. In the computerized variant implemented here, an asymmetric array of nine squares was presented on a monitor. On each trial, some or all of the squares would flash in sequence, though no square flashed more than once. Participants were

Accuracy

Analysis revealed that main effects of discourse congruency (F(1,63) = 8, p < 0.05; congruent more accurate) and probe relatedness (F(1,63) = 8.5, p < 0.05; unrelated more accurate) were qualified by a two-way interaction (F(1,63) = 16, p < 0.05). The interaction reflected the presence of a reliable discourse congruency effect for related (t(63) = 4, p < 0.05), but not unrelated (t < 1, n.s.), picture probes. Related picture probes were classified more accurately following discourse primes in which the speech

Discussion

Experiment 1 was intended to explore the relationship between participants' sensitivity to co-speech iconic gestures and the capacity of their verbal and visuo-spatial WM systems. Results suggest first, that the picture probe classification task was indeed a valid index of participants' sensitivity to iconic gestures, and, second, that visuo-spatial WM helps mediate speech–gesture integration. We briefly discuss each of these points below.

Experiment 2

In Experiment 2, we further explored possible visuo-spatial contributions to speech–gesture integration through a dual task paradigm designed to tax visuo-spatial WM concurrently with discourse and picture probe processing. The logic of the dual task paradigm is that performance deficits result when two tasks share the same resources (e.g., Wickens, 1980). Accordingly, we hypothesized that a secondary task that draws heavily on visuo-spatial resources will result in diminished capacity to

Participants

60 new volunteers from the UCSD community (44 female) gave informed consent and received academic course credit for participation. All participants were fluent English speakers.

Materials, design, and procedure

The primary task was identical to the picture probe classification task used in Experiment 1, requiring participants to view discourse primes and make relatedness judgments to pictures of discourse referents. The secondary task involved remembering a sequence of locations in a two-dimensional grid. Each trial began with

Secondary task accuracy (spatial recall)

As expected, superior recall of target locations was observed in low (93.0%, SD 7.4%) versus high (75%, SD 18%) load trials (memory load main effect: (F(1,59) = 115, p < 0.05)). Discourse congruity was not significant either as a main effect or in interaction with memory load.

Bivariate correlation coefficients in Table 3 confirm that Corsi Block Span, but not Sentence Span was correlated with overall accuracy on the spatial recall task. The relative import of visuo-spatial versus verbal WM capacity

Discussion

The goal of Experiment 2 was to evaluate speech–gesture integration under the duress of a concurrent secondary task expected to tax visuo-spatial resources (i.e., remembering grid locations). As expected, accuracy of location recall was positively related to performance on a separate test of visuo-spatial, but not verbal, WM capacity. Participants with larger Corsi Block Spans tended to recall more grid locations on both high and low load trials. This finding suggests that the secondary memory

Experiment 3

Experiment 3 examines the impact of a secondary verbal WM load on speech–gesture integration. A new cohort of volunteers was presented with a similar paradigm to that employed in Experiment 2. Participants were asked to remember spoken digit sequences consisting of either one or four items during the same picture classification task used in the preceding studies. It is widely believed that this type of recall task engages the phonological loop, as digits are thought to be maintained in

Participants

56 new volunteers from the UCSD community (37 female) gave informed consent and received academic course credit for participation. All participants were fluent in English.

Materials, design, procedure, and analysis

The primary task was identical to that used in Experiment 2. For the secondary task, participants were asked to remember sequences of spoken numbers. At the outset of each trial, a series of digitized audio files containing either one (low memory load) or four (high memory load) spoken numbers ranging from one to nine was

Secondary task accuracy (verbal recall)

Unsurprisingly, digits were recalled more accurately on low (97%, SD 5%) versus high (89%, SD 10%) load trials (memory load main effect: (F(1,55) = 54.7, p < 0.05)). No main effect of speech–gesture congruity or interaction with memory load was obtained. Importantly, however, Sentence Span, but not Corsi Block Span, scores were correlated with digit recall accuracy (Table 5). Further, the multiple regression model using Corsi Block and Sentence Span scores as predictor variables revealed that

Discussion

Experiment 3 examined the relationship between verbal working memory abilities and multi-modal discourse comprehension through a dual task paradigm similar to that employed in Experiment 2. Instead of target locations, participants held number sequences in immediate memory while judging the relatedness of picture probes following segments of discourse containing congruent versus incongruent speech and gestures. As expected, some outcomes of this study paralleled findings from Experiment 2. For

Experiment 4

To compare within subjects the attentional demands of the secondary cognitive load manipulations in this study, a new dual task paradigm was created. The primary task involved a conjunctive visual search task analogous to that developed by Treisman and Gelade (1980). This task has been successfully utilized by other behavioral researchers (Hermer-Vazquez & Spelke, 1999) with normative objectives similar to those of the present study. Participants scanned visual displays in search of single

Participants

71 UCSD volunteers (44 female) were awarded course credit for participation in this study. All participants were fluent in English and gave informed consent.

Materials, design, procedure, and analysis

64 total trials were presented. On half, the secondary task involved the high load version of the visuo-spatial recall task used in Experiment 2, whereas the remainder involved the high load version of the verbal digit recall task from Experiment 3. The primary task involved visual displays containing either seven (small set) or eleven

Results

On average, 95% of targets were correctly detected. No main effects of or interactions between secondary recall modality or primary set size were detected (all F's < 1, n.s.). With respect to response times, target detection was reliably slower with eleven (mean: 2251 ms; SD: 645) versus seven distractors (mean: 1994 ms; SD: 441), as expected (F(1, 70) = 31, p < 0.05). Intriguingly, a main effect of secondary recall modality indicated that verbal recall (mean: 2228 ms; SD: 593) resulted in slower target

Discussion

The purpose of Experiment 4 was to compare within subjects the overall demands placed on executive resources by the two secondary recall tasks used in this study. We reasoned that if attention is required to bind features of targets and distractors in the visual search task (Treisman & Gelade, 1980), then additional attentional loads incurred by the two types of secondary task should impact target detection — both with respect to the error rate and the length of time to make a response. If

General discussion

In three experiments, participants classified related pictures more rapidly and accurately when primed by multi-modal discourse with congruent versus incongruent speech and gestures, suggesting first, that people integrate the information conveyed by gestures with that conveyed by the speech, and, second, that our picture probe task offered a reliable index of sensitivity to iconic gestures. Further, Experiments 1 and 2 indicated that the participants who were the most sensitive to the

Conclusion

In conclusion, the present study demonstrates an important role for visuo-spatial resources in multi-modal discourse comprehension. In three experiments, healthy adults classified related picture probes more rapidly when primed by discourse with congruent versus incongruent gestures. The novel finding advanced here is that not all listeners are impacted equally by gestures. In particular, these data suggest visuo-spatial WM capacity plays a more important role in mediating speech–gesture

Acknowledgments

This work was supported by a grant to SC from the NSF (#BCS-0843946). Special thanks go to Rebecca Dai, Jordan Davison, and Marguerite McQuire for their contributions.

References (68)

  • R.M. Krauss et al.

    The communicative value of conversational hand gestures

    Journal of Experimental Social Psychology

    (1995)
  • J. Lavergne et al.

    Hand movement asymmetry during speech: No effect of speaking topic

    Neuropsychologia

    (1987)
  • J. Palmer

    Set-size effects in visual search: The effect of attention is independent of the stimulus for simple tasks

    Vision Research

    (1994)
  • M. Sauter et al.

    Learning what children know about space fom looking at their hands: The added value of gesture in spatial communication

    Journal of Experimental Child Psychology

    (2012)
  • J.J. Tree et al.

    Sometimes faster, sometimes slower: associative and competitor priming in picture naming with young and elderly participants

    Journal of Neurolinguistics

    (2003)
  • A.M. Treisman et al.

    A feature-integration theory of attention

    Cognitive Psychology

    (1980)
  • L. Valenzeno et al.

    Teacher's gestures facilitate students' learning: A lesson in symmetry

    Contemporary Educational Psychology

    (2003)
  • S.M. Wagner et al.

    Probing the mental representation of gesture: Is handwaving spatial?

    Journal of Memory and Language

    (2004)
  • Y.C. Wu et al.

    How iconic gestures enhance communication: An ERP study

    Brain and Language

    (2007)
  • Y.C. Wu et al.

    Are depictive gestures like pictures? Commonalities and differences in semantic processing

    Brain and Language

    (2011)
  • F.X. Alario et al.

    Semantic and associative priming in picture naming

    The Quarterly Journal of Experimental Psychology

    (2000)
  • M.W. Alibali

    Gesture in spatial cognition: Expressing, communicating, and thinking about spatial information

    Spatial Cognition and Computation

    (2005)
  • M.W. Alibali et al.

    Children’s gestures are meant to be seen

    Gesture

    (2001)
  • M.W. Alibali et al.

    Spontaneous gestures influence strategy choices in problem solving

    Psychological Science

    (2011)
  • A.D. Baddeley

    Working memory: Looking back and looking forward

    Nature Reviews. Neuroscience

    (2003)
  • A.D. Baddeley et al.

    Working memory

  • L.W. Barsalou

    Grounded cognition

    Annual Review of Psychology

    (2008)
  • G. Beattie et al.

    An experimental investigation of the role of different types of iconic gesture in communication: A semantic feature approach

    Gesture

    (2001)
  • P. Bernardis et al.

    Behavioural and neurophysiological evidence of semantic interaction between iconic gestures and words

    Cognitive Neuropsychology

    (2008)
  • A.M. Borghi et al.

    Putting words in perspective

    Memory and Cognition

    (2004)
  • S.C. Broaders et al.

    Making children gesture brings out implicit knowledge and leads to learning

    Journal of Experimental Psychology: General

    (2007)
  • M. Chu et al.

    Individual differences in frequency and saliency of speech-accompanying gestures: The role of cognitive abilities and empathy

    Journal of Experimental Psychology: General

    (2014)
  • R.B. Church et al.

    The role of gesture in bilingual education: Does gesture enhance learning?

    International Journal of Bilingual Education and Bilingualism

    (2004)
  • A.R. Conway et al.

    Working memory span tasks: A methodological review and user's guide

    Psychonomic Bulletin & Review

    (2005)
  • Cited by (0)

    View full text