The surprising power of statistical learning: When fragment knowledge leads to false memories of unheard words

https://doi.org/10.1016/j.jml.2008.10.003Get rights and content

Abstract

Word-segmentation, that is, the extraction of words from fluent speech, is one of the first problems language learners have to master. It is generally believed that statistical processes, in particular those tracking “transitional probabilities” (TPs), are important to word-segmentation. However, there is evidence that word forms are stored in memory formats differing from those that can be constructed from TPs, i.e. in terms of the positions of phonemes and syllables within words. In line with this view, we show that TP-based processes leave learners no more familiar with items heard 600 times than with “phantom-words” not heard at all if the phantom-words have the same statistical structure as the occurring items. Moreover, participants are more familiar with phantom-words than with frequent syllable combinations. In contrast, minimal prosody-like perceptual cues allow learners to recognize actual items. TPs may well signal co-occurring syllables; this, however, does not seem to lead to the extraction of word-like units. We review other, in particular prosodic, cues to word-boundaries which may allow the construction of positional memories while not requiring language-specific knowledge, and suggest that their contributions to word-segmentation need to be reassessed.

Introduction

Speech comes as a continuous signal, with no reliable cues to signal word boundaries. Thus learners have not only to map the words of their native language to their meanings (which is in itself a difficult problem), but first they have to identify the sound stretches corresponding to words. Thus, they need mechanisms that allow them to memorize the phonological forms of the words they encounter in fluent speech. Here we ask what kinds of memory mechanisms they can employ for this purpose. It is generally accepted that statistical computations are well suited for segmenting words from fluent speech, and thus for memorizing phonological word-candidates (e.g., Aslin et al., 1998, Cairns et al., 1997, Elman, 1990, Goodsitt et al., 1993, Hayes and Clark, 1970, Saffran, 2001b, Saffran et al., 1996, Saffran et al., 1996, Swingley, 2005). However, as reviewed below in more detail, there is evidence, in particular from speech errors, that memory for words in fact appeals to different kinds of memory mechanisms, namely those encoding the positions of phonemes or syllables within words. We thus ask whether learners extract word-like units from fluent speech when just the aforementioned statistical cues are given, or whether they require other, possibly prosodic, cues that allow them to construct positional memories. Specifically, we presented participants with continuous speech streams containing statistically defined “words”. These words were chosen such that there were statistically matched “phantom-words” that, despite having the same statistical structure as words, never occurred in the speech streams. If statistical cues lead to the extraction of words from fluent speech, participants should know that they have encountered words but not phantom-words during the speech streams. In contrast, if memory for words is positional, participants should fail to prefer words to phantom-words when only statistical information is given. Rather such a preference should arise only once cues are available that lead to the construction of positional memories.

Once they reach a certain age, learners can use many different cues to predict word boundaries (e.g., Bortfeld et al., 2005, Cutler and Norris, 1988, Dahan and Brent, 1999, Jusczyk et al., 1993, Mattys and Jusczyk, 2001, Shukla et al., 2007, Suomi et al., 1997, Thiessen and Saffran, 2003, Vroomen et al., 1998). However, many of these cues are language-specific, and thus have to be learned. For instance, if learners assume that strong syllables are word-initial, they will be right in Hungarian but wrong in French (where strong syllables are word-final), and to learn where stress falls in a word, they have to know the words in the first place. Hence, at least initially, language learners need to use cues to word-boundaries that do not require any language-specific knowledge.

Co-occurrence statistics such as transitional probabilities (TPs) among syllables are one such cue that is particularly well-attested. These statistics indicate how likely it is that two syllables will follow each other. More formally, TPs are conditional probabilities of encountering a syllable after having encountered another syllable. Conditional probabilities like P(σi+1 = pet|σi = trum) (in the word trumpet) are high within words, and low between words (σ denotes a syllable in a speech stream). Dips in TPs may give cues to word boundaries, while high-TP transitions may indicate that words continue. That is, learners may postulate word boundaries between syllables that rarely follow each other. Saffran and collaborators (e.g., Aslin et al., 1998, Saffran et al., 1996) have shown that even young infants can deploy such statistical computations on continuous speech streams. After familiarization with speech streams in which dips in TPs were the only cue to word boundaries, 8-month-old infants were more familiar with items delimited by TP dips than with items that straddle such dips. Even more impressively, after such a familiarization, infants recognize the items delimited by dips in TPs in new English sentences pronounced by a new speaker (Saffran, 2001b), suggesting that TP-based segmentation procedures may lead infants to extract word-like units.

Results such as these have led to the widespread agreement that co-occurrence statistics are important for segmenting words from speech. Though not thought to be the only cues used for word-segmentation, they are thought to play a particularly prominent role because, unlike other cues, they can be used by infants without any knowledge of the properties of their native language (e.g., Thiessen & Saffran, 2003).1 Moreover, similar computations have been observed with other auditory and visual stimuli (Fiser and Aslin, 2002, Saffran et al., 1999, Turk-Browne et al., 2005), and with other mammals (Hauser et al., 2001, Toro and Trobalón, 2005). Such computations may thus be domain- and species-general, stressing again the potential importance of such processes for a wide array of cognitive learning situations. Accordingly, some authors have proposed that these processes may be crucial not only for word-learning but also for other, more grammatical aspects of language acquisition (Bates and Elman, 1996, Saffran, 2001a, Thompson and Newport, 2007).

Surprisingly, however, there is no evidence that TP-based computations lead to the extraction of word-candidates. The experiments above have provided numerous demonstrations that participants are more familiar with items with stronger TPs than with items with weaker TPs. This, however, does not imply that the items with stronger TPs are represented as actual word-like units, or even that they have been extracted. For example, one may well find that a piece of cheese is more associated with a glass of wine than with a glass of beer, but this does not imply that the wine/cheese combination is represented as a unit for parsing the visual scene. Likewise, choosing items with stronger TPs (where the syllables have stronger associations) over items with weaker TPs does not imply either that the items with stronger TPs have been extracted as perceptual units.

The distinction between a preference for high-TP items and representing these items as perceptual units is well illustrated in Turk-Browne and Scholl (2009) studies of visual statistical learning. In these experiments, participants saw a continuous sequence of shapes. This sequence was composed of a concatenation of three-shape items (just as the experiments reviewed above used concatenations of three-syllable non-sense words). Following such a familiarization, participants were as good at discriminating high-TP items from low-TP items when the items were played forward (that is, in the temporal order in which they had been seen during familiarization) as when they were played backwards. If a preference for high-TP items implied that these items have been extracted and memorized, one would have to conclude that participants have extracted also the backwards items although they had never seen them. It thus seems that a preference for high-TP items does not imply that these items have been memorized.

There are also other reasons to doubt that TPs may play an important role in word-segmentation. One reason is that computational studies using TPs (or related statistics) for segmenting realistic corpora of child-directed speech have encountered mixed success at best (e.g., Swingley, 2005, Yang, 2004). At minimum, TPs thus have to be complemented with other cues. This seems highly plausible, given that one would certainly not expect a single cue to solve a highly complex problem such as speech-segmentation.

While the poor performance of word-segmentation mechanisms based on TPs can be improved by the inclusion of other cues, there is a second, more fundamental, reason for doubting that TPs play an important role in word-segmentation. This reason is related to the kinds of representations that are formed of acoustic word-forms. Presumably, the purpose of word-segmentation is to store phonological word-candidates in long-term memory. As these are essentially sound sequences (or sequences of articulatory gestures according to a direct realist perspective), it is reasonable to ask whether research on sequential memory can constrain the kinds of cues that can be used for word-segmentation. This issue is addressed in the next section.

Research on sequential memory has revealed (at least) two kinds of mechanisms for remembering sequences (for a review, see e.g., Henson, 1998). One mechanism is referred to as “chaining memory.” When memorizing the sequence ABCD using such a mechanism, one would learn that A goes to B, B to C, and C to D. In other words, this mechanism is fundamentally similar to TPs. There is another mechanism, however, that appeals to the sequential positions of items. For example, people often remember the first and the last element of a sequence — but not the intervening items. Chaining memories do not easily account for such results — because the “chain” is broken in the sequence middle. Positional mechanisms, in contrast, readily account for such results: People may memorize the item that occurred in the first and the last position without remembering items in intervening positions. These (and, in fact, many other) results are thus readily explained if a distinction between positional and chaining memories is assumed (e.g., Conrad, 1960, Henson, 1998, Henson, 1999, Hicks et al., 1966, Ng and Maybery, 2002, Schulz, 1955). This distinction has also been observed in artificial grammar learning experiments. In such experiments, TPs and positional regularities seem to require different kinds of cues, to have different time courses, and to break down under different conditions (Endress and Bonatti, 2007, Endress and Mehler, in press, Peña et al., 2002). In these experiments, participants were familiarized to speech streams. The streams contained both chaining and positional regularities. Following familiarization, participants had to choose between items that instantiated the chaining regularity, the positional regularity or both. Most relevant to the current experiments, participants were sensitive to the positional regularity only when the familiarization stream contained prosodic-like cues such as silences between words. TPs, in contrast, were tracked also in the absence of such cues. It thus appears that both positional and chaining memories can be learned from speech streams by independent mechanisms, but that positional memories require additional, perhaps prosodic cues.

Interestingly, a similar distinction between positional and chaining information has been proposed in artificial grammar learning experiments in the tradition developed by Miller, 1958, Reber, 1967, Reber, 1969 (although these experiments typically use simultaneously presented letter strings rather than sequences). In such experiments, participants are exposed to consonant strings governed by a finite-state grammar, and then have to judge whether new strings are grammatical. It now seems clear that participants acquire distributional information of the consonants of various kinds, including legal bigrams (which, we would argue, corresponds to chaining information; see e.g., Cleeremans and McClelland, 1991, Dienes et al., 1991, Kinder, 2000, Kinder and Assmann, 2000) and the positions of legal letters and bigrams within the strings (which may correspond to the positional information mentioned above; see e.g. Dienes et al., 1991, Johnstone and Shanks, 1999, Shanks et al., 1997, but see Perruchet & Pacteau, 1990). Whilst these experiments were not necessarily optimized to distinguish chaining and positional information, it is interesting to note that a similar distinction has also been proposed in this literature.

What kinds of memory mechanisms are used for words? There is some evidence from speech errors that word memory has at least a strong positional component. With the tip of the tongue experience, for instance, people often remember the first and the last phoneme of a word, but not the middle phonemes (e.g., Brown and McNeill, 1966, Brown, 1991, Kohn et al., 1987, Koriat and Lieblich, 1974, Koriat and Lieblich, 1975, Rubin, 1975, Tweney et al., 1975). Such observations are hard to explain if memory for words relies upon chaining memories, since such chains would be broken in the middles of words. In contrast, they naturally follow if one assumes that words rely on positional memories. Likewise, spoonerisms (that is, reversals in the order of phonemes such as in “queer old dean”, from “dear old queen”) often conserve the serial position in words and syllables of the exchanged phonemes (e.g., MacKay, 1970). Again, this would be unexpected if words were remembered by virtue of chaining memories (because positions are not encoded in such memories), but it is easily explained if word memory has a positional component.

If memory for acoustic word forms is positional, cues to chaining memories such as TPs may not enable participants to extract words from fluent speech. Rather, learners may require other cues such as those that have triggered positional computations in other artificial language learning studies (Endress & Bonatti, 2007). Here, we thus return to the original motivation for TP-based processes, and examine their potential for the first step in word-learning, namely word-segmentation. (In the following, we will use word-learning and word-segmentation interchangeably. We thus hypothesize that the role of a word-segmentation mechanism is to provide candidates for phonological word forms, but are agnostic as to how such forms may become linked to meaning.) At the very least, if TP-based learning mechanisms are used for word-learning, one would expect the output of these mechanisms (that is, presumably phonological word candidates) to make learners more familiar with items they heard frequently than with items they never heard at all. After all, a word-segmentation mechanism should learn the words contained in its input, and not some syllable combination it has never encountered at all.

To test whether co-occurrence statistics would lead to the extraction of word-like units from fluent speech, we used standard TP-learning procedures with adults. Participants were told that they would listen to a monologue in an unknown language (in “Martian”). They were then familiarized with a continuous speech stream. This stream was a monotonous concatenation of nonce words (hereafter “words”) with no pauses between them. Six words were concatenated such that TPs among their syllables were identical to TPs among syllables of particular “items” that did not occur in the streams (i.e. “phantom-words”). Phantom-words are items that, even though they did not occur in the stream, could become familiar to learners who only track pairwise TPs among syllables (see Fig. 1 and “Materials and method” of Experiment 1a for the construction of words and phantom-words).

At test, participants had to choose which of two items was more likely to be a Martian word. One item was a word and the other a phantom-word. If participants extracted word-like units through TP computations, they should prefer words to phantom-words even though the TPs were the same (since words are units that occurred in the stream while phantom-words are not). In terms of word-learning, one would expect participants to store only words they actually have encountered, and not all possible combinations of syllables that occurred together in other words.

To assess whether participants tracked the statistical structure of the speech streams, we also asked them to choose between words and part-words. Part-words occurred in the stream, but straddled a word boundary; TPs between syllables in words are thus higher than in part-words. We thus expected to replicate the standard finding that participants prefer words to part-words (e.g., Aslin et al., 1998, Saffran et al., 1996, Saffran et al., 1996).

In Experiments 1a through 1d, participants were exposed to speech streams whose durations ranged from 5 to 40 min. Words in these streams had the same TPs as some non-occurring phantom-words. If participants can use TP-information for extracting words from speech, they should be more familiar with words than with phantom-words. The familiarization in Experiment 2 was the same as in Experiment 1, but participants had then to choose between phantom-words and part-words; we asked whether participants would be more familiar with phantom-words even though they did not occur in the stream. Experiments 3 and 4 were identical to Experiment 1, except that participants were given additional cues to word boundaries. These cues were 25 ms silences between words, and a lengthening of the final syllable in each word, respectively.

Of course, results with adult participants do not automatically hold also for infants. However, statistical learning experiments have never revealed a difference between adults and infants (except that adults can learn more words), and some of our manipulations (such as presenting participants with a 40-min stream) are just not feasible for infants. That is, infant language processing is clearly different from that of adults in many ways (e.g., Werker & Tees, 1984). However, such differences have never been observed in statistical learning experiments on speech segmentation. Moreover, our crucial result will be that adults cannot keep track of the items they have heard if the foils have the same TP-structure. As there is no reason to assume that the memory abilities of infants are more sophisticated than those of adults, it seems reasonable to assume that our results would hold also for infants, but it is an important topic for future studies to test whether this assumption actually holds.

Section snippets

Participants

Fourteen native speakers of Italian (7 women, 7 men, mean age 23.4, range 20–27) took part in this experiment.

Materials

The stimuli were synthesized with the MBROLA speech synthesizer (Dutoit, Pagel, Pierret, Bataille, & van der Vreken, 1996), using the fr2 diphone database. (Pilot tests with native participants showed that Italian native speakers find synthesized speech with the fr2 voice more intelligible than speech synthesized with the available Italian diphone databases. We thus decided to use fr2.

Materials and methods

Experiment 1b was similar to Experiment 1a except that both words and part-words could have syllables with identical vowels. The words were bagadu, togaso, bapiso, limudu, tomufe and lipife; the phantom-words were bagaso and limufe. Moreover, each word was presented 200 times during familiarization (rather than 75 times as in Experiment 1a), yielding a 14-min familiarization. All test items are shown in Appendix C. 14 native speakers of Italian (12 women, 2 men, mean age: 23.4, range 19–33)

Materials and methods

Experiment 1c was similar to Experiment 1a except that 12 words were used instead of 6. That is, we employed two sets of six words with the same statistical properties as the six-word set in Experiment 1a. We thus obtained 12 words in total (paseti, Rosenu, pamonu, lekati, Rokafa, lemofa, bodesa, ludegi, bonagi, muRisa, luRivo and munavo) and four phantom-words (pasenu, lekafa, bodegi and muRivo). TPs within words were 0.5; TPs across words were 0.187 on average (range: 0.033–0.313). All test

Experiment 1d: Word-learning with 40-min exposure

Experiments 1a through 1c suggest that participants fail to prefer word over phantom-words also after controlling for a number of possible confounds. This is consistent with the hypothesis that participants do not extract word-candidates through TP computations. Another simple explanation, however, is that the familiarizations used in these experiments were simply too short for participants to extract words, and participants may require more exposure to prefer words over phantom-words. In

Materials and methods

The familiarization in Experiment 2 was identical to the one from Experiment 1a. At test, participants were presented with two other kinds of trials. They had to choose between phantom-words and part-words (rather than between words and phantom-words as in Experiment 1a); the part-word types (see above) were equally represented in these test pairs. In the remaining trials, participants had to choose between words and part-words; again, the part-word types were equally represented in the test

Materials and methods

Experiment 3 was identical to Experiment 1a except that words in the familiarization stream were separated by 25-ms silences. 14 native speakers of Italian (7 women, 7 men, mean age: 24.6, range 19–35) took part in this experiment.

Results and discussion

As shown in Fig. 8, participants preferred words to phantom-words (M = 66.1%, SD = 18.0%), t(13) =  3.3, p < .001, Cohen’s d = 0.89, CI.95 = 55.7%, 76.5% (and also words to part-words, M = 84.2, SD =  17.8%, t(13) = 7.3, p < .001, Cohen’s d = 2.0, CI.95 = 74.5%, 95.1%). An ANOVA comparing

Materials and methods

Experiment 4 was identical to Experiment 1a except that the duration of the final vowel of each word was doubled to 232 ms during familiarization (corresponding to a lengthening of the final syllable by 50%); the test items were (physically) identical to those used in Experiment 1a. As the preference for words to phantom-words was marginal with 14 participants (p = .048), we added six participants to be sure that the results would remain stable. In total, 20 native speakers of Italian (15 women, 5

General discussion

In the experiments presented here, we examined the potential of TP-based computations for word-segmentation. A basic prediction is that, if such computations are used for word-segmentation, they should make participants more familiar with items heard frequently than with unheard items. In contrast to this prediction, participants did not track at all whether or not they had encountered items when TPs in these items were matched; this remained true even after arbitrarily long periods of

Acknowledgments

This research has been supported by McDonnell Foundation Grant 21002089 and the European Commission Special Targeted Project CALACEI (contract No 12778 NEST). We are grateful to R. Aslin, L. Bonatti, D. Cahill, Á. Kovács, M. Nespor, M. Shukla, and J. Toro for helpful comments and discussions.

References (94)

  • R.N. Aslin et al.

    Computation of conditional probability statistics by 8-month-old infants

    Psychological Science

    (1998)
  • R.N. Aslin et al.

    Models of word segmentation in fluent maternal speech to infants

  • E.O. Batchelder

    Bootstrapping the lexicon: A computational model of infant speech segmentation

    Cognition

    (2002)
  • E. Bates et al.

    Learning rediscovered

    Science

    (1996)
  • L.L. Bonatti et al.

    Linguistic constraints on statistical computations: The role of consonants and vowels in continuous speech processing

    Psychological Science

    (2005)
  • H. Bortfeld et al.

    Mommy and me: Familiar names help launch babies into speech-stream segmentation

    Psychological Science

    (2005)
  • M. Brent

    Toward a unified model of lexical acquisition and lexical access

    Journal of Psycholinguistic Research

    (1997)
  • M. Brent et al.

    The role of exposure to isolated words in early vocabulary development

    Cognition

    (2001)
  • A.S. Brown

    A review of the tip-of-the-tongue experience

    Psychological Bulletin

    (1991)
  • R. Brown et al.

    The tip of the tongue phenomenon

    Journal of Verbal Learning and Verbal Behavior

    (1966)
  • P. Cairns et al.

    Bootstrapping word boundaries: A bottom–up corpus-based approach to speech segmentation

    Cognitive Psychology

    (1997)
  • J.M.C.M. Cattell

    The time it takes to see and name objects

    Mind

    (1886)
  • M.H. Christiansen et al.

    Learning to segment speech using multiple cues: A connectionist model

    Language and Cognitive Processes

    (1998)
  • A. Christophe et al.

    Do infants perceive word boundaries? An empirical study of the bootstrapping of lexical acquisition

    Journal of the Acoustical Society of America

    (1994)
  • A. Christophe et al.

    Perception of prosodic boundary correlates by newborn infants

    Infancy

    (2001)
  • A. Christophe et al.

    Phonological phrase boundaries constrain lexical access. I: Adult data

    Journal of Memory and Language

    (2004)
  • A. Cleeremans et al.

    Learning the structure of event sequences

    Journal of Experimental Psychology: General

    (1991)
  • R. Conrad

    Serial order intrusions in immediate memory

    British Journal of Psychology

    (1960)
  • A. Cutler et al.

    The role of strong syllables in segmentation for lexical access

    Journal of Experimental Psychology: Human Perception and Performance

    (1988)
  • D. Dahan et al.

    On the discovery of novel wordlike units from utterances: An artificial-language study with implications for native-language acquisition

    Journal of Experimental Psychology: General

    (1999)
  • M.H. Davis et al.

    Leading up the lexical garden path: Segmentation and ambiguity in spoken word recognition

    Journal of Experimental Psychology: Human Perception and Performance

    (2002)
  • Z. Dienes et al.

    Implicit and explicit knowledge bases in artificial grammar learning

    Journal of Experimental Psychology: Learning Memory & Cognition

    (1991)
  • Dutoit, T., Pagel, V., Pierret, N., & Bataille, F., van der Vreken, O. (1996). The MBROLA project: Towards a set of...
  • J.L. Elman

    Finding structure in time

    Cognitive Science

    (1990)
  • A.D. Endress et al.

    Rapid learning of syllable classes from a perceptually continuous speech stream

    Cognition

    (2007)
  • Endress, A. D., & Hauser, M. D. (in preparation). Cross-linguistic word segmentation without distributional...
  • Endress, A. D., & Mehler, J. (in press). Primitive computations in speech processing. Quarterly Journal of Experimental...
  • A. Fernald et al.

    Prosody and focus in speech to infants and adults

    Developmental Psychology

    (1991)
  • A. Fernald et al.

    Expanded intonation contours in mothers’ speech to newborns

    Developmental Psychology

    (1984)
  • J. Fiser et al.

    Statistical learning of new visual feature combinations by infants

    Proceedings of the National Academy of Sciences of the United States of America

    (2002)
  • Fon, J. (2002). A cross-linguistic study on syntactic and discourse boundary cues in spontaneous speech. Unpublished...
  • K.I. Forster et al.

    Lexical access and naming time

    Journal of Verbal Learning and Verbal Behavior

    (1973)
  • J. Goodsitt et al.

    Perceptual strategies in prelingual speech segmentation

    Journal of Child Language

    (1993)
  • A. Gout et al.

    Phonological phrase boundaries constrain lexical access. II: Infant data

    Journal of Memory and Language

    (2004)
  • K. Graf-Estes et al.

    Can infants map meaning to newly segmented words? Statistical segmentation and word learning

    Psychological Science

    (2007)
  • M.D. Hauser et al.

    Segmentation of the speech stream in a non-human primate: Statistical learning in cotton-top tamarins

    Cognition

    (2001)
  • J.S.F. Hay et al.

    Perception of rhythmic grouping: Testing the iambic/trochaic law

    Perception & Psychophysics

    (2007)
  • J.R. Hayes et al.

    Experiments in the segmentation of an artificial speech analog

  • R. Henson

    Short-term memory for serial order: The start–end model

    Cognitive Psychology

    (1998)
  • R. Henson

    Positional information in short-term memory: Relative or absolute?

    Memory & Cognition

    (1999)
  • R. Hicks et al.

    Generalization of serial position in rote serial learning

    Journal of Experimental Psychology

    (1966)
  • G.J. Hitch et al.

    Temporal grouping effects in immediate recall: A working memory analysis

    Quarterly Journal of Experimental Psychology: Human Experimental Psychology

    (1996)
  • C. Hoequist

    Durational correlates of linguistic rhythm categories

    Phonetica

    (1983)
  • C. Hoequist

    Syllable duration in stress-, syllable- and mora-timed language

    Phonetica

    (1983)
  • E.K. Johnson et al.

    Word segmentation by 8-month-olds: When speech cues count more than statistics

    Journal of Memory and Language

    (2001)
  • T. Johnstone et al.

    Two mechanisms in implicit artificial grammar learning? Comment on Meulemans and van der Linden (1997)

    Journal of Experimental Psychology: Learning, Memory, and Cognition

    (1999)
  • P.W. Jusczyk et al.

    Infants’ preference for the predominant stress patterns of English words

    Child Development

    (1993)
  • Cited by (94)

    View all citing articles on Scopus
    View full text