Estimating working memory capacity for lists of nonverbal sounds

Li, Dawei; Cowan, Nelson; Saults, J. Scott

doi:10.3758/s13414-012-0383-z

Estimating working memory capacity for lists of nonverbal sounds

Published: 10 November 2012

Volume 75, pages 145–160, (2013)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Estimating working memory capacity for lists of nonverbal sounds

Download PDF

Dawei Li¹,
Nelson Cowan¹ &
J. Scott Saults¹

2883 Accesses
25 Citations
1 Altmetric
Explore all metrics

Abstract

Working memory (WM) capacity limit has been extensively studied in the domains of visual and verbal stimuli. Previous studies have suggested a fixed WM capacity of typically about three or four items, on the basis of the number of items in working memory reaching a plateau after several items as the set size increases. However, the fixed WM capacity estimate appears to rely on categorical information in the stimulus set (Olsson & Poom Proceedings of the National Academy of Sciences 102:8776-8780, 2005). We designed a series of experiments to investigate nonverbal auditory WM capacity and its dependence on categorical information. Experiments 1 and 2 used simple tones and revealed capacity limit of up to two tones following a 6-s retention interval. Importantly, performance was significantly higher at set sizes 2, 3, and 4 when the frequency difference between target and test tones was relatively large. In Experiment 3, we added categorical information to the simple tones, and the effect of tone change magnitude decreased. Maximal capacity for each individual was just over three sounds, in the range of typical visual procedures. We propose that two types of information, categorical and detailed acoustic information, are kept in WM and that categorical information is critical for high WM performance.

Twenty years of load theory—Where are we now, and where should we go next?

Article 04 January 2016

No one knows what attention is

Article Open access 05 September 2019

The Stroop Task Sex Difference: Evolved Inhibition or Color Naming?

Article Open access 19 October 2022

Working memory (WM) refers to the cognitive process that involves maintenance and manipulation of a limited amount of information for a short period, usually a few seconds (Baddeley & Hitch, 1974; Cowan, 1995). WM is fundamental to a number of higher-order cognitive functions, such as decision making, language processing, and planning (Cowan, 2005). For decades, researchers have been curious about the limit of WM capacity—how much information can be stored in WM. Although WM has been investigated in great detail using visual and verbal stimuli, WM for tones has received far less attention and will be examined here. We will consider factors that affect performance in other domains in order to assess the capacity of a key attention-related component of WM for nonverbal sounds, uncorrupted by various mnemonic strategies.

Guidance from research on WM in other domains

WM in other domains provides an important context in which to formulate the manner in which to examine WM for tones. On the basis of various empirical and experimental evidence, Miller (1956) proposed that people could keep, in what is now called WM, lists of approximately seven items, plus or minus two. Miller’s work elicited many subsequent studies on humans’ WM capacity. Some studies, however, indicated that Miller might have overestimated WM capacity (Cowan, 2001; Luck & Vogel, 1997; Sperling, 1960). In one experiment in Sperling’s seminal work, participants were briefly presented an array of 12 characters and were instructed to write down the characters that they could remember after the array had disappeared. The results showed that only about 4 characters could be written down, meaning that WM capacity might be more restricted than that estimated by Miller.

Chunking

One reason for the higher estimate obtained by Miller (1956) is that, as he himself pointed out, people sometimes can group several items from a list into a larger meaningful unit, or chunk, and remember the chunk instead of the individual items. In a straightforward illustration, although people usually cannot remember nine random letters, such as NGJLXISFH, they can easily remember IRSFBICIA, if they are able to chunk these letters into three U.S. government agencies—IRS, FBI, and CIA. In Sperling’s experiments, due to the rapid and concurrent presentation of items, one can assume that it was difficult to apply chunking, leading to a smaller estimate of WM capacity. Similar results have been obtained with the recognition of nonverbal items (e.g., Luck & Vogel, 1997). Musical knowledge may allow chunking for musical sounds, which we discouraged by selecting stimuli judiciously.

Rehearsal

Another factor affecting WM capacity estimates is the strategy of rehearsal, or covertly repeating items or labels in WM in order to refresh the representations of the items in WM. Some studies have shown that phonologically similar words, such as cat, bat, and mat, or letters, such as P, V, D, B, and so forth, were more difficult to remember than phonologically dissimilar words or letters, a phonological similarity effect, even when the items were visually presented (Conrad, 1964; Conrad & Hull, 1964) and that people could memorize fewer words with longer length than words with shorter length, a word length effect (Baddeley, Thomson, & Buchanan, 1975). According to some theories, these effects reflect the use of rehearsal in verbal WM,^{Footnote 1} which is more error prone when the words are phonologically similar and takes longer when the words are longer. Thus, when participants are required to repeat a simple word, such as “the,” while remembering the items (an articulatory suppression task), both the phonological similarity effect and the word length effect are greatly diminished or disappear entirely (for a review, see Baddeley, 1986). Because humming might be used to rehearse nonverbal sounds (e.g., Hickock, Buchsbaum, Humphries, & Muftuler, 2003), we used suppression to prevent that possibility (cf. Schendel & Palmer, 2007).

Sensory memory

Another factor that can enhance the estimate of WM is memory for the physical properties of the stimuli—that is, exactly how they look or sound. Sperling (1960) showed that exposure to the character array led to sensory memory of most of the array items for a short period (under 1 s) and that this sensory memory was available for recall if a partial report cue was provided so that only one row of up to four items had to be recalled on a particular trial. Similar indications of a vivid but short-lived sensory memory for a complex array are obtained in the auditory modality, with auditory sensory memory lasting several seconds (e.g., Darwin, Turvey, & Crowder, 1972). In some experiments, items to be recalled are followed by an interfering item in the same modality in order to overwrite sensory memory, making it necessary that the concepts, rather than sensations, be recalled (e.g., Saults & Cowan, 2007). We adopted that strategy here for nonverbal sounds.

Core WM capacity

Cowan (2001) suggested that the smaller limit of three to five items in WM is obtained under conditions in which the items retained are chunks (meaningful units) based on already-known information. This is presumably the case when the items cannot be further grouped into larger meaningful units at the time when they are presented in the to-be-remembered materials, cannot be rehearsed verbally, and cannot be retained in a sensory form. To ensure that these conditions apply, known patterns across stimuli can be avoided, articulation can be suppressed, and the items to be retained can be followed by a sensory mask.

A key example upon which the present work is based is the recognition memory for colored squares, examined by Luck and Vogel (1997). In one experiment, they instructed the participants to memorize a briefly presented array of a few colored squares for several seconds, followed by the presentation of a second, probe array in which one square may have changed color; in another experiment yielding similar results, one item in the second array was marked to indicate which square might have changed. The task was to decide whether the new square had the same color as the previous square in that location. With this change detection paradigm, they estimated the participants’ visual WM capacity at about four items. The brief presentation of the first array made the items difficult to chunk, and a secondary memory load of two digits further discouraged rehearsal. Sensory memory presumably could not be used to great advantage either, inasmuch as the probe array would have overwritten the critical sensory information before a judgment could be made. Given these restrictions, it is suggested that the results are indicative of a core WM capacity (Cowan, 2001).

This core WM capacity was observed also when participants were taught pairs of words, in which case participants recalled about three chunks from a list in the presence of articulatory suppression, no matter whether the chunks in the list were singletons or learned pairs (Chen & Cowan, 2009). Given the considerable evidence for a small core capacity for information from stimuli that can be labeled, we wished to examine WM for tonal stimuli that could not easily be labeled.

Cowan (2001) proposed a measure that can be used to estimate the number of items held in WM. This measure applies to the experimental situation in which the test probe display clearly indicates which item changed, if any of them did (Rouder, Morey, Morey, & Cowan, 2011). It assumes that the array includes N items and that k items fit in WM. Then when N > k, the proportion of correct detections of a change, or hits, can be estimated as hits = k/N + (1 − k/N)g, where g is the rate of guessing that there has been a change, in the absence of WM information. Guessing takes place only if the tested item was not in WM, so when there is no change, false alarms = (1 − k/N)g. Combining these equations yields the estimate k = (hits − false alarms)N. We apply this formula to recognition memory for lists of tones.

WM and categorical information

Although a number of studies directly support the theory of core WM capacity, there are a few alternative theories and experiment results. Some studies found lower capacity estimates in WM tasks. For example, Alvarez and Cavanagh (2004) found that participants remembered fewer items when complexity of the stimulus set increased. However, Awh and colleges suggested that WM capacity was limited by sample–test similarity, instead of stimulus complexity (Awh, Barton, & Vogel, 2007). They presented the same complex objects as those used in Alverez and Cavanagh but manipulated the test object to have either high or low similarity with the sample object. They found that when sample–test similarity was low, capacity for complex objects was identical to capacity for simple objects. These results indicated that stimulus complexity is unlikely to be the main factor influencing WM capacity, although it affects the conditions under which the representations in WM will be adequate for a comparison with the test stimulus.

Another line of studies used a different approach to study WM capacity. These studies used various modified change detection tasks that measure the precision of a memory trace. Zhang and Luck (2008) presented participants an array of colored squares. In the test, participants were presented with a continuous color wheel and were required to choose a color that matched the color of a target square. Zhang and Luck (2008) manipulated the number of squares in the sample array, and they found that memory capacity was high at set size 1, 2, and 3 but dropped drastically from set size 3–6, whereas the precision of participants’ response decreased from set size 1–3 but stayed unchanged from set size 3–6. In support of the core WM capacity theory, these results indicated that when set size was high, participants devoted their mental resources to only a subset of stimuli. However, using similar paradigms, some studies have concluded that WM capacity is constrained by a limited pool of continuous mental resources that can be spread among any number of items, rather than a fixed number of slots (e.g., Bays & Husain, 2008). In Bays and Husain’s study, participants were briefly presented an array of colored squares or colored arrows, and after a short delay, they were required to judge whether a new square or arrow had the same spatial location or orientation as the previous square or arrow with the same color. The amount of location and orientation displacement was varied across trials. Results showed that memory precision decreased when the array size increased, even for the smallest set sizes. Bays and Husain also showed that WM resources could be allocated flexibly by manipulating eye movements during encoding. They proposed that WM resources could be distributed sparsely to all objects, instead of to a fixed number of objects. The debate between slot-based and resource-based models of WM capacity has yet to be resolved to everyone’s satisfaction (e.g., Anderson, Vogel, & Awh, 2011; Bays, Catalau & Husain 2009; Zhang & Luck, 2011), although we strongly favor the fixed slots view; for perhaps the most compelling rebuttals of the continuous resource view to date, see Anderson and Awh (2012) and Thiele, Pratte, and Rouder (2011).

Olsson and Poom (2005) manipulated the amount of categorical information in the stimulus sets. They found that visual WM capacity was as low as only one in a stimulus set with little categorical information, even if the visual items were easy to distinguish. When categorical information, such as discrete colors and shapes, was added to the stimulus set, the estimated visual WM capacity increased to slightly below three. The authors concluded that categorical information stored in long-term memory was crucial to visual WM performance. Considering this study along with Zhang and Luck (2008), it is possible that the core WM capacity must make use of categorical information in the stimulus set. When the set size is small (e.g., less than three items), participants are able to retain categorical information of all the stimuli, as well as some information about stimulus details. When the set size increases to a certain point (e.g., more than four items), participants may have to devote all their mental resources to the categorical information of only a subset of the stimuli, leaving less capability for storing object details. Categorical information is also likely the information stored in the focus of attention and is less susceptible to decay or interference than is information about details (Saults & Cowan, 2007). Therefore, if a stimulus set contains little categorical information, low WM capacity is expected, especially after a long delay.

Indeed, a few studies have suggested the role of categorical information in the auditory domain (Fujisaki & Kawashima, 1970; Nairne, 1990; Surprenant & Neath, 1996). Nairne proposed a feature model that consists of two features of memory trace: modality dependent and modality independent. Modality-dependent features refer to physical properties that are modality specific, whereas modality-independent features refer to categorized information that does not depend on a specific modality (Nairne, 1990). Consequently, if a stimulus set is difficult to categorize, only modality-dependent features can be used in a memory task, leading to low performance.

In this study, we manipulated the amount of categorical information in our stimulus sets to study its influence on WM capacity for nonverbal sounds.

Prior research on WM for tones

In contrast to the extensively investigated domains of visual and verbal WM, few studies have investigated the capacity limit of nonverbal auditory items in WM that are uncontaminated in that they contain little verbal or verbalizable information, are difficult to visualize, and contain no familiar structure. It is possible that such auditory items could be more difficult to remember due to their acoustic nature, which can be retained in WM only through their sound properties, instead of phonological, visual, or semantic properties.

Studies of music sequence production suggest the use of two levels of structure: the scales of discrete pitch relationships, or intervals (such as the 12-tone chromatic scale of equal temperament in Western music), and the seven-interval subsets of the chromatic scale called diatonic scales (Burns & Ward, 1999; Davies, 1979). People can use familiarity with these scales to encode the melodic contour of a musical sequence, grouping or chunking intervals to achieve better memory of musical sequences, as compared with random tone sequences (Dewar, Cuddy, & Mewhort, 1977; Idson & Massaro, 1976). In our stimuli, we avoided these familiar structures in order to assess WM capability without chunking. Even within structured stimuli, however, there is some evidence of a core capacity limit. In particular, in music sequence production, pitch-ordering errors (musical sequences reproduced with tones in the wrong order) suggest a WM capacity limitation: There is typically confusion between tones no more than three to four tones apart in the intended sequence (Drake & Palmer, 2000; Palmer, 2005; Palmer & Pfordresher, 2003).

Capacity for lists of tones

There has been little research in which the number of tones in a sequence has been varied in order to assess the effect of that manipulation on the ability to detect a change in one tone. Some studies studied serial recall of nonverbal auditory stimuli that were presented in different spatial locations (Lehnert & Zimmer, 2006; Parmentier & Jones, 2000). Lehnert and Zimmer asked participants to remember arrays of visual, auditory, or mixed visual and auditory stimuli at different spatial locations. The visual and auditory stimuli were from the same objects (e.g., an image of an airplane and sound produced by an airplane). The results showed that the hit rate was approximately .66, .57, and .52 at set size 4, 6, and 8, respectively, for auditory arrays and .90, .79, and .67 at set size 4, 6, and 8, respectively, for visual arrays. Importantly, even in the mixed array condition, performance was significant lower for auditory than for visual items. These results suggested lower capacity for spatial auditory than for spatial visual information. However, it is likely that binding of spatial and auditory features is more difficult than binding of spatial and visual features. Some studies also showed that sound localization was impaired during a spatial WM task, but not phonological WM tasks, indicating that spatial sound localization involves more spatial memory than does phonological memory (Merat & Groeger, 2003). Therefore, although the studies on auditory spatial WM provide insightful results for the organization of representations in WM, they are not direct measures of nonverbal auditory WM capacity, because of the involvement of spatial locations.

Watson, Foyle, and Kidd (1990) varied the number of component tones widely. They chose tones in a manner that eliminated conventional musical cues, dividing the frequency range 300–3 kHz into N tones based on logarithmically equal intervals, where N was the list length, and shuffling the order of the resulting tones. Clearly, the number of tones made a very large difference for performance, although no estimate of the WM capacity for tones could be obtained from their procedure. Note that, using this method, the number of tones in the list is confounded with the frequency difference between adjacent tones.

Kidd and Watson (1992) found that what was important was not the number of tones per se but the proportion of the tone list taken up by the target tone (proportion of the total duration [PTD]). In their procedure, however, participants were held responsible for only one tone per series, the one in the middle of the pattern (or in one experiment, two tones flanking the middle tone and changing together), which would not place a load on WM commensurate with the list length. Surprenant (2001) replicated this finding and further showed that both PTD and relative distinctiveness account for memory effects in three tone sequence recognition tasks. These findings indicate that besides list length, stimulus properties are important factors affecting auditory WM performance, similar to the case of visual WM (Alvarez & Cavanagh 2004; Awh et al., 2007).

In the closest precursor to the present study that we could find, Prosser (1995) chose 14 tones that were selected to avoid a musical scale and presented lists of 2, 4, or 6 randomly selected tones per trial. The list was followed by a tone probe to be judged present or absent from the sequence. To evaluate the results, we apply the formula of Cowan (2001) to the means shown in Prosser’s Fig. 1. Doing so, using data for a short (1-s) retention interval, for lists of 2, 4, and 6 tones yields estimates of k = 1.5, 2.2, and 2.9 tones in WM, respectively. These estimates are roughly consistent with past evidence on nontonal stimuli or are slightly lower. The shift across list lengths is found also for visual arrays and may occur because certain individuals have a capacity lower than set size N, resulting in ceiling effects that limit the estimates for the smaller set sizes. However, due to the short retention interval and lack of a mask sound, these estimates were likely affected by sensory memory. Indeed, capacity estimates fell to 1.5, 1.7, and 1.7 items for set sizes of 2, 4, and 6, respectively, when Prosser used a 7-s retention interval.

Some previous studies investigated the effect of perceptual organization of auditory sequences on memory performance (Deutsch, 1970; Jones, Macken, & Harries, 1997; Warren & Obusek, 1972). Warren and Obusek found that participants were unable to report serial order of auditory sequence with three or four sounds, including but not limited to tones, when the duration of a sound was 200 ms. They also found that for proper serial order identification, stimulus duration should be at least 670 ms for oral response and 300 ms for card-ordering response. In this study, we used a relatively long stimulus presentation (500 ms) to avoid the limitations found with short stimulus durations.

Capacity, attention, and time

Cowan (2001) suggested that a limited number of items making up the core contents of WM is held in the focus of attention. The primary function of holding information that way would be to make the item representations resistant to interference or decay. In that regard, it is useful to examine the items in WM after a several-second retention interval, so that features and items susceptible to decay already would have decayed, and what remains is the items held firmly in mind. Cowan et al. (2011) presented a combination of colored squares and spoken letters, followed by a mask and then an 8-s retention interval, and after that period still observed a capacity of 2.9–3.6 items.

Capacity might be lower, however, for tonal stimuli that do not correspond to known musical categories. Prosser (1995) included a 7-s retention interval and, for lists of two, four, and six tones, we estimate from his Fig. 1 that k = 1.5, 1.7, and 1.7 items, respectively.

The present study

We wished to explore further these rough estimates of tones in WM, derived from the findings of Prosser (1995) at a long retention interval, more systematically in order to understand WM capacity limits. We adapted the change detection procedure by presenting sequences of tones, followed by a probe tone or probe tone list to be recognized as the same as the original list or as changed (see Fig. 1). To identify the WM capacity limit for individual tones without any familiar musical structure, we used lists of tones randomly selected from a nonmusical scale of 12 pitches that differ from the notes of the chromatic scale and span several octaves. We used a retention interval of 6 s (following a list-final masking stimulus), which is long enough that any residual sensory memory that somehow survived the mask should already have decayed before the probe (see Darwin et al., 1972), leaving behind information that resists decay.

Several features distinguish our study from the past work of Prosser (1995) or any other study, to our knowledge. First, as one step to eliminate sensory memory information, we presented a masking sound after each list. There is a long history of auditory backward masking of recognition using interstimulus target–mask intervals of a fraction of a second (e.g., Massaro, 1975), but our purpose here was not to prevent recognition. Rather, similar to Saults and Cowan (2007), we waited long enough for recognition of all tones in the list to be completed and then presented a mask, in order to force participants to rely on the recognized abstract information in WM, rather than on a sensory memory trace, which otherwise might have persisted for several seconds (Cowan, 1984; Darwin et al., 1972).

Second, unlike most prior studies, on half of the trials, we suppressed articulation in case participants were able to vocalize tones covertly and rely on that process as subvocal rehearsal. To equate attention demand, we asked participants to tap their right index finger on the desk on the other half of the trials (Ricker, Cowan, & Morey, 2010).

Third, to equate the amount of intertone interference in memory (Cowan et al., 2005), we included conditions in which the number of tones stayed the same across different memory loads, which was accomplished by presenting six tones and requiring memorization starting at a variable point in the middle of the list (Fig. 1).

Fourth, and finally, we provided visual cues to indicate which serial position in the tone series was being probed. We did this because it is required for the k measure of items in working memory, which is based on the assumption that the participant needs to compare the probe with the memory of only one item. This measure of items in working memory has been psychometrically validated much more fully than any other measure; the data conform to a receiver operating characteristic function expected according to the model (Rouder et al., 2008; Rouder et al., 2011).

All of these precautions, taken together, should allow us to examine WM capacity for abstract information about tones without any prelearned categories for the tones.

Experiment 1

Method

Participants

Twenty-seven undergraduate University of Missouri students (12 male, 15 female) participated in the experiment to fulfill introductory psychology course requirements. In both Experiment 1 and the following two experiments, we included only individuals without special music training, defined as participation in a band or an orchestra or music instruction at a college level.

Apparatus and stimuli

The stimuli were presented with E-Prime (Schneider, Eschman, & Zuccolotto, 2002) in soundproof booths, using loudspeakers. Twelve simple tones (sine waves) were generated by Praat software (Boersma & Weenink, 2009), with a lowest frequency of 200 Hz and a highest frequency of 3900 Hz. There was a 31 % frequency difference between each two adjacent tones. Each tone had a duration of 500 ms and included 25-ms linear onset and offset ramps.

We wanted the pitches of our 12 tones to be as far apart as possible, so they would be easy to discriminate, but still within a range with similar difference limens for frequency change, which increases sharply beyond 4000 Hz (Sek & Moore, 1995). We also wanted them to differ from familiar musical notes. Thus, our lowest tone was about 35 cents above the G below middle C (G3), while our highest tone was about 23 cents below B7, the second highest note on an 88-key piano (100 cents = 1 semitone). A 31 % difference between tones avoids familiar musical intervals and harmonic relationships between tones. Adjacent semitones in music differ by about 5.9 % (precisely 2^1/12) in 12-tone equal temperament, the common tuning system for Western music (Burns & Ward, 1999). Although our stimuli spanned about four octaves, no tone in our set had a simple harmonic relationship with another tone. For example, the second harmonic of 200 Hz is 800 Hz, but the closest frequency to that in our set was 771.6 Hz. Avoiding octaves minimizes the tendency to confuse two tones with different pitch height but equal chroma on the basis of octave generalization (Shepard, 1982).

Six circles were presented in the center of the screen on a gray background, as shown in Fig. 1. The participants were seated approximately 50 cm from the screen. The sounds were presented through two speakers (left and right) in front of the participants, with intensities between 60 and 70 dB(A) as measured by a sound level meter.

Procedure

On each trial, participants had to try to remember 2, 3, 4, 5, or 6 tones and then perform a recognition task. At the beginning of each trial, a “+”appeared at the center of the screen for 1,000 ms, which indicated the onset of a trial and provided a fixation point for the participant. Next, six circles were presented at the center of the screen, as shown in Fig. 1. Six tones, randomly selected without replacement from the set of 12, were sequentially presented at a rate of one item every 750 ms. A printed character (*, &, $, @, #, %, or →) accompanied each tone, and the characters were presented sequentially, with each character in one of the circles, always starting from the circle at the top. The character disappeared as soon as its corresponding tone ended. The participants were instructed to start remembering tones, starting with the one accompanied by a forward arrow (→) and continuing until the end of the series. They were also instructed to ignore the characters except for the forward arrow (→) to minimize any additional processing load (Lavie, 1995). The position of the forward arrow (→) was manipulated such that the memory load was set to include five levels: 2, 3, 4, 5, and 6 tones. The other characters were randomly arranged, and there was no constant association between particular characters and particular tones.

Two additional types of trials were included in the experiment and were the same as those in the other conditions, except that the participants heard only two or four tones and saw two or four characters, respectively, during the encoding phase. The characters were presented sequentially, each in one circle, starting from the circle on the top, and the first character was always a forward arrow (→). We included these additional conditions to estimate to what extent the different stimulus presentation methods would affect the participants’ performance. In the following text, we will denote these trials as “presentation method 2” (PM2) and the other trials as “presentation method 1” (PM1).

A masking tone, which was produced by the simultaneous combination of all 12 stimulus tones, was presented for 500 ms after the last 1 of the 6 tones, in the same temporal rhythm as these tones, to eliminate sensory memory. After a 6,000-ms retention interval, a probe tone was presented, accompanied by a “?” symbol in one of the circles corresponding to a tone that was to be remembered. The participants were to decide whether the probe tone corresponding to the “?” location was the same as the one at that location during encoding or was different. If the tone was different, it did not match any of the tones in the presented series, and the participants were made aware of that. On half of the trials, the correct answer would be “same,” and on the other half of the trials, the correct answer would be ‘“different,” and the test tone was randomly selected from the 12 tones other than the target tone. The participants were instructed to press “s” for “same” and “d” for “different,” and they had unlimited time to respond. Feedback that lasted for 500 ms was provided after the participant made a response. A blank period with a dot in the center of the screen lasted for 1,000 ms before the next trial started.

The trials were allocated into 10 blocks. Each block contained 4 trials for each condition, adding up to 28 trials per block. Each trial lasted for 16 s, and the experiment lasted for 1.5 h.

In half of the blocks, the participants were instructed to whisper “the” twice a second during the encoding and maintenance phases (“whisper” sessions); to equate attention cost, in the other half of the sessions, they were instructed to tap the right index finger on the table twice a second during these phases (“tap” sessions) (Ricker et al., 2010). The “whisper” and “tap” sessions were arranged in a consistent order, whisper–tap–tap–whisper–whisper–tap–tap–whisper–whisper–tap.

At the beginning of the experiment, participants were trained to whisper “the” and tap their right index finger on the desk, each for 1 min. During the practice, there was a beep every second to help the participants keep the pace. After the training, participants performed two practice memory blocks, each consisting of seven trials (one trial per condition). The first practice session was a “whisper” block, and the second practice was a “tap” block.

Results and discussion

Accuracy

A two-way repeated measure ANOVA of PM1 response accuracy with the set size of tones to be remembered (two, three, four, five, or six) and articulation condition (“whisper” and “tap”) as within-participants factors revealed significant main effects of set size, F(4, 104) = 15.39, η _p ² = .37, p < .01, and articulation, F(1, 26) = 4.88, η _p ² = .15, p < .05. The interaction between set size and articulation was not significant, F(4, 104) = 1.27, p > .05 (see Fig. 2, top left). The main effect of set size showed better performance at smaller set sizes, and the main effect of articulation suggested better performance in the “tap” condition, meaning that repeating a simple word could interrupt rehearsal of tones, replicating the results of previous work (Schendel & Palmer, 2007). Full ANOVA tables of all three experiments are provided in the supplementary material (Supplementary Fig. 1).

Tone change magnitude effects

The frequency difference between target and test tones varied drastically among the “change” trials. It is possible that magnitude of frequency change might have influenced WM capacity, in that participants performed worse when frequency change was relatively small (although still clearly discriminable). Therefore, we examined the effect of tone change magnitude on memory performance. Only “change” trials were included in the analysis. All the “change” trials were sorted according to the frequency difference between target and test tones, ranging from 1 to 11 tones apart. Trials with frequency differences from 1 to 4 tones apart were combined as one condition (“small”), and the remaining, larger-difference trials were combined as another condition (“large”). Groups were divided in this way so that “small” and “large” groups would have approximately the same number of trials.^{Footnote 2} A three-way repeated measure ANOVA of the hit rate data, with set size (two, three, four, five, and six), articulation condition (“whisper” and “tap”), and tone change condition (“small” and “large”) as within-subjects factors, revealed significant main effects of set size, F(4, 104) = 4.43, η _p ² = .15, p < .01, and tone change condition, F(1, 26) = 30.47, η _p ² = .54, p < .001. The hit rate was 0.62 ± 0.06 for small tone differences and 0.72 ± 0.06 for large tone differences; the interval represents 95 % repeated measure confidence intervals (Hollands & Jarmasz, 2010).

The only significant interaction was between set size and tone change magnitude, F(4, 104) = 2.86, η _p ² = .10, p < .05. Post hoc Newman–Keuls tests revealed that the effect of tone change magnitude was significant (p < .05) at set size 2, 3, and 4, but not at set size 5 and 6 (see Fig. 3, left panel). This effect is unlikely to have been due solely to stimulus discriminability, because if stimulus discriminability was the cause, performance would have been better at large than at small tone change across all set sizes, rather than set sizes no bigger than four. Instead, it is possible that participants remembered two types of information in this task. The first type is categorical information. For example, participants could sort a tone to a certain category, such as “high tone,” “medium tone,” or “low tone,” and keep the categorical information in WM. The second type is more detailed acoustic information. When target and test tones were different enough, participants would be able to use both categorical and detail information during the test; when the difference was small; however, participants would be unable to use categorical information, because target and test tones would be sorted into the same category. The effect of categorical information diminished at set size 5 and 6, when performance in the large change trials fell down to the same level as in the small change trials. One explanation is that, at large set sizes, participants were unable to use any of their capacity to encode the more detailed information. Moreover, a higher proportion of correct responses at large set sizes come from lucky guesses, which do not depend on the tone frequency change magnitude.

Items in WM

For each set size, we calculated the participants’ WM capacity using Cowan’s k formula as noted above. The results are shown in Fig. 2 (bottom left). The highest capacity estimate (mean ± 95 % within-participants confidence interval) was 2.01 ± 0.42 tones at set size 6, lower than those estimated in simple visual or verbal WM tasks (Chen & Cowan, 2009; Luck & Vogel, 1997), while similar to the capacity limit found in the previous studies on memory for tone sequences (Prosser, 1995).

To investigate the influence of the different stimulus presentation methods, we used only the capacity (k) of set size 2 and 4, but both presentation methods as a dependent variable, and conducted a two-way repeated measure ANOVA with set size (2 and 4) and presentation method (PM1 and PM2) as within-participants factors. The results revealed significant main effects of set size, F(1, 26) = 11.83, η _p ² = .31, p < .01, and presentation method, F(1, 26) = 5.11, η _p ² = .16, p < .05. The interaction was not significant, F(1, 26) = 0.02, p > .05. The main effect of presentation method suggested that people performed slightly better when they memorized all the stimuli that they heard, instead of starting to remember from a specific stimulus. Nevertheless, the average k values for PM2 were still low, with 1.17 ± 0.13 at set size 2 and 1.56 ± 0.25 at set size 4, as compared with 0.99 ± 0.14 at set size 2 and 1.40 ± 0.26 at set size 4 for PM1. The results indicate that changing the presentation method would not induce much improvement in terms of k value estimates.

It is possible that tones along a frequency continuum introduce sequential interference (e.g., Deutsch, 1970) that sometimes results in the inefficient use of WM capacity. In order to assess participants’ best performance, we examined the maximum k value for each individual, no matter which combination of set size and articulation condition produced that maximum. Trials in PM1 and PM2 were combined to calculate k values at set size 2 and 4. This produced a mean maximum of 3.11 items (SD = 0.94), within the range of mean capacities observed in studies with categorical stimuli.

Stimulus discriminability is unlikely to have been the cause for the low WM capacity in this experiment. Adjacent tones have a 31 % frequency difference, which not only is a large difference, but also avoids any pair of tones with different pitches but equal chroma. Therefore, it is unlikely that the stimuli in this experiment are difficult to discriminate from each other.

Experiment 2

The estimated auditory WM capacity in Experiment 1 was lower than the measured visual or verbal WM capacity in previous studies, even in the presence of a long retention interval (Cowan et al., 2011). In this next experiment, we tried one method that might allow more categorical information about the tones to be extracted. An early study on short-term memory for tone lists found better accuracy rates when the context tones were also presented together with the probe tone, as compared with the single-tone probe (Dewar et al., 1977). The authors suggested that higher-order information, such as relational or pattern information, aided in the WM performance. Accordingly, in the second experiment, we re-presented the entire studied list as a test probe, with or without a change in one tone. This method should allow any contextual information encoded from the studied list to be of use in the test.