Perceiving the initial note: Quantitative models of how listeners parse cyclical auditory patterns

Yu, Minhong; Getz, Laura; Kubovy, Michael

doi:10.3758/s13414-015-0935-0

Perceiving the initial note: Quantitative models of how listeners parse cyclical auditory patterns

Published: 03 September 2015

Volume 77, pages 2728–2739, (2015)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Perceiving the initial note: Quantitative models of how listeners parse cyclical auditory patterns

Download PDF

Minhong Yu¹,
Laura Getz¹ &
Michael Kubovy¹

956 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we explore the rules followed by the auditory system in grouping temporal patterns. Imagine the following cyclical pattern (which we call an “auditory necklace”—an for short—because those patterns are best visualized as beads arranged on a circle) consisting of notes (1s) and rests (0s): … 1110011011100110 …. It is perceived either as repeating 11100110 or as repeating 11011100. We devised a method to explore the temporal segmentation of ans. In two experiments, while an an was played, a circular array of icons appeared on the screen. At the time of each event (i.e., note or rest), one icon was highlighted; the highlight moved cyclically around the circular array. The participants were asked to click on the icon that corresponded to the note they perceived as the starting point, or clasp, of the an. The best account of the segmentation of our ans is based on Garner’s (1974) run and gap principles. An important feature of our probabilistic model is the way in which it combines the effects of run length and gap length: additively. This result is an auditory analogue of Kubovy and van den Berg’s (2008) discovery of the additivity of the effects of two visual grouping principles (proximity and similarity) conjointly applied to the same stimulus.

Perceptual grouping in complex rhythmic patterns

Article 16 August 2022

Simple random-interval generation reveals the irresistibly periodic structure of perceived time

Article 28 May 2019

Scaffolded attention in time: ‘Everyday hallucinations’ of rhythmic patterns from regular auditory beats

Article 22 December 2021

Introduction

The temporal parsing of auditory patterns is a form of temporal grouping. The twin problems of temporal grouping and meter are the main puzzles of auditory temporal organization (Drake 1998; Lerdahl and Jackendoff 1983).

The perception of meter involves the extraction of the pulse of a rhythmic sequence (Cooper and Meyer 1960)—the rate at which we tap our foot to the sound of music (Drake 1998). Once the listener hears the pulse of a sequence, its pattern of strongly and weakly accented beats causes a hierarchical structure to be perceived (Essens 1986). Meter is an emergent property of rhythmic organization, just as symmetry is an emergent property of visual organization (Handel 1998).

Grouping refers to the segmentation of a sequence of sounds into units on the basis of its duration, pitch, intensity, or timbre (Bregman 1990; Handel 1989). Whereas the perception of meter is a learned top-down process (Drake et al. 2000; Jones, 1976; Large & Jones, 1999), grouping is a bottom-up process (Handel 1998): sensitivity to rhythmic grouping is immediate (Hébert and Cuddy 2002) and it is seen in infants as young as 3 months old (Demany et al. 1977).

Research on auditory grouping falls into three classes: (a) The perception of accents: how the perceived accent pattern of two- or three-note rhythms is affected by the loudness and duration of its notes (Povel and Okkerman 1981; Woodrow 1911); (b) Auditory scene analysis: how listeners separate parallel temporal patterns into their component streams (Bregman 1990); and (c) Parsing: how listeners determine the starting point of cyclical rhythmic patterns (Garner, 1974; Preusser et al. 1970; Royer & Garner, 1966; 1970).

In this study we have two goals. First, we undertake to quantify the principles that govern the parsing of ambiguous cyclical rhythm patterns. Secondarily, we wish to compare these principles to principles of visual grouping, because our understanding of perceptual organization is by and large based on studies of visual stimuli (Kubovy et al. 1998; Kubovy & van den Berg, 2008; Peterson & Gibson, 1994; Peterson & Lampignano, 2003).

Using ambiguous dot lattices as a tool, Kubovy and his colleagues (Kubovy et al. 1998; Kubovy & Wagemans, 1995) found that visual grouping by proximity was lawful and proposed a probabilistic model to account for this regularity. Furthermore, Kubovy and van den Berg (2008) investigated how the strengths of two grouping principles combined to determine visual grouping. They found that the effects of grouping by proximity and grouping by similarity were additive. Thus, when two visual grouping principles are conjointly applied to a visual stimulus, “the whole is equal to the sum of its parts.”

In the following sections, we describe our stimuli (auditory necklaces) and present several models that could predict their grouping structure. We then describe two studies in which we confront these models with empirical data.

Auditory necklaces

We call the auditory patterns in our studies auditory necklaces (a concept borrowed from combinatorics; Ruskey 2011) because they are best visualized as beads arranged on a circle. Figure 1 shows two common representations of a binary auditory necklace (an) of length 8, where a red bead stands for a note, and a grey bead stands for a rest. This an can also be represented as a single string of binary digits, where 1 stands for a note, and 0 stands for a rest; i.e., 11100110.

Following Garner’s terminology, in our ans a block is a sequence of consecutive identical events (be they notes or rests). When a block consists of notes, it is a run; when it consists of rests it is a gap. For example, Fig. 1 depicts a four-block an with two runs (111 and 11) and two gaps (00 and 0).

The question raised by Garner and pursued here is the following: If an an is played cyclically so that it has no perceptible initial note, which note do listeners choose as the beginning of the pattern? In our parlance, can we predict which note will be perceived as the clasp of the an? In theory listeners could conceivably perceive any note as the clasp, but as we explain later, the clasp is most likely to be the first note of a run.

The ans we use in this study are ambiguous. In Fig. 1, we illustrate the two ways the an 11100110 is typically heard: in each panel the clasp is indicated by a larger bead. For example, if one perceives this an as if it were the pattern 1 1100110 repeating itself, then we say that underlined 1 is its clasp.

The an we have considered so far have both runs and gaps. When an an has gaps, we say that it is sparse (Fig. 2a). In this work we will also study dense ans. Consider two complementary ans whose runs fit into the gaps of the other, and whose notes differ in some respect (pitch or loudness, for example). For example, N = 11100110 and M = 00022002 are complementary. When they are combined, we get a dense an (Fig. 2b): 11122112, which has no rests and hence no gaps.

Approaches to the perceptual organization of auditory necklaces

We now compare four approaches to the prediction of the clasp of an an. We believe that only after we have a quantitative model of this form of grouping, can we undertake a search for a mechanism. The ?? Appendix provides details on the computation of the metrics involved in each approach.

Run and gap principles

In their seminal work, Garner and his colleagues (Garner, 1974; Preusser et al. 1970; Royer & Garner, 1966; 1970) proposed two principles for the segmentation of sparse ans: (a) the run principle, according to which the clasp is perceived as the first note of the longest run, and (b) the gap principle, according to which the clasp is perceived as the first note following the longest gap. For example, the run principle predicts that 11100110 will be heard as 11100110 whereas the gap principle predicts that it will be heard as 11011100.

They also discussed the organization of dense ans. The organization of 11122112 depends on (a) the selection of 11100110 or 00022002 as the figure, while relegating the other to the background, and (b) the run and gap principles operating on each of the complementary sparse ans.

Garner and colleagues conjectured that if the two principles are in agreement, the clasp is stable and emerges readily, but if they disagree, the clasp is ambiguous and takes longer to emerge. To test this hypothesis, they asked participants to report the perceived organization of ans by pressing a key in synchrony with the pattern or notating the pattern. Although these procedures recorded the participants’ impressions faithfully, they were inefficient: each trial took too long. The amount of data collected was thus too small to allow quantitative modeling, though descriptive findings did support their predictions.

E measure

MacGregor (1985) made the first attempt to quantify the likely location of the clasp using a transformation of the run and gap principles based on block sizes and their relative positions within the pattern. Block size is measured as the number of elements in each block (i.e., run or gap). Relative position (i.e., the “enclosure”) of each block is defined as the number of blocks from that block to the closest end block. He proposed the measure (E) given by the sum of the cross products of block size (r _i) and the enclosure of each block (e _i) ($E=\sum r_{i} \cdot e_{i}$).

MacGregor (1985) predicted that the organization with the lowest E value would be perceived most often because it is the least complex pattern organization. Indeed, using a variety of patterns from previous studies, he found an inverse relationship between the E-value and frequency of selection.

Local surprise

Boker and Kubovy (1998) developed a measure called local surprise based on information theory, and they used it to predict the segmentation of sparse ans. The local surprise value is a measure of the predictability of a event at the current position within a given pattern. For example, in 1110000, the first note is less predictable than the second because the event that precedes the first note is a rest whereas the second note is preceded by another note. The third note is more predictable than the second because it is preceded by two notes whereas the second is only preceded by one. Boker and Kubovy conjectured that the less predictable a note, the more likely it is to be perceived as the clasp (which is in line with research on subjective accents, e.g., Cooper and Meyer 1960).

In their experiments, they asked participants to strike a key on a synthesizer keyboard at the moment they heard the clasp. This allowed them to collect voluminous data. They modeled their data using local surprise as well as the gap and run principles. The local surprise measure provided a better model fit than the run and the gap principles.

Although a productive means of data collection, this method has two drawbacks: (a) Participants had trouble synchronizing their responses with the tones; their responses often preceded or followed the note. To analyze the data, Boker and Kubovy had to decide which note a response aimed for, introducing noise into the data and complicating the analysis. (b) The task confounds the contributions of motor control and perception.

Predictive power

van der Vaart (2009) improved the original Boker and Kubovy algorithm by adding a new measure, predictive power, to predict the segmentation of ans. Whereas local surprise only considers backward information, the information before a point in an an, predictive power considers forward information, the information after a point. For example, in 1110000, the second note has larger predictive power than the third because the event that follows the second note is also a note, whereas the third note is followed by a rest. The first note has even larger predictive power than the second note because it is followed by two notes whereas the second is only followed by one.

Preliminary results from a study using methods similar to Boker and Kubovy (1998) suggest that integrating local surprise ratings and predictive power ratings may result in a better-fitting model than local surprise alone.

Our studies

We devised a new method that (a) allows participants easily and quickly to report the clasp, thus allowing us to obtain enough data to build quantitative models; (b) does not require participants to synchronize their taps with the clasp so that the data reflects perception alone.

At the beginning of each trial, a circular array of n icons (where n= the length of the an) appeared on the screen (Fig. 3). The computer randomly assigned icons to positions around the circle, and randomly associated the top icon with one of the events (a note or a rest) of the an. While the an was played (over headphones), a square highlighted the corresponding icon and moved clockwise as each note or rest played. The participants were instructed to click at any time on the icon corresponding to the note they perceived as the clasp.

We created 49 ambiguous ans and asked two questions: (a) Which of the approaches described above best accounts for the temporal organization in these patterns? and (b) When two temporal grouping principles are conjointly applied to a stimulus, is their conjoint effect equal to the sum of their separate effects? In Study 1, we explored the segmentation of four-block sparse ans and in Study 2 we explored the segmentation of four-block dense ans.

Study 1: Sparse Auditory Necklaces

Method

Participants

Ten students from the University of Virginia volunteered. All reported normal or corrected-to-normal vision and normal hearing. We excluded one of them because of a misunderstanding of the instructions.

Stimuli

In this experiment, we used four-block sparse ans. Each an contained two runs (A and B) and two gaps (A and B). Gap A precedes run A and gap B precedes run B. We generated 49 sparse ans by crossing seven run ratios (the ratio between the lengths of run A and run B: { 1:3,1:2,2:3,1:1,3:2,2:1,3:1}) with seven gap ratios (the length ratio between gap A and gap B: { 1:3,1:2,2:3,1:1,3:2,2:1,3:1}).

Table 1 lists these ans. Only 25 of the 49 ans are unique (e.g., 10001110 and 11101000 are rotations of the same pattern), but we treated them as different ans because we assigned run A, run B, gap A and gap B differently for each pattern (e.g., 1 is run A and 111 is run B in the first example, whereas 111 is run A and 1 is run B in the second). This allowed us to fully cross the seven levels of Run ratio and the seven levels of Gap ratio.

Table 1 Stimuli in experiment 1

Full size table

The notes were 440-Hz pure tones lasting 50 ms (including 5 ms linear fade-in and 5 ms linear fade-out). The stimulus-onset asynchrony (SOA) between successive events was 200 ms. To eliminate the bias of selecting the first heard note as the clasp, each pattern was played very fast (SOA = 60 ms) at first and decelerated to a steady SOA = 200 ms after 20 events (Fig. 4). The first played event was randomly selected for each trial.

For visual stimuli, we used ten icons with different shapes (Fig. 3). On each trial, the program randomly selected n icons (corresponding to the length of the auditory necklace) and randomly arranged them on the circumference of a circle. The size of each icon was 55 ×55 pixels and the radius of the circle was 150 pixels.

Design

Each participant completed 25 blocks of trials. Each block contained the 49 ans in random order. It took about 5 hours to complete the experiment. The participants were allowed to divide the experiment into as many sessions as they wished. They were required to complete a block before pausing the experiment: after each block they could choose to continue to the next block, or to quit and later pick up where they left off.

Procedure

At the beginning of each trial, the screen showed the circular array of icons (Fig. 3). Through headphones, the participants heard the ans with the first 20 events decelerating. While the pattern was playing, a square highlighted the icon corresponding to the currently playing note/rest in a clockwise direction. The participants were asked to click on the icon corresponding to the tone they heard as the beginning of the pattern. If they clicked on an icon corresponding to a rest (as opposed to a note), the program asked them to choose again because we assumed that a clasp cannot coincide with a rest. They could click the REST button in the center of the display anytime to take a break.

Results and discussion

Responses to the first note of a run

The median proportion of error responses (i.e., choosing icons corresponding to rests) for the nine participants was 1.1 % (ranging from 0.1 % to 7.9 %). Among the remaining responses, the median proportion of responses to the first note of a run was 99.8 % (ranging from 90.7 % to 100 %).

Figure 5 shows the frequency of responses to the first note of a run (labeled A and B) compared to other responses for one auditory necklace pattern: 1110001100. The errors and the responses not to the first note of a run were either to the note before or after the first note of a run, and may be due to momentary lapses of attention.

Interpersonal concordance

To test the extent to which our participants’ choices of clasp were in agreement with one another, we used the R (R Development Core Team 2013) package irr (Gamer et al. 2010) to compute the Kendall coefficient of concordance W (where 0≤W≤1), corrected for ties, across participants and within stimuli. We found high agreement among our participants’ clasp selections: W _t=0.78 (p≈0).

Statistical model selection

We excluded all trials in which participants did not choose the first note of a run. We could thus classify participants’ responses using a binomial response variable (clasp selection was either at the start of run A or run B). This allowed us to model our data using mixed-effects logistic regression. All of our generalized linear mixed-effects models (glmms) were computed using the package lme4 (Bates et al. 2014).

glmms, which use maximum-likelihood estimation, have many advantages over traditional repeated-measures analysis of variance, which use ordinary least-squares. In addition to providing estimates of fixed effects, they allow us to predict subject-by-subject variations in model parameters (called random effects). Furthermore, glmms do not rest on many of the assumptions required by traditional analyses, such as quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regression (Baayen et al. 2008).

For each of our five glmms (run/gap additive, run/gap non-additive, local surprise, predictive power, MacGregor E), we treated the predictors derived from the approaches described above (see ?? Appendix for formulas) as fixed effects and the subject-by-subject variation of the intercept and subject-by-subject variation of the slope of the predictors as random effects. To compare the five candidate models, we used a method of model comparison based on the Akaike’s Information Criterion (AIC), which offers a principled balance between goodness-of-fit and parsimony (Burnham et al. 2011; Bozdogan 1987, for introductory presentations). Because the probability of overfitting can be substantial when using AIC (Claeskens and Hjort 2008), we used AICc—which penalizes extra parameters more heavily than does AIC—as recommended by Anderson and Burnham (2002).

Whereas AICc is an appropriate method for model comparison and selection, it tells us nothing about the absolute model fit of a model. To give us an idea of this fit, we computed two types of R ² for glmms using the MuMIn package (Bartoń 2014). The first, called the marginal R ² ($R^{2}_{\text {marg.}}$), estimates the proportion of variance accounted for by the fixed effects only, whereas the second, called the conditional R ² ($R^{2}_{\text {cond.}}$), estimates the proportion of variance accounted for by the fixed and random effects taken together (Johnson 2014; Nakagawa and Schielzeth 2013).

Best-fitting model

Table 2 compares the five models. The best two models are the two versions of the run and gap approach, in which the two predictors are the ratio of the gap lengths and the ratio of the run lengths. The first of these does not include an interaction between the predictors; the second does. The marginal R ² (which is identical for the top two models) shows a good deal of variance explained by the fixed effects ($R^{2}_{\text {marg.}} = 0.465)$. When adding in the variance of the random effects, an additional ≈20% of variance is explained ($R^{2}_{\text {cond.}} = 0.678)$.

Table 2 Study 1: Model comparison of the five models

Full size table

The first two models are competitive: a ΔAICc of 1.78 implies an evidence ratio (or Bayes factor, see Anderson 2008, Section 4.4) of 2.44, which Jeffreys (1961, p. 432) considers “barely worth mentioning.” There is, however, no question that the evidence in favor of the first model is immeasurably stronger than the evidence in favor of the third, fourth, and fifth models. An AICc difference of 103 implies an evidence ratio on the order of 10²², which is far beyond what Jeffreys considers “decisive.”

When multiple models are competitive, we are faced with model uncertainty. The consensus in the statistical literature is that the best way to deal with such a situation is to construct a compromise model by a process called model averaging (Anderson, 2008; Claeskens and Hjort, 2008; Ginestet, 2009; Grueber et al. 2011; Richards et al. 2011; Symonds & Moussalli, 2011).

Table 3 shows the coefficients of the averaged model, their standard error, and a 95 % CI^{Footnote 1}. Because the interaction coefficient is almost zero (i.e., −0.03) and the confidence interval straddles zero (95 % CI: −0.15, 0.09), we are inclined to favor additive effects of the run and gap principles. Furthermore, with the same scale, the gap principle is more important than the run principle (i.e., participants use the gap principle more often than the run principle in making their clasp selection).

Table 3 Study 1: Coefficient estimates, standard error, and the lower and upper limits of 95 % of the model (in log-odds) resulting from the averaging of the two competitive models in Fig. 2

Full size table

Figure 6 shows the predictions of the averaged model. In this figure, in addition to the data points and their confidence intervals, we plot the predictions of an additive model. Two features of this plot require some clarification.

First, the proportions are plotted on an unevenly spaced y-axis. This is because these are binomial data fit using logistic regression. There are two ways to plot the predictions of a logistic regression. One, which we did not use here, is to plot the predicted proportions on a linear y-axis, which produces seven sigmoid (i.e., S-shaped) functions, one for each level of gap ratio. We chose to plot the proportions on a log-odds scale (resulting in unevenly spaced proportions on the y-axis), which produces seven linear functions.

Second, we plot lines that represent the predictions of an additive model. We did this because it effectively shows that for the most part, the data deviate from the additive model only when the run ratio and gap ratio are 1/3 or 3/1.

Figure 6 shows that as the run ratio and gap ratio increase, the growth of the probability of choosing run A as the clasp approximates a linear function.

Conclusions

First, the data show that participants organized the notes in each run as a perceptual unit and almost always perceived the first note of a run as the clasp, which replicates Preusser et al. (1970) and Royer and Garner (1966; 1970).

Second, although the MacGregor (1985), Boker and Kubovy (1998), and van der Vaart (2009) predictors are mathematically and conceptually more sophisticated, they did not fit the data nearly as well as the Garner models. There are several possible reasons for the poor showing of these sophisticated models:

(a)
The E measure takes both run and gap principles into account, but it combines them by summing them with equal weight, which results in a loss of information. This is relevant because the gap principle was found to be more important than the run principle in the current study.
(b)
The predictive power algorithm also takes both runs and gaps into consideration, which may be why it fares better than the local surprise model that only considers gaps. Nonetheless, both are but transformations of run and gap lengths. The complexity of these transformations and their sophisticated rationale appears not to produce better fits than the simple measures of relative run and gap lengths.
(c)
The local surprise model was designed to deal with data involving perception and motor skills. The location of the clasp was only one of the three response variables they used. The other two were the temporal accuracy of the tap and its strength (which may reflect the participant’s confidence). The latter two do not involve perception.
(d)
Boker and Kubovy (1998) sampled a number of eight-event ans without considering number of blocks, whereas we used only ans with four blocks to make the stimuli bistable. Therefore, many ans they used were more complex than ours. The complexity of the stimuli may favor their information theory model. This remains an open question.

Third, the run length ratio and gap length ratio were found to additively predict the auditory organization. This is in line with what Kubovy and van den Berg (2008) found in the visual domain.