Two distinct patterns of interference in between-attribute Stroop matching tasks

Dittrich, Kerstin; Stahl, Christoph

doi:10.3758/s13414-016-1253-x

Two distinct patterns of interference in between-attribute Stroop matching tasks

Published: 15 December 2016

Volume 79, pages 563–581, (2017)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Two distinct patterns of interference in between-attribute Stroop matching tasks

Download PDF

Kerstin Dittrich¹ &
Christoph Stahl²

2230 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In between-attribute Stroop matching tasks, participants compare the meaning (or the color) of a Stroop stimulus with a probe color (or meaning) while attempting to ignore the Stroop stimulus’s task-irrelevant attribute. Interference in this task has been explained by two competing theories: A semantic competition account and a response competition account. Recent results favor the response competition account, which assumes that interference is caused by a task-irrelevant comparison. However, the comparison of studies is complicated by the lack of a consensus on how trial types should be classified and analyzed. In this work, we review existing findings and theories and provide a new classification of trial types. We report two experiments that demonstrate the superiority of the response competition account in explaining the basic pattern of performance while also revealing its limitations. Two qualitatively distinct interference patterns are identified, resulting from different types of task-irrelevant comparisons. By finding the same interference pattern across task versions, we were additionally able to demonstrate the comparability of processes across two task versions frequently used in neurophysiological and cognitive studies. An integrated account of both types of interference is presented and discussed.

Twenty years of load theory—Where are we now, and where should we go next?

Article 04 January 2016

Gillian Murphy, John A. Groeger & Ciara M. Greene

No one knows what attention is

Article Open access 05 September 2019

Bernhard Hommel, Craig S. Chapman, … Timothy N. Welsh

The Stroop Task Sex Difference: Evolved Inhibition or Color Naming?

Article Open access 19 October 2022

Espen A. Sjoberg, Raquel G. Wilner, … Geoff G. Cole

Perhaps the most prominent interference paradigm in psychology is the Stroop color-word task (Stroop, 1935), in which participants name the ink color of a color word while attempting to ignore its meaning. Typically, participants fail to fully ignore the word’s meaning, as demonstrated by prolonged RT and higher error rates when word meaning and ink color mismatch (incongruent trial), relative to cases in which they match (congruent trial). To date, numerous Stroop variants have been developed (see MacLeod, 1991, for an overview), one being the Stroop matching task (Dyer, 1973; Treisman & Fearnley, 1969). In this task, a stimulus display consists of two stimuli: a Stroop stimulus and a probe. The Stroop stimulus, a colored color word, has two attributes: word meaning and ink color. The probe is either a color word printed in a neutral ink (e.g., black or white), or a colored nonword stimulus (e.g., a bar, a series of Xs). Thus, the probe has only one attribute: either a word meaning or an ink color. Participants are instructed to compare one of the Stroop stimulus’ attributes (the target attribute) with the probe while ignoring the other (the distracter attribute), and to respond “same” if the former two are equivalent, and “different” if they are not.

Treisman and Fearnley (1969) realized four different vari-ants of the Stroop matching task that result from different choices of the target and probe attributes (color or word mean-ing; examples of all four variants are given in Table 1). They distinguished two types of matching: within-attribute matching versus between-attribute matching. In within-attribute matching tasks (i.e., the word-word and color-color matching tasks), two attributes of the same type (i.e., two words, two colors) had to be compared. To illustrate, in the word-word matching task, the probe consisted of a color word printed in black; its meaning had to be compared with the Stroop word’s meaning. In the color-color matching task, the probe was a series of colored Xs that had to be compared with the Stroop word’s ink color. In between-attribute matching tasks (i.e., the word-color and color-word matching tasks), two attributes of different types (i.e., ink color and word meaning) had to be compared. In the word-color matching task, the probe consisted of a color word printed in black; its meaning had to be compared with the Stroop word’s ink color. In the color-word matching task, the probe was a series of colored Xs that had to be compared with the Stroop word’s meaning.

Table 1 Overview of trial types of all four Stroop matching task (SMT) versions, the proposed notation, and relevant SMT studies

Full size table

In their initial study, Treisman and Fearnley (1969) found negligible interference in both within-attribute matching tasks, but considerable (and comparable) interference in both versions of the between-attribute matching task. The absence of interference in within-attribute matching tasks was explained in terms of different cognitive analyzing modules for ink color and word meaning: In the color-color matching task, when comparing ink colors, the word analyzer can be switched off, such that the distracter attribute of the Stroop stimulus (i.e., word meaning) is not processed and therefore does not cause interference. Similarly, when comparing word meanings in the word-word matching task, the color analyzer can be switched off, implying that the distracter attribute of the Stroop stimulus (i.e., ink color) is not processed and therefore does not cause interference. In contrast, interference arises in between-attribute matching tasks because both types of attributes (i.e., words and colors) need to be processed in order to perform the task-relevant comparison; therefore, both analyzers have to be “switched on,” and, as a consequence, the distracter attribute of the Stroop stimulus (i.e., ink color in the color-word matching task; word meaning in the word-color matching task) can cause interference.

This study focuses on the latter between-attribute Stroop matching tasks and proposes an account of the basic pattern of performance in the different trial types of these tasks. Before introducing the study, we review previous conflicting findings and introduce a notation of trial types that will help in interpreting and integrating these findings.

Trial types in the Stroop matching task

Because Treisman and Fearnley (1969) measured performance for entire blocks, but not for single stimulus displays, they were not able to differentiate between trial types. In fact, target, distracter, and probe attributes can be combined in five different ways, resulting in five different trial types that are best described by the pattern of pairwise relations between attributes. Three relations between pairs of attributes can be distinguished: (1) the task-relevant relation between probe and target attribute, (2) the task-irrelevant relation between target and distracter attribute, and (3) the task-irrelevant relation between probe and distracter attribute. Each of these pairs of attributes can be either same or different. Table 1 illustrates how these pairwise relations result in five different trial types; examples for each trial type are given for each version of the Stroop matching task.

Perhaps in contrast to the classical Stroop task it is not immediately obvious which of the types of Stroop matching task trials would be associated with interference. This is reflected in the literature as an ongoing debate about how trial types should be classified and analyzed (e.g., Dyer, 1973; Goldfarb & Henik, 2006; Luo, 1999). Table 1 lists previous Stroop matching task studies and gives an overview of the different approaches.

Dyer (1973) first recorded RTs for the five trial types (plus two neutral control trial types) of both between-attribute matching tasks. To differentiate between the trial types, he proposed a classification primarily based on the relation between target and distracter, albeit with inconclusive results. To illustrate, a trial was termed congruent-different when target attribute and probe did not correspond (i.e., called for a different response), but the target attribute corresponded to the distracter attribute (i.e., they were congruent). Because the relation between distracter and probe is the same as between target and probe, performance in trials of this type was predicted to be superior to trials in which this relation is different: The redundancy was expected to speed up the decision process. However, contrary to this prediction, the congruent-different type turned out to be associated with interference (this was already noted by Dyer, at least for the word-color matching task, and confirmed later by Goldfarb & Henik, 2006, for the color-word matching task). Nevertheless, several subsequent studies have used this empirically inadequate classification; results obtained in these studies should be interpreted with caution (see Table 1).

Luo (1999) subsequently proposed a classification of trial types that focuses instead on the relation between probe and distracter. Specifically, he distinguished between match and mismatch trials (separately for “same” trial types, i.e., those requiring a “same” response, and “different” trial types, i.e., those requiring a “different” response), depending on whether the distracter attribute matched or mismatched the probe (see Table 1). This classification, however, distinguished only four categories of trial types; as a result, one of the categories (i.e., the mismatch-different category; see Table 1) conflated two trial types that differ with regard to the target-distracter relation. Goldfarb and Henik (2006; see also Caldas, Machado-Pinheiro, Souza, Motta-Ribeiro, & David, 2012) showed empirically that performance in these two conflated cases differs substantially. As we will discuss, this finding is difficult to reconcile with Luo’s semantic competition account (but see the General Discussion for an explanation of this finding in terms of semantic competition).

Finally, Goldfarb and Henik’s (2006) response competition account of the Stroop matching task focused on the relation between the target and distracter attribute as critical in explaining performance across trial types. Importantly, in contrast to Dyer’s (1973) approach, it does so in an empirically adequate manner; for instance, it predicts interference for Dyer’s congruent-different trials. The response competition account is described in more detail below.

It is clear from this brief review that there exists no consensus on the classification of trial types, and equivalently, about the basic pattern of performance in Stroop matching tasks. This fact complicates or even precludes the comparisons of findings obtained in different studies, and it results in failures to fully and adequately characterize basic patterns of performance across trial types. Still, two basic findings can be secured: First, performance in “same” trial types is typically superior to performance in “different” trial types (e.g., Dyer, 1973; Goldfarb & Henik, 2006; Luo, 1999). As a consequence, these two groups of trial types are typically analyzed separately (e.g., Goldfarb & Henik, 2006). Second, with regard to “same” trials, the findings are clear: Previous studies have consistently shown better performance for the trial type in which all three attributes match, as compared to the trial type in which the distracter attribute mismatches target and probe (e.g., Dyer, 1973; Goldfarb & Henik, 2006; Luo, 1999). As can be seen by the variety of classifications in Table 1, the picture is much less clear for the “different” trial types.

To clarify this picture, and to help establish the basic pattern of performance in Stroop matching tasks, we suggest a notation of trial types that (1) considers all relevant stimulus features, (2) is descriptive in the sense that it avoids a priori assumptions about interference associated with a trial type, and (3) is independent of the specific choice of relevant and irrelevant attributes and thus applicable to all four Stroop matching task variants. It is illustrated in Table 1. Representing the three possible pairwise relations between attributes, a trial type is characterized by three binary features, each denoted by one of two symbols, S for “same” and D for “different.” The first S or D denotes the relation between the task-relevant attributes, and thus, the required response. Based on their first feature, trial types can be subsumed into “same” trial types and “different” trial types, depending on whether they require a “same” or a “different” response. The second symbol (S or D) denotes the task-irrelevant relation between target attribute and distracter attribute of the Stroop stimulus; this is always a relation between different attributes of the same object (i.e., ink color and meaning of the Stroop word). The third symbol (again, S or D) denotes the task-irrelevant relation between the distracter attribute of the Stroop stimulus and the probe; in the between-attribute matching tasks investigated here, this is always a relation within the same attribute dimension across two different objects. In within-attribute matching tasks, it is a relation between different attributes across objects. Thus, the nature of this third relation (within-attributes or between-attributes) always differs from the nature of the task-relevant relation.

With the help of this descriptive classification of trial types, we attempt to fully characterize the interference pattern of both versions of the between-matching task and to identify the processes underlying these patterns. As starting point, we focus on two studies that have investigated the color-word matching task (i.e., a between-attribute matching task using a colored bar as probe), and have accounted for their findings in terms of competing theoretical accounts (Goldfarb & Henik, 2006; Luo, 1999). We will then extend our research to the word-color matching task. Thereby, we will provide empirical evidence concerning the comparability of these two between-attribute matching tasks that have been interpreted as equivalent in the literature (e.g., David, Volchan, Vila, et al., 2011; Goldfarb & Henik, 2006; Luo, 1999; Norris, Zysset, Mildner, & Wiggins, 2002; Zysset, Müller, Lohmann, & von Cramon, 2001).

Current accounts of the Stroop matching task

The semantic competition account (Luo, 1999) assumes that for the color-word matching task, both ink colors—word color and bar color—activate their respective semantic unit within a semantic system. Further, it is assumed that information from the semantic unit is mapped or translated into its verbal representation in a verbal-lexical unit, in which it is then compared with word meaning. Impaired identification of the relevant ink color (i.e., bar color) is predicted when the two colors differ; for instance, a word-color/bar-color mismatch should impair color identification because two different semantic units are activated and might interfere with each other. Importantly, it is assumed that this interference occurs at a translational stage located before the response selection stage. Based on this assumption, Luo (1999) divided trial types into match stimuli (i.e., trials in which word color and bar color match; e.g., a red bar presented along a word in red ink) and mismatch stimuli (i.e., trials in which word color and bar color mismatch; e.g., a red bar presented along a word in blue ink). As is evident from Table 1, this considers only four different types of trials; Luo’s “different/mismatch” type was in fact a mixture of two different types (i.e., DDD and DSD). Luo (1999) reported two experiments to test this account. In a first study, as predicted, robust interference was obtained for “same” trial types (i.e., increased RTs for the SDD trial over the SSS trial), and smaller but significant interference for “different” trial types (i.e., increased RTs for the mixture of DDD and DSD trials over DDS trials). Furthermore, when the colored bar preceded the colored word by one second or more (Luo, 1999, Experiment 2), interference was completely eliminated and even reversed for “different” trial types: Responses for the DDS trial were now significantly slower than those for the mixture of DDD and DSD trials. Luo’s interpretation was that, by delaying the presentation of the Stroop word—and thereby, the comparison process assumed to occur in the postulated verbal-lexical unit—the relevant color of the bar could be encoded semantically before the task-irrelevant color of the Stroop word could interfere.

However, this interpretation must be treated with caution, as the above conflation of trial types yielded misleading results. Goldfarb and Henik (2006) presented evidence showing that Luo’s (1999) classification of stimuli is not adequate. They compared RTs for all five trial types described above (see also Table 1). Focusing here on the “different” trial types, results showed that RTs were significantly larger for the DSD type than for the DDD type, suggesting that pooling both trial types, as in Luo’s studies, is not warranted. Also, contrary to Luo’s prediction, RTs for DDS and DDD trials did not differ significantly. Both of these findings cannot easily be reconciled with Luo’s semantic competition account (but see the General Discussion for an explanation of these findings in terms of semantic competition). Goldfarb and Henik’s (2006) response competition account argues that interference in the Stroop matching task arises at the response selection stage and is a consequence of task conflict (MacLeod & MacDonald, 2000), resulting in irrelevant comparisons being made. As pointed out previously, in addition to the task-relevant comparison between probe and target, two possible task-irrelevant comparisons can be made: a comparison between target and distracter (i.e., word meaning and word color) and a comparison between distracter and probe (i.e., word color and bar color in the color-word matching task). The authors argued that the former comparison is made, but not the latter, based on the assumption that attention is primarily object based, not feature based (Kahneman & Henik, 1981; Wühr & Waszak, 2003; for a review, see Chen, 2012). In other words, Goldfarb and Henik (2006) assumed that when focusing on the meaning of the Stroop word, its ink color cannot be ignored. As a result, when target and distracter attribute of the Stroop stimulus correspond (which is the case for the DSD type), a tendency toward a “same” response is elicited, competing with the required “different” response. This accounts well for the increased RTs for DSD as compared to DDD trials. In contrast, the second possible irrelevant comparison between the two ink colors is assumed not to be made; this assumption was supported by similar RTs for the DDD and DDS trials. From the perspective of Goldfarb and Henik’s response competition account, therefore, the target–distracter relation is critical for interference, implying that the SDD and DSD types would be associated with interference. In contrast, the probe–distracter relation is assumed to be irrelevant, implying that the DDS type would not be associated with interference.

The present study

We propose that the target–distracter relation may not be sufficient to fully account for interference in the Stroop matching task. Treisman and Fearnley’s (1969) finding of fast and efficient within-attribute comparisons suggests that the lack of an RT effect does not rule out that a task-irrelevant probe–distracter comparison (e.g., between two ink colors in the color-word matching task) is made. Such an irrelevant comparison could, if it is indeed made, result in accuracy costs in trials in which the result of this comparison is incongruent with that of the relevant comparison (i.e., as in the DDS type). Presumably, a physical match (of either two ink colors or two words) would be detected quickly and efficiently, and it would elicit a “same” response. A respective tendency would lead to incorrect responses based on the wrong within-attribute matching in the absence of RT costs. Initial support for this hypothesis can be found in the work of Goldfarb and Henik (2006), who observed descriptively increased error rates for DDS than for DDD trials (see also Caldas et al., 2012; Dyer, 1973). On the basis of these considerations, we expect that two types of task-irrelevant comparisons are inadvertently made when performing the Stroop matching task that differ in their relative speed of processing: (1) the task-irrelevant between-attribute comparison is assumed to be a slow comparison process (e.g., a process in which two different modules are involved) and should therefore result in both RT and accuracy costs, whereas (2) the task-irrelevant within-attribute comparison is assumed to be fast and should therefore produce accuracy costs in the absence of RT costs.

This study investigates the possibility of two distinct sources of interference in the Stroop matching task, not only at the level of mean RT and accuracy but also by analyzing the RT and error distributions. The RT distribution will be analyzed by means of delta plots (De Jong, Liang, & Lauber, 1994; Schwarz & Miller, 2012; Speckman, Rouder, Morey, & Pratte, 2008), which depict the magnitude of the interference effect at different RT percentiles. Specifically, the delta plot maps (on the y-axis) the difference between two RT distributions (e.g., the difference of the RT distributions of incongruent and congruent trials) across percentiles, against (on the x-axis) the average of the two RT distributions across percentiles. Thus, the slope of the delta plot captures the RT differences between incongruent and congruent trials across response speed, with a positive slope indicating that the RT difference (i.e., the interference effect) is larger for slower responses, and a negative slope indicating that the RT difference is smaller for slower responses. Delta plots of classical Stroop data typically reveal positive slopes (Dittrich, Kellen, & Stahl, 2014; Pratte, Rouder, Morey, & Feng, 2010), reflecting the finding that Stroop interference is larger for slow responses than for fast responses. Our assumption concerning the relative speed of processing of the two irrelevant comparisons results in the prediction of a steeper delta plot slope for trials involving the slower task-irrelevant between-attribute comparison (i.e., DSD trials) in comparison to trials involving the faster task-irrelevant within-attribute comparison (i.e., DDS trials). Assuming that the within-attribute comparison is fast and efficient (Treisman & Fearnley, 1969), and that it would result in the direct activation of an incorrect response (Gratton, Coles, & Donchin, 1992), we expect to see a greater proportion of fast errors on DDS trials when compared to trials involving task-irrelevant between-attribute comparisons (i.e., DSD trials). This assumption will be tested with the help of conditional accuracy functions (CAFs; Ridderinkhof, 2002), in which accuracy is displayed as a function of RT. We tested these predictions in both between-attribute matching tasks in Experiment 1a and 2a. Experiment 1b and 2b replicate these results in a procedure with slightly modified trial type proportions.

Experiment 1a

In Experiment 1a, the color-word Stroop matching task was applied. We expected to replicate Goldfarb and Henik’s (2006) results at the level of mean RT: RTs of the “different” trial types were expected to be faster when word color and word meaning mismatch (DDD = DDS < DSD), indicating that a slow task-irrelevant between-attribute matching is made. For RT, it should make no difference whether bar color matches or mismatches word color (DDD = DDS), whereas we expected to find accuracy costs for DDS trials relative to DDD trials. The latter result would be in line with our assumption that a task-irrelevant within-attribute comparison is also made. Delta plot slopes and CAFs will help us to characterize both task-irrelevant comparisons: We expected to find a more negative delta plot slope for trials that involve the task-irrelevant within-attribute comparison (i.e., the DDS trial type) in comparison to the delta plot slope for trials involving the task-irrelevant between-attribute comparison (i.e., the DSD trial type). Moreover, we expected to find a greater proportion of fast errors for DDS than DSD trials. Both assumptions are derived from the hypothesis that the task-irrelevant within-attribute comparison occurs fast and relatively effortlessly, resulting in specifically fast errors along with negligible RT interference.

Method

Participants

Participants were mostly University of Freiburg students with different majors. They were recruited in lectures of different subjects and via flyers distributed in the city of Freiburg. Thirty-two persons participated (18 women, 14 men; mean age was 22 years, ranging from 18 to 29 years). All participants were native speakers of German with normal or corrected-to-normal vision and participated for course credit or as paid volunteers. Two participants were extreme outliers (with mean error rates of 0.26 and 0.35, respectively, in the total sample’s distribution of error rates, \( M=0.05,SD=0.22 \)). These participants were excluded from analyses.

Stimuli and apparatus

The German words rot [red], grün [green], gelb [yellow], and blau [blue] were presented in capital letters, in one of four colors (red, green, yellow, and blue). A colored bar was presented above the colored word in either one of these four colors. The length and height of the colored bar matched those of the longest colored word. Stimuli subtended a visual angle of approximately 1.5° × 1.0°.

Trial types were generated as described by Goldfarb and Henik (2006): Word color, bar color, and word meaning were factorial combined, resulting in 64 possible combinations; 16 were “same” stimuli and 48 were “different” stimuli. To elicit the same number of “same” and “different” responses, the “same” stimuli were presented three times. One block consisted of 96 trials (i.e., 3 × 16 = 48 “same” and 48 “differ-ent” trials). Participants responded by pressing the keys j for “same” and k for “different” on a standard computer keyboard. All instructions and colored words were presented in 20-point Lucida Sans Regular font. Stimuli were presented on a light gray background on a 48.3 cm CRT screen at an approximate viewing distance of 60 cm.

Procedure and design

Each trial started with a fixation cross presented for 500 ms. A 300-ms blank screen followed, and afterwards the stimulus was presented. Participants’ task was to decide whether bar color and word meaning were same or different. The response window was 1,500 ms. In case of an erroneous key press or a missing response, participants saw the word “Fehler” [Error] printed in red for 500 ms; in case of no error, a blank screen was presented for 500 ms to keep the trial length identical. The intertrial interval was 500 ms. Participants first practiced the task during one block of 96 trials. Subsequently, participants performed two experimental blocks of 96 trials each. The experiment lasted approximately 20 minutes.

Results

Errors (5.28%) were excluded from all analyses of RTs. Outliers in RTs (0.72%) were removed from each individual’s distribution of latencies according to Tukey’s criterion (i.e., latencies more than 3 interquartile ranges below the first or above the third quartile). Mean RT was 696 ms (SD = 106), mean error rate was 0.05 (SD = 0.22).

Mean RTs and mean error rates

Figure 1 (upper row) presents RTs and error rates for the five trial types. RTs and error rates were analyzed separately for the “same” trial types and “different” trial types (i.e., ANOVAs with trial type as a within-subject factor were conducted). For the “same” trial types, a main effect of trial type was found both in RTs, F(1, 28) = 195.27, MSE = 545.77, p <.001, η ²_G = .145, and error rates, F(1, 28) = 28.51, MSE = 0.00077, p < .001, η ²_G = .162, indicating that RTs and error rates were larger for SDD trials than for SSS trials (see Fig. 1, upper row). The “different” trial types also differed in their RTs, F(1.76, 49.41) = 5.87, MSE = 889.23, p = .007, η ²_G = .009, a n d error rates, F(1.9, 53.2) = 5.95, MSE = 0.002, p = .005, η ²_G = .075. Evidence for a task-irrelevant between-attribute comparison was obtained when comparing DSD with DDD trials: RT was higher, and more errors were made, for DSD trials than for DDD trials, M _d = 23.77, 95% CI [7.93, 39.61, t(28) = 3.07, p = .005 and M _d = 0.04, 95% CI [0.01, 0.06], t(28) = 3.42, p = .002, respectively. RT was also higher for DSD than for DDS trials, M _d = 19.14, 95% CI [2.29, 35.99, t(28) = 2.33, p = .027, while in the error rates, this difference was not significant, t(28) = 0.23, p = .817 (see Fig. 1, upper row). Importantly, evidence for a relatively fast and efficient task-irrelevant within-attribute comparison was obtained when comparing DDS with DDD trials: RT did not differ between DDD and DDS trials, M _d = 4.63, 95% CI [−7.46, 16.73], t(28) = 0.78, p = .439; yet, error rates were higher for DDS trials compared to DDD trials as can be seen in Fig. 1 (upper row), M _d = 0.03, 95% CI [0.01, 0.06], t(28) = 2.92, p = .007. Note that this pattern is similar to the one observed by Goldfarb and Henik (2006), but no statistical tests of accuracy were reported in that study.

Delta plot slopes

To compute delta plots, we mapped (on the y-axis) the difference between the DSD and DDD RT distributions across five RT bins, against (on the x-axis) the average of the two RT distributions across five RT bins; the respective delta plot slope captures interference from a task-irrelevant between-attribute comparison. The same procedure applies for the difference between the DDS and DDD RT distribution, capturing interference from a task-irrelevant within-attribute comparison. Figure 1 (left panel, lower row) depicts respective delta plot slopes. As expected, delta plot slope was positive for the task-irrelevant between-attribute comparison, M = 0.04, 95% CI [-0.07, \( 0.16\Big] \) and negative for the task-irrelevant within-attribute comparison, M = − 0.08, 95% CI [−0.17, 0.01]. Moreover, delta plot slopes of DSD and DDS interference differed significantly, M _d = 0.13, 95% CI [0.00, 0.25, t (28) = 2.11, p = .044.

Conditional accuracy function

To compute CAFs, we plotted (on the y-axis) accuracy against RT (x-axis) across five RT bins, separately for DSD and DDS trials; Fig. 1 (right panel, lower row) depicts CAFs of DSD and DDS trials. Note that DDS trials showed a relatively high proportion of fast errors (i.e., in the first RT quintile), whereas there were fewer errors in slower responses. Contrasting DDS and DSD trials, DDS accuracy was lower than DSD accuracy in the first, and it was higher in the second quintile. We compared the slopes of the CAFs over the fastest two quintiles as an index for a predominance of fast errors. In line with the notion that the efficient task-irrelevant within-attribute comparison directly activates the incorrect response, DDS trials elicited significantly more fast errors than DSD trials, t (28) = 2.73, p = .011, d = 0.64.

Discussion

The results of mean RTs of Experiment 1a suggest that the DSD trial causes the greatest amount of interference, while RTs for the DDD and DDS trials were comparable. The DDS trial also caused interference, as shown in the mean error rates (i.e., reduced accuracy for the DDS trial as compared to the DDD trial). This suggests that not only a task-irrelevant between-attribute comparison is made, but perhaps also a task-irrelevant within-attribute comparison between the two ink colors. The assumption that both task-irrelevant comparisons are made when performing the Stroop matching task was also supported by different delta plot slopes and CAF properties. First, a positive delta plot slope (as typically observed for Stroop interference) was found for DSD trials, whereas a negative slope was found for DDS trials. Second, in contrast to DSD trials, the accuracy distribution of DDS trials was characterized by a high proportion of especially fast errors.

Experiment 1b

In Experiment 1a, we replicated the trial type proportion used by Goldfarb and Henik (2006): we factorial combined word color, bar color, and word meaning. However, this results in an unequal proportion of trial types (see the Appendix). Importantly, the three “different” trial types occur with unequal frequency, with DDD being presented twice as often as DSD and DDS. As it is debated which of these three trials is to be classified as “incongruent” or “congruent,” and given the well-established finding that the proportion of congruent and incongruent stimuli affects the amount of interference in a task (e.g., Kane & Engle, 2003; Lindsay & Jacoby, 1994; Logan & Zbrodoff, 1979), the unequal distribution of trial types might have affected the result pattern. In the extreme, evidence for the second task-irrelevant within-attribute comparison might be an artifact of the relatively small proportion of DDS trials and might not be found with equal proportions of trial types. To rule out this possibility, we replicated Experiment 1a with a constant trial type proportion within the “same” and “different” trial types (i.e., the proportions were one fourth each for the SSS and SDD trials and one sixth each for the DSD, DDD, and DDS trials).

Method

The method was identical to Experiment 1a except for the differences indicated below. Participants were N = 40 University of Freiburg students with different majors (27 women, 13 men; mean age was 22 years, ranging from 18 to 46 years). All participants were native speakers of German with normal or corrected-to-normal vision and participated for course credit or as paid volunteers. One participant was an extreme outlier (with mean error rate of 0.19 in the sample’s distribution of error rates, M = 0.05, SD = 0.22) and was excluded from analyses.

Stimuli and apparatus were identical to Experiment 1a, except that the proportion of trial type changed: proportions were one fourth each for the SSS and SDD trials and one sixth each for the DSD, DDD, and DDS trials. Participants practiced the task within one block of 72 trials and then performed three experimental blocks of 72 trials each. In case participants responded slower than 1,200 ms, a message (“Zu langsam” [too slow]) printed in red was shown for 500 ms; the response window was 2,000 ms. The experiment lasted approximately 20 minutes.

Results

For RT analyses, data were preprocessed as in Experiment 1a (i.e., excluding error trials, 4.97%, and outliers, 0.57%). Mean RT was 687 ms (SD = 103); mean error rate was 0.05 (SD = 0.22).

Mean RTs and mean error rates

The result pattern for mean RT and mean error rate of Experiment 1a was replicated in Experiment 1b (see Fig. 2, upper row): First, for the “same” trial types, a main effect of trial type emerged both for RTs, F(1, 38) = 231.44, MSE = 958.01, p < .001, η ²_G = .242, and error rates, F (1, 38) = 31.15, MSE = 0.0016, p < .001, η ²_G = .258; it reflects the finding that RTs and error rates were larger for SDD than for SSS trials. Second, the “different” trial types also differed in their RTs, F(1.82, 69.05) = 11.80, MSE = 1, 133.76, p < .001, η ²_G = .014, and error rates, F(1.95, 74.07) = 12.28, MSE = 0.001, p < .001, η ²_G = .133. Specifically, as in Experiment 1a, RT was greater, and more errors were made, for DSD trials than for DDD trials, RT: M _d = 31.77, 95% CI [16.00, 47.55, t (38) = 4.08, p < .001; error rate: M _d = 0.03, 95% CI [0.02, 0.05], t(38) = 5.00, p < .001. RT was again greater for DSD trials than for DDS trials, M _d = 29.23, 95% CI [13.34, 45.13], t(38) = 3.72, p = .001, whereas both trial types did not differ in their error rates, M _d = 0.01, 95% CI [−0.01, 0.02], t(38) = 0.75, p = .457. Importantly, the pattern was reversed for DDD and DDS trials: RT did not differ between DDD and DDS trials, M _d = 2.54, 95% CI [−9.62, 14.70], t(38) = 0.42, p = .675, but error rates were significantly higher for DDS trials compared to DDD trials, M _d = 0.03, 95% CI [0.01, 0.04], t(38) = 3.84, p <.001.

Delta plots and conditional accuracy functions

Delta plot slopes and CAFs are displayed in Fig. 2 (lower row). Replicating results of Experiment 1a, delta plot slopes were positive for the task-irrelevant between-attribute comparison (M = 0.11, 95% CI [0.01, 0.20]) and negative for the task-irrelevant within-attribute comparison (M = − 0.08, 95% CI [−0.17, 0.02]). This difference in slopes was again statistically significant, M _d = 0.18, 95% CI [0.07, 0.30], t(38) = 3.34, p = .002. Regarding the CAFs, the DSD–DDS crossover in slopes was again obtained, descriptively, d = 0.29, but this time it was not statistically significant, t(38) = 1.29, p = .203.

Discussion

Results of Experiment 1a were largely replicated. Thus, the specific trial type proportion realized in Experiment 1a (following the trial type proportion used by Goldfarb & Henik, 2006) was not responsible for the pattern of significant and nonsignificant comparisons across trial types. In particular, the finding of DDS interference was not an artifact of this methodological choice.

Experiment 2a and 2b

In the literature of the Stroop matching task, both the color-word and the word-color versions of the between-attribute matching task have been repeatedly used (David, Volchan, Vila, et al., 2011; Goldfarb & Henik, 2006; Luo, 1999; Zysset et al., 2001), and results gathered from both task versions have been directly compared, assuming that processes underlying both matching tasks are identical. However, an empirical investigation of the comparability of processes has been missing so far. In Experiment 2, our prediction of two sources of interference was tested in the word-color matching task, allowing us to draw an empirical conclusion on the comparability of processes underlying both between-attribute matching tasks. A replication of the result pattern of Experiment 1 would not only support our notion of two sources of interference in the Stroop matching task but also would demonstrate that the task-irrelevant comparisons are made irrespective of their specific properties (word or color). Specifically, it would demonstrate that a task-irrelevant within-attribute comparison is not only made for ink colors but also for words. Finally, if the result pattern replicates the findings of Experiment 1, this would also lend credence that processes underlying both between-attribute Stroop matching tasks are comparable, justifying the comparison among studies using different versions of the task.

In Experiment 2a, we used the same trial type proportion as in Experiment 1a (word color, bar color, and word meaning were factorial combined). In Experiment 2b, we realized the word-color matching task using the same constant trial type proportion within the “same” and “different” trial types that we realized in Experiment 1b.