Human behavior is characterized by its high adaptability and flexibility, so that goals can be achieved even if environmental factors create interference, or goals can be internally shifted despite unchanged environmental factors. The cognitive processes underlying this flexibility have recently been examined with respect to the notion of “cognitive control” (sometimes also called “executive functions”), which generally denotes the ability of humans to intentionally shift goals, update working-memory content, monitor own responses, and inhibit processing of distractors, unwanted thoughts, or prepotent but inappropriate responses.
Specifically, two theoretical frameworks have gained much interest in research of cognitive control over the last two decades. First, based on investigations of shared variance among sets of intercorrelated tasks, Miyake et al. (
2000) proposed a framework that postulates three general aspects of cognitive control: shifting of task-set, monitoring and updating of working-memory, and inhibition of prepotent response tendencies (see Karr et al.,
2018, for a recent review). Second, based on work on sequential modulation of well-established interference effects (i.e., the “sequential congruency effect”; Egner,
2007, for a review), Botvinick et al., (
2001; Botvinick et al.,
2004) developed a conflict-monitoring account, proposing that response conflict is internally monitored, and the detection of conflict triggering an upregulation of selective attention. Both frameworks are based on empirical “signature” effects, such as task-switch costs for the shifting component of cognitive control (for reviews, see, e.g., Kiesel et al.,
2010; Koch et al.,
2018; Monsell,
2003; Vandierendonck et al.,
2010), the well-known Stroop effect as an indicator of the degree of attentional selectivity (e.g., Stroop,
1935; see MacLeod,
1991), and its sequential modulation as an indicator of conflict-triggered control adjustments (see also Schuch et al.,
2019, for a recent discussion of sequential interference effects in multitasking paradigms).
Yet, recently, doubts about the reliability of interindividual differences in many cognitive control measures have emerged (e.g., Miyake & Friedman,
2012; Paap & Sawi,
2016; Rey-Mermet et al.,
2018). As it turns out, cognitive control measures that prove stable and reliable when measured at a group level (i.e., effects that have been replicated in many experimental studies using different participant samples, e.g. the “Many-Labs Project”, Klein et al.,
2014, and follow-up projects), do not necessarily show sufficient split-half and retest reliability when taken to assess interindividual differences (as is the case with correlational approaches, such as structural equation modeling). This puzzling discrepancy between reliability on a group level (i.e., the probability of replicating a group-level effect in a new sample of participants) and reliability on the level of interindividual differences (i.e., split-half and retest reliability) has recently been termed the “reliability paradox” (Hedge et al.,
2018a,
2018b,
2020).
In the present study, we aimed to examine the split-half and retest reliability of some prominent cognitive control measures. In Experiment 1, we examined the reliability of two important effects in task switching:
N − 2 task repetition costs and general cue-based task preparation benefit. In Experiment 2, we examined the reliability of two widely used effects in single-task paradigms: a Stroop-like effect (using a face-name interference paradigm) and its sequential modulation (sequential congruency effect, or “conflict adaptation effect”). All of these effects have previously been examined in several studies in our own labs, and have proven to be robust effects when measured at the group level (for reviews of the task-switching related effects, see Gade et al.,
2014; Koch et al.,
2010; Koch et al.,
2018; for congruency and sequential congruency effects in single-task paradigms, see e.g., Schuch & Koch,
2015, Schuch et al.,
2017; for review, see Schuch et al.,
2019). In the following, we will briefly discuss the theoretical background of a) the two task-switching measures and b) the two single-task measures of cognitive control.
Cognitive control measures in task switching
In the structural-equation modeling approach presented by Miyake et al. (
2000), three subcomponents of cognitive control were identified as latent variables: “shifting of task-sets”, “monitoring and updating of working memory”, and “inhibition of prepotent response tendencies”. Later empirical work using structural equation modeling confirmed the task-shifting and working-memory factors, but not the “inhibition of prepotent responses” factor (see Friedman & Miyake,
2017; Karr et al.,
2018; Miyake & Friedman,
2012, for reviews). Instead, Friedman et al. (
2008) proposed a “common executive function” factor that partially overlaps with the task-shifting and working-memory factors. They describe this common factor as “the ability to maintain and manage goals, and use those goals to bias ongoing processing” (Friedman & Miyake,
2017, citation from section “5.1.1. Hypothesized functions for the Common EF factor”). The working-memory factor is characterized by the ability to update some of the current working-memory content, while at the same time maintaining other working-memory content for later retrieval. This factor is measured by memory tasks that require participants to attend to sequentially presented items from different categories, and later recall the last item from each category.
The task-shifting factor is described as the ability to rapidly replace task-sets in Friedman and Miyake’s (
2017) framework, and the authors suggest that participants might differ in the speed of task-set replacement (see also Miyake & Friedman,
2012). The task-shifting factor is measured by cued task-switching paradigms, where the currently relevant task-set changes from trial to trial and is indicated by a task cue that is presented prior to the target stimulus (Meiran,
1996).
One popular measure that can be extracted from task-switching paradigms is the “task-switch cost”, defined as the performance difference between task-switch trials and task-repetition trials in a cued task-switching paradigm. For instance, Friedman and Miyake (
2004) tested more than 200 participants with three cued task-switching paradigms, and found good reliability of task-switch costs (with Spearman-Brown corrected split-half reliabilities ranging from
r = 0.43 to
r = 0.82). The reliability of task-switch costs has been confirmed in several other studies (e.g., Friedman et al.,
2008; Miyake et al.,
2000; Paap et al.,
2017; Pettigrew & Martin,
2016).
However, it is widely acknowledged that task-switch costs represent a mixture of different effects (see Kiesel et al.,
2010; Koch et al.,
2018, for reviews). One subcomponent of task-switch costs is task-level inhibition (e.g., Allport & Wylie,
1999; Goschke,
2000). Task-level inhibition can be measured with “
N − 2 task repetition costs” (e.g., Mayr & Keele,
2000; see Koch et al.,
2010, for a review), which is a sequential measure where different kinds of task-switching sequences are compared. For example, Gade and Koch (
2005) used three tasks, and in each trial, the task was indicated by an explicit instruction cue. As stimuli, they used colored (red vs. blue) symbols (a digit or a letter) that varied in size (small vs. large), so that there were three varying perceptual dimensions, and the task cue indicated the relevant stimulus dimension for selecting the target attribute (e.g., small vs. large for the size dimension). When the authors analyzed the sequential transitions, they found that sequences of the ABA type (
N − 2 task repetitions, e.g., color–size–color) resulted in worse performance (e.g., higher reaction time [RT]) than sequences of the CBA type (
N − 2 switches, e.g. symbol-size-color). The finding of higher RT in the last trial of an ABA versus CBA task sequence speaks in favor of a process that inhibits aspects of the preceding task set when shifting to a new task set; this is because accounts in terms of persisting activation of previously established task representations (task sets) would predict
better performance for ABA relative to CBA (Mayr & Keele,
2000; see Koch et al.,
2018, for a recent discussion). Even though some other effects in task switching have been related to inhibitory processing,
N − 2 repetition costs arguably represent the most unambiguous case for inhibition in task switching to date (Koch et al.,
2010; and see also Grange et al.,
2017, for a recent discussion). Yet, even though the experimental evidence for the existence (and replicability on the group level) of
N − 2 repetition costs in task switching is very robust (i.e., they have been replicated many times with different paradigms and in different participant samples, see Koch et al.,
2018, for a recent review), only few studies examined its split-half and retest reliability.
To our knowledge, three studies so far have assessed split-half reliability of
N − 2 repetition costs. Kowalczyk and Grange (
2017) used three different versions of task switching and found split-half reliabilities of
N − 2 repetition costs between
r = 0.37 and
r = 0.60 (these are corrected reliability scores; note that split-half reliability is usually corrected for attenuation by applying the Spearman–Brown correction). Pettigrew and Martin (
2016) reported a split-half reliability of
N − 2 repetition costs of
r = 0.44, and Rey-Mermet et al. (
2018) of
r = 0.27. One study assessed test–retest reliability of
N − 2 repetition costs in both a task-switching and a language-switching paradigm, with about one week between test and retest (Timmer et al.,
2018). These authors observed a retest reliability of
N − 2 repetition costs of
r ≈ 0.40 (both in the task-switching and the language-switching paradigm). Taken together, the available data on the reliability of
N − 2 repetition costs is scarce and ranging from poor to moderate reliabilities.
Apart from task-switch costs and
N − 2 repetition costs, another important cognitive-control measure that can be assessed in cued task-switching paradigms is the time-based task-preparation effect. Here we define this effect as the performance difference between trials with short versus long time intervals between task cue and task-specific stimulus (cue-stimulus interval, CSI). For instance, Lawo et al. (
2012) observed substantial task-preparation effects that differed between younger and older adults, suggesting that task-preparation ability deteriorates with older age (on a group level). Other aging and developmental studies confirm that the efficiency of task preparation is an important aspect when assessing age-related differences in cognitive control (e.g., Cepeda et al.,
2001; Crone et al.,
2006; Schuch,
2016; Schuch & Konrad,
2017; Wild-Wall et al.,
2007; for reviews, see Gajewski et al.,
2018; Kray & Doerrenbaecher, in press; Kray & Ferdinand,
2014). Assuming that the relevant task set becomes activated during the CSI, the performance difference between short and long CSI conditions can be interpreted as reflecting the degree of cue-based activation of the relevant task-set, especially in
N − 2 repetition cost paradigms where usually every trial is a task switch (e.g., Lawo et al.,
2012; Schuch & Grange,
2019; Schuch & Koch,
2003). It is often assumed that task preparation involves activation of the relevant attentional settings and task rules in working memory and builds up gradually over time, such that a longer CSI leads to better task preparation (for reviews, see Kiesel et al.,
2010; Koch et al.,
2018).
Beyond the general task-preparation effect discussed here (i.e., the reduction of mean RT in trials with long vs. short CSI), considerable research has been carried out focusing on the specific task-preparation effect, denoting the reduction of task-switch costs with long vs. short CSI (see Kiesel et al.,
2010; Koch et al.,
2018, for reviews). The latter measure is often interpreted as a marker of “advance reconfiguration of task set” (Meiran,
1996; Monsell,
2003; Vandierendonck et al.,
2010). Whether such a specific task-preparation effect also occurs with
N − 2 task repetition costs to date is an unresolved issue. While earlier studies did not find a reduction of
N − 2 task repetition costs with longer as compared to shorter CSI (e.g., Mayr & Keele,
2000; Schuch & Koch,
2003; see Koch et al.,
2010, for review), more recent studies do sometimes report reduced N-2 repetition costs with longer task-preparation time (e.g., Gade & Koch,
2014; Scheil & Kleinsorge,
2014; Schuch & Grange,
2019). The design of the present Experiment 1 allowed us to contribute to this literature, by examining
N − 2 repetition costs with short vs long CSI on a group level.
While CSI effects are well established on a group level, less attention has been paid to their reliability on the level of interindividual differences. Yet, the general task-preparation effect (i.e., performance improvement with long as compared to short CSI)—if it proves to be reliable—might be a good candidate for investigations of task switching processes from an interindividual-differences perspective. For instance, in the aging literature, age-related differences in task-preparation processes are widely discussed (e.g., Kray & Ferdinand,
2014, for review), but these studies typically compare task-preparation effects on a group level (i.e., comparing a group of younger adults with a group of older adults), such that reliability is usually not in the focus. Yet, the time-based task preparation effect may be suitable for correlational approaches, just as other behavioral indices of task preparation have been used in individual-differences studies (e.g., Wager et al.,
2006). For instance, task-preparation effects related to the informativeness of the task cues have been correlated with electrophysiological and neuroimaging markers of task preparation (e.g., Brass & von Cramon,
2004; Karayanidis et al.,
2009; see, e.g., Hsieh,
2012; Karayanidis et al.,
2010, for reviews). Hence, assessing reliability of task-preparation measures in general, and of the time-based task-preparation effect in particular, might be useful for future investigations of cognitive control from an individual-differences perspective.
Cognitive control measures in single-task paradigms
Regarding cognitive control measures in single-task context, perhaps the most popular effect is the color-word Stroop effect (i.e., saying the ink color of written color words that are either congruent or incongruent with the ink color they are presented in; see MacLeod,
1991; MacLeod & MacDonald,
2000, for reviews). The Stroop effect is a classic textbook example and popular classroom demonstration of a “conflict task”, where task-relevant and task-irrelevant features interfere, creating some kind of cognitive conflict (e.g., conflict between stimulus features, or conflict between competing responses). It is sometimes explained in terms of an inhibitory process, such as inhibition of distractor processing, or inhibition of inappropriate response tendency (e.g., Friedman & Miyake,
2017; Gärtner & Strobel,
2021; Miyake et al.,
2000; Pettigrew & Martin,
2016). Others have argued that the Stroop effect and other conflict tasks do not necessarily reflect inhibitory control (e.g., Paap et al.,
2020). Here, we will use the more descriptive terms “distractor interference control” or “control of cognitive conflict”. The Stroop effect has been reported to be quite reliable (with Spearman-Brown corrected split-half reliability often between
r = 0.80 and
r = 0.90, see, e.g., Friedman & Miyake,
2004; Rey-Mermet et al.,
2018).
Interestingly, despite the high robustness of this experimental effect, when examining the sequential modulation of the Stroop effect (sequential congruency effect, e.g., Egner,
2007), which is typically used to examine conflict adaptation, the split-half reliability of this sequential measure has been found to be very poor, ranging between
r = − 0.12 and
r = 0.08 across three experiments reported by Whitehead et al. (
2019). This drop in reliability is at least partly due to the fact that the sequential congruency effect is computed as the difference of a difference score, and therefore has lower reliability than the congruency effect, which is computed as a simple difference score (see Kopp,
2011; Miller & Ulrich,
2013; Whitehead et al.,
2019, for considerations on the reliability of difference scores).
While a considerable number of studies assessed split-half reliability of Stroop-like interference effects and task-switching effects, only few studies investigated the retest reliability of such effects. In one recent study, Hedge et al., (
2018b) assessed retest reliability of a number of interference effects, including the Stroop effect, with a temporal separation of three weeks between test and retest. They found a retest reliability of
r = 0.60 and
r = 0.66 for the Stroop effects in two studies (they did not report retest reliability of the sequential congruency effect). In another study, Paap and Sawi (
2016) examined retest reliability of effects in four different tasks, including task switching, over a period of one week and found only moderate reliabilities. For example, for color-shape switching, they found a retest reliability of
r = 0.62. For the Simon task (which is often considered a conflict task, similar to the Stroop task), they found a retest reliability of only
r = 0.43.
The present study
To summarize, several measures of cognitive control that are highly robust when analyzed on the group level in standard experimental paradigms have surprisingly low reliability when taken as a measure of interindividual differences in correlational approaches, for instance, in structural equation modeling. Therefore, more studies are needed that assess the split-half and retest reliability of standard cognitive control measures, to elucidate which of these measures are suitable for individual-differences approaches, and which are not.
In the present study, we assessed the reliability of four standard cognitive control measures. In Experiment 1, we focused on
N − 2 repetition costs, which are a measure of task-level inhibition (see Koch et al.,
2010, for review), and the time-based task-preparation effect (i.e., CSI effect, denoting the finding of improved performance with long as compared to short CSI), which may be considered as a marker of cue-based task-set activation (especially in paradigms with task switches only; e.g., Lawo et al.,
2012; Schuch & Grange,
2019; Schuch & Koch,
2003). The design of Experiment 1 also allowed us to explore the potential preparatory modification of
N − 2 repetition costs by task-preparation time (on a group level).
In Experiment 2, we examined a variant of the Stroop effect. The family of Stroop-like effects is a marker for distractor interference processing, and is sometimes taken as a marker for inhibitory processing; moreover, the sequential modulation of Stroop-like effects has been taken as a hallmark of conflict-triggered adjustments of cognitive control (Botvinick et al.,
2001; see also Egner,
2007,
2017; Paap et al.,
2019; Schuch et al.,
2019, for more recent reviews). Here we used a face-name interference paradigm that resembles paradigms often used in the neuroimaging literature (e.g., Egner & Hirsch,
2005; Gazzaley et al.,
2005; O’ Craven, et al.,
1999), and has been used in our own lab before (Schuch & Koch,
2015; Schuch et al.,
2017).
For these four measures of cognitive control, we report the group-level effects (i.e., the average effects across all participants), as well as their split-half and retest reliability. In both experiments, the respective effects were measured using standard experimental paradigms in a first and second session on the same day, which were separated by a short unrelated filler task. Then, participants performed the same experiment again in a second session (i.e., on the same day). We first report the group-level effects as obtained with a standard analysis of variance (ANOVA), with first vs. second session as an independent within-subjects variable. Then, we report split-half reliability (correlation between odd and even trials) and retest reliability (correlation between first and second session) for each of the effects.
Methodological considerations: number of participants and number of trials per condition
To get reliable estimates for correlations, two issues are important: first, there needs to be a large enough number of participants—for instance, to reliably detect medium-sized correlations, a sample of
N = 85 or larger is necessary (Cohen,
1992). With smaller sample sizes, correlation estimates are very variable (Schönbrodt & Perugini,
2013).
Second, and perhaps even more importantly, the number of experimental trials that provide the basis for computation of the experimental effects play a crucial role (Green et al.,
2016; Rouder & Haaf,
2019). With small trial numbers, the estimates of the experimental effects are variable, which leads to attenuated correlations between the experimental effects from different conditions. One remedy to this issue is to apply the Spearman-Brown correction formula (Spearman, 1904), which corrects for a reduction of test length (i.e., of trial numbers in the case of experimental effects).
1 The estimates of split-half reliabilities of experimental effects are often Spearman–Brown corrected, to compensate for halving the “test length” by splitting trials into odd versus even trials. Note, however, that “test length” may vary considerably across experimental paradigms. When assessing, e.g., the Stroop effect, some researchers might use a paradigm with as little as 20 trials per condition, while others might use a different paradigm with, say, 100 trials per condition. Usually, researchers do not pay much attention to the number of trials that provide the basis for computing the experimental effect. Rouder and Haaf (
2019) therefore suggested to calculate reliabilities of experimental effects for the case of infinitely large trial numbers. They did so by applying linear mixed models, and including trial-by-trial variability as an additional random factor in the model. They re-analyzed the data from Hedge et al., (
2018b), and found retest reliabilities of around
r = 0.70 for both Stroop and Flanker effect (as opposed to retest reliabilites of
r = 0.55 and
r = 0.50 when correlating the effects from first and second session without accounting for trial-by-trial variability). In a similar vein, Whitehead et al. (
2020) re-analyzed data from Whitehead et al. (
2019), and observed slightly larger split-half reliabilities for Stroop, Flanker, and Simon effects when using linear mixed models that account for trial-by-trial variability (split-half reliabilities ranging between
r = 0.57 and
r = 0.65) than when correlating the effects between odd and even trials without accounting for trial-by-trial variability (split-half reliabilities ranging between
r = 0.31 and
r = 0.61). Hence, it is important to always consider the number of trials per condition (or to extrapolate to the large-trial limit) when estimating split-half and test–retest reliabilities of experimental effects. Here, we considered the number of trials per condition when comparing reliability scores of different kinds (retest, split-half), and when comparing reliability measures across different studies.
The large-trial limit might be regarded as the “ideal case” for computing reliabilities; however, there are assets and drawbacks for designing experiments with large trial numbers. A potential disadvantage is that the longer the experiment, the more pronounced the influence of practice effects, and the more likely the cognitive tasks become highly overlearned and “automatized”. When investigating cognitive control functions, however, researchers might want to avoid too much automaticity and overlearning of task-specific associations or stimulus–response rules, as these cognitive processes might alter or even substitute the cognitive control processes the researcher is interested in (see, e.g., Grange & Juvina,
2015; Scheil,
2016, for practice effects on
N − 2 repetition costs; Davidson et al.,
2003, for practice effects on the Stroop effect in young versus old adults; Strobach et al.,
2014, for review).