Classically, reaction time (RT) research has compared performance across different conditions to assess the effects of various experimental manipulations and group differences on RT. Most often, the results have been summarized in terms of the mean RTs for each condition or group, although occasionally comparisons are made with respect to variances or full RT distributions (e.g., Luce, 1986; Posner, 1978; Sanders, 1998; Smith, Ratcliff, & Wolfgang, 2004).

Increasingly, RT researchers also have examined the correlations of different RT-based measures with each other and with other measures. Such correlations are of interest in at least two kinds of research. First, RTs offer a promising tool for assessing individual differences (Cattell, 1890). For example, many researchers have looked for correlations of intelligence with overall mean RT and with the sizes of particular effects (i.e., differences in mean RT) used to assess the time needed for specific mental operations (e.g., Beauducel & Brocke, 1993; Helmbold, Troche, & Rammsayer, 2007; Hunt, 1978; Keating & Bobbitt, 1978; Kirby & Nettelbeck, 1989; Neubauer, Riemann, Mayer, & Angleitner, 1997; Smith & Stanley, 1983; Vernon & Mori, 1992; for a review, see Vernon, 1990). RT means and effect sizes have also been used to assess individual differences within many different areas, including social psychology (e.g., Hofmann, Gawronski, Gschwendner, Le, & Schmitt, 2005; Nosek & Smyth, 2007; Richeson & Shelton, 2003; Wiers, Van Woerden, Smulders, & De Jong, 2002), personality psychology (e.g., Indermühle, Troche, & Rammsayer, 2011; Karwoski & Schachter, 1948; Smulders & Meijer, 2008), aging research (e.g., Eckert, Keren, Roberts, Calhoun, & Harris, 2010; Myerson, Robertson, & Hale, 2007; Wood, Willmes, Nuerk, & Fischer, 2008), and assessment of brain damage and psychopathology (e.g., Godefroy, Lhullier, & Rousseaux, 1994; Nettlebeck, 1980; Stuss, Pogue, Buckle, & Bondar, 1994).

Second, even where individual differences are not the focus of investigation, correlations between different RT effects may intuitively be thought to provide important information about the underlying mechanisms responsible for those effects. For example, Corballis (2002) examined the correlation between two RT-based effects thought to depend on the time needed for interhemispheric transmission (i.e., redundancy gain and the crossed–uncrossed difference) in order to find out whether the two effects were mediated by a common neural pathway. Likewise, Maloney, Risko, Preston, Ansari and Fugelsang (2010) examined correlations of numerical distance effects obtained with different number formats in order to see whether these different effects assessed the same underlying numerical representations and comparison mechanisms. Similarly, Stolz, Besner and Carr (2005; see also Waechter, Stolz, & Besner, 2010) examined the intercorrelations of semantic priming effects across different RT sessions in order to assess the extent to which individual variation in these effects reflect systematic differences in semantic associations between individuals, as opposed to merely statistical noise.

The main purpose of this article was to examine formally the factors influencing the reliabilities and correlations of RT-based measures. Although much has been written about important statistical considerations involved in correlational research using RTs (e.g., Brown, 2011; Jensen, 2006; Sriram, Greenwald, & Nosek, 2010; Stolz et al., 2005), the fundamental questions of exactly what determines RT reliabilities and correlations and of how these quantities are related to the durations of underlying mental processes have not been addressed. In short, we ask in this article, “What aspects of mental processing times affect the reliabilities and correlations of RT-based measures?” The answers to these questions are important not only for the proper interpretation of correlations involving RT, but also for the evaluation of contemplated correlational research protocols (e.g., how power will depend on the number of trials per individual).

The main conclusion of this article is that interpretations of RT-based correlations are far more complicated than has typically been acknowledged. We analyzed several different types of RT-based correlations (e.g., mean RTs, RT effect sizes), each of which has been studied across a wide range of substantive areas. The common finding running through our analyses is that the meanings of these correlations are far more complex than intuition would suggest. Specifically, our analyses reveal that RT-based correlations are difficult to interpret because they are influenced by many factors that are not intuitively obvious. To motivate our analyses, we start with brief descriptions of four prototypical examples arising in quite different research domains. The implications of our analyses extend far beyond these prototypical examples, of course, extending to all situations involving correlations of RT-based measures.

As a first example of the complications associated with RT-based correlations, consider the question of how strongly general intelligence correlates with a given RT effect size—the latter measured either as the difference between experimental and control conditions (e.g., Hunt, 1978) or, equivalently, as the slope relating RT to some quantitative independent variable (e.g., Jensen & Munro, 1979; but see Beauducel & Brocke, 1993). Intuitively, it seems that this correlation should be relatively strong to the extent that extra processing required in the experimental condition is specifically associated with general intelligence. Our analysis shows, however, that this intuition is vastly oversimplified. In fact, the correlation of the RT effect size with intelligence is also influenced by numerous other factors, including the correlation of intelligence with performance in the control condition and the correlation of performance in the experimental and control conditions.

A second and somewhat similar example involves the correlations of explicit measures of socially sensitive attitudes with RT-based measures of those same attitudes obtained with the implicit association test (e.g., Greenwald, McGhee, & Schwartz, 1998; Nosek & Smyth, 2007). Small correlations of explicit and implicit attitude measures have been taken as evidence that these measures tap into different underlying attitude representations or systems (e.g., Greenwald et al., 1998), which seems intuitively to be quite a plausible interpretation of the small correlation. Our analysis suggests, however, that this interpretation is unwarranted and the influences of other parameters could cause the observed correlations to be quite low even if the two measures are driven by a single attitude system.

A third example involves the correlation between costs and benefits—measured relative to a common neutral condition—in the Stroop color-naming task (e.g., Brown, 2011). Intuitively, it seems clear that this correlation should be large if costs and benefits are determined by the same mechanism, and researchers have therefore measured this correlation to assess single-mechanism accounts of Stroop effects (e.g., Brown, 2011; Lindsay & Jacoby, 1994). Our analysis shows, however, that this intuition is wrong, because the correlation of costs and benefits can be very small or negative even when a single-mechanism account is correct. In particular, the observable correlation of costs and benefits is influenced markedly by the correlations of performance in the two experimental conditions (i.e., facilitation and interference) with that in the neutral condition. Identical complications arise, of course, when interpreting correlations of costs and benefits obtained with a variety of other paradigms (e.g., precuing; Jonides & Mack, 1984).

Finally, as a fourth example, consider the correlation between two RT effects measured under different conditions, such as negative-priming effects assessed with normal versus degraded visual inputs (e.g., Kane, May, Hasher, Rahhal, & Stoltzfus, 1997). Again, intuition leads one to expect that these two effects should be strongly positively associated if they are signs of the same underlying mechanism, as would be expected. The present analysis, however, indicates that the correlation of such effects can be small even when they are driven by a common mechanism, because other factors can easily conceal the expected association between such effects. Analogous problems can even arise when two measures of the same RT effect are correlated across different testing sessions (e.g., Stolz et al., 2005). Low correlations between these two measures may simply indicate low test–retest reliability even though both measures seem necessarily to be driven by a single mechanism.

The individual differences in reaction time (IDRT) model

To explore the effects on RT-based reliabilities and correlations, we developed a general framework called the individual differences in RT (IDRT) model. IDRT is a specific classical test theory model, thereby allowing standard measures of classical test theory (e.g., reliabilities and correlations) to be investigated within an RT modeling framework. (Appendix 1 provides selected material on aspects of classical test theory especially relevant to this article.) Broadly, the IDRT model is intended to relate measurable RTs to underlying mental processes across a wide variety of tasks and to subsume standard RT models as special cases (e.g., diffusion models, accumulator models). It is also general enough to capture the key features of prominent models in the literature on individual differences in RT (e.g., Cerella, 1985; Fisher & Glaser, 1996; Hartley, 2001). We used the IDRT to analyze various different types of reliabilities and correlations that are computed using RT-based measures. The results of the analyses provide insights about how various parameters characterizing the times needed for mental processing would influence these quantities. Table 1 summarizes the different dependent measures whose reliabilities are considered in this article, and Table 2 summarizes the different types of correlational analyses studied—each in its own section. Readers interested in a particular type of analysis can focus on the section of interest but should first read the description of the IDRT model and the sections about the reliability of their measures (i.e., mean RT or difference score). The derivations of IDRT’s predicted reliabilities and correlations are presented in Appendix 2, and the main text provides numerical illustrations of the predictions for various combinations of the model’s parameters. Readers interested in other combinations of model parameters can, of course, use the general equations in Appendix 2 to examine the predicted reliabilities and correlations under any scenario of interest.

Table 1 Dependent measures involved in correlational analyses
Table 2 Types of correlation analyses and notation used

According to the general IDRT model, the total RT is the sum of latencies of several processing stages intervening between the stimulus and the response (e.g., Sternberg, 1969, 2001). The model is agnostic with respect to the nature of the processing within each stage, so this processing could conform to assumptions of diffusion models (e.g., Ratcliff, 1978), accumulator models (e.g., Usher & McClelland, 2001; Vickers, 1970), parallel models (e.g., Townsend & Nozawa, 1995), and so on. In addition, consistent with prior literature exploring individual differences in RT, IDRT proceeds from the assumption that an individual participant’s average latency in an RT task depends on certain latent processing time variables that differ across participants. Thus, IDRT attempts to integrate individual differences into a general framework that can be used to study the psychometric properties of mean RTs and their differences across a variety of tasks.

In general, IDRT represents the observed mean reaction time RT k of a single participant, k, with the equationFootnote 1

$$ \mathbf{R}{{{T}}_k}=\left( {A+B+C} \right)\cdot {G_k}+B\cdot {\varDelta_k}+{R_k}+{{\mathbf{E}}_{\mathbf{k}}}. $$
(1)

As is described next, according to this model, the observed mean RT k is determined by three conceptually separate sets of components.

The first component consists of the mental processing required in each stage of a task. In keeping with a long tradition within RT research, this component of the model characterizes task requirements rather generically as a set of three sequential stages—A, B, and C—that must be carried out between the onset of the stimulus and the initiation of the response (e.g., Donders, 1868/1969; Pashler, 1994; Smith, 1968; Sternberg, 1969). Stages A and C can be conceived as perceptual input and motor output stages, respectively, whereas stage B is a task-specific central stage such as response selection. The constants A, B, and C represent the amounts of work that need to be done in each of the three stages, and these are assumed to depend on the task but not on the person performing it.

The second component of the model represents individual differences in processing time that would allow RT to differ across people performing the same task. In keeping with much current thinking about the relationship of RT to intelligence (e.g., Vernon, 1990), the ability of each participant to perform cognitive tasks is modeled in terms of a general processing time parameter, G k , which could be related to overall neural processing speed (e.g., Eysenck, 1986; Miller, 1994; Reed & Jensen, 1991; Vernon & Mori, 1992). This parameter may be thought of as the amount of time needed to carry out a single, arbitrarily defined unit of cognitive work, which we will sometimes refer to as the processing time. For example, participant k’s total processing time in the perceptual stage (A) is A·G k . Analogous approaches in modeling stage processing times or speed parameters can be found in virtually all RT models (e.g., S. Brown & Heathcote, 2005; Kieras & Meyer, 1997; Miller & Ulrich, 2003; Navon & Miller, 2002; Ratcliff & Smith, 2004; Smith, 2000; Tombu & Jolicœur, 2003; Townsend & Nozawa, 1995; Usher & McClelland, 2001; for a review of older examples, see Luce, 1986).

In addition to the general processing time, G k , the second component also includes a processing time parameter reflecting participant k’s particular facility at the central processing required by a particular task, Δ k . For example, this participant’s total time for central stage (B) processing is \( B\cdot \left( {{G_k}+{\varDelta_k}} \right) \). Because of the parameter Δ k , the size of an experimental effect on central processing time can vary across participants who have the same general processing time, G k . Similar parameters could be added to capture individual differences in the durations of stages A and C, but in order to keep the analysis tractable, we considered only tasks varying in the requirements at the central stage (B). In developing this model, we allowed the general and task-specific processing times, G k and Δ k , to be correlated across participants, because a positive correlation is predicted by the idea that all cognitive operations depend to some extent on a common underlying neural processing speed (e.g., Eysenck, 1986; Vernon & Mori, 1992).

The second component also includes a residual term, R k , which reflects individual differences in RT that are uncorrelated, or at least negligibly correlated, with the processing times represented by G k and Δ k . These seem most likely to involve fairly peripheral latency components that we assumed are constant across conditions within the same task. For example, R k may include very early sensory processes starting with light transduction processes within the retina and very late motor processes ending with the activation of single muscle fibers (e.g., Ulrich & Wing, 1991). Although there is evidence that residual processes have at most a weak correlation with overall processing time or intelligence (e.g., Reed & Jensen, 1991; Vernon & Mori, 1992), we nonetheless included the correlation parameters ρ ΔR and ρ GR in the model development for completeness and generality.

The third component of the model is a purely statistical error term, E k . This term arises from the random trial-to-trial variability of RT for a given participant in a given condition. To model this within-condition variability, we assumed that the standard deviation of RT was proportional to the mean, on the basis of prior evidence that the ratio of the standard deviation of RT to the mean RT—also known as the coefficient of variation (CV)—is approximately constant in a number of RT models and data sets (e.g., Luce, 1986; Wagenmakers & Brown, 2007). For a given participant k, then, the variance associated with the error term E for a mean across N trials is

$$ \mathrm{Var}\left[ {{{\mathbf{E}}_{\mathbf{k}}}} \right]={{{\mathop{{\left( {CV\cdot \mathrm{E}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{k}}}} \right]} \right)}}\nolimits^2}} \left/ {N} \right.}. $$
(2)

This assumption implies that the mean and within-subjects standard deviation of RT would be strongly correlated across participants, as is commonly observed (e.g., Jensen, 1992). Although the predictions derived below do not depend on any distributional assumptions about E k , the central limit theorem implies that this error term would have a nearly normal distribution with a reasonably large number of trials, as is typical in RT research.

Following classical test theory, the IDRT model for a particular participant (i.e., Eq. 1) is generalized to a randomly selected participant by regarding the person-specific general processing time (G k ), task-specific processing time (Δ k ), and residual time (R k ) as random variables G, Δ, and R. Thus, the RT for a randomly selected participant is

$$ \mathbf{T}=\left( {A+B+C} \right)\cdot \mathbf{G}+{B}\cdot \mathbf \varDelta +\mathbf{R}+\mathbf{E}. $$
(3)

To produce more compact expressions, we define \( S=A+B+C \), so that Eq. 3 can be written asFootnote 2

$$ \mathbf{RT}=S\cdot \mathbf{G}+{B}\cdot \mathbf \varDelta +\mathbf{R}+\mathbf{E}. $$
(4)

The expected value of RT is given by

$$ \mathrm{E}\left[ {\mathbf{RT}} \right]=S\cdot \mathrm{E}\left[ \mathbf{G} \right]+{B}\cdot \mathrm{E}\left[ \varDelta \right]+\mathrm{E}\left[ \mathbf{R} \right], $$
(5)

because E[E] = 0. In order to simplify the notation, we generally denote expected values of parameters with terms like μ G and μ Δ so that the above expectation may be written

$$ \mathrm{E}[\mathbf{RT}]={\it S}\cdot {\mu_{{G}}}+\mathbf{B}\cdot {\mu_{\varDelta }}+{\mu_{{R}}}, $$
(6)

and we similarly denote variances as \( \sigma_G^2 \), \( \sigma_{\varDelta}^2 \), and so on.

The variance of RT is derived in Appendix 2 for the most general case (Eq. 38). As would be expected, this variance increases with task difficulty (i.e., the times needed for the perceptual, central, and motor stages [A, B, C]) and with the variability across participants of the individual processing time and sensory–motor residual parameters (i.e., G, Δ, and R), as well as the variance associated with measurement error (E). It also increases with the correlation between the general and task-specific processing times, G and Δ.

For further details concerning standard simplifying assumptions and plausible estimates of parameter values that we used in exploring the predictions of this model, the interested reader can consult Appendix 3. In addition, a preliminary check on the overall plausibility of the model involves its ability to produce realistic Brinley plots, which is examined in Appendix 4.

As was mentioned earlier, IDRT rests on rather general assumptions and subsumes many specific RT models that provide detailed descriptions of the latency mechanisms contributing to individual stages. It should be emphasized that these specific models describe total RT as the sum of a particular modeled stage and a residual component that is beyond the scope of the model (e.g., Jepma, Wagenmakers, & Nieuwenhuis, 2012; Ratcliff, 1978). For example, the information accumulation stage of a diffusion model would correspond to IDRT’s central stage B, and that model’s extra parameter for encoding and motor time would correspond to the sum of IDRT’s times for stages A and C. Similarly, almost all models of detection (e.g., Ollman, 1973; Smith, 1995), discrimination (e.g., Usher & McClelland, 2001), visual and memory search (e.g., Shiffrin & Schneider, 1977), and other cognitive processes postulate a sum of times for the key modeled process and the unmodeled residual processes (Luce, 1986), and this sum corresponds directly to IDRT’s additive stage times. Furthermore, many models that do not describe RT as a sum (e.g., McClelland, 1979) can be approximated well by an additive model (e.g., Miller, Van der Ham, & Sanders, 1995; Molenaar & Van der Molen, 1986). In further work it would also be possible to consider predicted reliabilities and correlations within more elaborate RT models that are not approximately additive, but it seems self-evident that such models would yield relationships even more complicated than the ones emerging from our simple model. In short, the complexity of interpreting correlations that emerges from the present analysis of our simple RT model, IDRT, is likely to provide a lower bound on such complexities within the space of all existing RT models. As was noted by Hillis (1993), “when a system is too complex to understand, it often helps to understand a simpler system with analogous behavior” (p. 80).

Mean reaction times

The most basic RT measure that might be used in correlational research is the mean RT, and this measure has been used by researchers in numerous fields, including intelligence (e.g., Jensen, 1985) and neurological assessment (e.g., Godefroy et al., 1994; Stuss et al., 1994). To interpret the correlations obtained in such studies, it is essential to have a picture of the psychometric properties of mean RTs. In this section, we investigated the reliabilities and correlations of mean RTs using the IDRT model; RT-based difference scores are considered in the next section.

Reliability of mean reaction times

We began by studying the reliability of mean RTs, because the reliability of any measure limits the correlations it might have (see Eq. 25). As in classical test theory, the reliability of a mean RT is the correlation across individuals of two parallel measures, \( \mathrm{Corr}\left[ {\mathbf{RT},\mathbf{R}{{\mathbf{T}}^{\prime }}} \right] \) (Lord & Novick, 1968, p. 61).

Figure 1 shows the reliability of mean RTs under various parameter combinations chosen to reveal the effects of task properties on reliability. Of course, the exact choices of parameter values were necessarily somewhat arbitrary, but we attempted to vary each parameter over its widest possible range of plausible values in order to see how large its effects might be.

Fig. 1
figure 1

Reliability of mean reaction time, \( \mathrm{Corr}\left[ {\mathbf{RT},\mathbf{R}{{\mathbf{T}}^{\prime }}} \right] \), as a function of the number of trials (N) and the task-related variables A, B, and the coefficient of variation (CV). Default values of the other parameters are shown in Table 15. \( \mathrm{Corr}\left[ {\mathbf{RT},\mathbf{R}{{\mathbf{T}}^{\prime }}} \right] \) was computed using Eq. 24 from the covariance given in Eq. 40 and from the variances given in Eqs. 38 and 39

Not surprisingly, the reliability of mean RTs is quite strongly affected by the number of trials, N. With these parameter combinations, reliability is often less than .5 for small numbers of trials but generally exceeds .85 with even 10–20 trials. By 100 trials—a range that is sometimes attainable in practice—reliability virtually always exceeds .95. Thus, these results provide some reassurance to researchers planning correlational studies using mean RT: High reliability can be obtained with only moderate numbers of trials.

As is also illustrated by Fig. 1, the reliability of mean RTs is strongly influenced by the RT distribution’s coefficient of variation, CV. The trial-to-trial error variance \( \sigma_E^2 \) increases with this CV, so reliability decreases as the CV increases. Naturally, researchers should take steps to minimize trial-to-trial fluctuations in arousal, attention, and other factors that might increase RT variability. The effect of the coefficient of variation, CV, diminishes rapidly as the number of trials increases, however, so such steps would not be very important with more than approximately 20–30 trials per participant.

Perhaps surprisingly, Fig. 1 also shows that task difficulty has only a rather small effect on RT reliability. This may at first be surprising, because it is well known that RTs are more variable in slower tasks, which would tend to increase error variance and thereby reduce reliability. On the other hand, increasing task difficulty also tends to increase the true score variance of the individual participants’ mean RTs and, hence, to increase the covariance across two independent measures (Eq. 40). The results shown in Fig. 1 suggest that these two counteracting forces balance each other out to a good approximation, leaving reliability fairly independent of task difficulty.

Figure 2 illustrates how the reliability of mean RTs depends on the population variability in the individual cognitive processing and sensory–motor residual time parameters, G, Δ, and R. Somewhat arbitrarily, these levels of population variability were chosen to range from very small values—simulating homogeneous populations—to values large enough to yield visible effects on reliability. Reliability increases with variability in all three of these parameters, consistent with the well-known phenomenon that the reliability of any measure depends not only on the measuring instrument itself, but also on the population to which it is applied (e.g., Graham, 2006). In particular, the general rule within classical test theory is that reliability increases with the amount of true score variance (Eq. 24; Lord & Novick, 1968). The most important new message of this figure is that the reliability of a mean RT tends to be high as long as there is variability in at least one of these three population parameters. This message is both good news and bad news for researchers studying correlations of mean RTs with other variables. It is good news because mean RTs can be expected to be highly reliable except in the worst case where there is severe range restriction on all three variables simultaneously (i.e., G, Δ, and R). For example, even in a sample that is restricted with respect to the cognitive processing times G and Δ (e.g., university students), the reliability of mean RTs will be high as long as there is adequate variability in the residual peripheral sensory and motor processing time, R. At the same time, this message is bad news because good reliability per se does not imply adequate sample variation in any particular type of processing time. If mean RTs are highly reliable only because of variation in R, for example, they might fail to correlate with cognitive measures (e.g., IQ) because of range restriction on the critical cognitive processing time parameters reflecting general and task-specific abilities, G and Δ.

Fig. 2
figure 2

Reliability of mean reaction time, \( \mathrm{Corr}\left[ {\mathbf{RT},\mathbf{R}{{\mathbf{T}}^{\prime }}} \right] \), as a function of the population-related variables σ G , σ Δ, and σ R . Default values of the other parameters are shown in Table 15. Correlations were computed using the same equations as those indicated in Fig. 1

Correlation of mean reaction times with another measure

Researchers often correlate the mean RT in one task with some external (i.e., non-RT) measure Y (e.g., IQ, total score on a symptom checklist). Figure 3 illustrates how Corr[RT, Y] is determined by the correlations of Y with the underlying cognitive processing time parameters determining mean RTs (i.e., G and Δ), because researchers computing such correlations seem to be interested mainly in the correlations of Y with these underlying processing times. Clearly, the main determinant of Corr[RT, Y] is ρ GY . The relationship between these two correlations is remarkably linear, and they are nearly equal in most cases. There is also an effect of ρ ΔY on Corr[RT, Y], however, and Corr[RT, Y] best matches ρ GY and ρ ΔY when the latter two correlations are equal to each other.

Fig. 3
figure 3

Correlation of mean reaction time, RT, with an external measure, Y, as a function of the underlying correlations ρ GY and ρ ΔY . Default values of the other parameters are shown in Table 15. The lines do not cover the full left–right range from ρ GY = −1 to ρ GY = 1 values because some combinations of ρ GY and ρ ΔY are incompatible with the assumption of ρ ΔG = .2. Corr[RT, Y] was computed using Eq. 23 from the covariance given in Eq. 41 and from the variances given in Eqs. 38 and 39

Figure 3 has two important implications for researchers trying to correlate external measures with mean RTs. First, researchers interested in correlating some such measure (e.g., IQ) with general processing time (G) must be alert to the fact that their observed correlations will inevitably also be influenced to some extent by the task-specific processing time (Δ) of whatever task they choose to use. The general solution to this problem is to use a variety of tasks and extract G as a common factor across all of them, correlating factor scores with the external measure (Jensen, 1993). Second and conversely, researchers trying to study the relationship between an external measure and some task-specific processing time (i.e., to correlate Y with some particular Δ) must be alert to the fact that their observed correlations will be strongly influenced by the correlation of Y with the general processing time G. For example, if G and Y are uncorrelated, the correlation of RT and Y will be less than the correlation between Δ and Y, due to the diluting influence of extraneous variability contributed by G. In fact, the results in Fig. 3 suggest that the correlation of the external measure with general processing time, ρ GY , almost completely dominates the observed correlation, Corr[RT, Y], making it almost impossible to assess the external measure’s correlation with the task-specific processing time, ρ ΔY . Similar influences could also arise from the residual component R if it were correlated with Y and varied substantially across individuals.

Correlation of two mean reaction times

In some situations, researchers might want to examine the correlation across individuals between the mean RTs from two different tasks, RT x and RT y , possibly to assess the extent to which these tasks tap into the same versus different mental processes or to validate the equivalence of different RT-based measures of individual differences (e.g., Chen, Myerson, Hale, & Simon, 2000; Kauranen & Vanharanta, 2001; Seashore & Seashore, 1941; Simonen, Videman, Battie, & Gibbons, 1995). Since any general influence on processing time would be the same in both tasks by definition, the between-task correlation would presumably vary mainly with the correlation of the task-specific processing times, \( {\rho_{{\kern-2.5pt{\varDelta_x}{\varDelta_y}}}} \). Under IDRT, the observed mean RTs for randomly selected individuals in the two tasks (i.e., i = x, y) are

$$ \mathbf{R}{{\mathbf{T}}_{\it{i}}}={{\it{S}}_{\it{i}}}\cdot \mathbf{G}+{{\it{B}}_{\it{i}}}\cdot {{\mathbf \varDelta}_{\it{i}}}+{{\mathbf{R}}_{\it{i}}}+{{\mathbf{E}}_{\it{i}}}. $$
(7)

Example results shown in Fig. 4 clearly indicate that the correlation between the two RTs is not a good index of the correlation between the two task-specific processing times, \( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \). In particular, Corr[RT x , RT y ] is virtually always much higher than \( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \). This overestimation increases with σ G , so it is evidently driven by the common contribution of the general processing time, G, to both tasks. The overestimation does lessen somewhat with increases in the variability of the task-specific processing times, \( {\sigma_{{{\varDelta_x}}}} \) and \( {\sigma_{{{\varDelta_y}}}} \), because increases in their variability allow the task-specific times to determine more of the variance in the corresponding RTs. The inadequacy of Corr[RT x , RT y ] as a measure of the correlation of task-specific times, \( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \), is also obvious because of the shallow slopes of the lines in Fig. 4, indicating that large changes in \( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \) produce much smaller changes in Corr[RT x , RT y ]. Thus, without detailed information about other parameters, the correlation between two RTs provides virtually no information about the correlation of the underlying task-specific processing times, \( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \).

Fig. 4
figure 4

Illustrations of how the correlation of two mean reaction times, RT x and RT y , varies as a function of the correlation of the underlying task-specific processing times (\( {\rho_{{{\varDelta_x}{\varDelta_y}}}} \)), the correlation of each of these processing times with G (ρ ΔG ), and the variability of the processing times σ G and \( {\sigma_{{{\varDelta_x}}}}={\sigma_{{{\varDelta_y}}}}={\sigma_{\varDelta }} \), with N = 30 trials per participant in each task. A value of \( {\rho_{{{R_x}{R_y}}}}=.40 \) was assumed. Default values of the other parameters are shown in Table 15. Corr[RT x , RT y ] was computed using Eq. 23 from the covariance given in Eq. 42 and from the variance given in Eq. 38 for each term

Reaction time difference scores

Often, researchers measure an individual’s performance using a difference between two mean RTs rather than a single overall mean. The difference score is generally used in order to focus more specifically on a particular kind of mental processing. For example, differences in mean memory scanning RTs to small versus large memory set sizes have been used to assess an individual’s speed of retrieval from short-term memory (e.g., Keating & Bobbitt, 1978; Neubauer et al., 1997). Similarly, differences in visual search RTs with displays of different sizes have been used to index perceptual inspection and comparison time (e.g., Schweizer, 1989). Other RT differences have been used to assess interhemispheric communication (e.g., Corballis, 2002; Iacoboni & Zaidel, 2000; Schulte, Pfefferbaum, & Sullivan, 2004), semantic memory access (Hunt, 1978), executive function (e.g., Larson & Clayson, 2011), and multisensory integration (e.g., Barutchu et al., 2011). Within social psychology, the implicit association test (IAT) uses differences in the mean RTs of different stimulus–response mapping conditions to assess an individual’s implicit attitudes about socially sensitive matters such as racial bias (e.g., Greenwald et al., 1998), although there is considerable debate about the meaning of the differences obtained in this way (e.g., Blanton, Jaccard, Gonzales, & Christie, 2006).Footnote 3

Using IDRT, it is also possible to study correlations involving differences in mean RT. For simplicity, we denote the two conditions generically as “experimental” and “control” and, hence, use the subscripts “e” and “c” to distinguish them. Also, we assume that the residual sensory–motor component is the same in all conditions of a given task, so we omit the subscript on R. Thus, the observed mean difference score is

$$ \begin{array}{*{20}c} {\mathbf{D}={{\mathbf{RT}}_{\mathbf{e}}}-\mathbf{R}{{\mathbf{T}}_{\mathbf{c}}}} \\ {=\left[ {{S_e}\cdot \mathbf{G}+{{\mathbf{B}}_{\mathbf{e}}}\cdot {\varDelta_{\mathbf{e}}}+\mathbf{R}+{{\mathbf{E}}_{\mathbf{e}}}} \right]-\left[ {{S_c}\cdot \mathbf{G}+{{\mathbf{B}}_{\mathbf{c}}}\cdot {\varDelta_{\mathbf{c}}}+\mathbf{R}+{{\mathbf{E}}_{\mathbf{c}}}} \right].} \\ \end{array} $$
(8)

We assume that the experimental manipulation only influences the amount of processing needed in the central stage (B), so that A e = A c and C e = C c , in which case this simplifies to

$$ \mathbf{D}=\left( {{B_e}-{B_c}} \right)\cdot \mathbf{G}+{{{B}}_{{e}}}\cdot {\varDelta_{\mathbf{e}}}-{{{B}}_{{c}}}\cdot {\varDelta_{\mathbf{c}}}+{{\mathbf{E}}_{\mathbf{e}}}-{{\mathbf{E}}_{\mathbf{c}}}. $$
(9)

One immediate implication of this model is that the measured difference score does not completely isolate the individual ability of interest—that is, the time needed for task-specific processing in the experimental condition, Δ e . Equation 9 shows that the difference also depends on both overall processing time, G, and the task-specific processing time in the control condition, Δ c .

Common versus opposing task-specific processes

It is important to distinguish between two extreme types of RT difference scores that are typically measured in RT experiments. We will refer to these as differences in which the control and experimental conditions have common versus opposing task-specific processes. The distinction is important because the analyses reported below indicate that these two types of RT differences have very different psychometric properties. As later numerical examples illustrate, for example, reliabilities tend to be higher for differences based on opposing processes than for those based on common processes. In contrast, correlations with an external measure tend to be stronger for differences based on common processes.

Comparisons involving common task-specific processes are those in which the control and experimental conditions differ with respect to the amount of some hypothesized mental processing. For example, researchers interested in mental rotation processes might compute the difference between mean RTs to stimuli rotated (say) 90° versus 180°, reasoning that the larger mean RT for the 180° condition reflects the extra time needed for the larger rotation (e.g., Cooper & Podgorny, 1976; Just & Carpenter, 1985). It is usually assumed that each individual participant rotates at approximately the same rate in both conditions, which implies within IDRT that the values of Δ c and Δ e would be approximately equal for each individual. Thus, across individuals, the difference shown in Eq. 9 would have strongly positively correlated values of Δ c and Δ e —that is, \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\gg 0 \) —so it seems natural to refer to these experimental and control conditions as having common task-specific processes. As other examples of differences based on common task-specific processes, researchers might assess the speed of visual search by comparing target-detection RTs in displays with smaller versus larger numbers of items (e.g., Schweizer, 1989), and researchers might assess the speed of short-term memory search by comparing target-detection RTs in conditions with smaller versus larger numbers of items held in memory (e.g., Chiang & Atkinson, 1976; Wilson & O’Donnell, 1986). In such comparisons, the control and experimental conditions differ in the amount of the common process needed (e.g., more mental rotation in the 180° experimental condition, more memory search with more items held in memory).

In contrast, comparisons involving opposing task-specific processes are those in which the control and experimental conditions differ with respect to the consequences of some hypothesized mental processing. As one example, consider color name interference as it is often measured in the Stroop (1935) paradigm. Participants are presented with words displayed in colored letters, and they must name the color of the letters. In a congruent control condition, the word matches the letter color (e.g., the word “red” displayed in red letters). In an incongruent experimental condition, the word is a conflicting color name (e.g., the word “blue” displayed in red letters). The difference between these two conditions reflects the effects of an automatic word reading process. Critically, this effect has opposite consequences for the two conditions. Specifically, stronger automatic processing of irrelevant word name tends to speed responses in the congruent condition (i.e., to reduce Δ c ) but tends to slow them in the incongruent one (i.e., to increase Δ e ). Thus, within IDRT, the difference would involve a strong negative correlation across individuals of Δ c and Δ e —that is, \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\ll 0 \) —so it seems natural to refer to these experimental and control conditions as having opposing task-specific processes.

Exactly analogous arguments suggest that opposing task-specific processes are involved in many other tasks assessing different kinds of congruence effects, including the flanker effect, the Simon effect, the crossed–uncrossed difference in simple RT tasks (e.g., Hasbroucq, Kornblum, & Osman, 1988), the SNARC effect (e.g., Dehaene, Bossini, & Giraux, 1993), and so on (e.g., Keye, Wilhelm, Oberauer, & Van Ravenzwaaij, 2009; Larson & Clayson, 2011; McConnell & Shore, 2011). In every case, the same processing that speeds responses in the congruent condition tends to slow responses in the incongruent one, so this processing has opposing consequences in the two conditions. Other examples include cue validity effects with spatial, semantic, and response cues (e.g., Huang, Mo, & Li, 2012; McConnell & Shore, 2011; Versace, Mazzetti, & Codispoti, 2008). The cues evoke selective preparation for a particular stimulus location, stimulus identity, or response; that preparation leads to especially fast responses in the valid cue condition, where it is appropriate, but to especially slow responses in the invalid cue condition, where it is inappropriate (i.e., a different stimulus location or identity was presented or a different response was required).

The above distinction focuses on the extremes of strongly correlated task-specific processing times (i..e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\gg 0 \) and \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\ll 0 \)), but intermediate cases are also possible. With the mental rotation task, for example, comparing rotations of 0° versus 180° would lead to less positively correlated task-specific processing times, because rotation is involved in only one of the two conditions. In the Stroop task example, comparing the incongruent condition against a neutral condition of colored Xs would lead to less negatively correlated task-specific processing times, because automatic word reading processes would have little or no effect with the Xs. Similarly, comparisons between primed and unprimed conditions (e.g., Tipper, 1985) and between cued and uncued conditions (e.g., Fan, McCandliss, Sommer, Raz, & Posner, 2002) would tend to be intermediate, because the processes responsible for the priming and cuing effects would simply be absent from the unprimed and uncued control conditions. In the following, to investigate the psychometric properties of various types of difference scores, we used task-specific speed correlations of \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=-.8 \), \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=0 \), and \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=.8 \) to represent difference scores based on opposing, unrelated, and common processes, respectively.

Reliability of reaction time difference scores

The reliability of a difference score is defined as the correlation of two separate estimates of that difference, \( \mathrm{Corr}\left[ {\mathbf{D},{{\mathbf{D}}^{\prime }}} \right] \). Figure 5 shows how the reliability of RT difference scores depends on a number of task-related variables that might be expected to influence these differences. In general, the number of trials, N, has a large effect, as expected. With some combinations of parameters, though, many more trials are needed to obtain reliable RT difference scores than were required to obtain reliable mean RTs. Hundreds of trials per condition are sometimes needed to produce reliabilities exceeding .8; although not shown in the figure, thousands are sometimes needed for reliabilities exceeding .9. Unless it is practical to obtain thousands of trials per participant and condition, these results raise a caution for researchers studying correlations of RT differences: It may be difficult to obtain high reliability.

Fig. 5
figure 5

Reliability of a mean reaction time difference score, D, as a function of the number of trials (N), the duration of stage A, the duration of stage B in the experimental condition (B e ), as compared with a duration of B c = 200 ms in the control condition, and the correlation of task-specific processing times in the two conditions involved in the difference score (\( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \)). Default values of the other parameters are shown in Table 15. \( \mathrm{Corr}\left[ {\mathbf{D},{{\mathbf{D}}^{\prime }}} \right] \) was computed using Eq. 24 from the covariance given in Eq. 45 and from the variance given in Eq. 43

It is clear from Fig. 5 that RT difference scores computed from opposing tasks (5a, 5d, and 5g: \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=-.8 \)) are more reliable than those computed from common tasks (5c, 5f, and 5i: \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=.8 \)). RT differences computed from unrelated tasks (5b, 5e, and 5h: \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=0 \)) are intermediate. Furthermore, the Corr[D, D′] difference between common and opposing tasks can be quite large. For example, keeping other parameters constant, reliability could be .8 for opposing tasks but only .1 for common tasks.

As would be expected, reliability tends to increase with a larger effect size (i.e., larger values of central stage processing time B e relative to the fixed B c ). Interestingly, increases in the duration of the perceptual stage A decrease reliability, despite the fact that these stage times are removed from the difference score by the subtraction, and the same pattern is also found with the durations of the motor stage (C). This pattern results from increased trial-to-trial error variance, which increases with the overall RT when perceptual or motor time (i.e., A or C) increases. For the same reason, reliability would decrease with increases in the coefficient of variation CV, although this is not illustrated in the figure.

Figure 6 illustrates the increases in difference score reliability resulting from increases in the variabilities of the general and task-specific processing times (σ G and σ Δ), with the range of parameter values again chosen to produce clear effects. In general, these effects are not too large over the range of parameter values examined here. Again, the difference between common and opposing task-specific processes plays a crucial role, with reliability decreasing dramatically as \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \) increases from −.8 to .8.

Fig. 6
figure 6

Reliability of a mean reaction time difference score, D, as a function of σ G , σ Δ, \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), and the number of trials (N). Default values of the other parameters are shown in Table 15. Correlations were computed using the same equations as those indicated in Fig. 5

In summary, researchers wanting to study correlations involving an RT difference score must be aware that many more trials—often an order of magnitude more—are needed to obtain adequate reliability than are needed with mean RTs. Moreover, the relationship between the two conditions entering into the difference score must also be considered, because this relationship has a big effect on the number of trials needed for adequate reliability. Indeed, thousands of trials per condition may be needed when the two conditions involve common task-specific processing, especially if the effect is not too large. Fortunately, the equations presented in Appendix 2 can be used to estimate reliability and, thereby, help determine the number of trials needed to obtain a desired reliability level for a specific set of assumed parameter values that seem appropriate for the difference score under study.

In practice, unfortunately, it might be difficult to estimate the extent to which a given RT difference score involves common versus opposing processes, because this involves estimating the value of \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \). At present, we know of no way to do that empirically and, thus, could rely only on a theoretical analysis of the tasks entering into the difference score. Although such analysis seems convincing in most of the cases that have been considered here (e.g., set size effect in memory scanning, Stroop congruency effect), it certainly need not do so in all cases.

Correlation of reaction time difference scores with another measure

One of the most common uses of RT in correlational research is to study the relationship of an RT difference score, D = RT e RT c , with some external measure, Y (e.g., Greenwald et al., 1998; Hunt, 1978; Williams, Light, Braff, & Ramachandran, 2010). Intuitively, the RT difference score is used in order to remove unwanted influences of general processing time, G, and of residual sensory–motor time, R. The usual goal of the correlation is to assess the relationship between the external measure and the task-specific processing time in the experimental condition, Δ e .

For example, the difference between RTs in primed and unprimed conditions in the negative-priming paradigm is thought to isolate the effects of inhibitory processes (e.g., Tipper, 1985), and researchers have correlated this measure of inhibitory processes with the severity of schizophrenic symptoms in order to examine the hypothesis that inhibition is disrupted by schizophrenia (e.g., Moritz & Andresen, 2004; Moritz & Mass, 1997). Others have correlated an RT-based measure of hemispheric disconnection known as the crossed–uncrossed difference with other behavioral (e.g., Cherbuin & Brinkman, 2006) and neurophysiological (e.g., Iacoboni & Zaidel, 2004) measures of such disconnection, essentially attempting to determine the validity of the RT-based disconnection measure. Likewise, RT difference scores are sometimes used to isolate particular cognitive processes, such as memory retrieval, that are thought to be especially strongly related to standard psychometric measures of IQ (e.g., Hunt, 1978; Keating & Bobbitt, 1978). As a final example, RT difference scores are now used extensively in social psychology within the context of the IAT (Greenwald et al., 1998), as was mentioned earlier. In the IAT people must classify examples based on two different categorical distinctions (e.g., flowers vs. insects and words having pleasant vs. unpleasant meanings). There are only two possible responses (e.g., left and right hands), and across two experimental conditions, the response assignments for the two distinctions are paired in opposite ways (e.g., flowers + pleasant words vs. insects + unpleasant words in one condition, flowers + unpleasant words vs. insects + pleasant words in the other). If the two distinctions are semantically related, responses should presumably be faster in the condition with two associated categories assigned to the same response (e.g., flowers + pleasant words vs. insects + unpleasant words) than in the condition with two unassociated categories assigned to the same response (e.g., flowers + unpleasant words vs. insects + pleasant words). Thus, the RT difference between these two conditions may be a measure of the strength of the semantic associations between categories (but see Blanton et al., 2006). This RT difference measure is thought to be implicit because it is extracted from performance measures rather than explicit questions about attitudes, and such RT difference scores are often correlated with corresponding explicit attitude measures (for a recent review and meta-analysis, see Hofmann et al., 2005). Moreover, the difference seems to be based on opposing processes, because stronger semantic associations would speed responses with associated categories and slow responses with unassociated categories.

In this section we analyze the correlation of an RT difference score, D = RT e RT c , with an external measure, Y. As in the section on difference score reliability, we assume that a randomly selected participant’s difference score is described by Eq. 9. Figure 7 illustrates the correlations predicted by IDRT under a wide range of combinations of true correlations among model terms. When the researcher’s goal is to assess the correlation between Δ e and Y, the perfect outcome would be \( \mathrm{Corr}\left[ {\mathbf{D},\mathbf{Y}} \right]={\rho_{{{\varDelta_e}Y}}} \), which would imply that all points in each panel of the figure would lie exactly on the positive diagonal. As can be seen in the figure, the observable correlation Corr[D, Y] does tend to increase linearly with the correlation of Y with the underlying task-specific processing time, \( {\rho_{{{\varDelta_e}Y}}} \), which is good. Nonetheless, the values of Corr[D, Y] and \( {\rho_{{{\varDelta_e}Y}}} \) are often quite different (i.e., many points are far from the diagonal), so the former is not necessarily a good estimate of the latter. Moreover, depending on the other parameters, the observable correlation Corr[D, Y] can be either larger or smaller than the underlying correlation of interest, \( {\rho_{{{\varDelta_e}Y}}} \), so researchers cannot even be certain whether Corr[D, Y] will tend to underestimate or overestimate the true target value of \( {\rho_{{{\varDelta_e}Y}}} \). Comparisons across panels indicate that the observable Corr[D, Y] tends to increase with increases in the correlation of Y with general processing time, ρ GY , and to decrease with increases in the correlation of Y with task-specific processing time in the control condition, \( {\rho_{{{\varDelta_c}Y}}} \).

Fig. 7
figure 7

Correlation of a mean reaction time difference score, D, and an external measure, Y, as a function of the true correlations ρ GY , \( {\rho_{{{\varDelta_c}Y}}} \), \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), and \( {\rho_{{{\varDelta_e}Y}}} \). Default values of the other parameters are shown in Table 15. Corr[D, Y] was computed using Eq. 47

The results shown in Fig. 7 also indicate that there is a substantial effect of whether the RT difference score is based on common, unrelated, or opposing processes (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=.8 \), 0, or −.8). Thus, the relation between task-specific processes, \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), has an important effect on the observable correlation Corr[D, Y] even though this parameter does not directly involve Y. As can be seen within each panel, the lines relating the observable Corr[D, Y] to the underlying \( {\rho_{{{\varDelta_e}Y}}} \) are steepest with differences based on common processes and shallowest with differences based on opposing processes. In this sense, Corr[D, Y] may be regarded as a better indicator of \( {\rho_{{{\varDelta_e}Y}}} \) with common rather than opposing processes, although the actual numerical difference between the observable Corr[D, Y] and the target \( {\rho_{{{\varDelta_e}Y}}} \) depends on many parameters and is in many cases smaller with opposing processes than with common ones. In the final analysis, then, an observed value of Corr[D, Y] by itself conveys little information about the correlation between Y and the time needed for task-specific processing in the experimental condition. For example, it seems quite risky to assess the relation between schizophrenia and inhibitory processes by correlating the extent of schizophrenic symptoms with the difference in RTs between a condition with inhibitory negative priming and a neutral control condition, because the observable correlation is influenced by too many factors to provide a good estimate of the association between schizophrenia and inhibitory processes.

Correlation of two distinct reaction time difference scores

Researchers might also want to examine the correlation across participants between two different experimental effects, perhaps in order to estimate the degree to which the effects are determined by the same versus distinct mental processes. Typically, the size of each effect is estimated for each participant by the difference in mean RTs between two conditions of a particular task, and these effect sizes are then correlated.

For example, Fan et al. (2002) developed the Attention Network Test (ANT) in order to obtain separate assessments of three previously suggested attentional networks involved in alerting, spatial orienting, and conflict resolution. Specifically, their goal was “to assess whether or not subjects’ efficiency within each of the [three attentional] networks was correlated” (p. 343) as a test of whether these attentional networks “engage separate brain mechanisms” (p. 344). On each trial, participants were presented with a row of five left- or right-pointing arrows and were required to respond with the left or right hand in accordance with the direction of the target arrow in the center of the row, with the other arrows being distractors (cf. Eriksen & Eriksen, 1974). The time needed for conflict resolution was measured as the difference in mean RT between trials with congruent distractor arrows (i.e., pointing in the same direction as the center target) and trials with incongruent distractors (i.e., pointing in the opposite direction). In addition, the spatial location of the row of arrows was either cued or unpredictable, and the difference between the mean RTs in these conditions was used to assess the efficiency of spatial orienting. Finally, the onset time of the row of arrows was either cued or unpredictable, with the difference between these conditions used to assess alerting. There were no statistically reliable correlations among these three effects, leading the authors to conclude that these effects are mediated by separate processes, although subsequent analyses of the psychometric properties of these measures have called this conclusion into question (e.g., MacLeod et al., 2010). The following analysis using IDRT also suggests that correlations may be small even if there are common components and that Fan et al.’s conclusion is, therefore, only weakly supported by the findings. This is especially true because only approximately 25 trials per condition and participant were included in some of the difference scores, weakening reliability.

Again, we consider one effect to be a comparison between experimental and control conditions, D ce = RT e RT c , so that the difference for each individual in a particular task is given by Eq. 9. Let the second effect be denoted as D uv = RT v RT u , so that the analogous equations apply substituting u and v for c and e. As before, we assume that both experimental effects involve changes in the time needed for the central stage, B.

In most situations, researchers seem mainly interested in the correlation of the task-specific processing times of the two experimental conditions, \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), so one question is clearly how well that parameter is estimated by Corr[D ce , D uv ] (e.g., Kane et al., 1997). Figure 8 illustrates how the true correlation Corr[D ce , D uv ] varies as a function of the correlation between the task-specific processing times of the two experimental conditions, \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), as well as the correlations between other pairs of task-specific processing times (e.g., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), \( {\rho_{{{\varDelta_u}{\varDelta_v}}}} \)). Specifically, the different panels represent correlations in which the difference scores represent different combinations of common, unrelated, and opposing task-specific processes. For example, Fig. 8b represents a correlation between one difference score involving unrelated task-specific processes (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}=0 \)) and one difference score involving opposing processes (i.e., \( {\rho_{{{\varDelta_u}{\varDelta_v}}}}=-.8 \)), which corresponds to the correlation of the flanker effect (opposing) and the spatial cuing effect (unrelated) in the study of Fan et al. (2002).

Fig. 8
figure 8

Correlation of two mean reaction time difference scores, D ce and D uv , as a function of the correlations between task-specific processing times \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), \( {\rho_{{{\varDelta_u}{\varDelta_v}}}} \), \( {\rho_{{{\varDelta_c}{\varDelta_u}}}} \), and \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \). Correlations between two differences involving two opposing, two unrelated, and two common task-specific processes, are depicted in a, c, and f, respectively, whereas b, d, and e depict correlations of two differences involving other combinations of task-specific differences. Values of \( {\rho_{{{\varDelta_c}{\varDelta_v}}}}={\rho_{{{\varDelta_e}{\varDelta_u}}}}=0 \) were assumed. Default values of the other parameters are shown in Table 15. Corr[D ce , D uv ] was computed using Eq. 23 from the covariance given in Eq. 48 and from the variance given in Eq. 43 for each term

One important fact influencing the patterns shown in most panels of Fig. 8 is that the interrelationships of the different processing time parameters are tightly constrained. Because of these constraints, some values of \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \) are impossible given specified values of the other correlations, which causes left and right truncation of the lines in most panels. For example, consider the correlation of two differences both involving opposing task-specific processes (Fig. 8a). If the task-specific processing times of the two control conditions are uncorrelated (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_u}}}}=0 \)) and all of the task-specific processing times have the same small correlation with G (i.e., \( {\rho_{{{\varDelta_c}G}}}={\rho_{{{\varDelta_e}G}}}={\rho_{{{\varDelta_u}G}}}={\rho_{{{\varDelta_v}G}}}=.2 \)), then the task-specific processing times of the two experimental conditions cannot possibly have a strong positive or negative correlation. Instead, their correlation must lie within the range of approximately −.1 to +.3. Thus, researchers interested in estimating such correlations (e.g., the correlation of the Stroop effect with the flanker effect) should keep in mind that the correlation under investigation is part of a complex network of interrelated quantities—not an entirely free parameter like mean effect size.

Each panel of Fig. 8 reveals a highly linear relation between the observable correlation of difference scores, Corr[D ce , D uv ], and the underlying correlation of the task-specific processing times in the two experimental conditions, \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), suggesting that the former could provide a good estimate of the latter within the narrow range of possible values. Unfortunately, the observable and target values (i.e., Corr[D ce , D uv ] and \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), respectively) are not generally equal, raising complications for such estimates. For example, consider first the correlation of two differences both involving opposing task-specific processes (Fig. 8a). The observable values of Corr[D ce , D uv ] are systematically closer to zero than the underlying target values of \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), making it less likely that researchers will find a statistically significant correlation even when the two task-specific processes are truly correlated (i.e., \( {\rho_{{{\varDelta_e}{\varDelta_v}}}}\ne 0 \)). For example, Fan et al. (2002) might not have obtained correlations among attentional effects even if the same mechanisms influenced task-specific processes in the experimental conditions, depending upon the relationships among the task-specific processes in the control conditions.

In addition, the separate lines within Fig. 8a illustrate that the observable Corr[D ce , D uv ] increases with the correlation between task-specific processing times in the two control conditions, \( {\rho_{{{\varDelta_c}{\varDelta_u}}}} \). This effect indicates that the observable Corr[D ce , D uv ] is not a pure measure of the target correlation \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \). Moreover, it shows that the observable Corr[D ce , D uv ] can differ from zero even when the target correlation \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \) equals zero, leading to the possibility that researchers will incorrectly reject the null hypothesis of interest (i.e., \( {H_0}:{\rho_{{{\varDelta_e}{\varDelta_v}}}}=0 \)). For example, researchers might incorrectly conclude that two experimental conditions are correlated when, in fact, only the two control conditions are related.

Similar problems for the estimation of the underlying \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \) by the observable Corr[D ce , D uv ] are also evident in Fig. 8b–f. Specifically, the lines do not generally lie on the positive diagonal, making it difficult for researchers to estimate the underlying correlation of interest, \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \), from the observable value of Corr[D ce , D uv ]. For example, even across the relatively restricted range of ± .25 examined here, the nuisance parameter \( {\rho_{{{\varDelta_c}{\varDelta_u}}}} \) always has a clear effect on Corr[D ce , D uv ], demonstrating that the observable value depends noticeably on factors other than the underlying correlation of interest.

Correlation of two reaction time difference scores involving the same baseline

A prominent variant of the RT difference score analysis considered in the previous section arises when researchers assess two effects on RT by computing difference scores that involve the same baseline condition. One case in which researchers examine such RT differences, on which we focus here, involves what is sometimes called cost–benefit analysis (e.g., Jonides & Mack, 1984). In this type of analysis, the researcher obtains the mean RTs for three conditions. Condition n is a neutral baseline condition, whereas conditions f and i represent conditions with some type of facilitation and interference, respectively. In the Stroop (1935) color naming task, for example, facilitation is expected when the irrelevant word matches the to-be-named color and inhibition is expected when the irrelevant word names an alternative color, relative to a neutral condition in which the word has no color association (e.g., Brown, 2011).Footnote 4

In cost–benefit analysis, the measured cost of interference is D i = RT i RT n , the measured benefit of facilitation is D f = RT n RT f , and the correlation between facilitation and interference is often of theoretical interest. Note that the neutral condition mean RT n enters into the two differences with opposite signs, which tends to produce a negative correlation between these two differences (e.g., Brown, 2011).

Considering as usual the case in which only the central stage B varies across conditions, the RTs for the interference and neutral conditions involved in the cost–benefit analysis can be represented under IDRT as

$$ \mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}=\left( {A+{B_i}+C} \right)\cdot \mathbf{G}+{{{B}}_{{i}}}\cdot {\mathbf \varDelta_{\mathbf{i}}}+\mathbf{R}+{{\mathbf{E}}_{\mathbf{i}}},\;\mathrm{and} $$
(10)
$$ \mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}=\left( {{A}+{{{B}}_{{n}}}+\mathbf{C}} \right)\cdot \mathbf{G}+{{{B}}_{{n}}}\cdot {\varDelta_{\mathbf{n}}}+\mathbf{R}+{{\mathbf{E}}_{\mathbf{n}}}. $$
(11)

It is most intuitive, however, to represent the mean RT in the facilitation condition as

$$ \mathbf{R}{{\mathbf{T}}_{\mathbf{f}}}=\left( {\it{A}+{{\it{B}}_{\it{f}}}+\it{C}} \right)\cdot \mathbf{G}-{{\it{B}}_{{f}}}\cdot {\varDelta_{\mathbf{f}}}+\mathbf{R}+{{\mathbf{E}}_{\mathbf{f}}}, $$
(12)

with B f · Δ f subtracted from rather than added to the overall total RT f . With this definition of RT f , larger values of Δ f produce smaller values of RT f , so Δ f reflects the amount of facilitation, as one intuitively expects. Furthermore, with this definition a positive correlation of Δ f and Δ i means that larger facilitation is associated with larger inhibition, as one also intuitively expects. Using these definitions, the measured cost of interference, D i = RT i RT n , and benefit of facilitation, D f = RT n RT f , are

$$ {{\mathbf{D}}_{\mathbf{i}}}=\left( {{{\it{B}}_{\it{i}}}-{{\it{B}}_{\it{n}}}} \right)\cdot \mathbf{G}+{{\it{B}}_{\it{i}}}\cdot {\varDelta_{\mathbf{i}}}-{{\mathbf{B}}_{\mathbf{n}}}\cdot {\varDelta_{\mathbf{n}}}+{{\mathbf{E}}_{\mathbf{i}}}-{{\mathbf{E}}_{\mathbf{n}}}\quad \mathrm{and} $$
(13)
$$ {{\mathbf{D}}_{\mathbf{f}}}=\left( {{{\it{B}}_{\it{n}}}-{{\it{B}}_{\it{f}}}} \right)\cdot \mathbf{G}+{{\it{B}}_{\it{n}}}\cdot {\varDelta_{\mathbf{n}}}+{{\it{B}}_{\it{f}}}\cdot {\varDelta_{\mathbf{f}}}+{{\mathbf{E}}_{\mathbf{n}}}-{{\mathbf{E}}_{\mathbf{f}}}. $$
(14)

Figure 9 shows the correlation between observable costs and benefits, Corr[D f , D i ], as a function of the correlation of the underlying task-specific processing times in the facilitation and inhibition conditions, \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \), as well the overall effect size indexed by central processing time in the interference condition, B i , the variability associated with the general processing time G, and the correlation of task-specific processing times in the inhibition and neutral conditions, \( {\rho_{{{\varDelta_i}{\varDelta_n}}}} \). The figure reveals that the observable Corr[D f , D i ] is linearly related to the underlying correlation of task-specific costs and benefits that is of interest, \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \), but is not generally equal to it, even approximately. Indeed, Corr[D f , D i ] and \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \) often differ in sign, so the observed value need not even capture the true direction of the correlation of interest. Furthermore, when the value of the underlying target correlation \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \) approaches the extremes of ±1, the observable Corr[D f , D i ] is far too small in absolute value. Both within and across panels, it is also clear that the other parameters have substantial effects on the observable Corr[D f , D i ] (e.g., the population variability of general processing time, σ G ), making it impossible to estimate the target \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \) from the observable Corr[D f , D i ] without precise information about the values of these other parameters. It is especially noteworthy that the observable Corr[D f , D i ] depends on the correlations of task-specific processing time in the neutral condition with the time in the inhibition condition (\( {\rho_{{{\varDelta_i}{\varDelta_n}}}} \), illustrated across panels) and with the time in the facilitation condition (\( {\rho_{{{\varDelta_f}{\varDelta_n}}}} \), not illustrated). Thus, researchers studying the relationship between facilitation and interference must allow for effects of these auxiliary neutral-condition correlations on the observable Corr[D f , D i ]. Debates about the appropriateness of different neutral conditions have previously focused on mean RT (e.g., Brown, 2011), but this analysis shows that the correlation of this condition with the facilitation and inhibition conditions must also be considered when effect-size correlations are being investigated.

Fig. 9
figure 9

Correlation of mean reaction time difference scores measuring the benefit of facilitation, D f = RT n RT f , and the cost of interference, D i = RT i RT n , computed using a common neutral condition, RT n . Correlations are displayed as a function of σ G ; the amount of stage B processing needed in the inhibition condition, B i ,which determines the mean cost relative to the fixed stage B processing amounts in the facilitation and neutral conditions (i.e., B f = 200 ms and B n = 250 ms); the correlation of the task-specific processing times in the neutral and inhibition conditions, \( {\rho_{{{\varDelta_i}{\varDelta_n}}}} \); and the correlation of the task-specific processing times in the facilitation and inhibition conditions, \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \). The value of \( {\rho_{{{\varDelta_f}G}}}=-.2 \) was assumed—in contrast to the values of \( {\rho_{{{\varDelta_i}G}}}={\rho_{{{\varDelta_n}G}}}=.2 \) —because the task-specific processing time was subtracted from the total RT in the benefit condition (Eq. 12), rather than adding to it as in the cost and neutral conditions (Eqs. 10 and 12). Default values of the other parameters are shown in Table 15. Corr[D f , D i ] was computed using Eq. 23 from the covariance given in Eq. 49. The variance of D i is given by Eq. 43, with the neutral and inhibition conditions corresponding to the control and experimental conditions, respectively, and the variance of D f is given by Eq. 44

One particularly interesting special case of a correlation involving difference scores with a common neutral term arises in the analysis of Stroop effects (e.g., MacLeod, 1991). As was reviewed by Brown (2011), many models of the Stroop task posit that facilitation in the congruent condition and interference in the incongruent condition are driven by a single underlying mechanism for automatic word recognition. Within these models, one would naturally expect both a perfect correlation of the underlying facilitation and inhibition (i.e., \( {\rho_{{{\varDelta_f}{\varDelta_i}}}}=1 \)) and, consequently, a strong correlation of the measured costs and benefits (i.e., \( \mathrm{Corr}\left[ {{{\mathbf{D}}_{\mathbf{f}}},{{\mathbf{D}}_{\mathbf{i}}}} \right]\gg 0 \)).

Figure 10 shows example correlations computed for this single-mechanism special case. As is evident from the figure, the correlation of the observed RT facilitation and inhibition (i.e., Corr[D f , D i ]) can be quite low even when the single-mechanism model is correct (i.e., \( {\rho_{{{\varDelta_f}{\varDelta_i}}}}=1 \)). In fact, the correlation can even be negative, which is possible because the random error component of the neutral condition, E n , enters into the facilitation and interference measures with opposite signs (Brown, 2011). Thus, the clear implication of Fig. 10 is that researchers cannot confidently reject single-mechanism accounts of facilitation and interference based on small correlations of measured facilitation and interference. This example also illustrates another way in which IDRT can be helpful; namely, by providing exact numerical values—possibly rather unexpected ones—for correlations that might be predicted by qualitative theories such as the single-process mechanism.

Fig. 10
figure 10

Correlation of mean reaction time difference scores measuring the benefit of facilitation, D f = RT n RT f , and the cost of interference, D i = RT i RT n , computed using a common neutral condition, RT n , for the special case in which a single mechanism is responsible for both facilitation and inhibition (i.e., \( {\rho_{{{\varDelta_f}{\varDelta_i}}}}=1 \)). Correlations are displayed as a function of the number of trials, the amount of cost indexed by B i relative to fixed values of B f = 200 and B n = 250, the standard deviation of G, and the common standard deviation of all task-specific processing times (\( {\sigma_{{{\varDelta_f}}}}={\sigma_{{{\varDelta_n}}}}={\sigma_{{{\varDelta_i}}}}\equiv {\sigma_{\varDelta }} \)). Values of \( {\rho_{{{\varDelta_f}G}}}={\rho_{{{\varDelta_i}G}}}=0 \) were assumed, in contrast to \( {\rho_{{{\varDelta_n}G}}}=.2 \), because of two constraints inherent in the single-mechanism model. First, Δ f and Δ i necessarily have the same correlation with G if they are perfectly correlated with one another. Second, Δ f and Δ i have opposite correlations with RT f and RT i , respectively, because the facilitation term is subtracted from RT f , whereas the cost term is added to RT i (Eqs. 12 and 10). If G is to be equally correlated with RT f and RT i , then, \( {\rho_{{{\varDelta_f}G}}}={\rho_{{{\varDelta_i}G}}}=0 \) is the only possibility. Default values of the other parameters are shown in Table 15. Correlations were computed using the same equations indicated in Fig. 9

Correlations between mean reaction times and reaction time difference scores

In some situations, researchers correlate mean RTs with difference scores. As was discussed by Chapman, Chapman, Curran and Miller (1994), for example, one main motivation for such correlations is that slower individuals or groups generally show larger effects of experimental manipulations in most RT tasks. This is to be expected, they argued, because “slow subjects tend to be slow in most aspects of performance with the result that they show greater differences than fast subjects between long-latency and short-latency tasks. By analogy, slow typists tend to show a larger difference in completion times between a long manuscript and a short one than do fast typists” (p. 162). It is particularly important to understand correlations between mean RT and effect size because these correlations complicate the interpretation of different-sized effects found for groups that differ in overall ability (Chapman et al., 1994). In this section we examine the correlation of a mean RT with an RT difference score.Footnote 5

One simple and intuitive way to quantify the relationship between overall RT and effect size is by correlating the observed mean RT in a control condition, RT c , with the size of the experimental effect, D = RT e RT c . On the basis of the idea that the effects of the experimental manipulation might depend on a participant’s overall processing time, one might expect Corr[RT c , D] to reflect primarily the correlation of the general processing time with the task-specific processing time in the experimental condition, \( {\rho_{{{\varDelta_e}G}}} \). On the other hand, given that RT c enters with opposite signs into the two measurements being correlated (i.e., RT c and D), one might expect the correlation to be negative. Given these two conflicting expectations, it is not surprising that the true situation is more complicated than either one suggests.

The influences of various factors on the observable Corr[RT c , D] can be studied within IDRT. Again we assume that the RTs in the control and experimental conditions are given by Eq. 7, with i = c, e, and that these two conditions differ only with respect to processing in the central stage B, so the observed experimental effect D is given by Eq. 9. Figure 11 illustrates how the correlation of the control RT and the difference score depends on several key parameters. First, there is a clear tendency for the observable Corr[RT c , D] to increase with the underlying correlation of the general processing time with the task-specific processing time in the experimental condition, \( {\rho_{{{\varDelta_e}G}}} \), as is intuitively expected. Nonetheless, these two correlations may be quite different, as is indicated by the deviations from the positive diagonal. In fact, depending on the values of the other parameters, the observable Corr[RT c , D] can be substantially larger or smaller than the underlying target correlation \( {\rho_{{{\varDelta_e}G}}} \). For example, as is shown within each panel, the observable Corr[RT c , D] tends to increase with the population variability of general processing time, σ G , as would be expected because both the mean and the difference tend to increase with G. Not surprisingly, the observable Corr[RT c , D] also tends to increase with the size of the experimental effect (i.e., with the duration of the central stage B e ). In addition, the observable correlation increases with the correlation of task-specific processing times, \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), which implies that larger observable correlations would be expected when differences involve common, rather than opposing, task-specific processes. Finally, although it is not shown in this figure, the observable Corr[RT c , D] also tends to increase with the variability of general processing times, σ G , which is to be expected because the true score variability of each of the two terms involved in the correlation tends to increase with the variability of G.

Fig. 11
figure 11

Correlation of the mean reaction time in a control condition, RT c , with the difference between means, D = RT e RT c . Correlations are displayed as a function of the size of the experimental effect, indexed by B e , the standard deviation of G, and the correlations \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \) and \( {\rho_{{{\varDelta_e}G}}} \). Default values of the other parameters are shown in Table 15. Corr[RT c , D] was computed using Eq. 23 from the covariance given in Eq. 50 and from the variances given in Eqs. 38 and 43

A slightly different way to assess the relationship between effect size and processing time is to examine the correlation between the size of an experimental effect and the average of the RTs in the control and experimental conditions, \( \overline{\mathbf{R}\mathbf{T}}={{{\left( {\mathbf{R}{{\mathbf{T}}_{\mathbf{c}}}+\mathbf{R}{{\mathbf{T}}_{\mathbf{e}}}} \right)}} \left/ {2} \right.} \) (e.g., Chapman et al., 1994; Gignac & Vernon, 2004). The analysis of this correlation shows effects that are quite similar to those shown in Fig. 11, so the situation apparently only changes minimally when using \( {{{\left( {\mathbf{R}{{\mathbf{T}}_{\mathbf{c}}}+\mathbf{R}{{\mathbf{T}}_{\mathbf{e}}}} \right)}} \left/ {2} \right.} \) rather than RT c to index faster versus slower participants. The main difference between the two cases is that the correlations are slightly higher using \( \overline{\mathbf{RT}} \). This makes sense, because the correlation would tend to be increased by the positive contribution of the experimental condition mean, RT e , to both of the terms being correlated. The bottom line, then, is that many factors influence the correlation between overall processing time and effect size within IDRT, whether processing time is indexed by RT c or \( {{{\left( {\mathbf{R}{{\mathbf{T}}_{\mathbf{c}}}+\mathbf{R}{{\mathbf{T}}_{\mathbf{e}}}} \right)}} \left/ {2} \right.} \). In particular, the model suggests that it will not be easy to determine whether a slower group shows a larger effect size just because of general slowing (i.e., changes in G). Instead, the appropriate adjustment in effect size for general slowing depends on a number of parameters, and this adjustment can be determined only within the context of a specific model.

General discussion

Analyses of correlations involving RT play a prominent role in research investigating both individual-differences (e.g., Jensen, 1985, 1993; Sheppard & Vernon, 2008) and basic cognitive processes (e.g., Corballis, 2002; Stolz et al., 2005). Although researchers conducting such analyses have often considered classical psychometric testing concepts (e.g., reliability) in assessing their correlations, no attempt has been made to assess the precise meanings of these correlations within the framework of standard RT models. Therefore, the goal of this article was to investigate how RT-based correlations would be influenced by the various underlying processes within standard RT models. To achieve that, we developed a general model of individual differences in RT, called IDRT, and linked this model to psychometric concepts from classical test theory. This linkage was especially direct because IDRT involves a linear combination of random variables. We explored the consequences of this model for several different types of correlational analyses involving RTs. In particular, the model’s predictions can be determined regarding correlations involving both mean RTs and difference scores (see Table 2).

Our model is based on the simple assumption that processing from stimulus input to response output proceeds via a series of computational steps or stages whose durations sum to produce the overall RT. Within IDRT, individuals of course differ in the time needed to carry out each of the stages (e.g., Vernon, 1990). This model makes it possible to distinguish between general and task-specific processing times, and the influences of these times on various observable correlations can be assessed, thereby helping to elucidate the precise meanings of such correlations. This model is attractive because of its simplicity, generality, and extensive theoretical development (e.g., Donders, 1868/1969; Smith, 1969; Sternberg, 1969, 2001). At the same time, as was discussed in the section “The individual differences in reaction time (IDRT) model,” it seems plausible that the conclusions emerging from this simple model system would also be applicable within more detailed models providing a richer description of specific RT tasks (Hillis, 1993).

Implications regarding correlations

The most important general conclusion emerging from our analysis is that the observable correlations involving RT means and difference scores depend on many factors influencing performance within the task, including characteristics of both the task (e.g., times needed for perceptual, central, and motor processing [A, B, C]) and the population (e.g., variability of general and task-specific processing times, and their correlation [σ G , σ Δ, and ρ ΔG ]). Obviously, the fact that the observable correlations are influenced by so many parameters greatly complicates the interpretation of any particular observed correlation. This finding critically underscores the need for extreme caution in interpreting observed correlations, especially because there are cases in which correlations can be expected to be far higher or far lower than the correlations of internal parameters that they might be intuitively assumed to measure. Ultimately, this finding raises the question of just what is actually being learned about individual differences and mental processes by studying such correlations. Although the present general model provides a first step toward understanding the implications of RT-based correlations, it is clear that there is a long way to go before it will be possible to draw strong conclusions from the size or in some cases even from the direction of an RT-based correlation.

The equations for the correlations predicted by the model clearly illustrate the above general conclusion for each of the different scenarios we examined (see Table 2). For example, Eq. 42 shows that even the correlation between two mean RTs—one of the simplest cases—depends on at least ten parameters. From an observed correlation of means, then, it is therefore impossible to estimate the value of a single parameter of interest without detailed knowledge about the other parameters. As a second example, Eq. 50 shows how the correlation between an effect size and the mean RT in the control condition depends on many parameters affecting the mean RTs in both the experimental and control conditions. Thus, although it seems intuitively reasonable to ask about such a correlation (i.e., whether an effect size increases for individuals who are slower in the control condition), the observed correlation value simply has no straightforward interpretation in terms of the underlying RT processes involved in the two conditions.

The present findings weaken many previous conclusions based on correlations of RT measures. As one example, consider the finding of low correlations among several different attentional effects reported by Fan et al. (2002), which was discussed in the earlier section “Correlation of two distinct reaction time difference scores”. It is certainly possible that these low correlations reflect truly independent neural mechanisms involved in the different attentional systems assessed via the RT difference scores, which is what Fan et al. concluded. Figure 8 shows, however, that—depending on the values of the other parameters—there could actually be a rather high correlation between the times needed for the task-specific mechanisms used in the experimental conditions (i.e., \( {\rho_{{{\varDelta_e}{\varDelta_v}}}} \)), despite the fact that the correlation of the RT difference scores is very low. Thus, it is possible that a low correlation of difference scores actually provides only illusory evidence of functional dissociations between the task-specific mechanisms under study. Before accepting a low correlation as evidence of a functional dissociation, then, it would be necessary to show that the values of the other parameters were not responsible for the low observed correlation. Unfortunately, it is not yet clear how to do this, because none of these parameters can be estimated directly.

Similarly, the present results raise doubts about the interpretation of weak correlations between RT-based measures of facilitation and interference. Some models of the Stroop (1935) task suggest, for example, that interference and facilitation are driven by the same underlying word recognition mechanism. Because the two effects are opposite sides of the same coin within these models, the models seem intuitively to predict that the effects should be strongly correlated. On that basis, findings of weak correlations have been regarded as evidence against single-mechanism models (Brown, 2011). As is illustrated in Fig. 10, however, such models need not predict a strong correlation between facilitation and interference. Depending on the values of other parameters, they may predict small or even negative correlations. Thus, the absence of a strong correlation between facilitation and interference does not actually imply the existence of separate mechanisms underlying the two effects.Footnote 6

Despite the complications evident in the formulas, the numerical results indicate that some types of observed correlations do sometimes provide very good estimates of underlying relationships of interest to researchers. For example, Fig. 3 shows that the correlation Corr[RT, Y] between mean RT and the score on an external (i.e., non-RT) measure, Y, is often quite similar to the correlation of general processing time, G, with that measure. This result provides encouragement that it may be possible to use mean RTs for fairly accurate assessments of correlations of external measures with general processing time, and it is therefore quite consistent with the large literature suggesting that the mean RTs of many tasks correlate well with general intelligence (e.g., Jensen, 2006).

What can be concluded from RT-based correlations?

Even if RT-based correlations are highly replicable empirical phenomena, they may be devilishly complicated to interpret. Given the multiplicity of factors influencing RT-based correlations of each type shown in Table 2, researchers must obviously be cautious in interpreting observed values of these correlations. Statistical reliability of the observed values should be assessed as usual, but the interpretations of both statistically significant and nonsignificant correlations must also take into account the many possible influences that could be responsible for the results.

Consider, for example, a significant correlation between the mean RTs of two distinct tasks, RT x and RT y . Although it may be tempting to attribute this correlation to a common task-specific central process hypothesized to be involved in both tasks (i.e., Δ x = Δ y), the correlation may actually be produced mainly by something else entirely, such as a general processing time parameter (G) common to all tasks (e.g., Jensen, 2006) or a correlation of the sensory–motor residual times (R) of the two tasks.

It seems clear that relatively sophisticated research strategies will be required to reach strong conclusions from between-task correlations of mean RTs. Specifically, researchers will need to base their conclusions on the patterns of correlations across a range of tasks—not just pairs of tasks. For example, suppose that the correlation of RT x and RT y is demonstrably higher than the correlation of either of these with a third task’s RT z (within the same sample of participants). If the third task were constructed so that RT z involved the same sensory–motor residual times as RT x and RT y , and if all three tasks appeared to depend equally strongly on general processing time G (e.g., because all correlated equally well with IQ), then the researcher would clearly be on stronger ground in attributing at least part of the RT x /RT y correlation to a hypothesized common central process contributing to RT x and RT y but not RT z .

In view of the multiplicity of influences on correlations, it is perhaps surprising that the present results do suggest that very strong correlations can have quite specific implications. Consider, for example, Iacoboni and Zaidel’s (2004) report of a .9 correlation between the crossed–uncrossed difference in simple RT (i.e., stimulus light on the same vs. opposite side of the body midline as the respond hand) and an fMRI-based measure of activity in the right superior parietal cortex. From this strong relationship, they concluded that this area has “a key role . . . in the type of interhemispheric visuo-motor integration required by [the task]” (Iacoboni & Zaidel, 2004, p. 423), but even more specific conclusions can be reached on the basis of IDRT. First, it seems clear that this RT difference must reflect opposing or unrelated task-specific processes in the two RT conditions being compared (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\leq 0 \)), because such a strong correlation is not found with common task-specific processes (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\gg 0 \); Fig. 7). Second, activity in the right superior parietal cortex must have been both negatively correlated with RT in the uncrossed condition (\( {\rho_{{{\varDelta_c}Y}}}\ll 0 \)) and positively correlated with RT in the crossed condition (\( {\rho_{{{\varDelta_e}Y}}}\gg 0 \)), because very strong correlations of RT difference scores with an external measure are not found unless both of these requirements are met (e.g., Fig. 7). In short, given the multiplicity of influences on RT-based correlations, extreme values near ±1 can be found only when most or all of the relevant parameters have certain required settings.

On the other hand, the interpretations of small correlations are much more poorly constrained. Consider, for example, possible interpretations of the finding that two RT-based effect sizes are only weakly correlated, as in the case of Fan et al.’s (2002) attentional effects. It would be tempting to conclude that different mechanisms are responsible for the effects in the two experimental conditions, but our analysis shows that other interpretations are possible. For instance, the correlation of the two effect sizes is also strongly influenced by the correlation of the RTs in the two control conditions (Fig. 7), and the effect sizes could be weakly correlated even when the same mechanism was responsible for both effects if the RTs in the two control conditions were negatively correlated. Indeed, given the rich set of constraints among the correlations of the four conditions involved in the difference scores (i.e., two control conditions and two experimental conditions), it seems clear that the entire set of correlations needs to be examined when assessing the mechanisms involved in producing the different experimental effects.

Again, more sophisticated research strategies can help to strengthen conclusions from correlations of RT-based effects. As an example, consider the study of Miles and Proctor (2012), who examined correlations of Simon compatibility effects obtained with three different types of stimulus materials (i.e., locations, arrows, and words). They found a significant correlation between the compatibility effects obtained with arrows and words, but no correlation between either of these effects and the compatibility effect obtained with location stimuli. This pattern of changing correlations among fairly similar tasks provides stronger support for the claim that the underlying mechanisms responsible for Simon effects with arrows and words have more in common with each other than either one does with the mechanisms responsible for location-based Simon effects. On the other hand, it is impossible to be certain about this conclusion in the absence of a complete model for the task, because it is might be possible to construct models that produce unequal correlations despite having common mechanisms for all three stimulus types. The present work strongly suggests that the proper interpretation of RT correlations requires explicit models, perhaps even more so than the interpretation of mean RT results.

It should be emphasized that our conclusions about RT-based correlations apply only to situations where correlations are computed across participants—not where they are computed across trials for a given participant. In studies of the performance of two successive tasks within the psychological refractory period paradigm, for example, many researchers have examined the correlation across trials between the RTs of the two tasks (e.g., Davis, 1959; Pashler & Johnston, 1989; Sigman & Dehaene, 2006; Way & Gottsdanker, 1968). Both the task parameters and the individual-difference parameters of the present version of IDRT (e.g., A, B, G, Δ, R) would be held constant across trials within such correlations, so an elaborated version of the model would be needed for the analysis of such correlations.Footnote 7

Would partial correlations avoid the problems of difference scores?

As has been discussed already, the main rationale for using an RT difference score like D = RT e RT c is usually to isolate the influence of a specific processing stage lengthened in the experimental condition, removing the effects of stages common to the experimental and control conditions. For example, the correlation of RT benefits and costs measured relative to a common neutral condition, \( \mathrm{Corr}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}-\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}-\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]=\mathrm{Corr}\left[ {{{\mathbf{D}}_{\mathbf{f}}},{{\mathbf{D}}_{\mathbf{i}}}} \right] \), is intended to assess the relationship between the task-specific processes generating those costs and benefits, \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \). We have seen, however, that the observable correlation Corr[D f , D i ] does not accomplish its intended task (e.g., Fig. 9), because this correlation is influenced by many parameters other than the intended one.

Some readers might wonder whether partial correlations would avoid the difficulties associated with difference scores. For example, researchers could compute the partial correlation of the mean RTs in the facilitation and inhibition conditions, partialling out the mean RT in the neutral condition, \( \mathrm{Corr}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}|\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right] \). The partial correlation seems to have some intuitive appeal for this purpose, given its usual interpretation as “removing the effect of” the variable being partialled out. Thus, for example, this partial correlation might intuitively be expected to provide another way to assess the relationship between benefits and costs after removing effects that were common with the neutral condition, just as the difference score was meant to do. In fact, however, despite the intuitive similarity of the partial correlation and the difference score in removing effects of neutral condition performance, these measures are not identical (i.e., \( \mathrm{Corr}\left[ {\left. {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}} \right|\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]\ne \mathrm{Corr}\left[ {{{\mathbf{D}}_{\mathbf{f}}},{{\mathbf{D}}_{\mathbf{i}}}} \right] \)). Therefore, it is mathematically possible that the partial correlation would directly assess the desired relationship of the task-specific processing times in the facilitation and inhibition conditions, \( {\rho_{{{\varDelta_f}{\varDelta_i}}}} \), even though correlation of the difference scores, Corr[D f , D i ], does not.

Fortunately, this possibility can also be examined within IDRT. In general, a partial correlation measures the relationship between X and Y after each of these two variables have been adjusted by linear regression to remove any association with Z (i.e., partialling out Z’s contribution to X and to Y). The partial correlation is therefore defined as \( \mathrm{Corr}\left[ {\left. {\mathbf{X},\mathbf{Y}} \right|\mathbf{Z}} \right]=\mathrm{Corr}\left[ {\mathbf{U},\mathbf{V}} \right] \), where U and V are the residuals when X and Y are regressed on Z. Conveniently, the correlation of the residuals can be computed from the three pairwise correlations of the original variables X, Y, and Z. For example, the partial correlation between RT f and RT i controlling for RT n is

$$ \mathrm{Corr}\left[ {\left. {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}} \right|\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]=\frac{{\mathrm{Corr}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}}} \right]-\mathrm{Corr}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]\cdot \mathrm{Corr}\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]}}{{\sqrt{{\left( {1-\mathrm{Corr}\mathop{{\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{f}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]}}\nolimits^2} \right)\cdot \left( {1-\mathrm{Corr}\mathop{{\left[ {\mathbf{R}{{\mathbf{T}}_{\mathbf{i}}},\mathbf{R}{{\mathbf{T}}_{\mathbf{n}}}} \right]}}\nolimits^2} \right)}}}}. $$
(15)

It is possible to investigate partial correlations within IDRT in basically the same manner that we have used to investigate simple correlations. The pairwise correlation between any two mean RTs can be derived using the methods discussed in the section “Correlation of two mean reaction times,” so the full predicted correlation matrix for any set of mean RTs can be obtained by repeated pairwise applications of these methods. Then, the partial correlation of two variables controlling for one or more additional variables can be determined from these pairwise correlations (e.g., Eq. 15). In short, predicted partial correlations can be derived from IDRT because they depend only on predicted pairwise correlations.

It is beyond the scope of this article to present or illustrate the partial correlations predicted by IDRT, but the overall conclusion of such an analysis is clear. Just like simple correlations, partial correlations are also influenced by numerous parameters beyond the ones of interest to researchers, so these are subject to the same sorts of interpretation difficulties that plague difference score correlations.

Ultimately, it appears that neither correlations of difference scores nor partial correlations can measure the simple relationships of intuitive interest, because the underlying RT process is not completely additive.Footnote 8 Consider, for example, the general processing time parameter G, which is related to a hypothesized neural processing speed influencing all stages. Within any general model of RT, this parameter has a multiplicative effect [i.e., \( \mathbf{RT}=\left( {A+B+C} \right)\times \mathbf{G}+\mathbf{R} \)] rather than a purely additive one as is assumed by the general model underlying difference scores and partial correlations. This is even true for IDRT despite the fact that it was constructed to have a relatively simple additive structure in the first place, so it is extremely doubtful that correlations would have simpler interpretations within other, less additive RT models.

Implications regarding reliability

As is well known within classical test theory and has been acknowledged by many researchers focusing on RT (e.g., Jensen, 1985), the reliabilities of RT-based measures are crucial. In general, reliability is determined by true score variance and error variance, and it is important because it places an upper limit on the correlations that can be observed.

Regarding mean RTs, IDRT provides some grounds for optimism. Although the number of observations needed for satisfactorily high reliability of a mean RT depends on the exact situation under study, as few as 15–30 trials are often enough under realistic parameter settings, and it is usually feasible to obtain at least that many trials per participant in all conditions of an RT study. On the other hand, IDRT indicates that high reliability per se is not a sufficient indication that a population has adequate variance in the population parameters of interest. As is shown in Fig. 2, reliability tends to be high if there is a reasonable amount of variability in at least one of the general processing time, G, the task-specific processing time, Δ, or the sensory–motor residual component, R. Thus, mean RT reliability could be high without any variability in the cognitive processing times that are generally of interest, G and Δ—and thus, without any opportunity for observing correlations of RT with other cognitive measures—as long there was large variability in the residual component.

The situation is somewhat less promising with respect to the reliability of difference scores, because more observations—possibly two orders of magnitude more—are needed for a reliable difference score. The reliability of difference scores is reduced partly because the sensory–motor residual term R does not contribute to the difference (Eq. 9), reducing the true score variation. Happily, this means that when the reliability of a difference score is high—unlike with mean RTs—the researcher can be sure that there was substantial variability in at least one of the cognitive processing time parameters (i.e., G, Δ c , and Δ e ).

Critically, the reliability of a difference score depends greatly on the relation between the two conditions involved in the difference. Specifically, the reliability of the difference between the mean RTs of two tasks depends on the similarity of the underlying task-specific processes involved in those tasks. When the two tasks involve common task-specific processes, so that these processes are positively correlated across participants (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\gg 0 \)), reliability tends to be relatively low. Difference scores computed from two conditions requiring different degrees of mental rotation, different memory loads, different display sizes in visual search, different temporal offsets between two overlapping tasks, and so on would be examples. In such cases, thousands of trials per condition might be required to obtain satisfactory levels of reliability. Not surprisingly, then, the literature contains numerous reports of low reliability for difference scores computed from such tasks (e.g., Neubauer et al., 1997).

In contrast, when the two tasks involve opposing task-specific processes, so that these processes are negatively correlated across participants (i.e., \( {\rho_{{{\varDelta_c}{\varDelta_e}}}}\ll 0 \)), reliability tends to be relatively high. A difference score computed from two conditions involving congruent versus incongruent trials in the Stroop task would be one example, and others might involve congruent versus incongruent conditions in the SNARC effect, Simon effect, flanker effect, stimulus–response compatibility tasks, and so on. In these cases, as few as 100–200 trials per condition might be sufficient to achieve adequate levels of reliability. Since correlations are inherently limited by reliability, the similarity of the underlying task-specific processes also has strong implications for correlations of difference scores with mean RTs and with external measures, as well as for reliability.

The practical implications of IDRT for reliability are nicely illustrated by considering a recent pair of studies examining the reliability of priming effects on word recognition. Stolz et al. (2005) found that semantic priming effects were rather unreliable across two blocks of trials, suggesting that these effects are driven by “uncoordinated processes specific to semantic memory” (Waechter et al., 2010, p. 553). In contrast, using a closely-matched experimental protocol, Waechter et al. (2010) found that repetition priming effects were noticeably more reliable than semantic priming. The higher reliability of repetition priming was taken as further support for the idea of uncoordinated semantic memory processes because it ruled out explanations of low reliability based on uncoordinated processes at presemantic (i.e., featural, lexical) levels. Within the context of IDRT, however, it is easy to imagine another possible interpretation of the higher reliability for repetition priming than for semantic priming. Specifically, the repetition priming effect was approximately twice as large as the semantic priming effect (M = 88 vs. 37 ms). Given that the reliability of a difference score increases with the effect size (e.g., effect of B e in Fig. 5), the larger effect size could have been responsible for the greater reliability of repetition priming, negating its support for the uncoordinated nature of semantic processes. Similarly, Maloney et al. (2010) found much lower reliability for numerical distance effects obtained with numbers presented in symbolic formats (e.g., “4”) than with those presented nonsymbolically (e.g., four squares), and these reliability differences may also have been due at least partly to the fact that the distance effect was much smaller with symbolic than with nonsymbolic stimuli (i.e., approximately 50 vs. 500 ms). Thus, as was the case with RT correlations, IDRT is useful in elucidating the many factors that need to be considered when interpreting changes in RT reliabilities.

Further uses of IDRT

The model developed here could also be useful in addressing various methodological questions affecting the exact choice of data analyses. Consider, for example, the issue of whether it is better to use common versus separate estimates of RT means contributing to both of two terms being correlated. As was discussed by Brown (2011), this question arises in correlating the sizes of Stroop facilitation and interference, because each of these effect sizes is estimated relative to the mean RT in a common neutral condition. Including all of the available neutral trials in a single estimate of the neutral mean has the advantage of yielding an estimate based on a larger number of trials but has the corresponding disadvantage of creating an artificial dependence between the two effect sizes being correlated. In contrast, dividing the neutral trials into two sets and computing separate estimates has the advantage of yielding independent estimates but the disadvantage of yielding estimates based on smaller numbers of trials. Exactly analogous questions arise when correlating an effect size with the control condition mean, Corr[RT c , D], or with the average of the control and experimental condition means, \( \mathrm{Corr}\left[ {\overline{\mathbf{RT}},\mathbf{D}} \right] \), because in both of these cases, at least one of the condition means, RT c and RT e , contributes to both of the terms being correlated.

Within the present model, the implications of using common versus separate RT estimates can be examined precisely under any desired set of assumptions about the task and individual-difference parameters. As was discussed in the section “Correlation of two reaction time difference scores involving the same baseline,” trial-to-trial error variance contributes to the covariance when a common RT mean is used, but not when separate RT means are used. Furthermore, the error variance of a single mean varies with the number of trials used to compute it (Eq. 2). Thus, the true underlying correlation of interest (e.g., Corr[D f , D i ]) can be computed exactly for both analysis procedures (i.e., common vs. separate estimates), allowing a fully informed choice about which is the better method under a particular configuration of parameter values.

Future directions

As an initial investigation of the psychometric properties implied by RT models, the present work has a number of limitations that should be addressed in future extensions. One is that we have only examined measures based on mean RTs and their differences. Many other summaries of RT have also been used in computing correlations, including median RT, within-condition standard deviation of RT, and parameters obtained by fitting particular models to RT distributions (e.g., Fjell, Ostby, & Walhovd, 2007; Forstmann et al., 2008; Jensen, 1992; Schmiedek, Lovden, & Lindenberger, 2009; Schmiedek, Oberauer, Wilhelm, Süß, & Wittmann, 2007). Future research could extend the present analysis to such other RT summary measures. Such extensions would appear to be quite straightforward in some cases. For example, IDRT’s predictions about the within-condition standard deviation of RT are dictated by Eq. 2, so predicted standard deviations can easily be computed from the same parameters used to compute predicted means.

A second limitation of the present work is that we have only considered the true values of the reliabilities and correlations implied by the model. In any empirical study, of course, the reliabilities and correlations would be estimated from observed data and would therefore fluctuate randomly around the true values provided by our formulas. We have not explored the implications of the IDRT model for these purely statistical fluctuations due to sampling error, and these implications could be important. For example, the reliability of a mean RT or a difference score could be estimated using a test–retest procedure, a split-half analysis, or some other technique, and it is not clear which of these would provide the reliability estimate with the best statistical properties (i.e., least bias and lowest standard error).

A third limitation of this work is that we have examined in detail only the relationships of observable correlations to the correlations between certain pairs of underlying parameters. Figure 7, for example, shows how the observable correlation of an external measure Y with an RT difference score is related to the underlying correlation of Y with the time needed for task-specific processing in the experimental condition, Δ e . One might ask, instead, how the observable correlation is related to Y’s correlation with the time needed for task-specific processing in the control condition (Δ c ), or even how it is related to Y’s correlation with the difference in task-specific processing times (i.e., \( {\varDelta^{*}}={{{B}}_{{e}}}{\mathbf \varDelta_{\mathbf{e}}}-{{{B}}_{{c}}}{\mathbf \varDelta_{{c}}} \)). It is beyond the scope of this initial investigation to consider all such possible relationships, but two points can be made. First, the equations developed in Appendix 2 could be used to study IDRT’s predictions concerning any such relationship of particular interest. As one illustration, Fig. 12 shows how Corr[D, Y] is related to the underlying correlation of Y with the difference in experimental versus control central processing times just defined, \( {\rho_{{{\varDelta^{*}}Y}}} \), under various assumptions about the other parameters. Second, and in keeping with the more general conclusions from this research, the complexity of these equations strongly suggests that all relationships between observable correlations and underlying parameters will be complicated, making it difficult to reach straightforward conclusions about the correlations of underlying processing durations from observable correlations.

Fig. 12
figure 12

Correlation of a mean reaction time difference score, D, and an external measure, Y, as a function of the true correlations ρ GY , \( {\rho_{{{\varDelta_c}{\varDelta_e}}}} \), and \( {\rho_{{{\varDelta^{*}}Y}}} \) and of the assumed relationship between \( {\rho_{{{\varDelta_c}Y}}} \) and \( {\rho_{{{\varDelta_e}Y}}} \). Default values of the other parameters are shown in Table 15. As in Fig. 7, Corr[D, Y] was computed using Eq. 47. Values of \( {\rho_{{{\varDelta^{*}}\mathbf{Y}}}} \) were computed using Eq. 55

A fourth limitation of this work is that it ignores the possibility that the observed RTs are contaminated by speed–accuracy trade-offs. The possibility of such contamination plagues all RT research (e.g., Pachella, 1974), of course—not just research focusing on correlations. Within correlational RT research, it appears that the possibility of such contamination can be addressed only by using a formal model to combine RT and accuracy into a single measure of processing efficiency (e.g., Brown & Heathcote, 2005; Ratcliff, 1978; Yellott, 1971).

Finally, it might also be worthwhile to extend the present approach to other analytical techniques beyond the computation of reliabilities and correlations. As one example, an alternative approach for investigating individual differences is to divide participants into groups based on one variable (e.g., age or IQ) and then to compare mean RTs and RT differences across the groups (e.g., Der & Deary, 2006; Dickman & Meyer, 1988; Eaton & Ritchot, 1995; Ellermeier, Eigenstetter, & Zimmer, 2001; Exposito & Andres-Pueyo, 1997; Myerson, Hale, Chen, & Lawrence, 1997; Smulders & Meijer, 2008). A persistent problem within this approach is to make a fair comparison of the sizes of RT differences across groups differing in overall mean RT. For example, Chapman et al. (1994) suggested that “slower or less accurate individuals tend generally to show larger differences between pairs of scores [RTs], and this may explain the finding in many kinds of tasks that slower and less accurate groups . . . show heightened [RT] priming difference scores” (p. 160). Researchers have developed a number of ad hoc procedures to adjust RT effect sizes in order to more fairly compare effect sizes across groups (e.g., ANCOVA), but none of these procedures has been developed from an explicit model of the underlying RTs. If the IDRT model could be extended to this situation, it might be useful for comparing the effectiveness of different suggested adjustment procedures or even to find a new model-based adjustment procedure for making the desired comparisons.