The cognitive reflection test (CRT) was presented by Frederick (2005) with the purpose of measuring the construct cognitive reflection, which he defined as “the ability or disposition to resist reporting the response that first comes to mind” (Frederick, 2005, p. 35). As is shown in Table 1, CRT contains three mathematical problems, with the common feature that they all typically trigger a quick, intuitive response, which is not the correct answer. If the test taker realizes that the intuitive response is not the correct answer, finding the correct solution requires relatively easy mathematical computations. Typically, a participant solves a problem either incorrectly or correctly within a few minutes. Research has shown that people find it difficult to solve these problems, and that those who perform well at the CRT tend to perform well at numeracy tests, other general ability tests, and tend to avoid biases in judgment and decision-making tasks (e.g., Campitelli & Labollita, 2010; Cokely & Kelley, 2009; Frederick, 2005; Liberali, Reyna, Furlan, Stein, & Pardo, 2011; Oechssler, Roider, & Schmitz, 2009; Toplak, West, & Stanovich, 2011).

Table 1 Cognitive reflection test with correct and intuitive answers

Frederick’s definition of cognitive reflection is intriguing because it encompasses the possibility that cognitive reflection is a thinking disposition. As was noted by Toplak et al. (2011), thinking dispositions are typically measured with subjective reports, which are not always reliable (e.g., Nisbett & Wilson, 1977). CRT is a performance measure with an objective criterion. Thus, if the CRT, indeed, measures a thinking disposition it would constitute a substantial progress in measuring thinking dispositions.

Researchers seem to disagree in whether CRT measures an ability or both an ability and a disposition. Cokely and Kelley (2009) associated CRT with reflectiveness or “careful, thorough and elaborative—but not necessarily normative—cognition” (Cokeley & Kelley, 2009, p. 27). Campitelli and Labollita (2010) proposed that cognitive reflection is not only an ability or disposition to veto a prepotent response, but also an ability or disposition to initiate cognitive processes. Moreover, in line with Cokely and Kelley, they proposed that “cognitive reflection, as measured by CRT, is related to Baron’s (2008) broader concept of actively open-minded thinking.” (Campitelli & Labollita, 2010, p. 188), and they suggested that the relationship between CRT and actively open-minded thinking (AOT) could be studied using Stanovich and West’s (1998) AOT scale. Given that AOT is a thinking disposition, Cokely and Kelley (2009) and Campitelli and Labollita (2010) seem to favor the view that the CRT not only measures an ability, but also a thinking disposition.

Another group of researchers seem to view the CRT as a measure of an ability (not a disposition), but they consider this ability to be distinct from general cognitive abilities (e.g., intelligence, working memory). Toplak et al. (2011) referred to this ability as rational thinking. These authors studied the relationship between the CRT and the AOT scale, among other measures, and found a significant but weak relationship (r = .10). Therefore, they discarded the possibility that the CRT measures a thinking disposition. Instead, they proposed that the CRT directly measures rational thinking ability or, negatively framed, “the tendency toward the class of reasoning error that derives from miserly processing” (Toplak et al., 2011, p. 1284). Toplak et al. used a range of measures of rational thinking ability, including syllogistic reasoning with belief bias (Evans, Barston, & Pollard, 1983), and a number of problems used in the heuristics-and-biases literature. They showed a unique covariance between CRT and rational thinking that cannot be accounted for by measures of general cognitive ability (e.g., WASI). This “miserly processing” view is consistent with Frederick’s (2005) explanation of performance in CRT based on Kahneman and Frederick’s (2002) dual-system account. People tend to use their System 1, which is quick, intuitive, and heuristic, and fail to use their System 2, which is slow, reflective, and rule-based. Using a default-interventionist conception of System 2 (Evans, 2008), Frederick explained errors in the CRT by the failure of System 2 to monitor or override System 1’s functioning. Böckenholt (2012) implemented a mathematical model entitled “cognitive-miser response model,” which also favors the explanation of the CRT as a measure of cognitive miserliness. Liberali et al. (2011) evaluated Campitelli and Labollita’s (2010) proposal that CRT measures an aspect of AOT (i.e., the disposition to search for alternatives), and concluded that the search for alternatives is not enough to solve the CRT problems. An ability to inhibit and edit the wrong responses is also required.

Although researchers disagree in whether CRT measures solely an ability, or both an ability and a thinking disposition, most of them agree that CRT is not just a test of mathematical ability. This agreement is based on the consensus that CRT problems, unlike other mathematical problems, trigger an automatic response, which is then inhibited or not, and only if inhibition is successful would individuals use their mathematical knowledge to solve the problems. This view received some support in Liberali et al.’s (2011) study, in which a factor analysis was conducted with a set of items including the three CRT problems and other mathematical problems. The authors found that the CRT problems tended to form a factor separated from the other problems. In contrast, Weller, Dieckmann, Tusler, Mertz, Burns, and Peters (2013) included two CRT problems within their numeracy scale, and discussed the CRT within a section entitled “Existing measures of numeracy.” Thus, they implied that the CRT is just a test of mathematical ability.

Summing up, three distinct views exist of what the CRT measures:

  • It is just a measure of mathematical ability.

  • It is a measure of mathematical ability and rational thinking.

  • It is a measure of mathematical ability, rational thinking, and the disposition toward actively open-minded thinking.

The goal of this article was to investigate in depth the structure of the CRT and to help determine which of these views is better supported.

Overview of the present study

In order to assess these views, we used a mathematical modeling approach, similar to the one used by Böckenholt (2012). The rationale for this approach is that more traditional analyses such as linear or logistic regression would not be able to capture the hierarchical structure of CRT (i.e., first an intuitive response occurs, then an inhibition process, and then a mathematical computation process). Moreover, as discussed later, unlike the traditional approaches, the mathematical modeling approach affords us the possibility of identifying gender and specific problem differences in estimated parameters (i.e., probability of inhibition of a prepotent response, and probability of using an appropriate mathematical procedure).

We developed one mathematical model for each of the views presented in the introduction, as well as a null model, and then we analyzed how well each model fit the data. Given that there are gender differences in CRT we conducted separate analyses in males and females. Moreover, in order to investigate the differences between CRT problems, we conducted both an analysis of the CRT as a whole, and an analysis of each of the problems independently.

Method

Participants and procedure

After obtaining ethical approval from Edith Cowan University’s Ethics committee we used the services of MyOpinions (www.myopinions.com.au), a company that provides access to a panel of 360,000 Australians. These persons register into a website and participate in surveys as part of a reward system. Quotas were established to assure that the distribution of the sample in the variables gender and age was not very different from that of the Australian population as a whole. After the survey was launched, it took approximately 10 days to obtain 2,019 responses online (47.2 % [952] were female). The average age of the sample was M = 39.8 years, SD = 11.5, range = 20–61. In terms of education, 18.8 % of the sample did not complete secondary school, 17.7 % did complete secondary school, 30.8 % obtained tertiary or trade qualification, 26.9 % obtained an undergraduate certificate or a bachelor degree, and 5.8 % obtained a master’s or doctoral degree.

Material

The participants completed a survey containing questions about financial behavior and questions assessing psychological variables. In this study we focused on the psychological variables only. Specifically, we examined: the questions that comprise the CRT; those that examined numeracy (NUM) as a measure of mathematical ability; syllogistic reasoning with belief bias (SRBB) as a measure of rational thinking ability; and actively open-minded thinking (AOT) as the disposition toward actively open-minded thinking. Table 1 presents the CRT, and Appendix 1 shows the numeracy problems, the syllogisms with belief bias, and the items of the open-minded thinking scale.

Cognitive reflection test

The CRT (see Table 1) contains three problems. There is no time limit to solve the problems, and no alternatives are provided for the participants to choose from. The total score was the number of problems solved correctly. We also classified the responses of the participants in each problem as a “correct answer,” “intuitive answer” (i.e., the answer that corresponds to the expected quick, intuitive response that first comes to mind; see Table 1), and “other answer.”

Numeracy

To measure numeracy, we used the three more difficult problems (as reported by Peters & Levin, 2008) of the 11-item numeracy scale developed by Lipkus, Samsa, and Rimer (2001). Problem 2 differed from the original question in that we provided six alternatives to the participants. Problems 1 and 3 did not have alternatives. The total score was the number of items solved correctly. The numeracy items are presented in Appendix 1.

Syllogistic reasoning with belief bias

We constructed four “incongruent” syllogisms in which either the conclusion followed logically from the premises but contradicted a belief (e.g., the Australia Stock Exchange [ASX] always goes up) or the conclusion did not follow logically from the premises but were consistent with a belief (e.g., Visa is a credit card). We constructed these syllogisms on the basis of Sá, West, and Stanovich (1999), who, in turn, used syllogisms presented in Markovits and Nantel (1989). Following Stanovich and West (1998), Macpherson and Stanovich (2007), West, Toplak, and Stanovich (2008), and Toplak et al. (2011) we used the total number of incongruent syllogisms correctly solved as a measure of the ability to avoid belief bias.Footnote 1 To ensure consistency and clarity with the literature, we refer to this variable as syllogistic reasoning with belief bias (SRBB); and, according to Toplak et al.’s classification, we used this variable as a measure of rational thinking. The syllogisms are presented in Appendix 1.

Actively open-minded thinking

Baron (1985, 2008) used the term actively open-minded thinking to refer to thinking that includes thorough search relative to the importance of a question, confidence according to the amount and quality of thinking carried out, and consideration of alternatives different to the one we initially favor. Stanovich and West (2007) used a 41-item Actively Open-Minded Thinking scale that evolved from previous scales: the flexible thinking scale (Stanovich & West, 1997), openness-values facet of the Revised NEO Personality Inventory (Costa & McCrae, 1992), dogmatism (Paulhus & Reid, 1991), categorical thinking subscale of Epstein and Meier’s (1989) constructive thinking inventory, belief identification scale (Sá et al., 1999), and counterfactual thinking scale (Stanovich & West, 1997). In order to minimize the chance of participant inattention we selected 15 items from the 41-item scale, on the basis of a pilot study that had shown that those items had the highest internal consistency.

Each item consisted of a statement, and the participants had to indicate whether they strongly agree (scored as 6), agree moderately (5), agree slightly (4), disagree slightly (3), disagree moderately (2), or disagree strongly (1) with the statement. The total score was obtained by summing the responses to the 15 items, after reversing the score of the questions in which disagree strongly (i.e., 1) indicated a tendency toward actively open-minded thinking. The scale is presented in Appendix 1.

Analyses

We carried out traditional analyses (i.e., correlations and regressions) and then conducted a mathematical modeling analysis. (Four scripts of code to run the mathematical modeling analyses in the statistical software R (R Development Core Team, 2012), and the data set can be found in Supplementary Material or in the following link: https://drive.google.com/folderview?id=0BxvQ-uHPASPvd3lwS2MzR3c0WlE&usp=sharing). We constructed four mathematical models (i.e., one for each of the views of CRT identified in the introduction, and one null model), and fitted the four models to the data corresponding to the whole CRT. After that we fitted the same models to the data of each of the three CRT problems separately. Given that previous research has shown gender differences in CRT (Frederick, 2005), we fitted the models to males and females separately. Appendix 2 presents the mathematical formulas that are common to all the models, and those that are model-specific. It also describes the maximum likelihood estimation and the model selection procedures.

Mathematical models

We constructed four mathematical models:

  • null model [NULL],

  • mathematical ability model [MATH],

  • rational thinking model [RAT], and

  • thinking disposition model [DISP].

NULL assumes that CRT is not a sensitive measure, and thus that everyone performs similarly. The only estimate of performance in each participant is the mean performance of the sample. This implies that no variability in performance occurs in the CRT. This model does not require estimating parameters. MATH (see panel a in Fig. 1) is the implementation of the view that the CRT only measures mathematical ability. The model assumes that after reading the instructions the participants either perform an adequate mathematical computation with probability μ, and thus they produce a correct answer, or they do not produce a correct mathematical computation with probability 1 – μ, and thus they give an incorrect answer (i.e., either intuitive or other). The mathematical expression of this model is equivalent to a regression analysis in which the CRT performance is predicted only by the score in the numeracy test.

Fig. 1
figure 1

Graphical representation of the models. Panel a shows the MATH model, in which μ stands for the probability of using an accurate mathematical procedure. Panel b shows a representation of the RAT and DISP models. The difference between these models is that in DISP, both SRBB and AOT are used as covariates to estimate the probability of inhibition (τ), and in RAT only SRBB is used. In all the models NUM is the covariate to estimate μ

RAT implements the view that CRT measures mathematical ability and rational thinking, and DISP implements the view that CRT measures mathematical ability, rational thinking and a disposition toward actively open-minded thinking. Panel b in Fig. 1 shows RAT and DISP. These models assume that reading the instruction triggers an intuitive response. This response is either inhibited with probability τ, or not inhibited with probability 1 – τ. If the response is not inhibited, then the participant reports the intuitive response as final answer (i.e., intuitive answer). If the response is inhibited then the participant will use an appropriate mathematical procedure with probability μ or use an inadequate procedure with probability 1 – μ. If an appropriate procedure is used then the participant gives a correct answer, and if not the participant gives an “other answer,” which is incorrect but different from the intuitive answer. In RAT the probability τ of inhibiting the intuitive response is estimated by SRBB, and in DISP this probability is estimated by both SRBB and AOT.

Comparison among models

All the parameters were estimated by maximum likelihood estimation using the function optim in the statistical software R (R Development Core Team, 2012). In order to select the best model we used the Bayesian information criterion formula (BIC). In each analysis the model with the lowest BIC was chosen as the best model. We used Raferty’s (1995) interpretation of differences between BIC scores in terms of strength of evidence: BIC differences between 0 and 2 denote weak evidence, between 2 and 6 express positive evidence, between 6 and 10 strong evidence, and higher than 10 very strong evidence.

Results

Descriptive statistics

As is shown in Table 2, participants produced more intuitive answers (M = 1.61, SD = 1.04) than correct answers (M = 0.94, SD = 1.06) and other answers (M = 0.45, SD = 0.69) [χ2 (2) = 1,366, p < .0001]. Males (M = 1.11, SD = 1.1) gave more correct answers than did females (M = 0.76, SD = 0.97) [t(2016.6) = 7.54, p < .0001, CI95 = .258, .439], and the opposite was true for intuitive answers (Males, M = 1.47, SD = 1.03; Females, M = 1.77 , SD = 1.02), [t(1997.3) = 6.46, p < .0001; CI95 = .205, .385]. No gender differences were found in the proportion of other answers (Males, M = .42, SD = .67; Females, M = .48, SD = .7) [t(1965.8) = 1.74; p < .081; CI95 = −.007, .113]. These results are consistent with Frederick’s (2005) report of gender differences in the number of correct answers [Male M = 1.47, Female M = 1.03, p < .0001].

Table 2 Descriptive statistics: Cognitive reflection correct answers, intuitive answers, and other answers (CRT correct, CRT intuitive, CRTother), numeracy, syllogistic reasoning with belief bias (SRBB), and actively open-minded thinking (AOT)

The correlations of age with correct answers, intuitive errors, and other errors were not significant [correlation of age with correct answers, r(2018) = −.007, p = .753; correlation of age with intuitive errors, r(2018) = −.009, p = .686; correlation of age with other errors, r(2018) = .026, p = .243]. To rule out nonlinear relationships between these variables we created four age groups (30 years or less, 40 years or less, 50 years or less, more than 50 years) and compared their performance on the CRT. We found no age group differences in correct answers (30– group, n = 533, M = .91, SD = 1.07; 30+ group, n = 505, M = 0.99, SD = 1.06; 40+ group, n = 545, M = 0.95, SD = 1.06; 50+ group, n = 436, M = 0.91, SD = 1.03) [F(3, 2015) = 0.6, p = .623], intuitive answers (30– group, M = 1.67, SD = 1.07; 30+ group, M = 1.55, SD = 1.03; 40+ group, M = 1.59, SD = 1.02; 50+ group, M = 1.61, SD = 1.02) [F(3, 2015) = 0.6, p = .623], and other answers (30– group, M = .41, SD = 0.67; 30+ group, M = 0.45, SD = 0.69; 40+ group, M = 0.46, SD = 0.7; 50+ group, M = 0.47, SD = 0.68) [F(3, 2015) = 0.7, p = .539]. Given that males and females differed in the proportion of correct answers and intuitive answers, we run separate analyses for females and males. On the other hand, age was not related to CRT performance, thus we did not separate the sample in age groups.

We also analyzed the data in all the problems separately (see Fig. 2). The proportions of correct answers in Problem 2 (M = .37, SD = .48) and Problem 3 (M = .37, SD = .48) were much higher than that in Problem 1 (M = .21, SD = .41), and the proportion of intuitive answers was much higher in Problem 1 (M = .74, SD = .44) than in Problem 2 (M = .41, SD = .49) and Problem 3 (M = .47, SD = .50). The number of other answers was higher in Problem 2 (M = .23, SD = .42) than in Problem 3 (M = .17, SD = .38) and Problem 1 (M = .06, SD = .23).

Fig. 2
figure 2

Proportions of types of answers for males and females in (a) CRT Problem 1, (b) CRT Problem 2, and (c) CRT Problem 3. Error bars represent standard errors

The pattern of gender differences remained the same in the three items. Males (Problem 1, M = .24, SD = .42; Problem 2, M = .43, SD = .5; Problem 3, M = .44 , SD = .5) produced a higher proportion of correct answers than did females (Problem 1, M = .18, SD = .38; Problem 2, M = .29, SD = .46; Problem 3, M = .29, SD = .45) in all problems [difference between males and females in correct answers: Problem 1, t(2016.8) = 3.2, p < .005, 95 % confidence interval (CI95) = .022, .093; Problem 2, t(2015.1) = 6.47, p < .0001, CI95 = .095, .179; Problem 3, t(2016.3) = 7.3, p < .0001, CI95 = .113, .195]. On the other hand, the proportion of intuitive answers was higher in females (Problem 1, M = .76, SD = .43; Problem 2, M = .46, SD = .5; Problem 3, M = .54, SD = .5) than in males (Problem 1, M = .71, SD = .45; Problem 2, M = .36, SD = .48, Problem 3, M = .40, SD = .49) in all of the problems [difference between males and females in correct answers: Problem 1, t(2011.9) = 2.66, p < .005, CI95 = .014,.091; Problem 2, t(1971.8) = 4.5, p < .0001, CI95 = .055, .141; Problem 3, t(1982.9) = 6.6, p < .0001, CI95 = .102, .188].

Finally, only in Problem 2 did significant differences emerge in the proportions of other answers between females (Problem 1, M = .06, SD = .24; Problem 2, M = .25, SD = .43; Problem 3, M = .17, SD = .38) and males (Problem 1, M = .05, SD = .22; Problem 2, M = .21, SD = .41; Problem 3, M = .16, SD = .37) [difference between males and females in correct answers: Problem 1, t(1967.1) = .53, p = .598, CI95 = −.015, .026; Problem 2, t(1957.7) = 2.08, p < .04, CI95 = .002, .076; Problem 3, t(1980.4) = .55, p = .585, CI95 = −.024, .042]. Given that we found differences in the behavior of participants from problem to problem, we fitted the models to the data of the whole CRT, and also to each problem separately.

Internal consistency

The Cronbach alpha in CRT was .66, which is higher than that reported in two previous studies—Liberali et al. (2011, Study 2) = .64 and Weller et al. (2013) = .60—and lower than that in one study—Liberali et al. (2011, Study 1) = .74. Finucane and Gullion (2010) obtained a higher internal consistency (α = .80), but with a different six-item questionnaire, which included the three CRT items. Other studies that used CRT have not reported measures of its internal consistency.

The 3-item measure of numeracy used in the present study obtained an internal consistency of α = .51. Using the Schwartz, Woloshin, Black, and Welch’s (1997) three-item scale, Finucane and Gullion (2010) obtained an internal consistency of α = .53, Weller et al. (2013) obtained α = .58, and Liberali et al. (2011) obtained α = .60 in their Study 1 and α = .44 in their Study 2. Lipkus et al.’s (2001) 11-item scale obtained a higher internal consistency [Liberali et al., Study 1, α = .69, Study 2, α = .59, Weller et al., α = .76], but it did not improve the relationship with CRT. For example, Liberali et al. obtained a somewhat higher correlation with CRT with the 11-item measure in Study 2 (11-item r = .39, 3-item r = .37), but somewhat lower in Study 1 (11-item r = .51, 3-item r = .55). Finucane and Gullion also justified the use of a three-item scale because it is moderately correlated with 11-item scales and reduces participant burden.

The SRBB measure used in this study obtained an internal consistency of α = .61. We are not aware of studies reporting internal consistency on this type of task. On average, the participants in our sample answered half of the syllogisms correctly (2.07 out of 4), which is consistent with previous studies [e.g., Stanovich & West (1998) = 4.4 out of 8; West et al. (2008) = 6.9 out of 12; Toplak et al. (2011) = 2.72 out of 5].

Finally, our decision to reduce the AOT scale to 15 items in the present study appears to be justified because its reliability (α = .85) was slightly higher than that obtained by Toplak et al. (2011) with 41 items (α = .81) and West et al. (2008) (α = .84). Moreover, as presented in the next section, the correlation with CRT and with SRBB was higher than in previous studies.

Traditional analyses

Before presenting the results of the mathematical modeling analyses we discuss the relationship between variables in a more traditional fashion. Table 3 shows the correlations between the measures used in the present study, including each of the CRT problems, as well as overall CRT. The correlations among CRT problems range from .35 to .42, and that of the CRT problems with overall CRT range from .73 to .80.

Table 3 Correlation matrix

We obtained a significant (r = .43) correlation between CRT and numeracy. This is consistent with previous studies—Cokely and Kelley (2009), r = .31; Liberali et al. (2011), r ranged from .37 to .51; Finucane and Gullion (2010), r = .53; Weller et al. (2013), r = .43. Moreover, like Toplak et al. (2011), we obtained a significant correlation of CRT with SRBB [this study, r = .43; Toplak et al., r = .36], and with AOT [this study, r = .25; Toplak et al., r = .10]. Then, we regressed the overall CRT score to the three covariates. Given that CRT and gender are correlated, we separately estimated a regression for males and another one for females. As is shown in Tables 4 and 5, a standard deviation change in numeracy accounts for almost a third of a standard deviation in the CRT in both males and females, and the same applies to SRBB. Moreover, a standard deviation change in AOT accounts for a .12 standard deviation change in CRT in males, and .08 in females. Although the contribution of AOT to predict CRT is modest, it is still statistically significant in both cases.

Table 4 Prediction of overall CRT performance: Females
Table 5 Prediction of overall CRT performance: Males

To further check whether the cognitive measures have explanatory power of CRT response classification (correct, intuitive and other) a multinomial logistic regression was estimated for each of the three CRT questions with the three cognitive measures as explanatory variables (results not tabulated). For the individual problems the Cragg–Uhler R 2 was 0.212, 0.143, and 0.307 for problems one, two, and three, respectively. Thus the three cognitive measures account for CRT response classification.

This initial analysis suggests that CRT has a strong mathematical and rational thinking component, and that the contribution of disposition toward actively open-minded thinking is weaker, but still important and significant. It also indicates that the relationship between the predictor variables and each of the CRT problems is significant, but the amount of variance accounted for varies among problems. Moreover, we found gender differences in CRT performance.

The mathematical modeling analyses will afford us the possibility to investigate the structure of CRT in more depth. On the basis of the results of this initial analysis, we not only conducted a mathematical modeling analysis in the whole CRT, but also in each problem. Moreover, we conducted the analyses in males and females, separately.

Mathematical modelling results

Tables 6, 7, 8, and 9 show the best estimate of the probability of using an accurate mathematical procedure (μ), that of the probability of inhibiting the intuitive response (τ), and the odds ratios given a 1 standard deviation change in the three covariates. The log-likelihood, deviance, and BIC of each model are also presented. Table 6 presents the results corresponding to the whole CRT analysis, and Tables 7, 8, and 9 show the results corresponding to the analyses of Problems 1, 2, and 3, respectively.

Table 6 Estimated probabilities, odds ratios, and goodness-of-fit measures in each model as a function of gender, for the whole CRT
Table 7 Estimated probabilities, odds ratios, and goodness-of-fit measures in each model as a function of gender, for CRT Problem 1
Table 8 Estimated probabilities, odds ratios, and goodness-of-fit measures in each model as a function of gender, for CRT Problem 2
Table 9 Estimated probabilities, odds ratios, and goodness-of-fit measures in each model as a function of gender, for CRT Problem 3

In all the analyses NULL was the worst model. This indicates that MATH, RAT and DISP are able to account for some of the individual differences in CRT beyond and above chance. In all cases the difference in BIC between NULL and each of the other models was much greater than 10; that is, this is very strong evidence (Raferty, 1995). The same result was found in the three problems analyzed separately. Note that, in RAT and DISP, μ is conditional on τ. In other words, it is the probability of using an appropriate mathematical procedure given that the intuitive response has been inhibited. That is why the values of μ in those models are much higher than those of the MATH models. The odd ratios for Num in all tables can be interpreted as the increase in odds of using an appropriate mathematical procedure given a 1 standard deviation change in numeracy. The odd ratios for SRBB and AOT in all tables reflect the increase in odds of inhibiting the intuitive response given a 1 standard deviation change in SRBB and AOT, respectively. An odds ratio of 1 indicates no change whereas an odds ratio of 2 indicates a 100 % change or indicates that the odds are doubled. The numeracy odds ratios range from 2.75 to 4.45 change. This confirms that mathematical ability is very important to solve the CRT problems. The odds ratios for SRBB suggest this variable is also important given that they range from 1.15 to 1.17. The AOT odds ratios are lower, ranging from 1.08 to 1.36.

The critical comparisons to test the hypothesis that CRT is merely a mathematical test are MATH versus RAT, and MATH versus DISP. Both the male and female whole CRT analyses provided very strong evidence (BIC difference > 10) in favor of RAT and DISP over MATH. Therefore, CRT is not just another numeracy test.

The critical comparison to determine whether CRT measures only rational thinking or both rational thinking and the thinking disposition toward actively open-minded thinking is between RAT and DISP. In females, the whole CRT analysis provided very strong evidence of RAT over DISP. On the other hand, very strong evidence in males favors DISP over RAT. These results suggest that the disposition toward actively open-minded thinking did not play a significant role in solving the CRT test in females, but it did play an important role in males.

In the individual problem analyzes the evidence in favor of RAT or DISP over MATH was very strong in Problems 1 and 3, and positive (BIC difference = 3.5) in Problem 2 in males. In females, there was very strong evidence in favor of RAT or DISP over MATH in Problems 1 and 3, whereas we found strong evidence (BIC difference = 7.4) in favor of MATH in Problem 2. These results suggest that Problem 2 is “more mathematical” than the others.

In the RAT-versus-DISP comparison, there was positive to strong evidence in favor of RAT in females in Problems 1 and 3 (Note that, given that in Problem 2 MATH was the best model, the RAT versus DISP comparison is irrelevant). In males, problems 2 and 3 provided weak and very strong evidence, respectively, in favor of DISP. However, in Problem 1 the evidence was in favor of RAT.

Discussion

We presented three views on what CRT measures: a mathematical ability (MATH model); both a mathematical ability and rational thinking ability (RAT model); or a mathematical ability, rational thinking ability and a disposition toward actively open-minded thinking (DISP model). The results clearly show that CRT is not just a mathematical test. However, the results do not provide clear-cut evidence to differentiate between the other two views. The overall CRT analysis showed strong evidence in favor of DISP over RAT in males, but the opposite was true in females. Both models contain the μ parameter (i.e., probability of using adequate mathematical procedures) and the τ parameter (i.e., probability of inhibiting the intuitive response). The difference between these models resides in how the τ parameter is estimated. In RAT only a rational thinking variable is used (i.e., the ability to avoid belief biases), whereas DISP also uses a thinking disposition (i.e., actively open-minded thinking) to estimate τ. Thus, this result indicates that evidence is very strongly in favor of the conception of CRT as a test that measures mathematical abilities, rational thinking and disposition toward actively open-mind thinking in males, and mathematical abilities and rational thinking in females.

The values of the estimated parameters provide very useful information. The average probability of inhibiting the intuitive response (i.e., τ) was .510 in males and .412 in females, in the whole CRT analysis. This gender difference was apparent in all the problems. The average values of τ in males in the best fitting model were .289, .640, and .599, in Problems 1, 2, and 3, respectively. The same pattern was observed in females: .237, .542, and .456. These results suggest that females found it more difficult to inhibit the intuitive response. Moreover, the inhibition of the intuitive response was more difficult in the first problem. Given that the order of the problems was not counterbalanced in this study because the CRT has a specified sequence of problems, it remains to be established whether this difficulty arises as a consequence of idiosyncratic characteristics of Problem 1 or due to a learning effect (i.e., participants got better at inhibiting the intuitive response in Problems 2 and 3).

The μ parameter also showed gender and problem differences. In the best fitting models the average estimate in males was .685 for the whole CRT, and .748, .657, and .677 for Problems 1, 2, and 3, respectively. In females, the average μ was .572 for the whole CRT, and .654, .532, and .563, for problems 1, 2, and 3, respectively. Interestingly, μ was higher in Problem 1 than in the other problems both in males and females. This suggests that in Problem 1 it is very difficult to inhibit the intuitive answer (i.e., low τ), but if one is able to inhibit it, then the problem becomes relatively easy (i.e., high μ).

One possible explanation of this finding is the following. When people try to solve all the CRT problems, they tend to use a heuristic representation of the problem instead of a representation using mathematical formulae. The bat and ball problem (Problem 1) differs from the others in that, if the intuitive answer is inhibited, people can still use the same representation to solve it correctly, whereas this is not possible with the other problems, which require the use of some formal mathematical procedure. For example, when people read “A bat and a ball cost $1.10 in total” they may represent the problem as a bat on the left hand side and the ball in the right hand side, and both above a line that goes from $0.00 to $1.10 (and with a marker at $1). When they then read “The bat costs $1.00 more than the ball” they (wrongly) increase the size of the bat until the $1.00 mark and “squeeze” the ball to the region between $1.00 and $1.10. Finally, when they read “How much does the ball cost?” they immediately respond $0.10 on the basis of their representation. However, if they realized that in this solution the bat does not cost $1.00 more than the ball, they can still use this representation to get the correct answer. They can increase the size of the region of the bat (and squeeze the size of the region of the ball) until the bat reaches a prize that is $1 higher than that of the ball.

The present results are consistent with those of Frederick (2005), Campitelli and Labollita (2010), Liberali et al. (2011), Toplak et al. (2011), and Böckenholt (2012). All these studies, using different approaches, arrived at the conclusion that the CRT is not just a measure of general skills (specifically, mathematical ability), and that it measures something above and beyond general skills (i.e., cognitive reflection).

Campitelli and Labollita’s (2010) and Cokely and Kelley’s (2009) suggestion that the CRT measures the thinking disposition called actively open-minded thinking (Baron, 1985, 2008) received partial support in this study. In males, the model that incorporated mathematical ability, rational thinking and the disposition toward actively open-minded thinking was the best model. On the other hand, in females the model that included mathematical ability and rational thinking (but not thinking dispositions) was the best model.

Limitations of this study

The numeracy and belief bias measures were calculated over three and four items, respectively. Using scales with a larger number of items may have increased the discrimination value of the scales. Moreover, for the same reason, CRT itself may be in need of a larger scale. Indeed, S. Frederick (personal communication, October 12, 2012) is currently developing a ten-item version of CRT. Having ten items may strike a balance between length of test and the discriminative value of the test. This weakness should be considered in the context of the strengths of this study. We used a very large sample of more than 2,000 participants; therefore, this study had enough power to capture small effects.

Conclusion

Our data suggests that performance in the CRT in females is accounted for by their abilities (both mathematical and rational thinking abilities), but not by their disposition toward actively open-minded thinking. On the other hand, performance in the CRT in males is accounted for by their abilities and by their disposition toward actively open-minded thinking. In both cases the results indicate that CRT is, indeed, a test of cognitive reflection, and not just a numeracy test.

The mathematical modeling approach provided more information than typical statistical analyses. We were able to estimate a parameter for the probability of inhibiting the intuitive response, and a parameter for the probability of using adequate mathematical procedures. This analysis suggests that gender differences are related to both parameters. Additionally, this approach showed parameter differences between problems. This information is very useful in view of current attempts to improve the discrimination of the test. Ideally, one should choose problems (like Problem 1) with a low probability of inhibition and a high probability of using adequate mathematical procedures. In this way, the cognitive reflection component of the test would be more important than the mathematical component of the test.

CRT is a very easily administered psychological test. We believe that this study contributes to the understanding of what CRT actually measures. By doing this, we hope that this study provides valuable information for researchers to decide whether, and in what situations, to use the CRT.