Introduction

The adaptive toolbox approach to judgment and decision making (Gigerenzer, Todd, & ABC Research Group, 1999) assumes that people possess a repertoire of decision strategies that are tailored to exploit helpful cue information in specific environments. According to that approach, decision makers have learned through their extended experience which tool fits best to which environment. In this way, decision making can adapt to a wide range of potential environments, so that in a specific decision situation, the decision maker only needs to detect the features of the current environment (i.e., structure of the task and valid cues) and then selects the most appropriate strategy. Behavior that utilizes these helpful environmental features has been termed “ecologically rational” (Todd & Gigerenzer, 2012).

We focus here on one of the simplest strategies from this toolbox, the recognition heuristic (RH) (Goldstein & Gigerenzer, 2002) which presumes that whenever one recognizes only one of two objects in a to-be-compared pair, one may infer that the recognized object has the larger criterion value. This strategy is so simple, because it relies on only one cue, namely recognition, which is very easily accessible (cf. Pachur & Hertwig, 2006). Besides, recognition seems to be a valid cue in many domains and thus rather helpful in otherwise rather difficult tasks.

As an example, imagine you are asked which of the following two Swiss cities is more populous, Geneva or Interlaken. And imagine further that you recognize the city name Geneva, but have never heard of Interlaken. In this case, most people tend to infer that Geneva is the larger of the two (which happens to be correct). This behavior corresponds to what the RH theory predicts (Goldstein & Gigerenzer, 2002). The heuristic will yield accurate judgments so long as recognition is related to the to-be-inferred criterion, in other words, whenever the validity of recognition as a cue is larger than chance. Previous results revealed that decision makers are able to differentiate domains with different recognition validities and adjust their behavior accordingly (Gigerenzer & Goldstein, 2011). In other words, decision makers are able to capitalize on this specific feature and thus behave in an ecologically rational way (Todd & Gigerenzer, 2012).

These findings imply that decision makers have some grasp of how valid the recognition cue is. This, in turn, begs the question: What exactly determines subjective estimates of the recognition validity? One option is that individuals assess recognition validity from the set of objects they are making inferences on, that is, the set’s recognition validity (we call this the “local-optimization” hypothesis). Another option is that they refer to the larger underlying domain of all objects from which the set was drawn, that is, the domain’s recognition validity (we call this the “global-optimization” hypothesis). The first alternative appears highly effective in adapting to any specific task environment by considering only the actual items and cue validities within the set. Indeed, only through such a “local” focus can one hope to determine the accuracy offered by a strategy such as the RH in the specific task one is facing. Such an expectation is supported by findings from sampling studies showing that “judgments and decisions are often remarkably sensitive to the data given in a stimulus sample” (Fiedler, 2008, p. 186). A potential procedure of how to estimate the recognition validity in a given set of objects, even without receiving feedback after single choices, could proceed as follows. If people have at least some valid knowledge in the respective domain (as is arguably the case for the world cities used herein), they can rely on the degree to which recognition and knowledge are aligned: For any recognized object, additional knowledge might speak for or against that object being the correct choice. Thus, if recognition and knowledge repeatedly coincide in pairs with only one object recognized, one may eventually discard knowledge and rely on recognition alone (based on the insight that recognition validity is likely to be large and knowledge yields little to no additional information). In contrast, if recognition and knowledge repeatedly conflict with each other, one may hesitate to use recognition alone. That way, people might arrive at an indirect estimate of the recognition validity in the specific set of objects and adapt their behavior accordingly.

However, having to re-assess cue validities in each and every object sample from scratch would clearly be cumbersome and potentially impossible (Dougherty, Franco-Watkins, & Thomas, 2008). The second alternative is much more parsimonious in that it registers only the general domain from which the objects were drawn, retrieves the domain’s “global” cue validities, and then selects the appropriate strategies. In fact, most natural samples may be considered roughly representative or at least not severely biased so that global optimization should generally fare well. The caveat in this case is that one may select a strategy that does not actually yield the desired accuracy for the specific task at hand (or forgo use of a more accurate strategy), namely, if the set is not representative of the domain. Fiedler (2008) referred to this as “meta-cognitive myopia,” that is, the “conspicuous reluctance to consider the history and the constraints imposed on the given stimulus sample” (Fiedler, 2012, p. 42; see also Fiedler & Kutzner, in press; Hogarth & Soyer, 2015). Ignorantly presuming unbiased samples, however, could have detrimental consequences for choice performance. If recognition is not valid in the set, then sticking to the RH (because it is valid in the domain) could result in worse performance given that other strategies would fare better. Vice versa, if recognition is not valid in the domain, but highly valid in the selected set, not detecting the usefulness of recognition could again be detrimental to one’s performance. Our main question thus is whether decision makers base their strategy selection on the domain’s cue validities or are able to detect deviations in selected sets and thus let the set’s cue validities govern which strategy is chosen.

In the next section, we discuss in more detail what recognition validity is and how it possibly affects choice behavior. Then, in the main part, we report two experiments in which we manipulated the recognition validity in two different ways, such that domain and set validities diverged: In Experiment 1, we used two selected sets of objects with different recognition validities, but drawn from the same domain. In Experiment 2, we used two selected sets with the same recognition validity, but drawn from domains with different recognition validities. Finally, we discuss the implications of our findings for what has been termed “ecological rationality” in decision making (Todd & Gigerenzer, 2012).

Recognition validity

The recognition validity, termed α (Goldstein & Gigerenzer, 2002), is defined as the proportion of cases in which recognition is aligned with the correct inference, out of all cases in which recognition discriminates between the objects in a pair. For example, if recognition points to the correct choice in 70 out of 100 such pairs (and to the wrong choice in the remaining 30), then α = .70. This value is well above chance (.50), and thus recognition would be considered a valid, and thus helpful, cue. In other words, the recognition validity reflects the maximum proportion of correct choices one can achieve by consistently following the RH.

The recognition validity is typically approximated as follows (cf. Goldstein & Gigerenzer, 2002). First, a domain is defined, for example, all Swiss cities with more than 25,000 inhabitants (yielding 43 cities). Second, a sample of participants indicates which of these cities they recognize. Third, all possible pairs of the 43 cities are constructed and the recognition validity is computed for each individual by counting in what percentage of the city pairs consisting of one recognized and one unrecognized city the recognized city is in fact the correct choice. These values are then averaged across all individuals to yield the overall recognition validity for the domain. Alternatively, the Spearman rank correlation r s between the cities’ true size-ranks and their recognition rates can be used to determine the overall recognition validity, namely α = (1+ r s)/2 (see Martignon & Hoffrage, 2002; Pachur, 2010; Pachur, Todd, Gigerenzer, Schooler, & Goldstein, 2012).

Consequently, estimates of the recognition validity depend on several factors, most notably on the exact definition of the domain and on the sample of participants asked whether they recognize the objects.Footnote 1 Indubitably, the recognition validity will most likely change if we restrict the domain to cities with more than 50,000 inhabitants (yielding only ten cities in Switzerland), or extend it to cities with more than 10,000 inhabitants (yielding 143 cities in Switzerland). Many such definitions of domains that have been reported in the literature appear rather arbitrary and do not represent natural categories. For example, Hoffrage (2011) reported that recognition rates (for the same objects) depended on the size of the reference class. Similarly, if we ask different samples, say, German versus Canadian participants, to recognize the set of Swiss cities (see Goldstein & Gigerenzer, 2002) the resulting recognition validity will probably differ, too. Similarly, Pohl (in press) asked younger and older children to recognize cities from a list of large world cities and found in two experiments that recognition rates and thus recognition validity differed depending on the children’s age.

The conclusion from these observations is that a domain simply does not possess any one “true” recognition validity, because such a value is only derived from averaging across the recognition knowledge of a sample of participants, and not from external, objective sources.Footnote 2 Thus, recognition validity is a rather abstract entity, and we can approach it only by pooling across a large sample or averaging across studies (thus minimizing error terms). For example, for the domain of large world cities, we found recognition validities of selected sets to range from .57 to .79, M = .70 (SD = .08), across 15 published data sets. Or, for the domain of single country’s largest cities (like Belgium, Canada, Italy, or Switzerland), we found values from .74 to .88, M = .82 (SD = .05), across 16 published data sets. These mean values might come close to what we could consider a domain’s “true” recognition validity, but it is nonetheless clear that recognition validity will not be conclusively determined as any one-point value.

Nonetheless, decision makers seem to have access to a domain’s recognition validity, at least with respect to the question of whether recognition is a valid cue or not (see, e.g., Hilbig, Erdfelder, & Pohl, 2010, Data set 7; Pachur, Mata, & Schooler, 2009, Exp. 1; Pohl, 2006, Exp. 1). The corresponding results are reported in the next section. Pachur, Bröder, and Marewski (2008) went even further and asked participants for numerical estimates of the recognition validity (see also Glöckner & Bröder, 2011, 2014). They found in three experiments that subjective estimates of the validities of recognition and three other cues conformed largely to the computed values.Footnote 3 Similarly, Hoffrage (2011, Exp. 3) asked his participants for subjective estimates of the validity of recognition for objects that were merely recognized versus objects for which further knowledge was available. Participants’ mean estimate was larger for objects with additional knowledge and correctly mirrored the computed recognition validities. However, when Hoffrage compared objects from small versus large reference classes, he found that subjective estimates of recognition validity were contrary to the computed values.

Effects of recognition validity

Given that participants—at least on average—are able to accurately estimate the recognition validity, albeit not in all cases (see above), one would in turn expect that they adjust their decision-making behavior accordingly. This ability, to differentiate between domains with different recognition validities, has been termed “environment adaptivity” (Pachur et al., 2009). The empirical evidence shows that decision makers are indeed able to adapt their behavior to features of the respective environment. When recognition is a highly valid cue, participants use the RH on average much more often than when it is an invalid cue. Table 1 summarizes three such findings with respect to adherence rates and the probability of RH use. Adherence rates give the proportion of choosing the recognized objects in pairs with only one object recognized, but are known to be a confounded measure and thus do not adequately reflect true use of the RH (see Hilbig, 2010; Hilbig et al., 2010; Horn, Pachur, & Mata, 2015; Pachur et al., 2012; Pohl, 2011). We thus also report results from model-based analyses using the r-model, a multinomial processing tree model that allows an unbiased estimate of RH use (Hilbig et al., 2010; see also Schwikert & Curran, 2014; Horn et al., 2015). Details of the model are reported below.

Table 1 Results from three studies comparing materials with different recognition validities

For example, Pohl (2006, Exp. 1) compared two between-subjects experimental conditions, using 20 large Swiss cities as materials. In one condition he asked participants which of two Swiss cities is the more populous one (“population” group), in the other he asked which of two Swiss cities is closer to the city of Interlaken, which is close to the geographical center of Switzerland (“distance” group). Recognition validity was large in the population group, but at chance in the distance group. Participants’ behavior in terms of adherence rates as well as model-based estimates of RH use (parameter r) perfectly reflected this difference. Pachur et al. (2009, Exp. 1) and Hilbig et al. (2010, Data set 7) found similar results.

In addition, two studies provided summaries of such findings in form of scattergrams pitting adherence rate against recognition validity (see Gigerenzer & Goldstein, 2011, Fig. 1; Pachur et al., 2012, Fig. 5-2). Gigerenzer and Goldstein included a large set of 43 studies and reported an impressive correlation of .57 across studies.

Effects of biased selections

However, in all the studies cited above, the selected materials were more or less representative of the underlying domain (or task), so that it remains unclear whether participants based their assessment of recognition validity on the domain or the selected set. To our knowledge, only three studies—although two of them unintentionally—provide some insight on that question. Pohl (2006, Exp. 4) used the height of mountains as decision domain. The recognition validity for the selected set of mountains happened to be at chance (.49), but participants nevertheless used the RH with a high probability (.80; estimated with the r-model). This behavior corresponds to the recognition validity that Goldstein and Gigerenzer (2002) had reported for that domain, namely .85, which probably is a better estimate of the true recognition validity for the height of mountains. Similarly, Pohl (in press, Exp. 2) compared children, adolescents, and adults, and used large world cities as materials (that typically have a recognition validity of around .70). The selection of cities turned out to have a recognition validity close to chance (≈ .50) for all groups. Yet, all three groups used the RH rather often as indicated by high estimates of the r-model, that is, they behaved as if the set had been representative of the underlying domain.

In a recent study, Basehore and Anderson (2016, Exp. 2) deliberately selected three training sets out of the domain of large world cities that typically has a substantial recognition validity of around .70 (see above). The three sets had widely differing recognition validities of .80, .50, and .20 (based on pilot data) and RH use or non-use was trained by giving participants feedback on their choices in a first experimental phase. In a second phase, without feedback and without biased sets, participants showed adherence rates that differed slightly but significantly between the three training conditions, namely .95, .92, and .83, respectively, thus showing an effect of the training procedure. But again, adherence rates were rather high, reflecting much more the recognition validity in the domain than in the trained, biased sets.

In sum, it appears that individuals are sensitive to diverging recognition validities in different domains and that they adjust their use of the RH accordingly. Further, it appears that individuals base their selection of inference strategies more on the domain’s and not the selected set’s features. However, the latter conclusion is based on little and more or less casual evidence so far, whereas a thorough investigation by systematically manipulating the domain’s and the set’s recognition validities is still missing. In the following, we report two such experiments that critically test whether participants adjust their RH use to the domain’s or to the set’s recognition validity.

Overview of experiments

Both of our experiments followed the same basic procedure (cf. Goldstein & Gigerenzer, 2002; Pohl, 2006). Participants received a number of pairs of cities and were asked for each pair to infer which of the two cities is more populous.Footnote 4 However, the chosen sets of cities were not representative of the underlying domains. Specifically, in Experiment 1, we constructed two sets (from the same domain) with recognition validities that were either below or above the domain’s recognition validity. In Experiment 2, we constructed two sets with the same recognition validity, but drawn from two different domains, one with a larger and the other with a lower recognition validity. In addition to the inference task, participants also received a recognition test, in which they judged for each city whether they recognized it or not. To continue the research discussed above (Hoffrage, 2011; Pachur et al., 2008) and as a manipulation check, we also asked participants to provide subjective estimates of the recognition validity in the respective domain.

From the individual recognition judgments, we classified all pairs of cities into knowledge pairs (i.e., both objects recognized), guessing pairs (i.e., neither object recognized), and recognition pairs (i.e., exactly one object recognized, split according to whether the recognized city was the correct or the false choice). The resulting frequencies of pairs in each of these four categories were further split into whether participants made a correct or false inference, thus leading to eight disjoint categories. These frequencies are the input to the r-model to estimate the frequency of RH use and other parameters (Hilbig et al., 2010). The r-model is a multinomial processing tree model (Batchelder & Riefer, 1999; Erdfelder et al., 2009) that allows the estimation of the proportion of RH use in a direct and unbiased way (Hilbig, 2010). The r-model consists of three trees representing the three possible pairs in the comparison task. It accounts for the observed frequencies of choices in the eight categories by four latent parameters. The most important of these parameters is the probability r of applying the RH, that is, making inferences in recognition cases based on the recognition cue in isolation. In the following analyses, parameters were estimated with multiTree (Moshagen, 2010).

The model was considered to fit the data if its fit value (G 2) was not significant. Following other authors (e.g., Bayen, Erdfelder, Bearden, & Lozito, 2006; Erdfelder, 1984; Moshagen & Erdfelder, 2016), we deliberately chose a lower Type-1 error probability for the model fit to the aggregate data than usual, namely α = .01, because otherwise—given the large data set of 36,000 data points in Experiment 1 and 9,000 in Experiment 2—even minute deviations from the model assumptions would result in model misfit. Even with α = .01, small deviations from the model (i.e., w = .10 as defined by Cohen, 1988) are detected by a G 2(df = 2) goodness-of-fit test with a power exceeding (1-β) = .99 (computed with G*Power; see Faul, Erdfelder, Buchner, & Lang, 2009). To compare parameter values of the r-model between experimental conditions, we constrained the respective parameters to be equal and tested the resulting decrement in model fit (∆G 2). If this decrement was significant, the parameters were considered to be different from each other. Just as in case of the omnibus goodness-of-fit tests, α = .01 is a reasonable criterion of significance for these ∆G 2 (df = 1) tests, as α = .01 is associated with a power exceeding (1-β) = .99 even for small effects of size w = .10. In sum, both our G 2(2) goodness-of-fit tests and the ∆G 2(1) parameter comparison tests are already very sensitive to small deviations from the null hypotheses in the experiments reported here.

A disadvantage of the G 2 significance tests is that they focus on goodness of fit only and do not take model flexibility into account. In general, one aims at parsimonious models of low complexity with few free parameters that nevertheless fit the data well. To crosscheck whether the r-model variants favored by our G 2 and ∆G 2 test are also those that provide the most parsimonious description of the data structure taking model complexity into account, we additionally compared all variants of r-models (i.e., those with versus those without some or all parameters constrained to be equal across conditions) using two established model selection criteria, the Akaike Information criterion (AIC) and the Fisher Information Approximation (FIA) to the normalized maximum likelihood (see, e.g., Heck, Moshagen, & Erdfelder, 2014; Wu, Myung, & Batchelder, 2010). The AIC criterion combines a badness-of-fit-measure with a penalty term reflecting the number of free parameters in the respective model. For a set of candidate models, AIC selects the model that provides the best approximation to the underlying data structure, in the sense that it minimizes expected information loss (in Kullback-Leibler terms) when approximating the expected frequencies implied by the true model (Wagenmakers & Farrell, 2004). In contrast to AIC, the FIA criterion not only accounts for the number of free parameters but also for model complexity due to functional form, which makes it a particularly attractive model selection criterion provided that the sample size is not too small (see Heck et al., 2014; Hilbig & Moshagen, 2014). For the data we report here, sample size is not a problem.

Following the scarce evidence discussed above (Basehore & Anderson, 2016, Exp. 2; Pohl, 2006, Exp. 4, and in press, Exp. 2), it is plausible to assume that participants will adjust RH use to the domain’s recognition validity, ignoring the selected set’s recognition validity. In other words, they will behave according to the global- and not the local-optimization hypothesis. So, in Experiment 1 in which two different sets varying in recognition validity were drawn from the same domain, we expected no difference in RH use, whereas in Experiment 2 in which sets with the same recognition validity were drawn from two domains varying in recognition validity we expected clear differences in RH use.

Experiment 1

Methods

Participants and design

A total of 120 participants, consisting of 82 females and 38 males, aged between 17 and 55 years (M = 23.1, SD = 5.4) were recruited at the University of Mannheim. After providing consent and demographic information, participants were randomly assigned to one of two experimental groups. One group received a material set with high and the other with low recognition validity, but drawn from the same domain (see below). Note that our hypothesis was the null, that is, that both groups would show the same RH use.

Material

We used the domain of major world cities asking participants to compare pairs of cities with respect to their population. From that domain and based on previous data (Hilbig, 2010; Hilbig et al., 2010; Hilbig, Erdfelder, & Pohl, 2012; Hilbig, Michalkiewicz, Castela, Pohl, & Erdfelder, 2015), we selected one set of 25 cities featuring a high recognition validity of approximately .80 (high-alpha set condition) and a second set of 25 cities featuring a low recognition validity of approximately .50 (low-alpha set condition; see Table 3 in Appendix A). Note that α is at chance level in the latter condition, so that recognition is not a valid cue at all. According to a study by Hilbig (2010), the domain of major world cities has a recognition validity of .64. This value is comparable to what was found in other studies using representative sets from that domain (α ≈ .70). Hence, both selected sets deviated from the underlying domain in terms of recognition validity—one upwards, the other downwards.

To rule out other possible influences, we kept all remaining aspects of the materials constant. First of all, we ensured that both sets included the same number of recognized objects, resembling the recognition proportion of the underlying decision domain (i.e., .60). At the same time, we controlled for knowledge validity. Knowledge validity, labeled β, is defined as the proportion of correct choices for pairs with both objects recognized (Goldstein & Gigerenzer, 2002), and thus is an index of each participant’s quality of knowledge in a given set. We selected items such that knowledge validity would be similar across sets and moreover similar to the typical knowledge validity in the underlying domain (β ≈ .65, cf. Hilbig et al., 2010, 2012, 2015). Finally, to avoid an influence of the specific choice of as few as just two specific sets, we created 15 different sets for each of the two conditions (while ensuring that they fulfilled the specific requirements with respect to recognition validity, recognition proportion, and knowledge validity). This way, any behavioral differences in the two conditions could be attributed to differences in recognition validity only.

Procedure

Apart from the specific item sets, the procedure was identical for both the high-alpha and the low-alpha conditions. Participants worked on the city-size task, consisting of a recognition task and a paired-comparison task in counterbalanced order. In the recognition task, participants were asked to indicate for the respective set of 25 cities whether or not they had heard of each city before the experiment. For the paired-comparison task, the 25 cities were exhaustively paired resulting in 300 pairs. For each pair, participants were asked to judge which city is more populous. To prevent participants from just clicking through the 300 trials, the comparison task comprised performance-dependent payment. Specifically, participants were informed that they would gain 3 cents for every correct answer, but also lose 3 cents for every wrong answer. However, we provided no feedback during the experiment.

After completion of the city-size task, we asked participants among other things to estimate the recognition validity (cf. Hoffrage, 2011; Pachur et al., 2008). To this end, the concept of validity was explained thoroughly and participants were asked to give individual ratings of the validity of the recognition cue by providing percentage values between 50% and 100%. Finally, participants received feedback about their overall performance, were paid (M = 2.19 Euro, SD = 1.36), and debriefed.

Results and discussion

The r-model with unconstrained parameters was applied to the aggregated data set containing 36,000 data points and fit these data with G 2(2) = 7.02, p = .03. The parameter estimates for the model are given in Table 2, separately for the two experimental conditions. As expected, the estimates show a large and significant difference in recognition validity (parameter a), ∆G 2(1) = 1216, p < .001. In addition, knowledge validity (parameter b) was also significantly larger in the high- than in the low-alpha set group, respectively, ∆G 2(1) = 71.9, p < .001. In contrast, neither the probability of correct guessing (parameter g) nor the probability of RH use (parameter r) differed significantly between conditions, ∆G 2(1) = .10, p = .72, and ∆G 2(1) < .001, p = .99, respectively. Thus, the r-model fit the aggregate frequencies, with r and g parameters that do not differ between conditions, whereas both the recognition and knowledge validity parameters are larger in the high- than in the low-alpha set condition. We arrive at the same conclusion when we apply the AIC and the FIA model selection measures to the hierarchy of r-model variants (summarized in Table 5 in Appendix B): Considering all possible patterns of equality constraints between conditions, the model with the best (i.e., smallest) AIC and FIA values is Model 6f that assumes equality of the g and the r parameters between conditions and does not impose any constraints on the recognition and knowledge validity parameters a and b.

Table 2 Parameter estimates of the r-model and subjective recognition-validity estimates in Experiments 1 and 2 (with standard errors in parentheses)

Importantly, to check whether aggregating frequencies across participants influenced our results, we analyzed the data with two additional methods. First, we applied the r-model to the data of each participant separately and then computed the means of the parameter estimates across participants (excluding non-fitting data sets). Second, we applied a hierarchical variant of the r-model to the data that estimates individual and group parameters and takes correlations between parameters into account (Michalkiewicz & Erdfelder, 2016). This model is based on Klauer’s (2010) hierarchical latent-trait approach to MPT models. Both analyses, the individual and the hierarchical approach, virtually yielded the same pattern of results (see Tables 7 and 8 in Appendix C), thus confirming that the aggregate analysis reported above did not produce artificial results due to pooling response frequencies across individuals.

Arguably, the observed (but unintended) difference in knowledge validity could have affected participants’ choice behavior. Knowledge validity was significantly larger in the high- than in the low-alpha set condition. Participants in the high-alpha set condition could thus have more often employed their knowledge, reducing RH use in turn (see Bröder & Eichler, 2006; Glöckner & Bröder, 2011, 2014; Hilbig et al., 2015; Hilbig, Pohl, & Bröder, 2009; Newell & Fernandez, 2006; Richter & Späth, 2006). This is, however, unlikely here for two reasons. First, the difference in knowledge validity was relatively small (i.e., .06) compared to the difference in recognition validity (i.e., .26). Second, the hierarchical r-model analysis found that, across both conditions, knowledge validity did not correlate with RH use (r = −.04, 95% Bayesian Credible Interval BCI [−.25, .17]). We thus conclude that knowledge validity had little if any influence on RH use in this experiment.

Finally, individual ratings of the recognition validity were in line with participants’ choice behavior. Participants judged the recognition cue to be equally valid across conditions (see Table 2); t(118) = 0.39, p = .70, Cohen’s d = 0.07, BF 10 = 0.21.Footnote 5

In sum, participants used the RH equally often in both conditions, irrespective of the actual recognition validity of the specific material set. Subjective estimates of recognition validity reflected that. These findings are in line with the idea that people have knowledge about global decision domains, but do not detect deviations in specifically selected (and potentially biased) subsets. In other words, participants behaved in accord with the global-optimization hypothesis, but in contrast to the local-optimization hypothesis. In the next experiment, we tested the reverse situation, using item samples with equal recognition validities, but drawn from domains with different recognition validities.

Experiment 2

Methods

Participants and design

A total of 30 participants, consisting of 14 females and 16 males, aged between 17 and 30 years (M = 21.9, SD = 3.1), were recruited at the University of Mannheim and randomly assigned to one of two groups. The groups differed only with respect to which material they received (see below). Note that we expected RH use to differ between the groups—reflecting the difference in recognition validity of the two underlying domains.

Materials

As before, we manipulated recognition validity between participants in two conditions. This time, however, we selected two sets with the same recognition validity from two domains (or, more precisely, from two tasks with the same objects) that featured widely differing recognition validities. We used the largest Italian cities comparing them either with respect to their population (high-alpha domain condition) or with respect to their height above sea level (low-alpha domain condition). These materials had been tested in a pre-study (N = 41), where participants first made recognition judgments for the largest Italian cities and then compared pairs of cities with respect to either population or height. Recognition validity was almost .80 for the population domain and only slightly exceeded .50 for the height domain (see also Hilbig et al., 2010, Data set 7, who found recognition validities of .87 and .53, respectively). From these domains, we selected two biased samples of 25 cities each with a recognition validity of α ≈ .65 (see Table 4 in Appendix A). As in Experiment 1, we aimed at choosing sets that mirrored the underlying domains regarding recognition proportion and knowledge validity. Again, we selected 15 different subsets for each condition, that is, a different set of cities for each participant, in order to minimize the potential effect of the choice of specific items.

Procedure

To maximize comparability, the second experiment was an exact replication of Experiment 1 in terms of procedure. Participants first worked on the paired-comparison task and the recognition task in random order, followed by individual ratings of recognition validity. As before, they received payment depending on their performance (M = 2.16 Euro, SD = 1.11).

Results and discussion

Experiment 2 was analyzed in the same way as Experiment 1. The unconstrained r-model was applied to the aggregated data set containing 9,000 data points and fit the data with G 2(2) = 8.59, p = .014. The parameter estimates of the model are given in Table 2, separately for the two conditions. The estimates show a slight but significant difference in recognition validity (parameter a), ∆G 2(1) = 7.80, p = .005. In addition, knowledge validity (parameter b) was also larger in the high- compared to the low-alpha domain condition, ∆G 2(1) = 60.6, p < .001. As in Experiment 1, the guessing probability (parameter g) did not differ significantly between conditions, ∆G 2(1) = 1.00, p = .31. In contrast, a large and significant difference in RH use (parameter r) was observed, ∆G 2(1) = 531, p < .001, with r in the high-alpha condition exceeding r in the low-alpha condition. Thus, according to the ∆G 2 significance tests, only the g parameter can be equated between conditions in Experiment 2. AIC- and FIA-based model selection procedures led to very similar conclusions (see Table 6 in Appendix B). The best model in terms of the AIC criterion is Model 7c that assumes equality of g between conditions and unconstrained a, b, and r parameters. In terms of the FIA criterion, both Model 7c and Model 6b (assuming a 1 = a 2 in addition) performed best. In sum, whereas the manipulation affected RH use and knowledge validity, there is either no effect or a very weak effect on the guessing probability and the recognition validity.

Just as in Experiment 1, two additional analyses, namely an individual-participant analysis based on the means of the r-model parameter estimates for each participant (excluding non-fitting data sets) and a hierarchical r-model analysis (Michalkiewicz & Erdfelder, 2016), yielded mainly the same pattern of results (see Tables 7 and 8 in Appendix C), thus confirming that the aggregate analysis reported above did not produce artificial results based on pooling across individuals. There was only one deviation: The small difference in recognition validity that was significant in some of the aggregate analyses was not significant in either of the two additional analyses.

One could again argue that differences in the sets’ recognition validity and in participants’ knowledge validity affected choice behavior. Both validities were larger in the high-alpha domain condition. Concerning the recognition validity, however, the difference between conditions was relatively small (i.e., .04), for example, compared to the difference in RH use (i.e., .66). In addition, the FIA model selection favored two models, one of which assumed no difference in the sets’ recognition validity. Finally, both the individual and the hierarchical r-model variant did not confirm a significant difference in recognition validity. The hierarchical r-model analysis also did not hint at a reliable correlation between recognition validity and RH use (r = .21, 95% BCI [−.20, .57]). In sum, it appears unlikely that the recognition validity of the chosen item sets had an impact on RH use.

The difference in knowledge validity was larger and robust across all analyses, reflecting that participants simply know substantially more about the size of Italian cities (high-alpha domain) than the cities’ height above sea level (low-alpha domain). As discussed above, better knowledge could lead to more use of this knowledge and thus reduced reliance on the RH (see Bröder & Eichler, 2006; Glöckner & Bröder, 2011, 2014; Hilbig et al., 2009, 2015; Newell & Fernandez, 2006; Richter & Späth, 2006). Note, however, that such an effect would actually reduce the difference in RH use between our conditions: If knowledge validity was the driving aspect, participants should have used the RH less often in the high-alpha domain condition (about which they had more valid knowledge) and more often in the low-alpha domain condition (about which they had less valid knowledge). In other words, the already large difference we found in RH use, if anything, might have been even larger if knowledge validity had been comparable across conditions. Thus, we conclude that the difference in RH use is certainly not artificially inflated by differences in knowledge validity. On the contrary, the “true” difference in RH use could be even larger.

Finally, we compared subjective estimates of recognition validity across conditions. In fact, participants in the high-alpha domain condition judged the recognition validity to be considerably higher than participants in the low-alpha domain condition (see Table 2); t(28) = 4.18, p < .001, Cohen’s d = 1.58, BF 10 = 94.9.

In sum, we found again that participants based their choice behavior on the domain’s recognition validity, that is, participants in the high-alpha domain condition used the RH considerably more often than participants in the low-alpha domain condition. Subjective estimates of recognition validity again reflected that. The results of Experiment 2 thus further support the global-optimization hypothesis stating that people base their strategy selection on the accuracy of a strategy as determined by the global decision domain, but not specific sets thereof (as the local-optimization hypothesis predicted).

General discussion

In this paper, we set out to investigate whether decision makers base their use of the recognition heuristic (RH) (Goldstein & Gigerenzer, 2002), a simple inference strategy, on the validity of recognition as a cue in the underlying domain or in the specific set of to-be-compared objects. We referred to these two possibilities as the global- and the local-optimization hypothesis, respectively. Some evidence from previous studies reported in the Introduction is compatible with the global-optimization hypothesis (Basehore & Anderson, 2016; Pohl, 2006, in press), but conclusive experimental tests were missing. We conducted two experiments systematically varying recognition validity in the domain and in the chosen set of objects. In Experiment 1, we selected two sets with rather different recognition validities drawn from the same domain. In Experiment 2, we selected two sets with the same recognition validity drawn from two (task) domains with rather different recognition validities. The results from both experiments clearly show that it is the domain’s recognition validity that impacts RH use, whereas the specific recognition validity in the set of to-be-compared objects is essentially ignored, thus supporting the global-optimization hypothesis and refuting the local-optimization hypothesis. Subjective estimates of the recognition validity reflected the domain’s and not the set’s validity, too, thus corroborating that participants did not detect (or not even searched for) deviating characteristics of the presented set of materials. We may thus conclude that participants (a) have access to the recognition validity in a domain, (b) adjust their use of the RH accordingly, and (c) treat even substantially biased item sets as representative of their underlying domain.

One conclusion from these results is that it is questionable to analyze participants’ strategic behavior with respect to the sample’s recognition validity (at least if the latter is not representative). The recognition validity in the selected set of objects is only important insofar as to check whether the set is representative of the domain, or whenever the overall accuracy of participants’ judgments is in the focus.

Our results are reminiscent of discussions regarding the role of representative design (see, e.g., Dhami, Hertwig, & Hoffrage, 2004; Gigerenzer & Hoffrage, 1995; Hoffrage & Hertwig, 2006). In the 1970s and 1980s, research focused on many biases and illusions in judgment and decision making, most prominently in the “heuristics and biases” program of Kahneman and Tversky and their collaborators (see Gilovich, Griffin, & Kahneman, 2002; Kahneman, Slovic, & Tversky, 1982; Pohl, 2017). Several of these biases were subsequently refuted as not showing deficiencies of the human information-processing system, but rather stemming from using artificial, non-representative materials and thus fooling participants into a faulty behavior (Gigerenzer, 1991; Gigerenzer & Hoffrage, 1995).

For example, the overconfidence effect (see Hoffrage, 2017, for a summary), that is, being overconfident on how good one’s evaluations of true/false assertions are, appeared only for non-representative material selections comprised of overly difficult items. For representative materials, the effect was reduced or even absent (Gigerenzer, Hoffrage, & Kleinbölting, 1991; Juslin, 1994; Juslin, Winman, & Olsson, 2000; Winman, 1997). Thus, it seems that in such cases—just as in our experiments—participants simply assumed representative material sets, for which their behavior would have been well calibrated, but then received biased materials of which they were not aware, which in turn led to non-optimal behavior. As a second example, hindsight bias, that is, the tendency to overestimate one’s pre-outcome knowledge (see Pohl & Erdfelder, 2017, for a general overview), has been found to be much larger for almanac questions than for case histories (Christensen-Szalanski & Willham, 1991). As an explanation, Winman (1997) proposed that the selection of almanac-items to be used as materials was not random, but included on average overly difficult questions. He accordingly found that hindsight bias was reduced or even absent when using random samples (see also Winman, Juslin, & Björkman, 1998). In other words, at least part of the bias was in the material selection, not in the participants’ behavior. Further examples of phenomena arguably caused by biased samples include illusory correlations and confirmation bias (Fiedler & Kutzner, in press), biased weighting of small probabilities in risky choice (Glöckner, Hilbig, Henninger, & Fiedler, 2016; Hilbig & Glöckner, 2011), or unrealistic optimism (Harris & Hahn, 2011).

Nonetheless, one may question why participants do not detect such biased samples and then adapt their behavior; by implication, one may question whether the behavior observed should still be considered “ecologically rational” (Todd & Gigerenzer, 2012). In our two experiments, participants applied useless cues indifferently or did not detect useful ones. This could indeed be considered a case of meta-cognitive myopia (Fiedler, 2008, 2012), a term that highlights shortsightedness in a variety of monitoring processes, for example, in erroneously accepting a sample as representative of the underlying domain (see also Hogarth & Soyer, 2015). However, maybe participants’ behavior is not so ignorant and the bias not so obvious. One argument could be that participants—as a default—treat all selections as if they were representative, at least as long as they do not have any reason to doubt their assumption. It might, moreover, be clear that a sample is never perfectly representative of the domain, even if randomly selected (as, e.g., in the two unintended conditions with recognition validity at chance level reported in the Introduction), but on average it will be close to being representative. Besides, there are myriads of potentially deviating features, like number, values, distributions, correlations, and validities (Hoffrage & Hertwig, 2006), which one would need to check. Some of these may also be difficult if not impossible to observe and evaluate (Fiedler, 2012).

It thus appears straightforward and much more efficient to simply assume random (and thus approximately representative) samples, despite the caveat of some (typically minor) costs in accuracy. Such detriments in performance would probably be more than outweighed by the costs for checking all potentially relevant features. Only when the selected set of objects deviates blatantly from what one would expect or if decision makers have reason to suspect biased sampling, could they be inclined to reflect on the sample’s features and then adapt their behavior accordingly. But in reality this might be a rare case so that sticking to the assumption of a representative sample appears rather efficient and thus ecologically rational after all.