Visual search for arbitrary objects in real scenes

Wolfe, Jeremy M.; Alvarez, George A.; Rosenholtz, Ruth; Kuzmova, Yoana I.; Sherman, Ashley M.

doi:10.3758/s13414-011-0153-3

Visual search for arbitrary objects in real scenes

Published: 14 June 2011

Volume 73, pages 1650–1671, (2011)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Visual search for arbitrary objects in real scenes

Download PDF

Jeremy M. Wolfe^1,2,
George A. Alvarez³,
Ruth Rosenholtz⁴,
Yoana I. Kuzmova² &
…
Ashley M. Sherman²

6337 Accesses
120 Citations
1 Altmetric
Explore all metrics

Abstract

How efficient is visual search in real scenes? In searches for targets among arrays of randomly placed distractors, efficiency is often indexed by the slope of the reaction time (RT) × Set Size function. However, it may be impossible to define set size for real scenes. As an approximation, we hand-labeled 100 indoor scenes and used the number of labeled regions as a surrogate for set size. In Experiment 1, observers searched for named objects (a chair, bowl, etc.). With set size defined as the number of labeled regions, search was very efficient (~5 ms/item). When we controlled for a possible guessing strategy in Experiment 2, slopes increased somewhat (~15 ms/item), but they were much shallower than search for a random object among other distinctive objects outside of a scene setting (Exp. 3: ~40 ms/item). In Experiments 4–6, observers searched repeatedly through the same scene for different objects. Increased familiarity with scenes had modest effects on RTs, while repetition of target items had large effects (>500 ms). We propose that visual search in scenes is efficient because scene-specific forms of attentional guidance can eliminate most regions from the “functional set size” of items that could possibly be the target.

Automatic guidance of attention during real-world visual search

Article 22 April 2015

Do target detection and target localization always go together? Extracting information from briefly presented displays

Article 19 June 2019

Visual search habits and the spatial structure of scenes

Article Open access 11 July 2022

We conduct visual searches all day long: looking for milk in the refrigerator, the car keys, the exit, the parking spot, the e-mail icon, and so forth. Most of that searching is easy enough that we give it no thought as we pursue our goals. The purpose of this article is to offer some empirical insight into the apparent efficiency of search for arbitrary objects in real scenes. There is a vast literature on visual search for a target item among distracting items (for reviews, see Sanders & Donk, 1996; Wolfe 1998a, 1998b; Wolfe & Reynolds, 2008). The great bulk of this work has been done with simple stimuli, isolated on blank backgrounds (for a few examples, see Enns, 1988; Koene & Zhaoping, 2007; Olds, Graham, & Jones, 2009; Treisman, 1993). These studies have taught us a great deal about the basics of search, but there is no getting around the fact that such stimuli are highly artificial. Outside of the lab, people simply do not spend much time searching for red vertical lines, and when they do, the other items in the visual field are unlikely to be evenly divided into sets of red horizontals and green verticals. A smaller body of research has involved search for pictures of objects (e.g., Biederman, Blickle, Teitelbaum, & Klatsky, 1988; Wolfe, Horowitz, Kenner, Hyle, & Vasan, 2004; Yang & Zelinsky, 2009; with an important subset of work focused on search for face stimuli: e.g., Doi & Ueda, 2007; Hershler & Hochstein, 2005, 2006; VanRullen, 2006; Williams, Moss, Bradshaw, & Mattingley, 2005). But these studies have still involved search for isolated objects on blank backgrounds.

Natural search is conducted for objects embedded in scenes, and there have been an increasing number of studies involving real scenes. After some early work (Enoch, 1959; Kingsley, 1932), the systematic study of search in scenes began with Biederman’s experiments (Biederman, Glass, & Stacy, 1973), where he showed that coherent scene structure aids search. Wolfe (1994) used artificial aerial views to show that guidance to such basic attributes as color and orientation still operated when the targets and distractors were part of a continuous stimulus. Computational work has provided further evidence that the mechanisms of guidance, used to account for highly artificial search, could also be applied to scenes. Itti and Koch (2000, 2001) developed a model of bottom-up guidance by basic attributes that could be calculated over real scenes. Bottom-up salience, on its own, does not explain deployments of the eyes (and, presumably, of attention) over scenes (Foulsham & Underwood, 2007; Henderson, Brockmole, Castelhano, & Mack, 2007; Henderson, Malcolm, & Schandl, 2009). Subsequent work has expanded models to include top-down, user-driven guidance by basic attributes (Hamker, 2006; Navalpakkam & Itti, 2005; Zelinsky, 2008). Thus, if you know the features of what you are looking for, you can guide your attention to parts of a random display (Motter & Belky, 1998) or to a scene containing those features (Pomplun, 2006).

Other computational work has picked up on Biederman et al.’s (1973) finding that scene structure is important in search. Considerable current interest focuses on the role of scene “priors” (Droll & Eckstein, 2008; Ehinger, Hidalgo-Sotelo, Torralba, & Oliva, 2009; Hidalgo-Sotelo, Oliva, & Torralba, 2005; Torralba, Oliva, Castelhano, & Henderson, 2006) because, unlike in random displays of isolated objects, in a real scene the scene itself tells you where some objects might be found. Moreover, people appear to use this information in search. Observers know that people generally appear on horizontal surfaces (Droll & Eckstein, 2008; Torralba et al., 2006), chimneys appear on roofs (Eckstein, Drescher, & Shimozaki, 2006), and pots appear on stoves (Võ & Henderson, 2009). For a useful discussion of scene prior information in linguistic terms, see Henderson and Ferreira (2004).

The purpose of the present work is to understand why mundane search in the world is often easier than one might predict, extrapolating from what we know about search in the lab. Real scenes are complex and heterogeneous, and in principle any item in a scene might serve as the target for a search. In some experiments with real or realistic scenes, the target object changes from trial to trial (Brockmole & Henderson, 2006; Henderson et al., 2009; Hollingworth & Henderson, 2002; Malcolm & Henderson, 2009). The researchers of these studies have been interested in questions such as the nature of the search target template and the roles of bottom-up and top-down factors in these searches. Our focus is on the apparent efficiency of search in scenes. We focus on two questions: First, is search for arbitrary objects in scenes actually efficient, as assessed with experimental rather than introspective methods? Second, since we will give an affirmative answer to the first question, what guides efficient search in scenes?

Search efficiency and scenes

When discussing search for targets among distractors on homogeneous backgrounds, the standard measure of efficiency is the slope of the function relating reaction time (RT) to set size. Traditionally, search tasks that produced slopes near 0 ms/item have been called “parallel,” while those with slopes greater than about 20–25 ms/item for target-present trials have been deemed to be the products of serial search. However, the serial/parallel distinction is a theoretical claim about underlying mechanisms, and a problematic one, at that (Townsend, 1971, 1990; Townsend & Wenger, 2004). Referring to search “efficiency” has the virtue of being neutral about such issues. It simply describes the effective rate with which items can be processed in a search task (Wolfe 1998a, 1998b).

We search because we cannot fully process all items in the visual field at one time. If attentional resources are deployed at random, search is inefficient. According to Guided Search (Wolfe, 2007; Wolfe, Cave, & Franzel, 1989) and similar models, search becomes more efficient when attention can be “guided” to some subset of all of the available stimuli in the scene. Following Neider and Zelinsky (2008), we call that subset the “functional set size”—the set of items that the visual system deems worth considering as targets. In laboratory search tasks, the highest efficiencies (RT × Set Size slopes near 0 ms/item) are seen when a target is defined by a unique feature among homogeneous distractors of another sufficiently distinct feature, with the features drawn from one of the basic attributes that guide visual attention in search (Wolfe & Horowitz, 2004). Thus, red will “pop out” among green, vertical among horizontals, and so forth (Nothdurft, 1993; Treisman & Gelade, 1980). One way to describe a slope of zero is to propose that guidance has reduced the functional set size to 1, and thus search is easy, regardless of the number of distractors. As the differences between targets and distractors decrease and/or the heterogeneity of distractors increases, search efficiency declines (Duncan & Humphreys, 1989). If a target is defined by multiple features, attention can be guided to the conjunction of features, and search efficiency tends to be somewhat worse than is seen for pop-out searches (Wolfe et al., 1989). Here, the effective set size is reduced to some proportion of the full set size. For instance, if half of the items are red and the target is known to be red, the functional set size will be reduced by half, as will the slope of the RT × Set Size function.

If no basic feature information distinguishes targets and distractors, search typically proceeds at an “inefficient” rate of 20–40 ms/item for target-present trials, and a bit more than twice that for target-absent trials (Kwak, Dagenbach, & Egeth, 1991). Search becomes even more inefficient if each item takes a significant time to identify or if eye movements and fixations on each item are required. In the latter case, the rate would be dominated by the rate of voluntary eye movements: three or four saccades per second.

Search in scenes seems to be efficient, in spite of multiple properties that would seem to militate against efficiency. Distractor features (like color) tend to be heterogeneous. Objects are less clearly displayed than they would be on a homogeneous background. Targets are diverse and change from “trial” to “trial.” All of these factors would reduce search efficiency in the lab. We will argue that these factors are balanced by scene-specific forms of guidance. In addition to guidance by basic attributes such as size and color, search in scenes is guided by several varieties of “semantic” guidance, or guidance by the structure and meaning of scenes (e.g., the target-constraining effects of physics that are not found in random displays). The functional set size in real scenes is also influenced by “episodic” guidance, or guidance specific to knowledge about this particular scene. Thus, episodic guidance may influence search for your toothbrush in your bathroom, above and beyond semantic guidance, which constrains the placement of toothbrushes in bathrooms in general.

The problem of set size

Scene-specific guidance has been discussed before (e.g., Henderson & Ferreira, 2004), but not in terms of search efficiency and not with strong ties to the classic laboratory search paradigm, because of what can be called “the problem of set size.” Efficiency is defined in terms of RT × Set Size functions, and we have no satisfactory definition of set size in real scenes (Neider & Zelinsky, 2008; Rosenholtz, Li, & Nakano, 2007). Consider the scenes in Fig. 1.

What is the set size of the image of the bedroom on the left? Is the bed one item, or are blanket and bedstead separate items? Is each pillow an item? Is a wall an item? On the right, is each tree an item—even the ones that are small and largely occluded—or is this a forest? Intuition tells us that search time must depend on the number of searchable entities in a scene and that the set of relevant entities can be changed by the search task. Thus, the line down the middle of the road might be a surface marking and not counted in the set size, unless the search task was to find a pair of curved lines.

In this study, we have adopted a brute force approach to the problem of set size. The LabelMe tool (Russell, Torralba, Murphy, & Freeman, 2008) provides a simple method for drawing polygons around regions of images and labeling them. At this writing, the LabelMe database contains over 65,000 annotated images, with over 732,508 labeled regions of varying precision. We took a set of 100 photographic images and had them exhaustively labeled so that almost every pixel in the scene was assigned to one labeled polygon. We then used the number of labeled polygons as a surrogate for set size.

For reasons like those raised above, even the seemingly simple process of exhaustively labeling scenes is fraught with difficulties. For instance, natural outdoor scenes tend to have very small numbers of labeled regions. Thus, the image on the right in Fig. 1 might be segmented into “road” and “forest,” even though it seems incorrect to consider this to be an image of set size 2. We avoided this issue by restricting ourselves to indoor scenes. We used unoccupied indoor scenes in order to sidestep the possibility that humans are somehow special items in search. We used largely domestic scenes (bedrooms, living rooms, kitchens, etc.) because our observers would be familiar with their contents, whereas this might not be true of factory floors, hospital wards, and so forth. Still, even unoccupied domestic indoor scenes present issues. Returning to Fig. 1, what do we do about the curtain and the window, two objects occupying the same place in the 2-D image? The details cannot be made out in Fig. 1, but if the picture on the wall contained images of objects, should they count as search items? What about objects seen through windows? When is each book one item, and when do they constitute a single shelf of books? Such questions may not be answerable in any general sense. Nevertheless, given a single labeler working with a relatively homogeneous class of images, we may assume that scenes containing more items will generate more labeled regions and that this brute force approach, while imperfect, will at least be positively correlated with the unknowable “true” set size. Note that the imperfections of this method are likely to lead to underestimates of the number of searchable objects, not overestimates. Thus, it is likely that set size could be reduced by labeling many books as a single item, “books.” It is much less likely that a single book would be overly labeled as “spine, lettering, pages,” and so on (though, in fact, one could search for such things).

Repeated search

In order to examine the role of episodic guidance in scenes, we need to give observers experience with scenes that could, in principle, produce episodic guidance (Võ & Wolfe, 2011). When we interact with a real scene, our real search behavior involves multiple searches over the same scene. Sitting down to a meal in a new restaurant illustrates the distinction between episodic and semantic guidance. Where is the fork? Where is the salad, the salt, your dining companion? All of these searches are constrained by semantic guidance. The fork is on the table to the left of the plate because that is where forks can be (they don’t float), and where they should be in a restaurant. The second time you search for fork or salt, there is the possibility that episodic guidance about this restaurant might come into play. There are two superficially contradictory facts about such “repeated searches.” First, it seems intuitively obvious that experience with a scene speeds search in scenes. You will find the coffee maker in your kitchen faster than a visitor will find it. Moving beyond intuition, clear evidence for detailed learning of multiple objects in a scene can be found in Hollingworth’s work (Hollingworth, 2006a, 2009; Hollingworth & Henderson, 2002). In apparent contrast, we have found that when observers search repeatedly through the same set of stimuli for hundreds of trials, search efficiency is the same at the end as it was at the beginning. Thus, if observers search through small sets of letters at a rate of about 35 ms/letter when the letters change on every trial, they turn out to search at the same 35-ms/letter rate 300 trials later (Wolfe, Klempen, & Dahlen, 2000). Of course, observers have learned the location and identity of all of the letters after 300 trials. However, in this experimental situation, accessing that memory appears to take about 100 ms/item, so it turns out to be more efficient to repeat the visual search than to use that memory (Kunar, Flusberg, & Wolfe, 2008a). The same result holds for search in scenes—in one particular case, realistic cartoon scenes, drawn with architectural rendering software (Oliva, Wolfe, & Arsenio, 2004).

How can the repeated-search results be reconciled with the obvious improvement in search with experience? Returning to Neider and Zelinsky’s (2008, 2010) concept of “functional set size,” the critical factor is that the failure to improve search efficiency with repeated search occurs only when the functional set size remains fixed in size. Thus, if you are always searching through the same, well-learned set of 6 letters, search efficiency does not improve. However, suppose that the display contains 30 letters, of which only 6 are ever queried in the search task. Initial search will be search at 35 ms/item through 30 items. As you learn that the relevant set is only these 6 items, search will transition to a much faster search through 6 items (Kunar et al., 2008a). If efficiency is calculated from the physical set size of 30, the efficiency will appear to increase. If it is calculated over the queried set size of 6, the rate will be 35 ms/item. Experience with the scene has the effect of reducing the functional set size in real scenes to the queried set size. In our earlier work on repeated search in artificial scenes (Oliva et al., 2004), observers rapidly learned to restrict search to the 3 or 6 objects in the scene that were task relevant. The rest of the objects were never targets and do not seem to have influenced search efficiency.

In the latter half of this article, we consider the situation where any object in the field can be a search target and where it is possible to have multiple searches though the same scene without repeating the search target. Referring back to Fig. 1a, we might ask, if you have searched for the curtain, the bed, and the picture, in Fig. 1a, is your search for the lamp speeded by your familiarity with the scene? We might also ask whether your second search for the picture will be influenced by the first.

We addressed questions of scene guidance and search efficiency in six experiments. In Experiment 1, observers searched for an arbitrary object in a novel scene on each trial. We found that search was highly efficient, at least as defined by the slope of the RT × Set Size function, with set size defined as the number of labeled items in a scene. The design of Experiment 1 (and, indeed, the design of the world) permitted observers to make intelligent guesses based on the typicality of the target object in the scene. (There is no point in searching for a long time for stoves in the bathroom.) In Experiment 2, we controlled for typicality and found that search for arbitrary objects becomes somewhat less efficient. However, Experiment 3 focused on search for arbitrary objects outside of a scene context, and we found search then to be much less efficient. These first three experiments point to the role of semantic guidance in the efficiency of scene search.

Experiments 4–6 allowed for the development of episodic guidance over repeated search through the same image for different targets. In these experiments, since the RT × Set Size functions were shallow, our interest was in the speeding of RT with repeated search through a scene. While there is some speeding of RT with increasing experience with a scene, there is massive speeding of RT only when observers search for a specific target for a second time. In Experiment 5, we showed that the background walls and floor need not be visible to produce the shallow RT × Set Size functions, if the objects are laid out as they would be in a complete scene. The experiment also showed that memory of the first search for a specific item in a specific location persists for hundreds of trials. In the first five experiments, word cues were used to identify the target. Experiment 6 showed that the advantage of the second search for a target is not entirely the result of learning the appearance of the specific target item on successful completion of the first search. Taken together, the last three experiments suggest that repeated search in a scene allows episodic guidance to speed search. To return to Fig. 1a, there might be 30 items in the scene, but once the target is identified as “lamp,” scene structure and observer knowledge about bedrooms provide the semantic guidance that reduces the functional set size to a fraction of the total items. With repeated search, that functional set seems to be reduced further on the second search for the same lamp. Were those hypothetical 30 items placed at random in a classic search display, search would take much longer.

Experiment 1: searching for arbitrary objects in novel scenes

Method

Stimuli

A set of 100 full-color images of indoor scenes was used. These were scenes of kitchens, living rooms, bedrooms, and so forth, such as one might find in an architectural magazine or real estate advertisement. As a result, there were no humans or animals in the scenes, and these rooms were somewhat cleaner than most occupied rooms. They were, however, “real” scenes—not scenes created in software, nor random collections of overlapping objects (Bravo & Farid, 2004). A lab employee exhaustively hand-labeled all images, drawing a polygon around each object and naming it. The names were reviewed by other lab members to reach consensus—though, as we will see, it is nearly impossible to label every object in a scene in a manner that will satisfy every viewer (“That’s not a bowl. It’s a dish.”). The number of labeled regions in each image ranged from 14 to 179 (M = 59, SD = 31, median: 53). There was a wide range of sizes of the labeled regions (area = 28 to 756,551 pixels, M = 15,023 pixels, SD = 36,624). The entire scene subtended 34 × 24 deg at an approximate viewing distance of 57.4 cm during these experiments. This restriction was not imposed during labeling.

Observers

A group of 12 observers were tested. All were paid volunteers who had given informed consent. Each had at least 20/25 visual acuity and normal color vision, as assessed by the Ishihara color blindness test.

Procedure

On each trial, a different image was presented. Before the scene appeared, a word cue was presented for 500 ms at the center of the screen. On 50% of trials, the word was drawn randomly from the names of the labeled polygons in the image for that trial. Surfaces were not allowed as targets (e.g., “wall,” “floor”), because virtually all of these scenes had visible walls, floors, and/or ceilings. On the other 50% of trials, a name of a polygon from another image was chosen as the cue word. It could only be used if the same term did not appear in the list of labeled regions for the current scene. If there were multiple instances of a term (e.g., “chair”), the probability of that term being used as the cue word increased accordingly.

After a 500-ms stimulus onset asynchrony, the scene appeared. The observers pressed one key if they believed that the cued target was present and another if it was absent. After the response, observers were given feedback on the accuracy of their response. On target-present trials, the target polygon or polygons were outlined, in green for a correct response and in red for an incorrect response. On target-absent trials, a “+” was presented at the center of the screen (green for correct responses, red for incorrect responses). Observers pressed any key to initiate the next trial, at which point the scene disappeared and the next trial began after an intertrial interval of 1,000 ms.

Because it is difficult, if not impossible, to ensure that all observers will agree on the appropriate name for a given object, observers were given the opportunity to dispute the label on any given trial. Before initiating the next trial, observers had the option of pressing the space bar if they disagreed with the labeling of the target objects. This brought up a screen with the prompt, “Please click on the box that best describes the problem,” and four blue boxes with the options “Wrong Name,” “Item Was Present,” “Item Was Absent,” “Other.” If observers selected “Other,” they were prompted to “‘Please briefly describe the error,” and then they typed in a free response describing the error.

After a brief period of instruction and practice, observers saw each of the 100 images 10 times, for 1,000 total trials. There were 10 blocks of 100 trials, with each image presented once per block in a randomized presentation order for each block.

Results

Trials with RTs less than 200 ms and greater than 5,000 ms were removed from analysis as outliers. This removed 2.4% of trials. Including these trials increased variance but did not substantially alter the patterns of data described below. Almost all of the excluded trials had long RTs. Observers disputed the labels on less than 3% of the trials (M = 29.6, SEM = 4.8). Of these 345 disputed trials, 334 were errors on which the observer reported that the requested target, deemed to be absent, was actually present (289 trials), or was absent when deemed present (45 trials). These trials were removed from the calculations of error rates and RTs.

Errors

Error rates were quite high in this experiment. Targets were missed on 17% of target-present trials, and false alarms occurred on 13% of target-absent trials. In typical laboratory search tasks in which RT is the dependent measure, miss errors tend to be less than 10%, and false alarm errors are very rare (Wolfe, Palmer, & Horowitz, 2010). Discriminability, d', was 2.09. All observers showed similar patterns of errors. Miss error rates varied from 10% to 23%, false alarms from 8% to 26%, and d' from 1.5 to 2.7. The subsequent RT analyses were performed on the correct responses.

RT data: effects of set size

Figure 2 shows average RTs as a function of set size (defined as the number of labeled regions) for the 100 images.

Several features of these data are worthy of comment. First, it is clear that set size, defined in these terms, is of limited use here. Standard RT × Set Size functions are monotonic and, usually, quite linear. Here, the images with the largest set sizes produced anomalously low RTs. We removed those large set sizes, and also the smallest set-size images, as outliers and computed slopes of the RT × Set Size functions for a “main sequence” from set size 25 to 85. The resulting slopes were 5 ms/item for target-present trials and 4 ms/item for target-absent trials. These were significantly greater than zero [F(1, 41) > 12, p < .001, in each case]. However, these slopes are comparable to those seen in the simplest feature searches, and it hardly seems credible to imagine that search for a bowl in a kitchen is as efficient as search for red among green. The vertical spread of points at a single set size tells us that there were significant sources of variance attributable to factors other than set size. Note that these slopes did not change significantly if the slowest trials (RT > 5,000 ms) were included.

How should we account for the pattern of RTs? Figure 3 shows four of the scenes that produced the fastest RTs.

Clearly, these span a range of set sizes and a range of subjective clutter from quite low (Fig. 3d) to quite high (Fig. 3c). These examples also illustrate a property of real scenes that complicates the analysis of search efficiency: If you are sampling objects at random, you are going to have an easier time finding a chair in Fig. 3b than a plant, simply because there are many chairs. Could the apparent efficiency of search, shown in Fig. 2, be an artifact of the presence of multiple instances of the target item? If the goal is to be able to ask about arbitrary objects in arbitrary scenes, this is going to be a hard problem to eliminate. There are not many real scenes in which all objects are singletons. However, some objects are singletons, and it is possible to restrict analysis to trials on which the cue indicated a target that appeared only once in the scene. Figure 4 presents the results of this analysis.

First, Fig. 4a shows that there was an effect of the number of instances of a target type. The number of instances ranged from 1 to 59 in the data set, but we restricted analysis to the range 1–6, because there were more than 200 trials in each of those cases. This range included 84% of target-present trials. In this range, there was a speed–accuracy covariance. As the number of instances went up, observers became both faster and more accurate (hit rate increases). Figure 4b plots RT × Set Size functions for the original data, here averaged over 10-item-wide bins. The critical data are the green diamonds, showing the average RTs for those trials on which the target was a singleton. Comparing average RTs for hits across bins, singleton RTs are slower [t(6) = 4.2, p = .006]; however, the slopes of the RT × Set Size functions are not different [F(1, 10) = 0.48, p = .5]. With the high error rates seen in this experiment, one should not put too much weight on the exact values of the slopes. However, the error rate for singletons did not increase markedly as the set size increased, so any speed–accuracy trade-off would have its prime effect on the intercept, raising all mean RTs. Moreover, the singletons were also more likely to have unusual or disputable labels (e.g., the singletons beginning with the letter G are “game case,” “garlic,” “glass,” “glass jar,” “glasses,” “globe,” “golden figurine,” “grandfather clock,” and “grapes.” Compare these to the most common labels: “plate,” “napkin,” “chair,” “book,” and “wine glass.”). Trials in which multiple examples of the cued target are present in the display are easier both because of the repeated instances of the same target and because these repeating objects and their names are simply more common (in the context of indoor scenes).

Effects of size and eccentricity

If set size, defined as the number of labeled regions, is a relatively poor predictor of RT, what does predict RT in search for arbitrary objects in real scenes? The effects of two rather unsurprising factors, target size and eccentricity, are illustrated in Fig. 5.

As shown in Fig. 5a, RT drops dramatically as targets get bigger. The effect is roughly linear with the square root of the target size, quantified by the area labeled as belonging to the target. Dashed lines are the best-fit regression lines. The results for singletons are similar to the results for all data, though the RTs are somewhat longer and the effect of size is a bit greater. The largest sizes were removed from this analysis, because there were only a few observations for these large stimuli. Note that the change in mean RTs from the smallest to the largest sizes (600–800 ms) was much greater than the change in RTs for the smallest and largest set sizes (200–300 ms; Fig. 4b).

Eccentricity shows a 200- to 300-ms effect over the range of target eccentricities from 0 to 11 deg. Larger eccentricities generated too little data to be meaningfully analyzed. The odd dip in RT at medium eccentricities is probably an artifact of the presence of multiple targets. The eccentricity of “chair” in Fig. 3b, for example, would be the average eccentricity of all chairs in the given scene (also true for the size value). Since we are using an unsigned eccentricity, this will be a number greater than zero. As can be seen, the dip is not present when we look only at trials with singleton targets. Note also that this eccentricity effect appears even though eye movements are unconstrained, suggesting a bias to begin search in the middle of the image (Carrasco, Evert, Chang, & Katz, 1995; Wolfe, O’Neill, & Bennett, 1998). In these experiments, the bias was reasonable, since we cued the target identity at the center of the display. In the case of multiple targets with the same label, one could argue that it might be advisable to plot average RTs as though they were associated with the smallest eccentricity and the largest size. After all, observers are most likely to have found the largest, most central example of a multiple. In the present analysis, this seems unlikely to make much difference. The conclusion would remain that big central targets are found more rapidly than small eccentric ones.

The role of typicality

When observers are asked if object X is in scene Y, they can make an assessment of how likely it is that such an object will be in such a scene. The identity of the object is given to the observer before the search begins, and observers can quickly assess the gist of a scene when it first appears (Greene & Oliva, 2008; Oliva, 2005). Since targets on absent trials were drawn from the names of all labeled items in the set of images, the design of the experiment made it likely that a target that was present would be more typical of the scene than a target that was absent. In order to assess the role of typicality in Experiment 1, a subsidiary experiment was run in which new observers rated the typicality of 3,012 object–scene pairs. Of those pairs, 2,115 had been used in Experiment 1. Fifteen raters each rated an average of 524 pairs (min 54 pairs, max 1,302). Each pair was rated two to three times on a scale from 1 (very atypical) to 9 (very typical). The large number of pairs arose from the strategy of asking about random items in random scenes. Figure 6 shows the mean ratings as a function of trial type and set size (over the “main sequence”).

Clearly, typicality was related to trial type. The main effect of trial type was significant by a Kruskal–Wallis test (p < .0001), and all pairwise comparisons were, likewise, significant (Dunn’s multiple comparison test, p < .0001). For present purposes, the important points are:

1.
The target on a target-present trial was notably more typical of the scene than a randomly chosen target in a target-absent trial (as is, one supposes, true in the world).
2.
When observers made errors, the typicality of a miss error was lower than that of a hit, and the typicality of a false alarm was higher that that of a correct absent trial.

This suggests that guessing could be a reasonable strategy, especially at the extremes of typicality. Evidence for guessing can be seen when error rate is plotted against typicality, as in Fig. 7, where miss errors rise to about over 30% at low typicality, while false alarms rise to similar levels at high typicality.

This effect of typicality can be seen as both a “bug” and a “feature” in Experiment 1. On the one hand, it clearly indicates that the RT and slope measures shown in Figs. 2 and 4 are influenced by guessing, and one could argue that the shallow slopes are simply an artifact of typicality effects. This would be a form of speed–accuracy trade-off: If observers guess that unlikely items are absent, they will make miss errors. If they do not guess, they will have to search, and that search will take longer for larger set sizes, increasing the slope. A similar story would apply for guessing “present” for high-typicality targets. On the other hand, these results imply that intelligent use of the typicality of a target is a real part of the explanation of the efficiency of search in scenes. Real-world searches are based on an assessment of the likelihood of finding what you are looking for. Quitting rules are based on that assessment, as are judgments about ambiguous stimuli (e.g., “Is that really a pillow?” In a bedroom, yes. In the bathroom, maybe not.) Similar effects can be seen if target probability is manipulated directly (Wolfe & Van Wert, 2010).

Even if one chooses to see typicality effects as an important part of intelligent search, one would still want a version of the experiment without co-occurring typicality effects, in order to assess set-size effects in scene search. That was the purpose of Experiment 2.