Elsevier

Cognitive Psychology

Volume 58, Issue 2, March 2009, Pages 137-176
Cognitive Psychology

Recognition of natural scenes from global properties: Seeing the forest without representing the trees

https://doi.org/10.1016/j.cogpsych.2008.06.001Get rights and content

Abstract

Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Experiment 1, we obtained ground truth rankings on global properties for use in Experiments 2–4. To what extent do human observers use global property information when rapidly categorizing natural scenes? In Experiment 2, we found that global property resemblance was a strong predictor of both false alarm rates and reaction times in a rapid scene categorization experiment. To what extent is global property information alone a sufficient predictor of rapid natural scene categorization? In Experiment 3, we found that the performance of a classifier representing only these properties is indistinguishable from human performance in a rapid scene categorization task in terms of both accuracy and false alarms. To what extent is this high predictability unique to a global property representation? In Experiment 4, we compared two models that represent scene object information to human categorization performance and found that these models had lower fidelity at representing the patterns of performance than the global property model. These results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.

Introduction

One of the greatest mysteries of vision is the remarkable ability of the human brain to understand novel scenes, places and events rapidly and effortlessly (Biederman, 1972, Potter, 1975, Thorpe et al., 1996). Given the ease with which we do this, a central issue in visual cognition is determining the nature of the representation that allows this rapid recognition to take place. Here, we provide the first behavioral evidence that rapid recognition of real-world natural scenes can be predicted from a collection of holistic descriptors of scene structure and function (such as its degree of openness or its potential for navigation), and suggests the possibility that the initial scene representation can be based on such global properties, and not necessarily the objects it contains.

Human observers are able to understand the meaning of a novel image if given only a single fixation (Potter, 1975). During the course of this glance, we perceive and infer a rich collection of information, from surface qualities such as color and texture (Oliva and Schyns, 2000, Rousselet et al., 2005); objects (Biederman et al., 1982, Fei-Fei et al., 2007, Friedman, 1979, Rensink, 2000, Wolfe, 1998), and spatial layout (Biederman et al., 1974, Oliva and Torralba, 2001, Sanocki, 2003, Schyns and Oliva, 1994), to functional and conceptual properties of scene space and volume (e.g. wayfinding, Greene and Oliva, 2006, Kaplan, 1992; emotional valence, Maljkovic & Martini, 2005).

Indeed, from a short conceptual scene description such as “birthday party”, observers are able to detect the presence of an image matching that description when it is embedded in a rapid serial visual presentation (RSVP) stream and viewed for ∼100 ms (Potter, 1975, Potter et al., 2004). This short description is also known as the basic-level category for a visual scene (Rosch, 1978, Tversky and Hemenway, 1983), and refers to the most common label used to describe a place.

The seminal categorization work of Eleanor Rosch and colleagues has shown that human observers prefer to use the basic-level to describe objects, and exhibit shorter reaction times to name objects at the basic-level rather than at subordinant or superordinant (Rosch, 1978). It is hypothesized that the basic-level of categorization is privileged because it maximizes both within-category similarity and between-category variance (Gosselin and Schyns, 2001, Rosch, 1978). In the domain of visual scenes, members of the same basic-level category tend to have similar spatial structures and afford similar motor actions (Tversky & Hemenway, 1983). For instance, most typical environments categorized as “forests” will represent enclosed places where the observer is surrounded by trees and other foliage. An image of the same place from very close-up might be called “bark” or “moss”, and from very far away might be called “mountain” or “countryside”. Furthermore, the characteristic spatial layout of a scene constrains the actions that can be taken in the space (Gibson, 1979, Tversky and Hemenway, 1983). A “forest” affords a limited amount of walking, while a “countryside” might afford more options for navigation because the space is open. Although such functional and structural properties are inherent to scene meaning, their role in scene recognition has not yet been addressed.

Many influential models of high-level visual recognition are object-centered, treating objects and parts as the atoms of scene analysis (Biederman, 1987, Biederman et al., 1988, Bülthoff et al., 1995, Fergus et al., 2003, Marr, 1982, Pylyshyn, 1999, Riesenhuber and Poggio, 1999, Ullman, 1999). In this view, the meaning of a real-world scene emerges from the identities of a set of objects contained within it, learned through the experience of object co-occurrence and spatial arrangement (Biederman, 1981, Biederman, 1987, De Graef et al., 1990, Friedman, 1979). Alternatively, the identification of one or more prominent objects may be sufficient to activate a schema of the scene, and thus facilitate recognition (Biederman, 1981, Friedman, 1979).

Although the object-centered approach has been the keystone of formal and computational approaches to scene understanding for the past 30 years, research in visual cognition has posed challenges to this view, particularly when it comes to explaining the early stages of visual processing and our ability to recognize novel scenes in a single glance. Under impoverished viewing conditions such as low spatial resolution (Oliva and Schyns, 1997, Oliva and Schyns, 2000, Schyns and Oliva, 1994, Torralba et al., in press); or when only sparse contours are kept, (Biederman, 1981, Biederman et al., 1982, De Graef et al., 1990, Friedman, 1979, Hollingworth and Henderson, 1998, Palmer, 1975) human observers are still able to recognize a scene’s basic-level category. With these stimuli, object identity information is so degraded that it cannot be recovered locally. These results suggest that scene identity information may be obtained before a more detailed analysis of the objects is complete. Furthermore, research using change blindness paradigms demonstrates that observers are relatively insensitive to detecting changes to local objects and regions in a scene under conditions where the meaning of the scene remains constant (Henderson and Hollingworth, 2003, Rensink et al., 1997, Simons, 2000). Last, it is not yet known whether objects that can be identified in a briefly presented scene are perceived, or inferred through the perception of other co-occurring visual information such as low-level features (Oliva & Torralba, 2001), topological invariants (Chen, 2005) or texture information (Walker Renninger & Malik, 2004).

An alternative account of scene analysis is a scene-centered approach that treats the entire scene as the atom of high-level recognition. Within this framework, the initial visual representation constructed by the visual system is at the level of the whole scene and not segmented objects, treating each scene as if it has a unique shape (Oliva & Torralba, 2001). Instead of local geometric and part-based visual primitives, this framework posits that global properties reflecting scene structure, layout and function could act as primitives for scene categorization.

Formal work (Oliva and Torralba, 2001, Oliva and Torralba, 2002, Torralba and Oliva, 2002, Torralba and Oliva, 2003) has shown that scenes that share the same basic-level category membership tend to have a similar spatial layout. For example, a corridor is a long, narrow space with a great deal of perspective while a forest is a place with dense texture throughout. Recent modeling work has shown success in identifying complex real-world scenes at both superordinant and basic-levels from relatively low-level features (such as orientation, texture and color), or more complex spatial layout properties such as texture, mean depth and perspective, without the need for first identifying component objects (Fei-Fei and Perona, 2005, Oliva and Torralba, 2001, Oliva and Torralba, 2002, Oliva and Torralba, 2006, Torralba and Oliva, 2002, Torralba and Oliva, 2003, Vogel and Schiele, 2007, Walker Renninger and Malik, 2004). However, the extent to which human observers use such global features in recognizing scenes is not yet known.

A scene-centered approach involves both global and holistic processing. Processing is global if it builds a representation that is sensitive to the overall layout and structure of a visual scene (Kimchi, 1992, Navon, 1977). The influential global precedence effect (Navon, 1977, see Kimchi, 1992 for a review) showed that observers were more sensitive to the global shape of hierarchical letter stimuli than their component letters. Interestingly, the global precedence effect is particularly strong for stimuli consisting of many-element patterns (Kimchi, 1998) as is the case in most real-world scenes. A consequence of global processing is the ability to rapidly and accurately extract simple statistics, or summary information, from displays. For example, the mean size of elements in a set is accurately and automatically perceived (Ariely, 2001, Chong and Treisman, 2003, Chong and Treisman, 2005), as is the average orientation of peripheral elements (Parkes, Lund, Angelucci, Solomon, & Morgan, 2001); some contrast texture descriptors (Chubb, Nam, Bindman, & Sperling, 2007) as well as the center of mass of a group of objects (Alvarez & Oliva, 2008). Global representations may also be implicitly learned, as observers are able to implicitly use learned global layouts to facilitate visual search (Chun and Jiang, 1998, Torralba et al., 2006).

While all of these results highlight the importance of global structure and relations, an operational definition of globality for the analysis of real-world scenes has been missing. Many properties of natural environment could be global and holistic in nature. For example, determining the level of clutter of a room, or perceiving the overall symmetry of the space are holistic decisions in that they cannot be taken from local analysis only, but require relational analysis of multiple regions (Kimchi, 1992).

Object and scene-centered computations are likely to be complementary operations that give rise to the perceived richness of scene identity by the end of a glance (∼200–300 ms). Clearly, as objects are often the entities that are acted on within the scene, their identities are central to scene understanding. However, some studies have indicated that the processing of local object information may require more image exposure (Gordon, 2004) than that needed to identify the scene category (Oliva and Schyns, 2000, Potter, 1975, Schyns and Oliva, 1994). In this study, we examine the extent to which a global scene-centered approach can explain and predict the early stage of human rapid scene categorization performance. Beyond the principle of recognizing the “forest before the trees” (Navon, 1977), this work seeks to operationalize the notion of “globality” for rapid scene categorization, and to provide a novel account of how human observers could identify the place as a “forest”, without first having to recognize the “trees”.

We propose a set of global properties that tap into different semantic levels of global scene description. Loosely following Gibson (1979) important descriptors of natural environments come from the scene’s surface structures and the change of these structures with time (or constancy). These aspects directly govern the possible actions, or affordances of the place. The global properties were therefore chosen to capture information from these three levels of scene surface description, namely structure, constancy and function.

A total of seven properties were chosen for the current study to reflect aspects of scene structure (mean depth, openness and expansion), scene constancy (transience and temperature), and scene function (concealment and navigability). A full description of each property is found in Table 1. These properties were chosen on the basis of literature review (see below) and a pilot scene description study (see Appendix A.1) with the requirement that they reflect as much variation in natural landscape categories as possible while tapping into different levels of scene description in terms of structure, constancy and function. Critically, the set of global properties listed here is not meant to be exhaustive,1 as other properties such as naturalness or roughness (the grain of texture and number and variety of surfaces in the scene) have been shown to be important descriptors of scene content (Oliva & Torralba, 2001). Rather, the goal here is to capture some of the variance in how real-world scenes vary in structure, constancy and function, and to test the extent to which this information is involved in the representation of natural scenes.

Previous computational work has shown that basic-level natural scene categories tend to have a particular spatial structure (or spatial envelope) that is well-captured in the properties of mean depth, openness and expansion (Oliva and Torralba, 2001, Torralba and Oliva, 2002). In brief, the global property of mean depth corresponds to the scale or size of the space the scene subtends, ranging from a close-up view to panoramic environment. The degree of openness represents the magnitude of spatial enclosure whereas the degree of expansion refers to the perspective of the spatial layout of the scene. Images with similar magnitudes along these properties tend to belong to the same basic-level category: for example, a “path through a forest” scene may be represented using these properties as “an enclosed environment with moderate depth and considerable perspective”. Furthermore, these spatial properties may be computed directly from the image using relatively low-level image features (Oliva & Torralba, 2001).

The degree of scene constancy is an essential attribute of natural surfaces (Cutting, 2002, Gibson, 1979). Global properties of constancy describe how much and how fast the scene surfaces are changing with time. Here, we evaluated the role of two properties of scene constancy: transience and temperature.

Transience describes the rate at which scene surface changes occur, or alternatively stated, the probability of surface change from one glance to the next. Places with the highest transience would show actual movement such as a storm, or a rushing waterfall. The lowest transience places would change only in geologic time, such as a barren cliff. Although the perception of transience would be more naturalistically studied in a movie rather than a static image, humans can easily detect implied motion from static images (Cutting, 2002, Freyd, 1983), and indeed this implied motion activates the same brain regions as continuous motion (Kourtzi & Kanwisher, 2000). Temperature reflects the differences in visual appearance of a place during the changes of daytime and season, ranging from the intense daytime heat of a desert, to a frigid snowy mountain.

The structure of scene surfaces and their change over time governs the sorts of actions that a person can execute in an environment (Gibson, 1979). The global properties of navigability and concealment directly measure two types of human–environment interactions deemed to be important to natural scene perception from previous work (Appelton, 1975, Gibson, 1958, Gibson, 1979, Kaplan, 1992, Warren et al., 2001). Insofar as human perception evolved for goal-directed action in the environment, the rapid visual estimation of possible safe paths through an environment was critical to survival (Gibson, 1958). Likewise, being able to guide search for items camouflaged by the environment (Merilaita, 2003), or to be able to be concealed oneself in the environment (Ramachandran, Tyler, Gregory, Rogers-Ramachandran, Duessing, Pillsbury & Ramachandran, 1996) have high survival value.

The goal of this study is to evaluate the extent to which a global scene-centered representation is predictive of human performance in rapid natural scene categorization. In particular, we sought to investigate the following questions: (1) are global properties utilized by human observers to perform rapid basic-level scene categorization? (2) Is the information from global properties sufficient for the basic-level categorization of natural scenes? (3) How does the predictive power of a global property representation compare to an object-centered one?

In a series of four behavioral and modeling experiments, we test the hypothesis that rapid human basic-level scene categorization can be built from the conjunctive detection of global properties. After obtaining normative ranking data on seven global properties for a large database of natural images (Experiment 1), we test the use of this global information by humans for rapid scene categorization (Experiment 2). Then, using a classifier (Experiment 3), we show that global properties are computationally sufficient to predict human performance in rapid scene categorization. Importantly, we show that the nature of the false alarms made by the classifier when categorizing novel natural scenes is statistically indistinguishable from human false alarms, and that both human observers and the classifier perform similarly under conditions of limited global property information. Critically, in Experiment 4 we compare the global property classifier to two models trained on a local region-based scene representation and observed that the global property classifier has a better fidelity in representing the patterns of performance made by human observers in a rapid categorization task.

Although strict causality between global properties and basic-level scene categorization cannot be provided here, the predictive power of the global property information and the convergence of many separate analyses with both human observers and models support the hypothesis that an initial scene representation may contain considerable global information of scene structure, constancy and function.

Section snippets

Observers

Observers in all experiments were 18–35 years old, with normal or corrected-to-normal vision. All gave informed consent and were given monetary compensation of $10/h.

Materials

Eight basic-level categories of scenes were chosen to represent a variety of common natural outdoor environments: desert, field, forest, lake, mountain, ocean, river and waterfall. The authors amassed a database of exemplars in these categories from a larger laboratory database of ∼22,000 (256 × 256 pixel) full-color photographs

Experiment 1: Normative rankings of global properties on natural scenes

First, we obtained normative rankings on the 500 natural scenes along the seven global properties. These normative rankings provide a description of each image and basic-level category in terms of their global structural, constancy and functional properties. Namely, each image is described by seven components, each component representing the magnitude along each global property dimension (see examples in Fig. A2 in Appendix A.2).

As Experiments 2–4 involve scene categorization using global

Experiment 2: Human use of global properties in a rapid scene categorization task

The goal of Experiment 2 was to test the extent to which global property information in natural scenes is utilized by human observers to perform rapid basic-level scene categorization. A global property-based scene representation makes the prediction that scenes from different semantic categories but with similar rankings along a global property (e.g. oceans and fields are both open environments) will be more often confused with each other in a rapid categorization task than scenes that are not

Experiment 3: The computational sufficiency of global properties for basic-level scene categorization

We have shown so far that global property information strongly modulates human performance in a rapid scene categorization task. To what extent is a global property representation sufficient to predict human rapid scene categorization performance? To answer this question, we built a conceptual naïve Bayes classifier whose only information about each scene image was from the normative ranking data of Experiment 1. Therefore, the classifier is agnosic to any other visual information (color,

Experiment 4: An alternative hypothesis—comparing a global property representation to a local region representation

The global property-based classifier shows remarkable human-like performance, in terms of both quantity and fidelity, in a rapid scene categorization task. Could any reasonably informative representation achieve such high fidelity? Basic-level scene categories are also defined by the objects and regions that they contain. Here, we test the utility of a local representation for predicting human rapid natural scene categorization by creating an alternative representation of our database that

General discussion

In this paper, we have shown that a global scene-centered approach to natural scene understanding closely predicts human performance and errors in a rapid basic-level scene categorization task. This approach uses a small vocabulary of global and ecologically relevant scene primitives that describe the structural, constancy and functional aspects of scene surfaces without representing objects and parts. Beyond the principle of recognizing the “forest before the trees” (Navon, 1977), here we

Acknowledgments

The authors wish to thank George Alvarez, Timothy Brady, Barbara Hidalgo-Sotelo, Todd Horowitz, Talia Konkle, Mary Potter, Joshua Tenenbaum, Antonio Torralba, Jeremy Wolfe and three anonymous reviewers for very helpful comments and discussion.

This research is supported by a NSF graduate research fellowship awarded to M.R.G. and by a NEC Research Support in Computer and Communication and a National Science Foundation Career Award (0546262) and an NSF Grant (0705677) to A.O.

References (100)

  • A. Oliva et al.

    Diagnostic colors mediate scene recognition

    Cognitive Psychology

    (2000)
  • A. Oliva et al.

    Building the gist of a scene: The role of global image features in recognition

    Progress in Brain Research: Visual Perception

    (2006)
  • T. Sanocki

    Representation and perception of spatial layout

    Cognitive Psychology

    (2003)
  • B. Tversky et al.

    Categories of environmental scenes

    Cognitive Psychology

    (1983)
  • J. Wolfe

    Visual memory: What do you know about what you saw?

    Current Biology

    (1998)
  • G.A. Alvarez et al.

    The representation of simple ensemble visual features outside the focus of attention

    Psychological Science

    (2008)
  • J. Appelton

    The experience of landscape

    (1975)
  • D. Ariely

    Seeing sets: Representation by statistical properties

    Psychological Science

    (2001)
  • F. Ashby et al.

    Predicting similarity and categorization from identification

    Journal of Experimental Psychology: General

    (1991)
  • M. Bar

    Visual objects in context

    Nature Reviews: Neuroscience

    (2004)
  • I. Biederman

    Perceiving real-world scenes

    Science

    (1972)
  • I. Biederman

    On the semantics of a glance at a scene

  • I. Biederman

    Recognition by components: A theory of human image understanding

    Psychological Review

    (1987)
  • I. Biederman et al.

    Searching for objects in real-world scenes

    Journal of Experimental Psychology

    (1973)
  • I. Biederman et al.

    On the information extracted from a glance at a scene

    Journal of Experimental Psychology

    (1974)
  • I. Biederman et al.

    Object identification in nonscene displays

    Journal of Experimental Psychology: Human Learning, Memory, and Cognition

    (1988)
  • D.H. Brainard

    The psychophysics toolbox

    Spatial Vision

    (1997)
  • H. Bülthoff et al.

    How are three-dimensional objects represented in the brain?

    Cerebral Cortex

    (1995)
  • M. Chaumon et al.

    Unconscious associative memory affects visual processing before 100 ms

    Journal of Vision

    (2008)
  • L. Chen

    The topological approach to perceptual organization

    Visual Cognition

    (2005)
  • J. Cutting

    Representing motion in a static image: Constraints and parallels in art, science and popular culture

    Perception

    (2002)
  • P. De Graef et al.

    Perceptual effects of scene context on object identification

    Psychological Research

    (1990)
  • A. Delorme et al.

    SpikeNET: An event-driven simulation package for modeling large networks of spiking neurons

    Network: Computation in Neural Systems

    (2003)
  • R. Epstein et al.

    A cortical representation of the local environment

    Nature

    (1998)
  • K. Evans et al.

    Perception of objects in natural scenes: Is it really attention free?

    Journal of Experimental Psychology: Human Perception and Performance

    (2005)
  • L. Fei-Fei et al.

    A Bayesian Hierarchical model for learning natural scene categories

    IEEE Proceedings in Computer Vision and Pattern Recognition

    (2005)
  • L. Fei-Fei et al.

    What do we perceive in a glance of a real-world scene?

    Journal of Vision

    (2007)
  • R. Fergus et al.

    Object class recognition by unsupervised scale-invariant learning

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2003)
  • J. Freyd

    The mental representation of movement when static stimuli are viewed

    Perception and Psychophysics

    (1983)
  • A. Friedman

    Framing pictures: The role of knowledge in automatized encoding and memory of scene gist

    Journal of Experimental Psychology: General

    (1979)
  • J.J. Gibson

    Visually controlled locomotion and visual orientation in animals

    British Journal of Psychology

    (1958)
  • J.J. Gibson

    The ecological approach to visual perception

    (1979)
  • J.O.S. Goh et al.

    Cortical areas involved in object, background, and object-background processing revealed with functional magnetic resonance adaptation

    Journal of Neuroscience

    (2004)
  • R. Gordon

    Attentional allocation during the perception of scenes

    Journal of Experimental Psychology: Human Perception and Performance

    (2004)
  • F. Gosselin et al.

    Why do we SLIP to the basic level? Computational constraints and their implementation

    Psychological Review

    (2001)
  • Greene, M.R., & Oliva, A. (in preparation). The briefest of glances: The time course of natural scene...
  • Greene, M. R., & Oliva, A. (2006). Natural scene categorization from conjunctions of ecological global properties. In...
  • M.R. Greene et al.

    High-level aftereffects to natural scenes

    Journal of Vision

    (2008)
  • Grossberg, S., & Huang, T.-R. (in press) ARTSCENE: A neural system for natural scene classification. Journal of...
  • J. Henderson et al.

    Global transsaccadic change blindness during scene perception

    Psychological Science

    (2003)
  • Cited by (0)

    View full text