When observing a visual scene, at any point in time, we are conscious of only a small fraction of the available information. Given this limitation, it is important to understand the mechanisms that determine what stimuli we become conscious of, when we become conscious of them, and the nature of processing that occurs in the absence of explicit awareness. There exists a long and rich history in experimental psychology of using visual masking to explore such issues. Visual masking refers to conditions where the visibility of one to-be-reported visual stimulus (the target) is obscured by the presentation of another stimulus (the mask) that appears in close spatiotemporal proximity and does not require report. Object substitution masking (OSM) is a recent discovery in the field of visual masking and is an ideal tool for exploring questions about visual awareness because it selectively impairs the extent to which an individual becomes conscious of a visual stimulus without the extent of image-level degradation that other forms of visual masking induce. This review, therefore, focuses on the determinants and consequences of visual awareness in OSM.

Fig. 1
figure 1

An illustration of the standard object substitution masking paradigm. The task is to identify whether the gap in the target “C” (the object inside the four dots) is on the left or the right side. Target perception is impaired when the offset of the four-dot mask is delayed (above), relative to when the mask offsets simultaneously with the target (below)

Here, we integrate a diverse range of literature to discern what OSM can tell us about perceptual consciousness and visual cognition in general. Specifically, we discuss the role of feedforward versus feedback processing in giving rise to OSM, whether the phenomenon is better characterized as reflecting object updating or object substitution, the similarities and differences between OSM and other forms of masking, and OSM’s relationship to other visual-cognitive phenomena. In addition, we review evidence regarding the role of attention in masking. Our aim is to address controversies in the OSM literature in depth and offer insights into how some of them may already be resolved and how the field might go about resolving those that remain. We begin by describing the emergence of OSM from the visual masking literature.

Visual masking

It was first documented in the late 19th century that the presentation of a stimulus (the mask) has the potential to interfere with the perception of a prior target stimulus when they share a common spatial location and that this impairment provides a metric of the time taken to recognize the target item (Baxt, 1871; translated in Baxt, 1982). This finding has inspired considerable research interest, chiefly as a tool for exploring the mechanisms that determine which stimuli enter perceptual consciousness and give rise to our rich visual experience. In the early 20th century, it was discovered that masking impairments extended to conditions where the target and mask occupied distinct (but usually contiguous) spatial locations if they had similar contours, which came to be known as metacontrast masking (Alpern, 1952, 1953; Breitmeyer & Ögmen, 2006; Werner, 1935). Strikingly, metacontrast masking produces a U-shaped masking function, with strongest masking typically occurring when the stimulus onset asynchrony (SOA) between the target and mask is 50–100 ms, with weaker or absent masking at shorter and longer SOAs (Breitmeyer & Ögmen, 2006). This counterintuitive departure from monotonicity has been the impetus for many investigations into masking and has been the topic of considerable theorizing (Breitmeyer & Ganz, 1976; Bridgeman, 1971, 1977; Francis, 1997; Kahneman, 1967; Weisstein, 1968).

Over a decade ago, Di Lollo, Enns, and Rensink (2000) discovered a form of masking that displayed properties that did not fit with conventional accounts: A mask that consisted of only four dots could impair the visibility of a target with which it shared little physical similarity and never spatially overlapped (see Fig. 1). This masking was characterized by the target and mask having a common onset, and the key variable was duration of the mask stimulus trailing target offset (mask duration), rather than the SOA between the target and mask (Di Lollo et al., 2000; Enns & Di Lollo, 2000). This impairment of target visibility was termed OSM (Enns & Di Lollo, 1997), and it reflected an exciting new development emerging from a rich tradition of visual masking research in experimental psychology and cognitive neuroscience. OSM is clearly distinct from backward masking by pattern, structure, or noise, where the mask spatially overlaps the target after a given SOA (Breitmeyer & Ögmen, 2006), since the four dots never even partly cover the target. OSM is more similar to metacontrast masking, and initially, there was debate about whether it was an original form of masking or merely a variant of metacontrast (Breitmeyer & Ögmen, 2000). The evidence since then, however, has demonstrated that while there are some similarities, metacontrast and OSM are experimentally dissociable. For example, Chakravarthi and Cavanagh (2009) demonstrated that “crowding,” the deficit in identifying a target in the periphery owing to overly close adjacent flankers (Pelli & Tillman, 2008), could be prevented when metacontrast masks were applied to the flankers but was not when four-dot masks were applied. This suggests that metacontrast masking has an earlier locus of suppression than crowding, whereas OSM has a later locus of suppression (Chakravarthi & Cavanagh, 2009). These similarities and differences between OSM and metacontrast masking and the functional properties of the visual system that they tap will be discussed in greater detail subsequently in this review. Presently, however, we will focus on a description of the properties that define OSM.

Since its discovery, it has been argued that one of the defining hallmarks of OSM is that the impairment in report depends on attentional resources being occupied during the presentation of the target array. Typically, masking is not obtained unless distractors are presented simultaneously with the target (Di Lollo et al., 2000) or unless attentional resources are occupied with another demanding task (e.g., mental arithmetic) (Dux, Visser, Goodhew, & Lipp, 2010). Masking is attenuated or eliminated when attention is precued to the location of the target in the array (Di Lollo et al., 2000; Neill, Hutchison, & Graves, 2002). Similarly, attention to the mask plays a role in producing masking, since deliberate attention to the mask exacerbates OSM (Tata & Giaschi, 2004). However, like the distribution of attention during target exposure, it is not necessary in order to produce masking (Neill et al., 2002).

Recently, however, it has been argued that the limited masking obtained with the presentation of a target stimulus without distractors is the result of a ceiling effect constraining performance in this condition and that the purported hallmark interaction between the number of distractors and mask duration (Di Lollo et al., 2000) is an artifact of this constraint (Argyropoulos, Gellatly, Pilling, & Carter, 2013). These authors demonstrated that when performance is shifted away from ceiling, although both set size (number of stimuli in the target array, target and distractors) and mask duration impaired target perception, their effects were additive rather than interactive. Argyropoulos et al. also obtained significant masking when the target was the only item in the target array (no distractors; set size 1). However, even in this condition, the target could appear in any one of four possible locations, meaning that there was still an element of uncertainty about the spatial location of the target and, consequently, attentional resources would have to be distributed across the potential target locations. Therefore, while it appears that OSM is not necessarily defined by an interaction between set size and mask duration, there is still no conclusive evidence against the idea that OSM depends on the prevention of focused attention on the target.

Collectively, the results above demonstrate that there are several key physical characteristics that define OSM: (1) Attention is distributed during target exposure; (2) target exposure is relatively brief (≤100 ms); (3) the mask continues to remain physically present after target offset (or at the very least, is visible after target offset; Enns & Di Lollo, 1997; Lleras & Moore, 2003); and (4) it is characterized by a loss of visual awareness of the target. In this next section, we specify what we mean by awareness, describe some of the prominent theories put forward to explain OSM, and then analyze their ability to account for the key findings in the literature.

Awareness

For the present purposes, when we discuss visual awareness or consciousness (used interchangeably in this review) in relation to masking, we specifically refer to the ability to report either the stimulus’ presence or its identity (depending on the task). This reportability definition of visual awareness is what Block (1995, 2011) has referred to as access awareness. Block argues that this is an incomplete definition of consciousness because there is also the possibility for phenomenal consciousness—the subjective sense of being aware of something without being able to explicitly report it. Note that this is different from the failure of visual awareness that we can measure in masking: being able to report stimulus presence (detection) without being able to accurately report identity. Instead, phenomenal consciousness is the purported situation of being aware without even being able to report that one is aware of something at any level. So when we present the findings from visual masking studies that use reportability as a gauge of visual awareness and infer that failures of reportability mean failures of conscious processing, there remains the possibility that participants could have this phenomenal consciousness state of the visual stimulus.

The notion of a present but unreportable awareness is, however, not testable. We can neither rely on reportability nor establish an implicit measure. For example, in order to develop a physiological signature of phenomenal consciousness, it would first be necessary to establish its presence versus absence and then correlate this with an accompanying physiological response. But this would require measuring phenomenal consciousness to begin with, but since explicit report does not suffice as a measure, this approach gets us no further ahead. In fact, it is not clear that there is any appropriate reference against which to gauge phenomenal consciousness (Kouider, de Gardelle, Sackur, & Dupoux, 2010; Kouider, Sackur, & de Gardelle, 2012). For these reasons, we will focus exclusively on access awareness.

The majority of OSM studies use identification as a measure of target awareness. For example, in Enns and Di Lollo’s (1997) task, participants were required to perform a two-alternative forced choice discrimination, where the target was a diamond with either its left or right corner cropped off. In this case, if participants could not accurately report the identity of the target, it was inferred that their visual awareness of it was impaired. A few studies, however, have also used a target detection task (e.g., Chen & Treisman, 2009) where participants make a judgment regarding whether a target is present or absent. In this task, participants can err in one of two ways: a false alarm, in which the target was absent but the participants indicate that they thought it was present, and a miss, where the target was present but they respond that it was absent. Misses represent the most severe failures of awareness. If a participant makes an error on a target identification or discrimination task, it is reasonable to assume that the participant’s awareness of the stimulus item was impaired, but it is still possible (even likely) that he or she had a sense that something was fleetingly present. That is, both target discrimination and detection are valid measures of awareness of a masked target; it is just that errors on these tasks may simply represent different levels or severities of failures of consciousness.

Of course, it is possible to correctly guess the target’s identity or presence, even in the absence of an explicit awareness of the target, and consequently, target correct responses across an experiment reflect a mix of actual target aware trials and some guesses. The errors on such forced choice metrics, however, have a more clear-cut interpretation: impaired visual awareness. If one is aware of the target, one does not need to guess; one can simply respond accurately. Thus, any unaware condition, as isolated via forced choice target identification or detection measures, can be more confidently thought to have isolated conditions of unawareness.

One potential issue, however, with detection tasks is the issue of response criteria. That is, some participants might have a liberal threshold for what they deem sufficient sensory evidence to conclude that a target was present, whereas others might have a more conservative approach and only deem a target “present” in light of much stronger sensory evidence. This issue can be mitigated by use of measures of sensitivity (such as d′) that take into account response bias and allow it to be disentangled from sensitivity. But such measures apply only to aggregate measures in a condition and do not resolve the problem of whether trials are sorted on the basis of responses in order to examine the fate of a masked target. An approach to dealing with this issue is to explicitly encourage one type of criterion (e.g., if you have any sense that a stimulus was there, report that it was). But still, for these reasons, some authors prefer a discrimination metric in which participants are forced to make a response about the target that removes any deliberation about when to respond.

Another potential criticism of the reportability definition of awareness, however, is that participants may have been fleetingly aware of a visual stimulus but, due to forgetting, are subsequently unable to report it. This seems unlikely in visual masking paradigms, especially when the task requires report of the stimulus identity/presence within a fraction of a second of the stimulus itself. However, it remains a possibility; but it is also one that is impossible to empirically gauge, and thus, like phenomenal consciousness, this possibility appears to be outside the scope of investigation. In the next section, we will describe the theories put forward to explain the failure of visual awareness in OSM.

Theories of OSM

Feedback versus feedforward-only models and OSM

It was once thought that visual perception conformed exclusively to a feedforward hierarchy of processing in which areas encode increasingly complex stimulus properties (Hubel, 1963). However, it is now widely recognized that in addition to this, the brain has extensive reentrant connections between regions, implying that perception does not exclusively conform to a hierarchy of analysis (Bullier, 2001; Zeki, 1993). Instead, it has been suggested that conscious perception is achieved via reentrant processing (Cudeiro & Sillito, 2006; Lamme & Roelfsema, 2000; Pascual-Leone & Walsh, 2001; Zeki, 2001). In this section, we will discuss the reentrant processing account that was first offered to explain OSM (Di Lollo et al., 2000), as well as other models of reentrant processing and their applicability to the phenomenon. Finally, we will examine the ability of exclusively feedforward models to account for OSM.

Mumford (1991) proposed one of the earliest reentrant models of object perception. A unique aspect of this account was that it posited that the thalamus is the destination for feedback processing and the locus for integrating input from different cortical areas, thus representing the current cohesive contents of consciousness. In this framework, the term active blackboard is used to describe how the thalamus maintains the current best reconstruction of the visual world. Specifically, perceptual processing is hypothesized to begin with a feedforward sweep through early areas that registers stimulus feature information, which then continues on to higher cortical regions (e.g., the inferior temporal [IT] cortex), where it activates multiple abstract representations (hypotheses) regarding object identity. From there, recurrent processing proceeds down a pathway from the cortex to the thalamus, where representations are compared in parallel against sensory information still reverberating from the feedforward sweep. The model employs a relaxation algorithm that chooses the hypothesis that is the closest match for the sensory information. Conscious perception of this representation then ensues (Di Lollo, 2012; Mumford, 1991).

Mumford’s (1991) model was developed as a general theory of object perception and predates the discovery of OSM. As a model of object perception, it has its merits, and it is noteworthy for being ahead of its time in terms of its emphasis on reentrant processing. Moreover, there is supporting physiological evidence; for example, activity in the (subcortical) lateral geniculate nucleus (LGN) strongly correlates with the contents of visual awareness (Wunderlich, Schneider, & Kastner, 2005). But fMRI has revealed that OSM activates frontal regions and the primary visual cortex, and the magnitude of the BOLD response changes in these areas are correlated with effectiveness of OSM (Weidner, Shah, & Fink, 2006). Moreover, it is not just the physical properties of OSM that activate these regions but, rather, the process of object substitution itself, since an OSM display in which the target location was cued did not elicit the same magnitude of activations as when target location was unpredictable. But it is neither just brief nor difficult-to-see targets that activate these regions, since these effects were more marked in OSM, as compared with backward pattern masking. They were still present for pattern masking, and this likely reflects a residual role of object substitution in such forms of masking (Weidner et al., 2006). This suggests that V1 is an important locus of sensory input against which perceptual hypotheses are compared, which is not accounted for by Mumford’s model.

Lamme and Roelfsema (2000) popularized the notion of reentrant processing and its critical importance in conscious object perception. According to this view, recurrent processing in early visual areas is a necessary condition for visual awareness. Whereas Mumford (1991) suggested that areas of the cortex have static functions and the thalamus is responsible for dynamic representations, according to Lamme and Roelfsema, lateral and feedback connections allow cortical neurons to contribute different analyses of a visual scene at different moments in time. Specifically, according to this model, the feedforward sweep reaches the highest levels of cortical processing within about 100 ms of stimulus onset. Recurrent processing typically exerts its influence at relatively long latencies and is necessary for refining stimulus representations and rendering them accessible to visual awareness. Within this framework, a backward mask is thought to disrupt recurrent interactions between cortical areas by creating stimulus responses to the mask in lower areas that clash with those in higher areas. In support of this, backward masking has been found to leave the feedforward sweep of processing intact, while selectively preventing feedback processing from extrastriate areas to V1 (Fahrenfort, Scholte, & Lamme, 2007; Lamme, Zipser, & Spekreijse, 2002).

Again, this model was developed as a general framework for visual processing, rather than as an explanation for OSM. However, it is consistent with the finding that an occipital P2 component is elicited about 220 ms after stimulus onset in OSM, and this component was significantly correlated with target reportability (Kotsoni, Csibra, Mareschal, & Johnson, 2007). Although the spatial resolution of ERPs is limited, the origin of this component and its timing is consistent with the notion of reactivation of the primary visual cortex in order to refine the target representation and with the notion that the greater this reactivation, the more likely it is that the visual system will discard the perceptual representation of the target in favor of the mask-alone stimulus (Kotsoni et al., 2007). However, while Lamme’s model provides a conceptual framework for understanding the dynamics of recurrent processing and its role in conscious vision, there are some aspects unique to OSM, as opposed to backward masking, that it does not specifically address, including the role of the distribution of attention and the effect of trailing mask duration. Di Lollo et al.’s (2000) model, however, was designed to do this.

According to the reentrant processing account of object substitution, OSM results from a conflict between abstract representations in higher level cortical areas and transient sensory input to V1 (Di Lollo, 2010, 2012; Di Lollo et al., 2000; Enns & Di Lollo, 1997). Under normal viewing conditions, in the first cycle of processing, input from the target array is encoded in V1, then proceeds to higher levels where tentative representations (or perceptual hypotheses) about object identity are generated. However, such perceptual hypotheses require comparison against the high-resolution and spatially specific sensory information in V1 because these high-level representations are coarse or incomplete (low resolution), sensitivity to the location of the target stimulus is reduced due to the large receptive field size of neurons in higher areas, and ambiguity about object identity is created when multiple perceptual hypotheses are activated by the feedforward sweep. The circuit then checks, via ongoing reentrant processing, for correlations between the descending codes representing multiple perceptual hypotheses and the ongoing pattern of sensory activity. Perceptual hypotheses with low correlations are discarded, and ultimately, the representation with the highest correlation wins the competition for consciousness. Given the brief presentation of the target in OSM, when the descending signals arrive at V1, there is a mismatch between the perceptual hypothesis that represents the target and mask and the sensory activity representing the mask alone. This yields a low correlation for the perceptual hypothesis containing the target, which is then is discarded, and the hypothesis reflecting the mask-alone is ultimately consciously perceived (Di Lollo, 2010, 2012; Di Lollo et al., 2000; Enns & Di Lollo, 1997, 2000).

This model predicts that OSM depends on set size, owing to the number of iterations required to locate the target in the search array. If fewer iterations are required, there are a greater number of reentrant loops during which stimulus information about the target will still be lingering. In contrast, when a greater number of iterations are required to identify the target among distractors (i.e., at larger set sizes), the descending signals representing the target are more likely to overlap with the mask-only sensory information, leading to a lower correlation and, thus, the selection of the perceptual hypothesis consistent with just the mask being presented (Di Lollo, 2010; Di Lollo et al., 2000).

Di Lollo et al. (2000) developed and tested a computational model of object substitution (CMOS) in which a cortical hypercolumn in the striate cortex (V1) connected to a corresponding region in an extrastriate visual area. This model posits iterative exchanges between fine-grained high spatial resolution sensory input information and more abstract pattern information that is lacking in spatial resolution, which ultimately results in conscious perception. Specifically, it stipulates three main stages of processing. First, there is the input layer (I), which, like V1, has small receptive fields, allowing for detailed and spatially precise coding of information. Second, there is an intermediate layer called the working space (W), thought to be part of the striate cortex, which has greater spatial resolution than I. Third, there is the final pattern layer (P), purported to be in the extrastriate cortex, which has large receptive field sizes, rendering it sensitive to the overarching pattern. At the onset of a stimulus, it is first encoded at I and then transferred to W, and a weighted sum of W and I are sent to P. P then outputs back to W; via this transfer, the pattern codes from P are translated into the pixel codes that predominate at W. Now W contains the pattern information that was represented in P, but it is in a form (pixel-based) that allows for direct comparison with the information at I. Then the contents of W are compared with the contents of I. Here, a “hill-climbing algorithm” searches for the highest correlation between the pattern information from P that has been translated into W, on the one hand, and the sensory information in I, on the other. This is achieved over successive iterations. Importantly, the fact that P receives input that is a weighted sum of the contents I and W (which is being fed back the representation from P) means that the patterns represented in P change more gradually than the rapid changes in sensory input that occur in response to physical stimulation changes (Di Lollo et al., 2000).

CMOS performed reasonably well, with r 2 values between .78 and .98 being obtained when the fits of the model were compared with the psychophysical results (Di Lollo et al., 2000). However, Põder (2013) has since pointed out that CMOS is not actually a true instantiation of reentrant processing but, instead, is essentially an attentional-gating model. More critically, CMOS is limited in its ability to explain some important findings. First, this model stipulates that set size and mask duration interact, but when performance is unconstrained by ceiling effects, this interaction does not occur (Argyropoulos et al., 2013; Põder, 2013). Second, it does not readily explain why prolonged exposure of placeholder stimuli in the target array can reduce masking (Guest, Gellatly, & Pilling, 2012).

Põder (2013) developed a model that did not incorporate reentrant processing and was designed to improve on the limitations of CMOS. This model assumes a preattentive stage of processing that covers the target array, in addition to the basic mechanism of CMOS. Specifically, it assumes temporal integration of the target and mask signals, the strength of which is modulated by attention. When the two stimuli have a simultaneous offset, the signals for both stimuli are likely to be preserved up until the object perception level. However, with the trailing mask-alone stimulus, the mask signals are present for a longer integration window, degrading the target representation and strengthening the mask representation, thereby impairing visibility of the target. After some time, when attention is focused on the target location, the target is no longer accessible. The preattentive stage, however, means that some target information was acquired, and thus, performance never declines to chance. Põder’s model provides an almost perfect fit for the behavioral data he modeled: r 2 values of .99 were obtained when compared with Di Lollo et al.’s (2000) behavioral data, and .98 when compared with Luiga and Bachmann’s (2008) results, pooled across polarity conditions.

The core advantage of Põder’s (2013) theory is that it that dovetails nicely with the important role of attention in OSM (Dux et al., 2010), while not stipulating an interaction between set size and mask duration, which is a stumbling block for CMOS (Argyropoulos et al., 2013). That said, it still does not provide a complete account for the full gamut of OSM findings. For example, there are many object-updating effects in OSM (e.g., Hirose et al., 2007), which we will discuss in the next section, on which this model is silent. Similarly, in its current form, this model does not predict the nonmonotonicity that OSM functions can produce (Di Lollo et al., 2000; Goodhew, Dux, Lipp, & Visser, 2012; Goodhew, Visser, Lipp, & Dux, 2011a): That is, with prolonged mask exposure (e.g., 640 ms), there can be an improvement in target identification accuracy, relative to intermediate mask durations (e.g., 240 ms), yielding a U-shaped function of masking across mask exposure (Goodhew et al., 2012; Goodhew et al., 2011a). It should be noted, however, that recovery does not universally occur. In one case in which an outline square mask was used and the task for participants was to detect the presence of an unbroken ring target among broken ring distractors, statistically significant recovery was not obtained (Tata & Giaschi, 2004). But there are considerable individual differences in the temporal dynamics of the OSM recovery function (Goodhew et al., 2012), and possibly averaging across participants in the Tata and Giaschi study diluted the recovery effect. Consistent with this idea, there are trends toward recovery with prolonged mask exposure in some of Tata and Giaschi’s figures (see Fig. 2, 8 mask condition; Fig. 4, no-preview condition; Fig. 6, no-preview condition). Further evidence for the existence of recovery in OSM is that Di Lollo et al. (2000) also found such nonmonotonicity with delayed offset of an annulus mask around a target with concurrent distractors in trained observers, with this pattern most pronounced at smaller set sizes (i.e., set size 1). Perhaps, therefore, recovery is observed under particular conditions of perceptual degradation of the target, which may reflect the interplay of target exposure duration, concurrent perceptual load, mask density/similarity to the target, and observer individual differences (e.g., working memory capacity). However, Põder’s model does not make any provision for the recovery effect.

There is a model that predicts nonmonotonicity of OSM functions, since Bridgeman (2007) modified a single-layer model that relied on distributed stimulus representations to account for metacontrast masking (Bridgeman, 1971) in order to explain OSM. The model employed lateral inhibition between adjacent neurons at recurrent 30-ms intervals, which is the latency of reciprocal lateral inhibition in the cat LGN (Singer & Creutzfeldt, 1970). The role of attention in masking is simulated in the model by varying the number of iterations over which neural net activity is collated. That is, fewer recurrent iterations occur for strongly attended items, and a larger number for less strongly attended. This is a more plausible instantiation of attention than that employed in earlier attempts to adapt models of metacontrast masking to explain OSM, where attention was modeled via mask intensity (Francis & Hermens, 2002). While it is true that focused attention generally reduces the effect of the trailing mask, mask intensity does not seem to capture what is meant by the nuanced construct that is attention, and, furthermore, it has been established that target perceptibility is largely unaffected by the relative intensity of the target versus mask (Neill et al., 2002).

Strong evidence in favor of Bridgeman’s (2007) model is that it produces U-shaped masking functions like those seen in recovery from OSM. The explanation for this pattern of masking is based on the timing of mask offset and target signals and their interaction within the model. That is, recovery occurs when the target is most closely attended (fewer iterations), preventing the mask offset transient from interfering with the target signal. But when focused attention is prevented (more iterations), the target and mask representations merge and produce masking. This neatly explains why Di Lollo et al. (2000) obtained nonmonotonic masking across mask duration at set size 1 (where there was still spatial uncertainty about target location): Here, the target was more strongly attended, relative to larger set sizes. This theory, however, does not as readily explain why Goodhew et al. (2012; Goodhew et al., 2011a) obtained recovery at larger set sizes or why recovery can occur with prolonged mask exposure in the absence of a mask offset prior to response delay or simply with delayed responses.

Many of the models discussed above assume that feedforward and/or recurrent processing between visual and higher-level cortical regions is an essential component of conscious perception. Another prevailing view in cognitive neuroscience stipulates that the dorsal and ventral processing streams are differentially implicated in conscious and unconscious perception (Goodale, 2008; Goodale & Milner, 1992; Goodale & Westwood, 2004). That is, the dorsal stream governs accurate, visually guided motor planning behavior, which can be executed in the absence of conscious visual awareness. The ventral stream, in contrast, underlies conscious object perception. The classic example of this dissociation was provided by patient D.F., who suffered visual form agnosia after ventral occipital lesion: She was unable to recognize objects, and yet her object-directed grasping remained intact (James, Culham, Humphreys, Milner, & Goodale, 2003). This same dissociation between visuomotor action and conscious perception has been demonstrated in OSM. That is, it has been found that normal motor reaching performance occurs even in the absence of conscious perception of the size of the target object (Binsted, Brownwell, Vorontsova, Heath, & Saucier, 2007; Heath, Neely, Yakimishyn, & Binsted, 2008).

The dorsal/ventral unconscious/conscious distinction, however, is not as absolute as initially conceived. In some of the earlier models of metacontrast masking, this interaction between the two processing channels was emphasized, especially how the magnocellular (transient) channel, which predominately innervates the dorsal stream (Shapley, 1990), disrupts processing of target-related information along the parvocellular (sustained) channel (which connects to the ventral stream) (Breitmeyer & Ganz, 1976). More recent models, however, have established that magnocellular input can have a facilitatory effect on object perception (Bar et al., 2006).

Breitmeyer and Ganz (1976) proposed the most influential and, for its time, the most physiologically plausible theory of metacontrast masking: the sustained–transient dual-channel account. The feedforward, dual-channel architecture explains masking as a function of time between target and mask onsets and/or offsets. Both offsets and onsets activate the faster transient channel, which inhibits the slower sustained channel carrying information about stimulus identity (the transient/sustained maps onto the magnocellular/parvocellular distinction; Sherman, Wilson, Kaas, & Webb, 1976). In metacontrast masking, the sustained activity representing the target is inhibited by the rapid mask-induced transient when the mask onsets 50–100 ms after the target. At shorter SOAs, the mask-induced transient occurs prior to the target-related sustained response and so does not interfere with it. Similarly, at longer SOAs, the target-related sustained response is consolidated and, thus, is impervious to the effect of the mask. While this model was initially designed to account for the U-shaped masking function that characterizes metacontrast masking, it is also relevant to OSM. In OSM, there is typically a constant (0-ms) target–mask SOA, and the one factor that varies is trailing mask duration, which concomitantly varies the temporal separation of mask offset from the target. This model, therefore, would predict a U-shaped function of OSM across mask duration (Breitmeyer & Ganz, 1976; Breitmeyer & Kersey, 1981; Breitmeyer & Ögmen, 2006), which has been found (Goodhew et al., 2011a). But the U-shaped pattern also occurs in the absence of mask offset or just with a response delay after mask offset (Goodhew et al., 2012), which cannot be explained within the sustained–transient framework.

More recently, Tapia and Breitmeyer (2011) compared behavioral responses to known response properties of magnocellular and parvocellular neurons. Observers were presented with either a visible or an invisible (rendered so by a metacontrast mask) prime arrow (pointing left or right) at different contrasts, followed by a probe arrow. Their task was to identify whether the probe arrow was pointing left or right as quickly and accurately as possible. Prime visibility was assessed in a separate identification block, and the authors found that the function of the effect of the prime across contrast was best approximated by the known properties of magnocellular neurons for consciously perceived visible primes, whereas the effect of the prime was best explained by the known properties of parvocellular neurons for the unconsciously perceived primes. These authors suggested that the role of the magnocellular channel in conscious vision is to generate feedback to IT areas that is necessary for rendering stimuli accessible to visual awareness.

A similar idea was previously described in more detail by Bar (2003), who proposed that “hypothesis generation” in object recognition is subserved by a magnocellular input directly into the prefrontal cortex, which generates a preliminary “guess” about the identity of an object in advance of the feedforward sweep. This then initiates reentrant activity from the prefrontal cortex to object-related areas in the IT cortex, amplifying the sensory input in these regions that is consistent with the perceptual representation, which then facilitates rapid object recognition. In support of this, activity in the orbitofrontal cortex preceded (by about 50 ms) activity in more classical object-related areas (the temporal cortex, including the fusiform gyrus and LOC), and moreover, this activity was predictive of subsequent behavioral object recognition reported by the observer (Bar et al., 2006). Consistent with the magnocellular’s channel preference for low spatial frequency content of stimuli (Derrington & Lennie, 1984), both the unfiltered and low-pass spatial frequency filtered images generated equivalent activity in the orbitofrontal cortex and induced greater synchronization between feedforward and feedback projections, whereas this was not so for high-pass-filtered images. Thus, conscious perception is not the exclusive domain of the ventral cortical stream.

In sum, both the qualitative reentrant processing account and its purported computational instantiation (CMOS) have difficulty explaining the nuances of OSM. While initial attempts to model OSM without recourse to reentrant processing fell short of accurately operationalizing important concepts like attention (Francis & Hermens, 2002), more recent developments have provided excellent predictive value in a way that is theoretically sound (Põder, 2013). Bridgeman’s (2007) model does one of the best jobs in approximating the behavioral data of OSM, but this is a model of recurrent lateral inhibition, rather than feedback, between different regions. However, the physiological reality is that the brain has substantial architecture for reentrant processing, and much evidence points to this being involved in conscious perception (Lamme & Roelfsema, 2000). Reentrant processing is therefore likely to be involved in some way in rendering the target stimulus accessible to visual awareness; it just may not be via the specific mechanisms proposed in the reentrant processing account of OSM (Di Lollo et al., 2000). For example, it may be that the visual competition between the target and mask is resolved via recurrent lateral inhibition and that reentrant processing is subsequently responsible for rendering either just the mask or both the target and mask accessible to perceptual consciousness. Furthermore, we believe that in the future, models will benefit from synthesizing growing knowledge of the role of the magnocellular/dorsal stream in conscious object perception (e.g., Bar, 2003; Bar et al., 2006). For example, it may be that the preattentive sweep proposed by Põder (2013) is a magnocellular-mediated “first guess” at target object identity.

Object updating and object substitution

According to the object-updating account, OSM reflects the conditions in which the target and mask are integrated into a single object representation, thereby obscuring the visibility of the target. As we will discuss in this section, there is a wealth of evidence that encouraging the formation of independent object representations for the target and mask reduces OSM. But in order for this to be a compelling explanation, it needs to describe how integrating the target and mask objects would render the target unavailable to awareness, since the target and mask never share a common spatial location. It may be the case that the space inside the four dots is treated not as transparent, but as part of the four-dot object in its own right. Evidence for this assertion is that when the four stimuli constituting the mask were arranged in such a way as to induce an illusory contour, masking occurred when this stimulus appeared immediately adjacent to the target, but not when it appeared elsewhere in the display (Hirose & Osaka, 2009). Furthermore, when a stereoscope is used to induce the perception that the target and mask appear in different depth planes, masking occurs only when the mask is perceived to be in front of the target (Kahan & Lichtman, 2006). This suggests that the target representation is fused not just with four dots, but with an illusory contour created by the four-dot mask, and this obscures the visibility of the target.

In support of the role of object updating in OSM, Lleras and Moore (2003) demonstrated that OSM critically depends on the mask constituting a robust continuing object representation within the time frame that allows the target to be integrated with it. Their design utilized apparent motion, where two static stimuli flashed close in space and time create the percept of motion in the absence of actual motion (Anstis, 1980). Specifically, in their experiments, the target array (presented for 17–34 ms) contained eight “Cs” (varying in orientation), arranged in a circle, with every item surrounded by four dots. The target was signaled by being a darker shade of gray than the distractors, and participants were instructed to report the orientation of the target C. In addition to a baseline condition (standard simultaneous offset condition), there were also three different delayed mask offset conditions. There was a standard delayed mask offset condition and a condition where the masks always offset simultaneously with the target and then reappeared at a new spatial location (all masks moved together out into a circle with a larger circumference) after a duration that was conducive to the percept of apparent motion (17–34 ms; short interstimulus interval [ISI]); finally, there was a similar condition where the masks reappeared after a longer delay (216–233 ms; long ISI), which was not conducive to eliciting apparent motion. Comparing target identification accuracy in each condition against the simultaneous offset baseline, it was found that masking occurred in the standard delayed offset condition and in the apparent motion condition, but not in the condition where apparent motion was not observed. This suggests that when the mask is perceived as a continuing object, it results in OSM, but when the mask appears as a new object that onsets after the target offsets, it does not.

Pilling and Gellatly (2010) recently extended Lleras and Moore’s (2003) paradigm. These authors used a condition akin to that employed in the short-ISI trials described above and assessed the effect of adding another presentation of the mask dots at an outermost location in between the two successive displays previously used in that condition. The logic here was that this should disrupt the perception that the dots were a reappearance of the mask that was presented during the target array and would, instead, encourage the percept that these were new objects. This, therefore, should allow the visual system to determine that the target and mask were separate object identities and, consequently, should reduce masking. This is indeed what was found: Masking was reduced with the additional presentation of the dots, as compared with the standard short-ISI condition (Pilling & Gellatly, 2010).

Moore and Lleras (2005) found that additional manipulations that encouraged the visual system to treat the target and mask as a single object versus two separate objects also affected masking. Specifically, in Experiment 1, the authors used motion to encourage either common or separate object representations. In one condition, the mask appeared at a different location than the target, then moved past the location of the target when it was present, and then moved on to a new location after target offset (i.e., separated target and mask objects). In the other condition, after the mask onset at a different location than the target, it then appeared to move in conjunction with the target, rather than slide past its location (i.e., target and mask objects not separate). Masking was greater for the latter than for the former condition. In Experiment 2, the independence of the target and mask objects was manipulated via coherent versus incoherent motion. The trials began with placeholders: filled circles (rather than circles with gaps—Cs), with four-dot masks around each of them (the placeholders gave no cue as to the subsequent location of the target). Then the arrays (both circles and dots) briefly moved. They did so either together (coherent motion) or independently of one another (incoherent motion). After the placeholders stopped moving, the target array was presented. Stronger masking occurred after exposure to motion, which implied that the four dots and the circles were a coherent object, as compared with viewing a pattern of motion implying that the circles and dots were separate objects. Again, this suggests that a failure of target and mask object individuation contributes to OSM (Moore & Lleras, 2005).

Finally, in Experiment 3, the target and mask either were colored identically or appeared in different colors (Moore & Lleras, 2005). This was done on the assumption that when the target and mask were identically colored, they should be more likely to be treated as belonging to a single object token, as compared with when they were differently colored. It was found that masking was weaker when the target and mask were colored differently. To the extent that featural similarity acts as an object-individuation cue, this is consistent with the object-updating account (Lleras & Moore, 2003). Moreover, it has been shown that rTMS applied to V5/MT+ reduces masking (Hirose et al., 2007). This dovetails with the object-updating account, since V5/MT+ processes motion (Born & Bradley, 2005), which inherently involves updating the location of an object over time. Temporarily deactivating this function may reduce object updating, and this might explain the decrease in masking.

Mask preview findings also provide converging evidence for the crucial role of object individuation computations in OSM. In the mask-preview paradigm, the target display includes four-dot masks around all items (both target and distractors). Since the masks surround all the stimuli, they are not predictive of the location of the target among the distractors in the subsequent array and, thus, do not simply serve as a precue to target location (which is also known to attenuate masking; see Di Lollo et al., 2000). Despite this, the nonpredictive preview of the mask objects (and targets) reduces masking (Neill et al., 2002). It has been hypothesized that this occurs because the preview consolidates the representation of the mask as a separate object identity, protecting against subsequent integration of the target and mask items. In addition to mask preview, prolonged exposure of the target array (in the absence of the cue indicating which item in the array is the target) decreases masking strength (Gellatly, Pilling, Carter, & Guest, 2010). Again, it has been suggested that the preexposure of the target array aids in the individuation of the target object’s representation, thus making it resistant to being confused with that of the mask (Gellatly et al., 2010). These authors have also shown that even when placeholders are used in lieu of the target and mask objects in the preview array, this reduction in masking still occurs, illustrating that the effect is not just the result of the previewed stimuli being loaded into visual short-term memory (VSTM; Guest et al., 2012).

As was noted above, prolonged mask exposure after the target offset can also reduce masking (Goodhew et al., 2012; Goodhew et al., 2011a). How would the object-updating framework explain this? As was discussed above, recovery of the object representation is obtained when the mask does not offset prior to a response being made and with a blank response delay (Goodhew et al., 2012). This illustrates the importance of time in the recovery effect; with sufficient processing time (untied to any further physical stimulation), the target representation can be accessed. Crucially, both in the conditions where there is a trailing mask and in those when the screen is blank, the visual system has a period of time during which no new stimulation onsets. We suspect that this period of time allows the mask representation to be consolidated in its own right, consequently preventing it from being fused with the target.

Our suggestion is that object updating reflects a limitation in the temporal resolution of visual encoding. This means that any manipulation that transiently increases the temporal precision of encoding should thwart OSM. One way to do this is to place visual stimuli in near-hand space. It was initially proposed that near-hand space incurs enhanced attentional processing (Reed, Betz, Garza, & Roberts, 2010; Reed, Grubb, & Steele, 2006) or reduced attentional disengagement (Abrams, Davoli, Du, Knapp, & Paull, 2008). More recently, however, there has been evidence to suggest that it is the purview of upregulated contribution of the magnocellular channel, which has greater temporal resolution but poorer spatial encoding, relative to its parvocellular counterpart (Gozli, West, & Pratt, 2012). Increased temporal resolution implies that the visual system should be especially likely to encode the target and mask as separate objects, and it has recently been demonstrated that OSM is indeed reduced at such locations (Goodhew, Gozli, Ferber, & Pratt, in press). This is consistent with the notion that OSM reflects overzealous temporal fusion of two object identities.

It has also been suggested that the object-updating account is distinct from the object substitution account, since the former predicts that the target and mask are fused into a single object representation, whereas the latter predicts that two separate object tokens (target and mask) compete for access to consciousness (Di Lollo et al., 2000; Guest, Gellatly, & Pilling, 2011). Essentially, it is a question of whether the visual system encodes the target and mask as a single or two separate events. Recently, Guest et al. (2011) argued that they had found evidence in favor of object substitution. These authors found that OSM was strongest when the four dots surrounded the target (as per conventional OSM) and was weakened when the dots were instead placed overlapping the critical target feature. These authors argued that this is because OSM reflects spatial competition for consciousness between the whole target and mask object. But this finding is also easily explained by the object-updating framework. A common alignment of the target and mask stimuli (as in the standard OSM condition) would be especially conducive to their being confused as belonging to a single object identity, as opposed to when they were spatially offset. Thus, the results of Guest et al. (2011) can be taken as evidence against the notion that the target and mask representations interact at an isolated featural level (e.g., Kahan & Enns, 2010), but they are entirely consistent with the existing evidence for the object-updating account. It remains to be seen, therefore, whether there are findings in OSM that can be explained only by object substitution, and not object updating. It is telling, however, that the neural models that fared best in predicting behavioral OSM results (Bridgeman, 2007; Põder, 2013) posit integration (as opposed to substitution) of the target and mask stimulus information as the core mechanism underlying OSM.

OSM and object correspondence

It has long been noted that object representations play a key role in OSM (e.g., Lleras & Moore, 2003; Moore & Lleras, 2005). But is there a link between OSM and one of the fundamental challenges of vision in everyday life? That is, humans make several eye movements a second (Henderson & Hollingworth, 1998), and objects themselves move in the environment. This means that the brain receives visual information that is constantly being interrupted and in a state of flux. Yet despite this, our subjective sense of the world is one of stability and continuity. This demonstrates that the brain transforms the dynamic and impoverished input into stable recognizable object representations. It would be a confusing world if the visual system treated all changes in stimulation as new objects. This would mean, for example, that making a saccade to another object in between two fixations would obliterate any recognition of persisting identity.

While maintaining object identities is important, the brain cannot indiscriminately compute continuing object identity. In some instances, objects do change: An object at one location may be replaced by another, and it is equally as important to appreciate when a new object appears on the scene as to maintain object identities when they persist. Thus, making an inference of correspondence versus noncorrespondence under conditions of ambiguous input is a key challenge for the visual system. Some paradigms exist for approaching this question, which are discussed below, but it is plausible that OSM cuts to the core of this process, since this phenomenon involves presenting observers with two instantiations of a visual stimulus, the first of which contains the target object, as well as surrounding visual information, followed by the second, which just contains the surrounding visual information. This means that the brain is confronted with the need to draw an object correspondence (or noncorrespondence) inference and, we suggest, on masked trials, this ambiguity is resolved in favor of a judgment that just a single object appeared throughout the presentation and, consequently, the target is suppressed from visual awareness. That is, the observer has no sense of having seen the target object at all and, when forced to respond, cannot accurately identify it. According to this perspective, when the target and mask are temporally fused, it is not a flaw of the system but, rather, reflects the operation of a mechanism that allows us to maintain coherent object identity representations through time and across change.

Why, according to this perspective, does spatial attention play a role in OSM? This could be explained in that it serves to lower the resolution of the forms being encoded to such an extent that the target plus mask representation is more likely to be perceptually confused with the mask-alone representation. It is already known that attention increases the spatial resolution of visual encoding (Yeshurun & Carrasco, 1998, 1999, 2008). Thus, in OSM, the fact that spatial attention cannot be readily focused on the target would decrease any featural dissimilarity between the target and mask. This, in turn, would increase the likelihood that they would be deemed to be two instantiations of a single object.

There is another aspect of attention in OSM—that attention to the mask increases masking (Neill et al., 2002)—which has been argued to be necessary for OSM to be observed (Tata & Giaschi, 2004). However, it is not necessary for the mask to be a singleton and “pop-out” in the target array, since OSM is obtained even when four dots surround all target locations and the target is, instead, signaled by appearing in a different color than the distractors (e.g., Lleras & Moore, 2003). From the object correspondence perspective, attention to the mask likely serves two important purposes. First, it enhances the strength of the mask’s stimulation, increasing the likelihood that the target instantiation, with relatively less strength, will be subject to integration rather than be treated as a discrete object. When the mask surrounding the target is not a singleton, the longer duration of the mask relative to the target likely serves the same purpose. Second, when integration occurs, attention to the mask ensures that it is the stimulus that dominates in the integration. Otherwise, if the representation of the mask were equally as weak as the target, integration could occur in favor of the target. That is, the target plus mask and subsequent trailing mask could be treated as a single instance of the target plus mask object, rather than the mask-alone stimulus. If such integration occurred, it would, of course, not be evident as an impairment in target visibility but, instead, as an impairment in mask visibility. Attention to the mask may preclude this possibility, and instead, OSM reflects the integration of the target and mask into a single object representation that largely reflects the mask-alone stimulus.

Relationship between OSM and other visual-cognitive phenomena

Although OSM has unique features that are not shared with other paradigms, the basic mechanisms that produce OSM may overlap with those underlying other visual-cognitive phenomena, especially other failures of awareness (Kim & Blake, 2005).

Metacontrast masking

There are obvious superficial differences between metacontrast masking and OSM. For example, mask duration is held constant in metacontrast masking, with only SOA being varied, whereas OSM conventionally involves the common onset of the target and mask (i.e., an SOA of 0). It has been demonstrated, however, that when mask duration is varied, metacontrast masking magnitude increases over the 0- to 160-ms durations (Di Lollo, von Muhlenen, Enns, & Bridgeman, 2004), and over these durations, OSM shows the same pattern. Furthermore, inducing a delay between the stimulus and the response can reduce metacontrast masking when the SOA between the target and mask is 0 (Lachter & Durgin, 1999), and common-onset OSM shows a similar improvement with response delay (Goodhew et al., 2012). But ultimately, as described previously, there is empirical evidence that OSM and metacontrast masking are indeed dissociable, with metacontrast masking having an earlier locus of suppression (Chakravarthi & Cavanagh, 2009).

When it comes to the role of attention, however, we might think of the distinction between OSM and metacontrast as one of degree, rather than kind. Whereas OSM is critically dependent on the dispersal of attention during target exposure, metacontrast masking is not, although it is modulated by attention. That is, metacontrast masking is obtained even with a centrally presented and attended target; in fact, this is the arrangement in which metacontrast masking is traditionally used (Alpern, 1952, 1953; Breitmeyer & Ögmen, 2006). However, the magnitude of metacontrast masking increases when spatial attention is dispersed during the presentation of the target (e.g., number of distractors in the target array) and attenuated when rapid attention to the target is facilitated in such arrays (Shelley-Tremblay & Mack, 1999; Tata, 2002). While the label metacontrast masking was not used, this interaction between attention and metacontrast masking was first noticed decades ago. Averbach and Coriell (1961) used a partial-report paradigm (similar to Sperling, 1960) in which a 2 × 4 array of letters was presented briefly and, after a variable delay, a visual cue was used to signal the target. When the visual cue was a circle that appeared around the location that had contained the target letter, these authors found an unexpected impairment in the visibility of the target letter that conformed to a U-shaped function across target–cue SOA (Averbach & Coriell, 1961). It is noteworthy that when attention was spread over the target array, a circle could serve as a mask for letters, even though these have somewhat different contours.

Why, then, does attention affect metacontrast masking? And why does the requirement for similarity of the contours of the target and mask appear to become more flexible under conditions of distributed spatial attention? We believe it is for the same reason that four dots can mask almost any stimulus in OSM when focused attention on the target is precluded: Under these conditions, the representation of the target is of sufficiently low spatiotemporal resolution that it can be confused for a prior instantiation of the trailing mask, rather than an object in its own right. When metacontrast masking occurs centrally, the greater spatiotemporal resolution of central vision demands that the target and mask have similar contours and appear in very close spatial proximity in order for them to become perceptually confusable. Once we move into the periphery or conditions under which attention is not focused on the target, however, this facilitates the perceptual decision that the target and mask reflect two instantiations of a single object representation, and thus the requirement for precise physical similarity between the target and mask is relaxed.

Backward masking

Similarities between OSM and metacontrast masking have often been noted, given that both phenomena appear to reflect a low-resolution target representation being integrated with that of the mask. It also appears, however, that backward pattern masking, where the mask spatially overlaps the target, may at least partially tap similar mechanisms (Di Lollo et al., 2000, first proposed this link). For example, it has been found that backward masking is differentially exacerbated when the perceptual quality of the target is already degraded, such as with object occlusion (Wyatte, Curran, & O'Reilly, 2012). Further evidence for object updating in backward masking is that when the overlapping mask consists of systematically alternating black and white squares (and thus, its form is more easily discernible when the mask is overlaid), it is a less effective mask than one that has exactly the same number of black and white squares but has them arranged in an random, unsystematic way (Coltheart & Arthur, 1972). This is evidence that the target and mask are integrated into a single percept and the mask that provides better camouflage when they are merged is more effective at obscuring target visibility. If the purpose of the mask was merely to terminate further processing of the target, the nature of the mask itself should be irrelevant. This suggests that integration of the target and mask representations plays a role in backward masking.

Change blindness

Attentional limitations that impoverish the quality of perceptual representations and render them vulnerable to integration are apparent in many other visual-cognitive phenomena. For example, change blindness refers to the finding that observers often fail to see large changes in a scene (Rensink, O'Regan, & Clark, 1997; Simons & Levin, 1998). Specifically, when two versions of a scene are presented in alternation—an exact version and a version with a significant change to an object in the scene—with an irrelevant image (e.g., gray screen), observers have difficulty in detecting the change/difference (Rensink et al., 1997). However, the irrelevant image interleaved between the two versions of the scene is not necessary to produce change blindness, since the effect can be observed with the scene being continuously present, if visual noise stimuli (“mudsplashes”) are added to the scene concurrently with the change, even though they do not obscure the target (O'Regan, Rensink, & Clark, 1999). Change detection may be related to OSM because, when attention and processing resources are occupied elsewhere, the critical event (the change to the scene in change blindness or the target in OSM) is encoded with poor spatiotemporal resolution, causing it to be missed entirely or susceptible to integration. In this sense, we support the notion that attention represents a necessary, but not a sufficient, condition for visual awareness (Cohen, Cavanagh, Chun, & Nakayama, 2012), rather than the alternative that spatial attention and perceptual consciousness reflect distinct operations (Koch & Tsuchiya, 2007, 2012).

Object correspondence through occlusion

If OSM does indeed reflect the operation of mechanisms involved in integrating (vs. segregating) object identities, there should be converging evidence from OSM studies and other traditional studies of object correspondence. On closer examination of the literature, such convergence is apparent. Many laboratory object correspondence tasks involve the presentation of two suprathreshold events, before and after occlusion. For example, a green circle travels across the screen (event one), disappears behind an occluder, and then remerges on the other side (event two). The extent to which observers perceive the emerging object as the same object as the one that disappeared behind the occluder can be gauged with a subjective report of perceived object identity persistence (Burke, 1952), indirectly via an objective change detection task (Hollingworth & Franconeri, 2009), or by the end location of saccades (Richard, Luck, & Hollingworth, 2008). Alternatively, some studies have used apparent motion, but such tasks usually employ similar dependent measures, such as reports on the nature or direction of motion (Green & Odom, 1986; Hein & Moore, 2012).

What properties of the events need to match in order for a persisting object identity (rather than two discrete objects) to be perceived? Both an object’s surface features and its spatiotemporal history make independent contributions to object correspondence across occlusion by another object (Hollingworth & Franconeri, 2009), saccades (Richard et al., 2008), and apparent motion events (Hein & Moore, 2012). It is interesting, therefore, that results in the OSM paradigm converge on the same conclusion: While the mask’s spatiotemporal trajectory influences masking (e.g., Lleras & Moore, 2003), so does its features. For example, masking strength is modulated by whether the target and mask are the same color (Moore & Lleras, 2005) or the same luminance (Luiga & Bachmann, 2008), such that masking is greater when the target and mask have common features, as compared with when they have conflicting features. This convergence between traditional measures of object correspondence and OSM suggests that they may tap common operations.

Visual short-term memory

Object updating implies that on at least a proportion of trials, the first event (i.e., the target) and the second event are stored in order for a comparison to take place. What memory store mediates this process? VSTM is a store in which about three to four bound objects are stored and explicitly reportable (Luck & Vogel, 1997). However, masking is affected by mechanisms other than loading information into VSTM (Guest et al., 2012), and ERP evidence suggests that targets that fail to be reported in OSM are not encoded into VSTM (Prime, Pluchino, Eimer, Dell'Acqua, & Jolicoeur, 2011). Thus, VSTM does not appear to be the most likely candidate.

An alternate store, visible persistence, is a briefly lingering low-level stimulus representation that is extinguished within approximately 100–200 ms of stimulus offset (Coltheart, 1980; Di Lollo, 1980; Hogben & Di Lollo, 1974). However, manipulations that occur beyond this time window still influence OSM (Goodhew et al., 2012; Goodhew et al., 2011a; Lleras & Moore, 2003). Informational persistence (also known as iconic memory) is preserved for longer durations, and lingering informational persistence is typically gauged with a cue after target offset, which is the first point at which the to-be-reported target is differentiated from distractors (Sperling, 1960). In OSM, the location of the target is cued from the time of exposure. Informational persistence, furthermore, is purported to be an abstract representation, and yet we know that improvement with prolonged mask exposure is extinguished when a backward pattern mask overlaps the spatial location of the target, suggesting that even at long durations, it is visual, rather than informational (Goodhew et al., 2011a). Consistent with this, since OSM appears to fundamentally reflect object integration mechanisms, it would make most sense for the target representation to be a mid-level representation, like that of the object token, a marker for an identity whose location can be updated over time and whose features are irrelevant (Kahneman, Treisman, & Gibbs, 1992). Unlike the representation suggested by Kahneman et al., however, the representation here appears to retain featural information (Luiga & Bachmann, 2008; Moore & Lleras, 2005).

It is worth noting that Brockmole and colleagues (Brockmole, Irwin, & Wang, 2003; Brockmole & Wang, 2003; Brockmole, Wang, & Irwin, 2002) found that performance on a temporal integration task follows a curvilinear function much like OSM. These studies used the missing-dot paradigm, in which observers were presented with two successive matrices of dots (e.g., a 4 × 4 grid) separated by variable intervals. If the matrices were overlaid, all of the locations were filled except one. Observers’ task was to identify this location (i.e., find the missing dot). It was found that missing-dot localization accuracy was maximal at an ISI of 0 ms between the to-be-integrated images, declined across the first 100 ms or so and then steadily increased over the course of at least another 1,000 ms. Brockmole et al. suggested that this was evidence that the visual system is capable of integrating a currently maintained visual image with perceptual information and that the prolonged time scale over which this occurs is due to sluggish mechanisms for generating a consolidated first image.

More recent work, however, has shown that consolidation of an image in such missing-dot tasks takes about 200 ms (Jiang, 2004) and that, after 500 ms, the visual system maintains separate representations of each array in VSTM (Jiang & Kumar, 2004). This is somewhat dissimilar to Brockmole’s time course but still not inconsistent with OSM. Recovery from OSM peaks by 640 ms after target exposure (Goodhew et al., 2012), so by this point, there may be separate representations for the target and mask available, which may even be stored in VSTM. But of greater interest is the nature of the target representation before this, in the form that allows for object integration. Recently, it has been suggested that the most efficient way to perform the missing-dot task is to memorize the location in which there are not dots and compare this with the second array, meaning that the task is not tapping true perceptual integration (Hollingworth, Hyun, & Zhang, 2005; Jiang, Kumar, & Vickery, 2005), casting doubt on the ability of the results from this paradigm to inform our understanding of OSM. Thus, we have not yet identified the precise nature of the memory representation in OSM, but it has several important characteristics: visual in form, containing both featural and spatiotemporal information, relatively long in duration, and, as we shall see in the next section, capable of activating semantic-level recognition of the target.

The fate of masked targets in OSM

Focusing on the mechanism that suppresses the target from, or catapults it into, visual awareness is vital for understanding OSM and, indeed, vision in general. Equally important, however, is assessing unconscious vision—that is, the level of processing of the masked target. This final section will review the evidence regarding the fate of masked targets in OSM.

How do salient stimuli fare as targets in OSM? One example of such a stimulus category is the human face. Processing of faces is associated with increased negativity over occipito-temporal electrode sites about 130–200 ms after stimulus presentation (Bentin, Allison, Puce, Perez, & McCarthy, 1996). This component of the event-related potential (ERP) is known as the N170, and it is thought to reliably predict the conscious perception of a stimulus as a face. The delayed offset of a four-dot mask has been found to obliterate the N170 (Reiss & Hoffman, 2007), suggesting that OSM impairs early levels of visual analysis. Examination of Reiss and Hoffman’s (2007) data presented in their paper, however, reveals that while there was no difference in the amplitude of the waveforms triggered by masked face versus house stimuli in the couple of hundred milliseconds after the target, a systematic and sustained difference did appear to emerge several hundred milliseconds later, beyond the time window of analysis for the N170. This suggests that there could be an electrophysiological correlate of faces that are processed (to the level that they are differentiated from other objects, such as houses) and yet not consciously perceived due to masking. However, a limitation of this study is that the authors did not separately compare on a trial-by-trial basis when participants were aware of the target and when they were not. Since masking was not complete (target identification accuracy of 70.4%, where chance is 50%), this means that on a considerable portion of trials in the “masking” condition, participants would have actually been aware of the target. This means that the ERP difference could be driven by target awareness on those trials. As it stands, this work is not definitive on the fate of unconscious stimuli in OSM.

While it remains unclear whether faces are perceived during OSM, features of objects clearly are processed. Chen and Treisman (2009) conducted a series of experiments investigating the level of implicit processing in OSM. These authors found that when the target and mask stimuli were featurally identical (e.g., left-pointing double-headed arrows, [<<]), response times to identify the mask were facilitated, as compared with when they were different (e.g., target left-pointing and mask right-pointing arrows), even for targets that were missed on a target detection task. This demonstrates that there is processing of the basic physical features of the target, even in the absence of explicit perceptual awareness (Chen & Treisman, 2009). The authors attempted to assess the level of implicit semantic perception in subsequent experiments, using isolated letter stimuli and defining compatibility in a somewhat arbitrary manner (vowels vs. consonants) that did not reflect the way we typically categorize or process letters. Typically, we see them as parts of words and rapidly extract the meaning of those words. Thus, what Chen and Treisman essentially showed is that there is not implicit processing of the vowel/consonant distinction in OSM. But when word stimuli are used, there is evidence for implicit semantic perception of suppressed targets.

In order to capture the richness of abstract meaning, Goodhew et al. (2011b, Experiment 2) used target words (PINK, BLUE, HOUR, MAIL, colored gray) and masked them with four colored dots (all colored either pink or blue on a given trial),with (####) as distractors. Targets were present on 50% of the trials, and the mask had either a simultaneous or a delayed (200-ms) offset. Participants’ task was to (1) make a speeded mask color judgment (pink vs. blue) and (2) make an unspeeded target detection task. Compatibility was defined as the relationship between the semantic meaning of the target word and the color of the mask. That is, when the target word was present, on some trials, the semantic meaning of the target matched the color of the four dots (e.g., “PINK” inside pink dots; compatible); on others, the semantic meaning of the target mismatched the color of the four dots (e.g., “PINK” inside blue dots; incompatible); and on others still, there were noncolor words that were neutral with respect to the color of the four-dot mask. Crucially, the only variable that differed across the compatible, neutral, and incompatible trials was the semantic relationship between the target and the colored dots; featurally, the conditions were equivalent (four black letters inside colored dots).

Target detection was impaired for delayed, relative to simultaneous, offset trials. On the delayed mask offset trials, when the target was correctly detected, response times were significantly faster on the compatible than on the incompatible trials, with neutral words yielding intermediate response times. When the target was missed, the opposite pattern was observed; response times were significantly faster for incompatible than for compatible trials. This negative compatibility effect (Bennett, Lleras, Oriet, & Enns, 2007; Eimer & Schlaghecken, 1998) is striking and demonstrates different processing in conscious and unconscious vision (see Goodhew et al., 2011b, for a discussion of this effect). But the most important finding was a systematic relationship between the meaning of the target and response times to the mask, indicating that the target word was processed to the level of semantics, despite failing to be detected. This demonstrates that there is implicit semantic perception in OSM (Goodhew et al., 2011b).

The conclusion that there is implicit semantic perception may seem at odds with the findings that OSM abolishes the N400 to word targets (Reiss & Hoffman, 2006), an event-related potential used as an electrophysiological signature of semantic processing (Kutas & Hillyard, 1980). However, if we examine Reiss and Hoffman’s (2006) methodology, we find that these results are also not definitive regarding the level of processing in OSM. These authors had participants read a context word followed by a semantically related or unrelated target word surrounded by dots. The participants then identified the target from multiple display options. The delayed mask offset impaired target identification performance, and while the differential amplitude of the N400 to semantically related versus unrelated words was present for the consciously perceived targets, it was abolished in the delayed offset condition. While these results point toward OSM impairing semantic processing, there are a couple of limitations to this study. First, the authors did not compare perceptually aware versus unaware trials, instead grouping all delayed mask offset trials together for the analysis. This makes it possible that any effect for unaware trials was obscured, especially if it differed qualitatively from that for aware trials. Second, there was, in fact, some behavioral evidence for implicit semantic perception—responses were significantly more accurate for compatible trials (context and target word semantically related) than for incompatible trials—but this is difficult to interpret, since as the authors acknowledged, it could be attributed to a guessing bias. Thus, this study, as it stands, does not conclusively reveal the fate of the masked target in OSM.

To summarize, the evidence indicates that (1) successfully masked targets in OSM are nonetheless implicitly processed at both the basic physical feature and higher-level abstract semantic level, and (2) there is a qualitative difference in the effect of the perceived versus missed targets at the semantic level. This is not because the missed targets failed to be processed; this would result in an absence of priming on the miss trials. Instead, there was strong priming, but it differed qualitatively in nature from the pattern when the target was visible.

Conclusion

OSM has been an exciting development in visual masking techniques that are used for understanding the mechanisms of conscious visual perception. It appears to tap properties of the visual system that are common to multiple paradigms, including metacontrast masking, backward masking, change blindness, change detection, and object correspondence through occlusion, to name but a few. That is, when focal attention to the target is prevented, this stimulus is encoded with sufficiently low spatiotemporal resolution that it is perceptually confused for a prior instantiation of the mask. Ultimately, we conclude that OSM reflects a perceptual decision to a fundamental problem in visual cognition: whether stimulation in close spatiotemporal proximity reflects a persisting object identity or discrete object representations. Masking occurs when the visual system decides in favor of the latter, thereby suppressing target visibility. Finally, OSM selectively impairs awareness while leaving high-level processing of the target intact, making it an ideal tool for the study of unconscious perception.