Introduction

Treisman’s feature integration theory (FIT) undoubtedly transformed the way in which we think about the integration of features (e.g. Treisman & Gelade, 1980). Her key suggestion was that attention was required to bind features together into perceptual objects (see Treisman, 1982, 1986, 1988, 1996, and 1998, for reviews). Placed in the context of the early/late selection debateFootnote 1 in which her own thinking developed, her theorizing can be seen as a neurophysiologically inspired second attempt to try to resolve the literature on selective attention. Treisman herself started out her research career in Oxford University’s Department of Experimental Psychology publishing in the area of dichotic listening (e.g. see Treisman, 1964, 1969). In her early years as an experimental psychologist, she developed an ‘Attenuator Model’ of selective listening. Importantly, however, this early cognitive (i.e. box-and-arrow) approach was soon criticized for its lack of neurophysiological plausibility (Styles, 2006; see also Driver, 2001). Her next attempt to resolve the early/late debate in selective attention would be much more firmly grounded in (or at least inspired by) the known visual neurophysiology of the period (see Cowey, 1979, 1985; Livingstone and Hubel, 1988; Zeki, 1978).

Interestingly, though, while often criticizing Treisman’s theory on various grounds, subsequent accounts of feature integration have still largely chosen to retain a relatively narrow focus on vision (e.g. see Eckstein, 2011; Quinlan, 2003; Wolfe, 1998, for reviews). That is certainly the case for the most successful successors to Treisman’s visual search, such as Wolfe’s influential “Guided search” (Wolfe, Cave, & Franzel, 1989) and Müller’s dimensional weighting model (Müller, Heller, & Ziegler, 1995; Müller, Krummennacher, & Heller, 2004). While such a narrow focus does perhaps make sense in light of the daunting complexity of human information processing, it nevertheless clearly fails to engage with much of our everyday experience, which involves either non-visual or multisensory information processing (and possibly also object representations). The aim of this review is therefore to provide an up-to-date critical analysis of Treisman’s FIT beyond the unimodal (or unisensory) visual case.

FIT: Key features

Several key (testable) claims were associated with Treisman’s original formulation of FIT:

1) All visual features (such as colour, form and motion) were processed in parallel.

2) A spatial attentional spotlight was required to glue, or bind, visual features together effectively at particular locations within a putative mastermap of locations. Neurophysiological support for the existence of such a mastermap subsequently emerged from data showing that parietal damage (Friedman-Hill, Robertson, & Treisman, 1995) or transcranial magnetic stimulation (TMS) over parietal areas (e.g. Ashbridge, Walsh, & Cowey, 1997) could selectively interfere with feature conjunction (see also Bichot, Rossi, & Desimone, 2005; Shulman, Astafiev, McAvoy, d'Avossa, & Corbetta, 2007).Footnote 2

These two claims were thought to give rise to the apparent dissociation between the parallel search for targets defined by the presence of a unique feature singleton and the serial search for those targets defined by a conjunction of features.

3) Feature binding in the absence of (sufficient) attention was likely to give rise to illusory conjunctions (ICs; e.g. see Treisman & Schmidt, 1982).Footnote 3 That is, features from different visual objects might inadvertently be bound together. Indeed, subsequent research revealed that ICs are more common between features that occur closer together spatially (Cohen & Ivry, 1989), or else are otherwise grouped (Prinzmetal, 1981),Footnote 4 suggesting some role for spatial segregation, even here.

Treisman’s FIT has, in the years since it was originally put forward, been criticized on a number of fronts.Footnote 5 Some researchers have questioned whether unique features are necessarily detected in the absence of attention (e.g. Kim & Cave, 1995; Mack & Rock, 1998; see also Braun, 1998). Others have provided examples of feature conjunction targets that are seemingly detected in parallel (e.g. see Enns & Rensink, 1990; McLeod, Driver, & Crisp, 1988; Nakayama & Silverman, 1986, for a number of such early examples). Others, meanwhile, have questioned whether ICs really are genuinely perceptual in nature, as Treisman would have us believe (e.g. see Virzi & Egeth, 1984, for an early study raising just such concerns). Questions have also been raised about quite how the attentional spotlight (see Tresiman & Gelade, 1980) manages to search efficiently through a scene (Wolfe et al., 1989; see also Klein & MacInnes, 1998).Footnote 6 Then, there are those who have questioned just how clear the dichotomy between serial and parallel search really is (e.g. Duncan & Humphreys, 1989). Finally, there are those who have wanted to argue against the very notion of conjunction search as a serial process (see Palmer, 1994).

That all being said, the one aspect of FIT that most researchers working in the area have seemingly not wanted to (or at least have not thought to) question concerns its focus on only a single sense, namely vision. In hindsight, Treisman and her colleagues’ theorizing on feature integration was surprisingly narrow, in that it only really engaged with the question of how visual features might be integrated. But what about the integration of features in the other senses such as, for example, audition and touch? Thereafter, one might legitimately also want to know about the integration of features from different sensory modalities (i.e. tapping into questions of multisensory integration; e.g. Kubovy & Schutz, 2010; O’Callaghan, 2016)? After all, object representations, no matter how they are defined (see below for more on this problematic theme), are often signified by cues from multiple distal senses, and not just vision. Beyond the visual dominance that is such a distinctive feature of human information processing (e.g. Posner, Nissen, & Klein, 1976; Spence, Shore, & Klein, 2001), and the fact that the neuroscience underpinnings of visual perception are better worked out than is the case for any of our other senses (e.g. Luck & Beach, 1998),Footnote 7 one might think that there is, actually, little reason to prioritize visual object perception over that taking place in the auditory or tactile modalities, say.Footnote 8

On the integration of non-visual features

Feature integration surely does take place in other senses, too. For instance, think only of the sound of a musical instrument and how the different features (such as pitch, timbre, and amplitude; see Giard, Lavikainen, Reinikainen, Perrin, Bertrand, Pernier, & Näätänen, 1995) are integrated perceptually. The same presumably must also hold true for touch, where tactile cues are combined into felt objects. That said, online searches reveal little evidence to suggest that researchers have really attempted to extend Treisman’s framework into the tactile modality (see Gallace & Spence, 2014, for a review of tactile information processing in humans). Some researchers also talk about flavour objects (see Auvray & Spence, 2008, for a review). However, while it may well be true that mention is made of perceptual objects in the visual, auditory (e.g. Cusack, Carlyon, & Robertson, 2000; Darwin & Hukin, 1999; Kubovy & Van Valkenburg, 2001; O’Callaghan, 2008; Shinn-Cunningham, 2008), tactile and olfactory (or flavour) modalities (Stevenson, 2014; Stevenson & Wilson, 2007), here it is probably safer (especially given space constraints) to restrict our discussion/critique primarily to the spatial senses of vision, audition and, to a much lesser extent, touch.Footnote 9 After all, one of the specific problems that is faced just as soon as one delves into the chemical senses, is that it immediately becomes especially unclear what ‘basic features’ are (Stevenson, 2014). And, in the case of taste (gustatory) perception, while there are several commonly discussed basic tastes, it is by no means clear that they should be equated with features, as conceptualized in the literature on visual FIT.

However, even restricting ourselves primarily to the integration of features within/between the spatial senses, it is interesting to note how soon one runs into problems/uncertainties/challenges in terms of adapting Treisman’s approach. While the neuroscience is reasonably clear in terms of the specific features that are processed in parallel in vision, audition and touch, what has been taken to constitute a basic feature in each of the senses is somewhat different. So, for example, while spatial location is coded from the retina onwards in vision (and hence constitutes the backdrop against which features are integrated), the analogous dimension in audition is frequency/pitch. This has led some theorists to wonder whether frequency/pitch is to hearing what space is to vision (cf. Kubovy, 1988). Furthermore, in vision, there is in any case a much clearer distinction between features and receptors, while in the case of touch and gustation, say there would appear to be a much closer link. That said, even in the purely visual case, some have raised the concern over a lack of a clear definition of what constitutes a visual feature (see Briand & Klein, 1989, for early discussion of this issue).

Integrating features within audition and touch

Over the years, a number of researchers have tried to extend Treisman’s revolutionary ideas around feature integration beyond the confines of the visual modality. For instance, some have attempted to adapt (or extend) her FIT to help explain the constraints on the integration of auditory features (e.g. Woods & Alain, 1993; Woods, Alain, Covarrubias, & Zaidel, 1993; Woods, Alain, Diaz, Rhodes, & Ogawa, 2001; Woods, Alain, & Ogawa, 1998). Certainly, there is widespread talk of auditory objects (Bizley & Cohen, 2013), albeit not without its own controversy/philosophical intrigue (e.g. Griffiths & Warren, 2004; Matthen, 2010; Nudds, 2010; Shinn-Cunningham, 2008). As has already been mentioned, localization comes ‘later’ in hearing than in vision or touch, and hence one might wonder whether features are perhaps processed prior to their effective localization (although a spatial visual cue has been shown to help with auditory identification; Best, Ozmerla, & Shinn-Cunnigham, 2007). Some have gone even further in suggesting that perhaps space itself should be considered as a perceptual feature, just like pitch and timbre, say (Woods et al., 2001).

In one study by Woods et al. (1998), for example, the participants were required to detect auditory targets from a rapidly presented stream of tone pips (low, medium or high pitch) that were presented to either ear. In other words, the auditory conjunction search task was, for instance, to respond to high-pitched targets presented from the right ear, say, while the feature search task consisted of reporting whenever a tone of a specific pitch was heard (i.e. regardless of the ear in which it was presented). The results revealed that participants actually detected the conjunction targets more rapidly than the feature targets. Here, though, it should be noted that restricting conjunction targets to one ear may have allowed for mechanisms of spatially selective attention to operate (see Kidd, Arbogast, Mason, & Gallun, 2005; Spence & Driver, 1994; and Shinn-Cunningham, 2008), hence facilitating performance by means of a rather different mechanism. In fact, transposing the experimental design back to the visual modality would presumably also have given the same result – namely faster detection responses when the target location is fixed (see Posner, 1978).

Others, meanwhile, have presented spatial arrays of auditory stimuli (i.e. using a presentation protocol that is more similar to that seen in vision than the sequential presentation used by Woods et al., 2001). Using such an approach, Hall, Pastore, Acker and Huang (2000) managed to provide evidence for the role of attention in auditory feature integration. Others have reported the existence of ICs (see Thompson, 1994; Thompson & Hall, 2001).Footnote 10 Similarly, other researchers have provided evidence suggestive of pre-attentive feature integration (of timbre and pitch) in audition – specifically for a pair of simultaneously presented by spatially distributed sounds using the mismatch negativity response (i.e. an index of auditory deviance; see Takegata, Brattico, Tervaniemi, Varyagina, Näätänen, & Winkler, 2005).

The problematic definition of features

Another challenge that is thrown into sharp relief when one moves outside of the visual modality concerns how, exactly, ‘features’ should be defined? In its original formulation in vision, features were associated with the existence of discrete early visual processing areas (such as for colour, orientation or motion; e.g. Treisman & Schmidt, 1980). However, the inspiration of features being synonymous with physiologically discrete feature maps in the brain has long since been superseded (see Treisman & Gormican, 1988, p. 16; and Bartels & Zeki, 1998). According to the latter authors, a feature is similar to the concept of a neural channel (see Braddick, Campbell, & Atkinson, 1978). However, outside of the visual modality, it is not so clear that features are necessarily associated with discrete processing areas (although discreet areas have been identified, e.g., for frequency, intensity and duration in audition; see Giard et al., 1995). What is more, contemporary researchers have often questioned whether all features necessarily have a distinct neural processing area attached. That said, there is even some uncertainty about how exactly features should be defined within the visual modality (e.g. Briand & Klein, 1989; Schyns, Goldstone, & Thibaut, 1998). Despite this uncertainty, a number of authors have, over the years, been more than happy to talk about olfactory or tactile objects as feature compounds (e.g. Carvalho, 2014; Keller, 2016; Stevenson & Wilson, 2007; Thomas-Danguin, Sinding, Romagny, El Mountassir, Atanasova, Le Berre, Le Bon, & Coureaud, 2014; Yeshurun & Sobel, 2010). It should be noted here though that, in the case of hearing, the individual identity of features of pure tones say, may blend into harmonies. Furthermore, matters are more complex still in the world of olfactory feature integration where the perceptual consequences of combining discrete odours are still not well understood (Yeshurun & Sobel, 2010).

The problem of spatial alignment

There are undoubtedly a number of important challenges for any account of feature integration as soon as one starts thinking about how to combine features from the different senses. One of the most salient of these concerns the determination of which features come from the same location, and hence should be bound together. Note here only the fact that information encoding is initially retinotopic in vision, tonotopic in hearing and somatotopic in touch (Spence & Driver, 2004). Hence, for example, the location from which a sound is presented is not given initially, but (as we have seen already) is coded later in information processing. Hence, location is processed – in some sense – late in the auditory modality, while being early in vision (see Shulman, 1990, on this point; see also Kopco, Lin, Shinn-Cunningham, & Groh, 2009).

What is more, there is no obvious immediate means of spatially aligning features in the different senses, given the different frames of reference in which spatial encoding takes place in vision, audition and touch. The computational problem here being exacerbated by the fact that the various frames of reference will immediately fall out of any kind of spatial alignment once the eyes are moved with respect to the head, say, or the head with respect to the body. Note here that in their earlier work, Spence and Driver (2004) focused on cross-modal links in spatial attention between the auditory, visual and tactile modalities under just such conditions of receptor misalignment.Footnote 11 That said, perhaps we need to stop for a moment to consider whether Treisman’s glue really is synonymous with Posner’s spotlight, as the preceding discussion would appear to have assumed. For, according to research by Briand and Klein (1987; see also Soetens, Derrost, & Notebaert, 2003) that is by no means necessarily the case: In fact, the answer with respect to endogenous and exogenous attention has been shown to differ. According to Briand and Klein only exogenous attentional orienting behaves equivalently to Treisman’s glue.

Multisensory object representations

While one finds some researchers talking about ‘multisensory objects’ (e.g. Busse, Roberts, Crist, Weissman, & Woldorff, 2005; Turatto, Mazza, & Umiltà, 2005), a closer inspection of the literature soon reveals that the definition of what exactly constitutes a multisensory object is pretty ‘thin’ (see Spence & Bayne, 2015, on this theme). For example, Turatto et al. appear to assume that if an auditory and a visual stimulus are presented from the same location then whoever perceives that combination of cues will de facto experience a multisensory object. Much the same can be said in the case of Busse et al.’s study.Footnote 12 The definition of what constitutes the necessary and sufficient conditions for positing that an object representation has been formed is, it should be noted, not well defined even within the visual modality (see Feldman, 2003; Scholl, 2001, 2007). Hence, perhaps no wonder that the problem becomes all the more challenging as soon as one considers multisensory object representations (see also Spence & Bayne, 2015). According to Bizley, Maddox and Lee (2016, p. 74), audiovisual objects can be defined as “a perceptual construct which occur when a constellation of stimulus features are bound within the brain”. At the same time, however, while Spence and Bayne acknowledge the widespread evidence for cross-modal interactions (see, e.g., Frings & Spence, 2010; Mast, Frings, & Spence, 2014), they question whether any of the evidence that has been published to date convincingly demonstrates the occurrence of multisensory awareness, what some might take as a necessary component of the very existence of multisensory object representations.

Temporal constraints on the integration of signals from different sensory modalities

Separate from the problem of spatial alignment is the challenge of integrating different features that may be processed (i.e. in some sense ‘available’) at different points in time after stimulus onset. A resolution to this problem in the visual modality has been to assume that the integration of different visual features is achieved as a consequence of the feedforward progression of neuronal responses as they advance through visual areas with progressively more complex receptive fields (Bodelón, Fallah, & Reynolds, 2007).

The difficulty with ensuring the integration of the appropriate features becomes potentially more pronounced just as soon as one steps outside of the visual modality since the different points in time after stimulus onset at which specific features become available for integration differ even more than in the unisensory case (see Grossberg & Grunewald, 1997; Spence & Squire, 2003). For instance, in the case of audiovisual feature integration, Fiebelkorn, Foxe and Molholm (2010, 2012) have documented the different points in time at which the processing of stimuli is seen in different neural structures. If features from different senses are to be bound appropriately, then some means of resynchronizing desynchronized signals may be needed (see Grossberg & Grunewald, 1997, for one potential computational solution to this problem in the visual modality). As long as those signals are only slightly desynchronized, multisensory integration has been shown to take place, providing that signals fall within what has been termed the window of multisensory integration (the so-called ‘temporal binding window’; e.g. see Colonius & Diederich, 2004; Soto-Faraco & Alsius, 2007, 2009; Wallace & Stevenson, 2014).

Illusory conjunctions

One of the key lines of evidence in support of Treisman’s FIT was the existence of ICs under those conditions where, for whatever reason, attention was limited. While evidence in support of the existence of ICs in vision was obtained early on (e.g. Treisman & Schmidt, 1982), one of the enduring controversies in the visual search literature has been whether ICs are genuinely perceptual in nature or whether instead they reflect nothing more than a memory error (see Virzi & Egeth, 1984). Given the theoretical importance of ICs to FIT in vision, one might legitimately ask about the existence of ICs in the other senses, not to mention between them. Researchers have documented the existence of ICs, for example between pitch and timbre in audition (Hall & Wieberg, 2003; see also Thompson, 1994; Thompson, Hall, & Pressing, 2001). That said, there would appear to have been little attempt to apply Treisman’s FIT to the integration of tactile features. Indeed, we are not aware of any reports documenting ICs within the tactile modality.

But what about ICs in the case of multisensory feature integration?Footnote 13 Cinel, Humphreys and Poli (2002) conducted one of the most thorough (not to mention perhaps the only) studies of cross-modal ICs involving visual and tactile stimuli. These researchers presented visual and tactile textured shapes to their participants. The latter were required to report the texture of the visual stimuli. Intriguingly, however, in those conditions in which the tactile textures did not match up with the visual textures, tactile-visual-conjunction errors were sometimes observed with participants erroneously reporting the tactile texture as the visual one. As one might have expected, given the tenets of FIT, such ICs were found to be even more common under those conditions where the participant’s attention was constrained.

Integration outside of the focus of spatial attention

One of the key claims associated with FIT, as originally proposed by Treisman, is that visual features are essentially only integrated within the focus of (spatial) attention.Footnote 14 However, in the cross-modal case, there is now plenty of evidence to suggest that multisensory integration sometimes occurs outside of the focus of (spatial) attention as well. So, for instance, in a series of studies reported by Santangelo and Spence (2007; see also Ho, Santangelo, & Spence, 2009; Santangelo, Ho, & Spence, 2008), audiovisual and audiotactile combinations of spatially co-located peripheral cues were shown to capture participants’ spatial attention regardless of the perceptual load of a central attention-demanding rapid serial visual presentation (RSVP) task. Furthermore, multisensory cues captured attention in a way that unisensory auditory, visual, or tactile cues simply failed to do (thus suggesting that multisensory integration had taken place in the absence of, or prior to, spatial attention being allocated to the cued location). Here, though, it should be borne in mind that one might want to separate out the attention-capturing capacity of a certain combination of multisensory cues (when presented from the same location, or direction, at more or less the same time) from the integration of those cues into a coherent whole (that is, a multisensory object of awareness; see Spence & Bayne, 2015, on this theme).Footnote 15

There is, though, a debate here, with Treisman, Sykes and Gelade (1977) originally suggesting that features come first into perception, and objects identified only later as a result of focused attention. This view can be contrasted with Wolfe and Cave’s (1999, p. 15) suggestion that “(visual) features of an object are bundled together preattentively but that explicit knowledge of the relationship of one feature to another requires spatial attention”.

The Pip-and-Pop effect

One paradigm where this distinction between cross-modal influences and multisensory integration is brought out most clearly is the so-called ‘Pip-and-Pop’ effect (e.g. Van der Burg, Olivers, Bronkhorst, & Theeuwes, 2008; see also Klapetek, Ngo, & Spence, 2012). Van der Burg and his colleagues have conducted numerous studies over the last decade or so showing that the search for a uniquely oriented line segment (either horizontal or vertical) placed in-amongst an array of diagonally oriented distractor line segments (i.e. in a complex visual search task) could be made to pop out (or at least search slopes could be made significantly less steep) simply by presenting a spatially non-predictive auditory tone in synchrony with the sudden change in colour of the visual target (note that both targets and distractors alternated randomly back and forth between red and green in this experimental paradigm). However, while it may well be tempting to suggest that such results are consistent with the claim that the auditory stimulus and the visual target are integrated in order to create some sort of multisensory object representation, it should be noted that many such cross-modal effects can be accounted for equally well in terms of the cross-modal focusing of temporal attention instead (see Spence & Ngo, 2012, for a review).Footnote 16 According to the latter account, note, there really is no need to suggest that any kind of multisensory integration has taken place. What this example therefore helps to illustrate is that just because cross-modal effects are observed in a given experimental paradigm, that doesn’t necessarily guarantee that any multisensory integration of stimulus features has taken place.

Multisensory integration outside of the focus of attention

For a number of years now, it has been argued that spatio-temporal co-occurrence is key to multisensory integration (see Mast, Frings, & Spence, 2015; Spence, 2007; Stein & Meredith, 1990, 1993; Stein & Stanford, 2008). That said, as highlighted by Spence (2013), closer inspection of the literature soon reveals that spatial co-occurrence mostly only appears to be necessary in those situations where space is somehow made relevant (either explicitly or implicitly) to a participant’s task. By contrast, temporal co-occurrence really does seem to be a prerequisite for any kind of audiovisual (or rather multisensory) integration (e.g. Kolewijn, Bronkhorst, & Theeuwes, 2010; Van der Burg et al., 2008).Footnote 17 Notice here only how enhanced multisensory integration is often seen with synchronized sensory signals (e.g. Harrar, Spence, & Harris, 2017). However, it is important to stress that multisensory integration is still often observed for those signals that are slightly desynchronized, providing, that is, that both signals fall within what has been termed the window of multisensory integration (the so-called ‘temporal binding window’; e.g. see Colonius & Diederich, 2004; Soto-Faraco & Alsius, 2007, 2009; Wallace & Stevenson, 2014).

The width of this temporal binding window, however, changes as a function of the demands of the participant’s task, not to mention the types of stimuli used (e.g. Spence & Squire, 2003; Vatakis, Maragos, Rodomagoulakis, & Spence, 2012; see Vatakis & Spence, 2010, for a review), or various individual differences-related factors (e.g. see Stevenson, Siemann, Schneider, Eberly, Woynaroski, Camarata, & Wallace, 2014). Finally here, it should be noted that there are other factors, such as the correlation between the unisensory signals (Parise, Spence, & Ernst, 2012), cross-modal perceptual grouping (see Spence, 2015, for a review), and higher-order cognitive factors, such as the ‘unity assumption’ (see Chen & Spence, 2017b, for a review), that have also been shown to contribute to the multisensory integration of audiovisual stimuli.Footnote 18

However, at the same time that the roles of these various factors in multisensory integration are being revealed, the putative role of attention in multisensory integration remains much more ambiguous. While several published studies have demonstrated that attention is needed for successful multisensory integration, a number of other researchers have published findings suggesting that integration is automatic and seemingly independent of attention (e.g. Bertelson, Vroomen, De Gelder, & Driver, 2000; Caclin, Soto-Faraco, Kingstone, & Spence, 2002; Helbig & Ernst, 2008; Santangelo et al., 2008; Santangelo & Spence, 2007; Van der Burg et al., 2008; Vroomen, Bertelson, & De Gelder, 2001). Those working in the field of multisensory perception research have attempted to address the controversy concerning the relationship between attention and multisensory integration by introducing conceptual frameworks that define moderating factors, such as, for example, stimulus complexity, stimulus competition (e.g. Talsma, Senkowski, Soto-Faraco, & Woldorff, 2010), and perceptual load (see Navarra, Alsius, Soto-Faraco, & Spence, 2010, for a review). Furthermore, according to Chen and Spence (2017a), hemispheric asymmetries may also help tease apart the effect of attention from those associated with integration. (Note that hemispheric asymmetry is only expected to influence perception/behaviour when dealing with attentional phenomena.) For instance, according to Talsma and his colleagues, multisensory integration is modulated by top-down attention in those situations in which the competition between stimuli is high and/or where the stimuli themselves are complex. Under such conditions, the participant’s intentions and goals may well help to determine what is integrated, and attended stimuli will likely be integrated first. On the other hand, when the stimuli are simple, and the competition between them is low, multisensory integration is thought to precede attentional selection and operate in more of a bottom-up manner instead. In this case, pre-attentive integration may help drive attention to the source of the stimuli (see Spence & Driver, 2000).

Perceptual load moderates the relationship between multisensory integration and attention (see Navarra et al., 2010, for a review). According to the Perceptual Load Theory (e.g. Lavie, 1995, 2005, 2010), processing resources are fully used until an individual’s capacity limit is reached. Hence, the suggestion is that under conditions of low load, all stimuli are automatically processed (and hence integrated) because the limit has yet to be reached. With increasing load, however, task-relevant stimuli are processed and integrated first and the integration of task-irrelevant stimuli starts to depend on the remaining processing resources.

Alsius and her colleagues demonstrated a modulation of the McGurk effect (McGurk & MacDonald, 1976) as a function of the perceptual load of a concurrent visual or tactile task (Alsius, Navarra, Campbell, & Soto-Faraco, 2005; Alsius, Navarra, & Soto-Faraco, 2007; see also Alsius, Möttönen, Sams, Soto-Faraco, & Tiippana, 2014). Note here also that a similar modulation of the audiovisual ventriloquism effect by perceptual load was reported by Eramudugolla, Kamke, Soto-Faraco and Mattingley (2011). While such results do fall short of demonstrating that attention is necessary for multisensory integration, they nevertheless do show that attention may modulate it. However, problems with this intuitive perceptual load-based account include the fact that it is difficult to objectively measure capacity, or the perceptual load of a given task, as well as continuing uncertainty over whether or not resources are modality-specific (see Otten, Alain, & Pickton, 2000; Rees, Frith, & Lavie, 2001).

Crucially, however, the studies that have been presented so far have all focused on the integration of target features (i.e. stimulus features that are somehow task-relevant) and the variation in the amount of attention that is devoted to them. The features that have been integrated have always more or less been in the focus of attention because the participants have always been tasked with responding to them. At the same time, however, evidence concerning the processing of multisensory distractors – that is, stimuli that are irrelevant to (or may even interfere with) the task at hand – has, until very recently at least, been scarce. One major advantage associated with investigating the multisensory integration of distractor stimuli is that it can be argued that the latter are genuinely processed outside of the focus of attention (except, of course, the possibility that they may be actively inhibited; Spence et al., 2001).

In two recent studies, we investigated whether multisensory distractor features are integrated (that is, whether the features are processed independently or not) or whether instead they are only processed on a unisensory level (see Jensen, Merz, Spence, & Frings, 2019). Specifically, multisensory variants of the flanker task were developed using either audiovisual (Jensen et al., 2019) or visuotactile (Merz et al., 2019) stimuli as both the targets and the distractors. Multisensory target stimuli were created by mapping specific combinations of a visual and an auditory (or tactile) feature onto a particular response. In order to respond correctly, the participants in our studies had to process both target features together (i.e. a particular tone together with a particular light colour was assigned to a specific response). Importantly, however, while responding to the multisensory target, the participants had to ignore a multisensory distractor that also comprised visual and auditory (tactile) features. Both of the distractor features could be congruent or incongruent with the target. Crucially, overt spatial attention was manipulated by varying whether the participants fixated on the location from which the distractors were presented (with the targets presented to one side), or vice versa (see Fig. 1).Footnote 19

Fig. 1
figure 1

Summary of experimental set-up and results from Jensen et al.’s (2019) and Merz et al.’s (2019) studies. Bird’s-eye view on the experimental set-up and key results of both studies highlighting dependence or independence of the congruency of distractors in both sensory modalities (the interaction term of visual and auditory [tactile] congruency RT effects in milliseconds; error bars depict standard error of the mean; * p < .01)

The results of both studies (Jensen et al., in press; Merz et al., 2019) revealed congruency effects for each distractor feature separately (that is reaction times and error rates were faster or lower when a distractor feature matched the target). Intriguingly, the two modalities only interacted when the multisensory distractor was presented at fixation – i.e. the processing of one modality was not independent of the other one. So, for instance, the effect of a congruent visual distractor feature was more pronounced if the auditory (tactile) feature also happened to be congruent (at a statistical level, this is reflected in significant interaction effects of both congruency effects). By contrast, when the participants’ gaze did not fall on the distractor stimuli, each distractor modality produced congruency effects that were independent of the other modality. These results, observed both for audiovisual and visuotactile distractors, can be taken to suggest that overt spatial attention is needed to integrate multisensory distractor features in this demanding selection situation. Without it, the distractor features are likely to be processed independently of each other. Taken together, then, these results concerning the multisensory integration of distractor stimuli fit nicely into the frameworks discussed above that define moderating factors such as stimulus complexity or the competition between stimuli (Talsma et al., 2010) and perceptual load (Koelewijn et al., 2010; Navarra et al., 2010) as key factors influencing the possible impact of attention on multisensory feature integration. That said, in a way, they also still fit with Treisman’s original FIT, as attention here (overt spatial attention, that is) might still be considered the glue that is needed to bind multisensory distractors into multisensory object representations.

Feature integration in action control

Before closing this review, it is worth highlighting the fact that in the decades since Treisman first developed her FIT, the general approach has been extended to various other aspects of cognition (i.e. beyond the purely perceptual). For instance, in the field of action control, it is nowadays assumed that stimulus and response features are somehow integrated (e.g. Frings, Koch, Rothermund, Dignath, Giesen, Hommel, et al., in press; Henson, Eckstein, Waszak, Frings, & Horner, 2014; Hommel, 1998) into stimulus-response (S-R) episodes (that can be retrieved later on, and hence may modulate a participant’s behaviour). Interestingly, the role of attention here is even more controversial than in the perceptual literature (e.g. Henson et al., 2014; Hommel, 2004; Moeller & Frings, 2014; Singh, Moeller, & Frings, 2018). Sometimes, it is assumed that attention is needed for integration, sometimes it is not. Part of the problem here is that most current approaches to action control use sequential priming paradigms. This can make it difficult to disentangle integration from retrieval processes. As such, the possible modulation by attention can be hard to pinpoint (see Frings et al., in press; Laub, Moeller, & Frings, 2018, for this argument; see also Töllner, Gramann, Müller, Kiss, & Eimer, 2008, and Zehetleitner, Rangelov, & Müller, 2012, for a related discussion in the literature on visual search).

For present purposes, however, the potential role of attention in S-R feature integration can be neglected. Instead, one can consider this kind of feature integration as giving rise to multimodal/multisensory feature compounds. Specifically, the perceptual features are integrated with the motor features and the anticipated sensory effects that they will produce (Harless, 1861; James, 1890; Lotze, 1852; for more recent approaches, see Hommel, 2009, and Stock & Stock, 2004, for an overview). In particular, if a participant is instructed to make a keypress (in a standard cognitive experimental paradigm, such as the Stroop task, say) in response to a specific feature, here colour, it is assumed that perceiving the colour will activate the perceptual features, but also the requisite motor features, and then these features will be integrated into an S-R episode. On the next occurrence of the particular stimulus, the previous S-R episode will then be retrieved, including the sensory effects that this episode produced (here, for instance, the tactile sensation of pressing the key), thereby directly facilitating the currently demanded behaviour. Thus, while this kind of feature integration is typically discussed in the context of action control, it can also be seen as an example of multisensory feature integration as – in the example described above – visual stimulus features are integrated with motor features and tactile features produced by pressing the response key. Furthermore, it is worth noting that in many papers on action control that use feature integration and retrieval as basic mechanisms of behaviour, FIT is mentioned, or even discussed, as a relevant precursor (e.g. Frings & Rothermund, 2011, 2017; Henson et al., 2014; Hommel, 2004).

Coming back to vision and visual feature integration

Let us finally reconsider the integration of visual features having discussed auditory, tactile and multisensory feature integration. The central question of ‘What constitutes a feature?’ is a thorny one. It seems fair to say – as outlined above – even when looking only at the visual modality, there is still no commonly agreed definition of feature-hood – given that purely neuronal or physiological definitions seem to be outdated (e.g. Bartels & Zeki, 1998); this becomes even clearer when looking at the other spatial senses (see, for instance, Woods et al.’s, 2001, treating of location, or ear of entry, as a feature in their auditory sequential search study). Furthermore, while location or spatial alignment might be the glue to integrate features in vision, location might just be a feature itself in other senses (e.g. audition) or even become irrelevant (e.g. in the case of olfactory feature binding). In the same vein, one might ask ‘What constitutes an object’? Once again, there is quite some debate about how to define object-hood; yet, when set against the above discussion it becomes abundantly clear that unisensory object-hood, at least in the spatial senses, such as vision, can be assumed to be spatially guided – that is, features belonging to the same object originate from in the same location. This argument does probably not hold to multisensory objects. Thus, what we can say, looking at vision from a multisensory perspective is that if one tries to define feature-hood, the current literature clearly suggests a modality-specific definition of features, perhaps even a modality specific way to integrate these features, and ultimately a modality-specific definition of objects.

Conclusions

Returning to Treisman and her monumental contribution to the field of cognitive psychology, it is undoubtedly the case that she was, at least in her early research, interested in (not to mention publishing papers on) cross-modal attention (e.g. Treisman & Davies, 1973). It does, therefore, seem a little strange that she never really came back to multisensory issues later on in her career, at least not in the context of FIT.Footnote 20 Still, as has hopefully been made clear here, expanding the idea of feature integration beyond the visual modality, and ultimately into the world of multisensory processing, is no easy venture. Problems soon emerge in terms of considering how best to define features, never mind those who question whether it really makes sense to talk of auditory or olfactory object representations. Particularly when processing demands become more complex, as in the case of multisensory integration, or feature integration in action control, the potential role of attention as the glue needed to integrate features becomes increasingly questionable. Nevertheless, no matter what position one chooses to adopt concerning the relationship between attention and integration, it is fair to say that Treisman’s FIT laid the foundations for the modern approach and still, in many contexts, influences current research. Certainly, we have often found ourselves framing our combined research agenda on the theme of multisensory selection in terms of the theoretical framework outlined initially by Treisman some four decades ago.