Elsevier

Neurocomputing

Volume 73, Issues 13–15, August 2010, Pages 2522-2531
Neurocomputing

Coarse scales are sufficient for efficient categorization of emotional facial expressions: Evidence from neural computation

https://doi.org/10.1016/j.neucom.2010.06.002Get rights and content

Abstract

The human perceptual system performs rapid processing within the early visual system: low spatial frequency information is processed rapidly through magnocellular layers, whereas the parvocellular layers process all the spatial frequencies more slowly. The purpose of the present paper is to test the usefulness of low spatial frequency (LSF) information compared to high spatial frequency (HSF) and broad spatial frequency (BSF) visual stimuli in a classification task of emotional facial expressions (EFE) by artificial neural networks. The connectionist modeling results show that an LSF information provided by the frequency domain is sufficient for a distributed neural network to correctly classify EFE, even when all the spatial information relating to these images is discarded. These results suggest that the HSF signal, which is also present in BSF faces, acts as a source of noisy information for classification tasks in an artificial neural system.

Introduction

The purpose of the present article is to determine whether LSF components are sufficient for efficient categorization of an EFE. This hypothesis is based on different neuroimaging and cognitive science experiments showing that the human cognitive system may have a fast way of accessing LSF components relating to threat recognition in the visual environment [23], [14], [38]. For instance, a neuroimaging study (functional MRI; [38]) suggests the possibility of a preferential link between magnocellular layers in the lateral geniculate nucleus (LGN) and the amygdala. This study [38] revealed the existence of a hemodynamic response at the level of a subcortical pathway involving the superior colliculus, the pulvinar and the amygdala when participants processed LSF pictures of faces depicting a fearful expression compared to LSF faces displaying a neutral expression. These results, therefore, suggest that the transmission of the signal associated with the facial expression of fear might bypass the primary visual cortex by taking a subcortical pathway [10], [14], possibly emanating from the magnocellular layers in the LGN, which transports the low spatial frequency information very quickly. This type of fast access to visual information might be of particular value for increasing sensory exposure to potentially dangerous events [37]. LSF and HSF stimuli are processed by two different visual streams at the level of the lateral geniculate nucleus. Whereas the magnocellular neurons primarily provide rapid, but low spatial frequency cues which encode configural features as well as brightness and motion, the parvocellular neurons provide slower, but also higher spatial frequency information (finer visual details) about local shape features, color and texture [19]. Conversely, this fMRI study [38] also showed that faces filtered at high spatial frequencies only slightly activated the amygdala and that the signal activated different structures in the ventral pathway (the occipito-temporal cortex and the face fusiform area). This result is corroborated by ERP studies [29].

Thus, the underlying question we address in the current paper is to determine if the biological structure of the human cognitive system is adapted to the computational properties of the visual environment when providing rapid access to an LSF information in a suitable EFE recognition task. In other words, we assume that the phylogenetic development of the human neural structures dedicated to categorization of an EFE may be such as to provide faster access to coarse scale information conveyed by magnocellular neurons, because this information is more efficient for categorization of EFE.

A number of different methods for the modeling of artificial systems capable of performing emotional facial expression recognition tasks have been reported in the literature [3]. Among these different recognition systems, the use of artificial neural networks permits an efficient classification of an EFE [16]. The first step involved in the use of connectionist networks consists of compressing the visual information. Thus, authors have suggested using radial basis function (or RBF) networks consisting of Gaussian receptive fields which compress the information relating to various parts of an image [7], [28]. However, different studies have shown that the use of Gabor wavelets permits a better modeling of the receptive fields of the simple cells of the primary visual cortex [5]. This research has shown that the statistical evaluation of the residual error between the difference in the response profiles of V1 simple cells and Gabor filters is not distinguishable from chance [11], [12], [39], [40] therefore suggesting Gabor wavelets for face recognition tasks performed by artificial systems. Using this technique, series of Gabor wavelets are convolved with specific parts of the image, in order to extract the information relating to different wavelengths and different orientations. Similarly, [16] successfully used Gabor wavelets in combination with a Support Vector Machine (SVM) for the static or dynamic recognition of an EFE [2]. Finally, [15], [4] have shown that the convolution of Gabor wavelets, which are sensitive to different wavelengths and different orientations using a sliding window applied to the entire image permits reliable categorization, which is comparable to human data when the wavelets are associated with an SVM, discriminant analysis [18] or an artificial neural network [4].

The computational model that we propose in this study is very similar to the model proposed by Dailey et al. [4] but with two differences. At the level of the perceptual encoding of the stimuli, the Gabor wavelets are implemented not in the spatial domain, but in the frequency domain by means of the modulus of the discrete Fourier transform (DFT). Each wavelet is therefore multiplied by the local energy spectrum of the Fourier transform. This local energy spectrum represents the quantity of energy associated with each spatial frequency and each orientation, independently of the spatial location of the wavelet. This is convenient because (i) it avoids a subsequent visual data compression step and (ii) it makes the representation of the image (i.e. the output computed by the Gabor filters) phase invariant (as V1 complex cells [6]). The second difference in our model can be found in the artificial neural network, which simulates the association between the perceptual output of the stimuli and the category label which encodes each EFE category. Dailey et al. [4] used a single layer perceptron in association with the softmax activation function at the level of the output layer, in order to perform non-linear categorization of their images, whereas we used the standard back-propagation algorithm on a multi-layer perceptron [20], [31]. This actually represents small differences with an exception, however, that our method actually constitutes a novel approach to visual data compression that deserves to be discussed in the light of a comparison with the methods used previously [2], [4], [15], [16].

The main difference between our perceptual model of vision and previous models proposed in [2], [4], [15], [16] resides at the level of visual data compression. Performing Gabor filtering in the spatial domain results in a huge perceptual vector (for example, a 40,600 vector size in [4]). Therefore, Gabor filtering in the spatial domain requires an additional step in order to reduce the size of the perceptual layer for subsequent neural network processes. For instance, some authors [4] have chosen to reduce the perceptual space by means of a “gestalt layer” produced by means of a principal component analysis (PCA), and then focusing neural computation on the first 50 eigenvectors. Using this technique, they have obtained efficient results for an EFE categorization. However, this technique raises an important methodological problem given the objectives of the present article. It has been shown elsewhere that the first eigenvectors correlate highly with LSF information [1]. In other words, virtually all the eigenvectors are necessary to retain HSF details, with the result that it is not possible to reduce visual information, while investigating the role of SF channel when using this method.

Another way to reduce the size of visual information for subsequent processing in an artificial neural network is to use feature selection algorithm such as Adaboost [16]. Adaboost is an efficient technique for selecting Gabor filters that are relevant for a subsequent associative task (for example, categorizing an EFE). In other words, Adaboost selects different Gabor filters that are sensitive to the different SF or orientations that are necessary to categorize different EFE. However, we did not use this algorithm to test the second main hypothesis of this paper, namely that an efficient categorization of an EFE can be obtained even after the spatial location of the Gabor filter has been completely removed. This hypothesis is based on biological evidence that is described below.

It is possible to perform Gabor filtering by means of a convolution of a Gabor kernel in the spatial domain (at a specific location or within a sliding window), which is the formal equivalent to a multiplication of this Gabor filter in the Fourier domain. When this multiplication is applied to the Fourier transform of the entire image, this method resembles a type of holistic vision and means that each Gabor filter provides an average energy value for the whole image. The originality of this method is that it does away with spatial locations.

At a methodological level, it has the advantage of retaining exactly the same amount of information, in quantitative and qualitative terms, for each SF channel, thus making it possible to compare the different SF channels in the most balanced way possible. It should also be noted that all the main references in the same field [2], [4], [15], [16] as our current perceptual model, also use the energy spectrum of the Gabor filters, and thus discard the phase information necessary to reconstruct the spatial information of an original image. However, using the magnitude spectrum of Gabor filters within a sliding window is just one step on the way towards eliminating spatial information: it removes phase information which is important for the spatial reconstruction of the image but retains the spatial location of the filters. In this paper, we will show that we can go a step further in removing spatial location for the purposes of the efficient categorization of an EFE. At a theoretical level, we assume that taking an average energy value of the Gabor filters over the whole image might be sufficient for the efficient categorization of an EFE. This assumption is based in part on biological data reported in a single-cell recording study, which showed that neurons in the medial temporal lobe (MTL) respond independently to spatial location [30]. In other words, neurons become less and less sensitive to spatial location during the bottom-up process from the retina to the temporal lobes, with the result being a completely abstract representation at the end of this process (at the level of the MTL). In other words, in this paper, we propose the provocative idea that spatial location might not be necessary for an efficient categorization of complex stimuli such as an EFE.

The first simulation is based on a limited but widely used database of EFE: the pictures of emotional affect (POFA) Database [8]. The aim of this pilot simulation is to determine whether the superiority of LSF signals for the categorization of EFE emerges even with a very small number of training exemplars. Simulation 2 was then designed to confirm the results obtained in Simulation 1 using a broader database, the Karolinska Directed Emotional Faces (KDEF) [17] as well as to extend this result to the full spectrum of spatial frequency channels. We have used the KDEF and POFA not only because they are two commonly employed databases, but also because they consist of very carefully controlled pictures selected for important neuroimaging, behavioral and connectionist modeling papers [8], [17], [24], [38] in the field of EFE categorization. To summarize, the aim of this paper was to test the efficiency of different SF channels on stimuli that were carefully controlled for in neuroimaging and behavioral experiments on the basis of commonly used and standardized computational methods (Gabor filters at the perceptual level and connectionist networks at the associative level).

Section snippets

Neural network

To perform our simulations, we used an image database of gray-scale images of facial expressions. The size of the images was N×N (with N=256 pixels). First, we applied a Hann window to avoid boundary effects in the subsequent Fourier transform. Boundary effects could result in a bias toward an over-representation of cardinal orientations, and the Hann window is frequently used to suppress this bias. The following formula describes the one-dimensional Hann window of size N (i=0, ..., N–1) applied

Neural network

The neural network, parameters and procedure were strictly identical to Simulation 1 except that we used 448 stimuli, 64 stimuli×7 emotions (from the KDEF database) for the training session (instead of 9 stimuli×7 emotions in the POFA database, Simulation 1) and 1 remaining item per emotion to test the generalization property of the neural network. Categorization rate was computed across 100 runs with a new test item being selected at random for each run.

Stimuli

The stimuli used came from the

Conclusion

The current computational results suggest that LSF information could be particularly effective for the recognition of specific facial expressions. Our connectionist simulations revealed a clear superiority of LSF over HSF information in both simulations. At a computational level, these results are consistent with [16]. These authors used Adaboost, SVM, and a linear discriminant analysis to select Gabor filters, in the spatial domain, that result in a higher categorization rate for each EFE.

Acknowledgments

This work was supported by the French CNRS (UMR 6024) and a grant from the French National Research Agency (ANR Grant BLAN06–2_145908, ANR Grant BLAN08-1_353820).

Martial Mermillod was born in 1973. He received his Master’s degree at Grenoble Universités, France, in 2000 and Ph.D. from the University of Liège, Belgium, in 2004. In 2004, he was awarded a post-doctoral grant by the Fyssen Foundation at Grenoble Universités. Since 2005, he has been an Associate Professor at the Université Blaise Pascal, Clermont-Ferrand, France. His current research investigates neural networks, cognitive science and cognitive neuroscience.

References (40)

  • J. Cohn et al.

    Use of automated facial image analysis for measurement of emotion expression

  • M.N. Dailey et al.

    EMPATH: a neural network that categorizes facial expressions

    Journal of Cognitive Neuroscience

    (2002)
  • J.G. Daugman

    Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters

    Journal of the Optical Society of America A

    (1985)
  • R.L. De Valois et al.

    Spatial Vision

    (1988)
  • S. Duvdevani-Bar et al.

    Visual recognition and categorization on the basis of similarities to multiple class prototypes

    International Journal of Computer Vision

    (1999)
  • P. Ekman et al.

    Pictures of Facial Affect

    (1976)
  • R.M. French, M. Mermillod, P.C. Quinn, A. Chauvin, D. Mareschal. The importance of starting blurry: simulating improved...
  • B. de Gelder et al.

    Non-conscious recognition of affect in the absence of striate cortex

    NeuroReport

    (1999)
  • J.P. Jones et al.

    The two-dimensional spatial structure of simple receptive fields in cat striate cortex

    Journal of Neurophysiology

    (1987)
  • J.P. Jones et al.

    The two-dimensional spectral structure of simple receptive fields in cat striate cortex

    Journal of Neurophysiology

    (1987)
  • Cited by (43)

    • The recognition of emotions beyond facial expressions: Comparing emoticons specifically designed to convey basic emotions with other modes of expression

      2021, Computers in Human Behavior
      Citation Excerpt :

      One of the main limitations was the emotion recognition task used to select the emotion label that best described the emotional expression of a stimulus. The methodological limitation of the forced-choice paradigm, widely used in emotional facial expression studies, has been criticized (Paiva-Silva et al., 2016), and other methods have been proposed such as free labelling (Izard, 1971) or a categorization task (Gendron et al., 2014; Mermillod et al., 2010). In the forced-choice paradigm, only six emotions are provided as labels, and happiness is the only emotion that is clearly positive among the different alternatives.

    • The importance of recurrent top-down synaptic connections for the anticipation of dynamic emotions

      2019, Neural Networks
      Citation Excerpt :

      Note that the simplest and more parsimonious core neural networks were used to ensure that the differences were not due to hyper-parameters, algorithms or training procedures but only to recurrent connectivity. The first classifier was a MLP using the standard back-propagation training algorithm, as used in Mermillod, Bonin et al. (2010) and Mermillod, Droit-Volet et al. (2010). The second classifier, an SRN, was the same MLP to which a reinjection of the hidden layer was added at the level of the perceptual units.

    • Do low spatial frequencies explain the extremely fast saccades towards human faces?

      2017, Vision Research
      Citation Excerpt :

      They also support the preferred holistic processing of faces (for a review, see Tanaka & Simonyi, 2016) and the coarse-to-fine sequence of spatial frequency processing (Goffaux et al., 2011). Several neuroimaging (Vlamings et al., 2009; Pourtois et al., 2005; Vuilleumier et al., 2003), as well as computational studies (Mermillod, Vuilleumier, Peyrin, Alleysson, & Marendaz, 2009; Mermillod, Bonin, Mondillon, Alleysson, & Vermeulen, 2010; Mermillod, Guyader, Vuilleumier, Alleysson, & Marendaz, 2005), have shown the importance of LSF for processing emotions in faces. Vuilleumier et al. (2003) observed that faces expressing fear activated the amygdala when filtered in LSF.

    View all citing articles on Scopus

    Martial Mermillod was born in 1973. He received his Master’s degree at Grenoble Universités, France, in 2000 and Ph.D. from the University of Liège, Belgium, in 2004. In 2004, he was awarded a post-doctoral grant by the Fyssen Foundation at Grenoble Universités. Since 2005, he has been an Associate Professor at the Université Blaise Pascal, Clermont-Ferrand, France. His current research investigates neural networks, cognitive science and cognitive neuroscience.

    Patrick Bonin was born in 1966 in Le Creusot. He obtained both his Master’s degree and Ph.D. at Université de Bourgogne (Dijon, France) in 1989 and 1995, respectively. In 1996, he acquired the title of an Associate Professor. In 2003, he became Full Professor at the Université Blaise Pascal, Clermont-Ferrand, France. Since September 2009, he has been Full Professor at the Université de Bourgogne. His current research interest is in cognitive psychology and psycholinguistics.

    Laurie Mondillon was born in 1979. She received her Master’s degree and Ph.D. from Université Blaise Pascal (Clermont-Ferrand, France) in 2003 and 2006, respectively. Since 2007, she has been an Associate Professor at the Université de Savoie, Chambery, France. Her current research focuses on social cognitive science, cognitive neuroscience and embodied knowledge, especially in the field of emotions.

    David Alleysson gained his Masters degree and Ph.D. degree from Grenoble Universités, Grenoble, France, in 1994 and 1999, respectively. Between 2000 and 2003, he worked in a post-doctoral position at the Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland. Since 2003, he has been conducting research at Grenoble Universités. He currently specializes in computer science, computer vision and color perception.

    Nicolas Vermeulen was born in Brussels, Belgium. He obtained both his Master’s degree and Ph.D. from the Université catholique de Louvain (UCL), Belgium, in 2001 and 2005, respectively. Since 2005, he has received a grant from the Belgian Fund for Scientific Research (FRS-FNRS). In 2005, he spent one year working in the Emotion Laboratory headed by Dr. Paula M. Niedenthal at the Université Blaise Pascal, Clermont-Ferrand, France. He is currently working in the fields of cognitive science, attentional processes, embodied knowledge and emotions.

    View full text