Elsevier

Neural Networks

Volume 21, Issue 9, November 2008, Pages 1238-1246
Neural Networks

Recognizing emotions expressed by body pose: A biologically inspired neural model

https://doi.org/10.1016/j.neunet.2008.05.003Get rights and content

Abstract

Research into the visual perception of human emotion has traditionally focused on the facial expression of emotions. Recently researchers have turned to the more challenging field of emotional body language, i.e. emotion expression through body pose and motion. In this work, we approach recognition of basic emotional categories from a computational perspective. In keeping with recent computational models of the visual cortex, we construct a biologically plausible hierarchy of neural detectors, which can discriminate seven basic emotional states from static views of associated body poses. The model is evaluated against human test subjects on a recent set of stimuli manufactured for research on emotional body language.

Introduction

The expression and perception of emotions have been studied extensively in psychology and neuroscience (Ekman, 1970, Ekman, 1993, Frijda, 1986, Tomkins, 1962). A complementary body of work comes from the field of computational neuroscience, where researchers have proposed biologically plausible neural architectures for facial emotion recognition (Dailey et al., 2002, Fragopanagos and Taylor, 2005, Padgett and Cottrell, 1996). One important result, on which many (but not all, e.g. Ortony and Turner (1990) and Russell (1994)) researchers agree nowadays, is that the perception of emotion is at least to a certain degree categorical (Ekman, 1970, Izard, 1992, Kotsoni et al., 2001, Tomkins, 1962), meaning that a perceived expression is assigned to one out of a small set of categories, which are usually termed the “basic” or “primary” emotions (although the precise number and type of basic emotions varies between theories). Categorical perception presupposes a sharp perceptive boundary between categories, rather than a gradual transition. At this boundary, the ability to discriminate between visually similar displays on different sides of the boundary is at its peak, so that stimuli can still be assigned to one of the categories. The most wide-spread definition of basic emotions since the seventies is due to Ekman, and comprises the six categories anger, disgust, fear, happiness, sadness, surprise. These seem to be universal across different cultures (Ekman, 1970) — in fact a theoretical motivation for emotion categories goes back to the notion that the same facial muscles are used to display emotions in widely different cultures.

The categorical nature of emotion recognition was established empirically, through carefully designed studies with human observers (Calder et al., 1996, de Gelder et al., 1997, Ekman, 1992). However, there is also a computational argument for this capability: if a suitable set of categories can be found (suitable in the sense that they can be distinguished with the available data), then a categorical decision can be taken quicker and more reliably, because the problem is reduced to a forced choice between few possibilities, and because only those perceptual aspects need to be considered, which discriminate the different categories. In learning-theoretical terminology, categories can be represented by a discriminative model, which aims for large classification margins, rather than a generative model, which allows a complete description of all their aspects.

Over the last decades, most studies have concentrated on emotional signals in facial expressions. Recently, researchers have also turned to emotional body language, i.e. the expression of emotions through human body pose and/or body motion (de Gelder, 2006, Grezes et al., 2007, Meeren et al., 2005, Peelen and Downing, 2007). An implicit assumption common to the work on emotional body language is that body language is only a different means of expressing the same set of basic emotions as facial expressions.

The recognition of whole-body expressions is substantially harder, because the configuration of the human body has more degrees of freedom than the face alone, and its overall shape varies strongly during articulated motion. However, in computer vision and machine learning research, recent results about object recognition have shown that even for highly variable visual stimuli, quite reliable categorical decisions can be made from dense low-level visual cues (Dalal and Triggs, 2005, Serre et al., 2006).

In this work, we try to gain new insight into possible mechanisms of emotion recognition from body pose, by constructing a biologically plausible computational model for their categorical perception (plausible in terms of the high-level hierarchy, not in terms of low-level functionality such as information encoding). We stress that at present the neurophysiological data about the visual cortex is not complete enough for us to fully understand and replicate the underlying processes. Any computational model can therefore only strive not to contradict the available data, but remains in part speculative. Still, we believe that such an approach can be beneficial, both for machine vision, which is still far from reaching the capabilities of animal vision, as well as for neuroscience, where computational considerations can contribute new insights.1

We restrict ourselves to the analysis of body poses (form), as opposed to the dynamics of body language (optic flow). This corresponds to modeling only perception and recognition processes typically taking place in the ventral stream (Felleman & van Essen, 1991): we focus on the question, what categorization of single snapshots can contribute to the extraction of emotions from body pose, without including any motion information. Recent studies suggest that there are also recognition processes based on connections to areas outside the ventral stream (STS, pre-motor areas), which presumably explain sensitivity to implied motion (de Gelder, Snyder, Greve, Gerard, & Hadjikhani, 2004) (and also to action properties of objects (Mahon et al., 2007)). For the moment, we exclude these connections, as the corresponding computational mechanisms for extracting and encoding implied motion are not clear.

Using a set of emotional body language stimuli, which was originally prepared for neuroscientific studies, we show that human observers, as expected, perform very well on this task, and construct a model of the underlying processing stream. The model is then tested on the same stimulus set. By focusing on form, we do not claim that motion processing is not important. The importance of motion and implied motion for the perception of human bodies is corroborated by several neurophysiological studies (Barraclough et al., 2006, Bruce et al., 1981, Jellema and Perrett, 2006, Oram and Perrett, 1994), and we have taken care to keep our computational approach compatible with models, which include the dorsal stream. In particular, our model can be directly extended by adding a motion analysis channel as proposed by Giese and Poggio in their model of action perception (Giese & Poggio, 2003).

Section snippets

Stimulus set

The data we use for our study was originally created at Tilburg University for the purpose of studying human reactions to emotional body language with brain imaging methods.

The data consists of photographic still images of 50 actors (34 females, 16 males) enacting different emotions. All images are taken in a frontal position with the figure facing the camera, on a controlled white background. The stimulus set follows the list of six basic emotions originally inventorised by Ekman (1970): per

Neural model

Our model of the visual pathway for recognition has been inspired by the one of Riesenhuber and Poggio (1999) and Serre et al. (2006). It consists of a hierarchy of neural feature detectors, which have been engineered to fulfill the computational requirements of recognition, while being consistent with the available electro-physiological data. A schematic of the complete model is depicted in Fig. 2. As an important limitation, the model is purely feed-forward. No information is fed back from

Experiments

The model has been tested on the stimulus set described in Section 2. All stimuli were used in their original orientation as well as mirrored along the vertical axis, to account for the symmetry of human body poses with respect to the sagittal plane. This gives a total of 696 images (for 2 out of the 50 actors the image for sad is missing). As explained earlier, we implicitly assume that attention has been directed to the person, because of the controlled imaging conditions (clean background,

Discussion

We have presented a biologically inspired neural model for the form-perception of emotional body language. When presented with an image showing an expression of emotional body language, the model is able to assign it to one out of seven emotional categories (the six basic emotions+neutral). The model has been tested on the Tilburg University stimulus set, the only complete data-set of emotional body poses, of which we are aware. It achieved a recognition rate of 82%, compared to 87% for human

Acknowledgment

This project was funded in part by EU project COBOL (NEST-043403).

References (65)

  • L. Pessoa et al.

    Attentional control of the processing of neutral and emotional stimuli

    Cognitive Brain Research

    (2002)
  • J.G. Taylor et al.

    The interaction of attention and emotion

    Neural Networks

    (2005)
  • P. Vuilleumier et al.

    Effects of attention and emotion on face processing in the human brain: An event-related fMRI study

    Neuron

    (2001)
  • D. Anguita et al.

    Improved neural network for SVM learning

    IEEE Transactions on Neural Networks

    (2002)
  • J.A. Beintema et al.

    Perception of biological motion without local image motion

    Proceedings of the National Academy of Sciences of the United States of America

    (2002)
  • H. Bourlard et al.

    Auto-association by multilayer perceptrons and singular value decomposition

    Biological Cybernetics

    (1988)
  • C. Bruce et al.

    Visual properties of neurons in a polysensory area in superior temproal sulcus of the macaque

    Journal of Neurophysiology

    (1981)
  • A.J. Calder et al.

    Categorical perception of morphed facial expressions

    Visual Cognition

    (1996)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • J.G.F. Coutinho et al.

    Designing a posture analysis system with hardware implementation

    Journal of VLSI Signal Processing

    (2006)
  • M.N. Dailey et al.

    EMPATH: A neural network that categorizes facial expressions

    Journal of Cognitive Neuroscience

    (2002)
  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proc. 10th international...
  • B. de Gelder

    Towards the neurobiology of emotional body language

    Nature Reviews Neuroscience

    (2006)
  • B. de Gelder et al.

    Fear fosters flight: A mechanism for fear contagion when perceiving emotion expressed by a whole body

    Proceedings of the National Academy of Sciences of the United States of America

    (2004)
  • B. de Gelder et al.

    Categorical perception of facial expressions: Categories and their internal structure

    Cognition and Emotion

    (1997)
  • P. Downing et al.

    A cortical area selective for visual processing of the human body

    Science

    (2001)
  • M. Eimer et al.

    An ERP study on the time course of emotional face processing

    Neuroreport

    (2002)
  • P. Ekman

    Universal facial expressions of emotion

    California Mental Health Research Digest

    (1970)
  • P. Ekman

    An argument for basic emotions

    Cognition and Emotion

    (1992)
  • P. Ekman

    Facial expression and emotion

    American Psychologist

    (1993)
  • D.J. Felleman et al.

    Distributed hierarchical processing in the primate visual cortex

    Cerebral Cortex

    (1991)
  • D.J. Field

    Relations between the statistics of natural images and the response properties of cortical cells

    Journal of The Optical Society of America A

    (1987)
  • Cited by (139)

    • Context-dependent emotion recognition

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Recently, researchers have realized the importance of contextual information for emotion recognition, and some related methods [26–29] have been proposed. [26] takes the shoulder position as the supplementary information of facial expression for basic emotion recognition. [27] utilizes the body posture to recognize basic emotions on the small datasets of non-spontaneous postures under controlled conditions. [7,8]

    View all citing articles on Scopus
    View full text