Recognizing emotions expressed by body pose: A biologically inspired neural model

doi:10.1016/j.neunet.2008.05.003

Neural Networks

Volume 21, Issue 9, November 2008, Pages 1238-1246

https://doi.org/10.1016/j.neunet.2008.05.003 Get rights and content

Abstract

Research into the visual perception of human emotion has traditionally focused on the facial expression of emotions. Recently researchers have turned to the more challenging field of emotional body language, i.e. emotion expression through body pose and motion. In this work, we approach recognition of basic emotional categories from a computational perspective. In keeping with recent computational models of the visual cortex, we construct a biologically plausible hierarchy of neural detectors, which can discriminate seven basic emotional states from static views of associated body poses. The model is evaluated against human test subjects on a recent set of stimuli manufactured for research on emotional body language.

Introduction

The expression and perception of emotions have been studied extensively in psychology and neuroscience (Ekman, 1970, Ekman, 1993, Frijda, 1986, Tomkins, 1962). A complementary body of work comes from the field of computational neuroscience, where researchers have proposed biologically plausible neural architectures for facial emotion recognition (Dailey et al., 2002, Fragopanagos and Taylor, 2005, Padgett and Cottrell, 1996). One important result, on which many (but not all, e.g. Ortony and Turner (1990) and Russell (1994)) researchers agree nowadays, is that the perception of emotion is at least to a certain degree categorical (Ekman, 1970, Izard, 1992, Kotsoni et al., 2001, Tomkins, 1962), meaning that a perceived expression is assigned to one out of a small set of categories, which are usually termed the “basic” or “primary” emotions (although the precise number and type of basic emotions varies between theories). Categorical perception presupposes a sharp perceptive boundary between categories, rather than a gradual transition. At this boundary, the ability to discriminate between visually similar displays on different sides of the boundary is at its peak, so that stimuli can still be assigned to one of the categories. The most wide-spread definition of basic emotions since the seventies is due to Ekman, and comprises the six categories anger, disgust, fear, happiness, sadness, surprise. These seem to be universal across different cultures (Ekman, 1970) — in fact a theoretical motivation for emotion categories goes back to the notion that the same facial muscles are used to display emotions in widely different cultures.

The categorical nature of emotion recognition was established empirically, through carefully designed studies with human observers (Calder et al., 1996, de Gelder et al., 1997, Ekman, 1992). However, there is also a computational argument for this capability: if a suitable set of categories can be found (suitable in the sense that they can be distinguished with the available data), then a categorical decision can be taken quicker and more reliably, because the problem is reduced to a forced choice between few possibilities, and because only those perceptual aspects need to be considered, which discriminate the different categories. In learning-theoretical terminology, categories can be represented by a discriminative model, which aims for large classification margins, rather than a generative model, which allows a complete description of all their aspects.

Over the last decades, most studies have concentrated on emotional signals in facial expressions. Recently, researchers have also turned to emotional body language, i.e. the expression of emotions through human body pose and/or body motion (de Gelder, 2006, Grezes et al., 2007, Meeren et al., 2005, Peelen and Downing, 2007). An implicit assumption common to the work on emotional body language is that body language is only a different means of expressing the same set of basic emotions as facial expressions.

The recognition of whole-body expressions is substantially harder, because the configuration of the human body has more degrees of freedom than the face alone, and its overall shape varies strongly during articulated motion. However, in computer vision and machine learning research, recent results about object recognition have shown that even for highly variable visual stimuli, quite reliable categorical decisions can be made from dense low-level visual cues (Dalal and Triggs, 2005, Serre et al., 2006).

In this work, we try to gain new insight into possible mechanisms of emotion recognition from body pose, by constructing a biologically plausible computational model for their categorical perception (plausible in terms of the high-level hierarchy, not in terms of low-level functionality such as information encoding). We stress that at present the neurophysiological data about the visual cortex is not complete enough for us to fully understand and replicate the underlying processes. Any computational model can therefore only strive not to contradict the available data, but remains in part speculative. Still, we believe that such an approach can be beneficial, both for machine vision, which is still far from reaching the capabilities of animal vision, as well as for neuroscience, where computational considerations can contribute new insights.¹

We restrict ourselves to the analysis of body poses (form), as opposed to the dynamics of body language (optic flow). This corresponds to modeling only perception and recognition processes typically taking place in the ventral stream (Felleman & van Essen, 1991): we focus on the question, what categorization of single snapshots can contribute to the extraction of emotions from body pose, without including any motion information. Recent studies suggest that there are also recognition processes based on connections to areas outside the ventral stream (STS, pre-motor areas), which presumably explain sensitivity to implied motion (de Gelder, Snyder, Greve, Gerard, & Hadjikhani, 2004) (and also to action properties of objects (Mahon et al., 2007)). For the moment, we exclude these connections, as the corresponding computational mechanisms for extracting and encoding implied motion are not clear.

Using a set of emotional body language stimuli, which was originally prepared for neuroscientific studies, we show that human observers, as expected, perform very well on this task, and construct a model of the underlying processing stream. The model is then tested on the same stimulus set. By focusing on form, we do not claim that motion processing is not important. The importance of motion and implied motion for the perception of human bodies is corroborated by several neurophysiological studies (Barraclough et al., 2006, Bruce et al., 1981, Jellema and Perrett, 2006, Oram and Perrett, 1994), and we have taken care to keep our computational approach compatible with models, which include the dorsal stream. In particular, our model can be directly extended by adding a motion analysis channel as proposed by Giese and Poggio in their model of action perception (Giese & Poggio, 2003).

Section snippets

Stimulus set

The data we use for our study was originally created at Tilburg University for the purpose of studying human reactions to emotional body language with brain imaging methods.

The data consists of photographic still images of 50 actors (34 females, 16 males) enacting different emotions. All images are taken in a frontal position with the figure facing the camera, on a controlled white background. The stimulus set follows the list of six basic emotions originally inventorised by Ekman (1970): per

Neural model

Our model of the visual pathway for recognition has been inspired by the one of Riesenhuber and Poggio (1999) and Serre et al. (2006). It consists of a hierarchy of neural feature detectors, which have been engineered to fulfill the computational requirements of recognition, while being consistent with the available electro-physiological data. A schematic of the complete model is depicted in Fig. 2. As an important limitation, the model is purely feed-forward. No information is fed back from

Experiments

The model has been tested on the stimulus set described in Section 2. All stimuli were used in their original orientation as well as mirrored along the vertical axis, to account for the symmetry of human body poses with respect to the sagittal plane. This gives a total of 696 images (for 2 out of the 50 actors the image for sad is missing). As explained earlier, we implicitly assume that attention has been directed to the person, because of the controlled imaging conditions (clean background,

Discussion

We have presented a biologically inspired neural model for the form-perception of emotional body language. When presented with an image showing an expression of emotional body language, the model is able to assign it to one out of seven emotional categories (the six basic emotions+neutral). The model has been tested on the Tilburg University stimulus set, the only complete data-set of emotional body poses, of which we are aware. It achieved a recognition rate of 82%, compared to 87% for human

Acknowledgment

This project was funded in part by EU project COBOL (NEST-043403).

References (65)

R. Adolphs
Neural systems for recognizing emotions
Current Opinions in Neurobiology
(2002)
N. Barraclough et al.
The sensitivity of primate sts neurons to walking sequences and to the degree of articulation in static images
Progressive Brain Research
(2006)
B. Boulay et al.
Applying 3d human model in a posture recognition system
Pattern Recognition Letters
(2006)
B. Fasel et al.
Automatic facial expression analysis: A survey
Pattern Recognition
(2003)
N. Fragopanagos et al.
Emotion recognition in human-computer interaction
Neural Networks
(2005)
M.A. Giese et al.
Physiologically inspired neural model for the encoding of face spaces
Neurocomputing
(2005)
J. Grezes et al.
Perceiving fear in dynamic body expressions
Neuroimage
(2007)
T. Jellema et al.
Neural representations of perceived bodily actions using a categorical frame of reference
Neuropsychologia
(2006)
N.K. Logothetis et al.
Shape representation in the inferior temporal cortex of monkeys
Current Biology
(1995)
B.Z. Mahon et al.
Action-related properties shape object representations in the ventral stream
Neuron
(2007)

L. Pessoa et al.

Attentional control of the processing of neutral and emotional stimuli

Cognitive Brain Research

(2002)

J.G. Taylor et al.

The interaction of attention and emotion

Neural Networks

(2005)

P. Vuilleumier et al.

Effects of attention and emotion on face processing in the human brain: An event-related fMRI study

Neuron

(2001)

D. Anguita et al.

Improved neural network for SVM learning

IEEE Transactions on Neural Networks

(2002)

J.A. Beintema et al.

Perception of biological motion without local image motion

Proceedings of the National Academy of Sciences of the United States of America

(2002)

H. Bourlard et al.

Auto-association by multilayer perceptrons and singular value decomposition

Biological Cybernetics

(1988)

C. Bruce et al.

Visual properties of neurons in a polysensory area in superior temproal sulcus of the macaque

Journal of Neurophysiology

(1981)

A.J. Calder et al.

Categorical perception of morphed facial expressions

Visual Cognition

(1996)

C. Cortes et al.

Support-vector networks

Machine Learning

(1995)

J.G.F. Coutinho et al.

Designing a posture analysis system with hardware implementation

Journal of VLSI Signal Processing

(2006)

M.N. Dailey et al.

EMPATH: A neural network that categorizes facial expressions

Journal of Cognitive Neuroscience

(2002)

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proc. 10th international...

B. de Gelder

Towards the neurobiology of emotional body language

Nature Reviews Neuroscience

(2006)

B. de Gelder et al.

Fear fosters flight: A mechanism for fear contagion when perceiving emotion expressed by a whole body

Proceedings of the National Academy of Sciences of the United States of America

(2004)

B. de Gelder et al.

Categorical perception of facial expressions: Categories and their internal structure

Cognition and Emotion

(1997)

P. Downing et al.

A cortical area selective for visual processing of the human body

Science

(2001)

M. Eimer et al.

An ERP study on the time course of emotional face processing

Neuroreport

(2002)

P. Ekman

Universal facial expressions of emotion

California Mental Health Research Digest

(1970)

P. Ekman

An argument for basic emotions

Cognition and Emotion

(1992)

P. Ekman

Facial expression and emotion

American Psychologist

(1993)

D.J. Felleman et al.

Distributed hierarchical processing in the primate visual cortex

Cerebral Cortex

(1991)

D.J. Field

Relations between the statistics of natural images and the response properties of cortical cells

Journal of The Optical Society of America A

(1987)

Cited by (139)

ICaps-ResLSTM: Improved capsule network and residual LSTM for EEG emotion recognition
2024, Biomedical Signal Processing and Control
Electroencephalography (EEG) emotion recognition is an important task for brain–computer interfaces. The time, frequency, and spatial domains of EEG signals have been widely studied. However, these methods often ignore the spatial and temporal correlations in dual modules, resulting in insufficient emotional representations. In this paper, a dual module EEG emotion recognition method based on an improved capsule network and residual Long-Short Term Memory (ResLSTM) is proposed. Using an improved capsule network as the spatial module is more advantageous in learning specific EEG spatial representations. The ResLSTM of the temporal module inherits the information flow from the upper spatial module and conducts complementary learning of the spatiotemporal dual module features through residual connections, thus obtaining more discriminative EEG features and ultimately boosting the classification capabilities of the model. The average accuracy of arousal, valence, and dominance on the DEAP dataset reached 98.06%, 97.94%, and 98.15%, respectively. The DREAMER dataset’s average accuracy of arousal, valence, and dominance reached 94.97%, 94.71%, and 94.96%, respectively. The results of our experiments indicate that our method outperforms state-of-the-art approaches.
Design and user experience analysis of AR intelligent virtual agents on smartphones
2023, Cognitive Systems Research
Intelligent Virtual Agents (IVAs) can provide users with a friendly experience and have a wide range of applications in the era of artificial intelligence. However, most of existing IVAs are designed for personal computers. Design and user studies of IVAs on smartphones are uncommon. Therefore, developing IVAs for smartphones is an interesting topic. Considering Augmented Reality (AR) technology can provide more potential application value for IVAs, we mainly investigate users’ experiences of AR IVAs on smartphones in this paper. To make an IVA more suitable for a smartphone, a lightweight IVA’s cognitive architecture is proposed. To find out the factors that affect users’ interaction experiences, the effects of humanoid embodiment and emotional expressions of IVAs on users’ perceptions and experiences are explored. A museum is used as a specific task scenario to measure users’ experiences. Three forms of AR agents are evaluated in this scenario: a voice assistant without an entity, a humanoid IVA without emotional expressions, and a humanoid IVA with emotional expressions. The results show that compared with the voice assistant, a humanoid embodiment can significantly improve the user’s experience, and compared with humanoid IVA without emotional expressions, a humanoid IVA with emotional expressions is more welcome. Moreover, we use the cloud model to describe the uncertainty of IVAs’ actions (blinking and body orientation). The results show that the uncertainty of actions can increase the believability of IVAs.
Context-dependent emotion recognition
2022, Journal of Visual Communication and Image Representation
Citation Excerpt :
Recently, researchers have realized the importance of contextual information for emotion recognition, and some related methods [26–29] have been proposed. [26] takes the shoulder position as the supplementary information of facial expression for basic emotion recognition. [27] utilizes the body posture to recognize basic emotions on the small datasets of non-spontaneous postures under controlled conditions. [7,8]
Most previous methods for emotion recognition focus on facial emotion and ignore the rich context information that implies important emotion states. To make full use of the contextual information to make up for the facial information, we propose the Context-Dependent Net (CD-Net) for robust context-aware human emotion recognition. Inspired by the long-range dependency of the transformer, we introduce the tubal transformer which forms the shared feature representation space to facilitate the interactions among the face, body, and context features. Besides, we introduce the hierarchical feature fusion to recombine the enhanced multi-scale face, body, and context features for emotion classification. Experimentally, we verify the effectiveness of the proposed CD-Net on the two large emotion datasets, CAER-S and EMOTIC. On the one hand, the quantitative evaluation results demonstrate the superiority of the proposed CD-Net over other state-of-the-art methods. On the other hand, the visualization results show CD-Net can capture the dependencies among the face, body, and context components and focus on the important features related to the emotion.
EmoSeC: Emotion recognition from scene context
2022, Neurocomputing
Context provides additional information to determine the actual emotional state of a person as part of a scene. Existing works on emotion recognition in context focused only on the features extracted from the entire image and the target’s body. In this work, we propose a comprehensive multi-cue based emotion recognition framework that incorporates the context, using a hybrid architecture comprised of four separate deep convolutional neural networks and a novel feature fusion mechanism. Each deep network presented in our proposed approach effectively learns the emotion-related features from the facial, body pose, non-target subject and entire image, individually. Experiments on Emotic, an in-painted and a newly constructed EmoSec datasets show that our proposed emotion recognition framework is promising when compared to existing methods in terms of accurately classifying emotions. Comparison with the state-of-the-art deep networks and off-the-shelf fusion techniques demonstrates that our network showed an improved performance. Furthermore, the performance evaluation of our proposed approach on all three datasets confirms that the contextual information influences the emotional state of a human.
BEE-NET: A deep neural network to identify in-the-wild Bodily Expression of Emotions
2024, arXiv
A systematic analysis of machine learning algorithms for human emotion detection using facial expression
2024, AIP Conference Proceedings

View all citing articles on Scopus

View full text

Recognizing emotions expressed by body pose: A biologically inspired neural model

Abstract

Introduction

Section snippets

Stimulus set

Neural model

Experiments

Discussion

Acknowledgment

Current Opinions in Neurobiology

Progressive Brain Research

Pattern Recognition Letters

Pattern Recognition

Neural Networks

Neurocomputing

Neuroimage

Neuropsychologia

Current Biology

Neuron

Cognitive Brain Research

Neural Networks

Neuron

Improved neural network for SVM learning

IEEE Transactions on Neural Networks

Perception of biological motion without local image motion

Proceedings of the National Academy of Sciences of the United States of America

Auto-association by multilayer perceptrons and singular value decomposition

Biological Cybernetics

Visual properties of neurons in a polysensory area in superior temproal sulcus of the macaque

Journal of Neurophysiology

Categorical perception of morphed facial expressions

Visual Cognition

Support-vector networks

Machine Learning

Designing a posture analysis system with hardware implementation

Journal of VLSI Signal Processing

EMPATH: A neural network that categorizes facial expressions

Journal of Cognitive Neuroscience

Towards the neurobiology of emotional body language

Nature Reviews Neuroscience

Fear fosters flight: A mechanism for fear contagion when perceiving emotion expressed by a whole body

Proceedings of the National Academy of Sciences of the United States of America

Categorical perception of facial expressions: Categories and their internal structure

Cognition and Emotion

A cortical area selective for visual processing of the human body

Science

An ERP study on the time course of emotional face processing

Neuroreport

Universal facial expressions of emotion

California Mental Health Research Digest

An argument for basic emotions

Cognition and Emotion

Facial expression and emotion

American Psychologist

Distributed hierarchical processing in the primate visual cortex

Cerebral Cortex

Relations between the statistics of natural images and the response properties of cortical cells

Journal of The Optical Society of America A