Human and machine validation of 14 databases of dynamic facial expressions

Krumhuber, Eva G.; Küster, Dennis; Namba, Shushi; Skora, Lina

doi:10.3758/s13428-020-01443-y

Human and machine validation of 14 databases of dynamic facial expressions

Open access
Published: 17 August 2020

Volume 53, pages 686–701, (2021)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Human and machine validation of 14 databases of dynamic facial expressions

Download PDF

Eva G. Krumhuber ORCID: orcid.org/0000-0003-1894-2517¹,
Dennis Küster^2,3,
Shushi Namba⁴ &
…
Lina Skora¹^nAff5

6195 Accesses
35 Citations
4 Altmetric
Explore all metrics

Abstract

With a shift in interest toward dynamic expressions, numerous corpora of dynamic facial stimuli have been developed over the past two decades. The present research aimed to test existing sets of dynamic facial expressions (published between 2000 and 2015) in a cross-corpus validation effort. For this, 14 dynamic databases were selected that featured facial expressions of the basic six emotions (anger, disgust, fear, happiness, sadness, surprise) in posed or spontaneous form. In Study 1, a subset of stimuli from each database (N = 162) were presented to human observers and machine analysis, yielding considerable variance in emotion recognition performance across the databases. Classification accuracy further varied with perceived intensity and naturalness of the displays, with posed expressions being judged more accurately and as intense, but less natural compared to spontaneous ones. Study 2 aimed for a full validation of the 14 databases by subjecting the entire stimulus set (N = 3812) to machine analysis. A FACS-based Action Unit (AU) analysis revealed that facial AU configurations were more prototypical in posed than spontaneous expressions. The prototypicality of an expression in turn predicted emotion classification accuracy, with higher performance observed for more prototypical facial behavior. Furthermore, technical features of each database (i.e., duration, face box size, head rotation, and motion) had a significant impact on recognition accuracy. Together, the findings suggest that existing databases vary in their ability to signal specific emotions, thereby facing a trade-off between realism and ecological validity on the one end, and expression uniformity and comparability on the other.

The ethical application of biometric facial recognition technology

Article Open access 13 April 2021

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

An update of the Benton Facial Recognition Test

Article 16 December 2021

Introduction

The human face is an important source of dynamic information. By conveying rich and complex action patterns, the dynamic quality of facial behavior makes it a powerful medium for emotion communication. Yet, for years the majority of research on the visual perception of emotions was dominated by static stimuli, i.e. datasets of still images of emotional expressions captured at apex (e.g., Ekman & Friesen, 1976; Biehl et al., 1997; Goeleven et al., 2008; Tottenham et al., 2009). Apart from their questionable ecological validity which renders them untypical of the displays encountered in everyday life (Russell, 1994), static portrayals may not convey the same affective information and communicative intent. There is now a growing body of evidence suggesting that the dynamics of facial expressions are crucial for the recognition (e.g., Wehrle et al., 2000; Kamachi et al., 2001) and interpretation of emotions (e.g., Ambadar et al., 2005; see Krumhuber et al., 2013; Sato et al., 2019 for reviews). Moreover, moving stimuli elicit different patterns of muscular/behavioral responses (Sato & Yoshikawa, 2007) and brain activation compared to static ones (Zinchenko et al., 2018). In order to capture the actual form of human behavior, facial movement appears to be essential for an accurate approximation of reality. In this vein, the last two decades have seen increased questioning and criticism of static stimuli, and a gradual shift towards research on dynamic expressions.

To meet new demands in stimulus selection that reflect the dynamic quality of facial displays, a wide range of databases have been developed in recent years. Those largely vary in their scope and potential application. Furthermore, they employ a host of techniques for expression elicitation. In some databases, for example, subjects are asked to deliberately make an expression by activating certain facial muscles using the Directed Facial Action task (Ekman, 2007). Alternatively, acting techniques have been used for simulating the emotion by asking subjects to (re)produce a particular emotion. This may involve the use of labels or verbally rich scenarios (so-called vignettes) that specify the emotional content (Siedlecka & Denson, 2019). In a few databases, expressions are also elicited through mental imagery in which the person recalls a personal past event and subsequently enacts the relevant emotion using Stanislavski or method acting techniques (Scherer & Bänzinger, 2010). While portrayals of the latter type may contain experiential affective elements, they are displayed with the deliberate intent to communicate the desired emotion. Hence, all of the above methods can be summarized under the umbrella of posed expression elicitation. A different approach consists in capturing spontaneous expressions by exposing naïve subjects to events expected to evoke a particular emotional state. These can be active tasks such as playing video games or touching certain objects (Cowie et al., 2005). Alternatively, databases may rely on emotion-induction techniques that are more passive such as watching emotive pictures, movies, or listening to music (Coan & Allen, 2007). Here, subjects respond freely and in their own way, yet the induced emotional expressions occur in a controlled setting (often in the laboratory).

Up to now, most of the available dynamic databases have favored some variant of posing over spontaneous emotion elicitation. Deliberately posed expressions can be defined precisely and judged against a clear criterion set by the researcher. However, they have been argued to represent stereotypical and often exaggerated displays (Barrett, 2011). Because acted portrayals operate with an explicit intention to convey the necessary facial signals, they are of higher expressivity compared to spontaneous emotional expressions (Hess et al., 1997). These differences are reflected in the cortical innervation of the underlying facial muscles, implying two separate neural pathways for voluntary and involuntary actions (i.e., cortical and subcortical, Morecraft et al., 2001; Rinn, 1984). Supportive evidence comes from studies showing that posed expressions have different temporal and morphological characteristics (duration, intensity, asymmetry) than spontaneous ones (Cohn & Schmidt, 2004; Krumhuber & Manstead, 2009; Namba et al., 2017). Databases in which emotions were spontaneously induced may therefore feature more salient facial behavior, which might guide recognition accuracy. In this vein, emotion agreement was found to be lower and vary substantially across spontaneous expressions, ranging from 15% to 65% (Wagner, 1990; Kayyal & Russell, 2013; for a review see Calvo & Nummenmaa, 2016). By contrast, recognition rates are typically situated between 60% and 80% for posed expressions. While this evidence points toward generally weaker recognizability for spontaneous compared to posed facial expressions, existing findings are difficult to interpret.

Many studies have tested their own database without any comparative evaluations between different platforms. Hence, the validity of conclusions about emotion decoding accuracy depends on the specific stimulus set being used. Furthermore, study authors have utilized dissimilar procedures to assess recognition performance. For the evaluation of some databases, for example, judgment tasks have been used in which trained raters or lay observers selected an emotion label from a predetermined list of categories (varying between 6 and 24; Golan et al., 2006; Roy et al., 2007). Others have calculated interrater agreement on the emotion categories among small groups of people, often experts or annotators (Zhang et al., 2014). Besides a strict categorical approach, a few databases have obtained emotion confidence and/or intensity judgments, continuous emotion ratings, or employed open-response formats (Kaulard et al., 2012; Matuszewski et al., 2012; Meillon et al., 2010). Alternative measures have included self-reports of emotional experience, thereby relying on subjective self-assessments instead of observer-based ones (Barrett, 2006). Finally, component measures have focused on the analysis of facial actions (using the Facial Action Coding System (FACS), Ekman et al., 2002) to obtain an objective classification of the expressive behavior (Cosker et al., 2011).

Given the various methods employed for eliciting and validating dynamic facial expressions, the quantity and quality of data available on emotion recognition performance is a major issue (Küster et al., 2020). There is currently no normative standard that incorporates the diversity of approaches seen in the literature. This calls for common cross-corpus evaluations that make it possible to compare databases to each other. Such coordinated effort may help accelerate the progress in the field by providing researchers with a benchmark by which to review, compare, and contrast existing study findings. Having a comprehensive source of reference provides crucial insights into human performance and how that varies within and across databases. Moreover, it is essential for the measurement and classification of emotions by means of machine learning.

In the last two decades, significant advances have been made in automated affect recognition (Sandbach et al., 2012), including the development of commercial software for dynamic facial expression analysis. The ability to recognize a person’s expression automatically and in real-time offers unique opportunities in basic and applied research (Zeng et al., 2009). However, many systems so far have been trained and tested on limited sets of data (Pantic & Bartlett, 2007). Those typically involved posed or acted facial behavior displaying prototypical patterns of emotional expression. In this vein, machine classification performance was found to be high for deliberately posed stimuli (Beringer et al., 2019; Skiendziel et al., 2019), but was reduced when facial expressions were spontaneous and/or subtle in their appearance (Yitzhak et al., 2017; Krumhuber et al., 2020). Unless training sets encompass large stimulus collections, automatic systems may therefore fail to generalize to the wide variety of expressive displays common in everyday life.

The present research

This research aims to provide a comparative test of databases of dynamic facial expressions published between 2000 and 2015. Such cross-corpus investigation allows for the comparison and validation of dynamic stimuli that differ in a range of parameters (i.e., elicitation condition, gender, ethnicity, expression intensity, head pose). All selected sets are publicly available and feature basic emotions in visual format. A comprehensive review of the existing corpora in terms of their conceptual and practical features is given in Krumhuber et al. (2017). In the present paper, we focus on the empirical evaluation by measuring and comparing emotion recognition indices across individual databases. For this purpose, we collected data from human observers and conducted automated facial expression analysis with a software tool called FACET (iMotions). FACET has been used widely, thereby demonstrating superior levels of emotion classification in recent cross-classifier comparisons (Stöckli et al., 2018; Dupré et al., 2020).

In Study 1, human participants were presented with a subset of stimuli from 14 dynamic databases, yielding facial expressions of the basic six emotions that were either posed or spontaneous. Recognition performance was assessed through an emotion identification task, including ratings of expression intensity and naturalness. We also submitted the materials to automated analysis by means of FACET as an additional form of validation, and to compare the results of the machine analysis to human coding. Given the diversity of expressive stimuli in this broad set of databases, we expected considerable variance in classification accuracy across the databases. Recognition levels should further vary with the perceived intensity and naturalness of the displays, with posed expressions being judged more accurately and as intense, but less natural compared to spontaneous ones.

Study 2 aimed for a full validation of the 14 databases by subjecting the entire stimulus sets to automated analysis by means of FACET. We further examined the exact facial cues that contribute to expression recognition by conducting a FACS-based Action Unit (AU) analysis. Similar to the first study, posed expressions were expected to facilitate emotion classification, thereby exhibiting prototypical facial AU configurations. Prototypicality should in turn predict accuracy in emotion identification, with increasingly better performance expected for more prototypical expressions. Aside from an emotion-based analysis, we examined the technical features of each database (i.e., duration, face box size, head rotation and motion), and their impact on recognition accuracy. While smaller face sizes and larger head movements may pose a more challenging situation, longer video durations could positively affect machine classification.

Study 1

The aim of the first study was to provide initial validation results for a subset of stimuli from each of the 14 dynamic databases. To this end, human observers were asked to identify the expressed emotion as well as to rate the intensity and naturalness for each stimulus. We further obtained machine validation data on the same materials using commercial software for automated affect analysis.

Method

Materials

Given the practical limitations regarding the number of facial portrayals that could be rated by human observers, a subset of stimuli was selected from the 14 databases using stratified random sampling. All contained videos of dynamic facial expressions portrayed by individual encoders and featured basic emotions. Out of the 14 databases, nine showed posed facial expressions that were initiated via instructions to perform an expression/facial action or through scenario enactment techniques: ADFES, BU-4DFE, CK, D3D-FACS, DaFEx, GEMP, MMI, MPI, and STOIC. The other five databases featured spontaneous facial expressions that were elicited in response to videos or tasks designed to induce a specific emotion: BINED, DISFA, DynEmo, FG-NET, UT Dallas. Both types of expressions had been recorded in the laboratory by the database authors. For the purpose of this research, we focused on the following six basic emotions as predefined by the dataset authors: anger disgust, fear, happiness, sadness, and surprise.^{Footnote 1}

For every database, two exemplars were randomly selected from each emotion category, yielding 12 portrayals per database. The two exceptions were DISFA and DynEmo, both of which contain only five and four basic emotions, respectively. This yielded a total of 162 expressions (108 posed, 54 spontaneous) from 85 female and 77 male encoders. Portrayals that exceeded a duration of 15 s (BINED, DynEmo) were edited to display the dynamic trajectory from onset, over apex, to offset of the expression (if applicable). The final stimuli lasted on average 5 s and measured approximately 642 x 482 pixels.

Human observers

Participants

One hundred twenty-four participants (86 females), aged 18–45 years (M = 24.23, SD = 5.58) were recruited face-to-face or via the departmental subject pool and participated in exchange for course credit or payment of £6. All participants identified themselves as White Caucasian. A power analysis using G*Power 3.1 (Faul et al., 2007) indicated that this sample size is sufficient to detect a medium-sized effect of database or emotion (Cohen’s f = 0.25) in an ANOVA with 80% statistical power (α = 0.05). All participants provided written informed consent prior to the study. Ethical approval was granted by the Department of Experimental Psychology at University College London.

Procedure

The study was described as a test of how people perceive emotion in dynamic facial expressions, with all instructions and stimuli being presented via computer. Participants saw one out of two exemplars of each emotion category from every database, netting 81 dynamic facial expressions per participant.^{Footnote 2} Stimulus sequence was randomized using Qualtrics (Provo, UT). For each facial stimulus, participants rated their confidence (from 0 to 100%) about the extent to which the expression reflects anger, disgust, fear, happiness, sadness, surprise, other emotion, and neutral (no emotion). If they felt that more than one category applied, they could respond using multiple sliders to choose the exact confidence levels for each response category. Ratings across the eight response options had to sum up to 100%. In addition, participants evaluated each facial stimulus in terms of its intensity and naturalness of the expressed emotion, using 7-point Likert scales (1 - very weak, 7 - very intense; 1 - not natural at all, 7 - very natural). All three measures were presented on the same screen and in a fixed order, with unlimited response time.

Machine analysis

We submitted all video stimuli to automated analysis by means of the FACET classifier, which is part of the biometric software suite by iMotions (www.imotions.com, SDK v6.3). FACET is a commercial software for automatic facial expression measurement based on the Computer Expression Recognition Toolbox algorithm (CERT; Littlewort et al., 2011). It estimates facial expressions in terms of the six basic emotions as well as 20 FACS Action Units (AUs). FACET outputs per-frame evidence scores for each emotion category that represent estimates of the likelihood of an expert human coder recognizing the expression as the target category. The values are expressed on a decimal logarithmic scale centered around zero (similar to a z-score), with zero indicating a 0.5 probability, negative values indicating that an expression is likely not present, and positive values indicating the likely presence of an expression.

Importantly, these raw evidence scores do not include any specification in terms of which emotion is most probable relative to the other emotions. Hence, researchers interested in dynamic expressions need to define a metric or rule by which to aggregate the per-frame evidence, and to extract the dominant emotion categorization for each video stimulus (Dente et al., 2017). While FACET’s raw evidence scores can be averaged to determine the dominant emotion categorization (e.g., Yitzhak et al., 2017), this approach results in a linear “pooling” of evidence across frames, with probabilities that may no longer reflect the logarithmically scaled recognition odds provided by human experts. We therefore transformed the FACET raw, non-baseline-corrected, evidence values first into probabilities, using the formula provided in the FACET documentation (1/(1 + (10 ^ -evidence); iMotions, 2016), and then into odds values (1/((1/p)-1)). Such conversion on a scale from zero to infinity ensures that the logarithmic increase in probabilities produced by the binary classifiers is adequately reflected when averaging across all frames. We defined the dominant emotion categorization as the expression with the highest proportion of odds relative to the total amount of odds for all six basic expressions:

$$ {\mathrm{Confidence}}_E=\frac{\sum_{i=1}^n{x}_i}{\sum_{i=1}^n{x}_i+{\sum}_{i=1}^n{y}_i}\ast 100 $$

For each expression (E), we computed a confidence score reflecting the proportion of the summed odds for the expression (x) relative to the total of all odds (target expression (x) + other expressions (y)). This proportion (0-1) was subsequently converted into a percentage score by multiplying the value with 100. This approach yields an odds-based percentage score for each video that allows easy identification of the dominant emotion categorization, i.e., the category with the highest score. Additionally, it provides a simple standardized metric to quantify and rank the relative confidence for each expression across videos from diverse databases. By definition, the resulting confidence scores for each expression add up to a total of 100.

To compute the human and machine accuracy of the multi-class categorization, we created new dummy variables to indicate the recognized expressions, and whether they matched the predicted emotion labels (true vs. false).

Results

Rating scores were averaged across the two exemplars of each emotion category from every database, which served as the unit of analysis. For all analyses, a Greenhouse-Geisser adjustment to degrees of freedom was applied, and Bonferroni correction was used for multiple comparisons.

Emotion recognition

Recognition accuracy was significantly higher than chance (17%), in both humans, M = 65.11% (SD = 26.18), t(80) = 16.54, p < .001, Cohen’s d = 1.84, and machine, M = 65.43% (SD = 40.03), t(80) = 10.89, p < .001, Cohen’s d = 1.21. In general, expressions from posed datasets were better recognized than those from spontaneous ones in both humans, t(38.39) = 3.64, p = .001, Cohen’s d = .91, and machine, t(79) = 2.21, p = .030, Cohen’s d = .50.

Due to insufficient variance within the study cells, separate ANOVAs were conducted with the factors database (14) or emotion (6), thereby comparing human vs. machine performance. Results revealed a significant main effect of database, F(13, 67) = 2.52, p = .007, η_p² = .33, with ADFES, CK, BU-4DFE, STOIC performing best, followed by MMI, MPI, D3D-FACS, DISFA, DynEmo, GEMEP, DaFEx, UT Dallas, BINED, and finally FG-NET (see Fig. 1). The difference was statistically significant only between ADFES and FG-NET (p = .037). A significant main effect of emotion, F(5, 75) = 5.78, p < .001, η_p² = .28, further revealed that recognition rates were highest for happiness, followed by disgust, then surprise, sadness, and anger, and finally fear. Pairwise comparisons showed that happiness was better recognized than sadness (p = .009), anger (p = .003), fear (p < .001), and marginally better than surprise (p = .059). For none of the above analyses, the human vs. machine difference was significant (Fs < 0.002, ps > .977), nor was there a significant interaction between database or emotion and human vs. machine (Fs < 1.48, ps > .151).^{Footnote 3}

As shown in Fig. 2, confusion rates were generally below the 25% chance level, except for fear which was sometimes confused with surprise (27.66%) in humans. The same confusion arose in machine classification (19.45%). Also, there was a tendency for both humans and machine to label anger expressions as disgust (10.35% and 20.57%, respectively). In order to quantify the similarity of confusions between machine and human, each confusion matrix was transformed into a single vector (see Kuhn et al., 2017). Correlational analyses indicated a significant overlap between both matrices (rho = .71, S = 2256, p < .001), suggesting that recognition patterns of target and non-target emotions were positively related in humans and machine.

Intensity rating

Results yielded a significant main effect of database, F(13, 67) = 2.89, p = .002, η_p² = .36, with GEMEP, ADFES, STOIC, and DaFEx attracting the highest scores in expression intensity, followed by CK, MMI, BU-4DFE, MPI, DynEmo, D3D-FACS, and finally DISFA, BINED, FG-Net, and UT Dallas (see Fig. 3). Pairwise comparisons showed that UT Dallas stimuli were rated as significantly less intense than those from GEMEP (p = .005), ADFES (p = .008), STOIC (p = .014), and DaFEx (p = .039). Overall, expressions from posed datasets (M = 4.53, SD = 0.76) were perceived as more intense than those from spontaneous sets (M = 3.66, SD = 0.85), t(79) = 4.64, p < .001, Cohen’s d = 1.07.

There was also a significant main effect of emotion, F(5, 75) = 2.87, p = .020, η_p² = .16, with disgust and fear being the two most intense expressions, followed by happiness, surprise, anger, and finally sadness. Pairwise comparisons showed that sadness was rated as significantly less intense than disgust (p = .009), and marginally significantly less intense than fear (p = .076). Overall, perceived intensity significantly predicted participants’ accuracy in emotion recognition, β = .50, t(79) = 5.07, p < .001, with better performance the more intense the expression was judged to be.

Naturalness rating

A significant main effect of database, F(13, 67) = 15.99, p < .001, η_p² = .76, revealed that BINED, DynEmo, GEMEP, UT Dallas, and DISFA achieved the highest scores in naturalness (ps = 1.0) (Fig. 3). Pairwise comparisons showed that BINED and DynEmo were rated as significantly more natural than DaFEx, FG-NET, STOIC, D3D-FACS, CK, MPI, ADFES, BU-4DFE, and MMI (ps < .05). GEMEP was rated as significantly more natural than FG-NET, STOIC, D3D-FACS, CK, MPI, ADFES, BU-4DFE, and MMI (ps < .05). UT Dallas and DISFA were rated as significantly more natural than D3D-FACS, CK, MPI, ADFES, BU-4DFE, and MMI (ps < .05), with DaFEx scoring significantly higher in naturalness than MMI (p = .024). In general, expressions from posed datasets (M = 3.69, SD = 0.76) were perceived to be less natural than those from spontaneous sets (M = 4.86, SD = 0.70), t(79) = – 6.66, p < .001, Cohen’s d = 1.59.

The main effect of emotion was not significant, F(5, 75) = 0.67, p = .644, η_p² = .04. Perceived naturalness significantly predicted participants’ accuracy in emotion recognition, β = – .28, t(79) = -2.62, p = .011, with worse performance the more natural the expression was judged to be.

Discussion

The findings of the first study showed considerable variance in emotion recognition across the 14 databases ranging from 34% to 83%. On average, posed expressions were recognized better and judged as more intense (but less natural) than spontaneous ones. Intensity ratings in turn predicted recognition accuracy, with higher performance the more intense the expression. As such, posed stimuli may act as salient symbols of highly expressive and intense displays (Hess et al., 1997; Motley & Camden, 1988). Those can be easily identified, but are seen as less representative of everyday behavior (Barrett, 2011). When comparing human vs. machine performance there was strong convergence, yielding similar patterns of emotion classification and confusion between categories. This makes automated analysis a suitable tool for assessing facial expressions.

Study 2

The second study intended to go beyond the limited subset of Study 1 and achieve a full validation of the 14 dynamic databases. For this, we processed the entire databases using automated methods for measuring emotion. We further analyzed the facial (AU) cues and technical features that may contribute to expression recognition.