Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

Vlasenko, Bogdan; Schuller, Björn; Wendemuth, Andreas; Rigoll, Gerhard

doi:10.1007/978-3-540-74889-2_13

Bogdan Vlasenko¹,
Björn Schuller²,
Andreas Wendemuth¹ &
…
Gerhard Rigoll²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4738))

Included in the following conference series:

International Conference on Affective Computing and Intelligent Interaction

5995 Accesses
40 Citations

Abstract

Opposing the pre-dominant turn-wise statistics of acoustic Low-Level-Descriptors followed by static classification we re-investigate dynamic modeling directly on the frame-level in speech-based emotion recognition. This seems beneficial, as it is well known that important information on temporal sub-turn-layers exists. And, most promisingly, we integrate this frame-level information within a state-of-the-art large-feature-space emotion recognition engine. In order to investigate frame-level processing we employ a typical speaker-recognition set-up tailored for the use of emotion classification. That is a GMM for classification and MFCC plus speed and acceleration coefficients as features. We thereby also consider use of multiple states, respectively an HMM. In order to fuse this information with turn-based modeling, output scores are added to a super-vector combined with static acoustic features. Thereby a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects are considered. Starting from 1.4k features we select optimal configurations including and excluding GMM information. The final decision task is realized by use of SVM. Extensive test-runs are carried out on two popular public databases, namely EMO-DB and SUSAS, to investigate acted and spontaneous data. As we face the current challenge of speaker-independent analysis we also discuss benefits arising from speaker normalization. The results obtained clearly emphasize the superior power of integrated diverse time-levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Within and cross-corpus speech emotion recognition using latent topic model-based features

Article Open access 25 January 2015

Long-Term Statistical Feature Extraction from Speech Signal and Its Application in Emotion Recognition

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

Article 28 July 2023

References

Polzin, T.S., Waibel, A.: Detecting emotions in speech, Cooperative Multimodal Communication. In: 2nd Int. Conf. 1998, CMC (1998)
Google Scholar
Schuller, B., Rigoll, G., Lang, M.: Hidden Markov Model-Based Speech Emotion Recognition. In: Proc. ICASSP 2003, IEEE, Hong Kong, China, vol. II, pp. 1–4 (2003)
Google Scholar
Lee, Z., Zhao, Y.: Recognition emotions in speech using short-term and long-term features. In: Proc. ICSLP, pp. 2255–2558 (1998)
Google Scholar
Jiang, D.N., Cai, L.-H.: Speech emotion classification with the combination of statistic features and temporal features. In: Proc. ICME 2004, IEEE, Taipei, Taiwan, pp. 1967–1971 (2004)
Google Scholar
Murray, L.R., Arnot, I.L.: Toward the simulation of emotion in synthetic speech: A review of the literature of humans vocal emotion. JASA 93(2), 1097–1108 (1993)
Google Scholar
Schuller, B., Rigoll, G.: Timing Levels in Segment-Based Speech Emotion Recognition. In: Proc. INTERSPEECH 2006, ICSLP, ISCA, Pittsburgh, PA, pp. 1818–1821 (2006)
Google Scholar
Klasmeyer, G., Johnstone, T., Bänziger, T., Sappok, C., Scherer, K.R.: Emotional Voice Variability in Speaker Verification. In: Proc. ITRW on Speech and Emotion, ISCA, Newcastle, UK (2000)
Google Scholar
Shahin, I.: Enhancing speaker identification performance under the shouted talking condition using the second order circular Hidden Markov Models. Speech Communication 48(8), 1047–1055 (2006)
Article Google Scholar
Reynolds, D.: Speaker identification and verification using Gaussian mixture speaker models. Speech Communication 17, 91–108 (1995)
Article Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK-Book 3. Cambridge University, Cambridge, England (2002)
Google Scholar
Schuller, B., Seppi, D., Batliner, A., Maier, A., Steidl, S.: Towards More Reality in the Recognition of Emotional Speech. In: Proc. ICASSP 2007, Honolulu, Hawaii (2007)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations, p. 133. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A Database of German Emotional Speech. In: Proc. INTERSPEECH 2005, ISCA, Lisbon, Portugal, pp. 1517–1520 (2005)
Google Scholar
Hansen, J.H.L., Bou-Ghazale, S.: Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database. In: Proc. EUROSPEECH 1997, Rhodes, Greece, vol. 4, pp. 1743–1746 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Cognitive Systems, IESK, Otto-von-Guericke University, Magdeburg, Germany
Bogdan Vlasenko & Andreas Wendemuth
Institute for Human-Machine Communication, Technische Universität München, Germany
Björn Schuller & Gerhard Rigoll

Authors

Bogdan Vlasenko
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Wendemuth
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Rigoll
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ana C. R. Paiva Rui Prada Rosalind W. Picard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vlasenko, B., Schuller, B., Wendemuth, A., Rigoll, G. (2007). Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing. In: Paiva, A.C.R., Prada, R., Picard, R.W. (eds) Affective Computing and Intelligent Interaction. ACII 2007. Lecture Notes in Computer Science, vol 4738. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74889-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-74889-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74888-5
Online ISBN: 978-3-540-74889-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

Abstract

Access this chapter

Preview

Similar content being viewed by others

Within and cross-corpus speech emotion recognition using latent topic model-based features

Long-Term Statistical Feature Extraction from Speech Signal and Its Application in Emotion Recognition

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

Abstract

Access this chapter

Preview

Similar content being viewed by others

Within and cross-corpus speech emotion recognition using latent topic model-based features

Long-Term Statistical Feature Extraction from Speech Signal and Its Application in Emotion Recognition

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation