skip to main content
10.1145/2413097.2413100acmotherconferencesArticle/Chapter ViewAbstractPublication PagespetraConference Proceedingsconference-collections
research-article

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Authors Info & Claims
Published:06 June 2012Publication History

ABSTRACT

In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. In particular, we focus our interest on whether the depth stream provides sufficient speech information that can improve system robustness to noisy audio-visual conditions, thus studying system operation beyond the traditional scenarios, where noise is applied to the audio signal alone. For this purpose, we consider four realistic visual modality degradations at various noise levels, and we conduct small-vocabulary recognition experiments on an appropriate, previously collected, audiovisual database. Our results demonstrate improved system performance due to the depth modality, as well as considerable accuracy increase, when using both the visual and depth modalities over audio only speech recognition.

References

  1. K. Iwano, S. Tamura, and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", in Proc. of HSC, 2001, pp. 187--190.Google ScholarGoogle Scholar
  2. H. I. Nakamura and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", in Proc. of ICSLP, 2000, vol. III, pp. 20--24.Google ScholarGoogle Scholar
  3. G. Galatas, G. Potamianos, D. Kosmopoulos, C. McMurrough, and F. Makedon, "Bilingual corpus for AVASR using multiple sensors and depth information", in Proc. of AVSP, 2011, pp. 103--106.Google ScholarGoogle Scholar
  4. G. Galatas, G. Potamianos and F. Makedon, "Audio-visual speech recognition incorporating facial depth information captured by the Kinect", submitted to EUSIPCO, 2012.Google ScholarGoogle Scholar
  5. G. Galatas, G. Potamianos, A. Papangelis, and F. Makedon, "Audio visual speech recognition in noisy visual environments", in Proc. of PETRA. 2011, pp. 19--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. "The Primesensor reference design", {online} available at: http://www.primesensor.com/?p=514.Google ScholarGoogle Scholar
  7. G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly Media, 2008.Google ScholarGoogle Scholar
  8. S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4, Cambridge University Press, 2006.Google ScholarGoogle Scholar
  9. "The HMM-based speech synthesis system (HTS)", {online} available at: http://hts.sp.nitech.ac.jp/.Google ScholarGoogle Scholar

Index Terms

  1. Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments
              June 2012
              307 pages
              ISBN:9781450313001
              DOI:10.1145/2413097

              Copyright © 2012 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 6 June 2012

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader