ABSTRACT
In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. In particular, we focus our interest on whether the depth stream provides sufficient speech information that can improve system robustness to noisy audio-visual conditions, thus studying system operation beyond the traditional scenarios, where noise is applied to the audio signal alone. For this purpose, we consider four realistic visual modality degradations at various noise levels, and we conduct small-vocabulary recognition experiments on an appropriate, previously collected, audiovisual database. Our results demonstrate improved system performance due to the depth modality, as well as considerable accuracy increase, when using both the visual and depth modalities over audio only speech recognition.
- K. Iwano, S. Tamura, and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", in Proc. of HSC, 2001, pp. 187--190.Google Scholar
- H. I. Nakamura and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", in Proc. of ICSLP, 2000, vol. III, pp. 20--24.Google Scholar
- G. Galatas, G. Potamianos, D. Kosmopoulos, C. McMurrough, and F. Makedon, "Bilingual corpus for AVASR using multiple sensors and depth information", in Proc. of AVSP, 2011, pp. 103--106.Google Scholar
- G. Galatas, G. Potamianos and F. Makedon, "Audio-visual speech recognition incorporating facial depth information captured by the Kinect", submitted to EUSIPCO, 2012.Google Scholar
- G. Galatas, G. Potamianos, A. Papangelis, and F. Makedon, "Audio visual speech recognition in noisy visual environments", in Proc. of PETRA. 2011, pp. 19--23. Google ScholarDigital Library
- "The Primesensor reference design", {online} available at: http://www.primesensor.com/?p=514.Google Scholar
- G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly Media, 2008.Google Scholar
- S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4, Cambridge University Press, 2006.Google Scholar
- "The HMM-based speech synthesis system (HTS)", {online} available at: http://hts.sp.nitech.ac.jp/.Google Scholar
Index Terms
- Audio-visual speech recognition using depth information from the Kinect in noisy video conditions
Recommendations
Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition ...
Audio-visual speech recognition using MPEG-4 compliant visual features
We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial ...
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments
ICMI '04: Proceedings of the 6th international conference on Multimodal interfacesThis paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-...
Comments