ABSTRACT
Speech recognition is a natural means of interaction for a human with a smart assistive environment. In order for this interaction to be effective, such a system should attain a high recognition rate even under adverse conditions. Audio-visual speech recognition (AVSR) can be of help in such environments, especially under the presence of audio noise. However the impact of visual noise to its performance has not been studied sufficiently in the literature. In this paper, we examine the effects of visual noise to AVSR, reporting experiments on the relatively simple task of connected digit recognition, under moderate acoustic noise and a variety of types of visual noise. The latter can be caused by either faulty sensors or video signal transmission problems that can be found in smart assistive environments. Our AVSR system exhibits higher accuracy in comparison to an audio-only recognizer and robust performance in most cases of noisy video signals considered.
- J. Huang, X. Zhuang, V. Libal and G. Potamianos, "Long-time span acoustic activity analysis from far-field sensors in smart homes", In Proc. ICASSP, pp. 4173--4176, 2009. Google ScholarDigital Library
- K. Iwano, S. Tamura and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", In Proc. HSC, pp.187--190, 2001.Google Scholar
- S. Nakamura, H. Ito and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", In Proc. ICSLP, vol. 3, pp. 20--24, 2000.Google Scholar
- G. Potamianos, C. Neti, G. Gravier, A. Garg and A. W. Senior, "Recent advances in the automatic recognition of audio-visual speech.", Invited, In Proc. IEEE, vol. 91, no. 9, pp. 1306--1326, 2003.Google ScholarCross Ref
- G. Potamianos, H. P. Graf and E. Cosatto, "An image transform approach for HMM based automatic lipreading", In Proc. ICIP, vol. 3, pp. 173--177, Chicago, IL, 1998.Google ScholarCross Ref
- G. Bradski and A. Kaehler. "Learning OpenCV: Computer Vision with the OpenCV Library." O'Reilly Media, 1st edition, September 2008.Google Scholar
- C. M. Bishop, "Pattern Recognition and Machine Learning." Springer, Heidelberg, 2006. Google ScholarDigital Library
- G. Potamianos and P. Scalnon, "Exploiting lower face symmetry in appearance-based automatic speechreading", In Proc. Works. AVSP, pp. 79--84, Vancouver Island, Canada, 2005.Google Scholar
- S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, "The HTK Book", Cambridge Univ. Eng. Dept., Tech Rep, 2002.Google Scholar
- E. K. Patterson, S. Gurbuz, Z. Tufekci and J. N. Gowdy, "CUAVE: A new audio-visual database for multimodal human-computer interface research", In Proc. IEEE ICASSP, vol. 2, pp. 2017--2020, 2002.Google Scholar
- J. Shain, C. B. Owen and F. Makedon, "Detecting lip motion in digital video", In Proc. SPIE Multimedia Systems and Applications, vol. 3528, pp.15--25, 1999.Google ScholarCross Ref
Index Terms
- Audio visual speech recognition in noisy visual environments
Recommendations
Psycho-acoustics inspired automatic speech recognition
AbstractUnderstanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments
In this paper, we propose a model for the incorporation of voicing information into a speech recognition system in noisy environments. The employed voicing information is estimated by a novel method that can provide this information for each filter-bank ...
Comments