research-article

Audio visual speech recognition in noisy visual environments

Authors:
Georgios Galatas

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece, and University of Texas at Arlington

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece, and University of Texas at Arlington
View Profile

,
Gerasimos Potamianos

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece
View Profile

,
Alexandros Papangelis

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece, and University of Texas at Arlington

Institute of Informatics and Telecommunications, NCSR Demokritos, Greece, and University of Texas at Arlington
View Profile

,
Fillia Makedon

University of Texas at Arlington

University of Texas at Arlington
View Profile

PETRA '11: Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive EnvironmentsMay 2011Article No.: 19Pages 1–4https://doi.org/10.1145/2141622.2141646

Published:25 May 2011Publication History

PETRA '11: Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments

Pages 1–4

ABSTRACT

Speech recognition is a natural means of interaction for a human with a smart assistive environment. In order for this interaction to be effective, such a system should attain a high recognition rate even under adverse conditions. Audio-visual speech recognition (AVSR) can be of help in such environments, especially under the presence of audio noise. However the impact of visual noise to its performance has not been studied sufficiently in the literature. In this paper, we examine the effects of visual noise to AVSR, reporting experiments on the relatively simple task of connected digit recognition, under moderate acoustic noise and a variety of types of visual noise. The latter can be caused by either faulty sensors or video signal transmission problems that can be found in smart assistive environments. Our AVSR system exhibits higher accuracy in comparison to an audio-only recognizer and robust performance in most cases of noisy video signals considered.

References

J. Huang, X. Zhuang, V. Libal and G. Potamianos, "Long-time span acoustic activity analysis from far-field sensors in smart homes", In Proc. ICASSP, pp. 4173--4176, 2009. Google ScholarDigital Library
K. Iwano, S. Tamura and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", In Proc. HSC, pp.187--190, 2001.Google Scholar
S. Nakamura, H. Ito and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", In Proc. ICSLP, vol. 3, pp. 20--24, 2000.Google Scholar
G. Potamianos, C. Neti, G. Gravier, A. Garg and A. W. Senior, "Recent advances in the automatic recognition of audio-visual speech.", Invited, In Proc. IEEE, vol. 91, no. 9, pp. 1306--1326, 2003.Google ScholarCross Ref
G. Potamianos, H. P. Graf and E. Cosatto, "An image transform approach for HMM based automatic lipreading", In Proc. ICIP, vol. 3, pp. 173--177, Chicago, IL, 1998.Google ScholarCross Ref
G. Bradski and A. Kaehler. "Learning OpenCV: Computer Vision with the OpenCV Library." O'Reilly Media, 1st edition, September 2008.Google Scholar
C. M. Bishop, "Pattern Recognition and Machine Learning." Springer, Heidelberg, 2006. Google ScholarDigital Library
G. Potamianos and P. Scalnon, "Exploiting lower face symmetry in appearance-based automatic speechreading", In Proc. Works. AVSP, pp. 79--84, Vancouver Island, Canada, 2005.Google Scholar
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, "The HTK Book", Cambridge Univ. Eng. Dept., Tech Rep, 2002.Google Scholar
E. K. Patterson, S. Gurbuz, Z. Tufekci and J. N. Gowdy, "CUAVE: A new audio-visual database for multimodal human-computer interface research", In Proc. IEEE ICASSP, vol. 2, pp. 2017--2020, 2002.Google Scholar
J. Shain, C. B. Owen and F. Makedon, "Detecting lip motion in digital video", In Proc. SPIE Multimedia Systems and Applications, vol. 3528, pp.15--25, 1999.Google ScholarCross Ref

Index Terms

Audio visual speech recognition in noisy visual environments

Recommendations

Audio-visual speech recognition for difficult environments
Read More
Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Read More
Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments

In this paper, we propose a model for the incorporation of voicing information into a speech recognition system in noisy environments. The employed voicing information is estimated by a novel method that can provide this information for each filter-bank ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PETRA '11: Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments
May 2011
401 pages
ISBN:9781450307727
DOI:10.1145/2141622
Program Chairs:
Margrit Betke
Boston University
,
Ilias Maglogiannis
University of Central Greece, Greece
,
Grammati Pantziou
TEI of Athens (Department of Informatics), Greece
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 May 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic speech recognition
computer vision
discrete cosine transform
hidden Markov models
multi-modality
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 162
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Audio visual speech recognition in noisy visual environments

PETRA '11: Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments

ABSTRACT

References

Cited By

Index Terms

Recommendations

Audio-visual speech recognition for difficult environments

Psycho-acoustics inspired automatic speech recognition

Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments