research-article

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Authors:
Georgios Galatas

University of Texas at Arlington, Arlington, Texas and Inst. of Informatics & Telecom, Athens, Greece

University of Texas at Arlington, Arlington, Texas and Inst. of Informatics & Telecom, Athens, Greece
View Profile

,
Gerasimos Potamianos

University of Thessaly, Volos, Greece and Inst. of Informatics & Telecom, Athens, Greece

University of Thessaly, Volos, Greece and Inst. of Informatics & Telecom, Athens, Greece
View Profile

,
Fillia Makedon

University of Texas at Arlington, Arlington, Texas

University of Texas at Arlington, Arlington, Texas
View Profile

PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive EnvironmentsJune 2012Article No.: 2Pages 1–4https://doi.org/10.1145/2413097.2413100

Published:06 June 2012Publication History

PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments

Pages 1–4

ABSTRACT

In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. In particular, we focus our interest on whether the depth stream provides sufficient speech information that can improve system robustness to noisy audio-visual conditions, thus studying system operation beyond the traditional scenarios, where noise is applied to the audio signal alone. For this purpose, we consider four realistic visual modality degradations at various noise levels, and we conduct small-vocabulary recognition experiments on an appropriate, previously collected, audiovisual database. Our results demonstrate improved system performance due to the depth modality, as well as considerable accuracy increase, when using both the visual and depth modalities over audio only speech recognition.

References

K. Iwano, S. Tamura, and S. Furui, "Bimodal speech recognition using lip movement measured by optical-flow analysis", in Proc. of HSC, 2001, pp. 187--190.Google Scholar
H. I. Nakamura and K. Shikano, "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", in Proc. of ICSLP, 2000, vol. III, pp. 20--24.Google Scholar
G. Galatas, G. Potamianos, D. Kosmopoulos, C. McMurrough, and F. Makedon, "Bilingual corpus for AVASR using multiple sensors and depth information", in Proc. of AVSP, 2011, pp. 103--106.Google Scholar
G. Galatas, G. Potamianos and F. Makedon, "Audio-visual speech recognition incorporating facial depth information captured by the Kinect", submitted to EUSIPCO, 2012.Google Scholar
G. Galatas, G. Potamianos, A. Papangelis, and F. Makedon, "Audio visual speech recognition in noisy visual environments", in Proc. of PETRA. 2011, pp. 19--23. Google ScholarDigital Library
"The Primesensor reference design", {online} available at: http://www.primesensor.com/?p=514.Google Scholar
G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly Media, 2008.Google Scholar
S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4, Cambridge University Press, 2006.Google Scholar
"The HMM-based speech synthesis system (HTS)", {online} available at: http://hts.sp.nitech.ac.jp/.Google Scholar

Index Terms

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Recommendations

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition ...
Read More
Audio-visual speech recognition using MPEG-4 compliant visual features

We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial ...
Read More
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments
ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces

This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments
June 2012
307 pages
ISBN:9781450313001
DOI:10.1145/2413097
Conference Chair:
Fillia Makedon
University of Texas at Arlington
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Microsoft Kinect
audio-visual speech recognition
depth information
video noise
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 329
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments

ABSTRACT

References

Cited By

Index Terms

Recommendations

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Audio-visual speech recognition using MPEG-4 compliant visual features

A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

PETRA '12: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments

ABSTRACT

References

Cited By

Index Terms

Recommendations

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Audio-visual speech recognition using MPEG-4 compliant visual features

A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media