Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Borde, Prashant; Varpe, Amarsinh; Manza, Ramesh; Yannawar, Pravin

doi:10.1007/s10772-014-9257-1

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Published: 21 October 2014

Volume 18, pages 167–175, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Prashant Borde¹,
Amarsinh Varpe¹,
Ramesh Manza² &
…
Pravin Yannawar¹

809 Accesses
29 Citations
Explore all metrics

Abstract

Automatic speech recognition by machine is an attractive research topic in signal processing domain and has attracted many researchers to contribute in this area. In recent year, there have been many advances in automatic speech reading system with the inclusion of audio and visual speech features to recognize words under noisy conditions. The objective of audio-visual speech recognition system is to improve recognition accuracy. In this paper we computed visual features using Zernike moments and audio feature using mel frequency cepstral coefficients on visual vocabulary of independent standard words dataset which contains collection of isolated set of city names of ten speakers. The visual features were normalized and dimension of features set was reduced by principal component analysis (PCA) in order to recognize the isolated word utterance on PCA space. The performance of recognition of isolated words based on visual only and audio only features results in 63.88 and 100 % respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Article 10 September 2018

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Article 11 June 2020

Noisy Speech Recognition Based on Combined Audio-Visual Classifiers

References

Bishop, C. M. (2006). Pattern recognition and machine learning. heidelberg: Springer.
MATH Google Scholar
Bradski, G., & Kaehler, A. (2008). Learning Open CV: Computer vision with the OpenCV library (1st ed.). CA, USA: O’Reilly Media.
Google Scholar
Capiler, A. (2001). Lip detection and tracking, 11th International Conference on Image Analysis and Processing (ICIAP 2001), Palermo, Italy
Christopher, B. (1993). Improving connected letter recognition by Lip-reading, IEEE (pp. 361–365).
Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete-Time Processing of Speech Signals. Englewood Cliffs: Macmillan Publishing Company.
Google Scholar
Duchnowski, P. (1995). Toward movement invariant automatic lip-reading and speech recognition, IEEE, pp.109–111
Finn K.I. (1986). An investigation of visible lip information to be used in automated speech recognition. Ph.D Thesis, George-Town University.
Gold, B., & Morgan, N. (2000). Speech and audio signal processing. New York, NY: John Wiley and Sons.
Google Scholar
Hong, X, et al. (2006). A PCA based visual DCT feature extraction method for lip-reading. International Conference on Intelligent Information Hiding and Multimedia Signal Processing.
Hwang, S-K., Kim, W-Y. “A novel approach to the fast computation of Zernike moments”,The Journal of the Pattern Recognition Society, doi:10.1016/j.patcog.2006.03.004.
Juergen, L. (1996). Visual speech recognition using active shape model and hidden markov model, IEEE, pp.817–820
Leon, C. G. K., Perai, P. S., Pauh, J. P. (2009). “Robust Computer Voice Recognition Using Improved MFCC Algorithm”, International Conference on New Trends in Information and Service Science.
Li, M., Cheung, M. (2008). A Novel motion based lip feature extraction for lip-reading, IEEE International Conference on Computational Intelligence and Security (pp. 361–365). Sichan Province, China
Macdonald, J., & MacGurk, H. (1978). Visual influences on speech perception process. Perception and Psychophysics, 24, 253–257.
Article Google Scholar
MacGurk, H., & Macdonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.
Article Google Scholar
Malkin, F. J. (1986). The effect on computer recognition of speech when speaking through protective masks. Proceeding Speech Technology, 268, 87–265.
Google Scholar
Matthews, L., Cootes, T. F., Banbham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features of lip-reading. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24, 198–213.
Article Google Scholar
Meisel, W. S. (1987). A natural speech recognition system. Proceeding Speech Technology, 87, 10–13.
Google Scholar
Moody, T., Joost, M., & Rodman, R. (1987). A comparative evaluation a speech recognizers. Proceeding Speech Technology, 87, 275–280.
Neti, C., et al. (Oct 2000) audio-visual speech recognition, Workshop 2000 Final report.
Paul, D. B., Lippmann, R. P., Chen, Y., & Weinstein, C. J. (1987). Robust HMM based technique for recognition of speech Produced under stress and in noise. Proceeding Speech Technology, 87, 275–280.
Petjan, E., Bischoff, B., & Bodoff, D. (1987). An Improved automatic Lip-reading system to enhance speech Recognition, Technical Report TM 11251–871012-11, AT&T Bell Labs
Saitoh, T., Morishita, K. & Konishi, R. (2008). Analysis of efficient lip-reading method for various languages, In Pattern Recognition, ICPR 2008. 19th International Conference on IEEE (pp. 1–4). Florida, USA
Sum, K.L., et al. (2001). A new optimization procedure for extracting the point based lip contour using active shape model. In Acoustics, Speech, and Signal Processing. Proceedings of (ICASSP’01) 2001 IEEE International Conference ( pp. 1485–1488).
Tiwari, V. (2010). MFCC and its applications in speaker recognition. Dept. of Electronics Engg., Gyan Ganga Institute of Technology and Management, Bhopal, (MP). arxiv.org.
Tripathy, J. (2010). Reconstruction of oriya alphabets using Zernike moments. International Journal of Computer Applications, 8(8), 0975–8887.
Article Google Scholar
Yuhas, B.P., Goldstien, M.H. & Sejnowski, T.J. (1989). Integration of acoustic and visual speech signals using neural networks, IEEE Communication Magazine, pp. 65–71
Železný, M., Krňoul, Z., Císař, P., Matoušek, J. (2006). Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Signal Processing, vol. 86, no. 12, New York: Elsevier Science (ISSN 0165–1684).

Download references

Acknowledgments

The Authors gratefully acknowledge support by the Department of Science and Technology (DST) for providing financial assistance for Major Research Project sanctioned under Fast Track Scheme for Young Scientist, vide sanction number SERB/1766/2013/14 and the authorities of Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MS) India, for providing the infrastructure for this research work.

Author information

Authors and Affiliations

Vision and Intelligent System Lab, Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, MS, India
Prashant Borde, Amarsinh Varpe & Pravin Yannawar
Biomedical Image Processing Lab, Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, MS, India
Ramesh Manza

Authors

Prashant Borde
View author publications
You can also search for this author in PubMed Google Scholar
Amarsinh Varpe
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Manza
View author publications
You can also search for this author in PubMed Google Scholar
Pravin Yannawar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prashant Borde.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borde, P., Varpe, A., Manza, R. et al. Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. Int J Speech Technol 18, 167–175 (2015). https://doi.org/10.1007/s10772-014-9257-1

Download citation

Received: 05 July 2014
Accepted: 10 October 2014
Published: 21 October 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10772-014-9257-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Abstract

Access this article

Similar content being viewed by others

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Noisy Speech Recognition Based on Combined Audio-Visual Classifiers

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Abstract

Access this article

Similar content being viewed by others

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Noisy Speech Recognition Based on Combined Audio-Visual Classifiers

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation