ABSTRACT
The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.
Supplemental Material
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Dabi Ahn. 2017. Voice Conversion with Non-Parallel Data. https: //github.com/andabi/deep-voice-conversion.Google Scholar
- Toshiyuki Ando, Yuki Kubo, Buntarou Shizuki, and Shin Takahashi. 2017. CanalSense: Face-Related Movement Recognition System Based on Sensing Air Pressure in Ear Canals. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST '17). ACM, New York, NY, USA, 679--689. Google ScholarDigital Library
- F Bocquelet, Thomas Hueber, Laurent Girin, Pierre Badin, and B Yvert. 2014. Robust articulatory speech synthesis using deep neural networks for BCI applications., 2288--2292 pages.Google Scholar
- Florent Bocquelet, Thomas Hueber, Laurent Girin, Christophe Savariaux, and Blaise Yvert. 2016. Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces. PLOS Computational Biology 12, 11 (11 2016), 1--28.Google Scholar
- François Chollet et al. 2015. Keras. https://keras.io.Google Scholar
- Tamás Gábor Csapó, Tamás Grósz, Gábor Gosztolya, László Tóth, and Alexandra Markó. 2017. DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface. In INTERSPEECH.Google Scholar
- B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (April 2010), 270--287. Google ScholarDigital Library
- B. Denby and M. Stone. 2004. Speech synthesis from real time ultrasound images of the tongue. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I--685.Google Scholar
- L. Diener, M. Janke, and T. Schultz. 2015. Direct conversion from facial myoelectric signals to speech using Deep Neural Networks. In 2015 International Joint Conference on Neural Networks (IJCNN). 1--7.Google Scholar
- Roman Gr. Maev (ed.). 2013. Advances in Acoustic Microscopy and High Resolution Imaging: From Principles to Applications. Wiley-VCH.Google Scholar
- Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. ICCV 2017 Workshop on Computer Vision for Audio-Visual Media (2017).Google ScholarCross Ref
- Diandra Fabre, Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, and Pierre Badin. 2017. Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication 93 (2017), 63 -- 75. Google ScholarDigital Library
- M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, and P.M. Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering & Physics 30, 4 (2008), 419 -- 425.Google ScholarCross Ref
- Sidney Fels and Geoffrey Hinton. 1995. Glove-TalkII: An Adaptive Gesture-to-formant Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '95). ACM Press/AddisonWesley Publishing Co., New York, NY, USA, 456--463. Google ScholarDigital Library
- Rebecca Fiebrink. 2017. Machine Learning as Meta-Instrument: Human-Machine Partnerships Shaping Expressive Instrumental Creation. Springer Singapore, Singapore, 137--151. CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Kimura, Kono and RekimotoGoogle Scholar
- Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246. Google ScholarDigital Library
- Mayank Goel, Chen Zhao, Ruth Vinisha, and Shwetak N. Patel. 2015. Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 255--258. Google ScholarDigital Library
- Jose A. Gonzalez, Lam A. Cheah, James M. Gilbert, Jie Bai, Stephen R. Ell, Phil D. Green, and Roger K. Moore. 2016. A Silent Speech System Based on Permanent Magnet Articulography and Direct Synthesis. Comput. Speech Lang. 39, C (Sept. 2016), 67--87. Google ScholarDigital Library
- D. Griffin and Jae Lim. 1984. Signal estimation from modified shorttime Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (April 1984), 236--243.Google ScholarCross Ref
- Tamás Grósz, Gábor Gosztolya, László Tóth, Tamás Csapó, and Alexandra Markó. 2018. F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces.Google Scholar
- Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silentspeech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301 -- 313. Silent Speech Interfaces. Google ScholarDigital Library
- Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Smallvocabulary Speech Recognition Using a Silent Speech Interface Based on Magnetic Sensing. Speech Commun. 55, 1 (Jan. 2013), 22--32. Google ScholarDigital Library
- Jess Hohenstein and Malte Jung. 2018. AI-Supported Messaging: An Investigation of Human-Human Text Conversation with AI Support. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA '18). ACM, New York, NY, USA, Article LBW089, 6 pages. Google ScholarDigital Library
- T. Hueber, G. Aversano, G. Cholle, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, and M. Stone. 2007. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Vol. 1. I--1245--I--1248.Google Scholar
- Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a Silent Speech Interface Driven by Ultrasound and Optical Images of the Tongue and Lips. Speech Commun. 52, 4 (April 2010), 288--300. Google ScholarDigital Library
- Thomas Hueber, Elie-Laurent Benaroya, Bruce Denby, and Gérard Chollet. 2011. Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface. In INTERSPEECH.Google Scholar
- Thomas Hueber, Gerard Chollet, Bruce Denby, and M Stone. 2008. Acquisition of ultrasound, video and acoustic speech data for a silentspeech interface application. (01 2008).Google Scholar
- Google Inc. {n. d.}. Clund Speech-to-Text. https://cloud.google.com/ speech-to-text/.Google Scholar
- Aurore Jaumard-Hakoun, Kele Xu, Clémence Leboullenger, Pierre Roussel-Ragot, and Bruce Denby. 2016. An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging. In ISCA Interspeech 2016, Vol. 2016. San Francisco, United States, 1467 -- 1471.Google ScholarCross Ref
- Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the Silent Speech Challenge Benchmark with Deep Learning. Speech Commun. 98, C (April 2018), 42--50. Google ScholarDigital Library
- Chuck Jorgensen and Kim Binsted. 2005. Web Browser Control Using EMG Based Sub Vocal Speech Recognition. In Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences - Volume 09 (HICSS '05). IEEE Computer Society, Washington, DC, USA, 294.3--. Google ScholarDigital Library
- Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 43--53. Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv.org/abs/1412.6980Google Scholar
- Lorenz Cuno Klopfenstein, Saverio Delpriori, Silvia Malatini, and Alessandro Bogliolo. 2017. The Rise of Bots: A Survey of Conversational Interfaces, Patterns, and Paradigms. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). ACM, New York, NY, USA, 555--565. Google ScholarDigital Library
- Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. 2018. Designing for Workplace Reflection: A Chat and VoiceBased Conversational Agent. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 881--894. Google ScholarDigital Library
- L. Maier-Hein, F. Metze, T. Schultz, and A. Waibel. 2005. Session independent non-audible speech recognition using surface electromyography. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. 331--336.Google ScholarCross Ref
- Hiroyuki Manabe, Akira Hiraiwa, and Toshiaki Sugimura. 2003. "Unvoiced Speech Recognition Using EMG - Mime Speech Recognition". In CHI '03 Extended Abstracts on Human Factors in Computing Systems (CHI EA '03). ACM, New York, NY, USA, 794--795. Google ScholarDigital Library
- Denys J. C. Matthies, Bernhard A. Strecker, and Bodo Urban. 2017. EarFieldSensing: A Novel In-Ear Electric Field Sensing to Enrich Wearable Gesture Input Through Facial Expressions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1911--1922. Google ScholarDigital Library
- Jess McIntosh, Asier Marzo, Mike Fraser, and Carol Phillips. 2017. EchoFlex: Hand Gesture Recognition Using Ultrasound Imaging. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1923--1934. Google ScholarDigital Library
- Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for How Users Overcome Obstacles in Voice User Interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 6, 7 pages. Google ScholarDigital Library
- Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Nonaudible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V--708.Google Scholar
- Phuc Nguyen, Nam Bui, Anh Nguyen, Hoang Truong, Abhijit Suresh, Matt Whitlock, Duy Pham, Thang Dinh, and Tam Vu. 2018. TYTH-Typing On Your Teeth: Tongue-Teeth Localization for HumanComputer Interface. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '18). ACM, New York, NY, USA, 269--282. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Google ScholarDigital Library
- Anne Porbadnigk, Marek Wester, Jan Calliess, and Tanja Schultz. 2009. EEG-based Speech Recognition - Impact of Temporal Effects. In BIOSIGNALS.Google Scholar
- Amanda Purington, Jessie G. Taft, Shruti Sannon, Natalya N. Bazarova, and Samuel Hardman Taylor. 2017. "Alexa is My New BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA '17). ACM, New York, NY, USA, 2853--2859. Google ScholarDigital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597 http://arxiv.org/abs/1505. 04597Google Scholar
- Alexander I. Rudnicky. 1989. The Design of Voice-driven Interfaces. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Stroudsburg, PA, USA, 120--124. Google ScholarDigital Library
- Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: A Wearable System for Silent Speech Recognition. In Proceedings of the 2014 ACM International Symposium on Wearable Computers (ISWC '14). ACM, New York, NY, USA, 47--54. Google ScholarDigital Library
- Tanja Schultz. 2010. ICCHP Keynote: Recognizing Silent and Weak Speech Based on Electromyography. In Computers Helping People with Special Needs, Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, and Arthur Karshmer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 595--604. Google ScholarDigital Library
- Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I. Hong. 2018. "Hey Alexa, What's Up?": A Mixed-Methods Studies of In-Home Conversational Agent Usage. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 857--868. Google ScholarDigital Library
- László Tóth, Gábor Gosztolya, Tamás Grósz, Alexandra Markó, and Tamás Csapó. 2018. Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces.Google Scholar
- Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with Long Short-Term Memory. CoRR abs/1601.08188 (2016). arXiv:1601.08188 http://arxiv.org/abs/1601.08188Google Scholar
- Jun Wang, Ashok Samal, and Jordan Green. 2014. Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph. (06 2014).Google Scholar
- Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. CoRR abs/1703.10135 (2017). arXiv:1703.10135 http://arxiv. org/abs/1703.10135Google Scholar
- Qiao Zhang, Shyamnath Gollakota, Ben Taskar, and Raj P.N. Rao. 2014. Non-intrusive Tongue Machine Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 2555--2558. Google ScholarDigital Library
Index Terms
- SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
Recommendations
WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing SystemsRecognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech ...
Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips
This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The ...
Effects of Speaking Rate on Speech and Silent Speech Recognition
CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing SystemsSpeaking rate or the speed at which a person speaks is a fundamental user characteristic. This work investigates the rate in which users speak when interacting with speech and silent speech-based methods. Results revealed that native users speak about ...
Comments