research-article

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Authors:
Naoki Kimura

The University of Tokyo, Bunkyo-ku, Tokyo, Japan

The University of Tokyo, Bunkyo-ku, Tokyo, Japan
View Profile

,
Michinari Kono

The University of Tokyo, Bunkyo-ku, Tokyo, Japan

The University of Tokyo, Bunkyo-ku, Tokyo, Japan
View Profile

,
Jun Rekimoto

The University of Tokyo & Sony Computer Science Laboratories, Inc., Bunkyo-ku, Tokyo, Japan

The University of Tokyo & Sony Computer Science Laboratories, Inc., Bunkyo-ku, Tokyo, Japan
View Profile

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing SystemsMay 2019Paper No.: 146Pages 1–11https://doi.org/10.1145/3290605.3300376

Published:02 May 2019Publication History

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Pages 1–11

ABSTRACT

The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.

Supplemental Material

paper146.mp4

mp4

268 MB

Download

paper146.mp4

mp4

268 MB

Download

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
Dabi Ahn. 2017. Voice Conversion with Non-Parallel Data. https: //github.com/andabi/deep-voice-conversion.Google Scholar
Toshiyuki Ando, Yuki Kubo, Buntarou Shizuki, and Shin Takahashi. 2017. CanalSense: Face-Related Movement Recognition System Based on Sensing Air Pressure in Ear Canals. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST '17). ACM, New York, NY, USA, 679--689. Google ScholarDigital Library
F Bocquelet, Thomas Hueber, Laurent Girin, Pierre Badin, and B Yvert. 2014. Robust articulatory speech synthesis using deep neural networks for BCI applications., 2288--2292 pages.Google Scholar
Florent Bocquelet, Thomas Hueber, Laurent Girin, Christophe Savariaux, and Blaise Yvert. 2016. Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces. PLOS Computational Biology 12, 11 (11 2016), 1--28.Google Scholar
François Chollet et al. 2015. Keras. https://keras.io.Google Scholar
Tamás Gábor Csapó, Tamás Grósz, Gábor Gosztolya, László Tóth, and Alexandra Markó. 2017. DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface. In INTERSPEECH.Google Scholar
B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (April 2010), 270--287. Google ScholarDigital Library
B. Denby and M. Stone. 2004. Speech synthesis from real time ultrasound images of the tongue. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I--685.Google Scholar
L. Diener, M. Janke, and T. Schultz. 2015. Direct conversion from facial myoelectric signals to speech using Deep Neural Networks. In 2015 International Joint Conference on Neural Networks (IJCNN). 1--7.Google Scholar
Roman Gr. Maev (ed.). 2013. Advances in Acoustic Microscopy and High Resolution Imaging: From Principles to Applications. Wiley-VCH.Google Scholar
Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. ICCV 2017 Workshop on Computer Vision for Audio-Visual Media (2017).Google ScholarCross Ref
Diandra Fabre, Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, and Pierre Badin. 2017. Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication 93 (2017), 63 -- 75. Google ScholarDigital Library
M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, and P.M. Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering & Physics 30, 4 (2008), 419 -- 425.Google ScholarCross Ref
Sidney Fels and Geoffrey Hinton. 1995. Glove-TalkII: An Adaptive Gesture-to-formant Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '95). ACM Press/AddisonWesley Publishing Co., New York, NY, USA, 456--463. Google ScholarDigital Library
Rebecca Fiebrink. 2017. Machine Learning as Meta-Instrument: Human-Machine Partnerships Shaping Expressive Instrumental Creation. Springer Singapore, Singapore, 137--151. CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Kimura, Kono and RekimotoGoogle Scholar
Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246. Google ScholarDigital Library
Mayank Goel, Chen Zhao, Ruth Vinisha, and Shwetak N. Patel. 2015. Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 255--258. Google ScholarDigital Library
Jose A. Gonzalez, Lam A. Cheah, James M. Gilbert, Jie Bai, Stephen R. Ell, Phil D. Green, and Roger K. Moore. 2016. A Silent Speech System Based on Permanent Magnet Articulography and Direct Synthesis. Comput. Speech Lang. 39, C (Sept. 2016), 67--87. Google ScholarDigital Library
D. Griffin and Jae Lim. 1984. Signal estimation from modified shorttime Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (April 1984), 236--243.Google ScholarCross Ref
Tamás Grósz, Gábor Gosztolya, László Tóth, Tamás Csapó, and Alexandra Markó. 2018. F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces.Google Scholar
Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silentspeech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301 -- 313. Silent Speech Interfaces. Google ScholarDigital Library
Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Smallvocabulary Speech Recognition Using a Silent Speech Interface Based on Magnetic Sensing. Speech Commun. 55, 1 (Jan. 2013), 22--32. Google ScholarDigital Library
Jess Hohenstein and Malte Jung. 2018. AI-Supported Messaging: An Investigation of Human-Human Text Conversation with AI Support. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA '18). ACM, New York, NY, USA, Article LBW089, 6 pages. Google ScholarDigital Library
T. Hueber, G. Aversano, G. Cholle, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, and M. Stone. 2007. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Vol. 1. I--1245--I--1248.Google Scholar
Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a Silent Speech Interface Driven by Ultrasound and Optical Images of the Tongue and Lips. Speech Commun. 52, 4 (April 2010), 288--300. Google ScholarDigital Library
Thomas Hueber, Elie-Laurent Benaroya, Bruce Denby, and Gérard Chollet. 2011. Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface. In INTERSPEECH.Google Scholar
Thomas Hueber, Gerard Chollet, Bruce Denby, and M Stone. 2008. Acquisition of ultrasound, video and acoustic speech data for a silentspeech interface application. (01 2008).Google Scholar
Google Inc. {n. d.}. Clund Speech-to-Text. https://cloud.google.com/ speech-to-text/.Google Scholar
Aurore Jaumard-Hakoun, Kele Xu, Clémence Leboullenger, Pierre Roussel-Ragot, and Bruce Denby. 2016. An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging. In ISCA Interspeech 2016, Vol. 2016. San Francisco, United States, 1467 -- 1471.Google ScholarCross Ref
Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the Silent Speech Challenge Benchmark with Deep Learning. Speech Commun. 98, C (April 2018), 42--50. Google ScholarDigital Library
Chuck Jorgensen and Kim Binsted. 2005. Web Browser Control Using EMG Based Sub Vocal Speech Recognition. In Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences - Volume 09 (HICSS '05). IEEE Computer Society, Washington, DC, USA, 294.3--. Google ScholarDigital Library
Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 43--53. Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv.org/abs/1412.6980Google Scholar
Lorenz Cuno Klopfenstein, Saverio Delpriori, Silvia Malatini, and Alessandro Bogliolo. 2017. The Rise of Bots: A Survey of Conversational Interfaces, Patterns, and Paradigms. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). ACM, New York, NY, USA, 555--565. Google ScholarDigital Library
Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. 2018. Designing for Workplace Reflection: A Chat and VoiceBased Conversational Agent. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 881--894. Google ScholarDigital Library
L. Maier-Hein, F. Metze, T. Schultz, and A. Waibel. 2005. Session independent non-audible speech recognition using surface electromyography. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. 331--336.Google ScholarCross Ref
Hiroyuki Manabe, Akira Hiraiwa, and Toshiaki Sugimura. 2003. "Unvoiced Speech Recognition Using EMG - Mime Speech Recognition". In CHI '03 Extended Abstracts on Human Factors in Computing Systems (CHI EA '03). ACM, New York, NY, USA, 794--795. Google ScholarDigital Library
Denys J. C. Matthies, Bernhard A. Strecker, and Bodo Urban. 2017. EarFieldSensing: A Novel In-Ear Electric Field Sensing to Enrich Wearable Gesture Input Through Facial Expressions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1911--1922. Google ScholarDigital Library
Jess McIntosh, Asier Marzo, Mike Fraser, and Carol Phillips. 2017. EchoFlex: Hand Gesture Recognition Using Ultrasound Imaging. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1923--1934. Google ScholarDigital Library
Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for How Users Overcome Obstacles in Voice User Interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 6, 7 pages. Google ScholarDigital Library
Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Nonaudible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V--708.Google Scholar
Phuc Nguyen, Nam Bui, Anh Nguyen, Hoang Truong, Abhijit Suresh, Matt Whitlock, Duy Pham, Thang Dinh, and Tam Vu. 2018. TYTH-Typing On Your Teeth: Tongue-Teeth Localization for HumanComputer Interface. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '18). ACM, New York, NY, USA, 269--282. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Google ScholarDigital Library
Anne Porbadnigk, Marek Wester, Jan Calliess, and Tanja Schultz. 2009. EEG-based Speech Recognition - Impact of Temporal Effects. In BIOSIGNALS.Google Scholar
Amanda Purington, Jessie G. Taft, Shruti Sannon, Natalya N. Bazarova, and Samuel Hardman Taylor. 2017. "Alexa is My New BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA '17). ACM, New York, NY, USA, 2853--2859. Google ScholarDigital Library
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597 http://arxiv.org/abs/1505. 04597Google Scholar
Alexander I. Rudnicky. 1989. The Design of Voice-driven Interfaces. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Stroudsburg, PA, USA, 120--124. Google ScholarDigital Library
Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: A Wearable System for Silent Speech Recognition. In Proceedings of the 2014 ACM International Symposium on Wearable Computers (ISWC '14). ACM, New York, NY, USA, 47--54. Google ScholarDigital Library
Tanja Schultz. 2010. ICCHP Keynote: Recognizing Silent and Weak Speech Based on Electromyography. In Computers Helping People with Special Needs, Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, and Arthur Karshmer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 595--604. Google ScholarDigital Library
Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I. Hong. 2018. "Hey Alexa, What's Up?": A Mixed-Methods Studies of In-Home Conversational Agent Usage. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 857--868. Google ScholarDigital Library
László Tóth, Gábor Gosztolya, Tamás Grósz, Alexandra Markó, and Tamás Csapó. 2018. Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces.Google Scholar
Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with Long Short-Term Memory. CoRR abs/1601.08188 (2016). arXiv:1601.08188 http://arxiv.org/abs/1601.08188Google Scholar
Jun Wang, Ashok Samal, and Jordan Green. 2014. Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph. (06 2014).Google Scholar
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. CoRR abs/1703.10135 (2017). arXiv:1703.10135 http://arxiv. org/abs/1703.10135Google Scholar
Qiao Zhang, Shyamnath Gollakota, Ben Taskar, and Raj P.N. Rao. 2014. Non-intrusive Tongue Machine Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 2555--2558. Google ScholarDigital Library

Index Terms

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Sound-based input / output

Recommendations

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech ...
Read More
Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The ...
Read More
Effects of Speaking Rate on Speech and Silent Speech Recognition
CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems

Speaking rate or the speed at which a person speaks is a fundamental user characteristic. This work investigates the rate in which users speak when interacting with speech and silent speech-based methods. Results revealed that native users speak about ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
May 2019
9077 pages
ISBN:9781450359702
DOI:10.1145/3290605
General Chairs:
Stephen Brewster
University of Glasgow, Scotland, UK
,
Geraldine Fitzpatrick
TU Wien, Austria
,
Program Chairs:
Anna Cox
University College London, UK
,
Vassilis Kostakos
University of Melbourne, Australia
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Honorable Mention
Author Tags
deep neural networks
human-ai integration
silent speech
speech interaction
ultrasonic imaging
Qualifiers
- research-article
Conference

Acceptance Rates
CHI '19 Paper Acceptance Rate703of2,958submissions,24%Overall Acceptance Rate6,199of26,314submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 1,865
  Total Downloads
- Downloads (Last 12 months)194
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

Effects of Speaking Rate on Speech and Silent Speech Recognition