skip to main content
10.1145/3290605.3300376acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article
Honorable Mention

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Authors Info & Claims
Published:02 May 2019Publication History

ABSTRACT

The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.

Skip Supplemental Material Section

Supplemental Material

paper146.mp4

mp4

268 MB

paper146.mp4

mp4

268 MB

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Dabi Ahn. 2017. Voice Conversion with Non-Parallel Data. https: //github.com/andabi/deep-voice-conversion.Google ScholarGoogle Scholar
  3. Toshiyuki Ando, Yuki Kubo, Buntarou Shizuki, and Shin Takahashi. 2017. CanalSense: Face-Related Movement Recognition System Based on Sensing Air Pressure in Ear Canals. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST '17). ACM, New York, NY, USA, 679--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F Bocquelet, Thomas Hueber, Laurent Girin, Pierre Badin, and B Yvert. 2014. Robust articulatory speech synthesis using deep neural networks for BCI applications., 2288--2292 pages.Google ScholarGoogle Scholar
  5. Florent Bocquelet, Thomas Hueber, Laurent Girin, Christophe Savariaux, and Blaise Yvert. 2016. Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces. PLOS Computational Biology 12, 11 (11 2016), 1--28.Google ScholarGoogle Scholar
  6. François Chollet et al. 2015. Keras. https://keras.io.Google ScholarGoogle Scholar
  7. Tamás Gábor Csapó, Tamás Grósz, Gábor Gosztolya, László Tóth, and Alexandra Markó. 2017. DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface. In INTERSPEECH.Google ScholarGoogle Scholar
  8. B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (April 2010), 270--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Denby and M. Stone. 2004. Speech synthesis from real time ultrasound images of the tongue. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I--685.Google ScholarGoogle Scholar
  10. L. Diener, M. Janke, and T. Schultz. 2015. Direct conversion from facial myoelectric signals to speech using Deep Neural Networks. In 2015 International Joint Conference on Neural Networks (IJCNN). 1--7.Google ScholarGoogle Scholar
  11. Roman Gr. Maev (ed.). 2013. Advances in Acoustic Microscopy and High Resolution Imaging: From Principles to Applications. Wiley-VCH.Google ScholarGoogle Scholar
  12. Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. ICCV 2017 Workshop on Computer Vision for Audio-Visual Media (2017).Google ScholarGoogle ScholarCross RefCross Ref
  13. Diandra Fabre, Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, and Pierre Badin. 2017. Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication 93 (2017), 63 -- 75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, and P.M. Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering & Physics 30, 4 (2008), 419 -- 425.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sidney Fels and Geoffrey Hinton. 1995. Glove-TalkII: An Adaptive Gesture-to-formant Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '95). ACM Press/AddisonWesley Publishing Co., New York, NY, USA, 456--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rebecca Fiebrink. 2017. Machine Learning as Meta-Instrument: Human-Machine Partnerships Shaping Expressive Instrumental Creation. Springer Singapore, Singapore, 137--151. CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Kimura, Kono and RekimotoGoogle ScholarGoogle Scholar
  17. Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mayank Goel, Chen Zhao, Ruth Vinisha, and Shwetak N. Patel. 2015. Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 255--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jose A. Gonzalez, Lam A. Cheah, James M. Gilbert, Jie Bai, Stephen R. Ell, Phil D. Green, and Roger K. Moore. 2016. A Silent Speech System Based on Permanent Magnet Articulography and Direct Synthesis. Comput. Speech Lang. 39, C (Sept. 2016), 67--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Griffin and Jae Lim. 1984. Signal estimation from modified shorttime Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (April 1984), 236--243.Google ScholarGoogle ScholarCross RefCross Ref
  21. Tamás Grósz, Gábor Gosztolya, László Tóth, Tamás Csapó, and Alexandra Markó. 2018. F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces.Google ScholarGoogle Scholar
  22. Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silentspeech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301 -- 313. Silent Speech Interfaces. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Smallvocabulary Speech Recognition Using a Silent Speech Interface Based on Magnetic Sensing. Speech Commun. 55, 1 (Jan. 2013), 22--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jess Hohenstein and Malte Jung. 2018. AI-Supported Messaging: An Investigation of Human-Human Text Conversation with AI Support. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA '18). ACM, New York, NY, USA, Article LBW089, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Hueber, G. Aversano, G. Cholle, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, and M. Stone. 2007. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Vol. 1. I--1245--I--1248.Google ScholarGoogle Scholar
  26. Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a Silent Speech Interface Driven by Ultrasound and Optical Images of the Tongue and Lips. Speech Commun. 52, 4 (April 2010), 288--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Thomas Hueber, Elie-Laurent Benaroya, Bruce Denby, and Gérard Chollet. 2011. Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface. In INTERSPEECH.Google ScholarGoogle Scholar
  28. Thomas Hueber, Gerard Chollet, Bruce Denby, and M Stone. 2008. Acquisition of ultrasound, video and acoustic speech data for a silentspeech interface application. (01 2008).Google ScholarGoogle Scholar
  29. Google Inc. {n. d.}. Clund Speech-to-Text. https://cloud.google.com/ speech-to-text/.Google ScholarGoogle Scholar
  30. Aurore Jaumard-Hakoun, Kele Xu, Clémence Leboullenger, Pierre Roussel-Ragot, and Bruce Denby. 2016. An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging. In ISCA Interspeech 2016, Vol. 2016. San Francisco, United States, 1467 -- 1471.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the Silent Speech Challenge Benchmark with Deep Learning. Speech Commun. 98, C (April 2018), 42--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chuck Jorgensen and Kim Binsted. 2005. Web Browser Control Using EMG Based Sub Vocal Speech Recognition. In Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences - Volume 09 (HICSS '05). IEEE Computer Society, Washington, DC, USA, 294.3--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 43--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  35. Lorenz Cuno Klopfenstein, Saverio Delpriori, Silvia Malatini, and Alessandro Bogliolo. 2017. The Rise of Bots: A Survey of Conversational Interfaces, Patterns, and Paradigms. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). ACM, New York, NY, USA, 555--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. 2018. Designing for Workplace Reflection: A Chat and VoiceBased Conversational Agent. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 881--894. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Maier-Hein, F. Metze, T. Schultz, and A. Waibel. 2005. Session independent non-audible speech recognition using surface electromyography. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. 331--336.Google ScholarGoogle ScholarCross RefCross Ref
  38. Hiroyuki Manabe, Akira Hiraiwa, and Toshiaki Sugimura. 2003. "Unvoiced Speech Recognition Using EMG - Mime Speech Recognition". In CHI '03 Extended Abstracts on Human Factors in Computing Systems (CHI EA '03). ACM, New York, NY, USA, 794--795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Denys J. C. Matthies, Bernhard A. Strecker, and Bodo Urban. 2017. EarFieldSensing: A Novel In-Ear Electric Field Sensing to Enrich Wearable Gesture Input Through Facial Expressions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1911--1922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jess McIntosh, Asier Marzo, Mike Fraser, and Carol Phillips. 2017. EchoFlex: Hand Gesture Recognition Using Ultrasound Imaging. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). ACM, New York, NY, USA, 1923--1934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for How Users Overcome Obstacles in Voice User Interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 6, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Nonaudible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V--708.Google ScholarGoogle Scholar
  43. Phuc Nguyen, Nam Bui, Anh Nguyen, Hoang Truong, Abhijit Suresh, Matt Whitlock, Duy Pham, Thang Dinh, and Tam Vu. 2018. TYTH-Typing On Your Teeth: Tongue-Teeth Localization for HumanComputer Interface. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '18). ACM, New York, NY, USA, 269--282. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction CHI 2019, May 4--9, 2019, Glasgow, Scotland Uk Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Anne Porbadnigk, Marek Wester, Jan Calliess, and Tanja Schultz. 2009. EEG-based Speech Recognition - Impact of Temporal Effects. In BIOSIGNALS.Google ScholarGoogle Scholar
  45. Amanda Purington, Jessie G. Taft, Shruti Sannon, Natalya N. Bazarova, and Samuel Hardman Taylor. 2017. "Alexa is My New BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA '17). ACM, New York, NY, USA, 2853--2859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597 http://arxiv.org/abs/1505. 04597Google ScholarGoogle Scholar
  47. Alexander I. Rudnicky. 1989. The Design of Voice-driven Interfaces. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Stroudsburg, PA, USA, 120--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: A Wearable System for Silent Speech Recognition. In Proceedings of the 2014 ACM International Symposium on Wearable Computers (ISWC '14). ACM, New York, NY, USA, 47--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Tanja Schultz. 2010. ICCHP Keynote: Recognizing Silent and Weak Speech Based on Electromyography. In Computers Helping People with Special Needs, Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, and Arthur Karshmer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I. Hong. 2018. "Hey Alexa, What's Up?": A Mixed-Methods Studies of In-Home Conversational Agent Usage. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, New York, NY, USA, 857--868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. László Tóth, Gábor Gosztolya, Tamás Grósz, Alexandra Markó, and Tamás Csapó. 2018. Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces.Google ScholarGoogle Scholar
  52. Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with Long Short-Term Memory. CoRR abs/1601.08188 (2016). arXiv:1601.08188 http://arxiv.org/abs/1601.08188Google ScholarGoogle Scholar
  53. Jun Wang, Ashok Samal, and Jordan Green. 2014. Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph. (06 2014).Google ScholarGoogle Scholar
  54. Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. CoRR abs/1703.10135 (2017). arXiv:1703.10135 http://arxiv. org/abs/1703.10135Google ScholarGoogle Scholar
  55. Qiao Zhang, Shyamnath Gollakota, Ben Taskar, and Raj P.N. Rao. 2014. Non-intrusive Tongue Machine Interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 2555--2558. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
        May 2019
        9077 pages
        ISBN:9781450359702
        DOI:10.1145/3290605

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 May 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CHI '19 Paper Acceptance Rate703of2,958submissions,24%Overall Acceptance Rate6,199of26,314submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format