skip to main content
research-article

An Attentive Survey of Attention Models

Authors Info & Claims
Published:22 October 2021Publication History
Skip Abstract Section

Abstract

Attention Model has now become an important concept in neural networks that has been researched within diverse application domains. This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we propose a taxonomy that groups existing techniques into coherent categories. We review salient neural architectures in which attention has been incorporated and discuss applications in which modeling attention has shown a significant impact. We also describe how attention has been used to improve the interpretability of neural networks. Finally, we discuss some future research directions in attention. We hope this survey will provide a succinct introduction to attention models and guide practitioners while developing approaches for their applications.

References

  1. Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A. Alemi. 2018. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 9180–9190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Artaches Ambartsoumian and Fred Popowich. 2018. Self-attention: A better building block for sentiment analysis neural network classifiers. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, 130–139.Google ScholarGoogle ScholarCross RefCross Ref
  3. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6077–6086.Google ScholarGoogle ScholarCross RefCross Ref
  4. Lei Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google ScholarGoogle Scholar
  6. Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-sequence learning using gated graph neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 273–283.Google ScholarGoogle ScholarCross RefCross Ref
  7. Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1442–1451.Google ScholarGoogle ScholarCross RefCross Ref
  8. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, and Ilya2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.Google ScholarGoogle Scholar
  9. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 12346). Springer, 213–229.Google ScholarGoogle Scholar
  10. William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4960–4964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119. PMLR, 1691–1703.Google ScholarGoogle Scholar
  12. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https://arxiv.org/abs/1904.10509.Google ScholarGoogle Scholar
  13. Chung-Cheng Chiu and Colin Raffel. 2017. Monotonic chunkwise attention. arXiv:1712.05382. Retrieved from https://arxiv.org/abs/1712.05382.Google ScholarGoogle Scholar
  14. Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimedia 17, 11 (2015), 1875–1886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). Association for Computational Linguistics, 103–111.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1724–1734.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 93–98.Google ScholarGoogle Scholar
  18. Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  19. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems. MIT Press, 577–585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 276–286.Google ScholarGoogle Scholar
  21. Limeng Cui, Haeseung Seo, Maryam Tabar, Fenglong Ma, Suhang Wang, and Dongwon Lee. 2020. DETERRENT: Knowledge guided graph attention network for detecting healthcare misinformation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 492–502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2978–2988.Google ScholarGoogle ScholarCross RefCross Ref
  23. Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 120–128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In Proceedings of the 7th International Conference on Learning Representations.Google ScholarGoogle Scholar
  25. Zhongfen Deng, Hao Peng, Congying Xia, Jianxin Li, Lifang He, and Philip Yu. 2020. Hierarchical bi-directional self-attention networks for paper review rating recommendation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 6302–6314.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.Google ScholarGoogle Scholar
  27. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  28. Jun Feng, Minlie Huang, Yang Yang, and Xiaoyan Zhu. 2016. GAKE: Graph aware knowledge embedding. In Proceedings of the 26th International Conference on Computational Linguistics. The COLING 2016 Organizing Committee, 641–651.Google ScholarGoogle Scholar
  29. Keisuke Fujii, Naoya Takeishi, Yoshinobu Kawahara, and Kazuya Takeda. 2020. Policy learning with partial observation and mechanical constraints for multi-person modeling. arXiv:2007.03155. Retrieved from https://arxiv.org/abs/2007.03155.Google ScholarGoogle Scholar
  30. Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. (2020), 1–18. Google ScholarGoogle Scholar
  31. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014a. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.Google ScholarGoogle Scholar
  32. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014b. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.Google ScholarGoogle Scholar
  33. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation.Proceedings of Machine Learning Research, Vol. 37. PMLR, 1462–1471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. Comput. Surv. 51, 5, Article 93 (2018), 42 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2354–2366. https://doi.org/10.1109/TKDE.2018.2831682Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 1693–1701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, 3543–3556.Google ScholarGoogle Scholar
  38. Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip Torr. 2018. Learn to pay attention. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  39. Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential recommendation. In Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society, 197–206.Google ScholarGoogle Scholar
  40. Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. arXiv:2101.01169. Retrieved from https://arxiv.org/abs/2101.01169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Douwe Kiela, Changhan Wang, and Kyunghyun Cho. 2018. Dynamic meta-embeddings for improved sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1466–1477.Google ScholarGoogle ScholarCross RefCross Ref
  42. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations.Google ScholarGoogle Scholar
  43. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 7057–7075.Google ScholarGoogle ScholarCross RefCross Ref
  44. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48. PMLR, 1378–1387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  46. Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. 2017. Interactive visualization and manipulation of attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 121–126.Google ScholarGoogle ScholarCross RefCross Ref
  47. John Boaz Lee, Ryan Rossi, and Xiangnan Kong. 2018. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1666–1674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen K. Ahmed, and Eunyee Koh. 2019. Attention models in graphs: A survey. ACM Trans. Knowl. Discov. Data 13, 6, Article 62 (2019), 25 pages. https://doi.org/10.1145/3363574 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880.Google ScholarGoogle ScholarCross RefCross Ref
  50. Guangyu Li, Bo Jiang, Hao Zhu, Zhengping Che, and Yan Liu. 2020a. Generative attention networks for multi-agent behavioral modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7195–7202.Google ScholarGoogle ScholarCross RefCross Ref
  51. Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. CoRR abs/1612.08220.Google ScholarGoogle Scholar
  52. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.Google ScholarGoogle Scholar
  53. Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020b. Sac: Accelerating and structuring self-attention via sparse adaptive connection. arXiv:2003.09833. Retrieved from https://arxiv.org/abs/2003.09833.Google ScholarGoogle Scholar
  54. Yang Li, Lukasz Kaiser, Samy Bengio, and Si Si. 2019a. Area attention. In Proceedings of the International Conference on Machine Learning. PMLR, 3846–3855.Google ScholarGoogle Scholar
  55. Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (2017).Google ScholarGoogle Scholar
  56. Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. 2020. SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition. In Proceedings of the 8th International Conference on Learning Representations.Google ScholarGoogle Scholar
  57. Shusen Liu, Tao Li, Zhimin Li, Vivek Srikumar, Valerio Pascucci, and Peer-Timo Bremer. 2018. Visual interrogation of attention-based models for natural language inference and machine comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 36–41.Google ScholarGoogle Scholar
  58. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  59. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 289–297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. arXiv:1508.04025. Retrieved from https://arxiv.org/abs/1508.04025.Google ScholarGoogle Scholar
  62. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421.Google ScholarGoogle Scholar
  63. Benteng Ma, Jing Zhang, Yong Xia, and Dacheng Tao. 2020. Auto learning attention. Adv. Neural Inf. Process. Syst. 33 (2020).Google ScholarGoogle Scholar
  64. Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level sentiment classification. arXiv:1709.00893. Retrieved from https://arxiv.org/abs/1709.00893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017a. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1903–1911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. 2019. Monotonic multihead attention. arXiv:1909.12406. Retrieved from https://arxiv.org/abs/1909.12406.Google ScholarGoogle Scholar
  67. Yukun Ma, Haiyun Peng, and Erik Cambria. 2018. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5876–5883.Google ScholarGoogle Scholar
  68. Suraj Maharjan, Manuel Montes, Fabio A. González, and Thamar Solorio. 2018. A genre-aware attention model to improve the likability prediction of books. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3381–3391.Google ScholarGoogle ScholarCross RefCross Ref
  69. Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning. PMLR, 1614–1623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. André F. T. Martins, Marcos Treviso, António Farinhas, Vlad Niculae, Mário A. T. Figueiredo, and Pedro M. Q. Aguiar. 2020. Sparse and continuous attention mechanisms. arXiv:2006.07214. Retrieved from https://arxiv.org/abs/2006.07214.Google ScholarGoogle Scholar
  71. Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5191–5198.Google ScholarGoogle Scholar
  73. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2. MIT Press, 2204–2212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, and Balaraman Ravindran. 2020. Towards transparent and explainable attention models. arXiv:2004.14243. Retrieved from https://arxiv.org/abs/2004.14243.Google ScholarGoogle Scholar
  75. Elizbar A. Nadaraya. 1964. On estimating regression. Theory Probab. Appl. 9, 1 (1964), 141–142.Google ScholarGoogle ScholarCross RefCross Ref
  76. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, 280–290.Google ScholarGoogle ScholarCross RefCross Ref
  77. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer.Proceedings of Machine Learning Research, Vol. 80. 4055–4064.Google ScholarGoogle Scholar
  78. John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135.Google ScholarGoogle ScholarCross RefCross Ref
  79. Ben Peters, Vlad Niculae, and André FT Martins. 2019. Sparse sequence-to-sequence models. arXiv:1905.05702. Retrieved from https://arxiv.org/abs/1905.05702.Google ScholarGoogle Scholar
  80. Zhu Qiannan, Xiaofei Zhou, Zeliang Song, Jianlong Tan, and Li Guo. 2019. DAN: Deep attention neural network for news recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 5973–5980.Google ScholarGoogle Scholar
  81. C. Qin and D. Qu. 2020. Towards understanding attention-based speech recognition models. IEEE Access 8 (2020), 24358–24369.Google ScholarGoogle ScholarCross RefCross Ref
  82. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language models are unsupervised multitask learners. https://openai.com/blog/better-language-models/.Google ScholarGoogle Scholar
  83. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. arXiv:1906.05909. Retrieved from https://arxiv.org/abs/1906.05909.Google ScholarGoogle Scholar
  84. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 379–389.Google ScholarGoogle ScholarCross RefCross Ref
  85. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2931–2951.Google ScholarGoogle Scholar
  86. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  87. Min Yang, Baocheng Li, Qiang Qu, Jialie Shen, Shuai Yu, and Yongbo Wang. 2019. NAIRS: A neural attentive interpretable recommendation system. The Web Conference (2019). Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for machine reading. arXiv:1606.02245. Retrieved from https://arxiv.org/abs/1606.02245.Google ScholarGoogle Scholar
  89. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019a. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 331–335.Google ScholarGoogle ScholarCross RefCross Ref
  90. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019b. Adaptive attention span in transformers. arXiv:1905.07799. Retrieved from https://arxiv.org/abs/1905.07799.Google ScholarGoogle Scholar
  91. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2440–2448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  93. Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual question answering. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. Association for Computing Machinery, 207–211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 5099–5110.Google ScholarGoogle ScholarCross RefCross Ref
  95. Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 214–224.Google ScholarGoogle ScholarCross RefCross Ref
  96. Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4263–4272.Google ScholarGoogle ScholarCross RefCross Ref
  97. Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention networks. Adv. Neural Inf. Process. Syst. 32 (2019), 6135–6145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.Google ScholarGoogle Scholar
  99. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.Google ScholarGoogle Scholar
  100. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 5998–6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations.Google ScholarGoogle Scholar
  102. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, Vol. 28. MIT Press, 2692–2700. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5797–5808.Google ScholarGoogle ScholarCross RefCross Ref
  104. Feng Wang and David MJ Tax. 2016. Survey on the attention based RNN model and its applications in computer vision. arXiv:1601.06823. Retrieved from https://arxiv.org/abs/1601.06823.Google ScholarGoogle Scholar
  105. Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2020e. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 108–126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan, and Rui Wang. 2020c. Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3229–3238.Google ScholarGoogle ScholarCross RefCross Ref
  107. Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and H. Linformer Ma. 2020a. Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768.Google ScholarGoogle Scholar
  108. Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3316–3322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020d. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv:2002.10957. Retrieved from https://arxiv.org/abs/2002.10957.Google ScholarGoogle Scholar
  110. Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019a. KGAT: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 950–958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019b. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 606–615.Google ScholarGoogle ScholarCross RefCross Ref
  113. Yue Wang, Jing Li, Michael Lyu, and Irwin King. 2020b. Cross-media keyphrase prediction: A unified framework with multi-modality multi-head attention and image wordings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3311–3324.Google ScholarGoogle ScholarCross RefCross Ref
  114. Geoffrey S. Watson. 1964. Smooth regression analysis. Ind. J. Stat. Ser. A (1964), 359–372.Google ScholarGoogle Scholar
  115. Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv:1410.3916. Retrieved from https://arxiv.org/abs/1410.3916.Google ScholarGoogle Scholar
  116. Chuhan Wu, Fangzhao Wu, Junxin Liu, and Yongfeng Huang. 2019. Hierarchical user and item representation with three-tier attention for recommendation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1818–1826.Google ScholarGoogle ScholarCross RefCross Ref
  117. Le Wu, Lei Chen, Richang Hong, Yanjie Fu, Xing Xie, and Meng Wang. 2020. A hierarchical attention model for social contextual image recommendation. IEEE Trans. Knowl. Data Eng. 32, 10 (2020), 1854–1867.Google ScholarGoogle ScholarCross RefCross Ref
  118. Wenyi Xiao, Huan Zhao, Haojie Pan, Yangqiu Song, Vincent W. Zheng, and Qiang Yang. 2021. Social explorative attention based recommendation for content distribution platforms. Data Min. Knowl. Discov. 35, 2 (2021), 533–567.Google ScholarGoogle ScholarCross RefCross Ref
  119. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48. JMLR.org, 2397–2406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Proceedings of Machine Learning Research, Vol. 37. 2048–2057.Google ScholarGoogle Scholar
  121. Song Xu, Haoran Li, Peng Yuan, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Self-attention guided copy mechanism for abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1355–1362.Google ScholarGoogle ScholarCross RefCross Ref
  122. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.Google ScholarGoogle ScholarCross RefCross Ref
  124. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4507–4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018a. Sequential recommender system based on hierarchical attention network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, 3926–3932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018b. Sequential recommender system based on hierarchical attention networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3926–3932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.Google ScholarGoogle ScholarCross RefCross Ref
  128. Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarGoogle ScholarCross RefCross Ref
  129. Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence. 5642–5649.Google ScholarGoogle Scholar
  130. B. Zhang, D. Xiong, J. Xie, and J. Su. 2020. Neural machine translation with GRU-gated attention model. IEEE Trans. Neural Netw. Learn. Syst. 31, 11 (2020), 4688–4698.Google ScholarGoogle ScholarCross RefCross Ref
  131. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019a. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning. 7354–7363.Google ScholarGoogle Scholar
  132. Ruochi Zhang, Yuesong Zou, and Jian Ma. 2020. Hyper-SAGNN: A self-attention based graph neural network for hypergraphs. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  133. Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019b. Deep learning based recommender system: A survey and new perspectives. Comput. Surv. 52, 1, Article 5 (2019). Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. 2020. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  135. Shenjian Zhao and Zhihua Zhang. 2018. Attention-via-attention neural machine translation. In Association for the Advancement of Artificial Intelligence.Google ScholarGoogle Scholar
  136. Zhou Zhao, Ben Gao, Vicent W. Zheng, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Link prediction via ranking metric dual-level attention network learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 3525–3531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. GMAN: A graph multi-attention network for traffic prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 1234–1241.Google ScholarGoogle ScholarCross RefCross Ref
  138. Chang Zhou, Jinze Bai, Junshuai Song, Xiaofei Liu, Zhengchao Zhao, Xiusi Chen, and Jun Gao. 2018. ATRank: An attention-based user behavior modeling framework for recommendation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 4564–4571.Google ScholarGoogle Scholar
  139. Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, and Xiaolin Zheng. 2020. A graphical and attentional framework for dual-target cross-domain recommendation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. 3001–3008.Google ScholarGoogle ScholarCross RefCross Ref
  140. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar

Index Terms

  1. An Attentive Survey of Attention Models

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Intelligent Systems and Technology
          ACM Transactions on Intelligent Systems and Technology  Volume 12, Issue 5
          October 2021
          383 pages
          ISSN:2157-6904
          EISSN:2157-6912
          DOI:10.1145/3484925
          • Editor:
          • Huan Liu
          Issue’s Table of Contents

          Copyright © 2021 Association for Computing Machinery.

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 October 2021
          • Accepted: 1 May 2021
          • Revised: 1 April 2021
          • Received: 1 November 2020
          Published in tist Volume 12, Issue 5

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format