Abstract
Attention Model has now become an important concept in neural networks that has been researched within diverse application domains. This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we propose a taxonomy that groups existing techniques into coherent categories. We review salient neural architectures in which attention has been incorporated and discuss applications in which modeling attention has shown a significant impact. We also describe how attention has been used to improve the interpretability of neural networks. Finally, we discuss some future research directions in attention. We hope this survey will provide a succinct introduction to attention models and guide practitioners while developing approaches for their applications.
- Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A. Alemi. 2018. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 9180–9190. Google ScholarDigital Library
- Artaches Ambartsoumian and Fred Popowich. 2018. Self-attention: A better building block for sentiment analysis neural network classifiers. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, 130–139.Google ScholarCross Ref
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6077–6086.Google ScholarCross Ref
- Lei Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-sequence learning using gated graph neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 273–283.Google ScholarCross Ref
- Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1442–1451.Google ScholarCross Ref
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, and Ilya2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.Google Scholar
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 12346). Springer, 213–229.Google Scholar
- William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4960–4964.Google ScholarDigital Library
- Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119. PMLR, 1691–1703.Google Scholar
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https://arxiv.org/abs/1904.10509.Google Scholar
- Chung-Cheng Chiu and Colin Raffel. 2017. Monotonic chunkwise attention. arXiv:1712.05382. Retrieved from https://arxiv.org/abs/1712.05382.Google Scholar
- Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimedia 17, 11 (2015), 1875–1886.Google ScholarDigital Library
- Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). Association for Computational Linguistics, 103–111.Google ScholarCross Ref
- Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1724–1734.Google ScholarCross Ref
- Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 93–98.Google Scholar
- Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems. MIT Press, 577–585. Google ScholarDigital Library
- Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 276–286.Google Scholar
- Limeng Cui, Haeseung Seo, Maryam Tabar, Fenglong Ma, Suhang Wang, and Dongwon Lee. 2020. DETERRENT: Knowledge guided graph attention network for detecting healthcare misinformation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 492–502.Google ScholarDigital Library
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2978–2988.Google ScholarCross Ref
- Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 120–128. Google ScholarDigital Library
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In Proceedings of the 7th International Conference on Learning Representations.Google Scholar
- Zhongfen Deng, Hao Peng, Congying Xia, Jianxin Li, Lifang He, and Philip Yu. 2020. Hierarchical bi-directional self-attention networks for paper review rating recommendation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 6302–6314.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Jun Feng, Minlie Huang, Yang Yang, and Xiaoyan Zhu. 2016. GAKE: Graph aware knowledge embedding. In Proceedings of the 26th International Conference on Computational Linguistics. The COLING 2016 Organizing Committee, 641–651.Google Scholar
- Keisuke Fujii, Naoya Takeishi, Yoshinobu Kawahara, and Kazuya Takeda. 2020. Policy learning with partial observation and mechanical constraints for multi-person modeling. arXiv:2007.03155. Retrieved from https://arxiv.org/abs/2007.03155.Google Scholar
- Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. (2020), 1–18. Google Scholar
- Alex Graves, Greg Wayne, and Ivo Danihelka. 2014a. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.Google Scholar
- Alex Graves, Greg Wayne, and Ivo Danihelka. 2014b. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.Google Scholar
- Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation.Proceedings of Machine Learning Research, Vol. 37. PMLR, 1462–1471. Google ScholarDigital Library
- Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. Comput. Surv. 51, 5, Article 93 (2018), 42 pages. Google ScholarDigital Library
- Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2354–2366. https://doi.org/10.1109/TKDE.2018.2831682Google ScholarDigital Library
- Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 1693–1701. Google ScholarDigital Library
- Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, 3543–3556.Google Scholar
- Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip Torr. 2018. Learn to pay attention. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential recommendation. In Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society, 197–206.Google Scholar
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. arXiv:2101.01169. Retrieved from https://arxiv.org/abs/2101.01169.Google ScholarDigital Library
- Douwe Kiela, Changhan Wang, and Kyunghyun Cho. 2018. Dynamic meta-embeddings for improved sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1466–1477.Google ScholarCross Ref
- Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations.Google Scholar
- Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 7057–7075.Google ScholarCross Ref
- Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48. PMLR, 1378–1387. Google ScholarDigital Library
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. 2017. Interactive visualization and manipulation of attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 121–126.Google ScholarCross Ref
- John Boaz Lee, Ryan Rossi, and Xiangnan Kong. 2018. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1666–1674. Google ScholarDigital Library
- John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen K. Ahmed, and Eunyee Koh. 2019. Attention models in graphs: A survey. ACM Trans. Knowl. Discov. Data 13, 6, Article 62 (2019), 25 pages. https://doi.org/10.1145/3363574 Google ScholarDigital Library
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880.Google ScholarCross Ref
- Guangyu Li, Bo Jiang, Hao Zhu, Zhengping Che, and Yan Liu. 2020a. Generative attention networks for multi-agent behavioral modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7195–7202.Google ScholarCross Ref
- Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. CoRR abs/1612.08220.Google Scholar
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.Google Scholar
- Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020b. Sac: Accelerating and structuring self-attention via sparse adaptive connection. arXiv:2003.09833. Retrieved from https://arxiv.org/abs/2003.09833.Google Scholar
- Yang Li, Lukasz Kaiser, Samy Bengio, and Si Si. 2019a. Area attention. In Proceedings of the International Conference on Machine Learning. PMLR, 3846–3855.Google Scholar
- Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (2017).Google Scholar
- Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. 2020. SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition. In Proceedings of the 8th International Conference on Learning Representations.Google Scholar
- Shusen Liu, Tao Li, Zhimin Li, Vivek Srikumar, Valerio Pascucci, and Peer-Timo Bremer. 2018. Visual interrogation of attention-based models for natural language inference and machine comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 36–41.Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarDigital Library
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 289–297. Google ScholarDigital Library
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. arXiv:1508.04025. Retrieved from https://arxiv.org/abs/1508.04025.Google Scholar
- Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421.Google Scholar
- Benteng Ma, Jing Zhang, Yong Xia, and Dacheng Tao. 2020. Auto learning attention. Adv. Neural Inf. Process. Syst. 33 (2020).Google Scholar
- Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level sentiment classification. arXiv:1709.00893. Retrieved from https://arxiv.org/abs/1709.00893. Google ScholarDigital Library
- Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017a. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1903–1911. Google ScholarDigital Library
- Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. 2019. Monotonic multihead attention. arXiv:1909.12406. Retrieved from https://arxiv.org/abs/1909.12406.Google Scholar
- Yukun Ma, Haiyun Peng, and Erik Cambria. 2018. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5876–5883.Google Scholar
- Suraj Maharjan, Manuel Montes, Fabio A. González, and Thamar Solorio. 2018. A genre-aware attention model to improve the likability prediction of books. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3381–3391.Google ScholarCross Ref
- Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning. PMLR, 1614–1623. Google ScholarDigital Library
- André F. T. Martins, Marcos Treviso, António Farinhas, Vlad Niculae, Mário A. T. Figueiredo, and Pedro M. Q. Aguiar. 2020. Sparse and continuous attention mechanisms. arXiv:2006.07214. Retrieved from https://arxiv.org/abs/2006.07214.Google Scholar
- Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarDigital Library
- Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5191–5198.Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2. MIT Press, 2204–2212. Google ScholarDigital Library
- Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, and Balaraman Ravindran. 2020. Towards transparent and explainable attention models. arXiv:2004.14243. Retrieved from https://arxiv.org/abs/2004.14243.Google Scholar
- Elizbar A. Nadaraya. 1964. On estimating regression. Theory Probab. Appl. 9, 1 (1964), 141–142.Google ScholarCross Ref
- Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, 280–290.Google ScholarCross Ref
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer.Proceedings of Machine Learning Research, Vol. 80. 4055–4064.Google Scholar
- John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135.Google ScholarCross Ref
- Ben Peters, Vlad Niculae, and André FT Martins. 2019. Sparse sequence-to-sequence models. arXiv:1905.05702. Retrieved from https://arxiv.org/abs/1905.05702.Google Scholar
- Zhu Qiannan, Xiaofei Zhou, Zeliang Song, Jianlong Tan, and Li Guo. 2019. DAN: Deep attention neural network for news recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 5973–5980.Google Scholar
- C. Qin and D. Qu. 2020. Towards understanding attention-based speech recognition models. IEEE Access 8 (2020), 24358–24369.Google ScholarCross Ref
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language models are unsupervised multitask learners. https://openai.com/blog/better-language-models/.Google Scholar
- Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. arXiv:1906.05909. Retrieved from https://arxiv.org/abs/1906.05909.Google Scholar
- Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 379–389.Google ScholarCross Ref
- Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2931–2951.Google Scholar
- Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.Google Scholar
- Min Yang, Baocheng Li, Qiang Qu, Jialie Shen, Shuai Yu, and Yongbo Wang. 2019. NAIRS: A neural attentive interpretable recommendation system. The Web Conference (2019). Google ScholarDigital Library
- Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for machine reading. arXiv:1606.02245. Retrieved from https://arxiv.org/abs/1606.02245.Google Scholar
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019a. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 331–335.Google ScholarCross Ref
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019b. Adaptive attention span in transformers. arXiv:1905.07799. Retrieved from https://arxiv.org/abs/1905.07799.Google Scholar
- Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2440–2448. Google ScholarDigital Library
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarCross Ref
- Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual question answering. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. Association for Computing Machinery, 207–211. Google ScholarDigital Library
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 5099–5110.Google ScholarCross Ref
- Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 214–224.Google ScholarCross Ref
- Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4263–4272.Google ScholarCross Ref
- Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention networks. Adv. Neural Inf. Process. Syst. 32 (2019), 6135–6145. Google ScholarDigital Library
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.Google Scholar
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 5998–6008. Google ScholarDigital Library
- Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations.Google Scholar
- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, Vol. 28. MIT Press, 2692–2700. Google ScholarDigital Library
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5797–5808.Google ScholarCross Ref
- Feng Wang and David MJ Tax. 2016. Survey on the attention based RNN model and its applications in computer vision. arXiv:1601.06823. Retrieved from https://arxiv.org/abs/1601.06823.Google Scholar
- Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2020e. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 108–126.Google ScholarDigital Library
- Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan, and Rui Wang. 2020c. Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3229–3238.Google ScholarCross Ref
- Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and H. Linformer Ma. 2020a. Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768.Google Scholar
- Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3316–3322. Google ScholarDigital Library
- Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020d. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv:2002.10957. Retrieved from https://arxiv.org/abs/2002.10957.Google Scholar
- Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019a. KGAT: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 950–958. Google ScholarDigital Library
- Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019b. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032. Google ScholarDigital Library
- Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 606–615.Google ScholarCross Ref
- Yue Wang, Jing Li, Michael Lyu, and Irwin King. 2020b. Cross-media keyphrase prediction: A unified framework with multi-modality multi-head attention and image wordings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3311–3324.Google ScholarCross Ref
- Geoffrey S. Watson. 1964. Smooth regression analysis. Ind. J. Stat. Ser. A (1964), 359–372.Google Scholar
- Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv:1410.3916. Retrieved from https://arxiv.org/abs/1410.3916.Google Scholar
- Chuhan Wu, Fangzhao Wu, Junxin Liu, and Yongfeng Huang. 2019. Hierarchical user and item representation with three-tier attention for recommendation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1818–1826.Google ScholarCross Ref
- Le Wu, Lei Chen, Richang Hong, Yanjie Fu, Xing Xie, and Meng Wang. 2020. A hierarchical attention model for social contextual image recommendation. IEEE Trans. Knowl. Data Eng. 32, 10 (2020), 1854–1867.Google ScholarCross Ref
- Wenyi Xiao, Huan Zhao, Haojie Pan, Yangqiu Song, Vincent W. Zheng, and Qiang Yang. 2021. Social explorative attention based recommendation for content distribution platforms. Data Min. Knowl. Discov. 35, 2 (2021), 533–567.Google ScholarCross Ref
- Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48. JMLR.org, 2397–2406. Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Proceedings of Machine Learning Research, Vol. 37. 2048–2057.Google Scholar
- Song Xu, Haoran Li, Peng Yuan, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Self-attention guided copy mechanism for abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1355–1362.Google ScholarCross Ref
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarDigital Library
- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.Google ScholarCross Ref
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4507–4515. Google ScholarDigital Library
- Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018a. Sequential recommender system based on hierarchical attention network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, 3926–3932. Google ScholarDigital Library
- Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018b. Sequential recommender system based on hierarchical attention networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3926–3932. Google ScholarDigital Library
- Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.Google ScholarCross Ref
- Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarCross Ref
- Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence. 5642–5649.Google Scholar
- B. Zhang, D. Xiong, J. Xie, and J. Su. 2020. Neural machine translation with GRU-gated attention model. IEEE Trans. Neural Netw. Learn. Syst. 31, 11 (2020), 4688–4698.Google ScholarCross Ref
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019a. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning. 7354–7363.Google Scholar
- Ruochi Zhang, Yuesong Zou, and Jian Ma. 2020. Hyper-SAGNN: A self-attention based graph neural network for hypergraphs. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019b. Deep learning based recommender system: A survey and new perspectives. Comput. Surv. 52, 1, Article 5 (2019). Google ScholarDigital Library
- Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. 2020. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Shenjian Zhao and Zhihua Zhang. 2018. Attention-via-attention neural machine translation. In Association for the Advancement of Artificial Intelligence.Google Scholar
- Zhou Zhao, Ben Gao, Vicent W. Zheng, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Link prediction via ranking metric dual-level attention network learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 3525–3531. Google ScholarDigital Library
- Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. GMAN: A graph multi-attention network for traffic prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 1234–1241.Google ScholarCross Ref
- Chang Zhou, Jinze Bai, Junshuai Song, Xiaofei Liu, Zhengchao Zhao, Xiusi Chen, and Jun Gao. 2018. ATRank: An attention-based user behavior modeling framework for recommendation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 4564–4571.Google Scholar
- Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, and Xiaolin Zheng. 2020. A graphical and attentional framework for dual-target cross-domain recommendation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. 3001–3008.Google ScholarCross Ref
- Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.Google Scholar
Index Terms
- An Attentive Survey of Attention Models
Recommendations
Attention, please! A survey of neural attention models in deep learning
AbstractIn humans, Attention is a core property of all perceptual and cognitive operations. Given our limited ability to process competing sources, attention mechanisms select, modulate, and focus on the information most relevant to behavior. For decades, ...
Attention to Attention
Organizational theory and research has increased attention to the determinants and consequences of attention in organizations. Attention is not, however, a unitary concept but is used differently in various metatheories: the behavioral theory of the ...
Structurally Guided Channel Attention Networks: SGCA-Net
PETRA '21: Proceedings of the 14th PErvasive Technologies Related to Assistive Environments ConferenceIn this paper, we propose Structurally Guided Channel Attention Networks (SGCA-Net), a principled way to guide the channel attention of CNNs. Convolution operator constructs features maps by using both channel and spatial information within the ...
Comments