skip to main content
10.1145/3366423.3380195acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered

Published:20 April 2020Publication History

ABSTRACT

Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  2. Adam J Berinsky, Michele F Margolis, and Michael W Sances. 2014. Separating the shirkers from the workers? Making sure respondents pay attention on self-administered surveys. American Journal of Political Science 58, 3 (2014), 739–753.Google ScholarGoogle ScholarCross RefCross Ref
  3. BotHitMTurk 2018. A Bot Panic Hits Amazon’s Mechanical Turk | WIRED. https://www.wired.com/story/amazon-mechanical-turk-bot-panic/.Google ScholarGoogle Scholar
  4. Nathan A Bowling, Jason L Huang, Caleb B Bragg, Steve Khazon, Mengqiao Liu, and Caitlin E Blackmore. 2016. Who cares and who is careless? Insufficient effort responding as a reflection of respondent personality.Journal of Personality and Social Psychology 111, 2(2016), 218.Google ScholarGoogle Scholar
  5. Elie Bursztein, Steven Bethard, Celine Fabry, John C Mitchell, and Dan Jurafsky. 2010. How good are humans at solving CAPTCHAs? A large scale evaluation. In 2010 IEEE symposium on security and privacy. IEEE, 399–413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alessandro Checco, Jo Bates, and Gianluca Demartini. 2018. All That Glitters Is Gold?An Attack Scheme on Gold Questions in Crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  7. Scott Clifford and Jennifer Jerit. 2015. Do attempts to improve respondent attention increase social desirability bias?Public Opinion Quarterly 79, 3 (2015), 790–802.Google ScholarGoogle Scholar
  8. Paul G Curran. 2016. Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology 66 (2016), 4–19.Google ScholarGoogle ScholarCross RefCross Ref
  9. Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20–28.Google ScholarGoogle Scholar
  11. Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW). 1013–1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Minwei Feng, Bing Xiang, Michael R Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep learning to answer selection: A study and an open task. arXiv preprint arXiv:1508.01585(2015).Google ScholarGoogle Scholar
  13. Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015. Understanding malicious behavior in crowdsourcing platforms: The case of online surveys. In Proceedings of the Annual ACM Conference on Human Factors in Computing Systems. 1631–1640.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haichang Gao, Jeff Yan, Fang Cao, Zhengya Zhang, Lei Lei, Mengyun Tang, Ping Zhang, Xin Zhou, Xuqin Wang, and Jiawei Li. 2016. A Simple Generic Attack on Text Captchas.. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google ScholarGoogle ScholarCross RefCross Ref
  15. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  16. Joseph K Goodman, Cynthia E Cryder, and Amar Cheema. 2013. Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making 26, 3 (2013), 213–224.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sandy J. J. Gould, Anna L. Cox, and Duncan P. Brumby. 2016. Diminished Control in Crowdsourcing: An Investigation of Crowdworker Multitasking Behavior. ACM Trans. Comput.-Hum. Interact. 23, 3 (June 2016), 19:1–19:29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David J Hauser and Norbert Schwarz. 2016. Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior research methods 48, 1 (2016), 400–407.Google ScholarGoogle Scholar
  19. Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google’s Perspective API Built for Detecting Toxic Comments. arXiv preprint arXiv:1702.08138(2017).Google ScholarGoogle Scholar
  20. Jason L Huang, Nathan A Bowling, Mengqiao Liu, and Yuhui Li. 2015. Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology 30, 2 (2015), 299–311.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jason L Huang, Paul G Curran, Jessica Keeney, Elizabeth M Poposki, and Richard P DeShon. 2012. Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology 27, 1 (2012), 99–114.Google ScholarGoogle ScholarCross RefCross Ref
  22. Qatrunnada Ismail, Tousif Ahmed, Kelly Caine, Apu Kapadia, and Michael Reiter. 2017. To permit or not to permit, that is the usability question: Crowdsourcing mobile apps’ privacy permission settings. Proceedings on Privacy Enhancing Technologies 2017, 4(2017), 119–137.Google ScholarGoogle ScholarCross RefCross Ref
  23. Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. 2017. Identifying unreliable and adversarial workers in crowdsourced labeling tasks. Journal of Machine Learning Research 18, 1 (2017), 3233–3299.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2021–2031.Google ScholarGoogle ScholarCross RefCross Ref
  25. Chester Chun Seng Kam and Gabriel Hoi-huen Chan. 2018. Examination of the validity of instructed response items in identifying careless respondents. Personality and Individual Differences 129 (2018), 83–87.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jeremy Kees, Christopher Berry, Scot Burton, and Kim Sheehan. 2017. An analysis of data quality: Professional panels, student subject pools, and Amazon’s Mechanical Turk. Journal of Advertising 46, 1 (2017), 141–155.Google ScholarGoogle ScholarCross RefCross Ref
  27. Amin Kharraz, William Robertson, and Engin Kirda. 2018. Surveylance: Automatically Detecting Online Survey Scams. In Proceedings of the IEEE Symposium on Security and Privacy (SP). 70–86.Google ScholarGoogle ScholarCross RefCross Ref
  28. Kazuaki Kishida. 2005. Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments. National Institute of Informatics Tokyo, Japan.Google ScholarGoogle Scholar
  29. Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW). 1301–1318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Franki YH Kung, Navio Kwok, and Douglas J Brown. 2018. Are Attention Check Questions a Threat to Scale Validity?Applied Psychology 67, 2 (2018), 264–283.Google ScholarGoogle Scholar
  31. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1188–1196.Google ScholarGoogle Scholar
  32. Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086(2014).Google ScholarGoogle Scholar
  33. J Li, S Ji, T Du, B Li, and T Wang. 2019. TextBugger: Generating Adversarial Text Against Real-world Applications. In 26th Annual Network and Distributed System Security Symposium.Google ScholarGoogle Scholar
  34. Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt(2017).Google ScholarGoogle Scholar
  35. MarketResearch 2019. Market research industry - Current stats and future trends | QuestionPro. https://www.questionpro.com/blog/market-research-stats-and-trends/.Google ScholarGoogle Scholar
  36. Adam W Meade and S Bartholomew Craig. 2012. Identifying careless responses in survey data.Psychological methods 17, 3 (2012), 437.Google ScholarGoogle Scholar
  37. Chenglin Miao, Qi Li, Lu Su, Mengdi Huai, Wenjun Jiang, and Jing Gao. 2018. Attack under Disguise: An Intelligent Data Poisoning Attack Mechanism in Crowdsourcing. In Proceedings of the World Wide Web Conference. 13–22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in neural information processing systems. 3111–3119.Google ScholarGoogle Scholar
  39. Asako Miura and Tetsuro Kobayashi. 2016. Survey satisficing inflates stereotypical responses in online experiment: The case of immigration study. Frontiers in psychology 7 (2016), 1563.Google ScholarGoogle Scholar
  40. Greg Mori and Jitendra Malik. 2003. Recognizing objects in adversarial clutter: Breaking a visual CAPTCHA. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1. IEEE, I–I.Google ScholarGoogle ScholarCross RefCross Ref
  41. MTurk 2018. Amazon Mechanical Turk (MTurk). https://www.mturk.com.Google ScholarGoogle Scholar
  42. Daniel M Oppenheimer, Tom Meyvis, and Nicolas Davidenko. 2009. Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology 45, 4 (2009), 867–872.Google ScholarGoogle ScholarCross RefCross Ref
  43. Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. 2010. Running experiments on amazon mechanical turk. udgment and Decision Making 5, 5 (2010), 411–419.Google ScholarGoogle Scholar
  44. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2383–2392.Google ScholarGoogle ScholarCross RefCross Ref
  45. Jinfeng Rao, Hua He, and Jimmy Lin. 2017. Experiments with convolutional neural network models for answer selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 1217–1220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Reddit 2019. Reddit. https://www.reddit.com/.Google ScholarGoogle Scholar
  47. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 193–203.Google ScholarGoogle Scholar
  48. Suranjana Samanta and Sameep Mehta. 2018. Generating adversarial text samples. In Proceedings of the European Conference on Information Retrieval. 744–749.Google ScholarGoogle ScholarCross RefCross Ref
  49. Mario Schaarschmidt, Stefan Ivens, Dirk Homscheid, and Pascal Bilo. 2015. Crowdsourcing for Survey Research: where Amazon Mechanical Turks deviates from conventional survey methods. Arbeitsberichte aus dem Fachbereich(2015).Google ScholarGoogle Scholar
  50. Daniel J Simons and Christopher F Chabris. 2012. Common (mis) beliefs about memory: A replication and comparison of telephone and Mechanical Turk survey methods. PloS one 7, 12 (2012), e51876.Google ScholarGoogle ScholarCross RefCross Ref
  51. Scott M Smith, Catherine A Roster, Linda L Golden, and Gerald S Albaum. 2016. A multi-group analysis of online survey respondent data quality: Comparing a regular USA consumer panel to MTurk samples. Journal of Business Research 69, 8 (2016), 3139–3148.Google ScholarGoogle ScholarCross RefCross Ref
  52. Ianna Sodré and Francisco Brasileiro. 2017. An analysis of the use of qualifications on the Amazon mechanical Turk online labor market. Computer Supported Cooperative Work 26, 4-6 (2017), 837–872.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. SpellCheckMicrosoft 2018. Spell Check | Microsoft Azure. https://azure.microsoft.com/zh-cn/services/cognitive-services/spell-check/.Google ScholarGoogle Scholar
  54. Peng Sun and Kathryn T Stolee. 2016. Exploring crowd consistency in a mechanical turk survey. In 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE). IEEE, 8–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing Properties of Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  56. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  57. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Hyperbolic Representation Learning for Fast and Efficient Neural Question Answering. In Proceedings of the ACM International Conference on Web Search and Data Mining. 583–591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. 2011. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proceedings of the ACM SIGIR Workshop on Crowdsourcing for Information Retrieval. 21–26.Google ScholarGoogle Scholar
  59. Shuohang Wang and Jing Jiang. 2016. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747(2016).Google ScholarGoogle Scholar
  60. Mary Kathrine Ward and Samuel B Pond III. 2015. Using virtual presence and survey instructions to minimize careless responding on Internet-based surveys. Computers in Human Behavior 48 (2015), 554–568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Alex C Williams, Joslin Goh, Charlie G Willis, Aaron M Ellison, James H Brusuelas, Charles C Davis, and Edith Law. 2017. Deja vu: Characterizing worker reliability using task consistency. In Fifth AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  62. Steven L Wise and Xiaojing Kong. 2005. Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education 18, 2 (2005), 163–183.Google ScholarGoogle ScholarCross RefCross Ref
  63. WMT16 2016. ACL2016 First Conference on Machine Translation (WMT16). http://www.statmt.org/wmt16/.Google ScholarGoogle Scholar
  64. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).Google ScholarGoogle Scholar
  65. Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In Proceedings of the ACM International on Conference on Information and Knowledge Management. 287–296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013–2018.Google ScholarGoogle ScholarCross RefCross Ref
  67. Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, and Ben Y Zhao. 2017. Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 1143–1158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet Another Text Captcha Solver: A Generative Adversarial Network Based Approach. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 332–348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 1744–1753.Google ScholarGoogle Scholar
  70. Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Transactions of the Association of Computational Linguistics 4, 1(2016), 259–272.Google ScholarGoogle ScholarCross RefCross Ref
  71. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv preprint arXiv:1804.09541(2018).Google ScholarGoogle Scholar
  72. Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. In Proceedings of the Deep Learning and Representation Learning Workshop.Google ScholarGoogle Scholar
  73. Haijun Zhai, Todd Lingren, Louise Deleger, Qi Li, Megan Kaiser, Laura Stoutenborough, and Imre Solti. 2013. Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal of medical Internet research 15, 4 (2013).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                WWW '20: Proceedings of The Web Conference 2020
                April 2020
                3143 pages
                ISBN:9781450370233
                DOI:10.1145/3366423

                Copyright © 2020 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 20 April 2020

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed limited

                Acceptance Rates

                Overall Acceptance Rate1,899of8,196submissions,23%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format