ABSTRACT
Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
- Adam J Berinsky, Michele F Margolis, and Michael W Sances. 2014. Separating the shirkers from the workers? Making sure respondents pay attention on self-administered surveys. American Journal of Political Science 58, 3 (2014), 739–753.Google ScholarCross Ref
- BotHitMTurk 2018. A Bot Panic Hits Amazon’s Mechanical Turk | WIRED. https://www.wired.com/story/amazon-mechanical-turk-bot-panic/.Google Scholar
- Nathan A Bowling, Jason L Huang, Caleb B Bragg, Steve Khazon, Mengqiao Liu, and Caitlin E Blackmore. 2016. Who cares and who is careless? Insufficient effort responding as a reflection of respondent personality.Journal of Personality and Social Psychology 111, 2(2016), 218.Google Scholar
- Elie Bursztein, Steven Bethard, Celine Fabry, John C Mitchell, and Dan Jurafsky. 2010. How good are humans at solving CAPTCHAs? A large scale evaluation. In 2010 IEEE symposium on security and privacy. IEEE, 399–413.Google ScholarDigital Library
- Alessandro Checco, Jo Bates, and Gianluca Demartini. 2018. All That Glitters Is Gold?An Attack Scheme on Gold Questions in Crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Scott Clifford and Jennifer Jerit. 2015. Do attempts to improve respondent attention increase social desirability bias?Public Opinion Quarterly 79, 3 (2015), 790–802.Google Scholar
- Paul G Curran. 2016. Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology 66 (2016), 4–19.Google ScholarCross Ref
- Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 7.Google ScholarDigital Library
- Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20–28.Google Scholar
- Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW). 1013–1022.Google ScholarDigital Library
- Minwei Feng, Bing Xiang, Michael R Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep learning to answer selection: A study and an open task. arXiv preprint arXiv:1508.01585(2015).Google Scholar
- Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015. Understanding malicious behavior in crowdsourcing platforms: The case of online surveys. In Proceedings of the Annual ACM Conference on Human Factors in Computing Systems. 1631–1640.Google ScholarDigital Library
- Haichang Gao, Jeff Yan, Fang Cao, Zhengya Zhang, Lei Lei, Mengyun Tang, Ping Zhang, Xin Zhou, Xuqin Wang, and Jiawei Li. 2016. A Simple Generic Attack on Text Captchas.. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
- Joseph K Goodman, Cynthia E Cryder, and Amar Cheema. 2013. Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making 26, 3 (2013), 213–224.Google ScholarCross Ref
- Sandy J. J. Gould, Anna L. Cox, and Duncan P. Brumby. 2016. Diminished Control in Crowdsourcing: An Investigation of Crowdworker Multitasking Behavior. ACM Trans. Comput.-Hum. Interact. 23, 3 (June 2016), 19:1–19:29.Google ScholarDigital Library
- David J Hauser and Norbert Schwarz. 2016. Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior research methods 48, 1 (2016), 400–407.Google Scholar
- Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google’s Perspective API Built for Detecting Toxic Comments. arXiv preprint arXiv:1702.08138(2017).Google Scholar
- Jason L Huang, Nathan A Bowling, Mengqiao Liu, and Yuhui Li. 2015. Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology 30, 2 (2015), 299–311.Google ScholarCross Ref
- Jason L Huang, Paul G Curran, Jessica Keeney, Elizabeth M Poposki, and Richard P DeShon. 2012. Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology 27, 1 (2012), 99–114.Google ScholarCross Ref
- Qatrunnada Ismail, Tousif Ahmed, Kelly Caine, Apu Kapadia, and Michael Reiter. 2017. To permit or not to permit, that is the usability question: Crowdsourcing mobile apps’ privacy permission settings. Proceedings on Privacy Enhancing Technologies 2017, 4(2017), 119–137.Google ScholarCross Ref
- Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. 2017. Identifying unreliable and adversarial workers in crowdsourced labeling tasks. Journal of Machine Learning Research 18, 1 (2017), 3233–3299.Google ScholarDigital Library
- Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2021–2031.Google ScholarCross Ref
- Chester Chun Seng Kam and Gabriel Hoi-huen Chan. 2018. Examination of the validity of instructed response items in identifying careless respondents. Personality and Individual Differences 129 (2018), 83–87.Google ScholarCross Ref
- Jeremy Kees, Christopher Berry, Scot Burton, and Kim Sheehan. 2017. An analysis of data quality: Professional panels, student subject pools, and Amazon’s Mechanical Turk. Journal of Advertising 46, 1 (2017), 141–155.Google ScholarCross Ref
- Amin Kharraz, William Robertson, and Engin Kirda. 2018. Surveylance: Automatically Detecting Online Survey Scams. In Proceedings of the IEEE Symposium on Security and Privacy (SP). 70–86.Google ScholarCross Ref
- Kazuaki Kishida. 2005. Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments. National Institute of Informatics Tokyo, Japan.Google Scholar
- Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW). 1301–1318.Google ScholarDigital Library
- Franki YH Kung, Navio Kwok, and Douglas J Brown. 2018. Are Attention Check Questions a Threat to Scale Validity?Applied Psychology 67, 2 (2018), 264–283.Google Scholar
- Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1188–1196.Google Scholar
- Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. arXiv preprint arXiv:1411.4086(2014).Google Scholar
- J Li, S Ji, T Du, B Li, and T Wang. 2019. TextBugger: Generating Adversarial Text Against Real-world Applications. In 26th Annual Network and Distributed System Security Symposium.Google Scholar
- Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt(2017).Google Scholar
- MarketResearch 2019. Market research industry - Current stats and future trends | QuestionPro. https://www.questionpro.com/blog/market-research-stats-and-trends/.Google Scholar
- Adam W Meade and S Bartholomew Craig. 2012. Identifying careless responses in survey data.Psychological methods 17, 3 (2012), 437.Google Scholar
- Chenglin Miao, Qi Li, Lu Su, Mengdi Huai, Wenjun Jiang, and Jing Gao. 2018. Attack under Disguise: An Intelligent Data Poisoning Attack Mechanism in Crowdsourcing. In Proceedings of the World Wide Web Conference. 13–22.Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in neural information processing systems. 3111–3119.Google Scholar
- Asako Miura and Tetsuro Kobayashi. 2016. Survey satisficing inflates stereotypical responses in online experiment: The case of immigration study. Frontiers in psychology 7 (2016), 1563.Google Scholar
- Greg Mori and Jitendra Malik. 2003. Recognizing objects in adversarial clutter: Breaking a visual CAPTCHA. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1. IEEE, I–I.Google ScholarCross Ref
- MTurk 2018. Amazon Mechanical Turk (MTurk). https://www.mturk.com.Google Scholar
- Daniel M Oppenheimer, Tom Meyvis, and Nicolas Davidenko. 2009. Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology 45, 4 (2009), 867–872.Google ScholarCross Ref
- Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. 2010. Running experiments on amazon mechanical turk. udgment and Decision Making 5, 5 (2010), 411–419.Google Scholar
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2383–2392.Google ScholarCross Ref
- Jinfeng Rao, Hua He, and Jimmy Lin. 2017. Experiments with convolutional neural network models for answer selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 1217–1220.Google ScholarDigital Library
- Reddit 2019. Reddit. https://www.reddit.com/.Google Scholar
- Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 193–203.Google Scholar
- Suranjana Samanta and Sameep Mehta. 2018. Generating adversarial text samples. In Proceedings of the European Conference on Information Retrieval. 744–749.Google ScholarCross Ref
- Mario Schaarschmidt, Stefan Ivens, Dirk Homscheid, and Pascal Bilo. 2015. Crowdsourcing for Survey Research: where Amazon Mechanical Turks deviates from conventional survey methods. Arbeitsberichte aus dem Fachbereich(2015).Google Scholar
- Daniel J Simons and Christopher F Chabris. 2012. Common (mis) beliefs about memory: A replication and comparison of telephone and Mechanical Turk survey methods. PloS one 7, 12 (2012), e51876.Google ScholarCross Ref
- Scott M Smith, Catherine A Roster, Linda L Golden, and Gerald S Albaum. 2016. A multi-group analysis of online survey respondent data quality: Comparing a regular USA consumer panel to MTurk samples. Journal of Business Research 69, 8 (2016), 3139–3148.Google ScholarCross Ref
- Ianna Sodré and Francisco Brasileiro. 2017. An analysis of the use of qualifications on the Amazon mechanical Turk online labor market. Computer Supported Cooperative Work 26, 4-6 (2017), 837–872.Google ScholarDigital Library
- SpellCheckMicrosoft 2018. Spell Check | Microsoft Azure. https://azure.microsoft.com/zh-cn/services/cognitive-services/spell-check/.Google Scholar
- Peng Sun and Kathryn T Stolee. 2016. Exploring crowd consistency in a mechanical turk survey. In 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE). IEEE, 8–14.Google ScholarDigital Library
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing Properties of Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
- Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Hyperbolic Representation Learning for Fast and Efficient Neural Question Answering. In Proceedings of the ACM International Conference on Web Search and Data Mining. 583–591.Google ScholarDigital Library
- Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. 2011. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proceedings of the ACM SIGIR Workshop on Crowdsourcing for Information Retrieval. 21–26.Google Scholar
- Shuohang Wang and Jing Jiang. 2016. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747(2016).Google Scholar
- Mary Kathrine Ward and Samuel B Pond III. 2015. Using virtual presence and survey instructions to minimize careless responding on Internet-based surveys. Computers in Human Behavior 48 (2015), 554–568.Google ScholarDigital Library
- Alex C Williams, Joslin Goh, Charlie G Willis, Aaron M Ellison, James H Brusuelas, Charles C Davis, and Edith Law. 2017. Deja vu: Characterizing worker reliability using task consistency. In Fifth AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Steven L Wise and Xiaojing Kong. 2005. Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education 18, 2 (2005), 163–183.Google ScholarCross Ref
- WMT16 2016. ACL2016 First Conference on Machine Translation (WMT16). http://www.statmt.org/wmt16/.Google Scholar
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).Google Scholar
- Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In Proceedings of the ACM International on Conference on Information and Knowledge Management. 287–296.Google ScholarDigital Library
- Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013–2018.Google ScholarCross Ref
- Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, and Ben Y Zhao. 2017. Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 1143–1158.Google ScholarDigital Library
- Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet Another Text Captcha Solver: A Generative Adversarial Network Based Approach. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 332–348.Google ScholarDigital Library
- Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 1744–1753.Google Scholar
- Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Transactions of the Association of Computational Linguistics 4, 1(2016), 259–272.Google ScholarCross Ref
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv preprint arXiv:1804.09541(2018).Google Scholar
- Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. In Proceedings of the Deep Learning and Representation Learning Workshop.Google Scholar
- Haijun Zhai, Todd Lingren, Louise Deleger, Qi Li, Megan Kaiser, Laura Stoutenborough, and Imre Solti. 2013. Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal of medical Internet research 15, 4 (2013).Google ScholarCross Ref
Index Terms
- Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered
Recommendations
An Attentive Survey of Attention Models
Attention Model has now become an important concept in neural networks that has been researched within diverse application domains. This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we ...
ATTENTION: ATTackEr Traceback Using MAC Layer AbNormality DetecTION
ISA '09: Proceedings of the 3rd International Conference and Workshops on Advances in Information Security and AssuranceDenial-of-Service (DoS) and Distributed DoS (DDoS) attacks can cause serious problems in wireless multi-hop networks due to limited network and host resources. Attacker traceback is a promising solution to take a proper countermeasure near the attack ...
Attention, Please! Adversarial Defense via Activation Rectification and Preservation
This study provides a new understanding of the adversarial attack problem by examining the correlation between adversarial attack and visual attention change. In particular, we observed that: (1) images with incomplete attention regions are more ...
Comments