ABSTRACT
As an effective way to solicit useful information from the crowd, crowdsourcing has emerged as a popular paradigm to solve challenging tasks. However, the data provided by the participating workers are not always trustworthy. In real world, there may exist malicious workers in crowdsourcing systems who conduct the data poisoning attacks for the purpose of sabotage or financial rewards. Although data aggregation methods such as majority voting are conducted on workers» labels in order to improve data quality, they are vulnerable to such attacks as they treat all the workers equally. In order to capture the variety in the reliability of workers, the Dawid-Skene model, a sophisticated data aggregation method, has been widely adopted in practice. By conducting maximum likelihood estimation (MLE) using the expectation maximization (EM) algorithm, the Dawid-Skene model can jointly estimate each worker»s reliability and conduct weighted aggregation, and thus can tolerate the data poisoning attacks to some degree. However, the Dawid-Skene model still has weakness. In this paper, we study the data poisoning attacks against such crowdsourcing systems with the Dawid-Skene model empowered. We design an intelligent attack mechanism, based on which the attacker can not only achieve maximum attack utility but also disguise the attacking behaviors. Extensive experiments based on real-world crowdsourcing datasets are conducted to verify the desirable properties of the proposed mechanism.
- Scott Alfeld, Xiaojin Zhu, and Paul Barford. 2016. Data Poisoning Attacks against Autoregressive Models Proc. of AAAI. 1452--1458. Google ScholarDigital Library
- Jonathan F Bard. 1998. Practical bilevel optimization: algorithms and applications. Kluwer Academic Publishers. Google ScholarDigital Library
- Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. 2006. Can machine learning be secure? .In Proc. of ASIACCS. 16--25. Google ScholarDigital Library
- Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning attacks against support vector machines. In Proc. of ICML. Google ScholarDigital Library
- Marco Brambilla, Stefano Ceri, Andrea Mauri, and Riccardo Volonterio. 2014. Community-based crowdsourcing. In Proc. of WWW. 891--896. Google ScholarDigital Library
- Shih-Hao Chang and Zhi-Rong Chen. 2016. Protecting Mobile Crowd Sensing against Sybil Attacks Using Cloud Based Trust Management System. Mobile Information Systems Vol. 2016 (2016).Google Scholar
- Xi Chen, Qihang Lin, and Dengyong Zhou. 2013. Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing Proc. of ICML. 64--72. Google ScholarDigital Library
- Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. 2013. Aggregating crowdsourced binary ratings. In Proc. of WWW. 285--294. Google ScholarDigital Library
- Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979).Google Scholar
- Luca de Alfaro, Vassilis Polychronopoulos, and Michael Shavlovsky. 2015. Reliable aggregation of boolean crowdsourced tasks Proc. of HCOMP.Google Scholar
- Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society (1977), 1--38.Google Scholar
- Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. 2012. Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms.. In CrowdSearch. 26--30.Google Scholar
- Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. 2016. Scheduling human intelligence tasks in multi-tenant crowd-powered systems Proc. of WWW. 855--865. Google ScholarDigital Library
- Carsten Eickhoff and Arjen P de Vries. 2013. Increasing cheat robustness of crowdsourcing tasks. Information retrieval Vol. 16, 2 (2013), 121--137. Google ScholarDigital Library
- Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. 2015. icrowd: An adaptive crowdsourcing framework. In Proc. of SIGMOD. 1015--1030. Google ScholarDigital Library
- Ujwal Gadiraju, Gianluca Demartini, Ricardo Kawase, and Stefan Dietze. 2015 a. Human beyond the machine: Challenges and opportunities of microtask crowdsourcing. IEEE Intelligent Systems Vol. 30, 4 (2015), 81--85.Google ScholarDigital Library
- Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015 b. Understanding malicious behavior in crowdsourcing platforms: The case of online surveys Proc. of CHI. 1631--1640. Google ScholarDigital Library
- Matthias Hirth, Tobias Hoßfeld, and Phuoc Tran-Gia. 2010. Cheat-detection mechanisms for crowdsourcing. University of Würzburg, Tech. Rep Vol. 4 (2010).Google Scholar
- Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. 2011. Adversarial machine learning. In Proc. of AISec. 43--58. Google ScholarDigital Library
- Nguyen Quoc Viet Hung, Duong Chi Thang, Matthias Weidlich, and Karl Aberer. 2015. Minimizing efforts in validating crowd answers. In Proc. of SIGMOD. 999--1014. Google ScholarDigital Library
- Vittorio P Illiano and Emil C Lupu. 2015. Detecting malicious data injections in wireless sensor networks: A survey. ACM Computing Surveys (CSUR) (2015). Google ScholarDigital Library
- Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proc. of the ACM SIGKDD workshop on human computation. 64--67. Google ScholarDigital Library
- Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. 2014. Reputation-based worker filtering in crowdsourcing Proc. of NIPS. 2492--2500. Google ScholarDigital Library
- Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. 2016. Identifying Unreliable and Adversarial Workers in Crowdsourced Labeling Tasks. (2016).Google Scholar
- David R Karger, Sewoong Oh, and Devavrat Shah. 2014. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research Vol. 62, 1 (2014), 1--24. Google ScholarDigital Library
- Walter S Lasecki, Jaime Teevan, and Ece Kamar. 2014. Information extraction and manipulation threats in crowd-powered systems Proc. of CSCW. 248--256. Google ScholarDigital Library
- Edith Law, Ming Yin, Joslin Goh, Kevin Chen, Michael A Terry, and Krzysztof Z Gajos. 2016. Curiosity killed the cat, but makes crowdwork better Proc. of CHI. 4098--4110. Google ScholarDigital Library
- Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. 2016 b. Data poisoning attacks on factorization-based collaborative filtering Proc. of NIPS. 1885--1893. Google ScholarDigital Library
- Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J Franklin. 2016 c. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering Vol. 28, 9 (2016), 2296--2319. Google ScholarDigital Library
- Hongwei Li, Bin Yu, and Dengyong Zhou. 2013. Error rate analysis of labeling by crowdsourcing. In ICML Workshop: Machine Learning Meets Crowdsourcing.Google Scholar
- Qi Li, Fenglong Ma, Jing Gao, Lu Su, and Christopher J Quinn. 2016 a. Crowdsourcing high quality labels with a tight budget Proc. of WSDM. 237--246. Google ScholarDigital Library
- Yaliang Li, Jing Gao, Patrick PC Lee, Lu Su, Caifeng He, Cheng He, Fan Yang, and Wei Fan. 2017. A weighted crowdsourcing approach for network quality measurement in cellular data networks. IEEE Transactions on Mobile Computing Vol. 16, 2 (2017), 300--313. Google ScholarDigital Library
- Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies Vol. 5, 1 (2012), 1--167. Google ScholarDigital Library
- Qiang Liu, Jian Peng, and Alexander T Ihler. 2012. Variational inference for crowdsourcing. In Proc. of NIPS. 692--700. Google ScholarDigital Library
- Yao Liu, Peng Ning, and Michael K Reiter. 2011. False data injection attacks against state estimation in electric power grids. ACM Transactions on Information and System Security Vol. 14, 1 (2011), 13. Google ScholarDigital Library
- Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han. 2015. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation Proc. of KDD. 745--754. Google ScholarDigital Library
- Shike Mei and Xiaojin Zhu. 2015. Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners. In Proc. of AAAI. 2871--2877. Google ScholarDigital Library
- Chuishi Meng, Wenjun Jiang, Yaliang Li, Jing Gao, Lu Su, Hu Ding, and Yun Cheng. 2015. Truth discovery on crowd sensing of correlated entities Proc. of SenSys. 169--182. Google ScholarDigital Library
- Chenglin Miao, Wenjun Jiang, Lu Su, Yaliang Li, Suxin Guo, Zhan Qin, Houping Xiao, Jing Gao, and Kui Ren. 2015. Cloud-enabled privacy-preserving truth discovery in crowd sensing systems Proc. of SenSys. 183--196. Google ScholarDigital Library
- Quoc Viet Hung Nguyen, Tam Nguyen Thanh, Ngoc Tran Lam, Son Thanh Do, and Karl Aberer. 2013. A Benchmark for Aggregation Techniques in Crowdsourcing Proc. of SIGIR. Google ScholarDigital Library
- Jungseul Ok, Sewoong Oh, Jinwoo Shin, and Yung Yi. 2016. Optimality of belief propagation for crowdsourced classification Proc. of ICML. 535--544. Google ScholarDigital Library
- Zhengrui Qin, Qun Li, and George Hsieh. 2013. Defending against cooperative attacks in cooperative spectrum sensing. IEEE Transactions on Wireless Communications Vol. 12, 6 (2013), 2680--2687.Google ScholarCross Ref
- Vikas C Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research Vol. 13, Feb (2012), 491--518. Google ScholarDigital Library
- Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research Vol. 11, Apr (2010), 1297--1322. Google ScholarDigital Library
- Mohsen Rezvani, Aleksandar Ignjatovic, Elisa Bertino, and Sanjay Jha. 2015. Secure data aggregation technique for wireless sensor networks in the presence of collusion attacks. IEEE Transactions on Dependable and Secure Computing Vol. 12, 1 (2015), 98--110.Google ScholarDigital Library
- Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of the EMNLP. 254--263. Google ScholarDigital Library
- Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment Vol. 7, 12 (2014), 1071--1082. Google ScholarDigital Library
- Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. 2011. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. of CIR. 21--26.Google Scholar
- Gang Wang, Bolun Wang, Tianyi Wang, Ana Nika, Haitao Zheng, and Ben Y Zhao. 2016. Defending against sybil devices in crowdsourced mapping services Proc. of MobiSys. 179--191. Google ScholarDigital Library
- Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y Zhao. 2014. Man vs. Machine: Practical Adversarial Detection of Malicious Crowdsourcing Workers. USENIX Security Symposium. 239--254. Google ScholarDigital Library
- Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proc. of the VLDB Endowment Vol. 5, 11 (2012), 1483--1494. Google ScholarDigital Library
- Peter Welinder and Pietro Perona. 2010. Online crowdsourcing: rating annotators and obtaining cost-effective labels Proc. of CVPRW. 25--32.Google Scholar
- Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proc. of NIPS. 2035--2043. Google ScholarDigital Library
- Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. 2015. Is feature selection secure against training data poisoning? Proc. of ICML. 1689--1698. Google ScholarDigital Library
- Dong Yuan, Guoliang Li, Qi Li, and Yudian Zheng. 2017. Sybil Defense in Crowdsourcing Platforms. In Proc. of CIKM. 1529--1538. Google ScholarDigital Library
- Kuan Zhang, Xiaohui Liang, Rongxing Lu, and Xuemin Shen. 2014 b. Sybil attacks and their defenses in the internet of things. IEEE Internet of Things Journal Vol. 1, 5 (2014), 372--383.Google ScholarCross Ref
- Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. 2014 a. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing Proc. of NIPS. 1260--1268. Google ScholarDigital Library
- Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: is the problem solved? Proc. of the VLDB Endowment Vol. 10, 5 (2017), 541--552. Google ScholarDigital Library
- Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. 2012. Learning from the wisdom of crowds by minimax entropy Proc. of NIPS. 2195--2203. Google ScholarDigital Library
Index Terms
- Attack under Disguise: An Intelligent Data Poisoning Attack Mechanism in Crowdsourcing
Recommendations
Data Poisoning Attack against Recommender System Using Incomplete and Perturbed Data
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningRecent studies reveal that recommender systems are vulnerable to data poisoning attack due to their openness nature. In data poisoning attack, the attacker typically recruits a group of controlled users to inject well-crafted user-item interaction data ...
Practical Data Poisoning Attack against Next-Item Recommendation
WWW '20: Proceedings of The Web Conference 2020Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning ...
Towards Data Poisoning Attacks in Crowd Sensing Systems
Mobihoc '18: Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and ComputingWith the proliferation of sensor-rich mobile devices, crowd sensing has emerged as a new paradigm of collecting information from the physical world. However, the sensory data provided by the participating workers are usually not reliable. In order to ...
Comments