ABSTRACT
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
- Baram, Y., El-Yaniv, R., and Luz, K. Online choice of active learning algorithms. Journal of Machine Learning Research 5 (Mar. 2004), 255--291. Google ScholarDigital Library
- Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.Google Scholar
- Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (Sept. 2004), 1757--1771.Google ScholarCross Ref
- Breiman, L. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32. Google ScholarDigital Library
- Cohn, D. A., Atlas, L. E., and Ladner, R. E. Improving generalization with active learning. Machine Learning 15, 2 (May 1994), 201--221. Google ScholarDigital Library
- Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (Sept. 1979), 20--28.Google ScholarCross Ref
- Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In KDD (1999), pp. 155--164. Google ScholarDigital Library
- Elkan, C. The foundations of cost-sensitive learning. In IJCAI (2001), pp. 973--978. Google ScholarDigital Library
- Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC, 2003.Google Scholar
- Jin, R., and Ghahramani, Z. Learning with multiple labels. In NIPS (2002), pp. 897--904.Google ScholarDigital Library
- Kapoor, A., and Greiner, R. Learning and classifying under hard budgets. In ECML (2005), pp. 170--181. Google ScholarDigital Library
- Lizotte, D. J., Madani, O., and Greiner, R. Budgeted learning of naive-bayes classifiers. In UAI) (2003), pp. 378--385. Google ScholarDigital Library
- Lugosi, G. Learning with an unreliable teacher. Pattern Recognition 25, 1 (Jan. 1992), 79--87. Google ScholarDigital Library
- Margineantu, D. D. Active cost-sensitive learning. In IJCAI) (2005), pp. 1622--1613. Google ScholarDigital Library
- McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 Workshop on Text Learning (1999).Google Scholar
- Melville, P., Provost, F. J., and Mooney, R. J. An expected utility approach to active feature-value acquisition. In ICDM (2005), pp. 745--748. Google ScholarDigital Library
- Melville, P., Saar-Tsechansky, M., Provost, F. J., and Mooney, R. J. Active feature-value acquisition for classifier induction. In ICDM (2004), pp. 483--486. Google ScholarDigital Library
- Morrison, C. T., and Cohen, P. R. Noisy information value in utility-based decision making. In UBDM'05: Proceedings of the First International Workshop on Utility-based Data Mining (2005), pp. 34--38. Google ScholarDigital Library
- Provost, F. Toward economic machine learning and utility-based data mining. In UBDM '05: Proceedings of the 1st International Workshop on Utility-based Data Mining (2005), pp. 1--1. Google ScholarDigital Library
- Provost, F., and Danyluk, A. Learning from Bad Data. In Proceedings of the ML-95 Workshop on Applying Machine Learning in Practice (1995).Google Scholar
- Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., 1992. Google ScholarDigital Library
- Saar-Tsechansky, M., Melville, P., and Provost, F. J. Active feature-value acquisition. Tech. Rep. IROM-08-06, University of Texas at Austin, McCombs Research Paper Series, Sept. 2007.Google Scholar
- Saar-Tsechansky, M., and Provost, F. Active sampling for class probability estimation and ranking. Journal of Artificial Intelligence Research 54, 2 (2004), 153--178. Google ScholarDigital Library
- Silverman, B. W. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory 26, 2 (Mar. 1980), 246--249.Google ScholarDigital Library
- Smyth, P. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems, Vol. III: Selecting Good Models, T. Petsche, Ed. MIT Press, Apr. 1995.Google Scholar
- Smyth, P. Bounds on the mean classification error rate of multiple experts. Pattern Recognition Letters 17, 12 (May 1996). Google ScholarDigital Library
- Smyth, P., Burl, M. C., Fayyad, U. M., and Perona, P. Knowledge discovery in large image databases: Dealing with uncertainties in ground truth. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop (KDD'94) (1994), pp. 109--120.Google Scholar
- Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., and Baldi, P. Inferring ground truth from subjective labelling of Venus images. In NIPS (1994), pp. 1085--1092.Google ScholarDigital Library
- Ting, K. M. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14, 3 (Mar. 2002), 659--665. Google ScholarDigital Library
- Turney, P. D. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research 2 (1995), 369--409. Google ScholarDigital Library
- Turney, P. D. Types of cost in inductive concept learning. In Proceedings of the ICML-2000 Workshop on Cost-Sensitive Learning (2000), pp. 15--21.Google Scholar
- Weiss, G. M., and Provost, F. J. Learning when training data are costly: The e ect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 (2003), 315--354. Google ScholarDigital Library
- Whittle, P. Some general points in the theory of optimal experimental design. Journal of the Royal Statistical Society, Series B (Methodological) 35, 1 (1973), 123--130.Google Scholar
- Witten, I. H., and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann Publishing, June 2005. Google ScholarDigital Library
- Zadrozny, B., Langford, J., and Abe, N. Cost-sensitive learning by cost-proportionate example weighting. In ICDM (2003), pp. 435--442. Google ScholarDigital Library
- Zheng, Z., and Padmanabhan, B. Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science 52, 5 (May 2006), 697--712. Google ScholarDigital Library
- Zhu, X., and Wu, X. Cost-constrained data acquisition for intelligent data preparation. IEEE TKDE 17, 11 (Nov. 2005), 1542--1556. Google ScholarDigital Library
Index Terms
- Get another label? improving data quality and data mining using multiple, noisy labelers
Recommendations
Repeated labeling using multiple noisy labelers
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for ...
Active Learning from Multiple Noisy Labelers with Varied Costs
ICDM '10: Proceedings of the 2010 IEEE International Conference on Data MiningIn active learning, where a learning algorithm has to purchase the labels of its training examples, it is often assumed that there is only one labeler available to label examples, and that this labeler is noise-free. In reality, it is possible that ...
Learning safe multi-label prediction for weakly labeled data
In this paper we study multi-label learning with weakly labeled data, i.e., labels of training examples are incomplete, which commonly occurs in real applications, e.g., image classification, document categorization. This setting includes, e.g., (i) ...
Comments