skip to main content
10.1145/1401890.1401965acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Get another label? improving data quality and data mining using multiple, noisy labelers

Authors Info & Claims
Published:24 August 2008Publication History

ABSTRACT

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

References

  1. Baram, Y., El-Yaniv, R., and Luz, K. Online choice of active learning algorithms. Journal of Machine Learning Research 5 (Mar. 2004), 255--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.Google ScholarGoogle Scholar
  3. Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (Sept. 2004), 1757--1771.Google ScholarGoogle ScholarCross RefCross Ref
  4. Breiman, L. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cohn, D. A., Atlas, L. E., and Ladner, R. E. Improving generalization with active learning. Machine Learning 15, 2 (May 1994), 201--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (Sept. 1979), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  7. Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In KDD (1999), pp. 155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Elkan, C. The foundations of cost-sensitive learning. In IJCAI (2001), pp. 973--978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC, 2003.Google ScholarGoogle Scholar
  10. Jin, R., and Ghahramani, Z. Learning with multiple labels. In NIPS (2002), pp. 897--904.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kapoor, A., and Greiner, R. Learning and classifying under hard budgets. In ECML (2005), pp. 170--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lizotte, D. J., Madani, O., and Greiner, R. Budgeted learning of naive-bayes classifiers. In UAI) (2003), pp. 378--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lugosi, G. Learning with an unreliable teacher. Pattern Recognition 25, 1 (Jan. 1992), 79--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Margineantu, D. D. Active cost-sensitive learning. In IJCAI) (2005), pp. 1622--1613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 Workshop on Text Learning (1999).Google ScholarGoogle Scholar
  16. Melville, P., Provost, F. J., and Mooney, R. J. An expected utility approach to active feature-value acquisition. In ICDM (2005), pp. 745--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Melville, P., Saar-Tsechansky, M., Provost, F. J., and Mooney, R. J. Active feature-value acquisition for classifier induction. In ICDM (2004), pp. 483--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Morrison, C. T., and Cohen, P. R. Noisy information value in utility-based decision making. In UBDM'05: Proceedings of the First International Workshop on Utility-based Data Mining (2005), pp. 34--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Provost, F. Toward economic machine learning and utility-based data mining. In UBDM '05: Proceedings of the 1st International Workshop on Utility-based Data Mining (2005), pp. 1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Provost, F., and Danyluk, A. Learning from Bad Data. In Proceedings of the ML-95 Workshop on Applying Machine Learning in Practice (1995).Google ScholarGoogle Scholar
  21. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Saar-Tsechansky, M., Melville, P., and Provost, F. J. Active feature-value acquisition. Tech. Rep. IROM-08-06, University of Texas at Austin, McCombs Research Paper Series, Sept. 2007.Google ScholarGoogle Scholar
  23. Saar-Tsechansky, M., and Provost, F. Active sampling for class probability estimation and ranking. Journal of Artificial Intelligence Research 54, 2 (2004), 153--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Silverman, B. W. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory 26, 2 (Mar. 1980), 246--249.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Smyth, P. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems, Vol. III: Selecting Good Models, T. Petsche, Ed. MIT Press, Apr. 1995.Google ScholarGoogle Scholar
  26. Smyth, P. Bounds on the mean classification error rate of multiple experts. Pattern Recognition Letters 17, 12 (May 1996). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Smyth, P., Burl, M. C., Fayyad, U. M., and Perona, P. Knowledge discovery in large image databases: Dealing with uncertainties in ground truth. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop (KDD'94) (1994), pp. 109--120.Google ScholarGoogle Scholar
  28. Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., and Baldi, P. Inferring ground truth from subjective labelling of Venus images. In NIPS (1994), pp. 1085--1092.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ting, K. M. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14, 3 (Mar. 2002), 659--665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Turney, P. D. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research 2 (1995), 369--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Turney, P. D. Types of cost in inductive concept learning. In Proceedings of the ICML-2000 Workshop on Cost-Sensitive Learning (2000), pp. 15--21.Google ScholarGoogle Scholar
  32. Weiss, G. M., and Provost, F. J. Learning when training data are costly: The e ect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 (2003), 315--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Whittle, P. Some general points in the theory of optimal experimental design. Journal of the Royal Statistical Society, Series B (Methodological) 35, 1 (1973), 123--130.Google ScholarGoogle Scholar
  34. Witten, I. H., and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann Publishing, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zadrozny, B., Langford, J., and Abe, N. Cost-sensitive learning by cost-proportionate example weighting. In ICDM (2003), pp. 435--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zheng, Z., and Padmanabhan, B. Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science 52, 5 (May 2006), 697--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhu, X., and Wu, X. Cost-constrained data acquisition for intelligent data preparation. IEEE TKDE 17, 11 (Nov. 2005), 1542--1556. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Get another label? improving data quality and data mining using multiple, noisy labelers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2008
          1116 pages
          ISBN:9781605581934
          DOI:10.1145/1401890
          • General Chair:
          • Ying Li,
          • Program Chairs:
          • Bing Liu,
          • Sunita Sarawagi

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 August 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader