research-article

Get another label? improving data quality and data mining using multiple, noisy labelers

Authors:
Victor S. Sheng

Leonard N. Stern School of Business, New York University, New York, NY, USA

Leonard N. Stern School of Business, New York University, New York, NY, USA
View Profile

,
Foster Provost

Leonard N. Stern School of Business, New York University, New York, NY, USA

Leonard N. Stern School of Business, New York University, New York, NY, USA
View Profile

,
Panagiotis G. Ipeirotis

Leonard N. Stern School of Business, New York University, New York, NY, USA

Leonard N. Stern School of Business, New York University, New York, NY, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 614–622https://doi.org/10.1145/1401890.1401965

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 614–622

ABSTRACT

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

References

Baram, Y., El-Yaniv, R., and Luz, K. Online choice of active learning algorithms. Journal of Machine Learning Research 5 (Mar. 2004), 255--291. Google ScholarDigital Library
Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.Google Scholar
Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (Sept. 2004), 1757--1771.Google ScholarCross Ref
Breiman, L. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32. Google ScholarDigital Library
Cohn, D. A., Atlas, L. E., and Ladner, R. E. Improving generalization with active learning. Machine Learning 15, 2 (May 1994), 201--221. Google ScholarDigital Library
Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (Sept. 1979), 20--28.Google ScholarCross Ref
Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In KDD (1999), pp. 155--164. Google ScholarDigital Library
Elkan, C. The foundations of cost-sensitive learning. In IJCAI (2001), pp. 973--978. Google ScholarDigital Library
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC, 2003.Google Scholar
Jin, R., and Ghahramani, Z. Learning with multiple labels. In NIPS (2002), pp. 897--904.Google ScholarDigital Library
Kapoor, A., and Greiner, R. Learning and classifying under hard budgets. In ECML (2005), pp. 170--181. Google ScholarDigital Library
Lizotte, D. J., Madani, O., and Greiner, R. Budgeted learning of naive-bayes classifiers. In UAI) (2003), pp. 378--385. Google ScholarDigital Library
Lugosi, G. Learning with an unreliable teacher. Pattern Recognition 25, 1 (Jan. 1992), 79--87. Google ScholarDigital Library
Margineantu, D. D. Active cost-sensitive learning. In IJCAI) (2005), pp. 1622--1613. Google ScholarDigital Library
McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 Workshop on Text Learning (1999).Google Scholar
Melville, P., Provost, F. J., and Mooney, R. J. An expected utility approach to active feature-value acquisition. In ICDM (2005), pp. 745--748. Google ScholarDigital Library
Melville, P., Saar-Tsechansky, M., Provost, F. J., and Mooney, R. J. Active feature-value acquisition for classifier induction. In ICDM (2004), pp. 483--486. Google ScholarDigital Library
Morrison, C. T., and Cohen, P. R. Noisy information value in utility-based decision making. In UBDM'05: Proceedings of the First International Workshop on Utility-based Data Mining (2005), pp. 34--38. Google ScholarDigital Library
Provost, F. Toward economic machine learning and utility-based data mining. In UBDM '05: Proceedings of the 1st International Workshop on Utility-based Data Mining (2005), pp. 1--1. Google ScholarDigital Library
Provost, F., and Danyluk, A. Learning from Bad Data. In Proceedings of the ML-95 Workshop on Applying Machine Learning in Practice (1995).Google Scholar
Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., 1992. Google ScholarDigital Library
Saar-Tsechansky, M., Melville, P., and Provost, F. J. Active feature-value acquisition. Tech. Rep. IROM-08-06, University of Texas at Austin, McCombs Research Paper Series, Sept. 2007.Google Scholar
Saar-Tsechansky, M., and Provost, F. Active sampling for class probability estimation and ranking. Journal of Artificial Intelligence Research 54, 2 (2004), 153--178. Google ScholarDigital Library
Silverman, B. W. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory 26, 2 (Mar. 1980), 246--249.Google ScholarDigital Library
Smyth, P. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems, Vol. III: Selecting Good Models, T. Petsche, Ed. MIT Press, Apr. 1995.Google Scholar
Smyth, P. Bounds on the mean classification error rate of multiple experts. Pattern Recognition Letters 17, 12 (May 1996). Google ScholarDigital Library
Smyth, P., Burl, M. C., Fayyad, U. M., and Perona, P. Knowledge discovery in large image databases: Dealing with uncertainties in ground truth. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop (KDD'94) (1994), pp. 109--120.Google Scholar
Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., and Baldi, P. Inferring ground truth from subjective labelling of Venus images. In NIPS (1994), pp. 1085--1092.Google ScholarDigital Library
Ting, K. M. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14, 3 (Mar. 2002), 659--665. Google ScholarDigital Library
Turney, P. D. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research 2 (1995), 369--409. Google ScholarDigital Library
Turney, P. D. Types of cost in inductive concept learning. In Proceedings of the ICML-2000 Workshop on Cost-Sensitive Learning (2000), pp. 15--21.Google Scholar
Weiss, G. M., and Provost, F. J. Learning when training data are costly: The e ect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 (2003), 315--354. Google ScholarDigital Library
Whittle, P. Some general points in the theory of optimal experimental design. Journal of the Royal Statistical Society, Series B (Methodological) 35, 1 (1973), 123--130.Google Scholar
Witten, I. H., and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann Publishing, June 2005. Google ScholarDigital Library
Zadrozny, B., Langford, J., and Abe, N. Cost-sensitive learning by cost-proportionate example weighting. In ICDM (2003), pp. 435--442. Google ScholarDigital Library
Zheng, Z., and Padmanabhan, B. Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science 52, 5 (May 2006), 697--712. Google ScholarDigital Library
Zhu, X., and Wu, X. Cost-constrained data acquisition for intelligent data preparation. IEEE TKDE 17, 11 (Nov. 2005), 1542--1556. Google ScholarDigital Library

Index Terms

Get another label? improving data quality and data mining using multiple, noisy labelers
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Repeated labeling using multiple noisy labelers

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for ...
Read More
Active Learning from Multiple Noisy Labelers with Varied Costs
ICDM '10: Proceedings of the 2010 IEEE International Conference on Data Mining

In active learning, where a learning algorithm has to purchase the labels of its training examples, it is often assumed that there is only one labeler available to label examples, and that this labeler is noise-free. In reality, it is possible that ...
Read More
Learning safe multi-label prediction for weakly labeled data

In this paper we study multi-label learning with weakly labeled data, i.e., labels of training examples are incomplete, which commonly occurs in real applications, e.g., image classification, document categorization. This setting includes, e.g., (i) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data preprocessing
data selection
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 704
  Total Citations
  View Citations
- 4,341
  Total Downloads
- Downloads (Last 12 months)289
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get another label? improving data quality and data mining using multiple, noisy labelers

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Repeated labeling using multiple noisy labelers

Active Learning from Multiple Noisy Labelers with Varied Costs

Learning safe multi-label prediction for weakly labeled data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Get another label? improving data quality and data mining using multiple, noisy labelers

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Repeated labeling using multiple noisy labelers

Active Learning from Multiple Noisy Labelers with Varied Costs

Learning safe multi-label prediction for weakly labeled data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media