skip to main content
10.3115/992730.992783dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

More accurate tests for the statistical significance of result differences

Published:31 July 2000Publication History

ABSTRACT

Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.

References

  1. G. Box, W. Hunter, and J. Hunter. 1978. Statistics for experimenters. John Wiley and Sons.Google ScholarGoogle Scholar
  2. N. Chinchor, L. Hirschman, and D. Lewis. 1993. Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Church and R. Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Cohen. 1995. Empirical Methods for Artificial Intelligence. MIT Press, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Forsythe, M. Malcolm, and C. Moler. 1977. Computer methods for mathematical computations. Prentice-Hall, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Harnett. 1982. Statistical Methods. Addison-Wesley Publishing Co., 3rd edition.Google ScholarGoogle Scholar
  7. R. Larsen and M. Marx. 1986. An Introduction to Mathematical Statistics and Its Applications. Prentice-Hall, NJ, USA, 2nd edition.Google ScholarGoogle Scholar
  8. E. Noreen. 1989. Computer-intensive methods for testing hypotheses: an introduction. John Wiley and Sons, Inc.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2
    July 2000
    549 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 31 July 2000

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate1,537of1,537submissions,100%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader