ABSTRACT
Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.
- G. Box, W. Hunter, and J. Hunter. 1978. Statistics for experimenters. John Wiley and Sons.Google Scholar
- N. Chinchor, L. Hirschman, and D. Lewis. 1993. Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3). Google ScholarDigital Library
- K. Church and R. Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24. Google ScholarDigital Library
- P. Cohen. 1995. Empirical Methods for Artificial Intelligence. MIT Press, MA, USA. Google ScholarDigital Library
- G. Forsythe, M. Malcolm, and C. Moler. 1977. Computer methods for mathematical computations. Prentice-Hall, NJ, USA. Google ScholarDigital Library
- D. Harnett. 1982. Statistical Methods. Addison-Wesley Publishing Co., 3rd edition.Google Scholar
- R. Larsen and M. Marx. 1986. An Introduction to Mathematical Statistics and Its Applications. Prentice-Hall, NJ, USA, 2nd edition.Google Scholar
- E. Noreen. 1989. Computer-intensive methods for testing hypotheses: an introduction. John Wiley and Sons, Inc.Google Scholar
Recommendations
A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementInformation retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalResearch has shown that little practical difference exists between the randomization, Student's paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval experiments with 50 topics. We compared these three tests on runs with ...
Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations
Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. ...
Comments