Article

Free Access

More accurate tests for the statistical significance of result differences

Author:
Alexander Yeh

Mitre Corp., Bedford, MA

Mitre Corp., Bedford, MA
View Profile

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2July 2000Pages 947–953https://doi.org/10.3115/992730.992783

Published:31 July 2000Publication History

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2

Pages 947–953

ABSTRACT

Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.

References

G. Box, W. Hunter, and J. Hunter. 1978. Statistics for experimenters. John Wiley and Sons.Google Scholar
N. Chinchor, L. Hirschman, and D. Lewis. 1993. Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3). Google ScholarDigital Library
K. Church and R. Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24. Google ScholarDigital Library
P. Cohen. 1995. Empirical Methods for Artificial Intelligence. MIT Press, MA, USA. Google ScholarDigital Library
G. Forsythe, M. Malcolm, and C. Moler. 1977. Computer methods for mathematical computations. Prentice-Hall, NJ, USA. Google ScholarDigital Library
D. Harnett. 1982. Statistical Methods. Addison-Wesley Publishing Co., 3rd edition.Google Scholar
R. Larsen and M. Marx. 1986. An Introduction to Mathematical Statistics and Its Applications. Prentice-Hall, NJ, USA, 2nd edition.Google Scholar
E. Noreen. 1989. Computer-intensive methods for testing hypotheses: an introduction. John Wiley and Sons, Inc.Google Scholar

Recommendations

A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Read More
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Research has shown that little practical difference exists between the randomization, Student's paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval experiments with 50 topics. We compared these three tests on runs with ...
Read More
Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2
July 2000
549 pages
Program Chair:
Martin Kay
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 31 July 2000
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,537of1,537submissions,100%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 1,010
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

More accurate tests for the statistical significance of result differences

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2

ABSTRACT

References

Cited By

Recommendations

A comparison of statistical significance tests for information retrieval evaluation

Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

More accurate tests for the statistical significance of result differences

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 2

ABSTRACT

References

Cited By

Recommendations

A comparison of statistical significance tests for information retrieval evaluation

Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media