Skip to main content

Measuring Performance in Classification Models

  • Chapter
Applied Predictive Modeling

Abstract

When predicting a categorical outcome, some measure of classification accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure classification accuracy, depending of the modeler’s primary objectives. Most classification models can produce both a continuous and categorical prediction output. In Section 11.1, we review these outputs, demonstrate how to adjust probabilities based on calibration plots, recommend ways for displaying class predictions, and define equivocal or indeterminate zones of prediction. In Section 11.2, we review common metrics for assessing classification predictions such as accuracy, kappa, sensitivity, specificity, and positive and negative predicted values. This section also addresses model evaluation when costs are applied to making false positive or false negative mistakes. Classification models may also produce predicted classification probabilities. Evaluating this type of output is addressed in Section 11.3, and includes a discussion of receiver operating characteristic curves as well as lift charts. In Section 11.4, we demonstrate how measures of classification performance can be generated in R.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In medical terminology, this rate is referred to as the prevalence of a disease while in Bayesian statistics it would be the prior distribution of the event.

  2. 2.

    This is true since predictive models seek to find a concordant relationship with the truth. A large negative Kappa would imply that there is relationship between the predictors and the response and the predictive model would seek to find the relationship in the correct direction.

  3. 3.

    In relation to Bayesian statistics, the sensitivity and specificity are the conditional probabilities, the prevalence is the prior, and the positive/negative predicted values are the posterior probabilities.

  4. 4.

    This depends on a few assumptions which may or may not be true. Section 20.1 discusses this aspect of the example in more detail in the context of net lift modeling.

  5. 5.

    In this analysis, we have used the test set to investigate the effects of alternative thresholds. Generally, a new threshold should be derived from a separate data set than those used to train the model or evaluate performance.

  6. 6.

    R has a number of packages that can compute the ROC curve, including ROCR, caTools, PresenceAbsence, and others.

References

  • Agresti A (2002). Categorical Data Analysis. Wiley–Interscience.

    Google Scholar 

  • Altman D, Bland J (1994). “Diagnostic Tests 3: Receiver Operating Characteristic Plots.” British Medical Journal, 309(6948), 188.

    Article  Google Scholar 

  • Becton Dickinson and Company (1991). ProbeTec ET Chlamydia trachomatis and Neisseria gonorrhoeae Amplified DNA Assays (Package Insert).

    Google Scholar 

  • Bridle J (1990). “Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition.” In “Neurocomputing: Algorithms, Architectures and Applications,” pp. 227–236. Springer–Verlag.

    Google Scholar 

  • Brown C, Davis H (2006). “Receiver Operating Characteristics Curves and Related Decision Measures: A Tutorial.” Chemometrics and Intelligent Laboratory Systems, 80(1), 24–38.

    Article  Google Scholar 

  • Cohen J (1960). “A Coefficient of Agreement for Nominal Data.” Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • DeLong E, DeLong D, Clarke-Pearson D (1988). “Comparing the Areas Under Two Or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.” Biometrics, 44(3), 837–45.

    Article  MATH  Google Scholar 

  • Dobson A (2002). An Introduction to Generalized Linear Models. Chapman & Hall/CRC.

    Google Scholar 

  • Drummond C, Holte R (2000). “Explicitly Representing Expected Cost: An Alternative to ROC Representation.” In “Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,” pp. 198–207.

    Chapter  Google Scholar 

  • Fawcett T (2006). “An Introduction to ROC Analysis.” Pattern Recognition Letters, 27(8), 861–874.

    Article  MathSciNet  Google Scholar 

  • Gupta S, Hanssens D, Hardie B, Kahn W, Kumar V, Lin N, Ravishanker N, Sriram S (2006). “Modeling Customer Lifetime Value.” Journal of Service Research, 9(2), 139–155.

    Article  Google Scholar 

  • Hall P, Hyndman R, Fan Y (2004). “Nonparametric Confidence Intervals for Receiver Operating Characteristic Curves.” Biometrika, 91, 743–750.

    Article  MathSciNet  MATH  Google Scholar 

  • Hand D, Till R (2001). “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems.” Machine Learning, 45(2), 171–186.

    Article  MATH  Google Scholar 

  • Hanley J, McNeil B (1982). “The Meaning and Use of the Area under a Receiver Operating (ROC) Curvel Characteristic.” Radiology, 143(1), 29–36.

    Article  Google Scholar 

  • Lachiche N, Flach P (2003). “Improving Accuracy and Cost of Two–Class and Multi–Class Probabilistic Classifiers using ROC Curves.” In “Proceedings of the Twentieth International Conference on Machine Learning,” volume 20, pp. 416–424.

    Google Scholar 

  • Larose D (2006). Data Mining Methods and Models. Wiley.

    Google Scholar 

  • Li J, Fine JP (2008). “ROC Analysis with Multiple Classes and Multiple Tests: Methodology and Its Application in Microarray Studies.” Biostatistics, 9(3), 566–576.

    Article  MATH  Google Scholar 

  • Ling C, Li C (1998). “Data Mining for Direct Marketing: Problems and solutions.” In “Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining,” pp. 73–79.

    Google Scholar 

  • McClish D (1989). “Analyzing a Portion of the ROC Curve.” Medical Decision Making, 9, 190–195.

    Article  Google Scholar 

  • Pepe MS, Longton G, Janes H (2009). “Estimation and Comparison of Receiver Operating Characteristic Curves.” Stata Journal, 9(1), 1–16.

    Google Scholar 

  • Piersma A, Genschow E, Verhoef A, Spanjersberg M, Brown N, Brady M, Burns A, Clemann N, Seiler A, Spielmann H (2004). “Validation of the Postimplantation Rat Whole-embryo Culture Test in the International ECVAM Validation Study on Three In Vitro Embryotoxicity Tests.” Alternatives to Laboratory Animals, 32, 275–307.

    Google Scholar 

  • Platt J (2000). “Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods.” In B Bartlett, B Schölkopf, D Schuurmans, A Smola (eds.), “Advances in Kernel Methods Support Vector Learning,” pp. 61–74. Cambridge, MA: MIT Press.

    Google Scholar 

  • Provost F, Fawcett T, Kohavi R (1998). “The Case Against Accuracy Estimation for Comparing Induction Algorithms.” Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445–453.

    Google Scholar 

  • Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M (2011). “pROC: an open-source package for R and S+ to analyze and compare ROC curves.” BMC Bioinformatics, 12(1), 77.

    Google Scholar 

  • Venkatraman E (2000). “A Permutation Test to Compare Receiver Operating Characteristic Curves.” Biometrics, 56(4), 1134–1138.

    Article  MathSciNet  MATH  Google Scholar 

  • Youden W (1950). “Index for Rating Diagnostic Tests.” Cancer, 3(1), 32–35.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kuhn, M., Johnson, K. (2013). Measuring Performance in Classification Models. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_11

Download citation

Publish with us

Policies and ethics