Skip to main content
Log in

Meta-analysis of Cohen’s kappa

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Allman, R.M.: Pressure ulcer prevalence, incidence, risk factors, and impact. Clin. Geriatr. Med. 13, 421–436 (1997)

    PubMed  CAS  Google Scholar 

  • Altman, D.G.: Practical Statistics for Medical Students. Chapman and Hall, London (1991)

    Google Scholar 

  • Baugh, F.: Correcting effect sizes for score reliability: a reminder that measurement and substantive issues are linked inextricably. Educ. Psychol. Meas. 62, 254–263 (2002)

    Article  Google Scholar 

  • Banerjee, M., Capozzoli, M., McSweeny, L., Sinha, D.: Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27, 3–23 (1999)

    Article  Google Scholar 

  • Berry, K.J., Mielke, P.W.: A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educ. Psychol. Meas. 48, 921–933 (1998)

    Article  Google Scholar 

  • Blackman, N.J.-M., Koval, J.J.: Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med. 19, 723–741 (2000)

    Article  PubMed  CAS  Google Scholar 

  • Block, D.A., Kraemer, H.C.: 2×2 kappa coefficients: measures of agreement or association. Biometrics 45, 269–287 (1989)

    Article  Google Scholar 

  • Borenstein, M.: Software for publication bias. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.) Publication Bias in Meta-Analysis—Prevention, Assessment and Adjustments, pp. 193–220. Wiley, Chichester (2005)

    Google Scholar 

  • Bours, G., Halfens, R., Lubbers, M., Haalboom, J.: The development of a National Registration Form to measure the prevalence of pressure ulcers in the Netherlands. Ostomy Wound Manage. 45, 28–40 (1999)

    PubMed  CAS  Google Scholar 

  • Brennan, R.L., Prediger, D.J.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41, 687–699 (1981)

    Article  Google Scholar 

  • Brennan, R.L., Silman, A.: Statistical methods for assessing observer variability in clinical measures. Br. Med. J. 304, 1491–1494 (1992)

    Article  CAS  Google Scholar 

  • Buntinx, F., Beckers, H., De Keyser, G., Flour, M., Nissen, G., Raskin, T., De Vet, H.: Inter-observer variation in the assessment of skin ulceration. J. Wound Care 5, 166–170 (1986)

    Google Scholar 

  • Capraro, M.M., Capraro, R.M., Henson, R.K.: Measurement error of scores on the Mathematics Anxiety Rating Scale across studies. Educ. Psychol. Meas. 61, 373–386 (2001)

    Article  Google Scholar 

  • Caruso, J.C.: Reliability generalization of the NEO personality scales. Educ. Psychol. Meas. 60, 236–254 (2000)

    Article  Google Scholar 

  • Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)

    Article  Google Scholar 

  • Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70, 220–231 (1968)

    Google Scholar 

  • Cohen, J.: Weighted Chi square: an extension of the kappa method. Educ. Psychol. Meas. 32, 61–74 (1972)

    Article  Google Scholar 

  • Davies, M., Fleiss, J.L.: Measurement agreement for multinomial data. Biometrics 38, 1047–1051 (1982)

    Article  Google Scholar 

  • Duval, S., Tweedie, R.: A nonparametric “trim and fill” method of assessing publication bias in meta-analysis. J. Am. Stat. Assoc. 95(449), 89–98 (2000a)

    Article  Google Scholar 

  • Duval, S., Tweedie, R.: Trim and fill: a simple funnel plot based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56, 455–463 (2000b)

    Article  PubMed  CAS  Google Scholar 

  • Everitt, B.S.: Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97–103 (1968)

    Article  Google Scholar 

  • Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)

    Article  Google Scholar 

  • Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)

    Google Scholar 

  • Fleiss, J.L., Cohen, J., Everitt, B.S.: Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72, 323–327 (1969)

    Article  Google Scholar 

  • Fleiss, J.L., Shrout, P.E.: The effects of measurement errors on some multivariate procedures. Am. J. Public Health 67, 1188–1191 (1977)

    Article  PubMed  CAS  Google Scholar 

  • Graves, N., Birrell, F.A., Whitby, M.: Modeling the economic losses from pressure ulcers among hospitalized patients in Australia. Wound. Rep. Reg. 13, 462–467 (2005)

    Article  Google Scholar 

  • Gross, S.T.: The kappa coefficient of agreement for multiple observers when the number of subjects is small. Biometrics 42, 883–893 (1986)

    Article  PubMed  CAS  Google Scholar 

  • Guilford, J.P., Fruchter, B.: Fundamental Statistics in Psychology and Education, 6th edn. McGraw-Hill, New York (1978)

    Google Scholar 

  • Hart, S., Bergquist, S., Gajewski, B., Dunton, N.: Reliability testing of the national database of nursing quality indicators pressure ulcer indicator. J. Nurs. Care Qual. 21, 256–265 (2006)

    Article  PubMed  Google Scholar 

  • Healey, F.: The reliability and utility of pressure sore grading scales. J. Tissue Viability 5, 111–114 (1995)

    Google Scholar 

  • Hedges, L.V.: Fitting categorical models to effect sizes from a series of experiments. J. Educ. Stat. 7, 119–137 (1982a)

    Article  Google Scholar 

  • Hedges, L.V.: Fitting continuous models to effect sizes from a series of experiments. J. Educ. Stat. 7, 245–270 (1982b)

    Article  Google Scholar 

  • Hedges, L.V.: A random effects model for effect sizes. Psychol. Bull. 93, 388–395 (1983)

    Article  Google Scholar 

  • Hedges, L.V., Vevea, J.L.: Fixed and random effects models in meta-analysis. Psychol. Methods 3, 486–504 (1998)

    Article  Google Scholar 

  • Helms, J.E.: Another meta-analysis of the White Racial Identity Attitude Scale’s Cronbach alphas: implications for validity. Meas. Eval. Couns. Dev. 32, 122–137 (1999)

    Google Scholar 

  • Henson, R.K.: Understanding internal consistency reliability estimates: a conceptual primer on coefficient alpha. Meas. Eval. Couns. Dev. 34, 177–189 (2001)

    Google Scholar 

  • Henson, R.K., Kogan, L.R., Vacha-Haase, T.: A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educ. Psychol. Meas. 61, 404–420 (2001)

    Article  Google Scholar 

  • Hunter, J.E., Schmidt, F.L.: Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Sage, Newbury Park (1990)

    Google Scholar 

  • Huynh, Q., Howell, R.T., Benet-Martinez, V.: Reliability of bidimensional acculturation scores: a meta-analysis. J. Cross Cult. Psychol. 40, 256–274 (2009)

    Article  Google Scholar 

  • Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations. Educ. Psychol. Meas. 61, 277–289 (2001)

    Article  Google Scholar 

  • Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educ. Psychol. Meas. 64, 62–70 (2004)

    Article  Google Scholar 

  • Kottner, J., Raeder, K., Halfens, R., Dassen, T.: A systematic review of inter-rater reliability of pressure ulcers classification systems. J. Clin. Nurs. 18, 315–336 (2009)

    Article  PubMed  Google Scholar 

  • Koval, J.J., Blackman, N.J.-M.: Estimators of kappa-exact small sample properties. J. Stat. Comput. Simulat. 55, 513–536 (1996)

    Google Scholar 

  • Kraemer, H.C.: Ramifications of a population model for κ as a coefficient o reliability. Psychometrika 44, 461–472 (1979)

    Article  Google Scholar 

  • Kraemer, H.C.: Extension of the kappa coefficient. Biometrics 36, 207–216 (1980)

    Article  PubMed  CAS  Google Scholar 

  • Kraemer, H.C., Vyjeyanthi, S.P., Noda, A.: Kappa coefficients in medical research. In: D’Agostino, R.B. (ed.) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies, pp. 85–105. Wiley, New York (2004)

    Google Scholar 

  • Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977a)

    Article  PubMed  CAS  Google Scholar 

  • Landis, J.R., Koch, G.G.: An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33, 363–374 (1977b)

    Article  PubMed  CAS  Google Scholar 

  • Linacre, J.M.: Many-Facet Rasch Measurement. MESA Press, Chicago (1989)

    Google Scholar 

  • Lord, F.M., Novick, M.R.: Statistical Theories of Mental Test Scores. Addison-Wesley, Reading (1968)

    Google Scholar 

  • Miller, C.S., Shields, A.L., Campfield, D., Wallace, K.A., Weiss, R.D.: Substance use scales of the Minnesota Multiphasic Personality Inventory: an exploration of score reliability via meta-analysis. Educ. Psychol. Meas. 67, 1052–1065 (2007)

    Article  Google Scholar 

  • Nixon, J., Thorpe, H., Barrow, H., Phillips, A., Nelson, E.A., Mason, S.A., Cullum, N.: Reliability of pressure ulcer classification and diagnosis. J. Adv. Nurs. 50, 613–623 (2005)

    Article  PubMed  Google Scholar 

  • Pedley, G.E.: Comparison of pressure ulcer grading scales: a study of clinical utility and inter-rater reliability. Int. J. Nurs. Stud. 41, 129–140 (2004)

    Article  PubMed  Google Scholar 

  • Raudenbush, S.W.: Analyzing effect sizes: random-effects models. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, 2nd edn, pp. 295–315. Russel Sage Foundation, New York (2009)

    Google Scholar 

  • Rohner, R.P., Khaleque, A.: Reliability and validity of the Parental Control Scale: a meta-analysis of cross-cultural and intracultural studies. J. Cross Cult. Psychol. 34, 643–649 (2003)

    Article  Google Scholar 

  • Russell, L.: Pressure ulcer classification: the systems and the pitfalls. Br. J. Nurs. 11, S49–S59 (2002)

    PubMed  Google Scholar 

  • Schmidt, F.L., Hunter, J.E.: Development of a general solution to the problem of validity generalization. J. Appl. Psychol. 62, 529–540 (1977)

    Article  Google Scholar 

  • Shrout, P.E.: Measurement reliability and agreement in psychiatry. Stat. Meth. Med. Res. 7, 301–317 (1998)

    Article  CAS  Google Scholar 

  • Sim, J., Wright, C.C.: The Kappa statistic in reliability studies: use, interpretation and sample size requirements. Phys. Ther. 85, 257–268 (2005)

    PubMed  Google Scholar 

  • Spearman, C.E.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904)

    Article  Google Scholar 

  • Stemler, S.E., Tsai, J.: Best practices in interrater reliability: three common approaches. In: Osborne, J.W. (ed.) Best Practices in Quantitative Methods, pp. 29–49. Sage, Thousand Oaks (2008)

    Chapter  Google Scholar 

  • Stotts, N.A.: Assessing a patient with a pressure ulcer. In: Morison, M.J. (ed.) The Prevention and Treatment of Pressure Ulcers, pp. 99–115. Mosby, London (2001)

    Google Scholar 

  • Suen, H.K.: Agreement, reliability, accuracy and validity: toward a clarification. Behav. Assess. 10, 343–366 (1988)

    Google Scholar 

  • Sutton, A.J.: Publication bias. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The handbook of research synthesis and meta-analysis, 2nd edn, pp. 435–452. Russel Sage Foundation, New York (2009)

    Google Scholar 

  • Thompson, B.: Guidelines for authors. Educ. Psychol. Meas. 54, 837–847 (1994a)

    Google Scholar 

  • Thompson, B.: Score Reliability: Contemporary Thinking on Reliability Issues. Sage, Thousand Oaks (2002)

    Google Scholar 

  • Thompson, B., Vacha-Hasse, T.: Psychometrics is datametrics: the test is not reliable. Educ. Psychol. Meas. 60, 174–195 (2000)

    Google Scholar 

  • Thompson, S.G.: Why sources of heterogeneity in meta-analysis should be investigated. Br. Med. J. 309, 1351–1355 (1994b)

    Article  CAS  Google Scholar 

  • Thorndike, R.M.: Measurement and Evaluation in Psychology and Education. Pearson Merrill Prentice Hall, Upper Saddle River (2005)

    Google Scholar 

  • Vacha-Hasse, T.: Reliability generalization: exploring variance in measurement error affecting score reliability across studies. Educ. Psychol. Meas. 58, 6–20 (1998)

    Article  Google Scholar 

  • Vacha-Hasse, T., Henson, R.K., Caruso, J.C.: Reliability generalization: moving toward improved understanding and use of score reliability. Educ. Psychol. Meas. 62, 562–569 (2002)

    Article  Google Scholar 

  • Vanbelle, S., Albert, A.: Agreement between two independent groups of raters. Psychometrika 74, 477–492 (2009)

    Article  Google Scholar 

  • Vanderwee, K., Grypdonck, M., De Bacquer, D., Defloor, T.: The reliability of two observation methods of nonblanchable erythema, Grade 1 pressure ulcer. Appl. Nurs. Res. 19, 156–162 (2006)

    Article  PubMed  Google Scholar 

  • Viechtbauer, W.: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010)

    Google Scholar 

  • Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the Kappa statistic. Fam. Med. 37, 360–363 (2005)

    PubMed  Google Scholar 

  • Viswesvaran, C., Ones, D.S.: Measurement error in “Big Five Factors” personality assessment: reliability generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000)

    Article  Google Scholar 

  • von Eye, A.: An alternative to Cohen’s κ. Eur. Psychol. 11, 12–24 (2006)

    Article  Google Scholar 

  • Yin, P., Fan, X.: Assessing the reliability of Beck Depression Inventory scores: reliability generalization across studies. Educ. Psychol. Meas. 60, 201–223 (2000)

    Article  Google Scholar 

  • Zwick, R.: Another look at interrater agreement. Psychol. Bull. 3, 374–378 (1988)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuyan Sun.

Appendices

Appendix A

An example illustrating how to calculate Cohen’s κ, its variance and confidence intervals using real data from Nixon et al. (2005)

 

Ward nurse

No pressure ulcer

Pressure ulcer

Total

Clinical research nurse

 No pressure ulcer

(a)

2175

(b)

35

(f 1)

2210

 Pressure ulcer

(c)

42

(d)

144

(f 2)

186

 Total

(g 1)

2217

(g 2)

179

(n)

2396

$$ p_{0} = \frac{a + d}{n} = \frac{2175 + 144}{2396} \approx 0.97 $$
$$ p_{c} = \frac{{\frac{{f_{1} \times g_{1} }}{n} + \frac{{f_{2} \times g_{2} }}{n}}}{n} = \frac{{\frac{2210 \times 2217}{2396} + \frac{186 \times 179}{2396}}}{2396} \approx 0.86 $$
$$ \hat{\kappa } = \frac{{p_{0} - p_{c} }}{{1 - p_{c} }} = \frac{0.97 - 0.86}{1 - 0.86} \approx 0.79 $$
$$ Var(\hat{\kappa }) = \frac{1}{{(1 - p_{c} )^{2} }}\frac{{p_{0} (1 - p_{0} )}}{n} = \frac{1}{{(1 - 0.86)^{2} }}\frac{0.97(1 - 0.97)}{2396} \approx 6.20 \times 10^{ - 4} $$

\( \hat{\kappa } \pm 1.96 \times \sqrt {\text{var} (\kappa )} \approx 0.79 \pm 1.96 \times 0.02 \). Therefore, 95% CI for Cohen’s κ is [0.75, 0.83].

Note that, the calculation of Var(\( \hat{\kappa } \)) depends on p c which is usually not reported in research articles. When p 0 and \( \hat{\kappa } \) are reported, p c can be derived by \( p_{c} = \frac{{p_{0} - \hat{\kappa }}}{{1 - \hat{\kappa }}} \).

Appendix B

R codes for all analyses conducted in the meta-analysis of Cohen’s κ for PU classification systems

#Install the package metafor

library(metafor)

#Derive vi, the variance for each kappa estimate

#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i

# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)

pci=(p0i-ki)/(1-ki)

vi=p0i*(1-p0i)/((1-pci)^2*ni)

#Fit the random-effects model using rma() and get confidence intervals for parameter estimates

res<-rma(ki,vi,data=PUdata)

confint(res)

#Get the forest plot

forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))

op<-par(cex=1,font=2)

text(-1.25,17, “Study”,pos=4)

text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)

#Get the funnel plot to check publication bias

funnel(res, main = “Random-Effects Model”)

#Sort the data matrix by the moderator variable rater for the mixed-effects model

data<-PUdata[order(PUdata$rater),]

#Fit the mixed-effects model using rma() and get confidence intervals for parameter estimates

mix<-rma(ki,vi,mods=~factor(rater),data=data)

confint(mix)

#Get the forest plot for mixed-effect model and add group estimates to the bottom of the plot

forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste(data$Study))

preds <- predict(mix, newmods = c(0, 1))

op<-par(cex=1.1,font=1)

addpoly(preds$pred, sei = preds$se, mlab = c(“Raters without training”, “Raters with training”))

abline(h=0)

abline(h=10.5)

text(1.2,0.4,”Raters with training”)

text(1.2,11,”Raters without training”)

op<-par(cex=1.1,font=2)

text(-1.25,17, “Study”,pos=4)

text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)

# Use trim and fill method to adjust for publication bias under the random-effects model

re<-rma(ki,vi,data=PUdata,method=“REML”)

rtf <- trimfill(re)

# Use trim and fill method to adjust for publication bias under the fixed-effects model

fe<-rma(ki,vi,data=PUdata,method=“FE”)

ftf <- trimfill(fe)

#Get the funnel plot with augmented data

funnel(ftf)

abline(v=1)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, S. Meta-analysis of Cohen’s kappa. Health Serv Outcomes Res Method 11, 145–163 (2011). https://doi.org/10.1007/s10742-011-0077-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-011-0077-3

Keywords

Navigation