Abstract
Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.
Similar content being viewed by others
References
Allman, R.M.: Pressure ulcer prevalence, incidence, risk factors, and impact. Clin. Geriatr. Med. 13, 421–436 (1997)
Altman, D.G.: Practical Statistics for Medical Students. Chapman and Hall, London (1991)
Baugh, F.: Correcting effect sizes for score reliability: a reminder that measurement and substantive issues are linked inextricably. Educ. Psychol. Meas. 62, 254–263 (2002)
Banerjee, M., Capozzoli, M., McSweeny, L., Sinha, D.: Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27, 3–23 (1999)
Berry, K.J., Mielke, P.W.: A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educ. Psychol. Meas. 48, 921–933 (1998)
Blackman, N.J.-M., Koval, J.J.: Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med. 19, 723–741 (2000)
Block, D.A., Kraemer, H.C.: 2×2 kappa coefficients: measures of agreement or association. Biometrics 45, 269–287 (1989)
Borenstein, M.: Software for publication bias. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.) Publication Bias in Meta-Analysis—Prevention, Assessment and Adjustments, pp. 193–220. Wiley, Chichester (2005)
Bours, G., Halfens, R., Lubbers, M., Haalboom, J.: The development of a National Registration Form to measure the prevalence of pressure ulcers in the Netherlands. Ostomy Wound Manage. 45, 28–40 (1999)
Brennan, R.L., Prediger, D.J.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41, 687–699 (1981)
Brennan, R.L., Silman, A.: Statistical methods for assessing observer variability in clinical measures. Br. Med. J. 304, 1491–1494 (1992)
Buntinx, F., Beckers, H., De Keyser, G., Flour, M., Nissen, G., Raskin, T., De Vet, H.: Inter-observer variation in the assessment of skin ulceration. J. Wound Care 5, 166–170 (1986)
Capraro, M.M., Capraro, R.M., Henson, R.K.: Measurement error of scores on the Mathematics Anxiety Rating Scale across studies. Educ. Psychol. Meas. 61, 373–386 (2001)
Caruso, J.C.: Reliability generalization of the NEO personality scales. Educ. Psychol. Meas. 60, 236–254 (2000)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70, 220–231 (1968)
Cohen, J.: Weighted Chi square: an extension of the kappa method. Educ. Psychol. Meas. 32, 61–74 (1972)
Davies, M., Fleiss, J.L.: Measurement agreement for multinomial data. Biometrics 38, 1047–1051 (1982)
Duval, S., Tweedie, R.: A nonparametric “trim and fill” method of assessing publication bias in meta-analysis. J. Am. Stat. Assoc. 95(449), 89–98 (2000a)
Duval, S., Tweedie, R.: Trim and fill: a simple funnel plot based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56, 455–463 (2000b)
Everitt, B.S.: Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97–103 (1968)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)
Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)
Fleiss, J.L., Cohen, J., Everitt, B.S.: Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72, 323–327 (1969)
Fleiss, J.L., Shrout, P.E.: The effects of measurement errors on some multivariate procedures. Am. J. Public Health 67, 1188–1191 (1977)
Graves, N., Birrell, F.A., Whitby, M.: Modeling the economic losses from pressure ulcers among hospitalized patients in Australia. Wound. Rep. Reg. 13, 462–467 (2005)
Gross, S.T.: The kappa coefficient of agreement for multiple observers when the number of subjects is small. Biometrics 42, 883–893 (1986)
Guilford, J.P., Fruchter, B.: Fundamental Statistics in Psychology and Education, 6th edn. McGraw-Hill, New York (1978)
Hart, S., Bergquist, S., Gajewski, B., Dunton, N.: Reliability testing of the national database of nursing quality indicators pressure ulcer indicator. J. Nurs. Care Qual. 21, 256–265 (2006)
Healey, F.: The reliability and utility of pressure sore grading scales. J. Tissue Viability 5, 111–114 (1995)
Hedges, L.V.: Fitting categorical models to effect sizes from a series of experiments. J. Educ. Stat. 7, 119–137 (1982a)
Hedges, L.V.: Fitting continuous models to effect sizes from a series of experiments. J. Educ. Stat. 7, 245–270 (1982b)
Hedges, L.V.: A random effects model for effect sizes. Psychol. Bull. 93, 388–395 (1983)
Hedges, L.V., Vevea, J.L.: Fixed and random effects models in meta-analysis. Psychol. Methods 3, 486–504 (1998)
Helms, J.E.: Another meta-analysis of the White Racial Identity Attitude Scale’s Cronbach alphas: implications for validity. Meas. Eval. Couns. Dev. 32, 122–137 (1999)
Henson, R.K.: Understanding internal consistency reliability estimates: a conceptual primer on coefficient alpha. Meas. Eval. Couns. Dev. 34, 177–189 (2001)
Henson, R.K., Kogan, L.R., Vacha-Haase, T.: A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educ. Psychol. Meas. 61, 404–420 (2001)
Hunter, J.E., Schmidt, F.L.: Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Sage, Newbury Park (1990)
Huynh, Q., Howell, R.T., Benet-Martinez, V.: Reliability of bidimensional acculturation scores: a meta-analysis. J. Cross Cult. Psychol. 40, 256–274 (2009)
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations. Educ. Psychol. Meas. 61, 277–289 (2001)
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educ. Psychol. Meas. 64, 62–70 (2004)
Kottner, J., Raeder, K., Halfens, R., Dassen, T.: A systematic review of inter-rater reliability of pressure ulcers classification systems. J. Clin. Nurs. 18, 315–336 (2009)
Koval, J.J., Blackman, N.J.-M.: Estimators of kappa-exact small sample properties. J. Stat. Comput. Simulat. 55, 513–536 (1996)
Kraemer, H.C.: Ramifications of a population model for κ as a coefficient o reliability. Psychometrika 44, 461–472 (1979)
Kraemer, H.C.: Extension of the kappa coefficient. Biometrics 36, 207–216 (1980)
Kraemer, H.C., Vyjeyanthi, S.P., Noda, A.: Kappa coefficients in medical research. In: D’Agostino, R.B. (ed.) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies, pp. 85–105. Wiley, New York (2004)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977a)
Landis, J.R., Koch, G.G.: An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33, 363–374 (1977b)
Linacre, J.M.: Many-Facet Rasch Measurement. MESA Press, Chicago (1989)
Lord, F.M., Novick, M.R.: Statistical Theories of Mental Test Scores. Addison-Wesley, Reading (1968)
Miller, C.S., Shields, A.L., Campfield, D., Wallace, K.A., Weiss, R.D.: Substance use scales of the Minnesota Multiphasic Personality Inventory: an exploration of score reliability via meta-analysis. Educ. Psychol. Meas. 67, 1052–1065 (2007)
Nixon, J., Thorpe, H., Barrow, H., Phillips, A., Nelson, E.A., Mason, S.A., Cullum, N.: Reliability of pressure ulcer classification and diagnosis. J. Adv. Nurs. 50, 613–623 (2005)
Pedley, G.E.: Comparison of pressure ulcer grading scales: a study of clinical utility and inter-rater reliability. Int. J. Nurs. Stud. 41, 129–140 (2004)
Raudenbush, S.W.: Analyzing effect sizes: random-effects models. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, 2nd edn, pp. 295–315. Russel Sage Foundation, New York (2009)
Rohner, R.P., Khaleque, A.: Reliability and validity of the Parental Control Scale: a meta-analysis of cross-cultural and intracultural studies. J. Cross Cult. Psychol. 34, 643–649 (2003)
Russell, L.: Pressure ulcer classification: the systems and the pitfalls. Br. J. Nurs. 11, S49–S59 (2002)
Schmidt, F.L., Hunter, J.E.: Development of a general solution to the problem of validity generalization. J. Appl. Psychol. 62, 529–540 (1977)
Shrout, P.E.: Measurement reliability and agreement in psychiatry. Stat. Meth. Med. Res. 7, 301–317 (1998)
Sim, J., Wright, C.C.: The Kappa statistic in reliability studies: use, interpretation and sample size requirements. Phys. Ther. 85, 257–268 (2005)
Spearman, C.E.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904)
Stemler, S.E., Tsai, J.: Best practices in interrater reliability: three common approaches. In: Osborne, J.W. (ed.) Best Practices in Quantitative Methods, pp. 29–49. Sage, Thousand Oaks (2008)
Stotts, N.A.: Assessing a patient with a pressure ulcer. In: Morison, M.J. (ed.) The Prevention and Treatment of Pressure Ulcers, pp. 99–115. Mosby, London (2001)
Suen, H.K.: Agreement, reliability, accuracy and validity: toward a clarification. Behav. Assess. 10, 343–366 (1988)
Sutton, A.J.: Publication bias. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The handbook of research synthesis and meta-analysis, 2nd edn, pp. 435–452. Russel Sage Foundation, New York (2009)
Thompson, B.: Guidelines for authors. Educ. Psychol. Meas. 54, 837–847 (1994a)
Thompson, B.: Score Reliability: Contemporary Thinking on Reliability Issues. Sage, Thousand Oaks (2002)
Thompson, B., Vacha-Hasse, T.: Psychometrics is datametrics: the test is not reliable. Educ. Psychol. Meas. 60, 174–195 (2000)
Thompson, S.G.: Why sources of heterogeneity in meta-analysis should be investigated. Br. Med. J. 309, 1351–1355 (1994b)
Thorndike, R.M.: Measurement and Evaluation in Psychology and Education. Pearson Merrill Prentice Hall, Upper Saddle River (2005)
Vacha-Hasse, T.: Reliability generalization: exploring variance in measurement error affecting score reliability across studies. Educ. Psychol. Meas. 58, 6–20 (1998)
Vacha-Hasse, T., Henson, R.K., Caruso, J.C.: Reliability generalization: moving toward improved understanding and use of score reliability. Educ. Psychol. Meas. 62, 562–569 (2002)
Vanbelle, S., Albert, A.: Agreement between two independent groups of raters. Psychometrika 74, 477–492 (2009)
Vanderwee, K., Grypdonck, M., De Bacquer, D., Defloor, T.: The reliability of two observation methods of nonblanchable erythema, Grade 1 pressure ulcer. Appl. Nurs. Res. 19, 156–162 (2006)
Viechtbauer, W.: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010)
Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the Kappa statistic. Fam. Med. 37, 360–363 (2005)
Viswesvaran, C., Ones, D.S.: Measurement error in “Big Five Factors” personality assessment: reliability generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000)
von Eye, A.: An alternative to Cohen’s κ. Eur. Psychol. 11, 12–24 (2006)
Yin, P., Fan, X.: Assessing the reliability of Beck Depression Inventory scores: reliability generalization across studies. Educ. Psychol. Meas. 60, 201–223 (2000)
Zwick, R.: Another look at interrater agreement. Psychol. Bull. 3, 374–378 (1988)
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
An example illustrating how to calculate Cohen’s κ, its variance and confidence intervals using real data from Nixon et al. (2005)
Ward nurse | |||
---|---|---|---|
No pressure ulcer | Pressure ulcer | Total | |
Clinical research nurse | |||
No pressure ulcer | (a) 2175 | (b) 35 | (f 1) 2210 |
Pressure ulcer | (c) 42 | (d) 144 | (f 2) 186 |
Total | (g 1) 2217 | (g 2) 179 | (n) 2396 |
\( \hat{\kappa } \pm 1.96 \times \sqrt {\text{var} (\kappa )} \approx 0.79 \pm 1.96 \times 0.02 \). Therefore, 95% CI for Cohen’s κ is [0.75, 0.83].
Note that, the calculation of Var(\( \hat{\kappa } \)) depends on p c which is usually not reported in research articles. When p 0 and \( \hat{\kappa } \) are reported, p c can be derived by \( p_{c} = \frac{{p_{0} - \hat{\kappa }}}{{1 - \hat{\kappa }}} \).
Appendix B
R codes for all analyses conducted in the meta-analysis of Cohen’s κ for PU classification systems
#Install the package metafor
library(metafor)
#Derive vi, the variance for each kappa estimate
#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i
# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)
pci=(p0i-ki)/(1-ki)
vi=p0i*(1-p0i)/((1-pci)^2*ni)
#Fit the random-effects model using rma() and get confidence intervals for parameter estimates
res<-rma(ki,vi,data=PUdata)
confint(res)
#Get the forest plot
forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))
op<-par(cex=1,font=2)
text(-1.25,17, “Study”,pos=4)
text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)
#Get the funnel plot to check publication bias
funnel(res, main = “Random-Effects Model”)
#Sort the data matrix by the moderator variable rater for the mixed-effects model
data<-PUdata[order(PUdata$rater),]
#Fit the mixed-effects model using rma() and get confidence intervals for parameter estimates
mix<-rma(ki,vi,mods=~factor(rater),data=data)
confint(mix)
#Get the forest plot for mixed-effect model and add group estimates to the bottom of the plot
forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste(data$Study))
preds <- predict(mix, newmods = c(0, 1))
op<-par(cex=1.1,font=1)
addpoly(preds$pred, sei = preds$se, mlab = c(“Raters without training”, “Raters with training”))
abline(h=0)
abline(h=10.5)
text(1.2,0.4,”Raters with training”)
text(1.2,11,”Raters without training”)
op<-par(cex=1.1,font=2)
text(-1.25,17, “Study”,pos=4)
text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)
# Use trim and fill method to adjust for publication bias under the random-effects model
re<-rma(ki,vi,data=PUdata,method=“REML”)
rtf <- trimfill(re)
# Use trim and fill method to adjust for publication bias under the fixed-effects model
fe<-rma(ki,vi,data=PUdata,method=“FE”)
ftf <- trimfill(fe)
#Get the funnel plot with augmented data
funnel(ftf)
abline(v=1)
Rights and permissions
About this article
Cite this article
Sun, S. Meta-analysis of Cohen’s kappa. Health Serv Outcomes Res Method 11, 145–163 (2011). https://doi.org/10.1007/s10742-011-0077-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-011-0077-3