Meta-analysis of Cohen’s kappa

Sun, Shuyan

doi:10.1007/s10742-011-0077-3

Shuyan Sun¹

5707 Accesses
110 Citations
4 Altmetric
Explore all metrics

Abstract

Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of grey zones in inter-rater agreement studies

Article Open access 05 January 2023

Clinical Agreement in Qualitative Measurements

Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies

Article Open access 05 February 2020

References

Allman, R.M.: Pressure ulcer prevalence, incidence, risk factors, and impact. Clin. Geriatr. Med. 13, 421–436 (1997)
PubMed CAS Google Scholar
Altman, D.G.: Practical Statistics for Medical Students. Chapman and Hall, London (1991)
Google Scholar
Baugh, F.: Correcting effect sizes for score reliability: a reminder that measurement and substantive issues are linked inextricably. Educ. Psychol. Meas. 62, 254–263 (2002)
Article Google Scholar
Banerjee, M., Capozzoli, M., McSweeny, L., Sinha, D.: Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27, 3–23 (1999)
Article Google Scholar
Berry, K.J., Mielke, P.W.: A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educ. Psychol. Meas. 48, 921–933 (1998)
Article Google Scholar
Blackman, N.J.-M., Koval, J.J.: Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med. 19, 723–741 (2000)
Article PubMed CAS Google Scholar
Block, D.A., Kraemer, H.C.: 2×2 kappa coefficients: measures of agreement or association. Biometrics 45, 269–287 (1989)
Article Google Scholar
Borenstein, M.: Software for publication bias. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.) Publication Bias in Meta-Analysis—Prevention, Assessment and Adjustments, pp. 193–220. Wiley, Chichester (2005)
Google Scholar
Bours, G., Halfens, R., Lubbers, M., Haalboom, J.: The development of a National Registration Form to measure the prevalence of pressure ulcers in the Netherlands. Ostomy Wound Manage. 45, 28–40 (1999)
PubMed CAS Google Scholar
Brennan, R.L., Prediger, D.J.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41, 687–699 (1981)
Article Google Scholar
Brennan, R.L., Silman, A.: Statistical methods for assessing observer variability in clinical measures. Br. Med. J. 304, 1491–1494 (1992)
Article CAS Google Scholar
Buntinx, F., Beckers, H., De Keyser, G., Flour, M., Nissen, G., Raskin, T., De Vet, H.: Inter-observer variation in the assessment of skin ulceration. J. Wound Care 5, 166–170 (1986)
Google Scholar
Capraro, M.M., Capraro, R.M., Henson, R.K.: Measurement error of scores on the Mathematics Anxiety Rating Scale across studies. Educ. Psychol. Meas. 61, 373–386 (2001)
Article Google Scholar
Caruso, J.C.: Reliability generalization of the NEO personality scales. Educ. Psychol. Meas. 60, 236–254 (2000)
Article Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)
Article Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70, 220–231 (1968)
Google Scholar
Cohen, J.: Weighted Chi square: an extension of the kappa method. Educ. Psychol. Meas. 32, 61–74 (1972)
Article Google Scholar
Davies, M., Fleiss, J.L.: Measurement agreement for multinomial data. Biometrics 38, 1047–1051 (1982)
Article Google Scholar
Duval, S., Tweedie, R.: A nonparametric “trim and fill” method of assessing publication bias in meta-analysis. J. Am. Stat. Assoc. 95(449), 89–98 (2000a)
Article Google Scholar
Duval, S., Tweedie, R.: Trim and fill: a simple funnel plot based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56, 455–463 (2000b)
Article PubMed CAS Google Scholar
Everitt, B.S.: Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97–103 (1968)
Article Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)
Article Google Scholar
Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)
Google Scholar
Fleiss, J.L., Cohen, J., Everitt, B.S.: Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72, 323–327 (1969)
Article Google Scholar
Fleiss, J.L., Shrout, P.E.: The effects of measurement errors on some multivariate procedures. Am. J. Public Health 67, 1188–1191 (1977)
Article PubMed CAS Google Scholar
Graves, N., Birrell, F.A., Whitby, M.: Modeling the economic losses from pressure ulcers among hospitalized patients in Australia. Wound. Rep. Reg. 13, 462–467 (2005)
Article Google Scholar
Gross, S.T.: The kappa coefficient of agreement for multiple observers when the number of subjects is small. Biometrics 42, 883–893 (1986)
Article PubMed CAS Google Scholar
Guilford, J.P., Fruchter, B.: Fundamental Statistics in Psychology and Education, 6th edn. McGraw-Hill, New York (1978)
Google Scholar
Hart, S., Bergquist, S., Gajewski, B., Dunton, N.: Reliability testing of the national database of nursing quality indicators pressure ulcer indicator. J. Nurs. Care Qual. 21, 256–265 (2006)
Article PubMed Google Scholar
Healey, F.: The reliability and utility of pressure sore grading scales. J. Tissue Viability 5, 111–114 (1995)
Google Scholar
Hedges, L.V.: Fitting categorical models to effect sizes from a series of experiments. J. Educ. Stat. 7, 119–137 (1982a)
Article Google Scholar
Hedges, L.V.: Fitting continuous models to effect sizes from a series of experiments. J. Educ. Stat. 7, 245–270 (1982b)
Article Google Scholar
Hedges, L.V.: A random effects model for effect sizes. Psychol. Bull. 93, 388–395 (1983)
Article Google Scholar
Hedges, L.V., Vevea, J.L.: Fixed and random effects models in meta-analysis. Psychol. Methods 3, 486–504 (1998)
Article Google Scholar
Helms, J.E.: Another meta-analysis of the White Racial Identity Attitude Scale’s Cronbach alphas: implications for validity. Meas. Eval. Couns. Dev. 32, 122–137 (1999)
Google Scholar
Henson, R.K.: Understanding internal consistency reliability estimates: a conceptual primer on coefficient alpha. Meas. Eval. Couns. Dev. 34, 177–189 (2001)
Google Scholar
Henson, R.K., Kogan, L.R., Vacha-Haase, T.: A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educ. Psychol. Meas. 61, 404–420 (2001)
Article Google Scholar
Hunter, J.E., Schmidt, F.L.: Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Sage, Newbury Park (1990)
Google Scholar
Huynh, Q., Howell, R.T., Benet-Martinez, V.: Reliability of bidimensional acculturation scores: a meta-analysis. J. Cross Cult. Psychol. 40, 256–274 (2009)
Article Google Scholar
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations. Educ. Psychol. Meas. 61, 277–289 (2001)
Article Google Scholar
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educ. Psychol. Meas. 64, 62–70 (2004)
Article Google Scholar
Kottner, J., Raeder, K., Halfens, R., Dassen, T.: A systematic review of inter-rater reliability of pressure ulcers classification systems. J. Clin. Nurs. 18, 315–336 (2009)
Article PubMed Google Scholar
Koval, J.J., Blackman, N.J.-M.: Estimators of kappa-exact small sample properties. J. Stat. Comput. Simulat. 55, 513–536 (1996)
Google Scholar
Kraemer, H.C.: Ramifications of a population model for κ as a coefficient o reliability. Psychometrika 44, 461–472 (1979)
Article Google Scholar
Kraemer, H.C.: Extension of the kappa coefficient. Biometrics 36, 207–216 (1980)
Article PubMed CAS Google Scholar
Kraemer, H.C., Vyjeyanthi, S.P., Noda, A.: Kappa coefficients in medical research. In: D’Agostino, R.B. (ed.) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies, pp. 85–105. Wiley, New York (2004)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977a)
Article PubMed CAS Google Scholar
Landis, J.R., Koch, G.G.: An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33, 363–374 (1977b)
Article PubMed CAS Google Scholar
Linacre, J.M.: Many-Facet Rasch Measurement. MESA Press, Chicago (1989)
Google Scholar
Lord, F.M., Novick, M.R.: Statistical Theories of Mental Test Scores. Addison-Wesley, Reading (1968)
Google Scholar
Miller, C.S., Shields, A.L., Campfield, D., Wallace, K.A., Weiss, R.D.: Substance use scales of the Minnesota Multiphasic Personality Inventory: an exploration of score reliability via meta-analysis. Educ. Psychol. Meas. 67, 1052–1065 (2007)
Article Google Scholar
Nixon, J., Thorpe, H., Barrow, H., Phillips, A., Nelson, E.A., Mason, S.A., Cullum, N.: Reliability of pressure ulcer classification and diagnosis. J. Adv. Nurs. 50, 613–623 (2005)
Article PubMed Google Scholar
Pedley, G.E.: Comparison of pressure ulcer grading scales: a study of clinical utility and inter-rater reliability. Int. J. Nurs. Stud. 41, 129–140 (2004)
Article PubMed Google Scholar
Raudenbush, S.W.: Analyzing effect sizes: random-effects models. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, 2nd edn, pp. 295–315. Russel Sage Foundation, New York (2009)
Google Scholar
Rohner, R.P., Khaleque, A.: Reliability and validity of the Parental Control Scale: a meta-analysis of cross-cultural and intracultural studies. J. Cross Cult. Psychol. 34, 643–649 (2003)
Article Google Scholar
Russell, L.: Pressure ulcer classification: the systems and the pitfalls. Br. J. Nurs. 11, S49–S59 (2002)
PubMed Google Scholar
Schmidt, F.L., Hunter, J.E.: Development of a general solution to the problem of validity generalization. J. Appl. Psychol. 62, 529–540 (1977)
Article Google Scholar
Shrout, P.E.: Measurement reliability and agreement in psychiatry. Stat. Meth. Med. Res. 7, 301–317 (1998)
Article CAS Google Scholar
Sim, J., Wright, C.C.: The Kappa statistic in reliability studies: use, interpretation and sample size requirements. Phys. Ther. 85, 257–268 (2005)
PubMed Google Scholar
Spearman, C.E.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904)
Article Google Scholar
Stemler, S.E., Tsai, J.: Best practices in interrater reliability: three common approaches. In: Osborne, J.W. (ed.) Best Practices in Quantitative Methods, pp. 29–49. Sage, Thousand Oaks (2008)
Chapter Google Scholar
Stotts, N.A.: Assessing a patient with a pressure ulcer. In: Morison, M.J. (ed.) The Prevention and Treatment of Pressure Ulcers, pp. 99–115. Mosby, London (2001)
Google Scholar
Suen, H.K.: Agreement, reliability, accuracy and validity: toward a clarification. Behav. Assess. 10, 343–366 (1988)
Google Scholar
Sutton, A.J.: Publication bias. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The handbook of research synthesis and meta-analysis, 2nd edn, pp. 435–452. Russel Sage Foundation, New York (2009)
Google Scholar
Thompson, B.: Guidelines for authors. Educ. Psychol. Meas. 54, 837–847 (1994a)
Google Scholar
Thompson, B.: Score Reliability: Contemporary Thinking on Reliability Issues. Sage, Thousand Oaks (2002)
Google Scholar
Thompson, B., Vacha-Hasse, T.: Psychometrics is datametrics: the test is not reliable. Educ. Psychol. Meas. 60, 174–195 (2000)
Google Scholar
Thompson, S.G.: Why sources of heterogeneity in meta-analysis should be investigated. Br. Med. J. 309, 1351–1355 (1994b)
Article CAS Google Scholar
Thorndike, R.M.: Measurement and Evaluation in Psychology and Education. Pearson Merrill Prentice Hall, Upper Saddle River (2005)
Google Scholar
Vacha-Hasse, T.: Reliability generalization: exploring variance in measurement error affecting score reliability across studies. Educ. Psychol. Meas. 58, 6–20 (1998)
Article Google Scholar
Vacha-Hasse, T., Henson, R.K., Caruso, J.C.: Reliability generalization: moving toward improved understanding and use of score reliability. Educ. Psychol. Meas. 62, 562–569 (2002)
Article Google Scholar
Vanbelle, S., Albert, A.: Agreement between two independent groups of raters. Psychometrika 74, 477–492 (2009)
Article Google Scholar
Vanderwee, K., Grypdonck, M., De Bacquer, D., Defloor, T.: The reliability of two observation methods of nonblanchable erythema, Grade 1 pressure ulcer. Appl. Nurs. Res. 19, 156–162 (2006)
Article PubMed Google Scholar
Viechtbauer, W.: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010)
Google Scholar
Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the Kappa statistic. Fam. Med. 37, 360–363 (2005)
PubMed Google Scholar
Viswesvaran, C., Ones, D.S.: Measurement error in “Big Five Factors” personality assessment: reliability generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000)
Article Google Scholar
von Eye, A.: An alternative to Cohen’s κ. Eur. Psychol. 11, 12–24 (2006)
Article Google Scholar
Yin, P., Fan, X.: Assessing the reliability of Beck Depression Inventory scores: reliability generalization across studies. Educ. Psychol. Meas. 60, 201–223 (2000)
Article Google Scholar
Zwick, R.: Another look at interrater agreement. Psychol. Bull. 3, 374–378 (1988)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Education, University of Cincinnati, 2600 Clifton Ave., Dyer Hall 475, P.O. Box 210049, Cincinnati, OH, 45221, USA
Shuyan Sun

Authors

Shuyan Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuyan Sun.

Appendices

Appendix A

An example illustrating how to calculate Cohen’s κ, its variance and confidence intervals using real data from Nixon et al. (2005)

	Ward nurse
	No pressure ulcer	Pressure ulcer	Total
Clinical research nurse
No pressure ulcer	(a) 2175	(b) 35	(f ₁) 2210
Pressure ulcer	(c) 42	(d) 144	(f ₂) 186
Total	(g ₁) 2217	(g ₂) 179	(n) 2396

$$ p_{0} = \frac{a + d}{n} = \frac{2175 + 144}{2396} \approx 0.97 $$

$$ p_{c} = \frac{{\frac{{f_{1} \times g_{1} }}{n} + \frac{{f_{2} \times g_{2} }}{n}}}{n} = \frac{{\frac{2210 \times 2217}{2396} + \frac{186 \times 179}{2396}}}{2396} \approx 0.86 $$

$$ \hat{\kappa } = \frac{{p_{0} - p_{c} }}{{1 - p_{c} }} = \frac{0.97 - 0.86}{1 - 0.86} \approx 0.79 $$

$$ Var(\hat{\kappa }) = \frac{1}{{(1 - p_{c} )^{2} }}\frac{{p_{0} (1 - p_{0} )}}{n} = \frac{1}{{(1 - 0.86)^{2} }}\frac{0.97(1 - 0.97)}{2396} \approx 6.20 \times 10^{ - 4} $$

$ \hat{\kappa } \pm 1.96 \times \sqrt {\text{var} (\kappa )} \approx 0.79 \pm 1.96 \times 0.02 $. Therefore, 95% CI for Cohen’s κ is [0.75, 0.83].

Note that, the calculation of Var($ \hat{\kappa } $) depends on p _c which is usually not reported in research articles. When p ₀ and $ \hat{\kappa } $ are reported, p _c can be derived by $ p_{c} = \frac{{p_{0} - \hat{\kappa }}}{{1 - \hat{\kappa }}} $.

Appendix B

R codes for all analyses conducted in the meta-analysis of Cohen’s κ for PU classification systems

#Install the package metafor

library(metafor)

#Derive vi, the variance for each kappa estimate

#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i

# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)

pci=(p0i-ki)/(1-ki)

vi=p0i*(1-p0i)/((1-pci)^2*ni)

#Fit the random-effects model using rma() and get confidence intervals for parameter estimates

res<-rma(ki,vi,data=PUdata)

confint(res)

#Get the forest plot

forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))

op<-par(cex=1,font=2)

text(-1.25,17, “Study”,pos=4)

text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)

#Get the funnel plot to check publication bias

funnel(res, main = “Random-Effects Model”)

#Sort the data matrix by the moderator variable rater for the mixed-effects model

data<-PUdata[order(PUdata$rater),]

#Fit the mixed-effects model using rma() and get confidence intervals for parameter estimates

mix<-rma(ki,vi,mods=~factor(rater),data=data)

confint(mix)

#Get the forest plot for mixed-effect model and add group estimates to the bottom of the plot

forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste(data$Study))

preds <- predict(mix, newmods = c(0, 1))

op<-par(cex=1.1,font=1)

addpoly(preds$pred, sei = preds$se, mlab = c(“Raters without training”, “Raters with training”))

abline(h=0)

abline(h=10.5)

text(1.2,0.4,”Raters with training”)

text(1.2,11,”Raters without training”)

op<-par(cex=1.1,font=2)

text(-1.25,17, “Study”,pos=4)

text(1.5,17, “Cohen’s Kappa [95% CI]”,pos=4)

# Use trim and fill method to adjust for publication bias under the random-effects model

re<-rma(ki,vi,data=PUdata,method=“REML”)

rtf <- trimfill(re)

# Use trim and fill method to adjust for publication bias under the fixed-effects model

fe<-rma(ki,vi,data=PUdata,method=“FE”)

ftf <- trimfill(fe)

#Get the funnel plot with augmented data

funnel(ftf)

abline(v=1)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, S. Meta-analysis of Cohen’s kappa. Health Serv Outcomes Res Method 11, 145–163 (2011). https://doi.org/10.1007/s10742-011-0077-3

Download citation

Received: 03 September 2010
Revised: 25 October 2011
Accepted: 29 October 2011
Published: 11 November 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10742-011-0077-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Meta-analysis of Cohen’s kappa

Abstract

Access this article

Similar content being viewed by others

Detection of grey zones in inter-rater agreement studies

Clinical Agreement in Qualitative Measurements

Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Meta-analysis of Cohen’s kappa

Abstract

Access this article

Similar content being viewed by others

Detection of grey zones in inter-rater agreement studies

Clinical Agreement in Qualitative Measurements

Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation