Intercoder reliability indices: disuse, misuse, and abuse

Feng, Guangchao Charles

doi:10.1007/s11135-013-9956-8

Intercoder reliability indices: disuse, misuse, and abuse

Published: 29 November 2013

Volume 48, pages 1803–1815, (2014)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

Guangchao Charles Feng¹

3123 Accesses
42 Citations
Explore all metrics

Abstract

Although intercoder reliability has been considered crucial to the validity of a content study, the choice among them has been controversial. This study analyzed all the content studies published in the two major communication journals that reported intercoder reliability, aiming to find how scholars conduct intercoder reliability test. The results revealed that some intercoder reliability indices were misused persistently concerning the levels of measurement, the number of coders, and the means of reporting reliability over the past 30 years. Implications of misuse, disuse, and abuse were discussed, and suggestions regarding proper choice of indices in various situations were made at last.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

\(^{1}\) Coders could be also called annotators, judges, raters, observers, classifiers and others, depending on the research field. Intercoder, as well as interrater, is used interchangeably throughout the paper.
When the reliability value is exceedingly lower than the value of percent agreement, e.g., percent agreement is higher than 0.8, while reliability is close or lower than 0, this may indicate that the marginal distribution is too skewed.
It is identical to Bennett et al. (1954)’s \(S\) coefficient.
As Lombard et al. (2002) argued, the proportion of percent agreement was probably underestimated because most “NAs” would actually adopt percent agreement.
They have corresponding multiple coder versions proposed by other scholars. For instance, Fleiss (1971) extended \(\pi \) while Conger (1980) and Light (1971) suggested the multiple coder version of \(\kappa \).
Cohen (1968) later proposed weighted \(\kappa \) for ordinal ratings. Krippendorff (2004a)’s \(\alpha \) is able to be applied to all levels of measurement. Some indices like ICCs are only applicable to interval ratings, and yet some like \(I_{r}\), Brennan and Prediger (1981)’s \(\kappa \) and \(\pi \) do not have higher levels of counterparts.
Although it has been a consensus that percent agreement, including Holsti generally overestimates reliability in that it does not make allowance for chance agreement, but it is not considered as misuse if used for nominal scaled codings. The rationale is to be explained below.
Reporting standard errors for the reliability value obtained is still arguable in the literature. Therefore, not reporting standard errors is not a problem for the present.
There are plenty of modeling approaches, such as log-linear, IRT (item response theory), latent class, and mixture modeling. In a separate study of the author, the approach of log-linear modeling was found to be no better than most indices.
Although variables with binary outcomes belong to the nominal level, most indices share more characteristics between binary and interval variables.

References

Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Article Google Scholar
Bates, D., Maechler, M., Bolker, B.: lme4: Linear mixed-effects models using s4 classes [Computer software manual], 2011, August. Retrieved from http://cran.r-project.org/web/packages/lme4/index.html
Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Q. 18(3), 303–308 (1954). Retrieved from http://poq.oxfordjournals.org/content/18/3/303.abstract doi:10.1086/266520
Brennan, R., Prediger, D.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(3), 687 (1981)
Article Google Scholar
Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiology 46(5), 423–429 (1993). doi:10.1016/0895-4356(93)90018-V. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569390018V
Google Scholar
Canty, A., Ripley, B.: Boot: Bootstrap functions (1.3-4 ed.) [Computer software manual], 2012, March. Retrieved from http://cran.r-project.org/web/packages/boot/index.html
Cicchetti, D., Feinstein, A.: High agreement but low kappa: II. Resolving the paradoxes* 1. J. Clin. Epidemiol. 43(6), 551–558 (1990)
Article Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960). doi:10.1177/001316446002000104
Article Google Scholar
Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968). Retrieved from http://search.ebscohost.com/login.aspx?direct=truedb=pdhAN=bul-70-4-213site=ehost-live doi:10.1037/h0026256
Conger, A.: Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322 (1980). doi:10.1037/0033-2909.88.2.322
Article Google Scholar
Cronbach, L.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)
Google Scholar
Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990). doi:10.1016/0895-4356(90)90158-L. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569090158L
Google Scholar
Feng, G.C.: Factors affecting intercoder reliability: a Monte Carlo experiment. Qual. Quant. 47(5), 2959–2982 (2013a). doi:10.1007/s11135-012-9745-9
Feng, G.C.: Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013b). doi:10.1007/s11135-012-9807-z
Finn, R.: A note on estimating the reliability of categorical data. Educ. Psychol. Meas. 30(1), 71–76 (1970). doi:10.1177/001316447003000106
Article Google Scholar
Fleiss, J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Article Google Scholar
Fleiss, J.L., Levin, B., Paik, M.C.: The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions, 3rd ed., pp. 598–626. Wiley, New York (2004). doi:10.1002/0471445428.ch18. Retrieved from http://dx.doi.org/10.1002/0471445428.ch18
Gwet, K.: Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat. Methods Inter-Rater Reliab. Assess. Ser. 2, 1–9 (2002)
Google Scholar
Gwet, K.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
Article Google Scholar
Gwet, K.: Handbook of Inter-Rater Reliability—A Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters. Advanced Analytics LLC, Gaithersburg (2010)
Google Scholar
Holsti, O.: Content analysis for the social sciences and humanities. Addison-Wesley: Reading, MA (1969)
Hughes, M.A., Garrett, D.E.: Intercoder reliability estimation approaches in marketing: a generalizability theory framework for quantitative data. J. Mark. Res. 27(2), 185–195 (1990). Retrieved from http://www.jstor.org/stable/3172845
Kolbe, R.H., Burnett, M.S.: Content-analysis research: an examination of applications with directives for improving research reliability and objectivity. J. Consum. Res. 18(2), 243–250 (1991). Retrieved from http://www.jstor.org/stable/2489559
Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Sociol. Methodol. 2, 139–150 (1970). Retrieved from http://www.jstor.org/stable/270787
Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 2nd ed. Sage, Thousand Oaks (2004a)
Krippendorff, K.: Reliability in content analysis. some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004b). doi:10.1111/j.1468-2958.2004.tb00738.x
Krippendorff, K.: Computing Krippendorff ’s alpha reliability, 2007, June. Retrieved from http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043context=ascpapers
Krippendorff, K.: Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112 (2011). doi:10.1080/19312458.2011.568376
Article Google Scholar
Krippendorff, K.: A dissenting view on so-called paradoxes of reliability coefficients. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 481–500. Routledge, New York (2012)
Google Scholar
Light, R.J.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)
Article Google Scholar
Lin, L.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255 (1989)
Article Google Scholar
Lin, L., Hedayat, A.S., Wenting, W.: A unified approach for assessing agreement for continuous and categorical data. J. Biopharm. Stat. 17(4), 629–652 (2007). doi:10.1080/10543400701376498
Article Google Scholar
Lombard, M., Snyder Duch, J.: Content analysis in mass communication: assessment and reporting of intercoder reliability. Hum. Commun. Res. 28(4), 587–604 (2002)
Article Google Scholar
Maxwell, A.E.: Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977). doi: 10.1192/bjp.130.1.79. Retrieved from http://bjp.rcpsych.org/content/130/1/79.abstract
Google Scholar
Osgood, C.: The representational model and relevant research methods. In: de Sola Pool, I. (ed.) Trends in Content Analysis, pp. 33–88. University of Illinois Press, Champaign (1959)
Google Scholar
Perreault, J., William D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148 (1989). Retrieved from http://www.jstor.org/stable/3172601
Google Scholar
Potter, W.J., Levine-Donnerstein, D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res. 27(3), 258–284 (1999). doi:10.1080/00909889909365539
Article Google Scholar
Riffe, D., Lacy, S., Fico, F.: Analyzing Media Messages: Using Quantitative Content Analysis in Research. Lawrence Erlbaum Assoc Inc, New Jersey (2005)
Google Scholar
Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19, 321–325 (1955). doi:10.1086/266577
Spiegelman, M., Terwilliger, C., Fearing, F.: The reliability of agreement in content analysis. J. Soc. Psychol. 37, 175–187 (1953)
Google Scholar
Warrens, M.: A formal proof of a paradox associated with Cohen’s kappa. J. Classif. 1–11 (2010). doi:10.1007/s00357-010-9060-x. Retrieved from https://openaccess.leidenuniv.nl/bitstream/handle/1887/16310/Warrens2010JoC27322332.2
Zhao, X.: A Reliability Index (ai) that Assumes Honest Coders and Variable Randomness. Association for Education in Journalism and Mass Communication, Chicago (2012)
Zhao, X., Liu, J.S., Deng, K.: Assumptions behind inter-coder reliability indices. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 419–480. Routledge, New York (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Journalism and Communication, Jinan University, 601th, West Huangpu Avenue, Tianhe, Guangzhou, Guangdong, China
Guangchao Charles Feng

Authors

Guangchao Charles Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangchao Charles Feng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, G.C. Intercoder reliability indices: disuse, misuse, and abuse. Qual Quant 48, 1803–1815 (2014). https://doi.org/10.1007/s11135-013-9956-8

Download citation

Published: 29 November 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11135-013-9956-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intercoder reliability indices: disuse, misuse, and abuse

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Qualitative Research: Ethical Considerations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Intercoder reliability indices: disuse, misuse, and abuse

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Qualitative Research: Ethical Considerations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation