Abstract
Although intercoder reliability has been considered crucial to the validity of a content study, the choice among them has been controversial. This study analyzed all the content studies published in the two major communication journals that reported intercoder reliability, aiming to find how scholars conduct intercoder reliability test. The results revealed that some intercoder reliability indices were misused persistently concerning the levels of measurement, the number of coders, and the means of reporting reliability over the past 30 years. Implications of misuse, disuse, and abuse were discussed, and suggestions regarding proper choice of indices in various situations were made at last.
Similar content being viewed by others
Notes
\(^{1}\) Coders could be also called annotators, judges, raters, observers, classifiers and others, depending on the research field. Intercoder, as well as interrater, is used interchangeably throughout the paper.
When the reliability value is exceedingly lower than the value of percent agreement, e.g., percent agreement is higher than 0.8, while reliability is close or lower than 0, this may indicate that the marginal distribution is too skewed.
It is identical to Bennett et al. (1954)’s \(S\) coefficient.
As Lombard et al. (2002) argued, the proportion of percent agreement was probably underestimated because most “NAs” would actually adopt percent agreement.
Cohen (1968) later proposed weighted \(\kappa \) for ordinal ratings. Krippendorff (2004a)’s \(\alpha \) is able to be applied to all levels of measurement. Some indices like ICCs are only applicable to interval ratings, and yet some like \(I_{r}\), Brennan and Prediger (1981)’s \(\kappa \) and \(\pi \) do not have higher levels of counterparts.
Although it has been a consensus that percent agreement, including Holsti generally overestimates reliability in that it does not make allowance for chance agreement, but it is not considered as misuse if used for nominal scaled codings. The rationale is to be explained below.
Reporting standard errors for the reliability value obtained is still arguable in the literature. Therefore, not reporting standard errors is not a problem for the present.
There are plenty of modeling approaches, such as log-linear, IRT (item response theory), latent class, and mixture modeling. In a separate study of the author, the approach of log-linear modeling was found to be no better than most indices.
Although variables with binary outcomes belong to the nominal level, most indices share more characteristics between binary and interval variables.
References
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Bates, D., Maechler, M., Bolker, B.: lme4: Linear mixed-effects models using s4 classes [Computer software manual], 2011, August. Retrieved from http://cran.r-project.org/web/packages/lme4/index.html
Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Q. 18(3), 303–308 (1954). Retrieved from http://poq.oxfordjournals.org/content/18/3/303.abstract doi:10.1086/266520
Brennan, R., Prediger, D.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(3), 687 (1981)
Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiology 46(5), 423–429 (1993). doi:10.1016/0895-4356(93)90018-V. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569390018V
Canty, A., Ripley, B.: Boot: Bootstrap functions (1.3-4 ed.) [Computer software manual], 2012, March. Retrieved from http://cran.r-project.org/web/packages/boot/index.html
Cicchetti, D., Feinstein, A.: High agreement but low kappa: II. Resolving the paradoxes* 1. J. Clin. Epidemiol. 43(6), 551–558 (1990)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960). doi:10.1177/001316446002000104
Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968). Retrieved from http://search.ebscohost.com/login.aspx?direct=truedb=pdhAN=bul-70-4-213site=ehost-live doi:10.1037/h0026256
Conger, A.: Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322 (1980). doi:10.1037/0033-2909.88.2.322
Cronbach, L.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)
Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990). doi:10.1016/0895-4356(90)90158-L. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569090158L
Feng, G.C.: Factors affecting intercoder reliability: a Monte Carlo experiment. Qual. Quant. 47(5), 2959–2982 (2013a). doi:10.1007/s11135-012-9745-9
Feng, G.C.: Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013b). doi:10.1007/s11135-012-9807-z
Finn, R.: A note on estimating the reliability of categorical data. Educ. Psychol. Meas. 30(1), 71–76 (1970). doi:10.1177/001316447003000106
Fleiss, J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Fleiss, J.L., Levin, B., Paik, M.C.: The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions, 3rd ed., pp. 598–626. Wiley, New York (2004). doi:10.1002/0471445428.ch18. Retrieved from http://dx.doi.org/10.1002/0471445428.ch18
Gwet, K.: Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat. Methods Inter-Rater Reliab. Assess. Ser. 2, 1–9 (2002)
Gwet, K.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
Gwet, K.: Handbook of Inter-Rater Reliability—A Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters. Advanced Analytics LLC, Gaithersburg (2010)
Holsti, O.: Content analysis for the social sciences and humanities. Addison-Wesley: Reading, MA (1969)
Hughes, M.A., Garrett, D.E.: Intercoder reliability estimation approaches in marketing: a generalizability theory framework for quantitative data. J. Mark. Res. 27(2), 185–195 (1990). Retrieved from http://www.jstor.org/stable/3172845
Kolbe, R.H., Burnett, M.S.: Content-analysis research: an examination of applications with directives for improving research reliability and objectivity. J. Consum. Res. 18(2), 243–250 (1991). Retrieved from http://www.jstor.org/stable/2489559
Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Sociol. Methodol. 2, 139–150 (1970). Retrieved from http://www.jstor.org/stable/270787
Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 2nd ed. Sage, Thousand Oaks (2004a)
Krippendorff, K.: Reliability in content analysis. some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004b). doi:10.1111/j.1468-2958.2004.tb00738.x
Krippendorff, K.: Computing Krippendorff ’s alpha reliability, 2007, June. Retrieved from http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043context=ascpapers
Krippendorff, K.: Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112 (2011). doi:10.1080/19312458.2011.568376
Krippendorff, K.: A dissenting view on so-called paradoxes of reliability coefficients. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 481–500. Routledge, New York (2012)
Light, R.J.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)
Lin, L.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255 (1989)
Lin, L., Hedayat, A.S., Wenting, W.: A unified approach for assessing agreement for continuous and categorical data. J. Biopharm. Stat. 17(4), 629–652 (2007). doi:10.1080/10543400701376498
Lombard, M., Snyder Duch, J.: Content analysis in mass communication: assessment and reporting of intercoder reliability. Hum. Commun. Res. 28(4), 587–604 (2002)
Maxwell, A.E.: Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977). doi: 10.1192/bjp.130.1.79. Retrieved from http://bjp.rcpsych.org/content/130/1/79.abstract
Osgood, C.: The representational model and relevant research methods. In: de Sola Pool, I. (ed.) Trends in Content Analysis, pp. 33–88. University of Illinois Press, Champaign (1959)
Perreault, J., William D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148 (1989). Retrieved from http://www.jstor.org/stable/3172601
Potter, W.J., Levine-Donnerstein, D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res. 27(3), 258–284 (1999). doi:10.1080/00909889909365539
Riffe, D., Lacy, S., Fico, F.: Analyzing Media Messages: Using Quantitative Content Analysis in Research. Lawrence Erlbaum Assoc Inc, New Jersey (2005)
Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19, 321–325 (1955). doi:10.1086/266577
Spiegelman, M., Terwilliger, C., Fearing, F.: The reliability of agreement in content analysis. J. Soc. Psychol. 37, 175–187 (1953)
Warrens, M.: A formal proof of a paradox associated with Cohen’s kappa. J. Classif. 1–11 (2010). doi:10.1007/s00357-010-9060-x. Retrieved from https://openaccess.leidenuniv.nl/bitstream/handle/1887/16310/Warrens2010JoC27322332.2
Zhao, X.: A Reliability Index (ai) that Assumes Honest Coders and Variable Randomness. Association for Education in Journalism and Mass Communication, Chicago (2012)
Zhao, X., Liu, J.S., Deng, K.: Assumptions behind inter-coder reliability indices. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 419–480. Routledge, New York (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Feng, G.C. Intercoder reliability indices: disuse, misuse, and abuse. Qual Quant 48, 1803–1815 (2014). https://doi.org/10.1007/s11135-013-9956-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-013-9956-8