Abstract
Although intercoder reliability has been considered crucial to the validity of a content study, the choice among them has been controversial. This study analyzed all the content studies published in the two major communication journals that reported intercoder reliability, aiming to find how scholars conduct the intercoder reliability test. The results revealed that some intercoder reliability indices were misused persistently concerning the levels of measurement, the number of coders, and the means of reporting reliability over the past 30 years. Implications of misuse, disuse, and abuse were discussed, and suggestions regarding proper choice of indices in various situations were made at last.
References
2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34, 555–596. doi: 10.1162/coli.07-034-R2v
(1954). Communications through limited-response questioning. Public Opinion Quarterly, 18, 303–308. doi: 10.1086/266520
(1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8, 135–160. doi: 10.1177/096228029900800204
(1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699. doi: 10.1177/001316448104100307
(1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429. doi: 10.1016/0895-4356(93)90018-V
(1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. doi: 10.1037/h0046016
(2012). Boot: Bootstrap functions [Computer software manual]. March(1.3–4th ed.). Vienna, Austria: R Foundation for Statistical Computing Retrieved from cran.r-project.org/web/packages/boot/index.html
(1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. doi: 10.1016/0895-4356(90)90159-M
(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. doi: 10.1177/001316446002000104
(1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. doi: 10.1037/h0026256
(1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328. doi: 10.1037/0033-2909.88.2.322
(1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16(2), 137–163. doi: 10.1111/j.2044-8317.1963.tb00206.x
(1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. doi: 10.1016/0895-4356(90)90158-L
(2013a). Factors affecting intercoder reliability: A Monte Carlo experiment. Quality and Quantity, 47, 2959–2982. doi: 10.1007/s11135-012-9745-9
(2013b). Underlying determinants driving agreement among coders. Quality and Quantity, 47, 2983–2997. doi: 10.1007/s11135-012-9807-z
(2014). Estimating intercoder reliability: A structural equation modeling approach. Quality & Quantity, 48, 2355–2369. doi: 10.1007/s11135-014-0034-7
(1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76. doi: 10.1177/001316447003000106
(1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. doi: 10.1037/h0031619
(1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323–327. doi: 10.1037/h0028106
(2004). The measurement of interrater agreement. In J. L. FleissB. LevinM. C. PaikEds., Statistical methods for rates and proportions (3rd ed.). (pp. 598–626). Hoboken, NJ: Wiley.
(2001). Interrater reliability. Journal of Consumer Psychology, 10, 71–73. doi: 10.1207/15327660151043998
(2002). Inter-rater reliability: Dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-Rater Reliability Assessment Series, 2, 1–9. Retrieved from advancedanalyticsllc.com/irrhbk/research_papers/inter_rater_reliability_dependency.pdf
(2008). Computing inter-rater reliability and its variance in the presence of high agreement. The British Journal of Mathematical and Statistical Psychology, 61, 29–48. doi: 10.1348/000711006X126600
(2010). Handbook of inter-rater reliability – a definitive guide to measuring the extent of agreement among multiple raters. Gaithersburg, MD: Advanced Analytics, LLC.
(2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 77–89. doi: 10.1080/19312450709336664
(1964). A note on the g index of agreement. Educational and Psychological Measurement, 24, 749–753. doi: 10.1177/001316446402400402
(1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.
(1990). Intercoder reliability estimation approaches in marketing: A generalizability theory framework for quantitative data. Journal of Marketing Research, 27, 185–195. Retrieved from www.jstor.org/stable/3172845
(1993). Rwg: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309. doi: 10.1037/0021-9010.78.2.306
(1979). On generalizations of the g index and the phi coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269. doi: 10.1207/s15327906mbr14029
(2012). Mutual information as a measure of intercoder agreement. Journal of Official Statistics, 28, 395–412. Retrieved from www.jos.nu/Articles/abstract.asp?article=283395
(1991). Content-analysis research: An examination of applications with directives for improving research reliability and objectivity. Journal of Consumer Research, 18, 243–250. Retrieved from www.jstor.org/stable/2489559
(1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167. doi: 10.1037/0021-9010.77.2.161
(1970). Bivariate agreement coefficients for reliability of data. Sociological Methodology, 2, 139–150. Retrieved from www.jstor.org/stable/270787
(2004a). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage.
(2004b). Reliability in content analysis. Some common misconceptions and recommendations. Human Communication Research, 30, 411–433. doi: 10.1111/j.1468-2958.2004.tb00738.x
(2007 , June). Computing Krippendorff’s alpha reliability. Unpublished manuscript. Retrieved from www.asc.upenn.edu/usr/krippendorff/mwebreliability5.pdf2011). Agreement and information in the reliability of coding. Communication Methods and Measures, 5, 93–112. doi: 10.1080/19312458.2011.568376
(2012). A dissenting view on so-called paradoxes of reliability coefficients. In C. T. SalmonEd., Communication Yearbook, Vol. 36, (pp. 481–500). New York, NY: Routledge.
(1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377. doi: 10.1037/h0031643
(1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255–268. doi: 10.2307/2532051
(2007). A unified approach for assessing agreement for continuous and categorical data. Journal of Biopharmaceutical Statistics, 17, 629–652. doi: 10.1080/10543400701376498
(2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28, 587–604. doi: 10.1093/hcr/28.4.587
(1977). Coefficients of agreement between observers and their interpretation. The British Journal of Psychiatry, 130, 79–83. doi: 10.1192/bjp.130.1.79
(1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. doi: 10.1037/1082-989X.1.1.30
(1959). The representational model and relevant research methods. In I. de Sola PoolEd., Trends in content analysis (pp. 33–88). Urbana, IL: University of Illinois Press.
(1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26(2), 135–148. Retrieved from www.jstor.org/stable/3172601
(1999). Rethinking validity and reliability in content analysis. Journal of Applied Communication Research, 27, 258–284. doi: 10.1080/00909889909365539
(1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325. doi: 10.1086/266577
(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. doi: 10.1037/0033-2909.86.2.420
(2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85, 257–268. Retrieved from ptjournal.apta.org/content/85/3/257.abstract
(1953). The reliability of agreement in content analysis. The Journal of Social Psychology, 37, 175–187. Retrieved from doi.apa.org/?uid=1954-02550-001
(1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22, 358–376. doi: 10.1037/h0076640
(2012). Assumptions behind inter-coder reliability indices. In C. T. SalmonEd., Communication Yearbook, Vol. 36, (pp. 419–480). New York, NY: Routledge.
(2012, August). A Reliability Index (ai) that assumes honest coders and variable randomness. Chicago, IL: Association for Education in Journalism and Mass Communication.
(