An Alternative to Cohen's κ
Abstract
At the level of manifest categorical variables, a large number of coefficients and models for the examination of rater agreement has been proposed and used. The most popular of these is Cohen's κ. In this article, a new coefficient, κ s , is proposed as an alternative measure of rater agreement. Both κ and κ s allow researchers to determine whether agreement in groups of two or more raters is significantly beyond chance. Stouffer's z is used to test the null hypothesis that κ s = 0. The coefficient κ s allows one, in addition to evaluating rater agreement in a fashion parallel to κ, to (1) examine subsets of cells in agreement tables, (2) examine cells that indicate disagreement, (3) consider alternative chance models, (4) take covariates into account, and (5) compare independent samples. Results from a simulation study are reported, which suggest that (a) the four measures of rater agreement, Cohen's κ, Brennan and Prediger's κ n , raw agreement, and κ s are sensitive to the same data characteristics when evaluating rater agreement and (b) both the z-statistic for Cohen's κ and Stouffer's z for κ s are unimodally and symmetrically distributed, but slightly heavy-tailed. Examples use data from verbal processing and applicant selection.
References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: WileyBanerjee, M. , Capozzoli, M. , McSweeney, L. , Sinha, D. (1999). Beyond κ: A review of rater agreement measures. The Canadian Journal of Statistics, 27, 3– 23Barlow, W. (1996). Measurement of iterrater agreement with adjustment for covariates. Biometrics, 52, 695– 702Barnhart, H.X. , Williamson, J.M. (2002). Weighted least-squares approach for comparing correlated κ. Biometrics, 58, 1012– 1019Brennan, R.L. , Prediger, D.J. (1981). Coefficient κ: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687– 699Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37– 46Darlington, R.B. , Hayes, A.F. (2000). Combining independent p values: Extensions of the Stouffer and binomial methods. Psychological Methods, 5, 496– 515Donner, A. , Klar, N. (1996). The statistical analysis of κ statistics in multiple samples. Journal of Clinical Epidemiology, 49, 1053– 1058Donner, A. , Zhou, G. (2002). Interval estimation for a difference between intraclass κ statistics. Biometrics, 58, 209– 215Feinstein, A.R. , Cicchetti, D.V. (1990). High agreement but low κ I: The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543– 549Fleiss, J.L. (1975). Measuring agreement between two judges in the presence or absence of a trait. Biometrics, 31, 651– 659Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: WileyFleiss, J.L. , Cohen, J. , Everitt, B.S. (1969). Large sample standard errors of κ and weighted κ. Psychological Bulletin, 72, 323– 327Fleiss, J.L. , Levin, B. , Paik, M.C. (2003). Statistical methods for rates and proportions (3rd ed.). New York: WileyGoodman, L.A. (1965). On simultaneous confidence intervals for multinomial proportions. Technometrics, 7, 247– 254Goodman, L.A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74, 537– 552Goodman, L.A. , Kruskal, W.H. (1954). Measures of association for cross-classifications. Journal of the American Statistical Association, 49, 732– 764Guggenmoos-Holzmann, I. (1995). Modeling covariate effects in observer agreement studies: The case of nominal scale agreement (letter to the editor). Statistics in Medicine, 14, 2285– 2286Hildebrand, D.K. , Laing, J.D. , Rosenthal, H. (1977). Prediction analysis of cross-classifications . New York: WileyKeselman, H.J. , Cribbie, R. , Holland, B. (1999). The pairwise multiple comparison multiplicity problem: An alternative approach to familywise and comparisonwise Type I error control. Psychological Methods, 4, 58– 69Klar, N. , Lipsitz, S.R. , Ibrahim, J. (2000). An estimating equation for modeling κ. Biometrical Journal, 42, 45– 58Landis, J.R. , Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159– 1741995). Microsoft(R) Fortran PowerStation . Version 4.0
(Park, S.K. , Miller, K.W. (1988). Random number generators: Good ones are hard to find. Communications of the Association for Computing Machinery, 31, 1192– 1201Press, W.H. , Flannery, B.P. , Teukolsky, S.A. , Vetterling, W.T. (1989). Numerical recipes. The art of scientific computing (FORTRAN version) . Cambridge: Cambridge University PressSchuster, C. , Smith, D.A. (2002). Indexing systematic rater agreement with a latent class model. Psychological Methods, 7, 384– 395Schuster, C. , von Eye, A. (2001). Models for ordinal agreement data. Biometrical Journal, 43, 795– 808Stouffer, S.A. , Suchman, E.A. , DeVinney, L.C. , Star, S.A. , Williams, R.M. Jr. (1949). The American soldier: Adjustment during Army life (vol. 1). Princeton, NJ: Princeton University PressTanner, M.A. , Young, M.A. (1985). Modeling agreement among raters. Journal of the American Statistical Association, 80, 175– 180Uebersax, J.S. (1993). Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 88, 421– 427von Eye, A. (2002). Configural Frequency Analysis - Methods, models, applications . Mahwah, NJ: Erlbaumvon Eye, A. , Brandtstädter, J. (1988). Application of prediction analysis to cross-classifications of ordinal data. Biometrical Journal, 30, 651– 655von Eye, A. , Jacobson, L.P. , Wills, S.D. (1990, July). Proverbs: Imagery, interpretation, and memory . 12th West Virginia University Conference on Life-Span Developmental Psychology, Morgantown, WVvon Eye, A. , Mun, E.Y. (2005). Analyzing rater agreement - Manifest variable models . Mahwah, NJ: Erlbaumvon Eye, A. , Schuster, C. (2000). Log-linear models for rater agreement. Multiciência, 4, 38– 56von Eye, A. , Sörensen, S. (1991). Models of chance when measuring interrater agreement with κ. Biometrical Journal, 33, 781– 787Wickens, T. (1989). Multiway contingency tables analysis for the social sciences . Hillsdale, NJ: Erlbaum