Abstract
Differential item functioning (DIF) is an important issue of interest in psychometrics and educational measurement. Several methods have been proposed in recent decades for identifying items that function differently between two or more groups of examinees. Starting from a framework for classifying DIF detection methods and from a comparative overview of the most traditional methods, an R package for nine methods, called difR, is presented. The commands and options are briefly described, and the package is illustrated through the analysis of a data set on verbal aggression.
Article PDF
Similar content being viewed by others
References
Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.
Agresti, A. (1990). Categorical data analysis. New York: Wiley.
Aguerri, M. E., Galibert, M. S., Attorresi, H. F., & Marañón, P. P. (2009). Erroneous detection of nonuniform DIF using the Breslow— Day test in a short test. Quality & Quantity, 43, 35–44.
Angoff, W. H., & Ford, S. F. (1973). Item—race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–106.
Bates, D., & Maechler, M. (2009). lme4: Linear mixed-effects models using S4 classes. R package Version 0.999375-32. Available from https://r-forge.r-project.org/R/?group_id=60.
Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.
Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research: Vol. 1. The analysis of case—control studies (Scientific Publication No. 32). Lyon, France: International Agency for Research on Cancer.
Breslow, N. E., & Liang, K. Y. (1982). The variance of the Mantel— Haenszel estimator. Biometrics, 38, 943–952.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.
Cardall, C., & Coffman, W. E. (1964). A method for comparing the performance of different groups on the items in a test (Research Bulletin 64–61). Princeton, NJ: Educational Testing Service.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement, 17, 31–44.
Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6, 269–279.
Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational & Psychological Measurement, 28, 61–75.
Cook, L. L., & Eignor, D. R. (1991). NCME instructional module: IRT equating methods. Educational Measurement, 10, 37–45.
De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.
Dorans, N. J. (1989). Two new approaches to assessing differential item functioning. Standardization and the Mantel-Haenszel method. Applied Measurement in Education, 2, 217–233.
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.
Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368.
Dorans, N. J., Schmitt, A. P., & Bleistein, C. A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educational Measurement, 29, 309–319.
Fidalgo, Á. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research, 5, 43–53.
Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: A comparison of four methods. Educational & Psychological Measurement, 67, 565–582.
Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational & Behavioral Statistics, 23, 244–253.
Hauck, W. W. (1979). The large sample variance of the Mantel-Haenszel estimator of a common odds ratio. Biometrics, 35, 817–819.
Holland, P. W., & Thayer, D. T. (1985). An alternate definition of the ETS delta scale of item difficulty (Research Report RR-85-43). Princeton, NJ: Educational Testing Service.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.
Ironson, G. H., & Subkoviak, M. J. (1979). A comparison of several methods of assessing item bias. Journal of Educational Measurement, 16, 209–225.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.
Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT differential item functioning analysis. Applied Psychological Measurement, 16, 158.
Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.
Lautenschlager, G. J., & Park, D.-G. (1988). IRT item bias detection procedures: Issues of model misspecification, robustness, and parameter linking. Applied Psychological Measurement, 12, 365–376.
Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V Program for Computing the Simultaneous Item Bias DIF Statistics [Computer program]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.
Lord, F. M. (1976). A study of item bias, using item characteristic curve theory. Princeton, NJ: Educational Testing Service.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.
Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational & Psychological Measurement, 54, 284–291.
Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). New York: Springer.
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.
Mislevy, R. J., & Bock, R. D. (1984). BILOG: Item analysis and test scoring with binary logistic models [Computer program]. Mooresville, IN: Scientific Software.
Mislevy, R. J., & Stocking, M. L. (1989). A consumer’s guide to LOGIST and BILOG. Applied Psychological Measurement, 13, 57–75.
Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257–274.
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.
Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.
Penfield, R. D. (2003). Applying the Breslow-Day test of trend in odds ratio heterogeneity to the analysis of nonuniform DIF. Alberta Journal of Educational Research, 49, 231–243.
Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system. Applied Psychological Measurement, 29, 150–151.
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.
Philips, A., & Holland, P. W. (1987). Estimators of the variance of the Mantel-Haenszel log-odds-ratio estimate. Biometrics, 43, 425–431.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.
Raju, N. S. (1995). DFITPU: A FORTRAN program for calculating DIF/ DTF [Computer program]. Atlanta: Georgia Institute of Technology. R Development Core Team (2008). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response analysis. Journal of Statistical Software, 17, 1–25.
Robins, J., Breslow, N., & Greenland, S. (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and largestrata limiting models. Biometrics, 42, 311–323.
Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A FORTRAN program for DIF analysis of dichotomously scored item response data [Computer program]. Amherst, MA: University of Massachusetts.
Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215–230.
Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17, 1–10.
Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement, 16, 143–152.
Shealy, R., & Stout, W. [F.] (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.
Shepard, L. [A.], Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational & Behavioral Statistics, 6, 317–375.
Smits, D. J. M., De Boeck, P., & Vansteelandt, K. (2004). The inhibition of verbally aggressive behaviour. European Journal of Personality, 18, 537–555.
Soares, T. M., Gonçalves, F. B., & Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational & Behavioral Statistics, 34, 348–377.
Somes, G. W. (1986). The generalized Mantel-Haenszel statistic. American Statistician, 40, 106–108.
Spielberger, C. D. (1988). State-Trait Anger Expression Inventory research edition: Professional manual. Odessa, FL: Psychological Assessment Resources.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer software]. Chapel Hill: University of North Carolina, L. L. Thurstone Psychometric Laboratory.
Thissen, D., Chen, W.-H., & Bock, R. D. (2003). MULTILOG 7 for Windows: Multiple-category item analysis and test scoring using item response theory [Computer software]. Lincolnwood, IL: Scientific Software International, Inc.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–170). Hillsdale, NJ: Erlbaum.
Vansteelandt, K. (2000). Formal models for contextualized personality psychology. Unpublished doctoral dissertation, K.U. Leuven, Belgium.
Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17, 113–144.
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479–498.
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was financially supported by the Belgian Federal Science Policy (Funds IAP/P6/03), the Research Fund GOA/2005/04 of the K.U. Leuven, Belgium, a doctoral grant “Bourse à la mobilité (hors Québec) pour l’intégration à la communauté scientifique en éducation” of the UQAM, Canada, and a postdoctoral grant “Chargé de recherches” of the National Funds for Scientific Research (FNRS), Belgium.
Rights and permissions
About this article
Cite this article
Magis, D., Béland, S., Tuerlinckx, F. et al. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods 42, 847–862 (2010). https://doi.org/10.3758/BRM.42.3.847
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.3758/BRM.42.3.847