Abstract
Although the terms item bias and DIF are often used interchangeably, the term DIF was suggested (Holland and Thayer in Proceedings of the 27th Annual Conference of the Military Testing Association, vol 1, pp 282–287, 1988) as a somewhat neutral term to refer to differences in the statistical properties of an item between groups of examinees of equal ability. Items that exhibit DIF threaten the validity of a test and may have serious consequences for groups as well as individuals, because the probabilities of correct responses are determined not only by the trait that the test claims to measure, but also by factors specific to group membership, such as ethnicity or gender. Thus, it is critical to identify DIF items in a test. In this chapter, the more popular DIF detection methods, including Mantel–Haenszel procedure, logistic regression modeling, SIBTEST, and IRT likelihood ratio test, are described. Details of the other methods, as well as some older methods not mentioned above, can be found in the overviews given by Camilli and Shepard (Methods for identifying biased test items. Sage, Thousand Oaks, 1994), Clauser and Mazor (Educ Measure Issues Pract 17:31–44, 1998), Holland and Wainer (Differential item functioning. Erlbaum, Hillsdale, 1993), Millsap and Everson (Appl Psychol Measure 16:389–402, 1993), Osterlind and Everson (Differential item functioning. Sage, Thousand Oaks, 2009), and Penfield and Camilli (Handbook of statistics: Vol. 26. Psychometrics. Elsevier, Amsterdam, pp 125–167, 2007).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agresti, A. (1996). Logit models with random effects and quasi-symmetric loglinear models. In Proceedings of the 11th International Workshop on Statistical Modelling (pp. 3–12).
Alwin, D. F., & Jackson, D. J. (1981). Applications of simultaneous factor analysis to issues of factorial invariance. In D. Jackson & E. Borgatta (Eds.), Factor analysis and measurement in sociological research: A multi-dimensional perspective (pp. 249–279). Beverly Hills: Sage.
Anderson, R. J., & Sisco, F. H. (1977). Standardization of the WISC-R performance scale for deaf children (Office of Demographic Studies Publication Series T, No. 1). Washington, DC: Gallaudet College.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.
Bentler, P. M. (1992). On the fit of models to covariances and methodology in the Bulletin. Psychological Bulletin, 112, 400–404.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Bolt, D., & Stout, W. (1996). Differential item functioning. Its multidimensional model and resulting SIBTEST detection procedure. Behaviormetrika, 23, 67–95.
Bracken, B. A., & McCallum, R. S. (2016). Examiner’s manual: Universal nonverbal intelligence test-second edition (UNIT2). Austin, TX: PRO-ED.
Braden, J. P. (1999). Straight talk about assessment and diversity: What do we know? School Psychology Review, 14, 343–355.
Brown, L., Sherbenou, R. J., & Johnsen, S. K. (2010). Test of nonverbal intelligence (TONI–4). Austin, TX: PRO-ED.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Beverly Hills, CA: Sage.
Bryk, A. (1980). Review of Bias in mental testing. Journal of Educational Measurement, 17, 369–374.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.
Canivez, G. L., & Watkins, M. W. (in press). Review of the Wechsler intelligence scale for children-fifth edition: Critique, commentary, and independent analyses. In A. S. Kaufman, S. E. Raiford, & D. L. Coalson (Authors), Intelligent testing with the WISC-V (pp. xx–xx). Hoboken, NJ: Wiley.
Cattell, R. B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum.
Chang, H. H., Mazzeo, J., & Roussos, L. (2005). Detecting DIF for polytomously scored items: An adaption of the SIBTEST procedure. Journal of Educational Measurement, 33, 333–353.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31–44.
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5, 115–124.
Cotter, D. E., & Berk, R. A. (1981, April). Item bias in the WISC-R using Black, White, and Hispanic learning disabled children. Paper presented at the Annual Meeting of the American Educational Research Association, Los Angeles, CA.
Diana v. the California State Board of Education. Case No. C-70-37 RFP. (N.D. Cal., 1970).
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and Standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.
Dorans, N. J., & Kulick, E. (1983). Assessing unexpected differential item performance of female candidates on SAT and TSWE forms administered in December 1977: An application of the standardization approach. ETS Research Report Series, 1983 (pp. i–14).
Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Psychological Measurement, 3, 347–360.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: American Council on Education & Macmillan.
Frisby, C. L. (1998). Poverty and socioeconomic status. In J. L. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 241–270). Washington, DC: American Psychological Association.
Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential distractor functioning. Journal of Educational Measurement, 26, 147–160.
Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (2009). Comprehensive test of nonverbal intelligence–2 (CTONI–2). Austin, TX: PRO-ED.
Hills, J. R. (1989). Screening for potentially biased items in testing programs. Educational Measurement: Issues and Practice, 8, 5–11.
Holland, P. W. (1985). On the study of differential item performance without IRT. In Proceedings of the 27th Annual Conference of the Military Testing Association (Vol. 1, pp. 282–287). San Diego, CA.
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.
Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351–362.
Ilai, D., & Willerman, L. (1989). Sex differences in WAIS-R item performance. Intelligence, 13, 225–234.
Jastak, J. E., & Jastak, S. R. (1964). Short forms of the WAIS and WISC vocabulary subtests. Journal of Clinical Psychology, 20, 167–199.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational Statistics, 23, 291–322.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.
Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC.
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 57, 409–426.
Jöreskog, K. G., & Sörbom, D. (1989). LISREL7: A guide to the program and applications (2nd ed.). Chicago: SPSS.
Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.
Koh, T., Abbatiello, A., & McLoughlin, C. S. (1984). Cultural bias in WISC subtest items: A response to Judge Grady’s suggestions in relation to the PASE case. School Psychology Review, 13, 89–94.
Kromrey, J. D., & Parshall, C. G. (1991, November). Screening items for bias: An empirical comparison of the performance of three indices in small samples of examinees. Paper presented at the annual meeting of the Florida Educational Research Association, Clearwater, FL.
Larry P. v. Wilson Riles, Superintendent of Public Instruction for the State of California. Case No. C-71-2270 (N.D. Cal., 1979).
Lawrence, I. M., & Curley, W. E. (1989, March). Differential item functioning of SAT-Verbal reading subscore items for males and females: Follow-up study. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988, April). Differential item functioning of SAT-Verbal reading subscore items for male and female examinees. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V program for computing the simultaneous item bias DIF statistics [Computer software]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Magis, D., Raîche, G., Béland, S., & Gérard, P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing, 11, 365–386.
Maller, S. J. (1996). WISC-III Verbal item invariance across samples of deaf and hearing children of similar measured ability. Journal of Psychoeducational Assessment, 14, 152–165.
Maller, S. J., & Ferron, J. (1997). WISC-III factor invariance across deaf and standardization samples. Educational and Psychological Measurement, 7, 987–994.
Maller, S. J., Konold, T. R., & Glutting, J. J. (1998). WISC-III Factor invariance across samples of children displaying appropriate and inappropriate test-taking behavior. Educational and Psychological Measurement, 58, 467–475.
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from the retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.
Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54, 284–291.
Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289–311.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education & Macmillan.
McGaw, B., & Jöreskog, K. G. (1971). Factorial invariance of ability measures in groups differing in intelligence and socio-economic status. British Journal of Mathematical and Statistical Psychology, 24, 154–168.
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.
Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16, 389–402.
Nagelkerke, N. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.
National Council on Measurement in Education. (1995). Code of professional responsibilities in educational measurement. Educational Measurement: Issues and Practice, 14, 17–24.
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.
Penfield, R. D. (2001). Assessing differential item functioning across multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.
Plake, B. S. (1980). A comparison of a statistical and subjective procedure to ascertain item validity: One step in the test validation process. Educational and Psychological Measurement, 40, 397–404.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.
Reynolds, C. R. (1982). The problem of bias in psychological assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), The handbook of school psychology (pp. 178–208). New York: Wiley.
Rigdon, E. E. (1996). CFI versus RMSEA: A comparison of two fit indexes for structural equation modeling. Structural Equation Modeling, 3, 369–379.
Roid, G. H., & Miller, L. J. (2013). Leiter international performance scale-3 rd edition (Leiter-3) manual. Wood Dale, IL: Stoelting.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and the Mantel-Haenszel procedures for detecting differential item functioning. Applied Measurement in Education, 17, 105–116.
Ross-Reynolds, J., & Reschly, D. J. (1983). An investigation of item bias on the WISC-R with four sociocultural groups. Journal of Consulting and Clinical Psychology, 51, 144–146.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph Series No. 17). Richmond, VA: Psychometric Society.
Sandoval, J. (1979). The WISC-R and internal evidence of test bias with minority groups. Journal of Consulting and Clinical Psychology, 47, 919–927.
Sandoval, J., & Miille, M. P. W. (1980). Accuracy of judgments of WISC-R item difficulty for minority groups. Journal of Consulting and Clinical Psychology, 48, 249–253.
Sandoval, J., Zimmerman, I. L., & Woo-Sam, J. M. (1977). Cultural differences on the WISC-R verbal items. Journal of School Psychology, 21, 49–55.
Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance structure analysis. Proceedings of the Business and Economic Statistics Section of the American Statistical Association (pp. 303–313).
Scheuneman, J. D. (1987). An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement, 24, 97–118.
Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109–131.
Scheuneman, J. D., & Oakland, T. (1998). High stakes testing in education. In J. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenirer (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 77–103). Washington, DC: American Psychological Association.
Shealy, R. T., & Stout, W. F. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.
Shelley-Sireci, & Sireci, S. G. (1998, August). Controlling for uncontrolled variables in cross-cultural research. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.
Sireci, S. G., Bastari, B., & Allalouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Thissen, D. (2001). Psychometric engineering as art. Psychometrika, 66, 473–486.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 149–169). Hillsdale, NJ: Erlbaum.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response theory. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Erlbaum.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–8.
Turner, R. G., & Willerman, L. (1977). Sex differences in WAIS item performance. Journal of Clinical Psychology, 33, 795–798.
Uttaro, T. (1992). Factors influencing the Mantel–Haenszel procedure in the detection of differential item functioning. Unpublished doctoral dissertation, Graduate Center, City University of New York.
Wechsler, D. (2014). Wechsler intelligence scale for children-fifth edition technical and interpretive manual. San Antonio, TX: NCS Person.
Wechsler, D., & Naglieri, J. A. (2006). Wechsler nonverbal scale of ability technical and interpretive manual. San Antonio, TX: Pearson.
Wild, C. L., & McPeek, W. M. (1986, August). Performance of the Mantel–Haenszel statistic in identifying differentially functioning items. Paper presented at the annual meeting of the American Psychological Association, Washington, DC.
Zieky, M. (1993). Practical questions in the use of DIF statistics in item development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–364). Hillsdale, NJ: Erlbaum.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Working Paper of the Edgeworth Laboratory for Quantitative Behavioral Science, University of Northern British Columbia, Prince George, B.C.
Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 185–197.
Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessing differential item functioning in performance tasks. Journal of Educational Measurement, 30, 233–251.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Maller, S.J., Pei, LK. (2017). Best Practices in Detecting Bias in Cognitive Tests. In: McCallum, R. (eds) Handbook of Nonverbal Assessment. Springer, Cham. https://doi.org/10.1007/978-3-319-50604-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-50604-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50602-9
Online ISBN: 978-3-319-50604-3
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)