Skip to main content

Best Practices in Detecting Bias in Cognitive Tests

  • Chapter
  • First Online:
Handbook of Nonverbal Assessment

Abstract

Although the terms item bias and DIF are often used interchangeably, the term DIF was suggested (Holland and Thayer in Proceedings of the 27th Annual Conference of the Military Testing Association, vol 1, pp 282–287, 1988) as a somewhat neutral term to refer to differences in the statistical properties of an item between groups of examinees of equal ability. Items that exhibit DIF threaten the validity of a test and may have serious consequences for groups as well as individuals, because the probabilities of correct responses are determined not only by the trait that the test claims to measure, but also by factors specific to group membership, such as ethnicity or gender. Thus, it is critical to identify DIF items in a test. In this chapter, the more popular DIF detection methods, including Mantel–Haenszel procedure, logistic regression modeling, SIBTEST, and IRT likelihood ratio test, are described. Details of the other methods, as well as some older methods not mentioned above, can be found in the overviews given by Camilli and Shepard (Methods for identifying biased test items. Sage, Thousand Oaks, 1994), Clauser and Mazor (Educ Measure Issues Pract 17:31–44, 1998), Holland and Wainer (Differential item functioning. Erlbaum, Hillsdale, 1993), Millsap and Everson (Appl Psychol Measure 16:389–402, 1993), Osterlind and Everson (Differential item functioning. Sage, Thousand Oaks, 2009), and Penfield and Camilli (Handbook of statistics: Vol. 26. Psychometrics. Elsevier, Amsterdam, pp 125–167, 2007).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Agresti, A. (1996). Logit models with random effects and quasi-symmetric loglinear models. In Proceedings of the 11th International Workshop on Statistical Modelling (pp. 3–12).

    Google Scholar 

  • Alwin, D. F., & Jackson, D. J. (1981). Applications of simultaneous factor analysis to issues of factorial invariance. In D. Jackson & E. Borgatta (Eds.), Factor analysis and measurement in sociological research: A multi-dimensional perspective (pp. 249–279). Beverly Hills: Sage.

    Google Scholar 

  • Anderson, R. J., & Sisco, F. H. (1977). Standardization of the WISC-R performance scale for deaf children (Office of Demographic Studies Publication Series T, No. 1). Washington, DC: Gallaudet College.

    Google Scholar 

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

    Google Scholar 

  • Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.

    Article  PubMed  Google Scholar 

  • Bentler, P. M. (1992). On the fit of models to covariances and methodology in the Bulletin. Psychological Bulletin, 112, 400–404.

    Article  PubMed  Google Scholar 

  • Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

    Book  Google Scholar 

  • Bolt, D., & Stout, W. (1996). Differential item functioning. Its multidimensional model and resulting SIBTEST detection procedure. Behaviormetrika, 23, 67–95.

    Article  Google Scholar 

  • Bracken, B. A., & McCallum, R. S. (2016). Examiner’s manual: Universal nonverbal intelligence test-second edition (UNIT2). Austin, TX: PRO-ED.

    Google Scholar 

  • Braden, J. P. (1999). Straight talk about assessment and diversity: What do we know? School Psychology Review, 14, 343–355.

    Article  Google Scholar 

  • Brown, L., Sherbenou, R. J., & Johnsen, S. K. (2010). Test of nonverbal intelligence (TONI–4). Austin, TX: PRO-ED.

    Google Scholar 

  • Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Beverly Hills, CA: Sage.

    Google Scholar 

  • Bryk, A. (1980). Review of Bias in mental testing. Journal of Educational Measurement, 17, 369–374.

    Google Scholar 

  • Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.

    Google Scholar 

  • Canivez, G. L., & Watkins, M. W. (in press). Review of the Wechsler intelligence scale for children-fifth edition: Critique, commentary, and independent analyses. In A. S. Kaufman, S. E. Raiford, & D. L. Coalson (Authors), Intelligent testing with the WISC-V (pp. xx–xx). Hoboken, NJ: Wiley.

    Google Scholar 

  • Cattell, R. B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum.

    Book  Google Scholar 

  • Chang, H. H., Mazzeo, J., & Roussos, L. (2005). Detecting DIF for polytomously scored items: An adaption of the SIBTEST procedure. Journal of Educational Measurement, 33, 333–353.

    Article  Google Scholar 

  • Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31–44.

    Article  Google Scholar 

  • Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5, 115–124.

    Article  Google Scholar 

  • Cotter, D. E., & Berk, R. A. (1981, April). Item bias in the WISC-R using Black, White, and Hispanic learning disabled children. Paper presented at the Annual Meeting of the American Educational Research Association, Los Angeles, CA.

    Google Scholar 

  • Diana v. the California State Board of Education. Case No. C-70-37 RFP. (N.D. Cal., 1970).

    Google Scholar 

  • Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and Standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Dorans, N. J., & Kulick, E. (1983). Assessing unexpected differential item performance of female candidates on SAT and TSWE forms administered in December 1977: An application of the standardization approach. ETS Research Report Series, 1983 (pp. i–14).

    Google Scholar 

  • Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Psychological Measurement, 3, 347–360.

    Google Scholar 

  • Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: American Council on Education & Macmillan.

    Google Scholar 

  • Frisby, C. L. (1998). Poverty and socioeconomic status. In J. L. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 241–270). Washington, DC: American Psychological Association.

    Chapter  Google Scholar 

  • Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential distractor functioning. Journal of Educational Measurement, 26, 147–160.

    Article  Google Scholar 

  • Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (2009). Comprehensive test of nonverbal intelligence–2 (CTONI–2). Austin, TX: PRO-ED.

    Google Scholar 

  • Hills, J. R. (1989). Screening for potentially biased items in testing programs. Educational Measurement: Issues and Practice, 8, 5–11.

    Article  Google Scholar 

  • Holland, P. W. (1985). On the study of differential item performance without IRT. In Proceedings of the 27th Annual Conference of the Military Testing Association (Vol. 1, pp. 282–287). San Diego, CA.

    Google Scholar 

  • Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351–362.

    Article  PubMed  Google Scholar 

  • Ilai, D., & Willerman, L. (1989). Sex differences in WAIS-R item performance. Intelligence, 13, 225–234.

    Article  Google Scholar 

  • Jastak, J. E., & Jastak, S. R. (1964). Short forms of the WAIS and WISC vocabulary subtests. Journal of Clinical Psychology, 20, 167–199.

    Article  PubMed  Google Scholar 

  • Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.

    Google Scholar 

  • Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational Statistics, 23, 291–322.

    Article  Google Scholar 

  • Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.

    Article  Google Scholar 

  • Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC.

    Google Scholar 

  • Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 57, 409–426.

    Article  Google Scholar 

  • Jöreskog, K. G., & Sörbom, D. (1989). LISREL7: A guide to the program and applications (2nd ed.). Chicago: SPSS.

    Google Scholar 

  • Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.

    Article  Google Scholar 

  • Koh, T., Abbatiello, A., & McLoughlin, C. S. (1984). Cultural bias in WISC subtest items: A response to Judge Grady’s suggestions in relation to the PASE case. School Psychology Review, 13, 89–94.

    Google Scholar 

  • Kromrey, J. D., & Parshall, C. G. (1991, November). Screening items for bias: An empirical comparison of the performance of three indices in small samples of examinees. Paper presented at the annual meeting of the Florida Educational Research Association, Clearwater, FL.

    Google Scholar 

  • Larry P. v. Wilson Riles, Superintendent of Public Instruction for the State of California. Case No. C-71-2270 (N.D. Cal., 1979).

    Google Scholar 

  • Lawrence, I. M., & Curley, W. E. (1989, March). Differential item functioning of SAT-Verbal reading subscore items for males and females: Follow-up study. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

    Google Scholar 

  • Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988, April). Differential item functioning of SAT-Verbal reading subscore items for male and female examinees. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

    Google Scholar 

  • Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V program for computing the simultaneous item bias DIF statistics [Computer software]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.

    Google Scholar 

  • Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.

    Article  Google Scholar 

  • Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.

    Article  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Magis, D., Raîche, G., Béland, S., & Gérard, P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing, 11, 365–386.

    Article  Google Scholar 

  • Maller, S. J. (1996). WISC-III Verbal item invariance across samples of deaf and hearing children of similar measured ability. Journal of Psychoeducational Assessment, 14, 152–165.

    Article  Google Scholar 

  • Maller, S. J., & Ferron, J. (1997). WISC-III factor invariance across deaf and standardization samples. Educational and Psychological Measurement, 7, 987–994.

    Article  Google Scholar 

  • Maller, S. J., Konold, T. R., & Glutting, J. J. (1998). WISC-III Factor invariance across samples of children displaying appropriate and inappropriate test-taking behavior. Educational and Psychological Measurement, 58, 467–475.

    Article  Google Scholar 

  • Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from the retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.

    PubMed  Google Scholar 

  • Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54, 284–291.

    Article  Google Scholar 

  • Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289–311.

    Article  Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education & Macmillan.

    Google Scholar 

  • McGaw, B., & Jöreskog, K. G. (1971). Factorial invariance of ability measures in groups differing in intelligence and socio-economic status. British Journal of Mathematical and Statistical Psychology, 24, 154–168.

    Article  Google Scholar 

  • Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.

    Article  Google Scholar 

  • Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16, 389–402.

    Article  Google Scholar 

  • Nagelkerke, N. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.

    Article  Google Scholar 

  • National Council on Measurement in Education. (1995). Code of professional responsibilities in educational measurement. Educational Measurement: Issues and Practice, 14, 17–24.

    Google Scholar 

  • Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.

    Book  Google Scholar 

  • Penfield, R. D. (2001). Assessing differential item functioning across multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.

    Article  Google Scholar 

  • Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.

    Google Scholar 

  • Plake, B. S. (1980). A comparison of a statistical and subjective procedure to ascertain item validity: One step in the test validation process. Educational and Psychological Measurement, 40, 397–404.

    Article  Google Scholar 

  • Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.

    Article  Google Scholar 

  • Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.

    Article  Google Scholar 

  • Reynolds, C. R. (1982). The problem of bias in psychological assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), The handbook of school psychology (pp. 178–208). New York: Wiley.

    Google Scholar 

  • Rigdon, E. E. (1996). CFI versus RMSEA: A comparison of two fit indexes for structural equation modeling. Structural Equation Modeling, 3, 369–379.

    Article  Google Scholar 

  • Roid, G. H., & Miller, L. J. (2013). Leiter international performance scale-3 rd edition (Leiter-3) manual. Wood Dale, IL: Stoelting.

    Google Scholar 

  • Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and the Mantel-Haenszel procedures for detecting differential item functioning. Applied Measurement in Education, 17, 105–116.

    Article  Google Scholar 

  • Ross-Reynolds, J., & Reschly, D. J. (1983). An investigation of item bias on the WISC-R with four sociocultural groups. Journal of Consulting and Clinical Psychology, 51, 144–146.

    Article  Google Scholar 

  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph Series No. 17). Richmond, VA: Psychometric Society.

    Google Scholar 

  • Sandoval, J. (1979). The WISC-R and internal evidence of test bias with minority groups. Journal of Consulting and Clinical Psychology, 47, 919–927.

    Article  Google Scholar 

  • Sandoval, J., & Miille, M. P. W. (1980). Accuracy of judgments of WISC-R item difficulty for minority groups. Journal of Consulting and Clinical Psychology, 48, 249–253.

    Article  Google Scholar 

  • Sandoval, J., Zimmerman, I. L., & Woo-Sam, J. M. (1977). Cultural differences on the WISC-R verbal items. Journal of School Psychology, 21, 49–55.

    Article  Google Scholar 

  • Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance structure analysis. Proceedings of the Business and Economic Statistics Section of the American Statistical Association (pp. 303–313).

    Google Scholar 

  • Scheuneman, J. D. (1987). An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement, 24, 97–118.

    Article  Google Scholar 

  • Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109–131.

    Article  Google Scholar 

  • Scheuneman, J. D., & Oakland, T. (1998). High stakes testing in education. In J. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenirer (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 77–103). Washington, DC: American Psychological Association.

    Chapter  Google Scholar 

  • Shealy, R. T., & Stout, W. F. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.

    Article  Google Scholar 

  • Shelley-Sireci, & Sireci, S. G. (1998, August). Controlling for uncontrolled variables in cross-cultural research. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.

    Google Scholar 

  • Sireci, S. G., Bastari, B., & Allalouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.

    Google Scholar 

  • Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.

    Article  Google Scholar 

  • Thissen, D. (2001). Psychometric engineering as art. Psychometrika, 66, 473–486.

    Article  Google Scholar 

  • Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 149–169). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response theory. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–8.

    Article  Google Scholar 

  • Turner, R. G., & Willerman, L. (1977). Sex differences in WAIS item performance. Journal of Clinical Psychology, 33, 795–798.

    Article  PubMed  Google Scholar 

  • Uttaro, T. (1992). Factors influencing the Mantel–Haenszel procedure in the detection of differential item functioning. Unpublished doctoral dissertation, Graduate Center, City University of New York.

    Google Scholar 

  • Wechsler, D. (2014). Wechsler intelligence scale for children-fifth edition technical and interpretive manual. San Antonio, TX: NCS Person.

    Google Scholar 

  • Wechsler, D., & Naglieri, J. A. (2006). Wechsler nonverbal scale of ability technical and interpretive manual. San Antonio, TX: Pearson.

    Google Scholar 

  • Wild, C. L., & McPeek, W. M. (1986, August). Performance of the Mantel–Haenszel statistic in identifying differentially functioning items. Paper presented at the annual meeting of the American Psychological Association, Washington, DC.

    Google Scholar 

  • Zieky, M. (1993). Practical questions in the use of DIF statistics in item development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–364). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.

    Google Scholar 

  • Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Working Paper of the Edgeworth Laboratory for Quantitative Behavioral Science, University of Northern British Columbia, Prince George, B.C.

    Google Scholar 

  • Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 185–197.

    Article  Google Scholar 

  • Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessing differential item functioning in performance tasks. Journal of Educational Measurement, 30, 233–251.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susan J. Maller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Maller, S.J., Pei, LK. (2017). Best Practices in Detecting Bias in Cognitive Tests. In: McCallum, R. (eds) Handbook of Nonverbal Assessment. Springer, Cham. https://doi.org/10.1007/978-3-319-50604-3_2

Download citation

Publish with us

Policies and ethics