Best Practices in Detecting Bias in Cognitive Tests

Maller, Susan J.; Pei, Lai-Kwan

doi:10.1007/978-3-319-50604-3_2

Susan J. Maller² &
Lai-Kwan Pei³

2200 Accesses
1 Citations

Abstract

Although the terms item bias and DIF are often used interchangeably, the term DIF was suggested (Holland and Thayer in Proceedings of the 27th Annual Conference of the Military Testing Association, vol 1, pp 282–287, 1988) as a somewhat neutral term to refer to differences in the statistical properties of an item between groups of examinees of equal ability. Items that exhibit DIF threaten the validity of a test and may have serious consequences for groups as well as individuals, because the probabilities of correct responses are determined not only by the trait that the test claims to measure, but also by factors specific to group membership, such as ethnicity or gender. Thus, it is critical to identify DIF items in a test. In this chapter, the more popular DIF detection methods, including Mantel–Haenszel procedure, logistic regression modeling, SIBTEST, and IRT likelihood ratio test, are described. Details of the other methods, as well as some older methods not mentioned above, can be found in the overviews given by Camilli and Shepard (Methods for identifying biased test items. Sage, Thousand Oaks, 1994), Clauser and Mazor (Educ Measure Issues Pract 17:31–44, 1998), Holland and Wainer (Differential item functioning. Erlbaum, Hillsdale, 1993), Millsap and Everson (Appl Psychol Measure 16:389–402, 1993), Osterlind and Everson (Differential item functioning. Sage, Thousand Oaks, 2009), and Penfield and Camilli (Handbook of statistics: Vol. 26. Psychometrics. Elsevier, Amsterdam, pp 125–167, 2007).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agresti, A. (1996). Logit models with random effects and quasi-symmetric loglinear models. In Proceedings of the 11th International Workshop on Statistical Modelling (pp. 3–12).
Google Scholar
Alwin, D. F., & Jackson, D. J. (1981). Applications of simultaneous factor analysis to issues of factorial invariance. In D. Jackson & E. Borgatta (Eds.), Factor analysis and measurement in sociological research: A multi-dimensional perspective (pp. 249–279). Beverly Hills: Sage.
Google Scholar
Anderson, R. J., & Sisco, F. H. (1977). Standardization of the WISC-R performance scale for deaf children (Office of Demographic Studies Publication Series T, No. 1). Washington, DC: Gallaudet College.
Google Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Google Scholar
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.
Article PubMed Google Scholar
Bentler, P. M. (1992). On the fit of models to covariances and methodology in the Bulletin. Psychological Bulletin, 112, 400–404.
Article PubMed Google Scholar
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Book Google Scholar
Bolt, D., & Stout, W. (1996). Differential item functioning. Its multidimensional model and resulting SIBTEST detection procedure. Behaviormetrika, 23, 67–95.
Article Google Scholar
Bracken, B. A., & McCallum, R. S. (2016). Examiner’s manual: Universal nonverbal intelligence test-second edition (UNIT2). Austin, TX: PRO-ED.
Google Scholar
Braden, J. P. (1999). Straight talk about assessment and diversity: What do we know? School Psychology Review, 14, 343–355.
Article Google Scholar
Brown, L., Sherbenou, R. J., & Johnsen, S. K. (2010). Test of nonverbal intelligence (TONI–4). Austin, TX: PRO-ED.
Google Scholar
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Beverly Hills, CA: Sage.
Google Scholar
Bryk, A. (1980). Review of Bias in mental testing. Journal of Educational Measurement, 17, 369–374.
Google Scholar
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.
Google Scholar
Canivez, G. L., & Watkins, M. W. (in press). Review of the Wechsler intelligence scale for children-fifth edition: Critique, commentary, and independent analyses. In A. S. Kaufman, S. E. Raiford, & D. L. Coalson (Authors), Intelligent testing with the WISC-V (pp. xx–xx). Hoboken, NJ: Wiley.
Google Scholar
Cattell, R. B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum.
Book Google Scholar
Chang, H. H., Mazzeo, J., & Roussos, L. (2005). Detecting DIF for polytomously scored items: An adaption of the SIBTEST procedure. Journal of Educational Measurement, 33, 333–353.
Article Google Scholar
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31–44.
Article Google Scholar
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5, 115–124.
Article Google Scholar
Cotter, D. E., & Berk, R. A. (1981, April). Item bias in the WISC-R using Black, White, and Hispanic learning disabled children. Paper presented at the Annual Meeting of the American Educational Research Association, Los Angeles, CA.
Google Scholar
Diana v. the California State Board of Education. Case No. C-70-37 RFP. (N.D. Cal., 1970).
Google Scholar
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and Standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.
Google Scholar
Dorans, N. J., & Kulick, E. (1983). Assessing unexpected differential item performance of female candidates on SAT and TSWE forms administered in December 1977: An application of the standardization approach. ETS Research Report Series, 1983 (pp. i–14).
Google Scholar
Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Psychological Measurement, 3, 347–360.
Google Scholar
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: American Council on Education & Macmillan.
Google Scholar
Frisby, C. L. (1998). Poverty and socioeconomic status. In J. L. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 241–270). Washington, DC: American Psychological Association.
Chapter Google Scholar
Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential distractor functioning. Journal of Educational Measurement, 26, 147–160.
Article Google Scholar
Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (2009). Comprehensive test of nonverbal intelligence–2 (CTONI–2). Austin, TX: PRO-ED.
Google Scholar
Hills, J. R. (1989). Screening for potentially biased items in testing programs. Educational Measurement: Issues and Practice, 8, 5–11.
Article Google Scholar
Holland, P. W. (1985). On the study of differential item performance without IRT. In Proceedings of the 27th Annual Conference of the Military Testing Association (Vol. 1, pp. 282–287). San Diego, CA.
Google Scholar
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.
Google Scholar
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.
Google Scholar
Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351–362.
Article PubMed Google Scholar
Ilai, D., & Willerman, L. (1989). Sex differences in WAIS-R item performance. Intelligence, 13, 225–234.
Article Google Scholar
Jastak, J. E., & Jastak, S. R. (1964). Short forms of the WAIS and WISC vocabulary subtests. Journal of Clinical Psychology, 20, 167–199.
Article PubMed Google Scholar
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Google Scholar
Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational Statistics, 23, 291–322.
Article Google Scholar
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.
Article Google Scholar
Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC.
Google Scholar
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 57, 409–426.
Article Google Scholar
Jöreskog, K. G., & Sörbom, D. (1989). LISREL7: A guide to the program and applications (2nd ed.). Chicago: SPSS.
Google Scholar
Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.
Article Google Scholar
Koh, T., Abbatiello, A., & McLoughlin, C. S. (1984). Cultural bias in WISC subtest items: A response to Judge Grady’s suggestions in relation to the PASE case. School Psychology Review, 13, 89–94.
Google Scholar
Kromrey, J. D., & Parshall, C. G. (1991, November). Screening items for bias: An empirical comparison of the performance of three indices in small samples of examinees. Paper presented at the annual meeting of the Florida Educational Research Association, Clearwater, FL.
Google Scholar
Larry P. v. Wilson Riles, Superintendent of Public Instruction for the State of California. Case No. C-71-2270 (N.D. Cal., 1979).
Google Scholar
Lawrence, I. M., & Curley, W. E. (1989, March). Differential item functioning of SAT-Verbal reading subscore items for males and females: Follow-up study. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Google Scholar
Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988, April). Differential item functioning of SAT-Verbal reading subscore items for male and female examinees. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
Google Scholar
Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V program for computing the simultaneous item bias DIF statistics [Computer software]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.
Google Scholar
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.
Article Google Scholar
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Magis, D., Raîche, G., Béland, S., & Gérard, P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing, 11, 365–386.
Article Google Scholar
Maller, S. J. (1996). WISC-III Verbal item invariance across samples of deaf and hearing children of similar measured ability. Journal of Psychoeducational Assessment, 14, 152–165.
Article Google Scholar
Maller, S. J., & Ferron, J. (1997). WISC-III factor invariance across deaf and standardization samples. Educational and Psychological Measurement, 7, 987–994.
Article Google Scholar
Maller, S. J., Konold, T. R., & Glutting, J. J. (1998). WISC-III Factor invariance across samples of children displaying appropriate and inappropriate test-taking behavior. Educational and Psychological Measurement, 58, 467–475.
Article Google Scholar
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from the retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.
PubMed Google Scholar
Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54, 284–291.
Article Google Scholar
Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289–311.
Article Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education & Macmillan.
Google Scholar
McGaw, B., & Jöreskog, K. G. (1971). Factorial invariance of ability measures in groups differing in intelligence and socio-economic status. British Journal of Mathematical and Statistical Psychology, 24, 154–168.
Article Google Scholar
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.
Article Google Scholar
Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16, 389–402.
Article Google Scholar
Nagelkerke, N. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.
Article Google Scholar
National Council on Measurement in Education. (1995). Code of professional responsibilities in educational measurement. Educational Measurement: Issues and Practice, 14, 17–24.
Google Scholar
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.
Book Google Scholar
Penfield, R. D. (2001). Assessing differential item functioning across multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.
Article Google Scholar
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.
Google Scholar
Plake, B. S. (1980). A comparison of a statistical and subjective procedure to ascertain item validity: One step in the test validation process. Educational and Psychological Measurement, 40, 397–404.
Article Google Scholar
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Article Google Scholar
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.
Article Google Scholar
Reynolds, C. R. (1982). The problem of bias in psychological assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), The handbook of school psychology (pp. 178–208). New York: Wiley.
Google Scholar
Rigdon, E. E. (1996). CFI versus RMSEA: A comparison of two fit indexes for structural equation modeling. Structural Equation Modeling, 3, 369–379.
Article Google Scholar
Roid, G. H., & Miller, L. J. (2013). Leiter international performance scale-3 ^rd edition (Leiter-3) manual. Wood Dale, IL: Stoelting.
Google Scholar
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and the Mantel-Haenszel procedures for detecting differential item functioning. Applied Measurement in Education, 17, 105–116.
Article Google Scholar
Ross-Reynolds, J., & Reschly, D. J. (1983). An investigation of item bias on the WISC-R with four sociocultural groups. Journal of Consulting and Clinical Psychology, 51, 144–146.
Article Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph Series No. 17). Richmond, VA: Psychometric Society.
Google Scholar
Sandoval, J. (1979). The WISC-R and internal evidence of test bias with minority groups. Journal of Consulting and Clinical Psychology, 47, 919–927.
Article Google Scholar
Sandoval, J., & Miille, M. P. W. (1980). Accuracy of judgments of WISC-R item difficulty for minority groups. Journal of Consulting and Clinical Psychology, 48, 249–253.
Article Google Scholar
Sandoval, J., Zimmerman, I. L., & Woo-Sam, J. M. (1977). Cultural differences on the WISC-R verbal items. Journal of School Psychology, 21, 49–55.
Article Google Scholar
Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance structure analysis. Proceedings of the Business and Economic Statistics Section of the American Statistical Association (pp. 303–313).
Google Scholar
Scheuneman, J. D. (1987). An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement, 24, 97–118.
Article Google Scholar
Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109–131.
Article Google Scholar
Scheuneman, J. D., & Oakland, T. (1998). High stakes testing in education. In J. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenirer (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 77–103). Washington, DC: American Psychological Association.
Chapter Google Scholar
Shealy, R. T., & Stout, W. F. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.
Article Google Scholar
Shelley-Sireci, & Sireci, S. G. (1998, August). Controlling for uncontrolled variables in cross-cultural research. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.
Google Scholar
Sireci, S. G., Bastari, B., & Allalouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.
Google Scholar
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Article Google Scholar
Thissen, D. (2001). Psychometric engineering as art. Psychometrika, 66, 473–486.
Article Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 149–169). Hillsdale, NJ: Erlbaum.
Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response theory. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Erlbaum.
Google Scholar
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–8.
Article Google Scholar
Turner, R. G., & Willerman, L. (1977). Sex differences in WAIS item performance. Journal of Clinical Psychology, 33, 795–798.
Article PubMed Google Scholar
Uttaro, T. (1992). Factors influencing the Mantel–Haenszel procedure in the detection of differential item functioning. Unpublished doctoral dissertation, Graduate Center, City University of New York.
Google Scholar
Wechsler, D. (2014). Wechsler intelligence scale for children-fifth edition technical and interpretive manual. San Antonio, TX: NCS Person.
Google Scholar
Wechsler, D., & Naglieri, J. A. (2006). Wechsler nonverbal scale of ability technical and interpretive manual. San Antonio, TX: Pearson.
Google Scholar
Wild, C. L., & McPeek, W. M. (1986, August). Performance of the Mantel–Haenszel statistic in identifying differentially functioning items. Paper presented at the annual meeting of the American Psychological Association, Washington, DC.
Google Scholar
Zieky, M. (1993). Practical questions in the use of DIF statistics in item development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–364). Hillsdale, NJ: Erlbaum.
Google Scholar
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.
Google Scholar
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Working Paper of the Edgeworth Laboratory for Quantitative Behavioral Science, University of Northern British Columbia, Prince George, B.C.
Google Scholar
Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 185–197.
Article Google Scholar
Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessing differential item functioning in performance tasks. Journal of Educational Measurement, 30, 233–251.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Educational Studies, Purdue University, West Lafayette, IN, 47907, USA
Susan J. Maller
Houston Independent School District, 4400 West 18th Street, Houston, TX, 77092-8501, USA
Lai-Kwan Pei

Authors

Susan J. Maller
View author publications
You can also search for this author in PubMed Google Scholar
Lai-Kwan Pei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susan J. Maller .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, Tennessee, USA
R. Steve McCallum

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Maller, S.J., Pei, LK. (2017). Best Practices in Detecting Bias in Cognitive Tests. In: McCallum, R. (eds) Handbook of Nonverbal Assessment. Springer, Cham. https://doi.org/10.1007/978-3-319-50604-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-50604-3_2
Published: 22 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50602-9
Online ISBN: 978-3-319-50604-3
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)

Publish with us

Policies and ethics