Skip to main content
Log in

Model-Based Measures for Detecting and Quantifying Response Bias

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

This paper proposes a model-based family of detection and quantification statistics to evaluate response bias in item bundles of any size. Compensatory (CDRF) and non-compensatory (NCDRF) response bias measures are proposed, along with their sample realizations and large-sample variability when models are fitted using multiple-group estimation. Based on the underlying connection to item response theory estimation methodology, it is argued that these new statistics provide a powerful and flexible approach to studying response bias for categorical response data over and above methods that have previously appeared in the literature. To evaluate their practical utility, CDRF and NCDRF are compared to the closely related SIBTEST family of statistics and likelihood-based detection methods through a series of Monte Carlo simulations. Results indicate that the new statistics are more optimal effect size estimates of marginal response bias than the SIBTEST family, are competitive with a selection of likelihood-based methods when studying item-level bias, and are the most optimal when studying differential bundle and test bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. This phenomenon was also noted by Shealy and Stout (1993) when studying their SIBTEST statistic.

  2. The maximum a posteriori criteria can also be obtained using the following estimation schemes if additional prior parameter distributions were included.

  3. For an example of empirical approaches for locating invariant anchor items see Woods (2009).

  4. Including non-studied items in the anchor item set is clearly only an issue for DIF and DBF. For DTF, there are no non-studied items because items are either anchor items or studied items that contribute to the complete test response functions.

  5. If possible, bounded parameters, such as the lower bound term in the 3PL model (Lord & Novick, 1968), should be re-parameterized so that the ACOV will better approximate the curvature of the log-likelihood or log-posterior function. For instance, in the case of the 3PL model, the lower bound parameter can be transformed using \(\gamma ^{\prime }=\log \left( \frac{\gamma }{1-\gamma }\right) \), where \(\gamma ^{\prime }\) is estimated in place of \(\gamma \).

  6. In the limiting case when both \(J\rightarrow \infty \) and \(N\rightarrow \infty \) then \({\hat{\theta }}\rightarrow \theta \), in which case the DFIT estimates converge to their associated parameters.

  7. This was rediscovered in the Monte Carlo simulation study below. Even after properly equating the groups through multiple-group estimation method, the Type I error rate estimates were within the range 0.60–0.95 when evaluated at \(\alpha =.05\). Moreover, the cut-off values for NCDIF were clearly suboptimal as well, resulting in either highly liberal or conservative detection rates which were dependent on data characteristics such as sample size, test length, number of anchor items, and so on. See Chalmers (2016) for details.

  8. The three-parameter logistic model (3PL) was also briefly considered for this simulation study. However, because SIBTEST and CSIBTEST require ad-hoc techniques to accommodate lower bound parameters, and because prior parameter distributions are typically recommended for this IRT model to help with convergence in smaller samples, this model was not included.

References

  • Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.

    Article  Google Scholar 

  • Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York: Springer.

    Chapter  Google Scholar 

  • Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141.

    Article  Google Scholar 

  • Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. California: Sage.

    Google Scholar 

  • Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06.

    Article  Google Scholar 

  • Chalmers, R. P. (2016). A differential response functioning framework for understanding item, bundle, and test bias (Unpublished doctoral dissertation). Toronto: York University.

    Google Scholar 

  • Chalmers, R. P. (2018). Improving the Crossing-SIBTEST statistic for detecting nonuniform DIF. Psychometrika, 83(2), 376–386. https://doi.org/10.1007/s11336-017-9583-8.

    Article  PubMed  Google Scholar 

  • Chalmers, R. P. (2018). Numerical approximation of the observed information matrix with Oakes’ identity. British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12127

  • Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76(1), 114–140. https://doi.org/10.1177/0013164415584576.

    Article  PubMed  Google Scholar 

  • Chalmers, R. P., Pek, J., & Liu, Y. (2017). Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5), 533–550. https://doi.org/10.1080/00273171.2017.1329082.

    Article  PubMed  Google Scholar 

  • Chang, H.-H., Mazzeo, J., & Roussos, L. (1996). DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33(3), 333–353.

    Article  Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.

    Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.

    Google Scholar 

  • Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368.

    Article  Google Scholar 

  • Efron, B., & Tibshirani, R. J. (1998). An introduction to the bootstrap. New York: Chapman & Hall.

    Google Scholar 

  • Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647–667.

    Google Scholar 

  • Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282.

    Article  PubMed  Google Scholar 

  • Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107–128.

    Article  Google Scholar 

  • Hedges, L. V. (1982). Estimating effect size from a series of independent experiments. Psychological Bulletin, 92, 490–499.

    Article  Google Scholar 

  • Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. New York: Routledge.

    Google Scholar 

  • Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23(4), 291–322.

    Article  Google Scholar 

  • Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677.

    Article  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates.

    Google Scholar 

  • Lord, F. M., & Novick, M. R. (1968). Statistical theory of mental test scores. Reading: Addison-Wesley.

    Google Scholar 

  • Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743.

    Article  PubMed  Google Scholar 

  • Metropolis, N., Rosenbluth, A. W., Teller, A. H., & Teller, E. (1953). Equations of state space calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.

    Article  Google Scholar 

  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.

    Google Scholar 

  • Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161.

    Article  Google Scholar 

  • Oakes, D. (1999). Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61(2), 479–482.

    Article  Google Scholar 

  • Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Development and demonstration of multidimensional IRT-based internal measures of differential functioning of items and tests. Journal of Educational Measurement, 34(3), 253–272.

    Article  Google Scholar 

  • Oshima, T. C., Raju, N. S., Flowers, C. P., & Slinde, J. A. (1998). Differential bundle functioning using the DFIT framework: Procedures for identifying possible sources of differential functioning. Applied Measurement in Education, 11(4), 353–369.

    Article  Google Scholar 

  • Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43(1), 1–17.

    Article  Google Scholar 

  • Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In Handbook of statistics (Vol. 26). Amsterdam: Elsevier B.V.

  • Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.

    Article  Google Scholar 

  • Raju, N. S. (1990). Determining the significance of estimated signed and unsigned area between two item response functions. Applied Psychological Measurement, 14, 197–207.

    Article  Google Scholar 

  • Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353–368.

    Article  Google Scholar 

  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

    Book  Google Scholar 

  • Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.

    Article  Google Scholar 

  • Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136–156. https://doi.org/10.1080/10691898.2016.1246953.

    Article  Google Scholar 

  • Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508.

    Article  PubMed  Google Scholar 

  • Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale: Lawrence Erlbaum.

    Google Scholar 

  • Thissen, D., & Wainer, H. (1990). Confidence envelopes for item response theory. Journal of Educational Statistics, 15(2), 113–128.

    Article  Google Scholar 

  • Wainer, H. (1993). Model-based standardized measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 123–135). Mahwah: Erlbaum.

    Google Scholar 

  • Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57.

    Article  Google Scholar 

  • Woods, C. M. (2011). DIF testing with an empirical- histogram approximation of the latent density for each group. Applied Measurement in Education, 24(3), 256–279.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Philip Chalmers.

Additional information

Special thanks to Dr. Daniel Bolt, five anonymous reviewers, and the associate editor for providing constructive comments on earlier drafts of this manuscript.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 137 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chalmers, R.P. Model-Based Measures for Detecting and Quantifying Response Bias. Psychometrika 83, 696–732 (2018). https://doi.org/10.1007/s11336-018-9626-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-018-9626-9

Keywords

Navigation