Abstract
This paper proposes a model-based family of detection and quantification statistics to evaluate response bias in item bundles of any size. Compensatory (CDRF) and non-compensatory (NCDRF) response bias measures are proposed, along with their sample realizations and large-sample variability when models are fitted using multiple-group estimation. Based on the underlying connection to item response theory estimation methodology, it is argued that these new statistics provide a powerful and flexible approach to studying response bias for categorical response data over and above methods that have previously appeared in the literature. To evaluate their practical utility, CDRF and NCDRF are compared to the closely related SIBTEST family of statistics and likelihood-based detection methods through a series of Monte Carlo simulations. Results indicate that the new statistics are more optimal effect size estimates of marginal response bias than the SIBTEST family, are competitive with a selection of likelihood-based methods when studying item-level bias, and are the most optimal when studying differential bundle and test bias.
Similar content being viewed by others
Notes
This phenomenon was also noted by Shealy and Stout (1993) when studying their SIBTEST statistic.
The maximum a posteriori criteria can also be obtained using the following estimation schemes if additional prior parameter distributions were included.
For an example of empirical approaches for locating invariant anchor items see Woods (2009).
Including non-studied items in the anchor item set is clearly only an issue for DIF and DBF. For DTF, there are no non-studied items because items are either anchor items or studied items that contribute to the complete test response functions.
If possible, bounded parameters, such as the lower bound term in the 3PL model (Lord & Novick, 1968), should be re-parameterized so that the ACOV will better approximate the curvature of the log-likelihood or log-posterior function. For instance, in the case of the 3PL model, the lower bound parameter can be transformed using \(\gamma ^{\prime }=\log \left( \frac{\gamma }{1-\gamma }\right) \), where \(\gamma ^{\prime }\) is estimated in place of \(\gamma \).
In the limiting case when both \(J\rightarrow \infty \) and \(N\rightarrow \infty \) then \({\hat{\theta }}\rightarrow \theta \), in which case the DFIT estimates converge to their associated parameters.
This was rediscovered in the Monte Carlo simulation study below. Even after properly equating the groups through multiple-group estimation method, the Type I error rate estimates were within the range 0.60–0.95 when evaluated at \(\alpha =.05\). Moreover, the cut-off values for NCDIF were clearly suboptimal as well, resulting in either highly liberal or conservative detection rates which were dependent on data characteristics such as sample size, test length, number of anchor items, and so on. See Chalmers (2016) for details.
The three-parameter logistic model (3PL) was also briefly considered for this simulation study. However, because SIBTEST and CSIBTEST require ad-hoc techniques to accommodate lower bound parameters, and because prior parameter distributions are typically recommended for this IRT model to help with convergence in smaller samples, this model was not included.
References
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.
Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York: Springer.
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141.
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. California: Sage.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06.
Chalmers, R. P. (2016). A differential response functioning framework for understanding item, bundle, and test bias (Unpublished doctoral dissertation). Toronto: York University.
Chalmers, R. P. (2018). Improving the Crossing-SIBTEST statistic for detecting nonuniform DIF. Psychometrika, 83(2), 376–386. https://doi.org/10.1007/s11336-017-9583-8.
Chalmers, R. P. (2018). Numerical approximation of the observed information matrix with Oakes’ identity. British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12127
Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76(1), 114–140. https://doi.org/10.1177/0013164415584576.
Chalmers, R. P., Pek, J., & Liu, Y. (2017). Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5), 533–550. https://doi.org/10.1080/00273171.2017.1329082.
Chang, H.-H., Mazzeo, J., & Roussos, L. (1996). DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33(3), 333–353.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368.
Efron, B., & Tibshirani, R. J. (1998). An introduction to the bootstrap. New York: Chapman & Hall.
Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647–667.
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282.
Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107–128.
Hedges, L. V. (1982). Estimating effect size from a series of independent experiments. Psychological Bulletin, 92, 490–499.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. New York: Routledge.
Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23(4), 291–322.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer.
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates.
Lord, F. M., & Novick, M. R. (1968). Statistical theory of mental test scores. Reading: Addison-Wesley.
Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743.
Metropolis, N., Rosenbluth, A. W., Teller, A. H., & Teller, E. (1953). Equations of state space calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161.
Oakes, D. (1999). Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61(2), 479–482.
Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Development and demonstration of multidimensional IRT-based internal measures of differential functioning of items and tests. Journal of Educational Measurement, 34(3), 253–272.
Oshima, T. C., Raju, N. S., Flowers, C. P., & Slinde, J. A. (1998). Differential bundle functioning using the DFIT framework: Procedures for identifying possible sources of differential functioning. Applied Measurement in Education, 11(4), 353–369.
Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43(1), 1–17.
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In Handbook of statistics (Vol. 26). Amsterdam: Elsevier B.V.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned area between two item response functions. Applied Psychological Measurement, 14, 197–207.
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353–368.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136–156. https://doi.org/10.1080/10691898.2016.1246953.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale: Lawrence Erlbaum.
Thissen, D., & Wainer, H. (1990). Confidence envelopes for item response theory. Journal of Educational Statistics, 15(2), 113–128.
Wainer, H. (1993). Model-based standardized measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 123–135). Mahwah: Erlbaum.
Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57.
Woods, C. M. (2011). DIF testing with an empirical- histogram approximation of the latent density for each group. Applied Measurement in Education, 24(3), 256–279.
Author information
Authors and Affiliations
Corresponding author
Additional information
Special thanks to Dr. Daniel Bolt, five anonymous reviewers, and the associate editor for providing constructive comments on earlier drafts of this manuscript.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Chalmers, R.P. Model-Based Measures for Detecting and Quantifying Response Bias. Psychometrika 83, 696–732 (2018). https://doi.org/10.1007/s11336-018-9626-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-018-9626-9