Model-Based Measures for Detecting and Quantifying Response Bias

Chalmers, R. Philip

doi:10.1007/s11336-018-9626-9

Model-Based Measures for Detecting and Quantifying Response Bias

Published: 15 June 2018

Volume 83, pages 696–732, (2018)
Cite this article

Psychometrika Aims and scope Submit manuscript

R. Philip Chalmers¹

1097 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

This paper proposes a model-based family of detection and quantification statistics to evaluate response bias in item bundles of any size. Compensatory (CDRF) and non-compensatory (NCDRF) response bias measures are proposed, along with their sample realizations and large-sample variability when models are fitted using multiple-group estimation. Based on the underlying connection to item response theory estimation methodology, it is argued that these new statistics provide a powerful and flexible approach to studying response bias for categorical response data over and above methods that have previously appeared in the literature. To evaluate their practical utility, CDRF and NCDRF are compared to the closely related SIBTEST family of statistics and likelihood-based detection methods through a series of Monte Carlo simulations. Results indicate that the new statistics are more optimal effect size estimates of marginal response bias than the SIBTEST family, are competitive with a selection of likelihood-based methods when studying item-level bias, and are the most optimal when studying differential bundle and test bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Plausible Values: Principles of Item Response Theory and Multiple Imputations

Using Sample Weights in Item Response Data Analysis under Complex Sample Designs

Parsimonious asymmetric item response theory modeling with the complementary log-log link

Article 30 March 2022

Hyejin Shim, Wes Bonifay & Wolfgang Wiedermann

Notes

This phenomenon was also noted by Shealy and Stout (1993) when studying their SIBTEST statistic.
The maximum a posteriori criteria can also be obtained using the following estimation schemes if additional prior parameter distributions were included.
For an example of empirical approaches for locating invariant anchor items see Woods (2009).
Including non-studied items in the anchor item set is clearly only an issue for DIF and DBF. For DTF, there are no non-studied items because items are either anchor items or studied items that contribute to the complete test response functions.
If possible, bounded parameters, such as the lower bound term in the 3PL model (Lord & Novick, 1968), should be re-parameterized so that the ACOV will better approximate the curvature of the log-likelihood or log-posterior function. For instance, in the case of the 3PL model, the lower bound parameter can be transformed using \(\gamma ^{\prime }=\log \left( \frac{\gamma }{1-\gamma }\right) \), where \(\gamma ^{\prime }\) is estimated in place of \(\gamma \).
In the limiting case when both \(J\rightarrow \infty \) and \(N\rightarrow \infty \) then \({\hat{\theta }}\rightarrow \theta \), in which case the DFIT estimates converge to their associated parameters.
This was rediscovered in the Monte Carlo simulation study below. Even after properly equating the groups through multiple-group estimation method, the Type I error rate estimates were within the range 0.60–0.95 when evaluated at \(\alpha =.05\). Moreover, the cut-off values for NCDIF were clearly suboptimal as well, resulting in either highly liberal or conservative detection rates which were dependent on data characteristics such as sample size, test length, number of anchor items, and so on. See Chalmers (2016) for details.
The three-parameter logistic model (3PL) was also briefly considered for this simulation study. However, because SIBTEST and CSIBTEST require ad-hoc techniques to accommodate lower bound parameters, and because prior parameter distributions are typically recommended for this IRT model to help with convergence in smaller samples, this model was not included.

References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.
Article Google Scholar
Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York: Springer.
Chapter Google Scholar
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141.
Article Google Scholar
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. California: Sage.
Google Scholar
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06.
Article Google Scholar
Chalmers, R. P. (2016). A differential response functioning framework for understanding item, bundle, and test bias (Unpublished doctoral dissertation). Toronto: York University.
Google Scholar
Chalmers, R. P. (2018). Improving the Crossing-SIBTEST statistic for detecting nonuniform DIF. Psychometrika, 83(2), 376–386. https://doi.org/10.1007/s11336-017-9583-8.
Article PubMed Google Scholar
Chalmers, R. P. (2018). Numerical approximation of the observed information matrix with Oakes’ identity. British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12127
Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76(1), 114–140. https://doi.org/10.1177/0013164415584576.
Article PubMed Google Scholar
Chalmers, R. P., Pek, J., & Liu, Y. (2017). Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5), 533–550. https://doi.org/10.1080/00273171.2017.1329082.
Article PubMed Google Scholar
Chang, H.-H., Mazzeo, J., & Roussos, L. (1996). DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33(3), 333–353.
Article Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Google Scholar
Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368.
Article Google Scholar
Efron, B., & Tibshirani, R. J. (1998). An introduction to the bootstrap. New York: Chapman & Hall.
Google Scholar
Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647–667.
Google Scholar
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282.
Article PubMed Google Scholar
Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107–128.
Article Google Scholar
Hedges, L. V. (1982). Estimating effect size from a series of independent experiments. Psychological Bulletin, 92, 490–499.
Article Google Scholar
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. New York: Routledge.
Google Scholar
Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23(4), 291–322.
Article Google Scholar
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer.
Book Google Scholar
Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677.
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates.
Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theory of mental test scores. Reading: Addison-Wesley.
Google Scholar
Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743.
Article PubMed Google Scholar
Metropolis, N., Rosenbluth, A. W., Teller, A. H., & Teller, E. (1953). Equations of state space calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Article Google Scholar
Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.
Google Scholar
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161.
Article Google Scholar
Oakes, D. (1999). Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61(2), 479–482.
Article Google Scholar
Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Development and demonstration of multidimensional IRT-based internal measures of differential functioning of items and tests. Journal of Educational Measurement, 34(3), 253–272.
Article Google Scholar
Oshima, T. C., Raju, N. S., Flowers, C. P., & Slinde, J. A. (1998). Differential bundle functioning using the DFIT framework: Procedures for identifying possible sources of differential functioning. Applied Measurement in Education, 11(4), 353–369.
Article Google Scholar
Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43(1), 1–17.
Article Google Scholar
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In Handbook of statistics (Vol. 26). Amsterdam: Elsevier B.V.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Article Google Scholar
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned area between two item response functions. Applied Psychological Measurement, 14, 197–207.
Article Google Scholar
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353–368.
Article Google Scholar
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Book Google Scholar
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
Article Google Scholar
Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136–156. https://doi.org/10.1080/10691898.2016.1246953.
Article Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508.
Article PubMed Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale: Lawrence Erlbaum.
Google Scholar
Thissen, D., & Wainer, H. (1990). Confidence envelopes for item response theory. Journal of Educational Statistics, 15(2), 113–128.
Article Google Scholar
Wainer, H. (1993). Model-based standardized measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 123–135). Mahwah: Erlbaum.
Google Scholar
Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57.
Article Google Scholar
Woods, C. M. (2011). DIF testing with an empirical- histogram approximation of the latent density for each group. Applied Measurement in Education, 24(3), 256–279.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Educational Psychology, The University of Georgia, 323 Aderhold Hall, Athens, GA , 30602, USA
R. Philip Chalmers

Authors

R. Philip Chalmers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Philip Chalmers.

Additional information

Special thanks to Dr. Daniel Bolt, five anonymous reviewers, and the associate editor for providing constructive comments on earlier drafts of this manuscript.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 137 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chalmers, R.P. Model-Based Measures for Detecting and Quantifying Response Bias. Psychometrika 83, 696–732 (2018). https://doi.org/10.1007/s11336-018-9626-9

Download citation

Received: 18 February 2017
Revised: 06 March 2018
Published: 15 June 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11336-018-9626-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Model-Based Measures for Detecting and Quantifying Response Bias

Abstract

Access this article

Similar content being viewed by others

Plausible Values: Principles of Item Response Theory and Multiple Imputations

Using Sample Weights in Item Response Data Analysis under Complex Sample Designs

Parsimonious asymmetric item response theory modeling with the complementary log-log link

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 137 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model-Based Measures for Detecting and Quantifying Response Bias

Abstract

Access this article

Similar content being viewed by others

Plausible Values: Principles of Item Response Theory and Multiple Imputations

Using Sample Weights in Item Response Data Analysis under Complex Sample Designs

Parsimonious asymmetric item response theory modeling with the complementary log-log link

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 137 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation