Skip to main content

Advertisement

Log in

Developing tailored instruments: item banking and computerized adaptive assessment

  • Original Paper
  • Published:
Quality of Life Research Aims and scope Submit manuscript

Abstract

Item banks and Computerized Adaptive Testing (CAT) have the potential to greatly improve the assessment of health outcomes. This review describes the unique features of item banks and CAT and discusses how to develop item banks. In CAT, a computer selects the items from an item bank that are most relevant for and informative about the particular respondent; thus optimizing test relevance and precision. Item response theory (IRT) provides the foundation for selecting the items that are most informative for the particular respondent and for scoring responses on a common metric. The development of an item bank is a multi-stage process that requires a clear definition of the construct to be measured, good items, a careful psychometric analysis of the items, and a clear specification of the final CAT. The psychometric analysis needs to evaluate the assumptions of the IRT model such as unidimensionality and local independence; that the items function the same way in different subgroups of the population; and that there is an adequate fit between the data and the chosen item response models. Also, interpretation guidelines need to be established to help the clinical application of the assessment. Although medical research can draw upon expertise from educational testing in the development of item banks and CAT, the medical field also encounters unique opportunities and challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Wainer, H., Dorans, N. J., & Eignor, D., et al. (2000). Computerized adaptive testing: A primer. Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  2. Fischer, G. H., & Molenaar, I. W. (1995). Rasch models—foundations, recent developments, and applications. Berlin: Springer-Verlag.

    Google Scholar 

  3. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. London: Sage Publications.

    Google Scholar 

  4. van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Berlin: Springer.

    Google Scholar 

  5. Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, II73–II82

    Article  PubMed  Google Scholar 

  6. Veit, C. L., & Ware, J. E., Jr. (1983). The structure of psychological distress and well-being in general populations. Journal of Consulting and Clinical Psychology, 51, 730–742.

    Article  PubMed  CAS  Google Scholar 

  7. Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 3–50). Berlin: Springer.

    Google Scholar 

  8. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

    Google Scholar 

  9. Muraki, E. (1997). A Generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). Berlin: Springer.

    Google Scholar 

  10. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–173.

    Article  Google Scholar 

  11. Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101–122). Berlin: Springer.

    Google Scholar 

  12. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573.

    Article  Google Scholar 

  13. Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). Berlin: Springer.

    Google Scholar 

  14. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34(Suppl 17), 1–97.

    Google Scholar 

  15. Lord, F. M., & Norvick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.

    Google Scholar 

  16. Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item responses. Applied Psychological Measurement, 19, 91–100.

    Article  Google Scholar 

  17. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.

    Article  Google Scholar 

  18. Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.

    Google Scholar 

  19. Maydeu-Olivares, A., Drasgow, F., & Mead, A. D. (1994). Distinguishing among parametric item response models for polychotomous ordered data. Applied Psychological Measurement, 18, 245–256.

    Article  Google Scholar 

  20. Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.

    Google Scholar 

  21. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408

    Article  Google Scholar 

  22. Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351–363.

    Article  Google Scholar 

  23. Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444.

    Article  Google Scholar 

  24. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah: Lawrence Erlbaum.

    Google Scholar 

  25. van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 27–52). Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  26. Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.

    Article  Google Scholar 

  27. Tarlov, A. R., Ware, J. E., Jr., Greenfield, S., Nelson, E. C., Perrin, E., & Zubkoff, M. (1989). The medical outcomes study. An application of methods for monitoring the results of medical care. JAMA, 262, 925–930.

    Article  PubMed  CAS  Google Scholar 

  28. Ware, J. E., Jr., Bayliss, M. S., Rogers, W. H., Kosinski, M., & Tarlov, A. R. (1996). Differences in 4-year health outcomes for elderly and poor, chronically ill patients treated in HMO and fee-for-service systems. Results from the Medical Outcomes Study. JAMA, 276, 1039–1047.

    Article  PubMed  Google Scholar 

  29. Ware, J. E., Jr., & Kosinski, M. (2001). SF36 physical and mental health summary scales: A manual for users of version 1. Lincoln RI: QualityMetric Inc.

    Google Scholar 

  30. Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the headache impact test (HIT). Quality of Life Research, 12, 913–933.

    Article  PubMed  Google Scholar 

  31. Hill, C. D. (2004). Precisions of parameter estimates for the graded item response model. (Masters Thesis) Chapel Hill: University of North Carolina.

  32. Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.

    Article  Google Scholar 

  33. Dillman, D. (2007). Mail and Internet surveys: The tailored design method—2007 update with new Internet, visual, and mixed-mode guide. New York, NY: J. Wiley.

    Google Scholar 

  34. Bjorner, J. B., Ware, J. E., Jr., & Kosinski, M. (2003). The potential synergy between cognitive models and modern psychometric models. Quality of Life Research, 12, 261–274.

    Article  PubMed  Google Scholar 

  35. McHorney, C. A., Kosinski, M., & Ware, J. E., Jr. (1994). Comparisons of the costs and quality of norms for the SF-36 health survey collected by mail versus telephone interview: Results from a national survey. Medical Care, 32, 551–567.

    Article  PubMed  CAS  Google Scholar 

  36. Cook, A. J., Roberts, D. A., Henderson, M. D., Van Winkle, L. C., Chastain, D. C., & Hamill-Ruth, R. J. (2004). Electronic pain questionnaires: A randomized, crossover comparison with paper questionnaires for chronic pain assessment. Pain, 110, 310–317.

    Article  PubMed  Google Scholar 

  37. Ryan, J. M., Corry, J. R., Attewell, R., & Smithson, M. J. (2002). A comparison of an electronic version of the SF-36 general health questionnaire to the standard paper version. Quality of Life Research, 11, 19–26.

    Article  PubMed  Google Scholar 

  38. Velikova, G., Wright, E. P., & Smith, A. B., et al. (1999). Automated collection of quality-of-life data: A comparison of paper and computer touch-screen questionnaires. Journal of Clinical Oncology, 17, 998–1007.

    PubMed  CAS  Google Scholar 

  39. Muthen, B. O., & Muthen, L. (2001). Mplus user’s guide. Los Angeles: Muthén & Muthén.

    Google Scholar 

  40. Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Educational and Behavioral Statistics, 22, 265–289.

    Google Scholar 

  41. Christensen, K. B., Bjorner, J. B., Kreiner, S., & Petersen, J. H. (2002). Tests for unidimensionality in polytomous Rasch models. Psychometrika, 67, 563–574.

    Article  Google Scholar 

  42. Muraki, E., & Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement, 19, 73–90.

    Article  Google Scholar 

  43. Stout, W., Habing, B., Douglas, J., Kim, R. H., Roussos, L., & Zhang, J. (2001). Conditional covariance-based nonparametric multidimensionality assessment. Psychological Measurement, 20, 331–354.

    Article  Google Scholar 

  44. Ramsay, J. O. (1995). TestGraf—a program for the graphical analysis of multiple choice test and questionnaire data. Montreal: McGill University.

    Google Scholar 

  45. van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 1–28). Berlin: Springer.

    Google Scholar 

  46. Rasch, G. (1966). An item analysis which takes individual differences into account. The British Journal of Mathematical and Statistical Psychology, 19, 49–57.

    PubMed  CAS  Google Scholar 

  47. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.

    Google Scholar 

  48. Andrich, D. (1988). Rasch models for measurement. Beverly Hills: Sage Publications.

    Google Scholar 

  49. Andrich, D., & Luo, G.(2003). Conditional pairwise estimation in the Rasch model for ordered response categories using principal components. Journal of Applied Measurement, 4, 205–221.

    PubMed  Google Scholar 

  50. Molenaar, I. W. (1995). Estimation of item parameters. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations recent developments and applications (pp. 39–52). Berlin: Springer.

    Google Scholar 

  51. Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Chapman & Hall, CRC.

  52. Fischer, G. H., & Ponocny, I. (1995). Extended rating scale and partial credit models for assessing change. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 353–370). Berlin: Springer.

    Google Scholar 

  53. Glas, C. A. W., & Verhelst, N. D. (1995). Tests of fit for polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 325–352). Berlin: Springer.

    Google Scholar 

  54. Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 69–95). Berlin: Springer.

    Google Scholar 

  55. Muraki, E., & Bock, R. D. (1996). Parscale—IRT based test scoring and item analysis for graded open-ended exercises and performance tasks. Chicago: Scientific Software Inc.

    Google Scholar 

  56. Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. The Journal of Educational Measurement, 4, 331–352.

    Article  Google Scholar 

  57. Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. The Journal of Educational Measurement, 37, 58–75.

    Article  Google Scholar 

  58. Stone, C. A. (2003). Empirical power and type I error rates for an IRT fit statistic that considers the precision of ability estimates. Educational and Psychological Measurement, 63, 566–586.

    Article  Google Scholar 

  59. Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64, 273–294.

    Article  Google Scholar 

  60. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

    Article  Google Scholar 

  61. Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.

    Article  PubMed  Google Scholar 

  62. Kosinski, M., Bayliss, M. S., & Bjorner, J. B., et al. (2003). A six-item short-form survey for measuring headache impact: the HIT-6. Quality of Life Research, 12, 963–974.

    Article  PubMed  CAS  Google Scholar 

  63. Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington (DC): American Psychological Association.

    Google Scholar 

  64. Berwick, D. M., Murphy, J. M., Goldman, P. A., Ware, J. E., Jr., Barsky, A. J., & Weinstein, M. C. (1991). Performance of a five-item mental health screening test. Medical Care, 29, 169–176.

    Article  PubMed  CAS  Google Scholar 

  65. van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 1–25). Dordrecht: Kluwer Adacemic Publishers.

    Google Scholar 

  66. Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298.

    Article  Google Scholar 

  67. Ware, J. E., Jr., Snow, K. K., Kosinski, M., & Gandek, B.(1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.

    Google Scholar 

  68. Ware, J. E., Jr., Kosinski, M., & Bjorner, J. B., et al. (2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research, 12, 935–952.

    Article  PubMed  Google Scholar 

  69. Bayliss, M. S., Dewey, J. E., & Dunlap, I., et al. (2003). A study of the feasibility of Internet administration of a computerized health survey: The headache impact test (HIT). Quality of Life Research, 12, 953–961.

    Article  PubMed  CAS  Google Scholar 

  70. Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.

    Article  Google Scholar 

  71. Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This paper builds upon presentations by the authors at the conference: Advances in Health Outcomes Measurement: Exploring the Current State and the Future of Item Response Theory, Item Banks, and Computer-Adaptive Testing, Bethesda, MD, June, 2004. This work was supported in part by a grant from the Small Business Innovation Research Program of the National Institute of Neurological Disorders and Stroke, under grant title Computerized Adaptive Assessment of Headache Impact (grant no. 1R43NS047763-01) and in part by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project. The authors would like to thank Howard Wainer of the National Board of Medical Examiners and three anonymous reviewers for comments on a previous version of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakob Bue Bjorner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bjorner, J.B., Chang, CH., Thissen, D. et al. Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16 (Suppl 1), 95–108 (2007). https://doi.org/10.1007/s11136-007-9168-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11136-007-9168-6

Keywords

Navigation