Abstract
Item banks and Computerized Adaptive Testing (CAT) have the potential to greatly improve the assessment of health outcomes. This review describes the unique features of item banks and CAT and discusses how to develop item banks. In CAT, a computer selects the items from an item bank that are most relevant for and informative about the particular respondent; thus optimizing test relevance and precision. Item response theory (IRT) provides the foundation for selecting the items that are most informative for the particular respondent and for scoring responses on a common metric. The development of an item bank is a multi-stage process that requires a clear definition of the construct to be measured, good items, a careful psychometric analysis of the items, and a clear specification of the final CAT. The psychometric analysis needs to evaluate the assumptions of the IRT model such as unidimensionality and local independence; that the items function the same way in different subgroups of the population; and that there is an adequate fit between the data and the chosen item response models. Also, interpretation guidelines need to be established to help the clinical application of the assessment. Although medical research can draw upon expertise from educational testing in the development of item banks and CAT, the medical field also encounters unique opportunities and challenges.
Similar content being viewed by others
References
Wainer, H., Dorans, N. J., & Eignor, D., et al. (2000). Computerized adaptive testing: A primer. Mahwah, NJ: Lawrence Erlbaum Associates.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch models—foundations, recent developments, and applications. Berlin: Springer-Verlag.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. London: Sage Publications.
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Berlin: Springer.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, II73–II82
Veit, C. L., & Ware, J. E., Jr. (1983). The structure of psychological distress and well-being in general populations. Journal of Consulting and Clinical Psychology, 51, 730–742.
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 3–50). Berlin: Springer.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Muraki, E. (1997). A Generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). Berlin: Springer.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–173.
Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101–122). Berlin: Springer.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). Berlin: Springer.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34(Suppl 17), 1–97.
Lord, F. M., & Norvick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.
Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item responses. Applied Psychological Measurement, 19, 91–100.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.
Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.
Maydeu-Olivares, A., Drasgow, F., & Mead, A. D. (1994). Distinguishing among parametric item response models for polychotomous ordered data. Applied Psychological Measurement, 18, 245–256.
Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408
Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351–363.
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah: Lawrence Erlbaum.
van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 27–52). Dordrecht: Kluwer Academic Publishers.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.
Tarlov, A. R., Ware, J. E., Jr., Greenfield, S., Nelson, E. C., Perrin, E., & Zubkoff, M. (1989). The medical outcomes study. An application of methods for monitoring the results of medical care. JAMA, 262, 925–930.
Ware, J. E., Jr., Bayliss, M. S., Rogers, W. H., Kosinski, M., & Tarlov, A. R. (1996). Differences in 4-year health outcomes for elderly and poor, chronically ill patients treated in HMO and fee-for-service systems. Results from the Medical Outcomes Study. JAMA, 276, 1039–1047.
Ware, J. E., Jr., & Kosinski, M. (2001). SF36 physical and mental health summary scales: A manual for users of version 1. Lincoln RI: QualityMetric Inc.
Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the headache impact test (HIT). Quality of Life Research, 12, 913–933.
Hill, C. D. (2004). Precisions of parameter estimates for the graded item response model. (Masters Thesis) Chapel Hill: University of North Carolina.
Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.
Dillman, D. (2007). Mail and Internet surveys: The tailored design method—2007 update with new Internet, visual, and mixed-mode guide. New York, NY: J. Wiley.
Bjorner, J. B., Ware, J. E., Jr., & Kosinski, M. (2003). The potential synergy between cognitive models and modern psychometric models. Quality of Life Research, 12, 261–274.
McHorney, C. A., Kosinski, M., & Ware, J. E., Jr. (1994). Comparisons of the costs and quality of norms for the SF-36 health survey collected by mail versus telephone interview: Results from a national survey. Medical Care, 32, 551–567.
Cook, A. J., Roberts, D. A., Henderson, M. D., Van Winkle, L. C., Chastain, D. C., & Hamill-Ruth, R. J. (2004). Electronic pain questionnaires: A randomized, crossover comparison with paper questionnaires for chronic pain assessment. Pain, 110, 310–317.
Ryan, J. M., Corry, J. R., Attewell, R., & Smithson, M. J. (2002). A comparison of an electronic version of the SF-36 general health questionnaire to the standard paper version. Quality of Life Research, 11, 19–26.
Velikova, G., Wright, E. P., & Smith, A. B., et al. (1999). Automated collection of quality-of-life data: A comparison of paper and computer touch-screen questionnaires. Journal of Clinical Oncology, 17, 998–1007.
Muthen, B. O., & Muthen, L. (2001). Mplus user’s guide. Los Angeles: Muthén & Muthén.
Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Educational and Behavioral Statistics, 22, 265–289.
Christensen, K. B., Bjorner, J. B., Kreiner, S., & Petersen, J. H. (2002). Tests for unidimensionality in polytomous Rasch models. Psychometrika, 67, 563–574.
Muraki, E., & Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement, 19, 73–90.
Stout, W., Habing, B., Douglas, J., Kim, R. H., Roussos, L., & Zhang, J. (2001). Conditional covariance-based nonparametric multidimensionality assessment. Psychological Measurement, 20, 331–354.
Ramsay, J. O. (1995). TestGraf—a program for the graphical analysis of multiple choice test and questionnaire data. Montreal: McGill University.
van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 1–28). Berlin: Springer.
Rasch, G. (1966). An item analysis which takes individual differences into account. The British Journal of Mathematical and Statistical Psychology, 19, 49–57.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
Andrich, D. (1988). Rasch models for measurement. Beverly Hills: Sage Publications.
Andrich, D., & Luo, G.(2003). Conditional pairwise estimation in the Rasch model for ordered response categories using principal components. Journal of Applied Measurement, 4, 205–221.
Molenaar, I. W. (1995). Estimation of item parameters. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations recent developments and applications (pp. 39–52). Berlin: Springer.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Chapman & Hall, CRC.
Fischer, G. H., & Ponocny, I. (1995). Extended rating scale and partial credit models for assessing change. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 353–370). Berlin: Springer.
Glas, C. A. W., & Verhelst, N. D. (1995). Tests of fit for polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 325–352). Berlin: Springer.
Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 69–95). Berlin: Springer.
Muraki, E., & Bock, R. D. (1996). Parscale—IRT based test scoring and item analysis for graded open-ended exercises and performance tasks. Chicago: Scientific Software Inc.
Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. The Journal of Educational Measurement, 4, 331–352.
Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. The Journal of Educational Measurement, 37, 58–75.
Stone, C. A. (2003). Empirical power and type I error rates for an IRT fit statistic that considers the precision of ability estimates. Educational and Psychological Measurement, 63, 566–586.
Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64, 273–294.
Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.
Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.
Kosinski, M., Bayliss, M. S., & Bjorner, J. B., et al. (2003). A six-item short-form survey for measuring headache impact: the HIT-6. Quality of Life Research, 12, 963–974.
Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington (DC): American Psychological Association.
Berwick, D. M., Murphy, J. M., Goldman, P. A., Ware, J. E., Jr., Barsky, A. J., & Weinstein, M. C. (1991). Performance of a five-item mental health screening test. Medical Care, 29, 169–176.
van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 1–25). Dordrecht: Kluwer Adacemic Publishers.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298.
Ware, J. E., Jr., Snow, K. K., Kosinski, M., & Gandek, B.(1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.
Ware, J. E., Jr., Kosinski, M., & Bjorner, J. B., et al. (2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research, 12, 935–952.
Bayliss, M. S., Dewey, J. E., & Dunlap, I., et al. (2003). A study of the feasibility of Internet administration of a computerized health survey: The headache impact test (HIT). Quality of Life Research, 12, 953–961.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.
Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.
Acknowledgements
This paper builds upon presentations by the authors at the conference: Advances in Health Outcomes Measurement: Exploring the Current State and the Future of Item Response Theory, Item Banks, and Computer-Adaptive Testing, Bethesda, MD, June, 2004. This work was supported in part by a grant from the Small Business Innovation Research Program of the National Institute of Neurological Disorders and Stroke, under grant title Computerized Adaptive Assessment of Headache Impact (grant no. 1R43NS047763-01) and in part by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project. The authors would like to thank Howard Wainer of the National Board of Medical Examiners and three anonymous reviewers for comments on a previous version of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bjorner, J.B., Chang, CH., Thissen, D. et al. Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16 (Suppl 1), 95–108 (2007). https://doi.org/10.1007/s11136-007-9168-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11136-007-9168-6