Swipe om te navigeren naar een ander artikel
Two important goals when using questionnaires are (a) measurement: the questionnaire is constructed to assign numerical values that accurately represent the test taker’s attribute, and (b) prediction: the questionnaire is constructed to give an accurate forecast of an external criterion. Construction methods aimed at measurement prescribe that items should be reliable. In practice, this leads to questionnaires with high inter-item correlations. By contrast, construction methods aimed at prediction typically prescribe that items have a high correlation with the criterion and low inter-item correlations. The latter approach has often been said to produce a paradox concerning the relation between reliability and validity [1–3], because it is often assumed that good measurement is a prerequisite of good prediction.
To answer four questions: (1) Why are measurement-based methods suboptimal for questionnaires that are used for prediction? (2) How should one construct a questionnaire that is used for prediction? (3) Do questionnaire-construction methods that optimize measurement and prediction lead to the selection of different items in the questionnaire? (4) Is it possible to construct a questionnaire that can be used for both measurement and prediction?
An empirical data set consisting of scores of 242 respondents on questionnaire items measuring mental health is used to select items by means of two methods: a method that optimizes the predictive value of the scale (i.e., forecast a clinical diagnosis), and a method that optimizes the reliability of the scale. We show that for the two scales different sets of items are selected and that a scale constructed to meet the one goal does not show optimal performance with reference to the other goal.
The answers are as follows: (1) Because measurement-based methods tend to maximize inter-item correlations by which predictive validity reduces. (2) Through selecting items that correlate highly with the criterion and lowly with the remaining items. (3) Yes, these methods may lead to different item selections. (4) For a single questionnaire: Yes, but it is problematic because reliability cannot be estimated accurately. For a test battery: Yes, but it is very costly. Implications for the construction of patient-reported outcome questionnaires are discussed.
Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. CrossRef
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum.
Ravens-Sieberer, U., Herdman, M., Devine, J., Otto, C., Bullinger, M., Rose, M., et al. (2014). The European KIDSCREEN approach to measure quality of life and well-being in children: Development, current application, and future advances. Quality of Life Research, 23(3), 791–803. doi: 10.1007/s11136-013-0428-3. CrossRefPubMed
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press.
Food and Drug Administration. (2009). Patient-reported outcome measures: use in medical product development to support labeling claims. Guidance for industry, US Department of Health and Human Services.
Foster, C. B., Gorga, D., Padial, C., Feretti, A. M., Berenson, D., Kline, R., et al. (2004). The development and validation of a screening instrument to identify hospitalized medical patients in need of early functional rehabilitation assessment. Quality of Life Research, 13(6), 1099–1108. doi: 10.1023/B:QURE.0000031346.27185.8f. CrossRefPubMed
De Vet, H. C. W., Terwee, C. B., Mokkink, L. B., & Knol, D. L. (2011). Measurement in medicine: A practical guide. Cambridge: Cambridge University Press. CrossRef
Fayers, P. M., & Machin, D. (2015). Quality of life: The assessment, analysis and reporting of patient-reported outcomes. New York: Wiley. CrossRef
Johnson, C., Aaronson, N., Blazeby, J. M., Bottomley, A., Fayers, P., Koller, M., et al. (2011). Guidelines for developing questionnaire modules (4th ed.). Belgium: EORTC Quality of Life Group.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. CrossRef
Kim, J.-O., & Mueller, C. W. (1978). Factor analysis: Statistical methods and practical issues. Beverly Hills, CA: SAGE Publications. CrossRef
Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., et al. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the patient-reported outcomes measurement information system (PROMIS). Medical Care, 45, S22–31. CrossRefPubMed
Guttman, L. (1941). An outline of the statistical theory of prediction. In P. Horst et al. (Eds.), The prediction of personal adjustment (Supplementary study B-1). New York: Social Science Research Council.
Guttman, L. (1971). Measurement as structural theory. Psychometrika, 36(4), 329–347. CrossRef
Finkelman, M. D., Smits, N., Kulich, R. J., Zacharoff, K. L., Magnuson, B. E., Chang, H., et al. (2016). Development of short-form versions of the screener and opioid assessment for patients with pain-revised (SOAPP-R): A proof-of-principle study. Pain Medicine, 18, 1292–1302. doi: 10.1093/pm/pnw210. CrossRef
Lin, A., Yung, A. R., Wigman, J. T. W., Killackey, E., Baksheev, G., & Wardenaar, K. J. (2014). Validation of a short adaptation of the mood and anxiety symptoms questionnaire (MASQ) in adolescents and young adults. Psychiatry Research, 215(3), 778–783. doi: 10.1016/j.psychres.2013.12.018. CrossRefPubMed
Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart and Winston.
Mellenbergh, G. J. (2011). A conceptual introduction to psychometrics: Development, analysis and application of psychological and educational tests. The Hague: Eleven International Publishing.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694. CrossRef
Landsheer, J. A., & Boeije, H. R. (2008). In search of content validity: Facet analysis as a qualitative method to improve questionnaire design. Quality & Quantity, 44, 59. CrossRef
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer. CrossRef
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York: McGraw-Hill.
Oosterwijk, P. R., van der Ark, & L. A., Sijtsma, K. (2017). Using confidence intervals for assessing reliability of real tests. Assessment. Advance online publication. doi: 10.1177/1073191117737375.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.). Urbana: University of Illinois Press.
Sheehan, D. V., Lecrubier, Y., Sheehan, K. H., Amorim, P., Janavs, J., Weiller, E., et al. (1998). The mini-international neuropsychiatric interview (MINI): The development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. Journal of Clinical Psychiatry, 59(suppl 20), 22–57. PubMed
Evers, A., Hagemeister, C., Höstmælingen, A., Lindley, P., Muñiz, J., & Sjöberg. (2013). EFPA review model for the description and evaluation of psychological and educational tests. Test review form and notes for reviewers, European Federation of Psychologists Associations.
Ten Berge, J. M. F. (2005). Tau-equivalent and congeneric measurements. Wiley StatsRef: Statistics Reference Online.
Windle, C. (1954). Test-retest effect on personality questionnaires. Educational and Psychological Measurement, 14(4), 617–636. CrossRef
Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling, 9(2), 195–212. CrossRef
van der Ark, L. A., van der Palm, D. W., & Sijtsma, K. (2011). A latent class approach to estimating test-score reliability. Applied Psychological Measurement, 35(5), 380–392. CrossRef
Cohen, R. J., Swerdlik, M. E., & Sturman, E. D. (2013). Psychological testing and assessment: An introduction to tests and measurement. New York: McGraw-Hill.
Revicki, D. A., Chen, W.-H., & Tucker, C. (2015). Developing item banks for patient-reported health outcomes. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 334–363). New York, NY: Routledge.
Zijlmans, E. A. O., Tijmstra, J., van der Ark, L. A., & Sijtsma, K. (2017). Item-score reliability in empirical-data sets and its relationship with other item indices. Educational and Psychological Measurement. Advance online publication. doi: 10.1177/0013164417728358.
Travers, R. M. W. (1951). Rational hypotheses in the construction of tests. Educational and Psychological Measurement, 11(1), 128–137. CrossRef
Ware, J. E., Sherbourne, C. D. (1992). The MOS 36-item short-form health survey (SF-36): I conceptual framework and item selection. Medical Care, pages 473–483.
Hand, D. J. (1987). Screening vs prevalence estimation. Journal of the Royal Statistical Society. Series C (Applied Statistics), 36(1), 1–7.
Kroenke, K., & Spitzer, R. L. (2002). The PHQ-9: A new depression diagnostic and severity measure. Psychiatric Annals, 32(9), 509–515. CrossRef
Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical measurement. Review of Clinical Psychology, 5, 27–48. CrossRef
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391–418. CrossRef
- Measurement versus prediction in the construction of patient-reported outcome questionnaires: can we have our cake and eat it?
L. Andries van der Ark
Judith M. Conijn
- Springer International Publishing