Skip to main content
Log in

Why assessment in medical education needs a solid foundation in modern test theory

  • Reflections
  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript

Abstract

Despite the frequent use of state-of-the-art psychometric models in the field of medical education, there is a growing body of literature that questions their usefulness in the assessment of medical competence. Essentially, a number of authors raised doubt about the appropriateness of psychometric models as a guiding framework to secure and refine current approaches to the assessment of medical competence. In addition, an intriguing phenomenon known as case specificity is specific to the controversy on the use of psychometric models for the assessment of medical competence. Broadly speaking, case specificity is the finding of instability of performances across clinical cases, tasks, or problems. As stability of performances is, generally speaking, a central assumption in psychometric models, case specificity may limit their applicability. This has probably fueled critiques of the field of psychometrics with a substantial amount of potential empirical evidence. This article aimed to explain the fundamental ideas employed in psychometric theory, and how they might be problematic in the context of assessing medical competence. We further aimed to show why and how some critiques do not hold for the field of psychometrics as a whole, but rather only for specific psychometric approaches. Hence, we highlight approaches that, from our perspective, seem to offer promising possibilities when applied in the assessment of medical competence. In conclusion, we advocate for a more differentiated view on psychometric models and their usage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. R scripts for this simulation are available upon request from the corresponding author.

References

  • Bartroff, J., Lai, T. L., & Shih, M.-C. (2013). Sequential experimentation in clinical trials. New York, NY: Springer.

    Book  Google Scholar 

  • Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software. doi:10.18637/jss.v067.i01.

    Google Scholar 

  • Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305.

    Article  Google Scholar 

  • Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. doi:10.1007/s11336-006-1447-6.

    Article  Google Scholar 

  • Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110, 203–219. doi:10.1037/0033-295X.110.2.203.

    Article  Google Scholar 

  • Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061.

    Article  Google Scholar 

  • Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi:10.1111/j.1365-2923.2011.04075.x.

    Article  Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.

    Book  Google Scholar 

  • Colliver, J. A., Markwell, S. J., Vu, N. V., & Barrows, H. S. (1990). Case specificity of standardized-patient examinations: Consistency of performance on components of clinical competence within and between cases. Evaluation & the Health Professions, 13, 252–261. doi:10.1177/016327879001300208.

    Article  Google Scholar 

  • Cook, D. A., Kuper, A., Hatala, R., & Ginsburg, S. (2016). When assessment data are words: Validity evidence for qualitative educational assessments. Academic Medicine. doi:10.1097/ACM.0000000000001175.

    Google Scholar 

  • Cooksey, R. W. (1996). The methodology of social judgement theory. Thinking & Reasoning, 2, 141–174. doi:10.1080/135467896394483.

    Article  Google Scholar 

  • Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. doi:10.1177/0013164404266386.

    Article  Google Scholar 

  • Crossley, J. G. M. (2010). Vive la difference! A recall from knowing to exploring. Medical Education, 44, 946–948. doi:10.1111/j.1365-2923.2010.03786.x.

    Article  Google Scholar 

  • Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.

    Article  Google Scholar 

  • De Champlain, A., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklist recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74(10), S52–S54.

    Article  Google Scholar 

  • Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007). Estimating the multilevel Rasch model: With the lme4 package. Journal of Statistical Software. doi:10.18637/jss.v020.i02.

    Google Scholar 

  • Dory, V., Gagnon, R., & Charlin, B. (2010). Is case-specificity content-specificity? An analysis of data from extended-matching questions. Advances in Health Science Education, 15, 55–63. doi:10.1007/s10459-009-9169-z.

    Article  Google Scholar 

  • Driessen, E., van der Vleuten, C. P. M., Schuwirth, L., van Tartwijk, J., & Vermunt, J. (2005). The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Medical Education, 39, 214–220. doi:10.1111/j.1365-2929.2004.02059.x.

    Article  Google Scholar 

  • Durning, S. J., Artino, A. R., Boulet, J. R., Dorrance, K., van der Vleuten, C. P. M., & Schuwirth, L. (2012). The impact of selected contextual factors on experts’ clinical reasoning performance (does context impact clinical reasoning performance in experts?). Advances in Health Science Education, 17, 65–79. doi:10.1007/s10459-011-9294-3.

    Article  Google Scholar 

  • Edwards, J. R. (2011). The fallacy of formative measurement. Organizational Research Methods, 14, 370–388. doi:10.1177/1094428110378369.

    Article  Google Scholar 

  • Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5(2), 155–174.

    Article  Google Scholar 

  • Elstein, A. S. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard Univ. Press.

    Book  Google Scholar 

  • Eva, K. W. (2003). On the generality of specificity. Medical Education, 37, 587–588. doi:10.1046/j.1365-2923.2003.01563.x.

    Article  Google Scholar 

  • Eva, K. (2011). On the relationship between problem-solving skills and professional practice. In C. Kanes (Ed.), Elaborating professionalism (Vol. 5, pp. 17–34, Innovation and change in professional education). Dordrecht: Springer.

  • Eva, K. W., & Hodges, B. D. (2012). Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Medical Education, 46, 914–919. doi:10.1111/j.1365-2923.2012.04310.x.

    Article  Google Scholar 

  • Evans, J. S. B. T., Clibbens, J., Cattani, A., Harris, A., & Dennis, I. (2003). Explicit and implicit processes in multicue judgment. Memory & Cognition, 31, 608–618. doi:10.3758/BF03196101.

    Article  Google Scholar 

  • Gick, M. L., & Holyoak, K. J. (1980). Analogical problem solving. Cognitive Psychology, 12, 306–355. doi:10.1016/0010-0285(80)90013-4.

    Article  Google Scholar 

  • Godden, D. R., & Baddeley, A. D. (1975). Context-dependent memory in two natural environments: On land and underwater. British Journal of Psychology, 66, 325–331. doi:10.1111/j.2044-8295.1975.tb01468.x.

    Article  Google Scholar 

  • Goldberg, L. R. (1970). Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. Psychological Bulletin, 73, 422–432. doi:10.1037/h0029230.

    Article  Google Scholar 

  • Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5, 211–220. doi:10.2307/1501031.

    Article  Google Scholar 

  • Goldstein, H. (2012). Francis Galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2), 147–158.

    Article  Google Scholar 

  • Goodwin, D. W., Powell, B., Bremer, D., Hoine, H., & Stern, J. (1969). Alcohol and recall: State-dependent effects in man. Science, 163, 1358–1360. doi:10.1126/science.163.3873.1358.

    Article  Google Scholar 

  • Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24, 23–37. doi:10.1080/02680930802412669.

    Article  Google Scholar 

  • Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct comparison of the efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on Systems, Man, and Cybernetics, 17, 753–770. doi:10.1109/TSMC.1987.6499282.

    Article  Google Scholar 

  • Hammond, K. R., Hursch, C. J., & Todd, F. J. (1964). Analyzing the components of clinical inference. Psychological Review, 71, 438–456. doi:10.1037/h0040736.

    Article  Google Scholar 

  • Hecht, M., Weirich, S., Siegle, T., & Frey, A. (2015). Modeling booklet effects for nonequivalent group designs in large-scale assessment. Educational and Psychological Measurement, 75, 568–584. doi:10.1177/0013164414554219.

    Article  Google Scholar 

  • Hertwig, R., Meier, N., Nickel, C., Zimmermann, P.-C., Ackermann, S., Woike, J. K., et al. (2013). Correlates of diagnostic accuracy in patients with nonspecific complaints. Medical Decision Making : An International Journal of the Society for Medical Decision Making, 33, 533–543. doi:10.1177/0272989X12470975.

    Article  Google Scholar 

  • Hodges, B. (2013). Assessment in the post-psychometric era: Learning to love the subjective and collective. Medical Teacher, 35, 564–568. doi:10.3109/0142159X.2013.789134.

    Article  Google Scholar 

  • Jarjoura, D., Early, L., & Androulakakis, V. (2004). A multivariate generalizability model for clinical skills assessments. Educational and Psychological Measurement, 64, 22–39. doi:10.1177/0013164403258466.

    Article  Google Scholar 

  • Jones, P., Smith, R. W., & Talley, D. (2006). Developing test forms for small-scale achievement testing systems. In S. M. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 487–525). New York, NY: L. Erlbaum Associates.

    Google Scholar 

  • Kane, M. (1996). The precision of measurements. Applied Measurement in Education, 9, 355–379. doi:10.1207/s15324818ame0904_4.

    Article  Google Scholar 

  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000.

    Article  Google Scholar 

  • Karelaia, N., & Hogarth, R. M. (2008). Determinants of linear judgment: A meta-analysis of lens model studies. Psychological Bulletin, 134, 404–426. doi:10.1037/0033-2909.134.3.404.

    Article  Google Scholar 

  • Kaufmann, E., & Athanasou, J. A. (2009). A meta-analysis of judgment achievement as defined by the lens model equation. Swiss Journal of Psychology, 68, 99–112. doi:10.1024/1421-0185.68.2.99.

    Article  Google Scholar 

  • Keller, L. A., Clauser, B. E., & Swanson, D. B. (2010). Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Advances in Health Science Education, 15, 717–733. doi:10.1007/s10459-010-9233-8.

    Article  Google Scholar 

  • Kotovsky, K., Hayes, J., & Simon, H. (1985). Why are some problems hard? Evidence from Tower of Hanoi. Cognitive Psychology, 17, 248–294. doi:10.1016/0010-0285(85)90009-X.

    Article  Google Scholar 

  • Kreiter, C. (2008). A comment on the continuing impact of case specificity. Medical Education, 42, 548–549. doi:10.1111/j.1365-2923.2008.03085.x.

    Article  Google Scholar 

  • Kreiter, C. D., & Bergus, G. R. (2007). Case specificity: Empirical phenomenon or measurement artifact? Teaching and Learning in Medicine, 19, 378–381. doi:10.1080/10401330701542776.

    Article  Google Scholar 

  • Larson, J. S., & Billeter, D. M. (2016). Adaptation and fallibility in experts’ judgments of novice performers. Journal of Experimental Psychology. Learning, Memory, and Cognition.. doi:10.1037/xlm0000304.

    Google Scholar 

  • Leight, K. A., & Ellis, H. C. (1981). Emotional mood states, strategies, and state-dependency in memory. Journal of Verbal Learning and Verbal Behavior, 20, 251–266. doi:10.1016/S0022-5371(81)90406-0.

    Article  Google Scholar 

  • Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling: A Multidisciplinary Journal, 3, 290–299. doi:10.1080/10705519609540045.

    Article  Google Scholar 

  • Mattick, K., Dennis, I., Bradley, P., & Bligh, J. (2008). Content specificity: Is it the full story? Statistical modelling of a clinical skills examination. Medical Education, 42, 589–599. doi:10.1111/j.1365-2923.2008.03020.x.

    Article  Google Scholar 

  • McClelland, D. C. (1973). Testing for competence rather than for intelligence. American Psychologist, 28(1), 1–14.

    Article  Google Scholar 

  • Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299. doi:10.1037/1082-989X.1.3.293.

    Article  Google Scholar 

  • Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. doi:10.3102/0013189X018002005.

    Article  Google Scholar 

  • Norcini, J., Anderson, B., Bollela, V., Burch, V., Costa, M. J., Duvivier, R., et al. (2011). Criteria for good assessment: Consensus statement and recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–214. doi:10.3109/0142159X.2011.551559.

    Article  Google Scholar 

  • Norman, G. R. (2008). The glass is a little full-of something: Revisiting the issue of content specificity of problem solving. Medical Education, 42, 549–551. doi:10.1111/j.1365-2923.2008.03096.x.

    Article  Google Scholar 

  • Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Education, 40, 618–623. doi:10.1111/j.1365-2929.2006.02511.x.

    Article  Google Scholar 

  • Norman, G. R., Tugwell, P., Feightner, J. W., Muzzin, L. J., & Jacoby, L. L. (1985). Knowledge and clinical problem-solving. Medical Education, 19(5), 344–356.

    Article  Google Scholar 

  • Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory into Practice, 48, 4–11. doi:10.1080/00405840802577536.

    Article  Google Scholar 

  • Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.

    Article  Google Scholar 

  • R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/.

  • Ray, A., & Wu, M. (2003). PISA programme for international student assessment (PISA): PISA 2000 technical report. Paris: OECD Publishing.

    Google Scholar 

  • Richter Lagha, R. A., Boscardin, C., May, W., & Fung, C.-C. (2012). A comparison of two standard-setting approaches in high-stakes clinical performance assessment using generalizability theory. Academic Medicine, 87, 1077–1082. doi:10.1097/ACM.0b013e31825cea4b.

    Article  Google Scholar 

  • Ricketts, C., Freeman, A., Pagliuca, G., Coombes, L., & Archer, J. (2010). Difficult decisions for progress testing: How much and how often? Medical Teacher, 32, 513–515. doi:10.3109/0142159X.2010.485651.

    Article  Google Scholar 

  • Roberts, J., & Norman, G. (1990). Reliability and learning from the objective structured clinical examination. Medical Education, 24, 219–223. doi:10.1111/j.1365-2923.1990.tb00004.x.

    Article  Google Scholar 

  • Rutkowski, L., von Davier, M., & Rutkowski, D. (2013). Handbook of International large-scale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman and Hall/CRC.

    Google Scholar 

  • Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail? Medical Education, 43, 298–300. doi:10.1111/j.1365-2923.2009.03290.x.

    Article  Google Scholar 

  • Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2006). A plea for new psychometric models in educational assessment. Medical Education, 40, 296–300. doi:10.1111/j.1365-2929.2006.02405.x.

    Article  Google Scholar 

  • Schuwirth, L. W. T., & van der Vleuten, C. P. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33, 478–485. doi:10.3109/0142159X.2011.565828.

    Article  Google Scholar 

  • Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232. doi:10.2307/1435044.

    Article  Google Scholar 

  • Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36(1), 61–71.

    Article  Google Scholar 

  • Sijtsma, K. (2006). Psychometrics in psychological research: Role model or partner in science? Psychometrika, 71, 451–455. doi:10.1007/s11336-006-1497-9.

    Article  Google Scholar 

  • Skrondal, A., & Rabe-Hesketh, S. (2007). Latent variable modelling: A survey. Scandinavian Journal of Statistics, 34, 712–745. doi:10.1111/j.1467-9469.2007.00573.x.

    Article  Google Scholar 

  • Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6, 649–744. doi:10.1016/0030-5073(71)90033-X.

    Article  Google Scholar 

  • Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11. doi:10.3102/0013189X024005005.

    Article  Google Scholar 

  • van der Vleuten, C. P. M. (2014). When I say … context specificity. Medical Education, 48, 234–235. doi:10.1111/medu.12263.

    Article  Google Scholar 

  • van der Vleuten, C. P. M., Schuwirth, L. W. T., Driessen, E. W., Govaerts, M. J. B., & Heeneman, S. (2014). 12 Tips for programmatic assessment. Medical Teacher. doi:10.3109/0142159X.2014.973388.

    Google Scholar 

  • von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). The statistical procedures used in national assessment of educational progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 1039–1055). Amsterdam: Elsevier.

    Chapter  Google Scholar 

  • Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 81–124, Handbook of Statistics): Elsevier Science.

  • Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • Wimmers, P. F., & Fung, C.-C. (2008). The impact of case specificity and generalisable skills on clinical performance: A correlated traits–correlated methods approach. Medical Education, 42, 580–588. doi:10.1111/j.1365-2923.2008.03089.x.

    Article  Google Scholar 

  • Wimmers, P. F., Splinter, T. A., Hancock, G. R., & Schmidt, H. G. (2007). Clinical competence: General ability or case-specific? Advances in Health Science Education, 12, 299–314. doi:10.1007/s10459-006-9002-x.

    Article  Google Scholar 

  • Wrigley, W., van der Vleuten, C. P. M., Freeman, A., & Muijtjens, A. (2012). A systemic framework for the progress test: Strengths, constraints and issues: AMEE Guide No. 71. Medical Teacher, 34, 683–697. doi:10.3109/0142159X.2012.704437.

    Article  Google Scholar 

  • Zumbo, B. D. (2006). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 45–80). Amsterdam: Elsevier.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan K. Schauber.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schauber, S.K., Hecht, M. & Nouns, Z.M. Why assessment in medical education needs a solid foundation in modern test theory. Adv in Health Sci Educ 23, 217–232 (2018). https://doi.org/10.1007/s10459-017-9771-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10459-017-9771-4

Keywords

Navigation