Review article
Using generalizability theory for reliable learning assessments in pharmacy education

https://doi.org/10.1016/j.cptl.2014.12.003Get rights and content

Abstract

The value of conducting good assessment of learning is becoming an increasingly large focus in pharmacy education. Having a framework to understand learning assessments and recognizing sources of error that contribute to unreliability in measurement are initial steps toward designing more reliable learning assessments in pharmacy education. In this article, we provide a primer on generalizability theory (G-theory), a widely accepted psychometric model used within higher education and present original empirical findings applying G-theory to data from classroom and laboratory pharmacy education as examples. In example 1, we illustrate how the reliability of didactic course grades is affected by the length and number of examinations (i.e., more testing occasions). Our results show that a high level of reliability can be achieved with fewer overall numbers of questions spread out over more occasions of testing. In example 2, we demonstrate how G-theory can be used to establish the reliability of a drug information task in a laboratory-based course. Results reveal that, once again, using more occasions improves reliability of performance assessments. We discuss how the results can be used to begin revising a rater-scoring instrument to improve reliability. This G-theory framework and the worked examples provide a clear path forward for pharmacy educators to consider when developing learning assessments.

Introduction

Testing with multiple-choice questions is widely used because multiple-choice questions can be scored objectively and quickly; they are often assumed to produce higher reliability than other testing methods like open-ended and short- or long-answer tests.1, 2 Despite these advantages, concerns remain that testing with multiple-choice examinations alone does not adequately assess the full scope of studentsʼ abilities developed in pharmacy education. Millerʼs3 pyramid illustrates a need to include other types of learning assessments beyond multiple-choice testing to assess higher-level learning objectives. Figure 1, with adaptation to pharmacy education, illustrates how a classroom courseʼs multiple-choice testing can be used to assess “knows” and “knows how” of the pyramid if cased-based (i.e., application-based), while alternate methods such as a pharmacy practice laboratoryʼs performance-based learning assessments are needed to evaluate “shows how” ability-based outcomes.

Colleges/schools of pharmacy have recognized a need for more “shows how” learning assessments in their programs and have begun to implement more performance-based learning assessments as a result. While many of the same principles that apply to building reliable multiple-choice-type learning assessments apply to these types of learning assessment, there are some new variables that must also be considered. This article attempts to highlight some of these using specific examples that should resonate with what many pharmacy educators experience in their own programs.

We begin by discussing some of the fundamentals of validity and reliability that apply to all types of learning assessments. While others have covered these fundamentals in more depth,4, 5 a brief summary is provided here to form a basis for some of the specific information that follows. Most importantly, validity is a unitary concept (i.e., there is only one validity), and language for validity involves “validity of conclusions” when applied to proposed or intended assessment score uses. Rather than stating that a learning assessment is simply “valid” or “invalid,” we need to seek and justify our uses of test scores by generating multiple sources of validity evidence for our specific assessment context.4 There are a number of sources of evidence that can be brought forward in order to come to a validity conclusion about the use of a score, and multiple sources are needed in making a stronger validity argument. Content and reliability are important sources of validity evidence to consider with every learning assessment5; aligning learning assessment content to course objectives is key to content evidence, while optimizing reliability is also essential for each learning assessment.

We do not mean to minimize or marginalize evidence sources for validity beyond reliability, but focusing on reliability is imperative for evaluative learning assessments. For summative assessments, reliability is a necessary, though insufficient evidence for validity on its own; strong reliability is required, but more validity evidence than reliability alone is needed.5 With that said, we turn our attention to reliability and improving it within pharmacy education learning assessments. To best improve reliability, we suggest taking a perspective using a generalizability theory (G-theory) framework.

Section snippets

Overview of generalizability theory

Achieving a high level of reliability is a testing standard for any summative assessment of student learning5, 6 and is an ethical imperative.7, 8 As a notable advanced psychometric model, G-theory provides a conceptual framework to understand and account for multiple variables that impact reliability. To that end, G-theory is a widely accepted psychometric model used within many higher education settings that employ complex assessment methods to quantify student learning.9, 10, 11, 12 In

Classroom testing and grading: Study A

In many pharmacy education courses students are tested on multiple occasions using a series of examinations that are combined together to produce a studentʼs single, overall course grade. One common course-level approach is to have two examinations during the semester and a cumulative final examination, with all examinations contributing to a studentʼs overall course grade. With this type of testing plan, there are two main variables that contribute to error in an overall course grade—the

Going forward

The literature and our results are clear—assessing students more often results in improved reliability.1, 8, 16, 19 In the absence of G-theory, it is hard to know how many items, raters, or number of occasions are needed to achieve optimal levels of reliability. It is ultimately up to individual educators to decide how to proceed given his or her particular set of circumstances. By employing a G-theory perspective, at the very least, he or she should be better equipped to make those decisions,

Conclusion

Having a conceptual framework to understand how sources of error contribute to unreliability in measurement is a first step toward designing more reliable assessments in pharmacy education. We described how a G-theory framework models the reliability of course grades by combining multiple exam scores. We also showed how G-theory models the reliability of performance assessments measuring multiple aspects of a rubric for assessing multiple occasions of a performance. The framework and worked

Author contributions

Both M.J.P. and M.K.C. reviewed the literature, drafted sections of this article, and critically reviewed this manuscript. M.K.C. accrued the data and performed the G-theory analysis for Study A and Study B within. Both the authors approved the final version for submission and accept overall responsibility for the accuracy of the data, its analysis, and this entire review.

References (29)

  • R.L. Brennan

    Generalizability Theory

    (2001)
  • D.L. Streiner et al.

    Health Measurement Scales

    (2008)
  • R. Bloch et al.

    Generalizability theory for the perplexed: a practical introduction and guide: AMEE guide no.68

    Med Teach

    (2012)
  • E.H. Haertel

    Reliability

  • Cited by (0)

    Note: Contents of this manuscript were presented as a Special Session during the 2014 Annual Meeting of the American Association of Colleges of Pharmacy in Grapevine, TX.

    View full text