Learning bronchoscopy in the clinical setting promotes learner anxiety, exposes patients to the burden of procedure-related education [1], and results in a highly variable learning experience [2]. Clinical responsibilities often interfere with reading bronchoscopy-related material, and, in the absence of periodic assessments of bronchoscopic knowledge, trainees are unlikely to be compliant with educational endeavors they perceive as optional, especially if there are no pass/fail grading consequences [3]. Furthermore, the current bronchoscopy learning environment is less than ideal for beginners because of concerns for patient safety, fiscal constraints, and an increasing impetus to document procedural competency [4, 5].

Short postgraduate courses comprising lectures and simulation-based hands-on instruction have thus become a popular means for enhancing procedure-related learning [6, 7]. In accordance with continuing medical education (CME) guidelines, these programs identify learner objectives and provide opportunities for feedback from students regarding the perceived quality of the course. Yet, to our knowledge, no study has objectively measured how much knowledge and skill students actually gain as a result of participating in such a program.

It is generally difficult to prove that a learning gain has occurred as a result of an educational intervention. This difficulty is due partly to controversies regarding the comparative use of pre- and post-test assessments and partly because of problems constituting control groups with which studies of an educational intervention can be compared [811]. Causality is also difficult to prove due to the confounding effects of retention and decay, normal maturation, and possible ongoing training [12].

Many educators in biological science, engineering, astronomy, mathematics, and physics have turned to using class-average normalized gain and related metrics to gauge a course’s effectiveness [1316]. Class-average normalized gain (known as 〈g〉) measures the ratio of a whole group’s performance to the maximum achievable improvement. It is expressed mathematically as a fraction of maximum achievable pre-test/post-test gain [17]. Educators use this measure of performance to diminish the confounding effects of pre-course knowledge and other baseline group characteristics, thereby decreasing the need for a control group [18].

In this study we used a pre-test/post-test model to assess the effectiveness of a one-day introductory bronchoscopy course curriculum. We demonstrated how various metrics of learning, including class-average normalized gain, can measure the acquisition of a procedural skill such as bronchoscopy. We hypothesized that the educational intervention would result in significant class and individual student gains in bronchoscopic cognitive knowledge and technical skill.

Methods

Participants

Twenty-four first-year pulmonary and critical care fellows (from now on referred to as students) and four program directors from eight training institutions in southern California participated in this study.

Course curriculum

A one-day curriculum comprising educational content, learner assessments, and teaching guides [19] was designed. It incorporated multiple components (visual, auditory, verbal, role-playing, and analytic) in order to create intentional redundancy (see supplementary Appendix 1). Educational content was composed in modular fashion using different educational media and techniques [20]; these included ten classroom-based didactic lectures, three interactive audience-participation question-and-answer (Q/A) sessions, four small-group hands-on technical skill workshops using low- and high-fidelity simulation technology, and one clinical problem-solving case scenario. Students received a syllabus containing lectures, learning objectives, and a CD-ROM of the web-based Essential Bronchoscopist ©, and Bronchoscopy Step-by-Step © exercises [21].

Lectures were structured to reinforce incremental learning using a combination of didactic, interactive, Q/A, and anonymous audience participation techniques. For the small-group learning sessions, the 24 students were randomly divided into five groups that rotated through four technical skill stations (airway anatomy and bronchoscopic inspection, endobronchial brushing and biopsy, transbronchial needle aspiration, and emergent bronchoscopic intubation) and one case scenario, spending 30 min at each station (2.5 h total). Specific learning objectives were designed to keep the students mentally and physically engaged so as to achieve “hands-on and heads-on” learning [13]. Teaching guides were distributed to instructors so that they could function as professional learning coaches rather than as experts merely stating facts and opinions [22].

Cognitive knowledge assessments

A multiple-choice question (MCQ) test of cognitive knowledge was developed by the authors. It comprised 40 items evenly divided into a pre-test and post-test, each containing 20 questions (the maximum score for each test was 20). Three of the authors (HC, MD, SM) had each written 25 Q/A sets based on information included in the web-based Essential Bronchoscopist ©; the sets were found in a prior study [23] to be necessary or absolutely necessary for a test of bronchoscopic knowledge. The resulting 75 questions were pilot-tested using a group of seven second- and third-year University of California Irvine (UCI) pulmonary and critical care fellows; questions with extreme high or low difficulty indices (ρ > 80% or <20% correct response rate) were eliminated. The difficulty index or ρ of a test item is defined as the proportion of a group of test-takers who gets that item wrong, hence ρ = 90% is a very difficult question while ρ = 10% is a very easy one. Questions that were very similar in material or had controversial responses were also eliminated.

The 40 best questions were then assigned semirandomly to 24 sets of two tests (items 1–20 for pre-test, and 21–40 for post-test). Using a random numbers table, items were randomly ordered from 1 to 40, 24 consecutive times for the 24 students. Each set of 1–20 and 21–40 was reshuffled to assure that the pre-test and post-test for each student had an equal number of difficult (0.20 < ρ ≤ 0.40), intermediate (0.40 < ρ ≤ 0.60), and easy (0.60 < ρ ≤ 0.80) questions. Tests were normalized for topic-specific questions. For example, among four questions on BAL (bronchoalveolar lavage), two were in the first 20 and two in the second 20 for each of the 24 students. By using this semirandom process, each student received a different test, yet the tests were similar in level of difficulty and material covered. To assure that each student received the two halves of the same test, the tests were marked with random two-letter codes (AA, BB, CC, etc.), printed on each page. The same code was used on the students’ identity badges and skills tests sheets, assuring anonymous processing of all data.

Technical skills assessments

An abbreviated version of the Bronchoscopic Skills and Tasks Assessment Tool—BSTAT© (Fig. 1) [24] was used to test individuals in a low-fidelity airway model (see supplementary Appendix 2). Students were asked to navigate the bronchoscope from the larynx to two designated segments (superior segment, left lower lobe; mediobasal segment, right lower lobe). After delivery of uniform instructions, and with one-on-one supervision from an independent instructor, students were allowed up to 45 s for each segment and were scored on five criteria: time, anatomic recognition, precision, economy of movement, and posture/hand position. To minimize inter-rater variability, the four instructors/testers performing the assessments had previously practiced scoring during a 2-h session, assuring similar definitions and parameters for the criteria being tested in test subjects. At the end of the 2-h session, the last three test samples were scored almost identically by all four testers.

Fig. 1
figure 1

Bronchoscopy technical skills pre-test and post-test score sheet

Study protocol

Immediately following on-site registration, each student was asked to complete a short questionnaire about prior experience and perceptions regarding bronchoscopy and simulation. Individual students then underwent technical skill pre-testing, followed by administration of the multiple-choice pre-test (20 min was allowed, although no one needed the allotted time). During the day, students and program directors were asked to rate the educational content and quality of each learning session using a Likert-scale survey instrument. Post-testing was done at the end of the day after course completion. While technical skill post-testing was performed identically, the written post-test consisted of the second half of each student’s specific MCQ test (e.g., student GG received GG-1–20 as pre-test and GG-21–40 as post-test).

Three months after the educational intervention, a four-item questionnaire was emailed to each student, and an independent psychologist research interviewer conducted telephone interviews with the four program directors to inquire about the perceived impact of the course on student performance, perceived weaknesses of the course, and whether the program directors would support similar courses in the future.

Outcome assessments

We hypothesized that participation in this educational intervention 3 months into subspecialty training would result in improved class and individual student learning gains in cognitive knowledge and technical skill. For the purpose of this study, curricular effectiveness was defined as the extent to which the program produced learning gains for the group and for individual students as demonstrated by several metrics of student performance.

A multiple-choice pre-test/post-test model of cognitive knowledge acquisition and a pre-test/post-test time-limited, low-fidelity simulation technical skill exercise were used to identify knowledge and skill gains. Pre-test and post-test scores were compared and the effectiveness of the educational intervention for the group as a whole was determined using a predefined target objective of at least 0.3 for class average normalized gain 〈g〉. This criterion is taken from Hake [13] whose 62-course 〈g〉 data suggested that 〈g〉 = 0.3 (30%) is a lower bound of what he designated as “medium” normalized gain. Individual single-student gi values to identify variability between individuals, and the average of single-student normalized gain, g(ave), were also calculated.

To complement these objective measurements, the program’s perceived educational value was assessed by analyzing qualitative evaluations and Likert-scale survey measurements from each participant on-site and by reviewing results from post-course student questionnaires and recorded interviews of program directors 3 months after the educational intervention. Quantitative and qualitative analyses were performed on interview transcripts.

To determine the potential reproducibility of study results in a different cohort tested at a similar time period during the course of their subspecialty training (3 months into training), the same curriculum was delivered 1 year later to a fresh group of first-year pulmonary subspecialty trainees invited from the same institutions. Statistical analyses were performed on pre-test/post-test results using the same metrics. The study was considered exempt from UCI Institutional Review Board review.

Statistical methods

This was a quasi-experimental, one-group pre-test/post-test study designed to assess learning gain and effectiveness of this single-day introductory bronchoscopy course curriculum [25]. Paired-samples t test with an α = 0.05 was used to compare pre- and post-test scores, which were also plotted for descriptive purposes. A dissimilar pre-test/post-test study design was used to diminish the biasing effect. Individual actual gains Gi (where Gi = post-test score − pre-test score) were tabulated in order to calculate percent absolute gain (where ∆ = average Gi/maximum score achievable), and percent relative gain, expressed as a percentage (where C = average Gi/pre-test score) for the class. As a measure of course effectiveness, the class average normalized gain 〈g〉 was calculated. The 〈g〉 is defined as the average actual gain divided by the maximum possible gain, where G is the actual gain and 〈%post〉 and 〈%pre〉 are the final (post) and initial (pre) class averages, and the angle brackets “〈… 〉“ indicate an average of the students taking the tests:

$$ \langle {\text{g}}\rangle = \langle \% {\text{G}}\rangle /\langle \% {\text{G}}\rangle_{ \max } $$
$$ \langle {\text{g}}\rangle = \left[ {\langle \% {\text{post-test}}\rangle - \langle \% {\text{pre-test}}\rangle } \right]/\left[ { 100\% - \langle \% {\text{pre-test}}\rangle } \right] $$

A predefined target 〈g〉 of 30% was taken as defining the minimum value at which the educational intervention could be regarded as effective [13, 26].

In addition, individual single-student normalized gains (gi) were calculated for all students and averaged as:

$$ {\text{g}}\left( {\text{ave}} \right) = \left[ {\sum_{{{\text{from 1 to\,}}N}} \left( {{\text{g}}_{i} } \right)} \right]/N $$

where N is the number of trainees taking both the pre- and post-tests. Because of the possibility of post-test scores inferior to pre-test scores (negative gain), g(ave) was also calculated by replacing negative gi with zero and by deleting all negative-gain students. In physics education research, it has been found that the first two averages 〈g〉 and g(ave) are usually the same or within 5% for N > 20 and that this near equality is associated with a low correlation of the single-student gain (or gi) with the single-student pre-test score [13].

Results

Twenty-four of 25 eligible first-year pulmonary and critical care trainees and four program directors from eight training institutions in southern California participated in this study. The director from UCI was not included to avoid obvious bias. Twenty-one students (88%) had assisted in 30 or fewer flexible bronchoscopies; the same percentage envisioned that bronchoscopy was a strong component of their future career. Two thirds (16/24) of the students were exposed to some form of medical simulation during residency training, of which nine had been previously exposed to a bronchoscopy simulator. None of those who had been exposed to medical simulation during residency training (0/16) reported a negative experience.

All 24 students took both the pre-test and the post-test (Fig. 2A, B). Mean test scores of cognitive knowledge improved significantly from 48% (9.6/20 ± 2.58) to 66% (13.2/20 ± 2.53) (p = 0.043). Absolute gain was 18% (3.5/20 ± 3.7) and relative gain was 37% (Fig. 3A). The class average normalized gain 〈g〉 was 34%, and the average of the single-student normalized gains g(ave) was 29% (SD ± 33) (Table 1).

Fig. 2
figure 2

A Cognitive learning pre- and post-test score plots showing improvement (positive slopes), no change (horizontal lines), or deterioration (negative slopes); 1 = pre-test, 2 = post-test. Maximum score was 20 for cognitive knowledge assessments. B Technical skill learning pre- and post-test score plots showing improvement (positive slopes), no change (horizontal lines) or deterioration (negative slopes); 1 = pre-test, 2 = post-test. Maximum score was 16 for technical skill assessments

Fig. 3
figure 3

A Cognitive and technical skill average learning gains with pre- and post-test scores (%) during one-day introductory bronchoscopy course (N = 24). B Average learning gains of five components of bronchoscopy technical skill with pre- and post-test skill scores (%). All changes statistically significant (p < 0.05)

Table 1 Pre- and post-test scores and learning gain (N = 24)

Mean test scores of technical skill also improved significantly from 43% (6.9/16 ± 2.91) to 77% (12.4/16 ± 3.33) (p = 0.017). Absolute gain for the class was 34% (5.5/16 ± 3.7) and relative gain was 78% (Fig. 3A). The class average normalized gain 〈g〉 for technical skills was 60%, and the average of the single-student normalized gains g(ave) was 59% (SD ± 39) (Table 1). Statistically significant absolute gains were noted in all five elements of technical skill (p < 0.05); time (−29%), precision (27%), anatomic recognition (42%), posture/hand position (24%), and economy of movement (42%) (Fig. 3B).

Likert-scale surveys for cognitive learning sessions received mean scores ranging from 4.65/5 to 4.94/5 (5 was the best score attainable). Likert-scale scores for each of the technical skill stations ranged from 4.81/5 to 5/5. Perceptions of the educational program assessed 3 months later were based on a four-item questionnaire. The response rate was 75% (18/24 students). All but one person (17/18) stated they would recommend participation in this course to the following year’s incoming trainees. Almost all (14/18) said the course had a very positive impact on their skills and performance. Fifteen said their senior colleagues in training had shown strong enthusiasm when asked if they would like to participate in a similar course. When asked for suggestions regarding future courses, most (16/18) requested more time at the skill stations or a program lasting more than one day (Table 2).

Table 2 Four-item questionnaire to assess educational value for pulmonary trainees 3 months after course participation (n = 18)

In their Likert surveys, all four program directors scored each of the cognitive and technical skill stations 5/5 at the time of the course. Follow-up interviews 3 months later revealed unanimity regarding the positive effect of this course on their trainees’ bronchoscopy skills and performance. The directors’ opinions were based on feedback obtained from trainees participating in the course, as well as on direct observation of their trainees’ ability to perform bronchoscopy during the intervening 3 months. When asked if the program had any weaknesses, two said that it was “too-information-dense” for one day and one recommended greater emphasis on bronchial anatomy. All four program directors said they wanted their trainees to attend a similar course the following year.

For the cohort one year later (comprising 18 first-year pulmonary and critical care trainees), baseline knowledge and technical skill were similar to the earlier cohort. All measures of learning gain were again significantly increased, thereby corroborating findings from the initial cohort (Table 3). Mean scores of cognitive knowledge significantly increased from 39% (7.8/20 ± 2.4) to 66% (13.1/20 ± 2.5) (p = 0.021). Absolute gain for the class was 27% (5.3/20 ± 2.9) and relative gain was 69%. No negative gains were noted. The class average normalized gain 〈g〉 was 44%. The average of the single-student normalized gains g(ave) was 60% (SD ± 21).

Table 3 Pre-/post-test scores and learning gain for two cohorts

Mean scores for technical skill also increased significantly from 41% (6.6/16 ± 3.2) to 76% (12.2/16 ± 2.4) (p = 0.015). Absolute gain for the class was 35% (5.6/16 ± 2.9) and relative gain was 85% (Table 3). Again, no negative gains were noted. The class average normalized gain 〈g〉 for technical skills was 60%, which, along with the large 〈g〉 for cognitive knowledge, confirmed the effectiveness of the educational intervention. The average of the single-student normalized gains g(ave) was 60% (SD = ±22).

Discussion

The competency-based paradigm is today’s prevalent educational model. It warrants that procedure-related learning should lead to a level of verifiably measurable knowledge and technical skill [27, 28]. From an educator’s perspective, identifying curricular effectiveness using objective measures of student learning that demonstrate gains in knowledge and skill is an important element of competency-oriented program development. Demonstrating learning gain can be viewed as analogous to the number-needed-to-treat metric that is required to prove the efficacy of new therapies [6]. It is often extremely difficult to prove the direct beneficial impact of educational interventions on clinical care. Yet measurements of curricular effectiveness can identify strengths and shortcomings of an educational intervention and help delineate the individual student’s progress along the competency curve between novice, advanced beginner, proficient provider, and competent provider [29].

In this study we used a pre-test/post-test assessment model, several measures of learning gain, Likert-scale analyses, and post-intervention surveys to study the curricular effectiveness and perceived educational value of a one-day course for novice bronchoscopists. The true value of the pre-test/post-test assessment model has been controversial because of the effects of many extraneous variables, including the Hawthorne effect (knowing that one is being tested may affect the results), the halo effect (the human tendency to respond positively or negatively to an instructor), and the practice effect (of a pre-test on a subsequent post-test). These limitations are inherent to most measures of knowledge acquisition in social research. For this reason, quasi-experimental designs are frequently used in education research to evaluate interventions where randomization cannot be performed because of ethical considerations, difficulty with randomization, or small available sample sizes [30]. We thus favored a synthetic design that also involved the integration of numerous variables in order to add to the internal validity of a simple analysis of pre-test/post-test gains [31].

The effectiveness of this curriculum, independent of the study group’s pre-test level of knowledge, was established using measures of class-average normalized gain (〈g〉) and related metrics. This follows a practice in physics education research where it has been shown that 〈g〉 for courses with widely varying average pre-test scores (〈%pre〉) is nearly independent of the pre-test score, being dependent primarily on the effectiveness of the instruction [16]. Weiman and Perkins [32] described the 〈g〉 metric as the fraction of concepts that students master, on average, which they did not already know at the start of the class.

To mitigate the need for a control group, the curriculum and the pre-tests and post-tests were all administered on the same day. It is not plausible that the significant gains seen over such a short time span could have occurred without the intervention. Hence, because of its short duration, our educational intervention was immune to many of the external factors that could otherwise threaten the validity of a single-group pre-test/post-test design, including history, maturation, and testing effect [25]. Furthermore, the baseline knowledge and skill level of a second cohort of novice bronchoscopists one year later was similar to the first, and participation in the course again resulted in significant learning gains in both knowledge and skill.

To diminish the skewing effect of outlier students with very high or very low pre-test scores, we calculated individual single-student normalized gain, where gi = [%post-test − %pre-test]/100% − %pre-test]. This is the actual gain divided by the maximum gain achievable by each student and has been described by McGowan and Davis [33] as “telling us what the student achieved in tests, given what was possible for her (him) to achieve.” The use of the single-student g and its related calculations has received empirical justification as an easy-to-use gauge of course effectiveness in hundreds of classroom teaching and other varying types of courses with different instructors and student populations [15]. In addition, individualized learning needs can potentially be determined by single-student normalized gain assessments. In our study, this performance measure allowed us to document changes in individual scores in addition to explore the overall effectiveness of our one-day curriculum. As expected, we demonstrated that the largest improvements were seen in learners with the lowest pre-test scores (Fig. 2A, B). This is a fairly obvious observation, basically suggesting that those who have more to learn do actually learn more. Yet it provides additional justification for objective measurements of knowledge and skill acquisition in each novice trainee; when an individual’s plotted learning curve does not meet established expectations, remedial intervention may be in order.

Our study has several limitations. First, our curriculum was designed for single-day delivery targeting a small number of participants. This was justified as our objective was to document only learning gain, not procedural competency. It reflects the logistical reality of the difficulties involved in bringing together trainees from eight institutions in southern California. Second, the study was intentionally limited to assess short-term acquisition of knowledge and skills and was not meant to follow the learners’ long-term knowledge and skill retention or decay [3436]. Third, despite the use of various skill stations, we focused only on bronchoscopic anatomy and inspection skill acquisition. We had two rationales for this: this is indeed the basis of all bronchoscopic procedures, and a considerable time commitment would have been required to test each participant at every other station. Last, as in many areas of medical education, the impact of this intervention on clinical practice and outcomes is unknown [37, 38]. Patient-related outcomes are generally not an ideal surrogate to demonstrate effectiveness of educational interventions. Despite this, most medical education research is founded on the basic logical assumption that knowledge and skill acquisition eventually leads to improved patient care [6].

Conclusion

It has long been recognized that assessment drives learning and that rigorous assessment inspires learning, reinforces confidence, and reassures the public [39, 40]. In the context of procedure-based training, we submit that the pre-test/post-test model with calculation of various measures of learning gain, including class-average and single-student normalized gains, provides an objective and informative means to document learner performance and demonstrate the effectiveness of the educational intervention. The presence of diverse opinions regarding educational methodologies [41, 42], curricular structure [43], and measures of effectiveness persist [44, 45], necessitating further studies to confirm and build on our findings.