Nonspecialist Raters Can Provide Reliable Assessments of Procedural Skills

doi:10.1016/j.jsurg.2017.07.003

Journal of Surgical Education

Volume 75, Issue 2, March–April 2018, Pages 370-376

https://doi.org/10.1016/j.jsurg.2017.07.003 Get rights and content

Background

Competency-based learning has become a crucial component in medical education. Despite the advantages of competency-based learning, there are still challenges that need to be addressed. Currently, the common perception is that specialist assessment is needed for evaluating procedural skills which is difficult owing to the limited availability of faculty time. The aim of this study was to explore the validity of assessments of video recorded procedures performed by nonspecialist raters.

Methods

This study was a blinded observational trial. Twenty-three novices (senior medical students) and 9 experienced doctors were video recorded while each performing 2 flexible cystoscopies on patients. The recordings were anonymized and placed in random order and then rated by 2 experienced cystoscopists (specialist raters) and 2 medical students (nonspecialist raters). Flexible cystoscopy was chosen as it is a simple procedural skill that is crucial to master in a resident urology program.

Results

The internal consistency of assessments was high, Cronbach’s α = 0.93 and 0.95 for nonspecialist and specialist raters, respectively (p < 0.001 for both correlations). The interrater reliability was significant (p < 0.001) with a Pearson’s correlation of 0.77 for the nonspecialists and 0.75 for the specialists. The test-retest reliability showed the biggest difference between the 2 groups, 0.59 and 0.38 for the nonspecialist raters and the specialist raters, respectively (p < 0.001).

Conclusion

Our study suggests that nonspecialist raters can provide reliable and valid assessments of video recorded cystoscopies. This could make mastery learning and competency-based education more feasible.

Introduction

Medical education is changing rapidly, and the way doctors train procedural skills is shifting from traditional apprenticeship and time-based learning to competency-based attainments of skills.1, 2 Competency-based learning is being favored as it implies that trainees will pass when they are competent and not after a certain prescribed time or when a certain number of procedures have been performed, which does not necessarily reflect competence.3, 4, 5, 6 Competency-based learning require specialist assessment of procedural skills. Despite the advantages of competency-based learning, there are still challenges that need to be addressed, e.g., the limited availability of faculty time.7, 8, 9, 10 Some studies also suggest that knowing the identity of the trainee can influence assessment.11, 12 Technology holds some promise, as video recordings of performances create more flexibility and reduce the risk of bias.13, 14, 15

Studies show that rater training is beneficial and even suggest that 1-hour frame-of-reference training sessions are able to sufficiently train raters to use a simple evaluation instrument for the assessment of procedural skills.16, 17 Studies have shown that medical students can be used in teaching settings instead of professors,18, 19, 20 and this could be translated to competency-based assessment where the use of nonspecialist raters could be implemented to further reduce the time spent by specialists on assessment. The common perception is currently that specialists need to assess procedural skills, but previous studies have shown that even nonmedically trained individuals can be used to assess surgical skills.21, 22

Using nonspecialist raters would not only decrease the workload of specialists and minimize interpersonal bias but also provide a more beneficial economical solution in areas where competency-based assessment is needed. The use of nonspecialist raters needs to be proven reliable and valid before it can be implemented as part of competency-based learning assessments.

The aim of this study was to explore the validity of assessments of video recorded procedures performed by nonspecialist raters.

Section snippets

Design

This study was a blinded observational trial. Novices (senior medical students) and experienced doctors were video recorded while performing 2 flexible cystoscopies each. The recordings were anonymized and placed in random order and then rated by 2 experienced cystoscopists (specialist raters) and 2 medical students (nonspecialist raters). Flexible cystoscopy was chosen as it is a simple procedural skill that is crucial to master in a resident urology program.²³

Participants

The novices participating in this

Results

Twenty-three novices and 9 experienced doctors participated in this study giving a total of 64 video recordings all rated by a pair of specialist raters and a pair of nonspecialist raters (256 ratings in total). The internal consistency of assessments was high, Cronbach’s α = 0.93 and 0.95 for nonspecialist and specialist raters, respectively (p < 0.001 for both correlations). The interrater reliability was significant (p < 0.001) with a Pearson’s correlation of 0.77 for the nonspecialist and

Discussion

Our study showed that nonspecialist raters can provide reliable and valid assessments of video recorded cystoscopies when comparing the level of internal consistency, interrater reliability, and test-retest reliability to the level of specialist raters. The strength of the reliability coefficient always depends on the purpose of the assessment, and the consequences of the assessment.²⁵ Overall, the interrater reliability was reasonable approaching the 0.80 level needed for high-stakes

Conclusion

Our study suggests that nonspecialist raters can provide reliable assessments of video recorded cystoscopies.

References (46)

V.Y. Vogt et al.
Is a resident′s score on a videotaped objective structured assessment of technical skills affected by revealing the resident′s identity?
Am J Obstet Gynecol
(2003)
C.R. Beckmann et al.
Computer-assisted video evaluation of surgical skills
Obstet Gynecol
(1995)
B.C. George et al.
Duration of faculty training needed to ensure reliable or performance ratings
J Surg Educ
(2013)
C. Chen et al.
Crowd-sourced assessment of technical skills: a novel method to evaluate surgical performance
J Surg Res
(2014)
L. Konge et al.
The simulation centre at Rigshospitalet, Copenhagen, Denmark
J Surg Educ
(2015)
K.R. Ghani et al.
Measuring to improve: peer and crowd-sourced assessments of technical skill with robot-assisted radical prostatectomy
Eur Urol
(2016)
R.K. Reznick et al.
Teaching surgical skills—changes in the wind
N Engl J Med
(2006)
C. Carraccio et al.
Shifting paradigms: from flexner to competencies
Acad Med
(2002)
P.C. Ferguson et al.
Three-year experience with an innovative, modular competency-based curriculum for orthopaedic training
J Bone Joint Surg Am
(2013)
V.T. Nguyen et al.
Time- versus competency-based residency training
Plast Reconstr Surg
(2016)

D.C. Leach

Building and assessing competence: the potential for evidence-based graduate medical education

Qual Manag Health Care

(2002)

M.G. Tolsgaard et al.

The assessment of clinical skills is imperative in postgraduate specialty training

Ugeskr Laeger

(2014)

R.E. Hawkins et al.

Implementation of competency-based medical education: are we addressing the concerns and challenges?

Med Educ

(2015)

K.E. Hauer et al.

Reviewing residents′ competence: a qualitative study of the role of clinical competency committees in performance assessment

Acad Med

(2015)

J.R. Kogan et al.

Reconceptualizing variable rater assessments as both an educational and clinical care problem

Acad Med

(2014)

E.S. Holmboe et al.

The role of assessment in competency-based medical education

Med Teach

(2010)

L. Konge et al.

Reliable and valid assessment of competence in endoscopic ultrasonography and fine-needle aspiration for mediastinal staging of non-small cell lung cancer

Endoscopy

(2012)

Y. Subhi et al.

An integrable, web-based solution for easy assessment of video-recorded performances

Adv Med Educ Pract

(2014)

D. Dath et al.

Toward reliable operative assessment: the reliability and feasibility of videotaped assessment of laparoscopic technical skills

Surg Endosc

(2004)

J.R. Kogan et al.

How faculty members experience workplace-based assessment rater training: a qualitative study

Med Educ

(2015)

M.G. Tolsgaard et al.

Student teachers can be as good as associate professors in teaching clinical skills

Med Teach

(2007)

S.A. Josephson et al.

A new first-year course designed and taught by a senior medical student

Acad Med

(2002)

S. Kassab et al.

Student-led tutorials in problem-based learning: educational outcomes and students′ perceptions

Med Teach

(2005)

Cited by (14)

ChatGPT: The brightest student in the class
2023, Thinking Skills and Creativity
This paper presents a research study that evaluated the score ChatGPT would get when summarizing a reading comprehension text from the PISA international tests with a prompt that made it simulate doing this as if it were a 15-year-old student. For this purpose, the text was camouflaged among 30 other summaries made by real 15-year-old students and was evaluated by 30 Spanish language teachers with different profiles in terms of age, professional experience, and gender who were unaware that one of the texts was made by artificial intelligence (AI). The evaluation of the summary, for which a homogeneous rubric is used, is based on two fundamental criteria: content and style. For the data analysis descriptive and inferential statistical techniques were used. The results show that the ChatGPT summary obtained the best marks in terms of content and style, with its respective marks being 3 and 2.5 points higher than those of the students. Therefore, we can deduce that the style and content of the ChatGPT summary greatly exceeded those presented by the students. These results are independent of the ages, levels of professional experience, and genders of the teachers who corrected the summary. The integration of AI tools such as ChatGPT must be based on solid methodological proposals that integrate their use from a creative and critical perspective that allows learning with the support of these tools and not using them as substitutes for the development of basic student competencies.
Crowdsourced assessment of surgical skills: A systematic review
2022, American Journal of Surgery
Citation Excerpt :
All but four studies used the crowdsourcing platforms C-SATS and/or AMT to obtain video reviews from CW. The remaining studies used Vimeo (New York, NY),42 Facebook users,40 local university websites,29 and recruitment at work.23 Most studies used both experts and CW to assess the videos, but a few studies did not.
Crowdsourced assessment utilizes a large group of untrained individuals from the general population to solve tasks in the medical field. The aim was to examine the correlation between crowd workers and expert surgeons for the use of crowdsourced assessments of surgical skills.
A systematic literature review was performed on April 14th, 2021 from inception to the present. Two reviewers screened all articles with eligibility criteria of inclusion and assessed for quality using The Medical Education Research Study Quality Instrument (MERSQI) and Newcastle-Ottawa Scale-Education (NOS-E)(Holst et al., 2015).⁷General information was extracted for each article.
250 potential studies were identified, and 32 articles were included. There appeared to be a generally moderate to very strong correlation between crowd workers and experts (Cronbach's alpha 0.72–0.95, Pearson's r 0.7–0.95, Spearman Rho 0.7–0.89, linear regression 0.45–0.89). Six studies had either questionable or no significant correlation between crowd workers and experts.
Crowdsourced assessment can provide accurate, rapid, cost-effective, and objective feedback across different specialties and types of surgeries in dry lab, simulation, and live surgeries.
Neurosurgical Operative Videos: An Analysis of an Increasingly Popular Educational Resource
2020, World Neurosurgery
Surgical education has increasingly relied on electronic learning. In particular, online operative videos have become a core resource within neurosurgery. We analyze the forums for neurosurgical operative videos.
Operative videos from 5 sources were reviewed: 1) the NEUROSURGERY Journal YouTube channel; 2) the American Association of Neurological Surgeons Neurosurgery YouTube channel; 3) The Neurosurgical Atlas Operative Video Cases; 4) Operative Neurosurgery; and 5) Neurosurgical Focus: Video. Title, year of publication, senior author, institution, country, and subspecialty were documented for each video.
A total of 1233 videos showing 1247 surgeries were identified. Ten videos included >1 surgery; of those, there was a median of 2 surgeries (interquartile range, 2.0–2.5) per video. The most frequently represented subspecialties included vascular (48.3%), tumor (35.2%), and skull base surgery (27.5%), with almost 40% of videos showing >1 category. Videos were submitted by investigators from 28 countries, but 82.1% of the videos originated in the United States.
Neurosurgical operative videos have become increasingly common through a variety of online platforms. Future efforts may benefit from collecting videos from underrepresented regions and subspecialties, providing long-term follow-up data and showing techniques for managing complications.
A video anchored rating scale leads to high inter-rater reliability of inexperienced and expert raters in the absence of rater training
2020, American Journal of Surgery
Citation Excerpt :
The literature supports our approach of using non-medically trained individuals to assess surgical skills.18,27 Mahmood et al. showed that trained non-specialist raters can provide reliable and valid assessments of video recorded cystoscopies compared to specialist raters as evidenced by similar internal consistency, inter-rater reliability, and test-retest reliability.30 Yeung et al. showcased that raters with varying levels of expertise can reliably grade performance of an intra-corporeal suturing task.
Our objective was to assess the impact of incorporating videos in a behaviorally anchored performance rating scale on the inter-rater reliability (IRR) of expert, intermediate and novice raters.
The Intra-corporeal Suturing Assessment Tool (ISAT) was modified to include short video clips demonstrating poor, average, and expert performances. Blinded raters used this tool to assess videos of trainees performing suturing on a porcine model. Three attending surgeons, 4 residents, and 4 novice raters participated; no rater training was provided. The IRR was then compared among rater groups.
The IRR using the modified ISAT was high at 0.80 (p < 0.001). Ratings were significantly correlated with trainee objective suturing scores for all rater groups (experts: R = 0.84, residents: R = 0.81, and novices: R = 0.69; p < 0.001).
Incorporating video anchors (to define performance) in the ISAT led to high IRR and enabled novices to achieve similar consistency in their ratings as experts.
Assessment of competence in video-assisted thoracoscopic surgery lobectomy: A Danish nationwide study
2018, Journal of Thoracic and Cardiovascular Surgery
Citation Excerpt :
Using nonexperts or novice raters may be considered, because the availability is easier and the costs are less. This approach should be used with some caution, but recent work has shown good inter-rater reliability between expert and nonexpert raters.27,28 The logarithmic relation between the experience level of the thoracic surgeons and the mean VATS score shows good consistency.
Competence in video-assisted thoracoscopic surgery lobectomy has previously been established on the basis of numbers of procedures performed, but this approach does not ensure competence. Specific assessment tools, such as the newly developed video-assisted thoracoscopic surgery lobectomy assessment tool, allow for structured and objective assessment of competence. Our aim was to provide validity evidence for the video-assisted thoracoscopic surgery lobectomy assessment tool.
Video recordings of 60 video-assisted thoracoscopic surgery lobectomies performed by 18 thoracic surgeons were rated using the video-assisted thoracoscopic surgery lobectomy assessment tool. All 4 centers of thoracic surgery in Denmark participated in the study. Two video-assisted thoracoscopic surgery experts rated the videos. They were blinded to surgeon and center.
The total internal consistency reliability Cronbach's alpha was 0.93. Inter-rater reliability between the 2 raters was Pearson's r = 0.71 (P < .001). The mean video-assisted thoracoscopic surgery lobectomy assessment tool scores for the 10 procedures performed by beginners were 22.1 (standard deviation [SD], 8.6) for the 28 procedures performed by the intermediate surgeons, 31.2 (SD, 4.4), and for the 20 procedures performed by experts 35.9 (SD, 2.9) (P < .001). Bonferroni post hoc tests showed that experts were significantly better than intermediates (P < .008) and beginners (P < .001). Intermediates' mean scores were significantly better than beginners (P < .001). The pass/fail standard calculated using the contrasting group's method was 31 points. One of the beginners passed, and 2 experts failed the test.
Validity evidence was provided for a newly developed assessment tool for video-assisted thoracoscopic surgery lobectomy (video-assisted thoracoscopic surgery lobectomy assessment tool) in a clinical setting. The discriminatory ability among expert surgeons, intermediate surgeons, and beginners proved highly significant. The video-assisted thoracoscopic surgery lobectomy assessment tool could be an important aid in the future training and certification of thoracic surgeons.
Direct Observation vs. Video-Based Assessment in Flexible Cystoscopy
2018, Journal of Surgical Education
Citation Excerpt :
With such reductions in time consumption, it would be feasible to let 2 different raters assess the same performance, as they would be able to see twice as many performances compared to direct observations and schedule their time for rating and feedback in more calm settings. It has earlier been described how residents or even medical students and chief physicians generate similar scores in video-rating.34,35 It could therefore be a possibility to let one of the raters be a resident and the other a chief physician.
Direct observation in assessment of clinical skills is prone to bias, demands the observer to be present at a certain location at a specific time, and is time-consuming. Video-based assessment could remove the risk of bias, increase flexibility, and reduce the time spent on assessment. This study investigated if video-based assessment was a reliable tool for cystoscopy and if direct observers were prone to bias compared with video-raters.
This study was a blinded observational trial. Twenty medical students and 9 urologists were recorded during 2 cystoscopies and rated by a direct observer and subsequently by 2 blinded video-raters on a global rating scale (GRS) for cystoscopy. Both intrarater and interrater reliability were explored. Furthermore, direct observer bias was explored by a paired samples t-test.
Intrarater reliability calculated by Pearson’s r was 0.86. Interrater reliability was 0.74 for single measure and 0.85 for average measures. A hawk-dove effect was seen between the 2 raters. Direct observer bias was detected when comparing direct observer scores to the assessment by an independent video-rater (p < 0.001).
This study found that video-based assessment was a reliable tool for cystoscopy with 2 video-raters. There was a significant bias when comparing direct observation with blinded video-based assessment.

View all citing articles on Scopus

View full text

Original reportsNonspecialist Raters Can Provide Reliable Assessments of Procedural Skills

Background

Methods

Results

Conclusion

Introduction

Section snippets

Design

Participants

Results

Discussion

Conclusion

Am J Obstet Gynecol

Obstet Gynecol

J Surg Educ

J Surg Res

J Surg Educ

Eur Urol

Teaching surgical skills—changes in the wind

N Engl J Med

Shifting paradigms: from flexner to competencies

Acad Med

Three-year experience with an innovative, modular competency-based curriculum for orthopaedic training

J Bone Joint Surg Am

Time- versus competency-based residency training

Plast Reconstr Surg

Building and assessing competence: the potential for evidence-based graduate medical education

Qual Manag Health Care

The assessment of clinical skills is imperative in postgraduate specialty training

Ugeskr Laeger

Implementation of competency-based medical education: are we addressing the concerns and challenges?

Med Educ

Reviewing residents′ competence: a qualitative study of the role of clinical competency committees in performance assessment

Acad Med

Reconceptualizing variable rater assessments as both an educational and clinical care problem

Acad Med

The role of assessment in competency-based medical education

Med Teach

Reliable and valid assessment of competence in endoscopic ultrasonography and fine-needle aspiration for mediastinal staging of non-small cell lung cancer

Endoscopy

An integrable, web-based solution for easy assessment of video-recorded performances

Adv Med Educ Pract

Toward reliable operative assessment: the reliability and feasibility of videotaped assessment of laparoscopic technical skills

Surg Endosc

How faculty members experience workplace-based assessment rater training: a qualitative study

Med Educ

Student teachers can be as good as associate professors in teaching clinical skills

Med Teach

A new first-year course designed and taught by a senior medical student

Acad Med

Student-led tutorials in problem-based learning: educational outcomes and students′ perceptions

Med Teach

Original reports
Nonspecialist Raters Can Provide Reliable Assessments of Procedural Skills