Background
Diabetes is a significant health problem and was recently estimated to affect approximately 451 million people worldwide [
1]. Up to 50% of persons with diabetes are affected by diabetic peripheral neuropathy (DPN), which causes widespread sensory loss, primarily affecting the feet and legs [
2‐
5]. DPN is associated with lower limb complications such as foot deformity [
6], increased plantar pressures [
7], ulceration and infection and, is implicated in 50–75% of all non-traumatic lower limb amputations [
8]. Prophylactic care in people with diabetes has been shown to prevent or delay development of DPN. For example, intensive glycaemic control has demonstrated a reduction of neuropathy incidence of between 25% [
9] and 57% [
10]. Additionally education and routine foot care in those with DPN have been shown to reduce risk of associated foot complications [
11,
12]. Therefore, early and accurate diagnosis of DPN is paramount to mitigating the risk of associated foot complications.
Methods for conducting clinical chairside neurological tests to establish the presence and monitor the progression of DPN are varied, and assess different nerve fibre types. Current international guidelines recommend testing of protective sensation using monofilament, as well as additional tests such as vibration perception, reflexes, pain perception and asking about neurological symptoms [
13,
14]. Diminished vibration perception and ability to detect 10 g monofilament have demonstrated predictive capacity for future foot ulceration [
8,
15‐
18], and are widely used both clinically and in research. Several techniques are available for testing vibration perception, including use of a neurothesiometer or similar instrument, as well as graduated and non-graduated tuning forks. Similarly, methods for testing protective sensation testing using monofilament examination can vary clinically in terms of location and number of sites tested. However there are limited data available comparing the reliability of different testing methods. Reliability refers to the level of consistency of measurement results between different clinicians (inter-rater) and the same clinican on multiple occasions (intra-rater). While there have been several small studies investigating inter- and/or intra-rater reliability of monofilament [
19‐
21] and vibration perception testing [
21‐
25] results of these studies are variable, and generalisability of these findings limited by inconsistency of testing methods. One larger study recently compared effectiveness of three, 4 and 10 site monofilament for identifying DPN in 1915 people with diabetes, and in doing so, reported high level of agreement between testing methods (К: 0.797 to 0.925) [
26], but did not report reliability on individual tests.
The aim of this study was to determine the inter- and intra-rater reliability of commonly used testing methods of protective sensation and vibration perception, performed by podiatrists with varying amounts of clinical experience, in people with diabetes. Specifically, a four-site and a 10-site monofilament test, as well as vibration perception as determined by neurothesiometer, graduated tuning fork and non-graduated (dampened and conventional methods) tuning fork.
Methods
This study was conducted at the University of Newcastle Podiatry clinics in New South Wales, Australia. Ethics approval was obtained from the University of Newcastle Human Research Ethics Committee prior to undertaking this study, protocol code H-2012-0141. All participants involved in this study provided written informed consent prior to study commencement.
Participants
Participants were recruited on a volunteer basis, with flyers posted up in university clinic consultation rooms and the waiting room, directing potential recruits to register their interest. Recruitment was performed by people who were not involved in test performance thereby ensuring blinding of raters to participant health status. Participants included in the study were required to be representative of the population in which screening for DPN is recommended [
14]. Therefore, inclusion criteria were Type-1 diabetes of five years or more or Type-2 diabetes of any duration with and without history of diagnosed DPN, confirmed by medical records. Participants were required to be fluent in English language to satisfy consent for the study. Exclusion criteria included active foot ulceration, visual evidence of recently healed foot ulceration, lower limb amputation of any kind or diagnosed peripheral neuropathy of an origin other than diabetes.
The inter- and intra-rater reliability of 10 g monofilament testing using four-site and 10-site testing techniques as well as vibration perception threshold (VPT) using a neurothesiometer were determined across three raters [a new graduate podiatrist (R1); a podiatrist with five years of clinical experience (R2); and a podiatrist with 10 years of clinical experience (R3)]. In addition, inter- and intra-rater reliability of a graduated tuning fork as well as an on/off and a dampened method of a conventional tuning fork were tested in a podiatrist with one year’s clinical experience (R4) and a new graduate podiatrist (R5).
Testing protocol
In both the initial testing session and retest for all testing conducted as part of this study, raters performed the relevant neurological tests in a pre-determined random order on every participant in separate treatment rooms. Raters were blinded to the participant health status i.e. presence, absence, or extent of DPN, though were aware that all of the participants had diabetes. Raters were also blinded to each other’s results as well as to their own results from the first testing session when undertaking the retest. The order of application of the tests was randomised using an online random number generator (
www.randomizer.org). The order of raters was randomised in a manner that was not pre-determined and the order of site application of the monofilament was randomised at the discretion of the individual raters. Participants were blind to all results, though were provided with a plain language summary on request at study completion. The tests were performed only on the right limb in order to satisfy the assumption of independence of data [
31], with the right limb chosen rather than a random limb in order to minimise rater confusion. Participants were required to attend the retest after seven days at the same location and were required to close their eyes for each test procedure. In addition, each test was first demonstrated on the dorsal aspect of the participant’s hand and in relation to vibration, ‘buzzing’ was differentiated from pressure sensation.
Statistical analysis
SPSS version 25 was used for statistical analysis. Results for all neurological tests were broken down into dichotomous variables, namely abnormal or normal results, with abnormal being indicative of neuropathy. The intra-rater reliability was calculated using an unweighted Cohen’s Kappa (К) statistic [
32]. In order to calculate the inter-rater reliability and effect of experience on reliability, Cohen’s К was initially determined between the following pairs of raters: R1 and R2; R1 and R3; and R2 and R3 (monofilament and neurothesiometer) and R4 and R5 (tuning fork tests). Fleiss’ К was then calculated to determine the overall reliability between raters R1-R3 [
33]. Interpretation of the Cohen’s and Fleiss’ К statistic was performed using the method proposed by Landis and Koch [
34] (Values indicating: 0.01–0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61 to 0.80 = substantial, and 0.81–1.0 = almost perfect). Values below 0.4 were interpreted as clinically unacceptable for reliability of a test [
35].
Discussion
The results from our study indicate that monofilament, neurothesiometer and the tuning fork are acceptably reliable methods of testing protective sensation and vibration perception respectively, with some variability demonstrated between inter- and intra-tester reliability as well as with level of clinical experience. Use of a graduated tuning fork or the on/off method using a conventional, non-graduated tuning fork, demonstrated higher reliability than the dampened method and are therefore more appropriate for clinical use. Overall, greater clinician experience resulted in marginally increased reliability of the graduated and conventional (on/off) tuning fork method and substantially increased reliability of the neurothesiometer. Monofilament tests overall, appear to be reliable with clinical experience possibly increasing the reliability of the four-site test. Despite the acceptable levels of reliability demonstrated by these tests, caution must be used in relying on any one test in isolation. Moderate reliability for example still indicates a marked margin of error in test interpretation and it is axiomatic that clinical tests that have the potential to change clinical practice and drive treatment strategies should strive for higher reliability. When considering using these tests for diagnosis and monitoring of DPN we support the current recommendations of using more than one test (e.g. monofilament and tuning fork) as part of a larger screening examination. In addition, we suggest that testing should be performed regularly and repetitively. Of note, our results relate specifically to the reliability of the tests used, i.e. that the results can be replicated, not that they reflect a correct diagnosis of DPN. While use of tests with high reliability is essential for effective clinical management, so too is the need for the tests to be able to diagnose the target condition. It has been stated that two-test combinations have > 87% sensitivity in detecting DPN [
36], though further work to determine the combination test with highest reliability that is most diagnostically accurate for identifying presence of DPN is required.
Previous investigation into the 10 g monofilament has shown mixed reliability. A nine-site monofilament test has been shown to have excellent intra- and inter-reliability [
20]. Meijer et al., described moderate to good intra-rater and good inter-rater reliability, respectively, for a two-site test [
21] while a three-site test has demonstrated fair to moderate inter- and intra-reliability [
37]. Lastly, level of agreement between the four- and 10-site test in 1915 people with diabetes was recently shown to be high (К: 0.87) [
26] indicating that these tests may be similarly reliable. Our study supports the relatively high inter-rater reliability of the four- And 10-site 10 g monofilament tests previously reported. The inter-rater reliability of four- and 10-site tests from this present study demonstrated similar levels of reliability overall, although experience improved reliability for the four-site test. The excellent intra-rater reliability previously described in the nine-site monofilament test [
20] was not replicated in the four or 10 site tests used in our study. The large range of intra-rater reliability of the monofilament (fair to substantial) was not associated with greater clinical experience. As these tests rely on subjective responses from a patient, it is possible that these tests will demonstrate variability regardless of the level of experience of the clinician.
The reliability of a variety of methods of assessing vibration perception was determined in this study including an on/off and a dampening method of a conventional, non-graduated tuning fork, a graduated tuning fork and the neurothesiometer. Of these, the neurothesiometer (
n = 50) demonstrated the highest intra-rater reliability and the graduated tuning fork (
n = 24) the highest inter-rater reliability. The reliability demonstrated may have been affected by the comparatively low participant numbers in the tuning fork cohort. Overall, the inter-rater reliability of vibration tests was substantial. Our findings regarding the neurothesiometer are supported by two smaller studies investigating the neurothesiometer [
22], biothesiometer and Maxivibrometer [
25], respectively. In our study, intra-tester reliability of the neurothesiometer was affected by experience, with the new graduate demonstrating substantially lower reliability (К = 0.52) than the more experienced clinicians (К = 0.72–0.78).
While all tuning fork methods demonstrated substantial inter-rater reliability, the intra-rater reliability was moderate for all methods, and bordering on fair for the dampened method. Previous investigation by Meijer et al., reported substantial intra-rater reliability of the conventional (on/off) method (K = 0.69) at the hallux interphalangeal joint [
21]. Perkins et al., noted acceptable reliability of the conventional (on/off) method at the hallux dorsum, without reporting a Kappa statistic [
23]. Our findings of moderate intra-tester reliability of the graduated tuning fork are somewhat supported by Thivolet et al., who simply stated statistical significance between test and retest at
p < 0.01 [
24]. A slightly smaller study previously reported low, non-significant inter-rater reliability of the graduated tuning fork [
22], which contradicts our findings of substantial reliability. However, the site application and methodology was too dissimilar to our present study to draw any meaningful comparisons. Lastly, the graduated and on/off conventional methods were only marginally affected by experience. We therefore suggest using the graduated tuning fork or conventional on/off method of vibration perception as opposed to the dampened method.
Limitations
Whilst adding to the paucity of research investigating intra- and inter-rater reliability of vibration perception and monofilament testing in people with diabetes, findings of this study need to be considered in light of several limitations. Though 50 participants attended for test and retest of monofilament and neurothesiometer, only 24 were involved in tuning fork testing. As
n ≥ 30 is required to satisfy the assumption of normal distribution [
38], larger sample studies are warranted. Our study is generalisable to people with type 2 diabetes only, however a strength of this study is that it included people with diagnosed DPN making it generalizable to people requiring testing and ongoing monitoring. In addition, more extensive clinician training and clearer instruction to participants may improve reliability. The findings of this study are also limited to peripheral neurological testing with neurothesiometer, tuning forks and 10 g monofilament. Other neurological tests such as pain perception, proprioception, ankle reflexes, temperature perception, light touch perception and two-point discrimination were not investigated but may be reliable and of clinical value.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.