Introduction

Lumbar surgery with fusion or disc prosthesis is being evaluated in clinical studies as treatment for patients with chronic low back pain (LBP) [13]. Single or two-level disc degeneration on magnetic resonance imaging (MRI) is a proposed part of the indication for such treatment, and adjacent level and facet degeneration are important issues in these patients [35]. Reliable assessment of findings from MRI is crucial to decide on and plan the surgery, to assess its effects, and to study the prognostic role of MRI findings. Unreliable findings in clinical practice and research can lead to incorrect treatment, faulty assessment of adjacent level and facet degeneration, and underestimation of the findings' potential relationship to clinical features and prognosis [6, 7].

Adequate agreement on both type and prevalence of MRI findings at individual disc levels is required to study which and how many levels to treat, to assess the prevalence of any later adjacent level degeneration, and to evaluate how the localised findings may affect prognosis. Therefore, we need data not only on observer agreement (kappa values) but also on differences in reported prevalence of relevant MRI findings between observers at separate disc levels.

Previous studies have examined observer agreement for relevant MRI findings, such as Modic changes [812], posterior high-intensity zone (HIZ) in the disc [9, 10, 1215], disc degeneration [9, 10, 12, 15], abnormal disc contour [9, 12, 15, 16] and facet arthropathy [10, 17, 18]. However, differences between observers in the reported prevalence of such findings have received very little attention [10, 16]. Some of the prior studies had only two observers [8, 11, 1315, 17] and/or a modest sample size [8, 9, 11, 12, 1618], focused on one or a few findings [8, 13, 17, 18] and/or reported combined results for several disc levels [9, 10]. Only one study concerned disc prosthesis patients, and it was restricted to facet arthropathy [18].

The aim of the present study was to assess the reliability of pretreatment lumbar spine MRI findings in chronic LBP patients who were accepted candidates for lumbar disc prosthesis. At each disc level for each MRI finding, we analysed interobserver and intraobserver agreement as well as differences in reported prevalence among experienced radiologists. Such analyses at individual levels were also done for combined findings used as MRI indication for prosthesis.

Materials and methods

The appropriate regional research ethics committee approved this study. All patients gave their informed consent prior to their inclusion in the study.

Patients

Of 173 LBP patients randomized to disc prosthesis surgery or multidisciplinary rehabilitation in a prospective national trial [3], 170 (98.3%; mean age 41 years; 82 men, 88 women) had pretreatment MRI available for this retrospective reliability study. The results of this study were not used to determine eligibility in the trial and have not been published previously. The criteria for inclusion in the trial were: age 25–55 years, LBP as main symptom for at least 1 year, insufficient effect of physiotherapy or chiropractic treatment, Oswestry Disability Index (ODI) ≥30% and the following MRI findings reported by the enrolling physicians at L4/L5 and/or at L5/S1 (levels suitable for disc prosthesis): (a) ≥40% disc height decrease compared to the nearest normal above disc and/or (b) at least two of these three findings: Modic changes type I (oedema) and/or type II (fat), posterior HIZ in the disc and dark/black nucleus pulposus on T2-weighted images. Patients were excluded if they had any of the four findings in a or b at any higher lumbar level (L1–L4) or had spondylolysis, spondylolisthesis, arthritis, osteoporosis, prior fracture L1–S1, prior spinal fusion, deformity, or symptomatic disc herniation/spinal stenosis. Facet joint degeneration was not an exclusion criterion.

Images

MRI was performed as part of clinical practice, using different protocols and magnets (1.5 T in 150 of 170 cases). All examinations included sagittal T2-weighted fast spin echo images: repetition time (TR)/echo time (TE), 2,511–4,760 ms/91–140 ms. All but two (168/170) included sagittal T1-weighted images: 159 spin echo images (TR/TE, 350–91 ms/7–22 ms) and 9 T1 fast fluid-attenuated inversion-recovery images (TR/TE, 1,984–2,130 ms/20–22 ms). Most (168/170) included axial images of the L4/L5 and L5/S1 levels: 135 T2-, 33 T1- and 21 proton density-weighted images. Few (5/170) included sagittal fat-suppression images. Typically, slice thickness was 3–5 mm, interslice gap 0.3–2.2 mm, field of view 19–38 cm for sagittal and 15–32 cm for axial images, and matrix 512 × 512 in the sagittal (115/170) and in the axial plane (89/170). Matrix varied from 160 × 256 to 640 × 640. The images were obtained directly in DICOM format or, in seven cases, as digitized printed film hard copies stored in DICOM format and were de-identified before being evaluated.

Ratings

One radiologist experienced in musculoskeletal MRI (A) and two neuroradiologists (B and C) from three different institutions rated findings on the images. Each observer had more than 10 years experience in reporting lumbar spine MRI findings. Observers A and C viewed the images on a clinical PACS unit and observer B on a personal computer. Observers A and B used the eFilm Lite software version 2.1.2 (Merge Healthcare, Hartland, Wisconsin), while observer C used the Agfa Impax 4.5 (Agfa HealthCare, Mortsel, Belgia).

We used existing MRI rating criteria for Modic changes [11, 1921], posterior HIZ in the disc [10, 14], nucleus pulposus signal [22], disc height (subjective and measured) [15, 2325], disc contour [19] and facet arthropathy [10, 26] (Table 1). Facet arthropathy was rated using Fujiwara and colleagues' simple system [26] combined with illustrations from the Spine Pain Outcomes Research Trial, which had yielded better agreement than Weishaupt and colleagues' system [10]. The observers also received published illustrations of Modic changes and HIZ [10]. They selected ratings from multiple choice lists for each variable at each of the disc levels L3/L4, L4/L5 and L5/S1. The types (none, I, II, III; primary and secondary), anteroposterior (AP) extent, and craniocaudal (CC) extent of Modic changes were rated both inferiorly and superiorly to the disc. Ratings were dichotomized as shown in the “Results” section prior to the statistical analysis.

Table 1 Rating of variables on magnetic resonance imaging of the lumbar spine

Blinded to clinical data and each others' ratings, all three observers evaluated the 170 MRI examinations in random order over 3–4 months. They were asked to also rate the variables on images of suboptimal quality, since these images had been accepted on enrolment and reflected practice. Blinded to and >3 months after their first rating, two observers (A and B) rerated 126 examinations in a new random order. These examinations were selected because the reratings were needed for comparison purposes in a follow-up study of these patients, who were also imaged at the end of 2 years of follow-up. These 126 patients were similar to the rest (n = 44) of the 170 patients in gender (p = 0.938; chi-squared test) and ODI (p = 0.278; t test, normal distribution) and were only slightly older (mean age 41.6 vs. 38.9 years in the n = 44 group; p = 0.027; t test, normal distribution).

Pilot study

To achieve a common understanding of the rating criteria, the three observers independently assessed six pilot examinations from another study. Observers A and B then discussed ratings and criteria at a joint 2-h meeting. Observer C did not attend the meeting but compared ratings with observers A and B and discussed with the last author, who had attended.

Statistical analyses

All MRI findings were dichotomized into categories that reflected the inclusion criteria or that might be clinically relevant (see “Results” section). The prevalence of each type of dichotomised MRI finding was calculated at each rated level for each observer. As in similar studies [9, 11], only findings with a mean prevalence 10–90% across all observers at the rated level were further analysed, since very high or low prevalence can lead to very low agreement beyond chance, despite very high actual agreement [27]. Each finding was further analysed at each rated level. MRI indication for prosthesis (yes/no) was analysed separately at L4/L5 and L5/S1 and noted as present when the observer reported ≥40% disc height decrease and/or at least two of these three findings: Modic changes type I/II (superior and/or inferior to disc), posterior HIZ and dark/black nucleus pulposus. These retrospective reports were not used in the prospective trial.

Using STATA 10.0 (College Station, TX), unweighted overall kappa was computed for agreement between all observers with a 95% bias-corrected confidence interval based on bootstrapping with 1,000 repetitions. Unweighted kappa for pairwise interobserver agreement and for intraobserver agreement was calculated using SPSS 17.0 (SPSS, Chicago, IL). p values were computed for difference in the prevalence of findings across observers (fixed effects model, STATA 10.0). After Bonferroni adjustment for multiple comparisons, p < 0.002 indicated statistical significance. Kappa was interpreted as: k ≤ 0.20, poor; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, good and 0.81–1.00, very good agreement beyond chance [28].

Sample size

For each comparison, if the true kappa is 0.60 and the prevalence 30%, 191 paired observations provide 80% power to give a significant result at the 5% level in a two-sided test of k = 0.40 [27]. Three observers were used in order to improve the power in this study with a fixed sample size n = 170.

Results

All observers rated all findings at L3–S1 in all 170 examinations, except for type of any Modic changes in the two examinations lacking T1 images. Observers A and B rated all findings twice in 126 cases for intraobserver analysis. Due to a mean prevalence <10% in the n = 170 sample, we did not further analyse any finding at L3/L4 or facet arthropathy at L5/S1.

Interobserver reliability

The prevalence at each rated level differed significantly (p < 0.002) but slightly across observers for most findings (Table 2). Observer C reported more Modic changes and twice as high prevalence as observer B at L4/L5 inferior to disc, i.e. at the upper endplate of L5 (52.9% vs. 26.5%, Table 2). The observers similarly often noted >50% CC extent of Modic changes, except at L5/S1 inferior to disc (Table 2). The prevalence at individual disc levels differed up to threefold between observers for posterior HIZ and for disc height judged severely reduced; it differed less for ≥40% measured disc height decrease, dark/black nucleus pulposus signal and abnormal disc contour (Table 2, Fig. 1). The difference in prevalence between observers was in a different direction for different findings (Table 2). Thus, the overall MRI indication for prosthesis did not differ significantly in prevalence across observers, neither at disc level L4/L5 nor at disc level L5/S1, but it tended to differ at L4/L5 (Table 2).

Table 2 Prevalence of findings in percent by reader
Fig. 1
figure 1

Magnetic resonance imaging of one patient; sagittal T2-weighted images (ae) shown in the order of patient's left to right, sagittal T1-weighted image (f) corresponding to T2-weighted image in a, and axial T2-weighted images (gj) shown from cranially to caudally. Image plane shown in c is marked on h and vice versa (broken lines). At L5/S1, all observers agreed on Modic changes primary type II (a, f; arrow heads), grey nucleus pulposus on T2-weighted images (ae), ≥40% measured disc height decrease compared to the normal disc above, disc herniation, and slight facet arthropathy (hj) but not on posterior high-intensity zone (c, arrow) or severely reduced disc height judged subjectively

Overall agreement was moderate or good (k = 0.56–0.77) for presence and extent of Modic changes, but only fair (k = 0.40) for inferior CC extent at L5/S1 (Table 3), which had a low mean prevalence across observers (14.7%). Regarding HIZ, overall agreement was moderate but better at L4/L5 than L5/S1 (k = 0.58 vs. 0.46, Table 3). Overall agreement was moderate or good (k = 0.50–0.72) for dark/black nucleus pulposus signal, severely reduced disc height, ≥40% measured disc height decrease and abnormal disc contour, and fair (k = 0.24) for moderate/severe facet arthropathy at L4/L5 (Table 3), which had a mean prevalence across observers of 11.4%. The MRI indication for disc prosthesis showed good overall agreement both at L4/L5 (k = 0.70) and at L5/S1 (k = 0.66).

Table 3 Interobserver agreement measured by using the kappa statistic

Pairwise agreement ranged from fair to very good. It was fair in one pair at L5/S1 for inferior AP and CC extent of Modic changes, superior AP extent, posterior HIZ and disc contour, and in all pairs for facet arthropathy at L4/L5. It was otherwise moderate to very good (Table 3).

Intraobserver reliability

Intraobserver agreement was good or very good (k = 0.61–1.00) except in one observer at L5/S1 for inferior AP and CC extent of Modic changes (k = 0.38–0.55) and for HIZ (k = 0.60, Table 4). It was mostly very good (k = 0.67–0.87) for the indication for prosthesis (Table 4).

Table 4 Intraobserver agreement measured by using the kappa statistic

Discussion

In this study, interobserver agreement was generally moderate or good for findings included in the indication for disc prosthesis (Modic changes, HIZ, dark/black nucleus pulposus, ≥40% disc height decrease) but only fair for facet arthropathy. Intraobserver agreement was mostly good or very good. Modic changes, HIZ and severely reduced disc height judged subjectively differed up to two- or threefold in prevalence between observers at individual disc levels. The overall MRI indication for disc prosthesis showed more similar prevalence across observers and good interobserver and intraobserver agreement both at L4/L5 and at L5/S1.

Strengths and limitations

The strengths of our study included the use of three observers, a large sample (n = 170) in the interobserver analysis, the analysis of separate disc levels and the testing of disagreement on prevalence. Such disagreement (bias) cannot be assessed by means of the kappa coefficient; it reduces expected agreement by chance and actually increases the kappa values slightly [27]. Disagreement between observers on the prevalence of a finding shows that their ratings of the finding differ systematically. Systematic differences in the interpretation of important findings should be identified by appropriate methods and addressed to improve the reliability.

The observers used well-defined MRI rating criteria, but they knew the patients were accepted for disc prosthesis surgery due to localised degeneration. How this may have affected their MRI ratings and agreement is not clear. The three radiologists came from different institutions, were not trained together and rated a range of findings on images obtained using different scanners and protocols. The often moderate reliability found in our study may therefore be representative for radiological subspecialty spine imaging practices.

Our results for patients accepted for disc prosthesis surgery should apply equally well to similar patients accepted for surgery with lumbar fusion. These reliability results provide a basis for further research on the role of MRI findings within both of these groups. Some of the results may also have a wider relevance. However, the reliability of the MRI indication for disc prosthesis surgery must be confirmed in chronic LBP patients not yet selected for surgery. Such patients may have a broader spectrum of MRI findings, causing more disagreement.

Discussion of results

We found clear differences in prevalence between observers for Modic changes, HIZ and subjectively rated disc height, and smaller differences for nucleus signal and abnormal disc contour, whereas Carrino et al. [10] found differences in frequency distributions between trained experts for disc degeneration (p = 0.055, Wald test) and facet arthropathy (p = 0.006) but not for Modic changes (p = 0.52) or HIZ (p = 0.22). No further comparable data exist. Lurie et al. [16] found similar frequencies across readers for bulges and normal discs combined.

It is noteworthy that the difference in prevalence between observers was in a different direction for different findings and did not add up to an even larger disagreement on the MRI indication for prosthesis. For example, observer B tended to report a lower prevalence of Modic changes and ≥40% disc height decrease than observer C but a higher prevalence of HIZ and dark/black nucleus signal and thus a more similar prevalence of the overall MRI indication (Table 2).

Disagreement on prevalence might be due to differences in interpretation and the use of rating criteria. It might also be due to differences in the observers' response bias, i.e. their tendency to prefer one or another response category (to rate up or down, particularly when in doubt), independently of the characteristics of the object [29]. Improved rating criteria might perhaps lower the number of ambiguous cases leading to differences in interpretation or response bias.

Our kappa values for interobserver and intraobserver agreement were generally similar or higher than in some prior studies for Modic changes [10], HIZ [9, 10, 12, 13], nucleus pulposus signal and disc height combined [9, 10, 15] and abnormal disc contour [9] but were similar [18] or lower [10, 17] for facet arthropathy. This may be partly due to non-standardized images and low prevalence of moderate/severe facet arthropathy in our sample (11.4% at L4/L5). In three studies based on standardized MRI of 40-year olds from the normal population, kappa values were slightly higher for Modic changes [11], HIZ [15] and abnormal disc contour [12]. The observers in one of these studies had read 50 pilot examinations in consensus [15]. Overall, lumbar MRI findings show mostly moderate interobserver agreement.

There is no firm rule for when the reliability of a finding is adequate, and the use of multiple readers, e.g. in a study, might improve the rating of a finding [30]. Yet, we suggest that kappa ≤0.40 for interobserver agreement should lead to an assessment of how to improve the reliability. We found pairwise kappa ≤0.40 in one observer pair at L5/S1 for inferior extent of Modic changes, disc contour and HIZ. Agreement on HIZ might be improved by looking more closely at both axial and sagittal images and at the signal intensity compared to nucleus. It is also clear that better reliability is needed for facet arthropathy. This finding may be easier to rate on computed tomography (CT) [17, 18].

The clinical relevance of the studied MRI findings is not clear. Systematic reviews indicate that Modic changes are not yet documented to affect treatment outcome [31], that disc findings have only a weak and no clinically meaningful relation to LBP [32] and that there is no test that could identify facet joint arthropathy as source of pain [33]. Further studies are needed to clarify the relevance of such localised MRI findings for surgery with disc prosthesis.

Conclusions

Present state of the art in lumbar imaging shows mostly moderate interobserver agreement [9, 10]. In this study, the agreement was moderate to good for Modic and disc findings and only fair for facet arthropathy. Specific causes of disagreement and strategies to reduce it should be explored. The high reliability of the proposed MRI indication for prosthesis must be confirmed in unselected chronic LBP patients. Further studies are needed to assess the clinical relevance of these MRI findings in candidates for surgery with disc prosthesis or lumbar fusion.