Introduction

It is well known that some patients operated on for spinal disorders will have a poor result, regardless of the apparent technical success of the operative procedure itself [27]. This has prompted the search for risk factors and the development of pre-screening tools [1, 4, 11] to assist with both the patient selection procedure and the promotion of realistic expectations on behalf of the patient. Over the last 10–15 years, numerous studies have sought to identify the determinants of surgical outcome [15]. Despite this, there is still a lack of consensus regarding both the most important predictors and their overall predictive power (i.e. their clinical relevance). The risk factors identified in any given study are most likely contingent upon certain methodological factors—such as the study design (retrospective versus prospective), the statistical methods used (bivariate, multivariate analyses), the number and type of predictors examined, and their prevalence within the patient group examined—as well as factors such as the specific pathology or the surgical procedure being investigated [15]. Furthermore, both the proportion of positive outcomes after spinal surgery [10] and the factors identified as predictors [9, 25] depend to a large extent on the manner in which the outcome itself is assessed.

There is no single, universally accepted method for assessing the outcome of spinal surgery. In the past, many clinicians developed their own simple rating scales, using categories such as “excellent, good, moderate, and poor”, which they themselves used to judge the outcome, predominantly from a surgical or a clinical perspective. The technical success of the operation also lent itself to evaluation in terms of, for example, the accuracy of screw placement or the degree of fusion/extent of decompression achieved, as monitored by appropriate imaging modalities at follow-up. In an effort to achieve further objectivity, these measures were sometimes supplemented with physiologic measures such as range of motion or muscle strength [7]. However, in many cases, these indices proved to be only weakly associated with outcomes of relevance to the patients and to society. With the increasing awareness that the outcome should be (at least also) assessed by the patient himself/herself, the previously popular surgical outcome measures were superseded by a diverse range of patient-orientated questionnaires, assessing factors of importance to the patient, such as symptoms, disability, quality of life, and ability to work. However, the emergence of many new instruments in each of these domains and the lack of their standardized use [34] has compromised meaningful comparison among different diagnostic groups, treatment procedures, and clinical studies [2]. In recognition of this problem, a standardized set of outcome measures for use with back pain patients was proposed in 1998 by a multinational group of experts [7]. There was general consensus that the most appropriate core outcome measures should include the following domains: pain, back-specific function, generic health status (well-being), work disability, and patient satisfaction [2, 7]. Recent studies have shown that these measures, whilst related, are not interchangeable as outcome measures [8]. In 1998, Deyo et al [7] developed a core set of just six questions that would cover all of these domains yet be brief enough to be practical for routine clinical use, quality management, and possibly also more formal research studies. The psychometric characteristics of the core-set were recently examined in back pain patients undergoing either surgical or conservative treatment [16, 22]. The individual core items as well as a multidimensional sum-score of all the core items showed good reliability, validity, and responsiveness [16, 22].

The present study sought to examine the factors that predict the outcome of spinal surgery as measured using this brief, multidimensional score. Some of the most important demographic/biological, psychosocial, and medical risk factors suggested by a recent review of the literature [15] were included, in order to assess their relative importance to each of the outcome domains and to the multidimensional score as a whole.

Methods

Study population

Patients were recruited (from March 1999 through to April 2000) from the spine centres of two neighbouring orthopaedic hospitals: one was a (public) University Hospital and the other was a not-for-profit Foundation Hospital with University affiliations. The inclusion criteria included fluency in the German language, back pain or leg pain due to a spinal disorder and scheduled for spinal surgery, and willingness to complete a questionnaire booklet before and 6 months after surgery. Exclusion criteria were severe medical problems (e.g. tumour, infection, cardiovascular disease) and spinal problems as a result of trauma. Patients were recruited by research assistants, through direct personal request in the University Hospital, and via a letter in the mail in the Foundation hospital. In the latter hospital, herniated disc patients were not invited to participate, as they were being examined as part of another study. All other consecutive patients who fulfilled the admission criteria were invited to participate; those completing the first questionnaire were sent a second postal questionnaire 6 months after surgery.

Questionnaires

The questionnaire booklet enquired, amongst other things, about the following (for further details, see Table 1):

  • Sociodemographic variables (gender and age)

  • Clinical/pain history (pain duration, previous operations on the spine)

  • Fear-avoidance beliefs [26, 31]

  • Depression (modified ZUNG self-rating depression scale [6, 14, 35])

  • The “core-set” items [7, 16]

All the questionnaires had been cross-culturally adapted for the German language in previous studies [16, 18, 26].

Table 1 Questionnaires/questionnaire items

From the core-set items (pain, function, symptom-specific well-being, general well-being, disability), a composite index score was constructed, as described previously [16]. Briefly, all scales were first linearly transformed into a 0–10 format. Pain intensity was already measured in this format, whilst function, and symptom-specific and general well-being were measured with a 1–5 point Likert scale (transformed according to the formula: category score marked by the patient −1 × 2.5). Disability (work) and disability (social role) were measured in days of work incapacity and restricted activity respectively over the last month and could theoretically range from 0 to 31. These were recoded into five categories, to provide a similar 1–5 point scale as for the other items: (1) 0 days, (2) 1–7 days (3) 8–14 days (4) 15–21 days (5) >22 days.

The transformed core-item scores were averaged to form an unweighted composite core index that ranged from 0 to 10.

Data analysis and statistics

Descriptive data on the domain single items and the composite index score, as well as their test–retest reliability, validity and sensitivity to change, were published recently [16] and are therefore not reported here.

In order to predict the outcome after surgical treatment, each index item and the composite score, in turn, served as the dependent variable in a multivariate longitudinal regression analysis. In the longitudinal hierarchical regression models, the baseline equivalent of the dependent variable was the first variable entered into the model (e.g. if pain was being predicted as the outcome measure, pain at baseline was the first variable entered into the analysis). This is the standard way of analysing longitudinal data and allows the prediction of change over and above that predicted by the association between the outcome variable and its value at baseline [12]. In the second step, the demographic variables, age and gender, were entered into the model. In the third step, the pain/medical variables were entered (duration of pain, number of previous spinal operations, number of vertebral levels treated, surgical procedure [dummy-coded variables]). We decided to include surgical procedure rather than diagnosis as a possible predictor variable in this step: these two variables are highly interrelated (and hence only one was to be entered), and we opted for surgical procedure, as it is potentially more relevant to the treatment outcome to consider what has actually been done than to consider diagnostic labels. Finally, in the fourth step, the psychosocial variables fear-avoidance beliefs (FABQ) about physical activity and about work, and depression (ZUNG) were entered into the model (for a similar multivariate regression model for predicting disability and work absence, see [31]).

Statistical significance was accepted at the P < 0.05 level.

Results

Participants/study flow

All patients who fulfilled the admission criteria (n = 427) were invited to participate; 256 (60%) patients agreed, and completed the first questionnaire 1–2 weeks before surgery. Many patients declared that they did not want to be bothered by such things as filling out long questionnaires before an impending operation; for many “it was simply too much”. Examination of the demographics of participants versus non-participants (for the Foundation Hospital only, in which access to such information was readily available) revealed that the group of patients that participated contained a slightly lower proportion of women (but not significantly; P = 0.25) and was somewhat older (62 vs 59 years; P = 0.04) compared with the group of patients that declined participation. The distribution of diagnoses did not differ significantly between the groups, but there was a slight tendency for discopathy and instability patients to be less well represented in the group under study.

Of the 256 patients who returned the baseline questionnaire, 25 patients being treated for a herniated disc were later excluded from analysis for a combination of reasons: they were somewhat more acute in their symptoms; they were recruited from one hospital only and relatively few in number (see Methods); and recent reports in the literature [5, 15] gave reason to believe they may differ in relation to their (lesser) propensity for non-medical factors to influence outcome (although being so few in number, any attempt to confirm such a bias by means of an independent analysis of this sub-group was untenable).

The data from a further 20 patients who had completed the baseline questionnaire were also discarded for various reasons: no follow-up questionnaire was sent to 7 patients due to administrative errors in the recall system, 3 patients due to uncertainties regarding the validity of the pre-op questionnaire (completion date/language-understanding), and 3 patients due to their involvement in a bigger research project requiring the completion of a similarly long questionnaire; 3 patients did not go on to receive the foreseen spine operation (cardiovascular comorbidity in 2, hip problem in 1); 2 patients developed other confounding medical problems (1 brain tumour, 1 hiatus hernia) and 2 patients underwent further spine operations, which precluded meaningful completion of the follow-up questionnaire 6 months after the index procedure.

Of the remaining 211 patients, 48 patients failed to respond to the 6-month follow-up questionnaire or returned a blank or largely incomplete questionnaire, whilst 163 patients returned a completed questionnaire. This gave a corrected response rate at follow-up of 77.3% (163/211). The demographic characteristics of the patients in the final study group are shown in Table 2.

Table 2 Sample characteristics

The data of the 48 dropouts were compared with those of the 163 responders with respect to age, gender, and all the variables that were predictors or outcome variables in the linear regression analyses. The dropout group was significantly younger than the group of responders [55.9 (SD 14.7) vs 61.4 (SD 14.3); P = 0.02] but the ratio of males to females was similar in the two groups [20:28 (41.7% males) and 69:94 (42.3% males), respectively; P = 0.94]. The dropout group showed significantly (P = 0.023) lower values for just one item of the Core index, general well-being (“How would you rate your quality of life?”). Otherwise, no differences between the groups reached statistical significance, although trends existed for the responders to have slightly higher baseline ZUNG depression and FABQ scores.

Since 97/163 of the study group were patients with spinal stenosis, a separate analysis was conducted for this group alone; the demographic details for this sub-group are shown in Table 2.

Changes in core item scores before and 6 months after surgery

The mean core-measures composite index score (score range 0–10) reduced from 7.0 (SD 1.6) before surgery to 5.0 (SD 2.4) after surgery (P < 0.001). All individual items showed highly significant improvements (P < 0.001).

Prediction of the core-measures composite index score

In predicting the core-measures composite score 6 months after surgery, the baseline (pre-operative) index score was a significant predictor when it was the only predictor variable in step 1 (β = 0.447, P < 0.001; Table 3). In step 2, age and gender were entered into the regression and showed no significant partial regression coefficients, i.e. did not contribute significantly to the predictive model. Also in step 3, when medical predictor variables entered the model, no individual predictor variable showed a significant partial regression coefficient; taken together as a block of predictors, the duration of pain, the number of previous surgeries, the number of treated levels, and the kind of treatment added 4.5% of variance explanation (ns) to the model. In step 4 of the hierarchical regression analysis, activity-related and work-related fear-avoidance beliefs and depression, as measured with the Zung scale, entered the model. FABQ work and ZUNG each showed significant beta coefficients (FABQ work: 0.206, P = 0.037; ZUNG: 0.257, P = 0.013). In the final model, the unique contribution of the block of medical and psychosocial variables was calculated by multiplying the standardized partial regression coefficients by the zero-order correlation between each predictor variable and the index, and summing these afterwards within each block. The unique contribution of medical predictors was 5.4%, whereas the psychosocial variables explained nearly 20% of the variance in the index (Table 4). Similar results were found when just the spinal stenosis patients were analyzed (2.3% medical predictor variables, 18.3% psychosocial predictor variables).

Table 3 Results of the hierarchical multiple regression analysis of the core measures composite index on demographic, medical, and psychosocial predictor variables
Table 4 Unique variance explained in the final regression models by sets of medical predictors, psychosocial predictors, and combined medical and psychosocial predictors

Table 4 also shows the results for prediction of the index when the single items of the index were z-standardized and the index was calculated as the mean score of z-standardized item values. Z-standardisation involves converting individual scores into measurements of standard deviations above or below the sample mean (which after transformation then becomes zero, with a SD of 1), making variables with different units or scales of measurement more directly comparable. Within the present study, it would serve to level out possible differences in the response range of the single items and might therefore be seen as a psychometrically better means of analysis. As shown in Table 4 (last column), regression of the z-standardized item-index on predictor variables showed comparable results to those for the non-standardized item-index, although the unique variance explained by psychosocial predictor variables was somewhat higher (25.7% compared with 19.4% for the combined predictor model).

Altogether, the combination of baseline symptoms, medical variables, and psychosocial factors explained 34% variance in the core-index score (regardless of whether raw or z-standardized scores were used) (Table 4, combined model).

For better comparison of the sets of “medical” and “psychosocial” predictor variables, three separate regression models were constructed (Table 4). The “medical” and “psychosocial” predictor models each included just the corresponding set of predictors, controlling for demographics and the baseline value of the outcome variable in question. Each set shows its maximum predictive power in these stand-alone models. In the third model, both sets of predictor variables were entered into the model simultaneously, and they hence had to ‘compete’. In the combined model, the variance explained by each set of variables is unique, in that it is controlled for the other variables: if this unique contribution were much lower than in the corresponding stand-alone model, it would indicate that the medical and psychosocial predictor sets were not complementary in their prediction but rather were predicting very much the same inter-individual differences in outcome, i.e. delivering overlapping information. Notably, this was not the case. The variance explained by the sets in the stand-alone models was almost identical to their unique contribution in the combined model (Table 4).

Prediction of the individual domain items

Hierarchical multiple regression analyses showed that the medical variables were better in predicting pain (6.4% variance explained) and symptom-specific well-being (7.0%) than in predicting back function (3.6%), general well-being (1.8%) or disability (2.4%). The inverse pattern emerged for the psychosocial predictors. The latter explained a higher proportion of variance in back function (19.9%), general well-being (17.1%), and disability (20.3%) than pain (12.8%) or symptom-specific well-being (14.0%). This pattern did not change much when “medical” and “psychosocial” predictor sets were analysed separately (Table 4).

Discussion

The present study confirmed that the outcome after spinal surgery, as measured using the recently validated core-measures questionnaire [7, 16, 22], can be predicted by some of the variables most commonly considered in the literature to represent determinants of outcome [15]. The questionnaire has recently been incorporated into the Spine Society of Europe’s “Spine Tango Spine Surgery Registry” (http://www.spinetango.com) and is being recommended for use on a prospective basis for all patients undergoing spinal surgery in the institutions contributing to the registry. A knowledge of the factors that might influence outcome, as assessed with this instrument, is hence of great interest.

Together with the baseline symptoms, the medical and psychosocial predictor variables examined in the present study were able to explain a considerable proportion of the variance in outcome as assessed by the composite-index score—34%. In terms of effect sizes, this would be considered large (i.e. an effect size of 0.51, where the effect size f 2 = R 2/1 − R 2, and f 2 values of 0.02, 0.15, and 0.35 are considered to indicate small, medium, and large effects, respectively [3]). Even when considering only the medical (5.4%) and psychosocial predictor sets (19.4%, see Table 4), these together explained 24.8% of the variation in the index, which in terms of an effect size (0.33) is also large and comparable with the findings of high quality prospective studies reported in the recent literature (see review in [15]). With the limitations of any biologic measurement, it is often possible to identify only 25–50% of the variance of a relationship [30]. Thus, the composite index proved itself to be a responsive instrument with respect to risk factor-related change. This complements previous findings in relation to the instrument’s responsiveness (sensitivity to change) after treatment, where effects sizes (0.95–1.30) exceeding those of many longer outcome questionnaires were found [16, 23, unpublished findings].

Various factors representing the different domains (medical and psychosocial) were identified as unique, independent predictors of outcome. Age and gender did not make any significant contribution to the explanatory models for any of the outcome items or the composite score. Whilst these variables have been identified as significant predictors of outcome in some retrospective studies (in which a limited number of “readily available” risk factors are typically examined), the majority of high quality prospective studies do not support a predictive role for them [15]. One of the risk factors commonly identified in previous studies, the duration of symptoms [11, 19, 21, 24, 29, 32, 33], was not shown to be a significant predictor in the present study. This may have been due to the paucity of acute patients in the study sample that would otherwise have provided a wider spectrum of values for this independent variable; a very high proportion of the whole sample (94%) were already “chronic” pain sufferers, according to the standard definition of >3 months pain.

Many work-related factors, such as worker’s compensation, disability claims, work status, and the duration of sick leave, have been identified as predictors of surgical outcome (see [15]). These factors were not examined in detail in the present study, but we recommend their inclusion in future risk-factor studies involving the core measures, particular in relation to the item “disability” (assessed by the number of cut-down days/time off work in the preceding month). In the present study, in the prediction of disability, 30 individuals had missing data, because they did not work at baseline and/or follow-up. In the remaining occupational group, fear-avoidance beliefs about work was the most significant predictor of “disability”, confirming the findings of previous studies in which more elaborate disability scales were used [17, 26, 31].

In the present study, depression was a consistent predictor of many of the individual outcome items and also of the composite core-index measure. Over the last two decades, the role of depression/psychological distress as a predictor of surgical outcome has been the subject of much study and discussion. Currently, the only consensus appears to be that that there is no consensus with regard to the findings, with as many studies as not, finding an association between depression and outcome (see [15]). Some suggest that the poor results of surgery reported in psychologically disturbed patients may reflect intervention in patients who did not have surgically remediable pathology: distress may increase the pressure for surgery, and inappropriate symptoms and signs may obscure the physical assessment, leading to a mistaken diagnosis of a surgically treatable lesion [30]. In this sense, the uncertain indication rather than psychological distress, per se, would be the factor responsible for the poor outcome, and would no doubt further exacerbate the distress of the patient after surgery [28], hence explaining the commonly reported association between outcome and post-operative psychological status (see [15]). To an extent, this theory has been verified by the recent studies of Carragee et al. (see [4]). This group showed that patients with acute and subacute sciatica in association with a clearly identifiable, severe disc herniation had a very high chance of dramatic and lasting improvement with surgery, and that standard psychological screening tests failed to predict outcome in these patients. Even severe emotional distress in patients who underwent appropriate early surgical intervention did not correlate with adverse outcomes, although the same psychometric profile in patients with chronic sciatic pain and disability did predict worse outcomes. These authors concluded that, with prolonged pain and emotional distress, adverse and possibly self-perpetuating psychological and social changes may significantly decrease the impact of surgery [4]. In the present study, more than 80% of the patients had had pain for more than 1 year, which most likely explains the significance of depression in predicting surgical outcome in this group.

The multiple predictors identified for the composite index show that: (1) the index is sensitive to the multidimensional risk factors that are most consistently described in the outcomes literature [15], and (2) the theoretical notion of building an index out of multiple relevant outcome dimensions, each of which had its own specific risk-factor profile in single item analyses, was successful in maintaining those same specific aspects within the combined index. In other words, the multi-dimensionality was reflected in a risk factor profile that constituted a combination of the profiles from the single item analyses. The potential danger in making an aggregate index is that it might equalize out the specific characteristics of the individual dimensions. However, this was clearly not the case here, and the composite index represented its individual facets very well. Nonetheless, we consider it prudent to recommend that, in future studies, the treatment effects for the individual items be examined in addition to those of the composite core measures index.

Certain limitations of the present study are worthy of mention. Only 60% of the patients who were invited to participate actually volunteered to do so. This potentially threatens the validity of the study and the ability to extrapolate the findings to the “typical” spine surgery patient. Nonetheless, we believe that those who did participate represented reasonably well the individuals seeking care in our institutions, since there were no relevant differences in their demographic characteristics compared with the non-responders, and their baseline and follow-up core index scores were comparable to those of patients with the same diagnoses that are now assessed with the core measures on a routine basis in one of the participating hospitals [unpublished findings]. Of the group that did participate, those that subsequently dropped out at follow-up were significantly younger, and rated their quality of life at baseline as significantly worse, than those who returned their follow-up questionnaire; they also had a non-significant tendency to be more depressed and fearful. However, since non-responders at follow-up are generally considered to include a higher proportion of patients with a poor outcome [13, 20], rather than challenging the validity of our conclusions this would serve instead to confirm our hypothesis that a less healthy psychological profile is associated with a poorer outcome.

In conclusion, further to our previous studies establishing the reliability, validity, and sensitivity to change of the core-index [16], we have now shown that the variance in its value after surgery could be significantly explained by the most commonly reported medical and psychosocial predictor variables. Moreover, the index was represented well by its individual items, as documented by the fact that certain predictor variables were more specific for some domains, such as the psychosocial predictors for “disability” and “function” and “general well-being”, rather than for “pain” and “symptom-specific well-being”. We believe that this further substantiates the recommendation for the widespread and consistent use of the core measures index in clinical trials, multicentre studies, routine quality management, and surgical registry systems.