Introduction
Since its release in 1993, the EORTC QLQ-C30 has become a widely used “core” instrument for the study of cancer-specific health-related quality of life (HRQoL) [
1‐
4]. It comprises 9 multi-item scales and 6 single-item measures. While the multidimensional profile generated by the QLQ-C30 is invaluable in providing a detailed picture of the impact of cancer and its treatment on patients’ HRQoL, there is also interest in developing “summary” scores that can simplify analyses and minimize the chance of Type I errors due to multiple comparisons. In addition, it might sometimes be more useful, particularly in clinical trials, to employ a composite variable measured with greater precision [
5], as opposed to many variables, each measured with less precision. This interest in summarizing data generated from multidimensional HRQoL profiles is reflected in the development of so-called “higher order models,” such as those available for the SF-36 Health Survey and other instruments [
6‐
8].
To date, there have been a limited number of analyses of the structure of the QLQ-C30, all of which relied on either relatively small sample sizes (e.g.,
N < 200), a subset of the QLQ-C30 items, and/or exploratory techniques [
9‐
15]. The aim of the present study was to fill this gap, by examining empirically and comparing the statistical “fit” of a number of alternative “higher order” measurement models for the QLQ-C30, using confirmatory factor analysis in a large sample of patients [
16]. The results of this study may be used to identify one or more, higher order measurement models that could be used for the computation of simpler, summary scores for this questionnaire. The results are also of interest from a theoretical perspective, hopefully allowing us to place the pragmatically oriented QLQ-C30 in the context of a number of established, theoretical HRQoL models.
Results
The characteristics of the patients included in the study are presented in Table
1. The average age of the patients was 60 years, with slightly more males than females, and more early than advanced cancer. A number of study types (clinical trials, non-randomized comparative studies, and observational studies), a wide variety of (primarily European) countries, and a range of disease sites were also represented.
Table 1
Respondent characteristics (N = 4,541)
Gender |
Male | 2,511 | 55.3 |
Female | 1,906 | 42.0 |
Unknown | 124 | 2.7 |
Stage |
I–III | 1,846 | 40.7 |
IV–recurrent/metastatic | 1,765 | 38.9 |
Unknown | 930 | 20.5 |
Site |
Breast | 663 | 14.6 |
Colorectal | 245 | 5.4 |
Gynecological | 375 | 8.3 |
Head and neck | 801 | 17.6 |
Lung | 610 | 13.4 |
Esophagus/stomach | 822 | 18.1 |
Prostate | 405 | 8.9 |
Other | 620 | 13.7 |
Study type |
RCT | 1,561 | 34.4 |
Non-RCT | 1,455 | 32.0 |
Field study | 1,386 | 30.5 |
Unknown | 139 | 3.1 |
Country |
Belgium | 193 | 4.3 |
Canada | 120 | 2.6 |
France | 266 | 5.9 |
Germany | 477 | 10.5 |
Netherlands | 228 | 5.0 |
Norway | 498 | 11.0 |
Spain | 402 | 8.9 |
Sri Lanka | 438 | 9.6 |
Sweden | 202 | 4.4 |
UK | 722 | 15.9 |
USA | 157 | 3.5 |
Other | 838 | 18.5 |
No item had more than 2.6% missing observations; for most items this was less than 1%. However, all items, with the exception of the two items of the QL scale, were highly skewed; approximately half of the items had 50% or more of the responses in the lowest category (data not shown). The polychoric correlations between the 29 items were generally moderate (i.e., >0.30) to strong (>0.50) (data not shown).
The fit indices for the various models are presented in Table
2. As might be anticipated given the large sample size, no model passed the stringent χ
2 test of model fit. However, all models were deemed to be at least “adequate” approximations to the data, as determined by the previously noted rules of thumb applied to the CFI/TLI and RMSEA indices. As expected [
20], the less restricted the model, the better the model fit, with the
Standard model even achieving a “good” fit. The
Mental–
Physical models had approximate fit indices slightly superior to all of the other higher order models. The correlations between higher order factors (in the multi-factor models) were generally quite high, often exceeding 0.95 (see Table
2). This indicates that these higher order factors were virtually indistinguishable, thus implying that additional factors were of limited explanatory value. Exceptions are the models positing
Mental and
Physical factors, which have lower correlations between these higher order factors.
Table 2
Testsa and approximate goodness-of-fit indices for various models
1. “Standard” model | 134 | 15 | 0.96/0.98 | 0.042 | 14 Latent variables, excluding FI |
2. Physical health, mental health and QL | 234 | 19 | 0.92/0.98 | 0.050 | Correlation physical health and mental health = 0.74 |
3. Physical burden, mental function and QL | 248 | 18 | 0.92/0.97 | 0.053 | Correlation physical burden and mental function = 0.81 |
4. Symptom burden, function and QL | 294 | 18 | 0.90/0.97 | 0.058 | Correlation burden and function = 0.97 |
5. HRQL and QL | 297 | 18 | 0.90/0.97 | 0.058 | |
6. Formative symptom burden (free weights), function and QL | 277 | 17 | 0.91/0.97 | 0.058 | Correlation formative burden and function = 0.96 |
7. Formative symptom burden (fixed weights), function and QL | 300 | 17 | 0.90/0.96 | 0.061 | Correlation formative burden and function = 0.95 |
The results of (corrected) chi-squared difference tests between pairs of models within each branch of nested models [
53] are presented in Table
3. Differences between each successive pair of nested models in each branch were significant, indicating that each successive tightening of restrictions resulted in a significant decrement in model fit.
Table 3
χ2 Difference testing between 3 branches of nested models
1. Standard model (14 latents), incl. QL | Root node | | Root node | | Root node | |
2. Physical health, mental health and QL | 293 | 17 | – | – | – | – |
3. Physical burden, mental function and QL | 77 | 2 | – | – | – | – |
4. Symptom burden, function and QL | – | – | 377 | 15 | – | – |
5. HRQL and QL | 241 | 3 | 47 | 2 | – | – |
6. Formative symptom burden (free weights), function, and QL | – | – | – | – | 336 | 12 |
7. Formative symptom burden (fixed weights), function, and QL | – | – | – | – | 241 | 5 |
The standardized regression weights (for the first-order factors on the higher order factors) for the best fitting models for each of the three branches are presented in Table
4. The percentage of variance for each first-order factor explained by their corresponding higher order factor is presented as well. All postulated factor regression weights for the
Burden/Function and the
Mental health/Physical health model were significant, with the exception of SL on the
Physical health factor. However, the percentages of explained variance for PF, EF, CF, and SL are markedly inferior for the
Burden/Function model.
Table 4
(Standardized) Regression weights for first-order factors and percentage variance explained by best fitting higher order model for each of three branches of (nested) models
PF | 0.80a
| | 0.64 | | 0.76* | 0.58 | | 0.76a
| 0.59 |
RF | 0.89* | 0.04 | 0.84 | | 0.89a
| 0.79 | | 0.89* | 0.80 |
EF | | 0.72a
| 0.52 | | 0.62* | 0.38 | | 0.62* | 0.38 |
CF | | 0.90* | 0.82 | | 0.80* | 0.63 | | 0.80* | 0.62 |
SF | 0.42* | 0.46* | 0.68 | | 0.82* | 0.67 | | 0.82* | 0.67 |
FA | 0.82* | 0.19* | 0.93 | 0.97a
| | 0.95 | 0.83a
| | NA |
NV | 0.66* | | 0.43 | 0.65* | | 0.42 | 0.04 | | NA |
PA | 0.60* | 0.23* | 0.62 | 0.79* | | 0.63 | 0.16* | | NA |
DY | 0.80* | | 0.65 | 0.80* | | 0.64 | 0.03 | | NA |
SL | 0.05 | 0.77* | 0.64 | 0.77* | | 0.59 | 0.08* | | NA |
AP | 0.85* | | 0.72 | 0.84* | | 0.71 | −0.08 | | NA |
CO | 0.75* | | 0.56 | 0.73* | | 0.54 | 0.04 | | NA |
DI | 0.62* | | 0.39 | 0.62* | | 0.38 | −0.02 | | NA |
Only the hypothesized regression weights for the FA, SL, and PA symptom scales for the formative Burden/Function model (in the third branch of nested models) were statistically significant. FA was the only symptom with a substantial loading on the formative Burden variable, which more or less ignored the other symptoms. The amount of explained variance was again inferior for the PF, SF, and CF scales, as compared to the Mental health/Physical health model.
Examination of the modification indices and residuals indicated that item q22 (“worry”) was a source of ill-fit for all models. There also appeared to be some relationships between EF and the other scales not fully captured by the higher order factors (data not shown).
Discussion and conclusions
The present study tested the statistical fit of seven alternative measurement models for the QLQ-C30. This was done by using confirmatory factor analysis to compare empirically their adequacy in representing the EORTC QLQ-C30 in a sample of 4,541 cancer patients. The point of reference was the Standard model, a latent variable model which employed the architecture of the standard, 14-dimensional QLQ-C30 model (excluding the FI item).
As mentioned previously, the models studied here were organized into three independent branches of nested models: three models in the so-called Mental–Physical branch, two in the Burden-Function branch, and two in the “formative” Burden-Function branch. The Standard model stands at the apex of each of the three branches.
None of the models examined passed the stringent χ
2 test of model fit, indicating that none of these models captured all of the systematic variation in the data. It should be noted, however, that with 4541 observations, a chi-square test is quite sensitive to detecting small deviations. Importantly, all models demonstrated at least an “adequate” approximation to the data [
50]. The
Standard QLQ-C30 model actually demonstrated a “good” fit to the data. Moreover, χ
2 “difference testing” demonstrated that each addition of restrictions in each of the successively nested models in each branch led to a statistically significant deterioration in model fit.
The MentalHealth/ PhysicalHealth model, the least restricted higher order model in the first branch studied, is significantly better than its nested alternatives, and gives an adequate, albeit imperfect, approximation to the data. The Burden/Function model was the best approximation to the Standard model in the second branch. We note that the Burden/Function model is only slightly superior to the simpler one-dimensional HRQL model, for its two dimensions are almost indistinguishable.
Unfortunately, we cannot use the chi-square test to directly test the fit of the models nested in these two, different branches. However, we did use the approximate fit indices to compare those models, with the results indicating that the Mental Health/PhysicalHealth model is slightly superior to the Burden/Function model. Additionally, the MentalHealth/PhysicalHealth model achieves better explanatory power for the CF, PF, EF, and SL scales than does the Burden/Function model. For these reasons, the MentalHealth/PhysicalHealth model is preferable.
A third branch of nested models, consisting of “causal” or “formative” latent variables, represents an alternative approach for the modeling of HRQL questionnaires. The model with free weights was a statistically better fit to the data than the fixed (equally weighted) model. However, the potential improvements in fit indices, which are to be expected if the formative conceptualization was more appropriate than the reflective one [
37], were not observed in the current analysis. Additionally, the only symptom that appears to strongly predict
Function is fatigue, a result also reported previously [
9]. This indicates that the other symptoms may be regarded as largely irrelevant as predictors of
Function for this group of patients, which may be an overly zealous simplification of the
Standard model. One could argue that this result disqualifies this branch of models.
It is interesting to note that question 22 (i.e., “did you worry?”) of the QLQ-C30 emotional function scale was frequently flagged as being a source of ill-fit. This may have to do with possible ambiguity in the meaning of “worry,” either as an indication of healthy concern in a difficult situation, or as an indication of psychological distress.
Several possible limitations of this study should be noted. First, the use of pair-wise deletion for (the relatively sparse) missing data in the computation of the polychoric correlations resulted in some loss of data. A second limitation concerns the possible bias introduced from the clustered sampling of data from various data sources. While we did apply a correction to the chi-square statistics and standard errors, additional corrections for parameter estimates, possibly based on sampling weights, would arguably have been even better. Third, it would have been useful to have access to Akaike Information Criterion (AIC), and other related statistics [
60] in order to compare non-nested models across the various branches. The use of full information maximum likelihood estimation procedure could have provided a solution for all three problems simultaneously; however, the computational burden for such an estimation procedure is prohibitive.
A fourth limitation concerns the choice of models, which was neither exhaustive for all plausible, theoretical models, nor sufficient for capturing all of the systematic variation in the data. On the other hand, the “alternative models” approach used here is methodologically stronger than a purely exploratory approach [
16]. For this reason, we refrained from “tweaking” either the standard or any of the other alternative models in order to achieve some improvement in fit, a practice frowned upon as potentially capitalizing on chance. Nevertheless, we recognize that there are other, more exploratory approaches that might be used. For example, causal discovery techniques and software (e.g., TETRAD) employ rigorous algorithms to locate all well-fitting models for a set of observed data, to which theory can then be applied to choose the most suitable or plausible model(s). While beyond the scope of the current paper, the utility of such approaches could be the subject of future studies [
61,
62].
Summarizing, we believe that the
PhysicalHealth/MentalHealth model is the most appropriate conceptualization for our goal of offering a simplified form of QLQ-C30 outcomes. This model was found to provide an “adequate” fit to the data, slightly superior to the alternative, higher order models examined here. We believe that it is the best of the approximations to the
Standard model considered in this study. The
Physical Health/MentalHealth conceptual model has also been utilized and successfully tested for other HRQoL instruments [
6,
7], has been considered in a large, multi-instrument study [
24], and is consistent with the PROMIS domain mapping project and the WHO framework [
25‐
27]. For these reasons, we consider it to be the most promising of the models considered here.
Nevertheless, the “superiority” of this PhysicalHealth/MentalHealth model is modest, and it remains to be seen whether its extra complexity—as compared to e.g., the simple HRQL model—provides tangible (clinical) benefits. We therefore intend to further examine the suitability of the PhysicalHealth/MentalHealth model by testing its measurement equivalence across sub-populations and over time. We will also attempt to use this model to predict external criteria and outcomes, as well as comparing it to other instruments purporting to measure similar concepts. These efforts will culminate in an algorithm for the computation of higher order factors for the QLQ-C30.