Main

Quality of life (QoL) is an important outcome measure in clinical trials and is increasingly being used as an end point, especially in studies involving patients with malignant diseases (Kavadas et al, 2003; Efficace et al, 2007). Despite this trend, a limited amount of research has been published about QoL in patients with carcinoid and neuroendocrine tumours (NETs) and, to our knowledge, there are limited data on disease-specific QoL questionnaires (QLQs) for these patients (Jacobsen and Hanssen, 1995; Wymenga et al, 1999; Larsson et al, 1999a, 1999b, 2001, 2003; O’Toole et al, 2000; Wareham et al, 2002; Vinik et al, 2009).

Neuroendocrine tumours at all sites are increasing in incidence, and the reported incidence of gastrointestinal (GI) NETs is approaching 3 per 100 000 people per year, with a slight predominance in women (Modlin and Sandor, 1997; Hemminki and Li, 2001a, 2001b; Modlin et al, 2003, 2008; Ellis et al, 2010). Neuroendocrine tumours of gut origin may give rise to symptoms that are related to the local presence of tumour in the gut, pancreas or liver (e.g. pressure and pain); however, around 85% of GINETs produce biologically active substances, which may lead to broader-ranging symptoms and distinctive syndromes (Snow and Liddle, 1995). The most common set of symptoms are those of the carcinoid syndrome (skin flushing, diarrhoea and wheezing) due to excessive production of serotonin (Davis et al, 1973).

The prognosis for NET patients is dependent on both the cancer and the syndrome, and there are differences between pancreatic NETs and those of gut origin (Modlin et al, 2008). The overall 5-year survival for patients with a gastroenteropancreatic NET (following complete resection of the primary tumour and liver metastases if present) is up to 83%; however, in many cases, where it is not possible to surgically remove the primary NET and liver metastases, 5-year survival ranges from 30 to 70% (Wangberg et al, 1996; Janson et al, 1997; Modlin and Sandor, 1997; Kirshbom et al, 1998; Soga, 1998; Shebani et al, 1999; Soreide et al, 2000; Wheeler et al, 2000; Nave et al, 2001; Davies et al, 2003; Ramage et al, 2012).

Treatment should be curative where possible but is palliative in the majority of cases (Ahmed et al, 2009), and many treatments have side effects. It is therefore important when selecting therapies to weigh up the benefits of treatment with its impact on the patient’s QoL (Ramage and Davies, 2003). To assess the overall benefits of therapeutic intervention, patient-reported outcome measures should be included as part of clinical studies in patients with NETs, using appropriate health-related QLQs. Generic tools to assess QoL in cancer patients, including the EORTC Core Quality of Life questionnaire (QLQ-C30) and the Functional Assessment of Cancer Therapy-General (FACT-G) scale, have been developed, and may be supplemented by disease-specific modules for different cancers (Cella et al, 1993). Recently, the disease-specific QLQ-GINET21 was devised to supplement the QLQ-C30 and include QoL issues important to patients with NETs.

This paper describes an international phase 4 psychometric validation study designed to assess the clinical and psychometric reliability, validity and responsiveness-to-change of the QLQ-GINET21 in patients with NETs. As discussed previously (Davies et al, 2006), in view of the uncommon nature of pancreatic NETs and the variety of different syndromes within this group, a pragmatic approach has been taken to develop a disease-specific module to cover all NETs.

Patients and methods

Patients

Patients were recruited into this international multicentre study between January 2006 and January 2010. Patients eligible for inclusion were those aged over 18 years with either (1) a histological diagnosis of NET or (2) radiological findings consistent with NET together with raised hormone levels in their plasma or urine indicative of NET. The sites of NET, with or without hormone secretion, included any gut-primary with metastases (including some patients who clinically had a gut primary where the primary lesion had not been definitely identified – classified as unknown primary), lung-primary with liver/abdominal metastases and pancreas with or without metastases. Of the 253 patients, 124 had non-functioning tumours, secreting only chromogranin-A, 111 tumours secreted 5-hydroxy-indoloacetic acid, a main metabolite of 5-hydroxytrptamine, four secreted gastrin, three secreted glucagon, five secreted insulin and one secreted vasoactive intestinal peptide. In five patients, secretion status was unknown. Expected survival of at least 3 months and a planned treatment for NET were also requirements for inclusion. Patients were excluded if they had concurrent malignancy elsewhere, or had psychological, geographical or comprehension impairment that prevented completion of the questionnaire. Ethics committee approval and written informed consent were obtained and the protocol was approved by the EORTC Quality of Life Group.

Questionnaires

The study required that all patients completed two QLQs – the EORTC core questionnaire (the QLQ-C30, version 3.0; available at http://groups.eortc.be/qol/questionnaires_qlqc30.htm) and the QLQ-GINET21 (see Appendix Table A1) – at all assessment time points (baseline, and 3 and 6 months post-baseline). Details of the development and initial field testing of the QLQ-GINET21 have been published previously (Davies et al, 2006); however, in brief, the QLQ-GINET21 contains a total of 21 items: four single-item assessments relating to muscle and/or bone pain (MBP), body image (BI), information (INF) and sexual functioning (SX), together with 17 items organised into five proposed scales: endocrine symptoms (ED; three items), GI symptoms (GI; five items), treatment-related symptoms (TR; three items), social functioning (SF) of the new module (SF21; three items) and disease-related worries (DRW; three items) (Davies et al, 2006). The response format of the questionnaire is a four-point Likert scale. To help distinguish between missing items and questions that are not relevant, four questions also have a ‘not applicable’ box. Responses to the QLQ-C30 and the QLQ-GINET21 were linearly transformed to a 0–100 scale using EORTC guidelines, with higher scores reflecting more severe symptoms.

The QLQ-GINET21 was translated into nine languages according to strict EORTC translation guidelines (Cull et al, 2002).

Study design

Patients were assigned to one of two groups depending on the treatment they were to receive. Group 1 consisted of 88 patients who received somatostatin (SMS) analogues or interferon therapy; Group 2 consisted of 165 patients who received other forms of therapy, namely peptide-receptor radiotherapy (102 patients), chemotherapy (23 patients), surgery (20 patients) or ablative/other therapies (20 patients). Patients completed the QLQs before commencement of treatment (baseline) and at 3 and 6 months post-baseline. Owing to the different nature of the treatments, post-baseline assessments were made during long-term continuous treatment in Group 1, while in Group 2 some of the treatments were single administrations and therefore the assessments occurred during the course of therapy in only some cases. Patients also completed a number of established debriefing questions to determine the acceptability of the module. Test–retest reliability was assessed in 48 clinically stable patients 2 weeks after the 6-month assessment.

Statistical analysis

Reliability of the QLQ-C30 and QLQ-GINET21

The reliability (i.e. internal consistency) of the multi-item questionnaire scales was assessed by Cronbach’s α coefficient (Cronbach and Warrington, 1951). As recommended by Nunnally and Bernstein (1994), internal consistency estimates of >0.70 were considered acceptable for group comparisons. The test–retest reliability approach was used to assess repeatability and reproducibility of the questionnaire in a group of 48 patients (the first 48 patients who agreed to take part). These patients completed the questionnaire again, 14 days later; to be included in the analysis, patients in this subgroup had to have stable disease with no change in symptoms within the 2-week period between questionnaires. Intraclass correlations (ICCs) for the QLQ-C30 and QLQ-GINET21 scales were calculated as a measure of test–retest reliability.

Validity of the QLQ-C30 and QLQ-GINET21

Multitrait scaling was employed to examine the hypothesised scale structure of the QLQ-GINET21. To test for item-scale convergence validity, correlations between an item and its own scale of 0.40 (i.e. the correlation coefficient) were used (Hays et al, 1988). Support for item-discriminant validity was based on a comparison of an item with its own scale as compared with other scales. An item was expected to correlate significantly better (two times the standard error) with its own scale than with other scales.

Correlation among the scales of QLQ-GINET21 and QLQ-C30 was examined using Pearson’s product–moment correlation. It was expected that the scales that were conceptually related would correlate substantially with one another (Pearson’s r>0.40).

The Mann–Whitney U-test (equivalent to the Wilcoxon’s rank-sum test) was used to assess known-group comparisons that discriminate between subgroups of NET patients (e.g. functioning vs non-functioning NETs, different primary tumour sites) with differing clinical status (Karnofsky Performance Status). To evaluate whether the QLQ-GINET21 could be used for pancreatic tumours, comparisons between the mean baseline scores of patients with pancreatic tumours and those with tumours at other sites were performed.

Responsiveness of the QLQ-C30 and QLQ-GINET21 tools to clinical change of health status over time was evaluated using the three sets of questionnaires for each patient over time. The variation of mean QoL scores over time for items and scales reflects changes in QoL and performance status. This was not correlated with radiological response in this study. The mean change over time and cancer NET group was compared by repeated-measure ANOVA.

The statistical software program SPSS 18 was used for the purpose of data management and linear transformation of responses from the QLQ-C30 and the QLQ-GINET21 to a 0–100 scale using EORTC guidelines and SPSS syntax programming. Stata 11 was employed for all analyses and a conservative P-value of 0.01 was considered statistically significant. Where items were missing, the method proposed in the EORTC scoring manual to impute missing values was used (Fayers et al, 2001).

Sample size

A sample of 253 patients (which covered 5% attrition), each responding to 21 items, achieves 80% power to detect the difference between the coefficient α under the null hypothesis of 0.60, and the coefficient α under the alternative hypothesis of 0.71, using a two-sided F-test with a significance level of 0.01 (Bonett, 2002). The sample size calculation is compatible with the accepted ‘rule of thumb’ that at least 10 responses per item are needed (Tabachnik and Fidel, 1993); therefore, the minimum sample would be 210.

For the purpose of test–retest reliability, a sample size of 48 subjects with two observations per subject provides 82% power to detect a 10% difference in ICC of 0.90 using an F-test with a significance level of 0.05 (Walter et al, 1988).

Results

Participating centres and patient characteristics

Participating centres in the United Kingdom, the Netherlands, Poland, Denmark, Sweden, Italy, Germany, Spain, Israel and Greece together enroled 253 NET patients who fulfilled the inclusion criteria (see Appendix Table A2). Some differences in patient characteristics and treatment approaches between centres were noted at baseline (Table 1).

Table 1 Baseline patient characteristics

Scale construction

The scaling analysis confirmed the aggregation of items into scales, with the exception of TR, which required revision. The TR scale had three items (question 39, side effects; question 40, repeated injections; question 46, weight gain (WG)), with WG being included to account for SMS analogue treatments. WG, however, showed low correlation with the other questions in the scale and Cronbach’s α coefficient was 0.29 (i.e. very weak); thus, WG was removed from the TR scale and analysed as a single item.

Completion rates and questionnaire acceptability

In total, 660 questionnaires were available for analysis. The completion rate was 90% at 3 months (227 out of 253 patients attended their appointment, 5 had died, 18 were lost to follow-up and 3 were too unwell) and 71% at 6 months (180 out of 253 patients attended their appointment, 8 had died, 54 were lost to follow-up, four did not comply and 7 were too unwell).

For the QLQ-C30, missing answers ranged from none for question 4 to 15 for question 18, with around half of the missing answers being from patients in the United Kingdom. For the QLQ-GINET21, missing answers ranged from 2 for question 44 to 26 for question 46, with most missing answers being from patients in Italy. There was no difference in mean Karnofsky Performance Status score between attendees and non-attendees at 3 months (83.5 vs 86.0; P=0.28) or 6 months (85.2 vs 86.0; P=0.72).

Debriefing results

Of the 253 patients in the study, 209 filled in the debriefing questionnaire, 94 (45%) did so at the outpatient clinic, 59 (28%) at home and 56 (27%) elsewhere. The mean reported completion time for the QLQ-C30 and QLQ-GINET21 together was about 15 min, with 107 patients (52%) saying the total time was <10 min, 67 (32%) saying it took 11–15 min and 29 (14%) reporting a >15 min completion time; 6 patients (3%) did not answer this question. Overall, 199 patients (95%) completed both questionnaires in <20 min. Thirty-nine patients (19%) needed help to fill in the questionnaires, usually (in 33 cases) from a family member.

Thirteen patients (6%) found one or more items confusing or difficult to answer. These included questions 11, 26, 27, 29–33, 36, 45, 46 and 50 (see Appendix Table A1). Thirty-one patients (15%) found at least one question to be not relevant.

QLQ-GINET21 scale structure: reliability and test–retest reliability

The internal consistency of the QLQ-C30 and QLQ-GINET21 scales was calculated overall and at each time point separately (Table 2). Cronbach’s α coefficients for all scales in the QLQ-C30 met the threshold of 0.70 (range 0.70–0.92) at baseline and improved consistently over time, with the exception of nausea and/or vomiting (NV), which decreased to 0.62 and 0.56 at the 3- and 6-month follow-ups, respectively. In the QLQ-GINET21, TR had the lowest α coefficient at baseline (0.49) but improved over time and reached the threshold of 0.70 at 6 months. DRW had an overall α coefficient of 0.87 and reached 0.93 at 6-month follow-up.

Table 2 Evaluation of internal consistency of the QLQ-GINET21 and QLQ-C30 scales

Reliability analysis was also carried out on the subgroup of patients with a pancreatic primary tumour; the Cronbach’s α coefficient for the QLQ-C30 scales ranged from 0.80 to 0.89, and for the QLQ-GINET21 it was 0.80 for the ED, 0.79 for the GI, 0.54 for the TR, 0.57 for the SF21 and 0.83 for the DRW scales.

ICCs were calculated for 48 individuals who filled out the 6-month questionnaires two times (once at the 6-month visit and again 2 weeks later). All ICCs for the QLQ-C30 were >0.90, except those for the second global functioning score (QL2; 0.82), cognitive functioning (CF; 0.86), NV (0.79), dyspnoea (DY; 0.77), sleep disturbance (SL; 0.89) and constipation (CO; 0.85). The ICCs for the QLQ-GINET21 were also >0.90, except for the single items of WG (0.84), MBP (0.87) and INF (0.87) (Table 2). Although an ICC 0.90 is ideal for discriminating between groups of patients in a clinical trial setting, an ICC 0.70 is acceptable (Fayers, 2004).

Construct validity

For both questionnaires, all items had correlations >0.40 with their own scales (0.58–0.91), supporting item-convergent validity. Correlations between items and other scales were <0.40 (−0.05 to 0.36), with the exception of the correlation of the SF21 scale with questions 34, 41, 43 and 47 (r=0.45, 0.53, 0.64 and 0.50, respectively), and the correlation of the DRW scale with question 42 (r=0.63), supporting item-discriminant validity. Item-discriminant validity was also confirmed by all new module scales having correlations <0.70 with each other.

Clinical validity

The mean QLQ-GINET21 scale scores at baseline, and at 3 and 6 months post-baseline for different NETs are shown in Table 3. The variation in mean scores over time for items and scales can be regarded as a measure of responsiveness.

Table 3 QLQ-GINET21 scale mean scores at baseline, and at 3 and 6 months post-baseline, by cancer type

Changes for each scale and symptom were assessed over time and are shown in Figure 1, according to treatment group. There was a clear improvement in the ED scale for Group 1 (SMS analogues and interferon) from baseline to 3 months, with some worsening between 3 and 6 months; Group 2 also showed some improvement; however, the GI scale did not change over time in Group 1, but improved in Group 2 (all other treatments). There was little change in the TR scale over time in either treatment group. Interestingly, both SF21 and DRW improved only in Group 2. SX worsened over time in Group 1 but improved in Group 2. BI was worse in Group 2 at the 6-month time point, whereas there were no notable changes over time in Group 1. Other changes over time were small. Overall, the mean change of the QLQ-GINET21 scales over time was statistically significant for ED (P=0.0012), GI (P=0.0081), SF21 (P=0.0065), DRW (P<0.0001) and borderline for BI (P=0.0488). Differences in mean scores between treatment Group 1 (SMS analogues and interferon) and Group 2 (all other treatments) were borderline statistically significant for GI (P=0.0544), TR (P=0.0833), SF (P=0.0067), DRW (P=0.0438) and BI (P=0.0699), and not significant for all other scales and items.

Figure 1
figure 1

QLQ-GINET21 scale and single-item mean scores at baseline, and at 3 and 6 months post-baseline (x axis: scale/item; y axis: mean score). The subscripts of 1 and 2 represent Group 1 and Group 2, respectively.

The ability of the QLQ-GINET21 to assess differences between groups at a specific time point was examined using Karnofsky Performance Status scores at baseline classified into two groups (<80 and 80). The difference in QLQ-GINET21 score of different scales between the above-mentioned Karnofsky groups was evaluated by the Mann–Whitney U-test. The results showed that the QLQ-GINET21 was able to assess differences between the two Karnofsky groups, showing statistically significant results for ED, GI, TR, SF21, DRW, MBP, SX and BI scales (in all cases, P<0.001), and for INF (P=0.0029), but not for WG (P=0.869). Similar significant results were found for the majority of core QLQ-C30 questionnaire scales: QL2, physical functioning (PF2), CF, SF and NV (in all cases, P<0.0001), role functioning (RF2; P=0.0002), emotional functioning (EF; P=0.0026), fatigue (FA; P=0.0194), pain (PA; P=0.001), DY (P=0.0002), SL (P=0.0004), CO (P=0.033), but not for appetite loss (AP; P=0.557), diarrhoea (DI; P=0.50) or financial difficulties (FI; P=0.105).

In addition to the known-group analyses, the mean scores of scales in the QLQ-C30 and QLQ-GINET21 at baseline, and at 3 and 6 months were described. Using repeated-measure ANOVA, the difference in the mean scores between pancreatic NETs and those of other origins were investigated (Table 3).

For the QLQ-C30, overall, only the mean score for the DI scale was statistically significantly different (P<0.0001); borderline significance was found for SL (P=0.0376) between pancreatic and other NETs. The mean change in values over time (baseline, and 3 and 6 months) was strongly significant for the scales of CO (P=0.0067), DI (P=0.0017) and NV (P=0.002), but borderline for EF (P<0.05) and AP (P=0.03).

For patients with pancreatic cancer, mean scales of the QLQ-C30 were higher than patients with non-pancreatic cancer for QL2 (65 vs 64), RF2 (75 vs 71), CF (83 vs 81) and SF (77 vs 74), and were lower for PA (25 vs 26), DY (16 vs 20), DI (17 vs 35), SF21 (35 vs 38), MBP (29 vs 30), SX (32 vs 35) and INF (8 vs 12). There was no statistically significant difference between these two groups of primary sites in any of the scales in the QLQ-GINET21.

All correlations between the QLQ-C30 and the QLQ-GINET21 scales were <0.70, except for SF21, which for both scales was −0.70 (Table 4).

Table 4 Correlation between QLQ-C30 and QLQ-GINET scales

Discussion

The QLQ-GINET21 was conceived as a tool that could combine scales and single items that assess clinically meaningful QoL parameters. In practice, it might be useful to review each item separately to identify specific problems experienced by individual patients. As a research tool, however, it may be preferable to summarise the items in clinically appropriate scales to reduce the amount of data to be analysed and communicated. This validation study found the proposed scale structure of the QLQ-GINET21 questionnaire to be appropriate, with the exception of the WG item, which did not correlate well with the other items in the TR scale. Therefore, a scale structure that includes five scales and five single items was instead validated (Table 2). The scales demonstrate a good degree of internal consistency, with high Cronbach’s α coefficients for all items, except for TR and SF21, which did not meet the threshold of 0.70 at baseline. The SF21 scale in the QLQ-GINET21 is a new type of scale, in which questions are ordered to assess how difficult SF is for the patient. The SF scale in the QLQ-C30 correlated quite highly with the SF21 scale in the QLQ-GINET21, confirming that both scales measure and assess similar aspects of the same problem. The SF21 scale in the QLQ-GINET21, however, seems to be sufficiently different to the SF scale in the QLQ-C30 for it to provide additional information; for example, it was more responsive to changes in SF that occurred over time. None of the other scales in the QLQ-GINET21 and QLQ-C30 showed high correlation, thus confirming that the QLQ-GINET21 provides additional information to the more generic QLQ-C30 tool. Scales/items within the QLQ-GINET21 that correlated moderately with each other were DRW, SF21 and SX, and one might therefore argue that these could be combined into one scale; however, because they may respond differentially to different NET therapies, it appears reasonable to keep them separate.

There was reasonably good compliance with the questionnaires and few missing items.

Changes in scores over time are to be expected. With SMS analogues and interferon, the ED scale for Group 1 was improved at 3 months from baseline but not at 6 months, as might be expected with initial improvement followed by tolerance to the SMS analogues. The GI symptoms did not improve in Group 1, which might be expected since SMS analogues can cause some GI symptoms as well as cure them; other therapies (Group 2) seemed to improve GI symptoms. DRW and SF21 improved only in Group 2, suggesting the use of a definitive treatment other than SMS analogues improved patients’ psychosocial issues. The change in SX was different in the two groups and might imply an adverse effect of SMS analogues, although this would need further studies for confirmation.

The scales within the QLQ-GINET21 were more sensitive to differences between patient subgroups (as determined by patients’ Karnofsky Performance Scores) and more responsive to changes over time than were the scales in the QLQ-C30. These properties might be considered the most important aspects of any clinical research tool, and highlight the importance of including disease-specific QoL assessments within clinical trials. The QLQ-GINET21 is designed to be used together with the existing QLQ-C30, and jointly they may provide a reliable tool that is useful for a broad evaluation of QoL in clinical studies of patients with GINETs.

The baseline QoL values for patients, as measured by both questionnaires, showed that there were variations between centres, which could be explained by differences in patient selection and patient characteristics (primary tumour site, tumour stage and baseline Karnofsky Performance Status). At baseline, all scales and single items seemed to distinguish between subgroups of patients on the basis of Karnofsky Performance Status. There was no correlation between Karnofsky Performance Status and the single item of WG in the QLQ-GINET21, which would be expected as, in the context of cancer, WG resulting from treatment would not be related to low functional status. Similarly, in the QLQ-C30, as might be expected, there was no correlation between Karnofsky Performance Status and the items of FI and CO; however, surprisingly, there was also a lack of correlation with DI, which is common to many GINETs and would be expected to be associated with a poorer performance status.

The data on most scales and single items of the QLQ-GINET21 were skewed toward low values, but responses covered the full range of scores for most of the new scales at all evaluation time points, thus confirming the response categories are appropriate.

Importantly, the QLQ-GINET21 was well accepted by patients; across all 660 questionnaires evaluated, the rate of missing values was 2%, with the exception of question 40 (relating to repeated injection), where the rate of missing values was 7%, and question 51 (relating to sex life), where it was 5%. The fact that 91% of patients answered the questionnaires at the second time point was as expected, and the lower value of 71% at the third time point was not ideal but was similar to what has been found in other large multicentre validation studies.

The data comparing tumours of different origin suggest that the QLQ-GINET21 can be used for pancreatic as well as non-pancreatic NETs, although it is accepted that pancreatic numbers were too small for a proper validation in patients with this tumour type alone.

Conclusion

The QLQ-GINET21 has been confirmed in this phase 4 validation study to have a module structure and scales that are clinically sensitive, reliable and valid. We recommend that this disease-specific module is used in randomised clinical trials of therapies, which would give further information on its clinical utility; indeed, some such trials are already in progress.