Background
Quality of Life (QoL) domains are usually reported in terms of scores. In order to assess the effects of a new drug or intervention, researchers must determine the minimal difference in these scores deemed clinically important. Only by knowing that, can they calculate the sample size for a trial and interpret which results are clinically meaningful.
Likewise, clinicians, patients, and policy-makers need to know what changes in QoL scores over time or differences in scores between groups are clinically relevant. If, say, we have a scale with a potential range of 0–100, and a patient had a score of 87 before surgery and 80 afterwards, would the difference be clinically relevant?
The difficulty lies in (a) how best to define these concepts and (b) how to measure them empirically. Numerous terms are used to describe the issue at hand—minimal important difference, minimal detectable change, clinical significance, etc. [
1]. We will use the terms minimal important difference (MID) and minimal important change (MIC), the MID being the minimal difference in QoL
between patient groups that is clinically relevant and the MIC being the minimal change in QoL
over time that is clinically relevant [
2].
A familiar definition for MID is the least difference that would lead to a change in treatment [
1,
3,
4], while the MIC is defined as the minimal difference over time considered relevant by the patient [
5]. Both can be measured using so called anchor-based approaches (which map QoL scores onto an external indicator) or distribution-based approaches (which rely on statistical criteria). Several papers summarise and discuss these approaches and the various methods for deriving estimates [
1,
6‐
9]. There is no "gold-standard" method for estimating MID or MIC. Distribution-based methods alone have often been found insufficient [
10] because they do not directly capture the patient’s or clinician’s perspective regarding the meaning of scores, and this should be resolved by combining them with anchor-based approaches. A further recommendation is to report a range of numbers instead of a single one, since different methods may yield different estimates [
1].
A difference of 10 points is often assumed to be the appropriate MID and MIC for the EORTC QLQ-C30 (European Organisation for Research and Treatment of Cancer Quality of Life Core Questionnaire), based on the work of Osoba [
5]. Cocks et al. recommended using scale-specific MICs [
11]. Recently, the EORTC Quality of Life Group performed analyses of previous EORTC trials to define various MICs for the EORTC QLQ-C30 scales [
12]. The MIC definitions were obtained by other authors carrying out observational studies [
13,
14] or using existing data from past clinical trials [
15]. For the EORTC disease-specific modules (as opposed to the core instrument) initial studies investigating MIDs or MICs have been published [
16,
17]. It is our aim to calculate MID and MIC estimates for the recently updated head and neck module, the EORTC QLQ-HN43 [
18‐
20]. As there is no gold standard method for calculating MID and MIC, we first developed a methodological approach, exploring various methods, the results of which are presented in this paper.
The research questions of the current study were: 1. What methods should we use to determine the clinically relevant minimal score differences of the EORTC QLQ-HN43 scales between patient groups (the MID)? 2. What methods should we use to determine the clinically relevant minimal changes in score over time for the EORTC QLQ-HN43 scales (the MIC)?
In the current paper, we focus on the Swallowing scale, in view of the importance of swallowing difficulties for patients with head and neck cancer and its wide use in clinical studies [
21]. We anticipated that these efforts would provide a useful model for approaching the determination of MIDs/MICs for the other scales in the EORTC QLQ-HN43 module.
Methods
Study design
In an international, multi-centre prospective validation study of the updated EORTC head and neck cancer module [
18], patients with head and neck cancer under active treatment (Group 1) completed a questionnaire at the following time points: before the onset of treatment (t1), three months after baseline (t2), and six months after baseline (t3). Based on previous studies [
22‐
27] and on clinical experience, we assumed Quality of Life would deteriorate for most patients between t1 and t2 and would somewhat improve between t2 and t3. In the validation study, there was also a group of head and neck cancer post-treatment survivors (Group 2) included to determine test–retest reliability. For the determination of MID and MIC presented in this paper, we used the data of Group 1 only.
Inclusion criteria, exclusion criteria, and data collection
Patients with the following ICD-10 codes were included: larynx (C32), lip (C00), oral cavity (C01-06), salivary glands (C07-08), oro-hypopharynx (C09-10, C12-14), nasopharynx (C11), nasal cavity (C30), nasal sinuses (C31), sarcoma in the head and neck region (C49), and lymph node metastases from unknown primary in the head and neck area (C77, C80.0). We did not include patients with tumours of the eyes, orbit, thyroid, skin (even if in the head and neck area), or lymphomas in the head and neck region. Additional inclusion criteria were sufficient language proficiency and sufficient cognitive functioning (assessments made by study coordinator), aged 18 years or over, and written informed consent.
Upon admission to the hospital or clinic, eligible patients received an invitation to participate in the study, and oral and written information in accordance with ethical and governance requirements of each participating centre. All sites obtained ethical approval in accordance with regional and national requirements. Patients were given time to consider the study and ask any questions before consenting and participating.
Instruments
The EORTC QLQ-C30 [
28] and the EORTC QLQ-HN43 [
18,
19] questionnaires were administered at all three time points.
At t2 and t3, a subset of participants also completed the Subjective Significance Questionnaire (SSQ) [
5]. In the SSQ, patients were asked to rate the extent that their QoL had changed (improved or worsened) in the domains swallowing, speech, dry mouth, and global quality of life compared to the previous time point. The first three domains were chosen as they were previously rated as having the highest priority by patients with head and neck cancer [
20] and global quality of life was included because of its general applicability. The response options for each of these items ranged from very much worse to very much better on a 7-point Likert scale. Consistent with the literature, the options "a little worse" and "a little better" defined the MIC from the patient's perspective, since these categories represent minimal change [
29]. The current analyses included the Swallowing item, since this was most relevant for the EORTC QLQ-HN43 Swallowing scores.
Information on the patient's gender, age, education, tumour site, tumour stage, Karnofsky Performance Score (KPS), and treatment received was documented on a Case Report Form by study staff.
Analysis
The statistical analysis plan was developed based on information from published papers and experiences from research clinicians involved in the study. After discussions in the group, we decided to employ a variety of methods to determine the MID and MIC in order to examine their applicability for the EORTC QLQ-HN43 scales using the Swallowing scale as an example. The results of this should serve as a decision basis for what methods to use when analysing the MID and MIC for all the other EORTC QLQ-HN43 scales. The Swallowing scale was used because it was applied most often in previous trials and clinical studies according to a systematic review [
21].
Descriptive analyses
The sample for the MID and MIC analyses comprised patients under active treatment who participated at least twice. The frequencies and percentages of the following variables were calculated: gender, age, education, tumour site, UICC tumour stage, Karnofsky Performance Score (KPS), and treatment received (as documented at t2).
For the EORTC QLQ-HN43 Swallowing scale, the mean change (delta—∆) and its standard deviation, minimum, and maximum were calculated for changes between t1 and t2 and between t2 and t3.
Methods for determining the Minimal Important Difference (MID)
1. Anchor-based approach
The assumption was that patients differ clinically when they have a KPS of 60 (requires some assistance but able to care for most of own needs) vs. 70 (cares for self, unable to carry on normal activity or do active work) and when they have a KPS of 70 vs. 80 (normal activity with effort), since these are often the thresholds for participation in clinical trials and treatment recommendations. However, it was unclear whether KPS correlates with the various domains of the EORTC QLQ-HN43 questionnaire, which is necessary if it is to work as a suitable anchor. The group decided, therefore, to calculate the Spearman correlation coefficients for KPS at t2 with the EORTC QLQ-HN43 scales at t2. If the correlation coefficient with KPS was |
r|≥ 0.40, the following calculations were planned: mean difference for the EORTC QLQ-HN43 scale score in patients with KPS 60 vs. 70 (at t2), mean difference for the EORTC QLQ-HN43 scale score in patients with KPS 70 vs. 80 (at t2). If the correlation coefficient was |
r|< 0.40, we considered that it was not a suitable anchor for this scale in this population [
30]. The calculation of the Spearman correlation coefficients for KPS with the EORTC QLQ-HN43 scales were repeated for t1 to investigate robustness of the results.
2. Distribution-based approach
We calculated the 0.5 and the 0.3 standard deviation [
7] and the standard error of measurement (SEM) of the Swallowing scale score at t2. The SEM was defined as follows: SEM = SD * square root of (1-Cronbach’s Alpha), which gives the measurement error for an individual measurement (i.e. at patient-level). The values for Cronbach’s alpha are published elsewhere [
18]. For the Swallowing scale, the Cronbach’s alpha was 0.85 at t2.
Methods for determining the Minimal important change (MIC)
1. Anchor-based approach
We calculated the mean delta of the Swallowing scale scores for those patients who reported their swallowing had changed “a little” for changes between t1 and t2, as well as between t2 and t3. Calculations were made separately for patients with improved and deteriorated swallowing.
In addition, we used Receiver Operating Characteristics (ROC) curves, as suggested by Kvam et al. [
13]. The procedure was as follows:
(a)
We created groups of patients with improved, unchanged, and deteriorated swallowing, using responses for the Swallowing change item from the SSQ. Responses of “very much worse,” “moderately worse,” and “a little worse” were classified together as “deteriorated”, and responses of “very much better,” “moderately better,” and “a little better” were classified together as “improved”. Patients responding “about the same” were classified as “unchanged”.
(b)
Based on step a, two dichotomous variables were created for the SSQ swallowing anchor: improved vs. not improved (with “unchanged” and “deteriorated” considered as “not improved”) and deteriorated vs. not deteriorated (with “unchanged” and “improved” considered as “not deteriorated”).
(c)
The area under the curve (AUC) was calculated separately for deterioration between t1 and t2, because most patients were expected to experience a worsening of functioning during the treatment period when toxicities are pronounced, and improvement between t2 and t3, because most participants were expected to report gains in functioning during the immediate post-treatment period. The cut-off point with the highest Youden-Index (sensitivity + specificity-1) was considered to be the MIC [
31].
Lastly, we applied predictive modelling to obtain MIC estimates as suggested by Terluin [
32]. Here, the MIC is defined as (ln(odds-pre)—intercept)/regression coefficient.
2. Distribution-based approach
We calculated the 0.3 and 0.5 standard deviation as well as the SEM of the delta for Swallowing scores between t1 and t2 to determine the MIC for deterioration and the 0.3 and 0.5 standard deviation as well as SEM of the delta in Swallowing scores between t2 and t3 to determine the MIC for improvement.
Discussion
In this study, we examined which methods would be useful to determine the MID and MIC for the head and neck cancer module of the EORTC questionnaire. Both distribution- and anchor-based approaches were used and applied to the Swallowing scale because this is an important domain of QoL in head and neck cancer patients, and the corresponding scale in the EORTC instrument is most often used in clinical studies [
21]. The aim was to explore which of the methods can be used later on for determining the MIC for all scales of the EORTC QLQ-HN43.
The various results were presented side by side (anchor-based vs. distribution-based; MID vs. MIC; results for deterioration vs. improvement). Although clinicians often prefer integration of results into single MID and MIC values, it is important first to understand the variety of findings and explore the applicability of the various approaches. It is also essential to keep in mind that the various estimates found in our study are based on conceptually different approaches (for example, the criterion for change can be defined by the patient or by external anchors). Researchers must determine which concept is most appropriate for their study.
The findings show that the anchor-based approach was ineffective in defining minimal important differences between patient groups (the MID) as the only external anchor available was the Karnofsky Performance Score, and correlations with it were poor, according to our predefined thresholds. This result highlights the importance of verifying instead of assuming that potential anchors have meaningful associations with the target QoL measure. A recent review [
9] reported that roughly a quarter of oncology investigations seeking to determine anchor-based MIDs for patient-reported outcome measures neglected to verify these correlations. In the current study, the modest correlation between performance status and the EORTC QLQ-HN43 Swallowing scale removed a convenient anchor; on the other hand, these results seem to bolster the discriminant validity of the EORTC QLQ-HN43 Swallowing scale and the other domains, since they were initially developed largely because clinician-rated performance measures were deemed inadequate for capturing the richness and nuances of patients’ experiences. Future studies could explore whether other external anchors are more suitable; using the current example of the Swallowing scale, tools that objectively assess swallow function or use of feeding tube, such as the Functional Oral Intake Scale [
33], penetration-aspiration score [
34] or Dynamic Grade of Imaging Toxicity [
35] might be useful; for the Social Eating scale a subjective score of functional behaviour (for example, frequency of patient eating out) such as the MD Anderson Dysphagia Inventory [
36] or Mann Assessment of Swallow Ability [
37] might be used. It is likely that external anchors are scale-specific, i.e., they cannot be used to determine the MID for all scales.
Distribution-based approaches were applicable. The criteria of one-third and one-half standard deviation and standard error of the mean yielded MIDs between 9.5 and 14.3. The advantage of the standard error of the mean is that it is relatively independent from the sample size, as it is largely an attribute of the measure rather than a characteristic of the sample [
2]. However, on their own, distribution-based methods are often considered suboptimal relative to those which are anchor-based as they are not intuitively understood by clinicians or patients and do not directly reflect patients’ perceptions of meaningful differences [
2,
8]. So, what alternatives can be applied if we want to find group differences that are relevant to patients? Cocks et al. performed qualitative interviews with breast cancer patients and discovered that patients are able to interpret findings from published literature and give opinions about the significance of differences found between groups [
38]. Similarly, Sully et al. used qualitative interviews to explore meaningful QoL score changes among multiple myeloma patients [
16]. This suggests patients' opinions can work as an external anchor. Although this is an interesting approach, it requires additional data collection and careful interviewing; calculations cannot simply be performed using existing data which is why we could not use it.
The anchor-based approach using subjective patient ratings for determining minimal clinically relevant changes over time (the MIC) yielded—in part—useable results. Problems occurred when we applied ROC methodology, especially when investigating
improvement of quality of life; patients sometimes rated their quality of life retrospectively as improved although their module scores had actually worsened during that interval (this phenomenon was observed in both time intervals). This was an interesting observation as both measures, the SSQ and the EORTC QLQ-HN43, were completed by patients themselves. Patients were asked in the EORTC module to assess their current ability to swallow (solid food, pureed food, liquids, etc.), whereas in the SSQ they were asked to make a retrospective judgment on their changes in swallowing compared to the previous measurement 3 months before. Obviously, the change score required more cognitive and emotional processing: patients were asked to make a judgement on the status of their current condition, recall the previous status of their condition, and make comparisons and a judgement of change between the two. It is likely that (dis)satisfaction with the changes may additionally influence the latter. Satisfaction itself may be viewed as comprising two components: the expectations we have and the evaluation of the situation. This can lead to the so-called satisfaction paradox: if patients expect little improvement, they may be more satisfied with small improvements than if they had expected things to be much better, and vice versa [
39]. In this case, perhaps some patients experienced less deterioration in swallowing than they had anticipated or possibly an adaptive sensory response to physiological motoric decline. Other processes that most likely play a role here are response shift and recall bias [
6,
40,
41].
This finding emphasises there is no ‘one size fits all’ approach for determining MIC even for patient global ratings of change. Therefore, the conclusion of our study group was to continue using a variety of concurrent approaches – distribution and anchor-based. As we move forward in future studies to determine MICs and MIDs of the other scales in the EORTC QLQ-HN43, we plan to omit ROC analyses and comparisons of groups based on the Karnofsky Performance Score and apply all the other methods. It is hoped that additional investigators will be able to evaluate additional clinical anchors. It should be noted though that the results of the methods are particular to this specific study. Although not viable for the current dataset, ROC analyses were suitable methods to estimate the MIC in other studies [
13].
While developing the statistical analysis plan, we realised that many decisions needed to be taken prior to knowing the results and the difficulties this would entail. However, we also wanted to avoid "fishing for the best results". Consequently, we agreed to be decisive in certain aspects beforehand, and more explorative in others. For example, based on previous literature [
22‐
27], we assumed that swallowing deteriorates between the time before treatment starts and three months later and we assumed improvement of swallowing between three and six months after baseline. We therefore decided to compare scores with "a little change" in the patient ratings between these two time spans and investigated the MIC for deterioration between t1 and t2 and the MIC for improvement between t2 and t3. However, was this a good decision? There was indeed an average deterioration of EORTC QLQ-HN43 Swallowing scores between t1 and t2 and an improvement between t2 and t3 on a group-level, but there were also some patients where the reverse was true. This might be related to improved symptom relief including pain medication. Moreover, data were considerably heterogeneous which could have contributed to the pattern of results that we observed.
Another point for discussion is that the mean change score on the EORTC QLQ-HN43 Swallowing Scale was not zero for "no change" on the anchor. In future studies, a calibration could be used in such situations, i.e., taking the difference between mean changes on the EORTC QLQ-HN43 Scale of interest between adjacent categories of the anchor measure.
Another potential limitation is that we had decided a priori to calculate the distribution-based values for data at t2, not at t1 or t3. We did so because the time-point matched with one that is frequently used in clinical trials. We did not calculate it for all time-points because we wanted to establish a method that could be applied to all scales of the module and restrict the number of possible MID and MIC values for one scale to a reasonable amount. Failure to do so could potentially confuse clinicians and consequently let them return to the simpler 10-point rule [
5] or the 16% of the range-rule [
8]. However, concentrating on only one time point bears risks. For example, if Cronbach’s alpha of the instrument differs following treatment (t2) in contrast to before (t1), then these SEM-based estimates differ as well. In our study, the differences in reliability were luckily very small (Cronbach’s alpha was 0.83 at t1, 0.85 at t2 and 0.85 at t3).
A further point discussed in our group was the difficulty encountered in trying to determine MIC and MID, and we consider thresholds [
42,
43] as a potential alternative. However, we decided to continue determining MIDs and MICs because of their importance not only for researchers and clinicians but also for regulatory bodies.
Declarations
Conflict of interest
CA, IA, IB, KB, AB, PB, CB, JCS, WC, JR-D, AF, LF-G, AF, RG, EH, BBH, JI, ON-G, DS, MS, AS, KT, IT, SK-T, and NY declare they have no conflicts of interest. NK reports honoraria from ONO PHARMACEUTICAL, Bristol Meyers Squibb, Merck Biopharma, Astra-Zeneca, Merck Sharp & Dohme, Eisai, Bayer and Chugai Pharmaceutical, all outside the submitted work. LL reports grants and personal fees from Astra-Zeneca, BMS, Boehringer-Ingelheim, Debiopharm International SA, Eisai, Merck-Serono, MSD, Novartis, Roche; personal fees from Bayer, Sobi, Ipsen, GSK, Doxa Pharma srl, Incyte Biosciences Italy, Amgen, Nanobiotics; grants from Celgene International, Exelicis, Hofmann-La Roche, IRX Therapeutics, Medpace, Pfizer, all outside the submitted work. HM reports grants from Cancer Research UK, during the conduct of the study; personal fees and other from Warwickshire Head Neck Clinic Ltd, personal fees from Astra-Zeneca, MSD, Sanofi Pasteur, Merck, grants from GSK Biologicals, Sanofi Pasteur, GSK PLC, Astra-Zeneca, non-financial support from Sanofi Pasteur, MSD, Merck, outside the submitted work. He is a National Institute for Health Research (NIHR) Senior Investigator. The views expressed in this article are those of the author(s) and not necessarily those of the NIHR, or the Department of Health and Social Care. MP reports personal fees from Meeting&Words, outside the submitted work. CS reports grants from Roche, Intuitive; personal fees from Pizer, Merck, MSD, Seattle Genetics, all outside the submitted work. SS reports personal fees from Lilly, Pfizer, Boehringer-Ingelheim, all outside the submitted work.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.