To understand patients’ well-being, questionnaires referred to as patient-reported outcomes (PROs) are increasingly being used to measure the patient’s perspective of his or her health, quality of life, and/or functional status [1, 2]. When administered preoperatively and postoperatively, PROs can be used to measure patients’ perspectives of surgical interventions [3]. For example, the Gastrointestinal Quality of Life Index (GQLI) measures quality of life using several domains relevant to people with gastrointestinal disorders, such as gallstones [4]. Surgical removal of the gallbladder (cholecystectomy, usually laparoscopic), is considered the most effective treatment for symptomatic gallstones [5, 6], which can produce highly uncomfortable symptoms such as severe abdominal pain, nausea, bloating, and constipation [7, 8]. English, German, Swedish, and Taiwanese versions of the GQLI have been validated for assessing the effects of gallstone disease and treatment on health-related quality of life (HRQoL) [4, 9,10,11], and the instrument is widely used to compare patients’ gastrointestinal-related health status pre and post cholecystectomy [8, 12, 13].

Previous research has reported that age, sex, and preoperative health status are associated with GQLI values; older patients with gallstones have shown poorer GQLI scores and smaller postoperative improvements [13]. The relationship between sex and health status is less clear: women presenting with gallstones have reported worse health status than men and greater improvement after cholecystectomy in unadjusted analyses, but similar improvement when controlling for age, diagnosis, comorbidities, and pre-intervention HRQoL [13], suggesting that age and pre-intervention health status may explain sex-based differences in health gains owing to cholecystectomy. However, women have a greater tendency to develop acute pain postoperatively than males, including after cholecystectomy [14, 15].

To infer whether cholecystectomy results in meaningful improvement in HRQoL, patients’ change in PRO score can be compared to the minimal important difference (MID). The MID measures the smallest change in a PRO score that is perceived by patients as meaningful [16,17,18] and serves as an important reference point for evaluating the effectiveness of therapeutic interventions [19, 20]. For example, if the context-specific MID of an instrument is ten points, patients reporting a change in PRO score of five points post surgery relative to their baseline score may not have experienced a meaningful change in health due to the intervention. The MID is relevant because tests of statistical significance are based on a function of sample size. Thus, while statistical significance describes whether a change has occurred, the change may not be meaningful from the patient’s point of view. Moreover, the MID provides a ‘benchmark’ against which to compare the effectiveness of interventions and untangles the concepts of statistical significance from meaningful clinical change as reported by the patient.

The MID can vary by population and context [1, 21], and to our knowledge no MID values of the GQLI have been reported for age, sex, or preoperative health subgroups of gallstone patients undergoing cholecystectomy, or for the English-language GQLI. To date, one study has estimated that the MIDs of GQLI domains among cholecystectomy patients range from 6.42 to 7.64 on a transformed scale of 0–100 [22]. However, this study needs very careful interpretation for several reasons. First, the MID was only calculated for improvements – not worsening – of health status. Second, the results are based on retrospective self-reported health and therefore may be subject to recall basis. Finally, the study is based on the Chinese version of the GQLI among a Taiwanese sample of patients and, thus, there may be some cultural interpretations that do not generalize to other countries or populations.

To address this gap in the literature, the primary purpose of this study is to calculate the MID for the domains and total score of the GQLI among a sample of English-speaking patients undergoing cholecystectomy for gallbladder disease. This study hypothesizes that the MID varies between sex, age, and preoperative health subgroups. Such knowledge can inform clinicians’ interpretation of GQLI scores to better identify candidates for cholecystectomy based on baseline health, expected improvement from cholecystectomy, and the appropriate subgroup MID, and can establish MID values for research or clinical trial purposes [20].

Material and methods

Patient recruitment and PRO collection

This study of the MID values of the GQLI is based on retrospective analyses of an existing cohort of cholecystectomy patients for treatment of symptomatic gallstones in Vancouver, Canada. Prospectively, sequential patients scheduled for elective cholecystectomy of 12 general surgeons practicing in four hospitals were identified and contacted to participate. Patients’ exclusion criteria included being younger than 19 years of age, unable to communicate in English, or residing in a long-term care facility.

Patients were contacted preoperatively by phone or mail between October 2014 and February 2019 [23, 24]. Participants completed a survey package which included the GQLI [4, 8, 10, 25] and the EQ-5D(3L) [26, 27] preoperatively. Participants completed the same PROs postoperatively, six months following their surgery, and postoperative data collection occurred until October 2019. In order to reduce loss to follow-up, participants who did not return their postoperative PROs were contacted via phone or email, reminding them to return their survey package.

This secondary analysis of participants’ PROs and administrative data was approved by the Behavioral Research Ethics Board of the University of British Columbia.

The GQLI

The GQLI has thirty-six items asking patients about their gastrointestinal symptoms and interference. There are five domains of measurement of the GQLI, with nineteen items asking about symptoms, five items about emotions, seven about physical functions, four about social functions, and one item that relates to medical treatment effects [22]. Each of the 36 items has five response subgroups ranging from “all of the time” to “never” with responses coded from 0 to 4, respectively [4]. Individual item scores are summed to produce a total score ranging from 0 (worst health status) to 144 (best health status) [4]. Domain scores are calculated by dividing the sum of the domain’s items’ responses by the number of items in the domain [13].

The GQLI has been validated for assessing the effects of gallstone disease and treatment on patients’ quality of life [4, 10]. Previous studies have reported improvement in the GQLI total score and all domain values following cholecystectomy among patients with symptomatic gallstones [8, 10, 25].

The EQ-5D(3L)

The EQ-5D(3L) is a five-item instrument that measures general health status [26, 27]. The instrument measures five domains, including: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each item’s response ranks the severity of problems: no problem, some problems, and severe problems. Combinations of the items’ responses are associated with health state utility values, a preference-based measure of health that ranges from values less than zero (worse than death) to 1 (full health) [28, 29]. Canadian-based utility values are available, derived from a sample of Canadians and independent of this study [30].

Analysis of patient demographics

Participants’ PROs were linked with hospital discharge summary data. From the hospital discharge, participants’ age, sex, and comorbidities were ascertained. Because of the associations between age and GQLI score, as well as score change [13], age was categorized into four groups ranging from “younger than 50 years” through “older than 70 years”, as summarized in Table 1. Baseline health was categorized into four subgroups based on GQLI total score quartiles, as shown in Table 1. Age and baseline health subgroups were selected to keep sample sizes approximately equal between groups. Using the comorbidities, the Charlson comorbidity index [31] was calculated as an integer-valued representation of participants’ morbidity burden.

Table 1 Demographic and clinical characteristics of participants

The demographic characteristics of the sample of participants was summarized using means and percentages. Participants’ average change in GQLI scores and standard deviation were calculated for the overall sample. Differences between baseline and postoperative scores were tested using t-tests; the distribution of the scores was assessed visually and found to be approximately normally distributed. The MIDs of the GQLI overall score and its five domains were calculated for the overall cohort as well as age and sex subgroups, using the described methods.

Estimation of the MID

There are two common approaches to estimate the MID of a PRO instrument known as anchor-based methods and distribution-based methods [1, 18, 32]. Distribution-based methods are based solely on the observed distribution of patients’ PRO values and represent observed change in the form of a standardized metric but provide no direct information about the MID [1]. They are commonly based on the sample’s baseline standard deviation (SD) or effect size [1, 18, 33]. In contrast, anchor-based approaches compare changes in PRO values to an external ‘anchor’ that can identify patients whose health has changed to a small but meaningful degree and which has a nontrivial association (correlation ≥ 0.30) with the PRO of interest [1]. One commonly used anchor is the level of change in an alternative PRO measure that corresponds to the minimal change perceived as meaningful in the target population (i.e., the alternative PRO’s MID) [1, 18].

Given the heterogeneity of methods, even within the same sample of patients, different MID estimates can be generated [1]. To select the ‘best’ measure, or to narrow the range of plausible MID values, researchers can synthesize estimates from different estimation methods, experience from clinical trials, and conceptual understanding of the relationship between the chosen anchor and the PRO measure to narrow the range of MID values; when doing so, it has been suggested that anchor-based methods be assigned the highest priority since they take into account patient perception of health even though distribution-based approaches can be a useful starting point to detect a meaningful difference [1]. Whenever possible, sensitivity analysis encompassing multiple approaches is recommended [1].

To determine a plausible range of MID values, this study used both distribution- and anchor-based approaches to calculate the MID for the GQLI overall score and its five domains [34]. One distribution-based approach we used is referred to as the effect size method. It relies entirely on the standard deviation of baseline data, and the MID is taken as either 0.2 or 0.5 times the mean standard deviation of the sample’s baseline scores [35], known, respectively, as the small effect size method and the medium effect size method. Based on some empirical and psychophysiological evidence, some researchers argue that half a standard deviation (i.e., the medium effect size) is a universal estimate of the MID [1, 18], while others acknowledge that half a standard deviation is a conservative estimate that is likely to be clinically meaningful, but does not necessarily correspond to the minimal important difference [36]. The second distribution-based approach we used is the standard error of the mean (SEM) method. Using the SEM method, the MID is taken as the product of the sample’s GQLI score standard deviation at baseline and the square root of one minus the relevant GQLI domain’s reliability [18, 37]. Estimates of reliability were extracted from the literature and utilize whichever of Cronbach’s alpha or intraclass correlation was available [10].

We refer to the anchor-based approach we used as the regression method as it consists of employing linear regression to estimate a line of best fit, with the anchor (change in EQ-5D(3L) score) as the independent variable and change in PRO score for which we wish to estimate the MID (GQLI) as the dependent variable [38, 39]. The change in EQ-5D(3L) utility value was chosen as the anchor because its correlation with the GQLI score change in most strata is near, or exceeds, 0.30 (results not shown) [1] and has precedents in related research [40,41,42]. The change in GQLI score when the anchor was set at the MID value of the EQ-5D(3L) was taken as an estimate of the GQLI’s MID. Because MID estimates for the EQ-5D(3L) among gastrointestinal surgery patients have not been published, we selected a value near the mid-range of MID estimates measured in patients with various conditions undergoing various interventions and took the MID of the EQ-5D(3L) to be 0.10 [34, 43]. A sensitivity analysis was performed to test whether MID estimates would differ if the EQ-5D(3L) MID was 0.04, the average of the total smallest health transitions defined by the instrument’s multi-attribute health classification system [41].

Each method to estimate the MID was repeated for sex, age, and baseline health subgroups. For each sex and age subgroup, the standard deviation of the subgroup’s MID was calculated using a bootstrap sampling approach where repeated samples were drawn from the original data and the MID was recalculated from each; the standard deviation of the empirical distribution of MID estimates was used as the subgroup’s standard error. To compare MID values between sex subgroups, a two-sample t-test was used to calculate the test statistic and p-value comparing the MID estimates, using the standard error derived from bootstrapping. To test for MID differences between age subgroups, the one-way analysis of variance (ANOVA) and Tukey’s test with Bonferroni’s correction to account for multiple comparisons between age subgroups were used (corrected α = 0.008). The distribution of baseline and postoperative scores were assessed visually and found to be approximately normally distributed. To test for MID differences between baseline health subgroups, ANOVA was used. MID values and standard errors for each subgroup as well as test statistics and p-values for all comparisons are reported.

All analyses were performed using R statistical software, version 3.4.1 (R Foundation for Statistical Computing, Vienna, Austria) [44].

Results

Among the 647 cholecystectomy surgery patients eligible to participate, the rate of participants completing the preoperative survey was 51%. Then, among the 330 participants that completed the preoperative PROs, 57% of the 330 also completed the PROs postoperatively. This resulted in 188 participants (among 647 eligible) that completed the PROs preoperatively and postoperatively. Participants were, on average, three years older than non-participants (p < 0.01; results not shown) though no other differences in observable characteristics between participants and non-participants were detected.

As shown in Table 1, the mean age was 58 years and the majority of participants were female (73%). Most females had zero comorbidities and the average comorbidity burden was higher among males (p = 0.04). The most common comorbidity was hypertension. While not statistically significant (p = 0.08), there was some evidence that males were, on average, older than females. There were no obvious differences in preoperative health between males and females, except that a greater proportion of males (30%) than females (20%) reported GQLI total scores greater than 125.

The results of Table 2 show that among the overall cohort, there was a statistically significant difference in the mean GQLI total score between preoperative and postoperative measurements, consistent with improvement (p < 0.001). Statistically significant improvements in the mean scores of each GQLI domain (p < 0.001) and in the mean EQ-5D(3L) utility value (p < 0.001) were also observed.

Table 2 Pre and postoperative GQLI mean and standard deviation (SD)

The estimated MIDs for the GQLI total score and domains using four different approaches (small effect size, medium effect size, SEM, and regression method) are shown in Table 3. The estimated MID for the GQLI total score ranged from 4.34 (in males, using the small effect size method) to 11.78 (in females, using the regression method). There were no statistically significant differences in the MID of the GQLI total score between sex subgroups (p > 0.05).

Table 3 Comparisons of the MID between females and males for the GQLI overall and domain score

Table 4 shows the estimated MID values for the GQLI total score and each domain using each of the four methods, stratified by age subgroup. No statistically significant differences in the MID of the GQLI total score were detected between age subgroups using the distribution-based approaches, and estimates were similar across age subgroups. Although the pairwise comparisons found no statistically significant difference between age subgroups’ MID estimates using the regression approach, differences between the numeric values were large (e.g., an 8.5-point difference between the MID of youngest participants and those aged 50–60 years).

Table 4 Estimated MID (standard error) by age subgroup and results of testing for differences in the MID between age subgroups estimated MID for each GQLI domain and overall

Table 4 also shows that no statistically significant differences in MID estimates between age subgroups were found for the symptoms, physical function, social function, or medical treatment effects domains of the GQLI (p > 0.008). The largest pairwise difference in estimated MID values between age subgroups for the Emotions domain was found using the regression method.

As shown in Table 5, for the GQLI total score and all except one GQLI domain, MID estimates were largest among the subgroup of participants with the lowest GQLI scores at baseline (i.e., worst preoperative health) and decreased with improving preoperative health. The sole exception was the MID of the Emotions domain estimated using the regression method – the subgroup of patients who scored 91–112 preoperatively had a smaller MID than those who scored 113–125. The ANOVA provided evidence of statistically significant differences in MID estimates based on baseline GQLI total score using both the medium effect size and linear regression methods (p < 0.05).

Table 5 Estimated MID (standard error) by preoperative GQLI total score and results of testing for differences in MID between baseline health subgroups for the GQLI domains and overall score

The sensitivity analysis of the anchor value found that using an anchor value of 0.04 produced MID estimates between 12 and 64% smaller than when the value of 0.10 was used. Results of the sensitivity analysis showed evidence of MID differences between age but not sex subgroups (Appendices 1 and 2 in Tables 6 and 7). Differences in MID values between baseline health subgroups were still identified (Appendix 3 in Table 8).

Discussion

The objective of this study was to calculate the MID of the GQLI among a sample of English-speaking adult participants and to report whether the MID values differed between sex, age, and baseline health subgroups. MID values for the GQLI total score and among English speakers undergoing cholecystectomy have not been previously reported in the literature. Furthermore, methodological results regarding sex, age, and baseline health-based differences in MID values fill an important gap in understanding change among several domains of health measured by the GQLI since the MID can vary with patient population. This is important summative research since the MID provides a ‘benchmark’ for evaluating interventions and untangles statistical significance from patients’ perceptions of their change in self-reported health.

Depending on the approach taken to calculate the GQLI’s overall MID, this study found considerable variation in MID estimates. Despite the range in MID values obtained by using various estimation methods, we observed some patterns; first, the small effect size method consistently produced MID estimates smaller than those obtained using all other methods. Also, consistent with previous reports [1, 18], MID estimates obtained using the anchor-based method and medium effect size method were generally similar; this result was observed among the overall cohort and both sex subgroups for many domains as well as the GQLI total score. Estimates using these two approaches were not as similar when participants were stratified by baseline health status.

When transformed to the same scale, our estimates of the MID of most GQLI domains were consistent with Shi et al.’s [22] MID estimates, providing additional support for our findings, and shown in Appendix 4 in Table 9. One exception to this consistency of result was found in the Emotions domain—our MID estimate was nearly two times larger. It is unlikely that baseline health status was the reason for this discrepancy since our sample’s baseline health was similar to that of Shi et al.’s [22].

An important finding for clinicians and patients is that the results found no difference in MID values between sexes for the GQLI total score or the instrument’s domains. Not only were MID differences not statistically significant, but the MID estimates were similar between males and females, suggesting it is unlikely that any meaningful differences in GQLI MID values exist between males and females. This finding is relevant to clinicians as it confirms that the GQLI is robust to the sex of the respondent, and that clinicians should not need to adjust MIDs for patient’s sex.

The findings for age subgroups were more complex to interpret; statistically, there were no differences in the GQLI’s MID between age subgroups. However, the results showed that the MID of the instrument’s total score differed by over 8 points between age subgroups, suggesting possible but inconclusive evidence of differences in MID values between age subgroups. Although this result was based on subgroups of 31 to 61 participants, it is possible that either the sample size was still too small to detect statistically significant differences, the Bonferroni adjustment was too conservative, or that participants in certain age subgroups experienced a wider range of outcomes (higher variance) than participants of other ages attributable to other, unmeasured, effects such as symptom severity. The implications of uncertainty by age subgroup is meaningful to clinicians; since other research has reported larger improvement in younger individuals [13], the patient’s expected benefits attributable to the cholecystectomy should assumed to be different by age patient’s age in directionality with the results presented in this study, with specific values determined by additional research.

We found strong evidence that the MID values of the GQLI differed by baseline GQLI values, which was consistent with expectations since previous research has reported larger quality of life gains among patients with worse health and smaller gains in patients with better preoperative health [13]. In our study, participants with higher GQLI total scores (i.e., better health) preoperatively perceived smaller changes in health as meaningful compared with participants who scored lower preoperatively (i.e., worse health).

Unexpectedly, we found that MID values estimated using the anchor-based approach among the highest scoring participants were negative; reflecting that among participants with the highest (best) preoperative health, gains in health-related quality of life may not accrue. Also, we found that the variance in GQLI scores was highest among participants with the worst baseline health; since distribution-based approaches produce MID values that are a function of the sample’s variance, this finding had a profound influence on the distribution-based MID estimates and provides support to the claim that the anchor-based approach, rooted in patients’ perspectives of their health, should be preferred when available.

Another important finding is that the value of the anchor can influence the results. In our study, halving the EQ-5D(3L) threshold value had negligible effect on the statistical significance, though the MID estimates varied greatly—demonstrating the sensitivity of this approach. This finding underlines the importance of estimating the MID within specific patient subgroups and reinforces that further research is needed to provide a better understanding of the EQ-5D(3L)’s use among elective cholecystectomy patients.

There are limitations to this study, as the sample size was less than 200 participants and follow-up for postoperative completion of PROs was less than 60%. Despite the participation rate, the only observable characteristic participants and non-participants differed by was age, with participants three years older, on average, than non-participants. Nonetheless, there is a risk that differences in unobservable characteristics, such as disease severity, could have introduced bias into our results. Additionally, the complexities of the age subgroup results suggest further research in larger samples is warranted. Also, this study applied a conservative approach to comparing subgroups; however, the study did not adjust for the fact sex and age subgroups were measured simultaneously, potentially lowering the threshold p-value further. In spite of these limitations, we found substantive evidence that the MID value varies by preoperative health status.

The findings of this study are important to inform the clinician or patient’s expectations regarding the effect of cholecystectomy on health and to inform thresholds for cholecystectomy effectiveness research. Based on our findings, we conclude that the MID is robust to sex. While we are inconclusive on whether MID values vary by age, clinicians should note that MID values vary based on patients’ preoperative health. Since distribution-based approaches to estimating the MID are heavily influenced by the sample’s variance versus the patient’s perspective of health, we conclude emphasis should be placed on MID estimates obtained using anchor-based approaches. Consequently, the ‘best’ MID estimates for the GQLI total score are: 23 among patients scoring less than 96 preoperatively; 10 among patients scoring between 91 and 112; 7 among patients scoring between 113 and 125; and 0 among patients scoring between 125 and 144.