FormalPara Key Points for Decision Makers

Searching and synthesis of health state utility values (HSUVs) to populate decision models should incorporate all good-quality evidence, but the variability of utility scores by elicitation methods generates a problem for pooling values through meta-analysis.

Stricter inclusion criteria for meta-regression or meta-analysis of HSUVs may help.

There is potential for greater use of mapping algorithms between HSUVs prior to meta-analysis, although careful consideration should be given to the appropriateness of the mapping function and the additional level of uncertainty associated with mapped values.

1 Introduction

The evaluation of healthcare technologies is increasingly reliant upon decision-analytic models. Where quality-adjusted life-years (QALYs) are used as the overall outcome measure for a decision model, each health state included in the model requires a health-related quality-of-life score or health state utility value (HSUV). Good practice in parameter estimation relies on the principles of evidence-based medicine, hence, aims to include all (unbiased) evidence and employ formal evidence synthesis techniques, with systematic review and meta-analysis [1] being the highest level of evidence. That said, the diversity of methods for generating QALYs [2] and the variability across the values generated by these different methods leads to a quandary over whether meta-analysis of utility values will be appropriate.

We are interpreting utility here to mean a measure of the social judgement of the value of a particular health state. Health economists use a number of different methods to extract that value, resulting in the same health state being attributed different (sometimes really quite different) utility scores. This variability arises from four factors: (1) who is asked (and when) to value health states (patients, ex-patients, or members of the public); (2) the technique used to extract preferences and estimate values [the most common being time trade-off, standard gamble (SG), visual analogue scale (VAS) and discrete choice experiment]; (3) different variants of each of the general method (such as the exact question wording, the mode of administration or the use of props); and (4) different preference-based measures (PBMs) or instruments with different descriptive systems, including different items and response options, valued using different methods.

Meta-analysis provides a means to pool data collected across a number of studies and produce a weighted average of the measure of interest, thereby, generating a more precise measure. Most HSUV studies report more than one mean utility value (e.g. patients may complete more than one PBM); consequently any meta-analysis of HSUVs needs to adjust for the fact that these values will be correlated. Given the potential sources of variability of HSUVs, it is unsurprising that conventional tests find that pooled HSUVs reveal considerable heterogeneity (e.g. [3, 4]).

2 Existing Use of Meta-Analysis and Meta-Regression for Utility Values

Meta-regressions [5] allow researchers to explore heterogeneity and the impact of different elicitation methods. Existing meta-regressions (see Table 1) on HSUVs have found substantial differences in values between elicitation methods.

Table 1 Some example coefficients on utility instruments and elicitation methods in meta-regressions

These differences are worryingly large. Indeed, Sturza [6], reporting on her meta-regression for lung cancer, argued that since methodological factors affect utility values, lung cancer researchers “should avoid direct comparisons on lung cancer utility values elicited with dissimilar methods” (p. 691).

Some HSUV synthesis has avoided some of these problems by only using meta-analysis on the EQ-5D (Peasgood et al. [14] for osteoporosis states; Doth et al. [15] for pain states) as this is the measure explicitly preferred by the National Institute for Health and Care Excellence (NICE) [16]. Others have conducted a separate meta-analysis for each overall method or instrument (Liem et al. [17] for renal replacement therapy states; Post et al. [18] for stroke; Mohiuddin and Payne [19] for depression). Whilst a weighted average of EQ-5D values may be adequate for NICE Health Technology Appraisal submissions, for non-NICE submissions, we are left with a decision as to which value to use to populate a decision model. This choice is likely to impact substantially upon the mean values used (e.g. Mohiuddin and Payne [19] reported a pooled SG value for mild depression of 0.69 compared with only 0.56 for the pooled EQ-5D estimate) and on the final incremental cost-effectiveness ratios [20]. Furthermore, a meta-analysis on one particular instrument or method results in considerable loss of evidence and information, which goes against the researcher’s responsibility to incorporate all high-quality evidence available.

3 Recommendations

How do we use the very best evidence under the circumstances of considerable parameter variation across methodologies? The problem may not be as bad as it at first seems. It may be that these elicitation method differences identified in meta-regressions are inflated. Firstly, some meta-regressions for HSUVs have been conducted on fairly small numbers of utility values. Secondly, meta-regressions have included values that do not appear to be measuring the same thing, i.e. the utility score on a scale of 0 (dead) to 1 (full health) representing how the relevant society views the value of a particular clinical health state.

Meta-regressions with only a few studies and considerable study heterogeneity run the risk of showing false positives [21]; hence, a dummy variable for the elicitation method may appear to be statistically significant when it is not. Whilst there are no hard and fast rules for the appropriate sample size in meta-regression, a ratio of at least ten studies to each covariate is often recommended [5]. For meta-regressions of effectiveness, a minimum of four studies in a categorical subgroup variable has been recommended [22], while more are required to conduct significance testing. Meta-regressions of HSUVs have been conducted with small numbers of utility values (e.g. McLernon et al. [3] conducted a meta-regression with nine covariates and 40 utility values), and some have very few utility values in each category (e.g. Wyld et al. [10] included a covariate for Short Form 6 dimension with only one utility value identified that used this instrument).

The pooling of utility values should only be attempted where the data are valuing the same clinical health state for the appropriate population. The breadth of the health state for which utility values are sought should be dictated by the economic model, and utility values should confidently reflect that exact health state required. Vignettes, which verbally describe a particular (hypothetical) clinical health state to allow individuals who are not in that particular health state to estimate a utility score, may have a useful role in populating economic models in the absence of any other utility values. However, they introduce another layer of uncertainty and may offer no additional benefit when values on the actual desired health state are available. In the meta-regression by Sturza [6], values derived from asking members of the public to link lung cancer vignettes to an EQ-5D state are included alongside direct patient EQ-5D responses without recognition of the superiority of the latter evidence. Making a judgement on whether a study is identifying a utility for the appropriate health state requires detailed information on the exact study population (including study selection, drop out, missing values and clinical diagnosis), and this is unfortunately not always available [19]. When in doubt, preference should be for including only studies where it is reasonable to assume that the utility refers to the desired population.

The pooling of utility values should also only include utilities anchored on the dead to full-health scale. This would exclude values where the top anchor is symptom free (which would exclude some values used in Bremner et al. [11]) or ‘normal’ rather than full health (which would exclude some values used in Peasgood et al. [23], Tengs and Lin [24, 25] and Sturza [6]). Where there is uncertainty on whether the values really are utility scores, such as when the assessment method is not stated, these should not be included (which would exclude some values used in Tengs and Lin [25]).

It is possible that some PBMs may not adequately identify important aspects of a particular clinical health state. Where there is strong psychometric evidence that a particular instrument lacks validity for the health condition of interest (e.g. see Longworth et al. [26] for a review), a synthesis that excludes those values will be useful for sensitivity analysis.

Where an economic model is to be used to support decision making in a particular country, the desired utility values are those that give the social value of the health state as judged by the relevant population from that country. Utility scores using tariffs from other countries reflect different sets of preferences, and unless it is believed that preferences should be universal, or the value sets are very similar, the rational for pooling utilities that use different country-specific tariffs is not clear. Considerable inter-country differences in the social tariff of the EQ-5D have been identified, with differences varying across the EQ-5D distribution [27]. Including a country-specific tariff dummy, hence, shifting the intercept, will not capture this variability across the distribution or differences in the weight given to different items in the instrument. To include utility data from other countries would require patient level data to enable the appropriate social tariff to be applied or a mapping from one country tariff to another using more sophisticated methods (e.g. [28]).

Even where we have included only utility values on the same clinical health state, the identified utility values are still likely to show variability across instruments and elicitation methods. For PBMs, it is likely that the different descriptive systems drive the variation as much as differences in valuation method [29]. Including the instrument as an intercept term on meta-regression is a limited approach as it does not pick up the relative weights attributed to the different domains within an instrument (including zero if the item is not included at all). An alternative approach would be to use mapping between instruments, at the aggregate or, if possible, the individual patient level. Whilst mapped values may still differ in terms of both mean and variance compared with direct values (e.g. Wyld et al. [10] found EQ-5D values mapped from Short Form 12 and Short Form 36 to have different values to direct EQ-5D values) and may not be feasible where descriptive content does not substantially overlap, where mapping is possible, the pooling of mapped-utility values could offer a means of generating an estimate that incorporates more of the relevant evidence and has a smaller variance. That said, consideration should be given to the quality of the mapping function, particularly at the ends of the distribution [30], and the appropriateness of the population on which the mapping function was based.

In addition to generating a pooled mean value, consideration also needs to be given to an assessment of uncertainty of the parameter. Ara and Wailoo [31] note that this should incorporate the uncertainty from any mapping functions used, the uncertainty from tariff scores and uncertainty from the output of the descriptive system.

More generally, pooling HSUVs would be aided if there was a greater consistency of valuation methods between instruments. Where instruments adopt different descriptive systems, effort could still be made to generate a social tariff that adopts a standardised methodology. This would facilitate greater understanding of the source of differences between instruments.

The advantages of adopting a systematic review of utility values to populate economic models are clear—the adoption of a clear methodology to follow in terms of searching (see [32]) and transparent reporting of findings. This includes details of study characteristics that would allow modellers to select the most appropriate value [33] for both the main model and any sensitivity analysis. The advantage of including a meta-analysis or meta-regression is the use of all available good-quality evidence in generating the value to be used. Yet even with stricter inclusion criteria (excluding values that are not the appropriate utilities), we are still likely to be left with a considerable degree of heterogeneity across utility values. Higgins [34] has presented the case that in relation to study effect sizes ‘‘any amount of heterogeneity is acceptable, providing both that the predefined eligibility criteria for the meta-analysis are sound and that the data are correct.” (p. 1158). Where we are aiming to measure the same thing—the social value of a particular health state—we ought to be able to combine values. More work is required on understanding sources of variation in utility values, particularly, variation driven by differences in the descriptive system.

For England and Wales, the current NICE methods guide states that when it is necessary to take HSUVs from the literature “the methods of identification of the data should be systematic and transparent. The justification for choosing a particular data set should be clearly explained. When more than one plausible set of EQ-5D data is available, sensitivity analyses should be carried out to show the impact of the alternative utility values” [16]. This does not then imply a requirement for meta-analysis on EQ-5D values at present. However, given the growing number of publications that incorporate meta-analysis or meta-regression of HSUVs, this guidance may change in the future.