Digital Features for this article can be found at https://doi.org/10.6084/m9.figshare.16910563

A Comparison of PROPr and EQ-5D-5L Value Sets: A Video (MP4 21913 kb)

FormalPara Key Points for Decision Makers

Both EQ-5D-5L and PROMIS are important preference-accompanied patient-reported outcomes for use in economic evaluation, but they differ in how they describe and value health.

PROPr (the values accompanying PROMIS) has a narrower range of theoretical values compared with EQ-5D-5L. PROPr has substantially lower values than EQ-5D-5L for comparable ‘mild’ health states, and higher values for more ‘severe’ health states. The patterns in utility decrements within the dimensions of each instrument, and the relative importance of dimensions, are very different, despite both representing the preferences of the US general public.

It is important for decision makers and clinical triallists to be aware of these differences and their potential implications for assessing the impact of healthcare interventions.

1 Introduction

A number of generic and condition-specific patient-reported outcome measures (PROMs) are accompanied by value sets (also referred to as utilities or preference weightings) that enable the estimation of quality-adjusted life years (QALYs). These PROMs and their accompanying utilities differ in how they describe and value health, potentially yielding different estimates of QALYs and resulting incremental cost-effectiveness ratios used in decision making [1, 2].

Both EQ-5D and PROMIS are important preference-accompanied PROMs [1]. In addition to its use in monitoring population health, the EQ-5D is widely used in cost-effectiveness analysis for Health Technology Assessment (HTA) around the world [3]. It includes five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression; and each dimension has one item. The initial version (EQ-5D-3L) has three levels of problems for each dimension; the EQ-5D-5L retains the same five dimensions but increases the number of levels of problems on each to five [4]. Utilities for the EQ-5D-5L have been produced in many countries and are typically obtained using a standard protocol (the EuroQol Valuation Technology, EQ-VT) that includes both discrete choice experiments (DCE) and the composite time trade-off (cTTO), with resulting value sets based either on cTTO or a hybrid of cTTO and DCE [5].

PROMIS was developed in the US using item response theory (IRT) to calibrate a large number of questions for each health domain to create an item bank [6]. PROMIS has over 100 health domains with a calibrated ‘item bank’ of questions that can be administered via standard short forms, custom short forms, or computer adaptive testing [7]. There are three PROMIS profile measures comprising fixed sets of short forms: PROMIS-29, PROMIS-43 and PROMIS-57. Profile measures comprise 4-, 6-, or 8-item short forms for each of seven domains: Anxiety, Depression, Fatigue, Pain Interference, Physical Function, Sleep Disturbance, Ability to Participate in Social Roles and activities (from here, known as social roles), plus a 0–10 Pain Intensity numeric rating item [8]. Among the three profile measures, PROMIS-29 is the most widely used. A generic societal preference-based scoring system for PROMIS (PROPr) has been produced based on multi-attribute utility theory and eliciting stated preferences using standard gamble (SG) [9]. Seven PROMIS health domains are needed to obtain PROPr utilities: Cognitive Function—Ability (from here, described as Cognitive Function), Depression, Fatigue, Pain Interference, Physical Function, Sleep Disturbance, Social Roles. While computerised adaptive testing (CAT) administration is the recommended method to obtain the domain scores necessary to calculate PROPr, the use of PROMIS-29 and two items from Cognitive Function v2.0 (PROMIS-29 + 2 Profile v2.1, from here, described as PROMIS-29 + 2) is the recommended measure to obtain PROPr if a fixed form is required for the project [10].

Both PROPr and EQ-5D-5L are potential candidates for use in economic evaluation. However, a recent systematic review of studies comparing EQ-5D and PROMIS-29 instruments noted the lack of research comparing the measurement properties and utilities of the two, and highlighted the importance of comparing the characteristics of their respective value sets [11]. It is important for decision makers and those using the measures in clinical trials to understand the differences between instruments and their implications for assessing the impact of healthcare interventions. A full account of the differences requires an understanding of both the characteristics and measurement properties of the items and their preference weights.

The utilities that accompany each are, in principle, measured on the same scale (i.e. anchored at full health and dead). This is the assumption underpinning the use of these utilities in QALY estimation, that is, that utilities ‘mean’ the same thing in terms of representing, on average, a society’s preferences about health-related quality of life (HRQoL), regardless of which descriptive system they accompany. This is important in, for example, the use of these utilities to estimate QALYs and incremental QALY gains in cost effectiveness analysis. When decision makers use such evidence, they rely on utilities having broad comparability. For example, various PROMs (e.g. PROMIS-29, EQ-5D-5L, HUI3, SF-6Dv2) have values < 0, and this should be capable of being interpreted as meaning that these states, in whatever PROMs they relate to, are worse than being dead. The characteristics of these utilities are important because they introduce an exogenous source of variance into the analysis of the data generated by the respective descriptive systems [12]. Examining the characteristics of the utilities that accompany different PROMs is therefore an essential precursor to a full understanding of the way that measurement and valuation properties interact when analysing patient data.

This paper compares EQ-5D-5L value sets and PROPr to better understand the differences between the characteristics of the utilities generated by each. The primary focus of comparison is between the EQ-5D-5L US value set and PROPr value set, both of which were generated from the stated preferences of the US general public. Differences between any two value sets may result from differences in sample population, differences in the classification system of the instruments and differences in various aspects of the valuation method used [1]. Therefore, in order to test whether results regarding the nature of PROPr and EQ-5D-5L utilities are specific to the characteristics of the US EQ-5D-5L value set, we additionally included two further EQ-5D-5L value sets from English-speaking countries: England and Australia. The English value set was selected as it was developed using the same valuation protocol (i.e. EQ-VT) as the US value set, and the Australian value set because it was developed using a different valuation method (DCE with duration).

2 Method

For each instrument, we provide an overview of (1) the descriptive system and (2) the valuation process. We then describe the approach we used to compare the utilities for all possible health states described by each instrument.

2.1 The Two Instruments

2.1.1 EQ-5D-5L Questionnaire and Value Sets

The development of the EQ-5D was motivated in part by the aim to facilitate QALY estimates for use in economic evaluation [13]. In the EQ-5D-5L, there are five response levels for each of the five dimensions (see Table 1): no, slight, moderate, severe and unable to/extreme problems. EQ-5D-5L health states can be described as a five-number string, representing the level of each dimension in the order in which they appear in the questionnaire (as described above) [14]. The best health state is described as 11111 and the worst as 55555.

Table 1 Comparing the measurement of health in EQ-5D-5L and PROPr

The EuroQol Group has developed a protocol (EQ-VT) to create value sets for the EQ-5D-5L, which has been used in more than 25 countries [5, 15], including in the US [16] and England [17]. The Australian value set was developed using DCE with duration [18]. It includes an interaction (described as the N5 term), capturing an additional reduction in value for any health state with a level 5 problem on at least one of the dimensions.

2.1.2 PROMIS Measures and PROPr Scores

The PROPr builds on PROMIS domains and is based on multi-attribute utility theory. A PROMIS domain’s item bank is an IRT-calibrated set of items for that domain. Domains are calibrated to the US general population and presented as T-scores with a mean of 50 and standard deviation (SD) of 10. A higher T‐score represents more of the concept being measured [19].Footnote 1 As shown in Table 1, the items used in PROPr each have five response levels. For PROMIS-29 + 2, a fixed four-item short form, the sum raw score for each domain except for Cognitive Function (which has 2 items) ranges from 4 to 20; the Cognitive Function domain ranges from 2 to 10.

The PROPr algorithm allows preference-based scores to be estimated from health states described by the seven PROMIS domains mentioned in the Introduction. Detailed information on domain selection and methods used to create PROPr have been described elsewhere [9, 20, 21]. There are different options to get the domain scores to calculate the PROPr scores, as described in Appendix A in electronic supplementary material (ESM) 1.

2.2 Analysis of Theoretical Values

As mentioned, the primary focus of analysis is comparing the EQ-5D-5L US value set and PROPr value set, because both were generated from the stated preferences of the US general public. We additionally included the EQ-5D-5L value sets for England and Australia.Footnote 2

We chose to calculate PROPr utilities using PROMIS-29 + 2 given its wide use [10]. All the possible T-scores were obtained from the scoring manualFootnote 3 for PROMIS-29 + 2 [19]. In this section and the results section, PROPr utilities refer to the utility scores generated based on PROMIS-29 + 2 (rather than the PROPr scoring system). We generated all theoretically possible health states for both instruments using Python v3.8. We obtained the PROPr utilities based on code written by Janel Hanmer in September 2017 using SAS 9.4 and transformed this to Stata code (see ESM 2). We used Stata 14.2 for the analyses. In order to check whether our results are specific to PROMIS-29 + 2, we also repeated analyses on utilities calculated for PROMIS-29 (results provided in Appendix C in ESM 1).

2.2.1 Value Set Characteristics

We produced descriptive statistics for all ‘theoretically possible’ values for EQ-5D-5L using the three value sets and for PROPr. In each case this included the range of values, the percentage of health states having values < 0 and the number of unique values. We plotted the overall distribution of the ‘theoretical’ values using a kernel density histogram.

We assessed dimension/domain importance by their ranking produced by the utility of corner states, that is, the utility of health states with the worst level in one domain and the best in all others.

2.2.2 Value Set Characteristics for Matched States

We carried out a comparative analysis on the health states hypothesised to be comparable in terms of severity across EQ-5D-5L and PROMIS-29 + 2. We considered ‘full health’, ‘mild’, ‘moderate’, ‘severe’ and ‘worst’ health states. We considered comparable states based on item response descriptions given that both instruments use five-level scales.Footnote 4 In EQ-5D-5L, that corresponds to 11111, 22222, 33333, 44444 and 55555. In PROMIS-29 + 2, we used five cut points on sum raw scores. The corresponding sum raw scores for full health, mild, moderate, severe and worst health states for symptom domains (e.g. Depression) are 4, 8, 12, 16 and 20, and for function domains (e.g. Physical Function) are 20, 16, 12, 8 and 4. The matched health states are presented in Sect. 3.2.

2.2.3 Differences in Utility Between Adjacent States

We carried out an analysis to understand how changes in utility reflect changes in severity on the descriptive system and domain scores in each instrument. We followed the method described in Mulhern et al. [23] to assess the values of adjacent states within the descriptive system, and comparing the differences across the value sets.Footnote 5 An adjacent state pair was defined as having one dimension with a one-level difference for EQ-5D-5L or with a four-point change in sum raw score for PROMIS-29 + 2, with the other dimensions held at the best health level, moderate health level and worst health level, respectively. We compared dimension pairs that we hypothesised to overlap conceptually or those found to be highly correlated using data from the general population and patients [26, 27]. For example, for mobility in EQ-5D-5L, we report the increase in utility between 51111, 41111, 31111, 21111 and 11111; between 53333, 43333, 33333, 23333 and 13333; and between 55555, 45555, 35555, 25555 and 15555. For physical function in PROPr, we compare this with the increase in utility using domain raw score at 4, 8, 12, 16 and 20 for physical function with best, moderate and worst health domain scores for other domains, respectively.Footnote 6

3 Results

3.1 Comparing Value Set Characteristics

Table 2 compares the value set characteristics. The theoretical value of the US EQ-5D-5L value set ranges from − 0.573 to 1, whereas PROPr values range from − 0.022 to 0.954. Compared with all three EQ-5D-5L value sets, PROPr has a higher minimum value (close to 0) for the worst state. The decrement from the best to next-best health state (i.e. from full health to the mildest state) is also larger for PROPr (a change of 0.065) compared with that of US EQ-5D-5L (0.057). In the EQ-5D-5L value sets, pain/discomfort is either the most or the second most important dimension (proxied by the overall size of the decrement) while usual activities is the least important. However, in PROPr (based on PROMIS-29 + 2), pain is the least important (apart from sleep disturbance). Appendix E (in ESM 1) reports the utilities of each disutility corner state for each instrument.

Table 2 Overall descriptive characteristics of the four value sets (modelled values)

Figure 1 demonstrates that the distribution of theoretical values for PROPr differ substantially from the distributions apparent for the three value sets for EQ-5D-5L. EQ-5D-5L US values are symmetric, with a wider range of values, whereas PROPr values are skewed, with a narrower and lower range of values. Given the narrower range of values, the number of unique values that exist within PROPr as explored by this exercise are fewer than that for the EQ-5D-5L value sets (as shown in Table 2).

Fig. 1
figure 1

Density plot of PROPr and EQ-5D-5L value sets. This figure shows the density plots of theoretical values for PROPr and EQ-5D-5L (utility value as X-axis and density as Y-axis)

3.2 Value Set Characteristics for Matched States

Table 3 shows that PROPr utilities are consistently lower across the less severe health states compared with EQ-5D-5L value sets but PROPr values are consistently higher for more severe health states. The pattern remained the same whether we classified the health states in PROMIS-29 + 2 based on domain raw scores or domain T-scores (as shown in Appendix D in ESM 1).

Table 3 Value set characteristics for matched states

3.3 Differences in Utility Between Adjacent States

Figure 2 presents the changes in utilities between adjacent states in the EQ-5D-5L US value set and PROPr in comparable dimensions. When holding all other dimensions at the best health level, the decrement in utility associated with any move away from full health to a mild health state was larger in all comparable dimensions in PROPr than in EQ-5D-5L value sets. Consequently, EQ-5D-5L utilities are higher for such mild states. Notably, the mildest problem in mobility on EQ-5D-5L, with a utility of 0.904 in the US value set, corresponds with a PROPr utility of 0.729. In EQ-5D-5L, the largest change in utility occurs in the move between severe (level 4) and moderate (level 3) problems, whether holding other dimensions at their best (level 1), moderate (level 3) or worst level (level 5). Utility associated with any given dimension, considered across its best to worst levels, and holding all other dimensions constant at either best or moderate levels, is always lower in PROPr than EQ-5D-5L. However, this pattern ‘flips’ when any one dimension is considered across its best to worst levels and all other dimensions are held constant at their most severe level; in these cases, there is no (or close to no) change in utility on PROPr, and PROPr utilities are higher than EQ-5D-5L for such states. The changes in utilities between adjacent states in all EQ-5D-5L dimension and PROPr domains are reported in Appendix F in ESM 1.

Fig. 2
figure 2

Changes in utilities between adjacent states between ‘comparable’ dimensions in EQ-5D-5L US value set and PROPr, when holding other domains at best, moderate and worst level respectively

4 Discussion

We have compared ‘theoretical’ values for three EQ-5D-5L value sets (US, England and Australia) and the US PROPr utilities based on PROMIS-29 + 2. The results show that EQ-5D-5L and PROPr utilities differ systematically, in important ways. We begin by summarising the key differences and then discuss each in more detail. First, PROPr has a narrower range of values compared with EQ-5D-5L. Second, the relative importance of dimensions and the utility decrements across dimensions for each instrument are very different. The relative distance between full health and the disutility corner states in PROPr differentiates less between domains: the utility decrements across PROPr are similar between domains in contrast to EQ-5D-5L values. Third, PROPr has substantially lower values for comparable relatively mild health states while having higher values for descriptively more severe health states, compared with the EQ-5D-5L values. Taken together, this means PROPr has a narrower range of values than the EQ-5D-5L values. Further, whereas the most commonly arising values in the EQ-5D-5L value sets lie in the mid-range of values, in PROPR the most frequently arising values are much lower. Lastly, the patterns in utility decrements within the dimensions of each instrument are different. When one dimension is considered across its best to worst levels and all other dimensions are held constant at their best or moderate level, in EQ-5D-5L value sets the greatest changes in utility occurred between levels 3 and 4 (i.e. the change from moderate to severe problems); in PROPr that occurred between the most severe ‘corner’ states and their adjacent states. However, when all other dimensions are held constant at their most severe level, there is no (or close to no) change in utility on PROPr, and PROPr utilities are higher than EQ-5D-5L for such states.

The narrower range of values in PROPr is a product both of its higher minimum value and the fact that the ‘best’ state (based on PROMIS-29 + 2) has a utility < 1. It is worth noting that PROPr utility can only reach 1 when using some PROMIS measures. PROPr was designed for the wider PROMIS system, and not only for PROMIS-29 + 2. PROPr focuses on a subset of the seven domains, and the items displayed in the valuation study for each domain were chosen from the domain item banks, in order to span the health state space. As a consequence, the chosen items are not necessarily included in every PROMIS measure (as shown in Appendix G in ESM 1), so the domain score range can be reduced.Footnote 7 This leads to the truncated PROPr utilities.

The relative importance of dimensions is also very different between the EQ-5D-5L US value set and PROPr—despite both being based on the stated preferences of the US general public. For example, pain is the most important dimension in the EQ-5D-5L US value set, whereas in PROPr it is the least important, apart from sleep disturbance. It is worth noting that pain ranks third in the whole PROPr scoring system, which is the main difference in the relative importance of domains using the PROMIS-29 + 2 instrument and PROPr utilities. This is because pain is the only domain for which the worst possible T-score on PROMIS-29 + 2 cannot be achieved (as shown in Appendix G in ESM 1). More generally, it is important to note that the importance of domains in PROPr is intertwined with the domain T-score ranges determined by the specific item chosen for that domain in the valuation study.

The utility decrements across domains in PROPr differentiate less between domains than is the case for EQ-5D-5L. PROPr utilities are strongly affected by the level of problem, and much less so by what aspect of health it is experienced in. One possible explanation for this is that the participants may have had difficulty using the SG valuation process to distinguish between these states.

We have found that PROPr has lower values than EQ-5D-5L for ‘comparable’ mild states. In interpreting this finding, it is important to note that PROPr inherits the characteristics of PROMIS measures, which do not exhibit the ceiling effects observed in EQ-5D instruments. Thus, PROPr makes it less likely for a respondent to self-report themselves as being at the best health state [9].Footnote 8 This means that mild states in PROPr are very mild compared with EQ-5D-5L mild states, so our finding that they have lower utilities is notable.

At the other end of the scale, the worst health problems described by PROMIS-29 + 2 have values close to 0 (− 0.022), whereas those described by EQ-5D-5L have larger negative values (− 0.573 in the US value set, − 0.285 and − 0.676 for English and Australian value sets, respectively). In general, PROPr utilities are also higher than EQ-5D-5L values for more severe health states. However, there is an exception to that; the utilities of corner states in PROPr are substantially lower than those in EQ-5D-5L among corresponding domains. This pattern is also reflected in our analysis of differences in utility between adjacent states (Fig. 2)—EQ-5D-5L utilities are higher than PROPr utilities when holding other domains at best or moderate level, but become lower than PROPr when holding other domains at worst level.

The characteristics of PROPr and EQ-5D-5L utilities observed in this paper are likely to be a product both of what is being measured, and fundamental differences in the stated preference methods used and how utility algorithms are estimated from them. This includes mode of administration, elicitation technique, selection of the sub-set of health states to value, display of health states, number of health states valued by each respondent, the approach to anchoring at 0 and to handling values worse than dead, quality control and data exclusion criteria, modelling approach and choice of final model. For example, for EQ-5D-5L, the TTO task bounds individual respondents’ utilities between 1 and − 1 by design [25]. In contrast, PROPr asks respondents whether they prefer the all-worst state or dead. Instead of dead = 0 being established for each respondent (as is the case in EQ-5D-5L value sets), the PROPr approach entails calculating the mean utility of dead for all respondents who considered dead the worst state; the mean utility of dead for all those who considered the all-worst state to be worst, and rescaling and combining the two, weighted by group size [29]. There are a number of unresolved methodological issues regarding the stated preference methods used to value HRQoL. Recent papers raise fundamental questions about the way such utilities are anchored at 0 [30, 31]. A different approach to anchoring at 0 would have an important effect on both PROPr and EQ-5D utilities [32], and indeed the values for any PROMs.

It is important to note that both PROPr utilities and EQ-5D-5L utilities are based on stated preference data—there is no ‘gold standard’ against which to judge the veracity of value sets and revealed preferences data cannot be used to validate them. Nevertheless, the extent of the differences between utilities for PROPr and the EQ-5D-5L is striking, given both are based on the preferences of the US general public.

The contrasting characteristics of the utilities we have reported in this paper will combine with differences in the descriptive systems to affect data and results arising from their use and application. For example, cross-sectional validation studies of PROPr based on the general population have shown that the absolute differences in utilities between different socio-demographic groups were smaller for PROPr compared with EQ-5D-5L [33]. The smaller range of PROPr utilities means the change in QALYs from HRQoL-increasing interventions might be reduced. However, the implications of these differences in utility for QALY estimates are quite complex. For example, interventions that result in improvements from the mildest states may yield larger utility gains when estimated by PROPr rather than EQ-5D-5L. This is because PROPr utilities for mild states are much lower than that of EQ-5D-5L, so any improvement from a mild state to full health will yield a relatively large incremental gain in PROPr utility. PROPr utilities for disutility corner states (i.e. health states with the worst level in one domain and the best on all others) are substantially lower than the utilities for comparable EQ-5D-5L states, so the changes in utilities between these corner states and their ‘next best’ adjacent states are larger in PROPr. However, PROPr utilities for the all-worst health states are much higher than those for EQ-5D-5L. For the all-worst state in PROPr, an intervention that leads to an improvement in just one domain (e.g. a reduction in pain) results in no change in utility (e.g. even if pain is eliminated altogether), whereas in EQ-5D-5L such improvements generate large gains in utility. It is worth noting that, as utilities will be applied to both treatment and control groups, some of these utility differences between instruments may cancel out. Furthermore, the impact of differences between utilities on QALY gains also depends on whether the change is from improving quality of life or extending length of life, or both. This requires further investigation using clinical cohort data.

There are a number of limitations of this study. First, we have not assessed in this study whether our matched states are equivalent. In our main analysis, we considered comparable PROMIS-29 + 2 states to EQ-5D-5L health states based on domain raw scores, but this ‘direct’ match did not take advantage of the IRT-based descriptive system for PROMIS-29 + 2. In the sensitivity analysis (see Appendix D in ESM 1), we classified PROMIS-29 + 2 states based on T-scores, although it is unknown where EQ-5D-5L fits on that construct. Notwithstanding these limitations, the results of both sets of analyses were qualitatively similar in the differences in utility between adjacent health states and of matched states. Second, there are different options for estimating PROPr scores. Our main analysis focused on the PROPr scores calculated by PROMIS-29 + 2, given it is the most widely used PROMIS profile measure. It is possible that some characteristics of PROPr may differ depending on the PROMIS instrument chosen (in addition to the way T-scores are obtained and the set of intervals chosen, as mentioned previously), limiting the generalisability of the results to other PROMIS instruments. To check for this, we also calculated PROPr using PROMIS-29 (results presented in Appendix C in ESM 1). Although PROPr scores differ somewhat in the range of values (− 0.021 to 0.905 when calculated by PROMIS-29) and there are minor differences in PROPr scores of matched states, the results of both sets of analyses were qualitatively similar in terms of the differences in utility between adjacent health states and of matched states when compared with EQ-5D-5L value sets and the overall key conclusions drawn about the nature of PROPr and EQ-5D-5L values. Third, we have not tested the impact of EQ-5D-5L value sets and PROPr using patient-reported data, which would enable us to directly compare QALY estimates produced from each. More research is needed using both instruments in a variety of clinical contexts and longitudinally [11, 34, 35]. Our ongoing research aims to address these gaps in evidence using patient data [26].

It is important for decision makers and clinical triallists to be aware of the substantial differences between the characteristics of utilities for these two widely used PROMs, and their potential implications in QALYs estimation used in decision making. The different dimensions included in the classification systems, combined with the non-trivial differences in the characteristics of the utilities in PROPr and EQ-5D-5L value sets reported here, means that which instrument is chosen will potentially yield different results and evidence for decision making. To date, PROPr is available only for the US. Given the growing importance and use of PROMIS measures, we suggest that international PROPr utilities are required, in different country settings, and further comparisons of these utilities with the EQ-5D-5L and other measures are needed to understand the impact of different measures and methods used on the values elicited.

5 Conclusions

There are important differences between the characteristics of EQ-5D-5L and PROPr utilities. It is important for decision makers and clinical triallists to understand these differences, as they may exert an effect on utility-weighted patient data produced by each instrument. The smaller range of PROPr means the change in QALYs from HRQoL-increasing interventions might be reduced. However, interventions that result in improvements from disutility corner states and from the mildest states may yield larger QALY gains when estimated by PROPr. Of course, these differences in utilities will apply to both treatment and control groups and the effects on estimates of QALY gains also depend on whether these come from improvements in quality of life or by extending length of life. The characteristics of PROPr may be attributable to the specific methods used to obtain stated preferences, suggesting further research is required to understand that, and potentially to investigate the merits of alternative methods. The methods we develop in this paper potentially provide a means of comparing the properties of utilities between any two HRQoL instruments.