FormalPara Key Points for Decision Makers

The German EQ-5D-5L valuation study is the first country study to use the improved EQ-5D-5L valuation protocol 2.0 developed by the EuroQol Group, including a feedback module as internal validation, and a continuous quality control.

The study provides evidence that the refined valuation protocol with its quality control process appears to be a solid basis for estimating national EQ-5D-5L value sets.

The resulting German EQ-5D-5L value set, based on the complementary composite time trade-off and discrete choice experiment data in a hybrid model, is recommended as the preferred value set for Germany.

The German EQ-5D-5L value set allows a more refined preference-based health-related quality of life measurement to describe patients’ health and enables the use of the EQ-5D-5L in a range of applications such as cost-utility analysis for health care policy and clinical assessment in Germany.

1 Introduction

The three-level version of EQ-5D (EQ-5D-3L) is one of the most commonly used generic health-related quality of life (HrQoL) instruments in Germany and other countries [1, 2]. Recently, an improved version of this instrument was introduced by the EuroQol Group to increase its sensitivity, extending the number of response levels per dimension from three to five, namely EQ-5D-5L (the five-level version of EQ-5D) [3]. This instrument consists of five HrQoL dimensions (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression), with each dimension specifying five levels of severity [no (level 1), slight (level 2), moderate (level 3), severe (level 4), and extreme problems/unable (level 5)], which allows the description of 3125 health states. Available evidence on the comparative performance of EQ-5D-3L and EQ-5D-5L suggests that EQ-5D-5L (1) is a useful improvement upon the measurement properties in terms of reduced ceiling effects, and (2) provides an improved discriminative capacity with a greater ability to detect differences between groups, compared with EQ-5D-3L [4, 5].

In contrast to EQ-5D-3L, currently, no specific set of social health status preference valuations for the German version of the instrument exists to convert each health state described by EQ-5D-5L to a preference-based summary score on a 1 (full health) to 0 (dead) scale, which allows quality-adjusted life year (QALY) calculations. To date, there is only a crosswalk value set available as an interim solution, based on a mapping algorithm between the two instruments [6]. As a result, the EQ-5D-5L currently can only be used with limitations as an alternative for EQ-5D-3L in Germany and there is a strong need for a national value set (‘tariff’) for EQ-5D-5L.

The introduction of the new instrument was considered as an opportunity to develop an internationally standardized valuation protocol with improved methods for health state valuation and accompanying computer-based valuation software (the EuroQol-Valuation Technology, EQ-VT). This standardized protocol enables comparison of the resulting EQ-5D-5L value sets between countries, in contrast to the EQ-5D-3L value sets where inconsistent study designs and methods were applied to elicit health state preferences [2, 7]. In the first wave of EQ-5D-5L valuation studies applying the first version of the protocol (EQ-VT Version 1.0), major data issues were observed leading to EQ-VT Version 1.1, a comprehensive EQ-VT research program [8], and finally to the improved EQ-VT Version 2.0 [9].

The objective of this study was to estimate a value set for EQ-5D-5L based on the societal preferences of the German population using, as the first country worldwide to do so, the improved valuation protocol 2.0, including a feedback module for the respondents as an internal validity check of their answers, and an accompanying quality control (QC) process.

2 Methods

This study used the most up-to-date EuroQol Group’s valuation protocol, the EQ-VT 2.0 [9]. Version 2.0 is a refinement of the protocol that was used in the first wave of EQ-5D-5L valuation studies [7]. Across the different versions of the protocol the valuation tasks have remained the same, but later versions pay more attention to the optimal implementation of these tasks with the introduction of a QC procedure (since EQ-VT Version 1.1) and a feedback module allowing participants to review their responses (in EQ-VT Version 2.0). At the launch of this study, these modifications were still experimental. Hence, as part of an international EQ-VT research program [8], it was decided to conduct an initial experimental pre-study to test the protocol updates in Germany. The pre-study provided evidence of the feasibility and acceptability of the protocol, confirmed the positive impact of the modifications from EQ-VT 1.0 to EQ-VT 1.1, contributed evidence for the development of EQ-VT 2.0, and built the empirical basis for this valuation study [10].

2.1 Study Design

The health state valuations were collected in computer-assisted personal interviews (CAPI). In accordance with the protocol’s sample size calculation [11], a minimum sample of 1000 respondents from the German general population aged over 18 years was targeted. Quota-based sampling with respect to age, sex, educational level, and employment status was applied using German official statistics [12]. Interviews were conducted in six cities and surrounds located in different parts of Germany to ensure a balanced geographical spread: Berlin, Leipzig, Hamburg, Bielefeld, Munich, and Frankfurt. Individuals were recruited through a mixed recruitment strategy, i.e., through personal contact and from public locations. Interviews were conducted either in a public venue (e.g., research offices) or at the participants’ or interviewers’ homes.

2.2 Valuation Interview and Methods of Eliciting Preferences

A sub-set of the 3125 health states described by EQ-5D-5L was included in two preference elicitation tasks: composite time trade-off (cTTO) and discrete choice experiment (DCE) without duration. The EQ-VT design included a set of 86 EQ-5D-5L health states, divided into ten blocks of ten health states for the cTTO tasks (in which some states were present in multiple blocks) and 196 pairs of EQ-5D-5L health states, divided into 28 blocks of seven pairs for the DCE tasks. Each respondent was randomly assigned to one of the cTTO blocks and to one of the DCE blocks [11].

Each interview consisted of the following:

  1. 1.

    welcome and purpose of the study;

  2. 2.

    self-reported health using EQ-5D-5L and background questions;

  3. 3.

    cTTO valuation tasks (wheelchair example, three practice states, ten real tasks, debriefing questions, and feedback module);

  4. 4.

    DCE valuation tasks (seven tasks, methodological add-on with six tasks, debriefing questions);

  5. 5.

    comment box;

  6. 6.

    further background questions.

The cTTO aims to elicit the number of remaining life years in full health at which the respondent is indifferent between a longer period of impaired health (10 years in an EQ-5D-5L health state) and a shorter life span in full health. The cTTO approach involves commencing with the ‘conventional’ time trade-off (TTO) for all health states, and shifting to a ‘lead time’ TTO when the participants’ responses indicate that they consider the health state to be worse than being dead [13,14,15]. The resulting cTTO values range from −1 (trading whole lead time) to 1 (trading no years in full health) in 0.05 increments. The interviewer used the example health state “being in a wheelchair” to explain the cTTO tasks. After the cTTO tasks, each respondent was presented with the rank ordering implied by their cTTO valuations in the feedback module. The respondents were asked to review their responses and to flag any health state they felt should be reconsidered. However, those health states could only be flagged but not re-valued [9, 10].

In the DCE tasks, respondents were presented with a pair of EQ-5D-5L health states, designated A and B. They were asked to decide which of the two states was better (forced choice). Additionally, six further paired comparisons comprising a duration attribute and a death alternative were included as a methodological add-on. These data will be reported elsewhere, as this article focusses on the estimation of the German EQ-5D-5L value set using the EQ-VT protocol.

2.3 Data Collection and Quality Control Process

Figure 1 describes the data collection and QC procedures. In total, 32 professional interviewers from a market research company were intensively trained and each interviewer performed practice interviews. Data were reviewed using the EQ-VT QC software to check the interviewers’ performance and compliance with the interview script [16] and each interviewer received detailed feedback.

The following QC criteria were defined:

Fig. 1
figure 1

Data collection and quality control of interviewers

  1. 1.

    time spent on the wheelchair example was too short (less than 3 min);

  2. 2.

    no explanation of the worse than dead task (‘lead time’ TTO) in the wheelchair example;

  3. 3.

    clear inconsistency in the cTTO ratings (i.e., 55555 is not the lowest and at least 0.5 higher than the health state with the lowest value);

  4. 4.

    time spent for the ten cTTO tasks was too short (less than 5 min).

If any of the criteria were met, the interview was flagged as being of “suspect” quality. Each interviewer had to perform at least two non-flagged interviews before entering the field phase. At the start of the data collection, experienced interviewers from the pre-study [10] participated in one interview supervising each interviewer. During the entire study, the interviewers received feedback based on a daily evaluation using the QC software. Continuously underperforming interviewers were excluded from the study (40% flagged interviews at maximum). Moreover, a written debriefing of every tenth participant provided additional information on the respondent’s perception of the interview and the interviewer’s performance, thus ensuring QC in terms of formative evaluation.

2.4 Data Analysis and Modeling

Descriptive analyses were used to examine the sample characteristics and the responses to the cTTO and DCE tasks (proportions for discrete variables, mean, and standard deviation for continuous variables). Statistical modeling was used to estimate values for all health states described by EQ-5D-5L, drawing on the cTTO data and DCE data, and combining both in a hybrid model. In the case of cTTO, the dependent variable was defined as 1 minus the observed cTTO value for a given health state indicating disutility and hence coefficients expressed utility decrements. The observed cTTO values for the non-flagged health states after the feedback module were used (i.e., the respondent’s flagged cTTO observations were excluded). For the DCE data, the dependent variable was the binary stated choice (i.e., 0/1 indicated the choice for each health state pair). No DCE data were excluded.

As the EQ-VT design was optimized for main effects models [11], the dependent variable was explained by 20 independent variables: four variables for each EQ-5D-5L dimension, each representing the four levels beyond level 1 (“no problems”) as the reference category. The coefficients presented the decrement from level 1 to the respective level (regular dummies). Thus, the regression equation was as follows:

$$\begin{aligned} Y\; = & \;\beta 0\\& + \;\beta 1*{\text{MO2}}\; + \;\beta 2*{\text{MO3}}\; + \;\beta 3*{\text{MO4}}\; + \;\beta 4*{\text{MO5}}\\& + \;\beta 5*{\text{SC2}}\; + \;\beta 6*{\text{SC3}}\; + \;\beta 7*{\text{SC4}}\; + \;\beta 8*{\text{SC5}} \\ & + \;\beta 9*{\text{UA2}}\; + \;\beta 10*{\text{UA3}}\; + \;\beta 11*{\text{UA4}}\; + \;\beta 12*{\text{UA5}}\\& + \;\beta 13*{\text{PD2}}\; + \;\beta 14*{\text{PD3}}\; + \;\beta 15*{\text{PD4}}\; + \;\beta 16*{\text{PD5 }} \\ & +\; \beta 17*{\text{AD}}2\; + \;\beta 18*{\text{AD}}3\; + \;\beta 19*{\text{AD}}4\; + \;\beta 20*{\text{AD}}5\; + \;\varepsilon \\ \end{aligned}$$

Firstly, to illustrate how the hybrid model combined both types of data, the cTTO and DCE data were modeled separately and their agreement of the utility decrements and the predicted index values was compared. A censored regression approach (Tobit) assuming normally distributed errors was used for the cTTO data. According to the EQ-VT design, the maximum lead time to be traded was restricted to 10 years (minimum value of −1). However, it cannot be excluded that respondents would have traded off more years, thus values may have been in the range (−∞;1] [7, 10, 17]. Thus, the cTTO responses were considered to be left-censored at −1 [10, 18] and hence a Tobit model was selected for the cTTO data (Model 1). With the constant term being marginal and non-significant, it was decided to suppress the constant in the models. In the case of the DCE data, a conditional logit model was used (Model 2). The values derived by the DCE valuations were estimated on a latent scale, and consequently cannot be used independently as a basis for value set generation. To allow direct comparisons, the coefficients were transformed to a 1 (full health) to 0 (dead) scale by dividing them by the rescaling parameter Theta of the hybrid model 3a, under the assumption that the cTTO model coefficients were proportional to the DCE coefficients [19].

Secondly, the cTTO and DCE data were combined in a hybrid model (Model 3a). The assumptions of the hybrid model were the same as for the Tobit model (cTTO data) and for the conditional logit model (DCE data). The combination was based on multiplying the likelihood functions of the cTTO model by the likelihood function of the DCE model (for a more detailed description of the hybrid model see [20]).

Thirdly, as the observed variance of the cTTO values increased with the severity of the health state [18], a test for homoskedasticity in a Tobit regression was applied [21]. Because homoskedasticity was rejected, another hybrid model censoring at −1 and accommodating heteroskedasticity was performed by relaxing the variance for each parameter (as suggested by Ramos-Goñi et al. [20]) (Model 3b), whereby the assumptions of the hybrid model 3a were kept.

The model selection was based on theoretical considerations, logical consistency of the parameter estimates (i.e., the higher the dimensional level, the higher the utility decrement), and significance of the parameters (p < 0.05). Prediction accuracy was not assessable in terms of mean absolute error because of the missing counterfactual for hybrid model predictions [18, 19]. Thus, the value range, the ranking based on the size of the coefficient for the worst level on each dimension, and the relative importance of the dimensions were compared. The predictions for the EQ-5D-5L health states between models were compared using scatterplots and Pearson product-moment correlation analysis.

The plotted kernel distributions of the 3125 possible EQ-5D-5L health states for the selected EQ-5D-5L value set, the crosswalk value set [6], and the 243 attainable values of the TTO-based EQ-5D-3L value set [22] were used to compare the different value sets for Germany.

The exclusion of single persons, as in previous EQ-5D-5L valuation studies, was tested. This resulted in: the same cTTO value for all health states (n = 1), non-traders (n = 6), suspect DCE response patterns (i.e., straight-liners who always clicked the same option in all DCE tasks, or variations thereof such as ABABABA) (n = 26), and flagged interviews by the QC software (n = 83). A sensitivity analysis of the resulting models with the excluded persons was performed. As there were only non-significant marginal differences, it was decided to use all available information and not to exclude any respondents. Statistical analyses were performed using STATA Version 14 (StataCorp. 2015. Stata Statistical Software: Release 14. College Station, TX: StataCorp LP).

3 Results

In total, 1158 interviews were conducted nationwide from December 2014 until March 2015 by 19 interviewers (Fig. 2). The mean interview time was approximately 48 min. 83 interviews (7.17%) were flagged according to the defined quality criteria. The final QC report can be found in the Electronic Supplementary Material (ESM) 1.

Fig. 2
figure 2

Place of residence of respondents in the German EQ-5D-5L valuation study

3.1 Characteristics of the Sample

The study sample was representative for the German population in terms of age, sex, education, and employment status (Table 1). Self-reported health using EQ-5D-5L showed that the frequencies of reported problems varied from 6.39% in self-care to 55.61% in pain/discomfort, whereas 36.4% of respondents reported no problems in any dimension (11111).

Table 1 Demographics of the respondents in the German valuation study

3.2 Data Characteristics: Composite Time Trade-Off and Discrete Choice Experiment

On average, respondents had 7.39 iterative steps before they reached their point of indifference in the cTTO tasks. The observed mean cTTO values ranged from −0.461 for health state 55555 to 0.973 for health state 21111.

In the feedback module, a total of 6.16% of cTTO responses (n = 713) were removed by respondents from the rank ordering. 17.69% of respondents had at least one inconsistency (i.e., health state A defined as better than health state B but A having a lower cTTO value) in their responses (3.8% involving 55555). After the feedback module, 12.6% of respondents were still inconsistent (2.33% involving 55555) (p < 0.001). In total, 2.2% of all cTTO responses were inconsistent and this was significantly reduced by 0.72% after the feedback module (p < 0.001). The following results include all cTTO valuations after the feedback module (11,580–713 removed observations = 10,867 observations).

17.3% of mean cTTO values were negative, and most of these worse than dead responses were elicited at −1 (9.65%) (Fig. 3). The proportion of values clustered at 0 was 3%. The higher the severity level (i.e., sum of levels across dimensions), the lower the mean cTTO value, whereby the standard deviation increases with the severity level (p < 0.001) (Fig. 4).

Fig. 3
figure 3

Observed composite time trade-off (cTTO) value distribution

Fig. 4
figure 4

Mean composite time trade-off (cTTO) value by severity level

In the DCE tasks, the proportion of the choice of A or B was strongly correlated to the difference in the severity level between the health states. The greater the difference in severity between the two states, the more likely respondents were to choose the state with the lower severity level. In total, 26 respondents (2.2%) answered with a suspect response pattern across all DCE tasks.

3.3 Modeling

Both single-method models, the Tobit model for the cTTO data (Model 1) and the conditional logit model for the DCE data (Model 2), had one inconsistent order in levels 2 and 3 (i.e., a smaller parameter estimate than the adjacent one; Table 2). The absolute difference between the estimated parameters of the cTTO- and DCE-only models was on average 0.02 (largest difference for mobility level 3 with 0.048). The estimated predictions of both single-method models were strongly correlated (r = 0.998, p < 0.001). Both hybrid models (3a and 3b) had logically consistent parameter estimates (p < 0.001 for all coefficients).

Table 2 Parameter estimates for main effects models

Comparing all the models, there were only small differences in the parameter estimates. The preference ranking of the dimensions was consistent across the four models (ordered from most to least important): (1) pain/discomfort, (2) anxiety/depression, (3) self-care, (4) mobility, (5) usual activities.

The scatterplots of the predictions of the different models suggest the compatibility of the cTTO and DCE data (Fig. 5a) and show the effect of adding the DCE data to the cTTO valuations in the hybrid model (Fig. 5b). There was still a large agreement between the predictions (mean absolute difference of parameter estimates: 0.01). Moreover, accounting for heteroskedasticity spread out the range of the predicted utilities as the decrements for extreme problems were higher (Fig. 5c).

Fig. 5
figure 5

Comparison of model predictions (86 states included in the composite time trade-off design)

3.4 Preferred Model (Value Set)

Following the agreement of the cTTO and DCE data, both data types can be combined in a hybrid model. In contrast to the cTTO- and DCE-only models, both hybrid models show a logical order for all parameter estimates (p < 0.001 for all coefficients). However, the value set has been based on Model 3b as this model takes into account the heteroskedasticity of the error terms in the cTTO data, and has the highest precision of the model coefficients (i.e., smaller standard errors and best model fit).

Applying the value set, EQ-5D-5L health state utilities can be estimated by subtracting the relevant decrement for each problem on each dimension from 1. A full set of the predicted EQ-5D-5L values is available in ESM 3. For example, the predicted EQ-5D-5L index value for health state 12345 was calculated as follows:

$$U\left( {12345} \right)\; = \;1.000 - 0.000 - 0.050 - 0.049 - 0.404 - 0.356\; = \;0.141$$

The magnitude of utility decrements and the resulting relative importance of the functional dimensions (mobility, self-care, and usual activities) were similar to each other. The decrements of pain/discomfort and anxiety/depression were higher, especially levels 4 and 5. However, symptoms of pain/discomfort had by far the greatest impact on HrQoL. All moves between the five consecutive levels within each of the five dimensions were statistically significant except for the move from level 2 to 3 in the functional dimensions.

3.5 Comparison of EQ-5D-3L and EQ-5D-5L Values

In contrast to the German EQ-5D-3L value set [22], each level of this EQ-5D-5L value set can be ascribed a certain utility weight that is consistent throughout. The kernel density plot of the index values for the 3125 attainable states using the EQ-5D-5L value set shows a left-skewed distribution, whereas the EQ-5D-3L value set is characterized by two peaks (bimodal distribution). The EQ-5D-5L value set covers a larger evaluation space without a constant as a deviation from full health (−0.661 to 1) than the EQ-5D-3L value set and the crosswalk (−0.205 to 1) [Fig. 6].

Fig. 6
figure 6

Kernel density plot of all possible EQ-5D-3L and EQ-5D-5L values

4 Discussion

This article presents the results of the German EQ-5D-5L valuation study. It has provided evidence of the successful completion of an EQ-5D-5L valuation study using the improved valuation protocol EQ-VT 2.0, combined with intensive interviewer training and data monitoring. The different QC steps ensured high data quality in terms of few inconsistencies, high values for mild states, little clustering of values, and high interviewer compliance compared with previous valuation studies using earlier protocol versions without QC [9, 16, 23, 24].

This EQ-5D-5L value set is based on a hybrid model judged to be theoretically preferred over a TTO-based model under the assumption that individuals each have a unique utility function that determines their health state preferences. Both techniques, cTTO and DCE, try to measure the same utility function for health and can be summed up to a linear transformation (BetaTTO = Theta*BetaDCE) [9, 13, 19]. However, values derived from both methods need not be identical and may not capture values perfectly: health states are valued against time in cTTO (matching task) whilst in DCE, health states differing on dimensions’ severity levels are directly compared to another (choice task). cTTO data might therefore be influenced by scale compatibility and loss aversion [7, 25], whereas lexicographic preferences and attribute non-attendance have been reported for DCE [26, 27]. The idea that the true value cannot be derived from a single valuation technique provides an argument for the use of hybrid models. In addition, the parameter estimates may be derived with greater precision when informed by two types of (complementary) data, as evidenced in this article. The utility decrements, ranking, and relative importance of the dimensions were robust across the cTTO, DCE and hybrid models. However, using all valuation data in the selected hybrid model provided the highest precision on the parameter estimation (i.e., smaller standard error) [9, 19]. Hence the German EQ-5D-5L value set considered the complementary cTTO and DCE data in a hybrid model without excluding single respondents due to data issues. Innovative modeling approaches representing best practice [9, 18, 20] were used, taking into account that cTTO data are left-censored and the heteroskedasticity of the error terms to prevent biased parameter estimates.

This German study was not troubled with data issues reported in some studies using an earlier protocol version [23, 24]. The health state descriptions in the EQ-5D-5L were matched with a valuation method that was able to discriminate between health severity levels even when the differences were subtle. This stands in sharp contrast with the German TTO-based EQ-5D-3L value set [22], where some levels were collapsed. This inspires faith in this German EQ-5D-5L value set, and it may be noted that the kind of insights that guided this valuation study did not exist in the past when the EQ-5D-3L was valued. Therefore, the EQ-5D-5L value set with its larger evaluation space and the more differentiated distribution of attainable index values than its predecessor reinforces the (theoretical) potential of the EQ-5D-5L to better discriminate between patients, especially those with mild health conditions and for small changes in health. This might improve the applicability of the EQ-5D in general and in certain disease areas using the national value set in Germany. However, further research is required to examine the practically relevant evaluation space and the discriminative properties of the German EQ-5D-5L in patient populations.

An implication of the results is that a gradual shift in the basis for utility assessment in Germany might be anticipated, moving away from the EQ-5D-3L to the EQ-5D-5L. This would allow users to benefit from a better descriptive system [3] and from a valuation point of view, as the high quality EQ-5D-5L valuation data were derived from a much larger and geographically more diverse sample. However, the sample was clustered within six regions and there was a small middle-class bias. The study sample was, nevertheless, predominantly representative in terms of age, sex, education, and employment status for the German general population [12]. In summary, the resulting value set has provided a robust and up-to-date basis for estimating EQ-5D-5L index values and should be the preferred value set for Germany.

5 Conclusions

This is the first country study to utilize the improved EQ-5D-5L valuation protocol 2.0 developed by the EuroQol Group, including a feedback module as internal validation, and a continuous QC. It provides evidence that the refined protocol with its QC process appears to be a solid basis for estimating national EQ-5D-5L value sets.

The German EQ-5D-5L value set, based on cTTO and DCE, is recommended as the preferred value set for Germany. It allows a more refined preference-based HrQoL measurement to describe patients’ health and enables the use of the EQ-5D-5L in a range of applications such as cost-utility analysis for health care policy and clinical assessment in Germany.

This set of EQ-5D-5L preference values demonstrates the relative importance placed on different HrQoL dimensions according to the German general population, where the health problems of anxiety/depression and especially pain/discomfort should receive the greatest attention. These societal preferences have implications for the assessment of treatments that affect individuals’ HrQoL as a patient-relevant outcome and should be reflected in health care decision making in Germany.