Introduction

The impact of glaucoma on vision has been shown to have implications for patients’ health-related quality of life (HRQoL).1, 2 Economic evaluations are used to assess the value of new interventions, the need for health outcomes for use in these evaluations has made the measurement of utilities important, whether they are obtained directly via Time Trade-off (TTO),3 Standard Gamble (SG),4 or through multi-attribute questionnaires, such as the EQ-5D,5 SF-6D,6 or HUI.7 There is increasing recognition of the desirability of cost-utility and cost-effectiveness analysis to inform decision-making, and utilities are used within the analyses to calculate the quality-adjusted life years (QALYs). The assessment of utilities is essential for policy makers in decision-making processes for health care interventions.8, 9 The NICE reference case10 recommends the use of the EQ-5D within these studies; however, often clinical studies will not use utilities or multi-attribute questionnaires to inform HRQoL, instead disease-specific health profiles or non-preference-based measures are used. Mapping provides a statistical model to estimate health-state utilities when no preference-based measure has been used. This allows the subsequent calculation of QALYs for cost per QALY analysis.

The primary aim of this study is to estimate a mapping algorithm to predict EQ-5D and SF-6D utility values based on the vision-specific measure, the 25-item Visual Functioning Questionnaire (VFQ-25), as well as the clinical measures of visual function, that is, integrated visual field (IVF), visual acuity (VA), and contrast sensitivity (CS). A previous study using the same data set has provided algorithms for converting clinical measures of vision directly into TTO utility values.11 Mapping relationships are estimated using a range of techniques and statistical specifications. The mapping functions will then be compared across the EQ-5D, SF-6D, and TTO. Although it is difficult to make hypotheses on the relative performance of the measures given the paucity of previous research, it is considered likely that the VFQ-25 scores will have a closer relationship to clinical tests of visual function than generic preference scores, as it contains vision-specific items.

Material and methods

All data was collected from patients at the Moorfields Eye Hospital in London, between September 2005 and September 2006, by a team of researchers from the London School of Hygiene and Tropical Medicine and Moorfields Hospital. Full details on data collection methods have been published.11 Inclusion criteria for the study was diagnosis of primary open-angle glaucoma (POAG), aged ≥18 years, English-speaking, and the absence of any other problems that could affect the results of their visual test or interview. Patients who had undergone eye surgery in the 6 weeks before the study were excluded.

Statement of ethics

The study was approved by the Moorfields and Whittington Local Research Ethics Committee and the London School of Hygiene and Tropical Medicine Ethics Committee. Informed consent was obtained from all participants in this study. Face-to-face interviews consisting of a series of questionnaires and visual tests were conducted at the Moorfields Eye Hospital.

QOL and clinical measures

Not a single outcome measure was consistently used when assessing QoL; different types of outcome measures were reported in the studies. These include clinical measures, such as VA, CS, and IVF; and patient reported HRQoL measures, such as SF-36, VFQ-25, EQ-5D, and SF-6D. These patient-reported measures can be generic instruments (for example, SF-36, EQ-5D, and SF-6D) or vision-specific (for example, VFQ-25). The generic measures can be split into preference-based measures, in which the scoring is based on preferences and may be used to calculate QALYs, for example, EQ-5D, SF-6D, or in which the scoring is non-preference-based, for example, SF-36.

Alternatively, preferences may be elicited directly using methods such as TTO.

Four HRQoL measures were administered within this study. (1) The Short-Form health survey (SF-36) consists of 36 questions and 8 different domains (physical functioning, physical role limitation, social functioning, bodily pain, emotional role limitation, mental health, vitality, and general health perception).12 The results from the SF-36 were then converted into the preference-based SF-6D, using a published algorithm.13 (2) The EQ-5D5 has five dimensions (mobility, self-care, usual activity, pain/discomfort, and anxiety/depression) and three levels of severity defining 243 health states.14 (3) The TTO3 asked subjects to trade time between two alternative health-state scenarios. The method of TTO used in this study is detailed in another paper.11 (4) The vision-specific measure, VFQ-25, developed by the National Eye Institute, was also included.15 The VFQ-25 consists of 25 vision-related questions with 12 domains (general health, general vision, ocular pain, near activities, distance activities, social functioning, mental health, role difficulties, dependency, driving, color vision, and peripheral vision). The scoring range of the VFQ-25 is from 0 to 100, with higher scores representing better visual function and well-being.

The clinical measures used to assess the visual functioning of the patients within this study are VA, CS, and IVF. The VA tests were measured in the standardized conditions of a well-lit room at 4 m, with each eye assessed separately using a back-illuminated ETDRS logMAR chart, or 2/1 m if she/he could not easily read the letters on the top line at 4/2 m, respectively.16 The binocular VA was measured using the glasses the patients normally wear for distance. Best-corrected vision in the better eye was measured using glasses looking through a pin hole to get the best-possible level of vision. A score of 1.85 LogMAR was given if the patient could not read any letters at 1 m, and a value of 2.3 LogMar was given if they could only see hand movements.17 Binocular CS was measured by way of the patient’s glasses for distance, using the Pelli-Robson chart,18 with front illumination at the standardized conditions at a distance of 1 m. A score of 0 was assigned if a patient could not read more than one letter on the chart. The main binocular visual-field test used was the integrated visual-field score.19 Using the monocular Humphrey 24-2 full-threshold test20 in both eyes, the visual field for each eye was combined to make a binocular (integrated) field. An integrated field score was then calculated giving an overall score between 0 and 104, in which the higher the score, the worse the visual field. Further details of the clinical scoring is published in another paper.11

Models

We present descriptive statistics on the HRQoL, utility scores, and clinical characteristics of the study sample (more details of the sociodemographics of the sample will be reported elsewhere (Fiammetta Bozzani et al, paper submitted)). Pearson’s correlations between the visual, utility, and HRQoL measures were calculated to assess the degree to which the assessments were related to each other (Table 1).

Table 1 Mapping functions

The general model used was the ordinary least squares (OLS) regression analysis, to test the relative contribution of the VFQ-25 dimensions, score, and visual function tests to the HRQoL measures. The OLS regression minimises residual error and assumes the residuals have a normal distribution.21 However, the OLS approach ignores the nature of the distribution of the dependent variable, which means estimations based on OLS methods may be biased and inconsistent.22

The second model was included to address problems of censoring that are common to the EQ-5D. Tobit regression is designed to estimate linear relationships between variables when there is some form of censoring in the dependent variable.

The third regression technique used is the censored least absolute deviations (CLAD). This was chosen to address heteroscedasticity. Under conditions of heteroscedasticity, Tobit models produce inconsistent estimates.22 In contrast, the CLAD estimator does not depend on distributional or homoscedasticity assumptions of the errors and is robust to censoring, and it produces consistent estimates even in the face of heteroscedasticity, non-normality, and censoring.23 Although CLAD estimates the median rather than the mean, we tested the model to assess its mapping performance. Heteroscedasticity was tested for using the White test. The variance inflation factor was used to test for multicollinearity.

We present the mean of the estimated EQ-5D, SF-6D index score, mean absolute error (MAE), mean square error (MSE), and the root MSE (RMSE). The MAE is the average of absolute differences between observed and estimated scores of all individuals, whereas the MSE is the expected value of the squared difference between the observed and the estimated scores.24 Both MAE and MSE measure the average precision at the individual level; however, the MSE places greater weight on bigger errors. The lower the RMSE, the better the model is performing. The best-performing models were selected on the basis of those with the lowest RMSE. Performance of the selected models was then based on the MAE between the observed and predicted index scores, and the model fit using R2. Although the MAE, MSE, and RMSE are criteria for evaluating model performance, we present the models that have the lowest RMSE. This is because the RMSE is measured in the same units as the data, is representative of the size of a ‘typical’ error, and is more sensitive than other measures to the occasional large error. All statistical analysis was undertaken using STATA version 11.0 (StataCorp LP, TX, USA).

Results

A total of 132 patients met the inclusion criteria and were enrolled within the study. Completion rates for the HRQoL measures were as follows: EQ-5D 99.24% (n=131), SF-6D 93.93% (n=124), TTO 93.18% (n=123), and VFQ-25 100% (n=132), and the rates for the clinical measures were as follows: CSB 100% (n=132), VAB 100% (n=132), and IVF 93.93% (n=124). The mean age of the sample was 71.79 years (range 27.61–93.45); 54% were male, 64% described their ethnicity as Whites, 18% were employed, and 41.67% were married. The mean time since diagnosis was 14.30 years, and the majority of patients (87.12%) had high-tension glaucoma. The visual characteristics of the study showed the mean VAB was 0.21 (SD=0.38), mean CSB was 1.37 (SD=0.37), and mean IVF was 37.65 (SD=33.62).

The mean values from each of the measures are shown in Table 2. The mean utility values for the preference-based measures were 0.77 for EQ-5D and 0.65 for SF-6D. Direct elicitation values by means of TTO resulted in a value of 0.90. The mean VFQ-25 score was 72.88. Of the generic preference-based measures, the SF-6D had the largest and most significant correlations with tests of visual function. However, the VFQ-25 score produced the overall highest and most significant correlations with tests of visual function (Table 3).

Table 2 Descriptive statistics of the health status measures
Table 3 Pearson’s correlation between CS, VA, and IVF, and health status measures

The mapping functions were compared using the RMSE as a measure of performance. For the EQ-5D, the RMSE was lowest for the mapping function containing the VFQ-25 dimensions, clinical measures, and demographics. The model with the lowest RMSE was OLS. With respect to SF-6D, however, the mapping was best from the overall VFQ-25 score, with the OLS model performing the best.

The error between the observed and predicted values for the EQ-5D and SF-6D are plotted against the observed values Figure 1. Both EQ-5D and SF-6D predicted values underestimated at the upper end of the scale and overestimated at the lower end of the scale. We calculated the MAE for each OLS model; the lower the number, the better the model performance. The MAE for EQ-5D ranged between −0.003 and 0.020, and for SF-6D, the MAE ranged between 0.039 and 0.044.

Figure 1
figure 1

Difference between observed and predicted EQ-5D/SF-6D scores.

To minimize modeling uncertainty within this study, we used three different models for prediction. Table 4 presents the best-performing algorithms for the OLS model for the EQ-5D and SF-6D. The results showed that OLS performed the best in comparison with both Tobit and CLAD. However, the adjusted R2 for the EQ-5D and SF-6D models had poor fit, with the SF-6D having a very low adjusted R2. Figure 1 shows that the EQ-5D and SF-6D were both underestimating utility for the higher scores and overestimating for the lower scores. This is common in EQ-5D mapping, for example Wolfe et al25 found that prediction errors tend to be increasingly positive for lower EQ-5D scores and increasingly negative for higher EQ-5D scores.

Table 4 OLS results

The issue of underestimating for high scores and overestimating for low scores will have implications on QALY analysis when using these algorithms to predict utility scores. This means that any small changes in vision will result in predicted changes in the EQ-5D or SF-6D that are smaller than they actually should be, therefore affecting any potential QALY changes and thus cost-effectiveness results. This highlights the importance of mapping functions, and taking account of uncertainty and the normality of data when estimating algorithms. Although OLS showed to be the best-performing model, these results still show that the errors are high, indicating that other methods may be better at predicting the mapping function.

A recent review of mapping studies found a wide variety of model tests being used to evaluate the models, with some studies only presenting one method.26 A study by Payakachat et al27 explored different statistical approaches to predicting EQ-5D from the VFQ-25. This paper also used OLS, Tobit, and CLAD to estimate the EQ-5D.27 They found their models to have heteroscedasticity and non-normality, and therefore, they rejected the OLS and Tobit models. They conclude that the CLAD estimates should be recommended for producing the EQ-5D when VFQ-25 data is only available. Our data was homoscedastic, and therefore, OLS and Tobit estimates in our models would not have been biased.

Discussion

The primary aim of this paper was to map the VFQ-25 onto the EQ-5D and SF-6D utility measures. This paper presents mapping functions to indirectly derive generic utility from disease-specific measures. The best-performing function when estimating the EQ-5D included the VFQ-25 dimensions and demographics. For the SF-6D models, the best-performing function in terms of RMSE included only the VFQ-25 score. For consistency, we presented only the model specifications with the lowest RMSE.

There has been limited research into the field of QoL and glaucoma, and there is an ongoing debate as how to best measure the utility values of glaucoma patients. The EQ-5D5 has been found to be insensitive to HRQoL in this population; studies using this measure found mean scores that did not differ substantially from their respective population norms.28, 29

Within this study, the mean EQ-5D value (0.77) was consistent with the values found in the literature of 0.8029 and 0.76,28 and with the average value of 0.78 for people aged 65–75 years30 within the general population, which implies that the HRQoL of the average glaucoma patient is not severely affected. This would therefore have significant implications for resource allocation, as the treatment may only have a marginal effect on HRQoL. However, the EQ-5D was found to be only correlated with IVF at the 5% level and insignificant with all other tests of visual function, suggesting it may not be a good predictor of health status for patients with glaucoma. This was highlighted further with the mapping results showing a poor model fit with the VFQ-25 dimension scores, suggesting that the content of the EQ-5D is not sensitive to areas involving vision.

Brown et al31 and Kobelt et al29 found that utility and HRQoL were driven by the better-seeing eye, and that as vision decreases in the worse eye, the better eye takes most of the function, and therefore, utility will only decline slightly. This provides one explanation why the utility values are relatively high for individuals suffering from a chronic disease. A further explanation is that the average time since diagnosis of the patients in this study was 14.3 years, suggesting that they have had effective management of the disease, and that they have adapted to the condition. The clinical vision measures also suggest that their vision has not been greatly affected, as most are still within normal range (see below).

Not all of the measures in this study were sensitive to patients with glaucoma. The EQ-5D did not correlate well with the tests of visual function. Given the EQ-5D does not have a vision dimension, it may not represent the true value for patients with POAG. Indeed, 26.72% of the patients recorded the maximum EQ-5D score of 1.0, indicating a significant ceiling effect within the measure. In this study, we converted the SF-36 into the SF-6D, and this measure was found to have a small but significant correlation with tests of visual function. The observed mean of 0.65 was lower than the observed population norm of 0.79, this therefore suggests that HRQoL has been affected by the condition.32

The main limitations to this study were that the mean VA in the better-seeing eye in this study was 0.21 and, with the majority of patients having a very low logMAR score, this suggests that VA in this cohort of patients has not been severely affected by the condition. This might be because VA is not affected until the late stages of the disease, therefore implying that most people in this study had detected their glaucoma at an early stage and have had effective management. In this case, we would not expect to see any effect on HRQoL. The mean value of binocular CS was 1.36 in this study; this shows slight visual problems, with values less than 1.5 signifying visual impairment when measured on the Pelli-Robson scale.33 We would, however, expect the visual fields to be affected and for the score to be relatively high. A mean score of 37.65 suggests that visual field has not been severely affected in many patients among the study sample. The data set was based on a sample from 2005 to 2006; although this may seem like an old data set, we would not expect the relationship between the measures to be dependent upon the year of data collection. Therefore, we found it valuable to use this data.

Conclusion

Mapping allows utility scores to be obtained from studies when utility measures were not originally used; these results can then be used in cost per QALY analysis. This study has attempted to provide an estimation of mapping algorithms, which could be used in future studies using the VFQ-25 when no HRQoL measure is used. The algorithms presented in this paper would need to be validated using an external data set. The patients in our study had relatively mild POAG, and therefore, there were minimal effects on their HRQoL. Further work needs to be done with a larger sample of patients with a much broader spectrum of the disease to establish the exact pattern of the relationship between decline in HRQoL as the disease progresses. Accurate models of measurement of the relationship between disease and HRQoL will allow clinicians to potentially benchmark their medical or surgical intervention against the potential loss or improvement of HRQoL to the patient.