Introduction

Economic evaluation is the most frequently used method in health care programme. It uses empirical techniques applied to cost and outcome measures to inform resource allocation in specific populations and settings [1]. Currently, the EQ-5D is one of the most widely used generic preference-based measures (GPBMs) assessing individual’s health-related quality of life (HRQoL) to facilitate the economic evaluation of health care interventions [2]. It has been shown to be valid in different patient groups and settings [3]. Although the use of the EQ-5D has grown in recent decades, its ability to capture and assess people’s mental HRQoL and well-being is questionable [4,5,6]. Recent studies have indicated that poor physical health is highly likely to lead to an increased risk of developing impaired mental health due to insecurity, confusion and emotional isolation [7, 8], unsatisfied social well-being due to loss of wealth, work and school closure and shortage of acquiring adequate medical services [9, 10]. Pfefferbaum and North further indicated that impaired mental health and well-being may result in unhealthy behaviours and exacerbate people’s physical health [11]. Several studies argued that the EQ-5D, which main focus is on certain aspects of physical health (four out of five items) may not adequately capture and measure the effectiveness of mental health, public health and social care interventions, which are issues certain to be echoed in populations affected by both acute and chronic diseases [12, 13].

The Recovering Quality of Life-Utility Index (ReQoL-UI) is a new GPBM that aims to capture changes in mental HRQoL [14]. It was developed on the basis of the theoretical framework established with considerable input from mental health service users, which is believed to provide different perspectives to the comparability across evaluations undertaken in physical and mental health [15]. The developers indicated that the ReQoL-UI has the advantage to detect psychometrically changes in HRQoL over time and differences across treatments. An alternative framework for measuring the cost-effectiveness of social care interventions is with the ICEpop CAPability (ICECAP) measures, which is theoretically grounded in Amartya Sen’s capability approach [16]. It was designed to measure people’s capability (what an individual can do) rather than function (what they actually do) to highlight the importance of freedom to choose. It focuses on well-being defined in a broader sense rather than health [17]. The ICECAP instruments have different versions, among them, the adult version (ICECAP-A) is validated in the Chinese population [18].

Although GPBMs are increasingly used to evaluate the effectiveness of health and social care interventions, there is little evidence to inform the selection of the most appropriate one for use in economic evaluations in the Hong Kong (HK) general population. Using reliable and appropriate instrument is vital to ensure the benefits of the interventions and policies are adequately capturing [19]. Thus, this study aimed to assess the psychometric properties of three GPBMs, the EQ-5D-5L, ReQoL-UI and ICECAP-A, and compare their performance in a sample of HK general population to inform instrument choice when conducting economic evaluation for public health and social care interventions, especially where mental health is an important component.

Methods

Sample size

For conducting psychometric analysis, a minimum of 300 respondents is required [20]. Given the possibility of missing data, in this study, a target sample size of 500 from the HK general population was considered sufficient to perform such analysis.

Participants and data collection

A telephone survey was carried out in July 2020 to recruit participants. To minimize the sampling error, first, telephone numbers were selected randomly from the updated available public telephone directories as seed numbers. Another three sets of numbers were then generated using the randomization of the last two digits to recruit the unlisted numbers. Duplicate numbers were screened out, with the remaining numbers mixed in a random order to form the final sample. A total of 5,385 telephone numbers were sampled for the survey. The inclusion criteria for the study were HK permanent residents, ≥ 18 years, and able to speak Cantonese. Upon successful contact with a target household, the adult who have had their birthday most recently was selected to complete a questionnaire over the phone. Study protocol and informed consent was approved by the institutional review board of the Chinese University of Hong Kong (Ref. ID: SBRE-18-671).

Measurements

EQ-5D-5L

The Chinese EQ-5D-5L used in this study was approved by the EuroQol Group (www. euroqol.org). The descriptive system comprises five items (mobility, self-care, usual activities, pain/discomfort and anxiety/depression) with five levels (no problem to extreme problems) [21], which can be converted into a summarised utility score between 0 (death) and 1 (full health) to facilitate cost-utility analysis. The utility score was estimated based on HK population’s preference weights [22]. We also administered the visual analogue scale (EQ-VAS) to describe individual’s overall health status (0 [worst]–100 [best]).

ReQoL-UI

The ReQoL-UI, which was developed based on the ReQoL-20, comprising six mental health items (activity; belonging and relationships; choice, control and autonomy; hope; self-perception; and well-being) and one physical health item was administered [14]. The ReQoL-UI has been translated to Chinese and adapted for use in HK with the necessary permissions [23]. In the absence of HK specific preference weights, we used the UK preference weights to calculate the utility score in this study. The weights were estimated from a sample of 305 UK general population using the time trade-off method [15]. The ReQoL-UI utility score ranges between −0.195 and 1, which reflects people’s worst and best recovered HRQoL, respectively.

ICECAP-A

The ICECAP-A is a well-being measure assessing an adult’s capability. The five attributes measured are stability, attachment, autonomy, achievement and enjoyment [16]. In this study, utility score of the ICECAP-A was calculated using the tariffs obtained from the UK general population using the best–worst scaling method [24]. The ICECAP-A utility score ranges between 0 (no capability) and 1 (full capability). The Chinese version of the ICECAP-A was approved by the University of Birmingham and its psychometric properties was reported by Tang et al. [18].

General anxiety disorder—7 items (GAD-7)

The GAD-7 is a self-rated scale to measure the severity of generalized anxiety disorder. It has seven items scored from zero (not at all) to three (nearly every day) [25]. Cut-off point of the GAD-7 for mild, moderate and severe anxiety are 5, 10 and 15, respectively. The psychometric properties of the Chinese GAD-7 was reported by Tong et al. [26].

Depression anxiety stress scales—21 (DASS-21)

The DASS-21 consists of three sub-scales to assess the emotional states of depression, anxiety and stress [27]. Scores of each item range from 0 (never applied to oneself) to 3 (very much/most of the time). Final scores are calculated by summing the scores for relevant items and then multiplied by two. The cut-off points identified for no clinical problems are 9, 7 and 14 for three sub-scales, respectively [27]. Psychometric properties of the Chinese DASS-21 was reported by Gong et al. [28].

Sociodemographic characteristics and other indicators

Information about respondents’ demographics (sex and age), socioeconomic status (marital status, educational level, employment, living status, government allowance and personal income), health conditions (chronic condition and cognitive ability) and social well-being (life satisfaction and social relationship) were collected.

Statistical analysis

R software was used to perform all statistical analyses [29]. The level of significance was set at p value ≤ 0.05. The acceptability, reliability, discriminant and convergent validity and correlations and agreement between three measures were assessed in this study.

Acceptability

We assessed the completion rate of the three measures which we expected to be similar given their comparable length. In addition, the proportion of missing values, score ranges, the floor (percentage with lowest possible score) and ceiling effects (percentage with highest possible score) were reported to assess the acceptability.

Internal consistency and test–retest reliability

Cronbach’s alpha (α) was used to assess the internal consistency reliability, where α > 0.7 was identified as acceptable. A random sample of 50 respondents (10%) was invited to complete the measures two weeks later to evaluate the test–retest reliability of the measures using intra-class correlation coefficient (ICC, two-way mixed model, > 0.7 acceptable) [30]. The measures were expected to have similar reliability given they have similar response structure and number of items.

Convergent validity and hypothesized correlations between measures

Convergent validity was evaluated by investigating a priori hypothesized associations using Pearson correlation coefficient (r ≥ 0.7, strong; r > 0.5, moderate; r > 0.2, weak) [31]. We hypothesized that the three measures would show a positive and moderate/strong association with participants’ overall health status measured by the EQ-VAS. In addition, to test the convergent validity, we formulated the following hypotheses based on the concepts measured by each instrument: a. weak correlation among the utility scores of the three measures, as the concepts they are capturing are very different; b. moderate to strong negative correlation between EQ-5D utility and the physical item of the ReQoL-UI as the former has four items on physical health; c. moderate negative correlation between ReQoL-UI and ICECAP-A utility score and the anxiety and depression item of the EQ-5D; d. moderate negative association between the mental health items of ReQoL-UI and the ICECAP-A utility score.

Discriminant validity

Discriminant validity was assessed by examining the ability of the measures to differentiate people with different mental or physical health status, socioeconomic status and social well-being. We assumed that (a) respondents with no depression and no clinical signs using the GAD-7 and DASS-21 would report a high utility score; (b) respondents with no chronic conditions, and satisfied with their cognitive ability, life satisfaction and social relationship would report a high utility score; and (c) respondents with high socioeconomic status (non-government allowance receivers, living with families, fully employed and well-paid) would report a high utility score.

Mann–Whitney U test (MW test) and Kruskal–Wallis one-way analysis of variance (KW test) were used to compare the differences between subgroups. Effect sizes (EZ) calculated based on Z score (MW test) and H score (KW test) were used to assess the discriminative power of the measures. Regarding the explanation of the EZ value, for MW test, 0.1 < EZ < 0.3, 0.31 < EZ < 0.5 and EZ ≥ 0.5 were identified as weak, moderate and strong; for KW test, 0.01 < EZ < 0.059, 0.06 < EZ < 0.139 and EZ > 0.14 were identified as weak, moderate and strong [32, 33]. Separate multiple linear regression analyses were used to predict the utility score of three measures based on respondents’ sociodemographic variables.

Agreement between measures

Agreement between measures was determined using Bland–Altman (B–A) plot and ICC. Regarding B–A plot, the y-axis represents the difference between utility scores of two measures and x-axis represents the mean of utility scores of two measures. The score distribution across the mean difference of two measures represent a good agreement. We assumed the agreement between three measures is poor given their different conceptual structures.

Results

Respondents’ characteristics and feasibility

A total of 500 respondents responded to the survey and provided valid responses. 72.2% (n = 361) were female, 60.6% (n = 303) were older than 60 years, and over one third (n = 174) completed primary school-level or below education. Additionally, nearly 90% (n = 448) reported living with their families, 27.8% were fully employed and over two third (n = 313) reported an income of ≤ 5000 HKD ($650 USD) per month (Table 1). All respondents completed the EQ-5D-5L, ReQoL-UI and ICECAP-A, an indication of the feasibility of administering the three measures.

Table 1 Respondent’s characteristics

Acceptability

The utility scores of the ReQoL-UI, EQ-5D-5L and ICECAP-A covered nearly the full possible range. The ICECAP-A showed a lower mean score of 0.85 (range: 0.29–1) than the other two measures (MeanReQoL-UI = 0.92 [0.34–1]; MeanEQ-5D-5L = 0.92 [0.01–1]). Analysis at the item level showed that 88.6, 68 and 59.8% of respondents reported no problems on hope, belonging and relationship and choice and autonomy of the ReQoL-UI, respectively. Around 70.8–93.2% and 23.4–60.4% of respondents reported no problems on all items of the EQ-5D-5L and ICECAP-A, respectively (Table 2). No missing data were identified. The distributions of the HRQoL measurement scores are presented in Fig. 1.

Table 2 Descriptive statistics, responses and reliability
Fig. 1
figure 1

Histogram of the EQ-5D-5L, ReQoL-UI and ICECAP-A utility scores

Reliability

The Cronbach’s alpha of the ReQoL-UI, EQ-5D-5L and ICECAP-A were 0.74, 0.82 and 0.77, respectively, which showed an acceptable internal consistency reliability (Table 2). The ICC for the ReQoL-UI (0.74), EQ-5D-5L (0.82) and ICECAP-A (0.77) exceeded the recommended threshold of 0.7, which indicated a satisfactory test–retest reliability.

Correlation and convergent validity

The three measures showed significant correlation with each other. The ReQoL-UI was moderately associated with the EQ-5D-5L and ICECAP-A utility scores (r = 0.55 and 0.49, respectively) as well as the overall health (r = 0.55). The association of the ICECAP-A with the EQ-5D-5L and overall health was small (r = 0.35). The EQ-5D-5L utility score moderately associated with the physical health item of the ReQoL-UI (r = − 0.67), the pain/discomfort of the EQ-5D-5L significantly associated with ReQoL-UI utility score (r = − 0.51). The ICECAP-A utility score exhibited weak correlation with all items of the EQ-5D-5L, and weak/moderate correlation with six out of seven ReQoL-UI items, respectively (Table 3).

Table 3 Correlations between and convergent validity of the EQ-5D, ReQoL-UI and ICECAP-A

Discriminant validity

The ReQoL-UI, EQ-5D-5L and ICECAP-A showed satisfactory discriminant validity (Table 4). Respondents without mental health problem based on the outcomes of the GAD-7 and DASS-21 and not receiving treatment from a psychiatrist reported higher utility scores. The EQ-5D-5L (ES = 0.32, p = 0.007) and ReQoL-UI (EZ = 0.16, p < 0.001) exhibited a stronger discriminative ability than ICECAP-A to differentiate respondents with/without chronic conditions and cognitive problems, respectively. The EQ-5D-5L also showed a stronger discriminatory power than the other two measures regarding respondents’ government allowance status (EZ = − 0.43, p < 0.001) and income levels (EZ = 0.06, p < 0.001).

Table 4 Discriminant validity of the EQ-5D-5L, ReQoL-UI and ICECAP-A

Results of multiple regression analysis

Figure 2 shows that education was a significant predictor for estimating the change of utility score of all three measures. Respondents, who were highly educated, showed a good HRQoL and well-being. Respondents, who were divorced/widowed, obtained a low EQ-5D-5L (coefficient = − 0.095, p = 0.002) and ReQoL-UI (coefficient =− 0.054, p = 0.01) utility score. Respondents with a good pay tended to report a high ICECAP-A utility score (coefficient = 0.064, p = 0.04).

Fig. 2
figure 2

Multiple regression models of the EQ-5D-5L, ReQoL-UI and ICECAP-A utility scores and selected sociodemographic characteristics

Agreement between measures

The agreement between three measures was poor. The ICC of the EQ-5D-5L and ReQoL-UI was 0.5, which was higher than that of the other two pairs of comparison. The B–A plot demonstrates a wide limit of agreement interval between measures. A systematic difference of the agreement between the low utility scores of measures was observed, which indicated respondents with poor health status/well-being are more likely to report less consistent utility scores (Fig. 3).

Fig. 3
figure 3

Bland–Altman plot of the EQ-5D-5L, ReQoL-UI and ICECAP-A

Discussion

This was the first study that directly compared the psychometric properties and performances of three GPBMs, the EQ-5D-5L, ReQoL-UI and ICECAP-A, in the HK general population. All of them exhibited satisfactory feasibility, reliability and validity to assess the population’s HRQoL and well-being related outcomes. The ICECAP-A showed a stronger discriminant ability to differentiate people reporting different mental health status. However, the EQ-5D-5L outperformed the other two measures in subpopulation with different physical health and socioeconomic status. Given the conceptual structures of these measures were different, the low agreement between them was expected. Overall, the psychometric properties of three measures are relatively sound with EQ-5D-5L performing better than the other two measures in our sample of HK general population.

All the measures showed good feasibility and acceptability, because no missing data were detected, no ceiling or floor effect were observed, and utility values covered a nearly full score range. The values of α confirmed that internal consistency reliability of three measures were acceptable, which the EQ-5D-5L showed a good internal consistency reliability with 0.82, and the ICECAP-A and ReQoL-UI exhibited an acceptable reliability of 0.77 and 0.74, respectively. This finding was not unexpected because a great number of studies have confirmed the good performance of the EQ-5D-5L in HK Chinese population [34,35,36,37]. However, no empirical evidence about the other two measures in HK population was found, especially that this is the first paper using the ReQoL-UI in the HK population [15]. Additionally, the ICECAP-A and EQ-5D-5L exhibited acceptable to good test–retest reliability but the ICC for the ReQoL-UI was poor.

Regarding the correlation and convergent validity, all the measures showed a significant association with each other and moderately correlated with the respondents’ overall health. This is consistent with findings in the literature. For example, a Hungarian study indicated that a correlation of 0.57 between the ICECAP-A and the EQ-5D-5L among the general public, and the correlation between the EQ-5D-5L and item achievement and enjoyment of the ICECAP-A was stronger than with the other items [38]. Another multi-centre study also exhibited a moderate association between the EQ-5D-5L and ICECAP-A in UK (r = 0.36) and German (0.35) healthy population [39]. Nevertheless, no study about the relationship between the ICECAP-A and EQ-5D-5L in the Chinese population was found.

In this study, all three measures exhibited moderate ability to discriminate between people with different health and socioeconomic status. For instance, the EQ-5D-5L showed a stronger discriminatory power than the other measures in distinguishing people with different physical health and socioeconomic status, which was consistent with findings of previous studies [35, 40, 41]. However, some hypotheses were not confirmed. For example, although the ReQoL-UI and ICECAP-A were mainly designed to assess people’s mental HRQoL and well-being, we found that the ICECAP-A showed a higher ability to differentiate between people with mental problems than the ReQoL-UI. In addition, compared with the ICECAP-A, the EQ-5D-5L and ReQoL-UI showed a stronger discriminatory power in differentiating people with satisfactions in social life, where we assumed that ICECAP-A should outperform the other measures. One possible explanation is the utility scores of the ReQoL-UI and ICECAP-A were calculated based on the UK people’s preference weights. Given both mental HRQoL and well-being are subjectively concepts [42], using the UK population’s preference may undermine measures’ validity in Chinese population. Further, there is evidence that the country weights impact on GPBMs’ utility values [43, 44]. The development of local value set for the ICECAP-A and ReQoL-UI is, therefore, recommended. Furthermore, regression analysis exhibited that educational attainment is an important predictor affecting people’s HRQoL and well-being, which was in line with previous findings [41, 45,46,47].

Although our preliminary results showed that all three measures exhibited satisfactory psychometric properties, the choice of measures depends on the concept that the study intends to measure. For instance, if measuring physical health is the focus of an intervention, then EQ-5D-5L, which covers most aspects of physical health of HRQoL, is preferred. However, if the objective of the intervention is to measure the impact of treatments on with a focus to improve mental health or well-being [14, 48], the other two measures may be appropriate given their conceptualization and constructs, though their performance in HK population needs further exploration.

A strength of this study is it directly compared the performance of three GPBMs in a same sample of HK general population, supporting the generalizability of our findings to conduct the economic evaluations for all HK population. Moreover, the data collected during the COVID-19 pandemic may increase the sensitivity of these measures to detect people’s mental HRQoL and well-being, facilitating the assessment of reliability and validity of the ICECAP-A and ReQoL-UI. However, several limitations need to be addressed. First, utility scores of the ICECAP-A and ReQoL-UI were calculated using the UK preference weights, which may generate bias in assessing the validity of those two measures. Another limitation was our sample was not representative of the HK general population in terms of age as it was hard to recruit younger people through the telephone survey. Last, despite, in this study, the utility score of three measures ranged between 0 and 1, which indicated the direct comparison between them is reasonable, the lower limit of the EQ-5D-5L and ReQoL-UI utility score can be smaller than 0, which may raise some methodological issues in explaining outcomes when using different measures. This issue should be further explored.

Conclusions

This study confirmed that the EQ-5D-5L, ICECAP-A and ReQoL-UI performed psychometrically well in this sample of HK general population though the agreement between them was poor. Considering their distinct theoretical structures, the selection of measure to facilitate the economic evaluation depends on the nature and objective of the intervention. Using the EQ-5D-5L to measure health benefits from a mental health intervention may fail to capture its benefits leading to misallocation of resources to mental health services. Additionally, studies are needed to further investigate the psychometric performance of the ICECAP-A and ReQoL-UI using preferences elicited from the HK general population.