FormalPara Key Points for Decision Makers

Generic preference-based measures (PBMs) play an important role in health technology assessment in Asian countries.

The EuroQol-5 Dimensions (EQ-5D) has shown good construct validity and responsiveness in most countries and most disease groups in East and South-East Asia.

Future research should be expanded to rarely or never tested PBMs, such as the Health Utilities Index, Quality of Well-Being scale, and Assessment of Quality of Life instrument in this region.

1 Introduction

Preference-based measures (PBMs) provide a convenient approach to deriving health state values for the calculation of quality-adjusted life-years (QALYs) in cost-utility analysis [1]. The use of a PBM starts with describing health status or health-related quality of life (HRQoL) of individuals using a standardized questionnaire. The HRQoL data can then be converted into health state values using a scoring method (also known as a ‘value set’). The value sets are established using the health preferences of the general public for the health states described by the PBMs. All PBMs use a scale anchored by 0 (corresponding to dead) and 1 (corresponding to full or perfect health), with or without negative values for very poor health states.

PBMs are usually developed for use in one population or culture, and subsequently introduced to other populations after translation or cultural adaptation. Since cultural, environmental, and psychosocial factors may affect the performance of PBMs, the measurement properties of PBMs should be validated in all populations and cultures to which they are introduced. Measurement properties that are relevant to all PBMs include construct validity, test-retest reliability, and responsiveness [2, 3].

In psychometrics, construct validity refers to the extent to which a scale measures what it is supposed to measure, test-retest reliability refers to the ability of a scale to generate reproducible measurement results, and responsiveness or sensitivity to change refers to the ability of a scale to capture the change in the levels of the targeted construct [3]. The testing of all three measurement properties involves collecting individual-level data using the scale, and performing statistical analyses. Construct validity is usually assessed through hypothesis testing because of the absence of a ‘gold standard’ measure [3]. Typically, the hypotheses are that a scale should be correlated with another scale measuring a similar construct (i.e. convergent validity) or that measurement results for groups known to differ in certain characteristics should be different (known-groups validity). The more hypotheses fulfilled, the more likely a scale is valid [3]. Test-retest reliability is assessed by examining the agreement between two different measurements of the same group of individuals whose levels in the targeted construct are the same at the times of the two measurements. Depending on the nature of the scale, statistics such as intraclass correlation coefficient (ICC) can be used as the indicator of test-retest reliability. Responsiveness assessment requires longitudinal data collection of individuals whose levels of the targeted construct change over time. Statistics that can be used to indicate responsiveness include standardized effect size (SES), standardized response mean [3], and receiver operating characteristic analysis [4].

Designed for use in a wide range of therapeutic areas, generic PBMs are particularly useful in economic evaluations informing resource allocations. In the past decades, generic PBMs such as EuroQol-5 Dimensions (EQ-5D) [5] and Short Form-6 Dimensions (SF-6D) [6] have been increasingly used in Asian countries and many validation studies assessing their measurement properties in Asian populations have been published. However, the overall performance of PBMs in different countries or patient populations in this region is unknown. This is an important knowledge gap since cost-utility analysis is increasingly used to inform reimbursement decision making in Asia [7, 8].

The aim of this systematic review was to review and summarize the current evidence on the measurement properties of generic PBMs in Asian populations.

2 Methods

The COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) guideline for systematic reviews of outcome measurement instruments [4] was used to guide this review. Different from systematic review guidelines that are designed to evaluate interventional studies (e.g. the Cochrane guideline), the COSMIN guideline is specialized for evaluating measurement properties that are usually assessed in observational studies. It provides methods and tools for use in the entire process of systematic reviews, including literature search, selection and evaluation of studies, interpretation of results, and reporting of findings. In this review, two members of the review team worked independently through all phases of the review, and discrepancies were resolved via consensus meetings with the other two members of the review team. The four phases of the review process are described below.

2.1 Identification and Selection of Studies

The search was carried out using online databases, including MEDLINE (OvidSP), EMBASE (OvidSP), PsycINFO (OvidSP), and PubMed, in August 2019. Three groups of search terms were included to describe: (1) country/district, including countries/districts in South-East and East Asia: ‘China’, ‘Korea’, ‘Japan’, ‘Singapore’, ‘Taiwan’, ‘Hong Kong’, ‘Indonesia’, ‘Malaysia’, ‘Philippine’, ‘Thailand’ and ‘Vietnam’; (2) PBMs of interest, including ‘EQ-5D-3L’, ‘EQ-5D-5L’, ‘EQ-VAS’, ‘SF-6D’, ‘HUI2’, ‘HUI3’,’QWB’, ‘15D’, and ‘AQOL’; and (3) measurement properties, including ‘construct validity’, ‘test-retest reliability’ and ‘responsiveness’. All spelling variations, acronyms and related terms were included in the search algorithm (Appendix 1 of Supplementary file). The search filter developed by Terwee et al. [9] for the identification of reports on measurement properties of measurement instruments was adapted for use in this review. Although the EuroQol-Visual Analog Scale (EQ-VAS) is not a PBM, it was included as it is a part of EQ-5D.

A set of predefined selection criteria were applied to the hits that were generated by the search terms. Papers that examined the construct validity, test-retest reliability, and/or responsiveness of any PBMs in any countries/districts of interest were included. Original research using primary data such as interventional and observational studies were included. Secondary research, including reviews, were excluded. Reports on mapping or reports published in a non-English language, as well as commentaries or conference papers (i.e. abstracts) were also excluded.

2.2 Data Extraction

The COSMIN guideline differentiates papers and studies [4]. Each hypothesis tested, ICC, or SES value reported for assessing construct validity, test-retest reliability, and responsiveness, respectively, is treated as one study. Therefore, a paper can include more than one study.

Information extracted from each study included PBM, sampling country or district, medical condition of study subjects, sample size, sample mean age, sample sex distribution, language of administration, and study design and result (see the following sections for more detail).

2.3 Assessment of Individual Studies

Each study was graded for its result and methodological quality using the methods prescribed in COSMIN [4]. The methods are briefly described below.

The result for construct validity was graded based on whether or not it was congruent with a relevant hypothesis formulated by the review team. COSMIN recommends systematic review teams to formulate a set of hypotheses for assessing known groups and convergent validity (including direction and magnitude of correlations) [4]. This is to ensure that results from all studies included in the review are interpreted using the same criteria. In this review, the review team formulated hypotheses based on published papers and on their expert experience. Example hypotheses were ‘patients with worse symptoms would have lower PBM scores’ (for testing known-groups validity) and ‘PBM and Health Assessment Questionnaire (HAQ) scores would be negatively and strongly correlated’ (for testing convergent validity). If the results of a study support the relevant hypothesis, a ‘positive’ rating is given, otherwise, a ‘negative’ rating is given.

Reported results on test-retest reliability (i.e. ICC value) were graded using 0.7 as the threshold [4]. A ‘positive’ rating was given if the ICC value was ≥ 0.70, otherwise a ‘negative’ rating was given. Although area under the curve (AUC) is recommended for assessing responsiveness by COSMIN, the review team used SES because all studies assessing responsiveness included in this review reported either only SES or results that could be used to calculate SES; only one study reported AUC and SES. An SES value below 0.20 has been interpreted as negligible [3, 10]. The review team assigned studies reporting an SES value < 0.20 a ‘negative’ rating, and those with an SES value ≥ 0.20 were assigned a ‘positive’ rating.

Using the ‘Risk of Bias’ assessment tool, the methodological quality of all studies was rated as ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’ [4]. Different standards were used to assess studies of convergent validity, known-groups validity, test-retest reliability, and responsiveness. These standards targeted various aspects of the design and execution of the studies. For example, measurement properties of the comparator instrument were targeted for assessing studies of convergent validity; characteristics of the comparison groups were targeted for assessing studies of known-groups validity; and stability of patients, time interval between test and retest, and similarity between test conditions were targeted for assessing studies of test-retest reliability. All assessments were made according to COSMIN recommendations, except for one of the standards for assessing convergent validity studies and the standards for assessing responsiveness studies (the modified standards used are shown in Appendices 2 and 3 of Supplementary file).

2.4 Assessment of the Preference-Based Measures (PBMs)

Since measurement properties may vary across populations, the review team assessed the measurement properties of each PBM in different populations separately. In this review, EQ-5D-3L and EQ-5D-5L were treated as one PBM (i.e. EQ-5D), Health Utilities Index (HUI) 2 and HUI3 as HUI, and SF-6Ds derived from SF-12, SF-36, and its descriptive system were not examined separately. For each PBM, different language versions or modes of administration (i.e. self- and interviewer-administered) were not examined separately. The populations were defined first by country/district and then by disease group. The disease groups were defined by the primary medical conditions of study samples included in this review using the International Classification of Diseases, 11th Revision (ICD-11) [11]. Studies on the general population were treated as one group.

For each PBM, separate assessments were performed using relevant studies to evaluate its population-specific measurement properties. Each of the assessments had two components—the measurement property and the quality of the evidence used in the assessment. The measurement property was rated as ‘sufficient’ (if at least 75% of the relevant studies had a ‘positive’ rating), ‘inconsistent’ (if 25–74% of the relevant studies had a ‘positive’ rating), or ‘insufficient’ (if < 25% of the relevant studies had a ‘positive’ rating) [4]. Using the COSMIN Grading of Recommendation Assessment, Development, and Evaluation (GRADE), the quality of evidence was rated as ‘high’, ‘moderate’, low’, or ‘very low’. To determine the grade for quality of evidence, the review team first assigned a rating of ‘high’ and then downgraded the rating based on the methodological quality of included studies (i.e. the ‘Risk of Bias’ factor) and the sample sizes of the studies (i.e. the ‘Imprecision’ factor). The review team did not apply the ‘Inconsistency’ and ‘Indirectness’ downgrading factors, as recommended by COSMIN [4]. In this review, inconsistency in the characteristics of the study samples was resolved by summarizing the results separately for different populations, and inconsistency in results was used to grade the quality of the PBMs. ‘Indirectness’ was not used as a downgrading factor because only studies of the populations of interest to the review team (i.e. populations from East and South-East Asia) were included (the modified GRADE criteria can be found in Appendices 4 and 5 of Supplementary file).

3 Results

The search initially identified a total of 1710 papers from four databases, which was reduced to 735 upon removal of duplicates, and further reduced to 114 after assessment of titles and abstracts. After assessment of full-text, 79 papers were retained for this systematic review [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]. A Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram for the selection process is shown in Fig. 1.

Fig. 1
figure 1

Chart for search results and selection of papers, PROMs patient-reported outcome measures

A total of 1504 individual studies were identified from the 79 retained papers. Table 1 shows the numbers of included papers and studies, organized by measurement property, PBM, and population. EQ-5D was the most studied PBM, construct validity was the most studied measurement property, Singapore and China produced the largest amount of papers, and the general population was the most studied. No relevant studies were found for Assessment of Quality of Life (AQOL), 15-Dimensional (15D) or Phillipines. A more detailed breakdown regarding the distribution of the papers can be found in Appendices 6 and 7 of Supplementary file.

Table 1 Included papers and studies, by category

Results were ‘positive’ in 80% of construct validity studies, 79% of test-retest reliability studies, and 57% of responsiveness studies. While 99% of the construct validity studies and 61% of the responsiveness studies were rated to have ‘very good’ or ’adequate’ methodological quality, only a small portion of test-retest reliability studies (23%) achieved ‘very good’ or ‘adequate’ methodological quality.

A total of 729, 38, and 42 studies assessing construct validity, test-retest reliability, and responsiveness of EQ-5D, respectively, were identified. EQ-5D-3L was more commonly studied than EQ-5D-5L. For example, EQ-5D-3L had more than twice the number of studies reported for construct validity than EQ-5D-5L. The results for EQ-5D are summarized in Table 2. ‘Sufficient’ construct validity exhibits in 6 of 10 countries/districts and 17 of 20 disease groups assessed; ‘sufficient’ test-retest reliability exhibits in none of 8 countries/districts and 3 of 10 disease groups assessed; and ‘sufficient’ responsiveness exhibits in 5 of 6 countries/districts and 8 of 11 disease groups assessed.

Table 2 Grading results for EQ-5D in different countries/districts and different disease groups

A total of 374, 15, and 16 studies assessing construct validity, test-retest reliability, and responsiveness of EQ-VAS, respectively, were identified. The results for EQ-VAS are summarized in Table 3. ‘Sufficient’ construct validity exhibits in 5 of 10 countries/districts and 8 of 14 disease groups assessed; ‘sufficient’ test-retest reliability exhibits in 4 of 6 countries/districts and 3 of 5 disease groups assessed; and ‘sufficient’ responsiveness exhibits in all of 4 countries/districts and 6 of 7 disease groups assessed.

Table 3 Measurement properties of EQ-VAS in different countries/districts and disease groups

A total of 179, 3, and 15 studies accessing construct validity, test-retest reliability, and responsiveness of SF-6D, respectively, were identified. The results for SF-6D are summarized in Table 4. ‘Sufficient’ construct validity exhibits in 2 of 5 countries/districts and 6 of 11 different disease groups assessed; ‘sufficient’ test-retest reliability exhibits in 1 (Hong Kong) of 2 countries/districts and 1 (thyroid) of 2 disease groups assessed; and ‘sufficient’ responsiveness exhibits in only one (South Korea) of 3 countries/districts and only 2 of 4 disease groups assessed.

Table 4 Measurement properties of SF-6D in different countries/districts and different disease groups

A total of 59, 5, and 7 studies assessing construct validity, test-retest reliability, and responsiveness of HUI, respectively, were identified. The results for HUI are summarized in Table 5. ‘Sufficient’ construct validity exhibits in all 3 countries/districts and 4 disease groups assessed; ‘sufficient’ reliability exhibits in 1 (Thailand) of 2 countries/districts and 2 of 3 disease groups assessed; and ‘sufficient’ responsiveness exhibits in 1 (Thailand) of 2 countries/districts and 2 of 3 disease groups assessed.

Table 5 Measurement properties of HUI in different countries/districts and different disease groups

A total of 22 studies assessing the construct validity of the Quality of Well-Being (QWB) scale were identified. ‘Sufficient’ construct validity exhibits in both China and Japan and both neurological and respiratory disease groups.

4 Discussion

This systematic review targets the measurement properties of generic PBMs in East and South-East Asian countries. To the best of the review team’s knowledge, this is the first systematic review of its kind. This review found that the generic PBMs that have been tested are EQ-5D, SF-6D, HUI (i.e. HUI2 and HUI3) and QWB, and that EQ-5D (i.e. EQ-5D-3L and EQ-5D-5L) might be the preferred choice when a generic PBM is needed in Asia. First, the evidence for EQ-5D is of the largest amount for all measurement properties and populations assessed. Second, it exhibited ‘sufficient’ construct validity and responsiveness in the largest number of populations, and ‘insufficient’ construct validity or responsiveness in none of the populations assessed. Satisfactory construct validity and responsiveness were also reported in past systematic reviews of EQ-5D in musculoskeletal [91], schizophrenia [92], skin [93], metabolic [94, 95], and respiratory diseases [96]. However, the current finding that EQ-5D is valid and responsive for patients with eye and heart diseases is at odds with the finding from a systematic review [95] that was mainly based on evidence from European populations. The contradictory findings from the two systematic reviews suggest that the measurement properties of PBMs might vary from region to region. Therefore, it might be worthwhile to perform similar reviews for other regions to better inform the selection of PBMs for use in different populations.

The test-retest reliability of EQ-5D was found to be either ‘inconsistent’ or ‘insufficient’ for almost all populations, which is largely inconsistent with past systematic reviews [91, 94, 96]. The inferior test-retest reliability of EQ-5D revealed in this review could be related to suboptimal quality of evidence, which was attributable to the imperfect study design. In many studies included in this review, the ‘test’ was conducted when subjects visited a health institution, in the mode of face-to-face interview or self-completion, while the ‘retest’ was conducted over the telephone or via post when subjects were rested in their homes. The change in the data collection mode and setting from test to retest could have negatively affected the assessment result. Moreover, the test-retest reliability of EQ-5D could be underestimated due to the long duration used in those studies. Most studies included in this systematic review conducted the retest 1–2 weeks after the first test, as recommended [97]. While an interval of 1–2 weeks is appropriate for testing scales using a recall period of 1–4 weeks, it may be too long for EQ-5D because its recall period is only one day (‘today’). It is very possible that the health status of patients experiencing episodic symptoms in a particular day would change after 1 or 2 weeks, thus violating the assumption of unchanged health status needed for test-retest reliability testing, and leading to a worse test result.

The results for EQ-VAS are not entirely surprising because a visual analogue scale is not as easy to understand or use as verbal or categorical rating scales, where each response option is attached to an explanatory label [98]. It is possible that Asians, on average, have more difficulty with the EQ-VAS than Westerners because of their relatively lower education levels [99]. The suboptimal construct validity could also be caused by the vagueness of the labels used by EQ-VAS. In a qualitative study of Asians from Singapore [100], great variations in the interpretation of ‘best imaginable health’ were observed, which casts doubt on the comparability of EQ-VAS scores across individuals. However, a ‘sufficient’ result on responsiveness suggests that the EQ-VAS can be useful in evaluating individual-level change in HRQoL.

The suboptimal construct validity results for SF-6D are somewhat surprising. The descriptive system of SF-6D is more comprehensive than EQ-5D, and worldwide studies comparing SF-6D and EQ-5D found the two PBMs to have comparable measurement properties. One possible explanation can be due to elderly patients in Asia having a relatively lower literacy rate. According to UNESCO data [101], the elderly in European countries, such as Italy and Romania, have a literacy rate of > 85%. On the other hand, the literacy rate for the elderly in Asian countries, such as Thailand and Malaysia, is below 40%. The data collection for SF-6D is usually through SF-36, which contains 36 questions using relatively long sentence structures, which in turn might be difficult for some respondents with a lower literacy level [99].

This study provides some directions for future research on generic PBMs in Asia. First, future research should be expanded to rarely or never tested PBMs such as HUI, QWB, and AQOL. HUI (i.e. HUI2 and HUI3) is especially worth more research since ‘sufficient’ support has been shown for most measurement properties in all populations assessed. Second, researchers are strongly recommended to use a better design in future studies of test-retest reliability and responsiveness, such as using the same data collection mode in all time points. Last, studies should be conducted to ascertain the reasons for the suboptimal construct validity of the SF-6D and EQ-VAS, and to explore ways to improve their performance in Asian populations.

This study has three limitations. First, since some of the COSMIN methods and tools do not apply to a systematic review of multiple measures in multiple populations, it was necessary for the review team to modify the original methods. Due to these modifications, it may not be meaningful to compare the results from this review with those from other reviews that applied the original COSMIN methods. These modifications, however, are unlikely to favour any of the PBMs included in this study. The second limitation is the exclusion of papers published in non-English journals due to limited manpower and resources. There are databases in the Chinese, Japanese, and Korean languages that could include validation studies of PBMs. Therefore, the results of this review might not truly reflect the performance of the generic PBMs in China, Japan, and South Korea. Third, since different language versions were not differentiated, results from this review for Singapore and Malaysia might not be accurate for all language versions of the studied instruments. Despite the effort that has been put into translation, psychometric equivalence between source and target languages might not necessarily occur [102]. Nevertheless, studies have shown measurement equivalence between different language versions of EQ-5D and SF-6D in Singapore [103,104,105,106].

5 Conclusions

This systematic review provides a summary of the quality of existing generic PBMs in Asian populations from different countries and different disease groups. The current evidence supports the use of EQ-5D as the preferred choice, when a generic PBM is needed, and the continuous testing of all PBMs in the region.