Introduction
Preference-based quality-of-life data is increasingly collected for economic evaluation studies, such as cost-utility analysis (CUA), to compare value for money and prioritise limited resources. Quality-adjusted life years (QALYs) are the predominant outcome measure in CUAs, calculated using a preference-based multi-attribute utility instrument (MAUI) like EQ-5D [
1,
2] and AQoL-4D to measure quality of life [
3].
The QALY is the preferred outcome measure used by many government funding bodies, such as the National Institute for Health and Care Excellence (NICE) in the UK [
4,
5] and the Pharmaceutical Benefits Advisory Committee (PBAC) in Australia [
6]. However, in many disability studies, researchers still prefer to use a non-preference-based disability-specific instrument [
7]. A recent review of disability outcome measures identified 20 generic instruments and reported that the World Health Organisation Disability Assessment Schedule (WHODAS) is widely used [
8]; The 12-item version of the WHODAS 2.0 has been validated with people with different types of disabilities [
9‐
14]. However, the summary score of WHODAS 2.0 is non-preference-based, which does not permit the construction of QALYs [
8]. Mapping analysis estimates a statistical relationship between preference-based and non-preference-based instruments, the "next-best" approach to deriving health state utilities from a non-preference-based instrument [
15].
No current study has developed mapping algorithms to generate health utility from WHODAS 2.0. Lokkerbol et al. [
16] have estimated an algorithm for mapping WHODAS 2.0 (both 36 and 12 versions) to disability weights (i.e., to calculate Disability Adjusted Life Years) using data from the World Health Organisation (WHO) Multi-Country Survey Study on Health and Responsiveness (MCSS) [
17]. But it included only eight out of the 12 questions of WHODAS 2.0 short form. This may impact the accuracy of the prediction if the missing items are significant predictors.
Traditional econometric methods such as ordinary least squares (OLS) and the generalised linear model (GLM) have been commonly used to create mapping algorithms [
18,
19]. Specifying the optimal functional form is often difficult when complex non-linear relationships exist between the source and target instruments. Furthermore, the distribution of a MAUI is often skewed, bounded at one, and maybe multinomial, which adds further complexity to identifying the optimal mapping algorithm [
20]. Recently, supervised machine learning techniques have increasingly been used in mapping studies as they have the potential to select important predictors and account for non-linear relationships more efficiently and effectively than traditional approaches. Gao et al. [
21] used a deep neural network method to develop mapping algorithms from the MacNew Heart Disease Quality of Life Questionnaire onto different country-specific value sets of EQ-5D-5L (
n = 943). They found that the machine learning technique performed similarly to the traditional econometric methods in three out of the four countries of their sample. Another study by Aghdaee et al. [
22] also found that machine learning (e.g., Lasso regression) performed marginally better if not combined with other traditional econometric methods (
n = 2015). Despite these previous findings, machine learning still has the advantage of determining the nature of the relationships without researchers trying to guess the possible combinations between them or imposing their bias on the results by selecting their preferred functional form. With a larger dataset than the previous studies, the performance of machine learning techniques in our study may improve.
The objective of the study is to derive optimal mapping algorithms from the 12-item version of WHODAS 2.0 (hereafter 'WHODAS') to the Assessment of Quality of Life-4 Dimension (AQoL-4D), which is a validated generic health-related preference-based instrument that is widely used in disability studies [
23‐
26]. This present study also contributes to the mapping literature by comparing results from traditional econometric models to machine learning approaches.
Methods
This study follows the Mapping onto Preference-based measures reporting standards (MAPS) from the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) to conduct and report mapping analyses [
15]. To apply the machine learning techniques, we followed the steps and best practice recommendations from Doupe et al. [
27]. All statistical analyses were conducted using Stata 16 except the exploratory factor analysis, which was performed using the EViews software version 12.
Data and sample
The data were obtained from the 2007 Australian National Survey of Mental Health and Wellbeing (NSMHWB). The 2007 NSMHWB was conducted throughout Australia from August to December 2007 on a national representative sample. Residences of private dwellings were randomly selected using a stratified, multistage area approach, and then one person meeting the age criteria of 16–85 years was selected from the dwelling. There were 14,805 eligible dwellings out of the initial 17,352 selected dwellings due to all residents being out of scope or empty dwellings. Finally, 8841 complete responses were recorded from the face-to-face interviews (60% response rate).
Both the WHODAS and AQoL-4D were included in this survey. Considering WHODAS is commonly used among people with disability, in our study, we constrained the study sample to be people who reported having a disability, regardless of whether they have a current restriction in activities or not (
n = 3376). We also excluded a small proportion (
n = 30) if they did not answer all of the WHODAS and the AQoL-4D items. The final study sample consists of 3376 respondents. The detailed sample selection process can be found in the flowchart in the Electronic supplementary materials (ESM) (ESM 1 Fig.
S1).
Instruments
Source measure: WHODAS 2.0–12
The WHODAS is a disability-specific instrument in which all items use a five-point ordinal response scale (1 = None, 2 = Mild, 3 = Moderate, 4 = Severe, and 5 = Extreme/Cannot do). This instrument has a recall period of 30 days [
28].
The WHODAS captures functioning in six domains, including cognitive (Items 3 and 6), mobility (Items 1 and 7), self-care (Items 8 and 9), getting along (Items 10 and 11), life activities (Item 2) and participation (Items 4 and 5). This study uses the widely used simple scoring method to calculate the WHODAS summary score, which is calculated as a sum of all items' raw scores [
9,
14,
24]. The summary scores of the WHODAS, therefore, range from 12 (no disability) to 60 (full disability).
Target measure: AQoL-4D
AQoL-4D is a valid generic preference-based quality-of-life instrument [
3,
29]. It contains 12 items (each with four response levels) and is evenly grouped into four dimensions (independent living, relationships, mental health and senses). This instrument has a recall period of seven days. The preference weight of AQoL-4D was developed using the time-trade-off approach among the Australian general public. The utilities range from -0.04 (worse than death) to 1 (full health).
More detailed comparisons of the domains and characteristics between the source (i.e., WHODAS) and the target measures (i.e., AQoL-4D) are presented in ESM 1 Table
S1.
Statistical analysis
As previous literature suggested [
30,
31], exploratory data analysis was first conducted to understand the degree of conceptual overlap between the two instruments, including Pearson correlation and exploratory factor analysis (EFA). The maximum likelihood factor analysis with a minimum average partial method was used to consider how many factors to retain in the factor analysis. Factors were rotated using the orthoblique promax method [
32]. The EFA would provide a clear view of the underlying structures between the two instruments, the information of which is useful for checking the feasibility of conducting a mapping analysis as well as the potential WHODAS item selection when developing mapping algorithms using traditional econometric techniques.
We used a direct mapping approach to predict AQoL-4D from WHODAS items. Indirect response mapping, which has also been used in the mapping literature, particularly when predicting EQ-5D, was not considered. Unlike EQ-5D, in which only responses to five dimensions need to be estimated, to accurately predict AQoL-4D, researchers need to predict responses to 12 dimensions. There is a higher chance of predicting errors; in particular, the AQoL-4D include items not directly captured by WHODAS (see EMS Table).
We used four traditional econometric and three machine-learning techniques to develop mapping algorithms. The four traditional regression techniques have been widely used in the mapping literature [
17,
19,
33‐
36]. They consist of (i) the ordinary least squares (OLS) estimator, which is the most widely used technique in the mapping literature; (ii) the robust MM estimator, which reduces the influence from potential outliers; (iii) the generalised linear model (GLM), which permits different combinations of distribution families and link functions, including non-normal distributions; and (iv) the beta mixture regression model which is suitable for analysing data with a continuous response variable that ranges from 0 to 1 and follows a beta distribution. Recently, the adjusted limited dependent variable mixture models (ALDVMM) have been developed that perform well with EQ-5D data [
10,
37]. However, since AQoL-4D does not have a large gap between 1 and the next feasible value as EQ-5D-3L UK tariff does, this method is not applied in our study.
The following two basic model specifications were considered for predicting the AQoL utility (
\(\mathrm{uAQoL}4\mathrm{D})\) using the item-level responses of WHODAS. Instead of treating a WHODAS item as a continuous variable, it was transformed into five indicator variables corresponding to each of the response levels. Using level 1 as the reference category, the mapping function therefore included four variables for each of the 12 WHODAS items, thereby allowing for the potential non-linear effects across different levels.
$$\mathrm{uAQoL}4{\mathrm{D}}_{i}=\alpha +{\sum \beta }_{j}*\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}$$
(1)
$$\mathrm{uAQoL}4{\mathrm{D}}_{i}=\alpha +{\sum \beta }_{\mathrm{j}}*\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}+{\gamma }_{1}*{\mathrm{Sex}}_{i}+{\gamma }_{2}*{\mathrm{Age}}_{i}$$
(2)
where
\(\mathrm{uAQoL}4\mathrm{D}\) represents the predicted utility for the individual
\(i\).
\(\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}\) is the set of binary variables constructed from the response levels of the WHODAS items. For each item, there are five levels. Therefore, four dummies will be included, with the first level serving as a reference level. Age is in years, and Sex is a binary variable equal to 1 for males and 0 for females.
To ensure that the coefficients of the WHODAS items follow a monotonic pattern (i.e., more severe disability levels have larger or equal decrements compared to less severe levels), we imposed additional constraints in traditional regression models. This involved combining item levels or excluding items with positive coefficients during estimation.
In addition, we employed three machine learning techniques: (i) Lasso regression; (ii) Support vector regression (SVR); (iii) Boosted regression (Boosting), all of which can be applied to continuous outcomes. Lasso regression is a method that selects and fits covariates in a model that minimises prediction errors, using the "shrinkage" method that constrains less important parameters towards zero. The key benefit of lasso regression is that it can automatically perform variable selection, providing a simpler model with only the most relevant features. This can help reduce model complexity and improve generalisation performance on new data [
37]. In support vector regression (SVR), the main objective is to find a function that best fits the data and minimises errors between the predicted and observed values. It is done by identifying an
n-dimensional space (hyperplane) that lies at an optimal distance from the data points (aka., support vectors) [
38]. Boosted regression enhances accuracy by combining predictions from multiple weaker models via a weighted average. It starts with base learners, which are typically shallow decision trees, and then trains multiple weak learners in an iterative manner. After each weak learner is trained, it combines the trained weak learners using a weighted scheme. Boosted regression excels in dealing with intricate, non-linear dependencies between features and the target variable. The amalgamation of multiple weak learners empowers the final boosted model, imparting robustness and achieving high predictive accuracy [
1,
2].
Apart from Lasso regression, where it assumes a linear functional form as we specified the overarching model structure (3), the other techniques are data-driven and do not rely on strong assumptions about the possible function form. The relationships of the variables are determined by the machine learning algorithms. They could be linear or non-linear depending on the patterns and structures present in the dataset.
$$\mathrm{uAQoL}4{\mathrm{D}}_{i}=\alpha +{\sum \beta }_{\mathrm{j}}.\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}+{\sum \lambda }_{\mathrm{j}}.\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}\#\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{iq}+{\gamma }_{1}*{\mathrm{Sex}}_{i}+{\gamma }_{2}*{\mathrm{Age}}_{i}$$
(3)
where
\(\mathrm{uAQoL}4\mathrm{D}\) represents the predicted utility for the individual
\(i\).
\(\mathrm{WHODAS}\_\mathrm{Ite}{\mathrm{m}\_\mathrm{level}}_{ij}\) is the set of binary variables constructed from the response levels of the WHODAS items. For each item, there are five levels. Therefore, four dummies will be included, with the first level serving as a reference level. Age is in years, and Sex is a binary variable equal to 1 for males and 0 for females.
Assessing goodness-of-fit
We employed five-fold cross-validation to evaluate goodness-of-fit. The full sample was randomly split into five groups with equal observations, where 80% of the data were used for algorithm development and the remaining 20% for performance assessment. This process was repeated five times, with each group used four times for estimation and once for validation. The optimal model and method were selected based on the best goodness-of-fit test result from the pooled estimated errors in the absence of external validation data.
Three goodness-of-fit statistics were used: (i) the mean absolute error (MAE), (ii) the root mean square error (RMSE), and (iii) the intraclass correlation (ICC) using a random effect model. We selected MAE as the primary criterion because the MAE is the most natural and unambiguous measure of the average error magnitude, while the RMSE places more weight on outliers [
39]. Additionally, we calculated the percentages of observations for which the difference between the observed and predicted values was larger than 0.03, which was performed in previous studies [
36,
40]. We also considered the performance of predicting the lower and upper bound of AQoL-4D when selecting the optimal algorithm.
To use the predicted utility scores in real life, it is important to ensure that the final predictions fall within the theoretic boundary of the targeting instrument [
19,
34,
41]. In our studies, the lower and upper values are truncated at − 0.04 and 1, respectively, to ensure the predicted value falls into the theoretical range.
The final mapping algorithms were developed using full-sample observations and are based on the method and model that perform the best in the five-fold cross-validation.
It is a common observation from previous mapping studies that mapping functions could perform relatively poorly in predicting lower utilities [
19,
36,
42]. We therefore reported the performance of different mapping algorithms on people with different disability restriction levels. Results from this sub-sample analysis provide useful information for users to better understand the direction and magnitudes of potential prediction bias.
We divided our sample into five subgroups, including people with profound or severe core restrictions, moderate core restrictions, mild core restrictions, school/employment restrictions, or no specific restrictions. We then performed two types of subgroup analysis. First, we calculated the goodness-of-fit statistics for each subgroup. In addition, we calculated the between-group margins of error in the predicted differences of average utilities between a particular subgroup with restrictions versus the subgroup with no specific restrictions.
1 This was conducted to assess how accurately the mapping algorithms capture the difference between different severity levels, which could represent incremental utilities across different disease states in economic evaluations.
Discussion
This study investigated various methods and models to map the WHODAS 2.0-12 items to the AQoL-4D utilities, a generic and well-validated instrument whose utilities could be used in various settings. It allows the estimation of utilities when responses of only WHODAS are collected, which facilitates future economic evaluation of disability interventions.
The mapping study is conducted using people with disability in Australia aged 16–85. Therefore, the observed utility of this sample is lower than the Australian norm for AQoL (0.81, 95%CI 0.81–0.82), which is derived from the same survey we use. Notably, there is only one value set for AQoL-4D, which was developed in an Australian population. Therefore, at the moment, regardless of where the respondents are based globally, the identical Australian-specific value set is used. We understand that generalising the Australian preferences to other countries may not be ideal, but using the Australian value set does not suggest that the instrument could only be valid in the country of development. Previous research showed that between-country variations in value sets may stem from the types of respondent (e.g., proxy vs. self-reported), the methods (e.g., DCE vs. standard gamble), and the composition of the sample selected to do the value tasks [
43]. Similar to Health Utilities Index, which developed the value set using only the Canadian sample but have been applied to research globally [
44], we believe that AQoL-4D could be applied to broader than the Australian population.
The recommended mapping algorithm identified for a general disability population uses the MM estimator in Model 2, controlling for sex and age. The MAE of the cross-validation samples for this method is 0.1325, which is comparable with the ranges of other mapping studies whose MAE falls between 0.011 and 0.19 [
19,
20,
33,
35]. Although the MM estimate overpredicted the utilities, our subgroup analysis showed that, unlike other methods, which over-predicted in some groups but under-predicted in others, the MM estimator consistently over-predicted the utilities across all the subgroups with different restriction levels (hence the prediction errors on the differences between different sub-groups could be minimised). Therefore, when using this recommended algorithm to estimate utilities in economic evaluations, the researchers should be aware that the actual utilities are possibly lower than the estimated ones for people with all levels of restrictions.
Economic evaluations focus on comparing a group receiving an intervention with a group that does not. As the MM estimator in Model 2 overpredicts utilities for all subgroups, the overpredicted errors would be offset when incremental utilities are calculated for comparisons between subgroups. Because the MM estimator generated consistently overestimates between-group differences, researchers should also be cautious while using MM in economic evaluations that compare subgroups with different restriction levels as the true differences may be smaller.
The machine learning technique did not outperform the traditional methods, even though the data-driven approaches allow for more flexibility. This is consistent with other mapping studies using machine-learning techniques [
21,
22]. Additionally, controlling for age and sex did not enhance the performance of the machine learning technique, likely because the complexity of variable interactions is already taken into account. Some research has indicated that age and sex may not be statistically significant [
34]. However, since age is statistically significant at the 5% level and some of our traditional models included the optimal MM model, we still include it in our algorithm. Sex is only significant at the 10% level in some of our models using traditional methods, but we have also included it because they are commonly included in mapping studies [
19,
22,
34], can increase precision for our prediction, and we are not sure if the gender difference will affect the responses to WHOAS and AQoL-4D disproportionally in other data. It should also be noted that the sex variable in our data is a binary variable consisting of female and male. Newer surveys are likely to allow individuals to classify themselves in other categories. Future studies should explore whether gender classifications with more identity choices will impact the precision of the results.
We found that different models and methods predicted different subgroups better. We recommend using MM in model 2 to estimate utilities for a population with different disability levels or people with disability but no specific restrictions. GLM with the log function and gamma family, Beta regression method, and SVM in model 1 are recommended for sensitivity analysis if the sample is concentrated on people with profound/severe, moderate or school/employment, and mild restrictions, respectively. Researchers could use the algorithms and STATA codes provided in the ESM to perform additional sensitivity analysis.
Unique items were identified in our concept mapping process for WHODAS and AQoL-4D, and these items are more likely to require combining levels to achieve monotonic correlation. For example, items related to cognition that were only asked in WHODAS required level combining and an item in WHODAS asking about dealing with unknown people was dropped due to positive correlation. This suggests that the relationship question in WHODAS was focused on a different type of relationship compared to AQoL-4D, which emphasises relationships with friends and family. It highlights the importance of considering the algorithm’s applicability when evaluating interventions.
Several limitations to this study should be acknowledged. The model performance of the study was validated using five-fold cross-validation but with data from the same study sample. Since the mapping is a data-driven exercise, the choice of response sample could affect the calibration of the mapping algorithm. Ideally, in the future, validation will also be possible using an external dataset. The second limitation of the study is that the predicted AQoL-4D utilities under-predict the highest utilities. This is a commonly reported limitation in many mapping studies [
33,
35]. However, because the algorithm consistently under-predicts the highest utility in each subgroup, the issue may be attenuated when comparing different groups in an economic evaluation study. The last limitation was that information about the disability types (e.g., physical, psychosocial) was not in the data. However, given the large sample size, we saw many variations in both the source and target instruments, and we were able to perform a subgroup analysis based on the restriction level of disability.
The results from this mapping study indicate that it is reasonable to map WHODAS onto the AQoL-4D utilities. The availability of this mapping algorithm will facilitate future economic evaluation for disability interventions when only the WHODAS is used.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.