Introduction

Obesity is associated with significant mortality, reduced quality of life and increased risk of developing diseases such as diabetes mellitus, cardiovascular disorders and cancers [1,2,3]. Europe has the world’s second highest obesity rate (women 25%, men 22%) after North America (women 30%, men 24%) [4]. Management of obesity involves variety of treatment options: the first-line lifestyle modification, includes diet, physical activity, and behavioral therapy [5, 6]. This may be supplemented with adjunct pharmacotherapy [5, 7]. When these conventional treatments are partially efficacious in achieving sustained weight loss, bariatric surgery will be introduced [5, 7].

To make treatments comparisons, cost-utility analysis are required, particularly by reimbursement agencies and national advisory bodies such as the National Institute for Health and Clinical Excellence (NICE) in the UK [8], the Dental and Pharmaceutical Benefits Agency in Sweden [9] which put a request on health utility data to be collected in clinical studies. Health utility is often obtained through a preference-based measure (PBM) [10, 11]. The most commonly used PBM are EQ-5D [1], SF-6D [2], and Health Utilities Index (HUI) [3]. For obesity, the most commonly applied PBM is SF-6D [2].

However, not all clinical studies contain a PBM. Similarly, in obesity studies, quite often only none-preference-based-measures (NPMB) were used [12], such as Obesity Problem Scale (OP), Obesity and weight-loss Quality of life, and weight-related symptom measure (WRSM). The OP scale has been mostly applied in Scandinavia [13], but recently, has been recognized by the American Society for Metabolic and Bariatric Surgery [14]. When in the absence of PBM, it may be possible to map utility values indirectly from a NPBM as a solution [15].

Mapping is a relative new research area with most papers published after 2000 [16, 17]. For obesity, mapping algorithms have been estimated from Moorehead-Ardelt II questionnaire (MA-II) to SF-6D and EQ-5D [18], from Weight on Quality of Life-Lite to SF-6D [19]. However, to the best of our knowledge, currently there is no mapping algorithm for OP. In the Scandinavian Region, as both SF-36 and OP have been applied in the Swedish Obesity Subjects trial between 1987 and 2001 [20], as well as in the large national register for bariatric surgery in Sweden, the Scandinavian Obesity Surgery Registry since 2007 [20,21,22], which enables constructing a mapping algorithm from OP to SF-6D utility index. As the OP mainly measures the impact of obesity on psychosocial function [23], and the obesity level will be significantly reduced after bariatric surgery [24], we assume that the relationship between OP and SF-46 might be different for pre- and post-surgery.

The aim of the study is to provide a mapping algorithm to estimate SF-6D utility values from the Obesity Problem Scale, which can be used to estimate utilities in subsequent analyses, such as economic evaluations reliant on data sets that include only Obesity Problem Scale. Additionally, we explored different mapping models and to test whether the mapping algorithm was robust to the effect of bariatric surgery.

Method

Data source and study population

The Scandinavian Obesity Surgery Registry (SOReg) is a national research and quality registry for bariatric surgery in Sweden (> 97% national coverage), and is validated regularly and has been shown to have high data quality [21]. SOReg contains information on patient socio-demographic characteristics, provider characteristics, details regarding the procedure, and health outcomes including HRQoL assessed by SF-36 and OP. HRQoL data are reported by the patients at baseline and at 1, 2, and 5 years by filling in a questionnaire on paper. Nurses take anthropometric data and collect the applicable questionnaires. Trained personnel perform data inputting. For the current study, all subjects who received bariatric surgery from January 2011 to March 2019 with complete answers on OP scale and SF-6D were included (n = 36 706), with no other exclusion criteria being applied. The Ethics Authority in Sweden granted ethical permission for this study (reference number: 2019–03,666).

The development and validation of the mapping algorithm followed guidelines from ISPOR [25] and TRIPOD checklist [26] (Supplementary material Table S8). For internal cross-validation, data at each wave (baseline, 1–, 2– and 5-year follow-ups) were randomly split into two parts: 80% of the data were used as a training dataset for building models, and the remaining 20% were used as a validation dataset, thus resulting in totally eight datasets (training and validation datasets at each time point (baseline, 29,365 and 7342; 1-year, 27,125 and 5425; 2-year, 13,911 and 3478; 5-year, 5945 and 1488) (Supplementary material, Table S3. No significant differences in patient characteristics were found between the training and validation datasets. To know if the performance of the mapping algorithm differs between pre- and post-surgery, we also tested the mapping algorithm from one wave on datasets from all time points. For example, for baseline data, validations were carried out on baseline, 1-, 2-, and 5-year data, respectively. The mapping model with the best predictive performance was selected as the final model.

Health outcomes measure

Short Form-36 (SF-36/RAND) and SF-6D

SF-36 measures HRQoL in eight domains (social functioning, physical function, role-physical, bodily-pain, general health, vitality, social functioning, role-emotional and mental health) [27, 28], and the SF-36-v1 has been applied in the SOReg. The short form six-dimensions (SF-6D) was developed to derive a preference-based score from the SF-36 [29] or its 12-item version (SF-12) [30], using a standard gamble method. The six SF-6D domains include pain, mental health, physical functioning, social functioning, role limitations, and vitality, and each is described into four to six functional levels. The SF-6D utility scores in the current study were calculated using the UK tariff [29] since there is a local tariff in Sweden. Details regarding SF-6D domains and relevant SF-36 items could be found in the supplemental material (S1 and S2).

Obesity problem scale

Obesity problem scale (OP) is a validated disease-specific instrument, which assesses the impact of obesity on psychosocial functioning [20, 23]. The instrument comprises eight items (private gatherings at home; private gatherings at a friend’s/relative’s home; going to restaurants; participation in community activities; holidays away from home; trying on and buying clothes; bathing in public places; intimate relations) on a four-point scale (significant difficulties; some difficulties; limited difficulties; no difficulties). Based on responses on the OP dimensions, an OP summary score can be calculated ranging from 0 to 100, with a higher score indicating more psychosocial dysfunction [20].

Statistical methods for mapping

Descriptive analyses were used to examine the sample characteristics and the responses to the SF-6D and OP measures (proportions for discrete variables, mean and standard deviation, plus median and inter quart range for continuous variables).

We applied multivariate analysis to predict the values of the SF-6D utility score from OP summary score and items, with and without other covariates. Besides the commonly used Ordinary Least Square (OLS) method, both beta-regression (accounting for the fact that the SF-36 utility score is bounded between 0 and 1) [31] and Tobit regression (accounting for the fact that SF-6D index were centred at 0.301 and 1) [32] were used. In beta regression, to decide whether or not including a link function and which link function to use, we computed AIC and BIC for those with and without link functions, based on Model 1 (Supplementary materials Tables S9). Model with Cauchit link function performed the best, and was applied in all beta-regression analyses in the study. In order to make comparisons across OLS, Tobit and Beta regression, as well as for easily interpreting the results, both transformed OP summary score \(\left( {\frac{{100 - OP_{{{\text{raw}}}} }}{{100}} \times \frac{{\left( {N - 1} \right) + 0.5}}{N},\left( {N = 500} \right)} \right)\) and transformed SF-6D index \(\left( {SF - 6D_{{{\text{raw}}}} \frac{{\left( {N - 1} \right) + 0.5}}{N}\left( {N = 500} \right)} \right)\) were used, as beta-regression does not allow the value of the dependant variable and variable used in the link function to be 0 or 1 [33]. Both transformed SF-6D index and OP summary score ranged between 0 and 1, with a higher value indicating better health.

Five sets of modes were tested, OP as the main effect (Model 1); including age and sex (Model 2); including age, sex and BMI (model 3); including age, sex and comorbidities (Model 4); including age, sex, BMI and comorbidities (Model 5). All the models were run on the baseline, 1, 2, and 5 years follow-up datasets, respectively.

Two types of independent variables were constructed for the OP measure: Type A is a simple additive model, where the transformed OP summary score was used as the independent variable. In Type B modelling, the item responses for each OP dimensions were used as independent variables and three dummy variables (reference: “no difficulty”) were created for levels “not so bothered”, “mostly bothered” and “definitely bothered”, respectively. As there are eight OP dimensions, totally 24 dummy variables were included in the models.

Model selection

Model goodness-of-fit was assessed using adjusted/pseudo R2 statistics in ordinary least squares (OLS)/Beta regression, Bayesian information criteria (BIC), and Alkaike information criteria (AIC) statistics. Lower BIC and AIC values would indicate a better fitting model. To examine the predictive performance of the model, the differences between the predicted and observed SF-6D value at the individual level were used to compute the mean absolute error (MAE) and root-mean-square error (RMSE). Smaller error values were indicative of better-performing models. All analyses were conducted using R.4.0.2 [34].

Results

Patient characteristics

Patient’s characteristics were reported in Table 1. More than 76% of the patients were female, and the mean age was 41 at the baseline. About 10% of the patients were current smokers. Mean BMI was 42 at the baseline, and decreased to 29 at follow-ups. In general, the presence of obesity-related comorbidities (sleep apnoea, hypertension, diabetes, dyslipidaemia and depression) has decreased overtime, lowest at 1-year follow-up, followed by 2- and 5-year follow-ups.

Table 1 Socio-demographic characteristics, at baseline, 1, 2 and 5 years follow-ups

Patient-reported health outcomes

Details regarding reporting on OP and SF-6D were reported in Table 2. HRQoL improved after surgery, with the highest improvements observed at 1-year follow-up, followed by 2- and 5-year follow-ups. The SF-6D index was close to normal distribution at the baseline but left-skewed at the follow-ups (Supplementary material Figure S1 and S2). There were moderate (0.4–0.59) and high correlations (≥ 0.6) between the SF-6D index and OP summary score at all time wave, as well as with most of the OP dimensions (Supplementary material Table S4).

Table 2 Reporting of Obesity problem scale (dimension, summary score) and SF-6D index score, at baseline, 1, 2 and 5 years follow-ups

Initial model development

Results for model goodness of fit and prediction accuracy for the initial model development at each time point (baseline, 1, 2, 5-year follow-up) are reported in Table 3, details can be found in Supplementary materials, for OLS (Table S5A and S5B), for Tobit regression (S6A-S6B) and beta regression (S7A and S7B). Four main issues were investigated: whether using OP summary score or item as a predictor? Whether including other covariates such as age, sex, BMI and comorbidity? Whether using a separate mapping algorithm for pre- and post-surgery? Whether using OLS, Tobit or beta regression?

Table 3 Comparison of model goodness of fit and prediction accuracy, transformed SF-6D indexa and OP Summary Score b used

OP summary score or item as predictor (Type A or B model)

Across the OLS, Tobit and beta regression, the application of OP dimensions instead of the OP summary score did not increase the model performance, as it had little impact on goodness-of-fit and prediction power. Furthermore, for OLS models, inconsistency was found for the dimension bathing in public places (beach, public pool, OP7), with positive coefficients at baseline.

Inclusion of age, sex BMI and comorbidity as predictors

Conclusions from beta regression and Tobit regression were similar to OLS models, that the inclusion of age, sex, BMI and comorbidity variables increased model performance: in terms of goodness-of-fit, an increased R2 and decreased AIC and BIC across Mode 1 to 5 for each wave of data; in terms of prediction power, decreased MAE and RMSE for model validations were also observed across Model 1 to 5.

OLS, Tobit or beta regression

Results for the goodness of fit and prediction power are presented in Table 3 In terms of goodness of fit, OLS yielded lowest AIC and BIC values for mapping algorithm from baseline and 2-year follow-up, while Beta regression gave the lowest AIC and BIC values for mapping algorithm from 1- and 5-year follow-ups. In terms of prediction power, results were similar for OLS and Tobit models, both yielded lower MAE and RAE values and higher RMSE and RRSE values relative to beta regressions at almost all time points; The performance of OLS and Tobit models were rather similar, both yielded better results than beta-regression.

Comparison between Pre- and post-surgery algorithms

Coefficients for pre- and post-surgery algorithms showed different patterns: coefficients for the OP summary score differed between baseline (0.26) and follow-ups (0.32). Coefficients for age groups were rather stable across all the models. The coefficient for male was higher than the coefficient for female at follow-ups, but not at baseline. At baseline, coefficients for BMI were significant; however, at follow-ups, not all coefficients for BMI were significant. Coefficients for comorbidities were relatively stable from baseline to 2-year follow-up, with depression associated with the largest effect, followed by sleep apnoea and diabetes. Hypertension and dyslipidaemia had a very low impact. At the 5-year follow-up, only depression was significant.

Final mapping algorithm

Based on the above findings, we conclude that using OP summary score as the main predictor, including age, sex, BMI and comorbidities, using OLS model, and separate analyses for pre- and post- surgery. For comorbidities, sleep apnoea, diabetes and depression were included as those were with significant coefficients and also confirmed by the clinicians as the most important obesity-related comorbidities. We include BMI into the algorithm for pre-surgery prediction, but exclude BMI for post-surgery prediction as it led to inconsistency (higher BMI was not associated with lower SF-6D index). We ran Model 1 to 5 for baseline data, and Model 1, 2 and 4 for post-surgery data (Table 4). As beta regression was not used for deriving the final mapping algorithms, it was not necessary to use the transformed SF-6D and OP summary score. Therefore, we ran OLS model with the raw OP summary score (ranged 0–100, with a higher value indicating worse health) as the predictor and raw SF-6D index (ranged 0–1, with a higher value indicating better health). We recommend Model 5 for mapping with pre-surgery data, and Model 4 for post-surgery data. When not all information of predictors are available, one may choose any algorithm from Model 1–4 for pre-surgery data, and Model 1 or 2 for post-surgery based on their own need or preferences.

Table 4 Mapping algorithm based on OLS model, untransformed SF-6D index and OP summary score used, baseline and post-surgery data, respectively

Discussion

This study explored mapping algorithms from OP to SF-6D index using a large patient register. Conceptual overlap between the source measure and the target PBM should be considered before mapping can be undertaken [35]. The OP has been developed as a condition-specific instrument to measure the impact of obesity on psychosocial function [23]. Although the focus of OP was on mental health and role function mental, dimensions such as Vacations away from home, Trying on and buying clothes, Bathing in public places (beach, public pool, etc.), and Intimate relations would also indicate the impact of obesity on physical health and pain. Therefore, we considered that there was a reasonable overlapping between OP and SF-6D, which was also indicating by the R2 in the mapping algorithm (0.3).

One important finding of our study was that the mapping algorithm should be different for data collected before and after bariatric surgery, which is in line with a recent study [36]. We found that the effect of the OP summary score increased while the effect of gender decreased after surgery and that the effect of BMI disappeared after the surgery. Possible explanation could be that pre-surgery patients were associated with very high BMI, and the there were remained effects of BMI on SF-6D utility even after controlling for OP; However, patients who underwent bariatric surgery lost weight significantly [24], and those with higher pre-operative BMI tend to lose a higher percentage of their total weight [37], thus all the effects of BMI were picked up by OP already. This finding suggests that mapping algorithm might differ at baseline and follow-ups for bariatric surgery, and one should be cautious to merge pre-operative and post-operative data to construct mapping algorithms, or to use follow-up data to examine the prediction power of mapping algorithm based on baseline data, or vice versa. To the best of our knowledge, this is the first evidence showing that clinical interventions may affect the crosswalk between an NPBM and a PBM among patients received bariatric surgery. Further research using data from other disease/intervention populations is needed to assess its generalizability.

We have chosen a simple additive model (with the OP summary score as the main predictor) for constructing the final mapping algorithm. This model assumed that the dimensions of the OP were equally important, and all levels carried equal weight; and response choices to each item lie on a similar interval scale. The models including all individual OP dimensions have a large number of independent variables; however, in terms of prediction ability, those did not outperform the simple additive models. Moreover, some of the coefficients were non-significant or non-monotonic. These findings were in line with previous studies using item response models or adding interaction and other terms [16]. Furthermore, in most published clinical studies, only the OP summary scores were reported. Therefore, we recommend using the simple additive model to map OP data to the SF-6D index scores.

The distributional characteristics of the SF-6D health utility data (UK v1 tariff) posed a challenge for modelling analysis, for example, the values being bounded between 0.301 and 1, skewness, multimodality, and gaps in the values [16, 17, 38]. In our study, we have tested OLS, Tobit, and beta-regression. The performance of OLS and Tobit was quite similar, both were superior to beta-regression. One possible explanation might be that SF-6D index does not suffer from the ceiling effect as much as the EQ-5D index, and in our study, the mean and median of SF-6D were rather close at baseline. In a study which was focused on the application of beta-regression on SF-6D index, the author claimed that the confidence intervals were overlapping across OLS and beta-regressions, suggesting that no model was superior to the others [33]. Although OLS has been criticized for not being appropriate for none-normally distributed data and might underestimate health utility associated with mild health states and overestimated utility for more severe health states [25, 38], there was no obvious evidence that OLS performed worse than other more complicated statistical models. The easy understanding and application of OLS made it a popular choice for deriving mapping algorithm. The ISPOR guideline for mapping does not advocate any specific statistical methods, with the reasons being “… the performance of different methods will vary according to the characteristics of the target utility measure, the disease and patient population in question, the nature of the explanatory clinical variables, and the form of intended use in the CEA[15].” Like many investigators of mapping studies, we would recommend using the OLS model in this study.

Age and sex were commonly included as a predictor in mapping algorithms, and clinical outcomes such as BMI were also frequently included [16]. In the current study, we observed that in terms of goodness-of-fit and prediction power, mapping algorithms containing more predictors performed better than those with the fewer predictor. However, to satisfy the user with a different need, we presented algorithms with different combinations of predictors.

For estimating mapping algorithms, clinical trials were the most common source of data [17]. However, it is debatable whether it is optimal to use trial data for deriving mapping algorithms. Comparing with registry data, trial data are often derived from smaller, more homogeneous patients samples, thus limiting the generalizability of the resultant mapping algorithms to the real world [17].

Although many mapping studies applied split-sample validation, it is questioned that this approach might reduce the sample size used in the mapping estimation and might have no proven benefit [25]. However, it is quite often the case that there is no external dataset available for external validation. Furthermore, unlike the majority of the mapping studies using data from clinical trials, our study is based on a clinical registry with a rather large sample size, we still consider it appropriate to apply split-sample validation.

The main strength of our study was the use of real-world data from a large national patient register and the provision of multiple mapping algorithms using different combinations of predictors. The main limitation of the study was that some surgical centres had a low response rate HRQoL. Since most centres in Sweden have similar characteristics in patient cohorts, this is unlikely to have a significant impact on the representativeness of our study sample. Moreover, lost to follow-up at 5 year was higher relative to 1-, and 2- year, which might explain the insignificant results in some of the analyses. The implication of missing data needs to be investigated in future studies [39].

Conclusion

This study makes available algorithms enabling crosswalk from the Obesity Problem Scale to the SF-6D for cost-utility analyses of interventions in obesity treatment. Different mapping algorithms are recommended for the mapping of pre-operative and post-operative data.