Introduction

Health status measures are widely used in clinical practice, observational studies, clinical trials, monitoring general population health, assessing the performance of health systems and in cost-effectiveness analysis [1]. Two forms of health status measures can be distinguished as follows: condition-specific and generic [2]. Condition-specific measures have a specific target population and are able to capture a wide range of symptoms and health problems relevant to a certain condition (e.g. itching in skin diseases or bowel problems in gastrointestinal diseases). In contrast, generic health status measures incorporate health areas that are relevant across different patient populations as well as for the general public (e.g. physical functioning, pain, sleeping). These measures have the advantage of allowing comparisons across different conditions, health interventions and with general population reference values.

In general, a large number of items are needed to precisely assess one’s health status; however, this may lack practical considerations (e.g. time and respondent burden). Therefore, short-form health assessments have gained popularity. Commonly used short generic health status measures include the EQ-5D and SF-36 [3, 4]. These instruments, however, were developed decades ago and one of their common criticisms is that their item development and selection did not benefit from modern psychometric methods, such as item response theory (IRT). The Patient Reported Outcomes Measurement Information System (PROMIS) initiative, funded by the National Institutes of Health in the US, aimed to develop, validate and standardize item banks to measure health outcomes across a broad range of health areas [5]. In the past two decades, over 100 PROMIS item banks and a few fixed-length short-forms have been developed using IRT methods (e.g. PROMIS Global Health, PROMIS-29, PROMIS-43, PROMIS-57) [6, 7]. The main advantages of IRT over classical test theory methods include the estimation of the respondents’ location on an underlying ‘latent’ trait (e.g. health status) based on any subset of items that do not vary depending on the characteristics of the population and the possibility to adaptively assess health status using computerised adaptive testing [8, 9].

PROMIS Global Health (PROMIS-GH) is the shortest PROMIS short-form that measures five generic domains of health (physical functioning, pain, fatigue, emotional distress, social health) using 10 global health items [10]. Its validity, reliability and responsiveness have been confirmed in several populations, including patients with stroke [11], orthopaedic conditions [12,13,14,15,16], amyloidosis [17], inflammatory bowel diseases [18], pregnant women [19] and older adults [20]. The international use of PROMIS-GH has also been expanding outside the US, including studies from the UK [21], Germany [22] and the Netherlands [23]. Furthermore, two countries, the US and the Netherlands have also established general population reference values [24, 25]. So far, PROMIS-GH has not been used in Hungary. Therefore, this study aimed to evaluate the psychometric performance of the Hungarian PROMIS-GH and to develop general population reference values in Hungary.

Methods

Study design and recruitment

The study was approved by the Research Ethics Committee of the Corvinus University of Budapest (no. KRH/343/2020). In November 2020, an online cross-sectional survey was conducted among the Hungarian adult general population. Respondents were recruited by a survey company from members of the largest Hungarian online panel. ‘Soft quotas’ were used for age, gender, education, place of living and geographical region to approximate the distribution of the general population. The inclusion criteria for this study were as follows: (i) ≥ 18 years of age; (ii) place of residence in Hungary; and (iii) giving informed consent prior to data collection.

PROMIS-GH v1.2 was administered as part of a longer survey that aimed to assess the health status and well-being among members of the general public in Hungary [26,27,28]. Respondents were also asked to complete the SF-36v1 and to identify their sociodemographic background (gender, age, education, place of residence, region, employment, household’s net monthly income, marital status, body weight and height) and if they had any chronic health conditions. All respondents first completed SF-36, followed by PROMIS-GH.

Measures

PROMIS Global Health (PROMIS-GH)

The official Hungarian version of PROMIS-GH v1.2 was used as provided by the PROMIS Health Organization. PROMIS-GH consists of 10 items, namely Global01 = general health, Global02 = quality of life, Global03 = physical health, Global04 = mental health, Global05 = satisfaction with discretionary social activities, Global06 = physical function, Global07 = pain, Global08 = fatigue, Global09 = social roles and Global10 = emotional problems [10]. It has two subscales, Global Physical Health (GPH) and Global Mental Health (GMH). GPH consists of Global03, Global06, Global07 and Global08, while GMH includes Global02, Global04, Global05 and Global10. The recall period of the items varies across ‘in general’, the ‘past seven days’ and unspecified. Each item is assessed on a scale with five response levels. For Global01, Global02, Global03, Global04, Global05 and Global09, the best response option is excellent (5), and the worst is poor (1). For Global06 options range from completely (5) to not at all (1), for Global10 from never (5) to always (1) and for Global08 from none (5) to very severe (1). An exception is Global07, which is rated from 0 to 10 (0 = no pain, 10 = worst imaginable pain). We recoded Global07 to a 5-point scale as follows: 0 = 5; 1–3 = 4; 4–6 = 3; 7–9 = 2; 10 = 1 [10]. Raw subscale scores were calculated by adding scores of individual items per subscale. We calculated standardized T-scores from raw scores using the US item calibrations [29]. Mean T-scores therefore represent the mean of the US general population. A higher T-score indicates better health status and a lower T-score refers to worse health status compared to the US general population, where the general population mean is set at 50 with a standard deviation of 10 [24].

36-item short form health survey (SF-36)

The Hungarian version of the SF-36v1 questionnaire was used in our survey with a 4-week recall period. SF-36 is a generic health status measure with 36 items that cover eight health subscales, specifically (1) physical functioning, (2) role limitations due to physical problems (3) bodily pain, (4) general health, (5) vitality, (6) social functioning, (7) role limitations due to emotional problems and (8) mental health [4, 30]. Responses to items are transformed to range from 0 to 100, where higher scores represent better health status. Subscale scores are computed by averaging the respective item scores. SF-36 allows the generation of two summary scores, one for physical health (physical health composite) that includes the first four subscales (1–4) and the other for mental health (mental health composite) including the last four (5–8).

Statistical analyses

In this study, we built on the methods used in earlier psychometric investigations and reference population studies with PROMIS instruments [9, 10, 23, 25]. Data analysis was carried out in R Statistical Software (v4.1.2 Vienna, Austria). We used both classical test theory (e.g. ceiling and floor effect, convergent validity, factor analysis) and IRT methods. Before IRT modelling, we tested the following three assumptions: unidimensionality, local independence and monotonicity [31]. In addition, differential item functioning (DIF) analysis was used to examine measurement invariance. Raw item and subscale scores were used to analyse ceiling and floor effect and for the factor analysis, IRT and DIF analyses. Unweighted T-scores were used to draw histograms and estimate correlations. T-scores were weighted for age group and gender to calculate Hungarian GPH and GMH general population reference values.

Ceiling and floor effect

Ceiling and floor effect were considered if GPH and GMH raw subscale scores exceeded 15% [32].

Unidimensionality

Unidimensionality was tested using confirmatory factor analysis (CFA) and bifactor models. CFA was conducted for the two subscales separately (lavaan package) [33]. Goodness-of-fit was evaluated by the comparative fit index (CFI, cut-off value: > 0.95), Tucker-Lewis index (TLI, cut-off value: > 0.95), the root mean square error of approximation (RMSEA, cut-off value < 0.06), and the standardized root mean squared residual (SRMR, cut-off value: < 0.08) [34, 35]. Further, we used bifactor models to obtain Omega Hierarchical (tentative benchmark > 0.70) and explained common variance (ECV, tentative benchmark > 0.60) [36, 37]. The bifactor models were developed using the psych package [38].

Local independence

To test local independence, we examined the residual correlation matrix resulting from the CFA for both GPH and GMH subscales. Residual correlation values between − 0.20 and 0.20 were considered acceptable supporting local independence [9].

Monotonicity

Monotonicity was analysed using Mokken scale analysis (mokken package). Coefficients (Hi for items, H for subscales) exceeding the cut-off value of > 0.30 were considered acceptable [39, 40].

IRT model fit

Given the polytomous response options of PROMIS-GH items, a graded response model was fitted for both GPH and GMH (mirt package) [41, 42]. To detect item misfit, we used Orlando and Thissen’s S-χ2. Items with p-value < 0.001 were considered misfitting [43]. The same cut-off values were used for fit indices (CFI, TLI, RMSEA, SRMR) as for unidimensionality [34]. Item discrimination (slope, a) and item difficulties (threshold, b) were also computed. Item characteristic curves (ICC) were generated for each item of the two subscales.

Measurement invariance

Measurement invariance was assessed by analysing differential item functioning (DIF) using the lordif package [44]. DIF occurs when the responses of a subgroup of respondents on an item consistently differ from those of another subgroup when controlling for the underlying level of the trait measured by the scale [9]. DIF was analysed for GPH and GMH with the following subgroups: gender (female, male), median age (< 47, ≥ 47 years), education (primary, secondary, tertiary), region (Central, Western and Eastern Hungary), employment (employed, not employed), place of residence (capital, other town, village), marital status (married, not married), and income groups (quintiles, do not know, refused to answer groups). First, we used ordinal logistic regression models without an anchor to evaluate DIF. Where DIF was detected, we repeated the analysis using non-DIF items as an anchor. A Pseudo R2 change ≥ 0.02 was taken as a critical value [45, 46]. The details of the DIF analysis are provided elsewhere [28].

Convergent validity

Spearman’s rank order correlations were used to explore the convergent validity of the two PROMIS-GH subscales with the eight SF-36 subscales and two composite scores. Correlation coefficients (rs) were interpreted as very weak (< 0.20), weak (0.20–0.39), moderate (0.40–0.59) and strong correlation (0.60 ≤) [47].

Establishment of general population reference values

Mean GPH and GMH T-scores were weighted according to gender and age group to derive general population reference values using the US item calibrations [48]. Mean weighted T-scores were computed for subgroups of respondents defined by gender, age groups, education, place of residence, region, employment, income groups, marital status, health status question of SF-36 (item 1), BMI and the presence of any chronic condition. We used Taylor linearization for standard errors, and 95% confidence intervals were calculated for each group. The subgroups were compared using Mann–Whitney or Kruskal–Wallis tests, where applicable.

Hypotheses

Regarding the psychometric properties, we hypothesized (1) no ceiling or floor effects for any subscales, (2) unidimensionality, (3) local independence, (4) monotonicity, (5) acceptable fit to the graded response model, (6) no measurement invariance for any subgroups, (7) moderate or strong correlations between the PROMIS-GH subscales (GPH and GMH) and their corresponding SF-36 composite scores [10, 11, 23, 49]. With regard to the reference values, we hypothesized better self-reported health in men and declining physical health with age [50].

Results

Sample characteristics (unweighted)

Overall, 2502 respondents initiated the survey, 2079 of whom consented and 379 quit before the end of the questionnaire. A total of 1700 respondents completed the survey. The mean age was 47.9 ± 16.3 years, and 56.3% of the respondents were female. Nearly one-third of the sample had tertiary education (32.4%). Half of the respondents were employed (50.9%), 23.5% were retired and 4.4% were students. Overall, 22.4% lived in the capital, 48.2% in other towns and 29.4% in villages. The geographical distribution of the sample was as follows: Western Hungary 29.0%, Central Hungary 33.6%, Eastern Hungary 37.4%. Overall, 67.4% of the sample reported to have any chronic disease. The overall sample showed a good representativeness for the general population in Hungary; however, respondents with a secondary education were slightly underrepresented and those who lived in the capital were somewhat overrepresented (Table 1).

Table 1 Characteristics of the study population and PROMIS Global Health reference values in Hungary

Ceiling and floor effect

The distributions of GPH and GMH raw scores are presented in Fig. 1. We found almost no floor and low ceiling effect for both GPH (0.4% and 4.1%) and GMH subscales (0.5% and 4.8%) (Table 2). Among the items, Global07 demonstrated the highest floor (29.8%). Global06 showed the highest ceiling (58.2%), followed by Global10 (38.3%), Global08 (23.9%) and Global09 (15.8%).

Fig. 1
figure 1

Distribution of Global Physical Health and Global Mental Health T-scores (unweighted)

Table 2 Floor and ceiling of PROMIS Global Health items and subscales

Factor and IRT analysis

Unidimensionality

Fit indices confirmed the unidimensionality of both GPH (CFI = 0.993, TLI = 0.978, SRMR = 0.039) and GMH (CFI = 0.999, TLI = 0.997, SRMR = 0.025), with the exception of RMSEA (GPH 0.114 and GMH 0.071). The hypotheses were supported by the bifactor models, resulting in ECV values higher than the tentative benchmark for both subscales (GPH 0.72 and GMH 0.78). Omega Hierarchical was above the tentative benchmark only for GMH (0.73), but not for GPH (0.66) (Table 3).

Table 3 Psychometric properties of PROMIS Global Health subscales

Local independence

We found no local dependence between item pairs (Online Resource 1). Eight item pairs had negative residual correlations, but all values were above the value of − 0.20.

Monotonicity

The Mokken scale analysis resulted in coefficients higher than the cut-off value for both subscales (H = 0.531 and 0.638 for GPH and GMH) and items, ranging from Hi = 0.480 (Global08) to 0.717 (Global04) supporting monotonicity (Table 3).

Model fit

Given that unidimensionality, local independence and monotonicity were supported for both subscales, graded response models were fitted. Acceptable fit indices were found for both subscales (GPH: RMSEA = 0.008, SRMR = 0.045, TLI = 0.905, CFI = 0.968 and GMH: RMSEA = 0.012, SRMR = 0.031, TLI = 0.969, CFI = 0.990). A few items showed misfit to the graded response model, namely Global03, Global06, Global02, Global05 and Global10 (p < 0.001) (Table 3). Item difficulties (b) ranged from − 3.7 (Global08) to 1.7 (Global03) for GPH and from − 2.9 (Global10) to 1.7 (Global02) for GMH. Item discrimination (a) values ranged from 1.6 (Global08) to 2.3 (Global07) and from 1.7 (Global10) to 8.0 (Global04) for GPH and GMH, respectively. ICCs for the two subscales are displayed in Fig. 2.

Fig. 2
figure 2

Item characteristic curves of items of the Global Physical Health and Global Mental Health subscales. Global02 = quality of life, Global03 = physical health, Global04 = mental health, Global05 = satisfaction with discretionary social activities, Global06 = physical function, Global07 = pain (reverse coded 5-level item), Global08 = fatigue, Global10 = emotional problems, Global Physical Health items: Global03, Global06, Global07, Global08. Global Mental Health items: Global02, Global04, Global05, Global10

Measurement invariance

After the first step (without anchors), one item (Global07) was flagged for DIF based on age groups, and two items (Global02 and Global10) were flagged for DIF by gender. After the second step (with anchors), DIF was no longer detected for age group and gender, as the Pseudo R2 change was < 0.02 for each analysis. No DIF was detected for education, region, employment, place of residence, marital status or income at all.

Convergent validity

GMH T-score showed a strong correlation with the mental health composite score of SF-36 (rs = 0.708) and GPH T-score with the physical health composite score (rs = 0.829) (Fig. 3). Among the SF-36 subscales, the GPH T-score had the highest correlation with general health (rs = 0.740) and bodily pain (rs = 0.738), while the GMH T-score showed the strongest correlation with mental health (rs = 0.699) and vitality (rs = 0.657).

Fig. 3
figure 3

Convergent validity of PROMIS Global Health subscales with SF-36 composites and subscales. p < 0.001 for all correlation coefficients (Spearman’s). PROMIS-GH = Patient Reported Outcomes Measurement Information System-Global Health, SF-36 = 36-item short form health survey

Reference values for PROMIS-GH in Hungary

Mean total T-scores for GPH and GMH were 49.0 and 47.7, respectively (Table 1). Mean GPH and GMH T-scores of females were lower (47.8 and 46.4) compared to males (50.5 and 49.3) (p < 0.001). We found the highest mean T-scores for GPH and GMH in the 18–24 age group (GPH: 52.3 and GMH: 49.9). Mean GPH and GMH T-scores showed a decreasing trend with age (p < 0.05). Those with higher level of education, living in towns, being student, having higher income and without chronic disease had higher mean T-scores scores for both GPH and GMH (p < 0.001). With regard to BMI, mean GPH T-scores were higher in respondents with normal weight compared to those being underweight or overweight/obese (p < 0.05). Those who reported ‘excellent’ health on the first question of the SF-36 had the highest, while those who reported ‘poor’ had the lowest mean GPH and GMH T-scores (p < 0.001).

Discussion

This study provided a psychometric assessment of the Hungarian version of PROMIS-GH and developed population reference values for its physical and mental health subscales in Hungary. We used both classical test theory and IRT methods to establish the psychometric properties of the measure. PROMIS-GH subscales showed no ceiling and floor effects. All assumptions of IRT (unidimensionality, local independence and monotonicity) were met. Although the Omega Hierarchical value was below the tentative benchmark for GPH, it is important to emphasize that PROMIS-GH is inherently a multidimensional measure, and therefore, individual subscale values within the range of 0.6 and 0.8 seem appropriate both for Omega Hierarchical and ECV [36, 37]. The goodness of fit to the graded response model was acceptable with a few items misfitting. We found no measurement invariance for any sociodemographic characteristics. Strong correlations were found between corresponding PROMIS-GH subscales and SF-36 physical and mental health composite scores. Mean GPH and GMH T-scores in the Hungarian general population were 49.0 and 47.7, respectively.

It is worthwhile to compare our findings about the psychometric performance of PROMIS-GH to those of earlier psychometric studies among members of the general population in the Netherlands and the US [10, 23]. First, unidimensionality was supported with negligible deviations in each study. No local dependence was detected in the Hungarian and Dutch general population samples. The coefficients of the Mokken scale analysis showed that monotonicity was supported in the Hungarian and Dutch samples, and an interesting similarity occurred that in both studies the Global06 item had the smallest distance between the thresholds (Hungarian: − 2.879 to − 0.252; Dutch: − 2.668 to − 0.055). The range of item difficulty values (b) were very similar in all three general population studies with small differences at both ends (US: − 3.0 to 1.5, Hungarian: − 3.7 to 1.7, Dutch: − 3.7 to 1.9) [10, 23]. Ranges of item discrimination parameters (a) were similar for both subscales with slight differences between the US and Dutch studies [10, 23]. While the item discrimination parameters of the Hungarian GPH were in the same range (from 1.6 to 2.3) as the previous two, the Hungarian GMH was somewhat biased due to Global04 (from 1.7 to 8.0), as it usually ranges between 0.5 and 2.5 [31].

The Hungarian overall mean GPH and GMH T-scores (49.0 and 47.7) were slightly lower than those of the US reference population values (GPH: 50.0, GMH: 50.0) and higher than the Dutch values (GPH: 45.2, GMH: 44.7), suggesting that the Hungarian general population is in a better health status than the Dutch (Online Resource 2). By contrast, the standardized Dutch SF-36 physical (49.7) and mental health composite score (52.1) were somewhat higher than the Hungarian scores (48.3 and 48.2), implying that the Dutch general population is in a better health status [51]. However, the Dutch population norm data were collected using the SF-12 and in 1996, which may limit the comparison [52]. A similar pattern was observed for GPH and GMH in the Hungarian general population as in the US and Dutch samples, with a decreasing mean T-score with age, and males reporting better health status than females [25, 53]. However, it should be noted that the US sample (data collected in 2006–2007) and the Dutch sample (data collected in 2016) were obtained considerably earlier compared to this study. In addition, the US calibration sample may not be representative for the European populations. Ultimately, the following characteristics were associated with better physical and mental health in the Hungarian sample: being younger, male, having higher level of education, living in towns, student status, having a higher level of income, having no chronic diseases and reporting better self-perceived health on the first question of the SF-36.

A surprising finding of this study is that the Hungarian general population reported better overall health status than the Dutch general population. Life expectancy in the Netherlands is almost one year higher (81.5) than the weighted EU average (80.6), while life expectancy in Hungary is almost five years (75.7) behind the weighted EU average [54]. In terms of government funding, compulsory and voluntary health insurance and out-of-pocket payments, the Netherlands has one of the highest per capita spending on healthcare in the EU, while Hungary continues to fall behind the EU average in this regard. The greatest contrast might be in the fact that in 2019, 75% of the Dutch general public reported that they were in good health, and this figure did not reach 60% in Hungary in the same year [54]. However, the comparison of PROMIS-GH scores between these two countries is limited by the fact that the Dutch sample was not representative for some important sociodemographic and health-related characteristics of the general population, such as employment and marital status, income and the prevalence of chronic diseases [25].

This study has a few limitations. Our data were collected during the pandemic that might have influenced health status of the general population. However, a recent study has shown that the COVID-19 pandemic had negligible impact on the health status of US patients measured by PROMIS-GH [55]. Furthermore, self-reported health status on the first question of SF-36 in our study was very similar to what had been reported in a pre-COVID online general population survey in Hungary in 2019 [56]. Selection bias might have occurred as online panel data collections may be subject to possible self-selection and underrepresentation of certain groups (e.g. those without internet access) [57]. Another limitation is the cross-sectional nature of this study that prevented us from assessing test–retest reliability and responsiveness of PROMIS-GH.

In conclusion, this study provided an extensive psychometric analysis of the Hungarian PROMIS-GH in a large general population sample and established general population reference values for Hungary. Future research is recommended to replicate this general population study after the COVID-19 pandemic and further test psychometric properties of the Hungarian PROMIS-GH in paper-and-pencil surveys, longitudinal studies and with various patient populations.