Introduction
Rheumatoid arthritis is an inflammatory joint disease, often with a chronic course that is known to impact patients’ quality of life in a variety of ways. Consequently, patient-reported outcomes such as pain and physical function have a prominent role in outcome assessment in this field [
1,
2]. Other patient-reported outcomes, particularly fatigue and social role participation, have also gained increased attention [
3,
4].
A variety of measures has been developed to facilitate the measurement of such PROs. For example, a patient-reported response index, the Rheumatoid Arthritis Impact of Disease (RAID) score, which combines 7 PRO domains, including fatigue, emotional well-being and sleep quality in one measure is now available and evidence regarding its measurement properties has been published [
5,
6]. The multidimensional Bristol Rheumatoid Arthritis Fatigue scale (BRAF-MDQ) is a patient-reported outcome measure (PROM) that provides in-depth information about fatigue. Several studies have supported its measurement properties in RA [
7‐
9].
Recently, these PROMs were cross-culturally adapted for use in 6 European countries, using a rigorous qualitative approach that focused on their linguistic and conceptual equivalence [
10]. This work ensured that item content is appropriate for use in new cultural contexts and that the intended meaning of the involved items was retained in translation [
11]. For BRAF-MDQ, a subsequent study showed that BRAF-MDQ yield reliable scores and that the same factor structure applied in each county [
12]. These findings support the configural invariance of BRAF-MDQ scores across the countries considered, which suggests that in each of the considered countries the items measure the same constructs [
13]. A next question to be addressed is to what extent these procedures have been successful in practice and hence the legitimacy of comparing BRAF-MDQ and RAID scores meaningfully across cultures. This requires that patients with the same overall fatigue or disease impact level can also be expected to have the same scores on the included items, regardless of the PROM language version administered to them [
14]. Items for which this is not the case display differential item functioning (DIF). If multiple items in a scale are found to show DIF, the scale might systematically over or underestimate between country differences in the measured trait and scores cannot be meaningfully compared across different language versions, unless DIF is taken into account in the PROM scoring procedure [
13,
14].
Item response theory (IRT) provides a framework for evaluating DIF, as well as the general scaling properties of a PROM. In IRT models for ordered polytomous data, the expected item responses for patients with different levels of the measured trait are described by an item characteristic function (ICF), which constrains the expected item scores to be monotonically increasing over the latent variable that the PROM intends to assess. Therefore, if an individual item in a scale shows good fit to an IRT model, it supports that the item is useful for measuring the latent variable the PROM intends to assess. Examining cross-language measurement equivalence using IRT involves testing whether the expected item scores of different language versions of a PROM item can be described using the same ICF. This would suggest that the item functions the same way in each language and would support that item scores can be meaningfully compared between patients who were assessed using different language versions of the scale. In cases where some PROM items are being responded to differently by patients of different languages, it is usually possible to improve the fit of the model by allowing country-specific ICF’s for DIF-affected items [
17]. As long as there are sufficient numbers of DIF free items, the different language versions will still be in the same IRT metric. Modeling DIF in this way is an effective way to adjust the scores for DIF and preserve comparability of scores [
18]. The impact of cross-cultural DIF on the comparability of the raw scores across different language versions of the PROM can be evaluated by examining the distance between an item’s unadjusted and DIF adjusted ICF’s on the latent variable or equivalently the differences between the adjusted and unadjusted predicted scale scores [
19].
Once the items of a PROM have been successfully calibrated using an IRT model, the precision of the scores can be summarized using a marginal reliability coefficient, but can also be examined in detail, across the different trait levels discriminated between by the PROM using conditional reliability coefficients, which provides for a more in depth evaluation of score precision compared with classical test theory-based methods that are more commonly used.
The primary aim of the present study was to examine cross-language measurement invariance of RAID and BRAF-MDQ, using data from approximately 200 patients in each of 6 European countries for which the questionnaires had been translated [
20]. We examined the presence of DIF and its impact on the item and scale levels using several effect sizes statistics that have been proposed for these purposes. A secondary aim was to examine measurement precision of the instruments.
Discussion
In the present study, we used IRT-based methods to evaluate the cross-language measurement equivalence and the psychometric properties of 6 European language versions of BRAF-MDQ and RAID. We found that although both instruments had a few items that exhibited language related DIF, accounting for these differences generally led to small differences in fatigue or disease impact estimates at the total score level. The results of this study therefore support the validity of BRAF-MDQ and RAID score comparisons between the different language versions considered in this study.
The BRAF-MDQ items 1 (NRS severity), 2 (How many days did you experience fatigue during the past week?) and 18 (Have you been embarrassed because of your fatigue?) and RAID item 7 (Coping) proved to be most consistently associated with DIF across countries. However, in subsequent analysis where these items were allowed to have country-specific characteristics, we observed good fit of the adjusted GPCM model. This finding supports the construct validity of the respective instruments; the same underlying variable of fatigue severity and disease impact, respectively, seems to apply to all items, but patients from different countries with the same level of fatigue/disease impact may have different expected item scores for some of the items.
The impact of language related DIF on the BRAF and RAID total scores was generally small, which suggests that raw scores can be compared between different language versions at the group level in most cases. However, for BRAF Physical in Spanish, Swedish, and French patients, the scores were inflated by ≥ 3% of the maximum attainable score. With respect to the scores of individual patients, the impact of cross-language factors was again generally quite minor. Only for BRAF physical, the impact of DIF was in some cases substantial so that the interpretation of BRAF physical scores of individual patients in a cross-cultural context should proceed with caution. Taken together, our findings provide support for the cross-cultural validity of RAID and BRAF-MDQ total scores.
However, in situations were small differences between different language versions of BRAF physical are sought, or when considering BRAF-physical scores of an individual patient assessed using different language versions, an IRT-based scoring procedure using the item parameters of the adjusted model might be prudent. Several IRT software packages can be used to estimate fatigue or disease impact scores based on the item parameters of the adjusted model, provided in Supplemental Material. The advantage of the IRT-based scoring procedures is that differences in item characteristics between different language versions are statistically adjusted, so that scores become better comparable across countries. The IRT-based scoring procedures are also more appropriate to use in case of missing individual item responses. However, the total sample size in this study was somewhat limited.
The psychometric properties of RAID and BRAF-MDQ have been described in several previous studies, using methods based on classical test theory. In these studies, both PROMs were found to have highly precise scores [
5‐
8]. Our results corroborate the findings and expanded on them by showing the items are well targeted to the score levels of RA patients and that scores were precise across the spectrum of score levels. Finally, it has previously been demonstrated that the factor structure of BRAF-MDQ was stable across countries [
12]. The findings support the configural of BRAF-MDQ scores across countries, i.e., that all items measure the same concepts in all countries. Our results expanded on this by demonstrating, for RAID and BRAF_MDQ that full measurement invariance was supported for both PROMs. Hence, scores can be meaningfully compared across different language versions. We also showed that all items could be described using the GPCM, which supports that the items relate to a common underlying variable. However, the analysis of IRT fit, together with the finding that the discrimination parameters varied quite a bit show that the Rasch model is not appropriate for these data. This means that the item responses contain more information about the disease impact/fatigue levels of patients then provided by the summed scores.
A limitation of the study is that convenience sampling was used to obtain samples of patients from different countries. Consequently, patients from different countries differed to an extent with respect to their fatigue, disease impact and HAQ scores which could have led to biased parameter estimates if a shared latent variable score distribution would have been assumed to apply to patients of all countries. In an effort to avoid this, we used separate marginal distributions to characterize the scores of each group of patients [
32]. Furthermore, conclusions with respect to comparability of scores based on these s should not be generalized to other language versions of these instruments. For example, it might be expected that larger differences in item response behavior would have been observed for languages other than those belonging to the Indo-European family of languages and of versions for patients with different cultural backgrounds.
In summary, the results of this study generally support the validity of cross-cultural score comparisons using the instruments evaluated here and provide additional support for their measurement properties. Based on these results, we recommend the BRAF-MDQ and RAID total score as well as the BRAF living, emotion and cognition subscale for cross-cultural comparisons. Those interested in using BRAF physical in a cross-cultural context, we recommend using an IRT-based scoring procedure using the item parameters provided in the supplemental material.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.