Original ArticleMultidimensional item response theory models yielded good fit and reliable scores for the Short Form-12 questionnaire
Introduction
The Short Form 12 (SF-12) version 1 was developed as a shorter alternative to the SF-36 health survey, for studies in which a 36-item form was too long. Because of its brevity and good performance in clinical assessment, the SF-12 has become a widespread measure of health status and changes in health over time in large samples. It is summarized into two measures, namely the physical (PCS) and mental component summary (MCS) scores [1]. The summaries have successfully been used to detect the presence and severity of physical and mental disorders in clinically defined groups [1], [2].
The SF-12 summaries are regression estimates of the corresponding second-order scores in the SF-36 [3], [4], [5] and computed as weighted linear composites of individual item responses, coded as dummy variables. Weights are composed of regression coefficients of responses multiplied by the component loading of the item's native SF-36 subscale on its summary [6]. Under the assumption that the items contain physical and mental information, all items participate in the estimation of both components. Another method, the RAND-12 algorithm [7], is arguably the most successful scoring method for the SF-12 based on the item response theory (IRT). It is derived from the application of a Rasch-type IRT model [7], [8] to the SF-36 items to obtain eight latent traits. These traits can be summarized in two second-order health scores (physical [PHC] and mental health components [MHC]), originally derived from a two-factor oblique principal axis factor analysis. Like the SF-12, the RAND-12 scores are the product of two regressions of the 36-item PHC and MHC on two six-item subsets, weighted by component loadings. Differently from the SF-12, weights are computed from the IRT-weighted items treated as continuous, each item with a single regression weight in its theoretical dimension. The RAND-12 has been shown to be more discriminating than SF-12 in clinical groups [9], [10].
Both approaches pose a number of problems, stemming from the fact that, as linear composites aiming at the prediction of the 36-version summaries, the SF-12 and RAND-12 implicitly assume the theoretical and psychometric models of the 36-item versions [4], [11], [12]. First, score reliabilities depend on a model that is not explicitly stated or estimated. Second, although regression weights optimize prediction of the 36-item summaries, they do not necessarily optimize instrument accuracy. As instrument criterion validity depends on the reliability of the instruments to be correlated, it is a point of major importance [13]. In the case of the SF-12, further difficulties spring from the varying number of alternatives of the 12 items, which violate classical test theory assumptions for computing the alpha coefficient [13], [14], [15]. With regard to the RAND-12, the use of a Rasch model (Master's partial credit model) prevents taking full advantage of item information because of the equal-slope restriction [8]. More importantly, the application of regression weights to IRT item weights to predict the RAND-36 PHC and MHC alters the information properties of the IRT weights.
In this article, we aim to provide a model to compute SF-12 scores without having to resort to the prediction of SF-36 summaries. We developed two bidimensional scoring algorithms based on multidimensional IRT graded response models (MGRMs) [16], [17], [18], proposing two item structures for the SF-12: items loading in just one dimension and items loading in both dimensions simultaneously. These structures mirror the implicit models of the SF-12 and RAND-12. Scores derived from these structures are compared with those of the original algorithms in terms of reliability.
We hypothesized that using such IRT modeling framework: (1) models with multidimensional response processes at the item level would better capture the properties of the data and yield better fit and more information than unidimensional items, (2) IRT scores would provide individual scores and ordering similar to the standard scoring algorithms, and (3) IRT-based scores would show higher reliability than the scores based on the other algorithms.
Section snippets
Sample
Data used for this study comes from the European Study of the Epidemiology of Mental Disorders (ESEMeD) [19] project. Briefly, the ESEMeD used a stratified, multistage, clustered area probability sample of noninstitutionalized adult population (aged 18 years or older) of Belgium, France, Germany, Italy, The Netherlands, and Spain. The interviews were conducted between January 2001 and August 2003 using computer-assisted interview techniques. The focus of the study was to estimate the prevalence
Results
Table 1 includes descriptive statistics of the sample. In this table, total sample size and weighted proportions are shown for gender, age groups, and other sociodemographic information. First inspection of the data evidenced marked ceiling effects in item responses (see Table 2). As the low frequency of the extreme categories in the six-alternative items induced instability in threshold parameter estimates, we collapsed responses in the first and second alternative (indicating the worst health
Discussion
Advances in IRT modeling from a factor analytic perspective have facilitated the implementation of multidimensional structures to the assessment of patient-reported outcomes [42], [43], [44], [45], [46], [47]. In this article, we used an IRT confirmatory approach, which has shown promising results in modeling complex data [48], [49], [50] to propose two multidimensional structures and the scoring algorithms emerging from them for the SF-12v1 questionnaire.
Results suggest that the models herein
Acknowledgments
C.G.F. was supported by a “Juan de la Cierva” fellowship to from Ministerio de Ciencia e Innovación FSE (JCI-2009-05486). G.V. was supported by “Fondo De Investigación Sanitaria ISCIII (ECA07/059). The European Study of the Epidemiology of Mental Disorders (ESEMeD) project (http://www.epremed.org) was funded by the European Commission (Contracts QLG5-1999-01042; SANCO2004123; EAHC 2008-1308), the Piedmont Region (Italy), Fondo de Investigación Sanitaria, Instituto de Salud Carlos III, Spain
References (61)
- et al.
Use of structural equation modeling to test the construct validity of the SF-36 Health Survey in ten countries: results from the IQOLA Project. International Quality of Life Assessment
J Clin Epidemiol
(1998) - et al.
The factor structure of the SF-36 Health Survey in 10 countries: results from the IQOLA Project. International Quality of Life Assessment
J Clin Epidemiol
(1998) - et al.
Power analysis in randomized clinical trials based on item response theory
Control Clin Trials
(2003) - et al.
A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity
Med Care
(1996) - et al.
How to score version 2 of the SF-12 Health Survey (with a supplement documenting version 1)
(2002) - et al.
The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection
Med Care
(1992) - et al.
The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs
Med Care
(1993) - et al.
SF-36 Health Survey manual and interpretation guide
(1993) - et al.
SF-12: how to score the SF-12 physical and mental health summary scales
(1995) RAND-36 Health Status Inventory
(1998)
A Rasch model for partial credit scoring
Psychometrika
Performance of the RAND-12 and SF-12 summary scores in type 2 diabetes
Qual Life Res
Performance of the SF-36, SF-12 and RAND-36 summary scales in a multiple sclerosis population
Med Care
Statistical theories of mental test scores
On the use, misuse and the very limited usefulness of Cronbach's alpha
Psychometrika
Congeneric and (essentially) tau-equivalent estimates of score reliability. What they are and how to use them
Educ Psychol Meas
Estimation of latent ability using a response pattern of graded scores
Psychometrika
Normal ogive model on the continuous response level in the multidimensional latent space
Psychometrika
Distinguishing among parametric item response models for polychotomous ordered data
Appl Psychol Meas
The European Study of the Epidemiology of Mental Disorders (ESEMeD) project: an epidemiological basis for informing mental health policies in Europe
Acta Psychiatr Scand Suppl
Further empirical results on parametric versus non-parametric IRT modeling of Likert-type personality data
Multivariate Behav Res
Some algebraic properties of the reticular action metamodel for moment structures
Br J Math Stat Psychol
Test theory. A unified approach
Structural equation modeling of paired-comparison and ranking data
Psychol Methods
Estimation of IRT graded response models: limited versus full information methods
Psychol Methods
Corrections to test statistics and standard errors in covariance structure analysis
Comparative fit indexes in structural models
Psychol Bull
Alternative ways of assessing model fit
The reliability coefficient for maximum likelihood factor analysis
Psychometrika
Cutoff criteria for fit indexes in covariance estructure analysis: Conventional criteria versus new alternatives
Struct Equ Modeling
Cited by (36)
Inconclusive evidence that arthroscopic techniques yield better outcomes than open techniques for subtalar arthrodesis—A systematic review
2023, Journal of ISAKOSCitation Excerpt :Both techniques included patients undergoing isolated STA for varied indications without additional procedures. American Orthopaedic Foot & Ankle Society (AOFAS) ankle-hindfoot scoring system [11], Foot Function Index [12], Short Form (SF)-36 [13], SF-12 [14], Angus and Cowell rating scale scores [15], numerical analogue scale [2], visual analogue scale [2] were the outcome measures reported across the studies. However, AOFAS modified with a maximum score of 94 (compensation for the loss of subtalar joint function) [16] was the most commonly used scoring system for reporting the outcomes.
Can patients with psychological distress achieve comparable functional outcomes and satisfaction after hallux valgus surgery? A 2-year follow-up study
2021, Foot and Ankle SurgeryCitation Excerpt :The SF-36 MCS has been reported using norm-based scoring (mean = 50, standard deviation = 10) in nearly all published studies to date [25]. It has also been validated as a screening tool for depression [26–28]. As such, a cut-off value of 50 was used to divide the cohort into patients with above-average mental health i.e. SF-36 MCS ≥50 (“non-distressed group”) and below-average mental health i.e. SF-36 MCS <50 (“distressed group”).
Ten-Year Results of Unicompartmental Knee Arthroplasty in Patients With Psychological Distress
2020, Journal of ArthroplastyDo Patients With Psychological Distress Have Poorer Patient-Reported Outcomes After Total Hip Arthroplasty?
2020, Journal of ArthroplastyCitation Excerpt :The SF-36 MCS has been reported using norm-based scoring (mean = 50, standard deviation = 10) in nearly all published studies to date [31]. It has also been validated as a screening tool for depression [32–34]. We used a cut-off value of 50 to divide the cohort into patients with below average MH (ie, SF-36 MCS < 50), indicating the presence of psychological distress; and above average MH (ie, SF-36 MCS ≥50), indicating the absence of psychological distress.
Do clinical results of arthroscopic subtalar arthrodesis correlate with CT fusion ratio?
2019, Orthopaedics and Traumatology: Surgery and ResearchCitation Excerpt :A numerical analog scale (NAS) assessed pain (0 = no pain, to 10 = worst imaginable pain). The AOFAS [16] and SF-12 [17] scales assessed functional status. At last follow-up, satisfaction was assessed on a NAS (0 = poor result, to 10 = excellent result) and Odom's criteria [18] (Table 2).
Do clinical results of arthroscopic subtalar arthrodesis correlate with CT fusion ratio?
2019, Revue de Chirurgie Orthopedique et Traumatologique
All authors declare that they have neither conflicts of interest nor relevant financial disclosure to inform.