Original Article
Multidimensional item response theory models yielded good fit and reliable scores for the Short Form-12 questionnaire

https://doi.org/10.1016/j.jclinepi.2013.02.007Get rights and content

Abstract

Objectives

To propose a multidimensional item response theory (MIRT) scoring system for the Short Form 12 (SF-12) with good psychometric properties in terms of fit and reliability.

Study Design and Settings

Two models, indicating physical (PCS) and mental component summary (MCS) dimensions, were fitted to SF-12 data from the European Study of the Epidemiology of Mental Disorders, a representative sample from European adult general population (n = 21,425; response rate = 61.2%). Goodness of fit, information, reliability, and agreement of individual scores were compared with the classical SF-12 and RAND-12 algorithms.

Results

The bidimensional response process (BRP) model, where all items are indicators of both dimensions, yielded the best fit (root mean square error of approximation = 0.057, comparative fit index = 0.95, and Tucker–Lewis index = 0.94), and highly agreed with PCS and MCS scores from the SF-12 (intraclass correlation coefficients of 0.92 and 0.88, respectively) and RAND-12 (0.88 and 0.95). Regarding reliability, the BRP yielded 0.75 and 0.77 (PCS and MCS, respectively), greater than SF-12 (0.65 and 0.66) and RAND-12 (0.65 and 0.67). As indicated by scale linking, MIRT scores can be interpreted similarly to the classical scores.

Conclusion

The MIRT models showed a clear construct structure for the PCS and MCS dimensions, defined by functional and role limitation content. Results support the use of SF-12 MIRT-based scores as a valid and reliable option to assess health status.

Introduction

The Short Form 12 (SF-12) version 1 was developed as a shorter alternative to the SF-36 health survey, for studies in which a 36-item form was too long. Because of its brevity and good performance in clinical assessment, the SF-12 has become a widespread measure of health status and changes in health over time in large samples. It is summarized into two measures, namely the physical (PCS) and mental component summary (MCS) scores [1]. The summaries have successfully been used to detect the presence and severity of physical and mental disorders in clinically defined groups [1], [2].

The SF-12 summaries are regression estimates of the corresponding second-order scores in the SF-36 [3], [4], [5] and computed as weighted linear composites of individual item responses, coded as dummy variables. Weights are composed of regression coefficients of responses multiplied by the component loading of the item's native SF-36 subscale on its summary [6]. Under the assumption that the items contain physical and mental information, all items participate in the estimation of both components. Another method, the RAND-12 algorithm [7], is arguably the most successful scoring method for the SF-12 based on the item response theory (IRT). It is derived from the application of a Rasch-type IRT model [7], [8] to the SF-36 items to obtain eight latent traits. These traits can be summarized in two second-order health scores (physical [PHC] and mental health components [MHC]), originally derived from a two-factor oblique principal axis factor analysis. Like the SF-12, the RAND-12 scores are the product of two regressions of the 36-item PHC and MHC on two six-item subsets, weighted by component loadings. Differently from the SF-12, weights are computed from the IRT-weighted items treated as continuous, each item with a single regression weight in its theoretical dimension. The RAND-12 has been shown to be more discriminating than SF-12 in clinical groups [9], [10].

Both approaches pose a number of problems, stemming from the fact that, as linear composites aiming at the prediction of the 36-version summaries, the SF-12 and RAND-12 implicitly assume the theoretical and psychometric models of the 36-item versions [4], [11], [12]. First, score reliabilities depend on a model that is not explicitly stated or estimated. Second, although regression weights optimize prediction of the 36-item summaries, they do not necessarily optimize instrument accuracy. As instrument criterion validity depends on the reliability of the instruments to be correlated, it is a point of major importance [13]. In the case of the SF-12, further difficulties spring from the varying number of alternatives of the 12 items, which violate classical test theory assumptions for computing the alpha coefficient [13], [14], [15]. With regard to the RAND-12, the use of a Rasch model (Master's partial credit model) prevents taking full advantage of item information because of the equal-slope restriction [8]. More importantly, the application of regression weights to IRT item weights to predict the RAND-36 PHC and MHC alters the information properties of the IRT weights.

In this article, we aim to provide a model to compute SF-12 scores without having to resort to the prediction of SF-36 summaries. We developed two bidimensional scoring algorithms based on multidimensional IRT graded response models (MGRMs) [16], [17], [18], proposing two item structures for the SF-12: items loading in just one dimension and items loading in both dimensions simultaneously. These structures mirror the implicit models of the SF-12 and RAND-12. Scores derived from these structures are compared with those of the original algorithms in terms of reliability.

We hypothesized that using such IRT modeling framework: (1) models with multidimensional response processes at the item level would better capture the properties of the data and yield better fit and more information than unidimensional items, (2) IRT scores would provide individual scores and ordering similar to the standard scoring algorithms, and (3) IRT-based scores would show higher reliability than the scores based on the other algorithms.

Section snippets

Sample

Data used for this study comes from the European Study of the Epidemiology of Mental Disorders (ESEMeD) [19] project. Briefly, the ESEMeD used a stratified, multistage, clustered area probability sample of noninstitutionalized adult population (aged 18 years or older) of Belgium, France, Germany, Italy, The Netherlands, and Spain. The interviews were conducted between January 2001 and August 2003 using computer-assisted interview techniques. The focus of the study was to estimate the prevalence

Results

Table 1 includes descriptive statistics of the sample. In this table, total sample size and weighted proportions are shown for gender, age groups, and other sociodemographic information. First inspection of the data evidenced marked ceiling effects in item responses (see Table 2). As the low frequency of the extreme categories in the six-alternative items induced instability in threshold parameter estimates, we collapsed responses in the first and second alternative (indicating the worst health

Discussion

Advances in IRT modeling from a factor analytic perspective have facilitated the implementation of multidimensional structures to the assessment of patient-reported outcomes [42], [43], [44], [45], [46], [47]. In this article, we used an IRT confirmatory approach, which has shown promising results in modeling complex data [48], [49], [50] to propose two multidimensional structures and the scoring algorithms emerging from them for the SF-12v1 questionnaire.

Results suggest that the models herein

Acknowledgments

C.G.F. was supported by a “Juan de la Cierva” fellowship to from Ministerio de Ciencia e Innovación FSE (JCI-2009-05486). G.V. was supported by “Fondo De Investigación Sanitaria ISCIII (ECA07/059). The European Study of the Epidemiology of Mental Disorders (ESEMeD) project (http://www.epremed.org) was funded by the European Commission (Contracts QLG5-1999-01042; SANCO2004123; EAHC 2008-1308), the Piedmont Region (Italy), Fondo de Investigación Sanitaria, Instituto de Salud Carlos III, Spain

References (61)

  • G.N. Masters

    A Rasch model for partial credit scoring

    Psychometrika

    (1982)
  • J.A. Johnson et al.

    Performance of the RAND-12 and SF-12 summary scores in type 2 diabetes

    Qual Life Res

    (2004)
  • M.W. Nordvedt et al.

    Performance of the SF-36, SF-12 and RAND-36 summary scales in a multiple sclerosis population

    Med Care

    (2000)
  • F. Lord et al.

    Statistical theories of mental test scores

    (1968)
  • K. Sijtsma

    On the use, misuse and the very limited usefulness of Cronbach's alpha

    Psychometrika

    (2009)
  • J.M. Graham

    Congeneric and (essentially) tau-equivalent estimates of score reliability. What they are and how to use them

    Educ Psychol Meas

    (2006)
  • F. Samejima

    Estimation of latent ability using a response pattern of graded scores

    Psychometrika

    (1969)
  • F. Samejima

    Normal ogive model on the continuous response level in the multidimensional latent space

    Psychometrika

    (1974)
  • A. Maydeu-Olivares et al.

    Distinguishing among parametric item response models for polychotomous ordered data

    Appl Psychol Meas

    (1994)
  • J. Alonso et al.

    The European Study of the Epidemiology of Mental Disorders (ESEMeD) project: an epidemiological basis for informing mental health policies in Europe

    Acta Psychiatr Scand Suppl

    (2004)
  • A. Maydeu-Olivares

    Further empirical results on parametric versus non-parametric IRT modeling of Likert-type personality data

    Multivariate Behav Res

    (2005)
  • J.J. McArdle et al.

    Some algebraic properties of the reticular action metamodel for moment structures

    Br J Math Stat Psychol

    (1984)
  • R.P. McDonald

    Test theory. A unified approach

    (1999)
  • A. Maydeu-Olivares et al.

    Structural equation modeling of paired-comparison and ranking data

    Psychol Methods

    (2005)
  • C.G. Forero et al.

    Estimation of IRT graded response models: limited versus full information methods

    Psychol Methods

    (2009)
  • A. Satorra et al.

    Corrections to test statistics and standard errors in covariance structure analysis

  • P.M. Bentler

    Comparative fit indexes in structural models

    Psychol Bull

    (1990)
  • M.W. Browne et al.

    Alternative ways of assessing model fit

  • L. Tucker et al.

    The reliability coefficient for maximum likelihood factor analysis

    Psychometrika

    (1973)
  • L.T. Hu et al.

    Cutoff criteria for fit indexes in covariance estructure analysis: Conventional criteria versus new alternatives

    Struct Equ Modeling

    (1999)
  • Cited by (36)

    • Inconclusive evidence that arthroscopic techniques yield better outcomes than open techniques for subtalar arthrodesis—A systematic review

      2023, Journal of ISAKOS
      Citation Excerpt :

      Both techniques included patients undergoing isolated STA for varied indications without additional procedures. American Orthopaedic Foot & Ankle Society (AOFAS) ankle-hindfoot scoring system [11], Foot Function Index [12], Short Form (SF)-36 [13], SF-12 [14], Angus and Cowell rating scale scores [15], numerical analogue scale [2], visual analogue scale [2] were the outcome measures reported across the studies. However, AOFAS modified with a maximum score of 94 (compensation for the loss of subtalar joint function) [16] was the most commonly used scoring system for reporting the outcomes.

    • Can patients with psychological distress achieve comparable functional outcomes and satisfaction after hallux valgus surgery? A 2-year follow-up study

      2021, Foot and Ankle Surgery
      Citation Excerpt :

      The SF-36 MCS has been reported using norm-based scoring (mean = 50, standard deviation = 10) in nearly all published studies to date [25]. It has also been validated as a screening tool for depression [26–28]. As such, a cut-off value of 50 was used to divide the cohort into patients with above-average mental health i.e. SF-36 MCS ≥50 (“non-distressed group”) and below-average mental health i.e. SF-36 MCS <50 (“distressed group”).

    • Do Patients With Psychological Distress Have Poorer Patient-Reported Outcomes After Total Hip Arthroplasty?

      2020, Journal of Arthroplasty
      Citation Excerpt :

      The SF-36 MCS has been reported using norm-based scoring (mean = 50, standard deviation = 10) in nearly all published studies to date [31]. It has also been validated as a screening tool for depression [32–34]. We used a cut-off value of 50 to divide the cohort into patients with below average MH (ie, SF-36 MCS < 50), indicating the presence of psychological distress; and above average MH (ie, SF-36 MCS ≥50), indicating the absence of psychological distress.

    • Do clinical results of arthroscopic subtalar arthrodesis correlate with CT fusion ratio?

      2019, Orthopaedics and Traumatology: Surgery and Research
      Citation Excerpt :

      A numerical analog scale (NAS) assessed pain (0 = no pain, to 10 = worst imaginable pain). The AOFAS [16] and SF-12 [17] scales assessed functional status. At last follow-up, satisfaction was assessed on a NAS (0 = poor result, to 10 = excellent result) and Odom's criteria [18] (Table 2).

    View all citing articles on Scopus

    All authors declare that they have neither conflicts of interest nor relevant financial disclosure to inform.

    View full text