Original Article
Methods for Testing Data Quality, Scaling Assumptions, and Reliability: The IQOLA Project Approach

https://doi.org/10.1016/S0895-4356(98)00085-7Get rights and content

Abstract

Following the translation development stage, the second research stage of the IQOLA Project tests the assumptions underlying item scoring and scale construction. This article provides detailed information on the research methods used by the IQOLA Project to evaluate data quality, scaling and scoring assumptions, and the reliability of the SF-36 scales. Tests include evaluation of item and scale-level descriptive statistics; examination of the equality of item-scale correlations, item internal consistency and item discriminant validity; and estimation of scale score reliability using internal consistency and test-retest methods. Results from these tests are used to determine if standard algorithms for the construction and scoring of the eight SF-36 scales can be used in each country and to provide information that can be used in translation improvement.

Introduction

The second research stage of the IQOLA Project tests the assumptions underlying item scoring and the construction of multi-item scales, expanding on the methods of multitrait scaling previously developed in the Health Insurance Experiment and the Medical Outcomes Study 1, 2. The goal of this stage is to arrive at item and scale scoring algorithms that satisfy scaling assumptions and achieve scales that can be compared across countries. When scaling assumptions are not met, evidence is sought to determine if this is due to translation problems or to country-specific differences in the definition or structure of health. Results from studies using translated forms are compared with those from previous studies, to gain insight into the extent to which equivalence of metrics and meaning has been achieved across countries.

The SF-36 contains 36 questions (items), which measure eight health concepts (or constructs) and health transition [3]. SF-36 scales are scored using Likert’s method of summated ratings 4, 5. This method has been widely used in scale construction because of its simplicity and success in yielding reliable scores. In this method, a score for each question (item) is derived from a standardized set of response choices; scores for some items need to be recoded so that all items are scored in the same direction. A multi-item scale score is then computed by simply summing the scores assigned to item responses and by transforming scores to 0–100.

Summated rating scales have the advantage of simplicity, can achieve high levels of reliability, and use scoring algorithms that do not require weights or the extra step of obtaining judgments used in scoring. However, this simplicity is based on a number of assumptions, originally proposed by Likert, that must be tested [4]. First, items in each hypothesized grouping should contain approximately the same proportion of information about the construct being measured. Second, items should have roughly equal variances so they contribute equally to the total scale score. Third, items should be substantially linearly related to the total score computed from all other items in that scale. In addition to testing these traditional Likert scaling criteria, the relationship between an item and scales representing other concepts should be examined, following the logic of the multitrait-multimethod approach described by Campbell and Fiske [6], to answer the question: why use an item to score one scale as opposed to another scale?

Tests of these assumptions determine the appropriateness of including an item in a particular scale and whether it is appropriate to simply sum item scores to estimate a scale score. Once the assumptions underlying the construction of summated rating scales have been confirmed, scale scores can be calculated with the confidence that the scores will have their desired properties.

The IQOLA approach to testing scaling assumptions starts with the most basic unit, the item, and first examines item-level characteristics (extent of missing data, frequency distribution, mean, standard deviation). It then evaluates the relationship (correlation) of each item to the other items in the scale defining the concept it is hypothesized to measure, and the item’s relationship to other scales. Scale distribution characteristics (mean, standard deviation, minimum and maximum values, range) and the reliability of scale scores are examined next. Finally, the correlations between scale scores are evaluated in relation to the reliability of the scales.

The software used by the IQOLA Project to test scaling assumptions was the FORTRAN version of the Multitrait Analysis Program-Revised, or MAP-R, and the SAS version of MAP-R, known as MAP-R for Windows [7]. MAP-R evolved from ANLITH (Analysis of Item and Test Homogeneity), a FORTRAN II mainframe software program written by Thomas Tyler and Thomas Gronek (see [8]), which was first used in the health care field during the Measuring Health Concepts Project at Southern Illinois University, School of Medicine, from 1972 to 1975 [9]. The ANLITH program was reprogrammed in FORTRAN IV at the RAND Corporation and was used extensively for about 10 years to test patient-based measures of health status and other concepts during the Health Insurance Experiment (HIE) [1]. It was during the HIE that the program began to be referred to as MAP, for “Multitrait Analysis Program.” Additional improvements in MAP were made at RAND to meet the needs of the Medical Outcomes Study in the middle 1980s [10]. The expanded FORTRAN version of MAP used in the IQOLA Project, MAP-R, was developed at the Health Assessment Lab at the Health Institute, New England Medical Center, in 1991. The SAS for Windows version of MAP-R was developed as a joint project of Glaxo Wellcome and the Health Assessment Lab in 1997 [7].

Section snippets

Item-level descriptive statistics

The first step in evaluating a summated ratings scale is to determine the extent of missing and out-of-range data (see Table 1 for an illustration using the SF-36 Physical Functioning, Role-Physical, and Mental Health items). A summated ratings score cannot be estimated with the same degree of confidence if there is a large amount of missing data. While the overall amount of missing data will be sample and survey dependent, a large amount of missing data for a particular item may indicate a

Multitrait/multi-item correlation matrix

The multitrait/multi-item correlation matrix (Table 2) is used to examine the relationship of each item to its hypothesized scale, as well as the item’s correlations with other scales. Each row in the matrix contains correlations between the score for one item and all hypothesized item groupings (which are operationally defined as scale scores). Each column contains correlations between the score for one scale and all items in the matrix, including those items hypothesized to be part of that

Scale-level descriptive statistics

After item-level analyses have established that the assumptions underlying the construction of summated rating scales have been met, SF-36 scales are scored and the properties of the scales are examined for comparability across countries, focusing on scale means and standard deviations and the proportion of respondents scoring at the highest (ceiling) and lowest (floor) level (Table 5). Scale means and standard deviations indicate where along a scale continuum the majority of individuals within

Reliability

Reliability of measurement refers to the extent to which the measured variance in a score reflects the true score, rather than random error; that is, the extent to which measures give consistent or accurate results (see [16] for a discussion of this model of true score and error variance). A reliability coefficient is an estimate of the proportion of total variance that is true score variance and can be expressed as [17]: Reliability=1−VeVt where Ve equals the error variance and Vt equals the

Correlations between scales

To evaluate how distinct each scale is from other scales in the same matrix, correlations among all scales are computed and compared with reliability estimates (Table 6). A reliability coefficient can be thought of as a correlation between a scale and itself. To the extent that the correlation between two scales is less than their reliability coefficients, there is evidence of unique reliable variance measured by each scale [22]. When the correlation between two scales equals their reliability,

Conclusion

In the second stage of the IQOLA research process, formal psychometric tests of the assumptions underlying item scoring and construction of multi-item scales are conducted. If scoring and scaling assumptions are met, then standard scoring algorithms can be used to score the SF-36 items and eight scales [3]. If scaling assumptions are not met, then the translation is examined, revised as necessary, and retested. Establishing that the translated scales meet standards for tests of scaling

References (23)

  • J.E. Ware

    Scales for measuring general health perceptions

    Health Serv Res

    (1976)
  • Cited by (0)

    View full text