01122011  Uitgave 10/2011 Open Access
Using structural equation modeling to detect response shift in performance and healthrelated quality of life scores of multiple sclerosis patients
 Tijdschrift:
 Quality of Life Research > Uitgave 10/2011
Belangrijke opmerkingen
An erratum to this article can be found at
http://dx.doi.org/10.1007/s1113601198720
Introduction
Measurement in health research relies heavily on selfreport data. Selfreport data collected in longitudinal studies are often difficult to interpret due to respondents’ changing standards, values, or conceptualization of the target construct. This phenomenon is referred to as ‘response shift’. We distinguish three types of response shift: (1)
recalibration of respondents’ internal standards of measurement, (2)
reprioritization of respondents’ values, and 3)
reconceptualization of the target construct [
1]. Each of these types of response shift can be operationalized within structural equation modeling (SEM) [
2,
3].
Several operationalizations of response shift have been proposed. Generally, response shift can be defined either as bias in the measurement of the attribute of interest or as bias in the explanation of the attribute [
4]. In this paper, we will focus only on response shift in measurement. From this perspective, bias is not considered as noise but rather as systematic differences in patients’ scores that are not fully explained by true differences in the attribute of interest (e.g., healthrelated qualityoflife (HRQoL)), but also by differences in other variables (e.g., other patient expectations, adaptation). Response shift is considered as a measurement bias that changes with time of measurement in longitudinal research (see Oort et al. [
2,
4,
5]).
With SEM, we can detect such measurement bias with respect to time of measurement in longitudinal designs, group membership in multigroup designs, or any other exogenous variable. For example, the effects of health state on the course of HRQoL can be modeled by dividing the sample into healthy and nonhealthy subgroups, or by including an indicator of health state as an exogenous variable. If exogenous variables are included in a longitudinal model, they can be static (e.g., diagnosis) or they can vary across measurement occasions (e.g., depression scale scores). Additionally, different longitudinal structures (e.g., growth, autoregression) can be investigated with latent variables.
As explained in a companion paper by Schwartz et al. [
6], this paper is one in a series investigating response shift in multiple sclerosis (MS) patients using different methods. Here, we demonstrate how SEM can be used to detect response shift. We aim to illustrate the flexibility of SEM by investigating response shift in two studies. In Study 1, we investigate the performance disabilities in MS patients by taking the first three measurement occasions with varying time lags across patients. We investigate health status by distinguishing between three predefined and known groups of MS patients (i.e. stable, progressive, and relapsing) and use these groups in a multigroup analysis. In Study 2, we investigate HRQoL in MS patients by selecting measurement occasions with fixed time lags. Health status is taken into account by introducing timevarying health status indicators as exogenous variables. In both studies, we will investigate change and response shift with respect to health status.
Method
Data
Analyses in this paper utilize data from the North American Research Committee on Multiple Sclerosis (NARCOMS) project registry. The NARCOMS registry was established in 1993 to collect biannual data on MS patients’ status. The main aim of the registry is to make these data available for the wider community, in particular researchers, to increase knowledge about MS.
Study 1
Response shift in performance disability is investigated using the intake questionnaire and the two subsequent followup questionnaires. As the timing of the measurement occasions varies across patients, they are considered random. On average, the first two measurement occasions are 1.04 (SD = 0.79) years apart, and the second and third measurement occasions are 0.88 (SD = 0.73) years apart.
Study 2
In Study 2, three other measurement occasions are used. Since the intake survey does not include the HRQoL questionnaire, we took the first three measurement occasions that included the HRQoL questionnaire and that were evenly spaced in time (about 6 months apart). These occasions are considered as fixed. On average, the first measurement occasion in this study is 3.07 (SD = 1.97) years from intake.
Variables
In the NARCOMS registry, a number of demographic, clinical, and psychosocial measures are collected. In Study 1, we investigate change and response shift in ‘performance disability’ as measured by the Performance Scales [
7]. In Study 2, we investigate change and response shift in ‘HRQoL’ as measured by the SF12 [
8]. In both studies, we include demographic variables (age and sex) and clinical variables (time since diagnosis and health state). These variables are used to investigate additional measurement bias in the observed variables of the Performance Scales and the SF12.
Performance Disability
The Performance Scales [
7] originally included eight items. As the visual disability item was not consistently included as part of the NARCOMS survey, we use seven items of disability. Items were scored on a 6point (or 7point in case of mobility) Likert scale (0 = Normal; 5 = Totally disabled) and represent performance disability with respect to mobility, hand function, fatigue, cognition, bladder/bowel, sensory, and spasticity. Higher scores indicate greater disability. When less than three item responses were missing, values were imputed using the expected maximization (EM) algorithm [
9].
HRQoL
The SF12 [
8] was used to measure two components of HRQoL: Mental (MENT) and Physical (PHYS). Eight scales are created from the 12 items including physical functioning (PF), role limitations because of physical health (RP), bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role limitations because of emotional problems (RE), and mental health (MH). Higher scores indicate better HRQoL. The scales—not the items—are the focus of our analysis. When less then five subscale values were missing, values were imputed using the EM algorithm [
9].
Health State
In Study 1, three groups of patients with different health states were created based on their answers to relapse and symptom change questions at baseline and followup measurements. The three groups are defined as follows: ‘stable’, patients with no relapses and symptoms that remained unchanged or improved; ‘progressive’, patients who relapsed and whose symptoms continued to get worse; ‘relapsing’, those who experienced a relapse but whose symptoms remained unchanged or improved. In Study 2, two items were used to measure health state: ‘relapse in the past 6 months’ (1 = yes, 0 = no/unsure) and ‘symptoms compared to 6 months ago’ (1 = much worse; 7 = much better). Both items were administered at each measurement occasion and can thus be included as timevarying covariates.
Other Variables
‘Age’, ‘sex’, ‘newly diagnosed’, and ‘time since diagnosis’ are also included in the analyses. At the first measurement occasion of Study 1, we distinguish between patients who are newly diagnosed (diagnosis the same year as joining the NARCOMS registry) and patients whose diagnosis year was different from the year of joining NARCOMS or their diagnosis year is unknown. In Study 2, ‘time since diagnosis’ is treated as a continuous variable that is calculated as the difference between the first measurement occasion and the year of diagnosis. Patients with an unknown year of diagnosis are excluded.
Study 1 analysis
In both studies, the analysis has three steps: Establishing a measurement model (Step 1), testing invariance of model parameters across measurement occasions (Step 2), and testing invariance with respect to exogenous variables (Step 3). Each step is outlined below, with similarities and differences between the two studies highlighted. All analyses were carried out with LISREL 8.54. See Appendix
1 for a more detailed description of the methods; syntax files are available upon request.
Step 1: Establishing a measurement model
The Performance Scales were originally reported to measure a unidimensional construct [
7]. If the corresponding confirmatory factor analysis (CFA) model does not fit, then exploratory factor analysis is used to determine an alternative model, before continuing with CFA. Maximum likelihood estimation is used for parameter estimation. We assess the overall fit of our models with the chisquare test of exact fit, the root mean square error of approximation (RMSEA) [
12], the expected crossvalidation index (ECVI) [
12], the comparative fit index (CFI) [
10], and the Tucker Lewis index (TLI) [
11]. A nonsignificant chisquare value indicates good fit. However, as it is sensitive to small deviations between model and data, especially when sample size is large, we also consider approximate fit indices. RMSEA < 0.08 indicates satisfactory fit; RMSEA < 0.05 indicates close fit. ECVI cannot be used as a standalone index but can be used to compare alternative models; a smaller ECVI value indicates better model fit [
12]. Finally, both the CFI and the TLI assess the improvement in fit from a null model that assumes no relationships between variables. Values of > .90 for the TLI and values > .95 for the CFI indicate reasonable fit of the model to data. If a new model is specified, the change in model fit is assessed with the chisquare difference test and the ECVI difference test [
12].
Step 2: Testing invariance across measurement occasions
In this step, we take the final model of Step 1 and simultaneously constrain all factor loadings and intercepts to be equal across measurement occasions and groups. Acrossoccasion invariance (no measurement bias) of factor loadings and intercepts is assessed by comparing this model with the final model of Step 1 using the chisquare difference test. A significant result provides evidence for response shift. However, if the test result is not significant, we still investigate response shift, as a single, yet substantially important response shift may not cause significant deterioration in the overall model fit.
To detect measurement bias, we examine modification indices and the standardized expected parameter changes (SEPC) [
13]. If both are large, we expect significant improvement in the overall model and substantial change in the parameter estimate(s). As there are a large number of modification indices to consider, we stop investigating modification indices when none are greater than a Bonferroniadjusted critical value [
14] of 12.83. For SEPC, we consider > 0.10 significant [
15]. The effect size [
16] of possible response shifts will be evaluated in comparison with Cohen’s
d effect sizes of observed and true change [
2].
Step 3: Testing invariance with respect to exogenous variables
The first model in this step includes ‘age’, ‘sex’, ‘newly diagnosed’, ‘time between measurement occasions 1 and 2’, and ‘time between measurement occasions 2 and 3’ as additional exogenous variables. These five exogenous variables correlate with each other and with the common factors, but their relationship with the observed items should be fully explained through these correlations. If large modification indices and SEPCs are present, this indicates the presence of bias. In case of direct effects changing over time, we consider this measurement bias as response shift.
Study 2 analysis
In Study 2, we took the same steps as in Study 1. All analyses were carried out with Mx [
17]. See Appendix
2 for a detailed description of the methods; syntax files are available upon request.
Step 1: Establishing a measurement model
The first goal is to find a satisfactory measurement model for the SF12. We begin with the measurement model comprising two common factors: PHYS and MENT HRQoL. If this model does not fit, we use modification indices and standardized residuals [
13,
18] to identify misspecification and to develop an alternative model. As in Study 1, possible model modifications are assessed using the chisquare difference test and ECVI difference test. Overall model fit is assessed using the same statistics as used in Study 1.
Step 2: Testing invariance across measurement occasions
All factor loadings and intercepts of the final model of Step 1 are simultaneously constrained to be equal across measurement occasions, like in Study 1. To detect response shift, we use a different search strategy from Study 1 where we test individual constraints. Here, we follow the procedure outlined in KingKallimanis et al. [
19], relying on a smaller number of global tests that free multiple constraints simultaneously. In this study, we use eight global tests, one for each observed scale. The fit of each of these eight new models is compared to the fully constrained model using the chisquare difference test (at adjusted significance level) [
14] and scaled observed parameter changes (OPC). After running the eight tests, the observed scale producing the largest OPC in combination with a significant chisquare difference test is interpreted as response shift. We continue iteratively, retesting the remaining scales, until no large OPC with a significant chisquare difference test is found. Corresponding to Cohen’s small effect sizes, we consider an OPC indicating a standardized difference of 0.1 between factor loadings or 0.2 between intercepts to be large [
16].
Step 3: Testing invariance with respect to exogenous variables
We extend the final model of Step 2 to include ‘age’, ‘sex’, ‘time since diagnosis’, ‘relapse in the past 6 months’, and ‘symptom change in the last 6 months’ as exogenous variables. To test for response shift, we fit new models where we include direct effects of the exogenous variables on the observed scales. The impact of these direct effects is assessed with OPCs and the chisquare difference test. If the largest effects are significant, we leave these parameters free to be estimated and repeat the process until no significant improvements are found. Once any biases have been accounted for, this final model can be used to assess true change in the attribute of interest using the same formula used in Study 1 [
2].
Results
Study 1 results
In the analysis of the Performance Scales items, we distinguished between 771 stable patients (26.1%), 629 progressive patients (21.3%), and 1,552 relapsing patients (52.6%). See Table
1 for sample characteristics and Table
2 for the Performance Scales item means.
Table 1
Descriptive statistics for demographic variables of Multiple Sclerosis patients for Study 1 and Study 2
Variable

Study 1 (
n = 2,952)

Study 2 (
n = 1,767)


Sex


Male

423 (14.33%)

303 (17.15%)

Female

2,031 (68.80%)

1,464 (82.86%)

Age, mean (SD)

40.82 (9.35)

45.56 (9.31)

Time since diagnosis in years, mean (SD)

NA

3.69 (2.12)

Newly diagnosed


Yes

1,054 (35.70%)

NA

No/unknown

1,898 (64.30%)


Relapse


Yes (T1)

NA

608 (34.41%)

Yes (T2)

NA

565 (31.98%)

Yes (T3)

NA

555 (31.41%)

Symptom change


T1

NA

3.64 (1.15)

T2

NA

3.61 (1.08)

T3

NA

3.64 (1.04)

Group membership


Stable

771 (26.12%)

NA

Actively relapsing

1,552 (52.57%)

NA

Progressing without relapsing

629 (21.31%)

NA

Table 2
Means and standard deviations for Performance Scale items (Study 1) and SF12 scales (Study 2)
Measurement Occasions & Group Membership

Mobility

Hand Function

Fatigue

Cognitive

Bladder/Bowel

Sensory

Spasticity


Study 1—
Performance Scales: higher scores indicate more disability


Time 1


Relapsing

1.52 (1.45)

1.22 (1.08)

2.58 (1.31)

1.62 (1.19)

1.18 (1.08)

1.89 (1.19)

1.51 (1.24)

Progressive

1.67 (1.47)

1.02 (0.98)

2.39 (1.26)

1.40 (1.17)

1.20 (1.04)

1.65 (1.17)

1.34 (1.22)

Stable

0.82 (1.17)

0.63 (0.86)

1.68 (1.22)

0.99 (1.01)

0.72 (0.90)

1.31 (1.03)

0.82 (1.05)

Time 2


Relapsing

1.65 (1.50)

1.25 (1.09)

2.68 (1.31)

1.74 (1.23)

1.29 (1.15)

1.83 (1.22)

1.60 (1.29)

Progressive

1.79 (1.56)

1.10 (1.05)

2.45 (1.31)

1.45 (1.12)

1.24 (1.05)

1.59 (1.13)

1.48 (1.27)

Stable

0.78 (1.21)

0.66 (0.89)

1.70 (1.25)

1.02 (0.97)

0.75 (0.91)

1.13 (0.98)

0.77 (0.96)

Time 3


Relapsing

1.75 (1.55)

1.31 (1.13)

2.70 (1.32)

1.77 (1.21)

1.37 (1.67)

1.84 (1.24)

1.65 (1.28)

Progressive

1.91 (1.67)

1.16 (1.07)

2.52 (1.28)

1.52 (1.16)

1.35 (1.10)

1.65 (1.16)

1.50 (1.29)

Stable

0.81 (1.24)

0.67 (0.88)

1.63 (1.25)

1.02 (1.25)

0.76 (0.91)

1.06 (0.96)

0.82 (0.99)

Measurement Occasions

PF

RF

BP

GH

VT

SF

RE

MH


Study 2—
SF
12: higher scores indicate better health


Time 1

6.80 (2.34)

6.26 (2.47)

5.40 (2.52)

4.30 (1.93)

3.13 (2.04)

7.17 (2.38)

7.42 (2.28)

5.87 (1.73)

Time 2

6.86 (2.17)

6.34 (2.32)

5.41 (2.54)

4.37 (1.95)

3.16 (1.96)

7.03 (2.43)

6.98 (2.38)

5.81 (1.65)

Time 3

6.70 (2.44)

6.23 (2.50)

5.52 (2.68)

4.32 (1.93)

3.05 (2.07)

7.13 (2.43)

7.50 (2.31)

5.92 (1.72)

Step 1: Establishing a measurement model
The reported unidimensional structure of the Performance Scales yielded satisfactory fit (Table
3, Model 1.1.1). However, model fit could be improved upon. Exploratory factor analysis suggested a twodimensional model, and in CFA this model yielded substantially better fit for the Performance Scales than a onedimensional model (Table
3, 1.1.2). The two dimensions were interpreted as (1) Visible Disability (mobility, spasticity, bladder), describing the most visible and stigmatizing symptoms that may make one home bound; and (2) Internal Disability (hand function, fatigue, sensory, cognition), relating to an internal, more subjective experience.
Table 3
Overall goodnessoffit and chisquare difference test results for Study 1 and Study 2
CHISQ (
df)

RMSEA (90% CI)

ECVI (90% CI)

Comparison models

CHISQ DIFF (
df)

P

ECVI DIFF (90% CI)

CFI

TLI



Study 1—
Performance Scales


Models–Study 1


1.1.1

Unidimensional measurement model

1,675.4 (495)

0.049 (0.047; 0.052)

0.75 (0.71; 0.79)

0.99

0.99


1.1.2

2 factor measurement model

899.2 (414)

0.035 (0.031; 0.028)

0.54 (0.51; 0.57)

1.00

0.99


1.2.1

Across occasion constraints on factor loadings and intercepts

1,281.6 (534)

0.038 (0.035; 0.040)

0.59 (0.55; 0.63)

1.2.1 vs. 1.1.2

382.4 (120)

<0.0001

0.04 (0.03; 0.07)

0.99

0.99

1.2.2

Sensory intercept, time 1 free, stable group

1,224.8 (533)

0.036 (0.034; 0.039)

0.57 (0.51; 0.58)

1.2.2 vs 1.2.1

56.8 (1)

<0.0001

0.02 (0.01; 0.03)

0.99

0.99

1.2.3

Sensory intercept, time 1 free, relapse group

1,150.6 (532)

0.034 (0.032; 0.037)

0.54 (0.49; 0.56)

2.2.3 vs 1.2.2

74.2 (1)

<0.0001

0.03 (0.02; 0.04)

1.00

0.99

1.3.1

As Model 1.2.2 but with addition of exogenous variables

1,461.8 (757)

0.031 (0.028; 0.033)

0.75 (0.69; 0.76)

0.99

0.99


1.3.2

Time lag restricted model

1,579.1 (808)

0.031 (0.029; 0.033)

0.76 (0.72; 0.80)

1.3.2 vs. 1.3.1

114.1 (50)

<0.0001

<0.01 (−0.01; 0.01)

0.99

0.99

Study 2—
SF
12


Models–study 2


2.1.1

Measurement model from manual

2,305.3 (213)

0.076 (0.073; 0.079)

1.47 (1.37; 1.55)

0.94

0.93


2.1.2

Modified measurement model

1,147.1 (192)

0.053 (0.050; 0.059)

0.80 (0.73; 0.85)

2.1.2 vs. 2.1.1

1,158.2 (21)

<0.0001

0.63 (0.57; 0.70)

0.97

0.96

2.2.1

Across occasion constraints on factor loadings and intercepts

1,303.8 (218)

0.053 (0.050; 0.056)

0.86 (0.80; 0.93)

2.2.1 vs. 2.1.2

156.7 (26)

<0.0001

0.06 (0.04; 0.08)

0.97

0.96

2.2.2

Factor loadings and intercepts of RE free

1,224.8 (214)

0.052 (0.049; 0.055)

0.82 (0.76; 0.88)

2.2.2 vs. 2.2.1

79.0 (4)

<0.0001

−0.67 (−0.70; −0.62)

0.97

0.96

2.3.1

As Model 2.2.2 but with addition of exogenous variables

1,653.7 (376)

0.044 (0.042; 0.046)

1.188 (1.118; 1.262)

0.97

0.95


2.3.2

Free direct effects of age on MH

1,595.8 (373)

0.043 (0.041; 0.045)

1.159 (0.090; 1.231)

2.3.2 vs. 2.3.1

58.0 (3)

<0.0001

0.03 (0.02; 0.05)

0.97

0.96

2.3.3

Free direct effects of age on VT

1,555.0 (370)

0.043 (0.040; 0.045)

1.139 (1.072; 1.211)

2.3.3 vs. 3.3.2

40.8 (3)

<0.0001

0.02 (0.01; 0.03)

0.97

0.96

Step 2: Testing measurement invariance across measurement occasions
The equality constraints imposed in this step led to a significant deterioration in fit, suggesting the presence of response shift (Table
3, Model 1.2.1). The modification indices and SEPCs suggested first removing the equality constraint on the intercept of ‘sensory’ for the stable group at the first measurement occasion (χ
^{2}diff(1) = 56.8,
P < 0.001) and successively the intercept of ‘sensory’ for the relapse group at the first measurement occasion (χ
^{2}diff(1) = 74.2,
P < 0.001). No further model modifications were necessary. The ECVI difference tests were in agreement with these modifications (Table
3).
As can be seen in Fig.
1a, at the second and third measurement occasions, the intercepts of ‘sensory’ appear to decrease for the stable (0.37–0.18) and relapsing groups (0.37–0.18) relative to their Visible Disability (Fig.
1b). This response shift can be interpreted as recalibration and suggests that for these groups, when overall disability increases over time, specific sensory disability did not increase as much.
×
Step 3: Testing measurement invariance with respect to exogenous variables
In the final step, we used Model 1.2.2 and included additional exogenous variables (Table
3, Model 1.3.1). In all groups, at all occasions, we found positive correlations between age and both disability dimensions (see Table
4). The other exogenous variables, ‘sex’, ‘newly diagnosed’, ‘time between measurement occasions 1 and 2’, and ‘time between measurement occasions 2 and 3’ had correlations less than 0.1 with disability. When inspecting the SEPCs, we find that none were above our cut point of 0.10. Therefore, we concluded that there was no bias in the Performance Scales items with respect to these variables. See Fig.
1a and Table
4 for final model estimates.
Table 4
Final model covariance and residual variance estimates
Visible–T1

Internal–T1

Visible–T2

Internal–T2

Visible–T3

Internal–T3



Variance/covariances


Visible–T1


Stable

1.10


Progressive

1.17


Relapsing

1.52


Internal–T1


Stable

0.38

0.52


Progressive

0.35

0.60


Relapsing

0.56

0.73


Visible–T2

1.26


Stable

0.99

0.36

1.98


Progressive

1.58

0.35

1.82


Relapsing

1.43

0.56


Internal–T2


Stable

0.35

0.40

0.42

0.47


Progressive

0.28

0.52

0.37

0.67


Relapsing

0.50

0.63

0.65

0.77


Visible–T3


Stable

0.97

0.34

1.20

0.39

1.22


Progressive

1.65

0.26

2.05

0.25

2.29


Relapsing

1.41

0.52

1.77

0.58

1.93


Internal–T3


Stable

0.32

0.39

0.38

0.42

0.40

0.48

Progressive

0.25

0.52

0.30

0.61

0.35

0.69

Relapsing

0.51

0.61

0.64

0.69

0.65

0.79

Sex


Stable

−0.01

0.01

−0.02

0.01

−0.03

0.01

Progressive

−0.09

<0.01

−0.08

−0.01

−0.09

−0.01

Relapsing

−0.08

−0.02

−0.07

−0.01

−0.08

−0.01

Age


Stable

0.32

0.07

0.34

0.09

0.35

0.08

Progressive

0.43

0.04

0.50

0.02

0.51

0.01

Relapsing

0.37

0.12

0.43

0.14

0.43

0.13

Newly diagnosed

−0.26

−0.12

−0.26

−0.12

−0.26

−0.12

Time between T1 and T2

0.10

0.01

0.10

0.01


Time between T2 and T3

0.10

0.01

Residual variances

Mobility

Spasticity

Bladder/bowel

Hand function

Sensory

Fatigue

Cognitive


Stable


T1

0.27

0.59

0.53

0.40

0.61

0.74

0.48

T2

0.23

0.45

0.51

0.42

0.55

0.80

0.48

T3

0.30

0.43

0.51

0.38

0.51

0.88

0.51

Progressive


T1

0.41

0.82

0.77

0.60

0.82

0.79

0.75

T2

0.44

0.87

0.70

0.61

0.72

0.80

0.64

T3

0.40

0.81

0.80

0.60

0.76

0.80

0.67

Stable


T1

0.57

0.83

0.76

0.58

0.76

0.77

0.72

T2

0.40

0.83

0.79

0.58

0.79

0.74

0.73

T3

0.43

0.76

0.86

0.61

0.78

0.73

0.67

For each group, the estimates of the common factor means of this final model (Model 1.3.2) are plotted against time in Fig.
1b and c. We see deterioration of progressive and relapsing patients (increasing visible disability scores) but no change in stable patients. There is little change within groups for internal disability, stable patients have the lowest internal disability means, and relapsing patients have the highest internal disability means.
Study 1 conclusion
Only “sensory disability” showed response shift. Considering the impact of response shift and true change on observed change in ‘sensory disability’, we see that an observed change of −0.19 for stable patients is almost fully attributable to response shift (−0.22), leaving only 0.03 of true change. In the relapsing group, once we accounted for a response shift of −0.18, we see that an observed change of −0.06 underestimates a true change of 0.12. Still, we note that these effect sizes should be considered “small”.
Study 2 results
Participants were included if they had at least six of the 12 SF12 items completed at three consecutive measurement occasions that were 6 (±3) months apart. This yielded a final sample of 1,767 patients. See Table
1 for sample characteristics and Table
2 for SF12 observed scale means.
Step 1: Establishing a measurement model
The SF12 PF, RP, BP, and GH scales are associated with PHYS, and VT, SF, RE, and MH are associated with MENT [
8]. When replicating this structure, this model had only marginally satisfactory fit (Table
3, Model 2.1.1). Three sources of misfit were, at all measurement occasions, as follows: 1) covariances between residuals of PF and RP (χ
^{2}diff(9) = 549.6,
P < 0.001), 2) covariances between residuals of MH and RE (χ
^{2}diff(9) = 452.5,
P < 0.001), and 3) crossloadings of VT on PHYS (χ
^{2}diff(3) = 156.0.2,
P < 0.001). These modifications produced a measurement model with satisfactory fit (Table
3, Model 2.1.2).
Step 2: Testing measurement invariance across measurement occasions
The equality constraints imposed in this step led to significantly deteriorated fit, suggesting the presence of response shift (Table
3, Model 2.2.1). The global test associated with the scale RE resulted in the largest OPCs and a significant chisquare difference test. Therefore, the parameters associated with RE were free to be estimated (Fig.
2a). As can be seen in Fig.
2a, the intercept of RE at the second measurement occasion was lower (7.1) than at the first and third measurement occasions (7.4 and 7.5). This suggests that there is uniform recalibration for the RE scale: Patients seemed less inclined to report high RE at the second measurement occasion as compared to the first and third measurement occasions, given similar MENT HRQoL. No further response shifts were found.
×
Step 3: Testing measurement invariance with respect to exogenous variables
Adding additional exogenous variables to Model 2.2.2 resulted in a satisfactorily fitting model (Table
3, Model 2.3.1). We found large negative correlations of age and relapse with PHYS and MENT and positive correlations of symptom change with PHYS and MENT. The correlations between sex and time since diagnosis and PHYS and MENT are considered very small (<0.01) (Table
5). With Model 2.3.1 as the comparison model, we proceeded to test for bias with respect to the exogenous variables using the global tests and OPCs. Significant direct effects of age on MH (Model 2.3.2) were found. We also found significant direct effects of age on VT (Model 2.3.3). In the next iteration, no further significant effects were found.
Table 5
Final model variance/covariances and residual variance estimates
PHYS HRQoL –T1

MENT HRQoL –T1

PHYS HRQoL –T2

MENT HRQoL –T2

PHYS HRQoL –T3

MENT HRQoL –T3



Variance/Covariances


PHYS HRQoL –T1

1


MENT HRQoL –T1

0.87

1


PHYS HRQoL –T2

0.87

0.78

0.94


MENT HRQoL –T2

0.81

0.81

0.84

0.95


PHYS HRQoL –T3

0.91

0.80

0.90

0.83

1.05


MENT HRQoL –T3

0.77

0.78

0.77

0.81

0.91

0.98

Sex

−0.03

0.002

−0.01

−0.01

−0.02

−0.005

Age

−0.25

−0.12

−0.21

−0.13

−0.26

−0.13

Time since diagnosis

−0.04

0.02

0.01

−0.05

−0.02

−0.03

Symptom change–T1

0.63

0.52

0.52

0.48

0.52

0.39

Symptom change–T2

0.52

0.42

0.61

0.53

0.57

0.41

Symptom change–T3

0.41

0.36

0.42

0.39

0.60

0.49

Relapse–T1

−0.15

−0.15

−0.13

−0.12

−0.13

−0.12

Relapse–T2

−0.13

−0.13

−0.15

−0.14

−0.14

−0.12

Relapse–T3

−0.12

−0.11

−0.13

−0.11

−0.17

−0.14

PF

RP

BP

GH

VT

SF

RE

MH



Residual variances


T1

2.27

1.66

3.24

1.68

1.97

1.31

2.65

1.76

T2

1.99

1.62

3.28

1.65

1.82

1.56

2.61

1.67

T3

2.51

1.60

3.91

1.64

2.02

1.62

2.59

1.79

We tested whether the measurement bias in MH and VT with respect to age was consistent across measurement occasions. As the inclusion of equality constraints across measurement occasions did not worsen model fit (χ
^{2}diff (4) = 7.80,
P = 0.099), we concluded that the bias is consistent and did not indicate response shift. Given the negative correlations between age and PHYS and MENT (Table
5), the bias on MH (0.20) and VT (0.19) with respect to age suggests that older patients reported better MH and VT than was expected. The estimates of the common factor means of this model (Model 2.3.3) did not show any change (Fig.
2b and c).
Study 2 conclusion
When we consider the impact of response shift and true change on observed change in EF, we see that the observed change of 0.23 is almost fully attributable to response shift (0.17), leaving only 0.06 of true change. However, only on the second measurement occasion, we found an indication of response shift in EF, which hinders substantive interpretation. So we concluded that this may be a chance finding.
Discussion
We have illustrated two different ways in which SEM can be used to investigate response shift. With the present data, we found uniform recalibration response shift in the sensory disability item of the Performance Scales (Study 1), indicating that stable and relapsing patients initially overestimate their sensory disability. Apparently, in comparison with their general performance disability, they initially worry more about their sensory disability but then become accustomed to their situation, whereas progressive patients continue to deteriorate in their performance disability and, as a result, do not become accustomed to sensory disability. In a study investigating progressive MS patients, it was shown that the presence of sensory disability led to an increased length of time to reach a severe level of disability [
20]. In another study comparing progressive and relapsing patients, it was found that relapsing patients had higher initial sensory disability than progressive patients [
21]. It may be possible that the gradual progression of disease seen in progressive patients leads to a slight worsening of sensory disability over time, which is difficult to become accustomed to.
We did not find clearly interpretable response shift in the SF12 (Study 2), nor did we find any change in HRQoL. However, two measurement biases that are not response shift, as they are constant across measurement occasions, were found: age on MH and age on VT. The correlations between age and PHYS and MENT HRQoL are negative; however, the direct effects of age on MH and VT are positive. This suggests that increased age affects MH and VT in a different way than would be expected due to the correlations between the age and the common factors.
A possible explanation for the limited response shift findings is that the NARCOMS registry patients are not subjected to a planned intervention, so there is no clear catalyst of health state changes, other than selfreported relapse and symptoms. Therefore, the sizes of the response shifts found are small, and accounting for these response shifts does not cause large effects on the mean change in performance disability or HRQoL. Though importantly, the response shifts do change the interpretation of the observed changes. In Study 1 for the stable patients and in Study 2 for all patients, the response shift accounts for essentially all observed change, leaving essentially no true change. With the relapsing patients in Study 1, the observed change is underestimated, and after taking response shift into consideration, true change appears small and negative.
These two studies highlight how SEM can be used to detect measurement bias under different circumstances. The steps used are hierarchical; however, there is the flexibility (1) to account for health state by splitting the sample into subsamples or by including exogenous variables, (2) to use timevarying or timeconstant exogenous variables, (3) to use different search strategies for detecting response shift, (4) inclusion of fixed or random measurement occasions and, not discussed in this paper, (5) to investigate different longitudinal structures like autoregressive and latent growth curve structures [
22]. Some of these decisions are made based on the design of the study or sample size available, and for each decision made there are tradeoffs. A persistent problem is that the decision of when to stop investigating measurement bias is relatively subjective. Because of the large number of tests to consider, despite our strict criteria for what we consider response shift, we still need to guard against chance findings [
18]. Still, SEM offers a useful statistical approach for response shift detection as it can be tailored to the specifics of the study design.
Acknowledgments
This work was funded in part by a Visiting Scientist Fellowship from the Consortium of Multiple Sclerosis Centers Foundation to Dr. Schwartz.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Open AccessThis is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (
https://creativecommons.org/licenses/bync/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Appendix 1
Study 1
Step 1: Establishing a measurement model
The Performance Scales were originally reported to be a unidimensional construct [
7]. If the confirmatory factor analysis (CFA) used to assess the longitudinal multigroup unidimensional model does not fit, then exploratory factor analysis is used to determine an alternative model, and the fit of this model is assessed using CFA.
Maximum likelihood estimation is used for parameter estimation. We assess the overall fit of our models with the chisquare test of exact fit, the root mean square error of approximation (RMSEA) [
12], the expected crossvalidation index (ECVI) [
12], the comparative fit index (CFI) [
10], and the Tucker Lewis index (TLI) [
11]. Emphasis, however, is placed on the RMSEA and ECVI fit statistics as confidence intervals can be calculated for these fit statistics. A nonsignificant chisquare value indicates good fit. As it is sensitive to small deviations between model and data, especially when the sample size is large, we also consider approximate fit indices that relax the stringent requirement on chisquare that the model has exact fit to the population. RMSEA < .08 indicates satisfactory fit; RMSEA < .05 indicates close fit. ECVI cannot be used as a standalone index but can be used to compare alternative models; a smaller ECVI value suggests better model fit [
12]. Finally, both the CFI and the TLI assess the improvement in fit from a null model with no relationships assumed between variables. Values >.95 for the CFI and >.90 for the TLI indicate reasonable fit of the model to data.
If a new model is specified due to misfit as indicated by modification indices, standardized residuals or standardized expected parameter changes (SEPC), then the change in overall model fit is assessed with the chisquare difference test and the ECVI difference test [
13]. The chisquare difference test is used to assess whether the alternative model fit is significantly better than the fit of the null model. A significant result indicates that the alternative model has better fit than the null model. The ECVI difference is the difference between the ECVI values of the null and alternative models. If the confidence interval of this test does not include zero, then the change is significant. The ECVI test is based on the chisquare test but it penalizes models containing more free parameters [
13].
Step 2: Testing invariance across measurement occasions
In this step, we take the final model of Step 1 and simultaneously constrain all factor loadings and intercepts to be equal across measurement occasions and groups. Acrossoccasion invariance (lack of measurement bias) of the factor loadings and intercepts is assessed by comparing this model with the final model of Step 1 using the chisquare difference test. A significant test provides strong evidence that response shift is present as it is possible that the equality constraints imposed are not tenable. However, if the test is not significant, it may still be possible that one of the equality constraints is not tenable. Therefore, we still search for bias, as a single, yet substantially important response shift may not cause significant deterioration in the overall model fit.
To detect measurement bias, we examine modification indices and SEPCs [
13]. One prerequisite for meaningful model respecification is that large modification indices are substantively interpretable; however, this alone does not ensure a substantial change in the parameter estimate, especially in large samples. Therefore, we also consider the associated SEPC. If both are large, then there is a significant improvement in the overall model and substantial change in the parameter estimate. As there are a large number of modification indices to consider, we stop investigating modification indices when none are greater than 12.83 [
14]. This critical value has been adjusted for the number of tests in consideration so as to maintain a familywise type 1 error rate of 5%. For the SEPCs, we consider >0.10 significant [
15]. The impact of any response shift found is assessed by using Oort’s [
2] partitioning formula to evaluate the contribution of response shift and true change in terms of Cohen’s effect size
d [
16].
Step 3: Testing invariance with respect to exogenous variables
Using the final model in Step 2, we now include ‘age’, ‘sex’, ‘newly diagnosed’, ‘time between measurement occasions 1 and 2’, and ‘time between measurement occasions 2 and 3’ as additional exogenous variables. These five exogenous variables correlate with each other and with the common factors, but their relationship with the observed items should be fully explained by these correlations. If large modification indices and SEPCs are present between the exogenous variables and the observed items, this is an indication of bias and requires the estimation of direct effects. If the estimates of the direct effects change over time (i.e., cannot be constrained to be equal across measurement occasions), we interpret this as response shift. However, if the direct effects do not change over time, we consider this measurement bias. Large modification indices and SEPCs will be evaluated using the same criteria as outlined in Step 2.
Appendix 2
Study 2
Step 1: Establishing a measurement model
The first goal is to find a satisfactory measurement model for the SF12. We begin with the measurement model comprising two common factors: PHYS and MENT HRQoL. If this model does not fit, we use modification indices and standardized residuals [
13,
18] to identify misspecification and develop an alternative model. As we are evaluating the measurement model longitudinally, we require the same observed variables to be associated with the same common factors across measurement occasions. If the model is modified, the changes are assessed using the chisquare difference test and ECVI difference test as explained in Study 1. Overall model fit is assessed using the same statistics as used in Study 1.
Step 2: Testing invariance across measurement occasions
Using the final measurement model from Step 1, all factor loadings and intercepts of the final model of Step 1 are simultaneously constrained to be equal across measurement occasions, like in Study 1. The assessment of overall model fit and change in model fit compared to the final model of Step 1 are again done in the same way as in Study 1.
To detect response shift, we use a different search strategy from Study 1 where we test individual constraints. Here, we follow the procedure outlined in KingKallimanis et al. [
19], where all observed scales are tested with a smaller number of global tests that free multiple constraints simultaneously. In this study, we use eight global tests, one for each observed scale. That is, for each of the eight scales, the equality constraints on both the factor loadings and the intercepts at all three measurement occasions are removed. The fit of each of these eight new models is compared to the fully constrained model using the chisquare difference test. The impact of freeing the parameters on the estimated parameter values is assessed by calculating the observed parameter changes (OPC). The OPCs are scaled for ease of comparison, and they are the actual difference between the standardized factor loadings and intercepts of the null model, and the standardized factor loadings and intercepts of the altered model. Corresponding to Cohen’s small effect sizes, we consider an OPC indicating a difference of 0.1 between factor loadings or 0.2 between intercepts to be large [
16]. We consider both values because small deviations in the observed and expected covariance matrix may lead to significant model improvement, but not substantial parameter change.
After running the eight tests, the model specifics are checked. In particular, scales with OPCs that meet our criteria that are in conjunction with a significant chisquare difference test are considered as exhibiting response shift. The factor loadings and intercepts of this scale remain unconstrained, and the remaining scales are retested with an adjusted significance level. We continue iteratively retesting the remaining scales, until no large OPC with significant chisquare difference test is found.
Although there are fewer tests to consider when using the global tests, when the number of iterations increases, so does the number of tests. Therefore, when considering the significance of the chisquare difference test, we use a Bonferroniadjusted level of significance, with the familywise level of significance at 5% divided by the number of tests under consideration for this particular step of the analysis [
14].
Step 3: Testing invariance with respect to exogenous variables
We extend the final model of Step 2 to include ‘age’, ‘sex’, ‘time since diagnosis’, ‘relapse in the past 6 months’, and ‘symptom change in the last 6 months’ as exogenous variables. We hypothesize that these variables have the potential to induce bias on the observed scales. These additional variables are free to correlate with each other and with the common factors; however, all direct effects between the observed scales are fixed to zero. To test for response shift, we fit new models where the direct effects of the exogenous variables are free to be estimated. For example, for ‘sex’ we fit eight new models, with the effect of sex on an observed scale for three measurement occasions. This results in three additional parameters to be estimated. The impact of these direct effects is assessed with OPCs and the chisquare difference test. If the largest effects meet our criteria like in Step 2, we leave these parameters free to be estimated and start the process over again and stop when no freed direct effects meet our criteria. Once any biases have been accounted for, this final model can be used to assess true change in the attribute of interest using the same partitioning formula we use in Study 1.