## Introduction

## Operationalization and interpretation of change and response shift

### Assessment of different types of change

The SEM approach | General example | Illustrative example |
---|---|---|

Step 1: establishing a measurement model | ||

\({\text{Cov}}\left( {X,X^{\prime}} \right) ={{\varvec{\Sigma}}}= {{\varvec{\Lambda}}}{{\varvec{\Phi}}}{\varvec{\Lambda^{\prime}}}+{{\varvec{\Theta}}},\) \(\begin{aligned} {\text{where}} = {{\varvec{\Sigma}}} = & \left[ {\begin{array}{*{20}c} {{{\varvec{\Sigma}}}_{{{\bf{T}}1}} } & {{{\varvec{\Sigma}}}_{{{\bf{T}}1,{\bf{T}}2}} } \\ {{{\varvec{\Sigma}}}_{{{\bf{T}}2,{\bf{T}}1}} } & {{{\varvec{\Sigma}}}_{{{\bf{T}}2}} } \\ \end{array} } \right] \\ = & \left[ {\begin{array}{*{20}c} {{{\varvec{\Lambda}}}_{{{\bf{T}}1}} } & 0 \\ 0 & {{{\varvec{\Lambda}}}_{{{\bf{T}}2}} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {{{\varvec{\Phi}}}_{{{\bf{T}}1}} } & {{{\varvec{\Phi}}}_{{{\bf{T}}1,{\bf{T}}2}} } \\ {{{\varvec{\Phi}}}_{{{\bf{T}}2,{\bf{T}}1}} } & {{{\varvec{\Phi}}}_{{{\bf{T}}2}} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {{{\varvec{\Lambda}}}_{{{\bf{T}}1}} } & 0 \\ 0 & {{{\varvec{\Lambda}}}_{{{\bf{T}}2}} } \\ \end{array} } \right]^{^{\prime}} + \user2{ }\left[ {\begin{array}{*{20}c} {{{\varvec{\Theta}}}_{{{\bf{T}}1}} } & {{{\varvec{\Theta}}}_{{{\bf{T}}1,{\bf{T}}2}} } \\ {{{\varvec{\Theta}}}_{{{\bf{T}}2,{\bf{T}}1}} } & {{{\varvec{\Theta}}}_{{{\bf{T}}2}} } \\ \end{array} } \right] \\ \end{aligned}\)\({\text{Mean}}\left( X \right)\, = \,{{\varvec{\upmu}}}\, = \,{{\varvec{\uptau}}}\, + \,{{\varvec{\Lambda}}} \, {{\varvec{\upkappa}}},\) \({\text{where}} = {{\varvec{\upmu}}} = \left[ {\begin{array}{*{20}c} {{{\varvec{\upmu}}}_{{{\bf{T}}{1}}} } \\ {{{\varvec{\upmu}}}_{{{\bf{T}}2}} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {{{\varvec{\uptau}}}_{{{\bf{T}}1}} } \\ {{{\varvec{\uptau}}}_{{{\bf{T}}2}} } \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {{{\varvec{\Lambda}}}_{{{\bf{T}}{1}}} } & 0 \\ 0 & {{{\varvec{\Lambda}}}_{{{\bf{T}}2}} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {{{\varvec{\upkappa}}}_{{{\bf{T}}1}} } \\ {{{\varvec{\upkappa}}}_{{{\bf{T}}2}} } \\ \end{array} } \right]\) | The factor model is used to model the relationships between the observed variables and one or more underlying latent variables, where the underlying latent variables represent everything that the observed measures have in common (e.g., perceived health or HRQL). With longitudinal assessment, e.g., baseline (T1) and follow-up (T2), the same factor model can be applied at each measurement occasion to arrive at a longitudinal measurement model | Suppose we have patients’ scores on nine observed indicators that measure physical, mental, and social aspects of health. We use a three-factor model to represent the measurement structure of the data (see Fig. 1). The same factor model can be applied at both baseline (T1) and follow-up (T2) to arrive at a longitudinal measurement model (see Fig. 2). The specification of the full SEM matrices are provided in Online Appendix B. Example syntaxes for application of the SEM approach using the software programs Lavaan [33] and LISREL [29] are provided as supplementary material |

Step 2: overall test of response shift ^{a} | ||

Σ and \({\varvec{\upmu}}\) as in Step 1, but with: \({{\varvec{\Lambda}}}_{\bf{T}1}\) = \({{\varvec{\Lambda}}}_{\bf{T}2}\), and = \({{\varvec{\uptau}}}_{\bf{T}1}{{\varvec{\uptau}}}_{\bf{T}2}\) | To test for the presence of response shift, all model parameters associated with response shift (i.e., factor loadings and intercepts) are restricted to be equal across occasions, e.g., across baseline and follow-up. When these restrictions are not tenable, we continue to investigate which variable is affected by which type of response shift in Step 3 | In the illustrative example from Fig. 2, the test for the presence of response shift entails equality restrictions on all nine factor loadings and all nine intercepts across occasions |

Step 3: response shift detection ^{a} | ||

Pattern \(({{\varvec{\Lambda}}}_{\bf{T}1})\) ≠ Pattern \({({\varvec{\Lambda}}}_{\bf{T}2})\) | Reconceptualization When the common factor loading of an observed indicator becomes zero or significantly different from zero at follow-up, this means that this indicator is no longer part of the measurement of the underlying latent variable or becomes part of the measurement, respectively, indicating a shift in conceptualization | Suppose that the common factor loading of the observed indicator ‘work relations’ on physical health is significantly different from zero at follow-up. This indicates that the scores on the subscale ‘work relations’ become (at least partly) indicative of physical health. It may be, for example, that ‘work relations’ is interpreted as related to physical functioning at follow-up (but not at baseline) |

\({{\varvec{\Lambda}}}_{{{\bf{T}}1}} \ne {{\varvec{\Lambda}}}_{{{\bf{T}}2}}\) | Reprioritization When the value of a common factor loading is smaller or larger at follow-up as compared to baseline, this indicates a shift in the meaning (less or more important) of the indicator to the measurement of the underlying factor | Suppose the common factor loading of the observed indicator ‘anxiety’ is larger at follow-up as compared to baseline. This indicates that patients’ scores on ‘anxiety’ become more important to the measure of mental health. It may be, for example that anxiety due to uncertainty about the effectiveness of treatment and/or the course of the disease play an increasingly important role for patients’ mental health |

\({{\varvec{\uptau}}}_{{{\bf{T}}1}} \ne {{\varvec{\uptau}}}_{{{\bf{T}}2}}\) | Recalibration When the intercept value of an observed indicator changes over time, this indicates a shift in the meaning of the response categories (internal standards) of the indicator, where the same ‘true’ state on the underlying factor corresponds to lower or higher scores on the indicator | Suppose the intercept value of the observed indicator ‘nausea’ is lower at follow-up as compared to baseline. This indicates that the same physical health leads to lower scores on nausea at follow-up as compared to baseline. It may be, for example, that patients get used to the experience of nausea, and therefore rate the same health-experience lower at follow-up as compared to baseline |

Step 4: true change assessment | ||

Σ and \({\varvec{\upmu}}\) as in Step 3, including possible response shift, and assess whether ≠ \({{\varvec{\upkappa}}}_{\bf{T}1}{{\varvec{\upkappa}}}_{\bf{T}2}\) | ‘True’ change When the common factor means are lower or higher at follow-up, this indicates that patients’ HRQL decreases or increases | Suppose the common factor mean of physical health is lower at follow-up as compared to baseline. This indicates that patients’ physical health deteriorates over time |

### Added value of the SEM approach

## Practical considerations in application of the SEM approach

Decisions to be made | Recommendations | |
---|---|---|

Know your measures | ||

Step 1: establishing a measurement model | • Choose one of two procedures (i.e., model all measurement occasions simultaneously or separately) to arrive at a longitudinal measurement model • Modify the measurement model when model fit is not adequate, in order to obtain a well-fitting model • Decide which and how many modifications are necessary to obtain a substantively meaningful measurement model | • Specify the measurement model based on the structure of the questionnaire, previous research, and/or theory • In case of unclear or unknown structure, use exploratory analyses and substantive considerations to arrive at an interpretable and well-fitting measurement model • Combine substantive and statistical criteria to guide (re)specification to arrive at the most parsimonious, most reasonable, and best-fitting measurement model |

Identification of possible response shift | ||

Step 2: overall test of response shift Step 3: detection of response shift | • Choose statistical criteria to guide response shift detection • Choose between competing response shifts • Decide when to stop searching for response shift | • Use the overall test of response shift to protect against false detection (i.e., type I error) • When possible, use an iterative procedure (where all possible response shifts are considered one at a time) to identify specific response shift effects • Alternatively, use statistical indices such as modification indices, expected parameter change, inspection of residuals, and/or Wald tests to guide response shift detection • Evaluate each possible response shift statistically (i.e., difference in model fit) and substantively (i.e., interpretation) in order to identify all meaningful effects • For more robust stopping criteria, use overall model-fit evaluation and evaluation of difference in model fit of the measurement model • Use different sequential decision-making practices in order to find confidence in robustness of results |

Interpretation of response shift and ‘true’ change | ||

Step 3: detection of response shift Step 4: assessment of ‘true’ change | • Can the detected violations of invariance of model parameters be interpreted as response shifts? • Is there ‘true’ change? • What is the impact of response shift on the assessment of change? | • Detected effects can be substantively interpreted as response shift using substantive knowledge of the patient group, treatment, or disease trajectory • Compare common factor means across occasions to assess ‘true’ change in the target construct • To understand both ‘true’ change and response shift, consider possible other explanations for the detected effects and include—when available—possible explanatory variables (e.g., coping, adaptation, social comparison) • To understand the impact of response shift on change assessment, evaluate (1) the impact of response shift on observed change in the indicator variable using the decomposition of change [12], and (2) the impact of response shift on ‘true’ change by comparing the common factor means from the final model of Step 3 (while taking into account response shift) with the common factor means from Step 2 (under no response shift) |

### Know your measures: establishing an appropriate measurement model

Description and interpretation | Advantages | Disadvantages |
---|---|---|

Fit statistic | ||

Overall goodness of model fit | ||

Chi-square test | ||

The chi-square value can be used to test the null hypothesis of ‘exact’ fit (i.e., that the specified model holds exactly in the population), where a significant chi-square value indicates a significant deviation between the model and data (and thus that the model does not hold exactly in the population) | • Has a clear interpretation • Provides a convenient decision rule | • Highly dependent on sample size, i.e., with increasing sample size and equal degrees of freedom the chi-square value increases • Tends to favor highly parameterized (i.e., complex) models |

Root-mean-square error of approximation (RMSEA) | ||

• Confidence intervals are available • Can be used to provide a ‘test of close fit’ • Takes into account model complexity • Less sensitive to sample size | • When N or df is relatively small, the index is expected to be uninformative [37] | |

Comparative fit index (CFI) | ||

The CFI [38] gives an indication of model fit based on model comparison (compared to the independence model in which all measured variables are uncorrelated). The CFI ranges from zero to one, and as a general rule of thumb, values above 0.95 are indicative of relatively ‘good’ model fit [24]. The Tucker-Lewis index (TLI [39]; also referred to as the non-normed fit index (NNFI)) is a similar comparative fit index | • Relatively unaffected by sample size • Takes into account model complexity | • Does not provide a test of model fit • Sensitive to the size of the observed covariances |

Expected cross-validation Index (ECVI) | ||

The ECVI [40] is a measure of approximate fit that is especially useful for the comparison of different models for the same data, where the model with the smallest ECVI indicates the model with the best fit. It provides the same ranking of models as the Akaike Information Criterion (AIC; [41]) and so-called “Bayesian” Information Criterion (BIC; [42, 43]) | • Confidence intervals are available • Takes into account model complexity | • Does not provide a test of model fit • Cannot be used to evaluate model fit of a single model |

Difference in model fit | ||

The chi-square difference test | ||

The chi-square values of two nested models (i.e., where the second model can be derived from the first model by imposing restrictions on model parameters) can be compared to test the difference in ‘exact’ fit. A significant difference in chi-square values indicates that the less restricted model fits the observed data significantly better than the more restricted model, or in other words, the more restricted model leads to a significant deterioration in model fit | • Has a clear interpretation • Provides a convenient decision rule | • Highly dependent on sample size (see above) • Tends to favor highly parameterized (i.e., complex) models |

ECVI difference test | ||

The difference in ECVI values of two nested models may be used to test the equivalence in approximate model fit, where a value that is significantly larger than zero indicates that the more restricted model has significantly worse approximate fit. Similarly, the difference between two model’s AICs (or BICs) can be used | • Takes into account model complexity | • Stringent evaluation of the performance of the ECVI for the comparison of nested models is lacking |

CFI difference | ||

It has been proposed that the difference between CFI values can be used to evaluate the difference in model fit between two nested models [44]. As a rule of thumb, CFI difference values larger than 0.01 are taken to indicate that the more restricted model should be rejected | • Simple to apply | • Cannot be used to test whether the difference in model fit is significant |