## Introduction

## Methods

### Defining and estimating MWIC thresholds on the latent scale

### Estimating MWIC thresholds on the observed PROM scale

### Model assumptions and test of model fit

^{2}test [24] to polytomous items. Misfit may lead to use of a more general IRT model (such as the nominal categories model [25]) for some items. (2) The PROM item parameters are assumed to be the same for time 1 and time 2 (i.e., no item response shift). This assumption can be tested for one item at the time using a likelihood ratio test comparing a model constraining the item parameters for that item over time with a model without these constraints. In case of significant difference for some items, the item parameters for these items can be allowed to differ over time. (3) The simplest form of the LIRT model assumes local item independence across time. This can be tested by comparing models with and without local dependence using a likelihood ratio test. The magnitude of local dependence can be evaluated by the discrimination parameters for the local dependence latent variables. Significant local dependence can be included in the model. (4) The discrimination parameter for the TR item on \(\theta_{1}\) is assumed to be of the same magnitude but opposite sign as the discrimination parameter for the TR item on \(\theta_{2}\). This can be tested by comparing models with and without constraints on the discrimination parameter for the TR item. Differences in the magnitude of the discrimination parameter for \(\theta_{1}\) and \(\theta_{2}\) can be caused by present state bias [26]. (5) Finally, we assume that the TR item measures change in the same construct that is measured by the PROM items and that the LIRT model provides good fit for the TR item. This may be evaluated by estimating the expected proportion of respondents indicating improvement (i.e., answering 1) on the TR item for different levels of the observed PROM change score. These expected proportions can be derived by simulation based on the estimated model parameters. These expected proportions can then be compared with the observed proportions of positive answers (similar to the approach of Orlando and Thissen [24]).

### Analysis of example dataset

## Results

Follow-up 1 | Follow-up 2 | Follow-up 3 | ||||
---|---|---|---|---|---|---|

Est | (95% CI) | Est | (95% CI) | Est | (95% CI) | |

Descriptive information | ||||||

Mean PROM change score (raw score) | −0.20 | (−0.56: 0.17) | 3.45 | (2.99: 3.90) | 10.24 | (9.79: 10.7) |

Mean PROM change score (ES) | −0.01 | (−0.03: 0.01) | 0.19 | (0.17: 0.22) | 0.58 | (0.55: 0.60) |

% improved | 34.5% | (32.4%: 36.5%) | 51.1% | (48.9%: 53.3%) | 72.0% | (70.0%: 73.9%) |

TR*PROM change score polychoric correlation | 0.57 | (0.53: 0.61) | 0.58 | (0.54: 0.62) | 0.61 | (0.57: 0.65) |

LIRT parameter estimates | ||||||

\(\theta_{1}\) mean ^{1} | 0 | 0 | 0 | |||

\(\theta_{1}\) SD ^{1} | 1 | 1 | 1 | |||

\(\theta_{2}\) mean/\(d\theta\) mean ^{2} | −0.02 | (−0.08: 0.03) | 0.57 | (0.50: 0.65) | 1.95 | (1.84: 2.07) |

\(\theta_{2}\) SD | 1.20 | (1.15: 1.26) | 1.45 | (1.38: 1.53) | 1.70 | (1.61: 1.80) |

\(d\theta\) SD | 1.03 | (0.98: 1.08) | 1.47 | (1.40: 1.54) | 1.81 | (1.72: 1.91) |

\(\theta_{1}\)*\(\theta_{2}\) Correlation | 0.58 | (0.54: 0.61) | 0.33 | (0.28: 0.37) | 0.18 | (0.13: 0.23) |

\(\alpha_{TR}\) | 1.63 | (1.41: 1.86) | 0.98 | (0.86: 1.10) | 0.94 | (0.83: 1.06) |

\(\beta_{TR}\) | 0.56 | (0.48: 0.65) | 0.50 | (0.39: 0.61) | 0.46 | (0.31: 0.61) |

MWIC estimates (PROM score metric) by different methods | ||||||

LIRT (Median) ^{3} | 4 | (3: 4) | 3 | (3: 4) | 3 | (2: 4) |

LIRT (Mean) | 3.78 | (3.31: 4.44) | 3.44 | (2.61: 4.29) | 3.20 | (1.90: 4.46) |

Mean change ^{4} | 2.74 | (1.87: 3.61) | 2.59 ^{5} | (1.44: 3.73) | 7.81 | (6.37: 9.25) |

MWIC–ROC analyses ^{3} | 2 | (0: 2) | 4 | (3: 5) | 10 | (8: 10) |

MWIC–Adjusted predictive model ^{4} | 1.78 | (1.40: 2.16) | 3.30 | (2.86: 3.72) | 6.71 | (6.17: 7.18) |

MWIC estimates by different methods – effect sizes and (% of score range) | ||||||

LIRT (Median) | 0.52 | (11%) | 0.39 | (8%) | 0.39 | (8%) |

LIRT (Mean) | 0.49 | (11%) | 0.44 | (10%) | 0.41 | (9%) |

Mean change | 0.35 | (8%) | 0.33 ^{5} | (7%) | 1.01 | (22%) |

MWIC–ROC analyses | 0.26 | (6%) | 0.52 | (11%) | 1.29 | (28%) |

MWIC–Adjusted predictive model | 0.23 | (5%) | 0.43 | (9%) | 0.87 | (19%) |

^{2}item-level fit tests did not show significant misfit. Out of 48 item-level fit test (12 items at 4 time points), the lowest P value was 0.027 and only two P values were below 0.05. We did not find any indication of significant item response shift. Modeling of local item dependence across time did not improve model fit at any follow-up time point and was therefore not included in the final models. Equality of the discrimination parameters for the TR item on \(\theta_{1}\) and \(\theta_{2}\) was supported at all three follow-up times. The largest difference was found at follow-up time 3 where separate estimation yielded \(\alpha_{TR1}\) = -1.05, \(\alpha_{TR2}\) = 0.92 (likelihood ratio test for significant difference = 3.37, DF = 1, P = 0.066).

^{2}test of fit, which was used to assess TR item fit. The model fit was acceptable at each follow-up time. Although the P value is 0.04 at follow-up 3, there were no indications of systematic departures from the predicted model.