## Introduction

### The monotonicity assumption in MSA

_{−i}), where R

_{−i}is the number-correct score over all items excluding item i, is used as a proxy for a person’s value on the person characteristic of interest [15]. When M holds then it applies that, apart from sampling fluctuations [3],

_{i}is the scalability coefficient of item i, #vi denotes the number of violations (the number of times Eq. 1 does not hold), #ac denotes the total number of pairs of restscore groups that are being compared, maxvi is the size of the largest violation, sum denotes the sum of all violations, and, finally, zmax and #zsig refer to the normal deviates associated with each violation. An example of how to obtain all these quantities is available in the Online Resource.

### The IIO assumption in MSA

_{−ij}is the number-correct score over all items excluding items i and j. If Eq. 3 does not hold in the sample, then at least some items may violate the assumption of invariant item ordering.

### Aim of the study

## Simulation setup

### Independent variables

_{misfit}), scale quality, and sample size (N).

#### Type of violation

_{misfit}items were generated to intersect with the remaining I − I

_{misfit}items by setting their slope either higher or lower than the common slope of all fitting items (see the Online Resource for details and Fig. 4 for an illustration).

#### Number of assumption-violating items

_{misfit}for both studies: 1, 3, and 5. Thus, either 10%, 30%, or 50% of the items in the scale were violating either the M or the IIO assumption.

#### Scale quality

Type of violation | False positive rates ^{a} | True positive rates (power) ^{b} | ||||
---|---|---|---|---|---|---|

I _{misfit} = 1 | I _{misfit} = 3 | I _{misfit} = 5 | I _{misfit} = 1 | I _{misfit} = 3 | I _{misfit} = 5 | |

Quadratic IRFs | ||||||

N = 100 | < 0.1 | < 0.1 | < 0.1 | 1.2 | 0.7 | 0.5 |

N = 500 | < 0.1 | < 0.1 | 0.1 | 12.1 | 9.8 | 6.5 |

N = 1000 | < 0.1 | < 0.1 | < 0.1 | 10.7 | 6.7 | 4.9 |

Unimodal IRFs | ||||||

N = 100 | < 0.1 | < 0.1 | 0.3 | 5.2 | 4.0 | 1.8 |

N = 500 | < 0.1 | 0.2 | 5.6 | 99.2 | 97.6 | 78.7 |

N = 1000 | < 0.1 | 0.1 | 5.4 | 99.5 | 99.6 | 91.3 |

Reversed IRFs | ||||||

N = 100 | < 0.1 | < 0.1 | 2.0 | 7.2 | 4.9 | 1.9 |

N = 500 | < 0.1 | 1.7 | 81.5 | 99.8 | 99.9 | 80.6 |

N = 1,000 | < 0.1 | 3.1 | 87.5 | 100.0 | 100.0 | 86.7 |

#### Sample size

### Outcome variable

_{misfit}, and we computed the false positive and true positive (power) rates. The false positive rate was defined as the percentage of cases in which an item was generated to comply with the model but was detected as misfitting (i.e., had a Crit ≥ 80). The true positive rate, or the power of Crit to detect misfit, was defined as the percentage of cases in which an item was correctly detected as misfitting, that is, the item was generated to violate the M or the IIO assumption and had a Crit ≥ 80.

_{misfit}. The false positive rate was defined as the percentage of cases in which #zsig > 0 even though the item was generated to comply with the model. Power was calculated as the percentage of cases in which #zsig > 0 and the item was generated to violate M or IIO. For the analyses we set minvi equal to 0.03 and minsize equal to N/10 for \(N \ge 500\) and to \({\text{max}}\left( {N/3,50} \right)\) for N = 100.

### Implementation

## Simulation results

### Crit for violations of monotonicity

_{misfit}= 0) were very low, with only 0.01% of the Crit values above 80. Moreover, the distribution of Crit, which had a median value of 0 and an interquartile range (IQR)

^{1}of 0, was not affected by either scale quality or sample size. Crit values above 0 are most likely random fluctuations. Regarding the false positive rates of #zsig in the I

_{misfit}= 0 conditions, we found that 0.09% of the values were larger than 0.

_{misfit}and N.

_{misfit}= 5, #zsig had a substantially lower power than Crit for unimodal IRFs (N = 500, 1,000) and for reversed IRFs (N = 500).

### Crit for violations of IIO

_{misfit}IRFs intersected with the IRFs of the (I − I

_{misfit}) items, the latter were considered misfitting as well. This is because the Crit coefficient for item i is a summary of, among other quantities, how many times Eq. 3 does not hold in the sample for each pair formed by item i with the remaining items. This led to high false positive rates for the fitting items in the misfit conditions. Consequently, it made little sense to interpret false positive rates for the fitting items in the misfit conditions. Therefore, we only interpreted the false positive rates in the conditions with I

_{misfit}= 0 (RQ2A) and the power of Crit to detect misfit in the conditions in which I

_{misfit}= 1, 3, 5 (RQ2B). We compared the false positive rates and power of Crit with the values we obtained for #zsig (Table D2 in the Online Resource).

N = 100 | N = 500 | N = 1,000 | |
---|---|---|---|

^{a}False positive rates | |||

Scale quality | |||

Unscalable items | 3.2 | 0.4 | 0.1 |

Weak scales | 2.0 | < 0.1 | < 0.1 |

Medium-strong scales | 1.7 | < 0.1 | < 0.1 |

^{b}Power | |||

Number of violating items | |||

I _{misfit} = 1 | 6.0 | 20.9 | 29.3 |

I _{misfit} = 3 | 5.4 | 16.2 | 22.3 |

I _{misfit} = 5 | 4.6 | 10.3 | 15.0 |

_{misfit}as for violations of monotonicity, though the overall power for detecting violations of IIO was considerably lower (up to only 30%). Higher power was obtained in larger samples because violations became statistically significant, whereas a decrease in power with relatively many misfitting items was due to lower inter-item correlations (and thus lower H

_{i}values).

_{misfit}. Consequently, for many misfitting items (I

_{misfit}= 5) and large samples (N = 500, 1,000), #zsig had considerably higher power to detect misfit compared to Crit. Nonetheless, the power of #zsig are still low (29.8% for N = 500 and 52.0% for N = 1,000).

## Empirical example: mental health

^{2}Records containing missing data on any of the GHQ-12 items were removed. The first column of Table 3 shows a short version of the GHQ-12 item content. We dichotomized the item responses: the scores “1” and “2” were recoded as “0” and the scores “3” and “4” were recoded as “1”. Also, to avoid issues due to nested data, we randomly sampled a single member from each household in our final dataset. Dichotomizing the item responses and selecting one member per household is an appropriate solution in this methodological context, where the aim was to illustrate some properties of the Crit coefficient on non-clustered, binary data. From a substantive perspective this approach might not be ideal, as it causes loss of information. For researchers who wish to analyze such data using Mokken scale analysis, we refer to Koopman et al. [4], who proposed point estimates, standard errors, and test statistics for scalability coefficients for nested data. These authors incorporated their proposed methods into what they called a two-step, test-guided MSA procedure for scale construction.

Item | ItemH | #ac | #vi | #vi/#ac | maxvi | sum | sum/#ac | zmax | #zsig | Crit |
---|---|---|---|---|---|---|---|---|---|---|

1. Able to concentrate | 0.51 | 33 | 2 | 0.06 | 0.08 | 0.11 | 0.0034 | 6.86 | 2 | 65 |

2. Loss of sleep over worry | 0.48 | 33 | 2 | 0.06 | 0.10 | 0.18 | 0.0053 | 8.74 | 2 | 81 |

3. Playing a useful role | 0.51 | 33 | 2 | 0.06 | 0.08 | 0.12 | 0.0035 | 7.47 | 2 | 69 |

4. Capable of making decision | 0.58 | 33 | 0 | 0.00 | 0.00 | 0.00 | 0.0000 | 0.00 | 0 | 0 |

5. Felt constantly under strain | 0.60 | 33 | 1 | 0.03 | 0.05 | 0.05 | 0.0014 | 5.03 | 1 | 35 |

6. Couldn’t overcome difficulties | 0.59 | 33 | 2 | 0.06 | 0.08 | 0.16 | 0.0048 | 7.47 | 2 | 67 |

7. Able to enjoy day-to-day activities | 0.56 | 33 | 0 | 0.00 | 0.00 | 0.00 | 0.0000 | 0.00 | 0 | 0 |

8. Able to face problems | 0.62 | 33 | 0 | 0.00 | 0.00 | 0.00 | 0.0000 | 0.00 | 0 | 0 |

9. Feeling unhappy and depressed | 0.64 | 33 | 1 | 0.03 | 0.05 | 0.05 | 0.0014 | 5.03 | 1 | 33 |

10. Losing confidence | 0.58 | 33 | 1 | 0.03 | 0.08 | 0.08 | 0.0023 | 6.86 | 1 | 49 |

11. Thinking of self as worthless | 0.63 | 33 | 0 | 0.00 | 0.00 | 0.00 | 0.0000 | 0.00 | 0 | 0 |

12. Feeling reasonably happy | 0.59 | 33 | 3 | 0.09 | 0.10 | 0.17 | 0.0052 | 8.74 | 3 | 85 |