Statistical analysis
The scale scores were summarized with descriptive statistics. NRSS item reduction was done by examining item characteristics using IRT analysis.
We first assessed the two assumptions of IRT: unidimensionality and local independence [
20]. Unidimensionality for the 12-item NRSS was previously shown in a bifactor model [
14]. Nevertheless, we retested this assumption in our sample using minimum residual factoring of the polychoric correlation matrix exploratory factor analysis (EFA) [
21]; specifically, we assessed Kaiser–Meyer–Olkin (KMO) and Bartlett’s test of sphericity values to confirm the appropriateness of the EFA [
22]. Then, the number of factors was identified by assessing the scree plot [
23]. Unidimensionality was accepted if the first factor explained more than 20–40% of variance [
24,
25] and the ratio of the eigenvalues of the first to the second unrotated factor was greater than 3 [
20]. Local independence refers to independence among responses across items conditioned on the corresponding latent trait [
26]; its presence was accepted if the residual correlations for the items were smaller than 0.25 [
27]. After confirming unidimensionality and local independence, we fitted a GPCM.
Under the GPCM, we obtained the discrimination parameter (
ai) and the difficulty parameters (
bi) for each item. A higher discrimination parameter value indicates greater ability of the corresponding item to differentiate respondents at different trait levels. In this study, ability refers to the NRS level. The discrimination parameter often ranges from 0.5 to 2.5, and items with values smaller than 0.4 are recommended for removal [
28]. As for the number of difficulty parameters, it corresponds to the number of response categories minus 1. Under the GPCM, the difficulty parameter refers to the latent trait level where the probabilities of endorsing the two adjacent response categories are the same [
29]. Item characteristic curves (ICCs) were also obtained, with the curve steepness reflecting the discrimination level: greater steepness demonstrated greater discrimination ability [
30].
To examine differential item functioning (DIF) by gender, that is, to determine whether there were items that were responded differently by male and female participants even when they had the same trait level [
31], similarity of slopes and intercepts by gender was tested using an iterative Wald approach, with a significant p-value indicating DIF [
32]. Specifically, the Wald-2 approach was used to identify the anchor items, that is, items showing invariance across groups. To better control type 1 error rate, we also adopted the MaxA5 method, which uses 5 items with the largest discrimination parameters as the anchor items [
33], before the Wald-1 approach was iteratively used to test for DIF items [
32]. The female group was set as the reference group and the male group was set as the focal group.
In addition, we also obtained item information, test information and average standard error (SE). Test information, which is the sum of the item information for all items, provides evidence on how accurately the test estimates a latent trait over the entire range of trait levels. The more information at a particular trait level provided by the test, the higher the precision of ability estimation, and the higher the reliability [
34]. Test information was obtained over the entire range of latent trait levels (
θ), as well as over the common range of (− 3, 3) used to avoid potentially inflated information due to the presence of extremely able/unable participants [
35]. The SE refers to the standard error of latent trait estimates, which indicates the amount of information unexplained by the items being considered. It is independent of the distribution of scores in the obtained sample [
36]. It was taken as the average of SE at each latent trait level in the study.
We tried two item selection approaches that had been adopted in the literature. First, the item reduction process was initiated by removing, first, items with discrimination < 0.4 and then items that showed DIF by gender. The process continued with the assessment of test information and SE of measurement. Specifically, first, the item that carried the lowest item information was removed, and test information and SE were assessed. Then, the item that carried the next-lowest item information was removed and test information and SE again assessed. We continued the removal of items until there was a relatively substantial reduction in test information and increase in SE. Finally, this shortened NRSS was tested again for item discrimination and difficulty using IRT.
Second, optimal test assembly (OTA) was adopted, after excluding the items showing DIF. For each fixed number of items between 3 and 12, the set of NRSS items that maximized the total test information over five anchor points (
θ: − 3, − 1, 0, 1, 3) based on GPCM was first obtained using the branch-and-bound algorithm [
37]. Then, the shortened version was taken as the smallest set of items that satisfied three criteria: (1) the correlation between the factor scores (as well as summed scores) of the shortened version and those of the 12-item version should be at least 0.95; (2) the convergent validity correlation between the factor scores (as well as summed scores) and PSQI should be within a tolerance of 0.05 when compared with that of the 12-item version; and (3) the Cronbach’s alpha should be at least 95% of that of the 12-item version.
The obtained shortened versions were compared with the original 12-item version in terms of test information, Cronbach’s alpha, and convergent validity.
The EFA was conducted with the package “psych” [
38], while DIF was tested with the package “mirt” [
39]. The R package “ltm” was used to run the OTA procedure [
40]. The OTA was procedure was implemented through the package “lpSolveAPI” [
41]. All of the packages were run in RStudio 1.1.383. SEs for each latent trait level were obtained in IRTPRO (4.2 Student version).