1 1. Introduction

Mokken scale analysis (Mokken, 1971; Sijtsma & Molenaar, 2002) is used for scaling items and measuring respondents on an ordinal scale. Mokken scale analysis consists of two parts. The first part is the evaluation of a set of items with ordered scores as a scale according to particular scaling criteria that are related to the monotone homogeneity model (Mokken, 1971; Sijtsma & Molenaar, 2002). This can be done in a confirmatory way for a set of items that are hypothesized to form a scale or in an exploratory way when an experimental set of items is analyzed to find out whether they constitute one or more scales. When none of the items satisfy the criteria of Mokken scale analysis, the result is that no scales can be constructed, but it happens more frequently that one or a few items in the set are unscalable whereas the majority of the items is scalable. The unscalable items are left out of the analysis. The scales that are produced by Mokken scale analysis are referred to as Mokken scales. The second part of Mokken scale analysis takes the scales found in the first part, and investigates several other interesting properties of the monotone homogeneity model that were not assessed explicitly in the first part of the analysis. This second part does not play a role in this study. Mokken scale analysis can be conducted using the stand-alone software package MSP5.0 for Windows (Molenaar & Sijtsma, 2000) and the R package mokken.0.2 (Van der Ark, 2007).

Mokken scales are defined by means of scalability coefficients (Mokken, 1971, pp. 148–153). The first part of Mokken scale analysis involves the testing of hypotheses about these scalability coefficients and the evaluation of their numerical values. The hypotheses involve testing whether scalability coefficients satisfy the criteria for a Mokken scale (Mokken, 1971, p. 184), and testing whether scalability coefficients are equal across items or across groups. We demonstrate that currently available methods do not allow us to test several interesting hypotheses about the scaling coefficients that are relevant in Mokken scale analysis, and we propose to use the marginal modelling framework for this purpose and also for testing hypotheses for which other solutions already exist (Mokken, 1971).

The paper is organized as follows. First, the principles of marginal modelling are explained. Second, Mokken scale analysis is discussed, including the monotone homogeneity model, the scalability coefficients, and the definition of a scale. Third, the scalability coefficients are discussed and it is shown how these coefficients can be reformulated so that they can be incorporated in marginal models. For the sake of readability, several important but rather cumbersome derivations have been diverted to appendices. Fourth, we give an overview of relevant hypotheses in Mokken scale analysis and we show how these hypotheses can be tested using marginal models. As an example, the marginal models were applied to data from a cognitive balance-task test (Van Maanen, Been, & Sijtsma, 1989). Fifth, the strengths and weaknesses of the marginal modelling approach are discussed, and recommendations are given for its practical use and for future improvements.

2 2. Marginal Models

Assume that a test consists of J dichotomously scored items, indexed by j and i. The random variable representing the scores on item j is denoted by X j , and its realization by x j (x j ∈ {0, 1}). A vector containing the J item-score variables is denoted (X 1,X 2, …, X J ). The total score on the test is denoted by \({X_ + } = \sum\nolimits_{j = 1}^J {{X_j}} \). The popularity or the easiness of an item is defined as the probability that a randomly drawn respondent from the population of interest endorses a positively worded statement or answers an item correctly, respectively, and is denoted by π 1 j . The probability that a randomly drawn respondent does not endorse a positively worded statement or answers an item incorrectly, is denoted by π 0 j . The joint probability of scores on X i and X j is denoted by π ij [u, ν = 0, 1; π ij can assume values for four different score pairs: (0, 0), (0, 1), (1, 0), and (1, 1)]. Without loss of generality, the items are ordered by decreasing popularity or easiness and numbered accordingly, such that

$$\pi _1^1 \ge \pi _2^1 \ge \cdots \pi _J^1$$

Equation (1) arbitrarily defines the most popular item to be item 1, the next popular item to be item 2, and so on. Equation (1) does not in any way restrict the data. Finally, the test data can be collected in a J-dimensional contingency table with L = 2J cells.

Consider the example in Table 1 (upper left-hand panel), which shows the cross classification of J = 2 items in a two-way contingency table. The observed frequencies in the contingency table are denoted by n ij (u, ν = 0, 1) and the marginal frequencies are denoted by n u i , n ν j , and n. Assuming a fixed sample size n, let m ij be the theoretically expected frequency satisfying \(m_{ij}^{u\nu } = n \times \pi _{ij}^{u\nu }\) (u, ν = 0, 1), with marginal frequencies m u i , m ν j , and m = n. Sample estimates of m ij and π ij are denoted by ^m ij and ij , respectively. Without any constraints imposed upon the data, ^m ij = n ij and ij = n ij /n. In Table 1 (upper left-hand panel), 1 i = 58/178 = 0.33 and 1 j = 44/178 = 0.25. Because 1 i > 1 j , item i is assumed to be more popular than item j in the population. The order of the indices i and j in the subscripts of, for example, n ij , in general indicates that in the sample item i is more popular than item j.

Table 1 Example of a contingency table with observed frequencies for a dichotomous item pair (upper left-hand panel), the estimated expected frequencies under a marginal model of equal diagonal probabilities (upper right-hand panel), the estimated expected frequencies under a marginal model of homogeneous item popularity (lower left-hand panel), and the estimated expected frequencies under a marginal model with γ = .8 (lower right-hand panel).

Marginal models for categorical data (Bartolucci & Forcina, 2002; Bartolucci, Forcina, & Dardanoni, 2001; Bergsma, 1997a; Bergsma & Rudas, 2002; Lang & Agresti, 1994; Rudas & Bergsma, 2004) constitute a family of models that impose restrictions on certain marginals (i.e., subsets) of contingency tables. These restrictions can have several forms. To illustrate this, we take the contingency table in the upper left-hand panel of Table 1 as a starting point.

The first example of a marginal model imposes equality constraints on two cell frequencies by hypothesizing that π 00 ij = π 11 ij . Estimation of this marginal model of equal diagonal probabilities yields estimated expected frequencies ^m ij that are as close as possible to the observed frequencies n ij (e.g., using a maximum likelihood or least-squares criterion) but with ^m 00 ij = ^m 11 ij . Table 1 (upper right-hand panel) shows the maximum likelihood estimates of the expected frequencies.

Throughout the paper we assume a multinomial sampling distribution that has the effect of reproducing the sample size n (here m = n = 178) in the marginal model. The fit of the marginal model is evaluated by comparing the observed and expected frequencies using commonly known fit statistics for contingency tables such as the likelihood ratio statistic, G 2 (see Appendix A). Let C denote the number of nonredundant constraints on the frequencies in the contingency table. For large n, G 2 approaches a chi-square distribution with C degrees of freedom (df = C). In the first example, it may be verified that G 2 = 64.352; because there is one nonredundant constraint (i.e., m 00 ij m 11 ij = 0), it follows that df = 1 and, as a result, p <.0001.

The second example of a marginal model imposes equality constraints on the marginal frequencies in Table 1 by hypothesizing that π 1 i = π 1 j , which implies π 0 i = π 0 j . Estimation of this marginal model of homogeneous item popularity yields estimated expected frequencies ^m ij such that ^m 0 i = ^m 0 j and ^m 1 i = ^m 1 j . Table 1 (lower left-hand panel) shows the maximum likelihood estimates of the expected frequencies. It may be verified that G 2 = 3.973; because there is one nonredundant constraint (i.e., m 0 i m 0 j = 0), it follows that df = 1 and, as a result, p = .0462.

The third example of a marginal model imposes equality constraints on functions of the cell frequencies in Table 1 by restricting Goodman and Kruskal’s (1954) γ coefficient to a value that is hypothesized between two variables in a particular study. This application is interesting because it allows us to illustrate marginal modelling in greater detail than the previous, simpler examples. Coefficient γ can be written as a function of the expected cell frequencies,

$$\gamma = {{m_{ij}^{00}m_{ij}^{11} - m_{ij}^{01}m_{ij}^{10}} \over {m_{ij}^{00}m_{ij}^{11} + m_{ij}^{01}m_{ij}^{10}}}$$

. Bergsma and Croon (2005) described several interesting restrictions on γ that can be estimated using marginal models. A simple restriction is the arbitrary equality constraint γ = .8. For this marginal model the expected frequencies m ij (u, ν = 0. 1) are estimated under the constraint that γ = .8. Table 1 (lower right-hand panel) shows the maximum likelihood estimates of the expected frequencies. It may be verified that G 2 = 3.207; because there is one nonredundant constraint (i.e., γ − 0.8 = 0), it follows that df = 1 and, as a result, p = .0733.

In general, marginal models can be applied to multiway contingency tables with L cells. Let n be the (L × 1) vector of observed frequencies in the contingency table, and let m be the (L × 1) vector of expected frequencies given the marginal model. It is assumed that the order of the elements in both n and m corresponds to the following ordering of the item-score patterns collected in the L × J matrix R, defined as

$${\rm{R}} = \left( {\matrix{ 0 & 0 & 0 & \cdots & 0 & 0 & 0 \cr 0 & 0 & 0 & \cdots & 0 & 0 & 1 \cr 0 & 0 & 0 & \cdots & 0 & 1 & 0 \cr 0 & 0 & 0 & \cdots & 0 & 1 & 1 \cr 0 & 0 & 0 & \cdots & 1 & 0 & 0 \cr 0 & 0 & 0 & \cdots & 1 & 0 & 1 \cr 0 & 0 & 0 & \cdots & 1 & 1 & 0 \cr \vdots & \vdots & \vdots & {} & \vdots & \vdots & \vdots \cr 1 & 1 & 1 & \cdots & 1 & 1 & 0 \cr 1 & 1 & 1 & \cdots & 1 & 1 & 1 \cr } } \right)$$

. Given the ordering with respect to popularity or easiness in equation (1), the scores in the first column of R correspond to the most popular item, the scores in the second column to the next most popular item, and so on, and the scores in the last column correspond to the least popular item. Suppose that themarginal model consists of C nonredundant equality constraints, which are functions of m. The first inequality constraint is denoted by g 1(m), the second by g 2(m), and the last by g C (m). Setting each function equal to zero yields g 1(m) = 0, g 2(m) = 0, …, g C (m) = 0. In vector notation these equality constraints can be written as

$${\rm{g}}({\rm{m}}) = \left( {\matrix{ {{g_1}({\rm{m}})} \cr \vdots \cr {{g_C}({\rm{m}})} \cr } } \right) = 0$$

. For the first example with respect to equal diagonal probabilities (Table 1, upper right-hand panel), equation (3) equals g(m) = g 1(m) = m 00 ij m 11 ij = 0; for the second example with respect to homogeneous item popularity (Table 1, lower left-hand panel), equation (3) equals \({\rm{g(m) = }}{{\rm{g}}_{\rm{1}}}{\rm{(m) = }}m_i^1 - m_j^1 = \left( {m_{ij}^{10} + m_{ij}^{11}} \right) - \left( {m_{ij}^{01} + m_{ij}^{11}} \right) = m_{ij}^{10} - m_{ij}^{01} = 0\); and for the third example that imposes restriction γ = .8 (Table 1, lower right-hand panel), equation (3) equals \({\rm{g(m) = }}{{\rm{g}}_{\rm{1}}}{\rm{(m)}} = {\rm{ }}\left( {m_{ij}^{00}m_{ij}^{11} - m_{ij}^{01}m_{ij}^{10}} \right)/\left( {m_{ij}^{00}m_{ij}^{11} + m_{ij}^{01}m_{ij}^{10}} \right) - .8 = 0\).

Bergsma (1997b) developed syntax for Mathematica (Wolfram, 1999) that produces maximum likelihood estimates and asymptotic standard errors for m. In the process of maximum likelihood estimation, the Jacobian of g(m) with respect to log(m) must be computed (see Appendix A). For different marginal models this Jacobian can have very different forms. Bergsma (1997a, p. 66) proposed to write the constraints in equation (3) in a single general matrix formula using a recursive exp-log notation (see also Kritzer, 1977). Once written in recursive exp-log notation, the derivation of the Jacobian is straightforward (Bergsma, 1997a, p. 68; see also Appendix A), and a simple recursive algorithm, which can be easily implemented in software, suffices to compute the Jacobian irrespective of the marginal model.

Given that A 1,…,A q are q design matrices, the general form of the recursive exp-log notation of a marginal model is

$${\rm{g}}({\rm{m}}) = {{\rm{A}}_q}\exp \left( {{{\rm{A}}_{q - 1}}\log \left( {{{\rm{A}}_{q - 2}} \ldots \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right)$$

. For a particular marginal model, the appropriate design matrices must be derived in order to write g(m) in a recursive exp-log notation. There are no explicit rules for deriving design matrices and the same marginal model can often be written in different recursive exp-log notations. Finding the most parsimonious recursive exp-log notation may require some effort.

For the three examples of marginal models in Table 1, the expected frequencies are collected in the vector m = (m 00 ij , m 01 ij , m 10 ij , m 11 ij )T (the superscript T denotes the transpose). The first example concerning equal diagonal probabilities has one design matrix, which is A 1 = (1 0 0 − 1), and the recursive exp-log notation of the model constraints in equation (3) is equal to

$${\rm{g}}\left( {\rm{m}} \right) = {{\rm{g}}_1}\left( {\rm{m}} \right) = {{\rm{A}}_1}{\rm{m}} = \left( {100 - 1} \right)\left( {\matrix{ {m_{ij}^{00}} \cr {m_{ij}^{01}} \cr {m_{ij}^{10}} \cr {m_{ij}^{11}} \cr } } \right) = m_{ij}^{00} - m_{ij}^{11} = 0$$

. The second example with respect to homogeneous item popularity also has one design matrix, which is A 1 = (0 1 − 10). The recursive exp-log notation of equation (3) is A 1 m = 0, which results in \(m_{ij}^{01} - m_{ij}^{10} = m_i^1 - m_j^1 = 0\).

For the third example that imposes γ = .8 upon the table, the design matrices were derived by Bergsma and Croon (2005), who showed that γ = A 5.exp(A 4.log(A 3.exp(A 2.log(A 1.m)))), with

$${{\rm{A}}_1} = {{\rm{I}}_{4 \times 4}},{{\rm{A}}_2} = \left( {\matrix{ 1 & 0 & 0 & 1 \cr 0 & 1 & 1 & 0 \cr } } \right),{{\rm{A}}_3} = \left( {\matrix{ 1 & 0 \cr 0 & 1 \cr 1 & 1 \cr } } \right),{{\rm{A}}_4} = \left( {\matrix{ 1 & 0 & { - 1} \cr 0 & 1 & { - 1} \cr } } \right),{{\rm{A}}_{\rm{5}}} = \left( {1 - 1} \right)$$

. Hence the recursive exp-log notation of equation (3) is

$${\rm{g}}({\rm{m}}) = {{\rm{g}}_{\rm{1}}}({\rm{m}}) = {{\rm{A}}_{\rm{5}}}.\exp \left( {{{\rm{A}}_{\rm{4}}}.\log \left( {{{\rm{A}}_{\rm{3}}}.\exp \left( {{{\rm{A}}_{\rm{2}}}.\log \left( {{{\rm{A}}_{\rm{1}}}.{\rm{m}}} \right)} \right)} \right)} \right) - {\rm{0}}.{\rm{8}} = {\rm{0}}$$

. In Appendix A it is shown how maximum likelihood estimates of m are obtained subject to the constraints in equation (3), when these constraints are written in the recursive exp-log notation of equation (4).

3 3. Mokken Scale Analysis

The main purpose of this study is to use marginal models and the recursive exp-log notation to test hypotheses about scalability coefficients in the context of Mokken scale analysis (Mokken, 1971; Sijtsma & Molenaar, 2002). Before we explain this application, subsequently we introduce the monotone homogeneity model, the scalability coefficients, the relationships between the monotone homogeneity model and the scalability coefficients, the definition of a scale, two types of Mokken scale analysis, and some existing results for the distribution of the scalability coefficients.

3.1 3.1. The Monotone Homogeneity Model

The monotone homogeneity model (Mokken, 1971, Chap. 4; Sijtsma & Molenaar, 2002, pp. 22–23; Sijtsma & Meijer, 2007) is a nonparametric item response theory (IRT) model for ordinal person measurement (related theory was developed, e.g., by Molenaar, 1997; Ramsay, 1991; Scheiblechner, 2007; and Stout, 1990). Before we discuss the assumptions of this model, first we introduce some notation. Let θ denote the latent variable underlying performance on each of the items in the test. Let the probability of obtaining score x j on item j be denoted by P(X j = x j |θ). This conditional response probability is known as the item response function (IRF). Further, let the joint probability of a particular score pattern on the J items in the test be denoted by P(X 1 = x 1, …, X J = x J |θ). The monotone homogeneity model is based on the following three assumptions.

Unidimensionality. The responses to the items are driven by a unidimensional latent variable denoted θ.

Local Independence. The joint distribution of the item scores conditional on θ can be written as the product of the J conditional marginal distributions: \(P({X_1} = {x_1}, \ldots ,{X_J} = {x_J}\left| {\theta ) = \prod _{j = 1}^J\left( {{X_j} = {x_j}\left| \theta \right.} \right)} \right.\).

Monotonicity. As latent variable θ increases, the probability of a positive response to an item increases or stays the same across intervals of θ; that is, for two values of θ, say, θ a and θ b , and arbitrarily assuming that θ a < θ b , monotonicity means that \(P({X_J} = 1\left| {\theta = {\theta _a}) = \le P\left( {{X_j} = 1\left| {\theta = } \right.{\theta _b}} \right)} \right.\). for j = 1, …, J.

For dichotomous items, the monotone homogeneity model implies the stochastic ordering of latent variable θ by total score X +; that is, for an arbitrary value t of θ, the probability P(θ > t|X + = x +) is nondecreasing in x + (Hemker, Sijtsma, Molenaar, & Junker, 1997; also, see Grayson, 1988). This property guarantees an ordinal person scale: Persons with higher X + scores on average have higher θ values.

Mokken (1971, pp. 119–120) showed that for a J-item test the monotone homogeneity model implies that all interitem covariances or, equivalently, all interitem product-moment correlations, are nonnegative. Let σ ij denote the covariance between items i and j; then, the monotone homogeneity model implies

$${\sigma _{ij}}{\rm{ = }} \ge {\rm{0 for all }}i{\rm{ < }}j$$

. Equation (5) is used throughout. Nonnegative interitem covariance is a special case of a more general interitem covariance result, known as conditional association, and proven to be true by Holland and Rosenbaum (1986) under more general conditions—multidimensional latent variables and continuous item scores, and local independence and monotonicity adapted to these conditions. In Holland and Rosenbaum’s (1986) conditional association framework, nonnegative interitem covariance in equation (5) is referred to as pairwise nonnegative association (Ellis & Van den Wollenberg, 1993). Other observable consequences, such as manifest monotonicity (Junker & Sijtsma, 2000), can be used to test the monotonicity assumption, but like conditional association (except pairwise nonnegative association) they do not play a role in this study.

3.2 3.2. Scalability Coefficients

The Guttman (1950) model is the basis of the scalability coefficients H ij , H j , and H (Mokken, 1971; cf. Loevinger, 1948). Given an ordering of the J items according to decreasing popularity (equation (1)), the Guttman model assumes that a respondent who endorses the less popular item in a pair of items also endorses the more popular item. Thus, if π 1 i > π 1 j , for any respondent the Guttman model excludes the item-score pattern (X i , X j ) = (0, 1). This item-core pattern is called a Guttman error, and the other three item-score patterns [(0, 0), (1, 0), and (1, 1)] are called conformal patterns. Data that do not contain Guttman errors are in agreement with the Guttman model.

In a 2 × 2 contingency table for the scores on items i and j (with π 1 i > π 1 j and sample of size n), the expected number of Guttman errors, denoted F ij , equals F ij = n × π 01 ij , and the expected number of Guttman errors under marginal independence, denoted by E ij , equals \({E_{ij}} = n \times \pi _i^0 \times \pi _j^1\). The scalability coefficient for items i and j, denoted by H ij , is computed from

$${H_{ij}} = 1 - {{{F_{ij}}} \over {{E_{ij}}}} = 1 - {{\pi _{ij}^{01}} \over {\pi _i^0 \times \pi _j^1}} = 1 - {{n \times m_{ij}^{01}} \over {m_i^0 \times m_j^0}}$$

. For the example in Table 1 (upper left-hand panel), ^F ij = 18 and ^E ij = 29.663, yielding ^H ij = .3932. To facilitate its interpretation, coefficient H ij can be written as a normed covariance (e.g., Sijtsma & Molenaar, 2002, p. 55). Let σ max ij be the maximum covariance between items i and j, given the marginal distributions of X i and X j . Given that items i and j have positive variance, equation (6) is equal to

$${H_{ij}} = {{{\sigma _{ij}}} \over {\sigma _{ij}^{\max }}}$$

.

The scalability coefficient for an individual item j, denoted H j , j = 1, …, J (Mokken, 1971, p. 151), is defined as

$${H_j} = 1 - {{\sum\nolimits_{i \ne j} {{F_{ij}}} } \over {\sum\nolimits_{i \ne j} {{E_{ij}}} }} = 1 - {{n\left( {\sum\nolimits_{i = 1}^{j - 1} {m_{ij}^{01} + \sum\nolimits_{i = j + 1}^J m _{ji}^{01}} } \right)} \over {\sum\nolimits_{i = 1}^{j - 1} {m_i^0m_j^1 + \sum\nolimits_{i = j + 1}^J {m_j^0m_i^1} } }}$$

. Coefficient H j can also be written in terms of interitem covariances and corresponding maximum covariances, given the marginal distributions of the item scores, as

$${H_j} = {{\sum\nolimits_{i \ne j} {{\sigma _{ij}}} } \over {\sum\nolimits_{i \ne j} {\sigma _{ij}^{\max }} }}$$

. Let rest score R (j) be defined as the total score on J − 1 items excluding item j, then one can also write (Sijtsma & Molenaar, 2002, p. 57)

$${H_j} = {{{\sigma _X}{{_{_j}}_R}_{_{(j)}}} \over {\sigma _{{X_j}{R_{(j)}}}^{\max }}}$$

. Equation (9) shows that coefficient H j expresses the strength of the relationship between item j and the other items in the test, comparable with a regression coefficient in a regression model.

For a set of J items, Mokken (1971, p. 149) proposed the total-scale coefficient H, which is defined as

$$H = 1 - {{\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {{F_{ij}}} } } \over {\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {{E_{ij}}} } }} = 1 - {{n\left( {\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {m_{ij}^{01}} } } \right)} \over {\sum\nolimits_{i = 1}^{J = 1} {\sum\nolimits_{j = i + 1}^J {m_i^0m_j^1} } }}$$

. Coefficient H can also be written in terms of interitem covariances and item rest-score covariances, which results in

$$H = {{\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {{\sigma _{ij}}} } } \over {\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {\sigma _{ij}^{\max }} } }} = {{\sum\nolimits_{j = 1}^J {{\sigma _X}{{_{_j}}_R}_{_{(j)}}} } \over {\sum\nolimits_{j = 1}^J {\sigma _{{X_j}{R_{(j)}}}^{\max }} }}$$

. If the data obey a perfect Guttman scalogram, H = 1, but this value is never found in practice.

Sijtsma and Molenaar (2002, Theorem 4.2; see also Hemker, Sijtsma, & Molenaar, 1995) showed that H ij , H j , and H are related such that

$$\mathop {\min }\limits_{i,j} \left( {{H_{ij}}} \right) \le \mathop {\min }\limits_j \left( {{H_j}} \right) \le H \le \mathop {\max }\limits_j \left( {{H_j}} \right) \le \mathop {\max }\limits_{i,j} \left( {{H_{ij}}} \right)$$

.

3.3 3.3. Relationships Between the Monotone Homogeneity Model and the Scalability Coefficients

The monotone homogeneity model implies observable consequences with respect to the scalability coefficients H ij , H j , and H. These observable consequences are used in data analysis to investigate whether the data support the fit of the monotone homogeneity model (Mokken, 1971; Sijtsma & Molenaar, 2002; Sijtsma & Meijer, 2007).

In particular, Mokken (1971, pp. 148–153; see also Sijtsma & Molenaar, 2002, Theorem 4.3) showed that the monotone homogeneity model implies that

$$\matrix{ {0 \le {H_{ij}} \le 1} & {{\rm{for all }}i < j,} \cr {0 \le {H_{ij}} \le 1} & {{\rm{for all }}j{\rm{, and }}} \cr {0 \le H \le 1} & {} \cr } $$

. Thus, negative scalability coefficients are in conflict with the monotone homogeneity model. These observable consequences are the basis of Mokken scale analysis.

3.4 3.4. Definition of a Scale and Two Types of Mokken Scale Analysis

3.4.1 3.4.1. Definition of a Scale.

A set of items is a scale (Mokken, 1971, p. 184; Molenaar & Sijtsma, 2000; Sijtsma & Molenaar, 2002, p. 68), in this study called a Mokken scale if, for product-moment correlation ρ, and for any constant value 0 < c ≤ 1,

$${\rho _{ij}} > 0{\rm{ }}({\rm{or, equivalently}},{H_{ij}} > 0){\rm{ for all }}i < j,{\rm{ and}}$$
$${H_j} \ge c > 0{\rm{ for all }}j$$

. Equation (13) is the first criterion of a Mokken scale, and equation (14) is the second criterion of a Mokken scale. Compared to equations (5) and (12), strict inequality is not crucial here due to continuity of the scales of ρ and H j . Except for the strict inequalities, the monotone homogeneity model implies both equation (13) and H j > 0 (which is part of equation (14)).

However, the monotone homogeneity model does not imply a specific positive values of c. Thus, the inclusion of positive c in the definition of a Mokken scale can be a source of confusion and needs to be explained. To understand the role of positive c, one may note that the monotone homogeneity model, and special cases of this model such as the one-, two-, and three-parameter logistic models, allow items in a scale which have (nearly) flat IRFs. Such items contribute little, if anything, to a reliable person ordering and may even attenuate the reliability of this ordering; thus, these items are unwanted in a scale. The inclusion of a positive c in the definition of a Mokken scale prevents the selection of such items in a scale by rejecting items with H j s which are smaller than c. Thus, Mokken scale analysis aims to produce “high-quality” scales, the definition of which depends on the researcher’s choice of lower bound c.

Mokken (1971, p. 184) proposed to always set c at least to .3. One may note that equation (11) implies that H ≥ min j (H j ); thus, for lower bound c = .3, the total-scale H ≥ .3. The choice of c controls the quality of the individual items in the scale and of the total scale and, therefore, of the total-scale score X + for ordering persons on latent variable θ. Mokken (1971, p. 185) proposed the following rules of thumb for the interpretation of H. A set of items is unscalable for all practical purposes if H < .3; and a scale is considered weak if .3 ≤ H < .4, moderate if .4 ≤ H < .5, and strong if H ≥ .5.

3.4.2 3.4.2. Two Types of Mokken Scale Analysis

Mokken scale analysis can have two forms (Mokken, 1971, pp. 187–199). The first possibility is that the researcher evaluates a given set of J items with respect to the definition of a scale for a chosen value of c. This is confirmatory Mokken scale analysis. The second possibility is to use an automated item selection algorithm (Mokken, 1971, pp. 190–199; Sijtsma & Molenaar, 2002, Chap. 5). This algorithm selects items one by one to obtain one or more scales (depending on the data structure) that agree with the definition of a Mokken scale. In each selection step, the item is chosen from the items not already selected, that not only agrees with equations (13) and (14) but also produces the greatest total-scale H coefficient with the items already selected in previous steps. This is exploratoryMokken scale analysis.

In the remainder of this paper we discuss the use of marginal modelling for testing hypotheses about the scalability coefficients. The term Mokken scale analysis refers to the use of scalability coefficients for scale construction both in a confirmatory and in an exploratory context.

3.5 3.5. Results for the Distribution of the Scalability Coefficients

Results for the distribution of the scalability coefficients are available for the null case (which refers to the null hypothesis that H = 0) and the nonnull case (which refers to the null hypothesis that H = w, w is some positive constant) (Mokken, 1971, pp. 160–169). Results for the null case Mokken (1971, pp. 160–164) are the following. Let S ij be the sample covariance of items i and j, and let S i and S j be the sample standard deviations of items i and j, respectively; then for large n, in the null case, the statistics

$${Z_{ij}} = {{{S_{ij}}} \over {{S_i}{S_j}}}\sqrt {n - 1} $$

,

$${Z_j} = {{\sum\nolimits_{i \ne j} {{S_{ij}}} } \over {{S_j}\sum\nolimits_{i \ne j} {{S_j}} }}\sqrt {n - 1} $$

, and

$$Z = {{\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {{S_{ij}}} } } \over {\sum\nolimits_{i = 1}^{J - 1} {\sum\nolimits_{j = i + 1}^J {{S_i}{S_j}} } }}\sqrt {n - 1} $$

, converge to a standard normal distribution. In the available software for Mokken scale analysis, H ij = 0 is tested against the alternative that H ij > 0 to decide whether items satisfy the first criterion of a Mokken scale that ρ ij > 0 (equation (13)). Results for the nonnull case yield asymptotic standard errors for ^H (Mokken, 1971, pp. 164–169). These results are not available in current software for Mokken scale analysis.

4 4. A Marginal Modelling Approach to the Scalability Coefficients

Coefficient H ij can be written in the recursive exp-log notation, which is useful for testing hypotheses involving H ij . Let m = (m 00 ij , m 01 ij , m 10 ij , m 11 ij )T, and let A 1 and A 2 be the following design matrices:

$${{\rm{A}}_1} = \left( {\matrix{ 1 & 1 & 1 & 1 \cr 1 & 1 & 0 & 0 \cr 0 & 1 & 0 & 1 \cr 0 & 1 & 0 & 0 \cr } } \right){\rm{ and }}{{\rm{A}}_{\rm{2}}}{\rm{ = }}\left( {1 - 1 - 11} \right)$$

. Then, H ij in equation (6) equals

$${H_{ij}} = 1 - \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)$$

. This can be verified by writing the term log(A 1 m) in equation (15) as

$$\log ({{\rm{A}}_1}{\rm{m}}) = \log \left[ {\left( {\matrix{ 1 & 1 & 1 & 1 \cr 1 & 1 & 0 & 0 \cr 0 & 1 & 0 & 1 \cr 0 & 1 & 0 & 0 \cr } } \right) \cdot \left( {\matrix{ {m_{ij}^{00}} \cr {m_{ij}^{01}} \cr {m_{ij}^{10}} \cr {m_{ij}^{11}} \cr } } \right)} \right] = \log \left( {\matrix{ n \cr {m_i^0} \cr {m_j^1} \cr {m_{ij}^{01}} \cr } } \right)$$

, and noting that

$$\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right) = \exp \left[ {\left( {\matrix{ 1 & { - 1} & { - 1} & 1 \cr } } \right)\log \left( {\matrix{ n \cr {m_i^0} \cr {m_j^1} \cr {m_{ij}^{01}} \cr } } \right)} \right] = {{n \times m_{ij}^{01}} \over {m_i^0 \times m_j^1}}$$

.

In the case of J items, there are K = 1/2J(J − 1) item pairs; hence, there are K coefficients H ij . The recursive exp-log notation for the vector H ij = (H 12,H 13, …, H J −1,J)T containing all K item-pair coefficients H ij (i < j) is derived in Appendix C

Based on previous results, a researcher may have reason to believe that for two particular key items in a test H ij = w, with 0 ≤ w < 1, and (s)he may wish to test this hypothesis on a sample from another population. Using the recursive exp-log notation for H ij (equation (15)), it may be verified that the marginal model imposing H ij = w on the contingency table has one nonredundant constraint, which can be written in terms of equation (3) as

$${g_1}(m) = 1 - w - \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right) = 0$$

. For the observed frequencies in Table 1 (upper left-hand panel), choosing w = .5 as an example, the marginal model with constraint H ij = .5 yields the estimated expected frequencies shown in Table 2. This results in G 2 = 1.2207, df = 1, and p = .2692.

Table 2 Estimated expected frequencies for the data in Table 1 under the marginal model imposing H ij = .5 on the table.

Item coefficient H j can be written in a recursive exp-log notation, which is derived in Appendix D for the vector H j = (H 1,H 2,…, H J )T containing all H j s. Total-scale coefficient H can be written in a recursive exp-log notation, which is derived in Appendix E.

5 5. Hypotheses in Mokken Scale Analysis

The use of marginal modelling for testing hypotheses in Mokken scale analysis is illustrated by means of the binary data from 484 children who were administered a 25-item balance-task test (Van Maanen et al. 1989). It was hypothesized that the tasks could be divided into five dimensionally different subscales based on the type of task. The subscales are named Distance, Weight, Conflict Weight, Conflict Balance, and Conflict Distance. For a convenient presentation in the tables, in each of the five scales the items are numbered 1, …, 5. Table 3 shows the proportions-correct (i.e., the 1 j s) of the 25 items.

Table 3 1 j -Values for each of the five balance-task scales.

5.1 5.1. Testing the First Criterion of a Mokken Scale

The first criterion of a Mokken scale is ρ ij > 0 for all i < j, which is identical to H ij > 0 for all i < j (equation (13)). In this section it is explained how marginal modelling can be used to test the global hypothesis that all K item-pair coefficients H ij s are 0. This global test is a novel statistical tool in Mokken scale analysis. To appreciate its usefulness, first we discuss the exploratory analysis and then the confirmatory analysis. In doing this, we only discuss details of exploratory Mokken scale analysis that are relevant here, and skip many other details.

For exploratory Mokken scale analysis, assuming that already r − 1 items have been selected into a scale (and without worrying how this has been accomplished; for the details, see Mokken, 1971, pp. 190–199; Sijtsma & Molenaar, 2002, Chap. 5), the rth candidate item for selection must have positive correlations (or, equivalently, positive pairwise scalability coefficients) with each of the r − 1 items already selected (Mokken, 1971, p. 192, third step). This requirement assures us that the first criterion (equation (13)) of a scale is satisfied for the r items selected thus far. If, for the rth item, each of the r − 1 item-pair coefficients is significantly greater than 0, the first criterion is satisfied, and if this result is also found for other candidate items, each of these items remains in competition to be included in the scale (which of these candidates eventually is the rth item to be selected depends on the second criterion (equation (14)) and other decision rules not discussed here).

The tests of H ij = 0 against H ij > 0 are conducted by testing the marginal independence of X i and X j . This is a simple procedure which can be done with little computational effort. The type I error rate is controlled by a Bonferroni correction, which is very conservative here because the test statistics are dependent, and because tests are accumulated across different steps in the automated item selection algorithm (Mokken, 1971, pp. 196–198).

In confirmatory Mokken scale analysis, the researcher has to test the first criterion for each item pair separately, but here we propose to use a marginal model to test for all K H ij coefficients simultaneously whether they are equal to zero, thus circumventing the Bonferroni correction. Formally, H ij = (H 12, H 13, …, H J−1,J )T contains all K coefficients H ij (i < j). If the global null hypothesis that H ij = 0 is rejected, the researcher has to check next whether the sample values of the item-pair scalability coefficients are positive; that is, whether ^H ij > 0. Only the combination of a rejected global null hypothesis and positive sample H ij s leads to the conclusion that the first criterion (equation (13)) of a Mokken scale is satisfied. If not all sample H ij s are positive, the next step is to identify items that may be rejected from the scale. This is done in the same way as when the global null hypothesis that H ij = 0 is not rejected. We suggest identifying candidate items for rejection by testing for separate item pairs H ij = 0 against the alternative that H ij > 0, just as with the exploratory procedure. Item pairs for which the null hypothesis is not rejected are identified, and for each item involved in such a pair it is counted how often it is involved in negative ^H ij s with other items. Items that are frequently involved in negative sample item-pair scalability coefficients are candidates for removal from the test. We now concentrate on the new global test, based on marginal modelling, that H ij = 0.

Let u K denote a vector of length K that contains 1s, and let A 1 and A 2 be design matrices (derived in Appendix C). In Appendix C it is shown that

$${{\rm{H}}_{ij}} = {{\rm{u}}_K} - \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)$$

. Hence, the recursive exp-log notation of the K restrictions (see equation (3)) for marginal model H ij = 0 is

$${\rm{g}}({\rm{m}}) = {{\rm{u}}_K} - \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right) = {0_K}$$

. If the marginal model in equation (17) is rejected and if in the sample ^H ij > 0 for all i < j, then the first criterion (equation (13)) for a Mokken scale is met for all J items.

One advantage of this global test is that it does not require a Bonferroni correction. Another advantage is that it allows the first criterion for a Mokken scale to be strengthened, for example, by requiring that all H ij s are greater than a positive value d so as to avoid values of H ij close to 0. Values close to 0 may allow undesirable multidimensionality in a scale, and are not excluded by the second criterion for a Mokken scale, H j c > 0 all j (equation (14)). What is a reasonable choice for d? Because, by equation (11), we have that min i,j (H ij ) ≤ H j ≤ maxi,j(H ij ), it seems reasonable to choose an a priori lower bound d for H ij smaller than c. In this example, we arbitrarily set d = .1.

Let d K be a vector of length K with all elements equal to d. Then the marginal model equals H ij = d K . Using the recursive exp-log notation for H ij in equation (16), it may be verified that the recursive exp-log notation of the K restrictions (see equation (3)) for this marginal model is

$${\rm{g}}({\rm{m}}) = {{\rm{u}}_K} - \exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right) - {{\rm{d}}_K} = {0_K}$$

.

The marginal model with d = 0 (equation (17)) and the stronger marginal model with d = .1 (equation (18)) were tested on the balance-scale data. For each balance scale, the ^H ij s and their standard errors, and the likelihood ratio statistic G 2 and corresponding p-value, are shown in Table 4. For d = 0, using α = .05 the null model was rejected for all scales and, in addition, all sample ^H ij s were found to be greater than zero. Thus, the first criterion of the Mokken scale (ρ ij > 0; equation (13)) was assumed to be satisfied. For d = .1, implying the statistical test that simultaneously all H ij > .1, four scales were found to satisfy this more demanding criterion but for the Conflict Balance scale the marginal model in equation (18) was not rejected.

Table 4 Estimated scalability coefficients ^H ij with standard errors between parentheses for each of the five balance-task scales (upper panel); fit statistics (G 2, p-value) for the marginal model defining H ij = 0 for i = 1 …, 4; j = i + 1, …, 5 (middle panel); and fit statistics for the marginal model defining H ij = .1 for i = 1 …, 4; j = i + 1, …, 5 (lower panel).

5.2 5.2. Testing the Second Criterion of a Mokken Scale

The second criterion of a Mokken scale is that H j c > 0 for all j = 1, …, J (equation (14)). The current practice is that for each item the null hypothesis is tested that H j = 0. When this null hypothesis is rejected, it is checked in the data whether ^H j exceeds lower bound c. If for each item the null hypothesis is rejected and ^H j > c for all j, the second criterion for a Mokken scale is assumed to be satisfied. Currently, there is no test available for the null hypothesis that H j = c against the alternative that H j > c and, sometimes, when the automated item selection procedure is used, an item scalability coefficient is greater than c when the item enters the scale, but then drops below c as subsequent items enter the scale (e.g., Sijtsma & Molenaar, 2002, pp. 79–80).

The marginal modelling approach offers a solution. A marginal model may be tested in which, simultaneously, all H j = c. H j = (H 1, …, H J )T contains all H j s, and let c J be a vector of length J with all elements equal to lower bound c. The marginal model is then H j = c. If the marginal model is rejected and all sample ^H j s exceed c, the second criterion is assumed to be satisfied.

Let A 1, A 2, A 3, and A 4 be design matrices (derived in Appendix D). Appendix D shows that

$${{\rm{H}}_j} = {{\rm{u}}_J} - \exp \left( {{{\rm{A}}_4}\log \left( {{{\rm{A}}_3}\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right)$$

. Using the recursive exp-log notation for H j in equation (19), it may be verified that the recursive exp-log notation of the J restrictions (see equation (3)) for the marginal model is

$${\rm{g}}({\rm{m}}) = {{\rm{u}}_J} - \exp \left( {{{\rm{A}}_4}\log \left( {{{\rm{A}}_3}\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right) - {{\rm{c}}_J} = {0_J}$$

.

The marginal model in equation (20) with c = .3 (which is the default value in software for Mokken scale analysis) and the marginal model with the more demanding criterion c = .4 were tested on the balance-task data. For each balance-task scale, Table 5 shows the estimates of the H j s and their standard errors, and the likelihood ratio statistic G 2 and corresponding p-value. For lower bound c = .3, for four scales the marginal model was rejected. In addition, all the ^H j s exceeded .3. Thus, the four scales meet the second criterion of a Mokken scale. The exception was the Conflict Balance scale, for which the marginal null model was not rejected. Thus, Conflict Balance does not meet the second criterion of a Mokken scale.

Table 5 Estimated scalability coefficients ^H ij with standard errors between parentheses for each of the five balance-task scales (upper panel); fit statistics (G 2, p-value) for the marginal model defining H j = .3 for j = 1, …, 5; (middle panel); and fit statistics for the marginal model defining H j = .4 for j = 1, …, 5 (lower panel).

For c = .4, for the Distance scale the marginal model was not rejected, and for the Conflict Balance scale this marginal model was rejected but all ^H j s were smaller than .4. Thus, for these two scales the more demanding second criterion of a Mokken scale was not satisfied. For the other three scales, the null model was rejected and all ^H j s exceeded .4; hence, the more demanding second criterion of a Mokken scale was satisfied.

5.3 5.3. Testing the Strength of the Scale

Testing the strength of the scale can be considered equivalent with testing for the total-scale coefficient that Hc against the alternative that H > c. If the null model is rejected for c = .3 and if in the sample ^H > .3, then the scale can be considered to be at least a weak scale; if the null model is rejected for c = .4 and if ^H >. 4, then the scale can be considered to be at least a moderate scale; and if the null model is rejected for c = .5 and if ^H > .5, then the scale can be considered to be a strong scale. The statistical test can be performed using the asymptotic standard errors derived by Mokken (1971, pp. 164–169). From the asymptotic standard errors a (1 − α)% confidence interval is constructed, and if c exceeds the upper bound of the confidence interval, the null hypothesis is rejected. This test is not available in the current software.

Alternatively, the test may be conducted using a marginal model. Let A 1, A 2, A 3, and A 4 be design matrices. These matrices are derived in Appendix E. Appendix E shows that H can be written as

$$H = 1 - \exp \left( {{{\rm{A}}_4}\log \left( {{{\rm{A}}_3}\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right)$$

. Using equation (21) it can be verified that the recursive exp-log notation of the restriction (see equation (3)) in the null model is

$${{\rm{g}}_1}({\rm{m)}} = 1 - \exp \left( {{{\rm{A}}_4}\log \left( {{{\rm{A}}_3}\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right) - c = 0$$

. It may be noted that, in principle, in equation (22) lower bound c may be replaced by any constant w > 0.

The marginal models with c = .3, c = .4, and c = .5 (equation (22)) were tested on the balance-scale data. For each scale, Table 6 shows the estimate of coefficient H and its standard error, and the likelihood ratio statistic G 2 and corresponding p-value. Using the rules of thumb for the interpretation of values of H, Weight and Conflict Distance were strong scales, Conflict Weight a moderate scale, Distance a weak scale, and Conflict Balance was found to be unscalable.

Table 6 For each of the five scales of the balance-task test: The estimated scalability coefficient åH with standard error between parentheses (first row); and the fit statistics (G 2, p-value) for the marginal models defining H = .3, H = .4, and H = .5.

5.4 5.4. Testing Equality of Item Coefficients

Coefficient H j expresses the contribution of item j to the ordering of respondents by means of total score X +. Thus, it can be argued that coefficient H j is a nonparametric IRT analogue to the discrimination power of an item (Van Abswoude, Van der Ark, & Sijtsma, 2004). The marginal modelling framework can be used to test whether the H j s of different items are equal. This may be interesting when one wants to know whether the items are different with respect to their contribution to the accuracy of the person ordering. Large differences may also provide the researcher with indications that different latent variables may drive the responses to different items (Sijtsma & Meijer, 2007). Currently, such a test is not available.

A statistical test for the null hypothesis “H 1 = … = H J ” requires a slight modification of the marginal model in equation (20). Let A 5 be a (J − 1) × J matrix with element (j, j) equal to 1 for j = 1, …, J − 1; and element (j, j +1) equal to −1 for j = 1, …, J −1; the remaining elements are equal to 0. Using equation (19), it may be verified that

$${{\rm{A}}_5}{{\rm{H}}_j} = \left( {\matrix{ {{H_1} - {H_2}} \cr {{H_2} - {H_3}} \cr \vdots \cr {{H_{J - 1}} - {H_J}} \cr } } \right)$$

, which should be equal to 0 J−1 if all H j s are equal. Then using the design matrices A 1, …, A 4 from equation (20) (see Appendix D), the marginal model for equal item coefficients is

$${\rm{g}}({\rm{m}}) = {{\rm{A}}_5}\left( {\exp \left( {{{\rm{A}}_4}\log \left( {{{\rm{A}}_3}\exp \left( {{{\rm{A}}_2}\log \left( {{{\rm{A}}_1}{\rm{m}}} \right)} \right)} \right)} \right)} \right) = {0_{J - 1}}$$

.

Using the marginal model in equation (23), the null hypothesis that H 1 = H 2 = H 3 = H 4 = H 5 was tested for each of the five balance-task scales. It may be noted that if all H j s are equal, then equation (11) implies that H = H j . For each scale, Table 7 shows the estimated total-scale H and its standard error, under the marginal model of equal H j s, and the likelihood ratio statistic G 2 and corresponding p-value. For Conflict Weight the null model of equal H j s was rejected. For the other four scales the null model was not rejected, thus providing support for equal item contributions to the person ordering.

Table 7 For the marginal model defining H 1 = … = H 5: For each of five balance-task scales, estimated coefficient H with standard error between parentheses (upper panel); and fit statistic G2, p-value (lower panel).

5.5 5.5. Multiple-Group Hypotheses

Mokken (1971, pp. 164–169) provided the asymptotic sampling theory for testing the null hypothesis that the H values for the same test in different groups are equal. Under this null hypothesis, the same test orders respondents from different groups with equal accuracy. For example, the balance-task test was administered to both boys and girls, and it may be interesting to test if the test orders boys and girls equally well. MSP (Molenaar & Sijtsma, 2000) allows the possibility to compare the H j and H values of different groups, but not to test hypotheses about (in-)equality of H in different groups.

Assume that there are G groups, and let superscript g index these groups. Then the null hypothesis of interest is “H 1 = … =H G”. The recursive exp-log notation requires the following definitions. Let A 11 , …, A G1 , A 2, A 3, A 4, and A 5 be design matrices (derived in Appendix F). Let m* be a vector of length LG in which the vectors of expected frequencies from groups 1, …, G are stacked, such that m* = (m 1, m 2, …, m G). The symbol ⊕ indicates the direct product (see Appendix F). In Appendix F it is shown that

$$ \left( {\begin{array}{*{20}c} {H^1 } \\ {H^2 } \\ \vdots \\ {H^G } \\ \end{array} } \right) = u_G - \exp \left( {\mathop \oplus \limits_{g = 1}^G A_4 \log \left( {\mathop \oplus \limits_{g = 1}^G A_3 \exp \left( {\mathop \oplus \limits_{g = 1}^G A_2 \log \left( {\mathop \oplus \limits_{g = 1}^G A_1^g m*} \right)} \right)} \right)} \right) $$

. In Appendix F it is also shown that the recursive exp-log notation for the marginal model with “H 1 = H 2 = … =H G” is

$$ \begin{gathered} g\left( m \right) = \left( {\begin{array}{*{20}c} {H^1 - H^2 } \\ {H^2 - H^3 } \\ \vdots \\ {H^{G - 1} - H^G } \\ \end{array} } \right) \hfill \\ = A_5 \exp \left( {\mathop \oplus \limits_{g = 1}^G A_4 \log \left( {\mathop \oplus \limits_{g = 1}^G A_3 \exp \left( {\mathop \oplus \limits_{g = 1}^G A_2 \log \left( {\mathop \oplus \limits_{g = 1}^G A_1^g m*} \right)} \right)} \right)} \right) = 0_G \hfill \\ \end{gathered} $$

.

The marginal model in equation (25) was used to test equal H for boys (indexed g = 1) and girls (g = 2); that is, H 1 = H 2. For each balance-task scale, Table 8 shows coefficient H g and its standard error, and the likelihood ratio statistic G 2 and corresponding p-value. For each scale, the sample ^H value was higher for girls than for boys but only for Conflict Distance was the difference significant. Notice that for G = 2, if estimated standard errors are available, this result can be approximated using a t-test.

Table 8 For the marginal model defining H 1 = H 2: For each of the five balance-task scales, scalability coefficients H for boys and girls with standard error between parentheses (upper panel); and fit statistics G2, p-value (lower panel).

A generalization of the multigroup hypothesis to coefficients H j and H ij is straightforward if the item ordering is the same in all subgroups. If the item ordering is different for some subgroups, the design matrices must be adapted.

6 6. Discussion

Marginal modelling offers a framework for testing many interesting hypotheses relevant to Mokken scale analysis that could not be tested before. In particular, new and exciting possibilities of the marginal modelling approach are:

  1. (1)

    The availability of global tests that evaluate all interitem scalability coefficients H ij simultaneously and all item-scalability coefficients H j simultaneously. This offers new opportunities for assessing item and test quality.

  2. (2)

    The possibility to test whether scalability coefficients are equal to a particular value. This is important for ascertaining item and test quality at a level deemed necessary by the researcher. This result also offers the possibility to test hypotheses about expected values of scalability coefficients (such as those derived from previous research).

  3. (3)

    The comparison of scalability coefficients between different groups. This provides the opportunity to assess whether the measurement quality of a test is the same in different groups.

This paper has presented several useful examples but the array of possibilities has not yet been fully explored. Exploring these possibilities and implementing the most useful ones in user-friendly software is the first topic for future research.

One possible limitation of the marginal modelling approach is that for the global tests assessing all scalability coefficients simultaneously and to a lesser degree for tests of coefficient H alone, the size of the matrices can grow rapidly as the number of items increases. The experience accumulated thus far did not reveal computational problems for tests up to J = 15. Matrix R (equation (2)), which is required to solve the marginal modelling problem, has L = 215 = 32760 rows. For larger J, the maximum likelihood estimation of the models becomes impractical. One solution may be to use an estimation procedure that only evaluates the observed item-score patterns so that the size of vector m does not exceed n. An example is the minimum information discrimination approach (e.g., Kullback, 1971; Read & Cressie, 1988, pp. 34–40). Applying alternative estimation procedures to marginal modelling of the scalability coefficients for Mokken scale analysis is the second topic for future research.

The methods presented here are only applicable to dichotomous items. Thus, a useful generalization is to Mokken scale analysis for polytomous items. Whereas, for dichotomous items, some of the interesting hypotheses tested in Mokken scale analysis could also be tested without the use of marginal models, this is often not possible for polytomous items. Examples are the computation of standard errors and testing the strength of the scale. The generalization of results for dichotomous items to polytomous items has proven to be problematic in many ways (e.g., Hemker et al., 1997; Sijtsma & Meijer, 2007), and this may also be true in the marginal modelling framework. The derivation of the design matrices for marginal models is more complicated and the magnitude the computational problems is more troublesome. The generalization of the methods to polytomous items is the third topic for future research.

The syntax files for the marginal models used here are available upon request from the first author. Currently, researchers wishing to apply the marginal models presented in this paper need to have Mathematica installed on their computer.