A Rasch analysis of Raven’s standard progressive matrices

https://doi.org/10.1016/S0191-8869(99)00177-4Get rights and content

Abstract

Unidimensionality was investigated for Raven’s Standard Progressive Matrices, one of the most widely used intelligence tests in the world. The test was administered as part of a research project devoted to the identification of highly gifted children. Unidimensionality was tested by means of the Rasch model, which was applied to subsets A–E separately. The Rasch model was not rejected for sets A, C and D. It was rejected for sets B and E, meaning that the items of these sets measure at least two different dimensions. It was hypothesized that these dimensions are Gestalt continuation and analogical reasoning for set B, and analogical reasoning and coping for set E. In the case of set C Rasch homogeneity could be considerably improved by assuming a second factor, apart from analogical reasoning, which was identified as lack of resistance to perceptual distractors. Splitting of set B into appropriate subsets yielded two unidimensional subsets, B1 and B2. Splitting of set E yielded one unidimensional subset E1 and a heterogeneous, multidimensional subset E2. Set C was redefined by disregarding some of its items. At the level of the newly defined subset scores the factor analogical reasoning is common to all subsets. The factor Gestalt continuation is common to the subsets A and B1. However, the reliability of these subsets were very low, implying that this factor might be too weak to be distinguishable in a factor analysis. The factors coping and lack of resistance to perceptual distractors are both unique. Therefore, one might expect the emergence of only one factor when a factor analysis would be performed on all newly defined subsets. However, factor analysis of the newly defined subsets yielded two factors. Further inspection of the factor plot showed that the emergence of a second factor could be considered as an artefact due to the skewness of the subset scores.

Introduction

This paper addresses unidimensionality of Raven’s Standard Progressive Matrices (RSPM), which is one of the most widely used intelligence tests in the world. Unidimensionality, in an intuitive sense, means that responses to all test items depend on the same underlying trait or ability. Such a test may also be called “unifactorial” or “homogeneous” — although the latter word has many different meanings (Dubois, 1970). The question whether a test is unidimensional has important consequences for (1) the construct validity of the test, and (2) the practical scoring scheme of the test. Suppose, for example, that the RSPM would contain two clusters of items, where responses on the first cluster depend mainly on the subject’s “completeness of Gestalt continuation” whereas responses on the second cluster depend mainly on “completeness of analogical reasoning”. This would imply that a subject’s total score would reflect an unknown mixture of both abilities. Thus, a subject with an average total score may be high on Gestalt continuation but low on analogical reasoning, or it may be low on Gestalt continuation but high on analogical reasoning. There is no way to infer the adequate theoretical interpretation from the total score. Therefore, the construct validity of the test, as defined by the Standards (APA, 1985), would be seriously undermined. Moreover, in practical test administrations each subject’s test behavior would be summarized more accurately and truthfully with two different subtest scores instead of one total score. For these reasons it is crucial to assess whether the RSPM is unidimensional.

The manual of the RPSM does not blindly prescribe the use of total scores. The RSPM is divided into various subsets, and if the subset scores deviate too much from each other, they should not be combined into a total score according to the manual. Thus the manual acknowledges that the subsets may measure different dimensions. However, the question of unidimensionality applies equally well on the level of subsets. Each subset may contain two or more clusters of items, and then it would be impossible to infer the adequate theoretical interpretation from the subset scores. Unidimensionality of each subset is a logical prerequisite for unidimensionality of the total test. Therefore unidimensionality of the RSPM will be investigated primarily for each of the subsets separately.

This paper assesses unidimensionality of the RSPM by means of the Rasch model and in particular with the R1 and R2 statistics of Glas (Glas, 1988, Glas and Verhelst, 1995). Although in principle many other methods — such as factor analysis — could be used to assess dimensionality, it is the considered opinion of the authors that the Rasch model offers the finest method. This will be elaborated below.

In the remainder of the introduction a few words will be devoted to the relation between construct validity and unidimensionality. Next, the main arguments for choosing the Rasch model as the appropriate statistical method to assess unidimensionality will be discussed. Subsequently, the main features of the Rasch model will be presented. In the method section, the RSPM and the subject sample will be described. Finally, the results of the Rasch analysis for the various subsets of the RSPM will be presented. This will also be done for the test as a whole, and the latter analysis will be completed with a confirmatory factor analysis of the subset scores.

It is not uncommon in test construction to do a factor analysis in order to determine the factorial structure of the test items (Allen & Yen, 1979). Ideally, items of the same subtest should load on the same factor and only on that factor. This is essentially the hypothesis of unidimensionality. The present paper deviates from this approach only in that a more appropriate and rigorous statistical method to test this hypothesis is used.

Why is unidimensionality important? Unidimensionality can be viewed as construct related evidence of validity of the test. Recall that construct validity means that an adequate theoretical interpretation can be given to test scores (APA, 1985). This implies that theories about the underlying construct should be tested and corroborated. Many construct validation studies consider theories in which subtest scores or item scores are related to each other by means of factor analysis or structural equation modelling (e.g., Craighead et al., 1998, Endler et al., 1998). Studies where item scores are related to each other may be called “studies of internal structure”. Studies of internal structure are explicitly mentioned by Cronbach and Meehl (1955) in their seminal paper on construct validity. Assessment of unidimensionality is essentially a study of internal structure, as it considers the relations between the items within a test. The “theory” tested here is that all items measure the same construct. Note that this simple theory does not tell what the nature of the construct is. Also, this has no strict logical implication for the relation between the total score and external variables. However, it seems unlikely that lasting and parsimonious theories can hold for the total score if the total score would in fact be a variable mixture of different abilities.

The number of latent abilities (or “dimensions” or “traits” or “factors”) underlying the item responses will be analyzed by an Item Response Theory (IRT) model instead of a classical factor analysis. Use of an IRT model is required here because the item responses are binary (0 or 1). Ordinary factor analysis and other classical test theory methods, if applied to binary variables, may entail too many factors, some of which are related to the item difficulties (Hattie, 1985, Green et al., 1977, McDonald and Ahlawat, 1974, McDonald, 1981). This problem has a history dating back to Spearman, 1927, Hertzman, 1936. In the case of binary items, differences in item difficulty may also cause differences in skewness. It is precisely the differences in the skewness of the variables which may cause the emergence of artificial factors. Note, that the mean or standard deviation as such cannot be the cause of artificial results, as factor analysis is based on standard scores in which the effect of the means and standard deviations are partialled out. Several factor analytic models have been developed to deal with binary variables (Christoffersson, 1975, Bartholomew, 1980, Muthén, 1978). However, statistical tests regarding the number of factors are always based on the assumption that the underlying latent trait is normally distributed. The latter assumption may be invalid. IRT models, on the other hand, are specially developed for observed variables that are binary and do not need the assumption that the latent trait is normally distributed. More information on IRT can be found on an introductory level in Allen and Yen (1979), on a more sophisticated level in Lord, 1953, Lord, 1980), and on a more general in (van der Linden and Hambleton, 1997).

From the IRT models, the model of Rasch (e.g. Rasch, 1960, Fischer and Molenaar, 1995) will be used in this paper. This model will be used for several reasons. The first reason is that the statistical theory of the model is well-developed and straightforward in comparison to other IRT models (e.g. Fischer, 1974, Glas and Verhelst, 1995). Possible competing models would be the normal ogive model and the 2-parameter logistic model. For the normal ogive model, however, no statistical tests exist regarding the number of factors (van der Linden & Hambleton, 1997), apart from tests where a normally distributed trait is assumed and factor analysis of binary data is applied. For the 2-parameter logistic model, several test statistics exist, but these lack the rigorous mathematical foundation that some statistics of the Rasch model have. More importantly, for the 2-parameter logistic model and the normal ogive model, no well-developed test statistic exists that is based on bivariate frequencies, whereas this is where violations of unidimensionality are most likely to show up (van den Wollenberg, 1982).

Another, more substantive reason to use the Rasch model, which will be discussed below, is that it is the only model in which a subject’s number correct score is a “sufficient statistic” for the underlying trait (Fischer, 1974, Fischer, 1995), and it is exactly this score which is used to evaluate the subject’s results, on each separate subset.

In the Rasch model, it is assumed that the probability pij that a given subject i gives a correct response to a given item j satisfies the formulapij=eαi−βj1+eαi−βjHere α is the subject’s underlying true ability; β is the item’s underlying true difficulty relative to the other items; and e is equal to the base of the natural logarithm. Here, ability should not to be confused with the observed test score, which is only an estimate for the true ability. Similarly, the relative difficulty should not be confused with the observed p-value (proportion correct) of the item. The right-hand side of this formula is known as a logistic function. The probability is plotted for two distinct items in Fig. 1. These curves are called Item Characteristic Curves (ICCs).

One can see in this figure that the probability of a correct response has the following characteristics:

  • 1. Monotone in ability: The probability increases with the ability of the subject.

  • 2. Monotone in difficulty: The probability decreases with the difficulty of the item.

This is as it should be. In the Rasch model it is assumed that formula (1) holds for all subjects and all items. In addition, unidimensionality is assumed, and this can be split into two further assumptions, closely related to 1 and 2 above:
  • 1′. Unidimensional ability: Each subject’s ability remains constant throughout the test, that is: a subject uses the same ability in all test items.

  • 2′. Unidimensional difficulty: Each item’s relative difficulty remains constant throughout the sample, that is: the ICCs are horizontally parallel and hence non-intersecting. Consequently, items can be arranged in order of difficulty by means of the p-values, and this order is independent of the abilities of the subjects. Smart and dull subjects will yield the same order.

It is legitimate to ask why one would assume precisely the logistic curve and not any other nice-looking curve, e.g. the normal ogive. Note that a logistic curve is certainly more reasonable than a linear curve, which would imply that the probability may become larger than 1 or smaller than 0. The S-curve accommodates the fact that the observed responses have a bottom (0) and a ceiling (1). In addition to this there is a rigorous mathematical reason to consider especially the logistic function. It can be shown (Fischer, 1974, Fischer, 1995) that the assumption of a logistic function implies (here the parallel assumption 1″ is omitted, as it is considered less important):
  • 2″. Sufficiency of the total score. Each subject’s observed total score is a “sufficient statistic” for the underlying true ability. This means that the total score contains all information about the subject’s true ability.

Sufficiency of the total score guarantees that the total score is the best summary of the subject’s response pattern. No relevant information will be lost if one reports only the total score for each subject. Particularly, the following situation, described earlier in the introduction, will not occur.

Suppose, for example, that the RSPM would contain two clusters of items, where responses on the first cluster depend mainly on the subject’s “completeness of Gestalt continuation” whereas responses on the second cluster depend mainly on “completeness of analogical reasoning”. This would imply that a subject’s total score would reflect an unknown mixture of both abilities. Thus, a subject with an average total score may be high on Gestalt continuation but low on analogical reasoning, or he may be low on Gestalt continuation but high on analogical reasoning.

Since the total score (number correct) is ordinarily used for the Raven, it is important to demonstrate that it is indeed sufficient. Otherwise the actual testing practice of using total scores would systematically neglect relevant dimensions in the subject’s test behavior. Note, that the reasons for using the Rasch model are not motivated by psychological theory, but rather by measurement theoretical considerations of how a fine test should be. Such a test yields total scores that are sufficient and this requires logistic functions.

Two test statistics will be used to evaluate whether the Rasch model holds for the RSPM, namely R1 and R2 of Glas (Glas, 1988, Glas and Verhelst, 1995). The R1 is particularly sensitive to violations of assumptions 2, 2′ and 2″. The R2 test is particularly sensitive to violations of assumptions 1 and 1″ (note that the numbering of R’s is opposite to the numbering of the assumptions). Therefore R2 may be considered as somewhat more important.

Section snippets

Raven’s Standard Progressive Matrices

The results reported below are based upon data obtained from the Standard Progressive Matrices Test (RSPM). The test was administered as part of a research project devoted to the identification of highly gifted children (see Mönks, van Boxtel, Roelofs & Sanders, 1986). The present sample is on the average not particularly highly gifted; it was used only to select such children. The Standard Progressive Matrices may be described as consisting of a number of visual analogy problems, each having

Rasch analysis of sets A–E

The results of the Rasch analysis based upon all items of each set (except the items which were correctly answered by all subjects) can be found in Table 2.

The number of subjects which were involved in each analysis (N) varies because subjects who answer all items correctly or all items incorrectly are excluded from the analysis. The results in Table 2 clearly indicate that the Rasch model generally holds for sets A, C and D, but not for sets B and E. Especially, the assumption of

Discussion

Unidimensionality was investigated for Raven’s Standard Progressive Matrices, one of the most widely used intelligence tests in the world. Unidimensionality was tested by means of the Rasch model, applied to subsets A–E separately. The Rasch model was not rejected for sets A, C and D. In the case of set B the significance of R2 could be explained by taking into account that the first seven items of set B (similar to the items of set A) require Gestalt continuation, whereas items 8–12 of set B

References (42)

  • M.J. Allen et al.

    Introduction to measurement theory

    (1979)
  • APA 1995. American Psychological Association, American Educational Research Association, & National Council on...
  • R. Amthauer

    Intelligenz-Struktur-Test I.S.T. 70. Handanweisung für die Durchführung und Auswertung

    (1973)
  • E.B. Andersen

    A goodness of fit test for the Rasch model

    Psychometrika

    (1973)
  • D.J. Bartholomew

    Factor analysis for categorical data

    Journal of the Royal Statistical Society, Series B

    (1980)
  • H.W. van Boxtel et al.

    De identificatie van begaafde leerlingen in het voortgezette onderwijs en een beschrijving van hun situatie, Vol. I and II

    (1986)
  • A. Christoffersson

    Factor analysis of dichotomized variables

    Psychometrika

    (1975)
  • W.E. Craighead et al.

    Factor analysis of the children’s depression inventory in a community sample

    Psychological Assessment

    (1998)
  • L.J. Cronbach et al.

    Construct validity in psychological tests

    Psychological Bulletin

    (1955)
  • P.H. Dubois

    Varieties of psychological test homogeneity

    American Psychologist

    (1970)
  • N.S. Endler et al.

    Coping with health problems: developing a reliable and valid multidimensional measure

    Psychological Assessment

    (1998)
  • G.H. Fischer

    Einführung in die Theorie psychologischer Tests

    (1974)
  • G.H. Fischer

    Derivations of the Rasch model

  • G.H. Fischer et al.

    Algorithmen und Programmen für das probabilistische Testmodel von Rasch

    Psychologische Beiträge

    (1970)
  • C.A.W. Glas

    The derivation of some tests for the Rasch model from the multinomial distribution

    Psychometrika

    (1988)
  • C.A.W. Glas et al.

    Rasch scaling program

    (1993)
  • C.A.W. Glas et al.

    Testing the Rasch model

  • Ch.M. des Granges

    Pensees

    (1964)
  • S.B. Green et al.

    Limitations of coefficient alpha as an index of test unidimensionality

    Educational and Psychological Measurement

    (1977)
  • J.A. Hattie

    Methodology review: assessing unidimensionality of tests and items

    Applied Psychological Measurement

    (1985)
  • Cited by (100)

    • Ethnicity and intelligence in children exposed to poverty environments: An analysis using the Oaxaca-Blinder decomposition

      2019, Intelligence
      Citation Excerpt :

      It may even be possible that the same overall score can be achieved by solving different items and with different levels of difficulty (Facon & Nuchadee, 2010). Faced with these circumstances, studies that further delve into the differences between these two groups of children, by analyzing at item level their level of difficulty (Van Herwegen, Farran, & Annaz, 2011), their typology (Lynn, Allik, & Irwing, 2004), the reaction time to solve it (Soulières et al., 2009), or the types of errors committed (Van der Ven & Ellis, 2000) can be very valuable, especially in a context such as that of Mexico, where poor indigenous children have a higher prevalence of bilingualism (indigenous language and Spanish) when compared to poor, non-indigenous children who are mostly monolingual (in Spanish). Although the available evidence shows that bilingual children belonging to minorities and low-income families do not differ from their monolingual peers in global tests of abstract reasoning such as the CPM (Engel De Abreu, Cruz-Santos, Tourinho, Martin, & Bialystok, 2012), it does not rule out a priori that differences with respect to their information processing strategies at the level of individual items may exist, taking into account the reorganization of complex mental structures that take place in response to a particular linguistic experience (Kroll & Bialystok, 2013).

    View all citing articles on Scopus
    View full text