Original article
Long-term stability of the French WISC-IV: Standard and CHC index scoresStabilité à long terme des indices standards et CHC du WISC-IV

https://doi.org/10.1016/j.erap.2016.10.001Get rights and content

Abstract

Introduction

The assumption of the stability of intelligence is the source of the predictive value of the Intelligence Quotient (e.g., Full Scale IQ). However, few studies have investigated the long-term stability of one of the most frequently used tests in the field of cognitive assessment: the Wechsler Intelligence Scale for Children – 4th edition (WISC-IV).

Objective

For a deeper understanding and a better use of intelligence test scores, this study examined the long-term stability of the standard index scores and five CHC composite scores of the French WISC-IV.

Method

A test–retest procedure was used, with an average retest interval of 1.77 year (SD = 0.56 year). This study involved 277 French-speaking Swiss children aged between 7 and 12 years. Three types of stability analysis were conducted: (a) mean-level changes, (b) rank-order consistency and change, and (c) individual-level of change.

Results

The observed pattern of mean-level changes suggested a normative mean-level stability for the Verbal Comprehension Index (VCI), the Perceptual Reasoning Index (PRI), the General Ability Index (GAI), Comprehension-Knowledge (Gc), and Visual Processing (Gv). Regarding individual differences stability, only the FSIQ and the GAI reached a reliability of .80 required for making decisions about individuals. Using a two standard errors of measurement confidence interval (± 2 SEM), we examined individual-level stability. Results indicated that more than 70% of the children presented stable performances for the GAI, Gc, and Gv scores.

Conclusion

Together, nomothetic and idiographic perspectives suggested that the GAI, Gc, and Gv were the most stable scores in our non-clinical sample.

Résumé

Introduction

L’hypothèse de la stabilité de l’intelligence est à l’origine de la valeur prédictive du Quotient Intellectuel (p.ex. QI Total). Or, peu d’études ont été conduites sur la stabilité à long terme des scores de l’une des batteries les plus utilisées dans le domaine de l’évaluation cognitive : la 4e édition de l’Échelle d’intelligence de Wechsler pour enfants et adolescents (WISC-IV).

Objectif

Afin de favoriser une compréhension approfondie et une meilleure utilisation des scores des tests d’intelligence, cette étude examine la stabilité à long terme des indices standards et de cinq indices CHC, estimés à partir de l’adaptation française du WISC-IV.

Méthode

La stabilité à long terme des différents scores a été évaluée par le biais d’une procédure test – retest avec un intervalle moyen de 1,77 an (ET = 0,56 an) entre les deux passations. L’échantillon comprend 277 enfants suisses francophones âgés de 7 à 12 ans. La stabilité des scores a été évaluée sous trois angles: (a) la stabilité du niveau moyen du groupe, (b) la stabilité différentielle et (c) la stabilité intra-individuelle.

Résultats

Les comparaisons de moyennes entre les deux passations suggèrent une stabilité du niveau moyen pour l’Indice de Compréhension Verbale (ICV), l’Indice de Raisonnement Perceptif (IRP), l’Indice d’Aptitude Générale (IAG), les scores compréhension-connaissance (Gc) et traitement visuel (Gv). Concernant la stabilité différentielle, seuls le QI Total et l’IAG atteignent un seuil de fidélité de .80 recommandé pour les décisions au niveau individuel. La stabilité intra-individuelle est examinée en définissant un intervalle de confiance de 2 erreurs types de mesure (± 2 ETM). Les résultats montrent que plus de 70 % des enfants présentent des performances stables pour les scores de l’IAG, de Gc et de Gv.

Conclusion

Globalement, la perspective nomothétique et la perspective idiographique suggèrent que l’IAG, Gc et Gv sont les scores les plus stables dans notre échantillon non clinique.

Introduction

Previous studies suggested that intelligence is a steady and enduring trait across time (e.g., Deary et al., 2013, Deary et al., 2000, Hertzog and Schaie, 1986, McCall, 1977). Indeed, apart from temporary fluctuations occurring in intellectual development, the cognitive performances of individuals are assumed to be relatively stable from childhood through adulthood. The stability of individual differences in intelligence confers a predictive value to the Full Scale Intelligence Quotient (FSIQ). Hence, intelligence tests like the Wechsler Scales are commonly used for diagnostic and intervention purposes. Because high-stakes decisions (e.g., grade-skip or admission to special education programs) are frequently based on the FSIQ and the index scores, it is essential to formulate diagnostic hypotheses and interventions based on reliable and stable intelligence test scores.

The reliability – and more particularly, the internal consistency – is routinely assessed for intelligence test scores. According to the Classical test theory, reliability/precision is the foundation for the validity of test score interpretation (AERA, APA, & NCME, 2014). Typically, a test–retest procedure is used to assess the reliability/precision of intelligence test scores across time (i.e., longitudinal studies with two assessments at least). With this procedure, the same test is administered to the same individuals twice with a defined retest interval, and test–retest correlations are computed to assess the stability.

Most longitudinal studies indicated that when individuals are tested again after an interval of several days or several years (with the same measure or alternate forms), their performance at the second assessment was higher than their performance at the first assessment (e.g., Calamia et al., 2012, Hausknecht et al., 2007, Salthouse et al., 2004). For instance, studies conducted with a retest interval of 1 year or less reported retest gains between 0.10 and 0.60 Time 1 SD1 (see review in Benedict and Zgaljardic, 1998, Salthouse et al., 2004). Furthermore, these studies indicated that retest gains varied with tasks. Typically, crystallized abilities demonstrated higher stability than fluid abilities (Schwartzman, Gold, Andres, Arbuckle, & Chaikelson, 1987). These studies also demonstrated that tests with problem-solving components were subject to greater practice effects than those with fewer such demands (Calamia et al., 2012, Dikmen et al., 1999). Similarly, with a short retest interval (from 3 to 6 months) retest gains tended to be greater for simple speed task (e.g., processing speed subtests such as Coding or Symbol Search) compared to verbal ones (e.g., verbal comprehension subtests such as Vocabulary or Information; Calamia et al., 2012, Estevis et al., 2012). In the longitudinal study conducted with adults (between 18 and 58 years), and tested twice after an interval of a few days to 35 years, Salthouse and colleagues (2004) demonstrated that seven or more years were needed to remove the positive retest effects.

Two main factors may explain the longitudinal change: age and retest effects (i.e., practice effects; Ferrer et al., 2005, Salthouse et al., 2004). While age effects refer to maturation (aging processes), “retest effects refer to influences on the difference in performance between the first and a subsequent measurement occasion that are attributable to the previous assessment” (Salthouse, 2009, p. 509). According to Salthouse and colleagues (2004), four types of influences (specific and general retest factors) could contribute to the retest effects: (1) test-specific factors (e.g., remembering items or answers); (2) familiarity with the testing situation that could reduce the anxiety; (3) increase in the cognitive ability assessed by the test during the test–retest interval; and (4) changes that occur in the environment of the individual. These authors assumed that the fourth influence is more relevant for general information or vocabulary tests.

Several methods are used to estimate age and retest effects. For instance, the comparisons of performances between children tested twice and children with the same age tested once allow assessing retest effects. However, because some longitudinal studies demonstrated that retest effects could contaminate age effects, it is necessary to distinguish effects due to the age or to the retest (Ferrer et al., 2005, Salthouse et al., 2004). Indeed, with adult samples, Salthouse and colleagues (2004) suggested that positive retest effects could obscure negative age effects. One method to distinguish these effects is to vary the retest interval among participants. Thus, there will be no more perfect correlation between the increase of age and the increase of retest interval. This procedure has only rarely been used. In the present study, because the retest interval varies among children, we will be able to decompose age and retest effects.

To our knowledge, the distinction between “short-term” and “long-term” retest interval is not clearly defined in the literature. Sattler (2008) considers a period less than one year as a short time interval (<1 year). Similarly, for Watkins and Canivez (2004), a long time interval is a retest interval of more than one year (>1 year). Close to these definitions, we consider that a period of one year or more is a long-term interval (≥1 year).

To date, very few studies have investigated the long-term stability of the Wechsler Intelligence Scale for Children – fourth edition (WISC-IV), and as far as we know, none with the French version. More than short-term stability, long-term stability is needed to provide complete evidence to support the predictive value of high-stakes decisions based on test scores (i.e., decision consistency). The stability of the previous editions of the WISC has been explored with several test–retest intervals and with various groups of children (non-clinical or clinical groups). Most longitudinal studies conducted with the U.S. WISC/WISC-R/WISC-III indicated that the FSIQ was fairly stable (i.e., r > .70) with clinical samples (e.g., Bauman, 1991, Canivez and Watkins, 1998, Canivez and Watkins, 2001, Oakman and Wilson, 1988, Stavrou, 1990, Truscott et al., 1994, Vance et al., 1981). Because of many changes in the WISC-IV, the previous findings are obsolete. The clinical interpretation of this fourth edition is currently based on a Full Scale Intelligence Quotient (FSIQ) and four index scores: the Verbal Comprehension Index (VCI), the Perceptual Reasoning Index (PRI), the Working Memory Index (WMI), and the Processing Speed Index (PSI).

Table 1 reports results for some short- and long-term longitudinal studies conducted with the WISC-IV. Stability coefficients corrected for the variability of the WISC-IV normative sample (Allen and Yen, 1979, Guilford and Fruchter, 1978, Magnusson, 1967), and stability coefficients corrected for combined or additive effect of content and time sampling error (Macmann & Barnett, 1997) were reported. The standardized mean difference (i.e., d) is the difference of the two test means divided by the pooled standard deviation (Cohen, 1977).

In the technical and interpretative manual of the French WISC-IV (Wechsler, 2005b), short-term stability of the scores was evaluated with a sample of 93 non-clinical children, who were tested twice one month apart (mean test–retest interval = 27 days). The corrected test–retest coefficients ranged from .78 (WMI) to .91 (FSIQ) for the index scores, and from .64 (Picture Concepts) to .83 (Symbol Search) for the subtest scores (see Table 1). The short-term stability of the U.S. WISC-IV scores was evaluated in a sample of 243 children with a retest interval between 13 and 63 days (mean test–retest interval = 32 days). As reported by Williams, Weiss, and Rolfhus (2003), corrected stability coefficients ranged from .86 (PSI) to .93 (VCI and FSIQ). Regarding the subtest scores, corrected stability coefficients ranged from .76 (Picture Concepts) to .92 (Vocabulary) (see Table 2 in Williams et al., 2003). These results indicated that the French WISC-IV corrected stability coefficients were slightly lower than those reported for the U.S. WISC-IV scores.

A longer retest interval (11 months) was considered by Ryan, Glass, and Bartels (2010), who investigated the stability of the U.S WISC-IV scores with a sample of 43 voluntary children from a private school (see Table 1). Except for the PRI (r = .68) and the PSI (r = .54), corrected stability coefficients of index scores were above .70. Stability coefficients of the subtest scores ranged from .26 (Picture Concepts) to .84 (Vocabulary). These results indicated that the stability coefficients of subtest scores were lower than those of the composite scores. At an individual level, Ryan and colleagues found that 42% of children changed their FSIQ by more than 5 points between both assessments.

As mentioned before, most short-term stability studies conducted on intelligence tests have revealed retest effects. For the WISC-IV, several studies have shown that the practice effects were more pronounced for the PRI and the PSI than for the WMI and the VCI (Flanagan and Kaufman, 2009, Ryan et al., 2010, Wechsler, 2005b). Flanagan and Kaufman (2009) also observed that “practice effects are largest for ages 6 to 7 and become smaller with increasing age” (p. 32). In addition, Ryan et al. (2010) found that children with higher performances benefited more from the second testing than children with lower performances. However, because this study was conducted with a small sample and with a short retest interval (<1 year), these findings must be taken with caution and could not be extrapolated to all children.

As far as we know, only three studies have investigated the stability of the WISC-IV scores with a long test–retest interval (i.e., ≥1 year). First, a long-term stability study was conducted by Lander (2010). This study involved a sample of 131 children with learning disabilities. The test–retest interval was 2.89 years. Except the FSIQ, the uncorrected long-term stability coefficients were lower than .70 (see Table 1). Concerning the subtest scores, the long-term stability coefficients ranged from .28 (Symbol Search) to .62 (Block Design). In addition, Lander found a significant mean decrease from the first to the second assessment for the PSI (−2.14 points), but the associated effect size was weak (d = −.18). In order to examine the intraindividual stability, Lander analyzed the change in individual scores between test and retest by computing a confidence interval based on a ±2 Standard Error of Measurement (±2 SEM). Lander found that 78% (VCI and FSIQ), 73% (PRI), 70% (WMI), and 73% (PSI) of the children remained stable across time with this ±2 SEM confidence interval. Thus, Lander stated that “there were many individuals who changed more than would be expected due to error” (p. 75). These results might be explained by the fact that selecting a specific sample of children with learning disabilities restricted the range and hence lowered test–retest correlations.

The second long-term stability study on the WISC-IV scores was conducted by Watkins and Smith (2013). Three hundred and forty-four children evaluated for special education eligibility were tested twice with a retest interval around 3 years (M = 2.84 years, SD = 0.75 year). Except the PSI (r = .65), corrected stability coefficients ranged from .70 (WMI) to .84 (FSIQ; see Table 1). Again, corrected stability coefficients for the subtest scores were lower than those for the composite scores. Regarding intraindividual stability, Watkins and Smith found that 71% and 75% of the children had test–retest differences less than or equal to 9 points for the VCI and the FSIQ, respectively. These percentages were 61%, 63%, and 56% for the PRI, WMI, and PSI, respectively. Therefore, Watkins and Smith concluded, “even the most reliable WISC-IV score, the FSIQ, may not be sufficiently stable for longitudinal individual decisions” (p. 4).

The third long-term stability study on the WISC-IV scores was conducted by Bartoi et al. (2015). Participants in this study were 51 clinically referred children aged from 8 to 16 years. The average retest interval was 1.84 year (SD = 0.50 year). The uncorrected stability coefficients for index scores ranged from .58 (PSI) to .86 (FSIQ), and from .35 (Letter-Number Sequencing) to .81 (Vocabulary) for subtest scores (see Table 1). Individual variation in scores showed that 78.4% of the children had test–retest differences less than or equal to 9 points for the FSIQ; similarly 68.6%, 56.9%, 54.9, and 54.9% of the children had test–retest differences less than or equal to 9 points for the VCI, PRI, WMI, and PSI, respectively. Overall, these results were consistent with those reported by Watkins and Smith (2013).

Section snippets

Aims of the study

Despite the increasing use of the Wechsler Intelligence Scales, there is a general lack of research investigating the long-term stability of subtest and composite scores. Thus, the specific objective of the present research was to examine the long-term stability of the French WISC-IV scores with young children. The most interesting features of the present study were the large non-clinical French-speaking Swiss sample and the consideration of both perspectives, nomothetic (group level) and

Participants

The sample was composed of French-speaking Swiss children aged from seven to twelve years, attending school at the canton of Geneva, Switzerland. Children were tested twice with the WISC-IV during school hours. Participant selection was restricted to primary students because in the Geneva school system, students change schools during the transition from primary to secondary school. The participation in this academic clinical study,2

Results

Descriptive statistics (means, SDs, mean score differences, and Cohen's d) and long-term stability coefficients of the WISC-IV index and subtest scores are reported in Table 2. First and as expected, IQ scores and subtest scores are close to the theoretical means (i.e., 100 and 10) and to the theoretical standard deviations (i.e., 15 and 3). For the first assessment, the means IQs ranged from 95.18 (WMI) to 104.90 (VCI), with standard deviations between 13.88 (PSI) and 15.23 (VCI). For the

Discussion

The present study investigated the long-term stability of the French WISC-IV scores with non-clinical children tested twice. To our knowledge, only three studies have examined the long-term stability of the WISC-IV scores: Lander (2010), Watkins and Smith (2013), and Bartoi et al. (2015). These studies were conducted with U.S. clinical samples. In contrast, our sample involves non-clinical French-speaking Swiss children. Thus, the present study provides useful information about the

Conclusion

The current study has important implications for psychological practice. Because the GAI is relatively stable at an interindividual level and at an individual level, this index score might be the most useful for predictions. In contrast, our results showed that the FSIQ was less stable at the individual level (idiographic perspective). However, more studies are required to demonstrate the prediction validity of the GAI. Based on an idiographic perspective, it could be argued that two

Funding

This work was supported by Grant 100014_135406 awarded by the Swiss National Science Foundation (Long-term stability of the WISC-IV: Standard and CHC composite scores; main applicant: T. Lecerf; co-applicants: N. Favez & J. Rossier).

Disclosure of interest

The authors declare that they have no competing interest.

References (64)

  • M.J. Allen et al.

    Introduction to measurement theory

    (1979)
  • American Educational Research Association et al.

    Standards for educational and psychological testing

    (2014)
  • A.G. Barnett et al.

    Regression to the mean: What it is and how to deal with it

    International Journal of Epidemiology

    (2005)
  • M. Bartoi et al.

    Attention problems and stability of WISC-IV scores among clinically referred children

    Applied Neuropsychology: Child

    (2015)
  • E.E. Bauman

    Stability of WISC-R scores in children with learning difficulties

    Psychology in the Schools

    (1991)
  • R.H.B. Benedict et al.

    Practice effects during repeated administrations of memory tests with and without alternate forms

    Journal of Clinical and Experimental Neuropsychology

    (1998)
  • D. Bremner et al.

    WISC-IV GAI and CPI in psychoeducational assessment

    Canadian Journal of School Psychology

    (2011)
  • M. Calamia et al.

    Scoring higher the second time around: Meta-analyses of practice effects in neuropsychological assessment

    The Clinical Neuropsychologist

    (2012)
  • G.L. Canivez et al.

    Long-term stability of the Wechsler Intelligence Scale for Children – Third Edition

    Psychological Assessment

    (1998)
  • G.L. Canivez et al.

    Long-term stability of the Wechsler Intelligence Scale for Children – Third Edition among students with disabilities

    School Psychology Review

    (2001)
  • R.A. Charter et al.

    Meaning of reliability in terms of correct and incorrect clinical decisions: The art of decision making is still alive

    Journal of Clinical and Experimental Neuropsychology

    (2001)
  • H.-Y. Chen et al.

    What does the WISC-IV measure? Validation of the scoring and CHC-based interpretative approaches

    Journal of Research in Education Sciences

    (2009)
  • J. Cohen

    Statistical power analysis for the behavioral sciences (rev.)

    (1977)
  • J. Cohen

    A power primer

    Psychological Bulletin

    (1992)
  • I.J. Deary et al.

    The stability of intelligence from age 11 to age 90 years: The Lothian birth cohort of 1921

    Psychological Science

    (2013)
  • S.S. Dikmen et al.

    Test–retest reliability and practice effects of Expanded Halstead–Reitan Neuropsychological Test Battery

    Journal of the International Neuropsychological Society

    (1999)
  • E. Estevis et al.

    Effects of practice on the Wechsler Adult Intelligence Scale-IV across 3- and 6-month intervals

    The Clinical Neuropsychologist

    (2012)
  • E. Ferrer et al.

    Multivariate modeling of age and retest in longitudinal studies of cognitive abilities

    Psychology and Aging

    (2005)
  • D.P. Flanagan et al.

    Essentials of WISC®-IV assessment

    (2009)
  • J. Gaetano

    Holm–Bonferroni sequential correction: An EXCEL calculator-ver 1.2

    (2013)
  • Ghisletta, P., & Lecerf, T. (2015). Intelligence, Crystallized. In S. K. Withbourne (Ed.), The Encyclopedia of...
  • P. Golay et al.

    Further insights on the French WISC-IV factor structure through Bayesian structural equation modeling

    Psychological Assessment

    (2013)
  • Cited by (11)

    • Developmental dyslexia, developmental coordination disorder and comorbidity discrimination using multimodal structural and functional neuroimaging

      2023, Cortex
      Citation Excerpt :

      About half of the children were recruited in Toulouse area and the other half in Aix-Marseille area (Toulouse/Aix-Marseille; TD: 21/21; DD: 25/20; DCD: 11/9; COM: 15/14). All children underwent neuropsychological assessment, including tests of intellectual abilities (WISC-IV; Kieng et al., 2017), reading skills (Alouette test; Lefavrais, 2005); ODEDYS-1 battery(Jacquier-Roux et al., 2005), and motor skills (Movement Assessment Battery for Children, M-ABC French translation; Soppelsa & Albaret, 2004). DCD children all met the DSM-V diagnostic criteria for discrete motor disorder: (1) M-ABC-1 was below the 5th percentile; (2) treated for a motor coordination problem by a pediatric physical therapist due to a persistent interference with activities of daily living; (3) no sign of intellectual disability (IQ score >70 or subtest similarities and picture concepts scaled scores >7 in WISC-IV); and (4) no visual impairments or neurological conditions that could affect their motor abilities.

    • “To g or not to g?” — Analysis of the dimensional structure of a cognitive abilities’ battery

      2020, Revue Europeenne de Psychologie Appliquee
      Citation Excerpt :

      Further, literature has gathered relevant support about the existence of factors with different levels of generality through confirmatory factor analysis’ procedures (Gustafsson, 1994; Keith & Reynolds, 2012). As the CHC taxonomy of intelligence establishes itself, it has been playing a key role in analyzing current intelligence tests and in (re)constructing more comprehensive cognitive assessment batteries (Castejon, Perez, & Gilar, 2010; Flanagan, Ortiz, & Alfonso, 2013; Grégoire, 2013; Kieng, Rossier, Favez, & Lecerf, 2017; McGrew & Wendling, 2010). In Portugal, Lemos and Almeida (2015) have developed the “Bateria de Aptidões Cognitivas” (BAC) which is composed of nine subtests aiming to evaluate, in combination, three different domains or items contents (verbal, numeric, and spatial) and three cognitive processes or functions (comprehension, reasoning, and problem solving).

    View all citing articles on Scopus
    View full text