Measuring theory of mind across middle childhood: Reliability and validity of the Silent Films and Strange Stories tasks

https://doi.org/10.1016/j.jecp.2015.07.011Get rights and content

Abstract

Recent years have seen a growth of research on the development of children’s ability to reason about others’ mental states (or “theory of mind”) beyond the narrow confines of the preschool period. The overall aim of this study was to investigate the psychometric properties of a task battery composed of items from Happé’s Strange Stories task and Devine and Hughes’ Silent Film task. A sample of 460 ethnically and socially diverse children (211 boys) between 7 and 13 years of age completed the task battery at two time points separated by 1 month. The Strange Stories and Silent Film tasks were strongly correlated even when verbal ability and narrative comprehension were taken into account, and all items loaded onto a single theory-of-mind latent factor. The theory-of-mind latent factor provided reliable estimates of performance across a wide range of theory-of-mind ability and showed no evidence of differential item functioning across gender, ethnicity, or socioeconomic status. The theory-of-mind latent factor also exhibited strong 1-month test–retest reliability, and this stability did not vary as a function of child characteristics. Taken together, these findings provide evidence for the validity and reliability of the Strange Stories and Silent Film task battery as a measure of individual differences in theory of mind suitable for use across middle childhood. We consider the methodological and conceptual implications of these findings for research on theory of mind beyond the preschool years.

Introduction

How children learn to use mental states, such as desires, knowledge, and beliefs, to predict and explain others’ behavior (commonly referred to as the acquisition of a “theory of mind”) is a topic that has attracted extensive theorizing and empirical research for nearly four decades (for recent reviews, see Hughes and Devine, 2015, Wellman, 2014). Most of this research has centered on a single task, the false belief task (Wimmer & Perner, 1983), in which an object is moved in an agent’s absence, such that children need to recognize that the agent has a mistaken belief in order to predict or explain his or her behavior. More complex tasks measure children’s ability to attribute beliefs to an agent about another agent’s beliefs (i.e., “second-order” false beliefs) (Perner & Wimmer, 1985) or to attribute emotional states to others on the basis of false beliefs (e.g., Harris, Johnson, Hutton, Andrews, & Cooke, 1989). These tasks have been used to study both individual differences and age-related changes in theory of mind (ToM) during the preschool and early school years (e.g., Wellman, Cross, & Watson, 2001).

Over the past decade, the developmental scope of ToM research has been greatly increased by the design of new tasks for use with infants (e.g., Luo & Baillargeon, 2010) and with adults (e.g., Apperly, Samson, & Humphreys, 2009). With some notable exceptions, including early research on the development of children’s understanding of the interpretive nature of knowledge during early middle childhood (e.g., Carpendale & Chandler, 1996) and evidence for meaningful individual differences in preadolescents’ ability to reason about characters’ mental states (e.g., Bosacki & Astington, 1999), the developmental period of middle childhood has been largely overlooked. However, over recent years this developmental period has begun to attract research attention (e.g., Apperly et al., 2011, Banerjee et al., 2011, Devine and Hughes, 2013, Dumontheil et al., 2010).

Middle childhood (the developmental period between 6 and 12 years of age) is a developmentally interesting period during which to study ToM. From a sociocultural perspective, it is worth noting that in primary school children are exposed to increasingly sophisticated forms of knowledge (e.g., fictional literature) and also spend increasing amounts of time outside the home interacting with their peers in a greater variety of contexts (e.g., Del Giudice, 2014, Eccles, 1999). Understanding how these new experiences shape and are shaped by individual differences in ToM presents a novel opportunity for researchers. Indeed, recent work in the field has demonstrated that individual differences in ToM during this period are related to important social and academic outcomes (e.g., Banerjee et al., 2011, Lecce et al., 2011). From a neuropsychological perspective, there is evidence of continued structural changes in the frontal and parietal lobes (specifically, gray matter volume increases in these regions across middle childhood; e.g., Giedd et al., 1999) and related gains in cognitive performance in domains such as executive function (EF) across middle childhood (e.g., Davidson, Amso, Anderson, & Diamond, 2006). Research on ToM across middle childhood could shed light on the correlates and consequences of these neuropsychological changes. Indeed, researchers have now begun to examine the developmental links between ToM and EF across middle childhood in order to understand the factors underpinning the continued development of ToM during this period (e.g., Bock et al., 2015, Lagattuta et al., 2010, Lagattuta et al., 2014). In an effort to contribute to this budding new field of research, the focus of the current study was to examine the validity and reliability of two tasks that appear promising as developmentally appropriate and useful indicators of ToM across middle childhood: the Strange Stories task (Happé, 1994) and a more recent analogue task using brief clips from a classic silent film (Devine & Hughes, 2013).

Validity refers to whether or not a test is measuring the construct that it purports to measure (e.g., Rust & Golombok, 2009). Test validity is established through the accumulation of evidence about whether the test conforms to expectations and hypotheses about the construct being measured (Carmines and Zeller, 1979, Rust and Golombok, 2009). Tasks that purport to measure a particular construct (e.g., false belief understanding) should be related to tasks that measure the same or similar constructs (convergent validity) and unrelated to tasks that do not (discriminant validity). Ideally, tests should also show evidence of correlations with real-life outcomes (criterion validity). Test validity can be established by examining the correlations between concurrent measures and longitudinal outcomes and by assessing group differences (e.g., 3-year-old vs. 4-year-old children, typical vs. atypical groups) (Cronbach and Meehl, 1955, Messick, 1995).

Four sources of evidence support the validity of the false belief task. First, children’s performance on the different versions of the false belief task show moderate to strong concurrent correlations and so appear to measure a single construct (e.g., Hughes et al., 2000, Hughes et al., 2014). That is, various false belief tasks show convergent validity. Second, there is now a growing body of evidence supporting the criterion validity of false belief tasks; individual differences in performance on the false belief task among typically developing children can be predicted by early social experiences (e.g., parent–child talk about mental states) and in turn correlate with important social outcomes (Hughes and Devine, 2015, Slaughter et al., 2015). Third, the false belief task is sensitive to development; between 2 and 5 years of age, children’s performance on this task improves dramatically (Wellman et al., 2001). Finally, children with known impairments in social competence (e.g., children with autism spectrum disorder [ASD], “hard-to-manage” preschoolers) show marked deficits in performance on the false belief task relative to children matched in age and verbal ability (e.g., Baron-Cohen et al., 1985, Hughes et al., 1998). In sum, there is evidence that the false belief task shows convergent, discriminant and criterion validity as a measure of ToM.

Test validity hinges on the precision and repeatability of a measurement, that is, the reliability of the test (e.g., Carmines & Zeller, 1979). Test reliability can be established through examining the dimensionality of a set of items that comprise a test (i.e., the internal consistency of items within a test) and through examining the stability of test scores over time (i.e., the test–retest reliability of a measure) (e.g., Rust & Golombok, 2009). These two forms of reliability testing enable researchers to establish the precision with which individual test items measure the construct of interest and the extent to which test scores can be reproduced with repeated measurement. Test reliability is vital to the study of individual differences and developmental change. If error variance is not accounted for, it is difficult to know whether observed correlations or test score changes reflect genuine associations with or changes in the construct of interest. Evidence for the reliability of the false belief task has grown over the past two decades. In an early study, Hughes and colleagues (2000) demonstrated that a battery of first- and second-order false belief tasks exhibited good internal consistency and strong 1-month test–retest reliability. Importantly, by examining the interaction between initial task performance and individual differences in verbal ability, Hughes and colleagues found that the test–retest reliability of the task battery was stable across different levels of verbal ability, suggesting that the battery of tasks could be used reliably with children of varying levels of ability.

Over recent years, researchers have begun to apply modern psychometric approaches when assessing the validity and reliability of tests of ToM for young children (e.g., Wellman & Liu, 2004). Confirmatory factor analysis (CFA) is a flexible way in which to assess the psychometric properties of test batteries. CFA is hypothesis driven and permits researchers to test a measurement model against data using multiple fit indices. CFA enables researchers to tackle the task impurity problem inherent in cognitive research by partitioning the variance that is common between a set of items or tasks (i.e., the true score variance) from the variance associated with a specific task or item and measurement error (i.e., the residual variance) (e.g., Miyake et al., 2000). Importantly for test development, CFA enables researchers to examine the stability of a measurement model (or “measurement invariance”) across different groups (e.g., gender, ethnic groups) and over time (Brown, 2006). Establishing measurement invariance is an important step in studying the fairness of a test. Differences in test performance should reflect genuine differences in the variable of interest and not group differences in the psychometric properties of the test (Knight and Zerr, 2010, Millsap, 2010). Differential item functioning (DIF) occurs when groups differ in their performance on a particular item because that item involves abilities other than those the item was intended to measure and the groups differ on those abilities rather than the target ability and so can undermine the fairness of a test (Walker, 2011). Using multiple groups CFA, it is possible to assess whether items exhibit DIF.

CFA has been applied extensively in the study of EF and has been used to analyze the psychometric properties of test batteries designed to measure EF across early childhood. For example, Willoughby and colleagues have used CFA and item response theory (IRT) to examine the measurement structure, precision, and test–retest reliability of a novel battery of EF tasks for young children (e.g., Willoughby and Blair, 2011, Willoughby et al., 2010). These studies demonstrate the flexibility of using CFA as a means to assess the psychometric properties of task batteries, and researchers have begun to apply CFA to assess individual differences in performance on measures of false belief understanding (Hughes et al., 2011, Hughes et al., 2014). These findings have revealed that false belief task batteries load onto a single latent factor and are invariant across gender and partially invariant across cultures. In summary, the false belief task appears to provide a valid and reliable measure of ToM for young children. A parallel body of evidence that assesses the psychometric properties of ToM tasks for use across middle childhood is now needed to support research on individual differences and change in ToM during this developmental period.

Over the past decade, researchers have devised a diverse range of tasks purported to measure different aspects of ToM use with a variety of stimuli such as vignettes, cartoons, audio-recordings, and film clips. In addition to the wide range of stimuli employed by researchers, the tests appear to measure distinct aspects of ToM use such as emotion understanding, perspective taking, understanding the interpretive nature of mind, attribution of intention, and explanation of behavior with reference to beliefs, knowledge, and desires (e.g., Baron-Cohen et al., 1997, Carpendale and Chandler, 1996, Castelli et al., 2000, Dumontheil et al., 2010, Dziobek et al., 2006, Golan et al., 2006, Happé, 1994). Supporting the validity of these tasks, adults with ASD and schizophrenia have been shown to have difficulties on these tasks relative to matched “neurotypical” controls (Chung, Barch, & Strube, 2014). Crucially, these limitations in performance are specific to test items centered on mentalistic content and not simply on narrative understanding or nonmental content (e.g., White, Hill, Happé, & Frith, 2009). Although there is some evidence for the validity of these “advanced” tasks, less is known about the precision and stability of these measures. With few exceptions, little effort has been made to evaluate the psychometric properties of these tasks (e.g., Dziobek et al., 2006, Fernandez-Abascal et al., 2013).

In an effort to develop age-appropriate ToM tasks to study individual differences and age-related changes in ToM across middle childhood, Devine and Hughes (2013) administered Happé’s (1994) vignette-based Strange Stories task alongside a novel Silent Film task to 230 middle-class children between 8 and 13 years of age. Successful performance on both of these tasks required children to explain a character’s behavior with reference to the character’s knowledge, beliefs, and desires. The findings from this initial study revealed that both the Strange Stories and Silent Film tasks were sensitive to age-related differences in performance, with neither task exhibiting marked ceiling effects. There were strong concurrent associations between the two tasks, supporting the convergent validity of the Silent Film task with the widely used Strange Stories task. More recently, longitudinal findings have shown that performance on a battery of false belief tasks at 6 years of age was significantly correlated with later performance on both the Strange Stories and Silent Film tasks at 10 years of age (Devine, White, Ensor, & Hughes, 2015). These findings provide further evidence for the convergent validity of these advanced ToM tasks. Supporting the criterion validity of these tasks, low scores in girls were associated with self-reported loneliness and low scores in boys were associated with self-reported peer exclusion. Despite differences in the modality of each task, CFA revealed that a unidimensional ToM latent factor underpinned performance on the diverse items of the Strange Stories and Silent Film tasks. This ToM latent factor exhibited measurement invariance in boys and girls, with no evidence of DIF. In sum, the Strange Stories and Silent Film task battery is a promising way in which to measure ToM across middle childhood. That said, further work is needed to investigate the validity, precision, and reliability of these tasks. The purpose of our study was to investigate the psychometric properties of the Strange Stories and Silent Film task battery in a large ethnically and socially diverse sample of children between 7 and 13 years of age.

Our first aim was to examine further the concurrent, discriminant, and construct validity of the Strange Stories and Silent Film tasks. Although it is tempting to claim that a ToM latent factor underpins participants’ performance on the items of the Strange Stories and Silent Film tasks, the relations between the items may simply reflect common variance due to another variable, for example, the ability to comprehend a narrative sequence rather than mental states per se. To rule out this alternative interpretation, the participants completed three “control” stories (matched in length and linguistic complexity) that described scenarios involving human characters but contained no mental state content (White et al., 2009) to determine whether the correlation between performance on the Strange Stories task (mental state items) and Silent Film task items persisted once individual differences in story or narrative comprehension were taken into account.

The second aim of our study was to examine the precision and measurement invariance of the Strange Stories and Silent Film task battery. Using IRT models, it was possible to compute standard errors that are conditional on a certain trait or “theta” level and so assess the reliability of the Strange Stories and Silent Film task battery at different levels of latent ability (Embretson and Reise, 2000, Hays et al., 2000). Extending findings about the measurement invariance of the Strange Stories and Silent Film task battery across boys and girls, the diverse sample recruited for the current study also made it possible to assess measurement invariance across different ethnic and socioeconomic groups.

The third aim of our study was to investigate the test–retest reliability of the Strange Stories and Silent Film task battery. To date, no published studies have sought to examine the short-term stability of measures of ToM across middle childhood and adolescence. Latent variable modeling with CFA provides a particularly robust way in which to examine test–retest reliability. Typically, researchers estimate the correlation between initial and retest scores. Because this approach does not account for item-specific variance and measurement error, the correlations between test scores might not provide accurate estimates of the stability of performance on the latent variable. In one pioneering study, Willoughby and Blair (2011) examined the 1-month test–retest reliability of a battery of EF tasks for preschool children using a latent variable approach. By accounting for the potential instability of item-specific variance, Willoughby and Blair found that, in contrast to the moderate test–retest correlations between specific items, the correlation between the latent factors approached unity. We adopted the same analytic strategy in the current study to examine the 1-month test–retest reliability of the Strange Stories and Silent Film tasks. Given the large and diverse sample, we also assessed whether test–retest stability was moderated by child characteristics such as age, gender, ethnicity, socioeconomic status (SES), and verbal ability. Building on the analysis used by Hughes and colleagues (2000) when studying the test–retest stability of the false belief task battery, nonsignificant interaction effects between child characteristics and test stability would provide evidence to support the applicability of the Strange Stories and Silent Film task battery across a diverse range of children.

To summarize, our study had three primary aims. Our first aim was to examine the convergent, discriminant, and construct validity of the Silent Film and Strange Stories tasks as measures of ToM suitable for use across middle childhood. Our second aim was to assess the precision and measurement invariance of the Silent Film and Strange Stories tasks. Our third aim was to assess the test–retest reliability of the Silent Film and Strange Stories tasks and examine whether 1-month test–retest stability varied as a function of individual differences in child characteristics.

Section snippets

Participants

Participants were recruited from eight socioeconomically and ethnically diverse state schools in the South East of England. The eight schools involved in this study were average or above average in terms of total number of pupils (i.e., >263 pupils for primary schools and >978 pupils for secondary schools), and all were based in urban areas (Office for Standards in Education (OFSTED), 2014). Of the 565 children in the classes approached, 38 were not eligible to take part because teachers

Analytic strategy

The data were analyzed using a latent variable framework in Mplus Version 7 (Muthén & Muthén, 2012). Given the categorical nature of our data, we used a mean- and variance-adjusted weighted least squares estimator (rather than a maximum likelihood estimator) in each of our models (Brown, 2006, Kline, 2011). For each model, we evaluated fit using Brown’s (2006) four recommended criteria: a nonsignificant chi-square (χ2) test, root mean square error of approximation (RMSEA)  .08, comparative fit

Discussion

This investigation of the psychometric properties of the Strange Stories and Silent Film task battery involved 460 children between 7 and 13 years of age and yielded three sets of findings. First, scores on both tasks were strongly correlated even when verbal ability and narrative comprehension were taken into account. Replicating previous findings (Devine & Hughes, 2013), the ToM latent factor was sensitive to effects of age and gender. Second, the Strange Stories and Silent Film task battery

Acknowledgments

We thank all of the principals, teachers and pupils at the participating schools in Cambridge, Nottingham, and London, England. We also thank Abby Furniss and Sakshi Rathi for their assistance with data collection and coding. R.T. Devine was funded by the Isaac Newton Trust (Cambridge, UK). For more information about using the Silent Film task in research, contact R.T. Devine by e-mail ([email protected]).

References (73)

  • I.A. Apperly

    What is theory of mind? Concepts, cognitive processes, and individual differences

    Quarterly Journal of Experimental Psychology

    (2012)
  • I.A. Apperly et al.

    Studies of adults can inform accounts of theory of mind development

    Developmental Psychology

    (2009)
  • I.A. Apperly et al.

    Developmental continuity in theory of mind: Speed and accuracy of belief–desire reasoning in children and adults

    Child Development

    (2011)
  • R. Banerjee et al.

    Peer relations and understanding of faux pas: Longitudinal evidence for bidirectional associations

    Child Development

    (2011)
  • S. Baron-Cohen et al.

    Another advanced test of theory of mind: Evidence from very high functioning adults with autism or Asperger syndrome

    Journal of Child Psychology and Psychiatry

    (1997)
  • A.M. Bock et al.

    Specifying the links between executive functioning and theory of mind during middle childhood: Cognitive flexibility predicts social understanding

    Journal of Cognition and Development

    (2015)
  • S. Bosacki et al.

    Theory of mind in preadolescence: Relations between social understanding and social competence

    Social Development

    (1999)
  • R.H. Bradley et al.

    Socioeconomic status and child development

    Annual Review of Psychology

    (2002)
  • British Psychological Society

    Code of human research ethics

    (2010)
  • T.A. Brown

    Confirmatory factor analysis for applied research

    (2006)
  • C. Calero et al.

    Age and gender dependent development of theory of mind in 6 to 8 year old children

    Frontiers in Human Neuroscience

    (2013)
  • E.G. Carmines et al.

    Reliability and validity assessment

    (1979)
  • J. Carpendale et al.

    On the distinction between false belief understanding and subscribing to an interpretive theory of mind

    Child Development

    (1996)
  • Y.S. Chung et al.

    A meta-analysis of mentalizing impairments in adults with schizophrenia and autism spectrum disorder

    Schizophrenia Bulletin

    (2014)
  • R.D. Conger et al.

    An interactionist perspective on the socioeconomic context of human development

    Annual Review of Psychology

    (2007)
  • L.J. Cronbach et al.

    Construct validity in psychological tests

    Psychological Bulletin

    (1955)
  • A.L. Cutting et al.

    Theory of mind, emotion understanding, language, and family background: Individual differences and interrelations

    Child Development

    (1999)
  • M. Del Giudice

    Middle childhood: An evolutionary–developmental synthesis

    Child Development Perspectives

    (2014)
  • Devine, R. T. & Hughes, C. (2015). Family correlates of false-belief understanding: A meta-analytic review. Unpublished...
  • Devine, R. T., White, N., Ensor, R., & Hughes, C. (2015). Theory of mind in middle childhood: Longitudinal associations...
  • R.T. Devine et al.

    Silent films and strange stories: Theory of mind, gender, and social experiences in middle childhood

    Child Development

    (2013)
  • R.T. Devine et al.

    Relations between false-belief understanding and executive function in early childhood: A meta-analysis

    Child Development

    (2014)
  • I. Dumontheil et al.

    Online usage of theory of mind continues to develop in late adolescence

    Developmental Science

    (2010)
  • I. Dziobek et al.

    Introducing MASC: A movie assessment of social cognition

    Journal of Autism and Developmental Disorders

    (2006)
  • J.S. Eccles

    The development of children ages 6 to 14

    The Future of Children

    (1999)
  • S.E. Embretson et al.

    Item response theory for psychologists

    (2000)
  • Cited by (81)

    • Reading fiction and reading minds in early adolescence: A longitudinal study

      2022, Journal of Experimental Child Psychology
      Citation Excerpt :

      Therefore, we created mean scores for fiction reading, nonfiction reading, and reading motivation at Time 1 and Time 3. At Time 3 and Time 4, children completed the Silent Films task (Devine & Hughes, 2013, 2016). Children watched five short film clips from a classic silent comedy depicting instances of deception, misunderstanding, and false belief.

    View all citing articles on Scopus
    View full text