Measuring theory of mind across middle childhood: Reliability and validity of the Silent Films and Strange Stories tasks
Introduction
How children learn to use mental states, such as desires, knowledge, and beliefs, to predict and explain others’ behavior (commonly referred to as the acquisition of a “theory of mind”) is a topic that has attracted extensive theorizing and empirical research for nearly four decades (for recent reviews, see Hughes and Devine, 2015, Wellman, 2014). Most of this research has centered on a single task, the false belief task (Wimmer & Perner, 1983), in which an object is moved in an agent’s absence, such that children need to recognize that the agent has a mistaken belief in order to predict or explain his or her behavior. More complex tasks measure children’s ability to attribute beliefs to an agent about another agent’s beliefs (i.e., “second-order” false beliefs) (Perner & Wimmer, 1985) or to attribute emotional states to others on the basis of false beliefs (e.g., Harris, Johnson, Hutton, Andrews, & Cooke, 1989). These tasks have been used to study both individual differences and age-related changes in theory of mind (ToM) during the preschool and early school years (e.g., Wellman, Cross, & Watson, 2001).
Over the past decade, the developmental scope of ToM research has been greatly increased by the design of new tasks for use with infants (e.g., Luo & Baillargeon, 2010) and with adults (e.g., Apperly, Samson, & Humphreys, 2009). With some notable exceptions, including early research on the development of children’s understanding of the interpretive nature of knowledge during early middle childhood (e.g., Carpendale & Chandler, 1996) and evidence for meaningful individual differences in preadolescents’ ability to reason about characters’ mental states (e.g., Bosacki & Astington, 1999), the developmental period of middle childhood has been largely overlooked. However, over recent years this developmental period has begun to attract research attention (e.g., Apperly et al., 2011, Banerjee et al., 2011, Devine and Hughes, 2013, Dumontheil et al., 2010).
Middle childhood (the developmental period between 6 and 12 years of age) is a developmentally interesting period during which to study ToM. From a sociocultural perspective, it is worth noting that in primary school children are exposed to increasingly sophisticated forms of knowledge (e.g., fictional literature) and also spend increasing amounts of time outside the home interacting with their peers in a greater variety of contexts (e.g., Del Giudice, 2014, Eccles, 1999). Understanding how these new experiences shape and are shaped by individual differences in ToM presents a novel opportunity for researchers. Indeed, recent work in the field has demonstrated that individual differences in ToM during this period are related to important social and academic outcomes (e.g., Banerjee et al., 2011, Lecce et al., 2011). From a neuropsychological perspective, there is evidence of continued structural changes in the frontal and parietal lobes (specifically, gray matter volume increases in these regions across middle childhood; e.g., Giedd et al., 1999) and related gains in cognitive performance in domains such as executive function (EF) across middle childhood (e.g., Davidson, Amso, Anderson, & Diamond, 2006). Research on ToM across middle childhood could shed light on the correlates and consequences of these neuropsychological changes. Indeed, researchers have now begun to examine the developmental links between ToM and EF across middle childhood in order to understand the factors underpinning the continued development of ToM during this period (e.g., Bock et al., 2015, Lagattuta et al., 2010, Lagattuta et al., 2014). In an effort to contribute to this budding new field of research, the focus of the current study was to examine the validity and reliability of two tasks that appear promising as developmentally appropriate and useful indicators of ToM across middle childhood: the Strange Stories task (Happé, 1994) and a more recent analogue task using brief clips from a classic silent film (Devine & Hughes, 2013).
Validity refers to whether or not a test is measuring the construct that it purports to measure (e.g., Rust & Golombok, 2009). Test validity is established through the accumulation of evidence about whether the test conforms to expectations and hypotheses about the construct being measured (Carmines and Zeller, 1979, Rust and Golombok, 2009). Tasks that purport to measure a particular construct (e.g., false belief understanding) should be related to tasks that measure the same or similar constructs (convergent validity) and unrelated to tasks that do not (discriminant validity). Ideally, tests should also show evidence of correlations with real-life outcomes (criterion validity). Test validity can be established by examining the correlations between concurrent measures and longitudinal outcomes and by assessing group differences (e.g., 3-year-old vs. 4-year-old children, typical vs. atypical groups) (Cronbach and Meehl, 1955, Messick, 1995).
Four sources of evidence support the validity of the false belief task. First, children’s performance on the different versions of the false belief task show moderate to strong concurrent correlations and so appear to measure a single construct (e.g., Hughes et al., 2000, Hughes et al., 2014). That is, various false belief tasks show convergent validity. Second, there is now a growing body of evidence supporting the criterion validity of false belief tasks; individual differences in performance on the false belief task among typically developing children can be predicted by early social experiences (e.g., parent–child talk about mental states) and in turn correlate with important social outcomes (Hughes and Devine, 2015, Slaughter et al., 2015). Third, the false belief task is sensitive to development; between 2 and 5 years of age, children’s performance on this task improves dramatically (Wellman et al., 2001). Finally, children with known impairments in social competence (e.g., children with autism spectrum disorder [ASD], “hard-to-manage” preschoolers) show marked deficits in performance on the false belief task relative to children matched in age and verbal ability (e.g., Baron-Cohen et al., 1985, Hughes et al., 1998). In sum, there is evidence that the false belief task shows convergent, discriminant and criterion validity as a measure of ToM.
Test validity hinges on the precision and repeatability of a measurement, that is, the reliability of the test (e.g., Carmines & Zeller, 1979). Test reliability can be established through examining the dimensionality of a set of items that comprise a test (i.e., the internal consistency of items within a test) and through examining the stability of test scores over time (i.e., the test–retest reliability of a measure) (e.g., Rust & Golombok, 2009). These two forms of reliability testing enable researchers to establish the precision with which individual test items measure the construct of interest and the extent to which test scores can be reproduced with repeated measurement. Test reliability is vital to the study of individual differences and developmental change. If error variance is not accounted for, it is difficult to know whether observed correlations or test score changes reflect genuine associations with or changes in the construct of interest. Evidence for the reliability of the false belief task has grown over the past two decades. In an early study, Hughes and colleagues (2000) demonstrated that a battery of first- and second-order false belief tasks exhibited good internal consistency and strong 1-month test–retest reliability. Importantly, by examining the interaction between initial task performance and individual differences in verbal ability, Hughes and colleagues found that the test–retest reliability of the task battery was stable across different levels of verbal ability, suggesting that the battery of tasks could be used reliably with children of varying levels of ability.
Over recent years, researchers have begun to apply modern psychometric approaches when assessing the validity and reliability of tests of ToM for young children (e.g., Wellman & Liu, 2004). Confirmatory factor analysis (CFA) is a flexible way in which to assess the psychometric properties of test batteries. CFA is hypothesis driven and permits researchers to test a measurement model against data using multiple fit indices. CFA enables researchers to tackle the task impurity problem inherent in cognitive research by partitioning the variance that is common between a set of items or tasks (i.e., the true score variance) from the variance associated with a specific task or item and measurement error (i.e., the residual variance) (e.g., Miyake et al., 2000). Importantly for test development, CFA enables researchers to examine the stability of a measurement model (or “measurement invariance”) across different groups (e.g., gender, ethnic groups) and over time (Brown, 2006). Establishing measurement invariance is an important step in studying the fairness of a test. Differences in test performance should reflect genuine differences in the variable of interest and not group differences in the psychometric properties of the test (Knight and Zerr, 2010, Millsap, 2010). Differential item functioning (DIF) occurs when groups differ in their performance on a particular item because that item involves abilities other than those the item was intended to measure and the groups differ on those abilities rather than the target ability and so can undermine the fairness of a test (Walker, 2011). Using multiple groups CFA, it is possible to assess whether items exhibit DIF.
CFA has been applied extensively in the study of EF and has been used to analyze the psychometric properties of test batteries designed to measure EF across early childhood. For example, Willoughby and colleagues have used CFA and item response theory (IRT) to examine the measurement structure, precision, and test–retest reliability of a novel battery of EF tasks for young children (e.g., Willoughby and Blair, 2011, Willoughby et al., 2010). These studies demonstrate the flexibility of using CFA as a means to assess the psychometric properties of task batteries, and researchers have begun to apply CFA to assess individual differences in performance on measures of false belief understanding (Hughes et al., 2011, Hughes et al., 2014). These findings have revealed that false belief task batteries load onto a single latent factor and are invariant across gender and partially invariant across cultures. In summary, the false belief task appears to provide a valid and reliable measure of ToM for young children. A parallel body of evidence that assesses the psychometric properties of ToM tasks for use across middle childhood is now needed to support research on individual differences and change in ToM during this developmental period.
Over the past decade, researchers have devised a diverse range of tasks purported to measure different aspects of ToM use with a variety of stimuli such as vignettes, cartoons, audio-recordings, and film clips. In addition to the wide range of stimuli employed by researchers, the tests appear to measure distinct aspects of ToM use such as emotion understanding, perspective taking, understanding the interpretive nature of mind, attribution of intention, and explanation of behavior with reference to beliefs, knowledge, and desires (e.g., Baron-Cohen et al., 1997, Carpendale and Chandler, 1996, Castelli et al., 2000, Dumontheil et al., 2010, Dziobek et al., 2006, Golan et al., 2006, Happé, 1994). Supporting the validity of these tasks, adults with ASD and schizophrenia have been shown to have difficulties on these tasks relative to matched “neurotypical” controls (Chung, Barch, & Strube, 2014). Crucially, these limitations in performance are specific to test items centered on mentalistic content and not simply on narrative understanding or nonmental content (e.g., White, Hill, Happé, & Frith, 2009). Although there is some evidence for the validity of these “advanced” tasks, less is known about the precision and stability of these measures. With few exceptions, little effort has been made to evaluate the psychometric properties of these tasks (e.g., Dziobek et al., 2006, Fernandez-Abascal et al., 2013).
In an effort to develop age-appropriate ToM tasks to study individual differences and age-related changes in ToM across middle childhood, Devine and Hughes (2013) administered Happé’s (1994) vignette-based Strange Stories task alongside a novel Silent Film task to 230 middle-class children between 8 and 13 years of age. Successful performance on both of these tasks required children to explain a character’s behavior with reference to the character’s knowledge, beliefs, and desires. The findings from this initial study revealed that both the Strange Stories and Silent Film tasks were sensitive to age-related differences in performance, with neither task exhibiting marked ceiling effects. There were strong concurrent associations between the two tasks, supporting the convergent validity of the Silent Film task with the widely used Strange Stories task. More recently, longitudinal findings have shown that performance on a battery of false belief tasks at 6 years of age was significantly correlated with later performance on both the Strange Stories and Silent Film tasks at 10 years of age (Devine, White, Ensor, & Hughes, 2015). These findings provide further evidence for the convergent validity of these advanced ToM tasks. Supporting the criterion validity of these tasks, low scores in girls were associated with self-reported loneliness and low scores in boys were associated with self-reported peer exclusion. Despite differences in the modality of each task, CFA revealed that a unidimensional ToM latent factor underpinned performance on the diverse items of the Strange Stories and Silent Film tasks. This ToM latent factor exhibited measurement invariance in boys and girls, with no evidence of DIF. In sum, the Strange Stories and Silent Film task battery is a promising way in which to measure ToM across middle childhood. That said, further work is needed to investigate the validity, precision, and reliability of these tasks. The purpose of our study was to investigate the psychometric properties of the Strange Stories and Silent Film task battery in a large ethnically and socially diverse sample of children between 7 and 13 years of age.
Our first aim was to examine further the concurrent, discriminant, and construct validity of the Strange Stories and Silent Film tasks. Although it is tempting to claim that a ToM latent factor underpins participants’ performance on the items of the Strange Stories and Silent Film tasks, the relations between the items may simply reflect common variance due to another variable, for example, the ability to comprehend a narrative sequence rather than mental states per se. To rule out this alternative interpretation, the participants completed three “control” stories (matched in length and linguistic complexity) that described scenarios involving human characters but contained no mental state content (White et al., 2009) to determine whether the correlation between performance on the Strange Stories task (mental state items) and Silent Film task items persisted once individual differences in story or narrative comprehension were taken into account.
The second aim of our study was to examine the precision and measurement invariance of the Strange Stories and Silent Film task battery. Using IRT models, it was possible to compute standard errors that are conditional on a certain trait or “theta” level and so assess the reliability of the Strange Stories and Silent Film task battery at different levels of latent ability (Embretson and Reise, 2000, Hays et al., 2000). Extending findings about the measurement invariance of the Strange Stories and Silent Film task battery across boys and girls, the diverse sample recruited for the current study also made it possible to assess measurement invariance across different ethnic and socioeconomic groups.
The third aim of our study was to investigate the test–retest reliability of the Strange Stories and Silent Film task battery. To date, no published studies have sought to examine the short-term stability of measures of ToM across middle childhood and adolescence. Latent variable modeling with CFA provides a particularly robust way in which to examine test–retest reliability. Typically, researchers estimate the correlation between initial and retest scores. Because this approach does not account for item-specific variance and measurement error, the correlations between test scores might not provide accurate estimates of the stability of performance on the latent variable. In one pioneering study, Willoughby and Blair (2011) examined the 1-month test–retest reliability of a battery of EF tasks for preschool children using a latent variable approach. By accounting for the potential instability of item-specific variance, Willoughby and Blair found that, in contrast to the moderate test–retest correlations between specific items, the correlation between the latent factors approached unity. We adopted the same analytic strategy in the current study to examine the 1-month test–retest reliability of the Strange Stories and Silent Film tasks. Given the large and diverse sample, we also assessed whether test–retest stability was moderated by child characteristics such as age, gender, ethnicity, socioeconomic status (SES), and verbal ability. Building on the analysis used by Hughes and colleagues (2000) when studying the test–retest stability of the false belief task battery, nonsignificant interaction effects between child characteristics and test stability would provide evidence to support the applicability of the Strange Stories and Silent Film task battery across a diverse range of children.
To summarize, our study had three primary aims. Our first aim was to examine the convergent, discriminant, and construct validity of the Silent Film and Strange Stories tasks as measures of ToM suitable for use across middle childhood. Our second aim was to assess the precision and measurement invariance of the Silent Film and Strange Stories tasks. Our third aim was to assess the test–retest reliability of the Silent Film and Strange Stories tasks and examine whether 1-month test–retest stability varied as a function of individual differences in child characteristics.
Section snippets
Participants
Participants were recruited from eight socioeconomically and ethnically diverse state schools in the South East of England. The eight schools involved in this study were average or above average in terms of total number of pupils (i.e., >263 pupils for primary schools and >978 pupils for secondary schools), and all were based in urban areas (Office for Standards in Education (OFSTED), 2014). Of the 565 children in the classes approached, 38 were not eligible to take part because teachers
Analytic strategy
The data were analyzed using a latent variable framework in Mplus Version 7 (Muthén & Muthén, 2012). Given the categorical nature of our data, we used a mean- and variance-adjusted weighted least squares estimator (rather than a maximum likelihood estimator) in each of our models (Brown, 2006, Kline, 2011). For each model, we evaluated fit using Brown’s (2006) four recommended criteria: a nonsignificant chi-square (χ2) test, root mean square error of approximation (RMSEA) ⩽ .08, comparative fit
Discussion
This investigation of the psychometric properties of the Strange Stories and Silent Film task battery involved 460 children between 7 and 13 years of age and yielded three sets of findings. First, scores on both tasks were strongly correlated even when verbal ability and narrative comprehension were taken into account. Replicating previous findings (Devine & Hughes, 2013), the ToM latent factor was sensitive to effects of age and gender. Second, the Strange Stories and Silent Film task battery
Acknowledgments
We thank all of the principals, teachers and pupils at the participating schools in Cambridge, Nottingham, and London, England. We also thank Abby Furniss and Sakshi Rathi for their assistance with data collection and coding. R.T. Devine was funded by the Isaac Newton Trust (Cambridge, UK). For more information about using the Silent Film task in research, contact R.T. Devine by e-mail ([email protected]).
References (73)
The extreme male brain theory of autism
Trends in Cognitive Sciences
(2002)- et al.
Does the autistic child have a theory of mind?
Cognition
(1985) - et al.
Movement and mind: A functional imaging study of perception and interpretation of complex intentional movement patterns
NeuroImage
(2000) - et al.
Development of cognitive control and executive functions from 4 to 13 years: Evidence from manipulations of memory, inhibition, and task switching
Neuropsychologia
(2006) - et al.
Intelligence and educational achievement
Intelligence
(2007) - et al.
Individual differences in false belief understanding are stable from 3 to 6 years of age and predict children’s mental state talk with school friends
Journal of Experimental Child Psychology
(2011) - et al.
Does sensitivity to criticism mediate the relationship between theory of mind and academic achievement?
Journal of Experimental Child Psychology
(2011) - et al.
The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: A latent variable analysis
Cognitive Psychology
(2000) - et al.
“John thinks that Mary thinks that …”: Attribution of second-order beliefs by 5- to 10-year-old children
Journal of Experimental Child Psychology
(1985) - et al.
Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception
Cognition
(1983)
What is theory of mind? Concepts, cognitive processes, and individual differences
Quarterly Journal of Experimental Psychology
Studies of adults can inform accounts of theory of mind development
Developmental Psychology
Developmental continuity in theory of mind: Speed and accuracy of belief–desire reasoning in children and adults
Child Development
Peer relations and understanding of faux pas: Longitudinal evidence for bidirectional associations
Child Development
Another advanced test of theory of mind: Evidence from very high functioning adults with autism or Asperger syndrome
Journal of Child Psychology and Psychiatry
Specifying the links between executive functioning and theory of mind during middle childhood: Cognitive flexibility predicts social understanding
Journal of Cognition and Development
Theory of mind in preadolescence: Relations between social understanding and social competence
Social Development
Socioeconomic status and child development
Annual Review of Psychology
Code of human research ethics
Confirmatory factor analysis for applied research
Age and gender dependent development of theory of mind in 6 to 8 year old children
Frontiers in Human Neuroscience
Reliability and validity assessment
On the distinction between false belief understanding and subscribing to an interpretive theory of mind
Child Development
A meta-analysis of mentalizing impairments in adults with schizophrenia and autism spectrum disorder
Schizophrenia Bulletin
An interactionist perspective on the socioeconomic context of human development
Annual Review of Psychology
Construct validity in psychological tests
Psychological Bulletin
Theory of mind, emotion understanding, language, and family background: Individual differences and interrelations
Child Development
Middle childhood: An evolutionary–developmental synthesis
Child Development Perspectives
Silent films and strange stories: Theory of mind, gender, and social experiences in middle childhood
Child Development
Relations between false-belief understanding and executive function in early childhood: A meta-analysis
Child Development
Online usage of theory of mind continues to develop in late adolescence
Developmental Science
Introducing MASC: A movie assessment of social cognition
Journal of Autism and Developmental Disorders
The development of children ages 6 to 14
The Future of Children
Item response theory for psychologists
Cited by (81)
Measures of individual differences in adult theory of mind: A systematic review
2024, Neuroscience and Biobehavioral ReviewsWhat does the Strange Stories test measure? Developmental and within-test variation
2023, Cognitive DevelopmentEmpathy and resting-state functional connectivity in children
2022, Neuroimage: ReportsReading fiction and reading minds in early adolescence: A longitudinal study
2022, Journal of Experimental Child PsychologyCitation Excerpt :Therefore, we created mean scores for fiction reading, nonfiction reading, and reading motivation at Time 1 and Time 3. At Time 3 and Time 4, children completed the Silent Films task (Devine & Hughes, 2013, 2016). Children watched five short film clips from a classic silent comedy depicting instances of deception, misunderstanding, and false belief.
What is theory of mind? A psychometric study of theory of mind and intelligence
2022, Cognitive Psychology