A critical review of outcome measures used to evaluate the effectiveness of comprehensive, community based treatment for young children with ASD

https://doi.org/10.1016/j.rasd.2015.12.009Get rights and content

Highlights

  • There is a lack of measurement consistency in outcomes for community treatment of preschoolers with ASD.

  • Standardized outcome tests are summarized and evaluated for ethical test use and reporting requirements.

  • Cognitive and adaptive tests are primary outcomes reported whereas other tests are under-represented.

  • Reporting strengths are use of multiple measures, clear sample descriptions, and use of specialized tools for ASD.

  • Reporting weaknesses are assessment bias, test substitution, and under-reporting of test modifications.

Abstract

This review critically evaluates reporting and use of standardized measures to assess community based treatments for young children with Autism Spectrum Disorder (ASD). The Standards for Educational and Psychological Testing (AERPA, APA & NCME, 1999), a best practice framework for reporting standardized test results, guides the evaluation. Fifty three different outcome measures are identified across 45 studies representing twelve countries. Adaptive behavior, specifically the Vineland Adaptive Behavior Scales and cognitive measures continue to be primary outcome tools, despite a lack of clear fit to core ASD diagnostic constructs. Behavioral, ASD specific, language, social communication, and family wellness tools are under represented. Reporting strengths are use of multiple measures, clear sample descriptions, and use of specialized tools for ASD. Reporting weaknesses are assessment bias, test substitution, and under reporting of test modifications. Clinical and research implications are discussed.

Introduction

Autism Spectrum Disorder (ASD) is diagnosed at younger ages and with increased frequency, with current estimates that approximately 1% of school aged children (Blumberg et al., 2012) meet the diagnostic criteria of qualitative impairment in communication and socialization skills, as well as the presence of repetitive behavioral mannerisms that interfere with daily functioning (American Psychiatric Association, 2013). As the number of children with ASD increases, so has public pressure to provide evidence based treatment. However, though evidence based treatments are established within research settings (e.g., Makrygianni & Reed, 2010) the gap between research and practice is large (Dingfelder and Mandell, 2011, Kasari and Smith, 2013) and despite significant costs associated with treatments in community settings, little is known about how well many treatments developed in research settings generalize into the community. For example, Amendah, Grosse, Peacock and Mandell (2012) estimate costs of $25,099 to $60,000 + per person, per year for behavioral therapies. In Canada, provincial governments are spending up to $40,000 per child, per year on therapies for children with ASD (Madore & Pare, 2006). Lifetime costs are even greater with recent estimates of $2.4 million in the United States and £1.5 million in the United Kingdom (Buescher, Cidav, Knapp, & Mandell, 2014).

The distinction between treatment efficacy and effectiveness has important implications for bridging research and clinical practice. Treatment efficacy is demonstrated through completion of replicable studies in highly controlled research settings, whereas treatment effectiveness is the demonstration of the generalizability of efficacious treatments into community settings (Greenberg, 2004). Our understanding of effective treatments for ASD has been consolidated through several extensive and systematic reviews (e.g., National Autism Center, 2009, Wong et al., 2014) yet, the evidence base for ASD treatment effectiveness is still an emerging field. In addition, predicting what intervention will work best for an individual child, the specificity of the intervention targets, and the individual responsiveness of ASD symptomology to treatment is still unknown, particularly as knowledge is generalized out of university contexts into the community (Minjarez, Williams, Mercier & Hardan, 2011). Moreover, in the context of implementation science, the lag time between the development of an efficacious practice and its eventual adoption is still estimated to be as high as 20 years (Walker, 2004).

One contributing factor to the slow adoption of efficacious practice is the lack of consensus on measurement tools and wide usage of different instruments (Bolte & Diehl, 2013). Matson (2007) reports that measures of intelligence and adaptive functioning are used most frequently, though a sole focus on these two constructs in the measurement of ASD treatment response can be problematic. For example, regarding cognition, Matson (2007) reports that (1) children often age out of Intelligence Quotient (IQ) measures from pre to post test, forcing substitution of a different IQ instrument normed on an older population, (2) it is not clear whether ASD intervention results in increased scores due to increased compliance, attention, motivation or ability, (3) IQ tests are less reliable at predicting future performance for children at young ages, (4) comorbid psychopathologies may interfere with measurement of the underlying constructs and, (5) IQ tests are not normed on an ASD population. Adaptive measures, though valuable, are normed primarily on the typical developing population (Sattler, 2006), not designed specifically for this and consequently only provide a reference point for identifying delays and strengths in the ASD population.

According to Gould, Dixon, Najdowski, Smith, and Tarbox (2011) measurement outcomes of intensive ASD programs should be: (1) comprehensive, (2) target early childhood development, (3) consider behavior function, (4) directly link assessment items to curricula targets, and (5) be used to track child progress over time. Gould et al. (2011) indicate that a combination of direct observation and indirect assessment (e.g., rating scales and checklists) is an ideal manner to track outcomes. However, after reviewing 27 different tools that may be used to measure ASD intervention progress, they were not able to identify any specific tool that met their five criteria. Four tools identified as being ‘of promise’ were the Verbal Behavior Milestones Assessment and Placement Program [VB MAPP] (Sundberg, 2008), the Brigance Diagnostic Inventory of Early Development II [Brigance IED II] (Brigance, 2004), the Vineland Adaptive Behavior Scales Second Edition [VABS II] (Sparrow, Cichchetti, & Balla, 2005), and the Brigance Diagnostic Comprehensive Inventory of Basic Skills Revised [CIBS R] (Brigance, 1999). The VABS was described as “by far the most popular assessment” (p. 998). To strengthen these tools, Gould et al. (2011) recommended simplified administration of the VB MAPP, increased psychometric evaluation of the VB MAPP and Brigance IED II, and content linking of the VABS and CIBS R more clearly to a curriculum.

Bolte and Diehl, 2013 reflected the lack of consensus on measurement tool selection to evaluate treatment response in their review of 195 prospective ASD treatment trials from 2001 to 2010. They identified 289 unique measurement tools, of which the vast majority (61.6%) were only used once. The top five utilized tools reported in this review were the Aberrant Behavior Checklist [5%] (Aman, Singh, Stewart, & Field, 1985), Clinical Global Impressions [4.6%] (Guy 1976), VABS [3.9%] (Sparrow, Balla, & Cicchetti, 1984), investigator designed video observations (1.9%), and the Bayley Scales of Infant Development [1.7%] (Bayley, 1993). Bolte and Diehl, 2013 concluded that “greater consistency in the use of measurement tools in ASD clinical trials is a worthwhile and achievable goal” (p. 2499) as the sheer number of tools and tool symptom combinations make comparison between studies difficult. Improved consistency in test use is important as this will allow for more nuanced and detailed comparisons across program models and funding jurisdictions, allow for a better understanding of how different treatment models influence different areas of child functioning, and lead to improved efficiency in the test selection and completion process for research participants. In summary, based on this review of the literature, one can conclude there is currently no single assessment measure that will capture all aspects of an intensive ASD treatment program, and a combination of consistently used outcome tools is of value.

Measurement challenges have been documented for decades in the ASD treatment literature. For example, much of the literature for ASD treatment began with a seminal study by Lovaas (1987) who reported that up to 47% of preschool children diagnosed with ASD could achieve average scores on standardized measures of intelligence after receiving 40 hours per week of behaviorally based intensive intervention for at least 2 years. This model has been identified as the UCLA Young Autism Project (UCLA YAP) model and still influences much of the current ASD literature (Reichow & Wolery, 2009).

Lovaas (1987) used multiple instruments to evaluate treatment effectiveness including four different measures of intelligence that were combined into an IQ “estimate” of mental age, direct recording of behavior and language, and post intervention classroom placement at 6 or 7 years of age. The pre measures of intelligence were diverse and the Vineland Social Maturity Scale (Doll, 1953), an earlier version of the VABS, was used to estimate the mental age for participants that were deemed to be untestable. Post treatment measures were equally diverse and included up to six different cognitive measures, with unclear rules for allowable test substitution.

Jordan, Jones, and Murray (1998) and Howlin (1997) identified numerous measurement shortcomings of Lovaas (1987) including (1) the use of different measures before and after treatment, (2) measures that may not reflect important areas of difficulty in addressing ASD, (3) non adherence to standard assessment protocols, (4) lengthy delays between program delivery and outcome assessment, and (5) the use of prorated mental age, a psychometrically weak metric. Eikeseth (2001) responded to these criticisms by identifying (1) that no one single IQ test covers the age range needed for lengthy interventions, (2) there is high overlap of up to 75% between ASD and mental retardation (citing Lord & Schopler, 1989), (3) the standardized process used would have penalized higher IQ scores, biasing against the intervention (4) the lengthy delay between intervention and final testing would also bias against the intervention and (5) that ratio IQ is a conservative measure of this construct. Despite these methodological and practical concerns, the same tools and methodological challenges continue to be present in many current studies, and researchers continue to report benefit for the children with ASD that participate in the UCLA YAP model (e.g., Cohen, Amerine-Dickens, & Smith, 2006; Howard, Sparkman, Cohen, Green, & Stanislaw, 2005).

In addition to the UCLA YAP behavior based model, developmentally influenced ASD models also exist, though these models are less prominent in the research literature. These models include the Treatment and Education of Autistic and Related Communication Handicapped Children [TEACCH] (Mesibov, Shea, & Schopler, 2005), the Early Start Denver Model [ESDM] (Dawson et al., 2010), the Joint Attention and Symbolic Play/Engagement and Regulation [JASP/ER] (Kasari, Freeman, & Paparella, 2006), and inclusive classroom models such as Learning Experiences and Alternative Program for Preschoolers and Their Parents [LEAP] (Strain & Bovey, 2011) or the Children’s Toddler School [CTS] (Stahmer & Ingersoll, 2004). No review or critique of the outcomes measured used to evaluate these models has identified whether measurement concerns parallel those found in the behaviourally oriented intervention research.

This review builds on the previous body of work related to measurement issues in ASD treatment effectiveness literature by systematically identifying and recording the degree to which documented measurement concerns continue in published ASD literature evaluating treatment effects. Ethical practice guidelines established by the American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA, APA, & NCME, 1999) are used to identify strengths and weaknesses in clinical assessment and reporting requirements. We focus specifically on comprehensive, community based (i.e., that reference a preschool, nursery, home or other community setting), outcome effectiveness, group based design ASD studies for young children, as this is where treatment effectiveness is ultimately demonstrated. Given the high volume of ASD studies, those that are single subject design, instrument related reviews, parent education programs, diagnostic studies, or general program descriptions are excluded. Specific aims are to: (1) identify whether the earlier criticisms of instrument usage (Howlin, 1997, Jordan et al., 1998) in initial ASD efficacy studies have been resolved, (2) identify the dominant instruments used to measure ASD treatment gains and evaluate their construct validity as primary measures for ASD, and (3) develop a standardized checklist to evaluate strengths and weaknesses in test administration, use and reporting requirements for ASD program effectiveness based on a best practice framework for evaluating tool selection, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999).

Section snippets

Best practices framework for evaluating tool selection and use

In order to guide the measurement evaluation process, a best practice framework that guides psychological and educational instrument selection and utility for program outcomes was adopted. This framework was guided by the fourth version of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999). The Standards provides minimum guidelines around test development and test use to ensure ethical assessments, jointly published since 1966, to improve ethical test use and

Instrument classifications

For the first level of analysis, each study was reviewed and all of the outcome instruments identified, resulting in a total of 53 different standardized outcome instruments. However, similar to findings by Bolte and Diehl, (2013), 24/53 (45.28%) instruments appeared in only one study. As the purpose of this review is to provide an overview of those tools commonly being used in the ASD effectiveness literature only the 21 instruments that appeared in a minimum of three studies are listed in

Discussion

The purpose of this paper was to provide a critical measurement review of recent community effectiveness outcome studies of treatments for young children with ASD to identify (1) if early criticism of instrument usage in ASD efficacy studies have been resolved, (2) to report on the dominant instruments used to measure ASD treatment gains and evaluate their construct validity as primary measures for ASD, and (3) to develop a standardized checklist to evaluate strengths and weaknesses in ethical

References (66)

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education

    Standards for educational and psychological testing

    (1999)
  • American Psychiatric Association

    Diagnostic and statistical manual of mental disorders

    (2013)
  • N. Bayley

    Bayley scales of infant development

    (1993)
  • N. Bayley

    Bayley scales of infant development

    Administration manual

    (2006)
  • S.J. Blumberg et al.

    Changes in prevalence of parent-reported autism spectrum disorder in school-aged U.S. children: 2007 to 2011–2012

    National health statistics reports, 65

    (2012)
  • E.E. Bolte et al.

    Measurement tools and target symptoms/skills used to assess treatment response for individuals with autism spectrum disorder

    Journal of Autism and Developmental Disorders

    (2013)
  • B. Bracken et al.

    Clinical assessment of behavior

    (2004)
  • A.H. Brigance

    Brigance diagnostic comprehensive inventory of basic skills revised

    (1999)
  • A.H. Brigance

    Brigance diagnostic inventory of early development-II

    (2004)
  • R.H. Bruininks et al.

    Scales of independent behavior-revised comprehensive manual

    (1996)
  • A.V.S. Buescher et al.

    Costs of autism spectrum disorders in the United Kingdom and United States

    Journal of the American Medical Association Pediatrics

    (2014)
  • A.S. Carter et al.

    The Vineland Adaptive Behavior Scales: Supplementary norms for individuals with autism

    Journal of Autism and Developmental Disorders

    (1998)
  • H. *Cohen et al.

    Early intensive behavioral treatment: replication of the UCLA model in a community setting

    Developmental & Behavioral Pediatrics

    (2006)
  • I.L. Cohen et al.

    The PDD behavior inventory

    (2005)
  • J.N. Constantino et al.

    Social responsiveness scale-2nd edition (SRS 2)

    (2012)
  • G. *Dawson et al.

    Randomized, controlled trial of an intervention for toddlers with autism: the early start Denver model

    Pediatrics

    (2010)
  • H.E. Dingfelder et al.

    Bridging the research-to-practice gap in autism intervention: an application of diffusion of innovation theory

    Journal of Autism and Developmental Disorders

    (2011)
  • E.A. Doll

    The measurement of social competence

    (1953)
  • S. Eikeseth

    Recent critiques of the UCLA young autism project

    Behavioral Interventions

    (2001)
  • L. Fenson et al.

    The MacArthur communicative development inventories: user’s guide and technical manual

    (1993)
  • M.T. Greenberg

    Current and future challenges in school-based prevention: the researcher perspective

    Prevention Science

    (2004)
  • W. Guy

    The clinical global impression scale

    ECDEU assessment manual for psychopharmacology-revised

    (1976)
  • H.R. Hall et al.

    Parenting challenges in families of children with autism: a pilot study

    Issues in Comprehensive Pediatric Nursing

    (2010)
  • Cited by (8)

    • Methodological considerations in the use of standardized motor assessment tools for children with autism spectrum disorder: A scoping review

      2022, Research in Autism Spectrum Disorders
      Citation Excerpt :

      Indeed, studies that examined the relationship between IQ and motor difficulties of children with ASD using standardized assessment tools, revealed inconsistencies (e.g., Hirata et al., 2015), suggesting the need to reconsider the most suitable way to determine the intellectual abilities of children with ASD, rather than setting IQ-related exclusion criteria. The studies identified here varied considerably in how they determined IQ, as previously described (e.g., Stolte, et al., 2016). Some measured it directly, while others relied on previous data.

    • Evaluating outcomes within culturally diverse contexts for children and youth with developmental disabilities

      2022, International Review of Research in Developmental Disabilities
      Citation Excerpt :

      Additional data on the validity of outcome measures should be reported for the specific cultural and linguistic groups. To date, numerous reviews have been published describing the complexities that arise when evaluating outcomes for children and youth with DD (Howell, Bradshaw, & Langdon, 2021; Imms et al., 2016; Steinhausen, Mohr Jensen, & Lauritsen, 2016; Stolte, Hodgetts, & Smith, 2016; Thurm, Kelleher, & Wheeler, 2020). However, much of this research has not evaluated whether outcomes for children and youth with DD are culturally responsive or relevant.

    • Efficacy of parent-training programs for preschool children with autism spectrum disorder: A randomized controlled trial

      2020, Research in Autism Spectrum Disorders
      Citation Excerpt :

      In addition, as the FEAS was established based on the DIR model, it is more likely to detect improvements in the DIR-based parent-training program. Stolte, Hodgetts, and Smith (2016) reported that evaluating treatment efficacy using measures related to targeted areas of function rather than using standardized outcome measures is one source of assessment bias. Moreover, the developmental levels described in the FEAS are more focused on social interaction with others and emotional functioning, which are areas of difficulty for children with ASD.

    View all citing articles on Scopus
    View full text