Introduction

Heterogeneity presents a unique challenge within the field of autism research, as individuals with Autism Spectrum Disorder (ASD) exhibit significant variability in the kind and extent of symptomatology. Research conducted in 2006 has shown that parents use a wide range of treatments with their children, with a greater number of treatments being used for younger children, and for children with greater severity of symptoms (Green et al., 2006). Green and colleagues also reported that the most commonly utilized treatments included those without empirical evidence.

Applied Behavior Analysis (ABA) is an applied science that focuses on socially significant behavior change (Baer, Wolf, & Risley, 1968; Sigafoos & Schlosser, 2008). While most published psychological research is based on between group research designs, ABA typically examines behavior at the level of the individual and generally utilizes single-case research designs (SCD), thus permitting a scientifically valid conclusion to be drawn from the intensive investigation of an individual (Blampied, 1999). Interventions based on such research have been used extensively in working with participants with ASD since the early 1980s.

Systematic reviews and meta-analyses of SCD literature are becoming increasingly important to a variety of stakeholders and have been conducted within academic literature, by government agencies, and health service providers to address the need for evidence based practice guidelines as well as to inform decisions at a policy level. Across the broader field of healthcare, the PRISMA Statement (2009) sets forth a checklist of 27 items that should be addressed in systematic reviews or meta-analyses of literature (Moher, Liberati, Tetzlaff, Altman, & The PRISMA Group, 2009). The PRISMA Statement (2009) has been used to guide reviews that are ultimately read by clinicians to inform practice, granting agencies to fund future research, and other stakeholders. Such reviews may include between-group design research and SCD research. It has been acknowledged that one limitation of meta-analytic research is that historically SCD research has often been omitted (Allison & Gorman, 1993).

Results from single-case meta-analyses have recently been used by health insurance providers both to support and deny the necessity for intensive behavioral intervention for individuals with ASD (Campbell, 2013). Campbell reported that in 2011, CIGNA companies concluded that behavioral intervention is an effective therapy, while in the same year, United Healthcare Services, Inc. used similar evidence to justify their policy that claims the same treatment is not a medical necessity. Separately, the National Autism Center’s National Standards Report (2009) has assessed the existing SCD published peer reviewed literature base for participants under 22 years of age diagnosed with ASD. The report has categorized 11 ABA-based interventions as established, 21 as emerging, and five as unestablished treatments. Established treatments were antecedent package, behavioral package, comprehensive behavioral treatment for young children, joint attention intervention, modeling, naturalistic teaching strategies, peer training, pivotal response treatment, schedules, self-management, and story-based intervention package. Emerging treatments were augmentive and alternative communication devices, cognitive behavioral intervention package, developmental relationship-based treatments, exercise, exposure package, imitation-based interaction, initiation training, language training (production), massage/touch therapy, multi-component package, music therapy, peer-mediated instructional arrangements, the picture exchange communication system, reductive package, scripting, sign instruction, social communication intervention, social skills package, structured teaching, technology-based treatment, and theory of mind training. Unestablished treatments were academic interventions, auditory integration training, facilitated communication, gluten- and casein-free diets, and sensory integrative packages. Their report also highlighted the importance of using meta-analytic procedures to identify ineffective or harmful treatments, although in their 2009 review, no studies were identified that met criteria. At the time of writing, the National Association of Insurance Commissioners and Top Health Insurance (2013) ranked United Healthcare as the number one insurer in the USA, providing services to an estimated 70 million Americans. Given the current ASD prevalence estimate of one in 88 (CDC MMWR, 2012), approximately 795,000 individuals are affected by such policy decisions in the USA alone.

Variations in literature synthesis, such as described above, have wide reaching implications for individuals on the autism spectrum and may impact whether or not an individual is able to access support services. However, despite interest across the broader educational field regarding how best to calculate and report strength of treatment effects, an agreement on how to measure and interpret the strength of treatment effects with SCD research studies has yet to be achieved.

SCD researchers in the field of ABA have traditionally relied on visual analysis as the principal method of determining intervention effects (Kratochwill & Levin, 2014; Matyas & Greenwood, 1990; Shadish, 2014). Visual analysis can be used to document experimental control and determine the overall effectiveness of an intervention by assessing all conditions within a design, with graphical inspection involving the evaluation of time series data in terms of systematic changes in level, trend, and variability, both within and across intervention phases (Horner et al., 2005). Historically, visual methods have been favored over statistical approaches on the basis that the former is less likely to report false positive treatment outcomes (Shadish, 2014). While visual analysis has wide appeal, it is not without criticism. In particular, unreliability of judgment across raters has been frequently reported (Campbell, 2013; Parker & Brossart, 2003; Scruggs, Mastropieri, & Casto, 1987) though it has also been noted that critics have seldom addressed consistency in visual analysis beyond two phases (Horner, Swaminathan, Sugai, & Smolkowski, 2012). Horner and colleagues have argued for the continued use of visual analysis in the absence of agreement on a statistical measure to determine treatment effect.

However, while much support is acknowledged for the continued use of visual analysis, we argue that such an approach in literature synthesis may not adequately address several important and topical issues in ASD research and treatment. Firstly, the American Psychological Association (APA) Taskforce on Statistical Inference (1999) argued in support of the earlier APA (1994) publication manual’s suggestion to include a treatment effect size in research reports, claiming that a treatment effect size permits the evaluation of the stability of findings across samples and is important to future meta-analyses. At this time, the earlier guideline was formalized as a requirement for research publications (Leland Wilkinson and the Taskforce on Statistical Inference, 1999). While some (APA Taskforce on Statistical Inference, 1999) have argued that, with improvements in state-of-the-art statistical analysis software, statistics are commonly reported without an understanding of computational methods or an understanding of what the statistics mean, Kratochwill and Levin (2014) recently highlighted that there are a growing number of nonoverlap methods that can be hand calculated, which may be advantageous as a supplement to visual analysis. Arguments for the retention of simple calculation methods such as percentage of nonoverlap (PND) have also been reported (Scruggs & Mastropieri, 2013).

Secondly, the evidence based best practice movement across the broader field of educational psychology highlights the importance of calculating a treatment effect score when evaluating and synthesizing literature. A treatment effect score is considered essential to informing an evidence base. The U.S. Department of Education, What Works Clearinghouse (WWC) has produced a procedures and standards handbook to assess the quality of both group and SCD studies and developed protocols to evaluate their effectiveness in order to establish a scientific evidence base for educational research (Kratochwill et al., 2010). The WWC SCD Pilot Version 1.0 guidelines suggested a preference for the use of regression-based calculations, advising specifically against the adoption of the PND calculation. Elsewhere in the earlier literature, researchers have stated a preference for regression-based approach (Allison & Gorman, 1993; Parker & Brossart, 2003). By contrast, the most recent update to the WWC (2013) Version 3.0 has reverted to recommending visual analysis as the primary procedure used in determining the strength of a treatment effect (Kratochwill et al. 2013). In this current version, the panel has predicted that at some future point, when the field has achieved greater consensus about appropriate quantitative techniques, new standards for effect demonstration will be developed. However, at the time of writing, the WWC guidelines do not specify a metric to use when calculating a treatment effect size.

SCDs are unique in that significant design decisions regarding length of baseline data collection and when to implement or withdraw treatment are not determined in advance, depending rather on the participant data that are collected. Typically, phase changes are made once the data within a phase are considered stable, as is characterized by the absence of slope and no more than a small level of performance variability within the phase (Kazdin, 1978). Ethical consideration of a participant’s circumstances and of the behavior under investigation may also contribute to determining how many data points are collected. If a participant has a severe skill deficit, or if a behavior is harmful either to the self or others, it may not be socially valid or ethically acceptable to prolong baseline data collection. Such scenarios are common in studies involving individuals with ASD. Even in cases where the intervention does not directly target a reduction in problem behavior, challenging behavior may still be present and an issue to be considered.

Debate regarding the most appropriate methodological approach to interpreting SCD research dates back as far as the early 1970s (Kratochwill & Brody, 1978). Effect size may be calculated using regression-based estimators, standardized mean difference, or nonparametric methods. Of these, the most extensively adopted method is the nonparametric calculation, PND developed by Scruggs, Mastropieri, and Casto (1987). However, PND has been criticized for misrepresenting treatment effects, being insufficiently sensitive to changes in slope, producing an invalid outcome in instances in which outlier data points in baseline obscure true intervention effects, being ineffective as a discriminator for powerful treatment effects, and that the number of baseline observations may in itself distort outcomes (Scruggs & Mastropieri, 1998).

A variety of statistical approaches for use in SCD meta-analyses are currently under development including procedures for modeling trend, determining and estimate of treatment effect size, and investigating statistical methods to improve estimates for small data samples (Shadish, 2014). In a recent special series of articles exploring emerging approaches to calculating treatment effect, it has been suggested that what exists is a group of communities and that perspectives of researchers from various sub-communities may differ greatly on the role of statistical analysis in research interpretation and reporting (Shadish, 2014). In light of this claim, we argue the significance of examining SCD data specifically in the context of participants with ASD.

Data Requirements for Treatment Effect Calculations

A review of existing literature was conducted to identify minimum data requirements for use in both regression-based and nonparametric approaches to calculating a treatment effect score. Parker et al. (2005) noted a minimum of six data points per phase, and at least 14 data points in a phase A and B comparison are required for a regression-based effect size calculation. In their research based upon a convenience sample of 77 published AB datasets, Parker and colleagues reported a median number of data points per graph of 23 (counting only A and B phases), with a median length of 10 for phase A and 11 for phase B for their sample.

In addition, our search identified three nonparametric approaches that can be calculated by hand: PND (Scruggs et al. 1987), percentage of all nonoverlapping data (PAND) (Parker et al. 2007), and nonoverlap of all pairs (NAP) (Parker & Vannest, 2009). PND was developed specifically to supplement visual analysis. While PND does not specify a minimum number of data points, outlier ceiling or floor effects in baseline data can result in the calculation of a zero score. Baseline variability may mean that a PND score should not be calculated, and Scruggs and Mastropieri (2013) have reiterated their original advice against the calculation of an effect score in those cases where the result would be inconsistent with visual examination.

PAND presents an approach designed to address the limitation of rogue outlier baseline data present in the PND calculation. For the nonparametric calculation PAND, Parker et al. (2007) reported that a minimum threshold of 20 data points in total is necessary. Parker and colleagues demonstrated the suitability of PAND for an initial AB comparison using a dataset comprised of multiple baseline design samples in which 60 to 80 data points were typical.

NAP offers a calculation that utilizes all data points in a pairwise comparison. Parker and Vannest (2009) examined the performance of NAP using 200 AB contrasts and reported that the median length of a full data series in their sample was 18 data points. In particular, phase A had a median of 8 data points, and phase B had a median of 9 data points.

Purpose of Study

A preferred method to evaluate SCD research for participants with ASD has yet to be agreed upon by experts across the field, and contradictory recommendations that suggest either calculating a treatment effect score, or conducting visual analysis, have been made by different advisory panels within the broader educational psychology community. Given the apparent significant discrepancies in interpretation of meta-analytic reports on ABA based treatment research for participants with ASD conducted by leading US healthcare policy makers, we argue that it is important to the ASD research community to further examine SCD data with a view to determining how best now to calculate treatment effects while alternate improved procedures are being developed. We argue that the ASD community currently requires a suitable method to evaluate SCD intervention research to inform an evidence base, policy guidelines, and educational and clinical practice in this interim.

Accordingly, the purpose of this research project was to gauge the feasibility of calculating a treatment effect score using SCD data specifically in the context of participants with ASD. In their literature review, The National Standards Report (2009) has classified treatments as established, emerging, or unestablished. To compare the data collection trends of researchers working with participants on the autism spectrum, we have selected self-management interventions as an example of an established treatment, and physical activity as an example of an emerging treatment. The following research questions were developed:

  1. 1.

    Do studies report a sufficient number of baseline and intervention data points to enable the calculation of a treatment effect score?

  2. 2.

    Are there trends in these data suggesting this pattern is changing?

Method

Locating Studies

Studies were located by conducting a systematic search of peer-reviewed literature published prior to November 2013. Both the PsycINFO and ERIC databases were queried using the keywords “autism*” and “Asperger’s syndrome.” For self-management interventions, the following terms were queried: “self-management,” “self-regulation,” “self-regulate,” “self-monitoring,” “self-recording,” “self-reinforcement,” “self-evaluation,” “self-advocacy,” “self-observation,” “self-instruction,” “empowerment,” “self-determination,” and “self-control.” For physical activity interventions, the following terms were queried: “physical activity,” “exercise,” and “fitness.” In addition, a hand search of the reference lists of existing systematic reviews on both self-management and exercise was undertaken.

The abstract of each article was examined to determine whether an article met inclusion criteria for further review, and the original article was retrieved and reviewed when necessary. No age limits were placed upon participants. Inclusion criteria required that:

  1. 1.

    Participants had an existing diagnosis of ASD or AS. In instances in which several participants with various conditions were included in a single article, only participants with either an ASD or AS were included for further review.

  2. 2.

    The study utilized a single subject research design such as a multiple baseline, reversal, changing criterion, or alternating treatment design.

  3. 3.

    The study presented data from each phase in graphical format for each participant thereby enabling the calculation of a treatment effect.

  4. 4.

    Components of self-management or exercise were included throughout the intervention.

  5. 5.

    Articles were published in English in a peer-reviewed journal.

This search procedure identified 38 articles that utilized SCDs in self-management interventions and a further eight articles that utilized SCDs for exercise interventions. Two studies appear in both self-management and exercise searches, in interventions that targeted participation in physical activity that also used self-management procedures (Todd & Reid, 2006; Todd, Reid, & Butler-Kisber, 2010).

Data Requirements for Treatment Effect Calculations

  1. 1.

    Regression-based approach: A minimum of six data points per baseline or treatment phase and at least 14 data points in a phase A and B comparison

  2. 2.

    PND: No minimum number of data points required; however, baseline stability must be evident

  3. 3.

    PAND: A minimum of 20 data points across baseline and treatment phases required

  4. 4.

    NAP: No minimum number of data points specified

Reliability of Data

Inter-coder agreement was calculated with the first and one of the co-authors separately coding each study before comparing results. Initial trial coding was performed using three studies to ensure consistency between assessors.

A 30 % random sample of abstracts from all search results was reviewed independently, to determine the reliability of the article selection process. Inter-coder agreement was determined by dividing the number of agreements by the total number of agreements plus disagreements and multiplying by 100. Inter-coder agreement was 97 % for self-management interventions and 100 % for exercise interventions.

Once the data set was developed, the accuracy of the data point count procedure was checked. The first author randomly selected a 50 % sample of included studies, and a co-author independently counted data points in baseline, treatment, and any subsequent phases. These results were then compared to the first author’s counts. Inter-coder assessment was 98 % for self-management interventions and 100 % for exercise interventions.

Results

Thirty eight self-management intervention articles included in the data set reported treatment data from a variety of behaviors and settings for 102 participants. Given that many treatments were repeated across either behaviors or settings, a total of 215 data series were included in these graphs. A further eight exercise intervention articles reported treatment data for an additional 20 participants, and the corresponding graphs included 43 data series.

Hand counts of the number of data points reported in each baseline and treatment phase were conducted for each of the data series included in the graphs. These tallies were recorded manually into an Excel spreadsheet and subsequently compared to the advised minimal baseline and treatment phase data requirements for regression-based procedures, PND, PAND, and NAP to determine the feasibility of using these procedures with these data sets. Table 1 provides a summary of the results of each comparison.

Table 1 Feasibility of treatment effect score calculation for studies

The feasibility of the application of regression-based calculations was determined via the adherence to a minimum of six data points in the first baseline phase, six data points in the first treatment phase, and a minimum total of 14 data points across the first AB comparison. With this threshold, 97 of the 215 self-management data series (45.1 %) provided sufficient data. This is illustrated graphically in Fig. 1. In order to calculate a treatment effect for an entire study, each individual data series should meet the minimum data threshold. Given this constraint, only nine self-management articles (23.7 %) had sufficient data for a regression-based calculation. When considered on a participant basis, we identified 36 individuals (35.3 %) for whom a sufficient volume of data was reported for each data series to permit a regression-based treatment effect calculation per participant.

Fig. 1
figure 1

Self-management data points

For exercise interventions, 11 of the 43 data series (25.6 %) met the minimum total of data points across the first AB comparison. This is illustrated graphically in Fig. 2. When the exercise intervention articles were viewed in their entirety, a single article (12.5 %) included sufficient data to enable the calculation of treatment effect for the entire study. When considered on a participant basis, we identified four individuals (20.0 %) for whom sufficient data is provided to enable a regression-based treatment effect calculation per participant.

Fig. 2
figure 2

Exercise data points

Within the self-management studies, ceiling or floor data points in baseline occurred in three data series that resulted in a 0 % PND calculation. Variability in baseline data for one study that included two participants indicated that a PND should not be coded. Accordingly, data from 34 of the 38 studies (89.5 %) appear sufficient for a PND calculation. When viewed on a participant level, 96 of 102 (94.1 %) of participant data appear sufficient for a PND calculation.

For the exercise interventions, floor data points in baseline occurred in one study that reported one data series for a single participant. For this study, a 0 % PND would be calculated. Overall, seven of the eight studies (87.5 %) report sufficient data for a PND calculation. At a participant level, data sufficient for a PND calculation is available for 19 of 20 (95.0 %) participants.

The calculation of PAND requires a minimum of 20 data points per data series across baseline and treatment phases. Using this guideline, 117 self-management data series (54.4 %) reported a sufficient number of data points. When each article was considered overall, 22 self-management articles (57.9 %) included data series of an adequate length for the application of PAND. Examination of the self-management data on a per participant basis revealed that PAND was appropriate to apply to 53 individuals (52.0 %).

For exercise interventions, 26 data series (60.5 %) reported a sufficient number of data points. Viewed overall by article, five exercise articles (62.5 %) included data series of an adequate length to apply the PAND calculation. Examination of the exercise data on a per participant basis revealed that PAND was appropriate to apply to 13 individuals (65.0 %).

Although no minimum threshold was reported for the application of the NAP calculation, Parker and Vannest (2009) reported the median length of the full data series of 200 selected AB contrasts was 18 data points. By way of comparison, the self-management data set of 215 data series had a median length of 13.75 data points for the first AB comparison and a median length of 25.5 data points across the full data series. For exercise interventions, the 43 data series had a median length of 13 data points for the first AB comparison and 25 data points across the full data series.

The length of the data series including baseline and treatment data that was reported in each self-management and exercise article was plotted over time and a line of best fit was calculated for the data. A split-middle line of progress (Cooper, Heron, & Heward 2007) was plotted and indicated a declining trend over time in the number of data points collected (see Fig. 3).

Fig. 3
figure 3

Total data points collected over time for self-management and exercise interventions

A line of best fit was also plotted using Excel and produced a trend line that closely paralleled that calculated using the quarter-intersect and split-middle line of progress methods, hence confirming this declining trend. The Excel calculation derived the equation y = −0.1339x + 49.558 for this trend line. The graph illustrates that older research papers reported a greater volume of observational data than more recent articles.

Discussion

Our results show that for 64 of the 102 participants included in the self-management interventions, and 16 of the 20 participants included in the exercise interventions, the number of data points in the first AB phase comparison was below the required minimum for regression-based estimates of treatment effect sizes.

While not always described as the target behavior of an intervention, a variety of problem behaviors or unacceptable performances were described for these participants. Examples of these behaviors included physical aggression towards others, elopement for the purpose of engaging in ritualistic behaviors, self-injury, non-compliance, loud screaming, psychotic speech, inappropriate touching or hugging, destruction of property, head banging, placing non-edibles in mouth, tantrums, threats towards others, prolonged crying or body rocking, and social withdrawal. Unacceptable levels of classroom engagement or academic performance were also described. In some instances, collateral reduction in these challenging behaviors was recorded or observed in interventions that targeted the development of a skill.

Problem behaviors such as those described present an ethical dilemma to researchers as the collection of lengthy baseline data for these participants might be considered unacceptable practice. Significantly, the data series collected in both established and emerging treatments for participants with autism are often limited in length. The nature of the behaviors frequently under investigation with this population is at odds with the ability to apply a regression-based calculation to determine the strength of treatment effects. Furthermore, for the studies included in this dataset, we have identified a declining trend in the volume of data points that are collected in baseline and treatment phases. This downward trend, in addition to the nature of behaviors often described among participants on the autism spectrum, suggests that it is unlikely that regression-based calculation methods will be appropriate in future research. Alternative methods that are appropriate where shorter data sets are the norm appear advantageous.

Previous criticism of PND has included claims that outlying data in baseline phases result in calculations that do not accurately reflect the success of an original study. However, examination of both self-management and exercise extant data sets reveals that there are relatively few occasions in which a 0 % PND is calculated or that baseline variability results in an inability to code a study. Our findings support recent claims by Scruggs and Mastropieri (2013) that PND continues to produce treatment effect scores that accurately represent the findings described by original authors.

By contrast, PAND appears appropriate for slightly less than two thirds of the studies included in this review. While the literature has suggested merit in adopting this calculation as an alternative to PND, this benefit appears to be offset by the reduced utility of the method as a result of insufficient data collection for participants with ASD.

NAP is not constrained by the volume of data points collected in intervention and consequently is appropriate to apply to all studies included in both self-management and exercise SCD research. However, as the calculation is based on a comparison of all pairs contained within baseline and treatment data, a somewhat more cumbersome hand calculation is required than that of PND or PAND.

Conclusion

We have explored the volume of SCD data collected in both an established and an emerging treatment for participants on the autism spectrum. Relatively short data series were frequently reported, and results of this study suggest a declining trend in the length of data series reported over time, with older studies including a greater volume of data when compared to more recent studies. Behavioral challenges were described for many of the participants; consequently, the collection of additional data, particularly extended baseline phases, may pose an ethical dilemma to researchers, clinicians, or other stakeholders. It appears unlikely that data yielded from such applied research in the future will provide the longer data series necessary for more complex treatment effect calculations. Accordingly, while a regression-based calculation is arguably more accurate, the nature of the data examined in this study appears at odds with this method. A nonparametric approach may be preferable in the calculation of effect sizes in research involving participants with ASD.

The feasibility of three nonparametric hand calculations was explored. Both PAND and NAP are considered by many to offer a potentially superior calculation to PND. However, our findings suggest that PAND cannot always be calculated given the volume of data points that are typically reported. Excluding studies with limited baseline data points from effect size calculations may distort meta-analysis findings. While the NAP procedure is unrestricted in this regard, the reality of conducting this more cumbersome calculation may present a barrier to many stakeholders to the adoption of this method in terms of the time involved, the increased chance of calculation error, and of erroneous interpretation of the treatment effect score that is produced. The simpler PND calculation appears appropriate for the majority of data sets included in this study, and relatively few instances in which outlier ceiling or floor data points in baseline data obscure the true treatment effects were identified.

Further, while acknowledging the inability of PND to differentiate between demonstrably powerful treatment effects, we observe that baseline variability systematically reduces the product of a PND calculation, thereby making the calculation inherently more conservative as a measure of strength of treatment effects. A calculation that is complementary to visual examination and does not require extensive additional training is highly desirable. The continued use of a PND calculation is arguably advantageous given that it has been widely applied and hence facilitates comparison with previous research findings. The PND calculation is also advantageous in that it can be calculated and interpreted with limited additional training, by a variety of stakeholders including clinicians, teachers, and parents. In light of the findings from this study, the continued use of PND appears justified though further research leading to the development of more robust and, at the same time, sensitive procedures is clearly warranted.