Introduction

In 1994, the American Psychological Association (APA) publication manual encouraged the inclusion of a treatment effect size in research reports. Since that time, the APA Taskforce on Statistical Inference (1999) argued that a treatment effect size permits the evaluation of the stability of findings across samples and is important to future meta-analyses. Subsequently, in some quarters, treatment effect scores have been considered a requirement for research publication (Leland Wilkinson and the Taskforce on Statistical Inference 1999).

The initial report developed by the APA Taskforce on Statistical Inference (1996) warned that, with advances in state-of-the-art statistical analysis software, statistics are commonly reported without understanding of the computational methods or necessarily even an understanding of what the statistics mean. Parker and colleagues have also argued that treatment effect calculations reported in meta-analysis of single-case design (SCD) research should be interpretable by various different stakeholder groups (Parker et al. 2005), a point well illustrated with autism intervention-related research where parents, educators and policy makers as well as clinicians and researchers all need to understand reports on treatment effects.

Given the increasing demand to develop an evidence base in educational psychology, quality assessment guidelines have been developed by the US Department of Education What Works Clearinghouse (WWC) (Kratochwill et al. 2010; Kratochwill et al. 2013a; Kratochwill et al. 2013b). Various methods for determining treatment effects have been proposed for use in meta-analyses of SCD research, although the merits of these different computation methods remain a matter of debate (Horner and Kratochwill 2012; Horner et al. 2012; Kratochwill et al. 2013b; Scruggs and Mastropieri 2013). Unlike group research designs, a generally accepted method for the calculation of treatment effect size for SCD research has yet to be established. The initial version of the WWC SCD guidelines indicated a preference for regression based procedures for calculating effect sizes though the WWC panel subsequently suggested conducting a sensitivity analysis of treatment effect scores using several indices. Most recently, the WWC has moved away from the use of a treatment effect score and reverted to visual analyses until a general consensus on the most appropriate method has been reached (Kratochwill et al. 2013b).

There is an extensive body of literature examining approaches to the evaluation of the strength of treatment effects for SCD research. Shadish (2014) has reported that a number of new methods are currently in development including standardised mean difference approaches, multilevel analysis and Bayesian generalised least squares procedures. In a recent review of SCD research conducted with students with a broad array of disabilities, Maggin et al. (2011a) reported percentage of nonoverlapping data (PND) as the most frequently used treatment effect score appearing in 55 % of the 84 effect sizes garnered from 68 literature syntheses. Maggin and colleagues also reported that nearly 40 % of authors did not describe the method they used for comparing data from various phases within studies when estimating effect sizes. Of the studies that did include these details, several approaches were described. These included taking an arithmetic mean across all baseline and treatment phases, selecting only certain phases (i.e., A1B2) and consolidating baselines and treatment phases (i.e., A1A2B1B2).

The PND (Scruggs et al. 1987) was developed as a method to synthesise SCD literature which could be easily calculated and readily as well as meaningfully interpreted. Several positive features of PND have been described and include ease of calculation from graphical rather than raw data, high degree of inter-rater reliability, applicability to any SCD design type and ease of interpretation (Campbell 2013; Parker et al. 2007). The continued utility of the procedure has recently been argued by the original developers (Scruggs and Mastropieri 2013). However, PND is not without its critics, and major limitations of the procedure include the following: (i) PND requires its own interpretation guidelines as it does not correlate to an accepted effect size; (ii) it lacks sensitivity in discriminating treatment effectiveness as the calculated score approaches 100 %; (iii) PND is reliant on a single extreme data point in baseline, and all other baseline data are excluded from the calculation; and (iv) as PND has no known sampling distribution, confidence intervals cannot be calculated (Parker et al. 2007).

All current alternative procedures have their own limitations however. Maggin et al. (2011b) compared 11 commonly used effect size measures including three parametric methods: interrupted time series analysis procedure (ITSACORR), piecewise regression, hierarchical linear modelling; and seven nonparametric methods: PND, percentage of all nonoverlapping data (PAND), percent of zero data points (PZD), pairwise overlap squared (PDO2), percentage exceeding mean (PEM), percentage exceeding mean trend line (PEM-T), improvement rate difference (IRD) and the standardised mean difference. Of the nonparametric methods that were assessed, PAND received the most favourable assessment.

PAND (Parker et al. 2007) has been presented as an alternative to PND, the developers recommending it for documentation and accountability purposes in schools and clinics, in addition to applications in meta-analyses and academic research reports. Their method was illustrated with sample data that typically contained between 60 and 80 data points, and the authors noted that the method was not well suited for data series that contained fewer than 20–25 data points. Parker et al. (2011) reported that PAND has been adopted in two meta-analyses along with a phi correlation coefficient, which is analogous to an R 2 score, that is frequently reported in large N studies (Parker et al. 2011).

Recently, Parker and Vannest (2009) developed the nonoverlap of all pairs (NAP) procedure, suggesting that this method offers an improvement on both PND and PAND. Several anticipated advantages have been suggested by these researchers, notably that the calculation uses all data points and as such should yield a more representative treatment effect score. Unlike other nonparametric indices, NAP is not based on means or medians, and it has been suggested that the calculation should relate more closely to the regression term R 2. Importantly for stakeholders within the autism spectrum disorder (ASD) community, NAP can be calculated by hand. NAP was omitted from the effect size comparison conducted by Maggin and colleagues (2011b) however, Parker and colleagues (2011) reported that NAP has also been used in several recent meta-analyses.

Although greater consensus is evident between researchers regarding how to calculate treatment effects in group design research (Kratochwill et al. 2013b), it has been noted that the interpretation of these treatment effect scores can also be problematic (Brossart et al. 2006). Brossart and colleagues observed that a basis for comparison of treatment effect sizes obtained using different calculations is lacking in the literature, arguably making this task challenging for clinicians and other stakeholders. These researchers noted that simple methods tend to yield different effect sizes than regression-based methods and that even ballpark interpretation guidelines for R 2—e.g., “large” (R 2 = .25), medium (R 2 = .09) and small (R 2 = .01)—drawn from large N group research in social science vary depending on the field of investigation.

Guidelines for interpretation of derived scores have been clearly defined for PND. However, other than the original developers describing a phi correlation coefficient for PAND, based upon a Pearson R 2 × 2 contingency table (Parker et al. 2007), it appears that an interpretation scale for PAND is not available in the literature to date. Parker and Vannest (2009) did provide a tentative interpretation scale for NAP, analogous to that used in PND, based on a process of expert judgements of 200 data sets.

In a recent exploration of the characteristics of SCD data for participants with ASD, Carr et al. (2014) reported a declining trend over time in the volume of data gathered in both baseline and treatment phases in SCD studies in both an established treatment (self-management) and an emerging treatment field (physical exercise) as classified by the National Standards Report (2009). Only 23.7 % of the studies included in the review reported a sufficient volume of data for a regression-based calculation. Carr and colleagues also explored the applicability of three readily hand-calculated nonparametric procedures for calculating effect sizes. PND was selected because of the frequency with which the procedure is reported and both PAND and NAP because they were developed to address limitations evident in PND. The authors concluded that a NAP calculation, which is not restricted by either volume of data points or the presence of ceiling or floor points in baseline, appeared appropriate for all studies that were included in the review and that PND was applicable with 90 % of the studies sampled. Conversely, PAND, which can only be applied when a minimum of 20 data points are presented, appeared applicable for only 54 % of the studies.

The purpose of this current study was to conduct a sensitivity analysis on treatment effect scores for use by the variety of stakeholders working with the ASD community. Accordingly, a primary requirement of the procedures included was the ability to perform all calculations by hand. PND has been selected as the basis of comparison in the sensitivity analysis, as it has been widely adopted in published literature. Burns et al. (2012) have recently recommended that additional research on new overlap approaches, particularly PAND and NAP, is warranted. The literature review for this current study has also found support for the suggestion made by Burns and colleagues, with NAP identified on the basis of anticipated improvements and PAND on the basis of prior favourable review. The current study sought to explore the advantages and limitations of PND, PAND and NAP. In addition, it was noted in the literature that previous research on newer calculations has been limited to AB designs (Brossart et al. 2006; Parker and Vannest 2009). As such, this study has included data from all phases.

The following research questions were investigated:

  1. 1.

    Do estimated effect sizes calculated using PND, PAND and NAP differ significantly from each other?

  2. 2.

    What benefits or limitations are evident in estimating treatment effect size using PAND, or NAP, when compared to the PND method?

  3. 3.

    How do calculated treatment effect scores compare with each other using available interpretation scales?

Method

Data Set Creation

Studies were located by conducting a systematic search of peer-reviewed literature prior to November 2013. Both PsycINFO and ERIC databases were queried using the keywords “autism*” and “Asperger’s syndrome” which were combined with the following terms typically associated with self-management: “self-management”, “self-regulation”, “self-regulate”, “self-monitoring”, “self-recording”, “self-reinforcement”, “self-evaluation”, “self-advocacy”, “self-observation”, “self-instruction”, “empowerment”, “self-determination” and “self-control”. In addition, a hand search of the reference lists of existing systematic reviews of self-management studies was undertaken.

The abstract of each article was examined to determine whether the article was likely to meet inclusion criteria for further review. The original article was retrieved and reviewed when further clarification appeared necessary. No age limits were placed upon participants. Inclusion criteria required that

  1. 1.

    Participants had an existing diagnosis of ASD or AS (for studies that included participants with differing conditions, only participants with ASD or AS were included for further review).

  2. 2.

    The study utilized a single-subject research design such as a multiple baseline, reversal, changing criteria or alternating treatment design.

  3. 3.

    Data for each phase and for each participant was presented in graphical format thus enabling calculation of PND, PAND and/or NAP.

  4. 4.

    Components of self-management were included throughout the intervention.

  5. 5.

    Articles were published in an English language peer-reviewed journal.

This search procedure identified 38 articles that were included for further review.

Calculating Treatment Effect

A treatment effect score was calculated for each participant included in each study as described for the following three methods.

PND (Scruggs et al. 1987) was calculated by counting the number of treatment data points that exceed the most extreme baseline data point, in the expected direction determined by whether an increase or decrease in target behaviour was desired. This number was then divided by the total number of treatment phase data points. Scruggs and colleagues have advised against coding a study when baseline stability has not been established and additionally noted that for cases including ceiling or floor baseline data points that yield a 0 % PND, the variation between treatment effect score and original research findings should be described.

PAND (Parker et al. 2007) was calculated by determining the minimum number of data points that need to be removed from either the baseline and/or treatment phases to eliminate all overlap. The number of remaining data points was then divided by the total number of data points across baseline and treatment phases. This number represents the overlap, which is then subtracted from 100 to derive the nonoverlap and finally multiplied by 100 to express this value as a percentage.

NAP (Parker and Vannest 2009) was calculated by counting all nonoverlapping pairs. Often, this is achieved most quickly by counting overlapping pairs and subtracting from the total possible pairs to obtain the nonoverlap count. The total possible pairs are determined by multiplying the number of data points in the baseline phase with the number of data points in the treatment phase. Scores are assigned for each pairwise comparison and totalled. An overlap counts as one point, a tie counts as a half point and a nonoverlap receives a zero. Each overlap score is summed, and the total subtracted from the total possible pairs. The result is in turn divided by the total possible pairs and then multiplied by 100 to derive the percentage of all nonoverlapping pair treatment effect score.

Various approaches to determining treatment effect scores beyond an initial AB phase comparison were identified in previous research of treatment effect calculation methods. Skiba et al. (1985) argued in favour of an effect size based solely on the first AB phase comparison, claiming that treatment effects beyond the first treatment tested may be confounded with multiple treatment interference or that failure to revert to baseline levels in subsequent baseline phases may be attributed to lack of experimental control or powerful treatment effects. Other approaches were based on a combination of comparable phases prior to calculation of an arithmetic mean (Scruggs et al. 1987) and a comparison of first A with last B phase (Allison and Gorman 1993). The methodology adopted by Scruggs and colleagues was selected as preferable, based on consistency with the widely published PND metric.

Interpretation of Treatment Effect Scores

The scales provided by the respective original authors of each method have been adopted to interpret treatment effect scores and are summarised in Table 1. Scruggs and colleagues have suggested the following ranges for the interpretation of PND scores: 0–50 % ineffective, 50–70 % questionable, 70–90 % effective and 90 % or greater very effective (Scruggs and Mastropieri 1998). Parker and colleagues (2007) presented PAND as an alternative to PND; however, their original paper does not describe an interpretation scale analogous to that of PND. While a phi correlation coefficient can be derived using a 2 × 2 table of proportions, an interpretation scale for the output of this computation has not been described by the developers. Consequently, an interpretation scale for PAND or phi has been omitted from Table 1. In their more recent research that compared treatment effect scores with expert visual judgements made on 200 published AB phase comparisons, Parker and colleagues have proposed the following tentative ranges for the interpretation of NAP: 0–65 % weak effect, 66–92 % medium effect and 93–100 % strong effect (Parker and Vannest 2009).

Table 1 Interpretation scale for strength of treatment effect score

Inter-Observer Agreement

Reliability of computations was verified by conducting inter-observer checks, and an initial trial coding was performed using four randomly selected studies. Both the author and a Senior Professor within the Faculty of Education separately hand counted data points for each phase of each data series and recorded tallies on a coding sheet. PND, PAND and NAP were then calculated independently by hand by each coder and recorded in the coding sheet for all studies that reported sufficient data. Both the coders then met to discuss any variations in results, and a 100 % agreement for these four studies was achieved.

Subsequently, a further 14 articles (36.8 %) were selected at random, and each coder independently calculated tallies for data points for each phase of the data series and calculated PND, PAND and NAP for each AB phase comparison. Both coders then met again to compare the scores that each had calculated independently. Overall, 172 agreements were achieved from a total of 179 treatment effect calculations, and an overall inter-coder agreement was calculated at 96.7 for 47 % of the studies included in the total data set. When calculated separately for computational procedure, an inter-observer agreement (IOA) of 98.3 % was calculated for PND, 96.6 % for PAND and 93.3 % for NAP.

Subjective assessments were made on the consistency of interpretation of the treatment effect scores that were reported in Table 2. Both assessors met to discuss a method for determining consistency between interpretation scales, and consistency was operationalised using the scales provided in Table 1 as follows:

Table 2 Participant treatment effect scores
  1. 1.

    One rating of “ineffective” and the other rating as any other considered a disagreement

  2. 2.

    Ratings on the same band as each other considered an agreement

  3. 3.

    Ratings plus/minus one or more band considered a disagreement.

Both assessors separately rated each participant treatment effect score then compared their findings. A total of 58 agreements and 43 disagreements were recorded by both assessors, and a further two scores were described as not applicable for this procedure. An IOA of 100 % was obtained for this process.

Results

A data set based on 38 articles that reported treatment data gathered across a variety of behaviours and settings for 103 participants was developed. Hand counts of the number of data points reported in each baseline and treatment phase were conducted for each of the 215 data series included in 177 graphs. These tallies were entered manually into an excel spreadsheet and subsequently compared to calculation guidelines to determine the suitability of applying PND, PAND and NAP calculations.

Baseline variability that included numerous ceiling data points was identified in one study (Koegel and Frea 1993), for which a PND result could not be calculated. Another three studies included ceiling data points in baseline that resulted in a 0 % PND for the respective participants (Koegel et al. 1992; Shearer et al. 1996; Stahmer and Schreibman 1992). As a result of too few data points noted in 15 studies, a PAND treatment effect score was calculated in 23 studies. A NAP treatment effect score was calculated for all 38 studies.

Treatment effect scores were calculated for all 103 participants for whom sufficient data was provided (see Table 2). Variations between PND, PAND and NAP scores are summarised, and treatment effect scores interpreted for PND and NAP.

To examine the variation in treatment effect score methods, the data set was reduced to include only studies that met the requirements for all three treatment effect score calculations. Twenty two studies were identified, and data for 57 participants was included. These 22 studies are indicated by an asterisk in Table 2.

Mean treatment effect scores for the 57 participants were PND 78.8 %, PAND 92.7 % and NAP 93.2 %. NAP indicated that treatment effect was on average 14.4 % greater than the product of the PND calculation and 0.5 % greater than that produced by PAND. PAND indicated strength of treatment effect that was 13.9 % greater when compared to PND.

A single-factor ANOVA was used to test the null hypothesis, that the variances between the mean treatment effect scores are equal. H 0 : the mean variances of treatment effect scores are equal. A summary of the mean variances for 57 participants and the ANOVA data is presented in Table 3.

Table 3 Analysis of variance: n = 22 studies, 57 participants

The single-factor ANOVA resulted in an F calc value that was greater than the F crit value, and the conclusion drawn that the mean variances are not equal. The difference was statistically significant (F 2,168 = 13.259, p < .005). The mean PND score of 78.8 % for 22 studies indicates an effective treatment. By comparison, the mean NAP score of 93.2 % indicates a strong treatment effect for the same 22 studies.

PND and NAP treatment effect scores were interpreted using the scales provided by the original authors and subjectively assessed for consistency in interpretation for all 103 participant scores. It appeared that these scales yielded a consistent strength of treatment effect for 58 scores (56.3 %) but appeared inconsistent for 43 scores (41.7 %), and a comparison was not applicable for the remaining two scores (1.9 %).

Discussion

The purpose of this research was to explore the suitability of three nonparametric calculation methods to estimate treatment effect size with SCD studies, with a specific focus on the needs of participants with ASD. In particular, PND, PAND and NAP hand calculation techniques were selected, and published data from self-management interventions conducted with participants diagnosed on the autism spectrum was used to test the calculations.

For each calculation method, a mean of all participant scores was calculated. Results from the ANOVA test of differences between the three methods suggested that the mean treatment effect scores derived using these three treatment effect calculation methods differ significantly with PND producing the most conservative estimates of effect size.

Benefits and limitations are apparent for each scoring method. A main criticism of PND is the weighting it places on extreme, possibly outlier data points in baseline phases (Parker et al. 2007). Scruggs and colleagues have defended their procedure, stating that these potential problems have rarely been encountered in the research literature, and when they are encountered, they can be easily addressed, as was noted in their original conventions, by acknowledging such discrepancies in the research report (Scruggs and Mastropieri 2013). Consistent with their conclusion, the current data set included relatively few instances in which outlying data points in base line skewed the resultant treatment effect score.

Both PAND and NAP address the weakness acknowledged in PND by integrating additional baseline data points in the algorithm. PAND incorporates additional baseline data although eliminates all overlapping data across baseline and treatment phases. This study identified a significant proportion of participant data that does not meet the minimal threshold of 20 data points. Of further concern, the original developers proposed a minimum range of data points of 20–25 data points. Had the upper level of their suggested threshold, 25 data points, been adopted, it is likely that an even greater number of studies would have been deemed as not suitable for PAND methodology.

The NAP calculation appeared to offer the greatest advantage in this regard as the algorithm incorporates a pairwise comparison of all data points included in the data set, thus utilising every data point recorded. NAP is not restricted by a minimum number of data points and, in that sense, is preferable to PAND as a treatment effect score can be calculated for all studies with even greater precision. However, as NAP requires a more complex calculation than PND, or PAND, it is more error prone and potentially problematic with longer data series when calculated by hand.

The treatment effect scores calculated for both PND and NAP have been interpreted using rating scales in published literature, however no such scale is available for PAND. Using PND, the mean effect size for self-management interventions was described as effective, the second highest category under this scale. By contrast, using NAP, the mean effect size for self-management interventions was described as strong, the highest category on the scale. The difference in interpretative guidelines, in addition to the observed difference in scores derived by the procedures, suggests that PND reports a more conservative strength of treatment effect than that calculated using NAP. This discrepancy underscores the warning by Brossart and colleagues (2006) that clinicians, as well as other interested stakeholders, face difficulty when interpreting studies that use various treatment effect calculation methods.

Conclusion

Fifteen years ago, the Wilkinson and the Taskforce on Statistical Inference (1999) emphasised the importance of understanding how a given statistical measure is calculated and how to interpret the statistic. Findings from this study suggest that both issues remain of concern in SCD research conducted with participants with ASD.

Of the three treatment effect scores that were reviewed, PAND may cautiously be considered the least applicable to stakeholders in the autism community for two reasons. First, a significant percentage of the articles included in this review did not include sufficient data points to permit a PAND-based analysis, and as observed elsewhere (Carr et al. 2014), evidence suggests that researchers are not collecting more data as would be required were PAND to be adopted in the future. Second, interpretation of the PAND score, or its associated phi or phi 2 correlation coefficient, is difficult in the absence of a conversion scale like that available for both PND and NAP. Further, as reported by Brossart and colleagues (2006), differences across research fields in the interpretation of a phi or phi 2 term may further compromise interpretation of these statistics. Further research into the development of a scale to interpret a PAND/phi calculation, with a particular focus on participants with ASD, is justified.

Few studies in this sample included lengthy baselines and consequently the NAP calculation was relatively straightforward to apply to the studies included in this review. Importantly, NAP utilised all data points reported for each participant. Given that a scale for interpreting strength of treatment effect has also been proposed, these factors arguably add support to the adoption of NAP as a potential improvement over PND for incorporation in research for participants with ASD. However, given the greater treatment effect scores calculated by NAP compared to those of PND, the adoption of NAP as a new standard should be treated cautiously.

The PND metric currently dominates SCD literature, and the present data show it to yield a relatively conservative result with strength of treatment effect for self-management intervention procedures described as effective under PND and strong under NAP. Such calibration differences across methods for calculating treatment effects, were they found to be generalizable, are unlikely to contribute positively to our understanding of the relative effectiveness of our intervention procedures. Arguably, this issue is relevant to the ASD community, and research reports using newer alternate treatment effect scores should be treated with caution to avoid presenting potentially misleading information to the ASD research stakeholders. Importantly, this study has indicated that PND is widely applicable to the data that is gathered for participants with ASD, and its continued use appears justified.