Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications☆
Introduction
If results from single-case designs (SCDs) are to contribute to the evidence-based practice debate about what works, it would be useful to have an effect size that is in the same metric as that used in between-subjects design (BSD) research like randomized experiments. Such effect sizes are important for at least five reasons. First, many reviews of effective practices consider evidence from both SCDs and BSDs—whether they combine those results together or report them separately. The evidence-based practice community often uses standardized effect size estimates as the common denominator for comparing such studies. If evidence from SCDs is to be considered along with evidence from BSDs, as already occurs in some reviews (e.g., Cicerone et al., 2000, Cicerone et al., 2005, Cicerone et al., 2011, Eikeseth, 2009, Fabiano et al., 2009, Kennedy et al., 2008, Rao et al., 2008, Rogers and Graham, 2008), it is essential to be able to represent the evidence from SCDs on the same scale used in BSDs. A common effect size measure for SCDs and BSDs would thus provide comparability in the reporting and synthesizing of evidence.
Second, rational planning of SCDs should depend on power calculations to ensure adequate design sensitivity, just as it does in BSDs. Methods for statistical power analysis that depend on a standardized effect size metric would allow the SCD researcher to rationally plan (and justify in grant proposals) their designs in terms of the same effect size parameters used by BSD researchers. Such methods do not currently exist as applied to SCDs, except for those we have developed for our d-statistics (Hedges and Shadish, 2013a, Hedges and Shadish, 2013b, Shadish and Marso, 2013).
Third, the past decade has seen growing interest in meta-analysis of SCDs (Maggin, O'Keeffe and Johnson, 2011, Shadish and Rindskopf, 2007), but most effect sizes that have been proposed for SCDs lack a formal statistical development. Without plausible distribution theory (e.g., knowledge of the sampling variance of an effect size statistic), these effect sizes cannot be used with common meta-analytic tools, such as forest plots, diagnostic plots such as radial plots and residual plots, cumulative meta-analysis, regression tests, or publication bias analysis. A formally developed standardized effect size estimate would allow researchers to take full advantage of conventional statistical methods in meta-analysis. Research synthesis methods for use with SCDs are less standardized and generally less developed by comparison.
Fourth, prompted by organizations like the American Psychological Association (Wilkinson & the Task Force on Statistical Inference, 1999), results from BSDs are now commonly summarized using effect sizes and accompanying confidence intervals. In SCD research, although statistics are rarely used, when they are used the most common statistic is one of the effect size estimates from the overlap tradition (Parker, Vannest, & Davis, 2011). This usage suggests some affinity on the part of SCD researchers for the simple elegance of an effect size as a summary statistic.
Finally, the size of treatment effects from SCD research relative to effects from other methodologies is of both intellectual and practical interest. Intellectually, recent interest in within-study comparisons (Cook, Shadish, & Wong, 2008) has generated work on the conditions under which various kinds of nonrandomized experiments can approximate the results from randomized experiments (Shadish et al., 2008, Shadish et al., 2011). SCDs are a form of time series analysis, but there is little empirical evidence that addresses how their results compare to results from randomized experiments. Practically, both practitioners and policy makers cannot always carry out randomized experiments to examine every causal question, and so they want to know what they are getting from alternative designs. These comparisons cannot be made without statistics in the same metrics for all these designs.
Hedges et al., 2012, Hedges et al., 2013 developed just such an effect size, a standardized mean difference statistic (d). One version of d is for multiple baseline designs, and one version is for phase reversal designs (e.g., AB and ABAB); together, these SCDs are the two most widely used (Shadish & Sullivan, 2011). Both effect sizes are similar conceptually but differ in algebraic details to accommodate the different features of the two kinds of designs. Hedges and Shadish, 2013a, Hedges and Shadish, 2013b also developed power analyses for these two effect size statistics. Calculations for all of these methods can be performed using SPSS macros available from the first author; a manual for using the macros is also available.1
This article has two main purposes. The first is to compare the d-statistic to alternatives such as earlier d-statistics, overlap statistics, and multilevel modeling, in order to begin to understand the relative advantages and disadvantages of all of these analyses. The second purpose is to demonstrate the application of d to real SCD data from two contexts: first, the analysis of multiple cases within a study, and second, the meta-analysis of effect sizes over studies. Consequently, while we present some of the statistical features of this work, we focus mostly on practical details, especially snapshots of computer program screens of input and output, and sample syntax that researchers can use to implement these analyses.
Before we present the d-statistic that we have developed for SCDs, it is helpful to put that d into context by comparing its strengths and weaknesses to those of the three other major approaches to analyzing SCDs—previously developed SCD d-statistics, overlap statistics, and multilevel modeling.
In order to do a quantitative summary of the results of many studies that use different outcome measures, one needs a way of putting diverse outcome measures on the same metric. Glass (1976) first suggested a d statistic to that addresses this need, by expressing treatment effects in terms of outcome standard deviations. In a simple BSD, the d-statistic is defined as:where Mt is the mean of the treatment group on the dependent variable, Mc is the mean of the comparison group, and S is an estimate of the standard deviation of the outcome variable, typically but not always the pooled standard deviation:
The pooled standard deviation is a weighted average of the between-person variances of the two groups (S2), where the weights are the degrees of freedom for each (sample size, n, minus one). For example, in a BSD, an effect size of d = 0.5 means the treatment group performed one half standard deviation better than the comparison group on the outcome. Just as in BSD research, SCD studies in the same research field frequently use diverse outcome measures, and so standardized effect sizes are needed.
Many previous SCD researchers have tried to create a similar d-statistic for use with SCDs. However, earlier proposals are all based on standardizing using within-case variation alone, making them incomparable to the d-statistic described above that is used in BSDs. In SCD research, Busk and Serlin (1992) proposed a standardized mean difference formula to compute ES for each case:where Mb is the mean of a case's baseline observations, Mt is the mean of the treatment observations, and Sb is the within-case standard deviation of the baseline observations (see Gorman-Smith & Matson, 1985, for a similar approach). Busk and Serlin (1992) argued that this ES can be pooled with the usual BSD d-estimator, which might suggest to some readers that the two estimators are comparable. However, the denominator from Busk and Serlin's effect size is based on within-case variation rather than the variation across individuals, as in the d from a BSD. The two effect size statistics do not estimate the same parameter, and so we would not recommend pooling the Busk and Serlin d with d from a BSD.
Many related effect sizes also standardize within-case. Beretvas and Chung (2008) proposed a d that accounts for trends but is still based on within-case variation. Van den Noortgate and Onghena (2008) described a d-statistic but stated explicitly that it is not the same as the BSD d. Maggin et al. (2011) d is based on a generalized least squares regression that accounts for both trends and autocorrelation, but is also standardized within-cases rather than between-cases. Finally, Parker, Hagan-Burke, and Vannest (2007) proposed to compute a phi correlation coefficient from SCD data and then convert it to a d-statistic, again standardized within-case.
It is not that standardizing an effect size within case is wrong. Indeed, SCD research needs effect sizes that are computed within case. The reason is that most SCD researchers are trained, and train their students, to examine each case individually, one case at a time. A study-wide summary over cases is not useful for that purpose, as useful as it may be for other purposes. SCD research needs to continue both approaches to effect size computation.
Starting with the Percent Nonoverlapping Data (PND) statistic more than 25 years ago (Scruggs and Mastropieri, 2013, Scruggs et al., 1987), SCD researchers have been exceptionally creative in developing effect size estimators that have come to be called the overlap statistics (sometimes nonoverlap statistics). Parker, Vannest, and Davis (2011) provide a recent review of the growing collection of this family of effect sizes. These statistics are all based on the concept of distributional overlap. Intuitively, in the simplest case in which the baseline phase does not exhibit trend, a more effective treatment would result in a data distribution from the treatment phase that overlaps less with the data distribution in the baseline phase. Compared to options like within-case d statistics, regression models, or multilevel models, overlap statistics are more often used in practice (Maggin, Swaminathan, et al., 2011).
Overlap statistics share one similarity with the d statistic developed in this article: just as the model behind the d statistic assumes that baseline phase data are stationary (lacking a time trend), most overlap statistics do not account for trends.2 Beyond this one point of commonality, overlap statistics differ markedly from d in the specificity of modeling assumptions, comparability with BSD effect sizes, and suitability for synthesis via conventional meta-analysis techniques.
The most fundamental difference is that the overlap statistics have been developed without reference to parametric, distributional modeling assumptions. Instead, many of the overlap statistics are inspired by and draw from the literature on nonparametric statistical tests, leading Parker, Vannest, and Davis (2011) to characterize them as “not requiring parametric assumptions about data distribution or scale type” (p. 304). Due to the close connection with nonparametric tests, the behavior of overlap statistics can be understood quite well when no intervention effect is present. However, little is known about the behavior of overlap statistics as measures of effect size magnitude when the intervention effect is nonzero. Allison and Gorman (1994) demonstrated that the magnitude of the most commonly used overlap statistic, PND (Scruggs et al., 1987), is confounded by the number of observations in the baseline series, which is clearly an undesirable property. More recent overlap statistics such as the non-overlap of all pairs (NAP, Parker & Vannest, 2009) do not suffer from this specific deficiency, but factors influencing their magnitude are not well understood and need to be studied further.3 In contrast, the parametric modeling approach used to define our d statistic, with its clear and explicitly stated assumptions, makes it much easier to characterize the factors influencing its magnitude and to study its behavior through mathematical analysis and simulation.
A second difference between the overlap statistics and the d-statistic in this article is that the d-statistic is in a metric that is the same as that used in BSD research. This comparability to BSD research has all the advantages outlined in the introduction. The comparability of the overlap statistics to any of the BSD effect size estimators is largely unknown (but worth exploring). Lack of demonstrable comparability will probably tend to limit the overlap statistics to use within the SCD community, which will not help much with integrating SCD research into systematic reviews outside that community that also include BSD research.
A final difference pertains to meta-analysis. Because they do not make explicit distributional assumptions, overlap statistics do not have known sampling variances (except in the special case when treatment effects are zero). Lacking a measure of how much variability in the effect size estimate is due to pure sampling error, the effect sizes cannot be used with conventional meta-analytic techniques. The d-statistic does have such a sampling variance, and we show how it can be used to great advantage later.
As we explain in more detail elsewhere (Shadish et al., in press), the motivating assumptions for the d statistic can be understood in terms of a relatively simple multilevel model. The first level of the model describes the pattern of change over time (across phases) in the outcome measurements for a single case. These repeated measurements are nested within the case, and so the second level of the model describes variation across cases. The multilevel formulation is central to demonstrating that the d statistic for SCDs is on the same metric as the standardized mean difference based on a BSD.
The use of multilevel models for analyzing SCDs is advancing rapidly, as evidenced by several articles in this special issue. Some of these advances involve using different approaches to fit models that are identical or closely related to the model behind the d statistic. For instance, Swaminathan, Rogers, and Horner (2014-this issue) take a Bayesian approach to estimation; Moeyaert, Ferron, Beretvas, and Van den Noortgate (2014-this issue) use likelihood-based estimation; and we ourselves are currently exploring different approaches to estimating the same underlying model and effect size parameter (Pustejovsky, 2013). Relative to these other approaches, the effect size presented in this article has the advantages that it is based on a relatively robust approach to estimation (as we explain later in this article) and is implemented in a documented, easy to use SPSS macro. In comparison, results over various multilevel software implementations may vary modestly depending on optimization algorithms, procedural details, and differing capabilities to do things like model autocorrelations (Shadish, Kyse, & Rindskopf, 2013).
Recent work has explored multilevel models that have considerably more complex features than the relatively simple model on which the d estimator is based. These extensions include multilevel models for SCDs with non-normally distributed outcome variables (Shadish et al., 2013) and explorations of nonlinear time trends (Shadish, Zuur, & Sullivan, 2014-this issue). However, these more advanced models are limited by the fact that they do not yield effect sizes that are known to be comparable to the well-known effect sizes from BSD, such as standardized mean differences, odds ratios, or relative risk ratios. Defining such effect sizes presents certain challenges and remains an area for further research. Without them, the usefulness of many recently proposed multilevel models is primarily in the analysis of multiple cases within a study and not in synthesis of multiple studies.
Section snippets
The Hedges, Pustejovsky and Shadish d-statistic for SCDs
As we have discussed in other work (Shadish, Hedges, Pustejovsky, Boyajian, et al., 2013), a statistical model that is broad enough to encompass both a BSD and a SCD with replications across individuals is needed in order to define a common effect size parameter. Hedges et al., 2012, Hedges et al., 2013 developed such models, making it possible to identify a parameter that is a conventional effect size (the standardized mean difference) in the BSD. They then proposed a d-statistic to estimate
Example one: Analysis of cases within a study
Here we analyze the Lambert, Cartledge, Heward, and Lo (2006) data using the SPSS macro and then test whether the resulting effect size is sensitive to the linearity assumption. Fig. 1 shows an appropriate input dataset in the SPSS Data Editor. The ten input variables are as follows:
- •
SID (study identification number),
- •
PID (case identification number),
- •
DVID (dependent variable identification number),
- •
DVDir (0 = outcome decreases if treatment works or 1 = increases if treatment works),
- •
SessIDX (session
Statistical power of g for SCDs
Hedges and Shadish, 2013a, Hedges and Shadish, 2013b have developed methods to compute design sensitivity (power) for g for ABk designs and g for multiple baseline designs. Power is useful to plan a study that can detect a specified effect, a valuable feature for grant proposals in particular.
Meta-analysis using the d-statistic
Our purpose in this section is to demonstrate how the d-statistic can be used to meta-analyze SCD studies. Not all SCD researchers agree that quantitative analysis of SCD research is desirable, much less quantitative, meta-analytic syntheses of SCD research (Baer, 1977, Perone, 1999). Elsewhere we have described why we believe that SCD researchers will begin to use quantitative work more often (Shadish, Hedges, Pustejovsky, Rindskopf, et al., in press, Shadish, 2014). Much of what we said there
Discussion
This article argued that SCD research would benefit from a standardized effect size estimator in the same metric as those used for BSDs and suggested the d-statistic in this article could serve that need. We reviewed past efforts to develop a d-statistic, noting that nearly all of them were standardized using within-case variability, so that they estimate a different parameter than either the d-statistic we have developed or the effect size estimates commonly used in the BSD literature. We also
References (80)
- et al.
“Making things as simple as possible, but no simpler”. A rejoinder to Scruggs and Mastropieri
Behaviour Research and Therapy
(1994) - et al.
Evidence-based cognitive rehabilitation: Recommendations for clinical practice
Archives of Physical Rehabilitation Medicine
(2000) - et al.
Evidence-based cognitive rehabilitation: Update of the literature from 1998 through 2002
Archives of Physical Rehabilitation Medicine
(2005) - et al.
Evidence-based cognitive rehabilitation: Updated review of the literature from 2003 through 2008
Archives of Physical Rehabilitation Medicine
(2011) Outcome of comprehensive psycho-educational interventions for young children with autism
Research in Developmental Disabilities
(2009)- et al.
A meta-analysis of behavioral treatments for attention-deficit/hyperactivity disorder
Clinical Psychology Review
(2009) - et al.
Using the natural language paradigm (NLP) to increase vocalizations of older adults with cognitive impairments
Research in Developmental Disabilities
(2007) - et al.
A generalized least squares regression approach for computing effect sizes in single-case research: Application examples
Journal of School Psychology
(2011) - et al.
An improved effect size for single-case research: Nonoverlap of all pairs
Behavior Therapy
(2009) - et al.
Combining nonoverlap and trend for single-case research: Tau-U
Behavior Therapy
(2011)
Assessment and treatment of aggressive behavior without a clear social function
Research in Developmental Disabilities
Brief report: toward refinement of a predictive behavioral profile for treatment outcome in children with autism
Research in Autism Spectrum Disorders
Conducting meta-analysis using SAS
Perhaps it would be better not to know everything
Journal of Applied Behavior Analysis
An evaluation of modified R2-change effect size indices for single-subject experimental designs
Evidence-Based Communication Assessment and Intervention
Comprehensive meta-analysis, version 2
Introduction to meta-analysis
Meta-analysis for single-case research
Applied meta-analysis with R
A parsimonious weight function for modeling publication bias
Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons
Journal of Policy Analysis and Management
The trim and fill method
Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis
Biometrics
A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis
Journal of the American Statistical Association
How to do a meta-analysis
British Journal of Mathematical and Statistical Psychology
An R companion to applied regression
Graphical display of estimates having differing standard errors
Technometrics
A note on graphical presentation of estimated odds ratios from several clinical trials
Statistics in Medicine
Some applications of radial plots
Journal of the American Statistical Association
A review of treatment research for self-injurious and stereotyped responding
Journal of Mental Deficiency Research
Distribution theory for Glass's estimator of effect size and related estimators
Journal of Educational Statistics
A standardized mean difference effect size for single-case designs
Research Synthesis Methods
A standardized mean difference effect size for multiple baseline designs across individuals
Research Synthesis Methods
Power analysis for single case designs
Power analysis for multiple baseline designs
Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model
Journal of Educational and Behavioral Statistics
Fixed- and random-effects models in meta-analysis
Psychological Methods
Intervention for executive functions after traumatic brain injury: a systematic review, meta-analysis and clinical recommendations
Neuropsychological Rehabilitation
Improved tests for a random effects meta-regression with a single covariate
Statistics in Medicine
Single-case designs technical documentation
Cited by (187)
Single-case intervention research design standards: Additional proposed upgrades and future directions
2023, Journal of School PsychologyA Meta-Analysis of Mathematics Interventions: Examining the Impacts of Intervention Characteristics
2024, Educational Psychology ReviewIsometric hip abduction and adduction strength ratios: A literature review with quantitative synthesis
2024, Isokinetics and Exercise ScienceA Comparison of Single-Case Effect Measures Using Check-In Check-Out Data
2024, Behavior Modification
- ☆
This research was supported in part by grants R305D100046 and R305D100033 from the Institute for Educational Sciences, U.S. Department of Education, and by a grant from the University of California Office of the President to the University of California Educational Evaluation Consortium. The opinions expressed are those of the authors and do not represent views of the University of California, the Institute for Educational Sciences, or the U.S. Department of Education.