Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications

https://doi.org/10.1016/j.jsp.2013.11.005Get rights and content

Abstract

This article presents a d-statistic for single-case designs that is in the same metric as the d-statistic used in between-subjects designs such as randomized experiments and offers some reasons why such a statistic would be useful in SCD research. The d has a formal statistical development, is accompanied by appropriate power analyses, and can be estimated using user-friendly SPSS macros. We discuss both advantages and disadvantages of d compared to other approaches such as previous d-statistics, overlap statistics, and multilevel modeling. It requires at least three cases for computation and assumes normally distributed outcomes and stationarity, assumptions that are discussed in some detail. We also show how to test these assumptions. The core of the article then demonstrates in depth how to compute d for one study, including estimation of the autocorrelation and the ratio of between case variance to total variance (between case plus within case variance), how to compute power using a macro, and how to use the d to conduct a meta-analysis of studies using single-case designs in the free program R, including syntax in an appendix. This syntax includes how to read data, compute fixed and random effect average effect sizes, prepare a forest plot and a cumulative meta-analysis, estimate various influence statistics to identify studies contributing to heterogeneity and effect size, and do various kinds of publication bias analyses. This d may prove useful for both the analysis and meta-analysis of data from SCDs.

Introduction

If results from single-case designs (SCDs) are to contribute to the evidence-based practice debate about what works, it would be useful to have an effect size that is in the same metric as that used in between-subjects design (BSD) research like randomized experiments. Such effect sizes are important for at least five reasons. First, many reviews of effective practices consider evidence from both SCDs and BSDs—whether they combine those results together or report them separately. The evidence-based practice community often uses standardized effect size estimates as the common denominator for comparing such studies. If evidence from SCDs is to be considered along with evidence from BSDs, as already occurs in some reviews (e.g., Cicerone et al., 2000, Cicerone et al., 2005, Cicerone et al., 2011, Eikeseth, 2009, Fabiano et al., 2009, Kennedy et al., 2008, Rao et al., 2008, Rogers and Graham, 2008), it is essential to be able to represent the evidence from SCDs on the same scale used in BSDs. A common effect size measure for SCDs and BSDs would thus provide comparability in the reporting and synthesizing of evidence.

Second, rational planning of SCDs should depend on power calculations to ensure adequate design sensitivity, just as it does in BSDs. Methods for statistical power analysis that depend on a standardized effect size metric would allow the SCD researcher to rationally plan (and justify in grant proposals) their designs in terms of the same effect size parameters used by BSD researchers. Such methods do not currently exist as applied to SCDs, except for those we have developed for our d-statistics (Hedges and Shadish, 2013a, Hedges and Shadish, 2013b, Shadish and Marso, 2013).

Third, the past decade has seen growing interest in meta-analysis of SCDs (Maggin, O'Keeffe and Johnson, 2011, Shadish and Rindskopf, 2007), but most effect sizes that have been proposed for SCDs lack a formal statistical development. Without plausible distribution theory (e.g., knowledge of the sampling variance of an effect size statistic), these effect sizes cannot be used with common meta-analytic tools, such as forest plots, diagnostic plots such as radial plots and residual plots, cumulative meta-analysis, regression tests, or publication bias analysis. A formally developed standardized effect size estimate would allow researchers to take full advantage of conventional statistical methods in meta-analysis. Research synthesis methods for use with SCDs are less standardized and generally less developed by comparison.

Fourth, prompted by organizations like the American Psychological Association (Wilkinson & the Task Force on Statistical Inference, 1999), results from BSDs are now commonly summarized using effect sizes and accompanying confidence intervals. In SCD research, although statistics are rarely used, when they are used the most common statistic is one of the effect size estimates from the overlap tradition (Parker, Vannest, & Davis, 2011). This usage suggests some affinity on the part of SCD researchers for the simple elegance of an effect size as a summary statistic.

Finally, the size of treatment effects from SCD research relative to effects from other methodologies is of both intellectual and practical interest. Intellectually, recent interest in within-study comparisons (Cook, Shadish, & Wong, 2008) has generated work on the conditions under which various kinds of nonrandomized experiments can approximate the results from randomized experiments (Shadish et al., 2008, Shadish et al., 2011). SCDs are a form of time series analysis, but there is little empirical evidence that addresses how their results compare to results from randomized experiments. Practically, both practitioners and policy makers cannot always carry out randomized experiments to examine every causal question, and so they want to know what they are getting from alternative designs. These comparisons cannot be made without statistics in the same metrics for all these designs.

Hedges et al., 2012, Hedges et al., 2013 developed just such an effect size, a standardized mean difference statistic (d). One version of d is for multiple baseline designs, and one version is for phase reversal designs (e.g., AB and ABAB); together, these SCDs are the two most widely used (Shadish & Sullivan, 2011). Both effect sizes are similar conceptually but differ in algebraic details to accommodate the different features of the two kinds of designs. Hedges and Shadish, 2013a, Hedges and Shadish, 2013b also developed power analyses for these two effect size statistics. Calculations for all of these methods can be performed using SPSS macros available from the first author; a manual for using the macros is also available.1

This article has two main purposes. The first is to compare the d-statistic to alternatives such as earlier d-statistics, overlap statistics, and multilevel modeling, in order to begin to understand the relative advantages and disadvantages of all of these analyses. The second purpose is to demonstrate the application of d to real SCD data from two contexts: first, the analysis of multiple cases within a study, and second, the meta-analysis of effect sizes over studies. Consequently, while we present some of the statistical features of this work, we focus mostly on practical details, especially snapshots of computer program screens of input and output, and sample syntax that researchers can use to implement these analyses.

Before we present the d-statistic that we have developed for SCDs, it is helpful to put that d into context by comparing its strengths and weaknesses to those of the three other major approaches to analyzing SCDs—previously developed SCD d-statistics, overlap statistics, and multilevel modeling.

In order to do a quantitative summary of the results of many studies that use different outcome measures, one needs a way of putting diverse outcome measures on the same metric. Glass (1976) first suggested a d statistic to that addresses this need, by expressing treatment effects in terms of outcome standard deviations. In a simple BSD, the d-statistic is defined as:d=MtMcSwhere Mt is the mean of the treatment group on the dependent variable, Mc is the mean of the comparison group, and S is an estimate of the standard deviation of the outcome variable, typically but not always the pooled standard deviation:Sp=nt1St2+nc1Sc2nt+nc2.

The pooled standard deviation is a weighted average of the between-person variances of the two groups (S2), where the weights are the degrees of freedom for each (sample size, n, minus one). For example, in a BSD, an effect size of d = 0.5 means the treatment group performed one half standard deviation better than the comparison group on the outcome. Just as in BSD research, SCD studies in the same research field frequently use diverse outcome measures, and so standardized effect sizes are needed.

Many previous SCD researchers have tried to create a similar d-statistic for use with SCDs. However, earlier proposals are all based on standardizing using within-case variation alone, making them incomparable to the d-statistic described above that is used in BSDs. In SCD research, Busk and Serlin (1992) proposed a standardized mean difference formula to compute ES for each case:ES=MbMtSbwhere Mb is the mean of a case's baseline observations, Mt is the mean of the treatment observations, and Sb is the within-case standard deviation of the baseline observations (see Gorman-Smith & Matson, 1985, for a similar approach). Busk and Serlin (1992) argued that this ES can be pooled with the usual BSD d-estimator, which might suggest to some readers that the two estimators are comparable. However, the denominator from Busk and Serlin's effect size is based on within-case variation rather than the variation across individuals, as in the d from a BSD. The two effect size statistics do not estimate the same parameter, and so we would not recommend pooling the Busk and Serlin d with d from a BSD.

Many related effect sizes also standardize within-case. Beretvas and Chung (2008) proposed a d that accounts for trends but is still based on within-case variation. Van den Noortgate and Onghena (2008) described a d-statistic but stated explicitly that it is not the same as the BSD d. Maggin et al. (2011) d is based on a generalized least squares regression that accounts for both trends and autocorrelation, but is also standardized within-cases rather than between-cases. Finally, Parker, Hagan-Burke, and Vannest (2007) proposed to compute a phi correlation coefficient from SCD data and then convert it to a d-statistic, again standardized within-case.

It is not that standardizing an effect size within case is wrong. Indeed, SCD research needs effect sizes that are computed within case. The reason is that most SCD researchers are trained, and train their students, to examine each case individually, one case at a time. A study-wide summary over cases is not useful for that purpose, as useful as it may be for other purposes. SCD research needs to continue both approaches to effect size computation.

Starting with the Percent Nonoverlapping Data (PND) statistic more than 25 years ago (Scruggs and Mastropieri, 2013, Scruggs et al., 1987), SCD researchers have been exceptionally creative in developing effect size estimators that have come to be called the overlap statistics (sometimes nonoverlap statistics). Parker, Vannest, and Davis (2011) provide a recent review of the growing collection of this family of effect sizes. These statistics are all based on the concept of distributional overlap. Intuitively, in the simplest case in which the baseline phase does not exhibit trend, a more effective treatment would result in a data distribution from the treatment phase that overlaps less with the data distribution in the baseline phase. Compared to options like within-case d statistics, regression models, or multilevel models, overlap statistics are more often used in practice (Maggin, Swaminathan, et al., 2011).

Overlap statistics share one similarity with the d statistic developed in this article: just as the model behind the d statistic assumes that baseline phase data are stationary (lacking a time trend), most overlap statistics do not account for trends.2 Beyond this one point of commonality, overlap statistics differ markedly from d in the specificity of modeling assumptions, comparability with BSD effect sizes, and suitability for synthesis via conventional meta-analysis techniques.

The most fundamental difference is that the overlap statistics have been developed without reference to parametric, distributional modeling assumptions. Instead, many of the overlap statistics are inspired by and draw from the literature on nonparametric statistical tests, leading Parker, Vannest, and Davis (2011) to characterize them as “not requiring parametric assumptions about data distribution or scale type” (p. 304). Due to the close connection with nonparametric tests, the behavior of overlap statistics can be understood quite well when no intervention effect is present. However, little is known about the behavior of overlap statistics as measures of effect size magnitude when the intervention effect is nonzero. Allison and Gorman (1994) demonstrated that the magnitude of the most commonly used overlap statistic, PND (Scruggs et al., 1987), is confounded by the number of observations in the baseline series, which is clearly an undesirable property. More recent overlap statistics such as the non-overlap of all pairs (NAP, Parker & Vannest, 2009) do not suffer from this specific deficiency, but factors influencing their magnitude are not well understood and need to be studied further.3 In contrast, the parametric modeling approach used to define our d statistic, with its clear and explicitly stated assumptions, makes it much easier to characterize the factors influencing its magnitude and to study its behavior through mathematical analysis and simulation.

A second difference between the overlap statistics and the d-statistic in this article is that the d-statistic is in a metric that is the same as that used in BSD research. This comparability to BSD research has all the advantages outlined in the introduction. The comparability of the overlap statistics to any of the BSD effect size estimators is largely unknown (but worth exploring). Lack of demonstrable comparability will probably tend to limit the overlap statistics to use within the SCD community, which will not help much with integrating SCD research into systematic reviews outside that community that also include BSD research.

A final difference pertains to meta-analysis. Because they do not make explicit distributional assumptions, overlap statistics do not have known sampling variances (except in the special case when treatment effects are zero). Lacking a measure of how much variability in the effect size estimate is due to pure sampling error, the effect sizes cannot be used with conventional meta-analytic techniques. The d-statistic does have such a sampling variance, and we show how it can be used to great advantage later.

As we explain in more detail elsewhere (Shadish et al., in press), the motivating assumptions for the d statistic can be understood in terms of a relatively simple multilevel model. The first level of the model describes the pattern of change over time (across phases) in the outcome measurements for a single case. These repeated measurements are nested within the case, and so the second level of the model describes variation across cases. The multilevel formulation is central to demonstrating that the d statistic for SCDs is on the same metric as the standardized mean difference based on a BSD.

The use of multilevel models for analyzing SCDs is advancing rapidly, as evidenced by several articles in this special issue. Some of these advances involve using different approaches to fit models that are identical or closely related to the model behind the d statistic. For instance, Swaminathan, Rogers, and Horner (2014-this issue) take a Bayesian approach to estimation; Moeyaert, Ferron, Beretvas, and Van den Noortgate (2014-this issue) use likelihood-based estimation; and we ourselves are currently exploring different approaches to estimating the same underlying model and effect size parameter (Pustejovsky, 2013). Relative to these other approaches, the effect size presented in this article has the advantages that it is based on a relatively robust approach to estimation (as we explain later in this article) and is implemented in a documented, easy to use SPSS macro. In comparison, results over various multilevel software implementations may vary modestly depending on optimization algorithms, procedural details, and differing capabilities to do things like model autocorrelations (Shadish, Kyse, & Rindskopf, 2013).

Recent work has explored multilevel models that have considerably more complex features than the relatively simple model on which the d estimator is based. These extensions include multilevel models for SCDs with non-normally distributed outcome variables (Shadish et al., 2013) and explorations of nonlinear time trends (Shadish, Zuur, & Sullivan, 2014-this issue). However, these more advanced models are limited by the fact that they do not yield effect sizes that are known to be comparable to the well-known effect sizes from BSD, such as standardized mean differences, odds ratios, or relative risk ratios. Defining such effect sizes presents certain challenges and remains an area for further research. Without them, the usefulness of many recently proposed multilevel models is primarily in the analysis of multiple cases within a study and not in synthesis of multiple studies.

Section snippets

The Hedges, Pustejovsky and Shadish d-statistic for SCDs

As we have discussed in other work (Shadish, Hedges, Pustejovsky, Boyajian, et al., 2013), a statistical model that is broad enough to encompass both a BSD and a SCD with replications across individuals is needed in order to define a common effect size parameter. Hedges et al., 2012, Hedges et al., 2013 developed such models, making it possible to identify a parameter that is a conventional effect size (the standardized mean difference) in the BSD. They then proposed a d-statistic to estimate

Example one: Analysis of cases within a study

Here we analyze the Lambert, Cartledge, Heward, and Lo (2006) data using the SPSS macro and then test whether the resulting effect size is sensitive to the linearity assumption. Fig. 1 shows an appropriate input dataset in the SPSS Data Editor. The ten input variables are as follows:

  • SID (study identification number),

  • PID (case identification number),

  • DVID (dependent variable identification number),

  • DVDir (0 = outcome decreases if treatment works or 1 = increases if treatment works),

  • SessIDX (session

Statistical power of g for SCDs

Hedges and Shadish, 2013a, Hedges and Shadish, 2013b have developed methods to compute design sensitivity (power) for g for ABk designs and g for multiple baseline designs. Power is useful to plan a study that can detect a specified effect, a valuable feature for grant proposals in particular.

Meta-analysis using the d-statistic

Our purpose in this section is to demonstrate how the d-statistic can be used to meta-analyze SCD studies. Not all SCD researchers agree that quantitative analysis of SCD research is desirable, much less quantitative, meta-analytic syntheses of SCD research (Baer, 1977, Perone, 1999). Elsewhere we have described why we believe that SCD researchers will begin to use quantitative work more often (Shadish, Hedges, Pustejovsky, Rindskopf, et al., in press, Shadish, 2014). Much of what we said there

Discussion

This article argued that SCD research would benefit from a standardized effect size estimator in the same metric as those used for BSDs and suggested the d-statistic in this article could serve that need. We reviewed past efforts to develop a d-statistic, noting that nearly all of them were standardized using within-case variability, so that they estimate a different parameter than either the d-statistic we have developed or the effect size estimates commonly used in the BSD literature. We also

References (80)

  • J.E. Ringdahl et al.

    Assessment and treatment of aggressive behavior without a clear social function

    Research in Developmental Disabilities

    (2008)
  • L. Schreibman et al.

    Brief report: toward refinement of a predictive behavioral profile for treatment outcome in children with autism

    Research in Autism Spectrum Disorders

    (2009)
  • W. Arthur et al.

    Conducting meta-analysis using SAS

    (2001)
  • D.M. Baer

    Perhaps it would be better not to know everything

    Journal of Applied Behavior Analysis

    (1977)
  • N. Beretvas et al.

    An evaluation of modified R2-change effect size indices for single-subject experimental designs

    Evidence-Based Communication Assessment and Intervention

    (2008)
  • M. Borenstein et al.

    Comprehensive meta-analysis, version 2

    (2005)
  • M. Borenstein et al.

    Introduction to meta-analysis

    (2009)
  • P.L. Busk et al.

    Meta-analysis for single-case research

  • D.-G. Chen et al.

    Applied meta-analysis with R

    (2013)
  • M. Citkowicz et al.

    A parsimonious weight function for modeling publication bias

  • T.D. Cook et al.

    Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons

    Journal of Policy Analysis and Management

    (2008)
  • S.J. Duval

    The trim and fill method

  • S.J. Duval et al.

    Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis

    Biometrics

    (2000)
  • S.J. Duval et al.

    A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis

    Journal of the American Statistical Association

    (2000)
  • A.P. Field et al.

    How to do a meta-analysis

    British Journal of Mathematical and Statistical Psychology

    (2010)
  • J. Fox et al.

    An R companion to applied regression

    (2011)
  • R.F. Galbraith

    Graphical display of estimates having differing standard errors

    Technometrics

    (1988)
  • R.F. Galbraith

    A note on graphical presentation of estimated odds ratios from several clinical trials

    Statistics in Medicine

    (1988)
  • R.F. Galbraith

    Some applications of radial plots

    Journal of the American Statistical Association

    (1994)
  • D. Gorman-Smith et al.

    A review of treatment research for self-injurious and stereotyped responding

    Journal of Mental Deficiency Research

    (1985)
  • L.V. Hedges

    Distribution theory for Glass's estimator of effect size and related estimators

    Journal of Educational Statistics

    (1981)
  • L.G. Hedges et al.

    A standardized mean difference effect size for single-case designs

    Research Synthesis Methods

    (2012)
  • L.G. Hedges et al.

    A standardized mean difference effect size for multiple baseline designs across individuals

    Research Synthesis Methods

    (2013)
  • L.V. Hedges et al.

    Power analysis for single case designs

    (2013)
  • L.V. Hedges et al.

    Power analysis for multiple baseline designs

    (2013)
  • L.V. Hedges et al.

    Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model

    Journal of Educational and Behavioral Statistics

    (1996)
  • L.V. Hedges et al.

    Fixed- and random-effects models in meta-analysis

    Psychological Methods

    (1998)
  • M. Kennedy et al.

    Intervention for executive functions after traumatic brain injury: a systematic review, meta-analysis and clinical recommendations

    Neuropsychological Rehabilitation

    (2008)
  • G. Knapp et al.

    Improved tests for a random effects meta-regression with a single covariate

    Statistics in Medicine

    (2003)
  • T.R. Kratochwill et al.

    Single-case designs technical documentation

  • Cited by (187)

    View all citing articles on Scopus

    This research was supported in part by grants R305D100046 and R305D100033 from the Institute for Educational Sciences, U.S. Department of Education, and by a grant from the University of California Office of the President to the University of California Educational Evaluation Consortium. The opinions expressed are those of the authors and do not represent views of the University of California, the Institute for Educational Sciences, or the U.S. Department of Education.

    View full text