Mediation models are often applied in psychological research to discover the mechanism by which an independent variable affects a dependent variable. A third variable—an intervening variable or mediator—intervenes between the independent and dependent variable. Methods to ascertain whether a mediating variable transmits the effects of an independent variable to a dependent variable are widely used in many substantive areas. Some examples of mediational hypotheses are that the effect of exposure to information on behavior is transmitted by understanding the information, that attitudes affect behavior through intentions, and that psychotherapy leads to catharsis that promotes mental health (MacKinnon, 2008).

The single-mediator model is the focus of this article, as it is the simplest example of mediation. This model is depicted in a path diagram in Fig. 1 and is specified in terms of Eqs. 1, 2, and 3:

$$ Y = {\beta_{01}} + \tau X + \varepsilon, $$
(1)
$$ Y = {\beta_{02}} + \tau \prime{\text{X}} + \beta {\text{M}} + {\varepsilon_Y}, $$
(2)
$$ M = {\beta_{03}} + \alpha X + {\varepsilon_M}. $$
(3)
Fig. 1
figure 1

Path diagram for the single-mediator model

In these equations, Y is the outcome variable, X is the independent variable, M is the mediator, τ represents the total effect of X on Y, τ′ represents the relation between X and Y adjusted for M (the direct effect), β represents the relation between M and Y adjusted for X, α represents the relation between X and M, β 0i is the intercept for Eq. i, and ε, ε Y , and ε M are residuals. The mediated effect is the product of α from Eq. 3 and β from Eq. 2. The corresponding sample values for α, β, τ, and τ′ are a, b, c, and c′.

Although several outstanding methods for statistical significance testing and confidence interval estimation for mediation have been identified, even the best tests do not have ideal Type I error rates, statistical power, and confidence limit coverage. MacKinnon, Lockwood, Hoffman, West, and Sheets (2002) described 15 different tests of mediation that had been proposed at different times. They compared these methods in terms of their Type I error rates and their power to reject false null hypotheses. The tests varied in their ability to control Type I error at the nominal rate. Even those that did control Type I error often had very low statistical power. As MacKinnon et al. (2002) detailed, a major difficulty in testing for mediation is that the sampling distribution of the mediated effect, ab, is typically not normal, as many tests of mediation assume. The same is true for the cc′ estimator of the mediated effect, which is equivalent to the ab estimator when the regressions in Eqs. 1, 2, and 3 are estimated using ordinary least squares (OLS; MacKinnon & Dwyer, 1993).

Under conditions in which the assumptions of classical statistical methods are violated, such as a nonnormal distribution, resampling methods often outperform classical methods because the resampling methods require fewer assumptions (Manly, 1997). Bootstrapping is one such resampling method that has been found to perform well in terms of Type I error control, power, and coverage, and it has therefore been widely recommended as an ideal approach to testing mediation (MacKinnon, Lockwood, & Williams, 2004; Preacher & Hayes, 2004; Shrout & Bolger, 2002), for more complex mediational models as well as for the single-mediator model (Cheung, 2007; Preacher & Hayes, 2008; Taylor, MacKinnon, & Tein, 2008; Williams & MacKinnon, 2008). Briefly, bootstrapping involves drawing many samples from the original sample with replacement (meaning that the same case may be included more than once in a bootstrap sample), estimating the mediated effect in each bootstrap sample, and using the distribution of these estimates to find a confidence interval for the true mediated effect. For the simplest bootstrap method, the percentile bootstrap, the (ω/2) × 100 and (1 − ω/2) × 100 percentiles are chosen as the limits of the confidence interval, where ω is the nominal Type I error rate. Other methods, such as the bias-corrected bootstrap, make adjustments to which percentiles from the bootstrap distribution are chosen as the confidence limits (Efron & Tibshirani, 1993; MacKinnon et al., 2004). Another resampling method that has not as yet been applied to testing for mediation is the permutation test (also called the randomization test). Like bootstrap methods, permutation tests make fewer assumptions than do classical statistical methods. MacKinnon (2008) suggested that the permutation test may be used to test mediation and described how such a test might be conducted. The purpose of this article is to describe and evaluate four permutation-based tests for mediation—the one proposed by MacKinnon (2008) and three others—and to compare them to the best-performing existing mediation tests. Two of the proposed methods also allow for the forming of confidence intervals. To the best of our knowledge, permutation-based confidence intervals have rarely been presented and have not been described for the mediated effect. To introduce permutation tests, we describe their use in comparing two means and in regression; we then describe the proposed applications of permutation tests to testing for mediation in the single-mediator model.

Permutation tests

The permutation test was proposed by Fisher (1935), who used it to demonstrate the validity of the t test. Unlike a classical statistical test, for which a test statistic calculated from the data is compared to a known sampling distribution such as a t or an F distribution, a permutation test compares the test statistic from the data to an empirical sampling distribution formed by permuting the observed scores. Like the sampling distribution used in a classical statistical test, this permutation-based distribution holds if the null hypothesis is true; if the calculated test statistic is extreme in this distribution, the null hypothesis is rejected.

Comparing two group means

The case of testing the difference between the means of two independent groups, which is typically done using an independent-samples t test, provides a straightforward example of the application of the permutation test. The test works by first finding the difference between the observed means. The data are then permuted, meaning that the cases are reallocated to the two groups in all possible combinations (with the constraint that the group sizes are held constant at their observed values). Permutation is done repeatedly to create all possible samples that could have resulted from assigning the cases to the two groups. Each sample based on reallocation provides an estimated difference between the group means that might have arisen if the null hypothesis were true. The rationale is that if the null hypothesis is true, cases in both groups come from the same population with the same group mean, so the cases could have just as easily been found in either of the two groups. The differences between group means found for each permuted sample provide estimates of differences that might arise by chance alone. In other words, they form a sampling distribution for the difference given that the null hypothesis is true. The observed difference between group means, based on the original, unpermuted data, is compared to this distribution in the same way as in any other null hypothesis test. If the observed value is extreme in the distribution, typically in the lowest or highest (ω/2) × 100% of the distribution for a two-tailed test, the null hypothesis of no difference is rejected. This permutation test is considered an exact test of the difference between two groups.

One difficulty of the permutation test is that the number of possible ways of reassigning scores to the two groups is extremely large, even for small samples. For two groups of size n 1 and n 2, the number of ways of reassigning the scores to the groups (i.e., the number of possible permuted samples, or N p) is equal to the number of combinations of n 1 + n 2, taken n 1 at a time or n 2 at a time:

$$ {N_{\text{p}}} = \left( {_{\;\;{n_1}\;\;}^{{n_1} + {n_2}}} \right) = \left( {_{\;\;{n_2}\;\;}^{{n_1} + {n_2}}} \right) = \frac{{\left( {{n_1} + {n_2}} \right)!}}{{ {{n_1}!{n_2}!} }} $$
(4)

For a two-group design with 10 scores in each group—for example, N p = 20!/(10!)(10!) = 184,756—calculating a test statistic for every one of these permuted samples can be quite time consuming. Therefore, rather than creating every possible permuted sample, most applications of the permutation test examine only a subset (of size n p) of the possible permuted samples, N p (Edgington, 1969, 1995). Tests for which n p < N p are called approximate permutation tests. Tests that use the entire set of permuted samples are called exact permutation tests. For all further applications of the permutation test, we will discuss only the approximate version.

Testing a regression coefficient

Permutation tests have been applied in several ways to tests of regression coefficients (Anderson & Legendre, 1999; Manly, 1997; ter Braak, 1992). The approach described here is known as the permutation of raw data (Manly, 1997). This application to regression analysis is similar to the two independent-group tests described above. Rather than a single variable defining group membership, there are potentially multiple predictors. In the case of a single predictor, W, predicting an outcome variable, Z, this is the regression equation:

$$ Z = {g_0} + {g_1}W + {e_Z}. $$
(5)

To perform a permutation test of the null hypothesis that the true coefficient for W, γ 1, equals zero, the model is first estimated for the original data to find g 1. To form the permutation-based sampling distribution, scores on the outcome Z are then permuted and reassigned to scores on the predictor W in all possible combinations. To distinguish them from the unpermuted Z scores, the permuted scores are labeled Z +. The regression model is reestimated, predicting Z + from W in each permuted sample; the resulting estimate of the coefficient for W in each sample is labeled g +1 . The g 1 coefficient from the original, unpermuted data is compared to the sampling distribution of g +1 obtained from the permuted samples to test the null hypothesis that γ 1 = 0.

In multiple regression, the procedure is largely the same. The model is first estimated for the unpermuted data:

$$ Z = {g_0} + {g_1}{W_1} + {g_2}{W_2} + {e_Z}. $$
(6)

Scores on the dependent variable Z are then permuted and reassigned in all possible ways to unpermuted scores on the predictors W 1 and W 2. As the null hypothesis being tested for each predictor is that its partial association with the outcome variable is zero, it is important to maintain the associations among the predictor variables (Anderson & Legendre, 1999). Therefore, scores on the predictors are not permuted and reassigned; only the outcome variable is permuted and reassigned. The model is reestimated for each permuted sample, allowing for a null hypothesis true sampling distribution to be formed for each coefficient. Observed coefficient values based on the original data are then compared to their corresponding permutation-based sampling distributions in order to test the null hypothesis that each has a true value of zero.

A confidence interval for a regression coefficient

In addition to null hypothesis testing of regression coefficients, the permutation method can also be used to find a confidence interval for a regression coefficient (Manly, 1997). The permutation methods described above estimate a sampling distribution given that the null hypothesis is true; the observed statistic is compared to this distribution to test the null hypothesis. Creating a confidence interval, on the other hand, requires estimating the actual sampling distribution of the statistic. Because the sampling distribution to be estimated varies around the observed value of the statistic rather than around zero, permutation confidence interval estimation requires a different approach than permutation null hypothesis tests. Instead of permuting scores on the outcome variable, finding a confidence interval for a regression coefficient requires permuting residuals, an approach proposed by ter Braak (1992) for null hypothesis testing and extended to estimating confidence intervals by Manly (1997). For a one-predictor regression, the model is first estimated for the original, unpermuted data, as in Eq. 5, and the predicted values \( \widehat{Z} \) and residuals e Z are calculated. The residuals are then permuted and reassigned to unpermuted data (which includes scores on the predictor and outcome and predicted scores), after which the residuals are labeled \( e_Z^* \). This process is repeated many times to create a large number of permuted samples. Following the form of Eq. 5, new permutation-based values of the outcome variable, Z*, are calculated in each permuted sample as the original predicted score plus the permuted residual, \( Z^* = \widehat{Z} + e_Z^* \). These permutation-based outcome variables are then regressed on the predictor in each permuted sample, yielding permutation-based estimates of the coefficient g 1, labeled \( g_1^* \):

$$ Z^* = g_0^* + g_1^* W + {e_{\left( {{Z^* }} \right)}}. $$
(7)

Note that the residuals in this regression are labeled e (Z*) to distinguish them from the original residuals e Z and the permuted residuals \( e_Z^* \).

The \( g_1^* \) values form an estimated sampling distribution for g 1. Confidence limits for g 1 are taken as the (ω/2) × 100 and (1 − ω/2) × 100 percentiles of the distribution. This confidence interval may also be used to perform a null hypothesis test: If zero is not included in the interval, the null hypothesis that γ 1 = 0 can be rejected.

An iterative search for a confidence interval for a regression coefficient

Another approach to finding confidence limits for a regression coefficient, proposed by Manly (1997), requires a separate iterative search for each of the two confidence limits. We describe the process only for the upper confidence limit; it is straightforwardly generalizable to searching for the lower confidence limit. This approach is largely similar to the noniterative approach, except that it uses the current estimate of the confidence limit in place of the sample estimate of the regression coefficient to calculate the predicted values and residuals. It begins by estimating the regression model for the original, unpermuted data, and finding the usual, normal-theory upper confidence limit for g 1, g 1(ucl) = \( {g_{1}} + {t_{\omega / 2, ( df=n-2 )}} s_{g_{1}} \) to use as a starting value, where \( s_{g_{1}} \) is the standard error of g 1. Predicted values and residuals are then calculated for the original data, but rather than finding predicted values in the usual way, using the coefficients g 0 and g 1, this approach uses g 1(ucl) in place of g 1 in the calculation. Predicted values are calculated as \( {\widehat{Z}_{({\text{ucl}})}} = {g_0} + {g_1}_{\left( {\text{ucl}} \right)}W \), and the residuals are calculated as \( {e_{Z({\text{ucl}})}} = Z - {\widehat{Z}_{({\text{ucl}})}} \). As for the noniterative approach, the residuals are then permuted and reassigned to unpermuted data, after which the residuals are labeled \( e_{{{_Z}_{({\text{ucl}})}}}^* \). This process is repeated many times to create a large number of permuted samples. The residuals are used, as in the noniterative approach, with the original predicted scores to calculate new outcome variable scores: \( Z_{({\text{ucl}})}^* = \widehat{Z} + e_{Z({\text{ucl}})}^* \). These permutation-based outcome variable scores are then regressed on the predictor in each permuted sample, as in Eq. 7:

$$ Z_{({\text{ucl}})}^* = g_{0({\text{ucl}})}^* + g_{{1}({\text{ucl}})}^* W + {e_{(Z_{\text{ucl}}^*)}}. $$
(8)

When the sampling distribution is formed from the \( g_{{1}({\text{ucl}})}^* \) values from the different permuted samples, rather than taking confidence limits from it directly, as in the noniterative approach, this approach checks whether the estimated confidence limit g 1(ucl) has the desired percentile rank of (1 − ω/2) × 100 in the permuted distribution. If it does, iteration ends, and g 1(ucl) is taken as the upper confidence limit. If it does not, g 1(ucl) is adjusted—downward if the percentile rank was too high, or upward if the percentile rank was too low—and another iteration is run. The process is repeated until a value of g 1(ucl) is found that yields the desired (1 − ω/2) × 100 percentile rank in the sampling distribution of values of \( g_{{1}({\text{ucl}})}^* \).

Permutation tests for mediation

We describe four applications of permutation tests to testing the single-mediator model. All are generalizations or extensions of the tests described above. One, the permutation test of ab, was proposed previously by MacKinnon (2008), but the other three are new.

The permutation test of ab

MacKinnon (2008, Sec. 12.6) proposed a permutation test for mediation that makes use of the permutation-of-raw-data approach described above for testing a regression coefficient (Manly, 1997). We refer to this method as the permutation test of ab. Applying this method requires, first, that the regression models in Eqs. 2 and 3 be estimated for the original, unpermuted data to find the values of a and b. Values of the outcome variable, Y, are then permuted a large number of times and reassigned to unpermuted scores on the predictor, X, and mediator, M, to created many permuted samples. The permuted Y values, labeled Y +, are then regressed on the unpermuted X and M values in each permuted sample (as in Eq. 2 above), and the coefficient for M in each permuted sample is labeled b*. Similarly, values of the mediator, M, are permuted a large number of times and reassigned to values of the predictor, X, to create many permuted samples. The permuted M values, labeled M +, are regressed on X in each permuted sample (as in Eq. 3), and the coefficient for X in each permuted sample is labeled a +. Finally, corresponding pairs of a + and b + values are multiplied to yield a + b +, and ab, the estimate of the mediated effect from the original data, is compared to the distribution of a + b + to perform a test of the null hypothesis of no mediation.

The permutation test of joint significance

A second application of the permutation test to the single-mediator model is based on the joint significance test, as discussed by MacKinnon et al. (2002; see also James & Brett, 1984; Kenny, Kashy, & Bolger, 1998). The joint significance test for mediation is similar to the well-known approach proposed by Baron and Kenny (1986), except that it does not require that c, the sample estimate of τ in Eq. 1, be significant. To perform it, the regression models in Eqs. 2 and 3 are estimated; to reject the null hypothesis of no mediation, both a (the estimate of α in Eq. 3) and b (the estimate of β in Eq. 2) must be significant. The permutation test of joint significance has the same requirements to find significant mediation. It differs only in that it tests the coefficients a and b using permutation of raw data, as described above, rather than the usual t tests of regression coefficients. Practically, this means that the steps in performing this test are nearly identical to the steps for the permutation test of ab. The difference occurs in the final step, where the a + and b + values are used for two separate null hypothesis tests rather than being multiplied together in pairs to create a sampling distribution of a + b +. For the first test, the sample estimate a is compared against the distribution of a +. For the second, the sample estimate b is compared against the distribution of b +. If both null hypotheses are rejected, the permutation test of joint significance rejects the null hypothesis of no mediation.

A confidence interval for the mediated effect

The permutation-of-residuals method described above for finding a confidence interval for a regression coefficient may also be applied to finding a confidence interval for the mediated effect. To find the confidence interval, the method is applied separately to the regression models used to estimate the mediated effect, Eqs. 2 and 3. For Eq. 2, the model is first estimated, and predicted values \( \widehat{Y} \) and residuals e Y are calculated. The residuals are then permuted and reassigned a large number of times to unpermuted scores on X and M, after which the residuals are labeled \( e_Y^* \). New permutation-based values of Y, which are labeled Y*, are calculated in each permuted sample as the original predicted score plus the permuted residual, \( {Y^* } = Y + e_Y^* \). These permutation-based Y* values are then regressed on X and M in each permuted sample, yielding permutation-based estimates of b, labeled b*:

$$ {Y^* }{ } = b_{0{2}}^* { } + c{\prime^* }X + {b^* }M + {e_{({Y^* })}}. $$
(9)

Similarly, for Eq. 3, the model is estimated, and predicted values \( \widehat{M} \) and residuals e M are calculated. The residuals are permuted and reassigned a large number of times to unpermuted scores on X, after which the residuals are labeled \( e_M^* \). New permutation-based values of M, which are labeled M*, are calculated in each permuted sample as the original predicted score plus the permuted residual, \( M^* = {\widehat{M}} + e_M^* \). These permutation-based M* values are regressed on X in each permuted sample, yielding permutation-based estimates of a, labeled a*:

$$ M^* = b_{03}^* + {a^* }X + {e_{({M^* })}}. $$
(10)

Corresponding values of a* and b* are multiplied, to yield a*b*. The distribution of values of a*b* is an estimate of the sampling distribution of ab. Confidence limits for the mediated effect are the (ω/2) × 100 and (1 − ω/2) × 100 percentiles of the distribution. The confidence interval may also be used to test the null hypothesis of no mediated effect.

An iterative search for a confidence interval for the mediated effect

The iterative-search approach to finding a confidence interval for a regression coefficient, described above, may also be extended to finding a confidence interval for the mediated effect. As in the case of the regression coefficient, a separate search is required for each of the two confidence limits. We describe the process only for the upper confidence limit. This process is largely generalizable to searching for the lower confidence limit; we will note points where the process differs for the lower limit. The regression models in Eqs. 2 and 3 are first estimated for the original, unpermuted data, and the mediated effect ab is calculated. The first-order standard error (Sobel, 1982) is used to calculate the starting value for the upper confidence limit: \( a{b_{\left( {\text{ucl}} \right)}} = ab + 1.96\sqrt {{{a^2}s_b^2 + {b^2}s_a^2}} \), where s a and s b are the standard errors of a and b. Because ab is the product of two regression coefficients rather than of a single regression coefficient, this estimate of the upper confidence limit cannot be directly used to calculate predicted scores and residuals, as in the iterative-search approach for the confidence interval for a regression coefficient. The estimated confidence limit must be analyzed into two components, one for a and one for b, which, when multiplied together, yield ab (ucl). We label these components a (ucl) and b (ucl), but note that they are not the same as the upper confidence limits for a and b. Because there are infinitely many pairs of values of a (ucl) and b (ucl) that can be multiplied to yield a particular value of ab (ucl), constraints must be applied to find a unique pair. We apply two constraints: first, a (ucl) and b (ucl) are required to be equidistant from a and b, respectively, in units of their respective standard errors. Second, for an upper confidence limit, a (ucl) and b (ucl) must be on the same side (positive or negative) of a and b, respectively. (For a lower confidence limit, a (ucl) and b (ucl) must be on opposite sides of a and b, respectively.) Although these constraints are somewhat arbitrary, the first is based on the goal of making the confidence limit be equally a function of both components, and the second is used because, for mediated effects near zero, it will correctly choose a negative value for the lower confidence limit and a positive value for the upper. These constraints yield two possible pairs of values for the components; in our application of the method, we always selected the pair that were closer to a and b. Appendix A gives details of how these constraints are used to analyze ab (ucl) into a (ucl) and b (ucl), as well as how the estimated lower confidence limit, ab (lcl), is analyzed into its components, a (lcl) and b (lcl). Once the estimated confidence limit has been analyzed into its components, the remainder of the procedure is similar to the iterative search for a confidence interval for a single regression coefficient. Each confidence limit component is used in place of its corresponding coefficient to calculate predicted values and residuals. For a (ucl), the predicted values are calculated as \( {\widehat{M}_{({\text{ucl}})}} = {b_{03}} + {a_{({\text{ucl}})}}X \), and residuals are calculated as \( {e_{M({\text{ucl}})}} = M - {\widehat{M}_{({\text{ucl}})}} \). For b (ucl), the predicted values are calculated as \( {\widehat{Y}_{({\text{ucl}})}} = {b_{02}} + c\prime X + {b_{\left( {\text{ucl}} \right)}}M \), and residuals are calculated as \( {e_{Y({\text{ucl}})}} = Y - {\widehat{Y}_{({\text{ucl}})}} \). To create permuted samples, both sets of residuals are then permuted and reassigned a large number of times to their corresponding unpermuted predictors. Values of e M(ucl) are permuted and reassigned to unpermuted values of X, after which they are labeled \( e_{M\left( {\text{ucl}} \right)}^* \). Values of e Y(ucl) are permuted and reassigned to unpermuted values of X and M, after which they are labeled \( e_{Y\left( {\text{ucl}} \right)}^* \). In each permuted sample, new outcome variable scores are calculated as the sum of the original predicted value and the permuted residual. The new outcome for M is \( M_{\left( {\text{ucl}} \right)}^* = \widehat{M} + e_{M\left( {\text{ucl}} \right)}^* \), and the new outcome for Y is \( Y_{\left( {\text{ucl}} \right)}^* = \widehat{Y} + e_{Y\left( {\text{ucl}} \right)}^* \). Finally, these new permutation-based outcome variables are regressed on their corresponding predictors, as in Eqs. 9 and 10:

$$ Y_{({\text{ucl}})}^* = b_{0{2}({\text{ucl}})}^* + c_{({\text{ucl}})}^{\prime * }X + b_{({\text{ucl}})}^* M + {e_{({Y^* }{\text{ucl}})}}, $$
(11)
$$ M_{({\text{ucl}})}^* = b_{0{3}({\text{ucl}})}^* + a_{({\text{ucl}})}^* X + {e_{({M^* }{\text{ucl}})}}. $$
(12)

Pairs of values of \( a_{({\text{ucl}})}^* \) and \( b_{\left( {\text{ucl}} \right)}^* \) are multiplied, and the estimated upper confidence limit ab (ucl) is compared to the distribution of values of \( a_{({\text{ucl}})}^* b_{\left( {\text{ucl}} \right)}^* \) to check whether it has the desired percentile rank of (1 − ω/2) × 100. If it does, iteration ends, and ab (ucl) is taken as the upper confidence limit. If it does not, ab (ucl) is adjusted—downward if the percentile rank is too high, or upward if the percentile rank is too low—and another iteration is run.

Method

Four permutation methods for testing mediation were evaluated: the permutation test of ab, the permutation test of joint significance, the permutation confidence interval for ab, and the iterative permutation confidence interval for ab. The purpose of the present study was to examine the performance of these methods in terms of their Type I error, power, and coverage. For purposes of comparison, four of the best-performing methods of testing for mediation recommended on the basis of previous research are also included. These methods are the joint significance test, the asymmetric-distribution-of-the-product test using the PRODCLIN program (MacKinnon, Fritz, Williams & Lockwood, 2007), the percentile bootstrap, and the bias-corrected bootstrap (Efron & Tibshirani, 1993).

The eight methods of testing for mediation were evaluated in a Monte Carlo study. Data were generated and the methods of testing for mediation were performed using SAS 9.2 (SAS Inc., 2007), with the exception of the asymmetric-distribution-of-the-product test, which was done using the PRODCLIN program (MacKinnon et al., 2007). The predictor (X) was simulated to be normally distributed. The mediator (M) and the outcome (Y) were generated using Eqs. 2 and 3. Residuals were simulated to be normally distributed, and the intercepts were simulated to be zero. Four factors were varied in the study. The sizes of α and β in Eqs. 2 and 3 either were set to be zero or were varied to correspond to Cohen’s (1988) small, medium, and large effects (as in MacKinnon et al., 2002). As most methods of testing for mediation have been found to be relatively insensitive to the size of τ′, it was varied at only two levels: zero and large. Because resampling methods such as permutation tests and bootstrapping typically show the largest differences from classical methods in smaller samples, sample size was set to be 25, 50, 100, and 200 in different conditions. The entire design consisted of 128 conditions: 4 (α) × 4 (β) × 2 (τ′) × 4 (sample size). In each condition, 4,000 replications were run, and the eight methods of testing for mediation were all applied. All permutation methods were used in their approximate form, using 1,999 permuted samples for each (with the original, unpermuted data also included, for a total of 2,000 samples); for the bootstrap methods, 2,000 bootstrap samples were drawn.

The methods were compared using three criteria: Type I error, power, and coverage. The Type I error for each method was the proportion of replications in a condition for which the null hypothesis of no mediation was true (i.e., αβ = 0) yet the method rejected the null hypothesis. The power for each method was the proportion of replications in a condition for which the null hypothesis of no mediation was false (i.e., αβ ≠ 0) and the method did reject the null hypothesis. A nominal Type I error rate of ω = .05 was used for all of the hypothesis tests. Coverage was used to compare only the methods that allow for estimation of a confidence interval. This included five of the methods: the permutation confidence interval for ab, the iterative permutation confidence interval for ab, the asymmetric-distribution-of-the-product test, and the percentile and bias-corrected bootstrap methods. The coverage for each method was the proportion of replications within a condition for which the confidence interval estimated using the method included the true mediated effect αβ. A nominal coverage level of 95% was used for all confidence intervals.

Results

Across the three criteria for comparing the methods’ performance, the results were very similar for conditions in which α and β took on particular values, regardless of which coefficient took on which value. For example, the results for α = 0 and β = .14 (small) were similar to those for β = 0 and α = .14. Therefore, for simplicity, the results are presented averaging across such pairs of conditions. The patterns of results were also largely similar across the two levels of τ′, so only results for the τ′ = 0 conditions are presented. In a very small number of replications (less than 0.1%, and no more than 11 of the 4,000 in any condition), the asymmetric-distribution-of-the-product test failed to find one or both of the confidence intervals. These replications are therefore excluded in the calculation of Type I error, power, and coverage for this method.

Type I error

Type I error rates are shown in Table 1. Most methods had Type I error rates well below the nominal level when both α and β were zero. The Type I error rates increased with increasing size of the nonzero coefficient and were generally near the nominal level for conditions in which the nonzero coefficient was large. The increase from near zero to about the nominal level occurred more quickly with increasing size of the nonzero coefficient in larger samples than in smaller ones. There were two exceptions to this pattern. First, the permutation test of ab had a Type I error rate was near the nominal level when both α and β were zero, but its rate increased to far beyond the nominal level—as high as .769—as the nonzero coefficient increased from zero. Second, the bias-corrected bootstrap also had some elevated Type I error rates in smaller samples. Its rate peaked at .083, with rates of at least .070 in four other conditions. Other than these two methods, and one condition in which the rate for the asymmetric distribution of the product had a Type I error rate of .061, no method had a Type I error rate as high as .060 in any condition.

Table 1 Type I error rates by method, sample size, and the values of α and β

The Type I error rates for each method in each condition in which the null hypothesis was true were tested against the nominal Type I error rate of .05. This was done by finding the standard error for the proportion of replications in which the null hypothesis was rejected (i.e., for the observed Type I error rate), forming a 95% confidence interval for the proportion, and checking whether .05 was in the confidence interval. As is shown in Table 2, the permutation test of ab had Type I error rates significantly above .05 in 82% of the null-true conditions, and the bias-corrected bootstrap had Type I error rates significantly above .05 in half of the null-true conditions. No other methods had difficulty with excess Type I error.

Table 2 Percentages of conditions in which the methods performed poorly on each criterion

Power

Power levels are shown in Table 3. The permutation test of ab is excluded from the table because of its dramatically inflated Type I error rates. Differences between methods in power were most pronounced in conditions of midrange coefficient sizes and effect sizes. When coefficients and effects were small, all methods had low power; when they were large, all methods had high power. Across conditions, the bias-corrected bootstrap was consistently the most powerful method. The difference between its power and the power of the second most powerful method was in a few conditions larger than .050. The asymmetric-distribution-of-the-product test was usually the second most powerful method. Following it were a group of methods that performed very similarly. In descending order of power, these were the permutation confidence interval for ab, the percentile bootstrap, the joint significance test, and the permutation joint significance test. The iterative permutation confidence interval for ab nearly always had the least power of any of the tests.

Table 3 Power levels by method, sample size, and the values of α and β

Unlike Type I error, there is not an a priori power level that methods are expected to achieve. Power performance was therefore tested by comparing the methods against each other. In each condition, the method having the maximum power was found, and all other methods’ power levels were tested against it using a z test of the difference between proportions. This comparison was done twice (see the second and third rows of Table 2). In the first comparison, only the permutation test of ab was excluded because of its excess Type I error. In the second, both the permutation test of ab and the bias-corrected bootstrap were excluded because of their excess Type I error. This was done because, although the excess Type I error for the bias-corrected bootstrap was not close to being as great as for the permutation test of ab, the method still did have Type I error rates significantly greater than the nominal level in half of the null-true conditions. In the first analysis, the bias-corrected bootstrap never had significantly less power than the most powerful method (except when more than one method had a power of 1, it was in all conditions the most powerful method). Among the remainder of the methods, the asymmetric-distribution-of-the-product test was most likely to have power not significantly lower than the most powerful method. In the second analysis, with the bias-corrected bootstrap excluded, the asymmetric-distribution-of-the-product test never had power significantly lower than the most powerful method. It was followed by the permutation test of ab, which had significantly lower power in 8% of conditions.

Coverage

Coverage is only applicable for methods used to form confidence intervals: the asymmetric-distribution-of-the-product test, the percentile and bias-corrected bootstraps, and the noniterative and iterative permutation confidence intervals for ab. For conditions in which the null hypothesis is true, coverage is simply one minus the Type I error rate when a 100 × (1 − ω)% confidence interval is used, as it was in the present study. This is true because a Type I error indicates that a confidence interval did not include zero; as zero is the true value (αβ = 0), this also indicates a failure of coverage. Coverage results are therefore inferable from the values in Table 1, and they mirror the Type I error rate results. In null-hypothesis-true conditions, all of the methods used to form confidence intervals had too high coverage (greater than .95) for the smallest nonzero coefficient sizes and sample sizes, but their coverage fell to near the nominal level for larger nonzero coefficients and sample sizes. The bias-corrected bootstrap was alone in having its coverage fall as low as .917, and below .930 in several other conditions. Among other methods, the only case of coverage falling below .940 was the one condition in which coverage for the asymmetric-distribution-of-the-product test had coverage of .939.

The coverage results for 95% confidence intervals in null-hypothesis-false conditions are shown in Table 4. Across methods, most problems with undercoverage were greater for smaller coefficient sizes and improved as the coefficient sizes increased. There was not as clear a pattern for sample size: Some undercoverage problems occurred for the smallest samples, but others appeared only in larger-sample conditions. For example, the asymmetric-distribution-of-the-product test had good coverage for the α small, β small condition with n = 25 and 50, but poor coverage (.897 and .916) for larger ns. The bias-corrected bootstrap had coverage as low as .904, with a few other conditions below .930, but had generally better coverage with increasing sample size. The other three methods—the percentile bootstrap and both of the permutation-confidence-interval-for-ab methods—had little difficulty with coverage in any condition. The iterative permutation confidence interval for ab performed particularly well, with a minimum coverage of .944.

Table 4 Coverage of 95% confidence intervals in null-hypothesis-false conditions by method, sample size, and the values of α and β

As with Type I error, the coverage levels for each method in each condition were tested against the nominal coverage level of .95. This was done by finding the standard error for the proportion of replications in which the confidence interval included αβ (i.e., for the observed coverage level), forming a 95% confidence interval for this proportion, and checking whether .95 was within the confidence interval. As is shown in Table 2, the two permutation confidence interval methods performed best on this criterion, with the iterative permutation confidence interval performing particularly well. The bias-corrected bootstrap performed most poorly, with coverage significantly below .95 in over half of the conditions.

Discussion

This article has introduced four methods of testing for mediation using a permutation approach and, in a Monte Carlo study, has compared their performance to that of other best-performing approaches to testing for mediation. The permutation test of ab performed poorly, with Type I error rates far beyond the nominal level in conditions in which one of α and β was nonzero. The permutation joint significance test performed similarly to, but no better than, the joint significance test. Particularly in the smallest samples and when τ′ was large, the permutation joint significance test had less power. The permutation confidence interval for ab lagged behind the two best-performing methods (the bias-corrected bootstrap and the asymmetric-distribution-of-the-product test) in power, but it had better Type I error control than the bias-corrected bootstrap, and better coverage than both. The iterative permutation confidence interval for ab had the least power of any method tested, but also the best coverage.

As in previous research (MacKinnon et al., 2004), the results of this study suggest that testing mediation is accomplished better by directly estimating the sampling distribution of the statistic being tested, rather than by estimating the sampling distribution that would hold if the null hypothesis were true and comparing the observed statistic to that distribution, as is done in most hypothesis testing. Other than the causal-step methods, such as the joint significance test, methods of testing for mediation estimate the sampling distribution of the mediated effect ab. The permutation test of ab estimates the sampling distribution of ab that holds if α = β = 0 and tests ab against that. The method controls Type I error when α = β = 0, but when the null hypothesis is true but α or β ≠ 0, it rejects the null hypothesis at far beyond the nominal rate. In this way, it performs similarly to Freedman and Schatzkin’s (1992) approach, as tested by MacKinnon et al. (2002). Some other methods tested by MacKinnon et al. (2002), such as a test using the first-order standard error (Sobel, 1982), control Type I error but have far less power than do the best-performing methods. These methods estimate a null-true sampling distribution that holds when αβ = 0 but that gets the shape of the sampling distribution wrong when the null is false, and therefore have low power. The best-performing methods therefore do not estimate the sampling distribution of ab when the null hypothesis is true. Rather, they directly estimate the sampling distribution of ab given the observed sample value ab. The asymmetric-distribution-of-the-product test estimates the shape of the sampling distribution by taking the product of assumed normal sampling distributions for a and b. The bootstrap methods resample the data to estimate the shape of the sampling distribution. The permutation confidence interval methods permute the data to achieve this same end.

The superior performance of the methods that directly estimate the sampling distribution on the basis of the sample value ab demonstrate a case in which testing a null hypothesis with a confidence interval is superior to testing the same null hypothesis using a null-hypothesis-true sampling distribution. Confidence intervals have been widely recommended (Cohen, 1994; Wilkinson & the Task Force on Statistical Inference, 1999), and our results provide more motivation for this change in reporting research results. In most familiar cases, where only the location, but not the shape, of the sampling distribution of the statistic of interest varies with the value of the parameter (e.g., a t test for the difference between group means), a confidence interval and a test against a null-hypothesis-true sampling distribution necessarily yield the same decision regarding the status of the null hypothesis. But in situations such as mediation, where both the location and the shape of the sampling distribution of the statistic of interest vary with the value of the parameter, the conventional approach of estimating a null hypothesis sampling distribution and shifting its mean to estimate the confidence interval is not optimal. A confidence interval estimated using the shape of the sampling distribution estimated from the data is not only a superior confidence interval, it yields a superior null hypothesis test.

Recommendations

The findings of the present study echo previous research in suggesting that the distribution-of-the-product test and bootstrap tests are the best performers for testing mediation. The bias-corrected bootstrap, in particular, had the greatest power of any method tested, although it also had difficulty with excess Type I error in some conditions, again replicating previous research (Cheung, 2007; Fritz, Taylor, & MacKinnon, 2011). Among the proposed permutation methods for testing mediation, the noniterative and iterative permutation confidence intervals for ab show the most promise. Although, in most cases, researchers are likely to be more interested in a test of the null hypothesis of no mediation, in situations where estimating a confidence interval for the mediated effect is of primary interest, these permutation confidence interval methods are ideal. Setting aside the bias-corrected bootstrap because of its Type I error difficulties, the permutation confidence interval for ab was found to have less difficulty with undercoverage than any other of the most powerful methods. Specifically, although it was noticeably less powerful than the most powerful methods, the iterative permutation confidence interval for ab had the best coverage of any method. Therefore, we recommend that researchers studying mediation continue to use the distribution-of-the-product test or percentile bootstrap when a test of mediation is of primary concern, but that they use the permutation confidence interval methods when estimating a confidence interval is the major goal. To facilitate the application of these methods, we provide SPSS and SAS macros in Appendices B and C that estimate the permutation confidence interval for ab and the iterative permutation confidence interval for ab.

Limitations and future directions

Our Monte Carlo study was simplified in order to reduce the complexity of the study. For example, the predictor and the residuals of the mediator and outcome variables were all simulated to follow a normal distribution, and the data were simulated to have no measurement error. Future research should consider less optimal situations in which these simplifications are replaced by conditions more in line with typical observed data. For example, as was studied by Biesanz, Falk, and Savalei (2010), data might be simulated in which the assumption in ordinary least squares regression of the normality of residuals is violated. Such situations could actually highlight the strengths of the permutation methods introduced here, as resampling methods often outperform classical methods when assumptions are violated, although bootstrap methods would likely also perform similarly well, as Biesanz et al. found. Future research might also examine the performance of permutation methods of testing for mediation with variables having measurement error.