In any scientific discipline, it is common to find units that are nested in higher-level clusters. An example in educational research is the nesting of children in classrooms. Children from the same classroom are exposed to common stimuli that might make their behavior in general more alike than the behavior of children from different classrooms. Examples of clustered data structures in biology or medicine are animals clustered in phylogenetic families, or patients nested within hospitals. This nesting of observations within higher-level clusters involves the possible existence of dependency among observations. That is, each observation does not give unique information, and not taking this interdependency into account may lead to the overestimation of the available information and hence to inflated Type I error rates. Multilevel models are methods that appropriately account for dependency among observations (Hox, 2002).

One application of multilevel models is meta-analysis (Hox, 2002; Raudenbush & Bryk, 2002). Meta-analysis refers to the set of statistical tools that enable the combination of evidence from different studies that address the same research question (Glass, 1976). Once effect sizes are extracted from primary studies, they are pooled together by applying a fixed or random effects model. In a fixed effect model, it is assumed that there is only one underlying population effect size, and that the observed effect sizes deviate from this population effect due to sampling variation only. A random effects model assumes that each study has its own population effect, that is, that effect sizes vary due to sampling variation and also due to systematic differences that exist across studies. Under this model, not only the combined effect size is estimated but also the variance of the overall effect across studies. Often, it is also of the interest of the meta-analyst to find (moderator) variables that explain the variation of effect sizes across studies.

Raudenbush and Bryk (1985) proposed the use of multilevel models for performing meta-analysis. Meta-analytic data indeed has a hierarchical structure: study participants are nested within studies. Because raw data are rarely available, effect sizes that summarize the raw data from the study participants are often used. Raudenbush and Bryk (1985) differentiated between the within-study model and the between-study model. In multilevel notation, the within-study model refers to the variation of effect sizes due to sampling variation at level 1:

$$ {d}_k={\gamma}_k+{r}_k $$
(1)

where dk represents the effect size reported in study k, γk refers to the population effect size of study k, and rk is a normally distributed random residual with mean 0 and variance \( {\sigma}_{r_k}^2 \). This is the sampling variance, that for commonly used effect sizes can be calculated before the meta-analysis, and therefore is assumed to be known. The between-studies model models the variation of effect sizes across studies, and constitutes the second level in the multilevel model:

$$ {\gamma}_k={\delta}_0+{u}_k $$
(2)

where δ0 is the combined effect size, and uk a random residual with expected value zero. Typically, the study residuals are assumed to follow a normal distribution with variance \( {\sigma}_u^2 \), which therefore represents the between-studies variance. The model can also be written in one single equation:

$$ {d}_k={\delta}_0+{u}_k+{r}_k $$
(3)

This two-level model is equivalent to the traditional random effects model (Hedges & Olkin, 1985), meaning that all random effects meta-analyses are, by nature, multilevel models (Pastor & Lazowski, 2018). The random effects model can be extended by including predictor variables in an attempt to explain the heterogeneity between studies, that is, to find variables that moderate the effect. Note that if the between-studies variance equals zero, then the random effects model reduces to a fixed effect model.

Multilevel models have been proven to be as effective and accurate to estimate the parameters of interest in meta-analysis as other traditional random effects approaches, such as DerSimonian and Laird’s (1986) or Hedges and Olkin’s (1985) method, with the additional advantage that multilevel models are more flexible (Van den Noortgate & Onghena, 2003). For instance, multiple predictors can be easily incorporated in the model to explain heterogeneity among effect sizes across studies. Furthermore, additional random effects can be added in the meta-analytic model for dealing with various kinds of dependency among effect sizes within and between studies.

In primary studies, it is quite common to report multiple effect sizes because, for instance, the construct of interest is measured using different instruments, and/or several treatment groups are compared to a common control group, and/or different sub-samples are used. The effect sizes that stem from the same study are likely to be more similar, because they are typically obtained from the same sample, from the same study procedures, and/or in a similar context. In other words, these effect sizes are dependent. It is well known that ignoring the correlation between effect sizes may lead to flawed inferences due to underestimation of standard errors, which in turn leads to an increase of the likelihood of false positives (Becker, 2000; Hedges, 2009). To appropriately account for dependency among effect sizes within studies, traditional two-level models can be extended to three-level models (Cheung, 2014; Van den Noortgate, López-López, Marín-Martínez, & Sánchez-Meca, 2013, 2015), adding an intermediate level that explicitly accounts for the variation among effect sizes within studies:

$$ {d}_{jk}={\delta}_{00}+{r}_{jk}+{w}_{jk}+{u}_k $$
(4)

djk refers to the effect size of outcome j in study k. Whereas in the traditional random effects model (Eq. 3) there is only one random effect whose variance has to be estimated (i.e., between-studies variance), in this extended model there are two random effects, both of which are assumed to follow a normal distribution with zero mean. The first random effect, wjk, refers to the deviation of the population effect j in study k from the mean population effect in study k, whereas the second random effect, uk, refers to the deviation of the study mean from the overall mean population effect. Therefore, \( {\sigma}_w^2 \), represents the variability between population effect sizes studied in the same study. Previous papers have referred to this variance as the between-outcomes or within-study variance. The variance \( {\sigma}_u^2 \) refers to the between-studies variance or, in other words, to the variability of study means around the grand population mean (level 3). An advantage of this model is that it assumes that each observed effect size estimates its own population value, so the types of outcomes reported can greatly differ across and within primary studies. Furthermore, unlike in other approaches (e.g., the multivariate approach of Kalaian & Raudenbush, 1996), with a three-level model it is not necessary to estimate the covariances between effect sizes in advance, which are rarely available. Therefore, the application of multilevel models in meta-analysis is especially advantageous when the outcomes reported are very different from study to study, and when studies do not report enough information to estimate the covariance among effect sizes.

Dependency can also occur across studies: studies carried out by the same research group will probably lead to more similar (therefore dependent) effect sizes, or effect sizes from studies done in the same country will correlate more among them. Therefore, three-level models can be applied also to account for dependencies across studies (Konstantopoulos, 2011). In fact, more levels can be added to account for additional sources of dependencies. For instance, the two scenarios presented before could be combined: multiple effect sizes (level 2) might be nested within studies (level 3), and studies, at the same time, might be nested within different countries (level 4). Therefore, in this example, a four-level model should be specified instead of a three-level model. In this four-level model, there are three variances to be estimated: between-outcomes variance (level 2), between-studies variance (level 3), and between-countries variance (level 4).

Other multilevel, but not hierarchical, models have been also proposed for meta-analysis, such as cross-classified random effects models (CCREMs; Fernández-Castilla et al., 2018), showing again the flexibility of the multilevel approach. CCREMs are appropriate when effect sizes are nested within two (or more) types of higher-level units (i.e., random factors) that are not nested within each other, but rather crossed. For instance, if studies give multiple effect sizes for multiple countries, studies are not nested within countries (because one study can give effect sizes for multiple countries), nor are countries nested within studies (because one country can be studied in multiple studies). In other words, effect sizes are nested in a combination (a cross-classification) of countries and studies. Although the use of a model with crossed random effects is the most appropriate approach under some scenarios (see Fernández-Castilla et al., 2018 for more examples), it is difficult to find meta-analyses that apply these models.

Although multilevel models are very flexible, we suspect that applied researchers do not take full advantage of their possibilities. Therefore, two goals of this study are to first describe how multilevel models (with more than one random effectFootnote 1) are typically applied in meta-analysis, and then illustrate how, in some meta-analyses, more sophisticated models could have been applied that account better for the (non-)hierarchical data structure.

Besides multilevel modeling, other techniques exist for dealing with multiple effect sizes, such as multivariate methods (Kalaian & Raudenbush, 1996; Raudenbush, Becker, & Kalaian, 1988) and the Robust Variance Estimation method (RVE; Hedges, Tipton, & Johnson, 2010). For a more comprehensive overview of the suitability of these methods in different circumstances, we refer to López-López, Page, Lipsey, and Higgins (2018). Roughly, the multivariate approach can be applied when there is information about the correlation between the different variables included in a primary study, which is unfortunately rarely the case. Furthermore, if there are many different types of outcomes across studies (see, for instance, Geeraert, Van den Noortgate, Grietens, & Onghena, 2004, where there are up to 52 outcomes in one study) the implementation of this method is even more cumbersome, as information on much more correlations need to be available in order to calculate the sampling covariance between all pairs of effect sizes. If there are many different types of outcomes across studies, an alternative is to apply a three- (or more) level model or to apply RVE method, as in these methods it is not necessary to know the correlation between outcome variables. Simulation studies have shown that these methods perform similarly (Moeyaert et al., 2017). The main difference is that RVE is typically used with a standard random effects model (and therefore gives an estimate of the total variance only) but robust standard errors are estimated, whereas in multilevel models additional random effects are included to account for the dependency, resulting in separate variance estimates for each random effect.

Several simulation studies have explored and compared the performance of these methods (e.g., Fernández-Castilla et al., 2018; Hedges et al., 2010; Lee, 2014; Moeyaert et al., 2017; Park & Beretvas, 2018; Tipton, 2013, 2015; Van den Noortgate et al., 2013, 2015). The conditions generated in these simulation studies were to a certain extent based on results of real meta-analyses (see a summary of the selected conditions in Table 1). However, a systematic study of what values can be expected is lacking. Therefore, the third goal of this study is to give a full description of the main characteristics of meta-analyses of multiple outcomes (that use multilevel modeling techniques to carry out the synthesis). This information will be given in function of the research field so that future simulation studies can generate and explore realistic scenarios. The only information available about characteristics of these meta-analyses has been given by Park and Beretvas (2018). However, the main drawback of this review is that the authors extracted the information from meta-analyses published on a specific journal in the field of education, meaning that the search was not systematic and only applies to that field. There is another systematic review that describes characteristics of meta-analyses of clinical psychology treatments (Rubio-Aparicio, Marín-Martínez, Sánchez-Meca, & López-López, 2018), but the authors only selected the most relevant effect size per primary study, meaning that this review does not include information about the distribution of the number of outcomes per study.

Table 1 Characteristics of published simulation studies that explore the performance of several methods for dealing with multiple effect sizes

Method

Search procedure

Meta-analyses applying a multilevel model with more than one random effect (i.e., three-, four- or five- level models and CCREMs) were searched in July 2018. First, we systematically searched for meta-analyses in six electronic databases: Web of Science, Science Direct, Medline PubMed, Psychology Database, Scopus and ERIC. The search strings used were “three-level meta-analysis” OR “multilevel meta-analysis” OR “multilevel meta-analytic review”. Second, we looked at meta-analyses citing studies which introduced the use of multilevel techniques for meta-analysis, namely the studies of Hox and de Leeuw (2003) and Raudenbush and Bryk (1985). Third, we searched for meta-analyses that referenced one of the three studies that specifically explain and illustrate the use of three-level models in meta-analysis for dealing with dependent effect sizes. These are the studies of Cheung (2014), Konstantopoulos (2011), and Van den Noortgate et al. (2013, 2015). The second and third step were carried out using Google Scholar. Afterward, the title, abstract or full text was screened to check whether the meta-analyses met the inclusion criteria.

Inclusion/Exclusion criteria

No date restriction was imposed in our search. A meta-analysis was included in this review if: a) results from group studies were combined; meta-analyses of single-case experimental studies were excluded; b) empirical effect sizes were combined using a model with more than one type of random effect, nested and/or crossed; meta-analyses using a traditional random effects model therefore were excluded, because they include only one type of random effect, c) the study was reported in a journal article, dissertation or conference paper; books, posters, or any other format were excluded, d) the paper was written in English, Spanish, or Dutch.

Data extraction

A MS Excel file was created in which several characteristics of the meta-analyses were coded. First, the name of the first author, the country where his/her research group or institute was located, and the research field based on the categorization given by the Web of Science were coded. If the meta-analysis was categorized in several research domains, then we selected the first category. This classification was sometimes too specific, making it difficult to group studies in broader categories. Therefore, we later used the classification of the Research Foundation of Flanders (FWO), which groups disciplines within five large categories: 1. Behavioral and social sciences, 2. Biological sciences, 3. Cultural sciences, 4. Medical sciences and 5. Science and technologyFootnote 2. If the meta-analysis was not found in the Web of Science or if the article was labeled as ‘multidisciplinary’, the first author classified the paper in one of these five domains according to the content of the article.

Afterward, we coded the effect size synthesized and the model fitted: either a three-level model, four-level model, five-level model, or two- or three-level CCREM. The number of units for each type of random effect was also registered (e.g., the number of countries). We were also interested in the sources of dependency among effect sizes; that is, why effect sizes were dependent within and between studies (i.e., multiple outcomes that measure a common construct, multiple treatment groups, multiple follow-ups, etc.).

Next, common characteristics of meta-analyses were coded. In this block, we first coded the number of independent meta-analyses performed in each study. If more than one independent meta-analysis was performed, then we selected the one that included more studies, so only one meta-analysis per paper was registered. Then, we coded the number of studies analyzed, the total sample size used, the value of the pooled effect size and the variance of each random component, together with the I2 index and the intraclass correlation coefficient (ICC) of each level (Cheung, 2014). In order to give information about common sample size values and typical number of effect sizes reported in primary studies, we also coded in another MS Excel document the sample size and the number of outcomes reported in each primary study within each meta-analysis. Sometimes, within a primary study, it was not possible to know if the same sample of participants was used for all reported outcomes (related samples), or if different samples were used for different outcomes (independent samples). In those cases, we assumed that these samples were related and computed the median sample size of each primary study.

Another characteristic that was coded was whether a moderator analysis was performed, how many moderators were tested, the strategy used to analyze them (e.g., moderators can be added to the regression model one by one, in blocks, simultaneously, etc.), and whether interaction terms were tested. When important information was not reported (e.g., the model fitted) or if inconsistencies were found (e.g., the number of outcomes that the authors reported did not match the ones reported in the supplement material), the corresponding authors were contacted. In total, we contacted 20 authors, and 16 of them replied.

All studies were coded by the first author; the second author independently coded 20% of the studies. Afterward, interrater agreement was calculated by dividing the number of agreements by the sum of the number of agreements plus the number of disagreements. Disagreements were solved through discussion.

All analyses were done in R, and the datasets are available in https://osf.io/znc68/

Results

Characteristics of meta-analyses with multiple random factors

Table 2 shows how many meta-analyses were first retrieved, how many of them were eliminated after a first screening (and the reasons why), and finally how many were kept after a second, more comprehensive screening of the full text. The main reasons for excluding an article were that it was not a meta-analysis, that it was a methodological paper, that it was not a journal article, dissertation or conference paper, that it was already included in other articles, or that the model fitted was a traditional random effects model. A total number of 178 meta-analyses were finally selected for this review. The list of excluded studies is available upon request from the first author.

Table 2 Number of studies excluded in each phase and reason for exclusion

General characteristics

The first three-level meta-analysis was published in 2002 (i.e., Beretvas, Meyers, & Leite, 2002). As can be seen in Fig. 1, there was a substantial increase in the number of publications of multilevel meta-analyses with more than one random factor in 2016. There is a clear increasing tendency: in the first six months of 2018, the same number of meta-analyses with more than one random effect has been published as in the whole 2017.

Fig. 1
figure 1

Number of published meta-analyses with multiple random effects by year. Notice that the search was done in July of 2018, so this bar only includes data from January to July of 2018

The interrater agreement resulting from the coding was 71%. The characteristic with most disagreements was the sample size of each primary study within each meta-analysis. As mentioned before, sometimes it was very difficult to know whether, within studies, the sample sizes reported for each outcome were independent or related. All disagreements were discussed until the two judges agreed.

From the 178 meta-analyses retrieved, 121 studies belonged to the behavioral and social sciences field (e.g., psychology, economics, law), 33 studies to biological sciences (e.g., biology, ecology, nutrition), 20 studies to medical sciences (e.g., medicine, biomedicine), two meta-analyses to cultural sciences (e.g., languages, history), and two studies to science and technology (e.g., computer sciences). Due to the small sample of studies belonging to these last disciplines, they will not be discussed in the Results section.

Regarding the software used, most meta-analyses used R software. Specifically, 82 studies used the package metafor, 28 used metaSEM, two performed the meta-analysis with the MCMCglmm package, one used the brms package, one used the nlme package, and one used the rStan package. Following the R software as the most used software are SAS (25), HLM (7), Stata (4), S-Plus (4), MLwinN (2), WinBUGS (1) and M-plus (1). There were 19 meta-analyses that did not mention the software used. The model most commonly fitted was a three-level model: 162 meta-analyses used a three-level model, eight meta-analyses fitted a four-level model, six studies fitted a CCREM, and two studies used five-level models.

Figure 2 shows how many times each type of effect size was used to synthesize study results. As shown, the Pearson correlation coefficient was the effect size most commonly combined, followed by Cohen’s d / Hedges’ g (for independent samples), Cohen’s d for related samples, Fisher’s z and odds ratios. A variety of other effect sizes or summary statistics (e.g., percentages, standardized means, rates, or R2) were less commonly used. In order to simplify the following analyses, we have put Hedges’ g and Cohen’s d (pre-post) in the same category as Cohen’s d. The eight Fisher’s z-values were back-transformed to Pearson correlation coefficients, and were discussed together with these correlation coefficients.

Fig. 2
figure 2

Effect sizes and summary statistics synthesized in each meta-analysis

Number of studies

In the field of behavioral and social sciences and in biological sciences, the range of number of primary studies included in the meta-analyses was quite wide (see Table 3). Furthermore, these were the fields in which more primary studies were included on average in the meta-analyses, and the distribution of these studies was positively skewed with a heavy tail. In contrast, in the field of medical sciences the number of primary studies analyzed varied less and was on average smaller. In these cases, the median and the mean were very similar, indicating that the distribution was closer to a normal distribution.

Table 3 Characteristics of the 178 meta-analyses included in this systematic review

Sample size of primary studies

Regarding the distribution of the sample sizes of the primary studies within each meta-analysis (Table 3), there was one (outlying) primary study in behavioral and social sciences that included 1,957,491 observations. Also, there was a large discrepancy between the mean and the median: the median was much smaller than the mean in all disciplines, indicating that the distribution of sample sizes of primary studies was very positively skewed, or in other words that small sample sizes were more common than large sample sizes. In medical science and in behavioral and social sciences, the sample sizes were larger than in biological sciences.

Number of effect sizes

Regarding the distribution of outcomes within studies (Table 3), we can see that the median number of outcomes was quite small and ranged between 1 and 3 across research disciplines. In the field of biological sciences and behavioral and social sciences, the maximum number of outcomes reported in a primary study reached very high values compared to maximum number of outcomes observed in the medical sciences. Figure 3 shows the percentage of primary studies in each field which included 1, 2, 3, 4, or 5 or more outcomes per study.

Fig. 3
figure 3

Percentage of primary studies that include 1, 2, 3, 4 or 5 or more effect sizes by discipline

Types of dependencies modeled

From the 178 papers, there were 115 meta-analyses (64.61%) reporting that (only) one source of dependency between effect sizes existed. In 49 meta-analyses (27.53%), authors reported two reasons why effect sizes were dependent, and in 13 (7.30%) and in one (0.56%) meta-analyses authors said that there were three and four sources of dependency, respectively.

The type of dependency most commonly reported was the existence of multiple effect sizes related to a common construct (127 meta-analyses). In 32 meta-analyses, several effect sizes were reported within primary studies because several treatment groups were compared to a common control group. In 25 papers, several sub-samples were used within primary studies. Another source of dependency that was reported in 22 studies was the existence of multiple follow-ups, that is, two groups were repeatedly compared at different time points. In 13 meta-analyses, authors reported that multiple effect sizes were extracted because primary studies used several instruments to measure a common construct. Finally, the authors of 12 meta-analyses said that due to the existence of several experiments and conditions within primary studies, more than one effect size could be retrieved.

In meta-analyses where dependency occurred across studies, the most common type of dependency was that papers were nested within countries (eight meta-analyses). In six papers, studies were dependent because they belonged to the same author or were performed by the same research group, and in ten other meta-analyses, studies were grouped in different higher-level factors, such as phylogenetic families, years in which studies were performed, etc.

Pooled effect distribution

Table 4 shows descriptive information of the overall effect size values in function of the type of effect size and scientific discipline. Looking at the pooled Cohen’s d, we can see that the mean and median were really similar in all disciplines, indicating that the distribution was close to being normal. Most available data came from behavioral and social sciences, where in general effect sizes were quite small if we compare them to the cutoffs proposed by Cohen (1988). In the field of biological sciences, the values were slightly higher, but still small if we take as a reference Cohen’s values. In medical sciences, the range of possible values that Cohen’s d could take was smaller compared to the other disciplines.

Table 4 Descriptive of each type of effect sizes in function of the discipline

Looking at Pearson correlations, we observed that the means and the medians are similar. In the three disciplines, the median pooled correlations could be considered as medium to small (using the rules of thumb of Cohen, 1988). For the studies using the odds ratio, little information was available. As shown in Table 4, the pooled odds ratio was larger in biological sciences, and in medical sciences there was a higher variability whereas in biological sciences and behavioral and social sciences the combined odds ratio was in general homogeneous among meta-analyses. Finally, there was little information available regarding the standardized mean changes and only for the research disciplines of behavioral and social sciences and medical sciences. The median pooled standardized mean difference was in general very small and homogeneous.

Heterogeneity of effects

Table 5 shows the level 2 and level 3 variance estimates and the ICCs of the meta-analyses that fitted a three-level model. The ICCs are calculated following the indications of Cheung (2014) and using the median sampling variance. We have not included standardized mean changes in these results because we did not have enough data. A common result across disciplines and types of effect sizes was that the median variance was, in general, smaller than the mean, indicating that the distributions of the variances at all levels were positively skewed.

Table 5 Descriptive information of variance components estimates for three-level models

Regarding Cohen’s d, variances were larger in biological sciences and in the medical sciences field compared to behavioral and social sciences. However, the maximum variances were observed in behavioral social sciences, indicating that the skewed distribution of the variances had a heavier tail in this field. For behavioral and social sciences and medicine, we could calculate the median I2 index using the median sampling variance, the second level variance and the third level variance. The resulting \( {I}_{(2)}^2 \) and \( {I}_{(3)}^2 \) indexes for behavioral and social sciences were 17.0% and 60.9% respectively, representing the proportion of the total variation of the effect sizes due to second and third level heterogeneity. Therefore, the ratio of variances (median sampling variance : level 2 variance : level 3 variance) was 2:2:6 approximately. In medical sciences, both I2 indexes equaled 45.8%, and therefore the ratio of variances was 1:4:5 approximately.

Regarding Pearson correlations, in the three disciplines the between-studies variances were larger than the within-study variances. In behavioral and social sciences and in medical sciences, the range of possible values that the three variances could take was quite narrow, while in biological sciences the range of values was wider for the level 3 variance. We could calculate the I2 indexes for the behavioral and social sciences field: \( {I}_{(2)}^2 \) was equal to 32.3%, and \( {I}_{(3)}^2 \)was 50. Therefore, the ratio of variances was 2:3:5.

Looking at odds ratio, we can see that the variance values in medical sciences were relatively larger than the variance values in behavioral and social sciences. The \( {I}_{(2)}^2 \) in behavioral and social sciences was 25.8%, and \( {I}_{(3)}^2= \) 47.6%. Therefore, the ratio of variances was 2:3:5. From the eight studies that fitted a four-level model, only three studies from the behavioral and social sciences field reported values for variance components estimates. Two studies report Cohen’s d, and one Pearson correlations. Information about the values of these variances is reported in Table 6.

Table 6 Descriptive information of variance components estimates for four-level models

Finally, from the six studies that reported CCREMs, only two reported the variance component estimates. These two meta-analyses belonged to the discipline of science and technology, reporting a Pearson correlation, and biological sciences, reporting a Cohen’s d. Both fitted a three-level model, with a crossed-classification at third level. In the study that belonged to science and technology, the second level variance equaled 0.12, and the variance of the crossed-factors at third level were 0.14 and 0.19. Regarding the meta-analysis of biological sciences, the second level variance was 0.13, and the variance of the crossed-random factors at third level were 0.07 and 0.10.

Moderators and strategies

Table 3 shows the distribution of the number of moderators analyzed in each discipline. On average, in behavioral and social sciences, the effect of eight moderators was typically tested, whereas in other disciplines the number was slightly smaller, between five and six. The distribution of moderators was especially positively skewed in the disciplines of behavioral and social sciences and in biological sciences, where the maximum number of moderators tested was around 30. In medical sciences, the maximum number of moderators was much smaller. Also, in 32 meta-analyses, interactions among moderators were tested in the multilevel model.

Most studies (41.01%) did separate analyses for each moderator. The second most popular strategy for analyzing moderator effects was to introduce all moderators simultaneously (15.16%). Another strategy was to introduce the moderators one by one, and then fit a final meta-regression with all moderators that resulted to be significant (11.23%). The fourth most common strategy was to introduce separate blocks of moderators (7.86%). There was a relationship between these four strategies and the number of moderators tested in the meta-analysis (F = 3.42, p < .05). Specifically, the strategy consisting in introducing moderators one by one in the model and then test all significant moderators simultaneously, is more often used when many moderators were tested (mean number of moderators = 11.80, standard deviation= 7.95) compared to when the strategy was to analyze them simultaneously (mean = 6.70, standard deviation = 5.15).

Other strategies that were used in a lesser extent were: introducing moderators one by one in the regression, and then in a second step introduce all moderators simultaneously, including the ones that were not significant in the first step (2.80%), introducing all moderators simultaneously, and then removing the ones that were not significant (2.80%), and introducing moderators one by one and then in separate blocks (1.12%). Only one meta-analysis used the MetaForest technique (Van Lissa, 2017), which consists in the application of the random forest technique in meta-analysis, and one meta-analysis (Hijbeek, van Ittersum, ten Berge, Gort, Spiegel, & Whitmore, 2017) used a technique called multi-model dredging.

Description of the use of multilevel models in meta-analysis

The model most commonly fitted was the three-level model: 162 meta-analyses applied a three-level model, eight meta-analyses fitted a four-level model, six studies fitted a CCREM, and two studies reported five-level models. Within the six meta-analyses that applied a CCREMs, two of them applied a two-level CCREM. Two other studies applied a three-level CCREM where random factors were crossed at third level. Another meta-analysis fitted a four-level model with a crossed-classification at the second level. Finally, in the study of Golivets and Wallin (2018) there were up to eight random components included in the model, without being completely clear how many levels there were. At least four of these random effects were crossed at level 2.

From the meta-analyses applying a three-level model, 123 (69.10%) selected as the third level factor ‘studies’, that is, these meta-analyses modeled the dependency of effect sizes that belong to the same study. Fifteen meta-analyses (8.43%) and six meta-analyses (3.37%) modeled the dependency among effect sizes within ‘samples’ and within ‘countries’, respectively. Other factors at third level were: authors, datasets, experiments, laboratories, societies, species, etc.

Four-level models were typically applied to treat additional sources of dependencies within outcomes, such as dependency among comparisons between several treatment groups (e.g., Krieger, 2010; Lehtonen, Soveri, Laine, Järvenpää, de Bruin & Antfolk, 2018; Schoenfeld, Ogborn, & Krieger, 2015) or dependency of effect sizes across sub-scores and tasks (Fradkin, Strauss, Pereg, & Huppert, 2018), and to treat dependency across studies. For instance, studies could be grouped within higher-level clusters such as culture (e.g., Kende, Phalet, Van den Noortgate, Kara, & Fischer, 2017), laboratory (e.g., Martineau, Ouellet, Kebreab, & Lapierre, 2016) or sub-region (e.g., De Noordhout et al., 2014). In the meta-analysis of Kende et al. (2017), where a four-level model was fitted, we find another example of how flexible multilevel models can be in their specification. Instead of estimating a total between-studies variance at level 3, the authors decided to model separately the between-studies variance (or within-culture variance) of studies carried out in the United States, and the between-studies variance of studies carried out outside the United States, because they expected a greater variation across studies done in the United States. In fact, they found that variance within-United States (0.018) and within-other countries (0.011) was slightly different.

Two meta-analyses from Scharfen, Blum, and Holling (2018) and Scharfen, Peters, and Holling (2018) accounted for five sources of dependency, leading to the application of a five-level model: effect sizes (level 1) were nested within multiple comparisons (level 2), because primary studies could report the comparison of a group across different time points. Comparisons were then nested within different outcomes (level 3), that were nested within sub-samples (level 4) that, finally, were nested within studies (level 5).

There are several examples of the application of CCREMs. For instance, O’Shea and Dickens (2016) fitted a four-level model with a crossed-classification at level 2: effect sizes were nested within outcomes and within instruments. The same outcome could be measured with several instruments, and the same instrument was used to measure different outcomes. Hence, these two random factors were cross-classified. Outcomes and instruments, in turn, were nested within studies (level 3), meaning that they assumed that the types of outcomes and instruments differed across studies. Finally, studies were nested within authors (level 4).

Another interesting example is the meta-analysis of Francis (2016), where a two-level CCREM was fitted with a cross-classification at the second level. In this case, there were several effect sizes within studies, that were obtained using different instruments. In other words, within studies effect sizes could be obtained from several instruments, and at the same time some instruments were used across studies. Because across studies some effect sizes were obtained from the same instrument, these two factors were not purely nested. In fact, effect sizes were nested within studies, and at the same time effect sizes were nested within instruments, meaning that ‘instruments’ and ‘studies’ constituted two crossed factors at level 2.

Finally, there is a special case in the meta-analysis of Golivets and Wallin (2018), who specified up to eight random components to account for several dependencies in their data. Most of these random effects were crossed. This meta-analysis studied the competition between plant species. Therefore, each effect size referred to two types of plant: the target plant, and the neighbor plant. The variability of the effect sizes due to the existence of different target species and due to the existence of different neighbor species was modeled with two crossed factors. In addition, species that belonged to the same phylogenetic family were supposed to be more alike, so other two random factors were added for modeling the phylogeny of the target and of the neighbor species. This is an example of how sophisticated a model can become in function of the different sources of dependencies.

Discussion

Considerations for specifying random effects

In the Introduction, we mentioned the advantages of using multilevel models to deal with dependent effect sizes in meta-analyses: the correlation between effect sizes from the same study is not needed (unlike the multivariate approach), separate estimates of the different variance components (e.g., between-studies and within studies) are estimated (unlike in the RVE method), and moderator effects can be allowed to vary across studies (or across any other higher-level cluster).

However, when using multilevel models to deal with dependent effect sizes, it is very important to correctly specify the model according to the (non-) hierarchical structure of the data to get appropriate parameter estimates. In other words, all relevant random effects must be included in the model to avoid biased estimates (McNeish, Stapleton, & Silverman, 2017), and sometimes it is difficult of decide whether the (moderator) effect of a variable should be considered as fixed or as random. Snijders and Bosker (2012) give some guidelines to take such a decision. A variable can be considered as random if its categories can be seen as a random sample of a population of (interchangeable) units. Moreover, these authors mention that, as a rule of thumb, at least 20 categories are necessary to properly estimate the variance of a random effect. As an example, let us imagine that studies are nested within 20 different countries. If the meta-analyst is not interested in the separate effect for each of these countries but only wants to estimate the extent to which effect sizes vary due to country differences, the country effect can be specified as random. An optional step here would be to try to explain this variance (i.e., the variability of effect sizes due to differences between countries) by including variables with a fixed effect (e.g., a variable that indicates whether studies are from United States or from Europe). Sometimes, however, a variable does not have enough categories to consider it as a random effect (e.g., when studies are nested within countries, but there are only five different countries), or researchers might be interested in the separate overall effect size for each country. In this scenario, the variable “country” can be introduced in the model as a moderator with a fixed effect (i.e., estimating one separate effect for each of the five countries or taking one country as a reference and estimating the contrast of the effect in this country with the ones in the other countries).

As mentioned before, a few studies found that the RVE method performs similarly as multilevel models. An advantage of applying RVE is that it only requires the correct specification of the higher-level cluster (McNeish et al., 2017). Therefore, this method is a good alternative when the researcher is unsure of the correct model specification. However, this method only gives an estimate of the total variance, and not separate estimates for each random effect. An interesting approach was recently proposed by Tipton, Pustejovsky, and Ahmadi (2019): applying first a multilevel model, and therefore getting separate estimates for the variance components, and then applying RVE method to get robust standard errors. Although this approach seems promising, simulation studies still have to investigate the performance of this approach.

Examples of alternative specification of multilevel models in meta-analysis

In this section, we identify five common situations in which models with more random effects than the basic hierarchical three-level model would have been more appropriate given the (non-)hierarchical data structure. We only give some prototypical example for illustrative proposes, but there are more meta-analyses included in the systematic review that fit in one of the following categories. First, a fourth level could have been added to model dependency within studies. For instance, Acar and Sen (2013), that studied the relationship between creativity and schizotypy, specified a three-level model where effect sizes (level 1: 268 effect sizes) were nested within studies (level 2: 45 studies), and studies were nested within authors (level 3: 34 authors). This model ignores that within studies, effect sizes might not only vary due to the known sampling variance, but also because they represent different population outcomes. The omission of these between-outcomes (or within-study) variability could lead to biased standard error estimates of the combined effect and of the moderator variables that referred to the outcome variables (Van den Noortgate, Opdenakker, & Onghena, 2005). Adding an additional second level that modeled the between-outcomes variance (or the between-population effect sizes variance) could have been statistically better: effect sizes (level 1, sampling level: 268 effect sizes) are nested within outcomes or within population effect sizes (level 2: 268 outcomes), outcomes (or population effect sizes) are nested within studies (level 3: 45 studies) and studies are nested within authors (level 4: 34 authors).

Second, within studies, more levels could have been specified to deal with different types of dependencies. The meta-analysis of Soveri, Antfolk, Karlsson, Salo, and Laine (2017), tests the efficacy of working memory training. Looking specifically at the analyses done on the outcome ‘fluid intelligence’, we see that authors fitted a meta-analytic three-level model because observed effect sizes (level 1, sampling level: 133 effect sizes) referred to specific population effect sizes or outcomes (level 2: 133 outcomes), that were at the same time nested within studies (level 3: 25 studies). However, primary studies sometimes used more than one treatment group, and the comparison of these treatment groups with a common control group on a specific outcome led to multiple effect sizes. Authors decided to perform a fixed-effect meta-analysis on these multiple effect sizes within comparisons with the aim of having just one single effect size per outcome. This strategy of summarizing effect sizes within a higher-level unit was already proposed by Cooper (2015), called ‘strategy of shifting unit of analysis’Footnote 3. The main drawbacks of this approach are that 1) it involves a loss of information and a reduction of power; 2) moderators that refer to the effect sizes within studies cannot be included, and 3) simulation studies have shown that the between-studies variance estimate is artificially reduced when this strategy is implemented (i.e., Moeyaert et al., 2017). A valid alternative is to add a level in the multilevel model that accounts for dependency between effect sizes due to several comparisons: effect sizes (level 1 – sampling level) are nested within outcomes or within population effect sizes (level 2) that are nested within comparisons (level 3) that are nested within studies (level 4). This approach allows the use of all effect sizes and the incorporation of moderator’s variables that refer to characteristics of the effect sizes within studies.

Third, a fourth level could have been added to take into account dependency across samples. Some meta-analyses considered that effect sizes belonging to different samples from the same study were independent. Lebuda, Zabelina, and Karwowski (2016) explore the link between mindfulness and creativity. In their meta-analysis, 89 effect sizes were nested within 20 samples. These samples were nested in 13 studies, but the variability of samples within studies was not modeled. Lebuda et al. (2016) made their dataset available, so we were able to re-analyze the data and specify a four-level model instead of a three-level model. When a three-level model was fitted (ignoring that samples were nested within studies), the combined effect size was 0.219, with a standard error of 0.065, and the between-outcomes variance equaled 0.029 and the between-samples variance was 0.066. When a four-level model was fitted, the pooled effect became a little bit larger (0.239) and the standard error slightly increased (0.070). The between-studies variance was 0.014, and the between-sample variance decreased (0.054) while the between-outcomes variance remained equal (0.029). Although the conclusions of the meta-analysis did not change, we can clearly see how the standard error was somehow shrunken due to the omission of the upper study-level. Also, it is important to note that 13 studies might not be enough to properly estimate the between-studies variance.

Fourth, a five-level model could have been applied to model additional within-study and/or between-study dependencies. The meta-analysis of Rabl, Jayasinghe, Gerhart, and Kühlmann (2014), explores country differences in the relationship between high-performance work system and business performance. A three-level model was fitted, where several effect sizes were nested within 156 studies (level 2), nested within 30 countries (level 3). There were several effect sizes within studies, and therefore the authors decided to calculate a linear composite correlation of these within-study effect sizes to avoid dependency. Another option would have been to add an additional level that modeled the variance between these correlations, and in this way preserve all data. Furthermore, some studies used the same dataset, and the authors averaged the effect sizes of studies that used the same dataset. Instead of averaging these effects across studies, an additional, fifth level could have been added that accounted for the between-datasets variance. In summary, the following five-level model could have been fitted: effect sizes (level 1) nested within outcomes (level 2), nested within studies (level 3), nested within datasets (level 4), nested within countries (level 5).

Fifth and last, CCREM’s could have been applied instead of three-level models. Pearce (2017) explored whether exposure to scary TV was related to internalizing behaviors in children (e.g., anxiety, stress, depression). Within primary studies, several effect sizes referring to different internalizing behaviors were reported, and also several effect sizes that referred to the same behavior but were measured with different instruments. A three-level model was fitted to account for these dependent outcomes. The author wanted to control for the fact that different instruments were used to measure the same behavior within studies (pg. 70). However, there were too many different instruments to consider this variable as a moderator (the author mentions nine). Furthermore, some instruments were used for only one effect size, making it again difficult to use this variable as a moderator. Therefore, the variable “scale” was not used in the analyses. An alternative would have been to consider “scales” as a random effect, crossed with studies: effect sizes (level 1) are nested within outcomes, and then outcomes are nested, at the same time, within studies and instruments (level 3). Instruments were not nested within studies because they could have been used in several studies. Also, studies are not nested within instruments because within one study, several instruments were used. The description of a similar example can be found in Fernández-Castilla et al. (2018). Considering “scales” as a random factor would have allowed the researcher to have a measure of how effect sizes vary due to the use of different instruments.

Final discussion

This study describes for the first time how multilevel models are applied in meta-analysis and what their most common characteristics are in function of the research discipline. The first conclusion of this systematic review is that three-level models are often systematically applied although other more complex and sophisticated models are sometimes more appropriate given the meta-analytic data structure. The most likely reason for this is that the methodological papers that explain and illustrate the use and application of multilevel methods for meta-analysis focus only on the three-level model (e.g., Assink & Wibbelink, 2016; Cheung, 2014; Konstantopoulos, 2011; Van den Noortgate et al., 2013, 2015). In this article, we have given several examples of how some meta-analytic data actually have a four- or five-level structure, or even cross-classified. We recommend meta-analysts to carefully study and explore the structure of their data in order to apply a proper model. Furthermore, all these more sophisticated models can be easily fitted with the most commonly used package for performing meta-analysis, namely metafor in R (Viechtbauer, 2010). An important warning here is that not all variables can be considered as random factors. For instance, in one of the examples above, the type of sample within studies was considered as a random effect. We assume that there were many different subsamples (i.e., children, adults, students, workers, etc.) and that the subsamples greatly differed across studies. However, if there were not many different subsamples and the subsamples reported within studies were almost always the same (e.g., men and women), ‘subsamples’ should not have been considered a random factor, but a fixed factor.

The second aim of this study was to describe the main characteristics of these meta-analyses with multiple random factors. In this systematic review, we have disaggregated this information by research discipline. Results show that in the field of behavioral and social sciences, meta-analyses normally include more primary studies (compared to the other disciples), these primary studies report larger sample sizes, the number of outcomes are more unbalanced across studies, and more moderator variables are tested. Also, the values for the combined Cohen’s d are, on average, smaller than the cutoffs originally proposed by Cohen (1988). In the medical sciences field, fewer studies are normally included in the meta-analysis, and the number of outcomes reported in primary studies are more balanced. The variability within and between studies is slightly larger than in the field of behavioral and social sciences, although in behavioral and social sciences the distribution of the variance components is more skewed. Meta-analyses in the field of biological sciences are similar to the ones in behavioral and social sciences except in the sample size of primary studies, that is on average smaller. One limitation of this systematic review is that we found only a few studies from the field of cultural sciences and science and technology, so the results for these two fields should be interpreted with caution.

Simulation studies that explore the performance of methods that deal with dependent effect sizes should acknowledge these differences in the characteristics of meta-analyses across disciplines. In fact, another goal of this systematic review was to check whether the simulation factor conditions selected in these methodological papers actually represent characteristics of published meta-analyses. Regarding the effect size value, most simulation studies done on standardized mean differences have selected values of 0, 0.20, and 0.40–0.50. According to this systematic review, these values are fairly close to the 1st quartile, median and 3rd quartile, respectively, of all disciplines except for cultural sciences. Looking at the number of studies, simulation studies have typically generated 10, 20, 30, 40, 50, and 60 primary studies. However, the minimum number of primary studies in behavioral and social sciences, biological sciences and medical sciences is below 10, and this situation has not been contemplated in simulation studies. Even so, the median number of studies included in meta-analysis range between 20 and 39 across disciplines and these values are indeed represented in the simulations. Future simulations should focus on factor conditions where the number of studies is even smaller than ten. This is especially important when multilevel models are applied, because previous research has pointed out that when the number of units at the highest level, typically ‘studies’, is below ten, multilevel models can lead to inflated Type I error rates (Van den Noortgate & Onghena, 2003).

The number of outcomes reported within primary studies is commonly unbalanced, especially in the field of behavioral and social sciences and in biological sciences. Some simulation studies have only generated balanced data (e.g., Lee, 2014; Moeyaert et al., 2017; Van den Noortgate et al., 2013) which is a very unrealistic scenario. Other simulation studies have generated unbalanced data (e.g., Park & Beretvas, 2018; Van den Noortgate et al., 2015) but in a way in which, from a range of possible values for the number of outcomes, all these values were equally likely (for instance, from a range from 1 to 5, primary studies were equally likely to report 1, 2, 3, 4, or 5 outcomes). However, this systematic review has shown that the largest percentage of primary studies report only one outcome (see Fig. 3). This simulated factor condition is only generated in the simulation study of Tipton (2013), where there was one condition where most of studies included only one outcome, and then some studies reported ten outcomes. This is a realistic condition, especially in the field of behavioral and social science and biological sciences. in medical sciences, the number of outcomes in primary studies is more balanced, as the majority of primary studies only include up to four outcomes.

The sample size of the primary studies has shown to have little impact on the parameter estimates (e.g., Hedges et al., 2010). Even so, most simulation studies have simulated balanced scenarios, in which the sample size was equal across primary studies (e.g., Hedges et al., 2010; Van den Noortgate et al., 2013, 2015), while the reality is that the sample size can vary quite a lot across primary studies. Furthermore, in general the sample sizes selected in simulation studies are representative of the behavioral and social sciences and medical science field, where the median number of observations reported within studies is close to 100. However, the median sample size in studies that belong to biological sciences and cultural sciences is close to 30, which is smaller than the common value included in the simulation studies. Only the simulation studies of Hedges et al. (2010) and Tipton (2013, 2015) have considered such small values for the sample sizes.

Simulation studies have selected the values for the variance components in two different ways. Some studies have directly selected a specific value for the between-studies and within-studies variances (e.g., Lee, 2014; Moeyaert et al., 2017), and other studies have used ratios to generate these factors (e.g., Park & Beretvas, 2018; Tipton, 2013). Among the first type of simulation studies, the total variance commonly used ranged from 0.10 to 0.30. The median total variance observed in this systematic review (that was obtained summing the medians of the variances at the second and third level) for Cohen’s d ranged from 0.12 to 0.26 across disciplines, so the ‘average’ total variance is actually represented in the simulation studies. A similar range was observed for odds ratios (0.11–0.25). Regarding Pearson correlations, the total median variances were smaller and ranged from 0.034 to 0.19. Among the second type of simulation studies, we can see that, for instance, Park and Beretvas (2018) simulated two ratios for Cohen’s d: 5:3:2 and 2:3:5, where the first number refers to the I2 index of the sampling variance, the second number refers to the I2 index of the second level variance, and finally the third number stands for the I2 index of the third level variance. This last ratio is similar to the one found in this systematic review for behavioral and social sciences (i.e., was 2:2:6 and 2:3:5 for all types of effect sizes), but it is not representative of medical sciences (i.e., 1:4:5 for Cohen’s d).

Finally, regarding the number of moderators, there are four simulation studies that test whether the moderator effect of only one variable (and its standard error) is correctly recovered. Only the Study 2 of Tipton (2015) explores the performance of the robust variance estimation method when up to four moderator variables are tested in the regression. However, our results indicate that at least 15.16% of the meta-analyses test several moderators at the same time. Specifically, from those meta-analyses that introduced at the same time all moderator variables, the average number of moderators tested was seven. Furthermore, none of the simulation studies explores the recovery of interaction effects, and 32 meta-analysis included in this systematic review do test for interaction effects. It is important that future simulation studies take into account these scenarios.

In sum, the application of multilevel models in the field of meta-analysis offers many opportunities, especially for the treatment of dependent effect sizes. However, researchers should start considering the application of multilevel models different from the three-level model if the meta-analytic data require it. Furthermore, we expect that researchers take into account the information reported in this systematic review to design future simulation studies in this field, so that their results can be generalized to real settings.