One-way analysis of variance (ANOVA) is one of the most common statistical techniques to test the equality of three or more means in educational and behavioral research (Keselman et al., 1998; Kieffer, Reese, & Thompson, 2001), although its use has decreased in recent years (Skidmore & Thompson, 2010). The F-test assumes that the outcome variable must be normally and independently distributed, and the samples must come from a population with common variances. However, the empirical evidence involving real data extracted from review of several scientific journals indicates that these assumptions are not always met (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013; Micceri, 1989; Ruscio & Roche, 2012).

Specifically, with regard to homogeneity of variance, research reveals that group variances are often unequal (Erceg-Hurn & Miroservich, 2008; Grissom, 2000; Keselman et al., 1998; Ruscio & Roche, 2012; Wilcox, 1987). This inequality may be due to a priori differences in groups that are naturally formed or to an effect of experimental treatment that produces differences not only in means but also in variances (Bryk & Raudenbush 1988; Erceg-Hurn & Mirosevich, 2008; Grissom, 2000; Grissom & Kim, 2001; Sawilowsky, 2002; Sawilowsky & Blair, 1992). Several indexes have been proposed to measure the amount of heterogeneity, namely the coefficient of variance variation (Box, 1954; Rogan, Keselman, & Breen, 1977; Ruscio & Roche, 2012), the standardized variance heterogeneity index (Ruscio & Roche, 2012), and the variance ratio (Keselman et al., 1998; Ruscio & Roche, 2012). The variance ratio, which is the simplest measure of heterogeneity, is defined as the ratio of the largest variance to the smallest variance of the groups. This is the index most commonly used in Monte Carlo studies (e.g., Box, 1954; Cribbie, Wilcox, Bewell, & Keselman, 2007; Fan & Hancock, 2012; Hsu, 1938; Kang, Harring, & Li, 2015; Mendes & Pala, 2004; Mickelson, 2013; Moder, 2007, 2010; Scheffé, 1959; Tomarken & Serling, 1986; Wilcox, Charlin, & Thompson, 1986; Zijlstra, 2004). With real data, Keselman et al. (1998) found that the average value of the variance ratio was 2.0 (SD = 2.6), with a median of 1.5 and a maximum ratio of 23.8. Recently, Ruscio and Roche (2012) found variance heterogeneity in more than 50% of examined cases, with the mean variance ratio being equal to 2.51 when there were two groups, 3.95 when there were three groups, and 8.84 when there were four groups in the design.

As the abovementioned studies show, variance heterogeneity is frequently observed in real data. The question that follows logically from this is how heterogeneity affects the robustness of the F-test. Robustness, which has been extensively addressed in the literature, refers to a statistical test’s insensitivity under violations of its assumptions, specifically in terms of its Type I error rates (Box, 1953). Type I error is the probability of rejecting a null hypothesis when it is actually true. The robustness of a statistical test can be evaluated via Monte Carlo simulation techniques, and in order to ensure the comparability of results from Monte Carlo studies a standard criterion to assess robustness must be established. Bradley’s (1978) liberal criterion is considered the most appropriate (e.g., Keselman, Algina, Kowalchuk, & Wolfinger, 1999; Kowalchuk, Keselman, Algina, & Wolfinger, 2004). According to this criterion, a statistical test is considered robust if the empirical Type I error rate is between .025 and .075 for a nominal alpha level of .05. When the rate is above .075 the test is considered liberal, increasing the risk of declaring mean differences that do not exist. When the rate is below .025 the test is considered conservative, such that the researcher is assuming an alpha level below the nominal.

The first Monte Carlo studies that examined F-test robustness to violations of its assumptions were carried out between 1930 and 1960 and were summarized by Glass, Peckham, and Sanders (1972). With regard to variance heterogeneity, early studies (Box, 1954; David & Johnson, 1951; Horsnell, 1953; Norton, 1952, cit. Lindquist, 1953; Hsu, 1938; Scheffé, 1959) suggest two main conclusions: (1) F-test is robust when the groups have equal sample sizes and the group size is not very small (e.g., greater than 7; Kohr & Games, 1974); and (2) F-test tends not to be robust when the groups have unequal sample sizes, in which case the effect of heterogeneity on Type I error depends on the pairing of variance with group size. F-test tends to be conservative when the pairing is positive, that is, when the group with the largest sample size also has the largest variance and the group with the smallest sample size has the smallest variance. Conversely, it tends to be liberal when the pairing is negative, namely when the group with the largest sample size has the smallest variance and the group with the smallest sample size has the largest variance. Based on these studies, many classical handbooks on research methods in education and psychology recommend using equal sample sizes as protection against the effect of heterogeneity (e.g., Glass & Stanley, 1970; Hays, 1981; Keppel, 1991; Maxwell & Delaney, 1990; Winner, 1971).

The issue of F-test robustness to variance heterogeneity has continued to be studied since 1970 until the present day (for a review, see Harwell, Rubinstein, Hayes, & Olds, 1992; Lix, Keselman, & Keselman, 1996). However, research to date with equal sample sizes provides contradictory results, there being both evidence that F-test is robust to variance heterogeneity (Lee & Ahn, 2003; Patrick, 2007; Yiǧit & Gökpinar, 2010) and evidence against this (Alexander & Govern, 1994; Büning, 1997; Harwell et al., 1992; Lix et al., 1996; Moder, 2010; Rogan & Keselman, 1977; Tomarken & Serling, 1986; Wilcox et al., 1986). This inconsistency in the results may be due to several factors.

First, most of the cited studies did not use a standard criterion to assess robustness. Results were usually interpreted based on the comparison between empirical and nominal alpha without following any standard criterion: If the difference was small, F-test was said to be robust. The problem here is that the meaning of “small” is ambiguous and does not allow a clear decision to be made. Indeed, expressions such as “modest inflation” (Harwell et al., 1992) or “slightly increase” (Glass et al., 1972) are frequently used when referring to Type I error rates. Had Bradley’s criterion of robustness been adopted, many of these results would have been interpreted differently.

Second, the studies in question used different measures to quantify variance heterogeneity, thus making it difficult to draw general conclusions. Some studies used the coefficient of variance variation (Lix et al., 1996; Rogan & Keselman, 1977), some used their own indexes (e.g., Patrick, 2007; Ruscio & Roche, 2012), and others used the variance ratio (e.g., Alexander & Govern, 1994; Box, 1954; Hsu, 1938; Moder, 2010; Scheffé, 1959; Tomarken & Serling, 1986; Wilcox et al., 1986; Zijlstra, 2004).

Third, the simulated conditions (e.g., variance values, number of groups, group sizes, pattern of variance, number of replications, etc.) were so varied that it is almost impossible to compare studies. In this context, the pattern of heterogeneity that is simulated appears to be the most relevant variable. The pattern of heterogeneity refers to the way in which the values of the group variances can be ordered. Thus, the group variances can monotonically increase (e.g., \( {\sigma}_1^2>{\sigma}_2^2>{\sigma}_3^2 \)) or decrease (e.g., \( {\sigma}_1^2<{\sigma}_2^2<{\sigma}_3^2 \)) or follow another arbitrary pattern (e.g., \( {\sigma}_1^2={\sigma}_2^2>{\sigma}_3^2 \)). Research to date has included a wide variety of these patterns. In general, some studies have found that F-test is robust, according to Bradley’s liberal criterion, with a monotonic pattern (Lee & Ahn, 2003; Tomarken & Serling, 1986; Wilcox et al., 1986), whereas others have found that it is liberal (Alexander & Govern, 1994; Büning, 1997). For example, Wilcox et al. (1986), who considered four groups with a variance ratio equal to 4 and a monotonic pattern of variance of 1: 2: 3: 4 with equal sample sizes (n = 11), found that F-test was robust (Type I error rate = .068), whereas Alexander and Govern (1994) found it to be liberal with a pattern of 1: 2: 4: 6 (Type I error rate = .079). Büning (1997) found that F-test was robust with group size equal to 10 and a pattern of 1: 2: 4 (Type I error rate = .062), but liberal with a pattern of 1: 3: 7 (Type I error rate = .083). With arbitrary patterns of heterogeneity involving a set of groups with similar variances and one with extreme variance (e.g., 1: 1: 1: 6 and 1: 1: 30) the test has been found to be non-robust (Alexander & Govern, 1994; Lee & Ahn, 2003; Moder, 2010; Rogan & Keselman; 1977; Wilcox et al., 1986). Overall, these findings suggest that F-test robustness with equal group sizes is more affected by a pattern where the variance of one group is very different to that of the other groups. However, F-test robustness with monotonic patterns of variance is still unclear, and further research is needed to determine under which types of these patterns the test can be used.

The sensitivity of F-test to violations of the variance homogeneity assumption when sample sizes are unequal has been reported more consistently (Gamage & Weerahandi, 1998; Kohr & Games, 1974; Lee & Ahn, 2003; Moder, 2010; Patrick, 2007; Tomarken & Serling, 1986; Yiǧit & Gökpinar, 2010; Zijlstra, 2004). The empirical evidence indicates that its robustness depends on the pairing of variance with group size, as was found in early studies. However, despite the large body of research the specific conditions under which F-test is robust have yet to be established, and a number of questions remained unanswered. For example, what values of the variance ratio are associated with correct/invalid inferences? How much inequality of group sizes can be assumed in order to ensure that F-test controls Type I error rate? What other types of pairing between variance and group size can be defined and how do they affect F-test robustness?

Regarding the first and second questions, some authors have suggested several rules of thumb, namely that variance homogeneity can probably be assumed when the variance ratio is not greater than 3 (Dean & Voss, 1999; Keppel, Saufley, & Tokunaga, 1992; Kirk, 2013), is less than 4 or 5 (Wuensch, 2017), or is even as high as 10 provided that the ratio of the largest to smallest sample size does not exceed 4 (Tabachnick & Fidell, 2007; 2013).

Regarding the pairing between variance and group size, previous Monte Carlo studies have usually included a perfect pairing with monotonic patterns of variance. For example, considering five groups with sample sizes equal to 32, 36, 40, 44, and 48 and variances equal to 1, 2, 3, 4, and 5, respectively, the pairing between these variables is perfect and positive. If the variances were 5, 4, 3, 2, and 1, respectively, the pairing would be perfect and negative. However, other types of pairing are also possible. If the pairing is defined by the correlation between group size and variance, then different values of this variable can be obtained. For example, if the same group sample sizes were associated with variance values of 1, 4, 2, 5, and 3, respectively, the pairing would be equal to .50, while for values of 3, 5, 2, 4, and 1 it would be equal to −.50. Thus, different values of the pairing could be considered in Monte Carlo studies in order to extend our understanding of how F-test robustness is affected by the type of pairing. As mentioned, previous research does not provide consistent results about the robustness of F-test with monotonic patterns, and it does not consider other possible types of pairing.

In this context, the main aim of this study is to systematically examine the robustness of F-test, in terms of Type I error, to violations of variance heterogeneity, considering a wide range of conditions representative of real data in educational and psychological research (Golinski & Cribbie, 2009; Keselman et al., 1998; Ruscio & Roche, 2012). To this end, a series of Monte Carlo simulation studies are performed for a one-way design with equal and unequal sample sizes and monotonic patterns of variance. The variance ratio is used as a measure of heterogeneity, the coefficient of sample size variation as a measure of the amount of inequality in group size, and the correlation between variance and group sample size as an indicator of different values of pairing. Our goal, based on the results of this study, is to offer a guideline to help applied researchers decide whether they can use the F-test when their data do not meet the variance homogeneity assumption under certain conditions.

Method

With the aim of systematically examining the robustness of F-test to violations of variance heterogeneity we conducted a series of Monte Carlo simulation studies for a one-way design with equal and unequal sample sizes and monotonic patterns of variance. Simulation studies use computer-intensive procedures to assess the appropriateness and accuracy of a variety of statistical methods in relation to the known truth (Angelis & Young, 1998), and they are especially suitable for evaluating a test’s robustness when the underlying assumptions are not fulfilled. For this reason, they are widely used by researchers in the health and social sciences (Burton, Altman, Royston, & Holder, 2006).

In order to examine the isolated effects of variance heterogeneity on F-test robustness, and considering a one-way design, data were assumed to be normally distributed. Normal data were generated using a series of macros created ad hoc in SAS 9.4 (SAS Institute, 2013). The group effect was set to zero in the population model. The following variables were manipulated:

  1. 1.

    Equal and unequal group sample sizes and number of groups. Data analytic practices for ANOVA show that unbalanced designs are more common than balanced designs (Golinski & Cribbie, 2009; Keselman et al., 1998). We considered designs with three, four, five, and six groups with balanced cells, and three and five groups with unbalanced cells.

  2. 2.

    Group sample size and total sample size. A wide range of group sample sizes which can be frequently found in real research were considered, enabling us to study small, medium, and large sample sizes. With balanced designs, the group sizes were set to three, five, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, and 100. With unbalanced designs, group sizes were set between three and 170, with a mean group size from five to 100. Total sample size ranged from nine to 600, depending on the number of groups considered, this being the result of multiplying the number of groups by the minimum and maximum group sample size (e.g., with five groups the total sample size ranged from 15 to 500).

  3. 3.

    Coefficient of sample size variation (Δn), which represents the amount of inequality in group sizes. This was computed by dividing the standard deviation of the group sample size by its mean. Different degrees of variation were considered and were grouped as low, medium, and high. A low Δn was fixed at approximately 0.16 (0.141–0.178), a medium coefficient at 0.33 (0.316–0.334), and a high value at 0.50 (0.491–0.521). Keselman et al. (1998) showed that the ratio of the largest to the smallest group size was greater than 3 in 43.5% of cases. With Δn = 0.16 this ratio was equal to 1.5, with Δn = 0.33 it was equal to either 2.3 or 2.5, and with Δn = 0.50 it ranged from 3.3 to 5.7.

  4. 4.

    Ratio of the largest to the smallest variance. For one-way designs Keselman et al. (1998) found that the average value of the ratio of the largest to the smallest standard deviation was 2.0 (SD = 2.6), with a median of 1.5 and a maximum ratio of 23.8. Ruscio and Roche (2012), despite the enormous range in variance ratio found, showed that the ratio exceeded 3 in 23.18% of reviewed studies, with a median of 1.64 and a range for the middle 50% of cases of between 1.23 and 2.76. In addition, for three groups they found a mean value of 3.95. Based on these findings, the values of variance ratio selected for the present study were 1.5, 2, 3, 5, and 9 for balanced designs, and 1.5, 1.6, 1.7, 1.8, 2, 3, 5 and 9 for unbalanced designs.

  5. 5.

    Patterns of variance and pairing of variance with group sample size. Monotonic patterns of variance were considered, and are presented in Table 1 for each variance ratio. The type of pairing between variance and group size indicates the relationship or association between the two. Pairing is positive when the largest group size is associated with the largest value of the variance and the smallest group size is associated with the smallest value of variance. Pairing is negative when the largest group size is associated with the smallest value of variance, and vice-versa. Unpairing occurs when there is no association between group size and variance. This happens, for example, with equal group sizes and/or equal variances, but it can also appear with unequal group sizes. In order to consider conditions of pairing which can represent real data (Keselman et al., 1998; Ruscio & Roche, 2012), we calculated the correlation between group sample size and variance value. Correlations equal, approximately, to 1, .50, 0, −.50, and −1 were considered for unbalanced designs. The value of 0 was not included for three groups because it is a non-possible value. These correlation values were obtained by associating each group sample size with different values of variance for the monotonic pattern. Thus, if the groups are ordered as a function of their sample sizes, different values of this correlation are obtained by changing the value of their variance. Table 2 shows the order of variance associated with the group sample sizes, from the smallest sample size to the largest one.

    Table 1 Patterns of variance considered in relation to variance ratio and number of groups
    Table 2 Order of variance associated with the groups according to pairing and number of groups

To ensure reliable results 10,000 replications of each combination of the above conditions were performed at a significance level of .05, recording the empirical Type I error rate (Bendayan, Arnau, Blanca, & Bono, 2014; Robey & Barcikowski, 1992).

Results

The empirical Type I error rates associated with the F-test of the group effect were analyzed for each combination. The results for equal and unequal sample size are shown in Appendices 1 and 2, respectively. Bradley’s liberal criterion (1978) was used to assess the robustness of the procedure. To summarize the results, based on Bradley’s criterion the empirical Type I error rates were dichotomized into a binary variable with two categories, robust (Type I error rate between .025 and .075) and not robust (Type I error rate below .025 or above .075). Chi-square tests were then performed to examine the association between robustness and the variables of interest. Results are presented according to equal and unequal sample sizes.

Equal sample sizes

As can be seen in Tables 11, 12, 13 and 14 (Appendix 1), all Type I error rates were inside the boundary of Bradley’s liberal criterion. Thus, the results show that F-test is robust for three, four, five, and six groups in 100% of cases, regardless of the total sample size and variance ratio.

Unequal sample sizes

Total sample size

The association between total sample size and categorical Type I error rate was not statistically significant for any condition of variance ratio and number of groups, considering 13 categories of the first variable and two of the second. Moreover, the association between group sample size mean and categorical Type I error rate, collapsed across all variance ratios, was not significant for either three groups, χ2(12) = 1.47, p = .99, or five groups, χ2(12) = 0.38, p = .99. The percentages of F-test robustness are shown in Table 3.

Table 3 Percentage of F-test robustness according to group sample size mean and number of groups

Variance ratio

The relationship between variance ratio (eight categories) and categorical Type I error rate was significant both for three groups, χ2(7) = 283.59, p < .001, and for five groups, χ2(7) = 288.57, p < .001. In general, the percentage of robustness decreased as variance ratio increased, with F-test being more robust with five groups. Table 4 shows the percentages of F-test robustness.

Table 4 Percentage of F-test robustness according to variance ratio and number of groups

Pairing of variance with group size

Overall, for both three and five groups there was a significant association between categorical Type I error rate and pairing of variance with group size for ratios higher than 1.5. Tables 5 and 6 show the percentage of robustness according to variance ratio and pairing. With a ratio of 1.5, F-test was robust in all conditions. With a ratio from 1.6 to 2 it was robust except when the pairing was equal to −1. With a ratio of 3 or higher, F-test was robust with pairing equal to 0 or .50 and non-robust with pairing equal to 1, −.5, and −1. Negative pairing had more of an effect than did positive pairing, with the percentage of robustness decreasing as the amount of negative pairing increased; it even reached zero with pairing equal to −1 and a variance ratio of 9. In addition, when F-test was not robust with positive pairing it was always conservative, whereas with negative pairing it was always liberal.

Table 5 Percentage of F-test robustness for three groups according to the pairing of variance with group size and the variance ratio
Table 6 Percentage of F-test robustness for five groups according to the pairing of variance with group size and the variance ratio

Coefficient of sample size variation

For both three and five groups there was a significant association between categorical Type I error rate and the coefficient of sample size variation (three categories) for each ratio higher than 1.5, with F-test being less robust with the highest values of this coefficient. Tables 7 and 8 show the percentage of robustness according to this coefficient. For ratios of 2 or higher, the more inequality between groups the less robust F-test was. The largest coefficient of sample size variation had an enormous effect on the percentage of robustness when the variance ratio was 3 or higher, decreasing it by as much as three-quarters (to 24%) in the cases for three groups.

Table 7 Percentage of F-test robustness for three groups according to the coefficient of sample size variation and the variance ratio
Table 8 Percentage of F-test robustness for five groups according to the coefficient of sample size variation and the variance ratio

All studied conditions

As can be seen in Tables 15 and 16 (Appendix 2) there was a similar pattern of Type I error rates for three and five groups. The results are summarized in Table 9. In general, it appears that robustness depends on the variance ratio, the pairing of variance with group size, and the coefficient of sample size variation, with the procedure being more robust when variance ratios were small, the pairing of variance was either zero or positive, and the coefficient of sample size variation was smaller. More specifically:

Table 9 Percentage of F-test robustness according to the coefficient of sample size variation, the pairing of variance with group size, and the variance ratio
  • Variance ratio of 1.5. As stated above, F-test was robust for all the studied conditions, regardless of the pairing or the coefficient of sample size variation.

  • Variance ratio ranged from 1.6 to 1.8. F-test was robust for all the considered conditions, except when the pairing was equal to −1 and the coefficient of sample size variation was equal to 0.50, in which case it tended to be liberal.

  • Variance ratio of 2. F-test was robust for all the considered conditions, except when the pairing was equal to −1 and the coefficient of sample size variation was equal to 0.33 or 0.50; in both these cases it was liberal, and in the 0.50 condition it was liberal in 100% of cases.

  • Variance ratio of 3. F-test was robust for all the considered conditions, except when the pairing of variance with group size was:

    • Equal to 1 and the coefficient of sample size variation was equal to 0.50, in which case it tended to be conservative.

    • Equal to −.5 and the coefficient of sample size variation was equal to 0.50, in which case it was liberal in almost 100% of cases.

    • Equal to −1 and the coefficient of sample size variation was equal to 0.33 or 0.50, in which case it was liberal in 100% of the considered conditions.

  • Variance ratios of 5 and 9. The pattern of results here was similar to that for a variance ratio of 3, although robustness decreased. Specifically, F-test was not robust when the pairing of variance with group size was:

    • Equal to 1 and the coefficient of sample size variation was equal to 0.33 or 0.50, the test being conservative in the latter condition in 100% of cases.

    • Equal to −.5 and the coefficient of sample size variation was equal to 0.33, in which case it was liberal in fewer than 50% of cases for a variance ratio of 5 and in 100% of them for a variance ratio of 9. When the coefficient of sample size variation was equal to 0.50, F-test was liberal in 100% of cases.

    • Equal to −1 and the coefficient of sample size variation was equal to 0.16, 0.33, or 0.50, the test being liberal in the latter two conditions in 100% of cases.

Discussion

The aim of this paper was to present a systematic examination of F-test robustness, in terms of Type I error, to violations of variance heterogeneity with monotonic patterns of variance in one-way balanced and unbalanced designs. We used the variance ratio as a measure of heterogeneity, the coefficient of sample size variation as a measure of the amount of inequality in group size, and the correlation between variance and group sample size as an indicator of different values of pairing of variance with group sizes. The studied variables cover a wide range of conditions (2,972 conditions), our goal being to provide a guideline that would help applied researchers decide whether they can trust F-test results under heterogeneity. Several main conclusions can be drawn from the results.

First, F-test is robust with monotonic patterns of variance when the group sample sizes are equal, regardless of the number of groups, of the ratio between the largest and smallest variance, and of the total sample size. With a variance ratio as large as 9, F-test can, at least for the number of groups and sample sizes considered here, still be used without the Type I error rate being affected by heterogeneity when the design is balanced.

Second, F-test is not robust with unequal sample sizes under certain conditions. The results showed that, in general, robustness depends on the variance ratio, the pairing of variance with group size, and the coefficient of sample size variation, with the procedure being more robust when variance ratios are small, the pairing of variance is either zero or positive, and the coefficient of sample size variation is smaller. These conditions can be specified as follows:

  1. 1.

    The percentages of robustness tend to be lower for three groups than for five groups. This may indicate that the number of the groups is a variable that has to be considered: the smaller the number of groups, the greater the effect on F-test.

  2. 2.

    The total sample size does not influence F-test robustness under heterogeneity. The use of a large sample size does not, therefore, protect against the effect of heterogeneity.

  3. 3.

    When the pairing of variance with group size is equal to 0 for three groups and equal to 0 or .5 for five groups, F-test is not affected by heterogeneity under any considered condition. However, F-test tends to be conservative with positive pairing and liberal with negative pairing, the latter being the most influential variable. Consequently, researchers should pay particular attention when the pairing is negative in their data.

  4. 4.

    The ratio of the largest to the smallest variance, which represents the measure of heterogeneity, determines F-test robustness. Its robustness decreases as the variance ratio increases, in other words, robustness decreases as the homogeneity assumption is more violated. With a ratio of 1.5, F-test is robust in all studied conditions.

  5. 5.

    For a ratio higher than 1.5 there are two variables that have to be considered: The coefficient of sample size variation and the pairing of variance with group size. In general:

    • The coefficient of sample size variation, which represents the amount of inequality in group sizes, affects F-test robustness. In several cases its robustness decreases as the coefficient of variation increases, in other words, robustness decreases as the group sizes become more unequal.

    • When pairing is equal to 1, F-test tends to be conservative, whereas when pairing is negative (equal to −.5 or −1) the procedure tends to be liberal, depending on the variance ratio and the coefficient of sample size variation.

    • With a ratio higher than 1.5 and lower than 2, F-test is only affected by heterogeneity when pairing is equal to −1 and the coefficient of sample size variation is 0.5.

    • With a ratio equal to 2, F-test is only affected by heterogeneity when pairing is equal to −1 and the coefficient of sample size variation is as high as 0.33 or 0.5.

    • With a ratio of 3 or higher, F-test tends to be conservative with pairing equal to 1 and a coefficient of sample size variation of 0.5. With a ratio of 5 or 9 it is conservative in 100% of the studied conditions. Likewise, F-test tends to be liberal with pairing equal to −.5 or −1 under several conditions of sample size variation. The more unequal the sample sizes, the less robust the F-test is.

In general, the results regarding equal sample sizes are consistent with early studies (e.g., Box, 1954; Glass et al., 1972; Hsu, 1938; Scheffé, 1959), as well as with more recent ones (Lee & Ahn, 2003; Patrick, 2007; Yiǧit & Gökpinar, 2010). Specifically, our findings are consistent with the early research suggesting that balanced designs can be used as protection against the effect of variance heterogeneity. However, the results of the present study go further, since they show that this recommendation is accurate – even with small samples and with a variance ratio as high as 9 – when there is a monotonic pattern of variance in the groups, that is, when the values of group variance increase or decrease monotonically so that the groups can be ordered as a function of their respective variances. Other researchers have found that F-test is not robust with a balanced design when the pattern of heterogeneity involves a set of groups with similar variances and one with extreme variance (e.g., Alexander & Govern, 1994; Lee & Ahn, 2003; Moder, 2010; Rogan & Keselman, 1977; Wilcox et al., 1986). This finding highlights the relevance of knowing the pattern of variance in the data when performing F-test.

With regard to unequal sample sizes, our results appear to be consistent with previous findings, showing that Type I error rates vary depending on the degree of variance heterogeneity and the pairing of variance with group sample size (Box, 1954; Gamage & Weerahandi, 1998; Harwell et al., 1992; Horsnell, 1953; Hsu, 1938; Kohr & Games, 1974; Lee & Ahn, 2003; Moder, 2010; Patrick, 2007; Scheffé, 1959; Tomarken & Serling, 1986; Yiǧit & Gökpinar, 2010; Zijlstra, 2004). Specifically, with positive pairing, F-test tends to be conservative, with the empirical level of alpha being less than the nominal. With negative pairing, F-test tends to be liberal, with the empirical level of alpha being higher than the nominal, such that the risk of declaring mean differences that do not exist is increased. However, the present study extends the findings of previous studies and provides further information about F-test robustness under heterogeneity in a wide range of conditions that applied researchers may encounter in their data, taking into account specific variables such as different values of the pairing of variance with group size, several ratios of variance, and different values of the coefficient of sample size variation.

Furthermore, the results of this study enable us to offer researchers a specific guideline regarding whether or not F-test will be sensitive to departures from the homogeneity assumption that may be present in their data. When a monotonic pattern of variance is found in the groups, as was the case here, there are three steps that researchers can follow:

  1. 1.

    Calculate the variance ratio, dividing the value of the largest variance of the groups by the smallest variance. If this ratio is equal to or less than 1.5, F-test can be performed with confidence. If this ratio is higher than 1.5, then continue with step 2.

  2. 2.

    Calculate the correlation between group sample size and the values of variance in order to determine the amount of pairing of variance with group sample size. If this correlation is either 0 or 0.5, proceed with F-test. Otherwise, continue with step 3.

  3. 3.

    Calculate the coefficient of sample size variation, dividing the standard deviation of the group sample sizes by its mean in order to determine the amount of inequality in group sample sizes.

    • If the pairing is equal to 1 and the coefficient of sample size variation is high (close to .50), it is not possible to trust the results of F-test for ratios higher than 2 because the actual Type I error may be much lower than the nominal alpha of .05, even reaching .01. Table 10 shows the specific conditions in which F-test is not robust.

      Table 10 Conditions under which F-test is not robust, in terms of Type I error, against violation of the homogeneity assumption, according to variance ratio, the pairing of variance with group sample size, and the coefficient of sample size variation
    • If the pairing is equal to −.50 and the coefficient of sample size variation is close to 0.33 or higher, results from F-test for ratios higher than 2 are not reliable because the actual Type I error may be much higher than the nominal alpha of .05. Thus, there is an increased likelihood of declaring mean differences that do not actually exist. The highest value of Type I error found in this condition was .10.

    • If the pairing is equal to −1, and in this case for the majority of sample size coefficients for high variance ratios, results from F-test are distorted because the actual Type I error may be much higher than the nominal alpha of .05, even reaching .20 (see Table 10 for specific conditions).

One of the biggest advantages of following these steps is that applied researchers do not need to use any traditional homogeneity tests (e.g., Bartlett, 1937; Cochran, 1941; Hartley, 1950; Levene, 1960), which are known to rely on other assumptions that might not be met (Bhat, Badade, & Aruna Rao, 2002; Conover, Johnson, & Johnson, 1981; Harwell et al., 1992; Moder, 2007; Sharma & Kibria, 2013; Zimmerman, 2004). Moreover, researchers can locate the specific variance conditions and characteristics of their data in the tables provided and see directly if F-test is robust or not.

To sum up, this study has two main strengths. First, its systematic approach covers the largest variety of conditions simulated to date when exploring F-test robustness to variance heterogeneity, including conditions representative of real data in educational and psychological research. Second, the results yield an easy guideline that can be followed by applied researchers from any background, making it easier for them to decide whether F-test can reliably be used when variances are not equal between the groups. Moreover, the guideline provided makes this process fast and straightforward, avoiding the need for traditional homogeneity tests, which cannot be used in a number of conditions. It should be noted, however, that this study has only analyzed the effect of monotonic patterns of variance on the Type I error rate of F-test. Future studies should therefore aim to examine power and other patterns of variance besides those considered here. A further potential limitation of this paper is that it aimed to explore the isolated effect of heterogeneity on F-test, without considering other assumptions such as normality. An interesting line of future research would be to explore whether or not the violation of normality increases the effect of heterogeneity.

The results of this study suggest that the traditional variance ratio should be used as a measure of the degree of heterogeneity, and indicate that special attention should be paid when the design is unbalanced, the pairing is negative, and the ratio is higher than 1.5. Furthermore, a variance ratio higher than 1.5 may be established as a rule of thumb for considering a potential threat to F-test robustness under heterogeneity with unequal sample sizes. This rule of thumb is much more restrictive than the previously recommended maximums of three (Dean & Voss, 1999; Keppel et al., 1992; Kirk, 2013), four or five (Wuensch, 2017), or 10 (Tabachnick & Fidell, 2007, 2013). This paper shows that these criteria may lead, under certain conditions, to incorrect inferences.

The next problem to be tackled is how to address heterogeneity of variance when F-test is not robust. Although a detailed analysis of this issue is beyond the scope of the present study, we would like to offer some general recommendations. A first, practical recommendation is that researchers should, if possible, design their study with equal group sample sizes, or, at least, with low sample size variation. However, this is not always possible and there may be disagreement over whether the study design or the data collection procedure should be driven by the statistical analysis.

Some authors have also recommended using a more stringent alpha level in the condition under which an inflated alpha is expected, for example, .025 instead of .05 (Keppel et al., 1992; Keppel & Wickens, 2004; Tabachnick & Fidell, 2007, 2013), or .01 with severe violation (Tabachnick & Fidell, 2007, 2013). This is the simplest procedure for researchers since they may still use F-test while maintaining control of Type I error. For illustrative purposes, and in order to examine which alpha level may be used, we conducted simulations under those conditions for which F-test is liberal for three groups with a nominal alpha of .05, and considering other more restricted alpha levels (results are shown in Appendix 3). Overall, a nominal alpha level of .025 controls the Type I error rate within the bounds of Bradley’s criterion for .05 in the conditions associated with Type I error rates around .10, while a nominal alpha level of .01 achieves this control in the conditions associated with Type I error rates above .10. However, in some conditions with Type I error rates above .15, the level of alpha has to be restricted to .005 to maintain empirical Type I error rates within the bounds of Bradley’s criterion for .05. Consequently, researchers can adjust the nominal alpha level depending on the specific characteristics of their data, bearing in mind that a severe violation of homogeneity requires a more restricted level of alpha.

Another common recommendation for meeting the assumption of variance homogeneity is to transform the response variable (e.g., Montgomery, 1991; Tabachnick & Fidell, 2007, 2013; Winer, Brown, & Michels, 1991). However, it is often difficult to determine which transformation is appropriate for a specific set of data, and results are usually difficult to interpret when data transformations are adopted.

The comparison of means using alternative statistical procedures which have been found to provide more robust results has also been proposed (e.g., Alexander & Govern, 1994; Brown & Forsythe, 1974; Brunner, Dette, & Munk, 1997; Chen & Chen, 1998; James, 1951; Krishnamoorthy, Lu, & Mathew, 2007; Kruskal & Wallis, 1952; Lee & Ahn, 2003; Li, Wang, & Liang, 2011; Lix & Keselman, 1998; Weerahandi, 1995; Welch, 1951; Wilcox, 1995; Wilcox, Keselman, & Kowalchuk, 1998). Below we focus on the most common ones.

The non-parametric Kruskal-Wallis test (Kruskal & Wallis, 1952) is one of the most widely recommended tests in classic handbooks on methodology and statistics. However, the Kruskal-Wallis test has several disadvantages: (1) It converts quantitative continuous data into rank-ordered data, with a consequent loss of information; (2) its null hypothesis differs from that of F-test, unless the distribution of groups has exactly the same shape (see Maxwell & Delaney, 2004); and (3) some Monte Carlo studies have shown that its Type I error is also affected by variance heterogeneity, being liberal (rates greater than .075) with negative pairing (Cribbie et al., 2007; Tomarken & Serling, 1986).

Another common proposal has been to use parametric modifications of F-test, such as Brown-Forsythe (1974) and Welch (1951) tests. Both seem to provide better control over Type I error rates than does F-test under heteroscedasticity. With variance patterns similar to those used here, Tomarken and Serlin (1986) recommended using the Welch test with normal populations, while Clinch and Keselman (1982) recommended the Brown-Forsythe test under both heterogeneity and non-normality. More recently, the results obtained by Parra-Frutos (2014) suggested that both tests perform well with normal data, although the Brown-Forsythe test offers better control of the Type I error rate under several non-normality conditions. Another recently proposed alternative is to use the F-test, Brown-Forsythe or Welch tests with bootstrapping in order to obtain distributions of the statistics instead of using their theoretical distribution (Krishnamoorthy et al., 2007; Parra-Frutos, 2014). Parra-Frutos (2014) showed that the bootstrapped F-test and the bootstrapped Brown-Forsythe test exhibit similar and exceptionally good behavior under heteroscedasticity and non-normality.

Finally, methods using robust estimators of location and robust measures of scale have also been proposed to compare trimmed means. For example, Lix and Keselman (1998), Wilcox (1995), and Wilcox et al. (1998) suggested that the best option was the Welch test on trimmed means and Winsorized variance, although the bootstrap procedure proposed by Krishnamoorthy et al. (2007), used in conjunction with a robust approach, has been shown to provide better control of Type I error under heteroscedasticity (Cribbie, Fiksenbaum, Keselman, & Wilcox, 2012).

Whatever the case, we encourage researchers to analyze the specific characteristics of their design and the data obtained and, if their data do not meet the assumption of variance homogeneity, to choose the best alternative in order to obtain valid results. To this end, the best approach is to perform a simulation study involving the specific conditions of the real data so as to determine whether or not F-test is robust in the situation being considered. We are aware, however, that applied researchers are not usually familiarized with this procedure.