Clustered sampling designs are frequently used in the behavioral, educational, and social sciences for investigating the hierarchical nature of individual and group influences. The class of hierarchical linear models has become the most widely accepted tool for multilevel analysis. The relevant methodological and practical issues associated with hierarchical linear models can be found in Alferes and Kenny (2009); Courrieu, Brand-D’Abrescia, Peereman, Spieler, and Rey (2011); Goldstein (2002); Gulliford, Ukoumunne, and Chinn (1999); Hedges and Hedberg (2007); Hoffman and Rovine (2007); Hofmann (2002); O’Connor (2004); Raudenbush and Bryk (2002); and Snijders and Bosker (2012). Because of the multilevel phenomena, measurements on individuals (e.g., employee, student, patient) within the same group (e.g., organization, classroom, clinic) are presumably more similar than measurements on individuals in different groups. The clustering can be expressed in terms of a correlation among the measurements on individuals within the same group and this correlation must be appropriately accounted for in multilevel research. Accordingly, the intraclass correlation coefficient (ICC) has been extensively used to measure reliability or degree of resemblance among cluster members. It is essential to note that different conceptual frameworks and modeling formulations ultimately lead to distinct and unique definitions of ICC. Each form of ICC arguably has its own theoretical implications and is appropriate for specific situations. In particular, Bartko (1976), McGraw and Wong (1996), and Shrout and Fleiss (1979) emphasized the proper use and interpretation of various ICCs for assessing interrater reliability in one-way and two-way analysis of variance (ANOVA) models.

In view of the practical applications of multilevel modeling in different fields of research, the one-way random effects model has been the most common formulation for clustering studies. For this simplest nested design, the ICC(1) index, proposed by Fisher (1938), is a well-adopted measure of intraclass correlation. Notably, Donner (1986) provided a comprehensive discussion on alternative point estimators of ICC. Also, Shieh (2012) recently presented a comparison of the prevailing ICC(1) with the corrected eta-squared estimator (Bliese & Halverson, 1998). These results show that although ICC(1) is the best known, it may not always be the best choice, especially when the underlying population ICC is small. On the other hand, McGraw and Wong (1996) presented an excellent discussion of ANOVA F procedures for constructing significance tests of the ICC in the one-way random effects model. It was noted in McGraw and Wong that the common approach to conducting significance tests concerning nonzero values of ICC has been to use Fisher’s normalizing transformation. Despite the general acceptance on Fisher’s technique, recommendations and findings about this issue are diverse and often contradictory. It was demonstrated in Weinberg and Patel (1981) that Fisher’s approximation is highly accurate for studies with a moderately large number of groups. However, the empirical results of Mian, Shoukri, and Tracy (1989) showed that the test procedure based on Fisher’s transformation overestimates the levels of significance when the null ICC value is greater than zero. Moreover, McGraw and Wong explicitly emphasized the superiority of the exact F test procedure over the Fisher’s Z approximation method in terms of computational ease and power performance. Hence, it is prudent to ensure that applications and extensions of Fisher’s method are carefully examined with technical clarification and numerical justification.

In practice, power analyses and sample size calculations are often critical for investigators to credibly address specific research hypotheses and confirm differences. The planning of sample size should be included as an integral part in the research design of reliability studies. In the context of a one-way random effects model, Donner and Eliasziw (1987) considered the exact power and sample size calculations for the significance tests of ICC. Alternatively, Walter, Eliasziw, and Donner (1998), using Fisher’s transformation, presented a closed-form formula for determining the sample size necessary to achieve a given level of power. More importantly, they concluded that the simplified technique has a close agreement with the exact calculations and can be easily applied without intensive numerical computation. The simulation study in Zou (2012) also demonstrated the accuracy of the sample size formula derived on the basis of Fisher’s Z approximation in Walter et al. (1998). These results imply that the simple approximation maintains excellent performance, whereas avoiding involved computation, and thus there is no reason to consider an exact approach. Given the asymptotic nature of the approximate distribution and poor control of the Type I error rate of Fisher’s Z test, the accuracy of the corresponding sample size formula has never been fully evaluated.

Despite the arguments and findings in Walter et al. (1998) and Zou (2012), the following caveats to their numerical investigations should be noted. First, the empirical illustration in Walter et al. (1998, Table 1) showed that the exact approach and the simplified method produce similar results over nine model configurations. In addition, to further justify the overall usefulness of their simple sample size formula, they also claimed the close performance of the two alternative procedures under a variety of parameter settings considered in the exact computations of Donner and Eliasziw (1987). However, it is important to note that the ANOVA F test and the approximate Fisher’s Z test surely rely on distinctive distribution functions and the corresponding power functions are fundamentally different. Because sample size calculations are performed with the power functions for the two distinct and unique test procedures, the resulting sample sizes needed to attain the designated power are not necessarily equivalent. The emphasis is on the inherent discrepancy between the developed techniques in terms of the power function and sample size determination. Accordingly, the appraisal of Walter et al. (1998) may have demonstrated an interesting phenomenon between the sample size techniques, but it does not serve to assure the actual performance of their sample size formula.

Table 1 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05 and nominal power 1 – β = .50

Second, although empirical assessments in Zou (2012, Table 2) were conducted with a precision terminology and interpretation, they can be viewed as simulation results for the sample size formula of Walter et al. (1998) under a power consideration. At first sight, the simple approximation seems to give acceptable outcomes in several situations. However, a closer inspection of Zou’s (2012) numerical results reveals that the discrepancy between the nominal power and simulated power (or estimate of true power) is sizeable for the cases with a small number of groups. In addition, the chosen true ICC is confined to the three values of 0.6, 0.7, and 0.8. According to benchmark suggested in Landis and Koch (1977), the ICC values are characterized as: slight (0–.20), moderate (.21–.40), substantial (.61–.80), and almost perfect (.81–1.00). It appears that Zou only considered the moderate and substantial levels of reliability that may not represent the general settings likely to be encountered with real data. Furthermore, the selected null ICC values of the right-sided tests are smaller than the alternative ICC specification within the neighborhood of .1 and .15. Consequently, the explication of Zou did not cover a full range of possible model configurations and is not detailed enough to elucidate the essential features of the approximate sample size procedure.

Table 2 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05 and nominal power 1 – β = .80

For the ultimate aim of selecting the most appropriate methodology, it is vital to ensure that the underlying properties of the sample size procedure are well understood before the approach can be applied by researchers to reliability studies with potentially diverse ICC configurations. The purpose of the present article is to contribute to the literature on optimal sample sizes for the design of reliability studies in two ways. First, this article describes the research designs with participant allocation constraints that may prove to be useful in a given application. The allocation schemes include (1) the number of subjects in each group is fixed and (2) the number of groups is given. The prescribed studies of Donner and Eliasziw (1987), Walter et al. (1998), and Zou (2012) have focused on the first case with different definitions and notation. Extensive numerical examinations were conducted to illuminate the advantages of the exact sample size approach of Donner and Eliasziw over the approximate transformation method of Walter et al. and Zou under a wide range of parameter configurations and sample size structures.

Second, design strategies are extended to take into account the impact of project funding while maintaining adequate power. The related cost issues in the design of reliability studies can be found in Flynn, Whitley, and Peters (2002); Shoukri, Asyali, and Donner (2004); Shoukri, Asyali, and Walter (2003); and the references therein. Specifically, the cost implications suggest optimally assigning subjects (1) to meet a designated power level for the least cost and (2) to attain maximum power performance for a fixed cost. The suggested exact sample size calculations were expanded to accommodate the budgetary considerations. For the practical problem of determining the optimal sample sizes for a nominal power so that the total sample size is minimized, the appraisals not only showed the approximate Fisher’s transformed procedure of Walter et al. (1998) does not guarantee to give correct optimal sample sizes, but also found that some of the optimal sample sizes computed with the method of Lagrange multipliers in Eliasziw and Donner (1987) are actually suboptimal. Thus the methodology of exact sample size calculations is of great potential use and should be properly recognized. Corresponding SAS computer programs have been also developed to facilitate the recommended sample size procedures in the research design of reliability studies.

Test procedures

Within the context of the one-way random effects model, the response variable is measured on each of N individuals within each of G groups that arise from a two-level study design assuming the following form:

$$ {Y}_{ij}=\upmu +{\gamma}_i+{\varepsilon}_{ij},i=1,\dots, G;j=1,\dots, N, $$
(1)

where Y ij is the individual level outcome, μ is the grand mean, γ i and ε ij are independent random variables with γ i  ∼ N(0, σ 2 γ ) and ε ij  ∼ N(0, σ 2 ε ). The variance of Y ij is then given by σ 2 γ  + σ 2 ε , where σ 2 γ represents the between-group variance and σ 2 ε represents the within-group variance. Accordingly, the ICC ρ is defined as the ratio of the variability between the groups to the total variability:

$$ \rho =\frac{\sigma_{\gamma}^2}{\sigma_{\gamma}^2+{\sigma}_{\varepsilon}^2}. $$
(2)

The ICC also can be interpreted as a simple correlation coefficient Corr(Y ij , Y ij′ ) between any two observations, Y ij and Y ij′ , in the same group with jj′. From the one-way random effects model defined in Eq. 1, the so-called ICC(1) index is the most frequently adopted estimator of ρ and is denoted by

$$ \widehat{\rho}=\frac{ MSB- MSW}{ MSB+\left(N-1\right) MSW}=\frac{F^{*}-1}{F^{*}+N-1}, $$
(3)

where MSB is the between-group mean square, MSW is the within-group mean square, and F* = MSB/MSW. Accordingly, the ANOVA F test statistic has the distribution

$$ F\ast \sim \tau F\left[G-1,G\left(N-1\right)\right], $$
(4)

where τ = 1 + /(1 − ρ) and F[G − 1, G(N − 1)] is the F distribution with G − 1 and G(N − 1) degrees of freedom. The main problem is to detect the magnitude of ICC in terms of the hypothesis H 0: ρ = ρ 0 versus H 1: ρ > ρ 0 where ρ 0 is a constant and 0 ≤ ρ 0 < 1. Hence, H 0 is rejected at the significance level α if F* > τ 0 F (G−1), G(N−1), αα , where τ 0 = 1 + 0/(1 − ρ 0) and F (G−1), G(N−1), α is the upper (100·α)th percentile of the F distribution F[G − 1, G(N − 1)]. It is important to note that ρ 0 is a specified magnitude, which corresponds to some threshold for identifying minimum or substantial research findings. The related considerations of testing substantive significance by specifying a nonzero-effect null hypothesis were emphasized in Fowler (1985) and Murphy and Myors (1999).

Under the alternative hypothesis with ρ = ρ 1 (> ρ 0), the associated power function is of the form

$$ {\pi}_F\left({\rho}_1,{\rho}_0,G,N\right)=1-{\varPhi}_F\left\{\left({\tau}_0/{\tau}_1\right){F}_{\left(G-1\right),G\left(N-1\right),\alpha}\right\}, $$
(5)

where Φ F (·) is the cumulative distribution function of the F distribution F[G − 1, G(N − 1)] and τ 1 = 1 + 1/(1 − ρ 1).

Alternatively, Fisher (1938) showed that the Z* = ln(F*)/2 transformation has an asymptotic normal distribution with mean μ Z and variance σ 2 Z

$$ {Z}^{\ast}\cdot \sim N\left({\mu}_Z,{\sigma}_Z^2\right), $$
(6)

where

$$ {\mu}_Z=\frac{ \ln \left(\tau \right)}{2}\ \mathrm{and}\ {\sigma}_Z^2=\frac{N}{2\left(G\hbox{--} 1\right)\left(N\hbox{--} 1\right)}. $$

Hence, one may perform the hypothesis testing of H 0: ρ = ρ 0 versus H 1: ρ > ρ 0 by rejecting H 0 if Z* > μ Z0 + z α σ Z where µ Z0 = ln(τ 0)/2, σ Z  = (σ 2 Z )1/2, and z α is the upper 100(α)th percentile of the standard normal distribution. The corresponding power function of Fisher’s Z* test can be expressed as

$$ {\pi}_Z\left({\rho}_1,{\rho}_0,G,N\right)=1-{\varPhi}_Z\left\{{z}_{\alpha }-\left({\mu}_Z{}_1-{\mu}_Z{}_0\right)/{\sigma}_Z\right\}, $$
(7)

where Φ Z (·) is the cumulative distribution function of the standard normal distribution and μ Z1 = ln(τ 1)/2.

To determine sample sizes for assessing the extent of reliability within the context of one-way random effects model, the power functions π F and π Z given in Eqs. 5 and 7, respectively, can be employed to calculate the sample sizes (G, N) needed to attain the specified power 1 – β for the chosen significance level α, and alternative and null ICC values (ρ 1, ρ 0). In order to enhance the applicability of sample size methodology, in subsequent sections the article discusses four design configurations allowing for different allocation and cost controls, and also including the settings not described in the previous investigations.

Allocation schemes

Since there may be several possible choices of sample sizes in terms of the number of groups and the number of subjects in each group that satisfy the chosen power level in the process of sample size calculations, it is constructive to consider an appropriate design that leads to a unique and optimal result. The following two allocation schemes are discussed because of their potential usefulness. First, the number of subjects in each group is fixed in advance, so the goal is to find the minimum number of groups required to attain the selected power level. Second, the number of groups may be preassigned, and it is desired to seek the smallest number of subjects per group that satisfies the designated power performance.

Study I: The number of subjects in each group is fixed

Assume the number of subjects N per group is determined beforehand. Then the exact power function π F is a monotone function of the number of groups G when all other quantities are treated as constants. It only requires a simple incremental search to find the optimal size G F needed to achieve the target power 1 – β for the chosen significance level α and parameter values (ρ 1, ρ 0). Specifically, G F is the least integer G so that the following inequality is valid

$$ 1\hbox{--} {\varPhi}_F\left[\left({\tau}_0/{\tau}_1\right){F}_{\left(G-1\right),G\left(N-1\right),\alpha}\right]\ge 1\hbox{--} \beta . $$
(8)

Because the cumulative distribution function of an F function is readily embedded in major statistical packages, the exact computations can be readily conducted with current computing capabilities. The SAS/IML (SAS Institute, 2012) program employed to perform the corresponding sample size calculation is available in the supplementary files.

In contrast, Walter et al. (1998) suggested a simple alternative to exact sample size calculations. They showed that the optimal sample size G Z computed by the approximate power function π Z would be the smallest integer G that satisfies the inequality

$$ G\ge 1+\frac{2N{\left({z}_{\alpha }+{z}_{\beta}\right)}^2}{\left(N\hbox{--} 1\right){\left\{ \ln \left({\tau}_1/{\tau}_0\right)\right\}}^2}. $$
(9)

The suggested formula at first sight provides an explicit formula and is straightforward to apply. However, computational simplicity is not the only concern in sample size planning. To the author’s knowledge, no research to date has examined the approximate sample size procedure on the basis of Fisher’s transformation in greater detail. Because the numerical justifications in Walter et al.’s (1998) and Zou’s (2012) studies are problematic and incomplete, it is worthwhile to clarify the issue surrounding the adequacy of the technique. To demonstrate the performance of the alternative procedures described in Eqs. 8 and 9, two Monte Carlo simulation studies were conducted. The first one reexamined both methods for the model configurations in Zou’s study, and the second extended the appraisal for an extensive set of parameter settings that were not considered by Zou.

First, the model settings were chosen as those in Zou (2012, Table 2) in which the six combined alternative and null ICCs are (ρ 1 ρ 0) = (0.60, 0.50), (0.6, 0.45), (0.70, 0.60), (0.7, 0.55), (0.80, 0.70), and (0.8, 0.65). Moreover, the four chosen number of subjects per group are N = 2, 3, 5, and 10. Hence, a total of 24 different model settings are obtained. With these specifications, the required sample sizes G F and G Z were computed for the two approaches with the chosen power value and significance level. Throughout this empirical investigation, the significance level was set as α = .05. Accordingly, the resulting numbers of groups are presented in Tables 1, 2 and 3 for the nominal powers 1 – β = .5, .8, and .9, respectively. In addition, the achieved powers or estimated powers were also calculated with the computed sample sizes. Because of the underlying metric of integer sample sizes, the estimated powers are marginally larger than the nominal level for both methods.

Table 3 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05 and nominal power 1 – β = .90

Then, estimates of the true power associated with given sample size (G F or G Z ) and parameter configuration (ρ 1, ρ 0, N) were computed via Monte Carlo simulation of 10,000 independent data sets. For each replicate, the test statistic F* was generated with F* ~ τ 1 F[G − 1, G(N − 1)]. Next, the simulated power of the exact F test was the proportion of the 10,000 replicates whose test statistics F* exceed the corresponding critical value τ 0 F (G−1), G(N−1), α . On the other hand, the simulated power for the approximate Z test was the proportion of the 10,000 replicates whose test statistics Z* = ln(F*)/2 are larger than the associated critical value μ Z0 + z ασ Z . The adequacy for power and sample size calculation is determined by the difference between the simulated power and the estimated power computed earlier. The simulated power and error are also summarized in Tables 1, 2 and 3 for the three designated power levels.

An inspection of the reported sample sizes in Tables 1, 2 and 3 reveals that in general the approximate method gives identical or marginally larger sample sizes as the exact approach. Also, the computed sample sizes of both methods decrease with an increasing number of subjects per group. More importantly, it can be seen from the discrepancy between simulated and estimated powers that the exact approach appears to be very good for the range of model specifications considered here. In particular, the absolute errors are less than .01 for all but three cases. The resulting errors of the three exceptions are .0140, −.0156, and −.0131 for (ρ 1, ρ 0, N) = (.6, .45, 3) and (.8, .65, 5) in Table 1, and for (ρ 1, ρ 0, N) = (.8, .65, 5) in Table 2, respectively.

However, the approximate method of Walter et al. (1998) is problematic, because only 28 out of 72 cases had an absolute error less than or equal to .01 (there are 7, 10, and 11 cases in Tables 1, 2 and 3, respectively). Specifically, the case with (ρ 1, ρ 0, N) = (.8, .65, 10) incurs the largest power difference of −.0590, −.0361, and −.0254 in Tables 1, 2 and 3, respectively. In contrast, the corresponding errors of the exact approach are −.0052, −.0050, and −.0031 for the three setups, respectively. The poor performance implies that their technique fails to produce accurate sample sizes under most of the conditions examined here. Moreover, the numerical results suggest a clear pattern that when all the other factors remain constant the accuracy of the simple sample size formula deteriorates as the number of groups decreases. Although the examined parameter settings are exactly the same as those in Zou (2012, Table 2), this undesirable property was not addressed in his assessment of the approximate method. In short, the sample size formula of Walter et al. (1998) defined in Eq. 9 may be too simple to account for the essential features embedded in the model configurations.

To demonstrate that the previous contrasting behaviors between the two sample size procedures continue to exist in other situations, further numerical investigations were conducted with a wide range of different model configurations. In the second study, the focus is on the sample size calculations (G F and G Z ) and power evaluations for the nominal power .8. Unlike the previous study with rather restrictive combinations of ρ 1 and ρ 0, the possible magnitudes of ρ 1 were extended to the range of .1 to .9 with an increment of .1 and, for each ρ 1, the matching values of ρ 0 are set as 0.0 to (ρ 1 – .1) with an increment of .1. Overall these considerations result in a total of 45 joint settings of ρ 1 and ρ 0. Similar to the implementation of the preceding assessment, the computed sample sizes for the number of groups (G F and G Z ), simulated powers, estimated powers and errors were computed for N = 2, 3, 4, 5, 10, and 20, respectively. Essentially, these model configurations are identical to those in Walter et al. (1998, Table 2) in which they reported the sample sizes G Z computed with the simple procedure. To conserve space, only the results for N = 2, 5, and 20 are presented in Tables 4, 5 and 6, respectively. In addition, a summary is presented in Table 7 for the maximum absolute error of the 45 combined configurations (ρ 1, ρ 0) for each fixed value of N = 2, 3, 4, 5, 10, and 20.

Table 4 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05, nominal power 1 – β = .80, and number of subjects per group N = 2
Table 5 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05, nominal power 1 – β = .80, and number of subjects per group N = 5
Table 6 Numbers of groups, simulated power, and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05, nominal power 1 – β = .80, and number of subjects per group N = 20
Table 7 Maximum absolute error between the simulated power and estimated power of the exact approach and approximated method for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05, nominal power 1 – β = .80, and numbers of subjects per group N = 2, 3, 4, 5, 10, and 20

According to the extensive numerical results, there is an undesirable phenomenon that the accuracy of the approximate sample size formula varies with the computed size G, especially when the number of subjects per group N is large. The discrepancies between simulated powers and estimated powers associated with N = 3, 4, 5, 10, and 20 demonstrate the asymptotic nature of the approximate method that the accuracy of the simple formula improves with larger sizes G. However, the induced errors also indicate that the performance of the approximate method is noticeably unstable and in several cases disturbing. Specifically, the computed results for N = 20 in Table 6 demonstrate that the errors of the approximate method are –.0296, –.0190, –.0226, and –.0204 when (ρ 1, ρ 0, G) = (.3, .2, 61), (.4, .3, 79), (.5, .4, 91), and (.6, .5, 87), respectively. Note that the corresponding total sample sizes are GN = 1,220, 1,580, 1,820, and 1,740, and these evidences confirm that the approximate method can still be questionable even for large sample sizes (>1,000). Thus, the results of N = 2 are the only cases that seem generally acceptable for the obvious reason that the required sample sizes G are relatively greater than those in other cases. Moreover, it is clear from Table 7 that the maximum absolute errors of the approximate method are substantially larger than those of the exact approach. In contrast, the simulated powers of the exact procedure maintain a close range near the estimated levels in all situations. Even when some of the number of groups G are as small as 2, the incurred errors are all within the small range of −.0089 to .008. It can be concluded that although computation is slightly more involved when using the exact procedure, the extra complexity is outweighed by its superiority in accuracy.

Study II: The number of groups is fixed

Consider the number of groups G is held constant in advance. Hence, the task is to identify the proper number of subjects N required to attain the desired power level. With the power function π F given in Eq. 5, the minimum sample size N F needed for the exact F* test to achieve the specified power 1 – β can be found by a simple iterative search for the chosen significance level α and parameter values (ρ 1, ρ 0, G). This scenario does not appear to have been previously studied in Walter et al. (1998) and Zou (2012). Interestingly, because both the asymptotic mean μ Z and variance σ 2 Z are functions of N, the approximate Fisher’s Z* test does not give a closed-form expression for the present problem of sample size calculations.

For illustration, when α = .05 and 1 – β = .80, the sample sizes N are presented in Table 8 for selected values of (ρ 1, ρ 0) and G = 10, 20, 30, 40, and 50. It is apparent from the reported results that the sample sizes N increase with a decreasing value of G or with a decreasing distance between ρ 1 and ρ 0 when all other factors are fixed. The determination of optimal sample size N F basically involves an iterative algorithm and requires the identical power computation given in Eq. 8. However, the number of groups G plays a dominant role in the power function π F and the search does not always furnish an optimum N F or does not give practically useful results when G is small and/or ρ 1 is near ρ 0. Accordingly, “NA” is used to denote such occurrences in the table. It should be noted that the influence of each of the components: (α, ρ 1, ρ 0, G) on the power behavior not only differs but also depends on the concurrent impact of other factors. Without a detailed appraisal, one may unknowingly adopt a miscomputed sample size that can lead to some distorted power performance and unsatisfactory research outcome for the planned study. To enhance the usefulness of the recommended exact methodologies, the sample size procedures for the left-tail test of H0: ρ = ρ 0 versus H1: ρ < ρ 0 and the two-tail test of H0: ρ = ρ 0 versus H1: ρρ 0 can be readily established but the details are not given here. The corresponding SAS/IML (SAS Institute, 2012) computer programs are available as supplementary materials.

Table 8 Numbers of subjects per group of the exact approach for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = .05, nominal power 1 – β = .80, and numbers of groups G = 10, 20, 30, 40, and 50

Cost schemes

In addition to the prescribed design schemes of participant allocations, it is often sensible to consider the issue of statistical power in the presence of funding constraints. The cost of a reliability study can be represented by the overhead costs and sampling costs through the following linear cost function (Eliasziw & Donner, 1987)

$$ C={C}_O+{C}_GG+{C}_NN+{C}_{GN} GN, $$
(10)

where C O is the overhead cost for the study, C G reflects costs proportional to the number of groups, C N denote costs proportional to the number of subjects per group, and C GN stands for costs proportional to both the number of groups and the number of subjects per group. It is important to note that the consideration of total number of subjects can be viewed as a special case of the cost function C, with C O = C G = C N = 0 and C GN = 1. Then, the following two questions arise naturally in choosing the optimal sample sizes. First, what is the least cost for an investigation to maintain its desired power level? Second, how can the maximum power be achieved in a study with a limited budget? In the following, the two cost implications associated with conducting reliability studies are demonstrated.

Study I: Target power is fixed and total cost needs to be minimized

To develop a systematic search of the optimal solution to ensure the nominal power performance while minimizing the cost function C defined in Eq. 10, detailed power calculations and cost evaluations are conducted. First, the previous algorithm is applied to find the optimal number of groups Gmax needed to achieve the target power 1 – β for the specified significance level α, parameter values (ρ 1, ρ 0) with the number of subjects per group N = 2. Second, a sequence of sample size calculations are performed to determine the optimal number of subjects per group, denoted by (N 2, . . ., N Gmax–1), required to achieve the target power 1 – β for the specified significance level α, parameter values (ρ 1, ρ 0) for G = 2 to (Gmax – 1), respectively. Again, this is a straightforward application of the exact approach described earlier in the second case of allocation schemes. Then the optimal solution (G*, N*) is the pair of values (G, N) giving the smallest cost for all combinations (G, N) = {(2, N 2), (3, N 3), . . ., (Gmax – 1, N Gmax–1), (Gmax, 2)}. In cases in which more than one combination yields the same magnitude of least cost, the one producing the largest power is reported. Accordingly, a special purpose computer program is developed for performing the necessary computation.

Alternatively, Eliasziw and Donner (1987) presented a two-step procedure to find the optimal result. The first step of their procedure is to find a fitted polynomial regression function of G on N with the power contours in Donner and Eliasziw (1987). In the second step, the optimal solution is obtained by employing the method of Lagrange multipliers to minimize the cost function subject to the fitted polynomial function derived in the first step. An obvious limitation of their method is its reliance on a power contour in the solution of the polynomial regression equation. In order to arrive at a final selection, the particular procedure entails laborious manipulations and incorporates some degree of subjective judgment. More importantly, the following numerical investigations show that their method does not always produce the optimal sample size. In addition, Walter et al. (1998) also considered the optimal design problem in which the total number of subjects was minimized. Specifically, they applied the simple formula in Eq. 9 to compute the required number of groups G over a range of values of N, and then identify the optimal combination from the minimum of total number of subjects GN. However, their optimization procedure is still approximate in nature and possesses the same drawbacks described in the preceding section.

To demonstrate the advantage and importance of the suggested technique, a numerical examination of optimal sample size calculations was performed so that the total number of subjects is minimized for the model settings in Walter et al. (1998, Table 4) in which ρ 1 = ρ 0 + .2 to .8 with an increment of .2 for ρ 0 = .0, .2, .4 and .6. Accordingly, the optimal combinations (G, N) of the suggested approach are reported in Table 9, with α = .05 and 1 – β = .80 for ten different settings of (ρ 0, ρ 1). Note that the sample sizes (G, N) listed in Table 9 for the procedures of Eliasziw and Donner (1987) and Walter et al. (1998) are exactly the values presented in Walter et al. (1998, Table 4). It can be readily seen from Table 9 that substantial discrepancies exist between the three procedures. For the Lagrange multiplier method of Eliasziw and Donner (1987), only five of their ten reported sample sizes coincide with the optimal solutions of the proposed approach. For the case (ρ 0, ρ 1) = (.2, .4), the total number of subjects of their optimal selection (G, N) = (36, 5) is the same as the proposed optimal combination (G, N) = (45, 4). However, the estimated powers of (G, N) = (36, 5) and (45, 4) are .8009 and .8015, respectively. Although the difference is not sizeable, it shows that their method is not detailed enough to address this subtle issue. Also, it is quite disturbing to see that the computed sample sizes (G, N) = (7, 3), (4, 2), and (9, 3) do not even reach the target power level .80 for (ρ 0, ρ 1) = (0, .6), (0, .8), and (.4, .8), respectively. On the other hand, the simple formula of Walter et al. (1998) provides correct results for just four occurrences. But the computed total number of subjects overestimates the optimal value in the other six occasions. Hence, the existing techniques of Eliasziw and Donner (1987) and Walter et al. (1998) do not guarantee to give the optimal sample sizes and should not be used indiscriminately. Consequently, it is worthwhile to conduct the suggested sample size computations.

Table 9 Optimal sample sizes (G, N) and total numbers of subjects of different procedures when the total number of subjects needed to be minimized for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with target power 1 – β = .80 and α = .05

Study II: Total cost is fixed and actual power needs to be maximized

In contrast to the previous situation in which power was prechosen, the strategy to accommodate both power and cost appraisal can be conducted by finding the optimal allocation to maximize power performance when the total cost is fixed. Accordingly, the results may be useful for researchers to justify their preliminary design and financial support. Due to the discrete character of sample size, the optimal allocation can be found through a screening of selective sample size combinations that attain the maximum power subject to a cost constraint. First, when N = 2, the maximum number of groups Gmax is computed by Gmax = Floor{(CC O – 2C N )/(C G + 2C GN )} for the specified total cost C and cost coefficients (C O , C G , C N , C GN ), where the function Floor(a) returns the largest integer that is less than or equal to a. Then, a detailed power calculation and comparison is performed for the sample size combinations (G, N) with N = Floor{(CC O C G G)/(C N + C GN G)} for G = 2 to Gmax. Ultimately, the optimal sample size allocation is the one giving the largest power.

When C O = C G = C N = 0 and C GN = 1, empirical results of the optimal sample size allocation, total number of subjects, and actual power are given in Table 10 for the model configurations of α = .05, 1 – β = .80, and a fixed total number of subjects C = 1,000. Examination of the results in Table 10 reveals that the actual power increases as the difference between ρ 1 and ρ 0 increases, and the specific parameter values (ρ 0, ρ 1) have an important impact on the attainable power for a given total cost. The proposed optimal algorithms under the two cost considerations can be readily extended to the left-tail test of H0: ρ = ρ 0 versus H1: ρ < ρ 0 and the two-tail test of H0: ρ = ρ 0 versus H1: ρρ 0. Accordingly, these options are accommodated in the supplemental programs.

Table 10 Optimal sample sizes (G, N) and estimated power of the proposed procedure when the maximum number of subjects is C = 1,000 for H 0: ρ = ρ 0 versus H 1: ρ > ρ 0, with α = 0.05

Conclusions

For advance design of reliability studies, the issue of sample size requirements to ensure adequate statistical power has been the focus of considerable attention. As an alternative to exact sample size calculations, the simplified procedure on the basis of Fisher’s transformation is often recommended for its convenient closed-form expression and computational ease. This study reevaluated the approximate method and compared its performance with the exact approach under various design and cost considerations. These empirical examinations showed that the sample size formula constructed from Fisher’s transformation provides satisfactory results only for the situations with large sample sizes and small differences between the null and true intraclass correlation coefficients. Moreover, the simple approximation may not yield the optimal sample size allocations that attain the desired power while giving the least total sample size. To remedy the deficiency of using computational shortcuts in sample size determinations, computer programs are presented to facilitate use of the suggested exact procedures for optimal design of reliability studies. Consequently, this research presents a comprehensive and updated treatment of the exact and approximate sample size techniques in such a way that the findings not only reveal the fundamental limitations of the existing methods, but also offer well-supported solutions to sample size calculations in the context of one-way random effects model.