Elsevier

Ecological Modelling

Volume 211, Issues 1–2, 24 February 2008, Pages 1-10
Ecological Modelling

Review
A review and comparison of four commonly used Bayesian and maximum likelihood model selection tools

https://doi.org/10.1016/j.ecolmodel.2007.10.030Get rights and content

Abstract

Many tools have become available for biologists for evaluating competing ecological models – models may be judged based on the fit to data alone (e.g. likelihood), or more formal statistical criteria may be used. Because of the implied assumptions of each tool, model selection criteria should be chosen a priori for the problem at hand, – a model that is considered ‘good’ in its explanatory power may not be the best choice for a problem that requires prediction. In this paper, I review the behavior and assumptions of the four most commonly used statistical criteria (Akaike's Information Criterion, AIC; Schwarz or Bayesian Information Criterion, BIC; Deviance Information Criterion, DIC; Bayes factors). Second, I illustrate differences in these model selection tools by applying the four criteria to thousands of simulated abundance trajectories. With the simulation model known, I examine whether each of the criteria are useful in selecting models to evaluate simple questions, such as whether time series support evidence of density dependent population growth. Across simulations, the maximum likelihood criteria consistently favored simpler population models when compared to Bayesian criteria. Among the Bayesian criteria, the Bayes factor favored the correct simulation model more frequently than the Deviance Information Criterion. There was considerable uncertainty in the ability of the Bayes factor to discriminate between models, this tool selected the simulation model slightly more frequently than other approaches.

Introduction

The purpose of mathematical modeling in biology is to provide a mechanism for evaluating scientific and statistical hypotheses using observational data (Lewin-Koh et al., 2004). Models may serve as both explanatory and predictive tools; if a model appears to realistically track variation in abundance through time, it may be used to predict future population sizes, set harvest quotas, or assess extinction risk. Population models are often complex, involving age- and sex-structured dynamics, predator–prey interactions, effects of both time and space, and potentially time lags. In developing statistical models for inference, researchers must decide a priori which, if any, of these factors may be important in the population being modeled. A second decision researchers must make is how model performance should be assessed. Model performance is defined broadly as the ability of a model to meet some specified objective (e.g. how well a model explains observed data, makes predictions, or minimizes risk). Statistical tools allow models to be compared relative to one another, but each criterion differs in what is considered a ‘good’ model. Differences between model selection criteria are rooted both in the philosophy of science, as well as statistics (Maurer, 2004). Criteria may view good models to be those that that minimize Type I and Type II error rates (Weiss, 1997) or those that involve the simplest explanations possible, minimizing the tradeoff between bias and variance (the principle of parsimony or Occam's razor; Forster, 2000, Burnham and Anderson, 2002). Because these tools are so widely used in the biological sciences (Johnson and Omland, 2004), it is important to understand the differences in the assumptions and performance of each criterion.

Over the last 20 years, the most widely used model selection tool in ecology has been Akaike's Information Criterion (AIC; Akaike, 1973). AIC began to see widespread use by biologists in the 1990s after several papers illustrated applications to capture–recapture analyses (Anderson et al., 1994, Burnham et al., 1995). AIC is computed as AIC=D(θˆ_MLE)+2K, where the function D(·) represents the deviance function (twice the negative log-likelihood), θˆ_MLE the vector of maximum likelihood parameter estimates (MLEs) and K represents the number of model parameters. The second term in the AIC calculation has been interpreted as a measure of model order or complexity (Burnham and Anderson, 2002). A popular variant of AIC is the small sample AICc, where the complexity term 2K is replaced by (2Kn/n  k  1) (Hurvich and Tsai, 1995), where n represents the sample size. With small sample size AIC favors models with fewer parameters compared to AICc, and as the sample size becomes relatively large, the behavior of AICc and AIC converges (Fig. 1). AIC has inspired several additional variants to account for overdispersion (QAIC), correlations between model parameters (CAICF), and situations where models might not be good approximations to truth (TIC; Burnham and Anderson, 2002). These criteria are beyond the scope of this review, as there are not as commonly used as AIC or AICc.

A second important model selection criterion that is frequently compared to AIC is the Bayesian Information Criterion (BIC or Schwarz Information Criterion; Schwarz, 1978). While the two criteria may appear similar, BIC has a completely independent derivation (Burnham and Anderson, 2004). Like AIC, the calculation of BIC involves two terms: BIC=D(θˆ_MLE)+Kln(n). The first term is identical to that used in the calculation of AIC (representing the model fit to data), however the complexity term is slightly different, being a function of both the number of parameters (K) and sample size (n). When n is <7.4, BIC assigns more weight to complex models than does AIC, but as the sample size increases, BIC assigns more weight to simpler models when compared to AIC (Fig. 2; Raftery, 1995, Forster, 2000, Burnham and Anderson, 2002). Generally, BIC is interpreted as a rough approximation to the logarithm of the Bayes factor (Kass and Raftery, 1995). BIC exists in somewhat of a grey area—while the computation of the criterion is not Bayesian, there are some situations where the BIC model weights may be interpreted as posterior model probabilities (Raftery, 1999, Link and Barker, 2005). Although BIC does not require the explicit specification of a prior distribution on model parameters, the implicit prior assumed by BIC is a multivariate normal distribution centered on the MLE (with a covariance matrix equal to the inverse of the Fisher information matrix). Another name for this prior is the unit information prior, because has approximately the same contribution as one additional data point (Raftery, 1995). BIC assumes equal priors on each model, so that if M models are considered, the prior on each is 1/M. In cases where the implicit prior is similar to the prior used in the Bayes factor calculation, BIC can be used to calculate posterior probabilities for each model being considered (posterior probabilities could be computed using normalized BIC weights; Burnham and Anderson, 2002). BIC is known to be poorly suited for some problems (Stone, 1979), particularly when the multivariate normal distribution is not a reasonable prior and few data points exist (skewed parameters may include variance parameters or ratios, non-linear density dependence or scaling parameters). It should be noted that AIC weights have recently been interpreted in a Bayesian setting, using the logic of BIC (Burnham and Anderson 2004).

One downside to the conventional forms of AIC and BIC is that they are not well suited for complicated ecological models that include hidden states with non-Gaussian errors, or hierarchical parameters (Burnham and Anderson, 2002, Vaida and Blanchard, 2005). Spiegelhalter et al. (2002) proposed using the Deviance Information Criterion (DIC) as a Bayesian equivalent of AIC for hierarchical models. Like BIC and AIC, the calculation of DIC involves a measure of model fit, and a measure of model complexity: DIC=D(θˆ)+2pD, where D(θˆ) represents the deviance evaluated at some point estimate of the joint posterior, and pD represents the effective number of model parameters. The DIC equation closely resembles the calculation of AIC, and in the absence of informative priors on model parameters, the two criteria are expected to be equal (Ellison, 2004). The term pD can also be expressed as the deviance function evaluated at the expected posterior values of the parameters subtracted from the mean deviance across all possible parameter vector values, pD=D(θ)¯D(θˆ). The mean deviance and mean parameter values were originally proposed by Spiegelhalter et al. (2002) as plug-in estimates (and are currently implemented as the default option in WinBugs; Spiegelhalter et al., 2003), however the posterior median or mode may also be appropriate (e.g. Celeux et al., 2006). Proponents of DIC argue that it is a Bayesian equivalent of AIC (Spiegelhalter et al., 2002), however the similarity between the two methods is still being investigated.

A final model selection criterion that has seen increased use by ecologists in the last decade is the Bayes factor (Jeffreys, 1935, Kass, 1993, Kass and Raftery, 1995). The Bayes factor for two models (M1 and M2) is the ratio of posterior odds to prior odds, BF = (P(x|M1)/P(x|M2)) = (P(M1|x)/P(M2|x)/P(M1)/P(M2)) (Good, 1958). As the Bayes factor updates the prior distribution upon observing the data, it is essentially a measure of how much a researcher learns by observing data (Bernardo and Smith, 2000). If more than two models are being compared, and all models are given equal prior weight, the Bayes factor in favor of model j becomes BF=P(x_|Mj)/iP(x_|Mi). There are several approaches for approximating the marginal likelihood of the data, P(x_|Mj), all of which are computationally intensive. For this analysis, I adopted the approach proposed by Gelfand and Dey (1994): P(x_|Mj)=[(1/n)i=1nh(θi)/P(θi|x_)]1, where n is the number of Markov Chain Monte Carlo (MCMC) samples, P(θi|x) is the posterior probability of parameter vector i for model j, and h(θi) represents an importance function evaluated at parameter vector i (a multivariate normal density centered at the posterior mode, with a covariance matrix estimated from the MCMC chain). Aside from computational challenges, the primary problem with implementing Bayes factors is the specification of prior distributions. Bayes factors are known to be unstable and sensitive to the choice of priors (Kadane and Lazar, 2004), and only proper priors may be considered (valid probability distributions that integrate to 1.0; Kass and Raftery, 1995). If no information about model parameters is known, a default option is that non-informative prior distributions such as Jeffrey's prior may be constructed (Kass and Wasserman, 1996). What makes the Bayes factor unique relative to the other criteria discussed here is that it does not explicitly include a term that quantifies model complexity—this still exists, however, because overly complex models are penalized in the marginal likelihood calculation (e.g. Dawid's discussion in Spiegelhalter et al., 2002).

There are several important differences between AIC, BIC, DIC, and Bayes factors (summarized in Table 1). All maximum likelihood and Bayesian criteria are alike in that model selection is linked to parameter estimation. In the calculation of AIC and BIC, parameter estimation is done by maximizing the likelihood in an attempt to find the single best point estimate. Parameters for DIC and Bayes factors are estimated using Bayesian methods, which integrate rather than maximize over the parameter space (Hobbs and Hilborn, 2006). While both DIC and Bayes factors incorporate parameter uncertainty and correlation in the sampling of the joint posterior distribution, one criticism of AIC and BIC is that neither consider parameter uncertainty in their calculations. This is not a maximum likelihood versus Bayesian issue; several maximum likelihood criteria, including the Information Complexity Criterion (ICOMP, Table 1; Bozdogan, 2000), do include parameter correlation and uncertainty, however these methods have seen little use in biology. In principle, the calculation of ICOMP is similar to AIC, ICOMP=D(θˆ_MLE)+ln(n)C(Σ), where C(·) represents a complexity function based on the parameter variance–covariance matrix (Σ).

From a philosophical point of view, there are also differences in the objectives of these model selection criteria. Myung (2000) divided model selection tools into two groups: generalization-based criteria, and explanation-based criteria. Generalization-based criteria (AIC and ICOMP) seek to find the best model that fits both current data in addition to hypothetical future data that may be observed from the same process that generated the original sample. Explanation-based criteria (BIC and Bayes factors) are concerned with identifying the process from which the data arose, and are not influenced by hypothetical future observations. Many biologists, particularly those working with natural resource management, may be more interested in predicting future quantities, rather than explanation (e.g. Fried and Hilborn, 1988). Prediction-based Bayesian model selection tools exist for these types of problems (Madigan and Raftery, 1995, Gelman et al., 1995, Bernardo and Smith, 2000), but will not be discussed here because the current analysis is only concerned with explanation.

A final difference between model selection tools is that other than Bayes factors, none of the criteria presented were ever intended to be used as decision tools, but they are still used to provide advice to decision making advice to managers. An argument could be made that BIC or AIC are compatible with decision making because they may be viewed as an approximation to the Bayes factor (Raftery, 1999, Link and Barker, 2005). Proponents of both AIC and DIC strongly encourage the use of these tools for making inferences, not decisions (Burnham and Anderson, 2002, Spiegelhalter et al., 2002). Richards (2005) illustrated potential pitfalls of using AIC weights, and other authors have noted statistical flaws in such approaches (see discussion of Forster and Sober, 2004). Another reason not to use these tools to make decisions about ‘significance’ is that there may be implicit costs that are not obvious. For example, assume that two nested modes are to be compared. One approach for evaluating these models might be to use a likelihood ratio test (LRT) as a decision rule. The LRT statistic rejects model M1 when L(x|M1)/L(x|M2)  C, C representing the critical region of the test (this can also be expressed as the ratio of the costs of a Type I error to a Type II error; Bernardo and Smith, 2000). The significance level of the LRT (α, the probability of rejecting M1 when M1 is actually true) is usually set a priori at 0.05. Suppose that instead of using a LRT, however, a researcher uses AIC to make a decision about competing models. The significance level of AIC is not transparent, because it varies as a function of the complexity between the models considered (with 1 degree of freedom, the significance level is fixed at 0.157 rather than 0.05; Forster, 2000, Burnham and Anderson, 2002, Kuha, 2004). The implication of this difference is that AIC will give more weight to complex models relative to LRTs (Forster, 2000). As a third option, suppose the researcher uses BIC as a decision tool—the significance level of BIC becomes even more complex, because it is a function of both the difference in model complexity and sample size. With small sample sizes (∼7.4), BIC gives similar results to AIC, but the significance level decreases rapidly with increasing sample size or an increasing difference in complexity (BIC eventually favoring simpler models than would be chosen by LRTs). If the researcher in this scenario chooses a model selection tool arbitrarily, simply based on which is easiest to calculate, he will be ignorant of the implied Type I and Type II errors.

The purpose of this analysis is to address the performance of four model selection criteria (AICc, BIC, DIC and Bayes factors) applied to simulated population abundance data. The majority of previous studies evaluating the performance of model selection criteria may not be completely applicable to ecological data, either because of the criteria compared, or the sample sizes involved. For example, Myung (2000) conducted numerous simulations for polynomial models, but did not consider the small sample variant of AIC (AICc), which is more applicable to ecological problems (Burnham and Anderson, 2002). Other simulations have focused on sample sizes of more than 100 (Kuha, 2004) which is often unrealistic for the natural sciences—many biologists work with sample sizes an order of magnitude smaller. Research with biological models utilizing few samples has been explored in a maximum likelihood framework (Shono, 2000), however few studies to date have bridged the gap between maximum likelihood and Bayesian model selection criteria (e.g. A’mar, 2004, Ellison, 2004).

All four criteria will be evaluated across multiple scenarios, and will be evaluated in their ability to (1) detect density dependent processes, (2) detect non-linear density dependence and (3) detect Allee effects (or inverse density dependence; Courchamp et al., 1999). In Monte Carlo comparisons of model selection performance, the true model is almost always considered among the candidate models (McQuarrie and Tsai, 1998, Forster, 2000, Myung, 2000). While the goal of Monte Carlo comparisons is to estimate the frequency of selecting the true model, the goal of model selection applied to real data is to find a model that best approximates truth. If the simulation-based rules were applied to inference in the real world, all criteria would be wrong 100% of the time. Despite these issues, understanding the performance of model selection over many data sets is extremely valuable—not only can the performance of alternative criteria be compared, but it is also possible to examine the influence of different kinds of error on model selection.

Section snippets

Methods

In this analysis, I considered the ability of four model selection tools to select among four discrete time population models. The first four population models represent single stage models: (1) the geometric model Nt+1 = Nt(1 + r); (2) logistic model Nt+1 = Nt + rNt(1  Nt/K); (3) theta-logistic model Nt+1 = Nt + rNt(1  (Nt/K)ϕ) (Gilpin and Ayala, 1973); and (4) model with decreased growth rate at low density, Nt+1 = Nt + rNt(Nt  a)(K  Nt)/K2) (Lewis and Kareiva, 1993). In these models, the parameter r represents

Results

In all comparisons, the true simulation model was not considered as an estimation model (simulation models included both process error and observation error). The first comparison I examined was the ability of each model selection criteria to detect density dependence, and the rate with which each was subject to Type II errors. In this comparison, Type II error represents instances where more weight is assigned to an estimation model that includes density dependence when the simulation model is

Discussion

Because each model selection tool has different intrinsic assumptions and behavior, it is crucial to understand the differences between each. Before using any of the model selection criteria discussed in this paper, ecologists should consider the following questions. First, is the purpose of the analysis to make predictions, or to decide which model best represents reality (Ghosh and Samanta, 2001)? While AICc may have better predictive ability than BIC, order-consistent criteria (Taper, 2004)

References (53)

  • N.G. Best et al.

    CODA Manual version 0.30

    (1995)
  • K.P. Burnham et al.

    Model selection and multimodel inference: a practical information-theoretic approach

    (2002)
  • K.P. Burnham et al.

    Multimodel inference: understanding AIC and BIC in model selection

    Sociol. Methods Res.

    (2004)
  • K.P. Burnham et al.

    Model selection strategy in the analysis of capture–recapture data

    Biometrics

    (1995)
  • G. Celeux et al.

    Deviance information criteria for missing data models

    Bayesian Analysis

    (2006)
  • A.E. Ellison

    Bayesian inference in ecology

    Ecol. Lett.

    (2004)
  • M.R. Forster et al.

    Why likelihood?

    The Nature of Scientific Evidence

    (2004)
  • D.A. Fournier

    AUTODIFF. A C++ array language extension with automatic differentiation for use in nonlinear modeling and statistics

    (1996)
  • S.M. Fried et al.

    In-season forecasting of Bristol Bay, Alaska, sockeye salmon (Oncorhynchus nerka) abundance using Bayesian probability theory

    Can. J. Fish. Aquat. Sci.

    (1988)
  • A.E. Gelfand et al.

    Bayesian model choice: asymptotics and exact calculations

    J. Roy. Stat. Soc. Ser. B

    (1994)
  • A.B. Gelman et al.

    Bayesian Data Analysis

    (1995)
  • J.K. Ghosh et al.

    Model selection—an overview

    Curr. Sci.

    (2001)
  • M.E. Gilpin et al.

    Global models of growth and competition

    Proc. Natl. Acad. Sci.

    (1973)
  • I.J. Good

    Significance tests in parallel series

    J. Am. Stat. Assoc.

    (1958)
  • N.T. Hobbs et al.

    Alternatives to statistical hypothesis testing in ecology: a guide to self teaching

    Ecol. Appl.

    (2006)
  • J.A. Hoeting et al.

    Bayesian model averaging: a tutorial

    Stat. Sci.

    (1999)
  • Cited by (145)

    • How to keep it adequate: A protocol for ensuring validity in agent-based simulation

      2023, Environmental Modelling and Software
      Citation Excerpt :

      Adequate model inference requires correcting this bias, e.g. by k-fold cross-validation. Only when parametric likelihoods are applicable (see above), information criteria (AIC, BIC, DIC, WAIC) or formal Bayesian frameworks with appropriately specified prior likelihoods (Burnham and Anderson 2004; Ward 2008; Vehtari et al., 2017) provide an alternative. Statistical diagnostics for influential observations (e.g. Cook's distance) and multicollinearity in the data (e.g. variance inflation factors) common in econometric analysis should complement the analysis of posterior uncertainty.

    View all citing articles on Scopus
    View full text