The golden rule is that there are no golden rules: A commentary on Paul Barrett’s recommendations for reporting model fit in structural equation modelling

https://doi.org/10.1016/j.paid.2006.09.023Get rights and content

Abstract

Paul Barrett offers a challenging and timely call for a re-examination of fit assessment strategies in structural equation modelling (SEM). He points out that widely accepted cutoff values for approximate fit indices have come to be treated as if they were test statistics. Paul cites four recent studies of the behaviour of fit indices under varying data conditions which demonstrate that universal indicative cutoff values cannot be trusted. Based upon these studies, Paul advocates the abandonment of approximate fit indices and greater reliance on the chi square test and a broader assessment strategy that includes predictive accuracy. I share Paul’s concerns about the lax standards often adopted in model testing and I agree with most of his arguments. However, the authors he cites in support of his recommendation to abandon approximate fit indices do not reach the same conclusion as Paul. In my response to Paul’s article, I discuss some conditions under which it could be legitimate to accept a model which has failed the chi square test and I contend that approximate fit indices can play a useful part in a multi-faceted strategy for determining model adequacy, provided they are not elevated to the status of golden rules.

Introduction

I welcome this opportunity to comment on Paul Barrett’s discussion of model fit assessment in structural equation modelling (SEM) and confirmatory factor analysis (CFA) and his recommendations authors submitting papers to Personality and Individual Differences and for the Journal‘s editors and reviewers. I have for some time shared Paul’s view that the Journal’s Associate Editors should seek at least a broad consensus amongst themselves on the assessment of fit issue so that submissions to the Journal will be met with a consistent position. Also, like Paul I have become frustrated with the slippery use of language, the shifting of goal-posts, and the selective adoption of often dubiously supported criteria that many authors adopt in order to present a rosy picture of the fit of their models. All too often it is apparent that authors adopt fit criteria that ensure that they can describe their models as fitting rather than criteria that assess whether or not their models fit. In writing this response I should say from the outset that I do not consider myself to be a methodologist or a statistical expert. Rather, my comments come from the position of an occasional user and a teacher of structural equation modelling who is relatively well-informed about the exact versus close fit debate, an Associate Editor with a responsibility for making recommendations about the fate of submissions to the Journal, and as a reviewer for other journals that frequently send me SEM based manuscripts.

In order to gauge what is actually currently happening within the Journal, before starting this commentary I conducted a search for papers describing studies using SEM and/or CFA procedures published or accepted for publication in PAID since the beginning of 2006. I found 28 such papers. In only two papers did the authors declare that the fit of their models was unacceptable. All 28 reported the chi square values and degrees of freedom for the models, but only 10 reported the chi square probabilities. Of these, only three models had non-significant chi squares, although there appeared to be an error in one paper with a table showing a non-significant chi square for one of seven alternative models tested but the text reporting that chi square was significant in all cases. In this instance the apparently exact-fitting model was highly parameterised and the authors rejected it in favour of one with a significant chi square on the grounds of parsimony. In the other two cases of non-significant chi squares, the models were also highly parameterised, yielding very few degrees of freedom (in one case only two). When reviewing or editing papers I frequently find that the authors seem to be oblivious to the fact that the fewer the degrees of freedom a model has, the less the degree of disconfirmability and that finding a good fit for a model with few opportunities for disconfirmation is scientifically of little use (MacCallum, 1995). In one paper a chi square of 332.19 with 48 degrees of freedom was described as ‘small’, with the conclusion that the model was acceptable. One paper even declared a ‘good fit’ for a model with a comparative fit index of only. 619. This brief and admittedly unsystematic examination of the state of play within the Journal clearly shows a considerable degree of both leniency in accepting ill-fitting models and inconsistency in what is reported and considered acceptable. If such inconsistency was evident in, say, sentencing practices in our criminal courts then the tabloid newspaper editors would be screaming that ‘something must be done’. Thus Paul Barrett’s call for action is, I believe, timely.

In commenting on Paul’s article I would first like to make some general observations and then I will deal with each of his specific recommendations in turn. As I read it, Paul’s position centres on two main issues. First that the chi square test is the only statistical test currently available for SEM and second that it is not possible, nor indeed logical, to draw universal indicative thresholds for approximate fit indices. On both these issues I am in agreement with Paul but my conclusions regarding the latter do differ. It is of course literally true that the chi square test is the only statistical test available with regards to testing the null hypothesis of no difference between the model-implied population covariances and the observed sample covariances. What is at issue, then, is whether or not it is the only useful statistical criterion for assessing the adequacy of a model.

It bears emphasising that as Paul acknowledges, the chi square test is not actually an exact fit test because the acceptance or rejection of the null hypothesis depends on adopting an arbitrary (conventional) alpha level. In this respect the chi square test is just as subjective as approximate fit indices and does not provide the golden rule that we all would like to have. Indeed Hayduk, probably the most ardent critic of approximate fit indices, argues cogently that the conventional alpha of .05 is far too liberal in SEM because, as one is seeking a non-significant chi square, it favours acceptance of the null hypothesis (Hayduk, 1996, Hayduk and Glaser, 2000). Instead, Hayduk (1996, p. 77) advocates that researchers should seek probabilities in the region of .75, a recommendation that Hayduk and Glaser (2000, p. 24) describe as a “fuzzier boundary” that acknowledges the desire to balance the potential for incorrect rejection of acceptable models with the need to place the onus on the researcher to demonstrate convincingly that their data do not depart significantly from the model. Whatever the merits of the argument for a more stringent alpha and whatever alpha one chooses, the decision rule is ultimately subjective.

In addition, as Paul also goes some way to acknowledging with reference to Gigerenzer (2004), the very practice of making a dichotomous (accept/reject) decision based on the null hypothesis significance test has been severely and persuasively criticised by many authorities for many years (e.g., Cohen, 1994, Gigerenzer, 2004, Rosnow and Rosenthal, 1989). Such criticisms stem largely from near universal misunderstandings of what statistical inference in general and the probability that the null hypothesis is false in particular actually mean. In the context of SEM, the acceptance of the null hypothesis does not mean that the model is in any sense proven to be true (Guttman, 1976), it merely provides evidence for its tenability (Bentler and Bonett, 1980, Bullock et al., 1994). Furthermore, for any given model there will be a number of alternatives that will produce exactly the same fit to the data and moreover, unless it is saturated and therefore untestable anyway, there are always other models that would reproduce the data even better (albeit at the potential expense of theoretical meaningfulness). Thus even a model with a non-significant chi square test needs to have a serious health warning attached to it. Clearly Paul is not promoting the notion that one can prove the null hypothesis, nor the sole reliance on acceptance/rejection of the null hypothesis in SEM, as shown by his advocacy of a broader assessment strategy involving the examination of models in terms of their predictive accuracy with respect to theoretically consistent external criteria. I believe that few if any authorities would disagree with this position or argue that a good fit (judged by any internal criteria) is sufficient to establish the validity of a model.

With regard to cutoff thresholds for approximate fit indices, the criteria proposed by Hu and Bentler (1999), which are generally more stringent than those previously widely adopted, have come to be regarded as golden rules by most SEM users and many journal editors. This is despite the fact that Hu and Bentler themselves were at pains to warn against this. Paul cites recent work by Beauducel and Wittmann, 2005, Fan and Sivo, 2005, Marsh et al., 2004, Yuan, 2005 on the performance of the Hu and Bentler criteria under varying data conditions as support for his case against their use. I agree that these papers make rather grim reading for advocates of definitive golden rules that can be applied across all models of varying degrees of complexity, with different parameter values and different sample sizes and data distributions. However, none of the authors cited by Paul actually advocates the abandonment of approximate fit indices. Beauducel and Wittmann’s (2005) study was not concerned with cutoff criteria per se but investigated the performance of selective indices in detecting misfit in simulated CFA data with varying degrees of departures from simple structure. They found some heterogeneity among fit indices but concluded that this supported a strategy of reporting a range of fit indices. They also found that the standardized root mean square residual (SRMR) and root mean square error of approximation (RMSEA) were not sensitive to some types of simple structure distortion. However, they argued that this should not lead to the conclusion that RMSEA and SRMR should not be reported because this would lead researchers to be too conservative with regard to model acceptance and would “be more of a problem than a stimulation for further research in these areas” (Beauducel & Wittmann, 2005 p. 73). Fan and Sivo (2005) examined the performance of Hu and Bentler, 1998, Hu and Bentler, 1999 two-index strategy (SRMR plus another index). They concluded that the strategy has questionable validity and therefore the use of multiple fit indices (i.e., not just two) “makes good sense” (p. 366). Similarly, although Yuan (2005) concluded that definitive cutoff values for fit indices are of questionable validity at best, he declared that “Although most of our results are on the negative side of current practice, fit indices are still meaningful measures of model fit/misfit.” (p. 142). Yuan goes on to state that “We want to emphasize that the purpose of the article is not to abandon cutoff values but to point out the misunderstanding of fit indices and their cutoff values.” (pp.142–143). Paul states that each of these four papers demonstrates that single value indicative thresholds are impossible to set without some models being incorrectly identified as fitting acceptably when in fact they are misspecified. In fact, in their replication of Hu and Bentler’s (1999) study, Marsh et al. (2004) also found the opposite to be the case. They showed that many of Hu and Bentler’s misspecified models should have been deemed acceptable, even using the more stringent criteria proposed by Hu and Bentler themselves. Although Marsh et al. (2004) warn against overgeneralising Hu and Bentler’s (1999) results and treating their proposed criteria as golden rules, and argue that the application of a hypothesis testing rationale to setting cutoff values is in any case illogical, they do not advocate the abandonment of approximate fit indices. On the contrary, they present a case that in the context of CFA, even the older and more lenient cutoff conventions (e.g., incremental fit indices >.90) may be too conservative if one wants to have measures with a sufficient number of items to achieve good construct validity. Furthermore, Marsh et al. (2004) found that all of the indices they examined were good at differentiating between degrees of model misspecification and were therefore useful in comparing the adequacy of alternative nested models of the same data. Thus the overall tone of the papers by Beauducel and Wittmann, 2005, Fan and Sivo, 2005, Yuan, 2005, Marsh et al., 2004 is certainly that indicative threshold values cannot be trusted but also that approximate indices are still useful in model fit assessment, at least to the extent that they can facilitate progression in a research field.

Section snippets

Paul Barrett’s recommendations

I will now turn my attention to Paul’s specific recommendations for authors, referees and editors of the Journal.

Conclusion

In conclusion, I applaud Paul’s efforts in seeking greater rigour in the reporting of SEM/CFA studies. I do not think that he and I are very far apart on most of the issues in principle. Like Marsh et al. (2004) I believe that golden rules are elusive and will probably never be realised. Instead, the assessment of model adequacy should be a multifaceted enterprise comprising consideration of model fit, empirical adequacy and substantive meaningfulness. Where Paul and I differ fundamentally,

References (17)

  • G. Gigerenzer

    Mindless statistics

    The Journal of Socio-Economics

    (2004)
  • A. Beauducel et al.

    Simulation study on fit indices in confirmatory factor analysis based on data with slightly distorted simple structure

    Structural Equation Modeling

    (2005)
  • P.M. Bentler et al.

    Significance tests and goodness of fit in covariance structures

    Psychological Bulletin

    (1980)
  • H.E. Bullock et al.

    Causation issues in structural equation modeling research

    Structural Equation Modeling

    (1994)
  • J. Cohen

    The earth is round (p < .05)

    American Psychologist

    (1994)
  • X. Fan et al.

    Sensitivity of fit indices to misspecified structural or measurement model components: rationale of two-index strategy revisited

    Structural Equation Modeling

    (2005)
  • L. Guttman

    What is not what in statistics

    The Statistician

    (1976)
  • L.A. Hayduk

    LISREL issues, debates and strategies

    (1996)
There are more references available in the full text version of this article.

Cited by (211)

View all citing articles on Scopus

George Bernard Shaw (1903) Man and Superman.

View full text