Abstract
With complex models becoming increasingly popular in the social sciences, many researchers have begun using latent variable modeling in multiple-steps, saving, estimating, or otherwise extracting factor scores from one confirmatory factor analysis (CFA) for use in a second inferential analysis. With two or more factors identified in a CFA, there exist few practical guidelines as to how researchers should proceed. In Study 1, we examine two common practices when CFAs have two or more factors: Fitting separate CFAs or allowing them to correlate in the model used for extraction. We provide a simulation study to demonstrate the bias introduced in each of the two approaches. In Study 2, we demonstrate that the between-factor correlation bias can be mitigated through the use of a different estimator; using ten Berge estimation shows near zero bias on the critical correlations between factors. Finally, we demonstrate this with an example dataset.
Similar content being viewed by others
References
Anderson, T.W., Rubin, H.: Statistical inference in factor analysis, vol. 5, pp. 111–150 (1956)
Bollen, K.A.: Structural Equations with Latent Variables. Wiley, New York (1989)
Borgeest, G.S., Henson, R., Shafto, M., Samu, D., Kievit, R.: Greater lifestyle engagement is associated with better cognitive resilience (2018). https://doi.org/10.31234/osf.io/6pzve
Brown, T.A.: Confirmatory Factor Analysis for Applied Research, 2nd edn. Gilford Press, New York (2015)
Croon, M.: Using predicted latent scores in general latent structure models. In: Marcoulides, G.A., Moustaki, I. (eds.) Latent variable and latent structure models, p. 195. Lawrence Erlbaum, Mahwah (2002)
Curran, P.J., Hussong, A.M.: Integrative data analysis: the simultaneous analysis of multiple data sets. Psychol. Methods 14(2), 81 (2009)
Curran, P.J., Cole, V.T., Bauer, D.J., Rothenberg, W.A., Hussong, A.M., Gottfredson, N.: Improving factor score estimation through the use of observed background characteristics. Struct. Equ. Model. 23(6), 827–844 (2016)
Curran, P.J., Cole, V.T., Bauer, D.J., Rothenberg, W.A., Hussong, A.M.: Recovering predictor-criterion relations using covariate-informed factor score estimates. Struct. Equ. Model. 25(6), 860–875 (2018). https://doi.org/10.1080/10705511.2018.1473773
Devlieger, I., Mayer, A., Rosseel, Y.: Hypothesis testing using factor score regression: a comparison of four methods. Educ. Psychol. Meas. 76(5), 741–770 (2016). https://doi.org/10.1177/0013164415607618
DiStefano, C., Zhu, M., Mindrila, D.: Understanding and using factor scores: considerations for the applied researcher. Pract. Assess. Res. Eval. 14(20), 1–11 (2009)
Fernández-Giménez, M.E., Allington, G.R., Angerer, J., Reid, R.S., Jamsranjav, C., Ulambayar, T., Hondula, K., Baival, B., Batjav, B., Altanzul, T.: Using an integrated social-ecological analysis to detect effects of household herding practices on indicators of rangeland resilience in Mongolia. Environ. Res. Lett. 13(7), 075010 (2018)
Greenbaum, P.E., Wang, W., Henderson, C.E., Kan, L., Hall, K., Dakof, G.A., Liddle, H.A.: Gender and ethnicity as moderators: integrative data analysis of multidimensional family therapy randomized clinical trials. J. Fam. Psychol. 29(6), 919–930 (2015). https://doi.org/10.1037/fam0000127
Harrington, D.: Confirmatory Factor Analysis. Oxford University Press, Oxford (2009)
Holzinger, K.J., Swineford, F.A.: A Study of Factor Analysis: The Stability of a Bi-Factor Solution. University of Chicago Press, Chicago (1939)
Hoshino, T., Bentler, P.M.: Bias in factor score regression and a simple solution. UCLA, Department of Statistics (2011). https://escholarship.org/uc/item/45h3t3t2
Jöreskog, K.G., Sörbom, D.: LISREL 8: User’s Reference Guide. Scientific Software International, Chicago (1996)
Kim, Y.S., Al Otaiba, S., Wanzek, J., Gatlin, B.: Toward an understanding of dimensions, predictors, and the gender gap in written composition. J. Educ. Psychol. 107(1), 79 (2015)
Kline, R.B.: Principles and Practice of Structural Equation Modeling. Guilford Publications, Chicago (2015)
Krijnen, W.P., Wansbeek, T., ten Berge, J.M.F.: Best linear predictors for factor scores. Commun. Stat. Theory Methods 25(12), 3013–3025 (1996)
Logan, J.A.R.: Question for researchers: have you ever run a confirmatory factor analysis and then saved out the factor scores (turning into observed scores) for use in another analysis? No(28%), Yes (40%), Just show me the results (32%). [tweet]. (2018). https://twitter.com/jarlogan/status/1058009194006220802
Lu, I.R.R., Thomas, D.R.: Avoiding and correcting bias in score-based latent variable regression with discrete manifest items. Struct. Equ. Model. 15(3), 462–490 (2008). https://doi.org/10.1080/10705510802154323
McDonald, R.P.: The dimensionality of tests and items. Br. J. Math. Stat. Psychol. 34(1), 100–117 (1981)
McNeish, D., Wolf, M.G.: Thinking twice about sum scores. Behav. Res. Methods 52(6), 2287–2305 (2020). https://doi.org/10.3758/s13428-020-01398-0
Muthén, L.K., Muthén, B.O.: Mplus: Statistical Analysis with Latent Variables: User’s Guide (Version 8). Los Angeles, CA: Muthén & Muthén (2017). https://www.statmodel.com/
Purpura, D.J., Hume, L.E., Sims, D.M., Lonigan, C.J.: Early literacy and early numeracy: the value of including early literacy skills in the prediction of numeracy development. J. Exp. Child Psychol. 110(4), 647–658 (2011)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017)
Rimm-Kaufman, S.E., Baroody, A.E., Larsen, R.A., Curby, T.W., Abry, T.: To what extent do teacher–student interaction quality and student gender contribute to fifth graders’ engagement in mathematics learning? J. Educ. Psychol. 107(1), 170 (2015)
Rose, J.S., Dierker, L.C., Hedeker, D., Mermelstein, R.: An integrated data analysis approach to investigating measurement equivalence of DSM nicotine dependence symptoms. Drug Alcohol Depend. 129(1–2), 25–32 (2013)
Rosseel, Y.: Lavaan: an R package for structural equation modeling. J. Stat. Softw. 48, 1–36 (2012a)
Rosseel, Y.: lavaan: an R package for structural equation modeling. J. Stat. Softw. 48(2), 1–36 (2012b)
Skrondal, A., Laake, P.: Regression among factor scores. Psychometrika 66(4), 563–575 (2001)
ten Berge, J.M.F., Krijnen, W.P., Wansbeek, T., Shapiro, A.: Some new results on correlation-preserving factor scores prediction methods. Linear Algebra Appl. 289(1–3), 311–318 (1999)
Thurstone, L.L.: The Vectors of Mind: Multiple-Factor Analysis for the Isolation of Primary Traits. University of Chicago Press, Chicago (1935)
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics approval
This study did not involve human subjects and was not subject to ethics approval.
Informed consent
Informed consent was not applicable for this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
1.1 Data generation
The data generation model was based on Eq. 4 (data model) and Eq. 5 (implied covariance form), where \(y_{i}\) is the p × 1 observed data vector for the ith observation and p = number of indicators, i.e. 8; \({\Sigma }\) is the implied covariance matrix for y; \(\mu\) is the p × 1 intercept vector fixed to 0; \({\Lambda }\) is the p × m matrix of factor loadings (fixed), where m = number of factors (i.e. 2), \(\eta_{i}\) is the m × 1 vector of “true” factor scores for the ith observation following a multivariate normal distribution, MVN \(\left( {0,{\Phi }} \right)\), and \(\epsilon_{i}\) is the p × 1 residual vector for the ith observation following a multivariate normal distribution, MVN \(\left( {0,{\Psi }} \right)\). Note that \({\Phi }\) is a m × m matrix which represents the covariance matrix between the factors, and \({\Psi }\) is a p × p diagonal matrix representing the error covariance matrix, where the error variances are set up as such that the variance of y is 1.
In addition, we also introduced an external variable z that correlated at 0.5 with factor 1 only. It is assumed that z is normally distributed with a variance of 1. The covariance matrix between factors and variable z is therefore
where \({\upgamma }^{\prime }\) = [0.5, 0, …, 0] with dimension of 1 × m.
For each replication, data were generated in the following steps: (1) a series of true factor scores (γ) as well as variables (z) were drawn randomly from the multivariate normal distribution MVN(0, \({\Phi }^{*}\)), and a series of residuals \(\upepsilon\)were drawn from MVN(0, \({\Psi }\)); and (2) the observed data matrix Y of dimension p x n (n = sample size) was calculated based on Eq. 4.
Appendix 2
2.1 Code to calculate ten Berge factor scores
Appendix 3
Means and standard deviations of between-factor correlation estimates used to convert to the bias estimates presented in Table 3.
True corr | n | Orthogonal extraction | Correlated extraction | ||||
---|---|---|---|---|---|---|---|
High | High mixed | Mixed | High | High mixed | Mixed | ||
0 | 100 | 0 (0.10) | 0 (0.10) | 0 (0.10) | 0 (0.13) | 0 (0.14) | 0 (0.15) |
250 | 0 (0.06) | 0 (0.06) | 0 (0.06) | 0 (0.08) | 0 (0.09) | 0 (0.09) | |
500 | 0 (0.04) | 0 (0.04) | 0 (0.04) | 0 (0.06) | 0 (0.06) | 0 (0.07) | |
0.3 | 100 | 0.25 (0.09) | 0.25 (0.09) | 0.23 (0.10) | 0.33 (0.12) | 0.34 (0.13) | 0.34 (0.14) |
250 | 0.27 (0.06) | 0.25 (0.06) | 0.23 (0.06) | 0.34 (0.07) | 0.34 (0.09) | 0.35 (0.09) | |
500 | 0.26 (0.04) | 0.25 (0.04) | 0.23 (0.04) | 0.34 (0.05) | 0.34 (0.06) | 0.35 (0.06) | |
0.5 | 100 | 0.43 (0.08) | 0.40 (0.09) | 0.38 (0.09) | 0.55 (0.10) | 0.56 (0.11) | 0.57 (0.13) |
250 | 0.44 (0.05) | 0.41 (0.05) | 0.39 (0.05) | 0.55 (0.06) | 0.57 (0.07) | 0.59 (0.07) | |
500 | 0.44 (0.04) | 0.42 (0.04) | 0.39 (0.04) | 0.56 (0.04) | 0.57 (0.05) | 0.59 (0.05) | |
0.8 | 100 | 0.70 (0.05) | 0.65 (0.06) | 0.61 (0.07) | 0.87 (0.05) | 0.88 (0.06) | 0.88 (0.06) |
250 | 0.70 (0.03) | 0.66 (0.04) | 0.63 (0.04) | 0.87 (0.03) | 0.88 (0.03) | 0.89 (0.04) | |
500 | 0.70 (0.02) | 0.66 (0.02) | 0.63 (0.03) | 0.87 (0.02) | 0.88 (0.02) | 0.89 (0.03) |
Appendix 4
Mean estimates and standard deviations for the correlation between the extracted factor and an external variable (z), this information is presented as bias estimates in Table 4.
True Corr | n | Orthogonal extraction | Correlated extraction | ||||
---|---|---|---|---|---|---|---|
High | High mixed | Mixed | High | High mixed | Mixed | ||
0 | 100 | 0.47 (0.08) | 0.47 (0.08) | 0.44 (0.08) | 0.47 (0.08) | 0.47 (0.08) | 0.44 (0.08) |
250 | 0.47 (0.05) | 0.46 (0.05) | 0.44 (0.05) | 0.47 (0.05) | 0.46 (0.05) | 0.44 (0.05) | |
500 | 0.47 (0.03) | 0.47 (0.04) | 0.44 (0.04) | 0.47 (0.03) | 0.47 (0.04) | 0.44 (0.04) | |
0.3 | 100 | 0.47 (0.08) | 0.47 (0.08) | 0.44 (0.08) | 0.46 (0.08) | 0.46 (0.08) | 0.43 (0.08) |
250 | 0.46 (0.05) | 0.46 (0.05) | 0.44 (0.05) | 0.46 (0.05) | 0.46 (0.05) | 0.44 (0.05) | |
500 | 0.47 (0.04) | 0.47 (0.04) | 0.44 (0.04) | 0.46 (0.04) | 0.46 (0.04) | 0.44 (0.04) | |
0.5 | 100 | 0.47 (0.08) | 0.47 (0.08) | 0.44 (0.08) | 0.45 (0.08) | 0.45 (0.08) | 0.42 (0.09) |
250 | 0.47 (0.05) | 0.47 (0.05) | 0.44 (0.05) | 0.45 (0.05) | 0.45 (0.05) | 0.42 (0.05) | |
500 | 0.47 (0.04) | 0.47 (0.04) | 0.45 (0.04) | 0.45 (0.04) | 0.45 (0.04) | 0.42 (0.04) | |
0.8 | 100 | 0.46 (0.08) | 0.46 (0.08) | 0.44 (0.08) | 0.40 (0.09) | 0.41 (0.09) | 0.36 (0.10) |
250 | 0.47 (0.05) | 0.47 (0.05) | 0.44 (0.05) | 0.40 (0.05) | 0.41 (0.05) | 0.36 (0.06) | |
500 | 0.47 (0.04) | 0.47 (0.03) | 0.44 (0.04) | 0.40 (0.04) | 0.41 (0.04) | 0.36 (0.04) |
Rights and permissions
About this article
Cite this article
Logan, J.A.R., Jiang, H., Helsabeck, N. et al. Should I allow my confirmatory factors to correlate during factor score extraction? Implications for the applied researcher. Qual Quant 56, 2107–2131 (2022). https://doi.org/10.1007/s11135-021-01202-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-021-01202-x