Abstract
Higher education researchers using survey data often face decisions about handling missing data. Multiple imputation (MI) is considered by many statisticians to be the most appropriate technique for addressing missing data in many circumstances. In particular, it has been shown to be preferable to listwise deletion, which has historically been a commonly employed method for quantitative research. However, our analysis of a decade of higher education research literature reveals that the field has yet to make substantial use of this technique despite common employment of quantitative analysis, and that in research where MI is used, many recommended MI reporting practices are not being followed. We conclude that additional information about the technique and recommended reporting practices may help improve the quality of the research involving missing data. In an attempt to address this issue, we develop a set of reporting recommendations based on a synthesis of the MI methodological literature and offer a discussion of these recommendations oriented toward applied researchers. The recommended MI reporting practices involve describing the nature and structure of any missing data, describing the imputation model and procedures, and describing any notable imputation results.
Notes
Maximum likelihood (ML) is another modern technique that is theoretically an excellent choice for handling missing data since it is always fully efficient (although MI can come close to full efficiency in practice with a high number of imputations, a bar no longer requiring unusual computing capacity), involves fewer implementation decisions than MI, and eliminates the possibility of conflicting imputation and analysis models (Allison 2012). However, ML also has some practical limitations (e.g. it is more frequently implemented in statistical packages for structural equation modeling (SEM) than for other forms of analysis, and even in SAS, which implements ML more than other software, it is not yet possible to estimate a logistic regression model). A full discussion of ML (Allison 2002; Cox et al. 2014; Enders 2010;) is beyond the scope of this research note.
Annotated Stata code that uses MI with publicly available data from the National Center for Education Statistics and that illustrates the recommended practices we discuss in this paper is available from the authors upon request or from the UMass Amherst ScholarWorks website at http://works.bepress.com/ryan_wells/21/.
To check the impact of uncertainty from missing data, check the “missing information,” a concept clearly explained by McKnight et al. (2007). Missing information (γ) gives a measure of the influence of missing data on the results of a statistical analysis for a certain number of imputations given the known correlations between variables. It is good to investigate convergence for variables with a high fraction of missing information.
Burn-in iterations are the number of times the imputation process is repeated prior to saving the first complete dataset to memory (e.g. saving a dataset as m = 1). For example, the default burn-in iteration number for Stata’s mi impute chained command is 10, and is 100 for mi impute mvn. For MVN, a different number of between-imputation iterations may also be selected (Stata’s default is 100), which refers to the number of times the imputation process is iterated between saving one complete dataset to memory and the next (e.g. between m = 1 and m = 2), and this convergence aspect should also be investigated (Enders 2010). The researcher should know and evaluate the adequacy of the default number in the software used.
Another rule of thumb for reproducibility is to have m ≥ largest FMI (fraction of missing information) (StataCorp 2011). Since the FMI includes more information than just the missing data rate, it is an even better guide. However, since it is more complex to calculate and is not known prior to analysis, m ≥ percent of missing data is a good starting place.
References
Allison, P. D. (2002). Missing data. Thousand Oaks, CA: Sage.
Allison, P. D. (2012). Handling missing data by maximum likelihood. In Paper presented at the SAS Global Forum, Orlando, FL. http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf. Accessed 18 April 2013.
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49. doi:10.1002/mpr.329.
Buhi, E. R., Goodson, P., & Neilands, T. B. (2008). Out of sight, not out of mind: Strategies for handling missing data. American Journal of Health Behavior, 32(1), 83–92.
Burton, A., & Altman, D. G. (2004). Missing covariate data within cancer prognostic studies: A review of current reporting and proposed guidelines. British Journal of Cancer, 91, 4–8. doi:10.1038/sj.bjc.6601907.
Collins, L. M., Schafer, J. L., & Kam, C.-M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351. doi:10.1037/1082-989X.6.4.330.
Cox, B. E., McIntosh, K., Reason, R. D., & Terenzini, P. T. (2014). Working with missing data in higher education research: A primer and real-world example. The Review of Higher Education, 37(3), 377–402.
Craig, L. E., Wu, O., Gilmour, H., Barber, M., & Langhorne, P. (2011). Developing and validating a predictive model for stroke progression. Cerebrovascular Diseases Extra, 1(1), 105–114. doi:10.1159/000334473.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. doi:10.1146/annurev.psych.58.110405.085530.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8, 206–213.
Heeringa, S., West, B. T., & Berglund, P. A. (2010). Applied survey data analysis. Boca Raton, FL: Chapman & Hall.
Hutchinson, S. R., & Lovell, C. D. (2004). A review of methodological characteristics of research published in key journals in higher education: Implications for graduate research training. Research in Higher Education, 45(4), 383–403. doi:10.1023/B:RIHE.0000027392.94172.d2.
Jelicic, H., Phelps, E., & Lerner, R. A. (2009). Use of missing data methods in longitudinal studies: The persistence of bad practices in developmental psychology. Developmental Psychology, 45(4), 1195–1199. doi:10.1037/A0015665.
Kenward, M. G., & Carpenter, J. R. (2007). Multiple imputation: Current perspectives. Statistical Methods in Medical Research, 16(3), 199–218. doi:10.1177/0962280206075304.
Klebanoff, M. A., & Cole, S. R. (2008). Use of multiple imputation in the epidemiologic literature. American Journal of Epidemiology, 168(4), 355–357. doi:10.1093/Aje/Kwn071.
Lee, K. J., & Carlin, J. B. (2010). Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology, 171(5), 624–632. doi:10.1093/Aje/Kwp425.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: Wiley.
McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing data: A gentle introduction. New York: Guilford Press.
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74(4), 525–556. doi:10.3102/00346543074004525.
Royston, P. (2004). Multiple imputation of missing values. Stata Journal, 4(3), 227–241.
Royston, P., & White, I. R. (2011). Multiple imputation by chained equations (MICE): Implementation in Stata. Journal of Statistical Software, 45(4), 1–20.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3–15.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. doi:10.1037/1082-989X.7.2.147.
Social Science Computing Cooperative (2012). Multiple imputation in Stata: Introduction. University of Wisconsin, Madison. http://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm. Accessed 27 September 2012.
StataCorp, L. P. (2011). Stata multiple-imputation reference manual: Release 12. College Station, TX: Stata Press.
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., et al. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. British Medical Journal. doi:10.1136/bmj.b2393.
Treiman, D. J. (2009). Quantitative data analysis: Doing social research to test ideas. San Francisco: Jossey-Bass.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), 219–242. doi:10.1177/0962280206074463.
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: CRC Press.
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12), 1049–1064. doi:10.1080/10629360600810434.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377–399. doi:10.1002/sim.4067.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604. doi:10.1037/0003-066X.54.8.594.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Manly, C.A., Wells, R.S. Reporting the Use of Multiple Imputation for Missing Data in Higher Education Research. Res High Educ 56, 397–409 (2015). https://doi.org/10.1007/s11162-014-9344-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11162-014-9344-9