Abstract
There are missing data in the majority of datasets one is likely to encounter. Before discussing some of the problems of analyzing data in which some variables are missing for some subjects, we define some nomenclature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This may work if values are “missing” because of “not applicable”, e.g. one has a measure of marital happiness, dichotomized as high or low, but the sample contains some unmarried people. One could have a 3-category variable with values high, low, and unmarried (Paul Allison, IMPUTE e-mail list, 4Jul09).
- 2.
Predictors of the target variable include all the other Xs along with auxiliary variables that are not included in the final outcome model, as long as they precede the variable being imputed in the causal chain (unlike with multiple imputation).
- 3.
Thus when modeling binary or categorical targets one can frequently take least squares shortcuts in place of maximum likelihood for binary, ordinal, or multinomial logistic models.
- 4.
662 discusses an alternative method based on choosing a donor observation at random from the q closest matches (q = 3, for example).
- 5.
To use the bootstrap to correctly estimate variances of regression coefficients, one must repeat the imputation process and the model fitting perhaps 100 times using a resampling procedure 174, 566 (see Section 5.2). Still, the bootstrap can estimate the right variance for the wrong parameter estimates if the imputations are not done correctly.
- 6.
The dataset is on the book’s dataset wiki and may be automatically fetched over the internet and loaded using the Hmisc package’s command getHdata(support) .
- 7.
You can use the R command subset(support, is.na(totcst) | totcst > 0). The is.na condition tells R that it is permissible to include observations having missing totcst without setting all columns of such observations to NA.
- 8.
We are anti-logging predicted log costs and we assume log cost has a symmetric distribution
References
P. D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Sage, Thousand Oaks CA, 2001.
J. Barnard and D. B. Rubin. Small-sample degrees of freedom with multiple imputation. Biometrika, 86:948–955, 1999.
S. A. Barnes, S. R. Lindborg, and J. W. Seaman. Multiple imputation techniques in small sample clinical trials. Stat Med, 25:233–245, 2006.
F. Barzi and M. Woodward. Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. Am J Epi, 160:34–45, 2004.
S. F. Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J Roy Stat Soc B, 22:302–307, 1960.
S. Buuren. Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton, FL, 2012.
T. G. Clark and D. G. Altman. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epi, 56:28–37, 2003.
S. L. Crawford, S. L. Tennstedt, and J. B. McKinlay. A comparison of analytic methods for non-random missingness of outcome data. J Clin Epi, 48:209–219, 1995.
D’Agostino, Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. J Am Stat Assoc, 95:749–759, 2000.
Donders, G. J. M. G. van der Heijden, T. Stijnen, and K. G. M. Moons. Review: A gentle introduction to imputation of missing values. J Clin Epi, 59:1087–1091, 2006.
A. Donner. The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values. Am Statistician, 36:378–381, 1982.
B. Efron. Missing data, imputation, and the bootstrap (with discussion). J Am Stat Assoc, 89:463–479, 1994.
J. W. Graham, A. E. Olchowski, and T. D. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8:206–213, 2007.
S. Greenland and W. D. Finkle. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epi, 142:1255–1264, 1995.
O. Harel and X. Zhou. Multiple imputation: Review of theory, implementation and software. Stat Med, 26:3057–3077, 2007.
Y. He and A. M. Zaslavsky. Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat Med, 31(1):1–18, 2012.
N. J. Horton and K. P. Kleinman. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Statistician, 61(1):79–90, 2007.
N. J. Horton and S. R. Lipsitz. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. Am Statistician, 55:244–254, 2001.
S. Hunsberger, D. Murray, C. Davis, and R. R. Fabsitz. Imputation strategies for missing data in a school-based multi-center study: the Pathways study. Stat Med, 20:305–316, 2001.
K. J. Janssen, A. R. Donders, F. E. Harrell, Y. Vergouwe, Q. Chen, D. E. Grobbee, and K. G. Moons. Missing covariate data in medical research: To impute is better than to ignore. J Clin Epi, 63:721–727, 2010.
M. P. Jones. Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc, 91:222–230, 1996.
L. Joseph, P. Belisle, H. Tamim, and J. S. Sampalis. Selection bias found in interpreting analyses with missing data for the prehospital index for trauma. J Clin Epi, 57:147–153, 2004.
G. Kalton and D. Kasprzyk. The treatment of missing survey data. Surv Meth, 12:1–16, 1986.
W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, P. Layde, R. K. Oye, P. E. Bellamy, R. B. Hakim, and D. P. Wagner. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Int Med, 122:191–203, 1995.
M. J. Knol, K. J. M. Janssen, R. T. Donders, A. C. G. Egberts, E. R. Heerding, D. E. Grobbee, K. G. M. Moons, and M. I. Geerlings. Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epi, 63:728–736, 2010.
P. W. Lavori, R. Dawson, and D. Shera. A multiple imputation strategy for clinical trials with truncation of patient data. Stat Med, 14:1913–1925, 1995.
D. Y. Lin and Z. Ying. Semiparametric regression analysis of longitudinal data with informative drop-outs. Biostatistics, 4:385–398, 2003.
S. R. Lipsitz, L. P. Zhao, and G. Molenberghs. A semiparametric method of multiple imputation. J Roy Stat Soc B, 60:127–144, 1998.
R. Little and H. An. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica, 14:949–968, 2004.
R. J. Little. Missing Data. In Ency of Biostatistics, pages 2622–2635. Wiley, New York, 1998.
R. J. A. Little. Missing-data adjustments in large surveys. J Bus Econ Stat, 6:287–296, 1988.
R. J. A. Little. Regression with missing X’s: A review. J Am Stat Assoc, 87:1227–1237, 1992.
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, second edition, 2002.
G. Marshall, B. Warner, S. MaWhinney, and K. Hammermeister. Prospective prediction in the presence of missing data. Stat Med, 21:561–570, 2002.
X. Meng. Multiple-imputation inferences with uncongenial sources of input. Stat Sci, 9:538–558, 1994.
M. E. Miller, T. M. Morgan, M. A. Espeland, and S. S. Emerson. Group comparisons involving missing data in clinical trials: a comparison of estimates and power (size) for some simple approaches. Stat Med, 20:2383–2397, 2001.
K. G. M. Moons, R. A. R. T. Donders, T. Stijnen, and F. E. Harrell. Using the outcome for imputation of missing predictor values was preferred. J Clin Epi, 59:1092–1101, 2006.
P. C. O’Brien, D. Zhang, and K. R. Bailey. Semi-parametric and non-parametric methods for clinical trials with incomplete data. Stat Med, 24:341–358, 2005.
M. Reilly and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. Stat Med, 16:5–19, 1997.
J. S. Roberts and G. M. Capalbo. A SAS macro for estimating missing values in multivariate data. In Proceedings of the Twelfth Annual SAS Users Group International Conference, pages 939–941, Cary, NC, 1987. SAS Institute, Inc.
D. Rubin and N. Schenker. Multiple imputation in health-care data bases: An overview and some applications. Stat Med, 10:585–598, 1991.
D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, New York, 1987.
J. L. Schafer and J. W. Graham. Missing data: Our view of the state of the art. Psych Meth, 7:147–177, 2002.
M. Schemper and G. Heinze. Probability imputation revisited for prognostic factor studies. Stat Med, 16:73–80, 1997.
M. Schemper and T. L. Smith. Efficient evaluation of treatment effects in the presence of missing covariate values. Stat Med, 9:777–784, 1990.
J. Shao and R. R. Sitter. Bootstrap for imputed survey data. J Am Stat Assoc, 91:1278–1288, 1996.
J. Siddique. Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med, 27:83–102, 2008.
N. H. Timm. The estimation of variance-covariance and correlation matrices from incomplete data. Psychometrika, 35:417–437, 1970.
J. Twisk, M. de Boer, W. de Vente, and M. Heymans. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epi, 66(9):1022–1028, 2013.
W. Vach. Logistic Regression with Missing Values in the Covariates, volume 86 of Lecture Notes in Statistics. Springer-Verlag, New York, 1994.
W. Vach. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med, 16:57–72, 1997.
W. Vach and M. Blettner. Logistic regression with incompletely observed categorical covariates—Investigating the sensitivity against violation of the missing at random assumption. Stat Med, 14:1315–1329, 1995.
W. Vach and M. Blettner. Missing Data in Epidemiologic Studies. In Ency of Biostatistics, pages 2641–2654. Wiley, New York, 1998.
W. Vach and M. Schumacher. Logistic regression with incompletely observed categorical covariates: A comparison of three approaches. Biometrika, 80:353–362, 1993.
S. van Buuren, H. C. Boshuizen, and D. L. Knook. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med, 18:681–694, 1999.
S. van Buuren, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, and D. B. Rubin. Fully conditional specification in multivariate imputation. J Stat Computation Sim, 76(12):1049–1064, 2006.
G. J. M. G. van der Heijden, Donders, T. Stijnen, and K. G. M. Moons. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J Clin Epi, 59:1102–1109, 2006.
P. T. von Hippel. Regression with missing ys: An improved strategy for analyzing multiple imputed data. Soc Meth, 37(1):83–117, 2007.
R. Wang, J. Sedransk, and J. H. Jinn. Secondary data analysis when there are missing observations. J Am Stat Assoc, 87:952–961, 1992.
I. R. White and J. B. Carlin. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med, 29:2920–2931, 2010.
I. R. White and P. Royston. Imputing missing covariate values for the Cox model. Stat Med, 28:1982–1998, 2009.
I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med, 30(4):377–399, 2011.
A. M. Wood, I. R. White, and S. G. Thompson. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials, 1:368–376, 2004.
R. M. Yucel and A. M. Zaslavsky. Using calibration to improve rounding in imputation. Am Statistician, 62(2):125–129, 2008.
X. Zhou, G. J. Eckert, and W. M. Tierney. Multiple imputation in public health research. Stat Med, 20:1541–1549, 2001.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Harrell, F.E. (2015). Missing Data. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-19425-7_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19424-0
Online ISBN: 978-3-319-19425-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)