Missing Data

Harrell, Frank E.

doi:10.1007/978-3-319-19425-7_3

Frank E. Harrell Jr.⁸

Part of the book series: Springer Series in Statistics ((SSS))

205k Accesses
7 Citations

Abstract

There are missing data in the majority of datasets one is likely to encounter. Before discussing some of the problems of analyzing data in which some variables are missing for some subjects, we define some nomenclature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This may work if values are “missing” because of “not applicable”, e.g. one has a measure of marital happiness, dichotomized as high or low, but the sample contains some unmarried people. One could have a 3-category variable with values high, low, and unmarried (Paul Allison, IMPUTE e-mail list, 4Jul09).
2.
Predictors of the target variable include all the other Xs along with auxiliary variables that are not included in the final outcome model, as long as they precede the variable being imputed in the causal chain (unlike with multiple imputation).
3.
Thus when modeling binary or categorical targets one can frequently take least squares shortcuts in place of maximum likelihood for binary, ordinal, or multinomial logistic models.
4.
662 discusses an alternative method based on choosing a donor observation at random from the q closest matches (q = 3, for example).
5.
To use the bootstrap to correctly estimate variances of regression coefficients, one must repeat the imputation process and the model fitting perhaps 100 times using a resampling procedure 174, 566 (see Section 5.2). Still, the bootstrap can estimate the right variance for the wrong parameter estimates if the imputations are not done correctly.
6.
The dataset is on the book’s dataset wiki and may be automatically fetched over the internet and loaded using the Hmisc package’s command getHdata(support) .
7.
You can use the R command subset(support, is.na(totcst) | totcst > 0). The is.na condition tells R that it is permissible to include observations having missing totcst without setting all columns of such observations to NA.
8.
We are anti-logging predicted log costs and we assume log cost has a symmetric distribution

References

P. D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Sage, Thousand Oaks CA, 2001.
Google Scholar
J. Barnard and D. B. Rubin. Small-sample degrees of freedom with multiple imputation. Biometrika, 86:948–955, 1999.
Article MathSciNet MATH Google Scholar
S. A. Barnes, S. R. Lindborg, and J. W. Seaman. Multiple imputation techniques in small sample clinical trials. Stat Med, 25:233–245, 2006.
Article MathSciNet Google Scholar
F. Barzi and M. Woodward. Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. Am J Epi, 160:34–45, 2004.
Article Google Scholar
S. F. Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J Roy Stat Soc B, 22:302–307, 1960.
MathSciNet Google Scholar
S. Buuren. Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton, FL, 2012.
Book MATH Google Scholar
T. G. Clark and D. G. Altman. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epi, 56:28–37, 2003.
Article Google Scholar
S. L. Crawford, S. L. Tennstedt, and J. B. McKinlay. A comparison of analytic methods for non-random missingness of outcome data. J Clin Epi, 48:209–219, 1995.
Article Google Scholar
D’Agostino, Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. J Am Stat Assoc, 95:749–759, 2000.
Google Scholar
Donders, G. J. M. G. van der Heijden, T. Stijnen, and K. G. M. Moons. Review: A gentle introduction to imputation of missing values. J Clin Epi, 59:1087–1091, 2006.
Article Google Scholar
A. Donner. The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values. Am Statistician, 36:378–381, 1982.
Google Scholar
B. Efron. Missing data, imputation, and the bootstrap (with discussion). J Am Stat Assoc, 89:463–479, 1994.
Article MathSciNet MATH Google Scholar
J. W. Graham, A. E. Olchowski, and T. D. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8:206–213, 2007.
Article Google Scholar
S. Greenland and W. D. Finkle. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epi, 142:1255–1264, 1995.
Google Scholar
O. Harel and X. Zhou. Multiple imputation: Review of theory, implementation and software. Stat Med, 26:3057–3077, 2007.
Article MathSciNet Google Scholar
Y. He and A. M. Zaslavsky. Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat Med, 31(1):1–18, 2012.
Article MathSciNet Google Scholar
N. J. Horton and K. P. Kleinman. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Statistician, 61(1):79–90, 2007.
Article MathSciNet Google Scholar
N. J. Horton and S. R. Lipsitz. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. Am Statistician, 55:244–254, 2001.
Article MathSciNet Google Scholar
S. Hunsberger, D. Murray, C. Davis, and R. R. Fabsitz. Imputation strategies for missing data in a school-based multi-center study: the Pathways study. Stat Med, 20:305–316, 2001.
Article Google Scholar
K. J. Janssen, A. R. Donders, F. E. Harrell, Y. Vergouwe, Q. Chen, D. E. Grobbee, and K. G. Moons. Missing covariate data in medical research: To impute is better than to ignore. J Clin Epi, 63:721–727, 2010.
Article Google Scholar
M. P. Jones. Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc, 91:222–230, 1996.
Article MATH Google Scholar
L. Joseph, P. Belisle, H. Tamim, and J. S. Sampalis. Selection bias found in interpreting analyses with missing data for the prehospital index for trauma. J Clin Epi, 57:147–153, 2004.
Article Google Scholar
G. Kalton and D. Kasprzyk. The treatment of missing survey data. Surv Meth, 12:1–16, 1986.
MATH Google Scholar
W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, P. Layde, R. K. Oye, P. E. Bellamy, R. B. Hakim, and D. P. Wagner. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Int Med, 122:191–203, 1995.
Article Google Scholar
M. J. Knol, K. J. M. Janssen, R. T. Donders, A. C. G. Egberts, E. R. Heerding, D. E. Grobbee, K. G. M. Moons, and M. I. Geerlings. Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epi, 63:728–736, 2010.
Article Google Scholar
P. W. Lavori, R. Dawson, and D. Shera. A multiple imputation strategy for clinical trials with truncation of patient data. Stat Med, 14:1913–1925, 1995.
Article Google Scholar
D. Y. Lin and Z. Ying. Semiparametric regression analysis of longitudinal data with informative drop-outs. Biostatistics, 4:385–398, 2003.
Article MATH Google Scholar
S. R. Lipsitz, L. P. Zhao, and G. Molenberghs. A semiparametric method of multiple imputation. J Roy Stat Soc B, 60:127–144, 1998.
Article MathSciNet MATH Google Scholar
R. Little and H. An. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica, 14:949–968, 2004.
MathSciNet MATH Google Scholar
R. J. Little. Missing Data. In Ency of Biostatistics, pages 2622–2635. Wiley, New York, 1998.
Google Scholar
R. J. A. Little. Missing-data adjustments in large surveys. J Bus Econ Stat, 6:287–296, 1988.
Google Scholar
R. J. A. Little. Regression with missing X’s: A review. J Am Stat Assoc, 87:1227–1237, 1992.
Google Scholar
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, second edition, 2002.
Google Scholar
G. Marshall, B. Warner, S. MaWhinney, and K. Hammermeister. Prospective prediction in the presence of missing data. Stat Med, 21:561–570, 2002.
Article Google Scholar
X. Meng. Multiple-imputation inferences with uncongenial sources of input. Stat Sci, 9:538–558, 1994.
Google Scholar
M. E. Miller, T. M. Morgan, M. A. Espeland, and S. S. Emerson. Group comparisons involving missing data in clinical trials: a comparison of estimates and power (size) for some simple approaches. Stat Med, 20:2383–2397, 2001.
Article Google Scholar
K. G. M. Moons, R. A. R. T. Donders, T. Stijnen, and F. E. Harrell. Using the outcome for imputation of missing predictor values was preferred. J Clin Epi, 59:1092–1101, 2006.
Article Google Scholar
P. C. O’Brien, D. Zhang, and K. R. Bailey. Semi-parametric and non-parametric methods for clinical trials with incomplete data. Stat Med, 24:341–358, 2005.
Article MathSciNet Google Scholar
M. Reilly and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. Stat Med, 16:5–19, 1997.
Article Google Scholar
J. S. Roberts and G. M. Capalbo. A SAS macro for estimating missing values in multivariate data. In Proceedings of the Twelfth Annual SAS Users Group International Conference, pages 939–941, Cary, NC, 1987. SAS Institute, Inc.
Google Scholar
D. Rubin and N. Schenker. Multiple imputation in health-care data bases: An overview and some applications. Stat Med, 10:585–598, 1991.
Article Google Scholar
D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, New York, 1987.
Book Google Scholar
J. L. Schafer and J. W. Graham. Missing data: Our view of the state of the art. Psych Meth, 7:147–177, 2002.
Article Google Scholar
M. Schemper and G. Heinze. Probability imputation revisited for prognostic factor studies. Stat Med, 16:73–80, 1997.
Article Google Scholar
M. Schemper and T. L. Smith. Efficient evaluation of treatment effects in the presence of missing covariate values. Stat Med, 9:777–784, 1990.
Article Google Scholar
J. Shao and R. R. Sitter. Bootstrap for imputed survey data. J Am Stat Assoc, 91:1278–1288, 1996.
Article MathSciNet MATH Google Scholar
J. Siddique. Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med, 27:83–102, 2008.
Article MathSciNet Google Scholar
N. H. Timm. The estimation of variance-covariance and correlation matrices from incomplete data. Psychometrika, 35:417–437, 1970.
Article MATH Google Scholar
J. Twisk, M. de Boer, W. de Vente, and M. Heymans. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epi, 66(9):1022–1028, 2013.
Article Google Scholar
W. Vach. Logistic Regression with Missing Values in the Covariates, volume 86 of Lecture Notes in Statistics. Springer-Verlag, New York, 1994.
Google Scholar
W. Vach. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med, 16:57–72, 1997.
Article Google Scholar
W. Vach and M. Blettner. Logistic regression with incompletely observed categorical covariates—Investigating the sensitivity against violation of the missing at random assumption. Stat Med, 14:1315–1329, 1995.
Article Google Scholar
W. Vach and M. Blettner. Missing Data in Epidemiologic Studies. In Ency of Biostatistics, pages 2641–2654. Wiley, New York, 1998.
Google Scholar
W. Vach and M. Schumacher. Logistic regression with incompletely observed categorical covariates: A comparison of three approaches. Biometrika, 80:353–362, 1993.
Article MATH Google Scholar
S. van Buuren, H. C. Boshuizen, and D. L. Knook. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med, 18:681–694, 1999.
Article Google Scholar
S. van Buuren, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, and D. B. Rubin. Fully conditional specification in multivariate imputation. J Stat Computation Sim, 76(12):1049–1064, 2006.
Article Google Scholar
G. J. M. G. van der Heijden, Donders, T. Stijnen, and K. G. M. Moons. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J Clin Epi, 59:1102–1109, 2006.
Google Scholar
P. T. von Hippel. Regression with missing ys: An improved strategy for analyzing multiple imputed data. Soc Meth, 37(1):83–117, 2007.
Article Google Scholar
R. Wang, J. Sedransk, and J. H. Jinn. Secondary data analysis when there are missing observations. J Am Stat Assoc, 87:952–961, 1992.
Article Google Scholar
I. R. White and J. B. Carlin. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med, 29:2920–2931, 2010.
Article MathSciNet Google Scholar
I. R. White and P. Royston. Imputing missing covariate values for the Cox model. Stat Med, 28:1982–1998, 2009.
Article MathSciNet Google Scholar
I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med, 30(4):377–399, 2011.
Article MathSciNet Google Scholar
A. M. Wood, I. R. White, and S. G. Thompson. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials, 1:368–376, 2004.
Article Google Scholar
R. M. Yucel and A. M. Zaslavsky. Using calibration to improve rounding in imputation. Am Statistician, 62(2):125–129, 2008.
Article MathSciNet Google Scholar
X. Zhou, G. J. Eckert, and W. M. Tierney. Multiple imputation in public health research. Stat Med, 20:1541–1549, 2001.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, School of Medicine Vanderbilt University, Nashville, TN, USA
Frank E. Harrell Jr.

Authors

Frank E. Harrell Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Harrell, F.E. (2015). Missing Data. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-19425-7_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19424-0
Online ISBN: 978-3-319-19425-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics