Skip to main content

Part of the book series: Springer Series in Statistics ((SSS))

Abstract

There are missing data in the majority of datasets one is likely to encounter. Before discussing some of the problems of analyzing data in which some variables are missing for some subjects, we define some nomenclature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This may work if values are “missing” because of “not applicable”, e.g. one has a measure of marital happiness, dichotomized as high or low, but the sample contains some unmarried people. One could have a 3-category variable with values high, low, and unmarried (Paul Allison, IMPUTE e-mail list, 4Jul09).

  2. 2.

    Predictors of the target variable include all the other Xs along with auxiliary variables that are not included in the final outcome model, as long as they precede the variable being imputed in the causal chain (unlike with multiple imputation).

  3. 3.

    Thus when modeling binary or categorical targets one can frequently take least squares shortcuts in place of maximum likelihood for binary, ordinal, or multinomial logistic models.

  4. 4.

    662 discusses an alternative method based on choosing a donor observation at random from the q closest matches (q = 3, for example).

  5. 5.

    To use the bootstrap to correctly estimate variances of regression coefficients, one must repeat the imputation process and the model fitting perhaps 100 times using a resampling procedure 174, 566 (see Section 5.2). Still, the bootstrap can estimate the right variance for the wrong parameter estimates if the imputations are not done correctly.

  6. 6.

    The dataset is on the book’s dataset wiki and may be automatically fetched over the internet and loaded using the Hmisc package’s command getHdata(support) .

  7. 7.

    You can use the R command subset(support, is.na(totcst) | totcst > 0). The is.na condition tells R that it is permissible to include observations having missing totcst without setting all columns of such observations to NA.

  8. 8.

    We are anti-logging predicted log costs and we assume log cost has a symmetric distribution

References

  1. P. D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Sage, Thousand Oaks CA, 2001.

    Google Scholar 

  2. J. Barnard and D. B. Rubin. Small-sample degrees of freedom with multiple imputation. Biometrika, 86:948–955, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  3. S. A. Barnes, S. R. Lindborg, and J. W. Seaman. Multiple imputation techniques in small sample clinical trials. Stat Med, 25:233–245, 2006.

    Article  MathSciNet  Google Scholar 

  4. F. Barzi and M. Woodward. Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. Am J Epi, 160:34–45, 2004.

    Article  Google Scholar 

  5. S. F. Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J Roy Stat Soc B, 22:302–307, 1960.

    MathSciNet  Google Scholar 

  6. S. Buuren. Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton, FL, 2012.

    Book  MATH  Google Scholar 

  7. T. G. Clark and D. G. Altman. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epi, 56:28–37, 2003.

    Article  Google Scholar 

  8. S. L. Crawford, S. L. Tennstedt, and J. B. McKinlay. A comparison of analytic methods for non-random missingness of outcome data. J Clin Epi, 48:209–219, 1995.

    Article  Google Scholar 

  9. D’Agostino, Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. J Am Stat Assoc, 95:749–759, 2000.

    Google Scholar 

  10. Donders, G. J. M. G. van der Heijden, T. Stijnen, and K. G. M. Moons. Review: A gentle introduction to imputation of missing values. J Clin Epi, 59:1087–1091, 2006.

    Article  Google Scholar 

  11. A. Donner. The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values. Am Statistician, 36:378–381, 1982.

    Google Scholar 

  12. B. Efron. Missing data, imputation, and the bootstrap (with discussion). J Am Stat Assoc, 89:463–479, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  13. J. W. Graham, A. E. Olchowski, and T. D. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci, 8:206–213, 2007.

    Article  Google Scholar 

  14. S. Greenland and W. D. Finkle. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epi, 142:1255–1264, 1995.

    Google Scholar 

  15. O. Harel and X. Zhou. Multiple imputation: Review of theory, implementation and software. Stat Med, 26:3057–3077, 2007.

    Article  MathSciNet  Google Scholar 

  16. Y. He and A. M. Zaslavsky. Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat Med, 31(1):1–18, 2012.

    Article  MathSciNet  Google Scholar 

  17. N. J. Horton and K. P. Kleinman. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Statistician, 61(1):79–90, 2007.

    Article  MathSciNet  Google Scholar 

  18. N. J. Horton and S. R. Lipsitz. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. Am Statistician, 55:244–254, 2001.

    Article  MathSciNet  Google Scholar 

  19. S. Hunsberger, D. Murray, C. Davis, and R. R. Fabsitz. Imputation strategies for missing data in a school-based multi-center study: the Pathways study. Stat Med, 20:305–316, 2001.

    Article  Google Scholar 

  20. K. J. Janssen, A. R. Donders, F. E. Harrell, Y. Vergouwe, Q. Chen, D. E. Grobbee, and K. G. Moons. Missing covariate data in medical research: To impute is better than to ignore. J Clin Epi, 63:721–727, 2010.

    Article  Google Scholar 

  21. M. P. Jones. Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc, 91:222–230, 1996.

    Article  MATH  Google Scholar 

  22. L. Joseph, P. Belisle, H. Tamim, and J. S. Sampalis. Selection bias found in interpreting analyses with missing data for the prehospital index for trauma. J Clin Epi, 57:147–153, 2004.

    Article  Google Scholar 

  23. G. Kalton and D. Kasprzyk. The treatment of missing survey data. Surv Meth, 12:1–16, 1986.

    MATH  Google Scholar 

  24. W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, P. Layde, R. K. Oye, P. E. Bellamy, R. B. Hakim, and D. P. Wagner. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Int Med, 122:191–203, 1995.

    Article  Google Scholar 

  25. M. J. Knol, K. J. M. Janssen, R. T. Donders, A. C. G. Egberts, E. R. Heerding, D. E. Grobbee, K. G. M. Moons, and M. I. Geerlings. Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epi, 63:728–736, 2010.

    Article  Google Scholar 

  26. P. W. Lavori, R. Dawson, and D. Shera. A multiple imputation strategy for clinical trials with truncation of patient data. Stat Med, 14:1913–1925, 1995.

    Article  Google Scholar 

  27. D. Y. Lin and Z. Ying. Semiparametric regression analysis of longitudinal data with informative drop-outs. Biostatistics, 4:385–398, 2003.

    Article  MATH  Google Scholar 

  28. S. R. Lipsitz, L. P. Zhao, and G. Molenberghs. A semiparametric method of multiple imputation. J Roy Stat Soc B, 60:127–144, 1998.

    Article  MathSciNet  MATH  Google Scholar 

  29. R. Little and H. An. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica, 14:949–968, 2004.

    MathSciNet  MATH  Google Scholar 

  30. R. J. Little. Missing Data. In Ency of Biostatistics, pages 2622–2635. Wiley, New York, 1998.

    Google Scholar 

  31. R. J. A. Little. Missing-data adjustments in large surveys. J Bus Econ Stat, 6:287–296, 1988.

    Google Scholar 

  32. R. J. A. Little. Regression with missing X’s: A review. J Am Stat Assoc, 87:1227–1237, 1992.

    Google Scholar 

  33. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, second edition, 2002.

    Google Scholar 

  34. G. Marshall, B. Warner, S. MaWhinney, and K. Hammermeister. Prospective prediction in the presence of missing data. Stat Med, 21:561–570, 2002.

    Article  Google Scholar 

  35. X. Meng. Multiple-imputation inferences with uncongenial sources of input. Stat Sci, 9:538–558, 1994.

    Google Scholar 

  36. M. E. Miller, T. M. Morgan, M. A. Espeland, and S. S. Emerson. Group comparisons involving missing data in clinical trials: a comparison of estimates and power (size) for some simple approaches. Stat Med, 20:2383–2397, 2001.

    Article  Google Scholar 

  37. K. G. M. Moons, R. A. R. T. Donders, T. Stijnen, and F. E. Harrell. Using the outcome for imputation of missing predictor values was preferred. J Clin Epi, 59:1092–1101, 2006.

    Article  Google Scholar 

  38. P. C. O’Brien, D. Zhang, and K. R. Bailey. Semi-parametric and non-parametric methods for clinical trials with incomplete data. Stat Med, 24:341–358, 2005.

    Article  MathSciNet  Google Scholar 

  39. M. Reilly and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. Stat Med, 16:5–19, 1997.

    Article  Google Scholar 

  40. J. S. Roberts and G. M. Capalbo. A SAS macro for estimating missing values in multivariate data. In Proceedings of the Twelfth Annual SAS Users Group International Conference, pages 939–941, Cary, NC, 1987. SAS Institute, Inc.

    Google Scholar 

  41. D. Rubin and N. Schenker. Multiple imputation in health-care data bases: An overview and some applications. Stat Med, 10:585–598, 1991.

    Article  Google Scholar 

  42. D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, New York, 1987.

    Book  Google Scholar 

  43. J. L. Schafer and J. W. Graham. Missing data: Our view of the state of the art. Psych Meth, 7:147–177, 2002.

    Article  Google Scholar 

  44. M. Schemper and G. Heinze. Probability imputation revisited for prognostic factor studies. Stat Med, 16:73–80, 1997.

    Article  Google Scholar 

  45. M. Schemper and T. L. Smith. Efficient evaluation of treatment effects in the presence of missing covariate values. Stat Med, 9:777–784, 1990.

    Article  Google Scholar 

  46. J. Shao and R. R. Sitter. Bootstrap for imputed survey data. J Am Stat Assoc, 91:1278–1288, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  47. J. Siddique. Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med, 27:83–102, 2008.

    Article  MathSciNet  Google Scholar 

  48. N. H. Timm. The estimation of variance-covariance and correlation matrices from incomplete data. Psychometrika, 35:417–437, 1970.

    Article  MATH  Google Scholar 

  49. J. Twisk, M. de Boer, W. de Vente, and M. Heymans. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epi, 66(9):1022–1028, 2013.

    Article  Google Scholar 

  50. W. Vach. Logistic Regression with Missing Values in the Covariates, volume 86 of Lecture Notes in Statistics. Springer-Verlag, New York, 1994.

    Google Scholar 

  51. W. Vach. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med, 16:57–72, 1997.

    Article  Google Scholar 

  52. W. Vach and M. Blettner. Logistic regression with incompletely observed categorical covariates—Investigating the sensitivity against violation of the missing at random assumption. Stat Med, 14:1315–1329, 1995.

    Article  Google Scholar 

  53. W. Vach and M. Blettner. Missing Data in Epidemiologic Studies. In Ency of Biostatistics, pages 2641–2654. Wiley, New York, 1998.

    Google Scholar 

  54. W. Vach and M. Schumacher. Logistic regression with incompletely observed categorical covariates: A comparison of three approaches. Biometrika, 80:353–362, 1993.

    Article  MATH  Google Scholar 

  55. S. van Buuren, H. C. Boshuizen, and D. L. Knook. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med, 18:681–694, 1999.

    Article  Google Scholar 

  56. S. van Buuren, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, and D. B. Rubin. Fully conditional specification in multivariate imputation. J Stat Computation Sim, 76(12):1049–1064, 2006.

    Article  Google Scholar 

  57. G. J. M. G. van der Heijden, Donders, T. Stijnen, and K. G. M. Moons. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J Clin Epi, 59:1102–1109, 2006.

    Google Scholar 

  58. P. T. von Hippel. Regression with missing ys: An improved strategy for analyzing multiple imputed data. Soc Meth, 37(1):83–117, 2007.

    Article  Google Scholar 

  59. R. Wang, J. Sedransk, and J. H. Jinn. Secondary data analysis when there are missing observations. J Am Stat Assoc, 87:952–961, 1992.

    Article  Google Scholar 

  60. I. R. White and J. B. Carlin. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med, 29:2920–2931, 2010.

    Article  MathSciNet  Google Scholar 

  61. I. R. White and P. Royston. Imputing missing covariate values for the Cox model. Stat Med, 28:1982–1998, 2009.

    Article  MathSciNet  Google Scholar 

  62. I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med, 30(4):377–399, 2011.

    Article  MathSciNet  Google Scholar 

  63. A. M. Wood, I. R. White, and S. G. Thompson. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials, 1:368–376, 2004.

    Article  Google Scholar 

  64. R. M. Yucel and A. M. Zaslavsky. Using calibration to improve rounding in imputation. Am Statistician, 62(2):125–129, 2008.

    Article  MathSciNet  Google Scholar 

  65. X. Zhou, G. J. Eckert, and W. M. Tierney. Multiple imputation in public health research. Stat Med, 20:1541–1549, 2001.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Harrell, F.E. (2015). Missing Data. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_3

Download citation

Publish with us

Policies and ethics