Abstract
Incomplete data are quite common in biomedical and other types of research, especially in longitudinal studies. During the last three decades, a vast amount of work has been done in the area. This has led, on the one hand, to a rich taxonomy of missing-data concepts, issues, and methods and, on the other hand, to a variety of data-analytic tools. Elements of taxonomy include: missing data patterns, mechanisms, and modeling frameworks; inferential paradigms; and sensitivity analysis frameworks. These are described in detail. A variety of concrete modeling devices is presented. To make matters concrete, two case studies are considered. The first one concerns quality of life among breast cancer patients, while the second one examines data from the Muscatine children’s obesity study.
Similar content being viewed by others
References
Beckman RJ, Nachtsheim CJ, Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426
Best NG, Spiegelhalter DJ, Thomas A, Brayne CEG (1996) Bayesian analysis of realistically complex models. J R Stat Soc Ser A 159:323–342
Beunckens C, Molenberghs G, Verbeke G, Mallinckrodt C (2008) A latent-class mixture model for incomplete longitudinal Gaussian data. Biometrics 64(1):96–105
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25
Brown ER, Ibrahim JG (2003a) A Bayesian semiparametric joint hierarchical model for longitudinal and survival data. Biometrics 59:221–228
Brown ER, Ibrahim JG (2003b) Bayesian approaches to joint cure rate and longitudinal models with applications to cancer vaccine trials. Biometrics 59:686–693
Brown ER, Ibrahim JG, DeGruttola V (2005) A flexible b-spline model for multiple longitudinal biomarkers and survival. Biometrics 61:64–73
Carpenter J, Pocock S, Lamm CJ (2002) Coping with missing data in clinical trials: a model based approach applied to asthma trials. Stat Med 21:1043–1066
Chen M-H, Ibrahim JG (2002) Maximum likelihood methods for cure rate models with missing covariates. Biometrics 57:43–52
Chen M-H, Ibrahim JG, Lipsitz SR (2002) Bayesian methods for missing covariates in cure rate models. Lifetime Data Anal 8:117–146
Chen M-H, Ibrahim JG, Shao Q-M (2004a) Propriety of the posterior distribution and existence of the maximum likelihood estimator for regression models with covariates missing at random. J Am Stat Assoc 99:421–438
Chen M-H, Ibrahim JG, Sinha D (2004b) A new joint model for longitudinal and survival data with a cure fraction. J Multivar Anal 91:18–34
Chen M-H, Ibrahim JG, Shao Q-M (2006) Posterior propriety anc computation for the Cox regression model with applications to missing covariates. Biometrika 93:791–807
Chen M-H, Ibrahim JG, Shao Q-M (2009) Model identifiability for the Cox regression model with applications to missing covariates. J Multivar Anal (in press)
Chen Q, Ibrahim JG (2006) Missing covariate and response data in regression models. Biometrics 62:177–184
Chen Q, Zeng D, Ibrahim JG (2007) Sieve maximum likelihood estimation for regression models with covariates missing at random. J Am Stat Assoc 102:1309–1317
Chen Q, Ibrahim JG, Chen M-H, Senchaudhuri P (2008) Theory and inference for regression models with missing responses and covariates. J Multivar Anal 99:1302–1331
Chi Y, Ibrahim JG (2006) Joint models for multivariate longitudinal and survival data. Biometrics 62:432–445
Chi Y, Ibrahim JG (2007) A new class of joint models for longitudinal and survival data accomodating zero and zon-zero cure fractions: a case study of an international breast cancer study group trial. Stat Sin 17:445–462
Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169
Cowles MK, Carlin BP, Connett JE (1996) Bayesian tobit modeling of longitudinal ordinal clinical trial compliance data with nonignorable missingness. J Am Stat Assoc 91:86–98
Creemers A, Hens N, Aerts M, Molenberghs G, Verbeke G, Kenward MG (2009) Shared-parameter models and missingness at random (Submitted for publication)
Daniels MJ, Hogan JW (2008) Missing data in longitudinal studies. Chapman and Hall, London
DeGruttola V, Tu XM (1994) Modelling progression of CD4 lymphocyte count and its relationship to survival time. Biometrics 50:1003–1014
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Diggle P, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Appl Stat 43:49–93
Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of longitudinal data. Oxford University Press, London
Ekholm A, Skinner C (1998) The Muscatine children’s obesity data reanalysed using pattern mixture models. Appl Stat 47:251–263
Faucett CL, Thomas DC (1996) Simultaneously modelling censored survival data and repeatedly measured covariates: a Gibbs sampling approach. Stat Med 15:1663–1685
Fitzmaurice GM, Laird NM (2000) Generalized linear mixture models for handling nonignorable dropouts in longitudinal studies. Biostatistics 1:141–156
Fitzmaurice GM, Lipsitz SR, Molenberghs G, Ibrahim JG (2001) Bias in estimating association parameters for longitudinal binary responses with drop-outs. Biometrics 57:15–21
Fitzmaurice GM, Laird NM, Ware JH (2004) Applied longitudinal analysis. Wiley, New York
Fitzmaurice GM, Lipsitz SR, Ibrahim JG, Gelber R, Lipshultz S (2006) Estimation in regression models for longitudinal binary data with outcome-dependent follow-up. Biostatistics 7:469–485
Fitzmaurice GM, Davidian M, Verbeke G, Molenberghs M (2008) Longitudinal data analysis. Chapman and Hall, London
Follman D, Wu M (1995) An approximate generalized linear model with random effects for informative missing data. Biometrics 51:151–168
Garcia RI, Ibrahim JG, Zhu H (2009) Variable selection for regression models with missing data. Stat Sin (in press)
Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. Appl Stat 41:337–348
Henderson R, Diggle P, Dobson A (2000) Joint modelling of longitudinal measurements and event time data. Biostatistics 1:465–480
Herring AH, Ibrahim JG (2001) Likelihood-based methods for missing covariates in the Cox proportional hazards model. J Am Stat Assoc 96:292–302
Herring AH, Ibrahim JG (2002) Maximum likelihood estimation in random effects cure rate models with nonignorably missing covariates. Biostatistics 3:387–405
Herring AH, Ibrahim JG, Lipsitz SR (2002) Frailty models with missing covariates. Biometrics 58:98–109
Herring AH, Ibrahim JG, Lipsitz SR (2004) Nonignorably missing covariate data in survival analysis: a case study of an international breast cancer study group trial. Appl Stat 53:293–310
Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Stat Med 16:239–257
Hogan JW, Laird NM (1998) Increasing efficiency from censored survival data using random effects from longitudinal covariates. Stat Methods Med Res 7:28–48
Huang L, Chen M-H, Ibrahim JG (2005) Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics 61:767–780
Ibrahim JG (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85:765–769
Ibrahim JG, Lipsitz SR, Chen M-H (1999a) Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc Ser B 61:173–190
Ibrahim JG, Chen MH, Lipsitz SR (1999b) Monte Carlo EM for missing covariates in parametric regression models. Biometrics 55:591–596
Ibrahim JG, Chen M-H, Lipsitz SR (2001) Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88:551–564
Ibrahim JG, Chen M-H, Lipsitz SR (2002) Bayesian methods for generalized linear models with covariates missing at random. Can J Stat 30:55–78
Ibrahim JG, Chen M-H, Sinha D (2004) Bayesian methods for joint modeling of longitudinal and survival data with applicants to cancer vaccine trials. Stat Sin 14:863–883
Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH (2005) Missing data methods in generalized linear models: a comparative review. J Am Stat Assoc 100:332–346
Ibrahim JG, Chen M-H, Kim S (2008a) Bayesian variable selection for the Cox regression model with missing covariates. Lifetime Data Anal 14:496–520
Ibrahim JG, Zhu H, Tang N (2008b) Model selection criteria for missing data problems using the EM algorithm. J Am Stat Assoc 103:1648–1658
Jennrich RI, Schluchter MD (1986) Unbalanced repeated-measures models with structured covariance matrices. Biometrics 42:805–820
Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38:963–974
Lavalley MP, DeGruttola V (1996) Models for empirical Bayes estimators of longitudinal CD4 counts. Stat Med 15:2289–2305
Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Lipsitz SR, Ibrahim JG, Fitzmaurice GM (1999a) Likelihood methods for incomplete longitudinal binary responses with incomplete categorical covariates. Biometrics 55:214–223
Lipsitz SR, Ibrahim JG, Zhao LP (1999b) A new weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J Am Stat Assoc 94:1147–1160
Lipsitz SR, Ibrahim JG, Molenberghs G (2000) Using a Box–Cox transformation in the analysis of longitudinal data with incomplete responses. Appl Stat 49:287–296
Lipsitz SR, Parzen M, Molenberghs G, Ibrahim JG (2001) Tesing for bias in weighted estimating equations. Biostatistics 2:295–307
Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Gelber R, Lipshultz S (2002) Parameter estimation in longitudinal studies with outcome-dependent follow-up. Biometrics 58:621–630
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134
Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483
Little RJA (1995) Modeling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 90:1113–1121
Little RJA, Wang Y (1996) Pattern-mixture models for multivariate incomplete data with covariates. Biometrics 52:98–111
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Louis T (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44:226–233
Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J R Stat Soc Ser B 51:127–138
Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Springer, New York
Molenberghs G, Kenward MG (2007) Missing data in clinical studies. Wiley, New York
Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika 84:33–4
Pawitan Y, Self S (1993) Modeling disease marker processes in AIDS. J Am Stat Assoc 88:719–726
Prentice RL (1989) Surrogate endpoints in clinical trials: definitions and operational criteria. Stat Med 8:431–440
Renard D, Geys H, Molenberghs G, Burzykowski T, Buyse M (2002) Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biom J 44:921–935
Rizopoulos D, Verbeke G, Molenberghs G (2008) Shared parameter models under random-effects misspecification. Biometrika 94:63–74
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121
Rotnitzky A, Robins JM, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with nonignorable nonresponse. J Am Stat Assoc 93:1321–1339
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics: applied probability and statistics. Wiley, New York
Scharfstein DO, Rotnitzky A, Robins JM (1999) Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 94:1096–1120
Schluchter MD (1992) Methods for the analysis of informatively censored longitudinal data. Stat Med 11:1861–1870
Shi X, Zhu H, Ibrahim JG (2009) Local influence for generalized linear models with missing covariates. Biometrics (in press)
Stubbendick AL, Ibrahim JG (2003) Maximum likelihood methods for nonignorable responses and covariates in random effects models. Biometrics 59:1140–1150
Stubbendick AL, Ibrahim JG (2006) Likelihood-based inference with nonignorably missing responses and covariates in models for discrete longitudinal data. Stat Sin 16:1143–1167
Taylor JMG, Cumberland WG, Sy JP (1994) A stochastic model for analysis of longitudinal AIDS data. J Am Stat Assoc 89:727–736
Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265
Troxel AB, Harrington DP, Lipsitz SR (1998a) Analysis of longitudinal data with nonignorable nonmonotone missing values. Appl Stat 47:425–438
Troxel AB, Lipsitz SR, Harrington DP (1998b) Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika 85:661–672
Tsiatis AA, DeGruttola V, Wulfsohn MS (1995) Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. J Am Stat Assoc 90:27–37
Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York
Wedderburn RWM (1974) Quasi-likelihood methods, generalised linear models, and the Gauss–Newton method. Biometrika 61:439–447
Wei GC, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704
Wolfinger R, O’Connell M (1993) Generalized linear models: a pseudo-likelihood approach. J Stat Comput Simul 48:233–243
Woolson RF, Clarke WR (1984) Analysis of categorical incomplete longitudinal data. J R Stat Soc Ser A 147:87–99
Wu MC, Bailey KR (1988) Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 7:337–346
Wu MC, Carroll RJ (1988) Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44:175–188
Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955
Xu J, Zeger SL (2001) Joint analysis of longitudinal data comprising repeated measures and times to events. Appl Stat 50:375–387
Zeger SL, Liang K-Y (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42:121–130
Zhu H-T, Lee S-Y (2001) Local influence for incomplete-data models. J R Stat Soc Ser B 63:111–126
Zhu H, Ibrahim JG, Shi X (2009) Diagnostic measures for generalized linear models with missing covariates. Scand J Stat (in press)
Author information
Authors and Affiliations
Corresponding author
Additional information
This invited paper is discussed in the comments available at: http://dx.doi.org/10.1007/s11749-009-0139-9, http://dx.doi.org/10.1007/s11749-009-0140-3, http://dx.doi.org/10.1007/s11749-009-0141-2, http://dx.doi.org/10.1007/s11749-009-0142-1, http://dx.doi.org/10.1007/s11749-009-0143-0.
Rights and permissions
About this article
Cite this article
Ibrahim, J.G., Molenberghs, G. Missing data methods in longitudinal studies: a review. TEST 18, 1–43 (2009). https://doi.org/10.1007/s11749-009-0138-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-009-0138-x
Keywords
- Expectation-maximization algorithm
- Incomplete data
- Missing completely at random
- Missing at random
- Missing not at random
- Pattern-mixture model
- Selection model
- Sensitivity analyses
- Shared-parameter model