Introduction
What is new?
- •
Dropping a variable with missing data from the analyses or conducting a complete case analysis more often leads to biased effect estimates, decreased coverage of the confidence intervals, and a decreased discriminative ability of the multivariable model, compared with multiple imputation.
- •
To “provide” data according to the strict methodology of multiple imputation seems a better alternative than to give up or delete valuable observed data.
No matter how hard researchers try to prevent it, missing data occur frequently in medical research [1]. Commonly, researchers simply neglect all the data of patients with missing values because this is what standard software packages do when the data are analyzed (complete case analysis). Because this leads to a smaller dataset, it comes at least at the price of loss of power. Complete case analysis not necessarily leads to biased results. Under the condition that the missing values are missing completely at random (MCAR), meaning that the cause of missingness is pure coincidence, complete case analysis will not lead to biased results. As an alternative to complete case analysis, researchers tend to drop a variable from the analysis when it has missing values. However, both methods neglect valuable observed data.
Multiple imputation is a statistical technique that uses all observed data to fill in plausible values for the missing values [2], [3], [4], [5], [6], [7], [8]. This method receives increasing attention in the medical literature [9], [10], [11], [12], [13], [14], [15], [16]. Nevertheless, many researchers seem unaware or uncertain about this approach to deal with missing values and still perform a complete case analysis or drop variables with missing values from the analysis [17]. The extent and sort of bias related to these approaches depend on the type of study. Diagnostic or prognostic studies often study the contribution of covariates (eg, patient characteristics and test results) in the prediction of a particular outcome by estimating the predictors' regression coefficients. For example, one may study the predictive effect of body mass index (BMI), age, gender, the intake of saturated fat, and other life style factors on the risk of cardiovascular diseases (CVD). Sometimes, these studies are aimed at developing a multivariable prediction model or risk score and estimate the ability of such a model to distinguish between patients at high and low risk of CVD. In etiologic studies, usually the effect of a specific etiologic factor on the outcome of interest is studied corrected for the influence of other covariates (confounders). Following the previous example, the regression coefficient of BMI could be the parameter of interest, corrected for the confounders age, gender, intake of saturated fat, and other life style factors.
We used empirical data of a previous study on deep venous thrombosis (DVT) to quantify the effect of different analyses in the presence of missing covariate data both for prediction and etiologic research purposes. We studied the effect of complete case analysis, dropping covariates with missing values, and multiple imputation on individual regression coefficients and on the predictive ability of a multivariable model for various proportions of missing covariate values.