Special Series: Missing Data
Review: A gentle introduction to imputation of missing values

https://doi.org/10.1016/j.jclinepi.2006.01.014Get rights and content

Abstract

In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results, whereas imputation techniques yield valid results without complicating the analysis once the imputations are carried out. Imputation techniques are based on the idea that any subject in a study sample can be replaced by a new randomly chosen subject from the same source population. Imputation of missing data on a variable is replacing that missing by a value that is drawn from an estimate of the distribution of this variable. In single imputation, only one estimate is used. In multiple imputation, various estimates are used, reflecting the uncertainty in the estimation of this distribution. Under the general conditions of so-called missing at random and missing completely at random, both single and multiple imputations result in unbiased estimates of study associations. But single imputation results in too small estimated standard errors, whereas multiple imputation results in correctly estimated standard errors and confidence intervals. In this article we explain why all this is the case, and use a simple simulation study to demonstrate our explanations. We also explain and illustrate why two frequently used methods to handle missing data, i.e., overall mean imputation and the missing-indicator method, almost always result in biased estimates.

Introduction

Missing data are a common problem in all types of medical research. There are various methods of handling missing data. Simple and frequently used methods include complete or available case analysis, the missing-indicator method [1], and overall mean imputation. However, these methods lead to inefficient analyses and, more seriously, commonly produce severely biased estimates of the association(s) investigated [2], [3], [4], [5], [6]. There are more sophisticated (imputation) techniques to handle missing data, such as multiple imputation, that give much better results [2], [3], [4], [5], [6]. With these techniques, missing data for a subject are imputed by a value that is predicted using the subject's other, known characteristics. Presently, these sophisticated techniques are easy accessible and available in standard software such as SAS and S-Plus. Nevertheless, there seems to be a general lack of understanding that has limited their use in epidemiological research.

In this short report we will give a gentle introduction into the logic behind these sophisticated imputation techniques of missing data. We will not go into technical details, nor into details on how to perform these analyses. For this we refer to the literature [2], [3], [4], [5], [6], [7], [8]. Instead, to assist medical researchers in their future data analyses we aim to clarify in simple wording why (more sophisticated) imputation is a better, more valid method than the simple and frequently used techniques for handling missing data. We will start with a brief introduction on different types of missing data and the principles of imputation in general, followed by explaining single and multiple imputation, and why frequently used methods fail. All this will be illustrated using data from a simple simulation study.

Section snippets

Types of missing data

If subjects who have missing data are a random subset of the complete sample of subjects, missing data are called missing completely at random (MCAR) [9]. Typical examples of MCAR are when a tube containing a blood sample of a study subject is broken by accident (such that the blood parameters can not be measured) or when a questionnaire of a study subject is accidentally lost. The reason for missingness is completely random, i.e., the probability that an observation is missing is not related

Imputation is replacement

We start this section by noting that in the classical (frequentistic) statistical view, conclusions drawn from any study should not depend on the sample that is involved in the study. Should the study be repeated with a different sample, nearly identical results should be obtained. The conclusions do not depend on the given set of subjects in the sample. This implies that every subject in a randomly chosen sample can be replaced by a new subject that is randomly chosen from the same source

Single imputation

Direct replacement of subjects by new subjects from an identifiable source population based on observed subject characteristics may be feasible when the number of study variables is limited, as in our diagnostic example study where only two variables, the test result and disease status, were used. Commonly, however, the number of covariates is large. Suppose a nondiseased male subject, aged 39, with a body mass index of 24.5, and a systolic blood pressure of 110 has a missing test result. If

Multiple imputation

To obtain correct estimates of the standard errors and P-values, we should take into account the imprecision caused by the fact that the distribution of the variables with missing values is estimated. This can be done by creating not a single imputed data set, but several or multiple imputed data sets in which different imputations are based on a random draw from different estimated underlying distributions [4], [5]. There are various approaches to creating these multiple imputed data sets.

Simulation study

We performed a simulation study based on our diagnostic example to illustrate that single imputation yields unbiased estimates with too narrow confidence intervals and multiple imputation indeed yields unbiased estimates with correct standard errors and P-values. We simulated 1,000 samples of 500 subjects using R [14]. The samples were drawn from a population consisting of equal numbers of diseased and nondiseased subjects. The true regression coefficient in a logistic regression model linking

Indicator method

A still popular method for handing missing values is the so-called missing-indicator method [1]. For each independent variable with missing values a new dummy or indicator (0/1) variable is created with “1” indicating a missing on the original variable and “0” indicating an observed value. For the original variable the missing values are recoded as “0.” For (original) categorical variables this in fact means, creating an extra value category for the missing values. When estimating the

Final comments

Our purpose was to provide insight into how sophisticated imputation works, to facilitate the understanding and cooperation between medical researchers and statisticians, and to make the data analysis a success. Complete and available case analyses provide inefficient though valid results when missing data are MCAR, but biased results when missing data are MAR, which is the more common form of missingness in epidemiological research. Other frequently used methods to handle missing data such as

Acknowledgments

We gratefully acknowledge the support by The Netherlands Organization for Scientific Research (ZON-MW 904-10-006 and 917-46-360).

References (15)

There are more references available in the full text version of this article.

Cited by (1852)

View all citing articles on Scopus
View full text