Recursive partitioning for missing data imputation in the presence of interaction effects☆
Introduction
Today’s state of the art solution for handling missing data is multiple imputation. In approaches to implement multiple imputation, different methods are available to use the information from the data at hand (Van Buuren, 2012). The common element in these methods is that they model the relations between variables. Hereby, it is particularly important to reflect the structure of the data since otherwise, parameter estimates under multiple imputation will be biased. Caution is therefore needed when data contain nonlinear structures like a quadratic relation. Approaches to implement multiple imputation, like Multiple Imputation by Chained Equations (MICE; Van Buuren, 2007), do not automatically incorporate nonlinear relations. We focus on a special case of nonlinear relations, namely interaction effects. For the purpose of this study, both cross-products and quadratic terms are denoted by interactions.
MICE is a popular approach for implementing multiple imputation because of its flexibility. In MICE, multivariate missing data are imputed on a variable by variable basis, called fully conditional specification (Van Buuren, 2007). This means that per variable imputations are created, such that for each incomplete variable a specified imputation model is required. In these imputation models, interactions can be modelled in two ways: first, by specifying models including interaction effects manually and second by imputing subgroups of the data separately. For example, one could create distinct imputation models for males and females. Besides the fact that both approaches are somewhat cumbersome, they are often unusable as the structure of the data is usually unknown before imputation. Therefore, models should preferably be fitted to the data in an automatic fashion without unnecessary user involvement.
A technique that can handle interactions with ease is recursive partitioning (Burgette and Reiter, 2010, Hand, 1997). One of the first implementations of recursive partitioning is called Automatic Interaction Detection (Morgan and Sonquist, 1963). The recursive partitioning technique models the interaction structure in the data by sequentially splitting a dataset into increasingly homogeneous subsets (Breiman et al., 1984). Essentially recursive partitioning finds the split that is most predictive of the response variable by searching through all predictor variables (Merkle and Schaffer, 2011). Within the subgroups created from one predictor variable, the algorithm goes on to partition the data based on other variables or other splits of the same predictor. The resulting series of splits can be represented by a tree structure like Fig. 1, to which we will return in Section 2. Since splits are conditional on previous splits, the variables used may indicate interaction effects. By constructing models in this manner, possible interactions are automatically taken into account.
Others have worked on this idea of combining recursive partitioning with imputation methods, e.g., Burgette and Reiter (2010), Iacus and Porro, 2007, Iacus and Porro, 2008, Nonyane and Foulkes (2007), Stekhoven and Bühlmann (2012), and Van Buuren (2012, p. 83). The main shortcoming of most of the proposed methods is that recursive partitioning is combined with single imputation instead of multiple imputation. Therefore, they cannot be used for making appropriate statistical inferences. Another shortcoming is that, except for Burgette and Reiter, the performance of these methods is not investigated on data containing interaction effects. In the current study, we would like to overcome these shortcomings by providing a framework for connecting recursive partitioning techniques with multiple imputation. This type of imputation takes into account the uncertainty associated with the missing data (Rubin, 1996), which results in parameter estimates with appropriate confidence intervals.
The purpose of our study is to gain insight into whether the use of recursive partitioning in multiple imputation (i.e., MICE) is a convenient way to preserve interaction effects. We consider two main questions: which recursive partitioning techniques create appropriate variability between repeated imputations? What are the statistical properties (e.g., bias, coverage, confidence interval width) of estimates of the interaction parameters? In gaining insight into these questions, distinctions will be made between different types of interactions. In addition, the two questions will be considered for both continuous and categorical data. Burgette and Reiter (2010) embarked on the implementation of recursive partitioning in MICE and demonstrated the performance of the method on a single model with continuous predictor and response variables. We want to elaborate on the work of Burgette and Reiter and, to be complete, also consider categorical predictor and response variables. Different results are expected for both types of data since recursive partitioning techniques are known to perform especially well for data with interactions between categorical variables (Dusseldorp et al., 2010).
This paper is organized as follows. In Section 2, MICE will first be elaborated further after which two main recursive partitioning techniques will be considered, namely Classification And Regression Trees (CART; Breiman et al., 1984) and random forests (Breiman, 2001). Subsequently, incorporation of recursive partitioning in the MICE framework will be presented. In Section 3 different interaction types will be discussed, which will be observed in answering the research questions. Then we make the distinction between predictor and response variables either being continuous (Section 4) or categorical (Section 5). In both Sections 4 Continuous predictor and response variables, 5 Categorical predictor and response variables, a simulation study is described, carried out to investigate which of the discussed methods are convenient to preserve interaction effects, followed by the results of the simulation study. The results from both simulation studies will be discussed in Section 6, at the end of which some final conclusions are given.
Section snippets
Multiple imputation by chained equations
Imagine a set of variables, , some or all of which have missing values. Handling these data using MICE comprises three main steps: generating multiple imputation, analyzing the imputed data, and pooling the analysis results (Van Buuren, 2007). The main idea is to impute each incomplete variable using its own imputation model. All missing values are initially filled in at random. The first variable with at least one missing value, say , is then regressed on the remaining variables,
Interaction types
In gaining insight into whether the use of recursive partitioning in multiple imputation is a convenient way to preserve interaction effects, we will distinguish different types of interactions. First, interactions may vary with regard to the correlation () between the variables that interact. This correlation may range from −1.0 to 1.0. Second, the effect size of an interaction effect, which implies the strength of the relation between an interaction effect and another (pair of) variable(s),
Simulation study
The performance of CART, Forest-boot and Forest-RI were compared with the default imputation method in mice for continuous data, i.e., predictive mean matching (denoted by Pmm). The simulation study used to gain insight into the performance of the four imputation methods in the presence of interaction effects can be described on the basis of five components.
Component 1: Data generation model. Data were generated using three different regression models, where each model included a two-way
Categorical predictor and response variables
In this section, the performance of CART, Forest-boot and Forest-RI as imputation method in mice is investigated regarding categorical predictor and response variables.
Discussion
We implemented three recursive partitioning techniques that incorporate interaction effects in the data, as imputation method in mice: CART, restricted random forests using bootstrapping only and random forests by a combination of bootstrapping and random input selection. We studied the bias and coverage of parameter estimates after imputation by these methods. In doing this, we replicated and extended the study of Burgette and Reiter (2010), who examined the performance of CART in MICE for
References (34)
- et al.
Missing data imputation, matching and other applications of random recursive partitioning
Comput. Statist. Data Anal.
(2007) - et al.
Multiple Regression: Testing and Interpreting Interactions
(1991) Random forest
Mach. Learn.
(2001)- et al.
Classification and Regression Trees
(1984) - et al.
Multiple imputation for missing data via sequential regression trees
Amer. J. Epidemiol.
(2010) Statistical Power Analysis for the Behavioral Sciences
(1988)- et al.
A comparison of inclusive and restrictive strategies in modern missing data procedures
Psychol. Methods
(2001) - et al.
Combining an additive and tree-based regression model simultaneously: STIMA
J. Comput. Graph. Statist.
(2010) Multivariate adaptive regression splines
Ann. Statist.
(1991)- et al.
How many imputations are really needed? Some practical clarifications of multiple imputation theory
Prevention Sci.
(2007)
Using odds ratios as effect sizes for meta-analysis of dichotomous data: a primeur on methods and issues
Psychol. Methods
Construction and Assessment of Classification Rules
The Elements of Statistical Learning; Data Mining, Inference, and Prediction
Unbiased recursive partitioning: a conditional inference framework
J. Comput. Graph. Statist.
Invariant and metric free proximities for data matching: an R package
J. Stat. Softw.
An exploratory technique for investigating large quantities of categorical data
J. R. Stat. Soc.
Cited by (187)
Checking Data Quality of Longitudinal Household Travel Survey Data
2024, Transportation Research ProcediaInterpretable machine learning to predict adverse perinatal outcomes: examining marginal predictive value of risk factors during pregnancy
2023, American Journal of Obstetrics and Gynecology MFMSocioeconomic Deprivation, Genetic Risk, and Incident Dementia
2023, American Journal of Preventive MedicineState insurance mandates for in vitro fertilization are not associated with improving racial and ethnic disparities in utilization and treatment outcomes
2023, American Journal of Obstetrics and Gynecology
- ☆
Supplementary materials related to the implementation of proposed methods are available online (see Appendix C).