Recursive partitioning for missing data imputation in the presence of interaction effects

https://doi.org/10.1016/j.csda.2013.10.025Get rights and content

Abstract

Standard approaches to implement multiple imputation do not automatically incorporate nonlinear relations like interaction effects. This leads to biased parameter estimates when interactions are present in a dataset. With the aim of providing an imputation method which preserves interactions in the data automatically, the use of recursive partitioning as imputation method is examined. Three recursive partitioning techniques are implemented in the multiple imputation by chained equations framework. It is investigated, using simulated data, whether recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. It is concluded that, when interaction effects are present in a dataset, substantial gains are possible by using recursive partitioning for imputation compared to standard applications. In addition, it is shown that the potential of recursive partitioning imputation approaches depends on the relevance of a possible interaction effect, the correlation structure of the data, and the type of possible interaction effect present in the data.

Introduction

Today’s state of the art solution for handling missing data is multiple imputation. In approaches to implement multiple imputation, different methods are available to use the information from the data at hand (Van Buuren, 2012). The common element in these methods is that they model the relations between variables. Hereby, it is particularly important to reflect the structure of the data since otherwise, parameter estimates under multiple imputation will be biased. Caution is therefore needed when data contain nonlinear structures like a quadratic relation. Approaches to implement multiple imputation, like Multiple Imputation by Chained Equations (MICE; Van Buuren, 2007), do not automatically incorporate nonlinear relations. We focus on a special case of nonlinear relations, namely interaction effects. For the purpose of this study, both cross-products and quadratic terms are denoted by interactions.

MICE is a popular approach for implementing multiple imputation because of its flexibility. In MICE, multivariate missing data are imputed on a variable by variable basis, called fully conditional specification (Van Buuren, 2007). This means that per variable imputations are created, such that for each incomplete variable a specified imputation model is required. In these imputation models, interactions can be modelled in two ways: first, by specifying models including interaction effects manually and second by imputing subgroups of the data separately. For example, one could create distinct imputation models for males and females. Besides the fact that both approaches are somewhat cumbersome, they are often unusable as the structure of the data is usually unknown before imputation. Therefore, models should preferably be fitted to the data in an automatic fashion without unnecessary user involvement.

A technique that can handle interactions with ease is recursive partitioning (Burgette and Reiter, 2010, Hand, 1997). One of the first implementations of recursive partitioning is called Automatic Interaction Detection (Morgan and Sonquist, 1963). The recursive partitioning technique models the interaction structure in the data by sequentially splitting a dataset into increasingly homogeneous subsets (Breiman et al., 1984). Essentially recursive partitioning finds the split that is most predictive of the response variable by searching through all predictor variables (Merkle and Schaffer, 2011). Within the subgroups created from one predictor variable, the algorithm goes on to partition the data based on other variables or other splits of the same predictor. The resulting series of splits can be represented by a tree structure like Fig. 1, to which we will return in Section  2. Since splits are conditional on previous splits, the variables used may indicate interaction effects. By constructing models in this manner, possible interactions are automatically taken into account.

Others have worked on this idea of combining recursive partitioning with imputation methods, e.g., Burgette and Reiter (2010), Iacus and Porro, 2007, Iacus and Porro, 2008, Nonyane and Foulkes (2007), Stekhoven and Bühlmann (2012), and Van Buuren (2012, p. 83). The main shortcoming of most of the proposed methods is that recursive partitioning is combined with single imputation instead of multiple imputation. Therefore, they cannot be used for making appropriate statistical inferences. Another shortcoming is that, except for Burgette and Reiter, the performance of these methods is not investigated on data containing interaction effects. In the current study, we would like to overcome these shortcomings by providing a framework for connecting recursive partitioning techniques with multiple imputation. This type of imputation takes into account the uncertainty associated with the missing data (Rubin, 1996), which results in parameter estimates with appropriate confidence intervals.

The purpose of our study is to gain insight into whether the use of recursive partitioning in multiple imputation (i.e., MICE) is a convenient way to preserve interaction effects. We consider two main questions: which recursive partitioning techniques create appropriate variability between repeated imputations? What are the statistical properties (e.g., bias, coverage, confidence interval width) of estimates of the interaction parameters? In gaining insight into these questions, distinctions will be made between different types of interactions. In addition, the two questions will be considered for both continuous and categorical data. Burgette and Reiter (2010) embarked on the implementation of recursive partitioning in MICE and demonstrated the performance of the method on a single model with continuous predictor and response variables. We want to elaborate on the work of Burgette and Reiter and, to be complete, also consider categorical predictor and response variables. Different results are expected for both types of data since recursive partitioning techniques are known to perform especially well for data with interactions between categorical variables (Dusseldorp et al., 2010).

This paper is organized as follows. In Section  2, MICE will first be elaborated further after which two main recursive partitioning techniques will be considered, namely Classification And Regression Trees (CART; Breiman et al., 1984) and random forests (Breiman, 2001). Subsequently, incorporation of recursive partitioning in the MICE framework will be presented. In Section  3 different interaction types will be discussed, which will be observed in answering the research questions. Then we make the distinction between predictor and response variables either being continuous (Section  4) or categorical (Section  5). In both Sections  4 Continuous predictor and response variables, 5 Categorical predictor and response variables, a simulation study is described, carried out to investigate which of the discussed methods are convenient to preserve interaction effects, followed by the results of the simulation study. The results from both simulation studies will be discussed in Section  6, at the end of which some final conclusions are given.

Section snippets

Multiple imputation by chained equations

Imagine a set of variables, y1,,yj, some or all of which have missing values. Handling these data using MICE comprises three main steps: generating multiple imputation, analyzing the imputed data, and pooling the analysis results (Van Buuren, 2007). The main idea is to impute each incomplete variable using its own imputation model. All missing values are initially filled in at random. The first variable with at least one missing value, say y1, is then regressed on the remaining variables, y2,,

Interaction types

In gaining insight into whether the use of recursive partitioning in multiple imputation is a convenient way to preserve interaction effects, we will distinguish different types of interactions. First, interactions may vary with regard to the correlation (r) between the variables that interact. This correlation may range from −1.0 to 1.0. Second, the effect size of an interaction effect, which implies the strength of the relation between an interaction effect and another (pair of) variable(s),

Simulation study

The performance of CART, Forest-boot and Forest-RI were compared with the default imputation method in mice for continuous data, i.e., predictive mean matching (denoted by Pmm). The simulation study used to gain insight into the performance of the four imputation methods in the presence of interaction effects can be described on the basis of five components.

Component  1: Data generation model. Data were generated using three different regression models, where each model included a two-way

Categorical predictor and response variables

In this section, the performance of CART, Forest-boot and Forest-RI as imputation method in mice is investigated regarding categorical predictor and response variables.

Discussion

We implemented three recursive partitioning techniques that incorporate interaction effects in the data, as imputation method in mice: CART, restricted random forests using bootstrapping only and random forests by a combination of bootstrapping and random input selection. We studied the bias and coverage of parameter estimates after imputation by these methods. In doing this, we replicated and extended the study of Burgette and Reiter (2010), who examined the performance of CART in MICE for

References (34)

  • S.M. Iacus et al.

    Missing data imputation, matching and other applications of random recursive partitioning

    Comput. Statist. Data Anal.

    (2007)
  • L.S. Aiken et al.

    Multiple Regression: Testing and Interpreting Interactions

    (1991)
  • L. Breiman

    Random forest

    Mach. Learn.

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • L.F. Burgette et al.

    Multiple imputation for missing data via sequential regression trees

    Amer. J. Epidemiol.

    (2010)
  • J. Cohen

    Statistical Power Analysis for the Behavioral Sciences

    (1988)
  • L.M. Collins et al.

    A comparison of inclusive and restrictive strategies in modern missing data procedures

    Psychol. Methods

    (2001)
  • E. Dusseldorp et al.

    Combining an additive and tree-based regression model simultaneously: STIMA

    J. Comput. Graph. Statist.

    (2010)
  • J.H. Friedman

    Multivariate adaptive regression splines

    Ann. Statist.

    (1991)
  • J.W. Graham et al.

    How many imputations are really needed? Some practical clarifications of multiple imputation theory

    Prevention Sci.

    (2007)
  • C. Haddock et al.

    Using odds ratios as effect sizes for meta-analysis of dichotomous data: a primeur on methods and issues

    Psychol. Methods

    (1998)
  • D.J. Hand

    Construction and Assessment of Classification Rules

    (1997)
  • T. Hastie et al.

    The Elements of Statistical Learning; Data Mining, Inference, and Prediction

    (2001)
  • T. Hothorn et al.

    Unbiased recursive partitioning: a conditional inference framework

    J. Comput. Graph. Statist.

    (2006)
  • Hothorn, T., Zeileis, A., 2013. Partykit: a toolkit for recursive partitioning. R package version 0.1-6. URL:...
  • S.M. Iacus et al.

    Invariant and metric free proximities for data matching: an R package

    J. Stat. Softw.

    (2008)
  • G.V. Kass

    An exploratory technique for investigating large quantities of categorical data

    J. R. Stat. Soc.

    (1980)
  • Cited by (187)

    • Socioeconomic Deprivation, Genetic Risk, and Incident Dementia

      2023, American Journal of Preventive Medicine
    View all citing articles on Scopus

    Supplementary materials related to the implementation of proposed methods are available online (see Appendix C).

    View full text