Elsevier

Neurocomputing

Volume 118, 22 October 2013, Pages 65-78
Neurocomputing

Locally linear reconstruction based missing value imputation for supervised learning

https://doi.org/10.1016/j.neucom.2013.02.016Get rights and content

Abstract

Most learning algorithms generally assume that data is complete so each attribute of all instances is filled with a valid value. However, missing values are very common in real datasets for various reasons. In this paper, we propose a new single imputation method based on locally linear reconstruction (LLR) that improves the prediction performance of supervised learning (classification & regression) with missing values. First, we investigate how missing values degrade the prediction performance with various missing ratios. Next, we compare the proposed missing value imputation method (LLR) with six well-known single imputation methods for five different learning algorithms based on 13 classification and nine regression datasets. The experimental results showed that (1) all imputation methods helped to improve the prediction accuracy, although some were very simple; (2) the proposed LLR imputation method enhanced the modeling performance more than all other imputation methods, irrespective of the learning algorithms and the missing ratios; and (3) LLR was outstanding when the missing ratio was relatively high and its prediction accuracy was similar to that of the complete dataset.

Highlights

► A new missing value imputation method based on LLR is proposed. ► The comparative study is performed to verify the effectiveness of LLR imputation. ► LLR imputation method outperforms other methods in classification and regression. ► The performance of LLR imputation is as good as that of the complete data sets.

Introduction

Supervised learning algorithms such as classification and regression in data mining or machine learning generally assume that training and test datasets are complete, i.e., each attribute of all instances is not missing and they are filled with a value. However, real data sets are often incomplete and they contain a proportion of missing values for various reasons, such as the death of a patient, equipment malfunctions, and a lack of responses [2]. The presence of missing values can lead to critical problems during the learning process, such as a loss of efficiency, biased data structure, analytical difficulties, and prediction performance degeneration [3], [14], [19]. According to Acuna and Rodriguez [1], less than 1% missing instances does not affect the prediction performance in general, while 1–5% is manageable. However, 5–15% missing instances requires sophisticated handling method, while greater than 15% missing data can severely degrade the prediction performance of learning algorithms. In order to handle missing values, several imputation techniques have been proposed in a wide range of data mining and machine learning domains [21], [22], [27], [41], [45]. The aim of missing value imputation is to enhance the functionality of learning algorithms and to improve their prediction accuracy, by replacing missing attributes with real values based on information extracted other non-missing data. The treatment of missing values depends on the type of missing values as follows [29], [36].

  • Missing completely at random (MCAR): this is the highest level of randomness. The probability of an instance having a missing value for an attribute does not depend on either the observed data or the missing attribute. Any missing value imputation method rarely distort the distribution of original data.

  • Missing at random (MAR): an intermediate level of randomness. The probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself. For example, let us assume that two attributes, gender and pregnancy, are collected together. If gender is recorded as ‘male’, we can easily deduce that pregnancy is ‘no’ although it is missing [38].

  • Not missing at random (NMAR): this is the lowest level of randomness. The probability of an instance having a missing value for an attribute may depend on the value of that attribute. For example, ex-convicts are likely to leave the criminal record attribute missing when they respond to a survey.

If missing values occur that are MAR or NMAR, imputation can be conducted by domain experts based on their appropriate background knowledge. Therefore, most missing value imputation techniques are focused on missing values that are classed as MCAR [2], [13], [38].

Depending on the learning algorithm and the number of repetitions, the handling of MCAR values can be divided into two main groups. The first group includes learning algorithms that can handle missing values during the learning process. Classification and regression tree (CART) simply ignores missing values when growing a tree [6]. CART iteratively computes the information gain for a large number of split candidates to select the best split point (the attribute and its split value). During CART learning, instances with missing values for an attribute are discarded if a candidate split point uses that attribute, whereas they are used if other non-missing attributes are selected for a split candidate. Naive Bayesian classifier treats missing values in a similar way [26]. When estimating the distribution of each attribute, instances with missing values for that attribute are abandoned and the parameters of the distribution are approximated using the non-missing instances only. When computing the distance between two instances in k-nearest neighbor (k-NN) learning, zero is assigned to an attribute if both instances have missing values. If only one is missing, k-NN assigns the maximum distance of that attribute [40]. The second group includes all imputation techniques that work independently of learning algorithms. This group can be divided into two subgroups: single imputation (SI) and multiple imputation (MI). SI replaces a missing value with a single value, whereas MI replaces a missing value with different values Thus MI transforms a single incomplete dataset into a number of complete datasets. Some representative single imputation techniques are as follows.

  • Mean (mode) imputation (MEI): this is a simple but fairly effective method in practice. MEI fills the missing values of an attribute with the mean (continuous) or mode (nominal/ordinal) of the non-missing values for the same attribute [13], [15].

  • k-nearest neighbor (k-NN): if an instance has a certain attribute missing, k-NN finds k most similar instances using its non-missing attributes. The values of the missing attributes of k neighbors are combined based on a predefined rule or kernel function, such as a simple average or exponential kernel, and it replaces the missing value. The k-NN imputation method is also known as ‘Hot Deck’ if k=1 is used, while a number of other variations have been proposed based on the modification of the kernel functions [13], [30], [15], [20], [41], [46].

  • Expectation conditional maximization (ECM): this approach assumes that the entire dataset is derived from a multivariate Gaussian distribution. Initially, the distribution parameters (mean vector and covariance matrix) are estimated for the data without missing values. The expectation-maximization (EM) algorithm is conducted as follows. During the expectation (E) step, the missing values are imputed based on the mean value for its attribute. During the maximization (M) step, the distribution parameters are updated based on the imputed values. After iterating the E–M process, the distribution parameters converge to the optimal values, and the missing values are imputed using values that are consistent with the distribution [10], [31], [16], [35], [38].

  • Clustering-based imputation: when clustering-based imputation methods are applied to an instance with a missing attribute, the entire dataset is grouped into some number of clusters using the non-missing attributes. The attribute values of the members of the cluster nearest to the instance are then used for imputation. Clustering algorithm such as K-Means clustering (KMC) or a mixture of Gaussian distributions (MoG) are widely used [32], [42], [44], [28], [11].

  • Model-based imputation: in model-based imputation methods, missing value imputation is reformulated as a supervised learning problem where the missing attribute becomes the dependent (target) variable and the non-missing attributes become independent (explanatory) variables. Thus, the learning task becomes classification if the missing attribute is nominal, whereas it becomes regression if the missing attribute is continuous. For each instance with a missing attribute, a machine learning algorithm is trained based on the instances without missing values and the non-missing values of the instance are used by the model to predict the target missing attribute value. Multiple linear regression, artificial neural network (ANN), Naive Bayesian classification, decision trees, and support vector machines (SVM) are some examples of machine learning algorithms that are commonly used for model-based imputation [18], [12], [13], [45], [21], [37].

In contrast to single imputation methods, multiple imputation methods impute a set of possible values rather than a single value for the missing attribute of an instance [43], [34]. Thus, multiple imputation methods generate a number of different datasets where the complete instances are identical but the incomplete instances have different values for the missing attributes. Some representative multiple imputation methods are as follows.

  • Multivariate imputation by chained equations (MICE) [39]: if missing values occur in more than one attribute for an instance, MICE employs a chained equation to fill the missing value of each attribute. MICE can generate various imputation results by modifying the imputation sequence of the missing attributes or the imputation algorithm for each attribute.

  • Boosting [14]: This multiple imputation method has three modules, i.e., mean pre-imputation, application of confidence intervals, and boosting. The pre-imputed values in the first module are imputed using a base imputation method that filters the missing values by generating confidence intervals using Student's t-statistics. Based on these confidence intervals, boosting is performed to deliver the high-quality imputed values.

These missing value imputation methods have advantages and limitations. Imputation methods inherent in learning algorithms do not require additional data preprocessing for missing value treatment, but they are usually too simple because most simply discard instances with missing attributes. This may allow learning algorithms to function but their prediction performance cannot be guaranteed. Single imputation methods can be applied before any learning algorithms. However, the prediction performance improvement may be restricted (e.g., mean imputation) or the computational burden might be increased because of the additional parameter optimization process (model-based imputation). Multiple imputation methods may improve the prediction performance better than single imputation methods. However, they significantly increase the computational cost not only by repeating the imputation steps, but also by repeating model learning based on individual imputed datasets. Therefore, multiple imputation methods may have difficulties handling large amount of data during real-time processing.

In this paper, we propose a new efficient single imputation method based on locally linear reconstruction (LLR) [24], [25] to improve the prediction performance of supervised learning. LLR is a structured approach that determines two parameters for k-NN learning, i.e., the number of nearest neighbors (k) and the weights given to the neighbors. In LLR, the optimization problem is formulated to minimize the difference between the test instance and its projection in the neighborhood space. Obtaining the solution of the optimization problem is equivalent to assigning the optimal weights to the neighbor instances, i.e., if a neighbor is necessary to span the neighborhood space, a non-zero weight is assigned, whereas zero is assigned otherwise. Therefore, LLR can determine the neighbors that are critical and assign appropriate weights to them. By adopting LLR, k and the weights of neighbors are not model parameters any more, but they are determined in a structured manner. To verify the effectiveness of the LLR imputation method, we analyzed the extent to which classification and regression algorithms are affected by diverse missing ratios if imputation is not conducted. Next, we compared (1) the accuracy improvement versus the incomplete data and (2) the accuracy recovery versus the complete data (ideal case) when using LLR imputation with other well-known single imputation methods. In order to generalize our conclusions, we conducted extensive experiments using six single imputation methods as benchmark methods and five classification and regression algorithms with 13 different degree of missing ratios, based on 13 classification and nine regression datasets.

The remainder of this paper is organized as follows. In Section 2, we review some representative work associated with the comparative analysis of various imputation methods and we discuss their findings and limitations. In Section 3, we describe the proposed LLR imputation method. In Section 4, we explain the experimental settings, i.e., the data description, missing data ratios, benchmark imputation algorithms, the classification and regression algorithms along with their user-specific parameters, and the performance measures. In Section 5, we compare LLR imputation method with other benchmark imputation methods in various scenarios. In Section 6, we provide some concluding remarks and we discuss areas of future work.

Section snippets

Related work

In this section, we review some representative works that have focused on the comparative study of missing value imputation methods for supervised learning in chronological order.

  • Grzymala-Busse and Hu [18]: In this study, nine imputation methods for nominal datasets were examined. The imputation methods were grouped into three categories: (1) two mode imputation-based methods where the key distinction was whether class information was used or not; (2) two methods based on the assignment of all

Missing value imputation based on locally linear reconstruction

Locally linear reconstruction (LLR) was proposed to simultaneously determine the critical neighbors and to assign their optimal weights in k-NN learning [24]. The input–output structure of LLR is shown in the following equation:y^n+1=LLR(xn+1,Xref,yref,k),where xn+1 is a 1×d input vector of a new test instance, y^n+1 is the target value of the test instance, Xref is an n×d input matrix where each row corresponds to each reference instance, yref is the n×1 vector of the target values of the

Experimental settings

To verify the effectiveness of LLR imputation, we designed the following experiments. First, we generate synthetic incomplete datasets from the original complete dataset with various missing instance ratios. Five classification and regression algorithms were trained without any imputation methods to determine the detrimental effects of missing values. The prediction accuracies with the original complete dataset and the missing dataset without imputation, and the prediction power degradation

Classification results

Fig. 4 shows how the classification accuracy was affected by the existence of missing values. Note that the relative accuracy on the y-axis is a fraction of the two accuracies, i.e., the classification accuracy based on the missing data without imputation (discarding missing instances when training the classifiers) divided by the classification accuracy based on the original complete data. Each point in the graph is the average of five classification algorithms. As expected earlier, missing

Conclusion

In this paper, we propose a new single imputation method based on locally linear reconstruction (LLR). Experiments verified that increasing the missing ratio severely degraded the performance of all learning algorithms, which confirmed that missing values should be treated with caution. After taking the local structure into account, the proposed LLR imputation could handle missing values more effectively than other parametric or global structure-based imputation methods. Carefully designed

Acknowledgments

The work was supported by the research program funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (MEST) (No. 2011-0021893).

Pilsung Kang is an assistant professor in Department of Industrial and Information Systems Engineering, College of Business and Technology, Seoul National University of Science and Technology (SeoultTECH), Seoul, South Korea. He received B.S. and Ph.D. in Industrial Engineering, College of Engineering, Seoul National University, Seoul, South Korea. His research interests include instance-based learning, learning kernel machines, novelty detection, learning algorithms in class imbalance, and

References (46)

  • J. Bernard et al.

    Applications of multiple imputation in medical studiesFrom AIDS to NHANES

    Stat. Methods Med. Res.

    (1999)
  • C. Bishop

    Pattern Recognition and Machine Learning

    (2002)
  • D. Böhning

    Multinomial logistic regression algorithm

    Ann. Inst. Stat. Math.

    (1992)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • C. Broyden

    Quasi-Newton methods and their application to functional minimisation

    Math. Comput.

    (1967)
  • P. Clark et al.

    The CN2 induction algorithm

    Mach. Learn.

    (1988)
  • T. Cover et al.

    Nearest neighbor pattern classification

    IEEE Trans. Inf. Theory

    (1967)
  • A. Dempster et al.

    Maximum likelihood from incompleter data via the EM algorithm

    J. R. Stat. Soc.Ser. B

    (1977)
  • C. Ennett, M. Frize, C. Walker, Imputation of missing values by integrating neural networks and case-based reasoning,...
  • A. Farhangfar et al.

    A novel framework for imputation of missing values in database

    IEEE Trans. Syst. Man Cybern. ASyst. Hum.

    (2007)
  • Z. Ghahramani, M. Jordan, Supervised learning from incomplete data via an EM approach, in: Advances in NIPS 6, Morgan...
  • W. Grzymala-Busse

    A new version of the rule induction system LERS

    Fundam. Inf.

    (1997)
  • W. Grzymala-Busse, M. Hu, A comparison of several approaches to missing attribute values in data mining, in: Lecture...
  • Cited by (44)

    • Similarity-learning information-fusion schemes for missing data imputation

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Indeed, any improper assumptions can bias the estimations [41,42]. Various missing data imputation techniques have been developed based on machine learning or statistical learning techniques [13–17,43]. Our objective is to develop an efficient technique for treating missing scores that consider both local and global similarities into account.

    • FROG: Inference from knowledge base for missing value imputation

      2018, Knowledge-Based Systems
      Citation Excerpt :

      These two challenges are related to the accuracy of missing value imputation and should not be overlooked. Many missing value imputation algorithms have been proposed [2–11]. Due to much dependence on known values, even though they could solve the problem on many data sets, their accuracy is not satisfactory in some cases.

    • Euclidean distance estimation in incomplete datasets

      2017, Neurocomputing
      Citation Excerpt :

      After filling the missing entries, any conventional learning method can be used. Examples of such an approach can be found in [3–5] and [6]. According to Acuña and Rodrigues in [7], problems with more than 5% of missing samples may require sophisticated handling methods.

    View all citing articles on Scopus

    Pilsung Kang is an assistant professor in Department of Industrial and Information Systems Engineering, College of Business and Technology, Seoul National University of Science and Technology (SeoultTECH), Seoul, South Korea. He received B.S. and Ph.D. in Industrial Engineering, College of Engineering, Seoul National University, Seoul, South Korea. His research interests include instance-based learning, learning kernel machines, novelty detection, learning algorithms in class imbalance, and non-linear dimensionality reduction. He is also interested in a wide range of applications such as keystroke dynamics-based authentication, fault detection in manufacturing process, and customer relationship management. He has published a number of papers on related topics in international journals and conferences.

    View full text