Locally linear reconstruction based missing value imputation for supervised learning

doi:10.1016/j.neucom.2013.02.016

Neurocomputing

Volume 118, 22 October 2013, Pages 65-78

https://doi.org/10.1016/j.neucom.2013.02.016 Get rights and content

Abstract

Most learning algorithms generally assume that data is complete so each attribute of all instances is filled with a valid value. However, missing values are very common in real datasets for various reasons. In this paper, we propose a new single imputation method based on locally linear reconstruction (LLR) that improves the prediction performance of supervised learning (classification & regression) with missing values. First, we investigate how missing values degrade the prediction performance with various missing ratios. Next, we compare the proposed missing value imputation method (LLR) with six well-known single imputation methods for five different learning algorithms based on 13 classification and nine regression datasets. The experimental results showed that (1) all imputation methods helped to improve the prediction accuracy, although some were very simple; (2) the proposed LLR imputation method enhanced the modeling performance more than all other imputation methods, irrespective of the learning algorithms and the missing ratios; and (3) LLR was outstanding when the missing ratio was relatively high and its prediction accuracy was similar to that of the complete dataset.

Highlights

► A new missing value imputation method based on LLR is proposed. ► The comparative study is performed to verify the effectiveness of LLR imputation. ► LLR imputation method outperforms other methods in classification and regression. ► The performance of LLR imputation is as good as that of the complete data sets.

Introduction

Supervised learning algorithms such as classification and regression in data mining or machine learning generally assume that training and test datasets are complete, i.e., each attribute of all instances is not missing and they are filled with a value. However, real data sets are often incomplete and they contain a proportion of missing values for various reasons, such as the death of a patient, equipment malfunctions, and a lack of responses [2]. The presence of missing values can lead to critical problems during the learning process, such as a loss of efficiency, biased data structure, analytical difficulties, and prediction performance degeneration [3], [14], [19]. According to Acuna and Rodriguez [1], less than 1% missing instances does not affect the prediction performance in general, while 1–5% is manageable. However, 5–15% missing instances requires sophisticated handling method, while greater than 15% missing data can severely degrade the prediction performance of learning algorithms. In order to handle missing values, several imputation techniques have been proposed in a wide range of data mining and machine learning domains [21], [22], [27], [41], [45]. The aim of missing value imputation is to enhance the functionality of learning algorithms and to improve their prediction accuracy, by replacing missing attributes with real values based on information extracted other non-missing data. The treatment of missing values depends on the type of missing values as follows [29], [36].

•
Missing completely at random (MCAR): this is the highest level of randomness. The probability of an instance having a missing value for an attribute does not depend on either the observed data or the missing attribute. Any missing value imputation method rarely distort the distribution of original data.
•
Missing at random (MAR): an intermediate level of randomness. The probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself. For example, let us assume that two attributes, gender and pregnancy, are collected together. If gender is recorded as ‘male’, we can easily deduce that pregnancy is ‘no’ although it is missing [38].
•
Not missing at random (NMAR): this is the lowest level of randomness. The probability of an instance having a missing value for an attribute may depend on the value of that attribute. For example, ex-convicts are likely to leave the criminal record attribute missing when they respond to a survey.

If missing values occur that are MAR or NMAR, imputation can be conducted by domain experts based on their appropriate background knowledge. Therefore, most missing value imputation techniques are focused on missing values that are classed as MCAR [2], [13], [38].

Depending on the learning algorithm and the number of repetitions, the handling of MCAR values can be divided into two main groups. The first group includes learning algorithms that can handle missing values during the learning process. Classification and regression tree (CART) simply ignores missing values when growing a tree [6]. CART iteratively computes the information gain for a large number of split candidates to select the best split point (the attribute and its split value). During CART learning, instances with missing values for an attribute are discarded if a candidate split point uses that attribute, whereas they are used if other non-missing attributes are selected for a split candidate. Naive Bayesian classifier treats missing values in a similar way [26]. When estimating the distribution of each attribute, instances with missing values for that attribute are abandoned and the parameters of the distribution are approximated using the non-missing instances only. When computing the distance between two instances in k-nearest neighbor (k-NN) learning, zero is assigned to an attribute if both instances have missing values. If only one is missing, k-NN assigns the maximum distance of that attribute [40]. The second group includes all imputation techniques that work independently of learning algorithms. This group can be divided into two subgroups: single imputation (SI) and multiple imputation (MI). SI replaces a missing value with a single value, whereas MI replaces a missing value with different values Thus MI transforms a single incomplete dataset into a number of complete datasets. Some representative single imputation techniques are as follows.

•
Mean (mode) imputation (MEI): this is a simple but fairly effective method in practice. MEI fills the missing values of an attribute with the mean (continuous) or mode (nominal/ordinal) of the non-missing values for the same attribute [13], [15].
•
k-nearest neighbor (k-NN): if an instance has a certain attribute missing, k-NN finds k most similar instances using its non-missing attributes. The values of the missing attributes of k neighbors are combined based on a predefined rule or kernel function, such as a simple average or exponential kernel, and it replaces the missing value. The k-NN imputation method is also known as ‘Hot Deck’ if k=1 is used, while a number of other variations have been proposed based on the modification of the kernel functions [13], [30], [15], [20], [41], [46].
•
Expectation conditional maximization (ECM): this approach assumes that the entire dataset is derived from a multivariate Gaussian distribution. Initially, the distribution parameters (mean vector and covariance matrix) are estimated for the data without missing values. The expectation-maximization (EM) algorithm is conducted as follows. During the expectation (E) step, the missing values are imputed based on the mean value for its attribute. During the maximization (M) step, the distribution parameters are updated based on the imputed values. After iterating the E–M process, the distribution parameters converge to the optimal values, and the missing values are imputed using values that are consistent with the distribution [10], [31], [16], [35], [38].
•
Clustering-based imputation: when clustering-based imputation methods are applied to an instance with a missing attribute, the entire dataset is grouped into some number of clusters using the non-missing attributes. The attribute values of the members of the cluster nearest to the instance are then used for imputation. Clustering algorithm such as K-Means clustering (KMC) or a mixture of Gaussian distributions (MoG) are widely used [32], [42], [44], [28], [11].
•
Model-based imputation: in model-based imputation methods, missing value imputation is reformulated as a supervised learning problem where the missing attribute becomes the dependent (target) variable and the non-missing attributes become independent (explanatory) variables. Thus, the learning task becomes classification if the missing attribute is nominal, whereas it becomes regression if the missing attribute is continuous. For each instance with a missing attribute, a machine learning algorithm is trained based on the instances without missing values and the non-missing values of the instance are used by the model to predict the target missing attribute value. Multiple linear regression, artificial neural network (ANN), Naive Bayesian classification, decision trees, and support vector machines (SVM) are some examples of machine learning algorithms that are commonly used for model-based imputation [18], [12], [13], [45], [21], [37].

In contrast to single imputation methods, multiple imputation methods impute a set of possible values rather than a single value for the missing attribute of an instance [43], [34]. Thus, multiple imputation methods generate a number of different datasets where the complete instances are identical but the incomplete instances have different values for the missing attributes. Some representative multiple imputation methods are as follows.

•
Multivariate imputation by chained equations (MICE) [39]: if missing values occur in more than one attribute for an instance, MICE employs a chained equation to fill the missing value of each attribute. MICE can generate various imputation results by modifying the imputation sequence of the missing attributes or the imputation algorithm for each attribute.
•
Boosting [14]: This multiple imputation method has three modules, i.e., mean pre-imputation, application of confidence intervals, and boosting. The pre-imputed values in the first module are imputed using a base imputation method that filters the missing values by generating confidence intervals using Student's t-statistics. Based on these confidence intervals, boosting is performed to deliver the high-quality imputed values.

These missing value imputation methods have advantages and limitations. Imputation methods inherent in learning algorithms do not require additional data preprocessing for missing value treatment, but they are usually too simple because most simply discard instances with missing attributes. This may allow learning algorithms to function but their prediction performance cannot be guaranteed. Single imputation methods can be applied before any learning algorithms. However, the prediction performance improvement may be restricted (e.g., mean imputation) or the computational burden might be increased because of the additional parameter optimization process (model-based imputation). Multiple imputation methods may improve the prediction performance better than single imputation methods. However, they significantly increase the computational cost not only by repeating the imputation steps, but also by repeating model learning based on individual imputed datasets. Therefore, multiple imputation methods may have difficulties handling large amount of data during real-time processing.

In this paper, we propose a new efficient single imputation method based on locally linear reconstruction (LLR) [24], [25] to improve the prediction performance of supervised learning. LLR is a structured approach that determines two parameters for k-NN learning, i.e., the number of nearest neighbors (k) and the weights given to the neighbors. In LLR, the optimization problem is formulated to minimize the difference between the test instance and its projection in the neighborhood space. Obtaining the solution of the optimization problem is equivalent to assigning the optimal weights to the neighbor instances, i.e., if a neighbor is necessary to span the neighborhood space, a non-zero weight is assigned, whereas zero is assigned otherwise. Therefore, LLR can determine the neighbors that are critical and assign appropriate weights to them. By adopting LLR, k and the weights of neighbors are not model parameters any more, but they are determined in a structured manner. To verify the effectiveness of the LLR imputation method, we analyzed the extent to which classification and regression algorithms are affected by diverse missing ratios if imputation is not conducted. Next, we compared (1) the accuracy improvement versus the incomplete data and (2) the accuracy recovery versus the complete data (ideal case) when using LLR imputation with other well-known single imputation methods. In order to generalize our conclusions, we conducted extensive experiments using six single imputation methods as benchmark methods and five classification and regression algorithms with 13 different degree of missing ratios, based on 13 classification and nine regression datasets.

The remainder of this paper is organized as follows. In Section 2, we review some representative work associated with the comparative analysis of various imputation methods and we discuss their findings and limitations. In Section 3, we describe the proposed LLR imputation method. In Section 4, we explain the experimental settings, i.e., the data description, missing data ratios, benchmark imputation algorithms, the classification and regression algorithms along with their user-specific parameters, and the performance measures. In Section 5, we compare LLR imputation method with other benchmark imputation methods in various scenarios. In Section 6, we provide some concluding remarks and we discuss areas of future work.

Section snippets

Related work

In this section, we review some representative works that have focused on the comparative study of missing value imputation methods for supervised learning in chronological order.

•
Grzymala-Busse and Hu [18]: In this study, nine imputation methods for nominal datasets were examined. The imputation methods were grouped into three categories: (1) two mode imputation-based methods where the key distinction was whether class information was used or not; (2) two methods based on the assignment of all

Missing value imputation based on locally linear reconstruction

Locally linear reconstruction (LLR) was proposed to simultaneously determine the critical neighbors and to assign their optimal weights in k-NN learning [24]. The input–output structure of LLR is shown in the following equation: ${\hat{y}}_{n + 1} = LLR (x_{n + 1}, X^{ref}, y^{ref}, k),$ where $x_{n + 1}$ is a $1 \times d$ input vector of a new test instance, ${\hat{y}}^{n + 1}$ is the target value of the test instance, $X^{ref}$ is an $n \times d$ input matrix where each row corresponds to each reference instance, $y^{ref}$ is the $n \times 1$ vector of the target values of the

Experimental settings

To verify the effectiveness of LLR imputation, we designed the following experiments. First, we generate synthetic incomplete datasets from the original complete dataset with various missing instance ratios. Five classification and regression algorithms were trained without any imputation methods to determine the detrimental effects of missing values. The prediction accuracies with the original complete dataset and the missing dataset without imputation, and the prediction power degradation

Classification results

Fig. 4 shows how the classification accuracy was affected by the existence of missing values. Note that the relative accuracy on the y-axis is a fraction of the two accuracies, i.e., the classification accuracy based on the missing data without imputation (discarding missing instances when training the classifiers) divided by the classification accuracy based on the original complete data. Each point in the graph is the average of five classification algorithms. As expected earlier, missing

Conclusion

In this paper, we propose a new single imputation method based on locally linear reconstruction (LLR). Experiments verified that increasing the missing ratio severely degraded the performance of all learning algorithms, which confirmed that missing values should be treated with caution. After taking the local structure into account, the proposed LLR imputation could handle missing values more effectively than other parametric or global structure-based imputation methods. Carefully designed

Acknowledgments

The work was supported by the research program funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (MEST) (No. 2011-0021893).

References (46)

Y. Ding et al.
A comparison of imputation methods for handling missing scores in biometric fusion
Pattern Recognition
(2012)
A. Farhangfar et al.
Impact of imputation of missing values on classification error for discrete data
Pattern Recognition
(2008)
P. Garcia-Laencina et al.
K-nearest neighbours with mutual information for simultaneous classification and missing data imputation
Neurocomputing
(2009)
Y. He et al.
A new fuzzy c-means method with total variation regularization for segmentation of images with noisy and incomplete data
Pattern Recognition
(2012)
K. Hron et al.
Imputation of missing values for compositional data using classical and robust methods
Comput. Stat. Data Anal.
(2010)
J. Jerez et al.
Missing data imputation using statistical and machine learning methods in a real breast cancer problem
Artif. Intell. Med.
(2010)
P. Kang et al.
Locally linear reconstruction for instance-based learning
Pattern Recognition
(2008)
S.-W. Kim et al.
On using prototypereduction schemes to optimize locally linear reconstruction methods
Pattern Recognition
(2012)
E. Acuna et al.
The treatment of missing values and its effect in the classifier accuracy
G. Batista et al.
An analysis of four missing data treatment methods for supervosed learning
Appl. Artif. Intell.
(2003)

J. Bernard et al.

Applications of multiple imputation in medical studiesFrom AIDS to NHANES

Stat. Methods Med. Res.

(1999)

C. Bishop

Pattern Recognition and Machine Learning

(2002)

D. Böhning

Multinomial logistic regression algorithm

Ann. Inst. Stat. Math.

(1992)

L. Breiman et al.

Classification and Regression Trees

(1984)

C. Broyden

Quasi-Newton methods and their application to functional minimisation

Math. Comput.

(1967)

P. Clark et al.

The CN2 induction algorithm

Mach. Learn.

(1988)

T. Cover et al.

Nearest neighbor pattern classification

IEEE Trans. Inf. Theory

(1967)

A. Dempster et al.

Maximum likelihood from incompleter data via the EM algorithm

J. R. Stat. Soc.Ser. B

(1977)

C. Ennett, M. Frize, C. Walker, Imputation of missing values by integrating neural networks and case-based reasoning,...

A. Farhangfar et al.

A novel framework for imputation of missing values in database

IEEE Trans. Syst. Man Cybern. ASyst. Hum.

(2007)

Z. Ghahramani, M. Jordan, Supervised learning from incomplete data via an EM approach, in: Advances in NIPS 6, Morgan...

W. Grzymala-Busse

A new version of the rule induction system LERS

Fundam. Inf.

(1997)

W. Grzymala-Busse, M. Hu, A comparison of several approaches to missing attribute values in data mining, in: Lecture...

Cited by (44)

Prediction of cardiovascular diseases using weight learning based on density information
2021, Neurocomputing
The uneven distribution of sample points is a common problem in medical datasets. How to improve the classification accuracy with these datasets remains to be solved. Based on the density-based spatial clustering of applications with noise (DBSCAN) algorithm, a weight learning approach is proposed to utilize the density information of datasets for the accurate prediction of cardiovascular diseases (CVDs). The approach selects important features by the random forest (RF) algorithm, divides the sample points into three types and weights them using different values by weight learning based on the density. Thus, the constructed machine learning models that combine the original features and weight feature can learn density information, more effectively identify decision boundaries, and achieve better performance. Compared with conventional machine learning models, the cross-validation approach showed that the performance of machine learning models with weight learning could achieve improved accuracy by 3 percentage points with the Stroke dataset and more than 10 percentage points with the University of California, Irvine (UCI) dataset.
Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021)
2021, Informatics in Medicine Unlocked
Recently, numerous studies have been conducted on Missing Value Imputation (MVI), intending the primary solution scheme for the datasets containing one or more missing attribute’s values. The incorporation of MVI reinforces the Machine Learning (ML) models’ performance and necessitates a systematic review of MVI methodologies employed for different tasks and datasets. It will aid beginners as guidance towards composing an effective ML-based decision-making system in various fields of applications. This article aims to conduct a rigorous review and analysis of the state-of-the-art MVI methods in the literature published in the last decade. Altogether, 191 articles, published from 2010 to August 2021, are selected for review using the well-known Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) technique. We summarize those articles with relevant definitions, theories, and analyses to provide essential information for building a precise decision-making framework. In addition, the evaluation metrics employed for MVI methods and ML-based classification models are also discussed and explored. Remarkably, the trends for the MVI method and its evaluation are also scrutinized from the last twelve years’ data. To come up with the conclusion, several ML-based pipelines, where the MVI schemes are incorporated for performance enhancement, are investigated and reviewed for many different datasets. In the end, informative observations and recommendations are addressed for future research directions and trends in related fields of interest.
Advancing post-earthquake structural evaluations via sequential regression-based predictive mean matching for enhanced forecasting in the context of missing data
2021, Advanced Engineering Informatics
After an earthquake, every damaged building needs to be properly evaluated in order to determine its capacity to withstand aftershocks as well as to assess safety for occupants to return. These evaluations are time-sensitive as the quicker they are completed, the less costly the disaster will be in terms of lives and dollars lost. In this direction, there is often not sufficient time or resources to acquire all information regarding the structure to do a high-level structural analysis. The post-earthquake damage survey data may be incomplete and contain missing values, which delays the analytical procedure or even makes structural evaluation impossible. This paper proposes a novel multiple imputation (MI) approach to address the missing data problem by filling in each missing value with multiple realistic, valid candidates, accounting for the uncertainty of missing data. The proposed method, called sequential regression-based predictive mean matching (SRB-PMM), utilizes Bayesian parameter estimation to consecutively infer the model parameters for variables with missing values, conditional based on the fully observed and imputed variables. Given the model parameters, a hybrid approach integrating PMM with a cross-validation algorithm is developed to obtain the most plausible imputed data set. Two examples are carried out to validate the usefulness of the SRB-PMM approach based on a database including 262 reinforced concrete (RC) column specimens subjected to earthquake loads. The results from both examples suggest that the proposed SRB-PMM approach is an effective means to handle missing data problems prominent in post-earthquake structural evaluations.
Similarity-learning information-fusion schemes for missing data imputation
2020, Knowledge-Based Systems
Citation Excerpt :
Indeed, any improper assumptions can bias the estimations [41,42]. Various missing data imputation techniques have been developed based on machine learning or statistical learning techniques [13–17,43]. Our objective is to develop an efficient technique for treating missing scores that consider both local and global similarities into account.
Missing data imputation is a very important data cleaning task for machine learning and data mining with incomplete data. This paper proposes two novel methods for missing data imputation, named kEMI and kEMI $^{+}$ , that are based on the $k$ -Nearest Neighbours algorithm for pre-imputation and the Expectation–Maximization algorithm for posterior-imputation. The former is a local search mechanism that aims to automatically find the best value for $k$ and the latter makes use of the best $k$ nearest neighbours to estimate missing scores by learning global similarities. kEMI $^{+}$ makes use of a novel information fusion mechanism. It fuses top estimations through the Dempster–Shafer fusion module to obtain the final estimation. They handle both numerical and categorical features. The performance of the proposed imputation techniques are evaluated by applying them on twenty one publicly available datasets with different missingness and ratios, and, then, compared with other state-of-the-art missing data imputation techniques in terms of standard evaluation measures such as the normalized root mean square difference and the absolute error. The attained results indicate the effectiveness of the proposed novel missing data imputation techniques.
FROG: Inference from knowledge base for missing value imputation
2018, Knowledge-Based Systems
Citation Excerpt :
These two challenges are related to the accuracy of missing value imputation and should not be overlooked. Many missing value imputation algorithms have been proposed [2–11]. Due to much dependence on known values, even though they could solve the problem on many data sets, their accuracy is not satisfactory in some cases.
Data imputation is a basic step for data cleaning. Traditional data imputation approaches are lack of accuracy in the absence of knowledge. Involving knowledge base in imputation could overcome this shortcoming. A challenge is that the missing value could be hardly found directly in the knowledge bases (KBs). To use knowledge base sufficiently for missing value imputation, we present FROG, an inference algorithm from knowledge bases. The inference not only makes full use of true facts in KBs, but also utilizes types to ensure the accuracy of captured missing values. Extensive experiments show that our proposed algorithm can capture missing values efficiently and effectively.
Euclidean distance estimation in incomplete datasets
2017, Neurocomputing
Citation Excerpt :
After filling the missing entries, any conventional learning method can be used. Examples of such an approach can be found in [3–5] and [6]. According to Acuña and Rodrigues in [7], problems with more than 5% of missing samples may require sophisticated handling methods.
This paper proposes a method to estimate the expected value of the Euclidean distance between two possibly incomplete feature vectors. Under the Missing at Random assumption, we show that the Euclidean distance can be modeled by a Nakagami distribution, for which the parameters we express as a function of the moments of the unknown data distribution. In our formulation the data distribution is modeled using a mixture of Gaussians. The proposed method, named Expected Euclidean Distance (EED), is validated through a series of experiments using synthetic and real-world data. Additionally, we show the application of EED to the Minimal Learning Machine (MLM), a distance-based supervised learning method. Experimental results show that EED outperforms existing methods that estimate Euclidean distances in an indirect manner. We also observe that the application of EED to the MLM provides promising results.

View all citing articles on Scopus

Pilsung Kang is an assistant professor in Department of Industrial and Information Systems Engineering, College of Business and Technology, Seoul National University of Science and Technology (SeoultTECH), Seoul, South Korea. He received B.S. and Ph.D. in Industrial Engineering, College of Engineering, Seoul National University, Seoul, South Korea. His research interests include instance-based learning, learning kernel machines, novelty detection, learning algorithms in class imbalance, and non-linear dimensionality reduction. He is also interested in a wide range of applications such as keystroke dynamics-based authentication, fault detection in manufacturing process, and customer relationship management. He has published a number of papers on related topics in international journals and conferences.

View full text

Locally linear reconstruction based missing value imputation for supervised learning

Abstract

Highlights

Introduction

Section snippets

Related work

Missing value imputation based on locally linear reconstruction

Experimental settings

Classification results

Conclusion

Acknowledgments

Pattern Recognition

Pattern Recognition

Neurocomputing

Pattern Recognition

Comput. Stat. Data Anal.

Artif. Intell. Med.

Pattern Recognition

Pattern Recognition

The treatment of missing values and its effect in the classifier accuracy

An analysis of four missing data treatment methods for supervosed learning

Appl. Artif. Intell.

Applications of multiple imputation in medical studiesFrom AIDS to NHANES

Stat. Methods Med. Res.

Pattern Recognition and Machine Learning

Multinomial logistic regression algorithm

Ann. Inst. Stat. Math.

Classification and Regression Trees

Quasi-Newton methods and their application to functional minimisation

Math. Comput.

The CN2 induction algorithm

Mach. Learn.

Nearest neighbor pattern classification

IEEE Trans. Inf. Theory

Maximum likelihood from incompleter data via the EM algorithm

J. R. Stat. Soc.Ser. B

A novel framework for imputation of missing values in database

IEEE Trans. Syst. Man Cybern. ASyst. Hum.

A new version of the rule induction system LERS

Fundam. Inf.