Elsevier

Neurocomputing

Volume 205, 12 September 2016, Pages 152-164
Neurocomputing

Missing data imputation using fuzzy-rough methods

https://doi.org/10.1016/j.neucom.2016.04.015Get rights and content

Highlights

  • We have proposed 3 missing value imputation methods based on fuzzy-rough nearest neighbors: FRNNI, OWANNI and VQNNI.

  • The results show they perform excellently.

  • FRNNI outperforms the others.

  • OWANNI is as good as FRNNI but in some cases FRNNI outperforms it.

Abstract

Missing values exist in many generated datasets in science. Therefore, utilizing missing data imputation methods is a common and important practice. These methods are a kind of treatment for uncertainty and vagueness existing in datasets. On the other hand, methods based on fuzzy-rough sets provide excellent tools for dealing with uncertainty, possessing highly desirable properties such as robustness and noise tolerance. Furthermore, they can find minimal representations of data and do not need potentially erroneous user inputs. As a result, utilizing fuzzy-rough sets for imputation should be an effective approach. In this paper, we propose three missing value imputation methods based on fuzzy-rough sets and its recent extensions; namely, implicator/t-norm based fuzzy-rough sets, vaguely quantified rough sets and also ordered weighted average based rough sets. These methods are compared against 11 state-of-the-art imputation methods implemented in the KEEL data mining software on 27 benchmark datasets. The results show, via non-parametric statistical analysis, that the proposed methods exhibit excellent performance in general.

Introduction

In today׳s world, the understanding of behaviors of phenomena can be achieved by analyzing relevant datasets. These interpretations could be for classification, regression or time series data. Perhaps the most beneficial part of this work is the prediction of parameter values which are of high importance. Thus, gathering relevant data related to a phenomenon is widely carried out. Many different datasets are generated everyday in most fields of science and are constructed in different ways. There are a number of factors that can affect the data gathered; for example, power system failures, noise, environmental factors (such as humidity, temperature, etc), a lack of response in scientific experiments, human error in measurements, problems of data transfer in digital systems or respondents’ unwillingness to respond to survey questions, low quality of sensors and many more [1], [2], [3], [4]. Hence, the existence of missing values in gathered datasets is somewhat inevitable. However, the presence of missing values could dramatically degrade the results of interpretation of datasets which are usually carried out with the aid of machine learning techniques [5]. Therefore, dealing with missing values is an important issue in data mining and machine learning communities [5], [6], [7], [8], [9].

There are several ways to deal with missing values in datasets. Deleting or ignoring them are the simplest approaches. Replacing the missing values with zero or the mean of the attributes is another option. These treatment methods have a major drawback in that they degrade the quality of estimations by removing some information present in instances containing missing values. This could potentially bias the results of estimation. Hence, another option is to deal with this problem by estimating the missing values. These methods are usually called imputation methods.

Based on the process undergone to produce a dataset and also the nature of the data itself, there could be three types of missing values. These are: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR) [10]. When missing values are considered to be of type MCAR, this means that the missing values are independent of other variables. If the values are considered to be of type MAR, then these can be estimated using other values (i.e. the mechanism by which values are missing can be ignored). And finally, if the values are of type NMAR then these values depend on other missing variables and the mechanism by which values are missing will need to be modeled for effective imputation. Hence the missing values cannot be estimated from existing variables. For most approaches, the missing values are assumed to be of type MAR.

Missing value imputation methods could be categorized based on the technique used for approximating missing values. They could be classified into two different types [11]. First, there are approaches that use mathematical or statistical methods to predict missing values. These consist of very simple methods which simply replace missing values with the mean or mode value of the features and also more sophisticated methods that are based on more advanced statistical techniques. Second, there are methods that use machine learning techniques to impute missing values. Methods belonging to this category build a model or a combination of models based on information available in the dataset. Afterwards, missing values are predicted using this model. Sometimes missing values are predicted during the training phase and they are iteratively amended to achieve best values. This category itself can be divided into some subcategories, such as methods using Neural Networks [12], [13], [14], [15], [16], Support Vector Machines [17], Nearest Neighbor based methods [18], [19], and also methods based on unsupervised learning, e.g. clustering [7], [8].

Methods which are based on Nearest Neighbors simply predict missing values based on complete instances which are located in the proximity of the instances with missing values. These methods are accurate, but the main reason to propose and use them is that they are intuitive and simple. Besides, they have some drawbacks such as a need to provide the number of neighbors by the user, a need to compare all instances to find NNs which results in a high time complexity and also the problem of local optima due to their local nature.

The use of simple statistical models can cause the resulting imputed data to become biased. Thus, models built using the data will also be biased. On the other hand, methods based on the statistical learning are typically more complex. They cannot directly deal with missing values, thus at the very beginning all of them need an initial guess to be optimized later. This initial guess is usually the mean of the feature values. Most of them use eigenvectors to describe data and they usually ignore some of the less important eigenvectors. This causes data loss and the final outcome of this could be degradation in accuracy of predictions. They also need the user to determine the number of the most important eigenvectors as an input. This might require expert knowledge in order to produce good results. Since these methods are iterative, they also need a predefined threshold to stop. This is usually given to the algorithm by the user. Furthermore, their performance is sensitive to the type of data being analyzed. For instance, SVDI [19] usually performs better on time series data. Most of them are essentially a form of linear regression.

Methods based on Neural Networks are another option to impute missing values. This family of methods usually defines an error and iteratively tries to minimize this. A main drawback of this set of algorithms is their time complexity; for example, Multi Task Learning (MTL) neural networks [20] use a quadratic error for minimization. They also need a predefined threshold to stop. Methods based on clustering are alternatives, but have some drawbacks: they have higher time complexities and they need user specified thresholds to terminate. Some of these methods need a user-specified number of clusters at the start. Most of them usually converge to a local minimum. To converge to a global minimum, several repetitions should be undertaken, which is very computationally expensive.

In contrast, fuzzy-rough set theory provides an excellent framework to deal with uncertainty and has some desirable characteristics which makes it a good choice to be applied to the problem of imputation. Fuzzy rough methods are not essentially optimization problems; thus, they do not iterate through algorithm steps. This is important because they do not need a stopping criterion when finding a good one is usually hard. They also do not need user-specified parameter values which could be erroneous. Another reason to use fuzzy-rough techniques for this domain is their simplicity and understandability. They simply calculate the fuzzy similarities of instances and make decisions based on these. They can easily and effectively work in presence of noise and also can deal with missing data (as demonstrated in this paper). Hence, they do not need initial guesses for missing values. Furthermore, they can easily deal with imprecise data. The missing value imputation domain needs methods which can deal with imprecise data easily and also effectively. The mentioned reasons make fuzzy-rough techniques suitable for imputation of missing values.

The nearest neighbor algorithm has proven itself as a very accurate method, yet simple. In this paper we have introduced several methods to impute missing values based on fuzzy-rough sets combined with the nearest neighbor algorithm. This way, one can benefit from both the simplicity and accuracy of nearest neighbor prediction along with the robustness and noise tolerance of fuzzy-rough sets. The accuracy of fuzzy-rough nearest neighbor methods has already been demonstrated in the classification domain [21]. We have used three types of fuzzy rough sets; namely, implicator/t-norm based fuzzy-rough sets, vaguely quantified rough sets and also ordered weighted average-based fuzzy-rough sets. The two latter approaches are proven to be more robust in the presence of noise.

The rest of the paper is as follows: In the next section we review the literature in the area and focus on the major approaches. Section 3 describes the theoretical background necessary to understand the proposed methods. Afterwards, the proposed algorithms are introduced. These methods are then applied to benchmark data and evaluated in Section 5 using non-parametric statistical analysis. The last section concludes the paper and outlines future work.

Section snippets

Literature review

There are many imputation methods in the literature based on different approaches. Although many such algorithms exist, they are often proposed to be used in specific domains or even for specific datasets, e.g. transportation [22], meteorology [23] and others [24]. Hence, they are not publicly used. In contrast, there are several imputation methods which are used in many domains. In this section, we are going to focus mainly on these general algorithms, as the focus of this paper is a general

Theoretical background

This section outlines the basics of fuzzy and rough set theories and ways of combining them to generate a more advanced model, called fuzzy-rough sets.

Proposed methods

In Jensen et al. [21], the authors introduced two algorithms for prediction based on the KNN algorithm which use the fuzzy-rough lower and upper approximations. These algorithms are used to predict continuous or discrete decision feature values of datasets. The algorithm is shown in Algorithm 1 where d(z) denotes the value of instance z for the decision attribute.

In this algorithm, the lower and upper approximations are defined as follows:(RRdz)(y)=inftNI(R(y,t),Rd(t,z))(RRdz)(y)=suptNT(R(y,

Experimentation

In this section we have conducted some experiments to compare the proposed methods against existing imputation methods. Initially, the best values are found for the parameter k in our algorithms. Then, we compare our algorithms with several existing methods and use non-parametric statistical tests to validate the results.

Conclusion

In this paper, three new missing value imputation methods were introduced which are based on fuzzy-rough sets: implicator/t-norm based fuzzy-rough sets, OWA-based fuzzy-rough sets and vaguely quantified rough sets. The obtained results show that FRNNI performs better than the other two methods and outperforms most of the other imputation methods considered in this study. OWANNI performs similar to FRNNI but FRNNI is slightly better. And finally, the performance of VQNNI is inferior to the other

Acknowledgment

Mehran Amiri would like to thank Dr. Mahdi Eftekhari for his help, support and encouragement.

Mehran Amiri graduated from Razi university, Kermanshah, Iran in 2008 as a software engineer. Pursuing his education in AI and data mining, he graduated as a M.Sc holder in Artificial intelligence. He has taught many computer courses in a few universities for a while. Then, he worked as an algorithm designer at Parseh intelligent surgical systems Co. for two years. Currently he is with Behsazan Mellat Company working as a software designer and data mining consultant. He has published some

References (42)

  • W. Qiao et al.

    Continuous On-line Identification of Nonlinear Plants in Power Systems With Missing Sensor Measurements

    (2005)
  • I.B. Aydilek et al.

    A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks

    Int. J. Innov. Comput. Inf. Control.

    (2012)
  • D. Li et al.

    Towards Missing Data Imputation: A Study of Fuzzy k-Means Clustering Method

    (2004)
  • Z. Liao et al.

    Missing Data Imputation: A Fuzzy K-means Clustering Algorithm Over Sliding Window

    (2009)
  • R.J.A. Little et al.

    Statistical Analysis With Missing Data

    (2014)
  • P.J. García-Laencina et al.

    Pattern classification with missing data: a review

    Neural Comput. Appl.

    (2010)
  • P.K. Sharpe et al.

    Dealing with missing values in neural network-based diagnostic systems

    Neural Comput. Appl.

    (1995)
  • S. Nordbotten

    Neural Network Imputation Applied to the Norwegian 1990 Population Census Data

    (1996)
  • A. Gupta et al.

    Estimating missing values using neural networks

    J. Oper. Res. Soc.

    (1996)
  • Y. Bengio et al.

    Recurrent neural networks for missing or asynchronous data

    Adv. Neural Inf. Process. Syst.

    (1996)
  • D. Pyle

    Data Preparation for Data Mining

    (1999)
  • Cited by (0)

    Mehran Amiri graduated from Razi university, Kermanshah, Iran in 2008 as a software engineer. Pursuing his education in AI and data mining, he graduated as a M.Sc holder in Artificial intelligence. He has taught many computer courses in a few universities for a while. Then, he worked as an algorithm designer at Parseh intelligent surgical systems Co. for two years. Currently he is with Behsazan Mellat Company working as a software designer and data mining consultant. He has published some papers in journals and conferences. He is also a reviewer for some journals like Applied soft computing and Iranian journal of fuzzy systems. His research interests contain machine learning, data mining, instance selection and especially fuzzy-rough data mining.

    Richard Jensen received the B.Sc. degree in computer science from Lancaster University, U.K., and the M.Sc. and Ph.D. degrees in artificial intelligence from the University of Edinburgh, U.K. He is a Lecturer with the Department of Computer Science at Aberystwyth University, working in the Advanced Reasoning Group. His research interests include rough and fuzzy set theory; data mining; feature selection; and swarm intelligence. He has published over 70 peer-refereed articles in these areas, including a recent best paper award winner. He authored the research monograph ʻComputational Intelligence and Feature Selection: Rough and Fuzzy Approaches’, published jointly by IEEE/Wiley. He was the program co-chair of the International Conference on Rough Sets and Current Trends in Computing 2010 and is on the editorial board of Transactions on Rough Sets amongst others, as well as on the advisory board of the International Rough Set Society. He has organized several special sessions on fuzzy-rough sets for the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) and related conferences.

    View full text