Categorical missing data imputation for software cost estimation by multinomial logistic regression

doi:10.1016/j.jss.2005.02.026

Journal of Systems and Software

Volume 79, Issue 3, March 2006, Pages 404-414

https://doi.org/10.1016/j.jss.2005.02.026 Get rights and content

Abstract

A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as multinomial logistic regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multi-organizational software database. Comparisons of MLR with other techniques for handling missing data, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) under different patterns and percentages of missing data, show the high efficiency of the proposed method.

Introduction

The importance of software cost estimation (SCE) as one of the most crucial phases of software development has been recognized since long. Attempts to estimate the effort and time involved in the development of a software product usually involve the construction of one or more cost estimation models (Shepperd and Schofield, 1997, Jeffery et al., 2000) by applying statistical methods to historical data sets with completed software projects. Most commonly, cost models are obtained by applying regression methods (Angelis et al., 2001, Strike et al., 2001).

A major problem in building such a model arises from the fact that missing values are often encountered in these historical data sets (Strike et al., 2001, Briand et al., 1992). The lack of data values in several important project attributes is a common phenomenon, which may cause misleading results regarding the models’ accuracy and prediction ability. The fact is that most of the software databases suffer from this problem, which is the result of several reasons related to the demanding process of collecting adequate data. Indeed, the data collection requirements include consistence, experience, time, cost and methodology for a company. Furthermore, when the model is based on multi-organizational data, the problem of missing values is caused by the different methods the various companies use to measure and record their data.

There are various techniques dealing with missing data. The most common one, known as listwise deletion (LD), simply ignores the cases with missing observations. The major advantage of the method is its simplicity and the ability to do statistical calculations on a common sample base of cases. The disadvantages of the method are the dramatic loss of information in data sets with high percentages of missing values and the possible bias in the data. These drawbacks are more or less always apparent especially when there is some type of pattern in the missing data, i.e., when the distribution of missing values is depended on certain valid observations in the data.

According to some other techniques, the missing values are replaced by estimates obtained from statistical procedures. The complete data set resulting from such a process is then analyzed by standard statistical methods (for example regression analysis). These techniques are commonly known as imputation methods. The problem is that most of the imputation methods produce in general continuous estimates, which are not realistic replacements of the missing values when the variables are categorical. Since the majority of the variables in the software data sets are categorical with many missing values, it is reasonable to use an imputation method producing categorical values in order to fill the incomplete data set and then to use it for constructing a prediction model.

In this paper we investigate the possibility of using the statistical procedure known as multinomial logistic regression (MLR) as imputation method for categorical variables. The data we used for experimentation and comparisons come from the International Software Benchmarking Standards Group (ISBSG) multi-organizational software database (ISBSG, 2001). MLR was applied on a set of carefully selected software projects that have been pre-processed by statistical analysis and was compared with four other missing data techniques: listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI).

In order to compare the efficiency of the above methods in various incompleteness situations, the data set we used was originally complete (no missing values) and the missing values were created by simulating three different mechanisms: missing completely at random (MCAR), missing at random (MAR) and non-ignorable missingness (NIM).

The comparisons were conducted on the basis of the predictive accuracy of a regression model fitted to the data after applying each missing data technique. The results of the study indicate that for small percentages of missing values, MLR gives as satisfactory results as all the other methods do. However, when the percentages of missing values increase, MLR seems to usually outperform with respect to other methods.

The structure of the paper is the following: In Section 2, we outline related work in the area. In Section 3, we describe the different mechanisms used to create missing data and the most common techniques for handling them. In Section 4, we present the data set used in the analysis and the statistical methods in details. In Section 5, we give the results of the statistical analysis conducted using as data the prediction accuracy outputs and finally, in Section 6, we conclude by discussing the findings as well as some directions for future work.

Section snippets

Related work

The problem of handling missing data has been treated adequately in various real world data sets. Several statistical methods have been developed since the early 1970s, when the manipulation of complicated numerical calculations became feasible with the advance of computers. Some of the most important review papers on the subject are Afifi and Elashoff, 1966, Hartley and Hocking, 1971, Dempster et al., 1977, Little and Rubin, 1983, Little and Rubin, 2002.

In the field of software engineering

Missing data mechanisms

The methods of handling missing data are directly related to the mechanisms that caused the incompleteness. Generally, these mechanisms fall into three classes (Little and Rubin, 2002):

1.
Missing completely at random (MCAR): The missing values in a variable are unrelated to the values of any other variables, whether missing or valid.
2.
Non-ignorable missingness (NIM): NIM can be considered as the opposite of MCAR in the sense that the probability of having missing values in a variable depends on the

Research methodology

The method of approach for the comparison of MLR with the other four MDTs was based on studying the impact of each one of the MDTs on the predictive accuracy of a cost estimation model. Below we describe the data set used, the preprocessing for obtaining a final complete data set, the cost model derived from the data and the accuracy measure employed for the comparisons. For the application of all statistical methods in this paper, we used the statistical program SPSS.

Results

As mentioned above, the purpose of the paper is to present a comparative study of five MDTs by experimentation with various missing data patterns. These patterns were generated so as to take into account variations in (a) randomness (b) percentages and (c) variables. The descriptive statistics in Table 3 and the box-plots in Fig. 1 show the overall performance of the five methods with respect to the SD criterion which in this section is considered as the dependent variable affected by the

Conclusion and future work

In this paper we investigated the use of MLR for estimating the missing values of categorical variables, predictors in software cost models. We experimented by considering a complete data set of software projects, by generating artificial missing values in four categorical predictor variables with three different mechanisms and finally by handling the missing data by MLR along with four other well-known methods. Subsequent statistical analysis including analysis of variance, confidence

Acknowledgment

We wish to thank the anonymous referees for their valuable comments which helped us to improve the paper.

References (19)

R. Jeffery et al.
A comparative study of two software development cost modelling techniques using multi-organizational and company-specific data
Inform. Software Technol.
(2000)
P. Sentas et al.
Software productivity and effort prediction with ordinal regression
Inform. Software Technol.
(2005)
A.A. Afifi et al.
Missing observations in multivariate statistics: review of the literature
J. Am. Statist. Assoc.
(1966)
Angelis, L., Stamelos, I., Morisio, M., 2001. Building a software cost estimation model based on categorical data. In:...
L. Briand et al.
A pattern recognition approach for software engineering data analysis
IEEE Trans. Software Eng.
(1992)
Cartwright, M.H., Shepperd, M.J., Song, Q., 2003. Dealing with missing software project data. In: Proc. METRICS, pp....
A.P. Dempster et al.
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
J. Roy. Statist. Soc. B
(1977)
K.E. Emam et al.
Validating the ISO/IEC 15504 measure of software requirements analysis process capability
IEEE Trans. Software Eng.
(2000)
T. Foss et al.
A simulation study of the model evaluation criterion MMRE
IEEE Trans. Software Eng.
(2003)

There are more references available in the full text version of this article.

Cited by (66)

A review on missing values for main challenges and methods
2023, Information Systems
Several recent reviews summarize common missing value analysis methods. However, none of them provide a systematic and in-depth summary of the analytical challenges and solutions for dealing with missing values. For the purpose of guiding the handling of missing values, this review aims to consolidate current developments in novel missing-value research methodologies. In particular, we comprehensively investigated cutting-edge missing value solutions and methodically studied the main challenges associated with missing values analysis (missing mechanisms, missing patterns, and missing rates). Furthermore, we reviewed 63 publications that compare different strategies for deleting and imputing missing values. Then we investigated data characteristics, highlighted three main problems when analyzing missing values, and analyzed the performance of missing value solutions in these studied papers. Moreover, we conducted comprehensive experiments on 9 public datasets using typical missing value processing methods and provided a simple guided decision tree for handling missing values. Finally, we described current Research hotspots and open challenges, which give potential research topics.
Multiple imputation method of missing credit risk assessment data based on generative adversarial networks
2022, Applied Soft Computing
Citation Excerpt :
These methods are relatively easy to implement, but they may also easily cause resource waste or data distribution distortion [15,16], which leads to bias in parameters, results and inferences of the subsequent evaluation model [17]. In addition, regression imputation method [18] is also commonly applied for the data imputation based on statistics, including linear regression [19], logistic regression [20] and so on. For example, Hamzah et al. [21] combined robust random regression imputation and multiple linear regression to impute the missing data in daily streamflow datasets, and achieved excellent results.
Credit risk assessment is critical for loan approval and risk management of banks. However, the problem of missing credit risk data may greatly reduce the effectiveness of the assessment model. Therefore, constructing a data imputation method for accurate missing data prediction is quite beneficial. Typically, building an effective imputation model is very challenging due to the high missing rate and complex arbitrary missing pattern of datasets in credit risk assessment. In this paper, a novel imputation method named as Multiple Generative Adversarial Imputation Networks (MGAIN) is proposed. Specifically, we first randomly select multiple attribute subsets instead of the whole attributes such that more complete samples can be generated. Then, the missing data in each attribute are imputed by using generative adversarial imputation networks (GAIN) which fully considers the relationships among missing values by combining neural network and adversarial learning. The proposed subset selection and multiple imputation strategy not only simplify the network structure of GAIN but also reduce the demand for data. Finally, a weighted average method is presented to synthesize multiple results of each missing attribute value to further improve the accuracy. The experimental results on real-world data demonstrate that the proposed method is superior to other popular imputation methods.
RESI: A Region-Splitting Imputation method for different types of missing data
2021, Expert Systems with Applications
A certain degree of data loss seriously affects the accuracy and availability of data, especially on the effects of the subsequent in-depth data analysis and mining. It is of great value in practical applications to construct a data imputation model, which is suitable for completing different types of missing data, including numerical only, categorical only and mixed-type data, and has strong capability of generalization. To address this issue, this paper defines a new metric, mean integrity rate, to measure the missing degree of a dataset, and proposes RESI, a novel tuple-based REgion-Splitting Imputation model, to impute different type missing data. We first select features and assign weights to each attribute by using the entropy weight method, and then partition the tuples into a subset of complete tuples and several subsets of incomplete tuples based on their integrity rate, which is formulated with the weights of attributes and the missing degree of tuples. The model performs training iterations on the complete tuple subset. In each iteration, the trained model is used to impute the next missing subset, and the computed subset is merged into the complete subset for training the next model. To improve the imputation accuracy, we leverage $k$ -fold cross validation to correct errors. Besides imputing diverse types of missing data, extensive experimental results have shown that our model, RESI, significantly outperforms the state-of-the-art methods in the sensitivity to missing rate and accuracy of imputed data.
Research patterns and trends in software effort estimation
2017, Information and Software Technology
Citation Excerpt :
Different selection strategies for project selection have been reported in “estimation by analogy” (T60.4) [390,391]. In “factors affecting estimation” (T12.6), the trend called “missing data effects” (T60.10) has emerged to depict handling and assessing of missing data [157,158,160,161,392]. “Impact of CMM levels” (T60.59) trend focuses on the effect of process maturity on SEE [75,362,364].
Software effort estimation (SEE) is most crucial activity in the field of software engineering. Vast research has been conducted in SEE resulting into a tremendous increase in literature. Thus it is of utmost importance to identify the core research areas and trends in SEE which may lead the researchers to understand and discern the research patterns in large literature dataset.
To identify unobserved research patterns through natural language processing from a large set of research articles on SEE published during the period 1996 to 2016.
A generative statistical method, called Latent Dirichlet Allocation (LDA), applied on a literature dataset of 1178 articles published on SEE.
As many as twelve core research areas and sixty research trends have been revealed; and the identified research trends have been semantically mapped to associate core research areas.
This study summarises the research trends in SEE based upon a corpus of 1178 articles. The patterns and trends identified through this research can help in finding the potential research areas.
Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study
2017, Journal of Systems and Software
Citation Excerpt :
In the domain of empirical software engineering and its related software quality estimation, researchers have devoted to predicting important quality-related variables, such as the fault count or if the fault-proneness exists, etc. Most empirical software engineering estimation builds statistical or machine learning models on historical data (Sentas and Angelis, 2006). Meanwhile, the software community has accumulated a myriad of software project quality related data for academic research, such as the PROMISE data repositories.
Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation.
Optimized fuzzy clustering-based k-nearest neighbors imputation for mixed missing data in software development effort estimation
2024, Journal of Software: Evolution and Process

View all citing articles on Scopus

View full text

Categorical missing data imputation for software cost estimation by multinomial logistic regression

Abstract

Introduction

Section snippets

Related work

Missing data mechanisms

Research methodology

Results

Conclusion and future work

Acknowledgment

Inform. Software Technol.

Inform. Software Technol.

Missing observations in multivariate statistics: review of the literature

J. Am. Statist. Assoc.

A pattern recognition approach for software engineering data analysis

IEEE Trans. Software Eng.

Maximum likelihood from incomplete data via the EM algorithm (with discussion)

J. Roy. Statist. Soc. B

Validating the ISO/IEC 15504 measure of software requirements analysis process capability

IEEE Trans. Software Eng.

A simulation study of the model evaluation criterion MMRE

IEEE Trans. Software Eng.