Categorical missing data imputation for software cost estimation by multinomial logistic regression
Introduction
The importance of software cost estimation (SCE) as one of the most crucial phases of software development has been recognized since long. Attempts to estimate the effort and time involved in the development of a software product usually involve the construction of one or more cost estimation models (Shepperd and Schofield, 1997, Jeffery et al., 2000) by applying statistical methods to historical data sets with completed software projects. Most commonly, cost models are obtained by applying regression methods (Angelis et al., 2001, Strike et al., 2001).
A major problem in building such a model arises from the fact that missing values are often encountered in these historical data sets (Strike et al., 2001, Briand et al., 1992). The lack of data values in several important project attributes is a common phenomenon, which may cause misleading results regarding the models’ accuracy and prediction ability. The fact is that most of the software databases suffer from this problem, which is the result of several reasons related to the demanding process of collecting adequate data. Indeed, the data collection requirements include consistence, experience, time, cost and methodology for a company. Furthermore, when the model is based on multi-organizational data, the problem of missing values is caused by the different methods the various companies use to measure and record their data.
There are various techniques dealing with missing data. The most common one, known as listwise deletion (LD), simply ignores the cases with missing observations. The major advantage of the method is its simplicity and the ability to do statistical calculations on a common sample base of cases. The disadvantages of the method are the dramatic loss of information in data sets with high percentages of missing values and the possible bias in the data. These drawbacks are more or less always apparent especially when there is some type of pattern in the missing data, i.e., when the distribution of missing values is depended on certain valid observations in the data.
According to some other techniques, the missing values are replaced by estimates obtained from statistical procedures. The complete data set resulting from such a process is then analyzed by standard statistical methods (for example regression analysis). These techniques are commonly known as imputation methods. The problem is that most of the imputation methods produce in general continuous estimates, which are not realistic replacements of the missing values when the variables are categorical. Since the majority of the variables in the software data sets are categorical with many missing values, it is reasonable to use an imputation method producing categorical values in order to fill the incomplete data set and then to use it for constructing a prediction model.
In this paper we investigate the possibility of using the statistical procedure known as multinomial logistic regression (MLR) as imputation method for categorical variables. The data we used for experimentation and comparisons come from the International Software Benchmarking Standards Group (ISBSG) multi-organizational software database (ISBSG, 2001). MLR was applied on a set of carefully selected software projects that have been pre-processed by statistical analysis and was compared with four other missing data techniques: listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI).
In order to compare the efficiency of the above methods in various incompleteness situations, the data set we used was originally complete (no missing values) and the missing values were created by simulating three different mechanisms: missing completely at random (MCAR), missing at random (MAR) and non-ignorable missingness (NIM).
The comparisons were conducted on the basis of the predictive accuracy of a regression model fitted to the data after applying each missing data technique. The results of the study indicate that for small percentages of missing values, MLR gives as satisfactory results as all the other methods do. However, when the percentages of missing values increase, MLR seems to usually outperform with respect to other methods.
The structure of the paper is the following: In Section 2, we outline related work in the area. In Section 3, we describe the different mechanisms used to create missing data and the most common techniques for handling them. In Section 4, we present the data set used in the analysis and the statistical methods in details. In Section 5, we give the results of the statistical analysis conducted using as data the prediction accuracy outputs and finally, in Section 6, we conclude by discussing the findings as well as some directions for future work.
Section snippets
Related work
The problem of handling missing data has been treated adequately in various real world data sets. Several statistical methods have been developed since the early 1970s, when the manipulation of complicated numerical calculations became feasible with the advance of computers. Some of the most important review papers on the subject are Afifi and Elashoff, 1966, Hartley and Hocking, 1971, Dempster et al., 1977, Little and Rubin, 1983, Little and Rubin, 2002.
In the field of software engineering
Missing data mechanisms
The methods of handling missing data are directly related to the mechanisms that caused the incompleteness. Generally, these mechanisms fall into three classes (Little and Rubin, 2002):
- 1.
Missing completely at random (MCAR): The missing values in a variable are unrelated to the values of any other variables, whether missing or valid.
- 2.
Non-ignorable missingness (NIM): NIM can be considered as the opposite of MCAR in the sense that the probability of having missing values in a variable depends on the
Research methodology
The method of approach for the comparison of MLR with the other four MDTs was based on studying the impact of each one of the MDTs on the predictive accuracy of a cost estimation model. Below we describe the data set used, the preprocessing for obtaining a final complete data set, the cost model derived from the data and the accuracy measure employed for the comparisons. For the application of all statistical methods in this paper, we used the statistical program SPSS.
Results
As mentioned above, the purpose of the paper is to present a comparative study of five MDTs by experimentation with various missing data patterns. These patterns were generated so as to take into account variations in (a) randomness (b) percentages and (c) variables. The descriptive statistics in Table 3 and the box-plots in Fig. 1 show the overall performance of the five methods with respect to the SD criterion which in this section is considered as the dependent variable affected by the
Conclusion and future work
In this paper we investigated the use of MLR for estimating the missing values of categorical variables, predictors in software cost models. We experimented by considering a complete data set of software projects, by generating artificial missing values in four categorical predictor variables with three different mechanisms and finally by handling the missing data by MLR along with four other well-known methods. Subsequent statistical analysis including analysis of variance, confidence
Acknowledgment
We wish to thank the anonymous referees for their valuable comments which helped us to improve the paper.
References (19)
- et al.
A comparative study of two software development cost modelling techniques using multi-organizational and company-specific data
Inform. Software Technol.
(2000) - et al.
Software productivity and effort prediction with ordinal regression
Inform. Software Technol.
(2005) - et al.
Missing observations in multivariate statistics: review of the literature
J. Am. Statist. Assoc.
(1966) - Angelis, L., Stamelos, I., Morisio, M., 2001. Building a software cost estimation model based on categorical data. In:...
- et al.
A pattern recognition approach for software engineering data analysis
IEEE Trans. Software Eng.
(1992) - Cartwright, M.H., Shepperd, M.J., Song, Q., 2003. Dealing with missing software project data. In: Proc. METRICS, pp....
- et al.
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
J. Roy. Statist. Soc. B
(1977) - et al.
Validating the ISO/IEC 15504 measure of software requirements analysis process capability
IEEE Trans. Software Eng.
(2000) - et al.
A simulation study of the model evaluation criterion MMRE
IEEE Trans. Software Eng.
(2003)
Cited by (66)
A review on missing values for main challenges and methods
2023, Information SystemsMultiple imputation method of missing credit risk assessment data based on generative adversarial networks
2022, Applied Soft ComputingCitation Excerpt :These methods are relatively easy to implement, but they may also easily cause resource waste or data distribution distortion [15,16], which leads to bias in parameters, results and inferences of the subsequent evaluation model [17]. In addition, regression imputation method [18] is also commonly applied for the data imputation based on statistics, including linear regression [19], logistic regression [20] and so on. For example, Hamzah et al. [21] combined robust random regression imputation and multiple linear regression to impute the missing data in daily streamflow datasets, and achieved excellent results.
RESI: A Region-Splitting Imputation method for different types of missing data
2021, Expert Systems with ApplicationsResearch patterns and trends in software effort estimation
2017, Information and Software TechnologyCitation Excerpt :Different selection strategies for project selection have been reported in “estimation by analogy” (T60.4) [390,391]. In “factors affecting estimation” (T12.6), the trend called “missing data effects” (T60.10) has emerged to depict handling and assessing of missing data [157,158,160,161,392]. “Impact of CMM levels” (T60.59) trend focuses on the effect of process maturity on SEE [75,362,364].
Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study
2017, Journal of Systems and SoftwareCitation Excerpt :In the domain of empirical software engineering and its related software quality estimation, researchers have devoted to predicting important quality-related variables, such as the fault count or if the fault-proneness exists, etc. Most empirical software engineering estimation builds statistical or machine learning models on historical data (Sentas and Angelis, 2006). Meanwhile, the software community has accumulated a myriad of software project quality related data for academic research, such as the PROMISE data repositories.
Optimized fuzzy clustering-based k-nearest neighbors imputation for mixed missing data in software development effort estimation
2024, Journal of Software: Evolution and Process