Categorical missing data imputation for software cost estimation by multinomial logistic regression

https://doi.org/10.1016/j.jss.2005.02.026Get rights and content

Abstract

A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as multinomial logistic regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multi-organizational software database. Comparisons of MLR with other techniques for handling missing data, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) under different patterns and percentages of missing data, show the high efficiency of the proposed method.

Introduction

The importance of software cost estimation (SCE) as one of the most crucial phases of software development has been recognized since long. Attempts to estimate the effort and time involved in the development of a software product usually involve the construction of one or more cost estimation models (Shepperd and Schofield, 1997, Jeffery et al., 2000) by applying statistical methods to historical data sets with completed software projects. Most commonly, cost models are obtained by applying regression methods (Angelis et al., 2001, Strike et al., 2001).

A major problem in building such a model arises from the fact that missing values are often encountered in these historical data sets (Strike et al., 2001, Briand et al., 1992). The lack of data values in several important project attributes is a common phenomenon, which may cause misleading results regarding the models’ accuracy and prediction ability. The fact is that most of the software databases suffer from this problem, which is the result of several reasons related to the demanding process of collecting adequate data. Indeed, the data collection requirements include consistence, experience, time, cost and methodology for a company. Furthermore, when the model is based on multi-organizational data, the problem of missing values is caused by the different methods the various companies use to measure and record their data.

There are various techniques dealing with missing data. The most common one, known as listwise deletion (LD), simply ignores the cases with missing observations. The major advantage of the method is its simplicity and the ability to do statistical calculations on a common sample base of cases. The disadvantages of the method are the dramatic loss of information in data sets with high percentages of missing values and the possible bias in the data. These drawbacks are more or less always apparent especially when there is some type of pattern in the missing data, i.e., when the distribution of missing values is depended on certain valid observations in the data.

According to some other techniques, the missing values are replaced by estimates obtained from statistical procedures. The complete data set resulting from such a process is then analyzed by standard statistical methods (for example regression analysis). These techniques are commonly known as imputation methods. The problem is that most of the imputation methods produce in general continuous estimates, which are not realistic replacements of the missing values when the variables are categorical. Since the majority of the variables in the software data sets are categorical with many missing values, it is reasonable to use an imputation method producing categorical values in order to fill the incomplete data set and then to use it for constructing a prediction model.

In this paper we investigate the possibility of using the statistical procedure known as multinomial logistic regression (MLR) as imputation method for categorical variables. The data we used for experimentation and comparisons come from the International Software Benchmarking Standards Group (ISBSG) multi-organizational software database (ISBSG, 2001). MLR was applied on a set of carefully selected software projects that have been pre-processed by statistical analysis and was compared with four other missing data techniques: listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI).

In order to compare the efficiency of the above methods in various incompleteness situations, the data set we used was originally complete (no missing values) and the missing values were created by simulating three different mechanisms: missing completely at random (MCAR), missing at random (MAR) and non-ignorable missingness (NIM).

The comparisons were conducted on the basis of the predictive accuracy of a regression model fitted to the data after applying each missing data technique. The results of the study indicate that for small percentages of missing values, MLR gives as satisfactory results as all the other methods do. However, when the percentages of missing values increase, MLR seems to usually outperform with respect to other methods.

The structure of the paper is the following: In Section 2, we outline related work in the area. In Section 3, we describe the different mechanisms used to create missing data and the most common techniques for handling them. In Section 4, we present the data set used in the analysis and the statistical methods in details. In Section 5, we give the results of the statistical analysis conducted using as data the prediction accuracy outputs and finally, in Section 6, we conclude by discussing the findings as well as some directions for future work.

Section snippets

Related work

The problem of handling missing data has been treated adequately in various real world data sets. Several statistical methods have been developed since the early 1970s, when the manipulation of complicated numerical calculations became feasible with the advance of computers. Some of the most important review papers on the subject are Afifi and Elashoff, 1966, Hartley and Hocking, 1971, Dempster et al., 1977, Little and Rubin, 1983, Little and Rubin, 2002.

In the field of software engineering

Missing data mechanisms

The methods of handling missing data are directly related to the mechanisms that caused the incompleteness. Generally, these mechanisms fall into three classes (Little and Rubin, 2002):

  • 1.

    Missing completely at random (MCAR): The missing values in a variable are unrelated to the values of any other variables, whether missing or valid.

  • 2.

    Non-ignorable missingness (NIM): NIM can be considered as the opposite of MCAR in the sense that the probability of having missing values in a variable depends on the

Research methodology

The method of approach for the comparison of MLR with the other four MDTs was based on studying the impact of each one of the MDTs on the predictive accuracy of a cost estimation model. Below we describe the data set used, the preprocessing for obtaining a final complete data set, the cost model derived from the data and the accuracy measure employed for the comparisons. For the application of all statistical methods in this paper, we used the statistical program SPSS.

Results

As mentioned above, the purpose of the paper is to present a comparative study of five MDTs by experimentation with various missing data patterns. These patterns were generated so as to take into account variations in (a) randomness (b) percentages and (c) variables. The descriptive statistics in Table 3 and the box-plots in Fig. 1 show the overall performance of the five methods with respect to the SD criterion which in this section is considered as the dependent variable affected by the

Conclusion and future work

In this paper we investigated the use of MLR for estimating the missing values of categorical variables, predictors in software cost models. We experimented by considering a complete data set of software projects, by generating artificial missing values in four categorical predictor variables with three different mechanisms and finally by handling the missing data by MLR along with four other well-known methods. Subsequent statistical analysis including analysis of variance, confidence

Acknowledgment

We wish to thank the anonymous referees for their valuable comments which helped us to improve the paper.

References (19)

  • R. Jeffery et al.

    A comparative study of two software development cost modelling techniques using multi-organizational and company-specific data

    Inform. Software Technol.

    (2000)
  • P. Sentas et al.

    Software productivity and effort prediction with ordinal regression

    Inform. Software Technol.

    (2005)
  • A.A. Afifi et al.

    Missing observations in multivariate statistics: review of the literature

    J. Am. Statist. Assoc.

    (1966)
  • Angelis, L., Stamelos, I., Morisio, M., 2001. Building a software cost estimation model based on categorical data. In:...
  • L. Briand et al.

    A pattern recognition approach for software engineering data analysis

    IEEE Trans. Software Eng.

    (1992)
  • Cartwright, M.H., Shepperd, M.J., Song, Q., 2003. Dealing with missing software project data. In: Proc. METRICS, pp....
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm (with discussion)

    J. Roy. Statist. Soc. B

    (1977)
  • K.E. Emam et al.

    Validating the ISO/IEC 15504 measure of software requirements analysis process capability

    IEEE Trans. Software Eng.

    (2000)
  • T. Foss et al.

    A simulation study of the model evaluation criterion MMRE

    IEEE Trans. Software Eng.

    (2003)
There are more references available in the full text version of this article.

Cited by (66)

  • Multiple imputation method of missing credit risk assessment data based on generative adversarial networks

    2022, Applied Soft Computing
    Citation Excerpt :

    These methods are relatively easy to implement, but they may also easily cause resource waste or data distribution distortion [15,16], which leads to bias in parameters, results and inferences of the subsequent evaluation model [17]. In addition, regression imputation method [18] is also commonly applied for the data imputation based on statistics, including linear regression [19], logistic regression [20] and so on. For example, Hamzah et al. [21] combined robust random regression imputation and multiple linear regression to impute the missing data in daily streamflow datasets, and achieved excellent results.

  • Research patterns and trends in software effort estimation

    2017, Information and Software Technology
    Citation Excerpt :

    Different selection strategies for project selection have been reported in “estimation by analogy” (T60.4) [390,391]. In “factors affecting estimation” (T12.6), the trend called “missing data effects” (T60.10) has emerged to depict handling and assessing of missing data [157,158,160,161,392]. “Impact of CMM levels” (T60.59) trend focuses on the effect of process maturity on SEE [75,362,364].

  • Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

    2017, Journal of Systems and Software
    Citation Excerpt :

    In the domain of empirical software engineering and its related software quality estimation, researchers have devoted to predicting important quality-related variables, such as the fault count or if the fault-proneness exists, etc. Most empirical software engineering estimation builds statistical or machine learning models on historical data (Sentas and Angelis, 2006). Meanwhile, the software community has accumulated a myriad of software project quality related data for academic research, such as the PROMISE data repositories.

View all citing articles on Scopus
View full text