Original ArticlesGenetic programming outperformed multivariable logistic regression in diagnosing pulmonary embolism
Introduction
In the past decade there has been an increased interest in medical prediction research to answer prognostic and diagnostic questions. Generally, such research aims to develop a so-called prediction rule to predict a particular outcome as accurate as possible, preferably with a minimum of information or predictors. In diagnostic prediction research the outcome includes the presence of a disease, and in prognostic prediction research the future occurrence of a certain event. With the increasing availability of electronic patient records the interest in medical prediction research will further increase because electronic records facilitate the application of prediction rules in medical practice.
The most widely used method to develop prediction rules or models in clinical epidemiology is multivariable logistic regression [1], [2], [3], [4], [5], [6], [7]. In the past decade, new methods such as classification and regression trees (CART) and neural networks have been introduced for this purpose. However, it has repeatedly been shown that both methods do not produce prediction rules that achieve higher predictive accuracy than rules developed by multivariable logistic regression [8], [9], [10], [11], [12], [13]. Recently, the technique of genetic programming has emerged. Genetic programming is a search method inspired by the process of natural evolution, and may be used to solve complex associations between large numbers of variables [14], [15], [16]. This feature makes genetic programming also suitable for prediction research to estimate the mutual correlations between various predictors and the outcome.
Genetic programming is not restricted to any fixed model structure. Therefore, it may theoretically result in a model achieving higher predictive accuracy compared to a model ob tained by ordinary logistic regression analyses. However, the flexibility a logistic model can also be increased by including cubic splines for continuous variables (rather than only the linear terms) and interaction terms, potentially enhancing the model's predictive accuracy [4], [6], [17]. However, this is not commonly done, as it often decreases the interpretability of such model.
Like neural networks, genetic programming originates from the field of artificial intelligence and machine learning. But contrary to neural networks, genetic programming requires fewer prior restrictions to the structure of the model. Nevertheless, an often-cited disadvantage of both genetic programming and neural networks is the complexity of the developed prediction model (“black-box character”). Genetic programming has been used in medical research used for myoelectrical signal recognition, echocardiography, and medical imaging, but its value for medical prediction has not been documented yet.
Our aim was to compare genetic programming and multivariable logistic regression in the development of a diagnostic prediction model using empirical data from a study on diagnosis of pulmonary embolism (PE). We developed a prediction model using genetic programming and one using multivariable logistic regression, and compared both methods on their predictive ability in an independent data set. The feasibility to apply both prediction models in clinical practice is discussed, as well as the differences between genetic programming and neural networks.
Section snippets
Patients: description of the empirical data set
For the present analysis, data were used from a prospective diagnostic study among 398 patients in secondary care of 18 years or older who were suspected of PE. As data are used for illustration purposes only, we refer to literature for details on the design and main results of the study [18], [19], [20]. Briefly, all patients underwent a systematic patient history and physical examination, followed by blood gas analysis, chest radiography, leg ultrasound, ventilation-perfusion lung scanning
Descriptives
There were no major differences in patient characteristics between the derivation and validation set (Table 1). PE was diagnosed in 42.6% of the patients in the derivation set, which was 42.9% in the validation set. Table 2 shows the univariable associations and distribution of the 10 predictors across patients with and without PE in the derivation set. “History of collapse” and “previous deep venous thrombosis” were the strongest predictors.
Logistic regression
The overall logistic model yielded a ROC area of 0.77
Discussion
To our knowledge, this is the first study to address the value of genetic programming for medical prediction purposes compared to the well-known and widely applied logistic regression technique. Given that the amount of overoptimism in discriminative value was similar for both models as estimated from the bootstrap, the discriminative value of the genetic programming model in the validation set was significantly larger than that of the logistic regression model. Before any form of recalibration
Acknowledgements
We gratefully acknowledge the support by The Netherlands Organization for Scientific Research\ (ZON-MW904-66-112).
References (28)
Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes
J Clin Epidemiol
(1996)- et al.
Comparison of logistic regression and neural networks to predict rehospitalization in patients with stroke
J Clin Epidemiol
(2001) - et al.
Simplified risk score models accurately predict the risk of major in-hospital complications following percutaneous coronary intervention
Am J Cardiol
(2001) - et al.
A normal perfusion lung scan in patients with clinically suspected pulmonary embolism: frequency and clinical validity
Chest
(1995) - et al.
Multivariate analyses-based prediction rule for pulmonary embolism
Thromb Res
(2000) - et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
J Clin Epidemiol
(2001) Probabilistic prediction in patient management and clinical trials
Stat Med
(1986)- et al.
Applied logistic regression
(1989) - et al.
Statistical aspects of prognostic factor studies in oncology
Br J Cancer
(1994) - et al.
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
(1996)
Clinical prediction rules. A review and suggested modifications of methodological standards
JAMA
Regression modeling strategies
Diagnostic studies as multivariable, prediction research
J Epidemiol Community Health
A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients
J Investig Med
Cited by (11)
Development and validation of clinical prediction models: Marginal differences between logistic regression, penalized maximum likelihood estimation, and genetic programming
2012, Journal of Clinical EpidemiologyCitation Excerpt :However, in medical data, it has been shown that these types of prediction models often do not achieve higher predictive accuracy [13–18]. Genetic programming, however, is a more novel and promising search method that may improve the selection and transformation of predictors, and it may lead to models with good predictive accuracy in new patients [19–22]. The modeling process starts with a large number of candidate prediction models that are stepwise optimized by selecting the best models and adding random variations (see also the Methods section).
Prediction of periventricular leukomalacia. Part II: Selection of hemodynamic features using computational intelligence
2009, Artificial Intelligence in MedicineArtificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis
2023, European Journal of Haematology