Original Articles
Genetic programming outperformed multivariable logistic regression in diagnosing pulmonary embolism

https://doi.org/10.1016/j.jclinepi.2003.10.011Get rights and content

Abstract

Objective

Genetic programming is a search method that can be used to solve complex associations between large numbers of variables. It has been used, for example, for myoelectrical signal recognition, but its value for medical prediction as in diagnostic and prognostic settings, has not been documented.

Study design and setting

We compared genetic programming and the commonly used logistic regression technique in the development of a prediction model using empirical data from a study on diagnosis of pulmonary embolism. Using part (67%) of the data, we developed and internally validated (using bootstrapping techniques) a diagnostic prediction model by genetic programming and by logistic regression, and compared both on their predictive ability in the remaining data (validation set).

Results

In the validation set, the area under the ROC curve of the genetic programming model was significantly larger (0.73; 95%CI: 0.64–0.82) than that of the logistic regression model (0.68; 0.59–0.77). The calibration of both models was similar, indicating a similar amount of overoptimism.

Conclusion

Although the interpretation of a genetic programming model is less intuitive and this is the first empirical study quantifying its value for medical prediction, genetic programming seems a promising technique to develop prediction rules for diagnostic and prognostic purposes.

Introduction

In the past decade there has been an increased interest in medical prediction research to answer prognostic and diagnostic questions. Generally, such research aims to develop a so-called prediction rule to predict a particular outcome as accurate as possible, preferably with a minimum of information or predictors. In diagnostic prediction research the outcome includes the presence of a disease, and in prognostic prediction research the future occurrence of a certain event. With the increasing availability of electronic patient records the interest in medical prediction research will further increase because electronic records facilitate the application of prediction rules in medical practice.

The most widely used method to develop prediction rules or models in clinical epidemiology is multivariable logistic regression [1], [2], [3], [4], [5], [6], [7]. In the past decade, new methods such as classification and regression trees (CART) and neural networks have been introduced for this purpose. However, it has repeatedly been shown that both methods do not produce prediction rules that achieve higher predictive accuracy than rules developed by multivariable logistic regression [8], [9], [10], [11], [12], [13]. Recently, the technique of genetic programming has emerged. Genetic programming is a search method inspired by the process of natural evolution, and may be used to solve complex associations between large numbers of variables [14], [15], [16]. This feature makes genetic programming also suitable for prediction research to estimate the mutual correlations between various predictors and the outcome.

Genetic programming is not restricted to any fixed model structure. Therefore, it may theoretically result in a model achieving higher predictive accuracy compared to a model ob tained by ordinary logistic regression analyses. However, the flexibility a logistic model can also be increased by including cubic splines for continuous variables (rather than only the linear terms) and interaction terms, potentially enhancing the model's predictive accuracy [4], [6], [17]. However, this is not commonly done, as it often decreases the interpretability of such model.

Like neural networks, genetic programming originates from the field of artificial intelligence and machine learning. But contrary to neural networks, genetic programming requires fewer prior restrictions to the structure of the model. Nevertheless, an often-cited disadvantage of both genetic programming and neural networks is the complexity of the developed prediction model (“black-box character”). Genetic programming has been used in medical research used for myoelectrical signal recognition, echocardiography, and medical imaging, but its value for medical prediction has not been documented yet.

Our aim was to compare genetic programming and multivariable logistic regression in the development of a diagnostic prediction model using empirical data from a study on diagnosis of pulmonary embolism (PE). We developed a prediction model using genetic programming and one using multivariable logistic regression, and compared both methods on their predictive ability in an independent data set. The feasibility to apply both prediction models in clinical practice is discussed, as well as the differences between genetic programming and neural networks.

Section snippets

Patients: description of the empirical data set

For the present analysis, data were used from a prospective diagnostic study among 398 patients in secondary care of 18 years or older who were suspected of PE. As data are used for illustration purposes only, we refer to literature for details on the design and main results of the study [18], [19], [20]. Briefly, all patients underwent a systematic patient history and physical examination, followed by blood gas analysis, chest radiography, leg ultrasound, ventilation-perfusion lung scanning

Descriptives

There were no major differences in patient characteristics between the derivation and validation set (Table 1). PE was diagnosed in 42.6% of the patients in the derivation set, which was 42.9% in the validation set. Table 2 shows the univariable associations and distribution of the 10 predictors across patients with and without PE in the derivation set. “History of collapse” and “previous deep venous thrombosis” were the strongest predictors.

Logistic regression

The overall logistic model yielded a ROC area of 0.77

Discussion

To our knowledge, this is the first study to address the value of genetic programming for medical prediction purposes compared to the well-known and widely applied logistic regression technique. Given that the amount of overoptimism in discriminative value was similar for both models as estimated from the bootstrap, the discriminative value of the genetic programming model in the validation set was significantly larger than that of the logistic regression model. Before any form of recalibration

Acknowledgements

We gratefully acknowledge the support by The Netherlands Organization for Scientific Research\ (ZON-MW904-66-112).

References (28)

  • A Laupacis et al.

    Clinical prediction rules. A review and suggested modifications of methodological standards

    JAMA

    (1997)
  • F.E Harrell

    Regression modeling strategies

    (2001)
  • K.G Moons et al.

    Diagnostic studies as multivariable, prediction research

    J Epidemiol Community Health

    (2002)
  • H.P Selker et al.

    A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients

    J Investig Med

    (1995)
  • Cited by (11)

    • Development and validation of clinical prediction models: Marginal differences between logistic regression, penalized maximum likelihood estimation, and genetic programming

      2012, Journal of Clinical Epidemiology
      Citation Excerpt :

      However, in medical data, it has been shown that these types of prediction models often do not achieve higher predictive accuracy [13–18]. Genetic programming, however, is a more novel and promising search method that may improve the selection and transformation of predictors, and it may lead to models with good predictive accuracy in new patients [19–22]. The modeling process starts with a large number of candidate prediction models that are stepwise optimized by selecting the best models and adding random variations (see also the Methods section).

    View all citing articles on Scopus
    View full text