References
Breiman L, Friedman JH, Stone CJ, Olshen RA (1993) Classification and regression trees. Chapman and Hall, New York
Cardoen S, Van Huffel X, Berkvens D, Quoilin S, Ducoffre G, Saegerman C, Speybroeck N, Imberechts H, Herman L, Ducatelle R, Dierick K (2009) Evidence-based semi-quantitative methodology for prioritization of food-borne zoonoses. Foodborne Pathog Dis 6:1083–1096
Havelaar AH, van Rosse F, Bucura C, Toetenel MA, Haagsma JA, Kurowicka D, Heesterbeek JH, Speybroeck N, Langelaar MF, van der Giessen JW, Cooke RM, Braks MA (2010) Prioritizing emerging zoonoses in the Netherlands. PLoS One 5:e13965
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15:651–674
Kim H, Loh W (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Protopopoff N, Van Bortel W, Speybroeck N, D’Alessandro U, Coosemans M (2009) Ranking malaria risk factors to guide malaria control efforts in African Highlands. PLoS One 25:e8022
Rosicova K, Geckova AM, Rosic M, Speybroeck N, Groothoff JW, van Dijk JP (2011) Socioeconomic factors, ethnicity and alcohol-related mortality in regions in Slovakia. What might a tree analysis add to our understanding? Health Place 17:701–709
Saegerman C, Speybroeck N, Roels S, Vanopdenbosch E, Thiry E, Berkvens D (2004) Decision support tools in clinical diagnosis in cows with suspected bovine spongiform encephalopathy. J Clin Microbiol 42:172–178
Speybroeck N, Berkvens D, Mfoukou-Ntsakala A, Aerts M, Hens N, Van Huylenbroeck G, Thys E (2004) Classification trees versus multinomial models in the analysis of urban farming systems in Central Africa. Agric Syst 80:133–149
Thang ND, Erhart A, Speybroeck N, Hung LX, Thuan LK, Hung TK, Van Ky P, Coosemans M, D’Alessandro U (2008) Malaria in Central Vietnam: analysis of risk factors by multivariate analysis and classification tree models. Malar J 7:28
White A, Liu W (1994) Bias in information based measures in decision tree induction. Mach Learn 15:321–329
Yewhalaw D, Legesse W, Van Bortel W, Gebre-Selassie S, Kloos H, Duchateau L, Speybroeck N (2009) Malaria and water resource development: the case of Gilgel-Gibe hydroelectric dam in Ethiopia. Malar J 8:21
Acknowledgments
I would like to express my thanks to the Reviewer for the constructive and interesting comments.
Author information
Authors and Affiliations
Corresponding author
Appendix: R code to run the decomposition (Comments in different font)
Appendix: R code to run the decomposition (Comments in different font)
The R software is free of charge and can be downloaded from http://www.r-project.org. An R package called rpart can handle several types of outcomes and generate classification and regression trees. As an example we will indicate how a CT can be constructed for analyzing the relation between malaria (infected/non-infected) and its determinants in Vietnam (Thang et al. 2008) can be generated through the rpart package. After installing the package rpart into R, the following code (in different font) can be copied and used into R and immediately used after having adapted the variables to the users’ needs.
library(rpart)
# To grow a tree, use the command
rpart(Malaria ~ Age + Forrest + Education + Income + Bednet + Housetype + Ethnicity + Gender, method = class)
# with Forrest Activity, Education, Income, Bednet use, House structure, Ethnicity and Gender the
# explanatory variables [these variables were used in Thang et al. (2008)]
# method can be e.g. “class” for classification trees, “anova” for regression trees, “poisson” for count data.
# detailed summary of splits
summary(fit)
# prune the tree and select the minimal error tree (i.e., with the smallest cross-validated error)
pfit <- prune(fit, cp = fit$cptable[which.min(fit$cptable[,″xerror″]),″CP″])
# detailed summary of the pruned tree
summary(pfit)
# plot the final tree
plot(pfit)
A simplified version of the resulting tree in Thang et al. (2008) is shown in Fig. 2. The tree starts with a root node, containing all the 3023 individuals in the sample, with a malaria prevalence (pr) of 14%. The root node is first split into two subgroups according to the wealth status, with the malaria prevalence in the poorer subgroup being higher (pr = 16%) than in the richer subgroup (pr 9%). The richer subgroup is split again into a subgroup engaged in regular forest activity (pr = 31%) and a group not engaged in regular forest activity (pr = 8%). The latter subgroup was split according to their bednet use, with bednet users showing a lower prevalence (pr = 7%) than non-bednet users (pr = 26%).
The example simplified for the sake of brevity (see reference for more information), indicates that CaRTs can be powerful tools for the analysis of complex public health data.
Conditional inference trees can be created via the function ctree (see Hothorn et al. 2006 for additional background)
# The party package provides regression trees.
library(party)
# To grow a conditional inference tree, use the command.
ctree(Malaria ~ Age + Forrest + Education + Income + Bednet + Housetype + Ethnicity + Gender)
Rights and permissions
About this article
Cite this article
Speybroeck, N. Classification and regression trees. Int J Public Health 57, 243–246 (2012). https://doi.org/10.1007/s00038-011-0315-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00038-011-0315-z