Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods
Introduction
Clustering, or unsupervised learning aims to classify objects into different groups or to partition the data into clusters such that the observations in each cluster are as homogeneous as possible with respect to some dissimilarity measure. Most clustering methods aim to construct a partition of a set of n observations in k clusters, k may be specified a priori or be determined by the method. There are two main types of clustering methods: hierarchical and non-hierarchical. Hierarchical algorithms [31] construct a set of clusters that may be represented by a tree. The root of the tree corresponds to the entire set of observations. The tree is often binary, with each node corresponding to the union of two clusters. A recent algorithm of this type is DIVCLUS-T [6]. It is based on the minimization of the inertia criterion, such as the classical hierarchical cluster analysis (HCA) with Ward’s criterion. The main difference with the latter is that it produces a natural interpretation of the clusters obtained, by using cut-off values on the variables that explain the clusters.
Non-hierarchical methods attempt to define the k clusters simultaneously. There are two main types of non-hierarchical methods: the partitional approaches, among which the most popular is k-means [26] and density-based methods. The method k-means searches for a k-cluster partition that minimize the within-cluster variance. Density-based methods attempt to construct high-density regions of points separated by low-density regions. These methods are particularly useful in the presence of noise and outliers. A well-known algorithm of this type is DBSCAN [11]. Other methods exist (see, for example, [36]), based on different principles, such as the probabilistic model-based approach, in which the clusters are modeled by a mixture of parametric distributions, allowing membership probabilities (see, for example, [8]), such as clustering using latent class analysis (LCA, [37]). Similarly, there are fuzzy clustering methods, where each observation may belong to more than one cluster (see, for example, the fuzzy c-means algorithm [10]).
In this work, we focus on hierarchical clustering methods based on decision trees. To our knowledge, the first proposal for using decision trees for clustering came with the algorithm CO.5 [9], which uses top-down induction of logical decision trees [3]. These types of trees are called “predictive clustering trees”. They are based on Quinlan’s classification algorithm C4.5 [33] and have been widely used in the literature, particularly in machine learning. Another tree-based algorithm, CLTree [25], is perfectly capable of ignoring outliers using a technique to introduce “non-existing” points in the data space, but it is limited to continuous data. A more recent algorithm exclusively designed for nominal data is CG–CLUS [39]. This algorithm possesses good properties of efficiency and performance, compared to the well-known COBWEB algorithm [12]. All of these methods are related to “conceptual clustering” (ı.e, a common machine learning task), which aims to cluster objects that are represented by pairs of attributes-values (ı.e, nominal descriptions) and do not take into account any ordering of variable levels. The main advantage of this approach is the direct interpretation of the output tree. In this framework, a common information theory measure to assess the quality of a partition is often used, namely, the category utility [7], which is formally equivalent to the mutual information [15].
CUBT [13] is a top-down hierarchical clustering method inspired by CART [5], and it consists of three stages. The first step grows a “maximal tree” by recursively splitting the dataset into several subsets of observations that minimize a heterogeneity criterion, the deviance, within the final clusters. The deviance criterion is the trace of the covariance matrix within each subset of observations. The second step prunes the tree. For each pair of sibling terminal nodes (ı.e, leaves with the same ascendant node), a measure of dissimilarity between the two nodes is computed. If this dissimilarity measure is lower than a fixed threshold mindist, then the two nodes are aggregated into a single node (ı.e, their parent node). The dissimilarity measure used by CUBT is based on the Euclidean distance. The final step, joining, is also a node aggregation step, in which the constraint of node adjacency for cluster aggregation is not required. For the joining step, either the deviance or the dissimilarity measure may be used.
CUBT shares various advantages with CART. It is flexible and efficient, ı.e, produces good partitions for a large family of data structures; it is interpretable (because of the binary labeled splits); and it has good convergence properties. In their original work [13], the authors compared CUBT with several clustering methods and showed that it produces quite satisfactory results. Several applications of CUBT have been undertaken, mainly in social sciences and medicine; see [28], [29], for example. This approach allowed clinicians to improve interpretability and decision making with regard to cut-off scores that define the membership of individuals in different clusters.
One of the limitations of CUBT is that the criteria used to grow and prune a tree (deviance and dissimilarity measure) are specific to continuous data. Herein, we present an extension of CUBT to nominal data. So, for each step of CUBT, namely, growing and pruning, we propose a new criterion based either on mutual information or on entropy. We provide some heuristics for selecting the parameters used at each stage. Section 2 describes the nominal version of CUBT. Section 3 presents some simulation models and comparisons with other clustering methods for nominal data. Section 4 provides two real data examples to illustrate the use of CUBT. The final section gives the conclusion and some ideas for future work.
Section snippets
CUBT for nominal data
We describe the three steps of CUBT using the new criteria for nominal data. The first step grows the maximal tree, while the second and third steps (pruning and joining) prune the maximal tree.
Let be a random p-dimensional nominal vector with coordinates X.j, and let be the number of levels of the jth variable, and is the set of levels (or categories) of the jth variable. We have a set S of n random vectors, denoted as Xi with . Finally, Xij is the ith
Experiments
In this section, we present some simulations using different models. We compare the results of CUBT with those of six other methods suitable for nominal data, such as hierarchical clustering (HCA) [31], k-modes [19], DBSCAN [11], COBWEB [12], DIVCLUS-T [6] and LCA [37]. The misclassification error [13] and the adjusted Rand index [20] are used for these comparisons. We have updated the actual CUBT package [14] in R to account for the criteria described in the previous section.
Real data applications
We present here two applications on real data sets coming from a supervised learning framework [23], where a grouping variable is also available. The first concerns the soybean diseases and the second concerns the Tic-Tac-Toe endgame dataset.
Conclusion
We have presented an extension of the CUBT algorithm for nominal data, which uses suitable criteria to handle these types of data, based on the Shannon entropy and mutual information. We have compared this approach to other classical methods using several data simulation models. Among the compared methods, CUBT exhibits excellent performance and outperforms most of the methods it is compared with, mainly over test samples. In all cases, CUBT still has at least three practical advantages over
Acknowledgments
The authors are very grateful to Pascal Auquier, director of the public health unity at Aix Marseille University for his financial support and to Claude Deniau for his valuable and helpful comments concerning this work. This work is partially supported by the program ECOS-sud U14E02.
Badih Ghattas A senior statistician, having forty five publications in the field of statistics. He has several high level theoretical papers about classification , clustering and density estimation and several papers with applications in medicine (public health, neurology, oncology, MRI) , ecology and bioinformatics.
References (39)
Incremental Constructive Induction: An Instance-based Approach.
Proceedings of the Eighth International Workshop on Machine Learning
(1991)- et al.
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. Ser. B
(1977) Categorical Data Analysis
(2002)- et al.
Top-down induction of first-order logical decision trees
Artif. Intell.
(1998) Estimating item parameters and latent ability when responses are scored in two or more nominal categories
Psychometrika
(1972)- et al.
Classification and Regression Trees
(1984) - et al.
DIVCLUS-T: a monothetic divisive hierarchical clustering method
Comput. Stat. Data Anal.
(2007) - et al.
Explaining basic categories: feature predictability and information
Psychol. Bull.
(1992) - et al.
Using Logical Decision Trees for Clustering
Proceedings of the 7th International Workshop on Inductive Logic Programming
(1997) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters
J. Cybern.
(1973)
A density-based algorithm for discovering clusters in large spatial databases with noise
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
Knowledge acquisition via incremental conceptual clustering
Mach. Learn.
Interpretable clustering using unsupervised binary trees
Adv. Data Anal. Classif.
Information, uncertainty and the utility of categories
Proceedings of the Seventh Annual Conference of the Cognitive Science Society
Entropy and Information Theory
The WEKA data mining software: an update
SIGKDD Explorations
Fpc: Flexible Procedures for Clustering
R package version 2.1–9
Extensions to the (k)-modes algorithm for clustering large data sets with categorical values
Data Min. Knowl. Discov.
Cited by (40)
DTEC: Decision tree-based evidential clustering for interpretable partition of uncertain data
2023, Pattern RecognitionHow to find a good explanation for clustering?
2023, Artificial IntelligenceShallow decision trees for explainable k-means clustering
2023, Pattern RecognitionInterpretable fuzzy clustering using unsupervised fuzzy decision trees
2022, Information SciencesCitation Excerpt :After the tree is constructed, a pruning procedure based on a dissimilarity measure is considered, and the merging of leaves from different father nodes is conducted later. Some researchers have also introduced entropy measures as the basis for feature selection and node splitting. [21] proposed a new type of heterogeneity measure based on entropy because the method proposed in [18] cannot handle nominal data.
Badih Ghattas A senior statistician, having forty five publications in the field of statistics. He has several high level theoretical papers about classification , clustering and density estimation and several papers with applications in medicine (public health, neurology, oncology, MRI) , ecology and bioinformatics.
Pierre Michel PhD student supervised by a statistician and an epidemiologist. He works mainly in the domain of public health and is contributing for the development of new approaches of clustering and variable selection in the context of computerized adaptive tests for quality of life questionnaires.
Laurent Boyer Dr in psychiatry, he works mainly on schizophrenia and multiple sclerosis. He contributed for the development of advanced tools for measuring quality of life for patients suffering from such chronic diseases. He is part of the MusiQoL team within the public health unity at the medicine faculty of Marseille.