Elsevier

Pattern Recognition

Volume 67, July 2017, Pages 177-185
Pattern Recognition

Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods

https://doi.org/10.1016/j.patcog.2017.01.031Get rights and content

Highlights

  • An extension of clustering using binary decision trees (CUBT) is presented for nominal data.

  • New heuristics are given for tuning the parameters of CUBT.

  • CUBT outperforms many of the existing approaches for nominal datasets.

  • The tree structure helps for the interpretation of the obtained clusters.

  • The method usable for direct prediction.

  • The method may be used with parallel computing and thus for Big data.

Abstract

In this work, we propose an extension of CUBT (clustering using unsupervised binary trees) to nominal data. For this purpose, we primarily use heterogeneity criteria and dissimilarity measures based on mutual information, entropy and Hamming distance. We show that for this type of data, CUBT outperforms most of the existing methods. We also provide and justify some guidelines and heuristics to tune the parameters in CUBT. Extensive comparisons are done with other well known approaches using simulations, and two examples of real datasets applications are given.

Introduction

Clustering, or unsupervised learning aims to classify objects into different groups or to partition the data into clusters such that the observations in each cluster are as homogeneous as possible with respect to some dissimilarity measure. Most clustering methods aim to construct a partition of a set of n observations in k clusters, k may be specified a priori or be determined by the method. There are two main types of clustering methods: hierarchical and non-hierarchical. Hierarchical algorithms [31] construct a set of clusters that may be represented by a tree. The root of the tree corresponds to the entire set of observations. The tree is often binary, with each node corresponding to the union of two clusters. A recent algorithm of this type is DIVCLUS-T [6]. It is based on the minimization of the inertia criterion, such as the classical hierarchical cluster analysis (HCA) with Ward’s criterion. The main difference with the latter is that it produces a natural interpretation of the clusters obtained, by using cut-off values on the variables that explain the clusters.

Non-hierarchical methods attempt to define the k clusters simultaneously. There are two main types of non-hierarchical methods: the partitional approaches, among which the most popular is k-means [26] and density-based methods. The method k-means searches for a k-cluster partition that minimize the within-cluster variance. Density-based methods attempt to construct high-density regions of points separated by low-density regions. These methods are particularly useful in the presence of noise and outliers. A well-known algorithm of this type is DBSCAN [11]. Other methods exist (see, for example, [36]), based on different principles, such as the probabilistic model-based approach, in which the clusters are modeled by a mixture of parametric distributions, allowing membership probabilities (see, for example, [8]), such as clustering using latent class analysis (LCA, [37]). Similarly, there are fuzzy clustering methods, where each observation may belong to more than one cluster (see, for example, the fuzzy c-means algorithm [10]).

In this work, we focus on hierarchical clustering methods based on decision trees. To our knowledge, the first proposal for using decision trees for clustering came with the algorithm CO.5 [9], which uses top-down induction of logical decision trees [3]. These types of trees are called “predictive clustering trees”. They are based on Quinlan’s classification algorithm C4.5 [33] and have been widely used in the literature, particularly in machine learning. Another tree-based algorithm, CLTree [25], is perfectly capable of ignoring outliers using a technique to introduce “non-existing” points in the data space, but it is limited to continuous data. A more recent algorithm exclusively designed for nominal data is CG–CLUS [39]. This algorithm possesses good properties of efficiency and performance, compared to the well-known COBWEB algorithm [12]. All of these methods are related to “conceptual clustering” (ı.e, a common machine learning task), which aims to cluster objects that are represented by pairs of attributes-values (ı.e, nominal descriptions) and do not take into account any ordering of variable levels. The main advantage of this approach is the direct interpretation of the output tree. In this framework, a common information theory measure to assess the quality of a partition is often used, namely, the category utility [7], which is formally equivalent to the mutual information [15].

CUBT [13] is a top-down hierarchical clustering method inspired by CART [5], and it consists of three stages. The first step grows a “maximal tree” by recursively splitting the dataset into several subsets of observations that minimize a heterogeneity criterion, the deviance, within the final clusters. The deviance criterion is the trace of the covariance matrix within each subset of observations. The second step prunes the tree. For each pair of sibling terminal nodes (ı.e, leaves with the same ascendant node), a measure of dissimilarity between the two nodes is computed. If this dissimilarity measure is lower than a fixed threshold mindist, then the two nodes are aggregated into a single node (ı.e, their parent node). The dissimilarity measure used by CUBT is based on the Euclidean distance. The final step, joining, is also a node aggregation step, in which the constraint of node adjacency for cluster aggregation is not required. For the joining step, either the deviance or the dissimilarity measure may be used.

CUBT shares various advantages with CART. It is flexible and efficient, ı.e, produces good partitions for a large family of data structures; it is interpretable (because of the binary labeled splits); and it has good convergence properties. In their original work [13], the authors compared CUBT with several clustering methods and showed that it produces quite satisfactory results. Several applications of CUBT have been undertaken, mainly in social sciences and medicine; see [28], [29], for example. This approach allowed clinicians to improve interpretability and decision making with regard to cut-off scores that define the membership of individuals in different clusters.

One of the limitations of CUBT is that the criteria used to grow and prune a tree (deviance and dissimilarity measure) are specific to continuous data. Herein, we present an extension of CUBT to nominal data. So, for each step of CUBT, namely, growing and pruning, we propose a new criterion based either on mutual information or on entropy. We provide some heuristics for selecting the parameters used at each stage. Section 2 describes the nominal version of CUBT. Section 3 presents some simulation models and comparisons with other clustering methods for nominal data. Section 4 provides two real data examples to illustrate the use of CUBT. The final section gives the conclusion and some ideas for future work.

Section snippets

CUBT for nominal data

We describe the three steps of CUBT using the new criteria for nominal data. The first step grows the maximal tree, while the second and third steps (pruning and joining) prune the maximal tree.

Let XE=j=1pXj, be a random p-dimensional nominal vector with coordinates X.j, j{1,,p}, and let mjN be the number of levels of the jth variable, and Xj is the set of levels (or categories) of the jth variable. We have a set S of n random vectors, denoted as Xi with i{1,,n}. Finally, Xij is the ith

Experiments

In this section, we present some simulations using different models. We compare the results of CUBT with those of six other methods suitable for nominal data, such as hierarchical clustering (HCA) [31], k-modes [19], DBSCAN [11], COBWEB [12], DIVCLUS-T [6] and LCA [37]. The misclassification error [13] and the adjusted Rand index [20] are used for these comparisons. We have updated the actual CUBT package [14] in R to account for the criteria described in the previous section.

Real data applications

We present here two applications on real data sets coming from a supervised learning framework [23], where a grouping variable is also available. The first concerns the soybean diseases and the second concerns the Tic-Tac-Toe endgame dataset.

Conclusion

We have presented an extension of the CUBT algorithm for nominal data, which uses suitable criteria to handle these types of data, based on the Shannon entropy and mutual information. We have compared this approach to other classical methods using several data simulation models. Among the compared methods, CUBT exhibits excellent performance and outperforms most of the methods it is compared with, mainly over test samples. In all cases, CUBT still has at least three practical advantages over

Acknowledgments

The authors are very grateful to Pascal Auquier, director of the public health unity at Aix Marseille University for his financial support and to Claude Deniau for his valuable and helpful comments concerning this work. This work is partially supported by the program ECOS-sud U14E02.

Badih Ghattas A senior statistician, having forty five publications in the field of statistics. He has several high level theoretical papers about classification , clustering and density estimation and several papers with applications in medicine (public health, neurology, oncology, MRI) , ecology and bioinformatics.

References (39)

  • D.W. Aha

    Incremental Constructive Induction: An Instance-based Approach.

    Proceedings of the Eighth International Workshop on Machine Learning

    (1991)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. Ser. B

    (1977)
  • A. Agresti

    Categorical Data Analysis

    (2002)
  • H. Blockeel et al.

    Top-down induction of first-order logical decision trees

    Artif. Intell.

    (1998)
  • R.D. Bock

    Estimating item parameters and latent ability when responses are scored in two or more nominal categories

    Psychometrika

    (1972)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • M. Chavent et al.

    DIVCLUS-T: a monothetic divisive hierarchical clustering method

    Comput. Stat. Data Anal.

    (2007)
  • J.E. Corter et al.

    Explaining basic categories: feature predictability and information

    Psychol. Bull.

    (1992)
  • L. De Raedt et al.

    Using Logical Decision Trees for Clustering

    Proceedings of the 7th International Workshop on Inductive Logic Programming

    (1997)
  • J.C. Dunn

    A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters

    J. Cybern.

    (1973)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Proceedings of the Second International Conference on Knowledge Discovery and Data Mining

    (1996)
  • D.H. Fisher

    Knowledge acquisition via incremental conceptual clustering

    Mach. Learn.

    (1987)
  • R. Fraiman et al.

    Interpretable clustering using unsupervised binary trees

    Adv. Data Anal. Classif.

    (2013)
  • B. Ghattas, M. Svarc, R. Fraiman, 2013, R-package for interpretable clustering using binary trees....
  • M.A. Gluck et al.

    Information, uncertainty and the utility of categories

    Proceedings of the Seventh Annual Conference of the Cognitive Science Society

    (1985)
  • R. Gray

    Entropy and Information Theory

    (1990)
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explorations

    (2009)
  • C. Hennig

    Fpc: Flexible Procedures for Clustering

    R package version 2.1–9

    (2014)
  • Z. Huang

    Extensions to the (k)-modes algorithm for clustering large data sets with categorical values

    Data Min. Knowl. Discov.

    (1998)
  • Cited by (40)

    • Interpretable fuzzy clustering using unsupervised fuzzy decision trees

      2022, Information Sciences
      Citation Excerpt :

      After the tree is constructed, a pruning procedure based on a dissimilarity measure is considered, and the merging of leaves from different father nodes is conducted later. Some researchers have also introduced entropy measures as the basis for feature selection and node splitting. [21] proposed a new type of heterogeneity measure based on entropy because the method proposed in [18] cannot handle nominal data.

    View all citing articles on Scopus

    Badih Ghattas A senior statistician, having forty five publications in the field of statistics. He has several high level theoretical papers about classification , clustering and density estimation and several papers with applications in medicine (public health, neurology, oncology, MRI) , ecology and bioinformatics.

    Pierre Michel PhD student supervised by a statistician and an epidemiologist. He works mainly in the domain of public health and is contributing for the development of new approaches of clustering and variable selection in the context of computerized adaptive tests for quality of life questionnaires.

    Laurent Boyer Dr in psychiatry, he works mainly on schizophrenia and multiple sclerosis. He contributed for the development of advanced tools for measuring quality of life for patients suffering from such chronic diseases. He is part of the MusiQoL team within the public health unity at the medicine faculty of Marseille.

    View full text