Skip to main content
Log in

Interpretable clustering using unsupervised binary trees

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

We herein introduce a new method of interpretable clustering that uses unsupervised binary trees. It is a three-stage procedure, the first stage of which entails a series of recursive binary splits to reduce the heterogeneity of the data within the new subsamples. During the second stage (pruning), consideration is given to whether adjacent nodes can be aggregated. Finally, during the third stage (joining), similar clusters are joined together, even if they do not share the same parent originally. Consistency results are obtained, and the procedure is used on simulated and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a simplification. Adv Knowl Discov Data Min, pp 11–22

  • Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised decision tree. IEEE Trans Knowl Data Eng 17:121–132

    Article  Google Scholar 

  • Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed) Proceedings of the 15th international conference on machine learning, pp 55–63

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton

    MATH  Google Scholar 

  • Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev. Statistique Appliquée, XLVII 87–99

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington

  • Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin

    Book  Google Scholar 

  • García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109

    Article  MathSciNet  Google Scholar 

  • Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari

  • Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin

    Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  • Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci 22:603–621

    Article  MathSciNet  MATH  Google Scholar 

  • Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of the ninth international conference on information and knowledge management. ACM, New York, NY, USA, pp 20–29

  • Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability and Statistics, Wiley, New York

  • Melia M (2012) Local equivalences of distances between clusterings—a geometric perpective. Mach Learn 86(3):369–389

    Article  MathSciNet  Google Scholar 

  • Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr Stat 16(3):559–585

    Article  MathSciNet  Google Scholar 

  • Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445

    Article  MATH  Google Scholar 

  • Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Berlin

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121

    Article  MathSciNet  MATH  Google Scholar 

  • Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus

    Google Scholar 

  • Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2): 1010–1033

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcela Svarc.

Appendix

Appendix

1.1 Proof of Lemma 3.1

We first observe that because \(t_l\) and \(t_r\) are disjoint, \(\mathbb{E }(X_{t_l \cup t_r}) = \gamma \mu _l + (1 - \gamma ) \mu _r\), where \(\gamma = P(\mathbf{X} \in t_l \vert \mathbf{X} \in t_l \cup t_r)\). Given that \(j = 1, \ldots , p\), we use \( M^{(j)}_{2i} = \int _{t_i} x(j)^2 dF(x),{\quad }i=l,r\), where \(F\) stands for the distribution function of the vector \(\mathbf{X}\).

It then follows that \(\mathbb{E }(X_{t_l \cup t_r}(j)^2) = \gamma M^{(j)}_{2l}+ (1 - \gamma )M^{(j)}_{2r}\) , and therefore that

$$\begin{aligned} var( X_{t_l \cup t_r}(j))&= \gamma var(X_{t_r}(j)) + (1-\gamma ) var(X_{t_r}(j)) \nonumber \\&+\, \gamma (1-\gamma )( \mu _l(j) - \mu _r(j))^2. \end{aligned}$$
(7)

By summing the terms in \(j\) we get the desired result.

1.2 Proof of Theorem 3.1

Let \(\mathbb{T }\) be the family of polygons in \( \mathbb{R }^p\) with faces orthogonal to the axes, and fix \(i \in \{1, \ldots , p\}\) and \(t \in \mathbb{T }\). For \(a \in \mathbb{R }\) denote by \(t_l = \{ x \in t: x(i) \le a\}\) and \(t_r = t \setminus t_l\). We define \(r(t,i,a) = R(t) - R(t_l) - R(t_r)\) and \(r_n(t,i,a) = R_n(t) - R_n(t_l) - R_n(t_r)\), the corresponding empirical version.

We start showing the uniform convergence

$$\begin{aligned} \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t,i,a)-r(t,i,a)\vert \rightarrow 0 \ \ a.s. \end{aligned}$$
(8)

By Lemma 3.1,

$$\begin{aligned} \alpha _t r(t,i,a) = \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2, \end{aligned}$$
(9)

where \(\alpha _A = P(\mathbf{X} \in A)\) and \( \mu _j(a) = \mathbb{E }(X_{t_j}),{\quad }j=l,r\). Then, the pairs \((i_{jn}, a_{jn})\) and \((i_{j}, a_{j})\) are the arguments that maximise the right-hand side of equation (9) with respect to the measures \(P_n\) and \(P\) respectively. We observe that the right-hand side of equation (9) equals

$$\begin{aligned} \alpha _{t_r} \int _{t_l} \Vert x \Vert ^2dP(x) + \alpha _{t_l} \int _{t_r} \Vert x \Vert ^2dP(x) - 2 \left\langle \int _{t_l} xdP(x) , \int _{t_r} xdP(x)\right\rangle . \end{aligned}$$
(10)

In order to prove Eq. (8) it is sufficient to show that:

  1. 1.

    \(\sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert P_n(t_j) - P(t_j)\vert \rightarrow 0 \, \, a.s. \, \, j=l,r\)

  2. 2.

    \(\begin{array}{l} \sup _{a \in \mathbb{R }}\sup _{t \in \mathbb{T }}\vert \int _{t_j} \Vert x \Vert ^2 dP_n(x) - \int _{t_j} \Vert x \Vert ^2 dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r\\ \end{array}\)

  3. 3.

    \(\begin{array}{l} sup_{a \in \mathbb{R }}\sup _{t \in \mathbb{T }} \vert \int _{t_j}\ x (i) dP_n(x) - \int _{t_j} x(i) dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r, i=1, \ldots , p. \end{array}\)

Since \(\mathbb{T }\) is a Vapnik–Chervonenkis class, we have that (1) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure \( \Vert x \Vert ^2dP(x)\) and the finite signed measure given by \(x(i)dP(x)\) we also have that (2) and (3) both hold.

Since

$$\begin{aligned} \lim _{a \rightarrow \infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2 = \lim _{a \rightarrow -\infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2= 0, \end{aligned}$$
(11)

we have that

$$\begin{aligned} \inf {\{argmax_{a \in \mathbb{R }} r_n(t,i,a)\}} \rightarrow \inf {\{argmax_{a \in \mathbb{R }} r(t,i,a)\}} \end{aligned}$$

a.s.

In the first step of the algorithm, \(t=\mathbb{R }^p\) and we obtain \( i_{n1} = i_1\) for \(n\) large enough and \(a_{n1} \rightarrow a_1\) a.s. In the next step, we observe that the empirical procedure begins to work with \(t_{nl}\) and \(t_{nr}\), while the population algorithm will do so with \(t_{l}\) and \(t_{r}\). However, we have that

$$\begin{aligned}&\sup _{a \in \mathbb{R }} \vert r_n(t_{nj}, i, a) - r(t_j, i, a) \vert \nonumber \\&\quad \le \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t_{nj}, i, a) - r(t_{nj}, i, a) \vert + \sup _{a \in \mathbb{R }} \vert r(t_{nj}, i, a) - r(t_j, i, a) \vert ,\quad \quad \end{aligned}$$
(12)

for \(j=l,r\).

We already know that the first term on the right hand side of equation(12) converges to zero almost surely. In order to show that the second term also converges to zero, it is sufficient to show that

  1. 1.

    \( \sup _{a \in \mathbb{R }} \vert P(t_{nj})- P(t_j)\vert \rightarrow 0{\quad }a.s.{\quad }j=l,r\)

  2. 2.

    \( \sup _{a \in \mathbb{R }}\vert \int _{t_j} \Vert x \Vert ^2 dP(x) - \int _{t_{nj} } \Vert x \Vert ^2 dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r\)

  3. 3.

    \( sup_{a \in \mathbb{R }} \vert \int _{t_j}\ x (i) dP(x) - \int _{t_{nj}} x(i) dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r, \ i=1, \ldots , p,\)

which follows from the assumption that \(\Vert x \Vert ^2 f(x)\) is bounded. This concludes the proof since \(minsize/n \rightarrow \tau \).

1.3 Proof of Theorem 3.2

We need to show that we have consistency in both steps of the backward algorithm.

(i) Convergence of the pruning step. Let \(\{t^*_{1n}, \ldots , t^*_{mn}\}\) be the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from

  • the conclusions of Theorem 3.1.

  • the fact that the random variables \(W_{lr} \quad \tilde{d}_l, \tilde{d}_r\) are positive.

  • the uniform convergence of the empirical quantile function to its population version.

  • the Lebesgue dominated convergence theorem.

The proof of convergence of the joining step is mainly the same as that for (i).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fraiman, R., Ghattas, B. & Svarc, M. Interpretable clustering using unsupervised binary trees. Adv Data Anal Classif 7, 125–145 (2013). https://doi.org/10.1007/s11634-013-0129-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0129-3

Keywords

Mathematics Subject Classification (2000)

Navigation