Interpretable clustering using unsupervised binary trees

Fraiman, Ricardo; Ghattas, Badih; Svarc, Marcela

doi:10.1007/s11634-013-0129-3

Interpretable clustering using unsupervised binary trees

Regular Article
Published: 29 March 2013

Volume 7, pages 125–145, (2013)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Ricardo Fraiman¹,
Badih Ghattas² &
Marcela Svarc³

1166 Accesses
45 Citations
Explore all metrics

Abstract

We herein introduce a new method of interpretable clustering that uses unsupervised binary trees. It is a three-stage procedure, the first stage of which entails a series of recursive binary splits to reduce the heterogeneity of the data within the new subsamples. During the second stage (pruning), consideration is given to whether adjacent nodes can be aggregated. Finally, during the third stage (joining), similar clusters are joined together, even if they do not share the same parent originally. Consistency results are obtained, and the procedure is used on simulated and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Thematic Analysis: Making Values Emerge from Texts

Density-Based Clustering Based on Hierarchical Density Estimates

References

Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a simplification. Adv Knowl Discov Data Min, pp 11–22
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised decision tree. IEEE Trans Knowl Data Eng 17:121–132
Article Google Scholar
Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed) Proceedings of the 15th international conference on machine learning, pp 55–63
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton
MATH Google Scholar
Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev. Statistique Appliquée, XLVII 87–99
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington
Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin
Book Google Scholar
García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109
Article MathSciNet Google Scholar
Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin
Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar
Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci 22:603–621
Article MathSciNet MATH Google Scholar
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of the ninth international conference on information and knowledge management. ACM, New York, NY, USA, pp 20–29
Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability and Statistics, Wiley, New York
Melia M (2012) Local equivalences of distances between clusterings—a geometric perpective. Mach Learn 86(3):369–389
Article MathSciNet Google Scholar
Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr Stat 16(3):559–585
Article MathSciNet Google Scholar
Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445
Article MATH Google Scholar
Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Berlin
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121
Article MathSciNet MATH Google Scholar
Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus
Google Scholar
Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2): 1010–1033
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad de San Andrés and Universidad de la República, Vito Dumas 284, 1644 Victoria, Buenos Aires, Argentina
Ricardo Fraiman
Département de Mathématiques, Université de la Méditerrannée Faculté des Sciences de Luminy, 163 Avenue de Luminy, 13288 , Marseille Cedex 09, France
Badih Ghattas
Universidad de San Andrés and Conicet, Vito Dumas 284, 1644 Victoria, Buenos Aires, Argentina
Marcela Svarc

Authors

Ricardo Fraiman
View author publications
You can also search for this author in PubMed Google Scholar
Badih Ghattas
View author publications
You can also search for this author in PubMed Google Scholar
Marcela Svarc
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcela Svarc.

Appendix

1.1 Proof of Lemma 3.1

We first observe that because $t_l$ and $t_r$ are disjoint, $\mathbb{E }(X_{t_l \cup t_r}) = \gamma \mu _l + (1 - \gamma ) \mu _r$, where $\gamma = P(\mathbf{X} \in t_l \vert \mathbf{X} \in t_l \cup t_r)$. Given that $j = 1, \ldots , p$, we use $ M^{(j)}_{2i} = \int _{t_i} x(j)^2 dF(x),{\quad }i=l,r$, where $F$ stands for the distribution function of the vector $\mathbf{X}$.

It then follows that $\mathbb{E }(X_{t_l \cup t_r}(j)^2) = \gamma M^{(j)}_{2l}+ (1 - \gamma )M^{(j)}_{2r}$ , and therefore that

$$\begin{aligned} var( X_{t_l \cup t_r}(j))&= \gamma var(X_{t_r}(j)) + (1-\gamma ) var(X_{t_r}(j)) \nonumber \\&+\, \gamma (1-\gamma )( \mu _l(j) - \mu _r(j))^2. \end{aligned}$$

(7)

By summing the terms in $j$ we get the desired result.

1.2 Proof of Theorem 3.1

Let $\mathbb{T }$ be the family of polygons in $ \mathbb{R }^p$ with faces orthogonal to the axes, and fix $i \in \{1, \ldots , p\}$ and $t \in \mathbb{T }$. For $a \in \mathbb{R }$ denote by $t_l = \{ x \in t: x(i) \le a\}$ and $t_r = t \setminus t_l$. We define $r(t,i,a) = R(t) - R(t_l) - R(t_r)$ and $r_n(t,i,a) = R_n(t) - R_n(t_l) - R_n(t_r)$, the corresponding empirical version.

We start showing the uniform convergence

$$\begin{aligned} \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t,i,a)-r(t,i,a)\vert \rightarrow 0 \ \ a.s. \end{aligned}$$

(8)

By Lemma 3.1,

$$\begin{aligned} \alpha _t r(t,i,a) = \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2, \end{aligned}$$

(9)

where $\alpha _A = P(\mathbf{X} \in A)$ and $ \mu _j(a) = \mathbb{E }(X_{t_j}),{\quad }j=l,r$. Then, the pairs $(i_{jn}, a_{jn})$ and $(i_{j}, a_{j})$ are the arguments that maximise the right-hand side of equation (9) with respect to the measures $P_n$ and $P$ respectively. We observe that the right-hand side of equation (9) equals

$$\begin{aligned} \alpha _{t_r} \int _{t_l} \Vert x \Vert ^2dP(x) + \alpha _{t_l} \int _{t_r} \Vert x \Vert ^2dP(x) - 2 \left\langle \int _{t_l} xdP(x) , \int _{t_r} xdP(x)\right\rangle . \end{aligned}$$

(10)

In order to prove Eq. (8) it is sufficient to show that:

1.
$\sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert P_n(t_j) - P(t_j)\vert \rightarrow 0 \, \, a.s. \, \, j=l,r$
2.
$\begin{array}{l} \sup _{a \in \mathbb{R }}\sup _{t \in \mathbb{T }}\vert \int _{t_j} \Vert x \Vert ^2 dP_n(x) - \int _{t_j} \Vert x \Vert ^2 dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r\\ \end{array}$
3.
$\begin{array}{l} sup_{a \in \mathbb{R }}\sup _{t \in \mathbb{T }} \vert \int _{t_j}\ x (i) dP_n(x) - \int _{t_j} x(i) dP(x) \vert \rightarrow 0\\ a.s. \, \, j=l,r, i=1, \ldots , p. \end{array}$

Since $\mathbb{T }$ is a Vapnik–Chervonenkis class, we have that (1) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure $ \Vert x \Vert ^2dP(x)$ and the finite signed measure given by $x(i)dP(x)$ we also have that (2) and (3) both hold.

Since

$$\begin{aligned} \lim _{a \rightarrow \infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2 = \lim _{a \rightarrow -\infty } \alpha _{t_l}\alpha _{t_r} \Vert \mu _l(a) - \mu _r(a)\Vert ^2= 0, \end{aligned}$$

(11)

we have that

$$\begin{aligned} \inf {\{argmax_{a \in \mathbb{R }} r_n(t,i,a)\}} \rightarrow \inf {\{argmax_{a \in \mathbb{R }} r(t,i,a)\}} \end{aligned}$$

a.s.

In the first step of the algorithm, $t=\mathbb{R }^p$ and we obtain $ i_{n1} = i_1$ for $n$ large enough and $a_{n1} \rightarrow a_1$ a.s. In the next step, we observe that the empirical procedure begins to work with $t_{nl}$ and $t_{nr}$, while the population algorithm will do so with $t_{l}$ and $t_{r}$. However, we have that

$$\begin{aligned}&\sup _{a \in \mathbb{R }} \vert r_n(t_{nj}, i, a) - r(t_j, i, a) \vert \nonumber \\&\quad \le \sup _{a \in \mathbb{R }} \sup _{t \in \mathbb{T }}\vert r_n(t_{nj}, i, a) - r(t_{nj}, i, a) \vert + \sup _{a \in \mathbb{R }} \vert r(t_{nj}, i, a) - r(t_j, i, a) \vert ,\quad \quad \end{aligned}$$

(12)

for $j=l,r$.

We already know that the first term on the right hand side of equation(12) converges to zero almost surely. In order to show that the second term also converges to zero, it is sufficient to show that

1.
$ \sup _{a \in \mathbb{R }} \vert P(t_{nj})- P(t_j)\vert \rightarrow 0{\quad }a.s.{\quad }j=l,r$
2.
$ \sup _{a \in \mathbb{R }}\vert \int _{t_j} \Vert x \Vert ^2 dP(x) - \int _{t_{nj} } \Vert x \Vert ^2 dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r$
3.
$ sup_{a \in \mathbb{R }} \vert \int _{t_j}\ x (i) dP(x) - \int _{t_{nj}} x(i) dP(x) \vert \rightarrow 0{\quad }a.s.{\quad }j=l,r, \ i=1, \ldots , p,$

which follows from the assumption that $\Vert x \Vert ^2 f(x)$ is bounded. This concludes the proof since $minsize/n \rightarrow \tau $.

1.3 Proof of Theorem 3.2

We need to show that we have consistency in both steps of the backward algorithm.

(i) Convergence of the pruning step. Let $\{t^*_{1n}, \ldots , t^*_{mn}\}$ be the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from

the conclusions of Theorem 3.1.
the fact that the random variables $W_{lr} \quad \tilde{d}_l, \tilde{d}_r$ are positive.
the uniform convergence of the empirical quantile function to its population version.
the Lebesgue dominated convergence theorem.

The proof of convergence of the joining step is mainly the same as that for (i).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fraiman, R., Ghattas, B. & Svarc, M. Interpretable clustering using unsupervised binary trees. Adv Data Anal Classif 7, 125–145 (2013). https://doi.org/10.1007/s11634-013-0129-3

Download citation

Received: 31 January 2013
Revised: 20 February 2013
Accepted: 07 March 2013
Published: 29 March 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s11634-013-0129-3

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interpretable clustering using unsupervised binary trees

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Thematic Analysis: Making Values Emerge from Texts

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Proof of Lemma 3.1

1.2 Proof of Theorem 3.1

1.3 Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Interpretable clustering using unsupervised binary trees

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Thematic Analysis: Making Values Emerge from Texts

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Proof of Lemma 3.1

1.2 Proof of Theorem 3.1

1.3 Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation