On Clustering Validation Techniques

Halkidi, Maria; Batistakis, Yannis; Vazirgiannis, Michalis

doi:10.1023/A:1012801612483

On Clustering Validation Techniques

Published: December 2001

Volume 17, pages 107–145, (2001)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Maria Halkidi¹,
Yannis Batistakis¹ &
Michalis Vazirgiannis¹

7628 Accesses
1694 Citations
7 Altmetric
Explore all metrics

Abstract

Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.

This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Berry, M.J.A. and Linoff, G. (1996). Data Mining Techniques For Marketing, Sales and Customer Support. John Wiley & Sons, Inc., USA.
Google Scholar
Bezdeck, J.C., Ehrlich, R., and Full, W. (1984). FCM: Fuzzy C-Means Algorithm. Computers and Geoscience, 10(2-3), 191-203.
Google Scholar
Dave, R.N. (1996). Validating Fuzzy Partitions Obtained Through c-Shells Clustering. Pattern Recognition Letters, 17, 613-623.
Google Scholar
Davies, D.L. and Bouldin, D.W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227.
Google Scholar
Dunn, J.C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions. J. Cybern., 4, 95-104.
Google Scholar
Ester, M., Kriegel, H-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceeding of 2nd Int. Conf.On Knowledge Discovery and Data Mining, Portland (pp. 226-23).
Ester, M., Kriegel, H.-P., Sander, J.,Wimmer, M., and Xu, X. (1998). Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of 24th VLDB Conference, New York, USA.
Fayyad, M.U., Piatesky-Shapiro, G., Smuth P., Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press.
Gath I. and Geva A.B. (1989). Unsupervised Optimal Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 773-781.
Google Scholar
Guha, S., Rastogi, R., and Shim K. (1998). CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD Conference.
Guha, S, Rastogi, R., and Shim K. (1999). ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proceedings of the IEEE Conference on Data Engineering.
Halkidi, M., Vazirgiannis, M., and Batistakis, I. (2000). Quality Scheme Assessment in the Clustering Process. In Proceedings of PKDD, Lyon, France.
Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, USA.
Google Scholar
Hinneburg, A. and Keim, D. (1998). An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of KDD Conference.
Huang, Z. (1997). A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining. DMKD.
Jain, A.K., Murty, M.N., and Flyn, P.J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-323.
Google Scholar
Krishnapuram, R., Frigui, H., and Nasraoui, O. (1993). Quadratic Shell Clustering Algorithms and the Detection of Second-Degree Curves. Pattern Recognition Letters, 14(7), 545-552.
Google Scholar
MacQueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, Volume I: Statistics, pp. 281-297.
Milligan, G.W. and Cooper, M.C. (1985). An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika, 50, 159-179.
Google Scholar
Milligan, G.W., Soon, S.C., and Sokol, L.M. (1983). The Effect of Cluster Size, Dimensionality and the Number of Clusters on Recovery of True Cluster Structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40-47.
Google Scholar
Mitchell, T. (1997). Machine Learning.McGraw-Hill, USA.
Google Scholar
Ng, R. and Han, J.(1994). Effecient and Effictive Clustering Methods for Spatial Data Mining. In Proceeding's of the 20th VLDB Conference, Santiago, Chile.
Pal, N.R. and Biswas, J. ( 1997). Cluster Validation Using Graph Theoretic Concepts. Pattern Recognition, 30(6), 847-857.
Google Scholar
Rezaee, R., Lelieveldt, B.P.F., and Reiber, J.H.C. (1998). A New Cluster Validity Index for the Fuzzy c-Mean. Pattern Recognition Letters, 19, 237-246.
Google Scholar
Sharma, S.C. (1996). Applied Multivariate Techniques.John Wiley and Sons.
Sheikholeslami, C., Chatterjee, S., and Zhang, A. (1998). WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database. In Proceedings of 24th VLDB Conference, New York, USA.
Smyth, P. (1996). Clustering using Monte Carlo Cross-Validation. In Proceedings of KDD Conference.
Theodoridis, S. and Koutroubas, K. (1999). Pattern Recognition.Academic Press.
Theodoridis, Y. (1999). Spatial Datasets: An “unofficial” collection. http://dias.cti.gr/~ytheod/research/ datasets/spatial.html Wang, W., Yang, J., and Muntz, R. (1997). STING: A Ststistical Information Grid Approach to Spatial Data Mining. In Proceedings of 23rd VLDB Conference.
Xie, X.L. and Beni, G. (1991). A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 841-846.
Google Scholar
Zhang, T., Ramakrishnman, R., and Linvy, M. (1996). BIRCH: An Efficient Method for Very Large Databases. ACM SIGMOD, Montreal, Canada.

Download references

Author information

Authors and Affiliations

Department of Informatics, Athens University of Economics & Business, Patision 76, 10434, Athens, Greece (Hellas)
Maria Halkidi, Yannis Batistakis & Michalis Vazirgiannis

Authors

Maria Halkidi
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Batistakis
View author publications
You can also search for this author in PubMed Google Scholar
Michalis Vazirgiannis
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Halkidi, M., Batistakis, Y. & Vazirgiannis, M. On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001). https://doi.org/10.1023/A:1012801612483

Download citation

Issue Date: December 2001
DOI: https://doi.org/10.1023/A:1012801612483

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Clustering Validation Techniques

Abstract

Access this article

Similar content being viewed by others

Effective Data Clustering Algorithms

Data clustering: application and trends

An empirical comparison and characterisation of nine popular clustering methods

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

On Clustering Validation Techniques

Abstract

Access this article

Similar content being viewed by others

Effective Data Clustering Algorithms

Data clustering: application and trends

An empirical comparison and characterisation of nine popular clustering methods

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation