Abstract
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: `Automatic subspace clustering of high-dimensional data for data mining applications’, In: ACM SIG-MOD Conference on Management of Data ( ACM Press, New York 1998 )
R. Agrawal, R. Srikant• `Fast Algorithms for Mining Association Rules’, In: Proceedings of the 20 th VLDB Conference, ( Santiago, Chile 1997 ) pp. 487–499
C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, Jong Park: `Fast algorithms for projected clustering’, In: ACM SIGMOD Conference, ( ACM Press, New York 1999 )
M.R. Anderberg: Cluster Analysis for Applications ( Academic Press, New York and London 1973 )
R. Bellman: Adaptive Control Processes: A Guided Tour, ( Princeton University Press, Princeton 1961 )
S. Brin: `Near Neighbor Search in Large Metric Spaces’, Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), ( Morgan Kaufmann, Los Gatos 1995 ) pp. 574–584
K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: `When is `nearest neighbor’ meaningful?’, In Proceedings of 7th International Conference on Database Theory (ICDT-1999) ( Jerusalem, Israel 1999 ) pp. 217–235
T.H. Cormen, C.E. Leiserson, R.L. Rivest: Introduction to Algorithms ( Prentice Hall, Englewood Cliffs 1990 )
R.O. Duda, P.E. Hart, D.G. Stork: Pattern Recognition ( Wiley, New York 2000 )
I.S. Dhillon, D.S. Modha: Machine Learning 42 143 (2001)
D.L. Donoho: `High Dimensional Data Analysis: The Curses and Blessings of Dimensionality’ American Math. Society Conference: Mathematical Challenges of the 21st Century Los Angeles CA August 6–11 (2000). (Currently only available on the Web at http://www-stat.stanford.edu/ donoho/Lectures/AMS2000/AMS2000.html)
M. Ester, H.P. Kriegel, J. Sander, X. Xu: ‘A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96) (Portland, Oregon 1996) pp. 226–231
L. Ertöz, M. Steinbach, V. Kumar: `Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach’, In Proceeding of Text Mining Workshop First International SIAM Data Mining Conference (Chicago, IL 2001)
A. EI-Hamdouchi, P. Willet: The Computer Journal 32 (3) (1989)
C. Fraley, A.E. Raferty: ‘How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis’, Technical Report No. 329, Department of Statistics, University of Washington, Seattle, Washington (1998)
K.C. Gowda, G. Krishna: Pattern Recognition 10, 105 (1978)
S. Guha, R. Rastogi, K. Shim: `ROCK: A Robust Clustering Algorithm for Categorical Attributes’, In Proceedings of the 15th International Conference on Data Engineering (ICDE ‘89) (1999) pp. 512–521
A. Hinneburg, C. Aggarwal, D.A. Keim: `What is the nearest neighbor in high dimensional spaces?’ In Proceedings 26th International Conference on Very Large Data Bases (VLDB-2000)(Morgan Kaufmann, San Francisco 2000) pp. 506–515
A. Hinneburg, D.A. Keim: `An Efficient Approach to Clustering in Large Multimedia Databases with Noise’, In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining(New York 1998) pp. 58–65
A. Hinneburg, D.A. Keim: `Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering’, In Proceedings of 25th International Conference on Very Large Data Bases (VLDB-1999)(Morgan Kaufmann, San Francisco 1999) pp. 506–517
E.-H. Han, G. Karypis, V. Kumar, B. Mobasher: `Clustering In a High-Dimensional Space Using Hypergraph Models’, Technical Report TR-97–063, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota (1997)
F. Hoppner, F. Klawonn, R. Kruse, T. Runkler Fuzzy Cluster Analysis: Methods for Classification Data Analysis and Image Recognition John Wiley and Sons, New York 1999
A.K. Jain, R.C. Dubes: Algorithms for Clustering Data ( Prentice Hall, Englewood Cliffs 1988 )
A.K. Jain, M.N. Murty, P.J. Flynn: ACM Computing Surveys 31 264 (1999)
R.A. Jarvis, E.A. Patrick: IEEE Transactions on Computers, C-22, 1025 (1973)
G. Karypis, E.-H. Han: `Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval and Categorization’, In: Ninth International Conference on Information and Knowledge Management (CIKM 2000) (McLean 2000) 7. G. Karypis, E.-H. Han, V. Kumar: IEEE Computer 32, 68 (1999)
G. Karypis, V. Kumar: `hMETIS 1.5: A hypergraph partitioning package’, Technical report, Department of Computer Science, University of Minnesota (1998)
L. Kaufman, P.J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis ( John Wiley and Sons, New York 1990 )
T. Mitchell: Machine Learning ( McGraw Hill, New York 1997 )
F. Murtagh, J.-L. Starck, M.W. Berry: The Computer Journal 43, 107 (2000)
H. Nagesh, S. Goil, Alok Choudhary: `MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets’, Technical Report Number CPDC-TR9906–019, Center for Parallel and Distributed Computing, Northwestern University (1999)
C.J. Van Rijsbergen: Information Retrieval 2nd Ed. ( Butterworth, London 1979 )
S.M. Savaresi, D.L. Boley: `On the Performance of Bisecting K-Means and PDDP’, In Proceedings of the First International SIAM Data Mining Conference, ( Chicago, IL 2001 )
G. Strang: Linear Algebra and its Applications third edition ( Harcourt Brace Jovanovich, New York 1986 )
G. Sheikholeslami, S. Chatterjee, Aidong Zhang: `Wavecluster: A multi-resolution clustering approach for very large spatial databases’, In Proceedings of the 24th VLDB Conference (1998)
M. Steinbach, G. Karypis, V. Kumar: A Comparison of Document Clustering Algorithms’, In Proceedings of the Text Mining Workshop for The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2000) ( Boston, MA 2000 )
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Steinbach, M., Ertöz, L., Kumar, V. (2004). The Challenges of Clustering High Dimensional Data. In: Wille, L.T. (eds) New Directions in Statistical Physics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-08968-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-662-08968-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07739-5
Online ISBN: 978-3-662-08968-2
eBook Packages: Springer Book Archive