The Challenges of Clustering High Dimensional Data

Steinbach, Michael; Ertöz, Levent; Kumar, Vipin

doi:10.1007/978-3-662-08968-2_16

Michael Steinbach,
Levent Ertöz &
Vipin Kumar

919 Accesses
176 Citations

Abstract

Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: `Automatic subspace clustering of high-dimensional data for data mining applications’, In: ACM SIG-MOD Conference on Management of Data ( ACM Press, New York 1998 )
Google Scholar
R. Agrawal, R. Srikant• `Fast Algorithms for Mining Association Rules’, In: Proceedings of the 20 th VLDB Conference, ( Santiago, Chile 1997 ) pp. 487–499
Google Scholar
C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, Jong Park: `Fast algorithms for projected clustering’, In: ACM SIGMOD Conference, ( ACM Press, New York 1999 )
Google Scholar
M.R. Anderberg: Cluster Analysis for Applications ( Academic Press, New York and London 1973 )
MATH Google Scholar
R. Bellman: Adaptive Control Processes: A Guided Tour, ( Princeton University Press, Princeton 1961 )
MATH Google Scholar
S. Brin: `Near Neighbor Search in Large Metric Spaces’, Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), ( Morgan Kaufmann, Los Gatos 1995 ) pp. 574–584
Google Scholar
K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: `When is `nearest neighbor’ meaningful?’, In Proceedings of 7th International Conference on Database Theory (ICDT-1999) ( Jerusalem, Israel 1999 ) pp. 217–235
Google Scholar
T.H. Cormen, C.E. Leiserson, R.L. Rivest: Introduction to Algorithms ( Prentice Hall, Englewood Cliffs 1990 )
MATH Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork: Pattern Recognition ( Wiley, New York 2000 )
Google Scholar
I.S. Dhillon, D.S. Modha: Machine Learning 42 143 (2001)
Article MATH Google Scholar
D.L. Donoho: `High Dimensional Data Analysis: The Curses and Blessings of Dimensionality’ American Math. Society Conference: Mathematical Challenges of the 21st Century Los Angeles CA August 6–11 (2000). (Currently only available on the Web at http://www-stat.stanford.edu/ donoho/Lectures/AMS2000/AMS2000.html)
Google Scholar
M. Ester, H.P. Kriegel, J. Sander, X. Xu: ‘A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96) (Portland, Oregon 1996) pp. 226–231
Google Scholar
L. Ertöz, M. Steinbach, V. Kumar: `Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach’, In Proceeding of Text Mining Workshop First International SIAM Data Mining Conference (Chicago, IL 2001)
Google Scholar
A. EI-Hamdouchi, P. Willet: The Computer Journal 32 (3) (1989)
Google Scholar
C. Fraley, A.E. Raferty: ‘How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis’, Technical Report No. 329, Department of Statistics, University of Washington, Seattle, Washington (1998)
Google Scholar
K.C. Gowda, G. Krishna: Pattern Recognition 10, 105 (1978)
Article MATH Google Scholar
S. Guha, R. Rastogi, K. Shim: `ROCK: A Robust Clustering Algorithm for Categorical Attributes’, In Proceedings of the 15th International Conference on Data Engineering (ICDE ‘89) (1999) pp. 512–521
Google Scholar
A. Hinneburg, C. Aggarwal, D.A. Keim: `What is the nearest neighbor in high dimensional spaces?’ In Proceedings 26th International Conference on Very Large Data Bases (VLDB-2000)(Morgan Kaufmann, San Francisco 2000) pp. 506–515
Google Scholar
A. Hinneburg, D.A. Keim: `An Efficient Approach to Clustering in Large Multimedia Databases with Noise’, In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining(New York 1998) pp. 58–65
Google Scholar
A. Hinneburg, D.A. Keim: `Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering’, In Proceedings of 25th International Conference on Very Large Data Bases (VLDB-1999)(Morgan Kaufmann, San Francisco 1999) pp. 506–517
Google Scholar
E.-H. Han, G. Karypis, V. Kumar, B. Mobasher: `Clustering In a High-Dimensional Space Using Hypergraph Models’, Technical Report TR-97–063, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota (1997)
Google Scholar
F. Hoppner, F. Klawonn, R. Kruse, T. Runkler Fuzzy Cluster Analysis: Methods for Classification Data Analysis and Image Recognition John Wiley and Sons, New York 1999
Google Scholar
A.K. Jain, R.C. Dubes: Algorithms for Clustering Data ( Prentice Hall, Englewood Cliffs 1988 )
MATH Google Scholar
A.K. Jain, M.N. Murty, P.J. Flynn: ACM Computing Surveys 31 264 (1999)
Article Google Scholar
R.A. Jarvis, E.A. Patrick: IEEE Transactions on Computers, C-22, 1025 (1973)
Google Scholar
G. Karypis, E.-H. Han: `Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval and Categorization’, In: Ninth International Conference on Information and Knowledge Management (CIKM 2000) (McLean 2000) 7. G. Karypis, E.-H. Han, V. Kumar: IEEE Computer 32, 68 (1999)
Google Scholar
G. Karypis, V. Kumar: `hMETIS 1.5: A hypergraph partitioning package’, Technical report, Department of Computer Science, University of Minnesota (1998)
Google Scholar
L. Kaufman, P.J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis ( John Wiley and Sons, New York 1990 )
Book Google Scholar
T. Mitchell: Machine Learning ( McGraw Hill, New York 1997 )
MATH Google Scholar
F. Murtagh, J.-L. Starck, M.W. Berry: The Computer Journal 43, 107 (2000)
Article Google Scholar
H. Nagesh, S. Goil, Alok Choudhary: `MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets’, Technical Report Number CPDC-TR9906–019, Center for Parallel and Distributed Computing, Northwestern University (1999)
Google Scholar
C.J. Van Rijsbergen: Information Retrieval 2nd Ed. ( Butterworth, London 1979 )
Google Scholar
S.M. Savaresi, D.L. Boley: `On the Performance of Bisecting K-Means and PDDP’, In Proceedings of the First International SIAM Data Mining Conference, ( Chicago, IL 2001 )
Google Scholar
G. Strang: Linear Algebra and its Applications third edition ( Harcourt Brace Jovanovich, New York 1986 )
Google Scholar
G. Sheikholeslami, S. Chatterjee, Aidong Zhang: `Wavecluster: A multi-resolution clustering approach for very large spatial databases’, In Proceedings of the 24th VLDB Conference (1998)
Google Scholar
M. Steinbach, G. Karypis, V. Kumar: A Comparison of Document Clustering Algorithms’, In Proceedings of the Text Mining Workshop for The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2000) ( Boston, MA 2000 )
Google Scholar

Download references

Authors

Michael Steinbach
View author publications
You can also search for this author in PubMed Google Scholar
Levent Ertöz
View author publications
You can also search for this author in PubMed Google Scholar
Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Physics, Florida Atlantic University, 777 Glades Road, 33431, Boca Raton, FL, USA
Luc T. Wille

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steinbach, M., Ertöz, L., Kumar, V. (2004). The Challenges of Clustering High Dimensional Data. In: Wille, L.T. (eds) New Directions in Statistical Physics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-08968-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-662-08968-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07739-5
Online ISBN: 978-3-662-08968-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics