Abstract
Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements are not restricted to single variables but can be seen as trajectories. Statistical methods used to determine homogeneous patient trajectories can be separated into two families: model-based methods (like Proc Traj) and partitional clustering (non-parametric algorithms like k-means). KmL is a new implementation of k-means designed to work specifically on longitudinal data. It provides scope for dealing with missing values and runs the algorithm several times, varying the starting conditions and/or the number of clusters sought; its graphical interface helps the user to choose the appropriate number of clusters when the classic criterion is not efficient. To check KmL efficiency, we compare its performances to Proc Traj both on artificial and real data. The two techniques give very close clustering when trajectories follow polynomial curves. KmL gives much better results on non-polynomial trajectories.
Similar content being viewed by others
References
Abraham C, Cornillon P, Matzner-Lober E, Molinari N (2003) Unsupervised curve clustering using B-splines. Scand J Stat 30(3): 581–595
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6): 716–723
Atienza N, Garcìa-Heras J, Muñoz-Pichardo J, Villa R (2008) An application of mixture distributions in modelization of length of hospital stay. Stat Med 27: 1403–1420
Beauchaine TP, Beauchaine RJ (2002) A Comparison of Maximum Covariance and K-Means Cluster Analysis in Classifying Cases Into Known Taxon Groups. Psychol Methods 7(2): 245–261
Bezdek J, Pal N (1998) Some new indexes of cluster validity. In: IEEE Transactions on Systems, Man and Cybernetics, Part B 28(3):301–315
Boik JC, Newman RA, Boik RJ (2008) Quantifying synergism/antagonism using nonlinear mixed-effects modeling: a simulation study. Stat Med 27(7): 1040–1061
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1): 1–27
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3): 315–332
Clark D, Jones B, Wood D, Cornelius J (2006) Substance use disorder trajectory classes: diachronic integration of onset age, severity, and course. Addict Behav 31(6): 995–1009
Conklin C, Perkins K, Sheidow A, Jones B, Levine M, Marcus M (2005) The return to smoking: 1-year relapse trajectories among female smokers. Nicotine & Tob Res 7(4): 533–540
D’Urso P (2004) Fuzzy C-means clustering models for multivariate time-varying data: different approaches. Int J Uncertain Fuzziness Knowl Base Syst 12(3): 287–326
Everitt BS, Landau S, Leese M (2001) Cluster analysis. A Hodder Edwar Arnold Publication, London
García-Escudero LA, Gordaliza A (2005) A proposal for robust curve clustering. J Classif 22(2): 185–201
Genolini C (2008) Kml. http://christophe.genolini.free.fr/kml/
Genolini C (2009) A (Not so) short introduction to S4. http://cran.r-project.org/
Goldstein H (1995) Multilevel statistical models. Edwar Arnold, London
Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4): 857–871
Hand D, Krzanowski W (2005) Optimising k-means clustering results with standard software packages. Comput Stat Data Anal 49(4): 969–973
Hartigan J (1975) Clustering algorithms. Wiley, New York
Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41(3–4): 429–440
James G, Sugar C (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462): 397–408
Jones BL (2001) Proc traj. http://www.andrew.cmu.edu/user/bjones/
Jones BL, Nagin DS (2007) Advances in group-based trajectory modeling and an SAS procedure for estimating them. Sociol Methods & Res 35(4): 542
Jones BL, Nagin DS, Roeder K (2001) A SAS procedure based on mixture models for estimating developmental trajectories. Sociological Methods & Research 29(3): 374
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster Analysis. Wiley, New York
Košmelj K, Batagelj V (1990) Cross-sectional approach for clustering time varying data. J Classif 7(1): 99–109
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5:172. http://www.biomedcentral.com/1471-2105/5/172
Magidson J, Vermunt JK (2002) Latent class models for clustering: a comparison with k-means. Can J Mark Res 20: 37–44
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2002.1114856
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2): 159–179
Muthén L, Muthén B (1998) Mplus user’s guide. Muthén & Muthén 2006, Los Angeles
Nagin DS (2005) Group-based modeling of development. Harvard University Press, Cambridge
Nagin DS, Tremblay RE (2001) Analyzing developmental trajectories of distinct but related behaviors: a group-based method. Psychol methods 6(1): 18–34
R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN 3-900051-07-0
Rossi F, Conan-Guez B, Golli AE (2004) Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, pp 305–312
Ryan L (2008) Combining data from multiple sources, with applications to environmental risk assessment. Stat Med 27(5): 698–710
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464
Shim Y, Chung J, Choi I (2005) A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. In: Proceedings of CIMCA-IAWTIC’05, IEEE computer society, Washington, vol 1, pp 199–204
Sugar C, James G (2003) Finding the number of clusters in a Dataset: an information-theoretic approach. J Am Stat Assoc 98(463): 750–764
Tarpey T (2007) Linear transformations and the k-means clustering algorithm: applications to clustering curves. Am Stat 61(1): 34
Tarpey T, Kinateder K (2003) Clustering functional data. J classif 20(1): 93–114
Tokushige S, Yadohisa H, Inada K (2007) Crisp and fuzzy k-means clustering algorithms for multivariate functional data. Comput Stat 22(1): 1–16
Tou JTL, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading
Touchette E, Petit D, Seguin J, Boivin M, Tremblay R, Montplaisir J (2007) Associations between sleep duration patterns and behavioral/cognitive functioning at school entry. Sleep 30(9): 1213–1219
Tremblay RE (2008) Prévenir la violence dès la petite enfance. Odile Jacob, Paris
Vlachos M, Lin J, Keogh E, Gunopulos D (2003) A wavelet-based anytime algorithm for k-means clustering of time series. In: 3rd SIAM international conference on data mining. San Francisco, May 1–3, 2003, workshop on clustering high dimensionality data and its applications
Warren-Liao T (2005) Clustering of time series data-a survey. Pattern Recognit 38(11): 1857–1874
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Genolini, C., Falissard, B. KmL: k-means for longitudinal data. Comput Stat 25, 317–328 (2010). https://doi.org/10.1007/s00180-009-0178-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-009-0178-4