KmL: k-means for longitudinal data

Genolini, Christophe; Falissard, Bruno

doi:10.1007/s00180-009-0178-4

KmL: k-means for longitudinal data

Original Paper
Published: 28 November 2009

Volume 25, pages 317–328, (2010)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Christophe Genolini^1,2 &
Bruno Falissard^1,3,4

2514 Accesses
143 Citations
4 Altmetric
Explore all metrics

Abstract

Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements are not restricted to single variables but can be seen as trajectories. Statistical methods used to determine homogeneous patient trajectories can be separated into two families: model-based methods (like Proc Traj) and partitional clustering (non-parametric algorithms like k-means). KmL is a new implementation of k-means designed to work specifically on longitudinal data. It provides scope for dealing with missing values and runs the algorithm several times, varying the starting conditions and/or the number of clusters sought; its graphical interface helps the user to choose the appropriate number of clusters when the classic criterion is not efficient. To check KmL efficiency, we compare its performances to Proc Traj both on artificial and real data. The two techniques give very close clustering when trajectories follow polynomial curves. KmL gives much better results on non-polynomial trajectories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abraham C, Cornillon P, Matzner-Lober E, Molinari N (2003) Unsupervised curve clustering using B-splines. Scand J Stat 30(3): 581–595
Article MATH MathSciNet Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6): 716–723
Article MATH MathSciNet Google Scholar
Atienza N, Garcìa-Heras J, Muñoz-Pichardo J, Villa R (2008) An application of mixture distributions in modelization of length of hospital stay. Stat Med 27: 1403–1420
Article MathSciNet Google Scholar
Beauchaine TP, Beauchaine RJ (2002) A Comparison of Maximum Covariance and K-Means Cluster Analysis in Classifying Cases Into Known Taxon Groups. Psychol Methods 7(2): 245–261
Article Google Scholar
Bezdek J, Pal N (1998) Some new indexes of cluster validity. In: IEEE Transactions on Systems, Man and Cybernetics, Part B 28(3):301–315
Boik JC, Newman RA, Boik RJ (2008) Quantifying synergism/antagonism using nonlinear mixed-effects modeling: a simulation study. Stat Med 27(7): 1040–1061
Article MathSciNet Google Scholar
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1): 1–27
Article MathSciNet Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3): 315–332
Article MATH MathSciNet Google Scholar
Clark D, Jones B, Wood D, Cornelius J (2006) Substance use disorder trajectory classes: diachronic integration of onset age, severity, and course. Addict Behav 31(6): 995–1009
Article Google Scholar
Conklin C, Perkins K, Sheidow A, Jones B, Levine M, Marcus M (2005) The return to smoking: 1-year relapse trajectories among female smokers. Nicotine & Tob Res 7(4): 533–540
Article Google Scholar
D’Urso P (2004) Fuzzy C-means clustering models for multivariate time-varying data: different approaches. Int J Uncertain Fuzziness Knowl Base Syst 12(3): 287–326
Article MATH MathSciNet Google Scholar
Everitt BS, Landau S, Leese M (2001) Cluster analysis. A Hodder Edwar Arnold Publication, London
Google Scholar
García-Escudero LA, Gordaliza A (2005) A proposal for robust curve clustering. J Classif 22(2): 185–201
Article Google Scholar
Genolini C (2008) Kml. http://christophe.genolini.free.fr/kml/
Genolini C (2009) A (Not so) short introduction to S4. http://cran.r-project.org/
Goldstein H (1995) Multilevel statistical models. Edwar Arnold, London
Google Scholar
Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4): 857–871
Article Google Scholar
Hand D, Krzanowski W (2005) Optimising k-means clustering results with standard software packages. Comput Stat Data Anal 49(4): 969–973
Article MATH MathSciNet Google Scholar
Hartigan J (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41(3–4): 429–440
Article MathSciNet Google Scholar
James G, Sugar C (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462): 397–408
Article MATH MathSciNet Google Scholar
Jones BL (2001) Proc traj. http://www.andrew.cmu.edu/user/bjones/
Jones BL, Nagin DS (2007) Advances in group-based trajectory modeling and an SAS procedure for estimating them. Sociol Methods & Res 35(4): 542
Article MathSciNet Google Scholar
Jones BL, Nagin DS, Roeder K (2001) A SAS procedure based on mixture models for estimating developmental trajectories. Sociological Methods & Research 29(3): 374
Article MathSciNet Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster Analysis. Wiley, New York
Google Scholar
Košmelj K, Batagelj V (1990) Cross-sectional approach for clustering time varying data. J Classif 7(1): 99–109
Article Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137
Article MATH MathSciNet Google Scholar
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5:172. http://www.biomedcentral.com/1471-2105/5/172
Google Scholar
Magidson J, Vermunt JK (2002) Latent class models for clustering: a comparison with k-means. Can J Mark Res 20: 37–44
Google Scholar
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2002.1114856
Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2): 159–179
Article Google Scholar
Muthén L, Muthén B (1998) Mplus user’s guide. Muthén & Muthén 2006, Los Angeles
Google Scholar
Nagin DS (2005) Group-based modeling of development. Harvard University Press, Cambridge
Google Scholar
Nagin DS, Tremblay RE (2001) Analyzing developmental trajectories of distinct but related behaviors: a group-based method. Psychol methods 6(1): 18–34
Article Google Scholar
R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN 3-900051-07-0
Rossi F, Conan-Guez B, Golli AE (2004) Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, pp 305–312
Ryan L (2008) Combining data from multiple sources, with applications to environmental risk assessment. Stat Med 27(5): 698–710
Article MathSciNet Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464
Article MATH Google Scholar
Shim Y, Chung J, Choi I (2005) A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. In: Proceedings of CIMCA-IAWTIC’05, IEEE computer society, Washington, vol 1, pp 199–204
Sugar C, James G (2003) Finding the number of clusters in a Dataset: an information-theoretic approach. J Am Stat Assoc 98(463): 750–764
Article MATH MathSciNet Google Scholar
Tarpey T (2007) Linear transformations and the k-means clustering algorithm: applications to clustering curves. Am Stat 61(1): 34
Article MathSciNet Google Scholar
Tarpey T, Kinateder K (2003) Clustering functional data. J classif 20(1): 93–114
Article MATH MathSciNet Google Scholar
Tokushige S, Yadohisa H, Inada K (2007) Crisp and fuzzy k-means clustering algorithms for multivariate functional data. Comput Stat 22(1): 1–16
Article MathSciNet Google Scholar
Tou JTL, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading
MATH Google Scholar
Touchette E, Petit D, Seguin J, Boivin M, Tremblay R, Montplaisir J (2007) Associations between sleep duration patterns and behavioral/cognitive functioning at school entry. Sleep 30(9): 1213–1219
Google Scholar
Tremblay RE (2008) Prévenir la violence dès la petite enfance. Odile Jacob, Paris
Google Scholar
Vlachos M, Lin J, Keogh E, Gunopulos D (2003) A wavelet-based anytime algorithm for k-means clustering of time series. In: 3rd SIAM international conference on data mining. San Francisco, May 1–3, 2003, workshop on clustering high dimensionality data and its applications
Warren-Liao T (2005) Clustering of time series data-a survey. Pattern Recognit 38(11): 1857–1874
Article Google Scholar

Download references

Author information

Authors and Affiliations

Inserm, U669, Paris, France
Christophe Genolini & Bruno Falissard
Modal’X, Univ Paris Ouest Nanterre La Défense, Paris, France
Christophe Genolini
Univ Paris-Sud and Univ Paris Descartes, UMR-S0669, Paris, France
Bruno Falissard
Département de santé publique, AP-HP, Hôpital Paul Brousse, Villejuif, France
Bruno Falissard

Authors

Christophe Genolini
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Falissard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Genolini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Genolini, C., Falissard, B. KmL: k-means for longitudinal data. Comput Stat 25, 317–328 (2010). https://doi.org/10.1007/s00180-009-0178-4

Download citation

Received: 05 May 2009
Accepted: 12 November 2009
Published: 28 November 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s00180-009-0178-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

KmL: k-means for longitudinal data

Abstract

Access this article

Similar content being viewed by others

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Modeling Time-Dependent Covariates in Longitudinal Data Analyses

Robust functional principal components for sparse longitudinal data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

KmL: k-means for longitudinal data

Abstract

Access this article

Similar content being viewed by others

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Modeling Time-Dependent Covariates in Longitudinal Data Analyses

Robust functional principal components for sparse longitudinal data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation