Abstract
This chapter is about advanced parametric clustering techniques based on the concept of mixture distributions. The first section introduces mixture distributions from a general perspective, followed by two popular applications in clustering: normal mixture models (latent profile analysis) for metric input variables and multinomial mixture models (latent class analysis) for categorical variables. Subsequently, these ideas are extended to mixed input scale levels. In the following section, the mixture distribution concept is embedded into a regression framework. In mixture regression models, clustering and estimation of regression parameters are performed simultaneously. By means of Dirichlet process regression, we add another complexity layer to the modeling framework by letting an algorithm determine the optimal number of clusters. Finally, the focus is on latent Dirichlet allocations: topic models for clustering text data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that in mclust a maximum-BIC strategy is used; that is, the higher the BIC, the better the fit.
- 2.
In practice, the user should again try out different numbers of clusters and pick the one with the lowest BIC.
- 3.
For an overview see the corresponding task view on CRAN (URL: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html).
- 4.
References
Betebenner, D. W. (2017). randomNames: Function for generating random names and a dataset. R package version 0.4-0. https://cran.r-project.org/package=randomNames
Chang, J. (2015). lda: Collapsed Gibbs sampling methods for topic models. R package version 1.4.2. https://CRAN.R-project.org/package=lda
Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a dataset. Journal of Statistical Software, 61(6), 1–36. http://www.jstatsoft.org/v61/i06/
da Silva, A. F. (2012). dpmixsim: Dirichlet process mixture model simulation for clustering and image segmentation. R package version 0.0-8. https://CRAN.R-project.org/package=dpmixsim
Ellis, P. (2017). Cross-validation of topic modelling. R-bloggers. http://www.r-bloggers.com/cross-validation-of-topic-modelling/
Everitt, B. S. (2011). Cluster analysis (5th ed.). New York: Wiley.
Everitt, B., & Hothorn, T. (2011). An introduction to applied multivariate analysis with R. New York: Springer.
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54. http://www.jstatsoft.org/v25/i05/
Fellows, I. (2014). wordcloud: Word clouds. R package version 2.5. https://CRAN.R-project.org/package=wordcloud
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97, 611–631.
Frick, H., Strobl, C., Leisch, F., & Zeileis, A. (2012). Flexible Rasch mixture models with package psychomix. Journal of Statistical Software, 48(7), 1–25. http://www.jstatsoft.org/v48/i07/
Gandrud, C. (2017). A link between topicmodels LDA and LDAvis. R-bloggers. https://www.r-bloggers.com/a-link-between-topicmodels-lda-and-ldavis/
Gershman, S. J., & Blei, D. M. (2012). A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56, 1–12.
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. https://doi.org/10.18637/jss.v040.i13
Grün, B., & Leisch, F. (2008). FlexMix version 2: Finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35. http://www.jstatsoft.org/v28/i04/
Haegeli, P., Gunn, M., & Haider, W. (2012). Identifying a high-risk cohort in a complex and dynamic risk environment: Out-of-bounds skiing—An example from avalanche safety. Prevention Science, 13, 562–573.
Hornik, K., Meyer, D., & Buchta, C. (2016). slam: Sparse lightweight arrays and matrices. R package version 0.1-40. https://CRAN.R-project.org/package=slam
Koller, I., & Alexandrowicz, R. W. (2010). Eine psychometrische Analyse der ZAREKI-R mittels Rasch-Modellen [A psychometric analysis of the ZAREKI-R using Rasch models]. Diagnostica, 56, 57–67.
Kwartler, T. (2017). Text mining in practice with R. New York: Wiley.
Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin.
Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18. https://www.jstatsoft.org/v011/i08
Linzer, D. A., & Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software, 42(10), 1–29. http://www.jstatsoft.org/v42/i10/
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2017). cluster: Cluster analysis basics and extensions. R package version 2.0.6.
Mair, P., Rusch, T., & Hornik, K. (2014). The grand old party: A party of values? SpringerPlus, 3(697), 1–10.
Mair, P., Hofmann, E., Gruber, K., Zeileis, A., & Hornik, K. (2015). Motivation, values, and work design as drivers of participation in the R open source project for statistical computing. Proceedings of the National Academy of Sciences of the United States of America, 112, 14788–14792.
McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Meyer, D., Zeileis, A., & Hornik, K. (2006). The strucplot framework: Visualizing multi-way contingency tables with vcd. Journal of Statistical Software, 17(3), 1–48. http://www.jstatsoft.org/v17/i03/
Mimno, D. (2013). mallet: A wrapper around the Java machine learning tool MALLET. R package version 1.0. https://CRAN.R-project.org/package=mallet
Murzintcev, N. (2016). ldatuning: Tuning of the latent Dirichlet allocation models parameters. R package version 0.2.0. https://CRAN.R-project.org/package=ldatuning
Savitsky, T. D., & Paddock, S. M. (2014). Bayesian semi- and non-parametric models for longitudinal data with multiple membership effects in R. Journal of Statistical Software, 57(3), 1–35. http://www.jstatsoft.org/v57/i03/
Shotwell, M. S. (2013). profdpm: An R package for MAP estimation in a class of conjugate product partition models. Journal of Statistical Software, 53(8), 1–18. http://www.jstatsoft.org/v53/i08/
Sievert, C., & Shirley, K. (2015). LDAvis: Interactive visualization of topic models. R package version 0.3.2. https://CRAN.R-project.org/package=LDAvis
Silge, J., & Robinson, D. (2017). Text mining with R : A tidy approach. Sebastopol: O’Reilly Media.
Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formality. Journal of Phonetics, 40, 808–815.
Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). Boca Raton: CRC Press.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Mair, P. (2018). Parametric Cluster Analysis and Mixture Regression. In: Modern Psychometrics with R. Use R!. Springer, Cham. https://doi.org/10.1007/978-3-319-93177-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-93177-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93175-3
Online ISBN: 978-3-319-93177-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)