Robust Cluster Analysis via Mixture Models

Geoffrey J. McLachlan; Shu-Kay Ng; Richard Bean

doi:10.17713/ajs.v35i2&3.363

Authors

Geoffrey J. McLachlan University of Queensland, Australia
Shu-Kay Ng University of Queensland, Australia
Richard Bean University of Queensland, Australia

DOI:

https://doi.org/10.17713/ajs.v35i2&3.363

Abstract

Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster data sets. In this paper, we focus on the use of normal mixture models to cluster data sets of continuous multivariate data. As normality based methods of estimation are not robust, we review the use of t component distributions. With the t mixture model-based approach, the normal distribution for each component in the mixture model is embedded in a wider class of elliptically symmetric
distributions with an additional parameter called the degrees of freedom. The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger. We also consider the use of the t distribution for the robust clustering of high-dimensional data via mixtures of factor analyzers. The latter enable a mixture model to be fitted to data which have high dimension relative to the number of data points to be clustered.

References

Banfield, J. D., and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803-821.

Campbell, N. A. (1984). Mixture models and atypical values. Mathematical Geology, 16, 465-477.

Chang, W. C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Applied Statistics, 32, 267-275.

Coleman, D., Dong, X., Hardin, J., Rocke, D. M., and Woodruff, D. L. (1999). Some computational issues in cluster analysis with no a priori metric. Computational Statistics & Data Analysis, 31, 1-11.

Cuesta-Albertos, J. A., Matrán, C., and Mayo-Iscar, A. (2005). Estimators based in adaptively trimming cells in the mixture model.

http://personales.unican.es/cuestaj/stemcell.pdf.

Davies, P. L., and Gather, U. (2005). Breakdown and groups (with discussion). Annals of Statistics, 33, 977-1035.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1-38.

Donoho, D. L., and Huber, J. (1983). The notion of breakdown point. In J. L. H. P. Bickel K. Doksum (Ed.), A festschrift for erich l. lehmann (p. 157-184). Wadworth: Belmont.

Fokoué, E., and Titterington, D. M. (2002). Mixtures of factor analyzers. bayesian estimation and inference by stochastic simulation. Machine Learning, 50, 73-94.

Garcia-Escudero, L. A., and Gordaliza, A. (1999). Robustness properties of k means and trimmed k means. Journal of the American Statistical Association, 956-969.

Ghahramani, Z., and Hinton, G. E. (1997). The EM Algorithm for Mixtures of Factor Analyzers (Vol. CRG-TR-96-1). University of Toronto: Technical Report.

Hadi, A. S., and Luceño, A. (1997). Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Computational Statistics and Data Analysis, 25, 251-272.

Hampel, F. R. (1971). A general qualitative definition of robustness. Annals of Mathematical Statistics, 42, 1887-1896.

Hardin, J., and Rocke, D. M. (2004). Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics and Data Analysis, 44, 625-638.

Hartigan, J. A. (1975). Statistical theory in clustering. Journal of Classification, 2, 63-76.

Hawkins, D. M. (2003). A feasible solution algorithm for the minimum volume ellipsoid estimator in multivariate data. Computational Statistics, 9, 95-107.

Hawkins, D. M. (2004). The feasible solution algorithm for the minimum covariance determinant estimator in multivariate data. Computational Statistics and Data Analysis, 17, 197-210.

Hennig, C. (2004). Breakdown points for maximum likelihood estimators of locationscale mixtures. Annals of Statistics, 32, 1313-1340.

Hinton, G. E., Dayan, P., and Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions in neural Networks, 8, 65-73.

Huber, P. J. (1981). Robust statistics. New York: J. Wiley.

Kotz, S., and Nadarajah, S. (2004). Multivariate t distributions and their applications. New York: Cambridge University Press.

Lawley, D. N., and Maxwell, A. E. (1971). Factor analysis as a statistical method. London: Butterworths.

Little, R. J. A., and Rubin, D. B. (1987). Statistical analysis with missing data. New York: J. Wiley.

Liu, C. (1997). ML estimation of the multivariate t distribution and the EM algorithm. Journal of Multivariate Analysis, 63, 296-312.

Liu, C., and Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633-648.

Liu, C., and Rubin, D. B. (1995). ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5, 19-39.

Liu, C., Rubin, D. B., and Wu, Y. N. (1998). Parameter expansion to accelerate EM: the PX-EM Algorithm. Biometrika, 85, 755-770.

Markatou, M. (2000). Mixture models, robustness and the weighted likelihood methodology. Biometrics, 56, 483-486.

Markatou, M., Basu, A., and Lindsay, B. G. (1998). Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association, 93, 740-750.

McLachlan, G. J., and Basford, K. (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker.

McLachlan, G. J., and Bean, R.W. (2005). Maximum likelihood estimation of mixtures of t factor analyzers. Brisbane, Queensland, Australia: Technical Report, University of Queensland.

McLachlan, G. J., and Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t distributions. In A. Amin, D. Dori, P. Pudil, and H. Freeman (Eds.), Lecture notes in computer science (Vol. 1451, p. 658-666). Berlin: Springer-Verlag.

McLachlan, G. J., and Peel, D. (2000a). Finite mixture models. New York: J. Wiley.

McLachlan, G. J., and Peel, D. (2000b). Mixtures of factor analyzers. In P. Langley (Ed.), Proceedings of the seventeenth international conference on machine learning (p. 599-606). San Francisco: Morgan Kaufmann.

McLachlan, G. J., Peel, D., and Bean, R. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379-388.

Meng, X. L., and Rubin, D. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267-278.

Meng, X. L., and van Dyk, D. (1997). The EM algorithm - an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511-567.

M¨uller, C. H., and Neykov, N. (2003). Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models. Journal of Statistical Planning and inference, 116, 503-519.

Neykov, N., Filzmoser, P., Dimova, R., and Neytchev, P. (2004). In Compstat 2004, proceedings computational statistics (p. 1585-1592). Vienna: Physica-Verlag.

Peel, D., and McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistical Computing, 10, 335-344.

Rocke, D. M. (1996). Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24, 1327-1345.

Rocke, D. M., and Woodruff, D. (1996). Identification of outliers in multivariate data. Journal of the American Statistical Association, 91, 1047-1061.

Rocke, D. M., and Woodruff, D. (1997). Robust estimation of multivariate location and shape. Journal of Statistical Planning and Inference, 57, 245-255.

Rocke, D. M., and Woodruff, D. (2000). A synthesis of outlier detection and cluster identification. Unpublished manuscript.

Rubin, D. B. (1983). Iteratively reweighted least squares. In Encyclopedia of statistical sciences (Vol. 4, p. 272-275). New York: J. Wiley.

Tibshirani, R., and Knight, K. (1999). Model search by bootstrap ”bumping”. Journal of Computational and Graphical Statistics, 8, 671-686.

Tipping, M. E., and Bishop, C. M. (1997). Mixtures of probabilistic principal component analysers. In Technical report no. NCRG/97/003. Birmingham, Aston University: Neural Computing Research Group.

Tyler, J. T. K. D. E., and Vardi, Y. (1994). A curious likelihood identity for the multivariate t-distribution. Communications in Statistics - Simulation and Computation, 23, 441-453.

Ueda, N., and Nakano, R. (1998). Deterministic annealing EM algorithm. Neural Networks, 11, 271-282.

Vandev, D. L., and Neykov, N. (1998). About regression estimators with high breakdown point. Statistics, 32, 111-129.

Woodruff, D. L., and Rocke, D. M. (1993). Heuristic search algorithms for the minimum volume ellipsoid. Journal of Computational and Graphical Statistics, 2, 69-95.

Woodruff, D. L., and Rocke, D. M. (1994). Computable robust estimation of multivariate location and shape using compound estimators. Journal of the American Statistical Association, 89, 888-896.

Robust Cluster Analysis via Mixture Models

Authors

DOI:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Developed By

Language

Information