Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

Research paper by Anastasios Bellas, Charles Bouveyron, Marie Cottrell, Jérôme Lacaille

Indexed on: 25 May '13Published on: 25 May '13Published in: Advances in data analysis and classification


Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.