Clustering Large Datasets Using Nonsmooth Optimization: The Clust-Splitter Algorithm – Jenni Lampainen (Department of Mathematics and Statistics)
Clustering is a fundamental task in data mining and machine learning, aiming to group data points into clusters based on their similarity. The recent growth of data, along with improvements in computer hardware, has made it possible to store and process massive datasets containing millions of data points and attributes. While this development makes large-scale clustering both possible and essential, it also introduces major challenges: many existing algorithms either produce suboptimal outcomes, such as local minima, or require excessive computational resources. Therefore, there is a significant need for clustering methods that can produce accurate results within a reasonable time on very large datasets. This talk presents a novel incremental clustering method, Clust-Splitter, for large-scale minimum sum-of-squares clustering problems, where data points are partitioned into clusters by minimizing the sum of squared distances to cluster centers. The method is based on a nonsmooth optimization approach and incorporates a new data splitting strategy to generate effective starting points.