2022 Annual Meeting

(10h) Global Optimal K-Center Clustering with Millions of Samples

Authors

Cao, Y. - Presenter, University of British Columbia
Hua, K., University of British Columbia
Clustering is a fundamental unsupervised machine learning task that plays a vital role in various fields of applications, such as customer grouping [1], data summarization [2], and facility location determination [3]. In this talk, we concentrate on one of the most fundamental centroid-based clustering problems called the k-center problem. Given a dataset, the K-center problem aims to select k samples from the dataset as centers of clusters that minimize the maximum within-cluster distance of the dataset [4]. Although many heuristic algorithms are developed for k-center problem, none of these algorithms can guarantee a global optimal solution.

This work provides a practical global optimization algorithm for this task based on a reduced-space spatial branch and bound scheme. This algorithm can guarantee convergence to the global optimum by only branching on the centers of clusters, which is independent of the dataset’s cardinality. In addition, a set of feasibility-based bounds tightening techniques are proposed to determine the assignment of samples, narrow down the domain of centers, and significantly accelerate the convergence. To demonstrate the capacity of this algorithm, we present computational results on 26 UCI datasets [5]. Notably, for the dataset with 4 million samples ((i.e., 1000 times larger than the state-of-the-art method [6] in the literature) and 18 features, the serial implementation of the algorithm can attain the global optimum to an optimality gap of 0.1% within 2 hours.

[1] Aggarwal, C. C., Wolf, J. L., and Yu, P. S.-l. Method for targeted advertising on the web based on accumulated self-learning data, clustering users and semantic node graph techniques, March 30 2004. US Patent 6,714,975.

[2] Kleindessner, M., Awasthi, P., and Morgenstern, J. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pp. 3448–3457. PMLR, 2019.

[3] Hansen, P., Brimberg, J., UroˇseviÅLc, D., and MladenoviÅLc, N. Solving large p-median clustering problems by primal–dual variable neighborhood search. Data Mining and Knowledge Discovery, 19(3):351–375, 2009.

[4] Kaufman, L. and Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis , volume 344. JohnWiley & Sons, 2009.

[5] Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

[6] Duong, K.-C., Vrain, C., et al. Constrained clustering by constraint programming. Artificial Intelligence , 244:70–94, 2017.