2025 AIChE Annual Meeting
(593e) GO-Clustering.Jl: A Julia Package for Global Optimal Centroid-Based Clustering
Centroid-based clustering methods, such as k-means, k-medoids, and k-center, are essential in chemical engineering for analyzing large datasets and uncovering underlying data structures. These methods assign data to clusters depending on the data points’ proximity to the centers of the clusters, which may either be data points, as in k-medoids and k-center, or calculated averages, as in k-means. They are particularly useful for process monitoring, fault detection, and chemical compound analysis, where interpretability and robustness are critical [1, 2]. However, traditional heuristic algorithms for these clustering methods often yield suboptimal solutions, while traditional exact methods guarantee optimality but struggle to scale with large datasets [3, 4, 5].
To address these challenges, we developed GO-Clustering.jl, an open-source Julia package that computes globally optimal solutions for centroid-based clustering problems. GO-Clustering.jl implements tailored spatial branch-and-bound (sBB) algorithms that reformulate the clustering problems as two-stage optimization problems. The sBB scheme ensures convergence by branching only on the low-dimensional space of cluster center regions. Efficiency is further enhanced through analytically derived, closed-form lower bounds and advanced bound tightening techniques as well as parallel execution.
GO-Clustering.jl could deliver provably global optimal solutions for k-means (minimizing sum-of-distance to arbitrary points as centers), k-medoids (minimizing sum-of-distance to data points as centers), and k-center (minimizing maximum distance to data points as centers). Extensive computational studies demonstrate unprecedented scalability: it produces high-quality global solutions for k-means problems with up to 210,000 samples, achieves small optimality gaps for k-medoids problems with up to one million samples, and solves k-center problems with up to one billion samples. Compared to state-of-the-art heuristics, GO-Clustering.jl significantly reduces objective function values (e.g., a 25.8% average reduction for k-center). Compared to traditional exact methods, GO-Clustering.jl is orders of magnitude faster and more scalable (e.g., capable of solving datasets 200,000 times larger for k-center).
These advancements in GO-Clustering.jl provide the chemical engineering community with a robust, scalable, and accessible tool for solving fundamental clustering tasks with global optimality. By overcoming the scalability challenges of previous exact methods, it enables reliable analysis of large, complex datasets common in modern process industries. Its guarantees of global optimality ensure solution quality, making it indispensable for safety-critical applications and robust decision-making in process design, monitoring, control, and optimization. This work bridges the gap between heuristic efficiency and exactness, advancing large-scale data analysis in chemical engineering.
References
[1] Aldrich, C., & Auret, L. (2013). Unsupervised process monitoring and fault diagnosis with machine learning methods (Vol. 16, No. 3, pp. 593-606). London: Springer.
[2] Backman, T. W., Cao, Y., & Girke, T. (2011). ChemMine tools: an online service for analyzing and clustering small molecules. Nucleic acids research, 39, W486-W491.
[3] Hua, K., Shi, M., & Cao, Y. (2021). A Scalable deterministic global optimization algorithm for clustering problems. In International Conference on Machine Learning (pp. 4391-4401). PMLR.
[4] Ren, J., Hua, K., & Cao, Y. (2022). Global optimal k-medoids clustering of one million samples. Advances in Neural Information Processing Systems, 35, 982-994.
[5] Ren, J., You, N., Hua, K., Ji, C., & Cao, Y. (2022). A Global Optimization Algorithm for K-Center Clustering of One Billion Samples. Management Science, accepted.