Clustering algorithms aim to partition data into meaningful groups, revealing underlying structures within large and complex datasets. This capability is particularly valuable in chemical engineering. For instance, clustering can provide an unsupervised method to automatically group normal process operating data, enabling effective fault detection and diagnosis during online monitoring [1]. Clustering algorithms are also employed in drug discovery to identify groups of similar chemical compounds, which is crucial for understanding structure-activity relationships and optimizing lead candidates [2]. In such applications, the interpretability of the cluster centers is crucial; these centers must correspond to actual data points representing physically meaningful entities, such as specific chemical compounds or process states.
K-Means is among the most widely used clustering methods, assigning data points to clusters based on proximity to the cluster's mean (centroid). However, because its centroids are calculated means, they may not correspond to actual data points, thus limiting their direct interpretability. In contrast, K-Medoids clustering selects actual data points as cluster medoids and assigns each observation to its nearest medoid [3]. The use of actual data points as centers makes K-Medoids inherently more interpretable and robust to outliers compared to K-Means.
Despite its advantages, K-Medoids clustering presents a significant computational challenge as the problem is NP-hard. Consequently, traditional algorithms like Partitioning Around Medoids (PAM) often rely on heuristics that can yield suboptimal solutions [4]. On the other hand, exact methods guarantee optimality but are limited to small datasets with thousands of samples [5]. To address this scalability issue, we investigate the deterministic global optimization of the K-Medoids clustering problem. This work proposes a branch-and-bound (BB) framework that utilizes a tailored Lagrangian relaxation method, first introduced in the 1970s [6], to compute lower bounds efficiently at each node of the BB tree. This lower bounding method provides a strong bound at the root node. Notably, a closed-form solution for the lower bound can be derived analytically, eliminating the need to explicitly solve optimization subproblems, and its computation is easily parallelizable. Furthermore, this bounding strategy guarantees finite convergence to the global optimum by branching only on the regions of medoids. We also introduce several tailored bound tightening techniques to reduce the search space and computational cost. Extensive computational studies on 28 benchmarking datasets demonstrate that our algorithm can provide a provable global optimal solution with an optimality gap of 0.1% within 4 hours on datasets with up to one million samples, which is 250 times larger than previous exact methods with better objective values compared to heuristics methods.
This work provides a scalable and efficient algorithm for solving the K-Medoids clustering problem to global optimality. By ensuring both interpretability and optimality for large-scale datasets relevant to chemical engineering, our method enhances the reliability of clustering results, facilitating enhanced process understanding and data-driven decision-making across various applications.
References
[1] Aldrich, Chris, and Lidia Auret. Unsupervised process monitoring and fault diagnosis with machine learning methods. Vol. 16. No. 3. London: Springer, 2013.
[2] Backman, T. W., Cao, Y., & Girke, T. (2011). ChemMine tools: an online service for analyzing and clustering small molecules. Nucleic acids research, 39, W486-W491.
[3] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., ... & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3), 267-279.
[4] Kaufman, L. (1990). Partitioning around medoids (program pam). Finding groups in data, 344, 68-125.
[5] Elloumi, S. (2010). A tighter formulation of the p-median problem. Journal of combinatorial optimization, 19(1), 69-83.
[6] Cornuejols, G., Fisher, M. L., & Nemhauser, G. L. (1977). Location of bank accounts to optimize float: An analytic study of exact and approximate algorithms. Management science, 23(8), 789-810.