Global optimization of nonconvex nonlinear programs can be a hard problem due to the potentially large number of local optima which may fail to be global. As a divide-and-conquer approach, Branch and Bound (B&B) methods rigorously search the whole solution space and thus have been widely used for solving optimization problems to global optimality [1]. Since B&B methods can be time-consuming when solving complex problems, multiple CPU-parallelization schemes [2-5] have been proposed to accelerate the solution. Contrarily, although GPUs are leveraged for massive parallelism in several computing tasks including the training of artificial intelligence models, there is not yet a mature scheme for parallelizing B&B methods on GPUs due to their Single Instruction Multiple Threads (SIMT) structure and the intrinsic imbalance of the B&B search A recent approach by Gottlieb et al. [7] proposed to calculate the lower bounds of numerous B&B nodes in parallel on the GPU, using McCormick relaxations via automatic source code generation. Processing multiple B&B nodes simultaneously is a direct source of parallelism and enabled significant speedups. However, there are two aspects that can limit the performance of this method: On the one hand, the McCormick relaxations evaluated on one B&B node can differ from those evaluated on others, which implies potential imbalance and thus warp divergence between GPU threads. On the other hand, at the start-up phase of B&B methods, only few B&B nodes are generated and thus the massive GPU parallelism cannot be fully utilized. Even though one can partition the original domain to produce a large amount of nodes at the very beginning, such an extensive a priori partitioning may increase the size of the B&B tree disproportionally, thus slowing down the solution.
Instead of processing multiple B&B nodes in parallel, we propose to parallelize the lower bounding solver to avoid potential warp divergence and a priori increase in the size of the B&B tree. In our proposal, the domain of a B&B node is only temporarily divided into a large number of smaller on which interval bounds of the objective functions and constraints are calculated in parallel on the GPU using interval arithmetics [8-10]. Among these interval bounds, the widest ones are selected as the overall bounds of the whole B&B node. Because the interval bounds get tighter if they are constructed over smaller domains, the bounds from our method are much tighter than those obtained by applying interval arithmetics directly over the whole B&B node – but they remain fast to compute thanks to massive parallelism offered by the GPU. We implemented this GPU-parallel lower bounding method within our open-source solver MAiNGO [11], where the GPU parts were implemented using CUDA [12] in two different manners: a. wrapping all the GPU code into one kernel function, and b. distributing the GPU code onto a CUDA graph object of which each node can be either a kernel function or a memory communication operation.
We present multiple case studies to test the performance of the proposed method. The results illustrate that using more subintervals can indeed produce tighter bounds on a given B&B node and thus result in less B&B iterations. In terms of wall-clock time, our GPU-parallelization reaches speed-ups up to 4000× over the of using standard interval arithmetics in serial on the CPU. Regarding the different implementations, the one based on the CUDA graph attains higher efficiency when the optimization problems become more complicated. For some case studies involving artificial neural network surrogate models, the presented method outperforms the default bounding method in MAiNGO which is based on McCormick relaxations [13-15] evaluated in serial on the CPU, in terms of tightness of bounds (thanks to the evaluation over multiple subintervals), B&B iterations, and wall clock time. Overall, the empirical results demonstrate the promising potential of GPU-parallel lower bounding solver to accelerate B&B methods.
Acknowledgments
Hongzhen Zhang and Dominik Bongartz gratefully acknowledge funding through Internal Funds KU Leuven (STG/22/060).
References
- Locatelli, M., & Schoen, F. (Eds.). (2013). Global optimization: theory, algorithms, and applications. Society for Industrial and Applied Mathematics.
- Pardalos, P. M. (1989). Parallel search algorithms in global optimization. Applied Mathematics and Computation, 29(3), 219-229.
- Gendron, B., & Crainic, T. G. (1994). Parallel branch-and-branch algorithms: Survey and synthesis. Operations research, 42(6), 1042-1066.
- Bader, D. A., Hart, W. E., & Phillips, C. A. (2005). Parallel algorithm design for branch and bound. Tutorials on Emerging Methodologies and Applications in Operations Research: Presented at INFORMS 2004, Denver, CO, 5-1.
- Archibald, B., Maier, P., McCreesh, C., Stewart, R., & Trinder, P. (2018). Replicable parallel branch and bound search. Journal of Parallel and Distributed Computing, 113, 92-114.
- Vu, T. T., & Derbel, B. (2016). Parallel Branch-and-Bound in multi-core multi-CPU multi-GPU heterogeneous environments. Future Generation Computer Systems, 56, 95-109.
- Gottlieb, R. X., Xu, P., & Stuber, M. D. (2024). Automatic source code generation for deterministic global optimization with parallel architectures. Optimization Methods and Software, 1–39. https://doi.org/10.1080/10556788.2024.2396297
- Hansen, E. (1980). Global optimization using interval analysis—the multi-dimensional case. Numerische Mathematik, 34(3), 247-270.
- Baumann, E. (1988). Optimal centered forms. BIT Numerical Mathematics, 28(1), 80-87.
- Kulisch, U. W. (2009, April). Complete interval arithmetic and its implementation on the computer. In Numerical Validation in Current Hardware Architectures: International Dagstuhl Seminar, Dagstuhl Castle, Germany, January 6-11, 2008. Revised Papers (pp. 7-26). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Bongartz, D., Najman, J., Sass, S., & Mitsos, A. (2018). MAiNGO - McCormick-based Algorithm for mixed-integer Nonlinear Global Optimization. Technical report, Process Systems Engineering (AVT.SVT), RWTH Aachen University. http://permalink.avt.rwth-aachen.de/?id=729717
- NVIDIA. (2024). CUDA Toolkit 12.4. Retrieved from https://developer.nvidia.com/cuda-toolkit
- McCormick, G. P. (1976). Computability of global solutions to factorable nonconvex programs: Part I—Convex underestimating problems. Mathematical programming, 10(1), 147-175.
- Mitsos, A., Chachuat, B., & Barton, P. I. (2009). McCormick-based relaxations of algorithms. SIAM Journal on Optimization, 20(2), 573-601.
- Beckers, M., Mosenkis, V., & Naumann, U. (2012). Adjoint mode computation of subgradients for McCormick relaxations. In Recent advances in algorithmic differentiation (pp. 103-113). Springer Berlin Heidelberg.