2025 AIChE Annual Meeting

(60f) GPU-Accelerated Solution of Block-Tridiagonal Systems for Large-Scale Optimization

Authors

David Jin - Presenter, Massachusetts Intitute of Technology
Sungho Shin, Uninveristy of Wisconsin-Madison
Efficiently solving large-scale linear systems is crucial in process system engineering such as energy system modeling, optimal process control, and supply chain optimization [1-4]. Many of these systems exhibit inherent block-structured sparsity patterns, which can be strategically exploited to enhance computational efficiency [5, 6]. Traditional general-purpose linear solvers often overlook such specialized structures, leading to suboptimal performance.

The existing works in the literature have focused on employing Schur-complement-based approaches for numerical linear algebra within nonlinear optimization frameworks. Wright contributed a partitioned dynamic programming method that applies domain decomposition for linear-quadratic optimal control problems, offering effective parallelism across various computing architectures [7]. Zavala, Laird, and Biegler introduced an interior-point decomposition approach that utilizes Schur complements to parallelize the solution of large-scale nonlinear parameter estimation problems [8]. Kang et al. developed methods using implicit Schur-complement decompositions tailored for block-structured nonlinear programs, while Chiang, Petra, and Zavala presented PIPS-NLP, a software library for large-scale energy system optimization using these techniques on HPC platforms [9, 10]. Petra et al. introduced an augmented incomplete factorization technique to compute the Schur complement in stochastic optimization problems, enhancing the scalability of solving large-scale systems on high-performance computers [11]. Although these methods have enabled the solution of very large-scale problems on distributed systems, they do not specifically target GPU architectures. Modern GPUs offer exceptional memory bandwidth and dense linear algebra throughput, making them ideal for medium-scale problems. This potential is demonstrated in several works by Shin and Pacaud [12-14]. Especially, Cole, Shin, and Pacaud have contributed specialized numerical linear algebra libraries tailored for optimal control, collectively highlighting the growing opportunity for GPU-based numerical linear algebra [15].

In this work, we present a parallelizable algorithm based on Schur’s complement tailored specifically for block tridiagonal linear systems. The proposed method partitions the system into independent subproblems, enabling effective parallel execution. To further optimize computational throughput, the algorithm employs GPU acceleration via custom CUDA kernels designed explicitly for block-wise factorization and solve operations. The solver is implemented in Julia, chosen for its superior computational efficiency and flexibility in scientific computing contexts. Essential features include automatic detection and exploitation of block structures, optimized memory usage to minimize cache misses, and batched LAPACK/BLAS operations to enhance GPU performance. Additionally, seamless integration with established optimization frameworks, such as MadNLP [1, 16], is provided.

Benchmarking against state-of-the-art solvers, including MA57 [18], CHOLMOD [19], LDLFactorizations [20, 21], and cuDSS [22], demonstrates substantial performance gains. In typical tests involving nonlinear optimization problems with approximately 100,000 variables, our solver achieves 20–50× faster factorization and solve times compared to MA57 and CHOLMOD on a single Nvidia V100 GPU. It also outperforms cuDSS by a factor of three in the factorization phase while matching its performance during the solve phase. The algorithm exhibits excellent scalability, with near-linear performance growth on problems reaching hundreds of thousands of variables.

Our solver provides considerable performance advantages across critical applications including battery modeling, model predictive control, and nonlinear optimization frameworks. For example, our solver can significantly benefit battery state estimation tasks, such as predicting the state of charge and state of health using Nonlinear AutoRegressive models with eXogenous inputs (NARX) [17]. Additionally, for real-time optimization scenarios, our algorithm can significantly reduce computational latency, facilitating more efficient online optimization solutions [23].

References:

[1] Shin, S., Pacaud, F., & Anitescu, M. (2024). Accelerating optimal power flow with GPUs: SIMD abstraction of nonlinear programs and condensed-space interior-point methods. arXiv. https://arxiv.org/abs/2307.16830

[2] Severson, K. A., Attia, P. M., Jin, N., Perkins, N., Jiang, B., Yang, Z., Chen, M. H., Aykol, M., & Braatz, R. D. (2019). Data-driven prediction of battery cycle life before capacity degradation. Nature Energy, 4(5), 383–391. https://doi.org/10.1038/s41560-019-0356-8

[3] Fujiwara, M., Nagy, Z. K., Chew, J. W., & Braatz, R. D. (2005). First-principles and direct design approaches for the control of pharmaceutical crystallization. Journal of Process Control, 15(5), 493–504. https://doi.org/10.1016/j.jprocont.2004.08.003

[4] Perea-López, E., Ydstie, B. E., & Grossmann, I. E. (2003). A model predictive control strategy for supply chain optimization. Computers & Chemical Engineering, 27(8), 1201–1218. https://doi.org/10.1016/S0098-1354(03)00047-4

[5] Coon, A. B., & Stadtherr, M. A. (1995). Generalized block-tridiagonal matrix orderings for parallel computation in process flowsheeting. Computers & Chemical Engineering, 19(6), 787–805. https://doi.org/10.1016/0098-1354(94)00081-6

[6] Gondzio, J., & Grothey, A. (2009). Exploiting structure in parallel implementation of interior point methods for optimization. Computational Management Science, 6(2), 135–160. https://doi.org/10.1007/s10287-008-0084-2

[7] Wright, S. J. (1991). Partitioned dynamic programming for optimal control. SIAM Journal on Optimization, 1(4), 620–642. https://doi.org/10.1137/0801037

[8] Zavala, V. M., Laird, C. D., & Biegler, L. T. (2008). Interior-point decomposition approaches for parallel solution of large-scale nonlinear parameter estimation problems. Chemical Engineering Science, 63(19), 4834–4845. https://doi.org/10.1016/j.ces.2007.05.022

[9] Kang, J., Cao, Y., Word, D. P., & Laird, C. D. (2014). An interior-point method for efficient solution of block-structured NLP problems using an implicit Schur-complement decomposition. Computers & Chemical Engineering, 71, 563–573. https://doi.org/10.1016/j.compchemeng.2014.09.013

[10] Chiang, N., Petra, C. G., & Zavala, V. M. (2014). Structured nonconvex optimization of large-scale energy systems using PIPS-NLP. In 2014 Power Systems Computation Conference (pp. 1–7). IEEE. https://doi.org/10.1109/PSCC.2014.7038374

[11] Petra, C. G., Schenk, O., Lubin, M., & Gäertner, K. (2014). An augmented incomplete factorization approach for computing the Schur complement in stochastic optimization. SIAM Journal on Scientific Computing, 36(2), C139–C162. https://doi.org/10.1137/130908737

[12] Shin, S., Pacaud, F., & Anitescu, M. (2024). Accelerating optimal power flow with GPUs: SIMD abstraction of nonlinear programs and condensed-space interior-point methods. arXiv. https://arxiv.org/abs/2307.16830

[13] Pacaud, F., Shin, S., Schanen, M., Anitescu, M., & Balaprakash, P. (2024). Accelerating condensed interior-point methods on SIMD/GPU architectures. Journal of Optimization Theory and Applications, 202, 184–203. https://doi.org/10.1007/s10957-022-02129-5

[14] Pacaud, F., Schanen, M., Shin, S., Maldonado, D. A., & Anitescu, M. (2024). Parallel interior-point solver for block-structured nonlinear programs on SIMD/GPU architectures. Optimization Methods and Software, 39(4), 874–897. https://doi.org/10.1080/10556788.2024.2329646

[15] Cole, D., Shin, S., Pacaud, F., Zavala, V. M., & Anitescu, M. (2023). Exploiting GPU/SIMD architectures for solving linear-quadratic MPC problems. In Proceedings of the 2023 American Control Conference (ACC) (pp. 3995–4000). IEEE. https://doi.org/10.23919/ACC55779.2023.10155791

[16] Shin, S., Coffrin, C., Sundar, K., & Zavala, V. M. (2021). Graph-based modeling and decomposition of energy infrastructures. IFAC-PapersOnLine, 54(3), 693–698. https://doi.org/10.1016/j.ifacol.2021.08.278

[17] Leontaritis, I. J., & Billings, S. A. (1985). Input-output parametric models for non-linear systems Part I: deterministic non-linear systems. International Journal of Control, 41(2), 303–328. https://www.tandfonline.com/doi/abs/10.1080/0020718508961129

[18] Duff, I. S. (2004). MA57—a code for the solution of sparse symmetric definite and indefinite systems. ACM Transactions on Mathematical Software, 30(2), 118–144. https://doi.org/10.1145/992200.992202

[19] Chen, Y., Davis, T. A., Hager, W. W., & Rajamanickam, S. (2008). Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software, 35(3), Article 22. https://doi.org/10.1145/1391989.1391995

[20] Davis, T. A. (2005). Algorithm 849: A concise sparse Cholesky factorization package. ACM Transactions on Mathematical Software, 31(4), 587–591. https://doi.org/10.1145/1114268.1114277

[21] Orban, D., & contributors. (2023). LDLFactorizations.jl: Factorization of Symmetric Matrices (Version 0.10.1) [Software]. Zenodo. https://zenodo.org/records/10016111

[22] NVIDIA Corporation. (n.d.). cuDSS: CUDA Direct Sparse Solvers. NVIDIA Developer. https://developer.nvidia.com/cudss

[23] Wang, Y., & Boyd, S. (2010). Fast model predictive control using online optimization. IEEE Transactions on Control Systems Technology, 18(2), 267–278. https://doi.org/10.1109/TCST.2009.2017934