2025 AIChE Annual Meeting

(392am) Optimizing over Trained Transformer Attention Mechanisms: A Mixed-Integer Nonlinear Programming Formulation

Checkout Do you already own this? Log in to access this content.

Pricing

Individuals

AIChE Pro Members	150.00
AIChE Emeritus Members	105.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free
AIChE Explorer Members	225.00
Non-Members	225.00

Authors

Tanuj Karia - Presenter, Imperial College London

Siân Hallsworth, Delft University of Technology

Artur M. Schweidtmann, Delft University of Technology

Neil Yorke-Smith, Delft University of Technology

Due to the advent of machine learning (ML) technologies, rapid strides have been made in the modelling and optimization of complex systems in the past few years. For mathematical optimization, mathematical programming formulations have been proposed for several machine learning models, namely artificial neural networks (1,2), Gaussian processes (3), decision trees (4,5), graph neural networks (6,7), and support vector machines (8). These formulations facilitate the verification of the prediction of machine-learning models (9) and act as surrogate models for improving the tractability of process optimization (10). Such formulations have inspired the development of packages such as MeLOn (1,8), OMLT (11), GurobiML (12), pySCIPOpt-ML (13) and MathOptAI (14) to embed machine-learning models in optimization problems.

A machine learning model that has particularly stood out recently is the transformer model with the attention mechanism (15). The attention mechanism forms the backbone of generative pre-trained transformer models such as ChatGPT, which are currently ubiquitous. Transformer models have also been applied to chemical engineering tasks, such as modelling chemical reactors (16) and crystallization processes (17). To enhance the applicability of transformer models for systematic decision-making in chemical engineering, we propose a mixed-integer nonlinear formulation for optimizing over a trained transformer model.

We implement the proposed formulation in Pyomo (18) and GurobiPy (19) frameworks to embed transformer models trained using Keras, PyTorch and HuggingFace libraries. We consider multiple case studies, including optimal trajectory problems, verifying transformer predictions and optimizing catalytic reactors to test the effectiveness of the proposed formulation. Moreover, we conduct extensive numerical experiments to assess the impact of the number of embedding dimensions (size of the input vector to the transformer) and depth of trained transformer models on the performance of the proposed formulation. Our results indicate that optimization over small transformer models with up to 12 embeddings and one encoder layer can be achieved quickly in under three minutes. For larger transformer models with more than 12 embeddings or one encoder layer, the model quickly becomes intractable, highlighting the need for further research to improve the tractability of the formulation.

References:

Schweidtmann AM, Mitsos A. Deterministic Global Optimization with Artificial Neural Networks Embedded. J Optim Theory Appl [Internet]. 2019 Mar 15 [cited 2025 Apr 7];180(3):925–48. Available from: https://link.springer.com/article/10.1007/s10957-018-1396-0
Fischetti M, Jo J. Deep neural networks and mixed integer linear optimization. Constraints [Internet]. 2018 Jul 1 [cited 2025 Apr 7];23(3):296–309. Available from: https://link.springer.com/article/10.1007/s10601-018-9285-6
Schweidtmann AM, Bongartz D, Grothe D, Kerkenhoff T, Lin X, Najman J, et al. Deterministic global optimization with Gaussian processes embedded. Math Program Comput [Internet]. 2021 Sep 1 [cited 2025 Apr 7];13(3):553–81. Available from: https://link.springer.com/article/10.1007/s12532-021-00204-y
Mistry M, Letsios D, Krennrich G, Lee RM, Misener R. Mixed-Integer Convex Nonlinear Optimization with Gradient-Boosted Trees Embedded. https://doi.org/101287/ijoc20200993 [Internet]. 2020 Nov 18 [cited 2025 Apr 7];33(3):1103–19. Available from: https://pubsonline.informs.org/doi/abs/10.1287/ijoc.2020.0993
Ammari BL, Johnson ES, Stinchfield G, Kim T, Bynum M, Hart WE, et al. Linear model decision trees as surrogates in optimization of engineering applications. Comput Chem Eng. 2023 Oct 1;178:108347.
McDonald T, Tsay C, Schweidtmann AM, Yorke-Smith N. Mixed-integer optimisation of graph neural networks for computer-aided molecular design. Comput Chem Eng. 2024 Jun 1;185:108660.
Zhang S, Campos JS, Feldmann C, Walz D, Sandfort F, Mathea M, et al. Optimizing over trained GNNs via symmetry breaking. Adv Neural Inf Process Syst. 2023 Dec 15;36:44898–924.
Schweidtmann AM, Weber JM, Wende C, Netze L, Mitsos A. Obey validity limits of data-driven models through topological data analysis and one-class classification. Optimization and Engineering [Internet]. 2022 Jun 1 [cited 2025 Apr 7];23(2):855–76. Available from: https://link.springer.com/article/10.1007/s11081-021-09608-0
Tsay C, Kronqvist J, Thebelt A, Misener R. Partition-Based Formulations for Mixed-Integer Optimization of Trained ReLU Neural Networks. Adv Neural Inf Process Syst. 2021 Dec 6;34:3068–80.
Misener R, Biegler L. Formulating data-driven surrogate models for process optimization. Comput Chem Eng. 2023 Nov 1;179:108411.
Ceccon F, Jalving J, Haddad J, Thebelt A, Tsay C, Laird CD, et al. OMLT: Optimization & Machine Learning Toolkit. Journal of Machine Learning Research [Internet]. 2022 [cited 2025 Apr 7];23(349):1–8. Available from: http://jmlr.org/papers/v23/22-0277.html
Gurobi Machine Learning Manual [Internet]. [cited 2025 Apr 7]. Available from: https://gurobi-machinelearning.readthedocs.io/en/stable/
Turner M, Chmiela A, Koch T, Winkler M. PySCIPOpt-ML: Embedding Trained Machine Learning Models into Mixed-Integer Programs. 2023 Dec 13 [cited 2025 Apr 7]; Available from: https://arxiv.org/abs/2312.08074v2
Parker RB, Dowson O, LoGiudice N, Garcia M, Bent R. Formulations and scalability of neural network surrogates in nonlinear optimization problems. 2024 Dec 16 [cited 2025 Apr 7]; Available from: https://arxiv.org/abs/2412.11403v1
Vaswani A, Brain G, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. Attention is All you Need. Adv Neural Inf Process Syst. 2017;30.
Lastrucci G, Theisen MF, Schweidtmann AM. Physics-informed neural networks and time-series transformer for modeling of chemical reactors. Computer Aided Chemical Engineering. 2024 Jan 1;53:571–6.
Sitapure N, Sang-Il Kwon J. Introducing Hybrid Modeling with Time-Series-Transformers: A Comparative Study of Series and Parallel Approach in Batch Crystallization. Ind Eng Chem Res [Internet]. 2023 Dec 13 [cited 2025 Apr 4];62(49):21278–91. Available from: https://pubs.acs.org/doi/full/10.1021/acs.iecr.3c02624
Bynum ML, Hackebeil GA, Hart WE, Laird CD, Nicholson BL, Siirola JD, et al. Pyomo — Optimization Modeling in Python. 2021 [cited 2025 Apr 4];67. Available from: http://link.springer.com/10.1007/978-3-030-68928-5
Gurobi Optimizer Reference Manual [Internet]. [cited 2025 Apr 4]. Available from: https://docs.gurobi.com/projects/optimizer/en/11.0/

Breadcrumb

2025 AIChE Annual Meeting

(392am) Optimizing over Trained Transformer Attention Mechanisms: A Mixed-Integer Nonlinear Programming Formulation

Authors