2025 AIChE Annual Meeting

(531d) Foundation Model-Guided Optimization of Chemical Reaction Spaces for Autonomous Experimentation

Authors

Jonggeol Na, Carnegie Mellon University
The optimization of chemical reactions is inherently complex, as it requires navigating a high-dimensional design space composed of both discrete and continuous variables, including the reactants, solvents, catalysts, reaction temperature, and concentration [1]. However, owing to practical limitations in terms of time and cost, only a small fraction of the entire combinatorial space can be experimentally explored. In most cases, experimental conditions are selected based on literature precedents and expert intuition [2]. Consequently, intelligent optimization strategies are essential for maximizing experimental efficiency. In closed-loop autonomous experimentation systems, optimization serves as a key decision-making component, enabling the exploration of large experimental spaces and the identification of optimal conditions. The performance of such optimization processes strongly depends on the representation of search space, including the input variables and their respective domains. Although one-hot encoding (OHE)—which lacks chemically meaningful information—has been widely adopted in previous studies, it suffers from considerable limitations, such as performance degradation due to sparsity and poor generalization in high-dimensional spaces [3].

To overcome these limitations, it is critical to reformulate the search space by integrating chemical knowledge and physical theories [4]. This enables the extraction of meaningful features from black-box objective functions, thereby improving the interpretability of the experimental designs. Such strategies provide intuitive insights into experimental systems, reduce unnecessary computations, and focus on experimentally meaningful regions. As part of this representation reformulation, we explored two approaches: (i) representing molecules using physicochemical descriptors obtained from first-principles calculations such as density functional theory (DFT), and (ii) compressing molecular features into low-dimensional vector embeddings that retain the key characteristics of the molecules using artificial intelligence (AI) models. In particular, we employed several pre-trained models to derive latent molecular representations. These included molecular language transformer (MoLFORMER) [5], geometry-enhanced molecular representation learning (GEM) [6], fractional denoising (Frad) [7], and knowledge-guided pre-training of graph transformer (KPGT) [8], among others that can be readily incorporated into the framework. The resulting embeddings were then used as input variables for optimization. Notably, these representations in this study demonstrated superior optimization performance compared with OHE-based inputs. For example, in C–N cross-coupling and Suzuki reactions, MoLFORMER achieved yields exceeding 90% within 20 iterations out of 100 virtual experiments, reaching optimal conditions significantly faster. On the other hand, representations with excessively high embedding dimensions—such as those derived from MoLFORMER and KPGT—tended to require more iterations to reach optimal conditions or showed unstable convergence in certain reactions, including the Buchwald–Hartwig and chiral phosphoric acid-catalyzed thiol addition reactions. In addition, reaction fingerprints, such as differential reaction fingerprint (DRFP) [9] and data-driven reaction fingerprints (RXNFP) [10], have gained attention as promising approaches owing to their capability to effectively capture the overall characteristics of a reaction by integrating information on both reactants and products, as well as the transformation between them. Following this trend, future work will focus on developing reaction representations that can more precisely reflect the nonlinear interactions among reactants and the geometrical structures of molecules.

However, rigorously comparing different combinations of reaction representations and optimization algorithms under varying experimental scenarios remains a time- and resource-intensive challenge [11]. To address this, we developed a unified and user-friendly platform that automatically benchmarks diverse optimization strategies and encoding schemes across a range of organic reaction scenarios. The platform is designed to collect large-scale high-throughput experimental data from organic reactions and enables a comparative analysis of the representations generated from various foundation models. In addition to classical optimization algorithms, the platform incorporates a range of state-of-the-art optimization strategies [12-14], allowing quantitative performance evaluation across a broad spectrum of approaches. A built-in reaction yield prediction model, trained on reaction data, is also provided, facilitating pre-evaluation and simulation of candidate experimental conditions. Furthermore, the platform is not limited to organic chemistry and can be extended to other domains involving black-box optimization problems, such as process systems, material design, and numerical optimizations. It supports custom objective functions defined by users, allowing the design of application-specific optimization workflow. The platform is designed for potential compatibility with automated workflows, supporting closed-loop optimization cycles in which the proposed conditions can be automatically executed, analyzed, and used to inform subsequent iterations. Ultimately, the proposed framework provides a scalable foundation for addressing the complexity of reaction optimization while advancing the feasibility and scientific insights of autonomous experimental systems.

[References]

  1. Taylor, C.J., et al., A Brief Introduction to Chemical Reaction Optimization. Chemical Reviews, 2023. 123(6): p. 3089-3126.
  2. Shields, B.J., et al., Bayesian reaction optimization as a tool for chemical synthesis. Nature, 2021. 590(7844): p. 89-96.
  3. Ranković, B., et al., Bayesian optimisation for additive screening and yield improvements – beyond one-hot encoding. Digital Discovery, 2024. 3(4): p. 654-666.
  4. Häse, F., et al., Gryffin: An algorithm for Bayesian optimization of categorical variables informed by expert knowledge. Applied Physics Reviews, 2021. 8(3).
  5. Ross, J., et al., Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 2022. 4(12): p. 1256-1264.
  6. Fang, X., et al., Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 2022. 4(2): p. 127-134.
  7. Ni, Y., et al., Pre-training with fractional denoising to enhance molecular property prediction. Nature Machine Intelligence, 2024. 6(10): p. 1169-1178.
  8. Li, H., et al., A knowledge-guided pre-training framework for improving molecular representation learning. Nature Communications, 2023. 14(1): p. 7568.
  9. Probst, D., P. Schwaller, and J.-L. Reymond, Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digital discovery, 2022. 1(2): p. 91-97.
  10. Schwaller, P., et al., Mapping the space of chemical reactions using attention-based neural networks. Nature machine intelligence, 2021. 3(2): p. 144-152.
  11. Velasco, P.Q., K. Hippalgaonkar, and B. Ramalingam, Emerging trends in the optimization of organic synthesis through high-throughput tools and machine learning. Beilstein Journal of Organic Chemistry, 2025. 21: p. 10-38.
  12. Rajabi-Kochi, M., et al., Adaptive representation of molecules and materials in Bayesian optimization. Chemical Science, 2025. 16(13): p. 5464-5474.
  13. Beck, A.G., et al., Paddy: Evolutionary Optimization Algorithm for Chemical Systems and Spaces. Digital Discovery, 2025.
  14. Xie, Y., et al., BoGrape: Bayesian optimization over graphs with shortest-path encoded. arXiv preprint arXiv:2503.05642, 2025.