2025 AIChE Annual Meeting

(588cc) Machine Learning Models for Predicting Pharmaceutically Relevant C–N Coupling Reactivity from High-Throughput Data

We report the development of machine learning models for predicting the reactivity of pharmaceutically relevant palladium-catalyzed C–N coupling reactions. A dataset of 4,200 unique products was generated de novo using nanomole-scale, automation-compatible high-throughput experimentation with LiOTMS as the base. A classification approach was adopted to account for inherent experimental noise in the data. This large and diverse dataset enabled robust model development and systematic benchmarking using five distinct data-splitting strategies, carefully designed to evaluate both interpolation and extrapolation across the substrate space. The resulting models demonstrated high predictive accuracy across all splits and showed strong performance in prospective predictions on external validation libraries containing previously unseen substrates. We also show that the models can be effectively trained on significantly reduced datasets, provided the substrate space is well represented. These findings highlight the potential of integrating such predictive models into medicinal chemistry workflows to enrich successful C–N coupling outcomes. Accurate in silico prediction of reaction performance would enable focused allocation of time and resources and accelerate the discovery process.