Small molecule solubility is a critically important property which affects the atom efficiency, environmental impact, and phase behavior of synthetic processes. In pharmaceutical development, organic solubility complicates synthesis and purification
1 and aqueous solubility determines
in vivo efficacy.
2 However, experimental methods for determining solubility are notoriously time- and resource-intensive.
3 A priori estimation of log S has long been of immense interest to the chemical sciences. While neural network-based models have achieved high predictive accuracy,
4 poor interpretability limits their utility in molecular design. Here, we introduce a causal machine learning framework built on the chemprop
5 and DAGMA
6 architectures for solubility prediction called
causal-fastsolv. This model equals state-of-the-art prediction accuracy while unlocking model interpretability. Trained on the BigSolDB dataset,
causal-fastsolv demonstrates strong extrapolative performance on the benchmark Leeds and SolProp datasets via intervention-based inference. Counterfactual inference also offers support for human-in-the-loop optimization of molecular structure, which we demonstrate by predicting solubility on an NNRTI seed structure and its molecular derivatives conceptualized by Cisneros et al.
7 Finally, we integrate
causal-fastsolv with the molecular optimization algorithm EvoMol
8 to perform inverse molecular design, yielding both constrained and unconstrained soluble analogs of the NNRTI seed structure. To our knowledge, this is the first application of causal machine learning to molecular property prediction. While this application focuses on organic solubility, we believe this approach can be generalizable to diverse molecular property prediction and molecular optimization tasks.
References
1. Tzschucke, C. C., Markert, C., Bannwarth, W., Roller, S., Hebel, A., & Haag, R. (2002). Angewandte Chemie International Edition, 41(21), 3964-4000.
2. Barrett, J. A., Yang, W., Skolnik, S. M., Belliveau, L. M., & Patros, K. M. (2022). Drug Discovery Today, 27(5), 1315-1325.
3. Murdande, S. B., Pikal, M. J., Shanker, R. M., & Bogner, R. H. (2011). Pharmaceutical development and technology, 16(3), 187-200.
4. Attia, L., Burns, J. W., Doyle, P. S., & Green, W. H. (2024). Chemarxiv
5. Heid, E., Greenman, K. P., Chung, Y., Li, S. C., Graff, D. E., Vermeire, F. H., ... & McGill, C. J. (2023). Journal of Chemical Information and Modeling, 64(1), 9-17.
6. Bello, K., Aragam, B., & Ravikumar, P. (2022). Advances in Neural Information Processing Systems, 35, 8226-8239.
7. Cisneros, J. A., Robertson, M. J., Mercado, B. Q., & Jorgensen, W. L. (2017). ACS medicinal chemistry letters, 8(1), 124-127.
8. Leguy, J., Cauchy, T., Glavatskikh, M., Duval, B., & Da Mota, B. (2020). Journal of cheminformatics, 12, 1-19.