2024 AIChE Annual Meeting
(203h) Pre-Training Large Language Models for Solvent-Solute Predictions
Authors
McGill, C. - Presenter, Massachusetts Institute of Technology
Williams, M., University of Rhode Island
Solubility is an important property to many industries, including petroleum-chemical, environmental chemistry, and pharmaceuticals. Information about the limit of solubility of substances in a chosen solvent system is critical to chemical process design. Traditional design requires a lengthy screening process to determine the best solvents for chemical processes. Machine learning has the potential to significantly reduce the time required for screening, and thereby, costs. Methods for solubility prediction currently exist, but their effectiveness is often hindered by limited access to experimental data. Existing solubility datasets contain a diverse set of solutes, but solvent selection is more limited. To overcome database deficiencies, this study explores solubility prediction, using a machine learning approach called transfer learning, a method that gains generalized knowledge that can later be applied to specialized tasks. We use six pre-trained models as the basis for this study, three large language models and three graph-based models. These pre-trained models were initially trained on extensive unlabeled chemical databases. We finetuned each of the available pre-trained methods on common chemical benchmark datasets to assess general model capability. We found that Molformer-XL was the most capable of the benchmarked pre-trained models, and we have therefore used it as the basis for our exploration of transfer learning in solubility. Solubility is a thermodynamic process, so we can use models of related steps in a thermodynamic cycle to predict solubility. Data is relatively available for both aqueous solubility and gas-solution solvation energy in the literature, but sparse for organic solubility. Here we use separate finetuned models for the aqueous solubility and gas-solution solvation energy to create a composite prediction of organic solubility. We then evaluate the composite prediction using the available organic solubility data. We demonstrate a greater generalizability of the model to new solutes and solvents from using pre-trained models. We also show how allowing the model to learn different representations for solutes and solvents leads to a higher level of model performance. We further discuss possible model architectures that can incorporate other thermodynamic properties for increased model performance.