Per- and polyfluoroalkyl substances (PFAS) have been widely used in various industrial and consumer products—such as water-repellent coatings, food packaging, and cleaning agents—due to their exceptional surface activity and chemical stability [1]. However, their extreme environmental persistence and bioaccumulative nature have led to increasing regulatory restrictions [2], prompting global efforts to develop safer alternatives [3]. Designing non-PFAS surfactants is particularly challenging, as it requires the simultaneous optimization of surface tension, critical micelle concentration (CMC), and solubility—properties that are difficult to balance through experimental approaches alone.
Artificial intelligence (AI) has become an increasingly powerful tool in molecular design, enabling efficient exploration of chemical space and rapid identification of molecules with desirable properties. In particular, machine learning-based property prediction models and inverse design models have emerged as powerful tools for rapidly screening chemical space and generating candidate molecules that meet predefined property targets [4, 5]. Despite these advancements, the performance of AI-driven design strongly depends on the availability of high-quality, well-curated datasets. For surfactants in particular, where large-scale public data is scarce, domain-specific data curation and tailored model architectures are essential for building reliable molecular design frameworks.
To this end, we developed an integrated inverse design framework that leverages reinforcement learning (RL) combined with combinatorial chemistry to explore vast chemical spaces and generate novel non-PFAS surfactant candidates. Specifically, we constructed a dataset of concentration-dependent surface tension values for approximately 3,000 molecules using density functional theory (DFT) calculations, and supplemented it with around 400 literature-derived CMC values along with additional CMC estimates obtained computationally. Using this dataset, we benchmarked several pre-trained foundation models—including MolFormer [6], GEM [7], and Frad [8]— that had been trained on large-scale molecular datasets, to identify robust architectures for surfactant property prediction in data-scarce settings. Based on this evaluation, we fine-tuned selected models to train accurate predictors for both CMC (R² ≈ 0.91) and surface tension (R² ≈ 0.80). These predictive models were then incorporated into our RL-based combinatorial chemistry framework [9], which was adapted in this study to generate molecular structures optimized for multiple target properties, including low surface tension (<30 mN/m), acceptable CMC, and solubility suitable for real-world applications.
As a result of training the RL framework, we successfully generated 21 candidate molecules that simultaneously satisfied all target property requirements. Among them, five molecules were found to match known surfactant structures, supporting the validity of the model. The remaining structures are novel and surfactant-like, and are worth experimental validation for potential use. Our framework not only accelerates surfactant discovery but also expands the accessible chemical space and enhances understanding of structure–property relationships, paving the way for more efficient and sustainable materials innovation.
[1] Glüge, J., et al., An overview of the uses of per-and polyfluoroalkyl substances (PFAS). Environmental Science: Processes & Impacts, 2020. 22(12): p. 2345-2373.
[2] Cousins, I.T., et al., The high persistence of PFAS is sufficient for their management as a chemical class. Environmental Science: Processes & Impacts, 2020. 22(12): p. 2307-2312.
[3] Kwiatkowski, C.F., et al., Scientific basis for managing PFAS as a chemical class. Environmental science & technology letters, 2020. 7(8): p. 532-543.
[4] Butler, K.T., et al., Machine learning for molecular and materials science. Nature, 2018. 559(7715): p. 547-555.
[5] Elton, D.C., et al., Deep learning for molecular design—a review of the state of the art. Molecular Systems Design & Engineering, 2019. 4(4): p. 828-849.
[6] Ross, J., et al., Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 2022. 4(12): p. 1256-1264.
[7] Fang, X., et al., Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 2022. 4(2): p. 127-134.
[8] Ni, Y., et al., Pre-training with fractional denoising to enhance molecular property prediction. Nature Machine Intelligence, 2024. 6(10): p. 1169-1178.
[9] Kim, H., et al., Materials discovery with extreme properties via reinforcement learning-guided combinatorial chemistry. Chemical Science, 2024. 15(21): p. 7908-7925.