2025 AIChE Annual Meeting

(121j) Reproducible Free Energy Surfaces from Machine-Learned Nucleation Collective Variables

Authors

Matteo Salvalaglio - Presenter, University College London
Florian Dietrich, University College London
The efficient calculation of collective variables (CVs) is key to deploying enhanced sampling methods to study the mechanisms, thermodynamics and kinetics of complex molecular processes. In particular, the computation of nucleation CVs represents a significant bottleneck for applying enhanced sampling methods to investigate nucleation processes in realistic environments. Traditional CVs for nucleation often involve complex combinations of local atomic environment descriptors, requiring substantial computational resources for on-the-fly evaluation. Recently, we addressed this issue by introducing a graph neural network (GNN) approach to approximate nucleation CVs, which constructs a molecular graph from atomic coordinates [1].
This approach achieves orders-of-magnitude gains in computational efficiency for on-the-fly evaluation compared to the direct calculation of classical CVs in both post-processing and on-the-fly biasing via pulling, umbrella sampling, and metadynamics simulations. This ability to learn crucial structural features from atomic coordinates bypasses the computational cost associated with explicitly calculating roto-translationally and permutationally invariant symmetry functions.[1]

Extending these concepts, the use of machine learning models also enables the approximation of non-differentiable molecular descriptors as CVs. By training differentiable surrogate models on non-differentiable descriptors, the resulting model CVs can be used for enhanced sampling, and then free energy surfaces can be reweighted back to the space of the original descriptor. This effectively opens up the design space to create CVs that match physical intuition more closely and allows for the deployment of powerful existing structural descriptors that were previously relegated to post-processing of trajectories in biasing applications that enable an efficient calculation of Free Energy Surfaces (FES).

Despite the computational advantages, machine-learned CVs (MLCVs) introduce challenges related to FES reproducibility due to the inherent variability of the training process, the choice of ML hyperparameters, and the dependence on their training data. Even with a consistent model architecture, different training instances can lead to variations in the learned CV and, crucially, in the resulting free energy surfaces. We show that these effects can be significantly mitigated by adopting the Geometric Free Energy as a standard representation for equilibrium probability distributions and by normalizing the MLCV gradients.[2]
These measures facilitate interpreting and comparing results obtained with different MLCVs and training procedures and ensure that the physical interpretation of calculation based on MLCVs is independent of their training, as well as their specific hyperparameter make-up.

[1] Dietrich, F. M., Advincula, X. R., Gobbo, G., Bellucci, M. A., & Salvalaglio, M. (2023). Machine learning nucleation collective variables with graph neural networks. Journal of Chemical Theory and Computation, 20(4), 1600-1611.
[2] Dietrich, F. M., & Salvalaglio, M. (2025). On the Reproducibility of Free Energy Surfaces in Machine-Learned Collective Variable Spaces.