In recent years, machine learning (ML) methods have transformed computational chemistry and materials research. In ML algorithms, we rely on machine-learning representations to serve as a “mathematical proxy” for our underlying chemistry. Molecular featurization—how we transform atoms and molecules into mathematical signals appropriate for machine-learning thermodynamic quantities—has an important role in our ability to learn material properties and observable quantities. There are many ways to encode raw chemical data, including the popular SMILES strings, symmetrized correlation functions, or determining implicit representations through deep model architectures. Unfortunately, while these representations have demonstrated unparalleled success in predictive modeling, their high-dimensionality often makes it difficult to extract meaningful scientific hypotheses or conclusions from their performance.
In this talk, I will primarily focus on how we assess and interpret models built on such molecular representations, focusing on how to extract actionable chemical and physical principles from models built on chemical data, a task traditionally achieved through unsupervised analyses such as principal components analysis or t-stochastic neighborhood embeddings. However, these methods only ask, “What makes these data points similar?” not “In what ways does my model see these points as similar?” The latter question, particularly in the context of supervised ML models, is more powerful and informative for structure-property relationships. Our results show that this multi-objective framing, with its inherent interpretability, reveals underlying trends across many ML tasks, from materials classification to machine-learning potential building to non-linear regression tasks.