Accurate machine learning force fields (MLFFs) enable large-scale atomistic simulations with near first-principles accuracy. However, their development remains computationally demanding due to the need for training datasets that are broad, diverse, and sufficiently redundant to capture local atomic variations. In this work, we present a data-driven framework that leverages unsupervised learning and deep learning to efficiently select representative atomic configurations from Molecular Dynamics (MD) trajectories of metal nanoparticles.
Building on our previous study of structural classification in Ag and Cu nanoparticles (100–200 atoms) using dimensionality reduction and unsupervised clustering based on Common Neighbor Analysis (CNA) features, we extend our methodology to generate optimal training sets for MLFF development. We found that both K-Means and Gaussian Mixture Models (GMM) effectively identified physically meaningful structural classes at varying levels of detail. The optimal number of classes was determined using evaluation metrics such as the silhouette score and gap statistic. This classification behavior aligns well with the requirements for constructing MLFF training datasets, which must be structurally diverse and locally redundant, yet not overly large to avoid unnecessary computational cost.
Specifically, we apply K-Means, GMM, and Graph Neural Network (GNN)-based embeddings to thousands of fixed-size atomic structures sampled from MD trajectories. These techniques consistently identify 10–20 core structural classes, capturing both global shape diversity and local configurational variance. This clustering-based approach enables the selection of a minimal yet representative subset of configurations, significantly reducing the cost of subsequent labeling with high-level quantum calculations.
Our ongoing work incorporates hyper-parameter tuning, local redundancy sampling and energy-aware refinement to further enhance coverage of the potential energy surface. We envision this data selection strategy as a critical step toward autonomous, closed-loop workflows in materials modeling. This contribution highlights how machine learning can accelerate simulation-ready data generation, paving the way for efficient and scalable force field development.