The success of a closed-loop molecular discovery effort hinges on the quality and quantity of data used to train the machine learning models. In chemistry and materials science, there are often several distinct levels of quality with which one could choose to collect or generate data for a given target property, such as experimental measurements, computational approximations, or simulations at different levels of theory. Each of these methods has a different fidelity to the true property to be optimized, as well as a different cost. High-fidelity methods can be slower and more expensive, while low-fidelity methods are faster and cheaper but introduce more noise and/or bias into the results. Several strategies have been proposed and implemented for integrating high- and low-fidelity together when both are available, such as computational funnels / tiered screens, multi-target learning, transfer learning, and delta machine learning (Δ-ML). However, choosing which of these multi-fidelity strategies to implement in a closed-loop molecular discovery effort is often rather arbitrary because the relationship between the dataset quality and quantity needs at each level of fidelity is often not well-understood a priori. In this work, we propose several quantitative metrics in both chemical space and property space to guide this choice of modeling strategy. Among other more common metrics for overlap and correlation, we show how the KL divergence from low-fidelity to high-fidelity in chemical fingerprint space and the roughness index of the difference between low and high-fidelity property values can be used to describe multi-fidelity dataset characteristics.
We conducted a comprehensive benchmark of multi-fidelity methods on both synthetic and real-world datasets, and evaluated the relationship between our proposed metrics and modeling performance in each case. Using a synthetic dataset allowed us to systematically add noise and bias in known quantities, which allowed us to decouple these effects and more precisely understand the effect of each on multi-fidelity modeling performance. The real-world datasets, namely optical properties (experiments and time-dependent density functional theory calculations), solubility (experiments and COSMO-RS calculations), and drug efficacy/potency (single-dose and dose-response measurements), served to validate the insights we gained from the synthetic data for practical utility. We found that Δ-ML and multi-target learning performed better in cases where low-fidelity datasets had high noise or bias with respect to the corresponding high-fidelity dataset. This was true except in cases where we tested the models on out-of-domain samples, in which case more noise/bias in the low-fidelity data helped to prevent overfitting for transfer learning and surrogate-Δ-ML strategies. Finally, we demonstrate scenarios where multi-fidelity methods offer a concrete performance and/or cost improvement with respect to the single-fidelity alternatives. The quantitative metrics and insights we developed in this work will enable better, data-driven choices for more time- and cost-efficient molecular closed-loop discovery.