2022 Annual Meeting

(416c) Projecting the Effectiveness of Deep Ensembles

Authors

McGill, C. - Presenter, Massachusetts Institute of Technology
Green, W., Massachusetts Institute of Technology
Deep ensembling is a core practice in the creation of robust machine learning models, improving prediction performance through the combination of independent submodels. In this work, we explore the practice of ensembling in chemical systems statistically in order to quantify the contribution of model variance error, evaluate the use of ensemble metrics to estimate model uncertainty, and project the performance of models using different ensemble sizes. This ability to project ensemble performance addresses the key tension of ensemble training: the diminishing returns for increasingly large ensembles and the related uncertainty of how many ensemble submodels to train.

Here we introduce a method of Bayesian inference to characterize the distribution of potential model predictions from which we sample during ensembling. This approach is completely posterior to the practice of training the model, requiring no alteration of the original model architecture. Through characterization of this distribution, we are able to separate he portion of model prediction errors that are due to model variance from those that would still be present with an infinite number of ensemble models. We are able to use this separation to evaluate which model regimes are subject to errors correctable by ensembling and those that are not. Further, we are able to use this distribution to estimate with good accuracy the expected performance of a model with a larger number of ensemble submodels. We demonstrate the robustness of these projections in common benchmark datasets as well as an artificially constructed dataset where the level of data noise errors can be controlled accordingly. We also evaluate the cases in which the variance of ensemble predictions can be useful as a metric of uncertainty and show the limitations of this metric to predict nonvariance errors when calibrated or uncalibrated.