2019 AIChE Annual Meeting
(370n) Constructing Interpretable and Accurate Model Combining Decision Tree and Random Forest
The estimation of properties and activities using machine learning has been actively performed. Quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) models estimate activities and properties from chemical structures of compounds, respectively. Numerous methods for model construction have been developed for QSAR and QSPR models to have high accuracy and predictive ability. For example, partial least squares (PLS) [1], support vector regression (SVR) [2] and artificial neural networks (ANN) [3]. In 2004, organization for economic co-operation and development (OECD) determined 5 principles for the validation of QSAR models [4]: (i) a defined endpoint, (ii) an unambiguous algorithm, (iii) a defined domain of applicability, (iv) appropriate measures of goodness-ofâfit, robustness and predictivity, (v) a mechanistic interpretation, if possible. Principles (i) ~ (iv) have been sufficiently investigated and it is possible to establish highly accurate model for samples in applicability domain, however, it is difficult to interpret highly accurate models. Model interpretation means discovering rules and mechanisms concerning specific physical properties and activities from models. There is room to research on interpretability of highly accurate models. Although examples of attempting to discover new mechanisms by interpreting models have been reported [5~8], there are three problems: interpretation is complicated, mechanisms cannot provide guidelines of syntheses and pursuing high interpretability leads to low accuracy. We aim to construct models with both highly predictive performance and high interpretability and solve the problem of the trade-off between accuracy and interpretability. We develop a new method combining an interpretable decision tree and regression methods.
Method
Decision tree (DT) [9] is a nonparametric regression analysis method that is based on a recursive partitioning of data using an explanatory variables X. In partitioning of data, data is divided into two subsets according to the value of X where objective variable y becomes as homogeneous as possible. A simple model structure like a tree enables visual interpretation. However, because of low computational complexity of decision tree, prediction accuracy is not very high.
Random forest (RF) [10] is an ensemble machine learning method that consists of many decision trees. y value is generated by summarizing all DTâs results. Each DT in RF model is independent and constructed with training samples randomly selected from all X variables. RF has an advantage that is to calculate feature importance which indicates the impact on estimation. Feature importance is measured how much the estimation accuracy decreases when permuting values of the feature.
The proposed method is composed of a DT and RF, which is named DT-RF. First, a dataset is divided into sub-datasets with DT, and local RF models are constructed using each sub-data sets. DT-RF model is possible to visually interpret a DT and local RF models increase estimation accuracy. In addition, local RF models can be interpreted in detail by using importance of X variables.
Result and discussion
Through case studies using superconductor data, we checked estimation accuracy, interpretability of model constructed with our proposed method and validity of the interpretation. Superconductor data extracted from the SuperCon [11] database is 15542 inorganic compounds data with critical temperature (Tc) at which superconductivity appears was measured. We used elemental composition, cross-terms of elemental composition and average molecular weight (AWM) as X variables to estimate Tc.
As a result of interpretation of constructed DT, we obtained following relationship between Tc and X variables:
(a) If not is equal to 0, a ratio of high Tc compounds is high.
(b) Tc of compounds whose composition ratio of oxygen is more than 0.5 is not so high.
(c) If is more than 0.011, a ratio of high Tc compounds is high.
In the previous research, La-Ba-Cu-O, Ba-Y-Cu-O, Bi-Sr-Ca-Cu-O, Tl-Ba-Ca-Cu-O and Hg-Ba-Cu-O alloy-based superconductor was discovered as high-temperature superconductors (Tc > 70[K]) [12, 13]. Interpretation result (a) and (c) are included in any of high-temperature superconductors that have been discovered. Estimation accuracy was evaluated by using coefficient of determination (r2) and mean absolute error (MAE). r2 and MAE in were 0.782 and 9.88, and r2 and MAE in CV were 0.732 and 11.0, respectively.
Local RF models were constructed with terminal subsets of DT model and predictive ability of local model was evaluated by double cross validation (DCV) [13], which is one of the methods to evaluate estimator performance in external dataset. DCV procedure is as follows:
(1) Dividing data at random, and one of the group is used for accuracy test, and the other groups are used for model construction.
(2) Optimization of hyperparameter in CV (inner-CV) and model construction using the groups for model construction.
(3) Estimation y value of a test group using the model built at (2).
(4) (1) ~ (3) is performed until all the groups used as test group (outer-CV).
(5) Comparing estimated y value calculated at (4) with measured y values to evaluate accuracy.
When constructing model in the case of a small number of samples and using CV for model optimization, hyperparameter are determined to improve the accuracy of y value in CV. The model fits into only samples for model construction and predictive ability for external data may not be adequately evaluated. On the other hand, DCV is composed of two nested CV referred to inner-CV and outer-CV, therefore, it is possible to evaluate the predictive ability more adequately than CV.Using samples of the terminal sub-dataset of DT for model construction, the number of samples is small and overfitting may occur. Then, DCV is effective in evaluating the predictive ability of the model adequately.
The estimation results of DCV using DT-RF and RF were MAE = 7.10, r2 = 0.866 and MAE = 7.51, r2 = 0.823, respectively. Thus, local RF models fit into sub-dataset in more detail than entire RF model. In particular, the samples with high Tc was estimated well. These results indicate that DT-RF is effectively used to discover new alloy materials with high Tc. According to local RF model of the highest Tc group in DT, Hg×O was chosen the most important X variable.
Conclusion
In this research, we developed a QSAR / QSPR model construction method with high interpretability and estimation accuracy. Instead of constructing a global model, local models were constructed using sub-datasets divided with DT, which made it possible to visually interpret the DT model and to predict y-values with high accuracy because local models fit into sub-datasets in more detail than an entire model. Since the small number of samples for model construction often heighten possibility of overfitting, we adopted DCV to adequately evaluate predictive ability of models. As a result of analyzing superconductor data with the proposed method, high prediction accuracy was achieved and interpretability and the validity was high because results of interpretation using DT corresponded to the superconductors that have been discovered. Interpreting local RF models, we obtain some important X variables that seem to deeply involve in the rise of Tc. We plan to new superconductors with high critical temperature by inverse analysis.
References
[1] P. Geladi; B. R. Kowalski, ANALYTICA CHIMICA ACTA, 1986
[2] R. Tibshirani, Journal of the Royal Statistical Society, 1996
[3] D.C. Park ; M.A. El-Sharkawi ; R.J. Marks ; L.E. Atlas ; M.J. Damborg, IEEE, 1991
[4] OECD Homepage Validation of (Q)SAR Models, http://www.oecd.org/chemicalsafety/risk-assessment/validationofqsarmodels.htm , 21 Oct 2018
[5] Supratik Kar; Juganta K. Roy; Danuta Leszczynska; Jerzy Leszczynski, MDPI, 2017
[6] Valentin Stanev; Corey Oses; A. Gilad Kusne; Efrain Rodriguez; Johnpierre Paglione; Stefano Curtarolo and Ichiro Takeuti; Machine learning modeling of superconducting critical temperature, ACS journal, 2018
[7] Vinicius M. Alves; Alexander Golbraikh; Stephan J. Capuzzi; Kammy Liu; Wai In Lam; Daniel Robert Korn; Diane Poze Approachfsky; Carolina Horta Andrade; Eugene N. Muratov and Alexander Tropsha, JCIM 2018, 58,1214-1223
[8] Ieda Maria dos Santos; Joao Pedro Gomes Agra; Thiego Gustavo Cavalcante de Carvalho; Gabriela de Azevedo Maia; Edilson Beserra de Alencar Filho, Springer, 2018, 29, 1287-1297
[9] Breiman. L, MACHINE LEARNING, 1996
[10] L. Breiman, Kluwer Academic Publishers, Machine Learning, 2001, 45, 5â32
[11] Corey Oses, Cormac Toher, and Stefano Curtarolo, MRS, BULLETIN VOLUME 43, SEPTEMBER 2018[12] S. S. P. Parkin, V. Y. Lee, E. M. Engler, A. I. Nazzal, T. C. Huang, C. Gorman, R. Savoy, R. Beyers, Phys. Rev. Lett., 60, 2539, 1988[13] A. Schilling, M. Cantoni, J. D. Guo, H. R. Ott, Nature, 363, 56, 1993
[12] S. S. P. Parkin, V. Y. Lee, E. M. Engler, A. I. Nazzal, T. C. Huang, C. Gorman, R. Savoy, R. Beyers, Phys. Rev. Lett., 60, 2539, 1988
[13] A. Schilling, M. Cantoni, J. D. Guo, H. R. Ott, Nature, 363, 56, 1993
