2023 AIChE Annual Meeting
(28w) Artificial Intelligence-Based Parametrization of Next Generation Systems Biology Models
Authors
The construction of such a mathematical model requires a high level of expertise; in addition, a very large number of kinetic parameters, such as turnover numbers or Michaelis â Menten constants have to be properly estimated, in order for the models to be feasible. It is well known that most of the detected turnover numbers estimated by wet lab experiments have not been quantified yet. Turnover numbers upon which training and test sets rely on are taken from the BRENDA (Jeske et al., 2019) and SABIO databases (Wittig et al., 2012) to date. These databases except from providing the training basis feed the NGSB models with actual turnover numbers leading to more accurate predictions. Detection of unknown turnover numbers is time-consuming, thus often inhibiting the completion of this task.
Consequently, train and test sets built based on the information stemming from the relevant databases and then Deep Neural Network (DNN) models are trained and finely optimized to estimate turnover numbers The preprocessed integrated dataset includes the basis for the construction of each enzyme and reaction component, including enzymes sequences, molecular fingerprints and other chemical attributes derived from mol files. The description of endogenous metabolites participating in each reaction is introduced to the model with the MACCS and PUBCHEM fingerprints incorporated by Tanimoto similarity indexes. Mol files are derived from the KEGG database (Kanehisa, 2002). The construction of the dataset is followed by the training process.
TensorFlow (Abadi et al., 2016) and Keras (Chollet F. Keras, 2015) modules are currently used for the DNN model development. These tools provide a user-friendly interface and the advantage of the integration with other tools used for machine and deep learning. The optimization process was carried out with trial-and-error in addition to a multicore process developed in-house. This process allowed the development of a model well optimized with the selection of a proper set of parameters. The biological integration that each enzyme has, relies on fasta files production based on their sequences and, by incorporating the algorithm of Alley et al. (2019) and Natural Language Processing (NLP) techniques, numerical vectors were produced representing the structure of each enzyme. The model we developed can predict a turnover number with an R2 of 0.56. The methodology constructed is independent of the organism for which kinetic parameters are predicted.
The procedure described is expected to provide solutions for the parametrization of systems biology and NGSB models, as well as for industrial and pharmaceutical applications including enzymatic processes which can benefit from these types of models. Soon, an online version of these models will be made available, thus offering users the opportunity to make their own applications, by providing the sequences of the enzymes they are interested in using already trained models as well as a GitHub repository in which users will be able to access the pre trained models and introduce them in their code.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., & Devin, M. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G. M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12), 1315-1322.
Chollet F. Keras. (2015). GitHub. Seattle, WA, USA. https://keras.io
Jeske, L., Placzek, S., Schomburg, I., Chang, A., & Schomburg, D. (2019). BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res, 47(D1), D542-D549.
Kanehisa, M. (2002). The KEGG database. âIn SilicoâSimulation of Biological Processes: Novartis Foundation Symposium 247,
Wittig, U., Kania, R., Golebiewski, M., Rey, M., Shi, L., Jong, L., Algaa, E., Weidemann, A., Sauer-Danzwith, H., & Mir, S. (2012). SABIO-RKâdatabase for biochemical reaction kinetics. Nucleic Acids Res, 40(D1), D790-D796. https://doi.org/10.1093/nar/gkr1046