2022 Annual Meeting
(362e) A Comparative Evaluation of Machine Learning Algorithms in Predicting Syngas Fermentation Outcomes Using Limited Experimental Data
Authors
Methods. Time course concentration data from Clostridium fermentations [2] was used to predict individual product production rates and time course concentration curves. For each time point, the state of the fermentation (gas composition and extracellular metabolite concentration) was paired with the production rates of acetate, ethanol, butyrate, and butanol. Via data augmentation, a database of 836 time points was constructed for supervised learning algorithms. This database was split into test data and training data, and was used to train six ML algorithms: neural networks (NNs), support vector machines (SVMs), random forests (RFs), elastic nets (ENs), lasso (LA), and k-nearest neighbors (kNN). Additionally, the rate predicting algorithms were used to generate time course concentration data by starting with initial conditions and iteratively calculating the concentration of each product at the next time point.
Results. Based on unseen testing data, the predictions of acid productivity (acetate and butyrate) were more accurate than for alcohol (ethanol and butanol) productivity. The predications of two carbon products were more accurate than those of four carbon products. A trend in our findings is that products that require more enzymatic steps or more cofactors have less accurate predictions.
For test set rate predictions, RF performed the best with SVM being a close second. Both algorithms had average R2 values of ~0.35. EN and LA had moderate performance with average R2 values of ~0.30, while NN and kNN showed the worst average performance with R2 values of ~0.22. EN and LA are relatively simple algorithms with fewer fitted variables than the other ML methods. The fact that they outperformed kNN and NN indicates that kNN and NN likely were overfit. Despite NNâs overall poor performance, it offered the best predictions for ethanol production rate. This indicates the performance of a ML model will not be uniform across syngas fermentation products, and therefore the selection of ML algorithms should be made only after testing multiple options.
Interestingly, the time course curve generated by the production rate models offered predictions that were more accurate than the rate predictions themselves. SVM, RF, EN, and LA were the most accurate with test set R2 values of ~0.80. NN and kNN performed less well with test R2 values around 0.5. Potentially, this is because kNN models rely heavily on the training set because the algorithm uses the most similar points in the training data to predict testing data. As a result, in this study kNN models tended to be less âgeneralizableâ than the other models. NNâs lower performance is likely because neural networks have many fitted parameters, and therefore can overfit. Both issues could be resolved with a larger training data set, or by using a training set that more closely resembles the testing set.
The trained random forest models were used to determine the relative weight of the gas components on the production rate of the four products. The feature importance of a gas on a productâs production rate was determined by averaging the impurity reduction when the value of the gas was used to split the decision trees. The analysis shows that butyrateâs production rate is heavily dependent on the concentration of CO in the feed gas, and that butanolâs production rate was mainly dependently on the concentration of H2. This follows previous findings since CO is both a carbon source and an energy source, while H2 offers strong reducing power. H2 is the most influential substrate for butanol production since its synthesis requires more reducing cofactors than the other products. These feature analyses show how machine learning methods can use limited experimental data to ârelearnâ and âredesignâ biosynthesis patterns.
Implications.
Syngas fermentations are highly dynamic and nonlinear, which make them ideal targets for ML based Model Predictive Control (MPC). This study evaluated six ML algorithmâs ability to predict syngas fermentation production rates based on limited fermentation tests. SVM and RF performed best in this study while kNN and NN performed the worst. In contrast, the simpler ML algorithms, EN and LA, are âsaferâ options because they have less variables. In this study, EN and LA outperformed NN likely because of the limited training set size rather than linear methods being more applicable to syngas fermentation. Generally, ML methods were more accurate for acid production rates than for alcohol production rates indicating that there are unknown features not captured for alcohol productions (e.g., metabolic shifts or other intrinsic biological factors). Time course predictions based on rate predictions were more accurate than direct rate predictions. Additionally, feature importance reaffirmed guidelines for how gas composition can be used to control product profiles. Future studies can build off this work by increasing the amount of syngas fermentation data, including new features to capture cell regulations and stress responses to bioreactor conditions, or by applying ensemble machine learning approaches.
- Beltramo, T., Ranzan, C., Hinrichs, J., & Hitzmann, B. (2016). Artificial neural network prediction of the biogas flow rate optimized with an ant colony algorithm. Biosystems Engineering, 143, 68-78.
- Wan, N., Sathish, A., You, L., Tang, Y. J., & Wen, Z. (2017). Deciphering Clostridium metabolism and its responses to bioreactor mass transfer during syngas fermentation. Scientific Reports, 7, 10090.