2022 Annual Meeting
Evaluation of Fingerprinting Techniques for Random Forest Classification of Large-Set Reaction Data
Reaction classification is valuable for many applications, particularly retrosynthesis and synthesis planning. Recently, there is increasing demand for data-driven machine learning (ML) classifiers which can make more effective use of large reaction data sets. In this study, we examine the performance of multiple available techniques of representing chemical reactions and reaction agents to develop a ML-based reaction classification model. The model, a random forest classifier, was trained and validated using a large set (50,000 total reactions) of labeled reaction data mined from a publicly available U.S. patent database. Different fingerprinting methods were used to train the random forest classifier and categorize the reactions into 50 reaction classes as proposed in the patent data set. Robustness of the model was analyzed using cross validation (CV), and the performance on the training data was measured using different scoring metrics: precision, recall, and F1-score. We found that the use of difference reaction fingerprints (FPs), resulting in CV accuracy of over 0.9, was significantly more effective than the use of structural reaction FPs, resulting in CV accuracy of about 0.7. Additionally, it was found that inclusion of reaction agents generally improved the model, with the most effective method being concatenation of an agent feature array. Variation of molecular FP type did not result in significant change to model performance, although Atom Pairs performed slightly worse in combination with the agent feature array. The most successful models were able to achieve a macro averaged F1-score of 0.979 on the hold-out test data. The results suggest that Topological Torsions, Morgan, Pattern, or the RDKit FP techniques are all suitable for this type of modeling. Additionally, we explore the use of our model on a set of unlabeled reaction data and propose possible classifications for 250 of these reactions.