2021 Annual Meeting
(291a) Naming, Classifying, and Comparing Polymers in the Era of Data Science
Author
Recently, we developed BigSMILES, a stochastic line notation capable of capturing polymer structures in a way directly analogous to chemical structure drawings but offering all the advantages of and full compatibility with the SMILES small molecule line notation. However, BigSMILES, like chemical structure drawings, only defines the set of possible molecules. To define their probabilities, characterization data is necessary. To address this, we have put forward the PolyDAT schema that links characterization to line notation, providing complete chemical definition of a polymer. Together, these structures enable many exciting challenges to be addressed. First, we demonstrate how polymer structures can be canonicalized, both using empirical rules and through analogy to automata in computer science. Second, we show how BigSMILES can be used to drive polymer vectorization, and third, we show how BigSMILES can form the basis of polymer similarity comparisons.
Extending the initial BigSMILES grammar, we have also developed BigSMARTS, an extension of SMARTS that allows search of polymer structures. We have further demonstrated that BigSMILES is compatible with the concepts put forth in SELFIES, enabling polymers to be written in a way that makes them more amenable to use in genetic algorithms. Finally, the stochastic nature of BigSMILES makes it inherently compatible with non-covalent bonds, an advantage over deterministic line notations. We use this feature to extend BigSMILES to a wide variety of molecular constructs useful in colloidal and supramolecular materials.