2024 AIChE Annual Meeting

Developing Support Vector Machine Models to Enable Precision Oncology

In the growing world of cancer research and treatment research, machine learning is growing alongside it. Our goal is to develop a machine learning model to aid in treatment selection with a focus on an individuals’ genetic sequence to guide the way. Data is taken from the 50% growth inhibition data set (GI50) from the NCI-60 database. This database includes 60 cell lines each with multiple cancer types. A support vector machine (SVM) is applied to find the most important features of drug compounds for each of the cell lines; the cell line with the most entries, A549-ATCC, is the subset used for training and testing. The drug compounds are encoded through the Simplified Molecular Input Line Entry System (SMILES). These strings are used with a python package known as RDKit to create Morgan fingerprints. The fingerprints serve as the data that the SVM is trained on, as each bit in a fingerprint represents a substructure of the compound. The compounds were labeled as active or inactive based on the concentration used in the GI50 trials. The kernel of the SVM is made by clustering the compounds by calculating the similarity scores using the Tanimoto algorithm. Finally, an in-house function, MISTIC - Model Informed Feature Selection Through Importance and Contribution, has been used to guide the feature selection process with the aim of accurate drug-response prediction.