The early stages of drug discovery are critical for determining the success of downstream development, yet they are often plagued by inefficiencies and high attrition rates [1]. A significant proportion of drug candidates fail in later stages due to poor pharmacokinetic or toxicity profiles that could have been identified earlier with more robust screening methodologies. Traditional machine learning approaches, while helpful, often rely on manual feature engineering and hand-crafted molecular descriptors, requiring separate models for each predictive task [2]. This fragmentation limits scalability and slows down the pipeline. Transformers—originally developed for natural language processing—have recently emerged as powerful tools in chemistry and materials science due to their ability to model complex relationships and dependencies in sequential data. Their attention-based architecture enables them to learn rich molecular representations directly from SMILES strings [3], making them well-suited for property prediction tasks in drug discovery. In this work, we present a unified transformer-based screening pipeline designed specifically for early-stage drug discovery [4]. The pipeline utilizes a 12-layer encoder-only transformer model trained on over 1.8 billion SMILES strings, using masked language modeling to learn rich, context-aware representations of molecular structures. Unlike descriptor-based approaches, our model directly learns from raw SMILES inputs, eliminating the need for molecular fingerprints or handcrafted features. The architecture integrates Rotary Positional Embedding (RoPE) to capture spatial dependencies in molecular sequences and employs linear attention via orthogonal random feature mapping to ensure computational efficiency, particularly when processing large-scale datasets.
Once trained, the transformer generates general-purpose molecular embeddings, which are reused across multiple downstream tasks. These embeddings are fed into separate, lightweight feed-forward neural networks for the prediction of drug-like properties and ADME-T (absorption, distribution, metabolism, excretion, and toxicity) characteristics. This modular architecture supports both regression tasks, such as predicting lipophilicity, aqueous solubility, volume of distribution, and acute toxicity, and classification tasks, such as blood-brain barrier permeability, CYP450 enzyme inhibition, and mutagenicity. The shared embedding structure ensures consistency across tasks while maintaining high task-specific accuracy. We validate the pipeline using a case study focused on identifying potential inhibitors of HIV Integrase 1, a critical enzyme in the HIV replication cycle. Beginning with a library of over 1.04 million compounds, we apply the pipeline to filter candidates through a comprehensive three-stage process. The first stage involves filtering based on drug-likeness and predicted bioactivity. The second stage evaluates ADME-T properties to remove compounds with suboptimal pharmacokinetic or toxicity profiles. The final stage predicts IC50 values and calculates binding efficiency indices to rank candidates by potency and molecular efficiency. This process ultimately narrows the initial library to just 143 highly promising molecules, each of which demonstrates strong therapeutic potential and favorable pharmacological characteristics.
Our model achieves high performance across all tasks, with regression models reaching R² values above 0.96 and classification tasks yielding precision, recall, and F1 scores consistently above 0.97. Importantly, the modular design of the pipeline allows for easy adaptation to new properties, therapeutic targets, or regulatory criteria by simply retraining the downstream prediction layers using the same transformer-generated embeddings. This adaptability is particularly valuable in real-world drug discovery, where project goals and constraints often vary widely. In summary, this work introduces a scalable, accurate, and adaptable transformer-based screening pipeline that consolidates early-stage drug discovery into a single, cohesive framework. By replacing fragmented, descriptor-driven processes with a unified model that learns directly from molecular sequences, our approach accelerates the identification of viable drug candidates, reduces development costs, and significantly mitigates the risk of late-stage failures. The framework sets a new benchmark for how deep learning and attention-based models can transform the front end of drug development pipelines.
References:
[1] J. Drews, Drug Discovery: A Historical Perspective, Science (1979) 287 (2000) 1960–1964. https://doi.org/10.1126/science.287.5460.1960.
[2] G. Sliwoski, S. Kothiwale, J. Meiler, E.W. Lowe, Computational Methods in Drug Discovery, Pharmacol Rev 66 (2014) 334–395. https://doi.org/10.1124/pr.112.007336.
[3] A. Khambhawala, C.H. Lee, S. Pahari, P. Nancarrow, N.A. Jabbar, M.M. El-Halwagi, J.S.-I. Kwon, Advanced transformer models for structure-property relationship predictions of ionic liquid melting points, Chemical Engineering Journal 503 (2025) 158578. https://doi.org/https://doi.org/10.1016/j.cej.2024.158578.
[4] A. Khambhawala, C.H. Lee, S. Pahari, J.S.-I. Kwon, Minimizing late-stage failure in drug development with transformer models: Enhancing drug screening and pharmacokinetic predictions, Chemical Engineering Journal (2025) 160423. https://doi.org/https://doi.org/10.1016/j.cej.2025.160423.