2025 AIChE Annual Meeting

(394af) LLM-Powered Knowledge Graph Framework - the Analysis of Biochemical Networks

Checkout Do you already own this? Log in to access this content.

Pricing

Individuals

AIChE Pro Members	150.00
AIChE Emeritus Members	105.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free
AIChE Explorer Members	225.00
Non-Members	225.00

Authors

Nikos Trokanas - Presenter

Stefanos Xenios, National Technical University of Athens

Konstantinos Mexis, National Technical University of Athens

Franjo Cecelja, University of Surrey

Antonios Kokosis, National Technical University of Athens

This work presents a modular computational framework that leverages the capabilities of open-source large language models (LLMs) to convert genome-scale metabolic models (GSMs) into interactive knowledge graphs (KGs) for semantic querying and visualization of biochemical networks. Centered around a detailed case study of the Escherichia coli core metabolic model, the system is implemented entirely in Python and uses GraphDB as the RDF triple store for persistent KG storage. The primary objective is to enable domain experts to engage with complex biological models through a conversational interface, lowering the barrier to systems-level metabolic analysis.

The pipeline begins with the parsing of SBML-encoded GSMs to extract reaction stoichiometries, gene–protein–reaction (GPR) associations, metabolite roles, and compartmentalization. These entities are mapped to our custom meta ontology and structured as RDF triples for ingestion into the KG. The current implementation encodes explicit biochemical and topological relationships, including enzyme-catalyzed reactions, substrate–product flows, and regulatory constraints. Future versions of the framework will incorporate learned vector embeddings—generated via graph-based and language model-based techniques—to represent metabolic components in a continuous latent space. These embeddings will facilitate semantic similarity search, entity disambiguation, and relevance scoring within a retrieval-augmented generation (RAG) architecture, improving both the precision and interpretability of system responses.

At present, the system uses general-purpose, pre-trained LLMs such as Llama3 and CodeLlama in a zero-shot configuration. These models are integrated via an inference API layer to process user prompts, guide the SPARQL query generation, and generate explanatory responses from the structured KG outputs. While no fine-tuning has yet been performed, we anticipate that domain adaptation—via continual pretraining or supervised fine-tuning on corpora from sources such as BioCyc, KEGG, and EcoCyc—will significantly enhance biological reasoning capabilities. In future iterations, fine-tuning will be paired with advanced prompting strategies, including retrieval-conditioned prompts, chain-of-thought (CoT) generation, and dynamic prompt templating based on KG context.

User interaction is mediated through a multi-stage pipeline: natural language queries are parsed and interpreted by the LLM, which then produces formal SPARQL queries targeting relevant entities and relations within the KG. The resulting query outputs are processed and distilled into biologically coherent responses, rendered in natural language and complemented by interactive network visualizations. Visual layers include reaction-centric views, metabolite-centric subnetworks, and pathway-level abstractions, each driven by user-defined filters or constraints.

This architecture supports a new paradigm for exploratory metabolic modeling, offering a human-in-the-loop interface for interrogating complex biological systems without requiring deep familiarity with query languages or formal model structures. By coupling symbolic representations (via RDF and ontologies) with the generative and interpretive capabilities of LLMs, the system enables a hybrid reasoning environment that supports hypothesis generation, data interpretation, and model curation.

Ongoing and future development efforts include the benchmarking of LLM-generated queries and responses against ground-truth pathway annotations and expert-curated outputs, extension of the framework to incorporate regulatory and signaling models beyond GSMs, optimization of KG query latency for real-time interaction, and integration of quantitative modeling layers including flux balance analysis (FBA) and kinetic rate laws. Additionally, embedding-based indexing will be introduced to support semantic search, enabling the LLM to retrieve and rank contextually relevant subgraphs to support high-fidelity RAG workflows. Together, these enhancements aim to establish a biologically grounded, computationally scalable platform for LLM-assisted metabolic network analysis.

Breadcrumb

2025 AIChE Annual Meeting

(394af) LLM-Powered Knowledge Graph Framework - the Analysis of Biochemical Networks

Authors