2025 AIChE Annual Meeting

(259d) Leveraging Large Language Models for Dynamic Fault Recovery Via Finite State Machine Traversal

Checkout Do you already own this? Log in to access this content.

Pricing

Individuals

AIChE Pro Members	150.00
AIChE Emeritus Members	105.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free
AIChE Explorer Members	225.00
Non-Members	225.00

Authors

Javal Vyas - Presenter, National Energy Technology Support Contractor

Milapji Gill, Helmut Schmidt University

Felix Gehlhoff, Helmut Schmidt University

Mehmet Mercangöz, ABB

Finite State Machines (FSMs) have long served as a foundational modelling formalism for complex engineered systems, particularly in domains such as process control, automation, and embedded system design [1]. FSMs offer a powerful way to decompose system behaviour into discrete operational modes (states) governed by distinct dynamics or control laws. Each state in an FSM represents a specific regime—such as nominal operation, degraded performance, or fault conditions—while the transitions encode logic for how the system evolves based on triggers like sensor readings, control actions, or environmental changes. This abstraction provides clarity, modularity, and traceability for systems where behaviour is mode-dependent.

In the context of industrial control systems, FSMs are commonly used to model fault modes and associated recovery strategies. A well-structured FSM allows engineers to predefine what should happen when certain conditions are met—such as entering a low-pressure state or detecting a valve malfunction—and what actions should be taken to bring the system back to a safer or more profitable regime. However, despite their widespread use and formal rigor, FSMs face critical challenges when applied to modern fault-tolerant systems [2].

A key limitation lies in the combinatorial explosion of states and transitions. In real-world systems, it is often the case that the same observable state (e.g., high temperature in a reactor) can be reached through multiple fault trajectories. For instance, a cooling failure, a sensor misreading, or a volume change could all lead to the same high-temperature reading. However, the recovery strategy for each case might differ. Traditional FSMs require these differences to be explicitly encoded, effectively multiplying the number of states and transitions. This leads to increased complexity, maintenance overhead, and brittle designs that fail gracefully only in predefined conditions. Furthermore, adding new fault types often requires a complete reconfiguration of the FSM structure, hindering scalability and adaptability.

To address this limitation, we introduce a novel methodology that uses Large Language Models (LLMs) as dynamic decision-makers [3] to traverse FSMs under fault-induced scenarios. Rather than predefining every possible fault path and recovery sequence, we allow the LLM to reason about the current system state and its trajectory to decide what actions should be taken to recover or stabilize the system. This approach shifts the FSM’s role from being a fully enumerated decision tree to a semantic substrate that informs and constrains LLM-based reasoning. The LLM acts as a reactive planner within the FSM structure—selecting transitions, suggesting control actions, and adjusting plans on the fly based on context.

Our proposed framework extends a previously published agentic architecture that integrates LLMs for autonomous control [4]. In this architecture, the LLM is part of a feedback loop composed of five key agents: a monitoring agent, which detects faults or abnormal conditions; an action agent (an LLM), which determines the next control move; a digital twin agent, which simulates the feasibility of the proposed action; validation agent (may or may not be an LLM) which determines the safety of the proposed action, and a reprompting agent (an LLM), which refines or rejects actions that lead to unsafe or infeasible outcomes. This closed-loop structure enables the system to self-correct in the face of hallucinations or invalid transitions, a known limitation of LLMs in high-stakes environments.

For benchmarking and validation, we rely on the case studies provided in the AI Benchmark for Diagnosis, Reconfiguration, and Planning repository developed by [5]. This benchmark contains a range of realistic plant models modelled in OpenModelica framework. These systems vary in size and complexity, allowing us to test the capabilities and limitations of LLM-based reasoning across a spectrum of control problems. Empirical tests show that with naïve prompting, current off the shelf LLMs can reliably solve FSMs with up to 25 states, although hallucinations and infeasible actions become more frequent beyond that scale.

A central aspect of this work is the modeling interface—i.e., how FSMs and system information are presented to the LLM. We explore four primary modalities: (1) verbal descriptions, where the FSM and state dynamics are described in natural language; (2) state-transition diagrams, encoded in text format; (3) knowledge graphs, capturing the causal and structural dependencies between components; and (4) code representations, using OpenModelica or Python to encode logic and transitions. Each of these formats offers trade-offs between interpretability, grounding, and reasoning accuracy. For instance, verbal descriptions are flexible but prone to ambiguity, while code-based logic provides precision but may overwhelm the model without proper abstraction. Part of our ongoing work is to systematically evaluate which formats best support LLM-based reasoning in FSM traversal tasks.

Classical control approaches often treat recovery as returning to a nominal setpoint or steady-state. However, in industrial practice, it is often sufficient for the system to reach any state that is safe, stable, and capable of sustaining profitable operation. This broader definition of recovery allows our framework to explore alternative paths, including those that stabilize the plant under suboptimal but acceptable conditions. For example, rather than restoring full heating capacity in a thermal system, the LLM may identify a lower operating temperature that meets production thresholds while avoiding risky interventions. This flexible reasoning is critical in high-dimensional or poorly modelled systems where full recovery may be either infeasible or unnecessary.

An illustrative example involves a system with multiple tanks (3 feed tanks and 1 reactor). The case study aims to fill the three tanks sequentially and then empty them into reactor sequentially. This process is subject to multiple anomalies like a clogging valve or a leaking valve or a valve malfunction. A classical FSM would require predefined transitions for each fault type and corresponding controller actions. Our LLM-based framework, however, dynamically proposes actions like changing the setpoint of the pump, inorder to overcome the clogging fault—all validated through a digital twin. If the action is deemed unsafe (e.g., flowrate is still low), the reprompting agent triggers an iterative reasoning loop to explore safer alternatives. This capability transforms the FSM from a rigid execution map into an interactive search space, where transitions are contextually constructed, pruned, and validated.

A major technical challenge in this setup is ensuring that the LLM does not make infeasible or unsafe suggestions. While the digital twin helps to filter out such actions, reliance on reprompting can increase latency and computational overhead. Future improvements will focus on incorporating constraint-aware reasoning, where physical or logical constraints are embedded directly into the LLM prompt or model architecture. Another direction involves fine-tuning LLMs on structured control problems to reduce hallucinations and improve reasoning fidelity.

The broader implications of this work extend beyond individual plants or benchmarks. The framework can generalize to other cyber-physical systems—including power grids, robotic swarms, and autonomous vehicles—where hybrid dynamics, discrete modes, and fault management are critical. Moreover, it offers a pathway to scalable autonomy: by offloading decision-making from static models to language-based agents, engineers can focus on specifying goals, constraints, and safety envelopes rather than exhaustively coding all transitions. This paradigm could also play a pivotal role in safety verification, where LLMs can reason about edge cases, simulate failure trajectories, and propose recovery actions in digital environments before being deployed in physical systems.

In conclusion, this work presents a transformative approach to FSM-based control by embedding LLMs as dynamic reasoning agents capable of real-time fault diagnosis, recovery planning, and safe action execution. By integrating this reasoning loop with formal modeling, validation via digital twins, and feedback through reprompting, we enable control systems that are both flexible and robust. While many challenges remain—including scaling to larger state spaces, reducing hallucinations, and improving interpretability—this work lays the groundwork for a new generation of autonomous industrial controllers that can learn, adapt, and recover without explicit enumeration of all failure modes.

References:

1. Cassandras, C. G., & Lafortune, S. (Eds.). (2008). Introduction to discrete event systems. Boston, MA: Springer US.

2. Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., & Teneketzis, D. (1995). Diagnosability of discrete-event systems. IEEE Transactions on automatic control, 40(9), 1555-1575.

3. Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022, June). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning (pp. 9118-9147). PMLR.

4. Vyas, J., & Mercangöz, M. (2024). Autonomous Industrial Control using an Agentic Framework with Large Language Models. arXiv preprint arXiv:2411.05904.

5. Ehrhardt, J., Ramonat, M., Heesch, R., Balzereit, K., Diedrich, A., & Niggemann, O. (2022, September). An AI benchmark for diagnosis, reconfiguration & planning. In 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA) (pp. 1-8). IEEE.

Breadcrumb

2025 AIChE Annual Meeting

(259d) Leveraging Large Language Models for Dynamic Fault Recovery Via Finite State Machine Traversal

Authors