2025 AIChE Annual Meeting

(594c) Offline Reinforcement Learning for Grade Change Operations in Chemical Process Plants

Authors

Alex Durkin, Imperial College London
Offline Reinforcement Learning for Grade Change Operations in Chemical Process Plants

Alex Durkin and Mehmet Mercangöz

Imperial College London

Abstract:
Grade change operations arise in multiproduct chemical manufacturing processes where the system must transition from producing one product specification to another. These transitions are typically planned based on customer demand, production scheduling, or inventory levels and can occur in polymerization, refining, and specialty chemical processes. The challenge lies in achieving the new product specifications quickly and efficiently, while minimizing the production of off-spec material, energy usage, and wear on equipment. Unlike steady-state operation, grade changes involve highly dynamic behaviour and often require coordinated manipulation of multiple process variables. Due to their complexity, such transitions are traditionally managed by experienced operators using heuristics or partially automated procedures. The operational strategies applied during grade changes can significantly impact economic performance and product quality, making them a crucial aspect of plantwide optimization and advanced control [1].

With digitalization now well established across many industrial sectors, process plants increasingly collect detailed operational data, including process measurements and operator actions during grade changes. These datasets offer a valuable opportunity for data-driven control approaches that can learn from historical decisions.

In this context, offline reinforcement learning (offline RL) has emerged as a promising framework. Unlike imitation learning or behavioural cloning, which simply attempt to replicate historical operator behaviour, offline RL seeks to improve upon historical strategies by learning optimal policies from offline datasets—while ensuring constraint-aware and safe operation. This is particularly relevant for grade change problems, where past operator trajectories may be suboptimal or vary in effectiveness [2,3].

In this study, we investigate the application of offline RL to a benchmark multi-product reactor system, where the task is to transition the system to a target concentration profile corresponding to a new product grade. To generate a realistic training dataset, we simulate numerous grade change scenarios using a suite of hand-designed and randomly parameterized control strategies that reflect the variability observed in operator behaviours. These synthetic transitions are initiated from a range of initial conditions and differ in performance, with some trajectories exhibiting overshoots or undershoots leading to constraint violations. We design a reward function that penalizes deviations from the target profile and incorporates constraints directly, labelling certain transitions as undesirable both in the reward structure and in the form of constraint indicators.

Using this dataset, we train and compare several offline RL agents, focusing on algorithms capable of handling constraints, such as Conservative Q-Learning (CQL) [2], Implicit Q-Learning (IQL) [3], Offline Constrained Policy Optimization (OCPO) [4], and COptiDICE [5] — a recent algorithm that learns cost-conservative policies through stationary distribution correction estimation. COptiDICE has shown particular promise in learning safe policies in offline settings while ensuring constraint satisfaction. We evaluate the performance of these agents across multiple metrics, including transition time, constraint satisfaction, and robustness across initial conditions. In addition, we study the sensitivity of the learning process to dataset size and quality, providing insight into the data efficiency and practical viability of offline RL in this domain.

Our results demonstrate the potential of offline RL to improve the performance of grade change operations in chemical process plants while maintaining safety and constraint satisfaction — even when only historical or simulated operational data is available. This work lays the foundation for safer, more efficient, and data-driven operation of process transitions in modern industrial systems.

References

  1. MacGregor, J. F., & Kourti, T. (1992). Statistical process control of multivariate processes. Computers & Chemical Engineering, 16(4), 489–500. https://doi.org/10.1016/0098-1354(92)80066-6
  2. Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems (Vol. 33, pp. 1179–1191). https://arxiv.org/abs/2006.04779
  3. Kostrikov, I., Nair, A., & Levine, S. (2022). Offline reinforcement learning with implicit Q-learning. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2110.06169
  4. Polosky, N., da Silva, B. C., Fiterau, M., & Jagannath, J. (2022). Constrained offline policy optimization. In Proceedings of the 39th International Conference on Machine Learning (ICML) (pp. 17761–17784). https://proceedings.mlr.press/v162/polosky22a.html
  5. Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., & Guez, A. (2022). COptiDICE: Offline constrained reinforcement learning via stationary distribution correction estimation. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2204.08957