Reinforcement Learning (RL) has significantly expanded the field of control in the past decade, enabling the optimal operation of highly complex dynamic systems. This expansion has largely been driven by the rapid developments in machine learning (ML), which enables the use of neural networks (NNs) as function approximators [1]. RL has been exceptionally useful in situations where it is challenging to create a mechanistic model or develop a high-fidelity surrogate representation which often relies on large amounts of data. This provides an alternative approach when advanced model-based control techniques (e.g., model predictive control (MPC)) may not be practical [2-3]. RL has proven to be instrumental in the areas of robotics [4], game-playing [5], and hierarchical decision-making [6], but has been difficult to implement in the direct control of chemical and energy systems [7]. This is due to the inherent challenges and concerns regarding safety and stability and the large amount of time needed to develop acceptable results which often arise in the exploration phases of RL algorithms [8]. Several recent works have been proposed to address these issues, but they often rely on a system model [9], require a control invariant set [10], or are only applicable for linear systems [11]. This adds to the difficulty for their execution in practice when little to no information about the true dynamics is available. To this end, there is an essential need to develop a systematic RL algorithm that can eliminate unsafe and time-consuming exploration while providing control actions with confidence during the entire training process without relying on a complex system model.
Toward this direction, we present a novel transfer-learned RL algorithm that leverages Y-wise Affine Neural Networks (YANNs) in initializing actor-critic networks. YANNs are a specialized NN architecture that we have developed, which can exactly represent the explicit control policy from multi-parametric MPC (mp-MPC) via targeted weight and bias selection. This serves as a hot start and transfers the mp-MPC knowledge to the RL controller. We also show that the critic network can be formulated equivalently to the MPC objective function for a one-step control and operating horizon. These initializations accelerate the learning process and eliminate the need for early exploration in both actor and critic networks. To obtain the initial mp-MPC control laws, a linear approximated system model can be developed using a few measurement data, e.g. via system identification, to enable the use in model-free environments.
Identified models are used to develop an initialized policy network with desired guarantees such as recursive feasibility and stability for the linear approximation. This feature is enabled by YANNs because they are functionally equivalent to the mp-MPC policy and thus maintain these desired mathematical properties. This gives a higher level of confidence when it comes to system safety and stability in the learning process. The optimal control policy for the nonlinear dynamics is learned through the accelerated RL algorithm which updates the actor-critic networks based on interactions with the environment. This guides the policy from the suboptimal explicit solution to the true optimal. We demonstrate the computational and practical advantages of the proposed methodology through comparisons with existing RL algorithms on two case studies for the control of model-free dynamic systems. Of particular interest, we select one of the case studies concerning a safety-critical process conceptualized from a real-world incident [12]. We show that this technique accelerates the learning of an optimal control policy, provides better intermediate control action trajectories throughout the learning process, and eliminates unsafe exploration. This work provides promising results for employing YANNs in RL for control and presents a state-of-the-art algorithm that learns optimal policies much quicker and more safely than existing approaches.
References:
[1] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
[2] Faria, R. de R., Capron, B. D. O., Secchi, A. R., & de Souza, M. B. (2022). Where Reinforcement Learning Meets Process Control: Review and Guidelines. Processes, 10(11), Article 11.
[3] P. Petsagkourakis, I.O. Sandoval, E. Bradford, D. Zhang, E.A. del Rio-Chanona. (2020), Reinforcement learning for batch bioprocess optimization. Computers & Chemical Engineering, 133, 106649.
[4] Kaufmann, E., Bauersfeld, L., Loquercio, A., Müller, M., Koltun, V., & Scaramuzza, D. (2023). Champion-level drone racing using deep reinforcement learning. Nature, 620(7976), 982–987.
[5] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
[6] Dogru, O., Velswamy, K., Ibrahim, F., Wu, Y., Sundaramoorthy, A. S., Huang, B., Xu, S., Nixon, M., Bell, N. Reinforcement learning approach to autonomous PID tuning. (2022). Computers & Chemical Engineering, 161, 107760.
[7] Shin, J., Badgwell, T. A., Liu, A., Lee., J. H. (2019). Reinforcement Learning – Overview of recent progress and implications for process control. Computers & Chemical Engineering, 127, 282–294.
[8] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers & Chemical Engineering, 139, 106886.
[9] Wang, Y., Zhu, X., Wu, Z. (2025). A tutorial review of policy iteration methods in reinforcement learning for nonlinear optimal control. Digital Chemical Engineering, 15, 100231,
[10] Marvi, Z., & Kiumarsi, B. (2022). Reinforcement Learning With Safety and Stability Guarantees During Exploration For Linear Systems. IEEE Open Journal of Control Systems, 1, 322–334.
[11] Bo, S., Agyeman, B. T., Yin, X., Liu, J. (2023). Control invariant set enhanced safe reinforcement learning: Improved sampling efficiency, guaranteed stability and robustness. Computers & Chemical Engineering, 179, 108413.
[12] Chemical Safety Board, T2 Laboratories Inc. reactive chemical explosion. www.csb.gov/t2-laboratories-inc-reactive-chemical-explosion/.