2024 AIChE Annual Meeting
(732e) Synergistic Integration of Reinforcement Learning with Conventional Process Control
In this work, we propose a control structure by augmenting existing CPC methods with an RL agent implemented in parallel. Due to the generally slow learning rates and high exploration requirements of RL, it is desired to have the existing conventional process controller (e.g., PID, MPC, etc.) continue to compute its own generated control action to enhance the learning rate of the RL agent [5]. A weighted sum of the control actions of the RL and CPC is derived, and subsequently applied to the plant [6]; the resultant states and actions are then used to supplement the RL agent’s learning. The proposed algorithm avoids direct action of the naive RL agent that may not result in acceptable performance and may even be unsafe under worst case scenarios. Algorithms are developed for an adaptive weighting function based on a measure of instantaneous and historical performance. Performance of both the RL and CPC methods are assessed using a moving horizon that decays with time, valuing more recent actions as more relevant than older actions. In addition, short-term trends are derived to allow for rapid transitions. In this way, the RL agent can take over control as and when its control performance exceeds that of the CPC method. If the RL’s performance begins to deteriorate, the conventional control method would again assume full control before significant degradation of performance.
The algorithm is demonstrated on a dynamic process model of a solid oxide fuel cell (SOFC) plant for H2 and power production [7]. The RL algorithm used is twin-delayed deep deterministic policy gradient (TD3) for temperature regulation at the outlet of the SOC stack. The TD3 approach offers a notable advantage in its capability to handle continuous action spaces and allows for direct one-to-one comparison with control actions from the conventional control method. For temperature regulation of the SOC system, the CPC is a series of PID controllers arranged in cascade loops. Because of the complex dynamics associated with SOC mode-switching operation between hydrogen and power production, the performance of the PID controllers can be poor whereas the actor-critic structure of the RL algorithm seeks to facilitate accurate capturing of the nonlinear dynamics. For this case study, the RL is proposed to augment, and eventually phase out, the cascaded PID loops. The episodic learning used for the RL-CPC arrangement is a series of hydrogen production set-point changes. Mode switching from maximum hydrogen production to maximum power production and back to maximum hydrogen production is considered as well. While learning is episodic in nature, the states are continuous across episodes, creating a consistent measure of performance improvement.
Specific contributions of this work include:
- An algorithm is developed for the parallel implementation of RL alongside conventional process control, allowing for transition of control from CPC to RL based on current and past performance. By leveraging short-term and long-term projections of control performance, this algorithm facilitates effective switching without degrading control.
- Online training and implementation of a direct RL algorithm for process control on systems of complex nonlinear continuous dynamics is demonstrated. It is shown that degradation of RL performance is limited throughout training to the level expected of in-place conventional control.
- It is observed that the RL-CPC algorithm can learn from and surpass a sub-optimal policy demonstrated by the conventional form of control that is in-place, eventually arriving at a superior policy than that of the conventional method.
- It is observed that the rate at which an RL-CPC arrives at an optimal policy surpasses traditional online RL methods with limited performance degradation.
- It is demonstrated that in cases where the RL encounters an unknown operational condition leading to degraded control performance, the control system will revert back to the conventional control that is in place, thereby limiting potential error and/or poor performance.
[1] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc., Sep. 2016, [Online]. Available: http://arxiv.org/abs/1509.02971
[2] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in 35th International Conference on Machine Learning, ICML 2018, Feb. 2018, pp. 2587–2601. [Online]. Available: http://arxiv.org/abs/1802.09477
[3] J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,” J. Mach. Learn. Res., vol. 16, pp. 1437–1480, 2015.
[4] O. Dogru et al., “Reinforcement Learning in Process Industries : Review and Perspective,” 2 IEEE/CAA J. Autom. Sin., vol. 11, no. 2, pp. 1–19, 2024, doi: 10.1109/JAS.2024.124227.
[5] J. A. Clouse, “On Integrating Apprentice Learning and Reinforcement Learning,” University of Massachusetts, 1996.
[6] M. T. Rosenstein and A. G. Barto, “Reinforcement learning with supervision by a stable controller,” Proc. Am. Control Conf., vol. 5, pp. 4517–4522, 2004, doi: 10.1109/ACC.2004.182663.
[7] D. A. Allan et al., “NMPC for Setpoint Tracking Operation of a Solid Oxide Electrolysis Cell System,” Found. Comput. Aided Process Oper. / Chem. Process Control (FOCAPO/CPC 2023), pp. 1–6, 2023, [Online]. Available: https://www.netl.doe.gov/projects/files/NMPCforSetpointTrackingOperatio…