2025 AIChE Annual Meeting

(259a) Continuous Learning of the Value Function Utilizing Deep Reinforcement Learning and Its Use As the Objective in Model Predictive Control

Authors

Elijah Hedrick, West Virginia University
Debangsu Bhattacharyya, West Virginia University
Reinforcement learning (RL), a machine learning technique, and model predictive control (MPC) possess an inherent synergy in the manner in which they function. There are many examples in open literature of the integration of RL and MPC. The most common methods evaluate high-level or meta RL in conjunction with MPC as opposed to implementation at the level of control execution [1]. It is often the case that for this high-level implementation, the focus is on tuning of internal weights and hyperparameters of the controller structure itself [2], [3]. RL can also be utilized to modify the objective function of the MPC. In this way, an approximation of an infinite horizon MPC may be derived with the value function (a linear approximation as opposed to an ANN) of the RL algorithm acting as a terminal cost [4], [5]. This approach can allow for a multi-step return to be used in the update procedure for the RL algorithm, taking advantage of the similar structure of the MPC prediction horizon and the return of the RL’s value function. This generally points to a problem (even in simple cases) in RL sample inefficiency, which is discussed widely in the literature [6]

In this work, we propose a form of meta-RL to be utilized within the controller structure of a classic MPC. Here, it is assumed that a model of the system is known along with associated constraints and bounds for the system. An RL action-value function is proposed to be used as the typical objective function used in MPC. In this case, a radial basis function is evaluated as the feature vector. In addition, a deep learning neural network is also explored for use. This proposed value-function model predictive controller (VFMPC) seeks to utilize the projected horizons in order to refine the return as well as using a target VF to create off-policy temporal difference learning.

In order to maintain the policy of the MPC, it is necessary to modify the VF NN such that a nonlinear constrained optimization problem can be solved over the prediction horizon. In this way, the form and policy of a standard MPC algorithm are maintained, while leveraging RL in order to adapt the controller to changing dynamics of the system. Most continuous RL algorithms adopt an actor-critic structure to circumvent this issue of value function optimization, such as the DDPG [7] or TD3 [8], but in this work the ANN model is converted into a gradient-boosted tree surrogate model by adapting an existing open-source software package OMLT (Optimization \& Machine Learning Toolkit) [9]. This surrogate can then be solved over the horizon using the Pyomo package and nonlinear solvers, such as IPOPT [10]–[12].

The algorithm is applied to a simple double integrator utilizing the previously mentioned linear value function approximation. This system is also applied to a selective catalytic reduction (SCR) unit in order to control NOx emissions and mitigate ammonia slip [2], a non-square problem. Here we also examine the effects of catalyst decay thus leading to a time-varying system that is challenging for conventional MPC to achieve good control performance. Our proposed approach also evaluates the use of an ANN as the objective function.

Following with these points, the main contributions of this work are:

  • We propose to combine RL and MPC in which the learned value function is used as the cost function for the MPC. This formulation yields an optimal constrained MPC policy with respect to the reward function, without the need for extensive tuning or a discrete update of the MPC. This poses a clear advantage over a standard MPC formulation where both of these would be required for the proposed scenarios. The use of MPC as a policy also yields an advantage over standard RL as the actions are subject to the constraints applied over the receding horizon.
  • A key contribution is the proposal of the use of the optimized trajectory from the MPC to accelerate learning, along with analysis of how the search depth in the trajectory affects the rate of learning.
  • We propose two algorithms employing these concepts: VFMPC(0), using the one step return in order to learn the cost function, and VFMPC(n), using the optimal trajectory to learn the n-step return subject to the dynamics of the process model. These algorithms and their performance are exhibited on two process control examples.
  • The introduction of neural networks into the VFMPC(n) algorithm serves to address the controller performance issues due to slowly changing dynamics and plant-model mismatch. This is done without the need for a significant MPC update done by hand as would be needed in a standard MPC format.
  • Following with the idea of reducing plant-model mismatch, the linear approximation for the value function is substituted for a ANN model that is combined with an appropriate surrogate model conversion. With this formulation, it is shown that a superior policy may be found even when using an approximate model within the MPC policy.
  • These algorithms and their performance are exhibited on a benchmark process control example as well as the model of an industrial SCR unit undergoing catalyst decay.

References

[1] O. Dogru et al., “Reinforcement Learning in Process Industries : Review and Perspective,” 2 IEEE/CAA J. Autom. Sin., vol. 11, no. 2, pp. 1–19, 2024, doi: 10.1109/JAS.2024.124227.

[2] E. Hedrick, K. Hedrick, D. Bhattacharyya, S. E. Zitney, and B. Omell, “Reinforcement learning for online adaptation of model predictive controllers: Application to a selective catalytic reduction unit,” Comput. Chem. Eng., vol. 160, p. 107727, 2022, doi: 10.1016/j.compchemeng.2022.107727.

[3] S. Gros and M. Zanon, “Data-driven economic NMPC using reinforcement learning,” IEEE Trans. Automat. Contr., vol. 65, no. 2, pp. 636–648, Feb. 2020, doi: 10.1109/TAC.2019.2913768.

[4] Y. Yang and S. Lucia, “Multi-step greedy reinforcement learning based on model predictive control,” IFAC-PapersOnLine, vol. 54, no. 3, pp. 699–705, 2021, doi: 10.1016/j.ifacol.2021.08.323.

[5] X. Pan, X. Chen, Q. Zhang, and N. Li, “Model Predictive Control : A Reinforcement Learning-based Approach,” J. Phys. Conf. Ser., vol. 2203, no. 1, p. 012058, 2022, doi: 10.1088/1742-6596/2203/1/012058.

[6] R. Nian, J. Liu, and B. Huang, “A review On reinforcement learning: Introduction and applications in industrial process control,” Comput. Chem. Eng., vol. 139, p. 106886, Aug. 2020, doi: 10.1016/j.compchemeng.2020.106886.

[7] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc., Sep. 2016, [Online]. Available: http://arxiv.org/abs/1509.02971

[8] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in 35th International Conference on Machine Learning, ICML 2018, Feb. 2018, pp. 2587–2601. [Online]. Available: http://arxiv.org/abs/1802.09477

[9] F. Ceccon et al., “OMLT: Optimization & Machine Learning Toolkit,” J. Mach. Learn. Res., vol. 23, pp. 1–8, Feb. 2022, [Online]. Available: http://arxiv.org/abs/2202.02414

[10] W. E. Hart, J.-P. Watson, and D. L. Woodruff, “Pyomo: modeling and solving mathematical programs in Python,” Math. Program. Comput., vol. 3, no. 3, pp. 219–260, 2011.

[11] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Math. Program., vol. 106, no. 1, pp. 25–57, May 2006, doi: 10.1007/s10107-004-0559-y.

[12] M. L. Bynum et al., Pyomo--optimization modeling in python, 3rd ed., vol. 67. Springer Science & Business Media, 2021.