Reinforcement learning (RL) and model predictive control (MPC) are both successful strategies for solving optimal control problems, each leveraging their own specialized theory and tools for doing so. RL takes an iterative approach to learning control policies through interaction with an environment, benefiting from great flexibility and scalability [1]. The tools of RL are often designed around the Bellman equation, learning a Q-function from which an optimal policy can be derived [2]. Alternatively, MPC is a widely adopted strategy for obtaining tractable optimization-based control policies, which design actions online through model-based predictions in a receding-horizon fashion [3]. With different priorities than RL, MPC theory has generally focused on obtaining guarantees on safe system operation, such as constraint satisfaction, robustness, and stability [4]. Given their strengths, there is increasing interest in combining the tool sets of RL and MPC [5]–[7] and several ways in which the two can successfully interact [8], [9].
However, the development of RL and MPC has generally been in isolation of the other, making the integration of their sophisticated software tools (e.g., [10]–[12]) challenging and limited. Furthermore, existing approaches for integrating the two do not resolve the significant computational cost of running and differentiating MPC in an RL algorithm, which scales with times steps, update iterations, and batch size. As a consequence, how to best leverage the scalability and flexibility of RL while retaining the theoretical properties of MPC remains an open question.
With this in mind, we propose MPCritic [13]: an architecture and associated learning framework for seamless integration of machine learning and MPC tools, readily incorporating MPC theory in its design. The architecture shares the interpretable structure of MPC, including dynamic model, cost, and constraints, to define an RL “critic network”. However, unlike other MPC and RL frameworks, iteratively training the critic does not require solving the MPC problem. Rather, within MPCritic, a “fictitious” controller aimed to approximate the MPC optimization enables computationally inexpensive critic evaluation and differentiation, and as a consequence, directly addresses scalability issues in learning. Furthermore, due to the preserved MPC structure, the learned MPC can still be solved for online control where its theoretical properties, such as robust constraint satisfaction and stability, matter most. This is achieved by simply discarding the fictitious controller during online control.
The modular design of MPCritic presents a range of learning configurations, depending on what strengths of RL and MPC best suit a given application. In one extreme, individual MPC components, such as dynamic model and cost, can be designed to ensure theoretical properties of the online MPC. In another extreme, all components can be learned in unison as a more general and flexible RL function approximator. We validate the proposed learning framework with comparison to standard MPC formulations and deep RL approaches on classical control benchmarks: offline learning the theoretically-optimal MPC for the linear quadratic regulator (LQR), extending to the online setting with constraints, and learning a stochastic RL “actor” guided by the fictitious controller for improved performance and constraint satisfaction in a nonlinear environment.
In offline learning for LQR, the framework (approximately) learns the optimal dynamic model and corresponding discrete algebraic Riccati equation solution for systems with over 100 state and action dimensions. As compared to differentiable MPC, the framework benefits from computationally cheap evaluation and differentiation, as well as favorable scaling. These favorable attributes, combined with the retained MPC structure, lessen the constraints of computational cost on the user’s choice of RL algorithm for learning MPC. In the online setting, the MPCritic agent solves the exact MPC optimization during its interactions with the environment, updating its MPC structure afterwards with the TD3 algorithm [14]. Relative to deep RL, the learned MPCritic agent demonstrates improvements in performance, constraint satisfaction, and sample efficiency. These benefits also translate to the maximum entropy RL setting [15]. Learning the stochastic actor and RL critic with the SAC algorithm [16] in conjunction with system identification, the MPCritic parameterization effectively balances the goal of the MDP with the constraints of the MPC formulation while traditional RL and MPC approaches can neglect one if favor of the other. These case studies demonstrate the theoretical connection, scalability, and versatility of MPCritic as an algorithmic framework for seamlessly integrating advanced MPC and RL tools, inviting further investigation of the framework’s theoretical properties, extension to increasingly complex MPC formulations, and utility as a general inductive bias in RL.
References:
[1] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Nashua, NH: Athena Scientific, 1996.
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018.
[3] J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model Predictive Control: Theory, Computation, and Design. Santa Barbara, CA: Nob Hill Publishing, 2017.
[4] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert, “Constrained model predictive control: Stability and optimality,” Automatica, vol. 36, no. 6, pp. 789–814, 2000.
[5] A. Mesbah et al., “Fusion of Machine Learning and MPC under Uncertainty: What Advances Are on the Horizon?” in Proceedings of the American Control Conference, Atlanta, 2022, pp. 342–357.
[6] R. Reiter et al., “Synthesis of model predictive control and reinforcement learning: Survey and classification,” 2025, arXiv.2502.02133.
[7] T. Banker, N. P. Lawrence, and A. Mesbah, “Local-Global learning of interpretable control policies: The interface between MPC and reinforcement learning,” 2025, arXiv:2503.13289.
[8] B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable MPC for end-to-end planning and control,” 2019, arXiv:1810.13400.
[9] S. Gros and M. Zanon, “Data-Driven Economic NMPC Using Reinforcement Learning,” IEEE Transactions on Automatic Control, vol. 65, no. 2, pp. 636–648, 2020.
[10] J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
[11] F. Fiedler et al., “do-mpc: Towards FAIR nonlinear and robust model predictive control,” Control Engineering Practice, vol. 140, p. 105676, 2023.
[12] S. Huang et al., “CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms,” Journal of Machine Learning Research, vol. 23, no. 274, pp. 1–18, 2022.
[13] N. P. Lawrence, T. Banker, and A. Mesbah, “MPCritic: A plug-and-play mpc architecture for reinforcement learning,” 2025, arXiv:2504.01086.
[14] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, Stockholm, 2018, pp. 1587–1596.
[15] S. Levine, “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review,” 2018.
[16] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 35, Stockholm, 2018, pp. 1861–1870.