2024 AIChE Annual Meeting

(118g) A Perspective on Learning Interpretable Control Policies: Thinking Local and Global

Authors

Banker, T. - Presenter, The Ohio State University
Mesbah, A., University of California, Berkeley
In this work, we provide perspective on the manner by which existing methods pose the
problem of control policy learning and how common challenges therein may be addressed through
the fusion of the machine learning and optimal control, explicitly balancing the exploitation
of local state-space knowledge with exploration of globally-uncertain regions. Modern control
engineering problems exhibit an unprecedented complexity such as integrating renewable energy
sources into power grids, robotic systems that interact with humans, and control of medical devices
for personalized care [1]. With recent advances in computing, sensing, and communication among
developments in machine learning, focus has shifted towards data-driven, automated controller
design and adaptation to reduce manual tuning in these complex problems [2]. At its core, learning
control policies is cast as a constrained optimization problem, comprised of black-box objectives
and constraints that can only be queried as noisy observations of a closed-loop system [3].

Under the mathematical formalism of a Markov decision process, reinforcement learning (RL)
is a powerful approach to this learning problem, combining elements of machine learning and
optimal control [4]. Model-free RL attempts to solve the optimization problem using (i) value-
based methods that attempt to satisfy the Bellman optimality conditions with a learned Q-function
approximation and (ii) policy search methods that optimize the parameters of some class of
policies. Achieving success in complex domains requires function approximations of sufficient
capacity such as deep neural networks, but such approximations are data-intensive and challenging
to interpret [5]. The many learnable neural network parameters pose challenges in data-limited
settings, and although injecting prior knowledge may reduce the amount of required training data,
integrating priors into neural networks and how to efficiently leverage imperfect models in the
case of model-based RL remain major challenges [6]. Compounding these challenges, the limited
data of the control policy interacting with the system is obtained in a manner that fails to balance
exploration and exploitation without significant reward-shaping, making it difficult to efficiently
cover the task-relevant state-space for complex, high-dimensional problems [7].

To address these challenges in learning, RL can take inspiration from optimization-based
policies in optimal control (OC) such as model predictive control (MPC), which plan locally-
optimal actions within a given prediction horizon according to an underlying predictive model
[8]. It has been shown that optimization-based policies when used as planners in the policy
learning stage can guide sampling towards high reward regions in solving the MDP problem [9].
It has also been theorized that MPC can be cast as a Q-function approximator to be learned with
model-free RL algorithms, and under mild conditions, is capable of providing an exact model
of the optimal Q-function for the core MDP problem even in cases of imperfect models [10].
We argue that many of the challenges associated with learning deep approximators for optimal
control policies may be addressed through the use of an optimization-based policy, which can
interact with a learned value function reflecting global uncertainty. As compared to a black-box
NN policies, an optimization-based policy can result in significant reduction of parameters leading
to more a interpretable policy capable of respecting constraints. Additionally, inclusion of a plant
model within policy will allow for encoding prior physics knowledge, which can be improved
in performance-oriented manner through experience. By guiding an optimization-based policy
with global uncertainty, planned trajectories can balance exploitation of local costs with exploring
globally-uncertain regions in a directed manner with temporally coordinated actions.

In this work, we demonstrate an optimization-based policy capable of balancing global ex-
ploration with local costs to the well-studied inverted pendulum problem as well as the Hopper
and Walker2D MuJoCo environments. We study an optimization-based policy with a control
objective defined by fixed action costs and formulate the terminal cost in terms of an upper-
confidence bound on a learned value function, following the principle of optimism in the face of
uncertainty [11]. We compute the upper-confidence bound using a value function learned through
minimization of the residual Bellman error value function and its posterior variance derived from
uncertainty Bellman equation estimation [12]. Using such a formulation, we investigate the trade-
off in exploration and exploitation of global uncertainty in trajectory optimization and its effects
on controller performance while providing comparison with a nominal MPC controller and well-
studied RL algorithms, including Q-learning and soft actor-critic (SAC). We further explore
how such a policy definition behaves with different horizon lengths, sparse rewards, and varying
degrees of model mismatch.

[1] K. P. Wabersich, A. J. Taylor, J. J. Choi, K. Sreenath, C. J. Tomlin, A. D. Ames, and M. N. Zeilinger, “Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems,” IEEE Control Systems Magazine, vol. 43, no. 5, pp. 137–177, 2023.
[2] Y. Wen, J. Si, A. Brandt, X. Gao, and H. H. Huang, “Online reinforcement learning control for the personalization of a robotic knee prosthesis,” IEEE Transactions on Cybernetics, vol. 50, no. 6, pp. 2346–2356, 2019.
[3] J. A. Paulson, F. Sorourifar, and A. Mesbah, “A Tutorial on Derivative-Free Policy Learning Methods for Interpretable Controller Representations,” in 2023 American Control Conference (ACC), 2023, pp. 1295–1306.
[4] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[5] B. Recht, “A tour of reinforcement learning: The view from continuous control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253–279, 2019.
[6] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” 2021.
[7] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, “Plan online, learn offline: Efficient learning and exploration via model-based control,” 2019.
[8] J. Rawlings, D. Mayne, and M. Diehl, Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2017.
[9] S. Levine and V. Koltun, “Guided Policy Search,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3. PMLR, 2013, pp. 1–9.
[10] A. B. Kordabad, D. Reinhardt, A. S. Anand, and S. Gros, “Reinforcement Learning for MPC: Fundamentals and Current Challenges,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 5773–5780, 2023.
[11] P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, B. Sch ̈olkopf, J. Platt, and T. Hoffman, Eds., vol. 19. MIT Press, 2006.
[12] C. E. Luis, A. G. Bottero, J. Vinogradska, F. Berkenkamp, and J. Peters, “Model-based uncertainty in value functions,” 2023.