2021 Annual Meeting

(346n) Reinforcement Learning with Neural Feedback Policies

Authors

Petsagkourakis, P. - Presenter, University College London
del Rio Chanona, A., Imperial College London
Sandoval Cárdenas, I. O., Imperial College London
In this work we find an optimal feedback control policy to optimize continuous nonlinear control problems. We explore the novel use of a single neural network as a closed-loop feedback controller that takes as input the full state of a system and outputs a multivariate constrained control. The strategy resembles closely the policy parametrizations commonly used in reinforcement learning, and could be considered a direct form of policy search when a dynamical model is available. The novel characteristic is the exploitation of a white-box dynamical system model to allow the efficient computation of the policy gradients directly from the ODE sensitivities. The network architecture is posed as a part of the ODE definition, substituting the original controls by a neural network, therefore allowing the weights of the network tobe treated as the controllable parameters of the modified system. In this way, the approach connects long-studied ideas in optimal control with recent approaches focused on neural policies and reinforcement learning.


The control policy is composed of a single neural network which receives as input the state of the system and outputs the corresponding control actions. The controller is trained in a closed-loop fashion using gradient based optimization via discrete adjoint sensitivities from the dynamic model with respect to the neural network parameters. The technique architecture is inspired by the REINFORCE algorithm from reinforcement learning, but given the explicit availability of the system's, sensitivities may be directly used to calculate the gradient of the loss function over the parameter space of the controller.A discretize-then-optimize approach is used to avoid instability problems of alternatives, which leverages an underlying reverse automatic differentiation for the correct estimation of sensitivities.


Gradient based optimization is used directly over the parameterization of the neural network controller to minimize a running and terminal cost over a fixed time interval. As a result, we construct a policy that can handle continuous nonlinear optimal control problems in the same spirit but orders of magnitude more efficiently than with standard policy gradient learning whenever a dynamical model is available.
We test our proposed technique in challenging nonlinear optimal control problems from process engineering where the governing dynamical system is available.