2019 AIChE Annual Meeting
(371aa) Disruptive Arti?cial Intelligence (Reinforcement Learning) Based Predictive Control
Authors
To handle continuous or stochastic action space, policy-based algorithms (Reinforce with policy gradients) are proposed, optimizes the policy without using a value function. A hybrid method namely, Advantage Actor-Critic (A2C) which consists of two distinct deep neural networks, a critic that measures quality of the action taken (value-based) & an actor that controls how our agent behaves (policy-based) stabilizes learning in comparison with the former. An extension to A2C namely, Asynchronous Advantage Actor-Critic (A3C) algorithm involves executing a set of environments in parallel and the policy gradient updates are done using the advantage function published by Volodymyr Mnih, Google DeepMind, 2016. For improving the stability, convergence and sample eï¬ciency of the stochastic policy gradient method. Proximal Policy Optimization (PPO), implements clipped surrogate objective on the policy update, published by John Schulman, Open AI, 2017. Trust Region Policy Optimization (TRPO), enforces KullbackâLeibler divergence constraint on the size of policy update at each iteration, published by John Schulman, UC Berkley, 2017. Kronecker-Factored Trust Region Actor-Critic(A2C) Policy Optimization(ACKTR), Kronecker-Factored Approximation Curvature (K-FAC) is utilized for the gradient update for both the critic and actor published by Yuhuai Wu, University of Toronto. Soft Actor-Critic(SAC), integrates the entropy computation of the policy into the reward to steer exploration. It is an oï¬-policy actor-critic model published by Ziyu Wang, Google DeepMind, 2017.
The algorithms described above model the policy function as a probability distribution over actions for a know current state(stochastic). Deterministic Policy Gradients (DPG), published by David Silver, Google DeepMind, 2014 instead models the policy as a deterministic rather than stochastic. Deep Deterministic Policy Gradients(DDPG), incorporates DPG with DQN & learns a stable Q-function by experience replay and the ï¬xed target network. DDPG learns a deterministic policy & extends it to the continuous space with the actorcritic framework published by Lillicrap, Google DeepMind, 2015. Distributed Distributional Deep Deterministic Policy Gradients (D4PG), the distributional critic estimates the expected Q value as a random variable, multiple distributed parallel actors gather experience in parallel & implements Prioritized Experience Replay (PER). D4PG are model-free variants, oï¬-policy, actor-critic algorithm which learns policies in high dimensional, continuous action spaces published by Gabriel Barth-Maron, Google DeepMind, 2018.
Augmented Random Search (ARS) algorithm published by Horia Mania, UC Berkely, 2018 is a random search method for training linear policies, utilized for continuous control problems ( augments the basic random search method) & achieves faster computations when compared to any other baseline RL algorithm. The subset of black-box optimization methods namely Evolution strategies(ES) are applied for a competitive alternative for training function approximators namely, deep neural networks for Reinforcement Learning. Evolution Strategies(ES), a kind of model-agnostic optimization approach by imitating Darwinâs theory of the evolution of species by natural selection it learns the optimal solution. Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Genetic Algorithms are utilized for function approximators. Deep Recurrent Q-Learning for Partially Observable MDPs published by Matthew Hausknecht, Microsoft Research, 2015 overcomes the limitation of the memory of RL agents. Distributional Reinforcement Learning with Quantile Regression published by Will Dabney, Google DeepMind, 2017 examines distinct ways of learning the value distribution rather than that of the traditional value function. GAN Qlearning, published by Thang Doan, McGill University, 2017 utilizes generative adversarial networks (GANs) for an alternative way of leveraging the distributional methodology to reinforcement learning for better learning the function approximator. Artiï¬cial Intelligence based Cognitive autonomous agents are all set for real time monitoring and predictive control. State of the art results are obtained for a Multi-Input Multi-Output(MIMO) real-time industrial scale problem. The above implemented algorithms, their architectures & the results obtained will be discussed in comparison to the baseline of traditional Model based Optimal Control. Thanks largely to GPU-backed machines for the extensive computations.