2025 AIChE Annual Meeting

(594d) Application of Unsupervised Machine Learning Methods to Reinforcement Learning Actor-Critic Structures for Training and Online Implementation

Authors

Debangsu Bhattacharyya, West Virginia University
One of the fundamental obstacles for the implementation of reinforcement learning (RL) in an online setting is the sample inefficiency which plagues training when the environment is continuous as in chemical process systems. This problem is two-fold in that both the quality of data as well as the selection for training is often difficult to determine quantitatively. Standard procedure for common actor-critic structures during the update procedure is essentially an arbitrary selection of a given batch size, while exploration is often focused on a reward-based approach with an arbitrary exploration noise [1], [2]. While these are sufficient for learning in many circumstances, they leave much room for improvement, especially for continuous and time-varying systems.

When considering the nature of exploration with RL there are many differing approaches, but they can broadly be classified as reward-based or reward-free and memory-based or memory-free. With these classifications in hand, it is possible to evaluate the nature of exploration at a given timestep in a quantitative manner beyond those basic approaches that evaluate the simplest quantities available, such as a greedy policy [3], or the predicted output of the value function[4]. This is what defines the exploration policy and differentiates it from the policy being learned, if the algorithm is off-policy.

Within standard deterministic actor-critic structures, such as deep deterministic policy gradient (DDPG) or its derivatives (TD3, D4G, etc.), exploration is dictated by a reward-based action yielded from the actor neural network with a level of gaussian noise added [1], [2]. In this way, the action follows essentially a greedy policy with some distribution around the optimal action that tightens as the actor is refined. Other structures such as soft-actor critic (SAC) follow an entropy maximization policy for exploration [5]. Neither of these possess prediction capability to minimize the possibility of performance degradation, nor do they adapt learning to suit the quality of data received.

The goal of this work is to leverage unsupervised learning tools, such as gaussian mixture models (GMMs) to help contribute to both the exploration policy and the training/learning done by the algorithm. The use of such algorithms is often considered for use in defining the quality of exploration in terms of density of visited data points [4], or in the uncertainty presented when used as a value function. However in this work, because of the availability of data over the course of the training period, it is desired to use such ML algorithms to evaluate the data in order to both form a prediction for the current state of the system, as well as sort data for quality in learning. Such a method would allow the screening of potentially risky/unsafe/undesired actions, without the need for an internal prediction model for the RL algorithm. In addition, high-reward samples can be reinforced during the update by a similar GMM that accounts for the output of the action-value function. This allows for more nuanced clusters of data to be sampled than a simple spatial density approach.

In summary the main contributions of this work are:

  • The proposal to use unsupervised machine learning methods, such as gaussian mixture models, as a form of prediction when selecting exploratory actions.
  • The goal is to minimize the risk of selecting a poor action that could result in increased error or performance degradation or safety violations or other undesired state evolution, while also prioritizing exploration for more informative learning.
  • The use of clustering to identify regions of high performing data to accelerate learning. Thus, high-reward data is reinforced more often than data that is low performing.
  • The algorithm is split into both the training aspect and prediction aspect and tested independently as well as in conjunction for comparison to traditional RL exploration methods.
  • This is applied to a selective catalytic reduction (SCR) unit model, representing a non-square system

[1] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc., Sep. 2016, [Online]. Available: http://arxiv.org/abs/1509.02971

[2] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in 35th International Conference on Machine Learning, ICML 2018, Feb. 2018, pp. 2587–2601. [Online]. Available: http://arxiv.org/abs/1802.09477

[3] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., no. 1. Cambridge, MA: The MIT Press, 2020.

[4] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, and D. Precup, “A Survey of Exploration Methods in Reinforcement Learning,” Aug. 2021, [Online]. Available: http://arxiv.org/abs/2109.00157

[5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” 35th Int. Conf. Mach. Learn. ICML 2018, vol. 5, pp. 2976–2989, Jan. 2018, [Online]. Available: http://arxiv.org/abs/1801.01290