2024 AIChE Annual Meeting

(372af) Scheduling of State Task Network Under Uncertainty Via a Hybrid Reinforcement Learning Agent with Partial Observability

Authors

Ricardez-Sandoval, L. - Presenter, University of Waterloo
Rangel Martinez, D., University of Waterloo
Scheduling optimization problems in Chemical Engineering have gained attention as they can improve economics, sustainability, and client satisfaction. Their inherent complexity and online implementation represent major challenges, especially when scheduling decisions are subject to parametric uncertainty. Usual methods to approach these problems include stochastic and two-stage optimization, which often exhibit large computational times and the risk of losing problem tractability [1]. Reinforcement Learning (RL) is a Machine Learning (ML) method where an intelligent agent with a policy is developed to perform in an environment. Attractive features of a policy developed with RL include a) the reduction in computational times when implementing the trained policy into the process; b) The insight capacity that the policy can gain from the process due to the RL exploration methods for complex environments; c) the capacity to handle multiple events in the environment, e.g., parametric uncertainty, disturbances. The implementation of these techniques involving process scheduling has been mostly focused on problems with discrete decisions where Deep Q-Networks (DQN) can be considered; other methods include Distributional RL and Proximal Policy Optimization (PPO), e.g., [2], [3]. Results reported from those studies highlighted the adaptability of the agents to uncertainty and the capacity to reach results that are near optimal while the limitations include the challenges when designing the reward function used to develop the policy. A common assumption from previous works in scheduling is that the process is Markovian, which guarantees that the information from the present state of the environment is enough to decide the optimal action at a given time. Since the decisions in process scheduling are mostly influenced by previous time intervals, a decision model that contemplates previous states would be expected to allow the application of RL methods to a broader spectrum of problems. For instance, those where information is provided at different times during the horizon or instances that do not provide explicit information, which must be collected through experience. This model is called Partially Observable Markov Decision Process (POMDP) and can be generalized using Recurrent Neural Networks (RNN).

In this work, a methodology to generate an RL agent with a policy that can take online decisions to schedule the production on a State-Task Networks (STN) under uncertainty is presented. The approach adopts the POMDP to consider past information. A key feature of this approach is that the agent works on a parameterized action space, i.e., more than one action is defined on every time interval. To the authors’ knowledge, an approach that combines these features for performing online scheduling on STNs is not available in the literature. The agent builds the schedule according to the state of the process and the current realizations of the uncertain parameters which are described using a set of uncertain scenarios. The objective of the agent is to maximize economic profit for a given scheduling horizon. A PPO method with a recurrent actor and a recurrent critic is used in this work. For this approach two actions are taken simultaneously, the first corresponds to a discrete action that defines which of the tasks is initialized whereas the second is a continuous action that describes the required capacity of the selected task, e.g., Initialize a Reactor at 73% capacity. A categorical distribution is used to define the first action and a multivariate normal distribution is considered for the latter action. The agent’s policy is embodied as an RNN using Long Short-Term Memory (LSTM) cells to handle the partial observability. The output of the agent is divided into two branches, one for each of the two actions. The action at a time following the policy is defined by collecting the output actions from both branches, i.e., . The agent is required to maximize economic profit by shaping the schedule online according to the market prices that are subject to uncertainty. The environment where the agent performs is subject to constraints related to capacity resources, time, and machine availability that the agent must met during operation. A reward-shaping method is used in the training to guide the agent in avoiding constraint violations and guiding the search for an optimal economic operation. A set of sub-rewards is defined to reinforce the behavior of the policy when the profits is improved and constraints are satisfied. The proposed framework also implements a masking technique to accelerate the learning by limiting the agent to work only with actions that lead to feasible scenarios, thus avoiding infeasible situations. This technique also allows to simplify the action space by dividing the process into campaigns that restrict the production only to specific products. An observation window is defined as the sequence of information packages collected from the environment at intervals of time; this variable is defined by the user and is dependent on the problem. The elements from the observation window collect information related to the states and tasks in the network, i.e., the current state of occupancy, production at time , and time. These elements are intercorrelated through the LSTM cells and then transformed into the pair of actions corresponding to the next time interval. To guarantee the feasibility of the schedule, a heuristic method that can suspend an action pair is executed to filter out those that will incur in a constraint violation. The framework for training the RL agent was built with the PyTorch package, version 2.1.0 and the environment was built with the Gym toolkit version 0.26.2.

The proposed framework was tested on two STN case studies adapted from the literature [4], [5]. Both cases correspond to multipurpose batch plants manufacturing two and three products, respectively. Results from the studies showed that the agent was able to learn the relations from the sequential elements presented in the observation window. Information that was not given in the observation window was derived by the agent, for instance, the duration of the tasks. The schedules showed that the agent was looking to increase the rewards while avoiding penalizations (i.e., constraint violations). Although the agent was guided to learn the optimal values, it is well known that optimality is not guaranteed and depends on the problem restrictions [6]. Insights from the case studies indicate that the agent showed a preventive behavior to constraint violations and to infeasible scenarios, which proved the effectiveness of the reward function. The performance of the discrete and continuous actions showed that the agent aimed for the largest rewards. In the case of the continuous action, the agent left a small gap from the limiting values that, if exceeded, would lead to infeasible scenarios. That is, the agent found economically attractive schedules that can remain feasible in the presence of parametric uncertainty.

References

[1] Z. Li and M. Ierapetritou, “Process scheduling under uncertainty: Review and challenges,” Computers & Chemical Engineering, vol. 32, no. 4–5, Art. no. 4–5, Apr. 2008, doi: 10.1016/j.compchemeng.2007.03.001.

[2] M. Mowbray, D. Zhang, and E. A. D. R. Chanona, “Distributional Reinforcement Learning for Scheduling of Chemical Production Processes,” no. arXiv:2203.00636. arXiv, Mar. 09, 2022. Accessed: Jun. 22, 2022. [Online]. Available: http://arxiv.org/abs/2203.00636

[3] T. Altenmüller, T. Stüker, B. Waschneck, A. Kuhnle, and G. Lanza, “Reinforcement learning for an intelligent and autonomous production control of complex job-shops under time constraints,” Prod. Eng. Res. Devel., vol. 14, no. 3, Art. no. 3, Jun. 2020, doi: 10.1007/s11740-020-00967-8.

[4] E. Kondili, C. C. Pantelides, and R. W. H. Sargent, “A general algorithm for short-term scheduling of batch operations—I. MILP formulation,” Computers & Chemical Engineering, vol. 17, no. 2, Art. no. 2, Feb. 1993, doi: 10.1016/0098-1354(93)80015-F.

[5] L. G. Papageorgiou and C. C. Pantelides, “Optimal Campaign Planning/Scheduling of Multipurpose Batch/Semicontinuous Plants. 2. A Mathematical Decomposition Approach,” Ind. Eng. Chem. Res., vol. 35, no. 2, pp. 510–529, Jan. 1996, doi: 10.1021/ie950082d.

[6] S. Kakade and J. Langford, “Approximately Optimal Approximate Reinforcement Learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, in ICML ’02. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., Jul. 2002, pp. 267–274.