2024 AIChE Annual Meeting
(372af) Scheduling of State Task Network Under Uncertainty Via a Hybrid Reinforcement Learning Agent with Partial Observability
Authors
In this work, a methodology to generate an RL agent with a policy that can take online decisions to schedule the production on a State-Task Networks (STN) under uncertainty is presented. The approach adopts the POMDP to consider past information. A key feature of this approach is that the agent works on a parameterized action space, i.e., more than one action is defined on every time interval. To the authors’ knowledge, an approach that combines these features for performing online scheduling on STNs is not available in the literature. The agent builds the schedule according to the state of the process and the current realizations of the uncertain parameters which are described using a set of uncertain scenarios. The objective of the agent is to maximize economic profit for a given scheduling horizon. A PPO method with a recurrent actor and a recurrent critic is used in this work. For this approach two actions are taken simultaneously, the first corresponds to a discrete action that defines which of the tasks is initialized whereas the second is a continuous action that describes the required capacity of the selected task, e.g., Initialize a Reactor at 73% capacity. A categorical distribution is used to define the first action and a multivariate normal distribution is considered for the latter action. The agent’s policy is embodied as an RNN using Long Short-Term Memory (LSTM) cells to handle the partial observability. The output of the agent is divided into two branches, one for each of the two actions. The action at a time following the policy is defined by collecting the output actions from both branches, i.e., . The agent is required to maximize economic profit by shaping the schedule online according to the market prices that are subject to uncertainty. The environment where the agent performs is subject to constraints related to capacity resources, time, and machine availability that the agent must met during operation. A reward-shaping method is used in the training to guide the agent in avoiding constraint violations and guiding the search for an optimal economic operation. A set of sub-rewards is defined to reinforce the behavior of the policy when the profits is improved and constraints are satisfied. The proposed framework also implements a masking technique to accelerate the learning by limiting the agent to work only with actions that lead to feasible scenarios, thus avoiding infeasible situations. This technique also allows to simplify the action space by dividing the process into campaigns that restrict the production only to specific products. An observation window is defined as the sequence of information packages collected from the environment at intervals of time; this variable is defined by the user and is dependent on the problem. The elements from the observation window collect information related to the states and tasks in the network, i.e., the current state of occupancy, production at time , and time. These elements are intercorrelated through the LSTM cells and then transformed into the pair of actions corresponding to the next time interval. To guarantee the feasibility of the schedule, a heuristic method that can suspend an action pair is executed to filter out those that will incur in a constraint violation. The framework for training the RL agent was built with the PyTorch package, version 2.1.0 and the environment was built with the Gym toolkit version 0.26.2.
The proposed framework was tested on two STN case studies adapted from the literature [4], [5]. Both cases correspond to multipurpose batch plants manufacturing two and three products, respectively. Results from the studies showed that the agent was able to learn the relations from the sequential elements presented in the observation window. Information that was not given in the observation window was derived by the agent, for instance, the duration of the tasks. The schedules showed that the agent was looking to increase the rewards while avoiding penalizations (i.e., constraint violations). Although the agent was guided to learn the optimal values, it is well known that optimality is not guaranteed and depends on the problem restrictions [6]. Insights from the case studies indicate that the agent showed a preventive behavior to constraint violations and to infeasible scenarios, which proved the effectiveness of the reward function. The performance of the discrete and continuous actions showed that the agent aimed for the largest rewards. In the case of the continuous action, the agent left a small gap from the limiting values that, if exceeded, would lead to infeasible scenarios. That is, the agent found economically attractive schedules that can remain feasible in the presence of parametric uncertainty.
References
[1] Z. Li and M. Ierapetritou, “Process scheduling under uncertainty: Review and challenges,” Computers & Chemical Engineering, vol. 32, no. 4–5, Art. no. 4–5, Apr. 2008, doi: 10.1016/j.compchemeng.2007.03.001.
[2] M. Mowbray, D. Zhang, and E. A. D. R. Chanona, “Distributional Reinforcement Learning for Scheduling of Chemical Production Processes,” no. arXiv:2203.00636. arXiv, Mar. 09, 2022. Accessed: Jun. 22, 2022. [Online]. Available: http://arxiv.org/abs/2203.00636
[3] T. Altenmüller, T. Stüker, B. Waschneck, A. Kuhnle, and G. Lanza, “Reinforcement learning for an intelligent and autonomous production control of complex job-shops under time constraints,” Prod. Eng. Res. Devel., vol. 14, no. 3, Art. no. 3, Jun. 2020, doi: 10.1007/s11740-020-00967-8.
[4] E. Kondili, C. C. Pantelides, and R. W. H. Sargent, “A general algorithm for short-term scheduling of batch operations—I. MILP formulation,” Computers & Chemical Engineering, vol. 17, no. 2, Art. no. 2, Feb. 1993, doi: 10.1016/0098-1354(93)80015-F.
[5] L. G. Papageorgiou and C. C. Pantelides, “Optimal Campaign Planning/Scheduling of Multipurpose Batch/Semicontinuous Plants. 2. A Mathematical Decomposition Approach,” Ind. Eng. Chem. Res., vol. 35, no. 2, pp. 510–529, Jan. 1996, doi: 10.1021/ie950082d.
[6] S. Kakade and J. Langford, “Approximately Optimal Approximate Reinforcement Learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, in ICML ’02. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., Jul. 2002, pp. 267–274.