2023 AIChE Annual Meeting
(430e) A Reinforcement Learning Strategy with Recurrent Neural Networks for Optimal Scheduling of Job-Shop Batch Chemical Plants Under Uncertainty
Authors
In this work, a methodology to develop a RL agent that acts as an online scheduler on a job-shop chemical plant subject to uncertainty is presented. A key feature in this formulation is that processing times and product demands are described using a discrete set of uncertain scenarios. A Deep Recurrent Q-Learning (DRQN) method is used to train the agent, which is designed using a Recurrent Neural Network (RNN). These networks are used to generalize HMMs in sequential decision processes, which is a typical characteristic in scheduling problems. To the authorsâ knowledge, this method has not been considered to address the scheduling in job-shops under uncertainty in chemical batch plants. The proposed DRL formulation assumes that there are several routes for producing predesigned batches for a determined number of products. Also, the formulation includes zero-wait restrictions during the process; however, storage is available for completed products. The agent was trained to build an online schedule of the initialization of these routes in the chemical batch plant. The resulting schedule is required to satisfy product demands given at the beginning of the process and demand realizations that take place at specific times during operation. Note that those demands are not known a priori; hence, they are considered as discrete uncertain parameters in the proposed DRQN framework. Also, the agent is motivated through a set of rewards to initially complete the demands of every product, and subsequently to fill the available storage for each product without exceeding their capacities. Moreover, the agent aims to minimize the makespan of the process subject to a set of user-defined process constraints, e.g., allocation and mass conversation balance constraints. These objectives are enforced using a reward shaping strategy. For each processing route in the chemical plant, there is a set of machines which processing times are uncertain parameters to the agent. Hence, no information about the characteristics of these times is given a priori to the agent. The HMM is used to know the true value of the uncertainty realization in the processing times by gathering information from present and recent past events. That is, the agent acquires knowledge from the processing times and their possible deviations through the sequence of events that are given as input in the observation window. Although the literature has shown methods to approach uncertainty using MDP, the use of this model assumes that the system is fully observable and that all the information needed for taking the next decision is available in the present time, which may not be the usual case in real applications. Uncertainty effects are propagated through time in the environment. In the present DRQN framework, these features are captured through the observation window that the HMM uses to take the next action. The observation vectors in the window that are inputs to the agent contain the information related to time intervals, demand satisfaction for each product, and machine availability.
The proposed RL framework was tested on a case study involving the scheduling of a job-shop chemical facility with a number of products, where each product has different processing routes. A scheduling horizon was set for the agent to satisfy the productsâ demands. At the beginning of the process there is an initial demand for each product; these individual demands are updated at specific (user-defined) times during the operation The processing times from the machines in the plant and the demand realizations are uncertain and can take different values that are set in the plant model but are not given to the agent. For the present case study, the agent trained with the DRQN method was able to return online schedules of the batch units depending on the previous states of the plant and satisfy the demands during the scheduling time horizon. Moreover, if there was time left and be economically attractive, the agent would proceed to fill the storage tanks. The results showed that the agent designed with RNNs was able to extract from the instance the knowledge to create a model that accounts for the constraints, the objective function, and the uncertainty in the processing times. Also, the agent was able to react to changes in the demands by adjusting the scheduling policies online. The uncertainty in both demands and processing times was handled by the agent in the form of a preventive behaviour, i.e., the agent takes conservative actions by not activating processes in between specific products in order to prevent a possible overlapping in the machines. In the proposed framework, the agent is able to produce attractive online schedules for different uncertainty realizations. Although the training might take time, the response of a trained agent is less than a second as it consists of the evaluation of a neural network. This feature results more attractive for large-scale scheduling applications under uncertainty since current optimization algorithms might need considerable time to produce an action at every sampling interval.
References
[1] C. D. Hubbs, C. Li, N. V. Sahinidis, I. E. Grossmann, and J. M. Wassick, âA deep reinforcement learning approach for chemical production schedulingâ, Computers & Chemical Engineering, vol. 141, p. 106982, Oct. 2020, doi: 10.1016/j.compchemeng.2020.106982.
[2] C. Hubbs, A. Kelloway, J. Wassick, N. Sahinidis, and I. Grossmann, An Industrial Application of Deep Reinforcement Learning for Chemical Production Scheduling. 2020.
[3] C. D. Paternina-Arboleda and T. K. Das, âA multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problemâ, Simulation Modelling Practice and Theory, vol. 13, no. 5, pp. 389â406, Jul. 2005, doi: 10.1016/j.simpat.2004.12.003.
[4] B. Waschneck et al., âOptimization of global production scheduling with deep reinforcement learningâ, Procedia CIRP, vol. 72, pp. 1264â1269, Jan. 2018, doi: 10.1016/j.procir.2018.03.212.
[5] T. Altenmüller, T. Stüker, B. Waschneck, A. Kuhnle, and G. Lanza, âReinforcement learning for an intelligent and autonomous production control of complex job-shops under time constraintsâ, Prod. Eng. Res. Devel., vol. 14, no. 3, pp. 319â328, Jun. 2020, doi: 10.1007/s11740-020-00967-8.