2023 AIChE Annual Meeting

(430e) A Reinforcement Learning Strategy with Recurrent Neural Networks for Optimal Scheduling of Job-Shop Batch Chemical Plants Under Uncertainty

Authors

Ricardez-Sandoval, L. - Presenter, University of Waterloo
Rangel-Martinez, D., University of Waterloo
In recent years, Deep Reinforcement Learning (DRL) has become an alternative in the optimization of scheduling problems for industrial facilities. These methods aim to design an intelligent agent that can take decisions and learn how to perform a task, e.g., a short-term schedule for manufacturing production or the execution of a specific response when unexpected events occur in a system, e.g., change in product demands. The features from DRL agents that have become attractive for scheduling applications under uncertainty include the unique method to explore solutions in complex systems, fast adaptation to unexpected deviations in the process’ parameters, and the method’s ability to provide immediate responses through a trained policy. In the context of optimal process scheduling, DRL has just started to emerge as an option to current methods in short-term scheduling for job-shops and flow-shops chemical manufacturing plants. Although research in this field is at early stages, the existing implementations have shown promising results in creating schedules through different methods such as multi agent systems and Q-Learning applications [1]–[5]. These implementations have shown that DRL agents are able to handle uncertainty in parameters related to the schedule formulation such as machine availability and demand realizations. Challenges in these implementations mainly consist in the adequate use of information in the training process, which is critical to obtain reliable and robust scheduling policies, and the effective translation of the optimization problem with a system of penalties and rewards. In the literature, the common approach for decision-making models involving DRL applications has been the Markov Decision Process (MDP), which basically uses only information from the present state to take the next actions in the model. Since process scheduling often involves the planification of a historical sequence of events that are intercorrelated, MDPs may not be the best alternative for processing observations from the instance. On the other hand, Hidden Markov Models (HMM) are typically used for time series where the process to be modelled is a sequence of interconnected time steps. These models use a set of consecutive observations which is called observation window to take the next action in the plant model.

In this work, a methodology to develop a RL agent that acts as an online scheduler on a job-shop chemical plant subject to uncertainty is presented. A key feature in this formulation is that processing times and product demands are described using a discrete set of uncertain scenarios. A Deep Recurrent Q-Learning (DRQN) method is used to train the agent, which is designed using a Recurrent Neural Network (RNN). These networks are used to generalize HMMs in sequential decision processes, which is a typical characteristic in scheduling problems. To the authors’ knowledge, this method has not been considered to address the scheduling in job-shops under uncertainty in chemical batch plants. The proposed DRL formulation assumes that there are several routes for producing predesigned batches for a determined number of products. Also, the formulation includes zero-wait restrictions during the process; however, storage is available for completed products. The agent was trained to build an online schedule of the initialization of these routes in the chemical batch plant. The resulting schedule is required to satisfy product demands given at the beginning of the process and demand realizations that take place at specific times during operation. Note that those demands are not known a priori; hence, they are considered as discrete uncertain parameters in the proposed DRQN framework. Also, the agent is motivated through a set of rewards to initially complete the demands of every product, and subsequently to fill the available storage for each product without exceeding their capacities. Moreover, the agent aims to minimize the makespan of the process subject to a set of user-defined process constraints, e.g., allocation and mass conversation balance constraints. These objectives are enforced using a reward shaping strategy. For each processing route in the chemical plant, there is a set of machines which processing times are uncertain parameters to the agent. Hence, no information about the characteristics of these times is given a priori to the agent. The HMM is used to know the true value of the uncertainty realization in the processing times by gathering information from present and recent past events. That is, the agent acquires knowledge from the processing times and their possible deviations through the sequence of events that are given as input in the observation window. Although the literature has shown methods to approach uncertainty using MDP, the use of this model assumes that the system is fully observable and that all the information needed for taking the next decision is available in the present time, which may not be the usual case in real applications. Uncertainty effects are propagated through time in the environment. In the present DRQN framework, these features are captured through the observation window that the HMM uses to take the next action. The observation vectors in the window that are inputs to the agent contain the information related to time intervals, demand satisfaction for each product, and machine availability.

The proposed RL framework was tested on a case study involving the scheduling of a job-shop chemical facility with a number of products, where each product has different processing routes. A scheduling horizon was set for the agent to satisfy the products’ demands. At the beginning of the process there is an initial demand for each product; these individual demands are updated at specific (user-defined) times during the operation The processing times from the machines in the plant and the demand realizations are uncertain and can take different values that are set in the plant model but are not given to the agent. For the present case study, the agent trained with the DRQN method was able to return online schedules of the batch units depending on the previous states of the plant and satisfy the demands during the scheduling time horizon. Moreover, if there was time left and be economically attractive, the agent would proceed to fill the storage tanks. The results showed that the agent designed with RNNs was able to extract from the instance the knowledge to create a model that accounts for the constraints, the objective function, and the uncertainty in the processing times. Also, the agent was able to react to changes in the demands by adjusting the scheduling policies online. The uncertainty in both demands and processing times was handled by the agent in the form of a preventive behaviour, i.e., the agent takes conservative actions by not activating processes in between specific products in order to prevent a possible overlapping in the machines. In the proposed framework, the agent is able to produce attractive online schedules for different uncertainty realizations. Although the training might take time, the response of a trained agent is less than a second as it consists of the evaluation of a neural network. This feature results more attractive for large-scale scheduling applications under uncertainty since current optimization algorithms might need considerable time to produce an action at every sampling interval.

References

[1] C. D. Hubbs, C. Li, N. V. Sahinidis, I. E. Grossmann, and J. M. Wassick, ‘A deep reinforcement learning approach for chemical production scheduling’, Computers & Chemical Engineering, vol. 141, p. 106982, Oct. 2020, doi: 10.1016/j.compchemeng.2020.106982.

[2] C. Hubbs, A. Kelloway, J. Wassick, N. Sahinidis, and I. Grossmann, An Industrial Application of Deep Reinforcement Learning for Chemical Production Scheduling. 2020.

[3] C. D. Paternina-Arboleda and T. K. Das, ‘A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problem’, Simulation Modelling Practice and Theory, vol. 13, no. 5, pp. 389–406, Jul. 2005, doi: 10.1016/j.simpat.2004.12.003.

[4] B. Waschneck et al., ‘Optimization of global production scheduling with deep reinforcement learning’, Procedia CIRP, vol. 72, pp. 1264–1269, Jan. 2018, doi: 10.1016/j.procir.2018.03.212.

[5] T. Altenmüller, T. Stüker, B. Waschneck, A. Kuhnle, and G. Lanza, ‘Reinforcement learning for an intelligent and autonomous production control of complex job-shops under time constraints’, Prod. Eng. Res. Devel., vol. 14, no. 3, pp. 319–328, Jun. 2020, doi: 10.1007/s11740-020-00967-8.