Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices

Lim, Ju-Bin; Jeong, Jongpil

doi:10.3390/app13179690

Open AccessArticle

Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices

by

Ju-Bin Lim

^1,2

and

Jongpil Jeong

^1,*

¹

Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon 16419, Gyeonggi-do, Republic of Korea

²

AI Machine Vision Smart Factory Lab, LG Innotek, 111 Jinwi2sandan-ro, Jinwi-myeon, Pyeongtaek-si 17708, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9690; https://doi.org/10.3390/app13179690

Submission received: 2 August 2023 / Revised: 16 August 2023 / Accepted: 23 August 2023 / Published: 27 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

In this study, reinforcement learning (RL) was used in factory simulation to optimize storage devices for use in Industry 4.0 and digital twins. Industry 4.0 is increasing productivity and efficiency in manufacturing through automation, data exchange, and the integration of new technologies. Innovative technologies such as the Internet of Things (IoT), artificial intelligence (AI), and big data analytics are smartly automating manufacturing processes and integrating data with production systems to monitor and analyze production data in real time and optimize factory operations. A digital twin is a digital model of a physical product or process in the real world. It is built on data and real-time information collected through sensors and accurately simulates the behavior and performance of a real-world manufacturing floor. With a digital twin, one can leverage data at every stage of product design, development, manufacturing, and maintenance to predict, solve, and optimize problems. First, we defined an RL environment, modeled it, and validated its ability to simulate a real physical system. Subsequently, we introduced a method to calculate reward signals and apply them to the environment to ensure the alignment of the behavior of the RL agent with the task objective. Traditional approaches use simple reward functions to tune the behavior of reinforcement learning agents. These approaches issue rewards according to predefined rules and often use reward signals that are unrelated to the task goal. However, in this study, the reward signal calculation method was modified to consider the task goal and the characteristics of the physical system and calculate more realistic and meaningful rewards. This method reflects the complex interactions and constraints that occur during the optimization process of the storage device and generates more accurate episodes of reinforcement learning in agent behavior. Unlike the traditional simple reward function, this reflects the complexity and realism of the storage optimization task, making the reward more sophisticated and effective.The stocker simulation model was used to validate the effectiveness of RL. The model is a storage device that simulates logistics in a manufacturing production area. The results revealed that RL is a useful tool for automating and optimizing complex logistics systems, increasing the applicability of RL in logistics. We proposed a novel method for creating an agent through learning using the proximal policy optimization algorithm, and the agent was optimized by configuring various learning options. The application of reinforcement learning resulted in an effectiveness of 30–100%, and the methods can be expanded to other fields.

Keywords:

conceptualization; methodology; job allocation; reinforcement learning; stocker; digital twin; simulation; Industry 4.0

1. Introduction

Factory automation and logistics optimization are crucial for enhancing the efficiency and productivity of factories. Logistics automation is a dynamic area of research [1]. Advanced technologies such as intelligent stockers have been introduced to automate logistics in factories and improve the efficiency of movement and storage of goods in factories. Optimizing the route selection, goods placement, storage volume, and work order of the robot is crucial for effectively using these technologies [2]. Reinforcement learning (RL) is an effective approach for solving complex problems by learning through trial and error. In particular, using RL in factory simulations is highly promising in optimizing logistics automation [3].

Considering industry trends that focus on streamlining production processes for new products and automating logistics is critical. These trends indicate that logistics systems in factories are becoming increasingly organized and automation technologies are advancing [4]. Logistics automation is a critical component of factory operations, and an efficient logistics system considerably affects the productivity and competitiveness of a factory. Studies have focused on optimizing logistics automation systems using RL based on artificial intelligence [5]. RL algorithms can improve and optimize logistics automation processes in factory simulations. However, automating logistics in factories can be expensive. The optimization of suitable requirements, the implementation of real-time autonomous judgment and autonomous driving technologies, and the use of RL, a sequential action decision technique, are crucial methodologies for addressing these concerns. However, challenges such as insufficient training data, the high cost of trial and error, and potential risk of accidents resulting from trial and error remain [6]. These systems can increase the efficiency of the logistics system of a factory. However, many factories have yet to use them properly [7].

This methodology for optimizing storage in logistics automation combines factory simulation with artificial intelligence (AI) technology. The method predicts various problems that may occur on the production line to improve efficiency [8]. An RL model simulates various scenarios and continuously adjusts to determine the optimal solution and optimize storage. Thus, analyses of the data required for logistics operations can be performed to determine the optimal response. The RL model analyzes the quantity, type, and quality of goods in the production line. Based on this result, data related to the operational patterns of logistics automation robots are collected and continuously adjusted. RL models can be used to determine optimized routes that efficiently use various resources required for logistics tasks, reduce work time, and prevent problems from occurring on the production line [9].

Complex processes from real factory operations can be modeled in a digital environment by combining factory simulation with RL to improve work efficiency and logistics processes within the factory [10]. This approach can improve overall factory productivity. Moreover, optimizing logistics automation can save labor and time and cut down expenses. In particular, in logistics automation, we propose optimizing the total capacity of the stocker, a storage device, which is expected to show savings of approximately 30% or more and increase the overall efficiency of the logistics system.

The contributions of this paper are as follows. Optimizing logistics automation is critical to increasing productivity and getting products to market. Reinforcement learning (RL) algorithms can improve and optimize the logistics automation process in factory simulations because they use trial and error until the optimal solution for logistics automation is determined. This can be used as a guideline for developing effective simulation solutions to improve logistics efficiency.

The remainder of this paper is organized as follows. Section 2 discusses factory simulation, RL, and storage devices and presents the proposed optimization technique. Section 3 describes the proposed model for optimizing storage using RL. We describe the techniques and roles of RL, virtual factory processes, and AI applications in factory simulation. Section 4 describes the modeling process for simulating a factory, including implementation, learning, result calculation, and hypothesis validation using optimization AI. Finally, Section 5 summarizes the optimization techniques, implementation, and test results presented in this paper and discusses potential avenues for future research.

2. Related Work

2.1. Factory Simulation

Factory simulation is a crucial field that is used to model and optimize manufacturing and operational processes.

Figure 1 shows that simulation models play a crucial role in manufacturing planning and control because they allow companies to assess and enhance their operational strategies. Simulation models provide a digital representation of a manufacturing system, which enables decision makers to experiment with various scenarios, test hypotheses, and determine the optimal course of action. Optimization-based simulation combines simulation models with optimization techniques to determine the optimal solution for manufacturing planning and control problems.

Models incorporate mathematical optimization algorithms, such as linear programming or genetic algorithms, to optimize performance metrics such as production throughput, inventory levels, and Make-Pan. Models can be used to resolve complex scheduling, routing, or resource allocation problems [11]. A virtual model of a factory was created, and its operations were simulated to analyze and optimize its capacity, efficiency, and performance. Manufacturers can use factory simulations to test various scenarios and make informed decisions regarding capacity planning, production scheduling, and resource allocation. Manufacturers can also identify bottlenecks, optimize workflows, and improve overall productivity [12]. Discrete-event simulation can assist manufacturers in identifying bottlenecks, optimizing workflows, and enhancing productivity. By simulating the performance of manufacturing systems, manufacturers can identify areas for improvement and optimize operations before implementing them in the real world [13]. Agent-based modeling and simulation can be used to predict the effect of changes in a system, such as alterations in demand, production processes, or market conditions. Manufacturers and businesses can proactively adjust their operations to reduce the risk of downtime, delays, and other disruptions, model and analyze complex systems, and optimize their operations using this technology [14].

2.2. RL

RL is a machine learning method that involves rewarding a behavior in a specific environment to determine its effectiveness and subsequently training to maximize the reward through repetition.

Figure 2 shows that reinforcement learning consists of two components: the environment and the agent. RL is defined by the Markov decision process (MDP), which is the assumption that the state at any point in time is only affected by the state immediately prior to that point in time [15].

The Markov assumption [16] is expressed as

P (s_{t} | s_{1}), s_{2}, \dots, s_{t - 1}) = P (s_{t} | s_{t - 1})

.

An MDP is a tuple of an action

A, a reward R

, and a depreciation rate

γ

based on a Markov process (MP).

An MDP can be summarized as follows [16]:

1: $MDP = (S, A, P, R, γ)$ ;
2: $The state set S = \{s_{1}, s_{2}, \dots, s_{| | s | |}\}$ , representing the set of all possible states in the MDP;
3: $The action set A = \{a_{1}, a_{2}, \dots, a_{| | a | |}\}$ , representing the set of all actions that an agent can take;
4: The state transition probability $P_{s s^{'}}^{a} = P (s^{'} ∣ s, a) = P [s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a]$ , representing the probability of changing from any state $s to state s^{'}$ when taking action a;
5: The reward function is a function that rewards an agent for an action taken in any state where $R_{t + 1} = R (s_{t}, a_{t})$ . The environment rewards the agent for taking the action $a_{t}$ in a certain state $s_{t}$ .

Unlike MP, MDP adds a decision-making process, and thus the agent should determine its behavior for each state:

1: The agent must decide what action a to take in any state $s_{t}$ , which is called a policy. A policy is defined as follows [16]:

$\begin{matrix} π (a ∣ s) = P [a_{t} = a ∣ s_{t} = s] \end{matrix}$
2: RL is a learning policy that involves trial and error to maximize the reward. The goodness or badness of an action is determined by the sum of the rewards, which is defined as the return value $G_{t}$ .
The depreciation rate $γ$ and the return determine the weight of the reward, with the weighted sum of the rewards expressed as follows:

$\begin{matrix} G_{t} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{T - t} R_{T + 1} \end{matrix}$

The depreciation rate is typically expressed as

γ

. The value can be set to a value between 0 and 1, with a value closer to 1 revealing more weight is placed on future rewards. In RL, maximizing the reward denotes maximizing the return

G_{t}

and not just one reward [16].

In RL, an agent receives a reward for performing a specific action in a particular state and learns to optimize this reward. Through trial and error, the agent selects actions that maximize its reward, and this iterative process optimizes its reward. Unlike supervised learning, RL does not require any pre-prepared data and can be trained with small amounts of data. To address challenges, such as accurately detecting the positions and orientations of objects which robots encounter during pick-and-place tasks, deep learning techniques are used to predict their location and orientation. To accomplish this task, the robot uses its camera to identify objects, and a deep learning model is used to anticipate their positions and alignments. To address the problem of a robot inaccurately picking objects during a pick-and-place task, we used RL to train the robot to accurately grasp objects. By using an RL algorithm, the robot receives rewards for accurately picking up objects and is penalized for any incorrect pickups. Studies have revealed that this method exhibits high accuracy and stability when robots perform pick-and-place tasks [17]. Deep Q-learning (DQL) enables a robot to collect goods in a warehouse and automatically fulfill orders. The robot receives a list of orders, collects the necessary goods, and processes them using DQL algorithms to determine the most efficient path. The optimal behavior is selected by approximating the Q-function using a deep learning model. The robot determines the most efficient route for collecting goods, which reduces the processing time [18].

2.3. Storage Devices

A device that stores products and materials on a manufacturing floor is called a stocker. Stockers automate the storage and movement of raw materials and finished goods both in and out of a facility. Stockers operate in conjunction with a logistics automation system to streamline the supply chain process and can perform additional functions such as splitting, merging, and flipping as required. These functions allow the stocker to efficiently manage logistics throughout the manufacturing process.

As shown in Figure 3, stockers were initially managed by people, and logistics were transported manually. To optimize the number and use of stockers, several factors, including space utilization, demand forecasting, inventory turnover, and operational efficiency, should be considered. Depending on the type of product and the requirements of the site, either the first-in, first-out (FIFO) or last-in, first-out (LIFO) method is applied [19]. FIFO reduces the risk of goods becoming obsolete by using older inventory first. By contrast, LIFO may be appropriate for products with a longer shelf life [20].

As shown in Figure 4, Stocker’s automation has been used for intelligent storage and management of products and materials to improve the efficiency and accuracy of storage operations in factories.

An overview of the various types of archivers is presented here. Storage devices are selectively used based on the production environment.

Figure 5 shows single- and double-deep racks, which are both types of selective pallet racks and are among the most popular options for inventory storage in use today. Single-deep assemblies are installed back to back to create aisle ways for forklifts to access the palletized inventory. Each inventory pallet is easily accessible from the aisle using standard material handling equipment. Figure 6 shows a drive-in and drive-thru storage system, which us a free-standing and self-supporting rack that enables vehicles to drive in and access the stored products. Drive-in racks exhibit a higher storage capacity within the same cubic space compared with other conventional racking styles and reduce the cost per square foot of pallets. Drive-thru racks are similar to drive-in racks, with the exception that vehicles can access the rack from either side, rendering them versatile and efficient. Figure 7 shows a live storage system for picking, also called a carton flow rack, which permits high-density storage of cartons and light products, resulting in savings in the space and improved stock turnover control. In this picking system, we ensure perfect product rotation by following the first-in, first-out (FIFO) system. Additionally, we avoid interference by differentiating the loading and unloading areas.

Figure 8 illustrates pallet flow racking, which is also known as pallet live storage, gravity racking, and the FIFO racking system. This storage solution is designed for high-density storage of pallets, and it allows loads to move using gravity. Palletized loads are inserted at the highest point of the channel and moved by gravity to the opposite end, where they can be easily removed. A flow rack system does not have intermediate aisles, which enhances its storage capacity. Figure 9 shows a push-back racking system, which is a pallet storage method that enables storage of pallets from two to six deep on each side of an aisle. This method provides a higher storage density than other forms of racking. A push-back rack system consists of a set of inclined rails and a series of nesting carts that run on these rails. Figure 10 shows a mobile rack, which is a shelving system on wheels designed to increase space efficiency in warehouses and logistics operations. Unlike typical fixed racks, the mobile rack sits on movable tracks and can be easily relocated as desired. This mobility enables optimal utilization of the warehouse space, which increases its efficiency.

Figure 11 shows how a rack master can be a valuable asset in warehousing and logistics systems, particularly when the space is limited inside the warehouse. Because the rack master is customized to satisfy the specific needs of each customer, it can provide numerous functions that cater to those needs. The rack master is widely used in warehouse and logistics systems, rendering it an essential component in this field.

3. RL-Based Storage Optimization

3.1. Overall Architecture

Factory simulation involves using a computer-based virtual environment to model and simulate the operations of an actual factory.

Figure 12 illustrates how simulation optimizes factory operations and enhances the efficiency of production lines. Furthermore, the components of a factory, including production lines, equipment, workers, and materials, can be virtually arranged to replicate various operational scenarios and evaluate the outcomes [21]. This simulation improves the efficiency of a real factory’s operations, optimizes production planning, and enhances overall productivity.

Figure 13 shows that factory simulation and RL can be combined to model the complex processes of real-world factory operations in a digital environment for experimentation and analysis [22]. RL algorithms are used as agents in factory simulations to learn logistics processes. These algorithms enable them to optimize the logistics processes and automation systems of the factory. By implementing this approach, the complexity of real-world factory operations can be effectively managed to improve the efficiency of logistics processes [23].

3.2. Virtual Factory Processes

Virtual factories are used in situations in which a physical factory has not yet been constructed or when an existing factory is to be analyzed or enhanced, but performing such analysis in a real-world setting is challenging. A factory can be established in this scenario. A virtual space accords various spaces for analyses and evaluation [24]. Factory simulation utilizes three-dimensional (3D) simulation software to analyze the logistics flow of the factory, the equipment utilization rate, and potential losses such as line of balance and bottleneck issues. The optimal factory structure when introducing new equipment or transportation devices can also be simulated [25].

Figure 14 illustrates that virtual factory processes involve simulating and mimicking real-life factory operations. To improve the efficiency, productivity, and resource utilization of factory operations, numerous processes are undertaken. These processes include design and modeling, data collection, simulation execution, performance evaluation and optimization, decision making, strategy formulation, and system integration. This resource provides a comprehensive understanding of factory operations and insights into the construction and functioning of an actual factory [26].

Figure 15 shows how through the systematic use of virtual verification, factory simulation helps manufacturers predict and address various problems that may emerge during the production process. By modeling the behavior of systems in advance, strategies can be developed for production and operations that increase efficiency, safety, and profitability. Additionally, analyzing the expected results of adjusting and changing various variables can improve these outcomes.

3.3. Applying AI Technology

One can effectively optimize complex simulations by exploring optimal design directions using AI.

The accuracy of the data enables clear visibility and precise location and dimensioning during the transition from 2D to 3D. Efficient resource utilization maximizes production by optimizing the resource input. Figure 16 shows how AI algorithms are used in production optimization [27] to efficiently manage the logistics and resources of a factory and optimize areas of logistics automation. When applying the proximal policy optimization (PPO) algorithm to a factory simulation, PPO functions as an RL agent that learns optimal policies by interacting with the simulated environment. This phenomenon can be used in factory simulations to help agents make decisions about factory operations. For instance, the PPO algorithm can be used to solve decision-making problems including but not limited to storage management, production planning, and resource allocation. This phenomenon can considerably improve the efficiency of factory operations [28]. The functions of real-time scheduling (RTS) and real-time dispatching (RTD) are being strengthened in the production environment. Previously, a rule-based production system was operated, but recently, an efficient operation system has been utilized by applying artificial intelligence models of RTS and RTD. In this simulation application, if the existing rule-based work order were defined, then the operation of STK optimized the work distribution with a model learned by applying RTD’s AI.

The subjective symptoms are ignored when an objective is initially opened, but they are considered when the objective worsens [29].

The loss function for the PPO algorithm is expressed as follows [29]:

L^{C P I} (θ) = E_{\hat{t}} [min ({\hat{r}}_{t} (θ) {\hat{A}}_{t}, clip ({\hat{r}}_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

The motivation of the objective function is as follows. The first term in the “Min” function represents the objective function of trust region policy optimization. The second term modifies the surrogate objective by limiting the probability ratio, which removes the motivation for RTs to move beyond [1−, 1+]. The ultimate objective is to establish a lower limit for the unclipped objective by selecting the smaller value between the clipped objective and the unclipped objective. In this approach, only changes in the likelihood ratio are considered.

4. Simulation and Results

4.1. Simulation Environment

The simulation environment that is validated in this study is a component of a production line that produces battery management systems. Figure 17 displays the production system design in which we compared the performance indicators of conventional (FIFO) and RL approaches for task distribution based on the model to optimize storage capacity. The production system was modeled using FlexSim simulation software for validation [30]. To implement RL, we used the PPO algorithm from the Stable-Baseline3 library [31], which is known for its effectiveness in solving numerous problems. PPO is a policy gradient method that directly optimizes an agent’s policy function in an RL setting. The objective of the agent is to acquire a policy that maximizes the total reward that it obtains over certain time [32]. PPO controls the difference between the old and new policies while enforcing the “closeness” of policy updates during the policy renewal phase. This increases the stability of learning by preventing large updates or unstable policy updates. PPO uses a “clipping” technique to limit the size of policy updates, which prevents updates from becoming large, improving convergence speed and stability. It is used in a variety of applications for its stability, new policy learning, parallelism and scalability, and low parameter tuning. Factory simulation and reinforcement learning were applied using the Flexsim simulator. In the simulation with RL, the variables of task assignment were defined based on the process time of the production line. RL mainly determines the behavior of the agent in the state space, and the average waiting time is determined based on the average production. We defined the actions that the RL agent could perform within the simulation. By iteratively learning the lead time and average waiting time, it continuously checks for reward signals to find a solution that optimizes the average waiting time. To use reinforcement learning, one needs to set the observation and action parameters to drive the learning. FlexSim simulation Version FlexSim 2022 was used to set the parameters to be observed for learning. In this paper, the type of recently processed items of Routers 1∼3 was set as the observation parameter. The state to be taken by the agent through the observation parameter was changed through the action parameter. In this paper, the type of item to be pulled by Routers 1∼3 was set as the action parameter.

4.2. Performance Metrics

Items were categorized into four types, namely 1, 2, 3, and 4. Each type had an equal probability of occurring, with a 1/4 chance of a particular item being type i (where i = 1, 2, 3, or 4). Additionally, the type of a specific work piece was not influenced by the types of its preceding or succeeding work pieces. Both Board Assemblies 1and 2 and Routers 1–3 could process all types of jobs, but the individual processing time for each type varied, as presented in Table 1. Router1. If the type of a particular job differed from the type of the previous job, then Router 3 had a setup time of 20 s, with the following factors considered:

-: Average production per type;
-: Average stocker load;
-: Average buffer latency per type.

Table 1 shows the four types, board assembly, and router processing time, while the following factors were considered:

-: Router set-up time: 20 s;
-: Item generation: Type 1–4 with uniform probability;
-: Run time: 10 days (warm-up time of 3 days);
-: RL (PPO algorithm);
-: Time step (number of training times): 10,000 times;
-: Reward definition, illustrated below.

The formula is as follows: (router’s average process time)/(Item Sink 1 entry time − item stocker exit time). To implement the model shown in Figure 17, which can be used to direct items entering the stocker to the optimal object among Routers 1–3, the reinforcement learning components are as follows: the state St = [Router1 Type-1, Router2 Type-1, Router3 Type-1] In the state expression, from Router 1 through Router 3, Type-t − 1 refers to the type number of the (t − 1) job processed by Routers 1–3 at the current time t. The reward can be calculated using the formula R = K/[Tt(Sink 1 entry) − Tt(stocker exit)], where K is a constant value and Tt represents the total time.

The behavior of the agent in RL involves selecting a specific item number from Routers 1–3. The constant K in the agent’s reward is 22, which represents the average processing time for each type of item across Routers 1–3. Here, Tt(E) refers to the time at which event E occurs for the item. In the simulation with RL, we defined the variables of task assignment based on the process time of the production line. RL mainly determines the behavior of the agent in the state space, and the average waiting time was determined based on the average production. We defined the actions that the RL agent could perform within the simulation. By iteratively learning the lead time and average waiting time, it continuously checked the reward signal to find a solution that optimized the average waiting time.

4.3. Results

To evaluate the performance of the model trained in this experiment, we compared the conventional FIFO method of distributing items with the method of distributing items using the trained RL model. We subsequently assessed the performance evaluation and metrics.

Figure 18 shows the simulation results by type after 10 iterations using FIFO. Figure 19 illustrates the results of a type-specific simulation with 10 iterations using RL.

Table 2 gives the results by type for a typical simulation using the FIFO method. The average latency for each type was similar.

Table 3 shows the result for FIFO to be 7304 inputs per day, with an average waiting time of 3348 s. The average waiting quantity was 272 pieces. Next, we applied RL and evaluated the outcomes after 10 iterations. Table 2 and Table 3 display the daily production, average lead time, stocker average wait time, and stocker average wait time when items were distributed using the FIFO method.

Table 4 shows the simulation results by type using RL. The application of RL resulted in a reduction of more than 30% in the average waiting time for each type.

Table 5 shows the results of applying RL, which yielded 7058 inputs per day and an average waiting time of 145 s. The average number of items in the queue was 12. Table 4 and Table 5 display the daily production, average lead time, average waiting time for stockers, and average waiting quantity for stockers when items were allocated using the trained RL method. Upon comparing Table 2 and Table 3 with Table 4 and Table 5, the daily production increased as the average waiting time for the stockers and the average waiting quantity decreased when the model with the RL method was used.Therefore, the performance evaluation metrics improved in the model that applied RL to allocate items compared with the model in which FIFO was used. When using RL to manage the supply of products from the stocker to the router, the average production of items increased and the lead time decreased compared with the conventional FIFO method. On average, the number of queues in the stocker was approximately 273 in the FIFO scenario. However, this number reduced to approximately 12 queues when using RL.

In the process of applying reinforcement learning, the metrics for convergence training were a key part of the study. Convergence is the point at which the model is trained reliably and reaches the desired performance, and it enters the convergence window at 3000 training iterations and the stabilization window of convergence at 8000 training iterations, as shown in Table 6.

Compared with the first-in, first-out model, the average stocker queue was reduced by minimizing the number of product changes while maintaining a similar daily output level. As a result, the average lead time and stocker waiting time decreased. Figure 20 shows a plot of the trend line for the average stocker queue as a function of the number of reinforcement training sessions. The figure converges to a constant stocker average wait time as training progresses.

4.4. Discussion

Our proposal is highly promising for the use of RL in factory simulation [3]. In [31], we utilized the PPO algorithm, which is a type of RL, while in this paper, we utilized the Stable-Baseline3 library of the Flexsim simulator. PPO has the following advantages. PPO learns by updating a new policy within a certain distance from the existing policy, and thus it learns without much difference from the initial policy. Also, it can utilize samples from the environment to improve the policy. It is useful for parallelization and distributed learning because it uses many samples from the environment to improve the policy. It has the advantage of increasing the stability of learning by preventing large updates or unstable policy updates. When applying reinforcement learning to production, we found that the capacity of the stocker was further optimized and the productivity of the machine increased. In other words, supplying the same product to the same machine from the buffer (storage) that exists between the separated processes to reduce the number of set-ups, which is a non-value added operation, can minimize productivity and the amount of products waiting in the stocker. RL algorithms can improve and optimize the logistics automation process in factory simulation. In addition, the field of factory simulation with RL can be trained on larger datasets to provide more accurate results.

5. Conclusions

The efficiency of logistics stockers’ operations was maximized by using RL in a factory simulation. Optimizing logistics automation is crucial for predicting and efficiently addressing various problems that arise in the production line. To optimize logistics automation, RL models simulate behaviors in various situations and continually adjust to determine the optimal outcome. The data necessary for logistics operations were analyzed, and the optimal course of action was determined. For example, to optimize logistics automation, RL models analyze the quantity, type, transfer, storage, and quality of products on a production line. Thus, the system collects data on the operation of a logistics automation robot and continuously adjusts it to perform logistics tasks optimally. The RL model identifies optimized routes to efficiently use the various resources required for logistics tasks. By optimizing the paths of logistics automation robots, the work time can be reduced, and problems on the production line can be prevented. By using RL in this manner, the diverse logistics tasks emerging on the production line can be managed, and production efficiency can be enhanced. The application of reinforcement learning resulted in an effectiveness of 30–100%, and optimizing logistics automation saves labor, time, and costs. It is clear that factory simulations can be effectively improved with reinforcement learning. Traditional rule-based simulators produce different results depending on the user’s parameter settings, but AI-powered modeling can optimize multiple simulation outcomes, which is different from rule-based simulations that vary depending on the context. RL can be applied to factory simulations to identify optimal solutions for logistics automation equipment, which improves the efficiency of production operations. Intelligent logistics solutions can reduce the production lead time by calculating optimal routes through machine learning and zero human handling of products and data to promote unmanned factories. Factory simulation with reinforcement learning has seen many advances in recent years, and with these advances come challenges. Advances in data collection and storage technologies have resulted in rich datasets for factory simulation models. Advances in deep learning technology have enabled deep reinforcement learning (DRL) algorithms to improve the performance of reinforcement learning agents in complex environments. However, reinforcement learning requires a large number of samples, which is sometimes difficult to collect in a factory environment, and the data collected may be generalized to a specific environment. Designing the right reward function is a key issue in reinforcement learning, and setting the reward function correctly can be challenging, especially in complex factory simulation environments. Factory simulation environments can be time-varying and unstable, and reinforcement learning agents need to be able to respond to changes in this dynamic environment. Potential directions for future research include the development of more sophisticated reinforcement learning models and consideration of their application in a wider variety of logistics task scenarios, as well as improvements in data collection and analysis methods and examination of real-world applications. These studies suggest that the combination of factory simulation and reinforcement learning can further improve efficiency in the field of logistics automation.

Author Contributions

Conceptualization, J.-B.L.; methodology, J.-B.L.; software, J.-B.L.; validation, J.-B.L.; formal analysis, J.-B.L.; investigation, J.-B.L.; resources, J.-B.L.; data curation, J.-B.L.; writing—original draft preparation, J.-B.L.; writing—review and editing, J.-B.L.; visualization, J.-B.L.; supervision, J.-B.L.; project administration, J.J.; funding acquisition, J.-B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the SungKyunKwan University and the BK21 FOUR (Graduate School Innovation) and funded by the Ministry of Education (MOE, Korea) and National Research Foundation of Korea (NRF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This research was supported by the SungKyunKwan University and the BK21 FOUR (Graduate School Innovation) and funded by the Ministry of Education (MOE, Korea) and National Research Foundation of Korea (NRF). Moreover, this research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-2018-0-01417) and supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kamble, S.S.; Gunasekaran, A.; Parekh, H. Digital twin for sustainable manufacturing supply chains: Current trends, future perspectives, and an implementation framework. Technol. Forecast. Soc. Chang. 2022, 176, 121448. [Google Scholar] [CrossRef]
Fragapane, G. Increasing flexibility and productivity in Industry 4.0 production networks with autonomous mobile robots and smart intralogistics. Ann. Oper. Res. 2022, 308, 125–143. [Google Scholar] [CrossRef]
Xue, T.; Zeng, P.; Yu, H. A reinforcement learning method for multi-AGV scheduling in manufacturing. In Proceedings of the 2018 IEEE International Conference on Industrial Technology (ICIT), Lyon, France, 20–22 February 2018. [Google Scholar]
Paritala, P.K.; Manchikatla, S.; Yarlagadda, P.K.D.V. Digital Manufacturing- Applications Past, Current, and Future Trends. Procedia Eng. 2017, 174, 982–991. [Google Scholar] [CrossRef]
Arents, J.; Greitans, M. Smart Industrial Robot Control Trends, Challenges and Opportunities within Manufacturing. Appl. Sci. 2022, 12, 937. [Google Scholar] [CrossRef]
Werth, B.; Karder, J.; Beham, A. Simulation-based Optimization of Material Requirements Planning Parameters. Procedia Comput. Sci. 2023, 217, 1117–1126. [Google Scholar] [CrossRef]
Dotoli, M. An overview of current technologies and emerging trends in factory automation. Int. J. Prod. Res. 2019, 57, 5047–5067. [Google Scholar] [CrossRef]
Wu, J.; Peng, Z.; Cui, D.; Li, Q.; He, J. A multi-object optimization cloud workflow scheduling algorithm based on reinforcement learning. In Proceedings of the International Conference on Intelligent Computing, Wuhan, China, 15–18 August 2018; Springer: Cham, Switzerland, 2018; pp. 550–559. [Google Scholar]
Feldkamp, N.; Bergmann, S.; Strassburger, S. Simulation-based Deep Reinforcement Learning for Modular Production Systems. In Proceedings of the 2020 Winter Simulation Conference, Orlando, FL, USA, 14–18 December 2020; pp. 1596–1607. [Google Scholar]
Tao, F.; Xiao, B.; Qi, Q. Digital twin modeling. J. Manuf. Syst. 2022, 64, 372–389. [Google Scholar] [CrossRef]
Serrano-Ruiz, J.C.; Mula, J. Development of a multidimensional conceptual model for job shop smart manufacturing scheduling from the Industry 4.0 perspective. J. Manuf. Syst. 2022, 63, 185–202. [Google Scholar] [CrossRef]
Cavalcante, I.M.; Frazzon, E.M.; Forcellini, F.A.; Ivanov, D. A supervised machine learning approach to data-driven simulation of resilient supplier selection in digital manufacturing. Int. J. Inf. Manag. 2019, 49, 86–97. [Google Scholar] [CrossRef]
Huerta-Torruco, V.A.; Hernandez-Uribe, O.; Cardenas-Robledo, L.A.; Rodriguez-Olivares, N.A. Effectiveness of virtual reality in discrete event simulation models for manufacturing systems. Comput. Ind. Eng. 2022, 168, 108079. [Google Scholar] [CrossRef]
de Ferreira, W.; Armellini, F.; de Santa-Eulalia, L.A. Extending the lean value stream mapping to the context of Industry 4.0: An agent-based technology approach. J. Manuf. Syst. 2022, 63, 1–14. [Google Scholar] [CrossRef]
Available online: https://stevebong.tistory.com/4 (accessed on 1 August 2023).
Available online: https://minsuksung-ai.tistory.com/13 (accessed on 1 August 2023).
Chen, Y.-L.; Cai, Y.-R.; Cheng, M.-Y. Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach. Machines 2023, 11, 275. [Google Scholar] [CrossRef]
Peyas, I.S.; Hasan, Z.; Tushar, M.R.R. Autonomous Warehouse Robot using Deep Q-Learning. In Proceedings of the TENCON 2021—2021 IEEE Region 10 Conference (TENCON), Auckland, New Zealand, 7–10 December 2021. [Google Scholar]
Available online: https://velog.io/@baekgom/LIFO-%EC%84%A0%EC%9E%85%EC%84%A0%EC%B6%9C-FIFO-%ED%9B%84%EC%9E%85%EC%84%A0%EC%B6%9C (accessed on 1 August 2023).
Utami, M.C.; Sabarkhah, D.R.; Fetrina, E.; Huda, M.Q. The Use of FIFO Method For Analysing and Designing the Inventory Information System. In Proceedings of the 2018 6th International Conference on Cyber and IT Service Management (CITSM), Parapat, Indonesia, 7–9 August 2018. [Google Scholar]
Zhang, M.; Tao, F. Digital Twin Enhanced Dynamic Job-Shop Scheduling. J. Manuf. Syst. 2021, 58 Pt B, 146–156. [Google Scholar] [CrossRef]
Yildiz, E.; Møller, C.; Bilberg, A. Virtual Factory: Digital Twin Based Integrated Factory Simulations. Procedia CIRP 2020, 93, 216–221. [Google Scholar] [CrossRef]
Fahle, S.; Prinz, C.; Kuhlenkötter, B. Systematic review on machine learning (ML) methods for manufacturing processes—Identifying artificial intelligence (AI) methods for field application. Procedia CIRP 2020, 93, 413–418. [Google Scholar] [CrossRef]
Jain, S.; Lechevalier, D. Standards based generation of a virtual factory model. In Proceedings of the 2016 Winter Simulation Conference (WSC), Washington, DC, USA, 11–14 December 2016. [Google Scholar]
Leon, J.F.; Marone, P. A Tutorial on Combining Flexsim with Python for Developing Discrete-Event Simheuristics. In Proceedings of the 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022. [Google Scholar]
Sławomir, L.; Vitalii, I. A simulation study of Industry 4.0 factories based on the ontology on flexibility with using FlexSim^® software. Manag. Prod. Eng. Rev. 2020, 11, 74–83. [Google Scholar]
Belsare, S.; Badilla, E.D. Reinforcement Learning with Discrete Event Simulation: The Premise, Reality, and Promise. In Proceedings of the 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022. [Google Scholar]
Park, J.-S.; Kim, J.-W. Developing Reinforcement Learning based Job Allocation Model by Using FlexSim Software. In Proceedings of the Winter Conference of the Korea Computer Information Society, Daegwallyeong, Republic of Korea, 8–10 February 2023; Volume 31. [Google Scholar]
Available online: https://ropiens.tistory.com/85 (accessed on 1 August 2023).
Krenczyk, D.; Paprocka, I. Integration of Discrete Simulation, Prediction, and Optimization Methods for a Production Line Digital Twin Design. Materials 2023, 16, 2339. [Google Scholar] [CrossRef] [PubMed]
Mayer, S.; Classen, T.; Endisch, C. Modular production control using deep reinforcement learning: Proximal policy optimization. J. Intell. Manuf. 2021, 32, 2335–2351. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, H. Dynamic job shop scheduling based on deep reinforcement learning for multi-agent manufacturing systems. Robot. Comput.-Integr. Manuf. 2022, 78, 102412. [Google Scholar] [CrossRef]

Figure 1. Factory simulation processes.

Figure 2. Reinforcement learning (RL) diagram.

Figure 3. Manual storage devices.

Figure 4. Automated storage devices.

Figure 5. Single-and double-deep racks.

Figure 6. Drive-in and drive-thru rack.

Figure 7. Carton flow rack.

Figure 8. Pallet flow rack.

Figure 9. Push-back rack.

Figure 10. Mobile rack.

Figure 11. Rack Master.

Figure 12. Factory simulation modeling.

Figure 13. RL structure for factory simulations.

Figure 14. Virtual factory processes.

Figure 15. Organized virtual verification.

Figure 16. Factory simulation reinforcement learning application block diagram.

Figure 17. Factory simulation system diagram.

Figure 18. FIFO type.

Figure 19. RL application type.

Figure 20. Average stocker waiting quantity over training.

Table 1. Process time.

Item	Board Assembly	Router
Type 1	20	19
Type 2	17	25
Type 3	14	23
Type 4	15	21

Table 2. Results by FIFO.

Item	Input	Output	Average
Item	Input	Output	Lead Time (s)	Wait Time (s)
Type 1	1766	1757	3351	3347
Type 2	1760	1754	3406	3349
Type 3	1756	1749	3400	3348
Type 4	1754	1748	3400	3349

Table 3. FIFO average results.

Item	Input	Output	Average
Item	Input	Output	Lead Time (s)	Wait Time (s)	Waiting Quantity
Result	7034	7009	3402	3348	272

Table 4. Results for RL application type.

Item	Input	Output	Average
Item	Input	Output	Lead Time (s)	Wait Time (s)
Type 1	1767	1767	159	106
Type 2	1760	1767	233	176
Type 3	1756	1756	204	152
Type 4	1766	1766	198	148

Table 5. RL application average results.

Item	Input	Output	Average
Item	Input	Output	Lead Time (s)	Wait Time (s)	Waiting Quantity
Result	7058	7058	198	145	12

Table 6. RL metrics for convergence.

Item	1	2	3	4	5	6	7	8	9	10	11
Time Step	0	1000	2000	3000	4000	5000	6000	7000	8000	9000	10,000
Average STK W/Q	278	68	68	24	23	23	20	17	13	9	9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, J.-B.; Jeong, J. Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices. Appl. Sci. 2023, 13, 9690. https://doi.org/10.3390/app13179690

AMA Style

Lim J-B, Jeong J. Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices. Applied Sciences. 2023; 13(17):9690. https://doi.org/10.3390/app13179690

Chicago/Turabian Style

Lim, Ju-Bin, and Jongpil Jeong. 2023. "Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices" Applied Sciences 13, no. 17: 9690. https://doi.org/10.3390/app13179690

APA Style

Lim, J.-B., & Jeong, J. (2023). Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices. Applied Sciences, 13(17), 9690. https://doi.org/10.3390/app13179690

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Factory Simulation of Optimization Techniques Based on Deep Reinforcement Learning for Storage Devices

Abstract

1. Introduction

2. Related Work

2.1. Factory Simulation

2.2. RL

2.3. Storage Devices

3. RL-Based Storage Optimization

3.1. Overall Architecture

3.2. Virtual Factory Processes

3.3. Applying AI Technology

4. Simulation and Results

4.1. Simulation Environment

4.2. Performance Metrics

4.3. Results

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI