Skip to Content
Applied SciencesApplied Sciences
  • Article
  • Open Access

28 January 2021

Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment

and
Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon 16419, Korea
*
Author to whom correspondence should be addressed.

Abstract

This paper reports on the use of reinforcement learning technology for optimizing mobile robot paths in a warehouse environment with automated logistics. First, we compared the results of experiments conducted using two basic algorithms to identify the fundamentals required for planning the path of a mobile robot and utilizing reinforcement learning techniques for path optimization. The algorithms were tested using a path optimization simulation of a mobile robot in same experimental environment and conditions. Thereafter, we attempted to improve the previous experiment and conducted additional experiments to confirm the improvement. The experimental results helped us understand the characteristics and differences in the reinforcement learning algorithm. The findings of this study will facilitate our understanding of the basic concepts of reinforcement learning for further studies on more complex and realistic path optimization algorithm development.

1. Introduction

The fourth industrial revolution and digital innovation have led to the increased interest in the use of robots across industries; thus, robots are being used in various industrial sites. Robots can be broadly categorized as industrial or service robots. Industrial robots are automated equipment and machines used in industrial sites such as factory lines and are used for assembly, machining, inspection, and measurement as well as pressing and welding [1]. The use of service robots is expanding in manufacturing environments and other areas such as industrial sites, homes, and institutions. Service robots can be broadly divided into personal service robots and professional service robots. In particular, professional service robots are robots used by workers to perform tasks, and mobile robots are one of the professional service robots [2].
A mobile robot needs the following elements to perform tasks: localization, mapping, and path planning. First, the location of the robot must be known in relation to the surrounding environment. Second, a map is needed to identify the environment and move accordingly. Third, path planning is necessary to find the best path to perform a given task [3]. Path planning has two types: point-to-point and complete coverage. A complete coverage path planning is performed when all positions must be checked, such as for a cleaning robot. A point-to-point path planning is performed for moving from the start position to the target position [4].
To review the use of reinforcement learning as a path optimization technique for mobile robots in a warehouse environment, the following points must be understood. First, it is necessary to understand the warehouse environment and the movement of mobile robots. To do this, we must check the configuration and layout of the warehouse in which the mobile robot is used as well as the tasks and actions that the mobile robot performs in this environment. Second, it is necessary to understand the recommended research steps for the path planning and optimization of mobile robots. Third, it is necessary to understand the connections between layers and modules as well as their working when they are integrated with mobile robots, environments, hardware, and software. With this basic information, a simulation program for real experiments can be implemented. Because reinforcement learning can be tested only when a specific experimental environment is prepared in advance, substantial preparation of the environment is required early in the study but the researcher must prepare it. Subsequently, the reinforcement learning algorithm to be tested is integrated with the simulation environment for the experiment.
This paper discusses an experiment performed using a reinforcement learning algorithm as a path optimization technique in a warehouse environment. Section 2 briefly describes the warehouse environment, the concept of path planning for mobile robots, the concept of reinforcement learning, and the algorithms to be tested. Section 3 briefly describes the structure of the mobile robot path-optimization system, the main logic of path search, and the idea of dynamically adjusting the reward value according to the moving action. Section 4 describes the construction of the simulation environment for the experiments, the parameters and reward values to be used for the experiments, and the experimental results. Section 5 presents the conclusions and directions for future research.

3. Reinforcement Learning-Based Mobile Robot Path Optimization

3.1. System Structure

In the mobile robot framework structure in Figure 6, the agent is controlled by the supervisor controller, and the robot controller is executed by the agent, so that the robot operates in the warehouse environment and the necessary information is observed and collected [25,26,27].
Figure 6. Warehouse mobile robot framework structure.

3.2. Operation Procedure

The mobile robot framework works as follows. At the beginning, the simulator is set to its initial state, and the mobile robot collects the first observation information and sends it to the supervisor controller. The supervisor controller verifies the received information and communicates the necessary action to the agent. At this stage, the supervisor controller can obtain additional information if necessary. The supervisor controller can communicate job information to the agent. The agent conveys the action to the mobile robot through the robot controller, and the mobile robot performs the received action using actuators. After the action is performed by the robot controller, the changed position and status information is transmitted to the agent and the supervisor controller, and the supervisor controller calculates the reward value for the performed action. Subsequently, the action, reward value, and status information are stored in memory. These procedures are repeated until the mobile robot achieves the defined goal or the number of epochs/episodes or conditions that satisfy the goal [25].
In Figure 7, we show an example class that is composed of the essential functions and main logic required by the supervisor controller for path planning.
Figure 7. Required methods and logic of path finding [25].

3.3. Dynamic Reward Adjustment Based on Action

The simplest type of reward for the movement of a mobile robot in a grid-type warehouse environment is achieved by specifying a negative reward value for an obstacle and a positive reward value for the target location. The direction of the mobile robot is determined by its current and target positions. The path planning algorithm requires significant time to find the best path by repeating many random path finding attempts. In the case of the basic reward method introduced earlier, an agent searches randomly for a path, regardless of whether it is located far from or close to the target position. To improve on these vulnerabilities, we considered a reward method as shown in Figure 8.
Figure 8. Dynamic adjusting reward values for the moving action.
The suggestions are as follows:
First, when the travel path and target position are determined, a partial area close to the target position is selected. (Reward adjustment area close to the target position).
Second, a higher reward value than the basic reward method is assigned to the path positions within the area selected in the previous step. Third, we propose a method to select an area as above whenever the movement path changes and changes the position of the movement path in the selected area to a high reward value. This is called dynamic reward adjustment based on action, and the following (Algorithm 3) is an algorithm of the process.
Algorithm 3: Dynamic reward adjustment based on action.
1. Get new destination to move
2. Select a partial area near the target position.
3. A higher positive reward value is assigned to the selected area, excluding the target point and obstacles.
(Highest positive reward value to target position)

4. Test and Results

4.1. Warehouse Simulation Environment

To create a simulation environment, we referred to the paper [23] and built a virtual experiment environment based on the previously described warehouse environment. A total of 15 inventory pods were included in a 25 × 25 grid layout. A storage trip of a mobile robot from the workstation to the inventory pod was simulated. This experiment was conducted to find the optimal path to the target position, avoiding inventory pods at the starting position where a single mobile robot was defined.
The Figure 9 shows the target position for the experiments.
Figure 9. Target position of simulation experiment.

4.2. Test Parameter Values

Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
Table 1 shows the parameter values used in the experiment.
Table 1. Parameter values for experiment.

4.3. Test Reward Values

The experiment was conducted with the following reward methods.
  • Basic method: The obstacle was set as −1, the target position as 1, and the other positions as 0.
  • Dynamic reward adjustment based on action: To check the efficacy of the proposal, an experiment was conducted designating set values that were similar to those in the proposed method as shown in Figure 10 [28,29,30]. The obstacle was set as −1, the target position as 10, the positions in the selected area (area close to target) as 0.3, and the other positions as 0.
    Figure 10. Selected area for dynamic reward adjustment based on action.

4.4. Results

First, we tested two basic reinforcement learning algorithms: the Q-Learning algorithm and the Dyna-Q algorithm. The experiment was performed five times for each algorithm, and there may be deviations in the results for each experiment, so the experiment was performed under the same condition and environment. The following points were verified and compared using the experimental results. The optimization efficiency of each algorithm was checked based on the length of the final path, and the path optimization performance was confirmed by the path search time. We were also able to identify learning patterns by the number of learning steps per episode.
The path search time of the Q-Learning algorithm was 19.18 min for the first test, 18.17 min for the second test, 24.28 min for the third test, 33.92 min for the fourth test, and 19.91 min for the fifth test. It took about 23.09 min on average. The path search time of the Dyna-Q algorithm was 90.35 min for the first test, 108.04 min for the second test, 93.01 min for the third test, 100.79 min for the fourth test, and 81.99 min for the fifth test. The overall average took about 94.83 min.
The final path length of the Q-Learning algorithm was 65 steps for the first test, 34 steps for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the fifth test, and the average was 59.8 steps. The final path length of the Dyna-Q algorithm was 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps for the fourth test, 38 steps for the fifth test, and the average was 37.2 steps.
The Q-Learning algorithm was faster than Dyna-Q in terms of path search time, and Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning. Figure 11 and Figure 12 show the learning pattern through the number of training steps per episode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q, stabilized after finding the optimal path between 1500 and 3000 episodes. In both algorithms, the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes for each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times (0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algorithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times (0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded 800 steps at a total level of 0.044 % in the five tests. To compare the path search time and the learning pattern result, we compared the experiment (e) of the two algorithms. The path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at the learning patterns in Figure 11 and Figure 12, the Q-Learning algorithm learned by executing 200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q algorithm learned by executing 200 steps to 800 steps per each episode until the middle of the 2000 episode. However, the time required for the Dyna-Q algorithm is more than 75% slower than the Q-Learning algorithm. In this perspective, it has been confirmed that Dyna-Q takes longer to learn in a single episode than Q-Learning.
Figure 11. Path optimization learning progress status of Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.
Figure 12. Path optimization learning progress status of Dyna-Q: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.
Subsequently, additional experiments were conducted by applying ‘dynamic reward adjustment based on action’ to the Q-Learning algorithm. The path search time of this experiment was 19.04 min for the first test, 15.21 min for the second test, 17.05 min for the third test, 17.51 min for the fourth test, and 17.38 min for the fifth test. It took about 17.23 min on average. The final path length of this experiment was 38 steps for the first test, 34 steps for the second test, 34 steps for the third test, 34 steps for the fourth test, and 34 steps for the fifth test. The average was 34.8 steps. The experimental results can be seen in Figure 13, and it includes the experimental results of Q-Learning and Dyna-Q previously conducted. In the case of the Q-Learning algorithm to which ‘dynamic reward adjustment based on action’ was applied, the path search time was slightly faster than Q-Learning, and the final path length was confirmed to be improved to a level similar to that of Dyna-Q. Figure 14 shows the learning pattern obtained using the additional experiments compared with the result of Q-Learning. There is no outstanding point. Figure 15 shows the path found through the experiment of the Q Leaning algorithm applying ‘dynamic reward adjustment based on action’ on the map.
Figure 13. Comparison of experiment results: (a) Total elapsed time; (b) Length of final path.
Figure 14. Path optimization learning progress status of Q-Learning with action based reward: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.
Figure 15. Path results of Q-Learning with action based reward: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.

5. Conclusions

In this paper, we reviewed the use of reinforcement learning as a path optimization technique in a warehouse environment. We built a simulation environment to test path navigation in a warehouse environment and tested the two most basic algorithms, the Q-Learning and Dyna-Q algorithms, for reinforcement learning. We then compared the experimental results in terms of the path search times and the final path lengths. Compared with the average of the results of the five experiments, the path search time of the Q-Learning algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final path length of Dyna-Q algorithm was about 37% shorter than the Q-Learning algorithm’s. As a result, the experimental results of both Q-Learning and Dyna-Q algorithm showed unsatisfactory results for either the path search time or the final path length. To improve the results, we conducted various experiments using the Q-Learning algorithm and found that it does not perform well when finding search paths in a complex environment with many obstacles. The Q-Leaning algorithm has the property of finding a path at random. To minimize this random path searching trial, the target position was determined for movement, and the set reward value was high around the target position,’ thus the performance and efficiency of the agent improved with ‘dynamic reward adjustment based on action’. Agents mainly explore locations with high reward values. In further experiments, the path search time was about 5.5 times faster than the Dyna-Q algorithm’s, and the final path length was about 71% shorter than the Q-Learning algorithm. It was confirmed that the path search time was slightly faster than the Q-Learning algorithm, and the final path length was improved to the same level as the Dyna-Q algorithm.
It is a difficult goal to meet both minimization of path search time and high accuracy for path search. If the learning speed is too fast, it will not converge to the optimal value function, and if the learning speed is set too late, the learning may take too long. In this paper, we have experimentally confirmed how to reduce inefficient exploration with simple technique to improve path search accuracy and maintain path search time.
In this study, we have seen how to use reinforcement learning for a single agent and a path optimization technique for a single path, but the warehouse environment in the field is much more complex and dynamically changing. In the automation warehouse, multiple mobile robots work simultaneously. Therefore, this basic study is an example of improving the basic algorithm in a single path condition for a single agent, and it is insufficient to apply the algorithm to an actual warehouse environment. Therefore, based on the obtained results, an in-depth study of the improved reinforcement learning algorithm is needed, considering highly challenging multi-agent environments to solve more complex and realistic problems.

Author Contributions

Conceptualization, H.L. and J.J.; methodology, H.L.; software, H.L.; validation, H.L. and J.J.; formal analysis, H.L.; investigation, H.L.; resources, J.J.; data curation, H.L.; writing–original draft preparation, H.L.; writing—review and editing, J.J.; visualization, H.L.; supervision, J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01417) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Acknowledgments

This work was supported by the Smart Factory Technological R&D Program S2727115 funded by Ministry of SMEs and Startups (MSS, Korea).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hajduk, M.; Koukolová, L. Trends in Industrial and Service Robot Application. Appl. Mech. Mater. 2015, 791, 161–165. [Google Scholar] [CrossRef]
  2. Belanche, D.; Casaló, L.V.; Flavián, C. Service robot implementation: A theoretical framework and research agenda. Serv. Ind. J. 2020, 40, 203–225. [Google Scholar] [CrossRef]
  3. Fethi, D.; Nemra, A.; Louadj, K.; Hamerlain, M. Simultaneous localization, mapping, and path planning for unmanned vehicle using optimal control. SAGE 2018, 10, 168781401773665. [Google Scholar] [CrossRef]
  4. Zhou, P.; Wang, Z.-M.; Li, Z.-N.; Li, Y. Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm. In Proceedings of the 2nd International Conference on Electronic & Mechanical Engineering and Information Technology (EMEIT-2012), Shenyang, China, 7 September 2012; pp. 1837–1841. [Google Scholar]
  5. Azadeh, K.; Koster, M.; Roy, D. Robotized and Automated Warehouse Systems: Review and Recent Developments. INFORMS 2019, 53, 917–1212. [Google Scholar] [CrossRef]
  6. Koster, R.; Le-Duc, T.; JanRoodbergen, K. Design and control of warehouse order picking: A literature review. EJOR 2007, 182, 481–501. [Google Scholar] [CrossRef]
  7. Jan Roodbergen, K.; Vis, I.F.A. A model for warehouse layout. IIE Trans. 2006, 38, 799–811. [Google Scholar] [CrossRef]
  8. Larson, T.N.; March, H.; Kusiak, A. A heuristic approach to warehouse layout with class-based storage. IIE Trans. 1997, 29, 337–348. [Google Scholar] [CrossRef]
  9. Kumara, N.V.; Kumarb, C.S. Development of collision free path planning algorithm for warehouse mobile robot. In Proceedings of the International Conference on Robotics and Smart Manufacturing (RoSMa2018), Chennai, India, 19–21 July 2018; pp. 456–463. [Google Scholar]
  10. Sedighi, K.H.; Ashenayi, K.; Manikas, T.W.; Wainwright, R.L.; Tai, H.-M. Autonomous local path planning for a mobile robot using a genetic algorithm. In Proceedings of the 2004 Congress on Evolutionary Computation, Portland, OR, USA, 19–23 June 2004; pp. 456–463. [Google Scholar]
  11. Lamballaisa, T.; Roy, D.; De Koster, M.B.M. Estimating performance in a Robotic Mobile Fulfillment System. EJOR 2017, 256, 976–990. [Google Scholar] [CrossRef]
  12. Merschformann, M.; Lamballaisb, T.; de Kosterb, M.B.M.; Suhla, L. Decision rules for robotic mobile fulfillment systems. Oper. Res. Perspect. 2019, 6, 100128. [Google Scholar] [CrossRef]
  13. Zhang, H.-Y.; Lin, W.-M.; Chen, A.-X. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [Google Scholar] [CrossRef]
  14. Han, W.-G.; Baek, S.-M.; Kuc, T.-Y. Genetic algorithm based path planning and dynamic obstacle avoidance of mobile robots. In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, Orlando, FL, USA, 12–15 October 1997. [Google Scholar]
  15. Nurmaini, S.; Tutuko, B. Intelligent Robotics Navigation System: Problems, Methods, and Algorithm. IJECE 2017, 7, 3711–3726. [Google Scholar] [CrossRef][Green Version]
  16. Johan, O. Warehouse Vehicle Routing Using Deep Reinforcement Learning. Master’s Thesis, Uppsala University, Uppsala, Sweden, November 2019. [Google Scholar]
  17. Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron. Available online: https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html (accessed on 22 December 2020).
  18. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning, 2nd ed.; MIT Press: London, UK, 2018; pp. 1–528. [Google Scholar]
  19. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; Wiley: New York, NY, USA, 1994; pp. 1–649. [Google Scholar]
  20. Deep Learning Wizard. Available online: https://www.deeplearningwizard.com/deep_learning/deep_reinforcement_learning_pytorch/bellman_mdp/ (accessed on 22 December 2020).
  21. Meerza, S.I.A.; Islam, M.; Uzzal, M.M. Q-Learning Based Particle Swarm Optimization Algorithm for Optimal Path Planning of Swarm of Mobile Robots. In Proceedings of the 1st International Conference on Advances in Science, Engineering and Robotics Technology, Dhaka, Bangladesh, 3–5 May 2019. [Google Scholar]
  22. Hsu, Y.-P.; Jiang, W.-C. A Fast Learning Agent Based on the Dyna Architecture. J. Inf. Sci. Eng. 2014, 30, 1807–1823. [Google Scholar]
  23. Sichkar, V.N. Reinforcement Learning Algorithms in Global Path Planning for Mobile Robot. In Proceedings of the 2019 International Conference on Industrial Engineering Applications and Manufacturing, Sochi, Russia, 25–29 March 2019. [Google Scholar]
  24. Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653–133667. [Google Scholar] [CrossRef]
  25. Kirtas, M.; Tsampazis, K.; Passalis, N. Deepbots: A Webots-Based Deep Reinforcement Learning Framework for Robotics. In Proceedings of the 16th IFIP WG 12.5 International Conference AIAI 2020, Marmaras, Greece, 5–7 June 2020; pp. 64–75. [Google Scholar]
  26. Altuntas, N.; Imal, E.; Emanet, N.; Ozturk, C.N. Reinforcement learning-based mobile robot navigation. Turk. J. Electr. Eng. Comput. Sci. 2016, 24, 1747–1767. [Google Scholar] [CrossRef]
  27. Glavic, M.; Fonteneau, R.; Ernst, D. Reinforcement Learning for Electric Power System Decision and Control: Past Considerations and Perspectives. In Proceedings of the 20th IFAC World Congress, Toulouse, France, 14 July 2017; pp. 6918–6927. [Google Scholar]
  28. Wang, Y.-C.; Usher, J.M. Application of reinforcement learning for agent-based production scheduling. Eng. Appl. Artif. Intell. 2005, 18, 73–82. [Google Scholar] [CrossRef]
  29. Mehta, N.; Natarajan, S.; Tadepalli, P.; Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Mach. Learn. 2008, 73, 289. [Google Scholar] [CrossRef][Green Version]
  30. Hu, Z.; Wan, K.; Gao, X.; Zhai, Y. A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with Adjustable Parameters. Math. Probl. Eng. 2019, 2019, 7619483. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.