A Simulator and First Reinforcement Learning Results for Underwater Mapping
Abstract
:1. Introduction
1.1. Related Work
1.2. Open Issues and Contribution
1.3. Structure of the Paper
2. Simulator
- Environment: The map is represented as a grid. The map is loaded and created from a heat map image, where the temperature is the height of the environment. When creating the environment, the length, width and height of the grid are passed as arguments. Some of the areas can be marked as litter. The 3D map then contains the following information: , and (Figure 1). A reduced-complexity 2.5D representation is also created, consisting of two matrices. One of these matrices, denoted by , stores the seafloor height for every 2D coordinate i, j. The other, denoted by m, is labeled with litter. The ground truth litter map is defined as follows: and . After the 3D and 2.5D representations are created, the agent can be placed at any position in the environment.
- Map representation: The real map must be differentiated from the beliefs about the map. This difference is caused by limited range and errors in the sensor, see the sensor model below. In particular, the sensor can give different measurement outputs when measuring the same object from the same or a different pose. The recovery of a spatial world model from sensor data is best modeled as an estimation theory problem.
- Robot state: The pose of the agent consists of a vector that stores the position of the agent and a rotation matrix that stores the attitude of the agent relative to the world coordinates.
- Action and transitions: The translation actions are performed in the agent’s coordinates. Before the agent performs an action, a check is made to see if the action is legal, i.e., whether the action would cause the agent to leave the environment or bump against the bottom. If the action is illegal, a collision signal is sent. The actions are defined here as follows:
- Translate: forward, backward, left, right, up and down.
- Rotate: clock wise or counter clock wise around each axis.
- Sensor model: The sensor model is based on a multi-beam sonar where an overall angle is covered by the angles of the combined K beams. Every beam has an opening angle , and this opening angle is represented by L rays, see Figure 3.
- Simulated sensor and map updates: To read the sensor data the simulator checks if in the agent’s sensor range the voxels are occupied. A distance variable is defined in the ray’s z coordinate, which is increased in a loop with a step of voxels until the maximum sensor range is reached. The sensor rotation matrix from the 4D array, , is multiplied with a vector to check if, at the current position along the ray, the voxel is occupied; if not, is increased. This continues in a loop until the sensor range is reached.
- Speed of the simulator: The main motivation to build the simulator was speed. Figure 5 shows the dependency of the speed on the amount of rays and on the map size. In this figure, u is 3, and the ray length/range is 11. The tests were performed while choosing random agent actions, on a computer running Ubuntu 20.04 and having an Intel Core i7-8565U CPU and 16GiB RAM. The smallest speed is around 100 steps per second, which means 10,000 s (about 3 h) are needed to simulate one million samples. This is acceptable for DRL.
3. Background on DRL
- Deep reinforcement learning uses deep neural networks (deep NNs) to estimate the Q values. Denote by the parameters of the NN and by the corresponding approximate Q-value function. In the case of the DDDQN [3] and Rainbow [4] algorithms that will be used, two networks are employed. The reason is that taking an action with the highest noisy Q-value in the maximum application within the TD, using the “normal” parameters , would lead to overestimation of the Q-Values. To solve this, the algorithms compute the maximum in the TD using another set of parameters , which give the so-called target network [23]:
4. Application of DRL to Underwater Mapping
- State: The state s of the agent is a tuple composed of the pose , the belief , the entropy and the height :
- The state is normalized between −1 and 1 to help the neural network learn. To do so the pose is split into (position) and (rotation):
- Actions: The actions are those defined in the robot model of Section 2.
- Reward: The goal of the agent is to find the litter as fast as possible. In general, rewards can be defined based on the belief and the entropy of the map. To help the agent learn, rewards are provided both for finding litter and for exploring the map.
- Entropy-dependent exploration: In this environment, states are similar at the beginning, e.g., for the first state of the trajectories, the entropy and belief are similar (uniform belief with high entropy). As the trajectories of the agent get longer and it discovers more of the map at various locations, the states become more unique. This poses an exploration problem. For this reason, instead of only exploring at the beginning and decreasing the linearly as in traditional DRL [24], exploration is also made dependent on the entropy left on the map:
- Modified PER: Unlike classic DRL tasks (e.g., Atari games), here the agent receives nonzero rewards in almost every state. At the beginning of a trajectory, as the locations that are easy to discover are found, high Q-Values are seen, which then decrease progressively. As a consequence, the TD-error at the beginning of a trajectory is also greater, which increases the probability that PER chooses an early sample of a trajectory to train the NN, at the expense of later samples that may actually be more unique and, therefore, more relevant. To avoid this, the TD-error is normalized by the Q-value before using it for PER:
- Collision-free exploration: By default, the agent can collide with the seafloor or with the borders of the map (operating area). Any such collision leads to termination of the trajectory, with a poor return, which is hypothesized to discourage the agent from visiting again positions where collisions are possible. This means both that the Q-values are estimated poorly for such positions and, in the final application of the learned policy, that those positions are insufficiently visited to map them accurately.
5. Experiments and Discussion
5.1. DDDQN Results and Discussion
- Normalization of the TD error in PER: For this experiment only, several simplifications are performed. Only discovered voxels are rewarded, and a penalty is given when the agent crashes, , while the litter component is ignored, meaning that only a coverage problem is solved. Two agents are run, one with the normalized TD errors (31) and the other without, over the same amount of steps (10 million). Collisions are allowed during validation.
- DDDQN versus LM:Figure 12 shows the comparison between the LM and DDDQN agents, this time for the full litter-discovery problem. As shown in Table 1, the DRL agent finds around of the litter, while the LM-agent finds of the litter at the end of the trajectory. On the other hand, the DDDQN agent finds after 50 steps on average around more litter. Moreover, the large difference in the variance is remarkable: in the worst trajectory, after 120 steps (half of the trajectory) the DDDQN agent found at least 11 L whereas the LM-agent found 0.
5.2. Rainbow Experiments and Discussion
- LM baseline: With the new configuration, the LM needs 350 steps to finish its trajectory. Figure 14 shows that on average 52.7 L items are found by the LM. During the first 31 steps, the agent does not find much litter. The reason for that is the poorer sensor, which must see again the same region to become sure that certain voxels are litter.
- The impact of entropy in the reward function: An investigation was conducted to determine whether the entropy component is needed in the reward function. Figure 15a–c show results with no entropy component (corresponding to entropy parameter in (25)), with (a low influence of entropy) and with (a larger influence of entropy). The litter parameter is always 1.
- Deep versus shallow network: To check whether the complexity of the network in Figure 16 is justified, here a shallower network is used, show in Figure 17. This network structure is appropriate for Atari games, the usual DRL benchmark. It has fewer layers with larger convolutional kernels. The results in Figure 18 and Table 1 show that this shallow network finds after 50 steps only 11.88 L on average and after 350 steps 36.78 L on average. The deeper network finds after 50 steps 19.3 L and after 350 steps 46.25 L.
- Collision-free exploration: For the final experiment, the following question was asked: could the collision and oscillation-avoidance measures, applied so far during validation, also help during training? Recall that, instead of simply avoiding collisions, the agent must additionally learn about them; therefore, the collision transitions are added to the PER as explained in Section 4.
Agent | 50 Steps | End of Trajectory | ||
---|---|---|---|---|
Litter | % | Litter | % | |
LM | 14.5 | 23% | 62.0 | 98% |
DDDQN , | 16.9 | 27% | 35.5 | 56% |
Deep Rainbow , ; no coll. | 30.1 | 48% | 56.4 | 90% |
LM | 4.5 | 7% | 52.7 | 84% |
Deep Rainbow , no entropy () | 17.2 | 27% | 39.3 | 62% |
Deep Rainbow , | 15.6 | 25% | 48.2 | 77% |
Deep Rainbow , | 19.3 | 31% | 46.3 | 73% |
Shallow Rainbow , | 11.9 | 19% | 36.8 | 58% |
Deep Rainbow , ; no coll. | 24.4 | 39% | 55.4 | 88% |
6. Conclusions and Future Work
6.1. Summary and Main Findings
6.2. Limitations and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Singh, A.; Krause, A.; Guestrin, C.; Kaiser, W.J. Efficient informative sensing using multiple robots. J. Artif. Intell. Res. 2009, 34, 707–755. [Google Scholar] [CrossRef]
- Stachniss, C.; Grisetti, G.; Burgard, W. Information Gain-based Exploration Using Rao-Blackwellized Particle Filters. Robot. Sci. Syst. 2005, 2, 65–72. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
- Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
- Ando, T.; Iino, H.; Mori, H.; Torishima, R.; Takahashi, K.; Yamaguchi, S.; Okanohara, D.; Ogata, T. Collision-free Path Planning on Arbitrary Optimization Criteria in the Latent Space through cGANs. arXiv 2022, arXiv:2202.13062. [Google Scholar]
- Xue, Y.; Sun, J.Q. Solving the Path Planning Problem in Mobile Robotics with the Multi-Objective Evolutionary Algorithm. Appl. Sci. 2018, 8, 1425. [Google Scholar] [CrossRef] [Green Version]
- Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef] [Green Version]
- Hitz, G.; Galceran, E.; Garneau, M.È.; Pomerleau, F.; Siegwart, R. Adaptive continuous-space informative path planning for online environmental monitoring. J. Field Robot. 2017, 34, 1427–1449. [Google Scholar] [CrossRef]
- Popović, M.; Vidal-Calleja, T.; Hitz, G.; Chung, J.J.; Sa, I.; Siegwart, R.; Nieto, J. An informative path planning framework for UAV-based terrain monitoring. Auton. Robot. 2020, 44, 889–911. [Google Scholar] [CrossRef] [Green Version]
- Bottarelli, L.; Bicego, M.; Blum, J.; Farinelli, A. Orienteering-based informative path planning for environmental monitoring. Eng. Appl. Artif. Intell. 2019, 77, 46–58. [Google Scholar] [CrossRef]
- Zimmermann, K.; Petricek, T.; Salansky, V.; Svoboda, T. Learning for active 3D mapping. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1539–1547. [Google Scholar]
- Wei, Y.; Zheng, R. Informative path planning for mobile sensing with reinforcement learning. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 864–873. [Google Scholar]
- Barratt, S. Active robotic mapping through deep reinforcement learning. arXiv 2017, arXiv:1712.10069. [Google Scholar]
- Hung, S.M.; Givigi, S.N. A Q-learning approach to flocking with UAVs in a stochastic environment. IEEE Trans. Cybern. 2016, 47, 186–197. [Google Scholar] [CrossRef] [PubMed]
- Li, Q.; Zhang, Q.; Wang, X. Research on Dynamic Simulation of Underwater Vehicle Manipulator Systems. In Proceedings of the OCEANS 2008-MTS/IEEE Kobe Techno-Ocean, Kobe, Japan, 8–11 April 2008; pp. 1–7. [Google Scholar]
- Manhães, M.M.M.; Scherer, S.A.; Voss, M.; Douat, L.R.; Rauschenbach, T. UUV Simulator: A Gazebo-based package for underwater intervention and multi-robot simulation. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016. [Google Scholar] [CrossRef]
- Mo, S.M. Development of a Simulation Platform for ROV Systems. Master’s Thesis, NTNU, Trondheim, Norway, 2015. [Google Scholar]
- Hausi, A.D. Analysis and Development of Generative Algorithms for Seabad Surfaces. Bachelor’s Thesis, Technical University of Cluj-Napoca, Cluj-Napoca, Romania, 2021. [Google Scholar]
- 2010 Salton Sea Lidar Collection. Distributed by OpenTopography. 2012. Available online: https://portal.opentopography.org/datasetMetadata?otCollectionID=OT.032012.26911.2 (accessed on 21 February 2022).
- Elfes, A. Occupancy grids: A stochastic spatial representation for active robot perception. arXiv 2013, arXiv:1304.1098. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Bellemare, M.G.; Dabney, W.; Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 449–458. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. arXiv 2017, arXiv:1706.10295. [Google Scholar]
- Ota, K.; Jha, D.K.; Kanezaki, A. Training larger networks for deep reinforcement learning. arXiv 2021, arXiv:2102.07920. [Google Scholar]
- Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How does batch normalization help optimization? In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. Volume 31. [Google Scholar]
- Gogianu, F.; Berariu, T.; Rosca, M.C.; Clopath, C.; Busoniu, L.; Pascanu, R. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 18–24 July 2021; pp. 3734–3744. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rosynski, M.; Buşoniu, L. A Simulator and First Reinforcement Learning Results for Underwater Mapping. Sensors 2022, 22, 5384. https://doi.org/10.3390/s22145384
Rosynski M, Buşoniu L. A Simulator and First Reinforcement Learning Results for Underwater Mapping. Sensors. 2022; 22(14):5384. https://doi.org/10.3390/s22145384
Chicago/Turabian StyleRosynski, Matthias, and Lucian Buşoniu. 2022. "A Simulator and First Reinforcement Learning Results for Underwater Mapping" Sensors 22, no. 14: 5384. https://doi.org/10.3390/s22145384