1. Introduction
The fields of planning and control are areas of continued exploration within robotics research. A problem shared by these fields is the need for accurate, on-time information. Since the early days of robotics, occupancy grids have played a crucial role in providing robotic agents information about an environment [
1]. While 2-dimensional (2D) occupancy grids have been predominately superseded by their 3-dimensional (3D) counterparts, occupancy-map-based solutions remain a core component of robotic solutions. Many planning and control solutions still utilise some form of occupancy map for determining the safe regions of an environment [
2]. However, with the development of higher-fidelity planning and control software, the need for more detailed maps has grown. This increase in desired map resolution combined with an increase in the resolution of sensors can cause even modern occupancy map solutions to become information bottlenecks as populating occupancy grids with dense point clouds is computationally expensive and is a problem still being explored today [
3].
This is an issue further compounded in fields where agent size and computing power are restricted. These low-powered robotic agents are often forced to sacrifice some combination of map resolution, effective sensor resolution, or sensor update rate in order to process the sensor information in real-time. Within the growing field of Aerial Robotics, such compromises are routinely made in both software and hardware to enable operation as a result of considerable platform limitations [
4,
5,
6,
7].
A common 3D occupancy mapping solution used in modern robotics is Octomap, proposed by Hornung et al. [
8]. The Octomap data structure has become a well-known and often-used mapping solution for robotics problems because of its probabilistic nature and its performance. Octomap enables the construction of 3D maps that contain measures of the uncertainty of the observed environment and by leveraging octrees to reduce the memory footprint of its maps. Unfortunately, the memory benefits of the hierarchical octree data structure are offset by the speed of operations on the data, creating bottlenecks for sensors with high update rates.
De Gregorio and Di Stefano [
9] produced SkiMap in an attempt to improve performance of 3D occupancy mapping by parallelising data operations. It succeeds in cases using dense sensors such as red–green–blue-depth(RGB-D) sensors, but struggles compared to Octomap when dealing with sensors that produce large numbers of sparse data.
Besselman et al. [
10] propose VDB-Mapping as a 3D occupancy mapping solution that utilizes the OpenVDB data structure to accelerate 3D occupancy mapping. It outperforms both Octomap and SkiMap and is a promising approach for robotic mapping, with the OpenVDB data structure having been used for years for managing and working on sparse data sets in the visual effects industry. Despite its promise, it still struggles on low-powered platforms, as it is a serial, CPU-bound approach. Macenski et al. [
11] also propose the use of the OpenVDB data structure to improve the information available in a 3D occupancy map to improve robotic planning and control outcomes.
Finally, Jia et al. [
12] propose a hardware-accelerated approach. They developed a dedicated 3D mapping accelerator to drastically accelerate mapping operations when using Octomap as the base data structure. While promising, this solution is not easily accessible for the wider robotics community.
This work extends the concept of probabilistic mapping using the OpenVDB data structure presented by Besselmann et al. [
10], off-loading the computationally expensive component of filtering and ray-casting points to the highly parallel GPU while additionally leveraging useful aspects of the OpenVDB data structure to further accelerate CPU performance during map insertion. The work is evaluated on multiple platforms against the existing VDB-Mapping code provided by Besselmann et al. and Octomap.
Additionally the proposed NanoMap library provides methods to very quickly simulate the construction of a map-using, user-defined sensor views within an existing OpenVDB grid; this enables training and validation of planning and control solutions prior to implementation on hardware.
The structure of this paper is as follows.
Section 2 describes the NanoMap software solution and details working with a graphics processing unit (GPU)-based occupancy map approach. This section also outlines the simulation component of the solution.
Section 3 contains the evaluation and analysis carried out on a number of computing platforms, comparing the overall performance of the solutions.
Section 4 concludes, outlining goals for future works and additions to be made to the library.
2. NanoMap
The type and quality of sensors are key considerations when outfitting robotic agents. While some robotic agents use light detection and ranging (LIDAR) systems, many mobile robotic agents use RGB-D or stereo cameras as they are often less expensive and more lightweight than their equivalent LIDAR counterparts. The Nanomap library provides functionality for both kinds of sensors; however, the most significant performance advantages are provided when used with dense sensors, referred to within the library and this work as Frustum Sensors. Like many other occupancy-mapping solutions NanoMap requires an accurate pose, and the corresponding point cloud for processing, it additionally requires the user to supply some sensor information when creating the mapping instance. This information is used to allocate the necessary memory for interacting with the GPU prior to operation.
The core operation of NanoMap is very similar to that of its predecessors. Take an input point cloud in the form of points and an agent pose. For each point in the cloud, create a ray with a start location at the sensor pose, and an end location at the point. Then, step along each ray over a temporary grid and encode into the occupancy map all the cells encountered along the ray as empty and the cell occupied by the end of the ray as occupied.
Obviously, accessing these cell locations within a map would be time-consuming if not using some form of accelerated data structure. Octomap as mentioned in the introduction solves this issue using octrees. Within an octree, each node of the tree contains 23 or eight children. While this is sufficient for reducing the memory usage of a volumetric grid, this can be problematic when dealing with high-resolution grids as the number of layers in the tree can grow to a size that makes insertion and lookup comparatively costly. OpenVDB side-steps this issue with its state-of-the-art data structure designed for placing a hard limit on data manipulation times for sparse grids. Additionally NanoVDB is a module of the OpenVDB library that focuses on working with and using OpenVDB structures on the GPU.
Consequently this work follows in the footsteps of Besselmann et al. [
10] and Macenski et al. [
11] with our use of the OpenVDB data structure as the core data structure for our occupancy grids. It also parallelises the data processing operations as is the case with SkiMap [
9] but leverages access to the hugely parallel GPUs now available to many robotics platforms; this prevents the need for the dedicated hardware accelerator proposed by Jia et al. [
12].
The following subsections outline the details and advantages of OpenVDB, introduce the GPU-focused library NanoVDB and the difficulties faced with interfacing such data structures with a GPU, and finally detail the approach used by NanoMap to leverage the GPU to accelerate occupancy mapping.
2.1. OpenVDB and NanoVDB
OpenVDB [
13] is a data structure developed for the efficient representation and manipulation of sparse volumetric grids. Originally developed by Dreamworks Animation to enable improved performance during simulation and renders of particle systems, the library is useful for storing any spatially organised information thanks to its impressive speed and flexibility.
The VDB grids supplied by the OpenVDB project are freely configurable tree structures that share a number of similarities with B+ Trees. Some have referred to VDB grids as “Volumetric Dynamic B+ tree grids” as a result, but the simple truth is that VDB itself is just a name. A key advantage of the VDB data structure that sets it apart from Octrees is that a VDB grid has a steep branching factor and user-programmable depth. The steep branching nature of VDBs means that lookup and insertion operations on the grid have a small maximum number of steps because there are a small fixed number of parents for any given voxel.
Furthermore, VDB grids implement caching during look-ups and insertion, meaning that for a group of look-ups or insertions that are spatially coherent, only the initial lookup has to traverse the entire tree and each subsequent lookup references the cache, oftentimes finding the target location in memory for the next operation immediately. In the case of an initial failure, the tree is traversed in reverse order from the cached location, meaning that time savings can still occur even when the data points are not strictly adjacent. These optimisations mean spatially coherent look-ups and insertions are extremely efficient when done in bulk compared to traditional data structures like octrees. Furthermore, as demonstrated by M. Besselmann et al. [
10] these features make VDB grids ideal for use as an occupancy map data structure.
Unfortunately, while impressively quick, there are still aspects of the VDB data structure and the point cloud insertion problem that are limited by the poor parallel capabilities of a central processing unit (CPU). While threading is an option, CPUs still lack the number of threads to meaningfully accelerate the processing of dense point clouds with tens to hundreds of thousands of points. This is where NanoVDB [
14] and Nvidia’s Compute Unified Device Architecture (CUDA) [
15] become relevant to the problem.
Given the original rendering and simulation focus of the OpenVDB project, it was only a matter of time until key components of the project were developed to use the Nvidia CUDA GPU API, leveraging the impressive parallel operations provided by Graphics Processors. This development came in the form of the NanoVDB module.
NanoVDB is a module developed primarily to enable improved rendering of OpenVDB data structures. Due to the nature of interfacing with GPUs, the sparse nature of the OpenVDB data structure can be loaded into and read from the GPU to enhance render speeds, but only as a read-only object. This is because dynamically manipulating memory structures on the GPU to insert new entries into the grid is not possible. The exception to this case is when using a dense grid representation of an OpenVDB grid because all voxel locations in the grid are already allocated, but doing so completely removes the advantages of the OpenVDB data structure and would have extreme memory requirements. As a result, the explicit NanoVDB grid structure is only used by the NanoMap library during simulation. However, the NanoVDB module does still provide useful objects and tools that can be used on the GPU and are essential for operation of the NanoMap library. The only other time when NanoVDB grids might be directly utilised on the GPU and CPU at the same time is by systems that have unified system memory, such as the Nvidia Jetson Devices, and this is an approach to be explored in future works.
2.2. GPU-Accelerated Mapping
Given that sparse NanoVDB grids objects are read-only on traditional systems with discrete GPUs, how does NanoMap accelerate the mapping process?
Memory management between the CPU and the GPU was the core problem with implementing NanoMap. Firstly, the input point cloud needed to be copied to the GPU, which is a relatively trivial operation using the CUDA application programming interface (API), and few optimizations could be made because the point cloud would be a fixed size and would need to be recopied every loop. Thankfully, copying even a million points between pinned memory on the CPU and GPU does not take long at all on modern systems. For extremely large sensor clouds, the cloud could be copied over in a reduced format, using only distance readings that map to a particular ray in the sensor model. This would shrink the size of copy operations, but so far this has not been necessary.
The key issue with memory management was maintaining a reasonable memory footprint and copy time for the data structures that contain the ray-casting results. The solution provided by this work leverages the tree structure of the VDB grid to initially ray cast at the leaf node level of the grid to calculate which leaf nodes have rays pass through them. Algorithm 1 provides an overview of the algorithm used by the NanoMap library with the steps described further in this section.
In the case of an OpenVDB grid, the leaf nodes of the grid are the nodes in the tree that contain voxels. Using the default configuration for grids, leaf nodes in NanoMap contain 8 voxels on each side, meaning each leaf node contains 512 voxels. A leaf node that has a ray pass through it is considered active. Once the active leaf nodes have been determined, they are assigned an index ranging from 0 through to the number of currently active leaf nodes. The library then populates a buffer using the assigned index for each node, and the origin of that node relative to the agent. This maps each active node to 3D space. This means that even with a larger sensor model, only the nodes that contain new information are copied to the CPU memory.
Figure 1 shows how this initial process would work for a 2D case.
Algorithm 1 Processing a point cloud using Nanomap |
if GPU Enabled then Allocate Memory on GPU and CPU at initialisation while there exists unprocessed point cloud do Copy Cloud to GPU Memory Calculate sensor offset from current occupied leaf node Calculate the offset of the aforementioned leaf node from the origin of the grid Ray-cast cloud in parallel, making note of each node a ray passes through Index each active node and store its relative position to the origin if size of voxel arrays not sufficient for active nodes then Re-allocate voxel level arrays Zero voxel arrays else if size of voxel arrays sufficient for active nodes then Zero voxel arrays end if Ray-cast at voxel level, populating voxel array according to sensor properties Reduce the voxel array from a float (4 Byte) array to a single byte array Copy the voxel byte array and the leaf index array to CPU memory Using the leaf index array and voxel byte array, update the occupancy grid end while else if CPU Enabled then while there exists unprocessed point cloud do Ray-cast the cloud in serial Populating temporary occupancy grid during ray-casting operation Iterate over temporary grid and update primary occupancy grid as required end while end if
|
Once each node is assigned an index, a float array is allocated by the GPU with a size equal to the active leaf node count multiplied by the volume of a leaf node in voxels (512 by default). If an array of appropriate size already exists, it is zeroed rather than reallocated. The GPU then ray-casts a second time, iterating at the voxel level. Whenever a new node is traversed by the ray, the index of that node is fetched, and then all voxels that are traversed while inside that node are added to the buffer at the location corresponding to the node index with an offset determined by the XYZ location of that voxel within the node. When a ray passes through a voxel but does not terminate within that voxel, the sensor’s log odds miss value is added to the location; the log odds hit value is added for voxels that contain the end point of the ray. The atomic add function provided by CUDA is used so that the additions to the buffer remain accurate even when performed in parallel.
Figure 2 shows how the process from
Figure 1 is continued at the voxel level.
The values in the voxel array are then copied by the GPU from their float array to an equivalent single byte integer array. This is carried out to further shrink the size of the array to improve the copy speed from the GPU to the CPU. Prior to copying the float values to the buffer, they are multiplied by one hundred (100) to preserve their first two decimal values and then capped at −128 and 127, meaning that the range for cell updates is limited to between −1.28 and 1.27 during a single update. Because NanoMap uses a log-odds style update similar to OctoMap and other probabilistic mapping techniques, the slight loss in precision is acceptable. Values in the occupancy grid are capped between −0.87 and 1.5, allowing relatively large updates to still be made to each voxel at each time step.
At this point, the ray casting data are on the CPU, and the primary GPU functionality of NanoMap ceases until the next point cloud is received.
The CPU then begins to process the data retrieved. There are two main approaches currently implemented, one that works directly at the voxel level, and another that works at the node level. Depending on the characteristics of the problem and input point clouds, either approach can be applicable. The voxel approach is simple; NanoMap iterates over the input buffer and updates the occupancy grid when encountering any non-zero values. The node level approach leverages the fact that the data received from the GPU arrive as a buffer of nodes each containing 512 voxels; sometimes it makes sense to avoid voxel-level interactions with the existing map when dealing with extremely dense information. This approach involves probing the existing map at a given leaf-node location. If no leaf node exists, the leaf node provided by the GPU is processed by itself and inserted as a new node into the Map.
If a node does exist, the values of the existing node are considered when processing the new data, and the whole node is replaced with the new processed node. In doing this, the number of VDB grid queries and insertions is significantly reduced. One benefit to this process is that the leaf nodes can be handled in parallel using Intel’s Threaded Building Blocks to perform the population of the leaf nodes in parallel.
In addition to the regular raycasting operation defined above, Nanomap provides a GPU-accelerated voxelisation filter. This filter maps all points in the point cloud to their corresponding terminal voxel before ray-casting. Then, instead of ray-casting each point, ray-casting is only performed once for each active voxel. There are a number of modes available for this discrete voxelisation filter, each with its own trade-off. The key difference between filtering modes is the type of data stored, and how those data are stored.
Table 1 outlines the four available filter types, the minimum CUDA architectures they are compatible with, and their memory consumption as a percentage of the memory used by Filter Mode 0, whether the filter is “precise” or “simple”, and the data-type used by the filter. The precise filter modes track the average offset for all rays that terminate within a voxel to use as an endpoint for raycasting, while the simple filter modes simply count the number of rays that terminate within a voxel and use the centre of the voxel as the termination point, trading accuracy for memory consumption.
The number of points and the average of their locations are maintained within each voxel to maintain the correct information for the raycasting and probabilistic calculations. This optimisation significantly increases processing speed for cases where the grid resolution or point cloud density results in multiple points occupying a single voxel. The downside to the filter is the increased memory usage needed. This is a result of needing to pre-allocate an array that encompasses the entirety of the information space of a given sensor. This makes the voxelisation filter potentially unusable for long-range LIDAR sensors. Additionally, because of the sparse nature of the information provided by LIDARs, they do not benefit from the ray reduction provided by the voxelisation filter.
When it comes to laser or LIDAR functionality, Nanomap currently only provides CPU-based processing. Due to the sparseness and often lower resoluton of LIDAR sensors, they do not currently benefit from GPU accelerations during ray-casting, as any performance they gain from parallel processing is lost due to overhead. Investigating potential GPU-based optimisations for LIDAR sensors within NanoMap is an area for future work. The optimisations provided by NanoMap in its current form rapidly accelerate the processing of a dense cloud of information where rays produced by the point cloud often traverse the same space. This makes NanoMap particularly good at accelerating sensors with a Frustum view shape; RGB-D and Stereo cameras in particular benefit significantly from the use of the GPU.
2.3. Simulation and Robotic Navigation
In addition to the mapping component of NanoMap, the simulation component benefits from the acceleration of the GPU and the NanoVDB Grid data structure, making it quite useful for simulating agent sensing and mapping within an existing OpenVDB Grid. A method is provided for generating randomised voxelised caves using 3D Cellular Automata to use in simulations and deep reinforcement learning. The only difference between the operation of the mapping in the simulation is that the point cloud is generated on the GPU from a sensor model and a provided grid of the simulation environment, rather than copied from an input. The simulation component of NanoMap was developed to enable the validation of planning and control methodologies that rely on building occupancy maps, and to enable the rapid training of deep reinforcement learning agents by reducing the step time taken to generate sensor and mapping information.
Additionally, packages have been written that interface the Nanomap library with the Robotic Operating System (ROS1) and the Robotic Operating System 2 (ROS2); the library has been tested on Melodic, Noetic, and Galactic. Simulation and server functionality is provided. Simulation functionality takes some external pose input and sensor information and generates a point cloud against a user-provided environment before processing that point cloud into a published occupancy grid. Server functionality provides an ROS node that listens for a pose input and a point cloud and publishes an occupancy grid. It still requires the sensor to be defined in a configuration file.
A benchmarking package is also available for ROS1 and ROS2 with functionality that compares the performance between an optimised CPU-only version of the approach presented by M. Besselmann et al. [
10], GPU Nanomap with a variety of operation types, and Octomap with and without the discrete optimisation. This benchmarking package was used for the evaluation contained in
Section 3.
Finally, an ROS2 visualisation (RVIZ2) plugin compatible with ROS2 Galactic has been written to enable streaming and rendering of OpenVDB grids to avoid first having to convert them into point clouds.
Figure 3 shows the map created by a simulated agent operating within a randomly generated “cave” environment.
A video (
https://youtu.be/UBrlLRqY_E4) is provided that shows simulation runs using a Frustum Sensor, a Laser Sensor, and a simulation run using a combination of the Frustum and Laser Sensors. These simulations were carried out using using the Nanomap ROS2 Galactic package and the Nanomap RVIZ2 Render plugin.
4. Conclusions
This work provides the C++ software library NanoMap (
https://github.com/ViWalkerDev/NanoMap) and the necessary ROS packages required to create OpenVDB grids using frustum and laser-based sensors on GPU-enabled systems. Evaluation shows that by utilising a GPU, the ray casting of a point cloud for use in occupancy mapping can be rapidly accelerated in comparison to current CPU-only approaches, even on limited single-board GPU-enabled computers. On a Jetson Nano, the approach outperforms Octomap by a factor of 5.5 in the provided frustum-based test case at 0.1 m mapping resolution and a factor of 7.6 at the 0.025 m mapping resolution. Even in the case of CPU-only operation, when handling sparse point cloud inputs such as those provided by LIDAR sensors, NanoMap outperforms both the non-filtered and filtered OctoMap methods.
With access to a more powerful GPU on the Asus G14 test platform, the difference balloons to a factor of 7.5 at 0.1 m mapping resolution and 50.8 times at the 0.01 m mapping resolution.
The library also shows exceptional performance for the simulation and construction of occupancy maps on an Asus G14 laptop, making it useful for reinforcement learning tasks and validation of tasks involving occupancy mapping.
Future work aims to expand the capabilities of the library and provide additional functions utilising the GPU for processing and analysing occupancy maps in real time for use with planning and control tasks and other robotics applications.