The input for onboard 3D reconstruction consists of a sequence of RGB-D frames captured in real-time by RGB-D cameras, denoted by
where
and
represent the RGB and depth images, respectively. The output is the surface reconstruction
of the captured scene and the 6
Degrees of Freedom (6DoF) camera pose trajectory
, where
and
represent the 3D rotation and translation in the global coordinate system. We employ our method, OwlFusion, within a framework of randomized optimization [
4], which is the de facto method for large-scale high-quality online dense 3D reconstruction on low-computational hardware platforms. The key challenge is to estimate the 6DoF pose of each frame and to fuse the captured surface data. The randomized optimization framework allows for fast camera motion tracking in low-light conditions. To reduce the computational cost and accelerate the optimization convergence of randomized optimization on low-computational hardware, we introduce planar constraints based on sparse representation of the scene surface, which is our key contribution.
Figure 2 provides a block diagram overview of our method.
Our method consists of three main parts: surface measurement, surface reconstruction, and pose estimation. In the surface measurement step, we preprocess each depth image frame input by computing the vertex map
and normal map
and generating a
Partition Normal Map (PNM)
, which we use to introduce plane constraints. In the surface reconstruction step, we adaptively allocate voxel blocks of different resolutions in GPU memory for surface representation based on the correlation between the measured depth values
and the vertex normals, achieving sparse representation of the scene. Based on the allocated voxel space and the pose
of the depth images, we continuously weight and fuse the measured depth frames into the volume to reconstruct the scene surface
. The quality of the scene reconstruction heavily relies on the accuracy of the pose estimation. In the pose estimation step, we evaluate the fitness of random particles
based on the reconstructed sparse surface
and PNM
, which serves as a constraint to alleviate the computational burden and accelerate the optimal pose estimation speed. Finally, we use the classical ray-casting method [
1] to extract the scene surface.
3.1. Surface Measurement
Surface Measurement is the first step in our method, which takes the raw depth image
as input. We use
to represent a two-dimensional pixel on
. Given the camera intrinsic parameter matrix
, we convert each depth measurement
into the three-dimensional position
of a vertex in the camera coordinate system using
which forms the vertex map
corresponding to the depth map. We then determine the normal map
at each vertex in the vertex map by computing
where
and
represent the direction vector of the three-dimensional points on both sides of the point
horizontally and the direction vector of the three-dimensional points on both sides of the point
vertically, respectively. We normalize the cross-product result to obtain the normal vector
at the current point. At this point, the direction of the normal vector points away from the camera center, so we flip the direction to point towards the camera center. Points located at the edge of the depth image are not used to calculate the normal vector. Here,
and
represent the elements in the vertex map
and normal map
, respectively.
In OwlFusion, a PNM
is generated based on the normal map
to introduce planar constraints, as shown in
Figure 3. The PNM can be considered as the result of grouping pixels in the normal map
based on their similarity in the normal direction. It clusters adjacent pixels based on their similarity in the normal direction and visualizes the clustering result as different-colored regions, each representing a plane. In this process, we use a growth method to obtain the PNM
. Specifically, we choose a pixel in the normal map as the seed pixel and calculate the normal angle and Manhattan distance between it and its neighboring pixels. If both evaluation criteria are below a given threshold, the neighboring pixel is accepted as a region growth unit. We choose the normal angle threshold as 0.6° and the Manhattan distance threshold as the sum of the image’s width plus height based on experience. After growth, if the number of pixels in a segmented region is less than 4% of the total image pixels, the region is rejected; otherwise, the segmented region is too small. In a region growth process, the seed pixel and the pixels in the segmented region will not be repeatedly calculated. Through the segmented normal map, we can quickly drive the particle swarm to move to the vicinity of the optimal pose, thereby accelerating the optimization convergence. Compared to directly extracting the scene plane on the depth map, the normal map-based region growth method is more robust to depth map measurement noise and faster in parallel computing on graphics processing units.
3.2. Surface Reconstruction
The design requirements for adaptive resolution fast surface reconstruction include efficient hierarchical voxel allocation and noise-robust depth data fusion methods. To achieve this, we build upon previous work that uses a fixed number of L-level resolution layers to store surface voxels [
19] and implement effective access to each level’s voxels using a hash table [
7,
18,
19]. However, instead of uniform voxel allocation, we selectively allocate voxel blocks to pixels in different regions using a PNM as a mask. For pixels in planar regions, we allocate voxel memory at the coarsest resolution level, while for non-planar regions, we skip the coarsest level and directly allocate voxel memory at the next coarsest level. On the coarsest level hash table entry, we implicitly refer to the position of the sub-blocks in finer resolution levels by a special marker.
To reconstruct the surface, we first dynamically allocate memory for voxels within the camera field of view and then use voxel splitting and merging to achieve a hierarchical representation of the surface. When a new depth frame
is input into our system, we construct a ray
for each depth measurement
with respect to the camera center, where
Here,
is the transformation from frame
in the camera coordinate system to the global frame
, where
is the rotation transformation matrix and
is the translation transformation vector of the previous frame. Given the truncation range
of the TSDF [
6], we create a line segment along the direction of ray
within the truncation band of
to
. For voxels that intersect with this line segment, we create a corresponding entry in the hash table and allocate memory for the unallocated voxel blocks on the GPU [
7,
18,
19]. After voxel memory allocation is complete, we compute the roughness
of the surface on which the voxels reside, which serves as the criterion for voxel splitting and merging. We define the roughness of the surface using the correlation of normals, calculated as
where
represents the part of the voxel block
that stores TSDF and
is the number of TSDF values.
is the gradient operator,
represents the normal at voxel
, and
represents the average normal of
voxels. It should be noted that we use the gradient of TSDF values to calculate voxel normals, rather than directly using normals from the normal map derived from depth measurements. This is because the normal map from depth measurements contains sensor noise, and the TSDF values in voxels are the results of multiple weighted fusion observations with high credibility. When the roughness is greater than the segmentation threshold
, the current level voxel is segmented. When the roughness is less than the merging threshold
, the current level voxel is merged, and the resolution level of each voxel is recorded.
After allocating all voxel blocks within the truncation region, the current depth frame is fused with the reconstructed visible surface voxels. To efficiently perform the fusion of TSDF, we first access all entries in the hash table before fusing the depth frames, selecting the hash entries that point to visible voxel blocks within the camera view frustum [
7], thereby avoiding empty voxel blocks in the hash table. These hash entries are then processed in parallel to update the TSDF values. The global fusion of all depth maps in the volume is formed as the weighted average of all individually calculated TSDFs from each depth frame, which can be viewed as denoising the global TSDF from multiple noisy TSDF measurements. We adopt the TSDF fusion framework of [
18], but redefine the weight
of the TSDF. Considering the effect of sensor noise, we define
where
is the normalized radial distance of the current depth measurement
from the camera center, and
is derived empirically. We define an imaging validity factor and a scan validity factor, respectively,
where
is the angle between the ray direction of the depth pixel
and the normal measurement of the corresponding surface point
in the local frame. If the depth measurement is within the valid distance range and the scan angle of the visible surface points in the depth map is 0°, the validity of the point is maximum
, which decreases as the scan distance exceeds the valid range or deviates from 0°. As the reconstructed surface may extend beyond or revisit the camera view frustum during system running, a bidirectional GPU–Host data stream scheme [
7,
18] is used to store the reconstructed voxels that exceed the current camera view frustum in the host, allowing the system to fully utilize the limited GPU memory and performance and enable unlimited reconstruction. When the camera returns to a previously reconstructed position, the voxels stored in the host in that region are streamed back to the GPU for fusion and reuse.
3.3. Pose Estimation
Accurate image pose estimation is crucial for surface reconstruction using depth images. However, traditional methods [
3,
14,
33,
47,
48,
49,
50,
51] for pose estimation on MAV platforms can fail to track the camera due to fast camera motion, resulting in incorrect pose estimation results. To address this challenge, our work uses a particle swarm template random optimization [
4]. Unlike previous work, we introduce planar constraints with the help of PNM (
Section 3.1) based on hierarchical sparse surface representation (
Section 3.2). Our method performs
Candidate Particle Set (CPS) filtering firstly before particle fitness evaluation, and subsequent iterations of particle optimization select particles from the CPS. Compared to the
Advantage Particle Set (APS) defined in [
4], the CPS is a much smaller particle swarm set that is strictly constrained, which helps to reduce computational cost during random optimization and accelerate the convergence of pose optimization iterations.
To reflect the alignment between the current and the previous frame accurately, we need to determine the overlapping area between the planar regions of the two frames. To identify the set of overlapping pixels
between the PNMs
and
, we adopt an unproject-and-reproject approach:
where
is the projection matrix of frame
under the camera pose
. Based on
, CPS is calculated by projecting the overlapping pixels onto the volume and normalizing based on whether the corresponding voxel is in the coarsest voxel level:
where
represents any pose particle in the PTM
and
is the number of overlapping pixels projected onto the coarsest voxel level. Due to the uncertainty of edge pixels in the PNM plane segmentation, we relax the percentage of
relative to the total overlapping pixels to 96% based on the empirical values obtained in the experiment to ensure coverage of the optimal solution. Meanwhile, we use the same method to identify the set of overlapping pixels
between the depth frames
and
:
In each iteration
during the optimization of
,
is used as the valid pixel set to evaluate the particle fitness:
where
and
represent the rotation and translation of the pose particle
, and
represents the so far constructed TSDF. Note that the inter-frame overlap is deliberately maintained at a non-negligible level to avoid any potential over-evaluation of poses with minimal overlap. Our PST scaling scheme ensures that the sampled transformation, relative to the pose of the previous step, remains within a controlled range of
in translation and
in rotation. This careful constraint ensures that the evaluations remain within appropriate bounds and accurately reflect the relevant transformations. In each iteration, we use the method in [
4] to scale and move PST until we find the optimal pose
, i.e.,
The experiments will demonstrate that our method requires significantly fewer iterations to find the optimal solution compared to the method in [
4] while also reducing the time and resource consumption for pose estimation.