The system flow of UE-SLAM is shown in
Figure 1. The system is primarily divided into the following two modules: the tracking module and the mapping module. When an RGB image is input into the system, depth estimation and pose recovery are first performed in the mapping thread for camera pose initialization. Then, in the mapping module, segmentation is carried out, and the segmented semantic vectors are added to the triple-plane feature decomposition for synchronized rendering. The rendering process includes semantic, color, and depth fusion losses. Similarly, the mapping thread also applies the depth obtained from depth estimation for rendering. Finally, the reconstruction result of multi-view geometry is obtained. We present our method in three parts as follows: In
Section 3.1, we introduce the semantic-oriented triple-plane feature decomposition. In
Section 3.2, we present the depth fusion strategy. In
Section 3.3, we introduce the unique semantic-oriented hybrid rendering loss for single-object purposes.
3.1. Tri-Plane Feature Representation
To efficiently represent and reconstruct complex 3D scenes while integrating semantic information, we propose a semantic encoding method based on tri-plane features. This approach leverages the strengths of DINOV2 semantic features and tri-plane feature representation by encoding the scene’s geometric, appearance, and semantic information separately into axis-aligned tri-plane features, achieving a compact and efficient semantic representation of the scene. The specific steps are as follows:
For each point , its geometric features , appearance features , and semantic features are computed through the following process:
- 1.
Projection: Project the point
p onto three orthogonal planes—geometric, appearance, and semantic—to obtain its coordinates on each of these planes. These projections are mathematically represented as follows:
where
and
denote the projection operators for the geometric, appearance, and semantic planes, respectively, mapping the 3D point onto their corresponding 2D representations.
- 2.
Bilinear interpolation: After obtaining the projected coordinates on each plane, we perform bilinear interpolation to extract the feature values at each projection. For each of the planes, the feature
is interpolated between the four nearest neighboring feature points. The bilinear interpolation formula is given by the following:
where
and
are the feature values at the four nearest neighbors, and
and
represent the interpolation factors based on the relative positions of the point
p within the grid of the plane.
- 3.
Feature fusion: To obtain a robust representation, we combine the coarse and fine features separately to create a coarse output and a fine output. This process enhances the multi-scale feature representation. The final fused feature vector for point
p is then constructed by concatenating the geometric, appearance, and semantic features as follows:
where ‖ denotes concatenation, and
are the features obtained from the geometric, appearance, and semantic planes, respectively.
Inspired by the feature fusion method in SNI-SLAM, we utilize a cross-attention mechanism to fuse the semantic features with the geometric and appearance features. The attention mechanism allows the model to focus on the most relevant information between these feature sets. The attention scores are computed as follows:
where
represents the geometric features,
represents the appearance features, and
represents the semantic features extracted by DINOV2. The operator
denotes the
-norm of the appearance feature vector
.
Next, the semantic feature
is fused with the appearance feature
, using another attention mechanism as follows:
Finally, the fused semantic feature
and appearance feature
are concatenated with the geometric feature
to form the final complete feature vector as follows:
During the rendering stage, we employ volume rendering techniques to convert the fused tri-plane features into 2D depth maps and RGB images. The rendering process computes the contributions of each point sampled along a ray to the final rendered image. For a ray
r emitted from the camera origin
O, we sample
N points
along the ray and compute the occupancy and color values for each sampled point as follows:
where
is the occupancy value and
is the color value at the sampled point
.
We then compute the weight function
using the concept of alpha compositing in volume rendering. The weight function expresses the contribution of each sample to the final rendering as follows:
The total contribution of all samples is aggregated to obtain the final depth and color values as follows:
where
and
represent the depth and color values of the sampled point
, respectively, and
is the weight for each sample based on the accumulated occlusion.
By utilizing this semantic encoding and rendering method based on tri-plane features, we achieve high-quality 3D scene reconstruction and semantic representation while maintaining computational efficiency. The tri-plane feature representation allows for the effective integration of geometric, appearance, and semantic information, which is crucial for accurate scene reconstruction and real-time applications.
3.2. Fusion of Proxy Depth and RGB-D Integration
To improve mapping precision, we fuse proxy depth from Depth-Anything-v2 with tri-plane features and address edge inaccuracies using semantic segmentation from DINOv2. This combined approach enhances depth estimation, especially at object edges.
First, we utilize the DINOv2 semantic segmentation model to process the input image and identify object boundaries and edge regions. This segmentation enables us to separate object interiors from edge pixels, which is crucial for depth estimation fusion. Based on this classification, we apply different depth estimation methods for different regions. For object interiors, we directly use the Depth-Anything-v2-estimated depth
. However, for object edges, we fuse the Depth-Anything-v2-estimated depth with the depth rendered from the tri-plane feature representation, denoted as
. The fusion is performed using the following equation:
where
is a weighting parameter that balances the contributions of both depth estimates. Here,
represents the fused depth,
is the depth estimated by Depth-Anything-v2, and
is the depth rendered from the tri-plane feature representation. This parameter is adjustable based on experimental results to achieve optimal fusion performance.
Since depth estimation is inherently noisy, we further introduce a noise modeling step. We assume that the fused depth
follows a Laplace distribution, where the noise is heteroscedastic. Consequently, the observed depth
is sampled from the following probability density function:
where
represents the standard deviation of the depth measurement, parameterized by
. Here,
denotes the observed depth, and
is the fused depth. The parameter
is dependent on the uncertainty in depth estimation, and it adjusts the confidence in the fused depth. For a collection of depth measurements along a ray, we assume these measurements are independent, yielding the following joint probability density function:
To find the optimal depth estimation, we perform maximum likelihood estimation (MLE). This is equivalent to minimizing the negative log-likelihood, which leads to the following optimization problem:
This maximization process helps us determine the optimal parameters (related to the fusion of different depth estimates) and (the noise model parameter), which improve the accuracy of the final depth map.
By incorporating semantic segmentation to refine depth estimation and modeling noise appropriately, our fusion approach significantly improves mapping accuracy and robustness. This method effectively handles depth discontinuities at object edges while maintaining consistency in object interiors, leading to more reliable 3D reconstruction results.
Although our system is primarily based on monocular RGB input, we can still simulate the effects of multi-sensor fusion by integrating proxy depth. More specifically, we assume that the fused proxy depth
is sampled from multiple virtual sensors, where each sensor’s depth observation follows an independent and identically distributed (I.I.D.) assumption. The virtual sensors mimic the behavior of different types of depth sensors, each contributing to the depth estimation in the following way:
where
represents the weight of each virtual sensor, and
is the depth observation from the
m-th sensor. Here,
denotes the final fused depth, and
represents the depth observation from the
m-th virtual sensor.
To model this, we define a generalized fusion loss function. This loss function captures the uncertainty in the depth estimates and ensures that the fusion process accounts for sensor discrepancies. The generalized fusion loss function is given by the following:
where
represents the fused proxy depth,
is the depth obtained via tri-plane feature-based rendering, and
is the uncertainty parameter for each pixel. Here,
denotes the rendered depth from the tri-plane features. This loss function penalizes high uncertainty, implicitly learning the weights between different depth estimations and ensuring that more reliable depth estimates are given higher priority during the fusion process.
For RGB-D fusion, we further enhance the accuracy of the reconstructed map by incorporating both depth and RGB information. The total loss function is defined as follows:
The geometric loss term
measures the difference between the fused depth
and the depth
obtained through tri-plane feature-based rendering as follows:
where
is the uncertainty parameter associated with the depth sensor.
The RGB loss term
ensures consistency between the reconstructed RGB image
and the observed RGB image
, as follows:
where
is the pixel value of the RGB image,
is the RGB value rendered using tri-plane features, and
is the uncertainty parameter for the RGB sensor.
The combined loss function ensures that both depth and RGB information contribute to the final reconstruction, leading to more accurate and robust 3D models.
3.3. Proxy Depth Mapping and Tracking
Neural Radiance Fields (NeRFs) achieve 3D scene rendering by modeling the scene as a continuous function that maps spatial coordinates to color and density. More specifically, NeRF employ a neural network
to map each spatial point
x to its color
and density
, as follows:
where
d represents the ray direction.
For each ray
r, we sample multiple points
along its path and compute their color and density to render the final pixel color
as follows:
where
is the accumulated transmittance from the ray origin to sample point
i, and
is the distance between adjacent sample points.
Our mapping process is similar to the NeRF method but incorporates a revised loss function that integrates the fused depth estimation results. More specifically, we introduce a proxy depth map
, obtained from a combination of neural network-based depth estimations and multi-sensor fusion. The mapping loss is defined as follows:
where
is the fused proxy depth, and
represents the depth obtained through tri-plane feature-based rendering. By incorporating DINOv2 semantic segmentation results and fusing depth estimations from Depth-Anything-v2 and radiance field rendering, our method can more accurately handle depth information at object edges, thereby significantly improving the accuracy and robustness of mapping.
During the tracking stage, we utilize the fused proxy depth to optimize the camera pose. More specifically, we define a tracking loss function that accounts for the uncertainty in depth estimation as follows:
where
represents the fused proxy depth,
is the depth obtained through tri-plane feature-based rendering,
is the standard deviation of the depth estimate,
is the uncertainty parameter for each pixel, and
is the number of pixels sampled during tracking. By minimizing this loss function, we optimize the extrinsic parameters of the camera,
.
To further improve the tracking accuracy, we incorporate an additional regularization term that penalizes rapid motion or drastic changes in depth. This helps prevent large jumps in the camera pose between successive frames, which are often caused by ambiguous depth estimations. The tracking loss function is thus augmented as follows:
where
represents the temporal difference in depth, and
is a regularization parameter that controls the trade-off between pose accuracy and depth smoothness.
By minimizing this augmented tracking loss, the optimization process becomes more robust to noisy depth measurements, ensuring stable and accurate pose estimates even under challenging conditions, such as occlusions and varying lighting.
In summary, our method improves the performance of depth mapping and tracking by integrating semantic segmentation, multi-sensor depth fusion, and geometry-aware rendering techniques, resulting in more accurate depth estimation and robust camera tracking in dynamic environments.