Semantic scene completion aims to deduce a full geometric and semantic structure of a scene from partial and sparse input data. A semantic scene completion network is designed in this paper, which maps a given sparse input point cloud to a regular voxel space. Then, the network generates a label for each voxel within a voxel space of dimensions
, indicating whether it is empty or assigned to a specific class
, with
categories.
L,
W,
H representing the length, width, and height of the voxel space, respectively. The output of the completed scene with semantic information is represented as
. Specifically, the proposed network is a learnable generative model based on a discrete diffusion probabilistic model for scene completion, as depicted in
Figure 1. The network operates in a discretized voxel space and learns the semantic scene data distribution of 3D scenes through a weighted
K-nearest neighbor uniform transition kernel based on feature distance (
Section 3.2 and
Section 3.3). With the consideration of the geometric information loss caused by the voxelization process, the network adopts a point feature compensation module (
Section 3.1). This module learns the detailed geometric feature information from the LiDAR data and integrates the learned features into the corresponding voxel space, thereby enhancing the granularity of the completion. Meanwhile, a combined loss function consisting of diffusion and semantic loss is introduced for the network training, which not only considers the overall layout but also emphasizes the local details (
Section 3.4).
3.1. Point Feature Compensation
Existing research on semantic scene completion for point clouds, such as LMSCNET [
11] and SCPNET [
35], mostly involves transforming the data into voxel representations to handle large-scale scene data and efficiently utilize the convolution. However, due to voxel discretization, fine geometric information of the point cloud is often lost, and multiple points or even small objects may be merged into a single grid or fragmented into various grids, making them indistinguishable. In this case, smaller objects (e.g., pedestrians and cyclists) are disadvantaged compared to larger objects (e.g., cars). To address these issues, a point feature compensation module is designed before diffusion training, as depicted in the lower part of
Figure 1.
Point Cloud Voxelization: Based on the size of a single scene in the given dataset, the point cloud is converted into a corresponding regular space by dividing the 3D space into evenly spaced voxels. Specifically, the coordinate value of each point is divided by the voxel’s spatial size and then rounded to ensure that each point is mapped to the nearest voxel grid. The point’s coordinates are thus converted into the index within the voxel grid, mapping the continuous space to a discrete voxel space. If a point’s mapping result exceeds the grid boundaries, it is clamped to the grid’s limits. During voxelization, the value of each voxel does not simply indicate whether it is occupied but rather records the number of points within the voxel, serving as the initial voxel input.
Point Grouping: Points are grouped based on the voxel they belong to. Due to factors such as distance, occlusion, uneven sampling, and the highly variable point density across the entire space, a voxel after grouping may contain a varying number of points. As shown in the feature compensation section of
Figure 1, the number of points in the four voxel grids differs. Points within the same voxel space are considered as a group, and local features are aggregated and extracted to serve as the point feature compensation for that voxel.
Point-Voxel Feature Fusion: The feature compensation module in
Figure 1 demonstrates a hierarchical fusion process of point-voxel features, with the key innovation being the point-voxel feature fusion (PVFF) layer. Specifically, the voxel
represents a voxel space containing
t points, where
t is the initial voxel representation of
V; each point
represents the coordinates of the
i-th point. To obtain the features of each voxel in the initial voxel input, a sparse convolution approach is employed. Sparse convolution reduces computational load and memory consumption by performing convolution operations only on non-zero voxels while maintaining a high level of feature extraction capability. This method leverages the sparsity of the voxel grid, targeting important regions for convolution to extract feature information for each voxel.
Then, the points within the voxel are input into the PVFF layer, where the points are aggregated into a local feature , which is then fused with the voxel feature . The output, which is a point feature with voxel information, serves as input for the next PVFF layer. After two rounds of point feature processing through the PVFF layers, the features are passed through a Multilayer Perceptron (MLP) and Max Pooling, selecting the most prominent features as the final voxel feature.
Point-Voxel Feature Fusion Layer: The PVFF layer is crucial for the fusion of point features and voxel features, as shown in
Figure 2. The original points undergo multiple layers of MLP, including a linear layer, a batch normalization (BN) layer, and a ReLU activation layer. After obtaining the per-point feature representations, max pooling is applied to extract the point aggregation feature
from the voxel space
V. The fusion of the point aggregation feature
and the voxel feature
first involves converting both through a linear layer to match the same dimensionality. The voxel feature
is then passed through a sigmoid activation and multiplied by the aggregated feature
to obtain the point-voxel fused feature. Finally, they are concatenated with the previous per-point feature to yield the final point-voxel feature.
3.2. Discrete Diffusion for Semantic Scene Completion
The discrete diffusion process for semantic scene completion is illustrated in
Figure 3. It consists of the forward diffusion and the reverse generation. During the forward diffusion process, the original data distribution is gradually disrupted until the data become uniform noise. In contrast, the reverse generation process learns to recover the original data distribution from noise. Semantic scene completion is considered as a conditional generation task. In the reverse generation process, the method can infer a complete scene with corresponding semantic labels, given the sparse input
s as the condition.
In the discrete diffusion process, the semantic variables at
belong to the semantic class space
. The forward process can be represented by a transition matrix
, where each
denotes the probability of transitioning from state
to
. The state vector
x is one-hot encoded, which ensures that each state is mutually exclusive. The forward transition probability is formulated as follows:
where
[
31] is a categorical distribution over the one-hot row vector
x with probabilities given by the row vector
p, and
is to be understood as a row vector-matrix product. This means that at time
t, the transition from state
to state
is determined by the corresponding entry in the matrix
.
In the forward process, the original
is gradually decomposed into the noisy scene
; each forward step is represented by
and is defined as
. Based on the Markov chain [
36], the
t-step noisy scene
can be derived from the original scene
, the cumulative transition matrix [
31]
, which converges to a known stationary distribution, and
:
In the reverse process, a learnable parameter set
is used to denoise the noisy scene and reconstruct the original semantic scene. Specifically, the learnable model
utilizes the sparse input
s and to guide the denoising process, to guide the denoising process. Specifically, a reparameterization technique [
18] is adopted to enable the network to infer the denoised mapping
, and then obtain the reverse probability distribution
:
here,
can be expressed as follows using Bayes’ theorem:
3.3. Weighted K-Nearest Neighbor Uniform Transition Kernel
The discrete diffusion model is able to control data corruption and denoising processes by selecting the transition matrix
[
6], thereby altering the Markov transition kernel. This stands in stark contrast with continuous diffusion, which primarily focuses on additive Gaussian noise. Hoogeboom et al. [
32] later extended the binary random variables to categorical variables and employed a uniform diffusion transition matrix
, where the probability of transitioning from each state to any other state is the same, and therefore, the forward diffusion process can be presented as follows:
In the uniform diffusion process, global uniform interpolation is typically used to update the state. This method relies on a simple average-weighting strategy, estimating the value by globally linearly weighting the current state . This update assumes that all voxels contribute equally to the state transition, resulting in a smooth state transition. However, this uniform diffusion assumption has limitations, especially when dealing with scene voxel data that exhibit spatial correlations and local patterns. It may fail to capture the complex local structure inherent in the data. For instance, in some real-world scenarios, the relationships between voxels are often not uniform, and a single object may be split across multiple adjacent voxel spaces, influenced by similar voxels within local regions. Therefore, employing a uniform update strategy may not fully reflect the true relationships between voxels. To better capture the local structure of the data, a diffusion transfer method based on weighted K-nearest neighbor is proposed. In this method, the update of a voxel does not rely solely on a global linear combination but rather on the feature similarity between the voxel and its neighboring voxels. Specifically, the feature distance between each voxel and the selection of its K-nearest neighbors are computed first. Then, the update is weighted based on the feature distances to these nearest neighbor voxels, where voxels with smaller feature distances have a greater influence on the current voxel’s state.
Weighted K-Nearest Neighbor: In weighted
K-nearest neighbor diffusion, the first step is to calculate the feature distance between voxel
and other voxels to assess their similarity. Next, the
K-nearest voxels are selected, and a weighted scheme is applied based on the feature distances between them. There are various designs for the weighting function, all of which emphasize the inverse relationship between feature distance and influence. The weighting function designed in this paper takes the form of a Gaussian kernel function:
in this equation,
represents the feature distance between voxel
and its neighbor
, and
is a hyperparameter that controls the smoothness of the weighting function.
Transition Matrix: Using the weighted
K-nearest neighbor method, the state of
at time
can be updated as follows:
where
represents the
K-nearest neighbors most similar to
,
is the corresponding weight coefficient, and
C denotes the semantic category space. Through this method, the state of the current voxel is strongly influenced by similar voxels within its neighborhood and adjusted according to the feature distances, thereby better reflecting the local patterns.
Building upon the weighted
K-nearest neighbor transition kernel, the state transition process can be further represented in the discrete diffusion probability model. Specifically, the forward diffusion process distribution
for the transition from
at time
to
at time
t can be expressed as follows:
where
is multivariate categorical distribution, and
is a hyperparameter. Through this transition model, each voxel not only depends on its state at the previous time
, but also incorporates local structural information via the weighted
K-nearest neighbor strategy, thereby more accurately capturing the spatial relationships and local patterns between voxels.
3.4. Training Objective
When training the proposed semantic scene completion network, a combined loss function is designed for the network. It takes into account the similarity of the semantic scene global distribution during the diffusion process, as well as the cross-entropy loss for detail optimization. Specifically, in the discrete diffusion generation process, the loss function consists of the Kullback–Leibler (KL) divergence [
20] between the forward process
and the reverse process
, as well as the KL divergence between the original scene
and the reconstructed scene
, which is formalized as follows:
where
is an auxiliary loss weight. Then, the cross-entropy loss [
37]
is employed to constrain the divergence between the generated semantic scene and the true semantic scene, optimizing the completion details. Therefore, the overall training objective is represented as follows: