With a set of multi-view images and known poses at our disposal, our objective is to reconstruct surfaces that amalgamate the benefits of neural rendering and volume rendering, all without relying on mask supervision. We leverage the zero-level set of the signed distance function (SDF) to extract the scene’s surface in rendering to optimize the SDF. Firstly, we present a novel 3D geometric appearance constraint method known as image appearance embedding: this method involves extracting feature information directly from the images and feeding it into the color MLP, enhancing the disambiguation of geometric structures. Secondly, we perform interpolation on the sampling points of the volume rendering. Additionally, we apply weight regularization to eliminate color bias, as discussed in detail in
Section 3.3, enhancing the overall rendering quality. Thirdly, we introduce display SDF optimization. This optimization is instrumental in achieving geometric consistency across the reconstructed scene, contributing to the overall accuracy of the 3D model. Lastly, we present an automatic geometric filtering approach aimed at refining the reconstructed surfaces. This method plays a crucial role in enhancing the precision and visual fidelity of the 3D model. Our approach overview is shown in
Figure 1.
4.1. Appearance Embedding
To mitigate the sparse feature bias discussed in
Section 3.2 and account for potential variations in environmental conditions during data capture [
41], we extract appearance latent features from each image to subsequently optimize the color MLP. This process is illustrated in
Figure 2.
In our model, the initial MLP is denoted as , predicting the SDF for a spatial position . Additionally, the network also generates a feature vector which is combined with the viewing direction and an appearance embedding . These amalgamated components are then fed into a second MLP denoted which produces the color corresponding to the given point. Therefore, the appearance embedding also further enriches the color information of the neural surface rendering, preparing for further accurate reconstruction.
During model training, considering that latent features typically diminish after repeated convolutions, ResNet-50 is employed to counteract this effect. Unlike conventional setups, ResNet-50 continuously incorporates previous latent features during the backward training process [
40,
42] thereby enhancing the global representation of features.
In addition, compared with ResNet-18 and ResNet-34, ResNet-50 not only improves the model’s accuracy but also significantly reduces the number of parameters and computations. The reason we did not choose ResNet-101 or ResNet-152 was because they require more computer memory. In the field of feature extraction, DenseNet [
43] and MobileNet [
44] have also produced impressive results. DenseNet directly merges feature maps from different layers to achieve feature reuse and improve efficiency, which is also the main difference from ResNets. However, the inherent disadvantage of DenseNet is that it consumes a lot of computer memory and cannot handle more complex images. In addition, the accuracy of MobileNet v3 large may decrease when dealing with complex scenarios, and the design of MobileNet v3 small is relatively simple, making it difficult to apply in complex scenarios. In summary, we chose ResNet-50 to extract the depth features of the image.
Consequently, we crop the multi-view image of the scene to 224 × 224 and input the cropped image into ResNet-50 to extract useful features, and the output is a feature vector denoted as
. This vector is then fed into the color MLP to accomplish appearance embedding. The convolution results of each image input to ResNet-50, known as ImageNet are detailed in
Table 2, and a bottleneck in ResNet-50 is illustrated in
Figure 3.
ResNet-50 introduces a “Bottleneck” structure in the residual structure to reduce the number of parameters (multiple small-size convolutions replace a large-size convolution). This Bottleneck layer structure first goes through a 1 × 1 convolutional kernel, then a 3 × 3 convolutional kernel, and finally through another 1 × 1 convolutional kernel. The 256-dimensional input passes through a 1 × 1 × 64 convolutional layer, followed by a 3 × 3 × 64 convolutional layer, and finally through a 1 × 1 × 256 convolutional layer. Each convolutional layer undergoes ReLU activation, resulting in a total parameter count of 256 × 1 × 1 × 64 + 64 × 3 × 3 × 64 + 64 × 1 × 1 × 256 = 69,632.
We assessed the surface reconstruction performance and view synthesis performance of NeuS and NeuS with embedded appearance features on the BlendedMVS dataset. As shown in
Figure 4 and
Figure 5 and
Table 3 and
Table 4. We assessed the performance of surface reconstruction using the distance metric. The chamfer distance is illustrated in
Section 5.1.2. And the view synthesis performance was evaluated by PSNR/SSIM (higher is better) and LPIPS (lower is better) is illustrated in
Section 5.1.2.
4.2. Volume Rendering Interpolation and Color Weight Regularization
To eliminate
caused by the sampling operation mentioned in
Section 3.3, first, identify two neighboring sampling points near the surface. Beginning at the camera position denoted as
,we move along the ray’s direction
, and their SDF values satisfy
The initial point of intersection between the ray and the surface, denoted as
, is approximated through linear interpolation as
:
Then, we incorporate the point set
into the initial point set
, resulting in a new point set
. This combined set
is utilized to generate the final volume rendering color:
where
represents the weight of
,
represents the pixel value of
.
represents the weight of
,
represents the pixel value of
, and
denotes the number of points. Then, the color bias becomes
Following interpolation, we obtain , signifying the bias introduced by linear interpolation. Importantly, is at least two orders of magnitude smaller than .
Meanwhile, we also alleviate the weight bias to regularize the weight distribution:
is utilized to eliminate anomalous weight distributions, specifically those located far from the surface yet exhibiting substantial weight values. This indirectly promotes the convergence of the weight distribution toward the surface. Theoretically, as the weight approaches , a delta distribution centered at , will tend towards 0.
4.4. Point Cloud Coarse Sampling
In most scenarios, the majority of a scene is characterized by open space. In consideration of this, our objective is to strategically identify the broad 3D regions of interest before engaging in the reconstruction of intricate details and view-dependent effects, which typically demand substantial computational resources. This approach allows for a significant reduction in the volume of points queried along each ray during the subsequent fine-stage processing.
In the handling of input datasets, conventional methods involve manual filtration to eliminate irrelevant point clouds. In contrast, DVGO [
17] accomplishes the automatic selection of the point cloud of interest, representing a notable advancement in streamlining this process. To determine the bounding box, rays emitted by each camera intersect with the nearest and farthest points in the scene, as shown in
Figure 6.
Due to the limitations and excessive size of the 3D point cloud regions selected by DVGO, precise localization of fine scene structures is not achieved. Therefore, we introduce a novel point cloud automatic filtering method. Leveraging camera pose information, we identify the point cloud center and compute the average distance from the center to the camera position. Using this average distance as the radius, we select a point cloud region of interest encompassing 360° around the center. The radius
defining the surrounding area is determined based on the camera’s capture mode, whether it is capturing a panoramic view or covering a distant scene, as shown in
Figure 7.