Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis

Xu, Shuyuan; Wang, Jun; Xia, Jingfeng; Shou, Wenchi

doi:10.3390/buildings15060848

Open AccessArticle

Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis

¹

School of Civil Engineering and Architecture, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

School of Engineering, Design and Built Environment, Western Sydney University, Kingswood, NSW 2745, Australia

³

North China Municipal Engineering Design & Research Institute Co., Ltd., Hangzhou 310012, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(6), 848; https://doi.org/10.3390/buildings15060848

Submission received: 24 January 2025 / Revised: 21 February 2025 / Accepted: 1 March 2025 / Published: 7 March 2025

(This article belongs to the Special Issue Intelligence and Automation in Construction Industry)

Download

Browse Figures

Versions Notes

Abstract

:

An efficient and robust solution for 3D indoor reconstruction is crucial for various managerial operations in the Architecture, Engineering, and Construction (AEC) sector, such as indoor asset tracking and facility management. Conventional approaches, primarily relying on SLAM and deep learning techniques, face certain limitations. With the recent emergence of radiance field (RF)-inspired methods, such as Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS), it is worthwhile to evaluate their capability and applicability for reconstructing built environments in the AEC domain. This paper aims to compare different RF-inspired methods with conventional SLAM-based methods and to assess their potential use for asset management and related downstream tasks in indoor environments. Experiments were conducted in university and laboratory settings, focusing on 3D indoor reconstruction and semantic asset segmentation. The results indicate that 3DGS and Nerfacto generally outperform other NeRF-based methods. In addition, this study provides guidance on selecting appropriate reconstruction approaches for specific use cases.

Keywords:

3D indoor reconstruction; radiance field; comparative analysis

1. Introduction

Three-dimensional indoor reconstruction has broad applicability throughout the lifecycle of buildings and infrastructure, supporting managerial operations from construction through the operation and maintenance (O&M) phases. On construction sites, such models help project managers monitor progress [1], optimize schedule decisions, and enhance visualization. Within built environments, 3D indoor reconstruction techniques facilitate inspection tasks, such as identifying defects [2,3], documenting conditions [4], and evaluating building performance (e.g., energy assessments) [5] and post-disaster assessment [6]. Their significance further extends to enabling facility management and asset tracking, serving as the foundation for advanced digital twins [7]. This is especially critical for organizations like laboratories, warehouses, and complex office buildings, where inventories and facilities must be continuously monitored.

Early 3D indoor reconstruction efforts relied heavily on expensive laser scanners and other sophisticated equipment. The associated high costs, substantial computing requirements, and need for trained operators limited widespread adoption. With rapid advances in computer vision and photogrammetry, a broader range of methods and algorithms have emerged to achieve more affordable and efficient 3D indoor reconstruction using image-based techniques. However, conventional approaches—often SLAM-based and sometimes supported by deep learning—still encounter challenges, including reduced accuracy in unobserved areas, discrete surface representations, and sparse modeling. Additionally, the requirement of depth information to enhance reconstruction quality further constrains their practicality in real-world projects.

Radiance field (RF)-inspired methods represent a promising alternative. Since their introduction, these approaches have garnered significant academic interest. By leveraging novel 3D representations that enable fast, realistic rendering of indoor scenes from novel viewpoints, RF-based methods offer continuous surface modeling, scene inpainting, and improved noise handling—advances that benefit the 3D reconstruction community. However, most research on RF-inspired methods remains within the computer science domain [8,9], focusing on algorithmic improvements validated against commonly used public datasets [10,11,12]. Few studies have explored their implementation in the construction management field, where priorities differ due to specific downstream tasks and managerial objectives. To the authors’ knowledge, neither the applicability nor the performance of these cutting-edge RF-inspired methods, as compared to conventional 3D reconstruction, has been rigorously evaluated within AEC-related use cases.

To address this gap, this paper provides a systematic comparison between RF-based techniques and conventional 3D reconstruction approaches by applying state-of-the-art RF-inspired methods to a real-life indoor built environment to validate their applicability for managerial operations. The contributions of this work are twofold. First, we offer a direct and side-by-side comparison between various 3D reconstruction methods and provide practical insights on how different reconstruction methods respond to variations in input data. Second, we introduce an application map that categorizes these methods based on their suitability for different AEC tasks, offering procedural suggestions from data collection through segmentation for researchers and practitioners and addressing critical trade-offs between reconstruction fidelity and efficiency. The remainder of the paper is organized as follows: Section 2 reviews related work on 3D reconstruction, while Section 3 describes the research methods employed. Section 4 presents the experimental setup and results, followed by sensitivity analyses and discussion in Section 5. Finally, Section 6 provides concluding remarks.

2. Related Work

This section provides an overview of prior studies on 3D reconstruction. Input data modalities range from RGB images and RGB-D images to raw LiDAR-based point clouds. Accordingly, we categorize 3D reconstruction methods into three groups: LiDAR-based methods, RGB-D-based methods, and purely photogrammetry-based methods.

2.1. LiDAR-Based Methods

LiDAR-based approaches typically start from raw point cloud data obtained via high-precision laser scanners, LiDAR systems, or related devices. These systems can be mounted on various platforms—such as aerial, terrestrial, mobile, and unmanned—to achieve detailed indoor point cloud capture. Recently, the integration of LiDAR sensors into mobile phones and tablets has made data collection more accessible. Commercial apps, such as 3D Scanner [13] or Record 3D [14], allow quick data acquisition and pre-processing for reconstruction algorithms. However, the quality of the raw point cloud may still be insufficient for downstream tasks like object detection and tracking [15]. Pre-processing steps—e.g., establishing topological relationships, removing irrelevant points, and applying denoising and down-sampling—remain essential [16]. The Random Sample Consensus (RANSAC) algorithm is often applied to remove outliers. For instance, Abdollahi, Arefi [17] addressed occluded and cluttered data to reconstruct major structural components in detail. These refined point clouds enable scan-to-BIM workflows, further enhanced by deep learning methods like ArrangementNet [18].

2.2. RGB-D-Based Methods

RGB-D cameras (e.g., Microsoft Kinect) capture color images enriched with depth information, providing valuable input for 3D reconstruction. Simultaneous Localization and Mapping (SLAM) methods originally combined multiple sensor inputs for environmental modeling, and RGB-D SLAM has been extensively explored since the early 2010s [19]. Typical RGB-D SLAM pipelines involve camera tracking and local mapping: the former estimates real-time camera poses, and the latter updates and optimizes the map more slowly. For camera tracking, feature-less approaches rely on the photometric consistency of images to estimate camera motions, exemplified by KinectFusion [20], RGBDTAM [21], and ID-RGBDO [22]. Meanwhile, feature-based approaches involve feature extraction and matching by minimizing geometric errors like 2D Point-to-Point errors, 3D Point-to-Point errors, and 3D Point-to-Plane errors. Representative methods include sed ORB-SLAM2 [23] and Plane-Edge-SLAM [24], which combine the above two approaches and leverage both geometric and photometric information for camera pose estimation, such as CPA-SLAM [25], BuddleFusion [26], etc. As for local mapping, mainstream approaches either construct point clouds directly or represent the scene implicitly with voxel volumes. RGB-D datasets available for use include ICL-NUIM [27], ETH3D [28], and TUM RGB-D [29].

2.3. Photogrammetry-Based Methods

Over recent decades, visual-only SLAM (vSLAM), which relies solely on RGB images, has gained popularity. The vSLAM pipeline resembles that of RGB-D SLAM systems but without additional depth data. To determine the camera pose and 3D structure of the unknown environment, initialization of the global coordinate system, tracking, and mapping are subsequently performed, assisted with re-localization and global map optimization techniques if applicable [30]. Under this category, feature-based approaches utilize feature points for tracking and mapping, while feature-less approaches rely on the photometric consistency of images.

Several groups of researchers have reviewed image-based 3D reconstruction methods in both the civil engineering sector and computer science. It is shown that before 2019, visual geometry methods were prominently used in the field of civil engineering, including such procedures as point cloud generation, point cloud processing, surface reconstruction, and parametric modeling [31,32,33]. Conventional image-based 3D reconstruction methods typically involve three steps, i.e., Structure-from-Motion (SfM), multi-view stereo (MVS), and surface reconstruction. Specifically, SfM estimates the camera poses and depth maps of different perspectives, based on which a sparse point cloud from a certain perspective can be obtained. There are projects focusing on SfM pipelines, such as OpenSfM and OpenMVG. Subsequent procedures, including dense reconstruction, mesh reconstruction, etc., can be accomplished by open-source projects like OpenMVS, which takes camera pose information and sparse point clouds as input. COLMAP is a representative pipeline for the entire procedure.

Since 2019, deep learning approaches have significantly influenced image-based 3D reconstruction. A variety of network architectures are proposed and validated, including encoder–decoder networks, depth estimation networks, and implicit neural representation [34,35,36]. A more recent study [37] reviewed the proposed techniques for image-based 3D reconstruction methods in the past decade until 2022 and mapped previous studies into a knowledge framework that consists of three axes, i.e., essential elements, use phases, and reconstruction scales. Regarding the input data modality, deep learning-based methods can be divided into categories of monocular images and stereo images. For example, deep learning-based MVS algorithms have been proposed to deal with special indoor scenes like weakly textured [38] or spatially large. The scene understanding in point clouds can take advantage of deep learning algorithms as well for 3D object recognition and segmentation [39].

In all, Table 1 compares the above three categories of 3D reconstruction methods regarding their advantages and disadvantages. Given the limitations of these conventional methods, this paper explores the potential of novel RF-based methods in 3D reconstruction.

3. Radiance Field (RF)-Inspired Methods

Existing 3D representations—such as meshes, voxels, point clouds, occupancy fields, and signed distance fields—have certain drawbacks. In contrast, radiance fields (RF) are defined to describe the behavior and distribution of light in 3D space [7]. Both NeRF and 3D Gaussian Splatting (3DGS) are grounded in this concept, as described below.

(1) NeRF and its variants

The essence of NeRF-based methods is to represent scenes as soft shapes, that is, Neural Radiance Fields (NeRF). It means that instead of storing explicit 3D models (e.g., meshes or point clouds), NeRF encodes a scene as a multi-layer perceptron (MLP) that learns the relationship between 3D points and their appearance. To be specific, it takes 5D coordinates (spatial location (x, y, z) and viewing direction (θ, φ)) as input and outputs volume density (

σ

) and color (R, G, B) at each spatial location [8]. The main steps in a typical NeRF workflow include (i) generating sample points for each pixel, (ii) computing local color and density for each sample point using MLPs, and (iii) synthesizing new view images. Volume rendering techniques are used in the third step to project the outputs (i.e., colors and densities) into images with known camera poses. Collectively, the 3D space can be reconstructed. Figure 1 illustrates the core of the NeRF reconstruction method.

As indicated in Figure 1, when a camera ray

r

explores the 3D space, the pixel color of a certain point on the ray path, i.e.,

C (r)

, can be calculated using the volume rendering equation, i.e., Equation (1).

C (r) = \int_{t_{1}}^{t_{2}} T (t) σ (r (t)) c (r (t), d) d t

(1)

T (t) = e x p (- \int_{t_{1}}^{t} σ (r (s)), d s)

(2)

where

σ (r (t))

and

c (r (t), d)

represents the volume density and color at a certain point

r (t)

.

T (t)

denotes the accumulated transmittance from

t_{1}

to

t

and is calculated using Equation (2). By integrating along the ray from

t_{1}

to

t_{2}

at each differential step of

d t

, the scene can be reconstructed.

Since the first appearance of NeRF, many researchers have dedicated themselves to proposing NeRF-based variants to speed up the algorithm, extend its applicable scenarios, or further improve its reconstruction accuracy. The inference time can be largely decreased by speeding up the rendering process [9], while the training time can be reduced using more efficient encoding methods [10]. To deal with large-scene 3D reconstruction, Hu, Xiong [11] used a pair of parent and child NeRF to hierarchically represent the volumetric scenes. With regard to indoor environments, Park, Do [12] proposed a two-phase learning approach to first conduct holistic surface learning and then object surface learning.

(2) 3D gaussian splatting (3DGS) and its variants

Unlike NeRF, 3DGS can achieve explicit scene representations with state-of-the-art (SOTA) visual quality and competitive training times [13]. The core novelty of 3DGS compared to NeRF-based methods lies in the 3D Gaussian representations, i.e., 3D Gaussian splats in the form of anisotropic balls, and 3DGS models a scene as a collection of them. Table 2 summarizes the differences between NeRF and 3DGS at the theoretical level. The 3D Gaussian representations are described by parameters including center position

μ

, covariance matrix

\sum

, opacity

α

, and color, as defined in Equation (3). Specifically, the spatial covariance

\sum

can be calculated for the center position

μ

, using a scaling matrix

S

and rotation matrix

R

, as shown in Equation (4). Instead of differential volumetric representations, i.e., sample points in NeRF, 3D Gaussian splats are used along the camera ray. By blending 3D Gaussian splats that overlay at a certain pixel, the color of this pixel

C

can be computed using Equation (5). The optimization of such parameters with adaptive density control is achieved using stochastic gradient descent techniques and enabled by GPU kernels. Furthermore, a tile-based rasterizer for Gaussian splats was proposed to allow for

α

-blending so as to accelerate the whole rendering process. As a result, the splatting method allows real-time rendering and ensures visual quality at the same time.

G (x) = e^{- \frac{1}{2} {(x)}^{T} \sum^{- 1} (x)}

(3)

\sum = R S S^{T} R^{T}

(4)

C = \sum_{i \in N} c_{i} α_{i} \prod_{j = 1}^{i - 1} (1 - α_{j})

(5)

Since its advent, 3DGS has been applied in a variety of fields, from autonomous driving to the Metaverse. Extensive attempts have been made in academia to further improve the performance of 3DGS techniques in terms of visual quality, efficiency, cost, and expanded applicability to more complex scenarios. To enhance the visual quality, a group of researchers are dedicated to rendering more details and eliminating aliasing effects [14,15], while others aim to produce more realistic illumination [16,17,18,19]. As for efficiency, the training speed of 3DGS is averagely faster than conventional 3D reconstruction schemes, i.e., in minutes. To further improve it, compression of Gaussians through vector quantization techniques, octree-based algorithms, or other methods to reduce the memory usage would be beneficial, resulting in such novel algorithms as Scaffold-GS [20], CompGS [21], EAGLES [22], and Light Gaussian [23]. Its rendering efficiency can be improved as well by identifying unnecessary 3D gaussians [24]. The cost of applying 3DGS in the real world regarding the reality capture devices is rather low compared to LiDAR systems and laser scanners. However, requirements to the training data can be further loosened by proposing few-shot training [25,26,27,28,29] and even rendering on monocular images [30,31]. The applicability of 3DGS can be expanded to allow for large scene rendering [32,33], dynamic scene reconstruction [34,35,36], and to deal with non-rigid or deformable targets [37].

In summary, while 3D reconstruction methods have evolved for decades and have been extensively studied in the AEC sector, the application and validation of RF-inspired methods remain largely unexplored within real AEC projects. Furthermore, their suitability for supporting managerial operations in built environments has not yet been examined, a gap this paper seeks to address.

4. Research Method

This paper aims to validate and compare various algorithms for indoor 3D reconstruction and to discuss their applicability and potential challenges in real-world facility management scenarios.

4.1. Methods for Performance Comparison

This section outlines the comparison scheme used in this study. As summarized in Table 3, three main categories of comparisons are considered: novel view synthesis, 3D point cloud generation, and 3D point cloud segmentation. For novel view synthesis, we only consider RF-based methods, as traditional photogrammetry-based approaches cannot generate novel views. Both conventional 3D reconstruction methods and RF-based methods are evaluated for their ability to produce 3D point clouds. Finally, among a variety of downstream tasks such as detection and tracking, we test the 3D point cloud segmentation task as a means of comparing and validating the reconstruction methods, using both MVS-derived point clouds and the RF-based method that achieves the highest view synthesis quality.

The algorithms selected for comparison meet the following criteria:

They are recognized as classic algorithms in their respective fields;
They have publicly available open-source GitHub repositories;
We exclude the latest specialized algorithmic variants designed for unique conditions (e.g., poor lighting or large-scale environments).

Based on these criteria, SfM and MVS methods are chosen to represent conventional 3D reconstruction approaches, providing sparse and dense reconstructions, respectively. The sparse model generated by COLMAP also serves as the input for RF-based methods. The RF-based methods tested include both NeRF and 3DGS. Variants of NeRF are assessed via Nerfstudio [38], an integrated platform for training and evaluating NeRF-like algorithms. The selected NeRF-based methods include Vanilla NeRF, InstantNGP [10], Mip-NeRF [39], and Nerfacto [40]. For the 3DGS category, we use the Hugging Face implementation [13], chosen for its high-quality visualization and real-time rendering capability.

To further validate the applicability of RF-based methods, we conduct a downstream task: semantic segmentation of the reconstructed 3D point clouds. Mainstream segmentation approaches can be classified into model-driven methods (e.g., RANSAC and Hough transform) and data-driven methods (e.g., clustering or deep learning models). In this study, we compare MVS outputs to the RF-based method with the best view synthesis performance. The reconstructed 3D point clouds are exported and segmented using a pre-trained PointNet model. We set the export to include 1,000,000 points and use Open3D for normalization.

4.2. Performance Evaluation Metrics

To quantitatively evaluate reconstruction quality—especially point cloud quality—we refer to various metrics used in previous work. Depending on the availability of a reference point cloud, point cloud quality assessment (PCQA) metrics can be full-reference (FR), reduced-reference (RR), or no-reference (NR). Common FR metrics include p2point [41], p2plane [42], PSNR_yuv [43], and PointSSIM [44]. Other metrics account for color distortions, such as MPED [45], GraphSIM [46], and PCQM [47]. A group of researchers also explored how to deal with point clouds with fewer references and proposed such metrics as PCM_RR [48]. While under no reference circumstances, only distortion samples are utilized to extract and learn the point cloud features. Liu, Yang [49] proposed ResSCNN, a sparse CNN-based metric to assess both the geometry and color information of point clouds. Jarząbek-Rychard and Maas [50] further explored how geometric uncertainties propagate through the scan-to-BIM process, relating model quality to input data quality.

In this study, we select widely used evaluation metrics as listed in Table 3. For RF-related algorithms, since they inherently rely on radiance field representations rather than explicit 3D geometry, traditional geometric accuracy metrics (e.g., Chamfer Distance, Hausdorff Distance, RMSE) are not directly applicable without an additional surface reconstruction step. Instead, PSNR, SSIM, and LPIPS effectively capture differences in visual fidelity, which directly impact downstream AEC applications, therefore, we use them to assess the quality of view synthesis. To extend beyond view synthesis, semantic segmentation accuracy is used as an additional evaluation criterion, ensuring the reconstructed models are useful for subsequent analysis and decision-making in AEC workflows. Table 4 summarizes the key metrics and their definitions.

4.3. Data Collection

To validate the performance of various RF-inspired methods, we collected data from multiple indoor environments, including a university laboratory and a computer-aided classroom. These spaces reflect typical office-like environments where facility management is crucial. Table 5 summarizes the configuration of the experiment setup, and Figure 2 shows the indoor spaces used in the experiments. An iPhone 12 was used to capture video and images, which were then organized and formatted into an indoor scene dataset. Different video capturing and data pre-processing strategies were tested for sensitivity analyses. In total, 148 photographs and one video were collected.

Video frames were extracted at 2 fps to create a sequence of images. COLMAP [5,51] was then used for feature extraction, feature matching, and automatic reconstruction to produce a conventional photogrammetry-based 3D model for baseline comparison. Camera parameters were retrieved and stored in a transform.json file. The final dataset includes an image folder, COLMAP processing results, and transform.json, as illustrated in Figure 3.

All 3D reconstruction experiments—view synthesis and downstream tasks—were conducted on a PC with intel i5 CPU and RTX 4090 GPU. We used PyTorch 2.0 as the deep learning framework and CUDA 11.8 for GPU acceleration. The Anaconda environment facilitated efficient Python package management. An OpenGL-based viewer allowed real-time rendering of trained models.

5. Results and Performance Comparison

Table 6 compares the performance of various radiance field-inspired methods, including both NeRF-based and 3D Gaussian Splatting approaches. We evaluate each method in terms of view synthesis quality, the richness of reconstructed 3D models, and computational efficiency. Specifically, SSIM, PSNR, and LPIPS are used to measure view synthesis quality. The richness of the reconstructed scene is assessed by counting rays for RF-inspired methods and comparing point densities for SfM-based methods. Efficiency is measured by the training and rendering time required. For these tests, we used a dataset from a university laboratory (Figure 2a), captured using a handheld camera during a walk-around session. A total of 66 images were included for reconstruction.

5.1. View Synthesis

PSNR, SSIM, and LPIPS (including their variants) were used to quantify view synthesis quality. The results were visualized using the Weights and Biases platform (wandb).

PSNR: Higher PSNR values indicate better image quality. 3DGS achieves 28.2287 dB, followed by Instant-NGP and Nerfacto (~20 dB), demonstrating near-realistic indoor scene rendering.
SSIM: SSIM measures structural similarity and ranges from 0 to 1. Instant-NGP achieved the highest SSIM (0.7923), closely followed by 3DGS (0.7559). Mip-NeRF performed least satisfactorily in terms of structure retention.
LPIPS: LPIPS evaluates perceptual similarity. Vanilla-NeRF excelled here (0.8332), generating images closely matching human perception. By contrast, 3DGS lagged behind in LPIPS at 0.4405.

Figure 4 shows rendered results from various methods, confirming that RF-inspired methods can produce satisfactory novel view images. Among RF-inspired methods, 3DGS generates superior PSNR and SSIM due to its explicit spatial information, where Gaussian primitives store finer geometric and textural details without relying on neural function interpolation. However, 3DGS underperforms in the LPIPS metric. This issue is likely due to the lack of fine-grained color representation in 3DGS compared to NeRF-based methods, which model continuous radiance fields rather than voxelized structures. Prior studies have demonstrated that neural voxel-based approaches tend to trade off texture fidelity for improved efficiency and rendering speed [52]. Overall, the RF-inspired methods produce high-quality novel views, with 3DGS leading in general image fidelity, Instant-NGP in structural similarity, and Vanilla-NeRF in perceptual similarity.

In addition, variations in algorithm performance under adverse lighting conditions (i.e., strong illumination near the window) are observed. Among all tested methods, 3DGS exhibits greater robustness to lighting variations, likely due to its grid-based representation and more stable feature extraction process. In contrast, NeRF-based approaches are inherently more sensitive to lighting changes, as they encode both geometry and appearance within the radiance field. This sensitivity results in noticeable blurring artifacts, particularly in Instant-NGP and Nerfacto-generated novel views, as shown in Figure 4b,c.

5.2. 3D Point Cloud Generation

Conventional Structure-from-Motion (SfM) methods generate both sparse and dense point clouds directly. In contrast, RF-based methods focus primarily on view synthesis, and an additional export process is required to produce point clouds. To assess the “richness” of the reconstructed scene, we count the number of points for conventional methods and the number of rays for RF-based methods. Table 7 presents the results of various approaches. Using SfM (COLMAP), the initial sparse reconstruction yields 14,422 points, and after dense reconstruction, this increases to 962,847 points. For RF-inspired methods, point counts refer to the exported point clouds from the trained model. A threshold of 1,000,000 points was set during export. During training, RF-based methods produce rays at approximately 50,000 per second, totaling 12,192,768 rays.

After view synthesis, tools like NerfStudio can convert and export multiple 3D representations, including point clouds and meshes. Due to the predefined export threshold, the number of points for NeRF-based methods is slightly lower than that of 3DGS. Figure 5 shows point clouds generated by various methods. While conventional SfM produces relatively clean reconstructions, it fails to capture part of the room properly (Figure 5a). In contrast, 3DGS provides highly detailed point clouds, modeling objects such as air conditioners, projector curtains, and lighting fixtures—though without color data (Figure 5d). Nerfacto focuses predominantly on centrally located objects, capturing fewer details near the room’s edges (Figure 5c).

5.3. Reconstruction Efficiency

We compare efficiency based on training and/or rendering times. Since NeRF-based methods require per-scene optimization, which is computationally intensive due to the need for sampling, volume rendering, and backpropagation over iterations, traditional NeRF-based methods take an extremely long time for training (e.g., Vanilla-NeRF took around 17 h). Mip-NeRF, at ~14 h, is similarly impractical for real-time scenarios and limited to static environments like construction training or educational demonstrations. Nerfacto and Instant-NGP are designed for fast training through hash-grid encoding and hierarchical sampling, making them suitable for scenarios where timely updates are essential—such as active construction sites with frequent changes. In contrast, 3DGS demonstrates comparable, even superior, efficiency to Instant-NGP and Nerfacto, completing view synthesis in about 30–60 min. The reason is mainly because 3DGS directly optimizes explicit Gaussian parameters rather than learning an implicit function, making it more practical for applications that require rapid scene adaptation.

5.4. Downstream Tasks

We applied a pre-trained PointNet [53] model on both MVS and RF-based point clouds for semantic segmentation. The Stanford 3D Indoor Scene Dataset (S3DIS), a benchmark dataset for indoor semantic segmentation [54], was used for model pre-training. The dataset includes six large-scale indoor areas, and 16 semantic categories are annotated. The point clouds generated by 3DGS methods were first pre-processed through outlier points removal and rotation and then used for validation. The experiment was conducted on Ubuntu 20.04 with a TensorFlow 1.15.5 framework and Cuda 11.4. An RTX 4090 GPU with 24 GB RAM was used for training and validation.

Figure 6 presents the segmentation results, including accuracy and visual interpretation. Nerfacto outperforms 3DGS in identifying central objects (e.g., tables). However, objects like windows, doors, and boards are less accurately segmented due to insufficient reconstruction quality at the room’s periphery. While 3DGS excels in view synthesis, its lack of color information in point clouds impairs segmentation performance. Interestingly, for objects where color is not a critical feature—like windows—the model performs relatively well. In summary, Nerfacto prioritizes a more stable and structured feature representation over render fidelity, resulting in a more balanced performance in semantic segmentation accuracy. On the contrary, 3DGS, which models scenes using discrete Gaussian primitives, is optimized for high-quality renders. However, the interpolation across Gaussians may introduce smooth blending effects, making it more challenging to delineate clear object boundaries for segmentation. This underscores the significance of both geometric detail and color cues in achieving effective semantic segmentation of indoor scenes. Potential improvements to RF-inspired methods could involve integrating depth priors or adopting hybrid feature extraction techniques to enhance segmentation performance.

6. Discussion

6.1. Sensitivity Analysis on Data Collection Strategies

In this section, we evaluate RF-inspired methods using two distinct data collection strategies applied to the university laboratory scenario: the walk-around strategy and the pivot strategy. The walk-around approach involves the camera operator moving around the indoor space while keeping the camera focused on the room’s center. By contrast, the pivot strategy requires the operator to stand at a central point and rotate in place to capture the room from different angles. Figure 7 illustrates these two data collection methods, and Table 8 presents the corresponding view synthesis results.

Contrary to some guidelines in the literature, no single data collection strategy proves universally superior. In terms of PSNR, the walk-around dataset yields better results for Instant-NGP and Nerfacto, whereas the pivot dataset is more advantageous for Vanilla-NeRF, MipNeRF, and 3DGS. Models trained on the pivot dataset generally achieve higher SSIM, while those trained on the walk-around dataset produce images closer to human perception (LPIPS), except in the case of MipNeRF. The discrepancy in 3DGS performance under the pivot strategy (i.e., a high PSNR of 35.9607 and a low LPIPS) suggests that while the pivot approach enhances structural reconstruction, it may limit the diversity of sampled textures, potentially affecting the perceptual similarity adversely.

Figure 8 shows view synthesis results and 3D point clouds obtained from different video-capturing strategies. While RF-inspired methods trained on both datasets can produce realistic novel views, the point clouds generated from the walk-around dataset are substantially more detailed than those derived from the pivot dataset.

6.2. Sensitivity Analysis of Dataset Size

We also examined the impact of dataset size by training on two sets of images: 41 images versus 860 images. We tested MipNeRF, Instant-NGP, Nerfacto, and 3DGS to assess their sensitivity to dataset size. Table 9 presents the results after training on these two different datasets.

In general, increasing the number of training images improves view synthesis quality. For instance, the 3DGS model’s PSNR increases from 30.7624 to 32.1640, and Instant-NGP’s SSIM improves from 0.7099 to 0.8090. However, more data sometimes leads to reduced perceptual similarity (LPIPS), as seen in Instant-NGP’s drop from 0.7255 to 0.5054. MipNeRF behaves anomalously, performing worse in PSNR and SSIM and higher in LPIPS with more data.

Figure 9 presents reconstructed scenes and point clouds trained on datasets of different sizes. While increasing the dataset size by a factor of 20 significantly improves the quality of point clouds generated by conventional SfM methods (from 1,203,383 points to 24,881,631 points), the improvement for RF-inspired models is more modest (around 1.05–1.5 times). Thus, the number of training images does not significantly influence the performance of RF-inspired methods in terms of training and rendering.

Overall, the walk-around data collection strategy is recommended for real-life projects due to its ability to produce more detailed point clouds. While increasing the training dataset size yields more realistic and similar novel views, these improvements are limited.

6.3. Applicability of RF-Inspired Methods in Indoor Scenarios

Based on the findings, Figure 10 provides a guide for selecting suitable methods for 3D indoor reconstruction. To fully leverage RF-inspired methods in practical applications—such as BIM modeling and digital twins—we must consider the choice of algorithm, data collection strategy, and output formats.

From an algorithmic perspective, 3DGS is generally preferable if high-quality novel views are the priority. However, its generated point clouds lack color information. Nerfacto, on the other hand, strikes a balance by producing richer point clouds with RGB data and realistic novel images. Larger furnishings and infrastructure (e.g., tables, computers, projectors) are well-reconstructed and can be accurately segmented for downstream tasks. Smaller objects (e.g., keyboards, mice) tend to be excluded when the entire indoor space is the focus. In such cases, capturing images at closer range may be necessary.

As illustrated in Figure 8, using the walk-around strategy generally results in richer point clouds, especially for items located along the edges and corners of the room. While the pivot-trained models struggle with peripheral objects, the walk-around-trained models preserve these details. When the walk-around approach is adopted, central items are reconstructed with more detail, though corner-located objects may still have fewer points in NeRF-based models. Conversely, the pivot approach struggles with both central and corner objects due to occlusion issues. Therefore, for large indoor spaces, conducting multiple walk-around passes can lead to satisfactory 3D reconstructions. In addition, there are attempts to use scene geometry priors or scene partitioning to generate hierarchical representations (e.g., octrees or voxel) of the indoor spaces to cope with multi-room settings or large-scale indoor scenarios [55].

Regarding output formats, RF-inspired methods like NeRF and 3DGS are inherently designed for novel view synthesis and 3D rendering. NeRF and its variants use implicit neural representations, while 3DGS employs explicit scene representations (3D Gaussian splats). Unlike SLAM algorithms that map the environment and track the camera location, RF-inspired methods require post-processing to generate structured geometric representations like point clouds and meshes, which remains a challenge for real-life scan-to-BIM workflows. The seamless and direct object recognition and classification for AEC elements (e.g., walls, beams, HVAC systems) using RF-inspired reconstruction results are still infeasible, which is essential for such managerial operations as facility management. The advantage of RF-inspired methods lies in their capacity for semantically meaningful tasks and. For example, 3DGS supports 3D scene editing by manipulating individual splats [56]. Although this study focuses on O&M phases, extending RF-inspired approaches to construction phases—such as construction resource tracking and construction progress monitoring—remains a promising area for future research.

It is promising to integrate other data sources or methods since RF-inspired methods have a few inherent limitations. Vanilla RF-inspired methods rely on SfM methods for camera pose estimation, which can be noisy and computationally expensive. And since they are trained on static scenes, RF-inspired methods inherently lack the capability for monitoring tasks over time. In hybrid models, SLAM-based methods offer real-time pose tracking and drift-free camera trajectories, addressing the limitations of RF-inspired methods by enabling more stable and temporally consistent scene updates, making them more suitable for long-term monitoring applications, e.g., construction progress monitoring [57]. The parallel processing of tracking and implicit representation can improve the convergence speed [58]. The color information in the SLAM system is also beneficial for enhanced reconstruction fidelity. If assisted with LiDAR, the hybrid models outperform baseline models for large-scale outdoor scenes and multi-room settings due to the geometric priors (e.g., depth information) provided [59,60].

7. Conclusions

Image-based 3D reconstruction methods have attracted growing interest in both industry and academia. This study evaluates several RF-inspired approaches—from NeRF variants to 3DGS—compares them with conventional SLAM-based methods, and examines their applicability in real-life asset management. Experiments in university settings demonstrate that Nerfacto and 3DGS outperform other methods in terms of view quality and reconstruction efficiency. Additionally, point clouds generated by RF-inspired methods are sufficient for subsequent tasks like semantic segmentation. To enhance their practicality, adopting a walk-around data collection strategy is recommended, and increasing dataset size does not necessarily improve rendering performance. This research also highlights future directions for the AEC community to harness RF-inspired methods for more cost-efficient, accurate, and effective asset and facility management.

Author Contributions

Conceptualization, S.X., J.W. and W.S.; methodology, S.X. and J.W.; software, J.X.; validation, J.X.; formal analysis, S.X. and J.W.; investigation, J.W. and J.X.; resources, J.X.; data curation, J.X. and W.S.; writing—original draft preparation, S.X.; writing—review and editing, J.W. and W.S.; visualization, S.X.; supervision, J.W. and W.S.; project administration, W.S.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by the National Natural Science Foundation of China (Grant No. 72301246) and the Zhejiang Office of Philosophy and Social Science, China (Grant No. 24NDQN178YBM).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Jingfeng Xia was employed by the company North China Municipal Engineering Design & Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kim, M.; Kim, H. Optimal Pre-processing of Laser Scanning Data for Indoor Scene Analysis and 3D Reconstruction of Building Models. Ksce J. Civ. Eng. 2024, 28, 1–14. [Google Scholar] [CrossRef]
Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Dai, A.; Nießner, M.; Zollhöfer, M.; Izadi, S.; Theobalt, C. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. (ToG) 2017, 36, 1. [Google Scholar] [CrossRef]
Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, J.; Kira, Z.; Cho, Y.K. Deep learning approach to point cloud scene understanding for automated scan to 3D reconstruction. J. Comput. Civ. Eng. 2019, 33, 04019027. [Google Scholar] [CrossRef]
Tosi, F.; Zhang, Y.; Gong, Z.; Sandström, E.; Mattoccia, S.; Oswald, M.R.; Poggi, M. How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: A Survey. arXiv 2024, arXiv:2402.13255. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Garbin, S.J.; Kowalski, M.; Johnson, M.; Shotton, J.; Valentin, J. Fastnerf: High-fidelity neural rendering at 200fps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
Hu, X.; Xiong, G.; Zang, Z.; Jia, P.; Han, Y.; Ma, J. PC-NeRF: Parent-Child Neural Radiance Fields Using Sparse LiDAR Frames in Autonomous Driving Environments. arXiv 2024, arXiv:2402.09325. [Google Scholar] [CrossRef]
Park, M.; Do, M.; Shin, Y.; Yoo, J.; Hong, J.; Kim, J.; Lee, C. H2O-SDF: Two-phase Learning for 3D Indoor Reconstruction using Object Surface Fields. arXiv 2024, arXiv:2402.08138. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkuehler, T.; Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Yu, Z.; Chen, A.; Huang, B.; Sattler, T.; Geiger, A. Mip-splatting: Alias-free 3d gaussian splatting. arXiv 2023, arXiv:2311.16493. [Google Scholar]
Yan, Z.; Low, W.F.; Chen, Y.; Lee, G.H. Multi-scale 3d gaussian splatting for anti-aliased rendering. arXiv 2023, arXiv:2311.17089. [Google Scholar]
Jiang, Y.; Tu, J.; Liu, Y.; Gao, X.; Long, X.; Wang, W.; Ma, Y. GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces. arXiv 2023, arXiv:2311.17977. [Google Scholar]
Liang, Z.; Zhang, Q.; Feng, Y.; Shan, Y.; Jia, K. Gs-ir: 3d gaussian splatting for inverse rendering. arXiv 2023, arXiv:2311.16473. [Google Scholar]
Yao, Y.; Zhang, J.; Liu, J.; Qu, Y.; Fang, T.; McKinnon, D.; Tsin, Y.; Quan, L. Neilf: Neural incident light field for physically-based material estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Ma, L.; Agrawal, V.; Turki, H.; Kim, C.; Gao, C.; Sander, P.; Zollhöfer, M.; Richardt, C. SpecNeRF: Gaussian Directional Encoding for Specular Reflections. arXiv 2023, arXiv:2312.13102. [Google Scholar]
Lu, T.; Yu, M.; Xu, L.; Xiangli, Y.; Wang, L.; Lin, D.; Dai, B. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. arXiv 2023, arXiv:2312.00109. [Google Scholar]
Navaneet, K.; Meibodi, K.P.; Koohpayegani, S.A.; Pirsiavash, H. Compact3d: Compressing gaussian splat radiance field models with vector quantization. arXiv 2023, arXiv:2311.18159. [Google Scholar]
Girish, S.; Gupta, K.; Shrivastava, A. Eagles: Efficient accelerated 3d gaussians with lightweight encodings. arXiv 2023, arXiv:2312.04564. [Google Scholar]
Fan, Z.; Fan, Z.; Wang, K.; Wen, K.; Zhu, Z.; Xu, D.; Wang, Z. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. arXiv 2023, arXiv:2311.17245. [Google Scholar]
Jo, J.; Kim, H.; Park, J. Identifying Unnecessary 3D Gaussians using Clustering for Fast Rendering of 3D Gaussian Splatting. arXiv 2024, arXiv:2402.13827. [Google Scholar]
Chung, J.; Oh, J.; Lee, K.M. Depth-regularized optimization for 3d gaussian splatting in few-shot images. arXiv 2023, arXiv:2311.13398. [Google Scholar]
Zhu, Z.; Fan, Z.; Jiang, Y.; Wang, Z. FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting. arXiv 2023, arXiv:2312.00451. [Google Scholar]
Charatan, D.; Li, S.L.; Tagliasacchi, A.; Sitzmann, V. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv 2023, arXiv:2312.12337. [Google Scholar]
Xiong, H. Sparsegs: Real-time 360° sparse view synthesis using gaussian splatting. arXiv 2023, arXiv:2312.00206. [Google Scholar]
Zou, Z.-X.; Yu, Z.; Guo, Y.; Li, Y.; Liang, D.; Cao, Y.; Zhang, S. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv 2023, arXiv:2312.09147. [Google Scholar]
Das, D.; Wewer, C.; Yunus, R.; Ilg, E.; Lenssen, J.E. Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv 2023, arXiv:2312.01196. [Google Scholar]
Szymanowicz, S.; Rupprecht, C.; Vedaldi, A. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv 2023, arXiv:2312.13150. [Google Scholar]
Chen, Y.; Gu, C.; Jiang, J.; Zhu, X.; Zhang, L. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering. arXiv 2023, arXiv:2311.18561. [Google Scholar]
Lin, J.; Li, Z.; Tang, X.; Liu, J.; Liu, S.; Liu, J.; Lu, Y.; Wu, X.; Xu, S.; Yan, Y.; et al. VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction. arXiv 2024, arXiv:2402.17427. [Google Scholar]
Kratimenos, A.; Lei, J.; Daniilidis, K. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiv 2023, arXiv:2312.00112. [Google Scholar]
Lin, Y.; Dai, Z.; Zhu, S.; Yao, Y. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. arXiv 2023, arXiv:2312.03431. [Google Scholar]
Yang, Z.; Yang, H.; Pan, Z.; Zhang, L. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv 2023, arXiv:2310.10642. [Google Scholar]
Duisterhof, B.P.; Mandi, Z.; Yao, Y.; Liu, J.W.; Shou, M.Z.; Song, S.; Ichnowski, J. Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv 2023, arXiv:2312.00583. [Google Scholar]
Tancik, M.; Weber, E.; Ng, E.; Li, R.; Yi, B.; Wang, T.; Kristoffersen, A.; Austin, J.; Salahi, K.; Ahuja, A.; et al. Nerfstudio: A modular framework for neural radiance field development. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Zhang, X.; Srinivasan, P.P.; Deng, B.; Debevec, P.; Freeman, W.T.; Barron, J.T. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Trans. Graph. (ToG) 2021, 40, 1–18. [Google Scholar] [CrossRef]
Mekuria, R.; Li, Z.; Tulvan, C.; Chou, P. Evaluation criteria for pcc (point cloud compression). ISO/IEC JTC 2016, 1, N16332. Available online: https://mpeg.chiariglione.org/standards/mpeg-i/point-cloud-compression/evaluation-criteria-pcc.html (accessed on 2 February 2025).
Pavez, E.; Chou, P.A.; de Queiroz, R.L.; Ortega, A. Dynamic polygon clouds: Representation and compression for VR/AR. APSIPA Trans. Signal Inf. Process. 2018, 7, e15. [Google Scholar] [CrossRef]
Tian, D.; Ochimizu, H.; Feng, C.; Cohen, R.; Vetro, A. Geometric distortion metrics for point cloud compression. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Alexiou, E.; Ebrahimi, T. Towards a point cloud structural similarity metric. In Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), London, UK, 6–10 July 2020. [Google Scholar]
Yang, Q.; Zhang, Y.; Chen, S.; Xu, Y.; Sun, J.; Ma, Z. MPED: Quantifying point cloud distortion based on multiscale potential energy discrepancy. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6037–6054. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Ma, Z.; Xu, Y.; Li, Z.; Sun, J. Inferring point cloud quality via graph similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 3015–3029. [Google Scholar] [CrossRef]
Meynet, G.; Nehmé, Y.; Digne, J.; Lavoué, G. PCQM: A full-reference quality metric for colored 3D point clouds. In Proceedings of the 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 26–28 May 2020. [Google Scholar]
Viola, I.; Cesar, P. A reduced reference metric for visual quality evaluation of point cloud contents. IEEE Signal Process. Lett. 2020, 27, 1660–1664. [Google Scholar] [CrossRef]
Liu, Y.; Yang, Q.; Xu, Y.; Yang, L. Point cloud quality assessment: Dataset construction and learning-based no-reference metric. ACM Trans. Multimed.Comput. Commun. Appl. 2023, 19, 1–26. [Google Scholar] [CrossRef]
Jarząbek-Rychard, M.; Maas, H.G. Modeling of 3D geometry uncertainty in Scan-to-BIM automatic indoor reconstruction. Autom. Constr. 2023, 154, 105002. [Google Scholar] [CrossRef]
Schönberger, J.L.; Zheng, E.; Frahm, J.-M.; Pollefeys, M. Pixelwise view selection for unstructured multi-view stereo. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. Proceedings, Part III 14. [Google Scholar]
Liu, L.; Gu, J.; Zaw Lin, K.; Chua, T.S. Neural sparse voxel fields. Adv. Neural Inf. Process. Syst. 2020, 33, 15651–15663. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, S.; Zhang, W.; Gasperini, S.; Wu, S.C.; Navab, N. VoxNeRF: Bridging voxel representation and neural radiance fields for enhanced indoor view synthesis. arXiv 2023, arXiv:2311.05289. [Google Scholar]
Chen, Y.; Chen, Z.; Zhang, C.; Wang, F.; Yang, X.; Wang, Y.; Cai, Z.; Yang, L.; Liu, H.; Lin, G. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. arXiv 2023, arXiv:2311.14521. [Google Scholar]
Jeon, Y.; Kulinan, A.S.; Tran, D.Q.; Park, M.; Park, S. Nerf-con: Neural radiance fields for automated construction progress monitoring. in ISARC. In Proceedings of the International Symposium on Automation and Robotics in Construction, Lille, France, 3–5 June 2024. [Google Scholar]
Zhu, Z.; Peng, S.; Larsson, V.; Cui, Z.; Oswald, M.R.; Geiger, A.; Pollefeys, M. Nicer-slam: Neural implicit scene encoding for rgb slam. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024. [Google Scholar]
Zhang, J.; Zhang, F.; Kuang, S.; Zhang, L. Nerf-lidar: Generating realistic lidar point clouds with neural radiance fields. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7178–7186. [Google Scholar] [CrossRef]
Chang, M.; Sharma, A.; Kaess, M.; Lucey, S. Neural radiance field with lidar maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023. [Google Scholar]

Figure 1. NeRF for 3D reconstruction.

Figure 2. Indoor spaces for experiment.

Figure 3. Directory structure of the dataset.

Figure 4. Render results using RF-inspired methods. (a) Original picture of the laboratory, (b) render results using Instant-NGP, (c) render results using Nerfacto, (d) render results using 3DGS.

Figure 5. Point clouds of reconstructed indoor scene.

Figure 6. Segmentation results by pre-trained PointNet (Front View).

Figure 7. Two video-capturing strategies used in this study.

Figure 8. Reconstructed scenes comparison among different video-capturing methods.

Figure 9. Reconstructed scenes comparison among different dataset size.

Figure 10. Application map for choosing the most suitable algorithm for certain tasks.

Table 1. Summary of conventional 3D reconstruction methods.

Category	Pros	Cons	Exemplified Methods
LiDAR-based	High accuracy; Suitable for large-scale scenes; Robust to environmental changes.	High cost; Difficult to operate; Large volume of collected data; Require additional processing to generate dense models.	Commercially available applications; [1]
RGB-D-based	Cost effective; Easy to operate;	Medium accuracy; Sensitivity to environmental changes;	[2,3,4]
Photogrammetry-based	Cost effective; High resolution with color information.	High sensitivity to environmental changes, e.g., lighting; Require complex algorithms to process;	[5,6]

Table 2. Comparison between NeRF and 3DGS.

Methods	NeRF	3DGS
Scene representation type	Implicit	Explicit
Scene representation	A deep neural network (MLP)	Centre position μ(x,y,z) Covariance matrix ∑ Color and opacity (R, G, B, α)
Rendering method	Volume rendering	Gaussian rasterization
Optimization	Differentiable rendering	Gradient descent

Table 3. Performance comparison schemes.

Tasks	Methods Used for Comparison	Evaluation Metrics
Novel view synthesis	NeRF, NeRF’s variants, and 3DGS	PSNR, SSIM, LPIPS
3D point clouds generation	SfM, MVS, NeRF, NeRF’s variants, and 3DGS	Number of points, Efficiency
3D point clouds segmentation	MVS, RF-based methods with best view synthesis performance	Accuracy

Table 4. Quality evaluation metrics.

Category	Evaluation Metric	Explanation
Quality of 3D reconstruction	Accuracy	The average distance between two models
View synthesis	Peak Signal-to-Noise Ratio (PSNR)	Ratio of maximum pixel value to the root mean squared error
	Structural Similarity Index Measure (SSIM)	Measures similarity in luminance, contrast, and structure between two images
	Learned Perceptual Image Patch Similarity (LPIPS)	Measures feature similarity between image patches using a pre-trained network

Table 5. Configurations of experiment setup.

Room’s Name	Laboratory	Computer-aided classroom
Floor plan shape	Rectangular	Heteromorphism
Room area	20 m²	25 m²
Data collection	107 photos	41 Photos and 1 Video
Image resolution	3024 × 4032	1920 × 1080

Table 6. Performance comparison of radiance-field inspired methods.

	Evaluation Metrics	PSNR	SSIM	LPIPS	No. Rays per Second	Training/Rendering Time
Reconstruction Methods		PSNR	SSIM	LPIPS	No. Rays per Second	Training/Rendering Time
NeRF-based	Vanilla-NeRF	12.9516	0.6390	0.8332	50,935	~17 h
	Mip-NeRF	11.1231	0.6330	0.7073	40,526	~14 h
	Instant-NGP	19.0023	0.6818	0.7090	56,520	~1 h
	Nerfacto	18.9293	0.6955	0.6080	1,338,020	~30 min
3D Gaussian-based	3DGS	28.2287	0.7559	0.4405	--	~30 min

Table 7. Performance comparison in terms of 3D point cloud generation.

Reconstruction Method	SfM (Sparse)	MVS (Dense)	MipNeRF	Nerfacto	Nerfacto (Meshed Poisson)	3DGS
Number of Points	14,422	962,847	100,0283	1,001,213	992,125	1,574,318
Number of Rays	N/A	N/A	12,192,768			N/A

The “number of rays” indicator is only applicable for NeRF-based method and thus is N/A for SfM, MVS, and 3DGS.

Table 8. View synthesis results comparison using different data capturing methods.

RF-Inspired Methods		PSNR	SSIM	LPIPS	No. of Rays per Second	Training/ Rendering Time
RF-Inspired Methods	Data Capturing Methods	PSNR	SSIM	LPIPS	No. of Rays per Second	Training/ Rendering Time
Vanilla-NeRF	Walk-around	12.9516	0.6390	0.8332	50,935	~9 h
Vanilla-NeRF	Pivot	14.3469	0.6650	0.7088	48,667	~16.3 h
MipNeRF	Walk-around	11.1231	0.6330	0.7073	40,526	~10.5 h
MipNeRF	Pivot	13.9983	0.5907	0.7571	40,753	~15 h
Instant-NGP	Walk-around	19.0023	0.6818	0.7090	56,520	~30 min
Instant-NGP	Pivot	14.8730	0.7053	0.6039	44,270	~50 min
Nerfacto	Walk-around	18.9293	0.6955	0.6080	1,338,020	~13 min
Nerfacto	Pivot	17.9385	0.7617	0.3499	1,200,194	~32 min
3DGS	Walk-around	28.2287	0.7559	0.4405	N/A *	~10 min
3DGS	Pivot	35.9607	0.9746	0.0955	N/A	~20 min

* No. of Rays per Second indicator is only applicable for NeRF-based methods and thus is N/A for 3DGS.

Table 9. View synthesis results comparison using different sizes of datasets.

RF-Inspired Methods		PSNR	SSIM	LPIPS	Training/ Rendering Time
RF-Inspired Methods	Dataset Size	PSNR	SSIM	LPIPS	Training/ Rendering Time
MipNeRF	41 images	12.4190	0.6852	0.6875	~16 h
MipNeRF	860 images	12.3500	0.5393	0.8302	~16 h
Instant-NGP	41 images	15.0179	0.6904	0.7255	~35 min
Instant-NGP	860 images	22.9552	0.7615	0.5054	~40 min
Nerfacto	41 images	17.0927	0.7099	0.5847	~10 min
Nerfacto	860 images	22.6608	0.8090	0.2677	~12 min
3DGS	41 images	30.7624	0.9246	0.2539	~10 min
3DGS	860 images	32.1640	0.9548	0.1268	~11 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, S.; Wang, J.; Xia, J.; Shou, W. Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis. Buildings 2025, 15, 848. https://doi.org/10.3390/buildings15060848

AMA Style

Xu S, Wang J, Xia J, Shou W. Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis. Buildings. 2025; 15(6):848. https://doi.org/10.3390/buildings15060848

Chicago/Turabian Style

Xu, Shuyuan, Jun Wang, Jingfeng Xia, and Wenchi Shou. 2025. "Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis" Buildings 15, no. 6: 848. https://doi.org/10.3390/buildings15060848

APA Style

Xu, S., Wang, J., Xia, J., & Shou, W. (2025). Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis. Buildings, 15(6), 848. https://doi.org/10.3390/buildings15060848

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Radiance Field-Inspired Methods for 3D Indoor Reconstruction: A Comparative Analysis

Abstract

1. Introduction

2. Related Work

2.1. LiDAR-Based Methods

2.2. RGB-D-Based Methods

2.3. Photogrammetry-Based Methods

3. Radiance Field (RF)-Inspired Methods

4. Research Method

4.1. Methods for Performance Comparison

4.2. Performance Evaluation Metrics

4.3. Data Collection

5. Results and Performance Comparison

5.1. View Synthesis

5.2. 3D Point Cloud Generation

5.3. Reconstruction Efficiency

5.4. Downstream Tasks

6. Discussion

6.1. Sensitivity Analysis on Data Collection Strategies

6.2. Sensitivity Analysis of Dataset Size

6.3. Applicability of RF-Inspired Methods in Indoor Scenarios

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI