1. Introduction
Differentiable rendering techniques have emerged as a popular research topic in the areas of virtual/augmented reality, computer vision, and computer graphics. The rendering process can be understood as a function that maps the parameters of geometries, materials, lights, and cameras to the image pixels’ intensities; i.e., the data flows from scene parameters to image pixels. On the contrary, differentiable rendering techniques aim to propagate the gradients of image pixels to the scene parameters in the opposite direction; i.e., the data flows from image pixels to scene parameters.
The gradients obtained through differentiable rendering techniques can be paired with various optimization algorithms to solve inverse-rendering (also known as analysis-by-synthesis) problems such as reflectance and lighting estimation [
1] and 3D reconstruction from one or multiple input photographs [
2,
3]. The gradients can also be propagated further into neural networks to learn the scene’s neural representations [
4], enabling applications such as novel view synthesis [
5,
6], scene relighting [
7], and scene editing [
8].
When we have challenging inverse rendering problems at hand, high-quality gradients are required to ensure that the corresponding optimization process converges to a valid result. Note that the differentiable rendering techniques we use to obtain gradients are largely determined by the primary rendering process. Therefore, high-quality gradients come from a high-quality primary rendering process, and the rendering algorithms of the highest quality are undoubtedly global illumination algorithms.
Global illumination algorithms, obeying the rules of geometric optics, use Monte Carlo methods to solve the path integral and then render photorealistic images. However, the differentiable rendering techniques corresponding to global illumination algorithms, known as physics-based differentiable rendering methods [
9,
10,
11], are much more complicated due to the high-order discontinuities contained in the integrand of the path integral, which combines the parameters of geometries, materials, lights, and cameras into a single entity.
Directly applying automatic differentiation techniques to global illumination algorithms will not yield physically correct gradients. Just like the Leibniz integral rule in calculus, the derivative of the path integral contains an additional boundary term due to the movement of discontinuities when the scene’s geometric parameters change.
There are two main methods to handle the movement of geometric discontinuities in the physics-based differentiable rendering community: boundary sampling methods [
9,
11,
12,
13,
14] and reparameterization methods [
10,
15,
16]. Boundary sampling methods aim to explicitly calculate the boundary term by importance sampling paths in the boundary sample space. Reparameterization methods, on the other hand, aim to avoid explicitly sampling boundary paths. They reparameterize the integral domain by tracing auxiliary rays such that the geometric discontinuities remain static when scene parameters change. Then, automatic differentiation techniques can be applied to obtain the physically correct gradients.
The main drawbacks of physics-based differentiable rendering techniques are the relatively long time required to obtain a low-variance estimate of the gradients and the significantly more complex theory compared to other local illumination-based techniques.
The volume-rendering-based method, NeRF [
5], is such a technique that ignores scattering in participating media, retaining only emission and absorption. NeRF-like methods represent scenes using a volume function in the form of neural networks and render this volume using traditional volume rendering methods, except that they now obtain samples through neural networks. The original NeRF exhibits a series of inefficiencies, such as slow rendering speeds that also hinder the training process [
17] and poor generalization to new scenes [
18]. Many algorithms have been proposed to address these inefficiencies.
Inspired by the idea of representing scenes using volumes, 3D Gaussian splatting (3DGS) [
6] has been proposed, which uses a discrete set of 3D Gaussians as base primitives. By avoiding queries to a neural network and leveraging the advantageous projection properties of 3D Gaussians, this technique achieves much higher rendering frame rates, which also accelerates the training process a lot. The main shortcomings of 3DGS include the large storage space required [
19] and the difficulty in converting the discrete scene representation into more editable meshes [
20].
This paper is organized as follows: In
Section 2, classification based on the primary rendering algorithms employed is presented. Next, the physics-based differentiable rendering methods, NeRF-like methods, and methods based on 3DGS are introduced and analyzed in
Section 3,
Section 4 and
Section 5, respectively. In
Section 7, the conclusions and open research problems are presented.
2. Algorithms
From the perspective of the primary rendering algorithms used, we provide a brief taxonomy of existing differentiable rendering methods, as illustrated in
Figure 1. Note that we provide only a coarse taxonomy here, which is sufficient for the topics covered in this paper, i.e., physics-based, NeRF-based, and 3DGS-based differentiable rendering. Moreover, we list the representative works in these categories in
Figure 2.
Besides NeRF-like methods, there are many other approaches included in the “neural-network-based” branch in
Figure 1, such as [
21,
22,
23,
24], which are based on voxel or point cloud representations. Under the “rasterization-based” branch, there are also many other techniques that are based on traditional rasterizers, such as [
25,
26]. A survey of these methods goes beyond the scope of this paper, as well.
Figure 1.
A brief taxonomy of existing differentiable rendering methods. Figures come from [
5,
6,
10,
11].
Figure 1.
A brief taxonomy of existing differentiable rendering methods. Figures come from [
5,
6,
10,
11].
Figure 2.
Selected representative works in the area of differentiable rendering [
9,
10,
11,
14,
15,
16,
18,
19,
20,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40].
Figure 2.
Selected representative works in the area of differentiable rendering [
9,
10,
11,
14,
15,
16,
18,
19,
20,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40].
We start with the most elaborate formula used in primary rendering algorithms and classify the review topics covered in this paper based on how they relate to this formula. The formula is the equation of transfer that governs the behavior of light in participating media containing surfaces:
where
is the incoming radiance at point
p in direction
,
is the beam transmittance, which is the fraction of radiance transmitted between two points,
D is the distance from point
p to the medium’s boundary in the direction of
,
is the medium’s scattering coefficient,
P is the phase function, which describes the angular distribution of scattered radiation at a point, and
is the source term defined by:
where
is the absorption coefficient,
is the medium’s radiant emission,
is the intersection point where the ray originating from point
p and traveling in direction
meets the medium’s boundary,
f is the cosine-weighted BSDF, and
is the interfacially emitted radiance.
The formulas above take both volume scattering and surface interactions into account and are the basis of the physics-based differentiable rendering techniques that will be reviewed in
Section 3. If we ignore the contributions from surface interactions and scattering in participating media, the resulting volume rendering formula is:
which forms the basis of NeRF-like methods and will be reviewed in
Section 4. If the scene representation we use is no longer volumetric but is instead based on discrete points, then the aforementioned formula can be rewritten as follows:
We can further rewrite the above formula by utilizing the property of the beam transmittance term
:
Then we can obtain:
where we replace the transmittance term
with the alpha blending coefficients
. Then we obtain the formula utilized by 3DGS methods, which will be reviewed in
Section 5.
3. Physics-Based Differentiable Rendering
Physics-based forward rendering aims to synthesize photorealistic images by simulating light transport in a manner consistent with physical laws using Equation (1). In contrast, physics-based differentiable rendering aims to propagate the derivatives of image pixels back to the scene parameters, which is a challenging problem due to the high-order discontinuities inherent in the scene function. With the gradients of these scene parameters, one of the most important applications of PBDR is inverse rendering, i.e., reconstructing geometries, materials, or emitters from a set of input photographs, as illustrated in
Figure 3.
Note that there are several works [
12,
13] using the full form of Equation (1), but the main difficulties encountered in PBDR arise from the rendering part that involves geometric surfaces. The rendering part involving participating media is fully continuous and thus can be differentiated easily. Therefore, we will focus on scenes comprised entirely of surfaces. In the following subsections, we will first briefly introduce the theory behind physics-based rendering. Then, we will discuss two categories of methods in PBDR, which are classified based on their approach to handling discontinuities: boundary sampling methods and reparameterization methods.
3.1. Physics-Based Rendering Preliminaries
Since the main difficulties in PBDR arise from geometric surfaces rather than the participating media, we will focus exclusively on scenes comprised entirely of surfaces. Adding participating media back is straightforward [
12,
13], as the corresponding rendering part is continuous. Ignoring the participating media term in Equation (1) gives the rendering equation:
which describes the local behavior of light, such as the reflection or transmission, at a point on the geometric surface. The above equation can be solved by iteratively replacing
in the right-hand side with the function in the left-hand side. The final solution can be expressed as a Neumann series:
where
stand for the radiance contribution from paths of length N:
The resulting radiance function
is then convolved with a pixel reconstruction filter function to obtain the final intensity of pixel
j:
The rendering equation can also be expressed as a surface form or an integral on object surfaces:
where
is the geometric term. Expanding the formula again, we obtain the widely used path integral:
Given the physics-based rendering formulas above, we still need a numerical integration method to calculate the corresponding integrals. Since these integrals tend to be very high-dimensional, Monte Carlo methods are preferred over other numerical integration methods. Specifically, Monte Carlo methods use random numbers to evaluate the integral:
Employing the Monte Carlo estimator, we have:
Note that the points
need to be sampled according to the probability density function
. In practice, to make the Monte Carlo estimator converge as quickly as possible, we need to match the shape of
with that of
as closely as possible.
With these tools at hand, the next task is to calculate the corresponding differentials. However, automatic differentiation techniques cannot be applied directly due to the discontinuities in the integrand of Equations (
8), (
9) and (
11). Based on how the discontinuities in these high-dimensional integrals are handled, existing PBDR methods can be classified into two categories: boundary sampling methods and reparameterization methods, which will be reviewed in the following subsections.
When comparing the performance of two PBDR methods, the commonly used metrics are the variance of the Monte Carlo estimators and the rendering time in seconds; notably, there are generally no common datasets used in the field of PBDR, as the scenes for comparison are typically built by the authors themselves.
3.2. Boundary Sampling Methods
Li et al. [
9] presented the first physics-based differentiable rendering technique for scenes composed of triangle meshes using an edge sampling method. They realized that the automatic differentiation technique cannot be used directly to compute the gradient of pixel intensity resulting from global illumination techniques such as path tracing due to the discontinuity of the integrand. The Heaviside step function is utilized to partition the discontinuous integrand into many small parts, ensuring that the corresponding function is continuous within each small domain. Then, the pixel intensity
I can be written as:
where
is the scene function,
is the Heaviside step function,
is the edge equation corresponding to the discontinuity’s location in the scene function, and the summation is over the subdivided small domains.
Recall that the derivative of the Heaviside step function is the Dirac delta function, so the derivative of pixel intensity
I with respect to scene parameter
can be written as:
where
is the Dirac delta function.
A 2D integral containing the Dirac delta function in the integrand is actually a 1D integral over the domains where the Dirac delta function has non-zero values, so the first physically correct formula for the derivative of the pixel intensity is obtained as follows:
where the second integral is over the triangle edges corresponding to the discontinuity’s location in the scene function.
For scenes composed of triangle meshes, the edges that cause discontinuities in the scene function come from three sources, as shown in
Figure 4. The boundary edges belong to the topological boundary of the triangle mesh. If the mesh has no topological boundary, i.e., it is closed, then it has no boundary edges that may contribute to the discontinuities of the scene function. The silhouette edges correspond to the occlusion of one mesh over another mesh or self-occlusion, causing the shading to change suddenly from one side of the edge to the other. And the sharp edges result from the discontinuous face normals on the mesh if smooth shading using interpolated normals is disabled.
Given the physically correct formula Equation (16) for obtaining gradients of pixel colors rendered using global illumination algorithms, the next step is to design a Monte Carlo estimator and an efficient importance sampling scheme to put the theory into practice.
To reduce the variance of the corresponding Monte Carlo estimator as much as possible, the shape of the probability density function needs to closely match the shape of the contribution function.
For the former part of Equation (16), traditional importance sampling methods such as next-event estimation and multiple importance sampling techniques can be applied directly. Recall that arbitrary long light paths need to be evaluated for global illumination algorithms. Therefore, the contribution of the latter part of Equation (16) can be further divided into two parts: one corresponding to the primary visibility, i.e., the first segment of light paths, and the other corresponding to higher-order visibility, i.e., the subsequent segments of light paths.
We can precompute all the triangle edges that may contribute to geometric discontinuities and importance sample them for the primary visibility case. The main challenge comes from the higher-order visibility case.
In this case, we need to importance sample geometric discontinuities viewed from arbitrary shading points in the scene, which is a much more complicated task than the primary visibility case.
Li et al. [
9] employs a 6D Hough tree, which takes both vertex positions and normals into account, for the importance sampling task. However, this pioneering importance sampling scheme does not scale well to scenes with high complexity and does not match the shape of the contribution function closely. This results in a Monte Carlo estimator with relatively large variance.
A series of methods have been proposed to improve upon the above technique. Zhang et al. [
12] extends the edge sampling method to scenes that contain participating media, supporting arbitrary surface and volumetric configurations. However, their approach remains confined to the framework of sampling edges from given shading points, which is considered inefficient from a modern viewpoint.
The next milestone in the development of boundary sampling techniques is attributed to the work of Zhang et al. [
11]. They utilize a so-called transport relation that originated in fluid mechanics to establish a mathematical framework that is used in modern techniques. They also introduce a multi-directional form of the boundary integral, allowing for silhouette paths to be generated “from the middle”, which significantly reduces the variance of the gradient estimator with less computation time, as illustrated in
Figure 5.
Zhang et al. [
13] extends the aforementioned method to scenes containing participating media. Yan et al. [
41] employ adaptive data structures to guide the sampling process within the boundary sample space. They also propose an edge-sorting algorithm to reorganize the boundary sample space to further improve the sampling efficiency, as illustrated in
Figure 6. Zhang et al. [
14] re-derive the local formulation of the perimeter and propose the first local formulation of the interior. They also find that the calculation of the boundary term can greatly benefit from information gathered during the interior term’s simulation. Specifically, they project the rays that intersect the shape that is being differentiated onto the corresponding geometry boundary. Then, they calculate these projected rays’ contributions to the boundary integral to initialize the guiding structure, thereby reducing the overall variance, as illustrated in
Figure 7.
3.3. Reparameterization Methods
The main inefficiency of the edge sampling method proposed by Li et al. [
9] arises from the challenging task of importance sampling of geometric discontinuities viewed from arbitrary shading points in the scene. Reparameterization methods instead try to avoid sampling these discontinuities explicitly by reparameterizing the integral domain.
Recall that the pixel intensity I is the result of an integral that typically contains discontinuities due to the discontinuous visibility function. And the geometric discontinuities move in the integral domain when the geometric parameter changes.
The core idea of reparameterization methods can be understood as subdividing the integral domain into small parts such that the scene function is continuous over each part and then reparameterizing the scene function over each subdomain. Note that the boundary of each small integral domain needs to match the moving geometric discontinuities to ensure the validity of the reparameterization. Then, we can rewrite formula as follows:
where
represents the subdivided subdomains whose boundaries match the movement of the geometric discontinuities.
Suppose we have a change of the variables’ schemes at hand:
then we have:
Although the discontinuities still exist in the integrand, they no longer move when the geometric parameter
changes. Therefore, we can now apply automatic differentiation techniques to compute the derivative of pixel intensities.
In practice, we need to trace many rays of arbitrary lengths to estimate the derivative. This would result in a huge computational overhead if we explicitly subdivide the integral domain for each ray segment. Therefore, all existing reparameterization methods propose their own reparameterization schemes without knowing the boundaries of the subdomains in Equation (19).
The first work in a series of reparameterization studies was proposed by Loubet et al. [
15]. They found that the movement of the geometric discontinuities in a small spherical integral domain can be well-approximated with a simple spherical rotation transform, as illustrated in
Figure 8. For scene functions with large supports, they convolved the function with a convolution kernel and used spherical rotations to approximate the movement of geometric discontinuities within the small support of the kernel function at the cost of increased variance.
Given the small domain to reparameterize the convolved scene function, auxiliary rays need to be emitted to determine the suitable changes of variables. The raw results obtained through their reparameterization scheme exhibit high variance. Control variates and partially correlated pairs of paths are used to reduce the overall variance.
The main drawback of this reparameterization method is that the final Monte Carlo estimator is biased due to the approximation the authors used.
The first unbiased reparameterization method for physics-based differentiable rendering was proposed by Bangaru et al. [
10]. To better understand their work, we need to go further with Equation (19):
where we assume that the absolute value of the Jacobian determinant in Equation (19) is 1 when evaluated at
, which is true in existing reparameterization methods.
By using the result in Magnus et al. [
42], we can rewrite the derivative of the above Jacobian determinant as the divergence of the underlying mapping:
where
is referred to as the warp field in the work of Bangaru et al. [
10].
Instead of directly seeking suitable changes of variables, they choose to estimate the warp field using a Monte Carlo estimator. Recall that to ensure the validity of the reparameterization, the boundaries of subdomains need to match the movement of the geometric discontinuities. The same requirement applies to the induced warp field .
Applying automatic differentiation techniques directly to the ray tracing process can generate a warp field . However, this warp field does not satisfy the above requirement in the case of silhouette edges. Imagine one object blocking another; the velocities of the movements of the projections of the silhouette edges are actually determined by the occluder. However, the warp field obtained through the ray tracing process is determined entirely by the object being blocked.
To achieve valid reparameterization, an additional operation is needed to transform into a warp field that matches the movement of the silhouette edges.
Bangaru et al. [
10] convolves the warp field
obtained through the ray tracing process with a weight function
w, which is estimated by tracing additional rays, to obtain the final warp field:
where the weight function
w has the property:
where
is the Dirac delta function. After the convolution, we obtain a valid warp field, and Equation (21) can be used to calculate the derivatives of scene parameters in an unbiased way, as illustrated in
Figure 9. Xu et al. [
16] extend the above method to more advanced primary Monte Carlo rendering techniques, such as bidirectional path tracing, based on a new formulation for reparameterized differential path integrals. They also introduce a new distance function to further reduce the variance of the final gradient estimator, as illustrated in
Figure 10.
4. Neural Radiance Field
Compared to physics-based differentiable rendering techniques, which simulate light transport between scene surfaces for every detail, neural radiance fields (NeRFs) [
5] use an approximate method to model the real world. Specifically, NeRF-like methods model scenes primarily composed of 2D surfaces with participating media only. In these methods, the rendering integral is completely continuous, so the hard-to-handle discontinuities encountered in PBDR methods disappear. Due to the approximations used in NeRF-like methods, the single-object reconstruction results do not achieve the same level of precision as PBDR methods. However, thanks to these approximations, NeRF-like methods can reconstruct scenes with many complex geometries or even wild scenes using a few photographs—something that current PBDR methods cannot accomplish.
The core idea of NeRFs is to employ a neural network to implicitly represent the radiance field of a three-dimensional scene. Specifically, NeRFs utilize a multilayer perceptron (MLP) to map a pair of a position and direction to a color and volume density corresponding to the emission term and absorption term in the participating media. Points along the ray that originate from sensors are then sampled to calculate the rendering integral, resulting in the final pixel intensity.
NeRFs have successfully modeled scenes as radiance fields, enabling the synthesis of high-quality images from new perspectives for scenes with complex geometries and appearances. These representations have been rapidly extended and have been applied to numerous graphics and vision tasks, including generative modeling, surface reconstruction, appearance editing, and motion capture. These advancements facilitate applications in various fields such as robotics, autonomous navigation, scene inpainting [
43], and virtual/augmented reality.
In the following subsections, we will first briefly introduce the theory behind NeRFs. Then, we will successively discuss the improvements made since the original NeRF work. Note that there is an enormous amount of research focusing on improving NeRF in various aspects such as view synthesis under fuzzy input conditions [
44], thin structures [
45], reconstruction of outdoor scenes [
46], drone-captured scenes [
47], large-scale scenes [
48], and RGB-D-captured scenes [
49]. Since there are already thorough surveys of NeRF-based methods [
50,
51,
52,
53], we will focus primarily on PBDR and only briefly review NeRF-based methods from three aspects for completeness: lowering acquisition costs, increasing rendering speeds, and enhancing generalization capabilities.
4.1. NeRF Preliminaries
The original NeRF represents a scene as a 5D vector-valued function described by the following mapping:
Here,
represents the 3D coordinates of a point in the scene, and
is the camera’s viewing direction. The function outputs
, the color emitted by the point
in direction
, and
, which is the volume density and is a quantity related to absorption in the participating media. The function
is defined by a deep, fully connected neural network known as multilayer perceptron (MLP). Specifically, the volume density
is only related to the position
, whereas the color
is affected by both the position
and the direction
. To help the neural network better capture and represent high-frequency details in the scene, the position encoding technique [
54] is used to map low-frequency input coordinates to a high-dimensional space. In this way, NeRF is able to generate high-quality and realistic images across different viewing angles by using the following rendering process.
Given a ray passing through an image pixel in the world coordinate space, we use the following integral to calculate the corresponding intensity:
where
represents the fraction of energy retained when a photon travels from
to
t. Then, points along the ray are sampled, and the corresponding color and volume density are obtained through the neural network. The final pixel intensity is calculated using Equation (25). The whole pipeline is illustrated in
Figure 11.
When comparing the performance of two NeRF-based methods, commonly used datasets include the DTU dataset [
55] (carefully calibrated camera poses), LLFF dataset [
56] (handheld cellphone images), NeRF Synthetic dataset [
5] (Blender-generated), Redwood dataset [
57] (over ten-thousand 3D scans), Mip-NeRF 360 dataset [
58] (scenes containing a complex central object), Tanks and Temples dataset [
59] (high-quality laser-scanned scenes), ICL-NUIM dataset [
60] (synthetic indoor scenes), and ScanNet dataset [
61] (over 2.5 million indoor scene views with semantic labels), while commonly used visual quality assessment metrics include PSNR (an approximation of human perception of reconstruction quality), SSIM [
62] (a measure of structural similarity), and LPIPS [
63] (a measure based on deep features).
4.2. NeRF with Lower Acquisition Costs
Although a NeRF can synthesize high-quality images, it requires a large number of images to learn the 3D structure and lighting information of a scene. If the number of input views is insufficient, the NeRF may fail to fully capture the complex details and variations of the scene, resulting in a decrease in the quality of the synthesized images.
DS-NeRF [
27] achieves enhanced training efficiency and improved rendering quality in sparse view settings by introducing a novel loss function for learning radiance fields that takes advantage of readily available depth supervision. This approach utilizes the sparse 3D points generated by structure-from-motion (SFM) as a source of “free” depth information. During training, this additional supervision signal helps the model learn more accurate scene geometries. A novel loss function is introduced to align the distribution of a ray’s terminating depth with specified 3D keypoints, incorporating depth uncertainty to anchor the learning process to the 3D geometry. The comparison results in
Figure 12 and
Figure 13 and
Table 1 demonstrate that DS-NeRF delivers higher-quality view synthesis with fewer input views. Note that the compared methods, MetaNeRF [
64] and PixelNeRF [
28], both use data-driven priors recovered from a domain of training scenes to fill in missing information from test scenes, allowing them to use fewer images to recover the scene.
RegNeRF [
65] finds that errors in estimated scene geometry and the divergent behavior at the start of training lead to lower rendering quality with sparse input views. The authors then propose to regularize the geometry and appearance of patches rendered from unobserved viewpoints to improve rendering quality. Additionally, a normalizing flow model is used to regularize the color of unobserved viewpoints, which also plays an important role in the final rendering results.
FreeNeRF [
29] finds that frequency plays an important role in NeRF’s training under the few-shot setting. This paper introduces an innovative frequency regularization technique that significantly improves NeRF’s performance with fewer training views. The core of the technique lies in two key regularization strategies: First, frequency regularization controls the range of visible frequencies in NeRF’s input through a simple, linearly increasing frequency mask. This prevents overfitting to high-frequency signals during the early stages of training. Second, occlusion regularization mitigates the “floating artifacts” in novel view synthesis by penalizing high-density regions near the camera, effectively reducing these artifacts. The comparison results in
Figure 14 and
Figure 15 and
Table 2 demonstrate that FreeNeRF consistently outperforms state-of-the-art methods across multiple datasets. Moreover, some methods, like NeuralLift-360 [
66], Sherf [
67], and Dip-NeRF [
68], combine NeRF with additional information, such as depths, to reconstruct scenes with few input views.
4.3. NeRF with Faster Rendering Speeds
Although NeRF can achieve photorealistic view synthesis, it requires frequent evaluation of the neural network at all point samples along each ray at runtime, which limits its capability for real-time rendering applications.
SNeRG [
69] precomputes and stores the trained NeRF in a sparse voxel grid with learned feature vectors, enabling real-time rendering on commodity hardware. This method is 3000 times faster than the original implementation, significantly enhancing its performance. Mobile-NeRF [
30] tries to combine NeRF with the traditional polygon rasterization pipeline to increase rendering speed. The method utilizes a set of polygons with textures representing binary opacities and feature vectors to model the scene. The output of the rasterization pipeline is pixels representing features, which are then interpreted by a lightweight MLP running in a GLSL fragment shader to render images. This approach not only maintains high-quality image output but also significantly increases rendering speed and reduces memory requirements, enabling real-time rendering on a wide range of computing platforms, including mobile phones. The comparison results in
Table 3 and
Table 4 demonstrate that Mobile-NeRF is around 10 times faster than SNeRG.
DIVeR [
31] tries to accelerate the rendering process by using deterministic rather than stochastic estimates of the volume rendering integral. This method involves jointly optimizing a feature voxel grid and a decoder MLP to reconstruct the scene. Each ray from the camera is segmented into intervals corresponding to each voxel. Components of the volume rendering integral are decoded by an MLP for each interval to generate density and color. As a result, DIVeR outperforms previous methods in terms of quality, especially for thin translucent structures, while maintaining comparable rendering speed. Moreover, Instant-NGP [
32] proposes a learned parametric multiresolution hash encoding to greatly reduce training time and uses an occupancy grid to accelerate inference speed.
4.4. NeRF with Better Generalization
The original NeRF requires several hours of training for each new scene. In contrast, a generalizable NeRF can render multiple scenes directly using a pre-trained neural network without the need for retraining.
IBRNet [
18] synthesizes novel views of complex scenes by interpolating a sparse set of nearby views: a common strategy in the field of image-based rendering. It employs a neural network to learn a generic view interpolation function that generalizes to new scenes. MVSNeRF [
70] utilizes a higher-dimensional cost volume to represent the scene, resulting in faster processing and improved generalization. NeRFusion [
33] combines the advantages of NeRF- and TSDF-based fusion, achieving state-of-the-art quality for both large-scale indoor scenes and small-scale object scenes. It predicts per-frame local radiance fields from an input image sequence via direct network inference and fuses these into a global sparse scene representation in real-time, which can be further fine-tuned. NeRFusion outperforms baseline NeRF [
5], IBRNet [
18], and MVSNeRF [
70] on the NeRF Synthetic [
5] and DTU [
55] datasets, as shown in
Table 5.
Point-NeRF [
34] combines volumetric neural rendering and deep multi-view stereo by using neural 3D point clouds to make NeRFs generalizable to new scenes. Point-NeRF can use the result of direct inference from a deep network pre-trained across scenes to produce an initial neural point cloud. This neural point cloud can then be rendered with a ray-marching-based pipeline or further fine-tuned to improve visual quality. Point-NeRF outperforms IBRNet [
18] and MVSNeRF [
70] on the NeRF Synthetic [
5] and DTU [
55] datasets, as shown in
Figure 16.
5. 3D Gaussian Splattting
Despite various improvements made by researchers to address the shortcomings of NeRFs, the neural-network-based implicit representation still struggles to balance training and rendering efficiency with scene reconstruction quality, limiting its further development. To address this issue, 3D Gaussian splatting (3DGS) [
6] has recently emerged; 3DGS represents a scene as a collection of 3D Gaussians, each with its own attributes such as position, rotation, scale, opacity, and color. Given a camera ray, the corresponding intensity is calculated as the alpha blending result of the 3D Gaussians intersecting with the ray. The scene is reconstructed by optimizing the attributes of 3D Gaussians so that the rendered images match the input images. Unlike coordinate-based implicit 3D scene representation methods like NeRF, 3DGS draws inspiration from point-based rendering methods [
71] and is a purely explicit representation. It is highly parallelized, is capable of converging in approximately 30 min, and achieves real-time rendering at over 30 FPS at 1080p resolution. Similar to NeRF, 3DGS is used in diverse applications such as robotics, autonomous navigation, urban mapping, and virtual/augmented reality. Since there are already thorough survey of 3DGS-based methods [
72], we will focus primarily on PBDR and only briefly review 3DGS-based methods from three aspects for completeness. In the following subsections, we will first briefly introduce the theory behind 3DGS. Then, we will successively discuss the improvements in 3DGS regarding quality enhancement, compression and regularization, and 3D geometry reconstruction.
5.1. 3DGS Preliminaries
A scene is represented as millions of 3D anisotropic balls in 3DGS, with each modeled using a 3D Gaussian distribution:
where
is the mean position of the anisotropic ball and
is the corresponding covariance. Further, each 3D Gaussian has an opacity
and spherical harmonics parameters
(k is the degrees of freedom) for modeling density and view-dependent radiance for each anisotropic ball. For regularizing optimization, the covariance matrix is further decomposed into rotation matrix
R and scaling matrix
S:
Given the camera pose, novel view rendering is performed via point splatting to project onto the 2D image plane. At last, for every pixel, the final pixel color can be computed by alpha compositing the opacity and color of all the 3D Gaussians by depth order.
When comparing the performance of two 3DGS-based methods, the commonly used datasets and visual quality assessment metrics are almost the same as those used by NeRF-based methods, as listed in
Section 4.1.
5.2. 3DGS with Quality Enhancement
Although 3DGS can render realistic images, there is still room for improving the rendering quality. Due to the mismatch between the signal frequency and the sampling rate, 3DGS produces aliasing when zooming in. To address this issue, Mip-Splatting [
35] first adopts a 2D Mip filter inspired by EWA-Splatting [
73] to alleviate aliasing. Additionally, Mip-Splatting also limits the sampling frequency. MS 3DGS [
74] also addresses the issue of aliasing. It proposes a multi-scale Gaussian splatting representation that selects different scales of 3D Gaussians based on the rendering resolution level. From another perspective, SA-GS [
75] proposes a training-free method and maintains scale consistency using a 2D scale-adaptive filter to improve the anti-aliasing performance of 3DGS; it outperforms Mip-Splatting [
35] and the original 3DGS [
6], as illustrated in
Figure 17.
Due to the poor performance of spherical harmonics for fitting high-frequency radiance distributions, some works have proposed improvements based on the intrinsic properties and rendering formula of 3DGS. GaussianShader [
76] proposes a simplified shading function on 3D Gaussians to enhance novel view synthesis results in scenes with reflective surfaces. VDGS [
36] proposes using a neural network, similar to NeRF, to model the view-dependent radiance and opacity, thereby enhancing 3DGS’s capability to model high-frequency information. A visual comparison between VDGS [
36] and the original 3DGS [
6] using the Tanks and Temples [
59] and Mip-NeRF 360 [
58] datasets is shown in
Figure 18.
Researchers find that the spatial distribution of 3D Gaussians significantly impacts the final rendering quality. Therefore, some works have improved the mechanisms of 3DGS growing and pruning in vanilla 3DGS to enhance rendering quality. GaussianPro [
77] guides the densification of the 3D Gaussians with a proposed progressive propagation strategy. RadSplat [
37] develops a ray-contribution-based pruning technique to reduce the overall point count while maintaining photo-realistic rendering quality. LightGS [
78] detects Gaussians that have minimal impact on scene reconstruction and employs a process of pruning and recovery, thereby reducing the number of redundant Gaussians while maintaining visual quality. To enhance the rendering quality in non-textured regions such as walls, ceilings, and furniture surfaces, GeoGaussian [
79] introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, as shown in
Figure 19, and achieves better novel view synthesis results, as shown in
Figure 20.
5.3. 3DGS with Data Compression
Despite the absolute advantages of 3DGS over NeRF-based methods in terms of training speed and rendering speed, 3DGS requires significantly more storage space. A vanilla NeRF representation of a typical scene requires only about 5MB, while 3DGS often requires an order of magnitude more. Scaffold-GS [
19] uses anchor points to distribute local 3D Gaussians and generates a large number of 3D Gaussians around these anchor points to reduce the storage requirements significantly, as shown in
Table 6. Many works have made improvements based on Scaffold-GS, including the introduction of level-of-detail strategies [
80] and the use of adaptive quantization modules for further compression [
81].
Others have combined 3DGS with vector quantization: a method widely used in the field of signal processing. EAGLES [
38] quantifies implicit attributes to reduce the storage memory of 3D Gaussians and uses a coarse-to-fine rendering resolution approach during training to ensure rendering quality. Compact3D [
39] performs vector quantization of Gaussian parameters during the training process. By treating each Gaussian as a vector, K-means clustering is executed to achieve compression. The resulting reduction in the number of Gaussians is illustrated in
Table 7.
5.4. 3DGS with Geometry Reconstruction
Since 3DGS represents scenes using discrete 3D anisotropic balls, reconstructing explicit geometries such as meshes is not as straightforward as with physics-based differentiable rendering methods. Few works have conducted preliminary explorations on how to utilize 3DGS for geometry reconstruction. SuGaR [
20] proposes a constraint term on the scene surface to improve the geometric reconstruction effect of 3DGS and uses Poisson reconstruction to extract meshes.
Figure 21 shows the reconstruction and scene editing results of SuGaR.
Most recently, 2DGS [
40] introduces 2D Gaussians to replace 3D Gaussians for scene representation and proposes a low-pass filter to prevent 2D Gaussians from generating line projections. However, these methods still fall short in reconstruction accuracy compared to NeRF-based implicit methods, let alone physics-based differentiable rendering methods.
6. Discussion
Differentiable rendering acts as a crucial link between image creation and analysis, delivering effective solutions for diverse applications in computer graphics and computer vision due to its capability of supplying gradients for optimization. This paper categorizes existing differentiable rendering methods based on the primary rendering algorithms they employ. Specifically, this work reviews physics-based, NeRF-based, and 3DGS-based differentiable rendering methods. It is noteworthy that while there are several existing reviews for NeRF-based and 3DGS-based methods, almost no reviews for PBDR exist. Therefore, our primary focus is on PBDR, with NeRF-based and 3DGS-based methods covered from several aspects to ensure completeness.
In this section, we summarize advancements and suggest potential directions for future research in physics-based, NeRF-based, and 3DGS-based differentiable rendering methods. A comparison of and conclusions related to these three categories of methods are presented in
Section 7.
Physics-based differentiable rendering: The ultimate goal of PBDR is to reduce the variance of the gradient estimator as much as possible within the same computation time, allowing the inverse rendering algorithms to converge more quickly and efficiently.
For boundary sampling methods, state-of-the-art techniques have abandoned the strategy of explicitly finding silhouette edges for a given shading point. Instead, they first generate a path segment tangent to the geometry being differentiated, then complete it to a full path in a bidirectional manner. To importance sample these tangent segments, a guiding structure is built during the precomputation process. Currently, only a small portion of the seed rays used to build this guiding structure has a positive contribution. Thus, a new strategy is needed to build this guiding structure more efficiently. Moreover, the parameterization used for the boundary sampling space for triangle meshes exhibits high-frequency and sparse features, which hinder the importance sampling process. A new parameterization scheme promising a smoother distribution needs to be developed.
For reparameterization methods, state-of-the-art techniques use a warp field to reparameterize the rendering integral, ensuring that the discontinuities remain fixed in the newly reparameterized domain as the scene parameters change. The Monte Carlo technique is used to estimate the warp field at a specific point, which requires auxiliary rays to be emitted. This Monte Carlo technique introduces additional variance in the interior region, and the ray tracing process for auxiliary rays is also time-consuming. To reduce the variance and computation time for estimating the warp field, a new reparameterization scheme needs to be developed. Ideally, this new reparameterization scheme should be analytic so that the variance in the reparameterization step is reduced to zero.
Boundary sampling methods have advantages over reparameterization methods, as they produce a cleaner gradient in the interior region. This is because the Monte Carlo process used by reparameterization methods to estimate the warp field introduces additional variance in the interior region. The advantages of reparameterization methods over boundary sampling methods are that they neither require precomputation nor a guiding process before the main differentiable rendering process. Additionally, reparameterization methods can still produce good results in scenarios where boundary sampling methods struggle to build an effective guiding structure, such as when an area emitter is encapsulated within a transparent surface with low roughness.
NeRF-based differentiable rendering: Numerous technical improvements focusing on different aspects of the original NeRF have been proposed. Since our main focus is on PBDR, we will only review NeRF-based methods from several aspects.
The core problem for sparse-view and generalizable NeRF models is how to deduce the missing information while avoiding overfitting to individual views. Some methods first extract neural features from the input views and then aggregate them into the target novel view, optionally incorporating depth or geometry regularization during the training process. Others aim to infer a low-dimensional latent code for the scene, which is then decoded by a hypernetwork for final shading. Combining NeRF models with large vision models that are capable of introducing stronger scene priors, which are helpful for deducing the missing scene information, is a promising research direction.
One major challenge for NeRF-based methods is their prolonged training and inference time. Some methods embed neural features or small MLPs into voxels or on mesh surfaces then pair them with additional small MLPs to mitigate the catastrophic forgetting phenomenon of MLPs, resulting in faster convergence during training. Others cache the outputs of pre-trained networks into grids or tree structures, thereby eliminating the need for costly network queries and accelerating the inference process. Techniques such as early ray termination and empty space skipping are also usually adopted by these methods. Looking ahead, an avenue worth exploring is the use of more sophisticated data structures or more expressive neural features to enhance inference speed.
3DGS-based differentiable rendering: Like NeRF, various technical enhancements have been proposed for the original 3DGS. Given that our primary focus is on PBDR, we will limit our review of 3DGS-based methods to several specific aspects.
There are still several artifacts in the rendering results of 3DGS-based methods that need to be reduced. These include aliasing, which is typically mitigated by multi-scale approaches, and floater artifacts, which are usually addressed through filtering and pruning. Additionally, while surface normal regularization can enhance rendering quality for scenes with high-frequency reflections, there is still significant room for improvement.
Reducing memory usage without sacrificing too much rendering quality is also an essential improvement for 3DGS-based methods. This will not only benefit in terms of faster rendering speeds but will also enable quicker transmission and deployment on mobile devices. Existing methods usually adopt the strategy of filtering or pruning insignificant Gaussians, often followed by clustering or vector quantization to further reduce memory storage. However, extending these methods to dynamic scenes remains an under-explored area of research.
The problem of mesh reconstruction for 3DGS-based scene representation remains unresolved. Existing methods often employ a regularization term to align Gaussians with geometric surfaces, but the results still lack the precision found in NeRF-based geometry reconstruction, let alone the accuracy achieved by physics-based geometry reconstruction techniques.
7. Conclusions
As sub-branches of differentiable rendering, all these three categories of methods (physics-based, NeRF-based, and 3DGS-based) aim to propagate gradients from image pixel intensities to explicit or neural scene parameters. These gradients can then be used in optimization algorithms to reconstruct the scene representation. The difference between these categories of methods lies in the primary rendering algorithms used, which in turn determine the differentiable rendering process.
The biggest difference between PBDR and the other two categories of methods is that its primary rendering process rigorously obeys the physical laws of light transport in the real world, whereas the processes used in NeRF-based and 3DGS-based methods are only approximations. Thus, the inverse rendering results of PBDR are more precise than those of the other two categories, but this comes at the cost of increased computation time. However, current PBDR theory is still underdeveloped. Existing methods mainly focus on the reconstruction of single objects with known emitters in purely virtual environments. An interesting research direction is to explore how to apply PBDR to scenes with various geometries in the real world or even in the wild, similar to how NeRF-based or 3DGS-based methods are used.
While both NeRF-based and 3DGS-based methods use rendering algorithms based on approximations of geometric optics, they differ significantly in form: NeRF is based on neural representations, whereas 3DGS is based on explicit representations. Further, 3DGS-based methods offer faster training and inference speeds compared to NeRF-based methods while maintaining competitive rendering quality thanks to their explicit representations. However, these explicit representations also result in a much larger memory footprint. Surface extraction results from NeRF-based methods are smoother than those from 3DGS-based methods due to a NeRF’s underlying continuous representation. Moreover, a promising research direction is to explore how to incorporate more physics into NeRF-based and 3DGS-based methods to make them more physically correct and, consequently, achieve better reconstruction quality.