Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting

Shi, Shizhou; Ma, Chaoqun; Liu, Jing; Ma, Changpei; Zhang, Feng; Jiang, Xiaoyu

doi:10.3390/photonics12030279

Open AccessArticle

Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting

by

Shizhou Shi

¹

,

Chaoqun Ma

²,

Jing Liu

^2,*,

Changpei Ma

¹,

Feng Zhang

¹ and

Xiaoyu Jiang

^3,*

¹

State Key Laboratory of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Ministry of Education Key Laboratory for Intelligent Networks and Network Security, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

Department of Information Communication, Army Academy of Armored Forces, Beijing 100072, China

^*

Authors to whom correspondence should be addressed.

Photonics 2025, 12(3), 279; https://doi.org/10.3390/photonics12030279

Submission received: 26 February 2025 / Revised: 15 March 2025 / Accepted: 17 March 2025 / Published: 18 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The paper presents an efficient light field image synthesis method based on single-viewpoint images, which can directly generate high-quality light field images from single-viewpoint input images. The proposed method integrates light field image encoding with the tiled rendering technique of 3DGS. In the construction of the rendering pipeline, a viewpoint constraint strategy is adopted to optimize rendering quality, and a sub-pixel rendering strategy is implemented to improve rendering efficiency. Experimental results demonstrate that 8K light field images with 96 viewpoints can be generated in real time from end to end. The research presented in the paper provides a new approach for the real-time generation of high-resolution light field images, advancing the application of light field display technology in low-cost environments.

Keywords:

3D light field; 3DGS; autostereoscopic display; rendering pipeline

1. Introduction

Three-dimensional light field display technology, with its exceptional spatial light modulation capabilities and immersive visual effects, demonstrates broad application potential in fields such as virtual reality (VR), augmented reality (AR), and holographic displays [1,2,3]. Among these, autostereoscopic display devices can reproduce the light field of 3D objects without the need for head-mounted devices, achieving realistic 3D visual effects and enabling users to view and interact with scenes from different viewpoints [4,5]. The advantages of the technology are reflected not only in its unique display effects but also in its strong technical support for a variety of application scenarios, such as medical imaging, digital entertainment, and remote collaboration. However, despite significant progress in light field display technology, the generation of image content still faces numerous challenges.

Creating high-quality light field images requires high-resolution images from multiple viewpoints. To ensure a dense viewpoint distribution in the light field image, a large number of sampling cameras are needed [6,7]. For example, to create an 8K resolution light field image, it is necessary to capture a large number of high-resolution images from different viewpoints, adjust the camera arrangement based on the specific parameters of the display, and possibly recalibrate the cameras. The high-density sampling mode places significant demands on hardware, making it difficult to promote in low-cost environments. On the other hand, while 3D mesh models can be used to render images from different viewpoints, the process is not only time-consuming but also computationally expensive when rendering high-resolution images.

To reduce hardware and computational costs, current solutions involve capturing low-resolution disparity images, enhancing their resolution using interpolation techniques, and then generating light field images using light field image encoding methods [8]. However, the use of low-resolution disparity images often results in poor light field image quality [9]. In theory, the best display quality can only be achieved when the resolution of images at each viewpoint matches the resolution of the display [10,11]. Therefore, to address the core challenges of light field image generation, two major technical bottlenecks must be overcome: first, how to infer sufficient light field information from a small number of captured images or even a single captured image, i.e., achieving dense novel view synthesis (NVS) from a limited number of images, thus reducing reliance on high-cost hardware; and, second, how to improve the rendering efficiency of light field images, ensuring high display quality while reducing rendering time and computational overhead.

In recent years, with the rapid development of deep learning technologies, novel view synthesis methods based on radiance fields have attracted significant attention. Neural Radiance Fields (NeRFs) learn a volumetric representation of 3D scenes using multi-layer perceptrons (MLPs), introducing importance sampling and positional encoding to enhance quality, and rendering via volumetric ray tracing [12]. While NeRFs achieve excellent rendering results, the use of large MLPs negatively impacts speed. The success of NeRFs has inspired numerous follow-up works, which have developed various NeRF variants focusing on aspects such as training speed, rendering speed, rendering quality, and sparsity [13,14]. Notable examples include Mip-NeRF360, which excels in novel view synthesis quality [15], Instant NGP (Instant Neural Graphics Primitives), which significantly improves training speed and rendering quality using multi-resolution hash encoding [16], and Plenoxels, which interpolate continuous density fields using sparse voxel grids and completely abandon the use of neural networks, achieving remarkable results [17].

Three-dimensional Gaussian splatting (3DGS) introduced an explicit 3D Gaussian representation in 2023, combined with tile-based rasterization acceleration techniques, successfully achieving real-time rendering and significantly improving rendering efficiency [18]. This makes 3DGS a leading technology in novel view synthesis. However, 3DGS still faces the issue of long scene inference times, with each scene inference process taking several minutes, and a lack of strong generalization. To address these issues, subsequent research has proposed more efficient solutions. GPS Gaussian+ enables real-time portrait reconstruction with backgrounds, generating high-quality novel views from just a pair of viewpoint images [19]. Meanwhile, Splatter Image innovates the training process, allowing 360° 3D Gaussian scene models to be reconstructed from a single viewpoint image, demonstrating strong generalization capabilities [20]. This progress makes it possible to generate light field images from single-frame images, significantly reducing the cost of light field image generation.

These radiance field-based 3D reconstruction methods provide new approaches for the efficient encoding of 3D light fields. Radiance field technology eliminates the need for expensive light field cameras or dense camera arrays to obtain 3D light field information. Instead, it requires only sparse views to infer sufficient light field information, which can then be used to generate light field images with dense viewpoints.

Benefiting from the outstanding performance of NeRFs and 3DGS in novel view synthesis, recent research has continuously advanced the progress of light field image generation technologies. For example, Cutoff-NeRF combines off-axis pixel encoding methods with the volumetric rendering process of NeRFs, significantly improving the rendering efficiency of light field images [21]; YANG et al. proposed DirectL, a new radiance field rendering paradigm, which improves rendering efficiency by combining light field image encoding algorithms with the volumetric rendering process [22]. However, these methods still have certain limitations: Cutoff-NeRF requires a large number of viewpoint images to train the implicit radiance field, and each model training process is complex with limited generalization; and, although DirectL improves rendering efficiency, it is still more suitable for NeRFs’ volumetric rendering and not applicable to the Gaussian splatting process of 3DGS.

To further improve the efficiency of light field image generation, the paper proposes a method for generating 3D light fields from single-viewpoint images based on Gaussian splatting (GS). We design an improved Splatter Image method for light field generation. The original Splatter Image method, with its innovative neural network architecture, can reconstruct a complete 360° scene from a single viewpoint image. The method is based on the U-Net neural network architecture [23], where a set of parameters is encoded into the U-channel of each pixel. These parameters are then transformed into different properties of the 3DGS model (including opacity, offset, depth, scale, rotation, and color) to construct a complete 3D scene. Our improved method combines light field encoding algorithms with GS’s tile-based rendering acceleration technique, enabling direct rendering of light field images and significantly increasing rendering speed. Additionally, we further enhance the rendering quality by introducing viewpoint constraints. In engineering applications, we also propose a memory optimization method that makes light field image generation more efficient and reduces dependency on devices. Experimental results show that our method can generate light field images required for 96-viewpoint 8K stereoscopic displays directly from a single viewpoint image, achieving a nearly 30 times faster rendering speed compared with traditional methods while maintaining optimal rendering quality. Furthermore, our method exhibits excellent generalization. Our main contributions are three-fold: firstly, we propose a pixel-by-pixel GS-based light field image rendering method to improve rendering speed; secondly, we design an improved Splatter Image method to enable rapid light field image generation from single-frame images; and, thirdly, we propose a refined memory reclamation mechanism to reduce the method’s reliance on device performance.

The paper is organized as follows. Section 2 briefly introduces the light field encoding method for slanted lens grating autostereoscopic displays and the principle of single-frame image 3D light field reconstruction. Section 3 presents the single-view image-based GS method for generating 8K light field images, and Section 4 discusses experimental details and results.

2. 3D Light Field Encoding from a Single Image

2.1. Light Field Image Coding Based on Autostereoscopic Display

Lenticular lens-based autostereoscopic displays are among the most widely used devices in the field of naked-eye 3D displays. They project light field images from the display panel to multiple display regions at different angles through the light control properties of the lenticular lenses. Viewers standing in different display regions can observe distinct visual content, thereby creating a three-dimensional visual experience.

Generating light field images that are adapted to the lenticular lens characteristics requires the capture of images from multiple viewpoints of the same 3D scene, resulting in a series of viewpoint images. These images are then arranged into the light field image through an encoding algorithm, where sub-pixels from each viewpoint are interwoven. The light field image not only contains light field information from different viewpoints but also needs to be precisely matched with the physical parameters of the lenticular lens to achieve a high-quality 3D display.

As shown in Figure 1, the parameters of the slanted grating autostereoscopic display used in the encoding algorithm include the tilt angle,

θ

, of the grating relative to the vertical direction, and the sub-pixel width in the horizontal direction that the grating unit covers, also known as the “line number”,

P_{e}

. The line number can be calculated using Equation (1):

P_{e} = \frac{(L + L_{g}) P_{l}}{L cos θ},

(1)

where

P_{l}

is the width of the cross-section of the grating unit,

L_{g}

is the distance between the 2D display panel and the grating, and L represents the vertical distance from the viewing position to the grating plane. In the synthesized image, the horizontal distance,

D_{p}

, from the left edge of any sub-pixel

(i, j, k)

to the left edge of the leftmost grating unit and the distance,

A_{p}

, from the left edge of the sub-pixel to the left edge of the corresponding grating unit can be calculated using Equations (2) and (3):

D_{p} = 3 \times (j - 1) + 3 \times (i - 1) \times tan θ + (k - 1),

(2)

A_{p} = D_{p} m o d P_{e}

(3)

The viewpoint number, n, of the sub-pixel

(i, j, k)

is:

n = ⌈ A / d ⌉,

(4)

where

d = P_{e} / N

is the width of the sub-interval and N is the number of viewpoints of the lenticular stereoscopic display.

Using the aforementioned formula, the encoding algorithm ensures that each sub-pixel of the light field image is accurately mapped to the correct spatial region, allowing viewers in different positions to see different image information. The encoding method not only fully utilizes the resolution of the display panel but also optimizes the matching with the characteristics of the lenticular lens, significantly improving the quality of the 3D display.

2.2. Reconstructing the 3D Light Field from a Single Image

In the process of generating light field images for lenticular grating-based stereoscopic displays, light field information is typically obtained through the acquisition and encoding of multiple disparity images. However, the method of obtaining multiple disparity images relies on complex hardware setups and multi-viewpoint synchronous acquisition systems. In practical applications, it has issues such as high equipment costs and complex data collection. Therefore, exploring a method to directly reconstruct the 3D light field from a single image can not only significantly reduce hardware requirements but also provide a more flexible and efficient solution for autostereoscopic displays.

In the 3D reconstruction task of a single image, how to efficiently represent and render a 3D scene is one of the key issues. Gaussian splatting (GS), as an advanced 3D scene representation method, uses the Gaussian distribution to represent each 3D point in the scene. It is described by a set of parameters (position

μ \in R^{3}

, scaling vector

s \in R_{+}^{3}

, rotation vector

r \in R^{4}

, color

c \in R^{k}

defined by SH, and opacity

α \in [0, 1]

). It can significantly reduce the computational complexity while maintaining scene details. The point cloud in a 3D scene can be represented by the Gaussian distribution function as:

G (x) = A exp [- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)]

(5)

In order to perform effective optimization through gradient descent, the covariance matrix, Σ, can be decomposed into a scaling matrix, S, and a rotation matrix, R. Then, the Gaussian is projected from the 3D space to the 2D plane using the affine-approximated Jacobian matrix, J, of the view transformation, W, and the projection transformation:

Σ = R S S^{T} R^{T}

(6)

Σ^{'} = J W Σ W^{T} J^{T}

(7)

The color of each pixel is obtained through

α

hybrid rendering. The concept is derived from the volumetric rendering process of NeRFs:

C_{c o l o r} = \sum_{i \in N} c_{i} α_{i} \prod_{j = 1}^{i - 1} (1 - α_{i})

(8)

In this way, each 3D point in the scene can be modeled as a Gaussian ellipsoid, featuring continuity and sparsity. Additionally, 3DGS employs a tile-based rasterizer. It divides the screen space into 16 × 16 tiles, binds each Gaussian point to a tile, and sorts them according to the depth of the Gaussian points and the IDs of the bound tiles. This greatly accelerates the rasterization speed and enables efficient rendering from different viewpoints.

The Splatter Image reconstructs 3D scenes using Gaussian representation. By training a simple neural network, it can transform the input monocular image into a 3D Gaussian scene. To reconstruct a 3D light field from a single image, a large-scale dataset based on multi-view disparity images is required to train the neural network. The essence of the training process is to fit the Gaussian representation of the 3D point cloud through the neural network. The network model learns the distribution characteristics of the point cloud under viewpoint changes, such as the 3D offset

(Δ_{x}, Δ_{y}, Δ_{z})

, depth, d, opacity,

σ

, shape,

Σ

, and color, c (represented using SH functions) corresponding to each pixel of the input image. This provides strong prior knowledge for single-view 3D reconstruction. Multiple disparity images are captured from the same 3D scene to construct a multi-view dataset that covers the entire range of viewpoints. Each disparity image provides light field information from one perspective, including disparity and depth characteristics. A neural network is used to fit the light field distribution of the multi-view disparity images, representing the point cloud in the 3D scene as a set of Gaussian ellipsoids. By optimizing objective functions such as position error, shape error, and rendering loss, the network can gradually converge. Using this network, novel views can be obtained from various viewpoints.

The research on reconstructing 3D light fields from a single image not only expands the possibilities of light field image generation but also provides a new data source for glasses-free 3D display technology. Compared with traditional methods that rely on multi-viewpoint capture, the single-image reconstruction method offers the following advantages:

Low hardware cost: it eliminates the need for complex multi-viewpoint capture equipment, requiring only a single camera for data acquisition.
High flexibility: it is suitable for various application scenarios, such as real-time 3D reconstruction on mobile devices and fast capture of dynamic scenes.
Strong compatibility: the generated 3D light field data can be seamlessly integrated into existing raster encoding algorithms, directly enabling glasses-free 3D displays.

By leveraging the reasoning ability and strong generalization of neural networks, 3D light field reconstruction is achieved from a single image, providing innovative technical support for future glasses-free 3D display technologies.

3. Method

Generating 3D light fields from a single image in real time presents several challenges, such as the design of a real-time rendering pipeline for high-resolution light field images, mapping light field Gaussian ellipsoid sets from a single image, and performing high-resolution multi-view inference under limited memory. To address these challenges, we propose a single-viewpoint image encoding method for 3D light fields based on editable field-of-view Gaussian splatting, enabling the real-time generation of a single high-resolution light field image. The overall framework of the method is shown in Figure 2.

A single frame image is input into the Splatter Image network, and a 3D Gaussian model is inferred through the network. This network combines the nonlinear feature extraction capability of deep learning with Gaussian kernel-based geometric shape modeling, effectively capturing the local features and global distribution relationships of the input view, thereby generating high-quality image representations. Subsequently, based on the geometric properties of the input view and Gaussian distribution parameters, we design and implement a parallel multi-viewpoint rendering module based on an off-axis camera model. This module accurately simulates the propagation process of the light field, enabling efficient mapping from multi-viewpoint input to light field image output, while also supporting flexible adjustments of off-axis views, significantly enhancing the adaptability and generality of the algorithm.

To further optimize the rendering quality of light field images, we introduce an off-axis pixel encoding method that directly encodes the sub-pixels of the light field image. This method effectively enhances the light field reconstruction and preserves fine details. Furthermore, to improve the light field reconstruction, we perform parameter optimization and acceleration iterations for GS reconstruction within a specific viewpoint range. Based on this, by combining light field rendering with the optimized reconstruction strategy, we achieve higher reconstruction fidelity and more natural depth perception within the field of view.

To ensure the scalability and practicality of the method, we conduct an in-depth analysis of the memory usage and computational efficiency of the proposed model on different hardware platforms. In the rendering module, a memory-blocking and pipelined parallel strategy was employed, and a memory reclamation mechanism was established, greatly reducing computational resource requirements. By closely integrating the algorithm with the hardware resources, the method not only ensures real-time performance for high-precision Gaussian model scenes but also significantly improves the overall robustness of the system. Below, we discuss these three aspects in detail.

3.1. Design of the Gaussian Splatting Real-Time Rendering Pipeline for 8K Light Field Images

To achieve real-time 8K light field image generation, our initial approach is to improve computational efficiency by parallelizing the multi-view pixel rendering process. However, the method still faces many challenges in practical applications. The generation of an 8K light field image typically requires rendering dozens or even hundreds of high-resolution images from different viewpoints. This demand places a tremendous computational load on the device and consumes a large amount of video memory, making it difficult for existing hardware platforms to support efficient real-time rendering. Therefore, relying solely on traditional rendering optimizations cannot meet the stringent requirements for high resolution and real-time generation of light field images.

From the perspective of optical encoding, the generation of light field images essentially involves extracting target sub-pixel values from a large number of viewpoints according to optical theory and filling them into the 8K light field image. However, during this process, most of the rendered sub-pixels are discarded, leading to a substantial waste of computational resources. Based on this observation, we propose an efficient rendering strategy tailored for light field images, rendering only the sub-pixels that are actually needed in the light field image. This method eliminates frame-by-frame rendering of full-view images, and by precisely matching sub-pixels it significantly reduces computational complexity and video memory usage.

We integrate the light field image encoding algorithm into the rasterization acceleration method of 3DGS, directly generating the final light field image. As shown in Figure 3, in the original 3D Gaussian splatting rendering kernel the image is first divided into tiles of size 16 × 16 pixels. The impact area of each tile is bound to Gaussian ellipsoids using a Gaussian ellipsoid set. Using a 64-bit key, each Gaussian ellipsoid bound to a tile is then sorted.

However, fixed-size tiles are not suitable for light field image rendering. As the number of viewpoints increases, more and more sub-pixels in each tile do not participate in the final image presentation, resulting in unnecessary computational overhead and memory usage. To address the issue, we propose an adaptive tile partitioning strategy. The strategy dynamically adjusts the tile size based on the number of viewpoints to accommodate the rendering needs of different viewpoints. In multi-view rendering scenarios, each view contains a large amount of redundant information, which becomes particularly prominent as the number of viewpoints increases. By adaptively adjusting the tile size, we ensure that each tile contains only the sub-pixels that need to be rendered, significantly reducing redundant computation and memory usage. The optimization strategy not only significantly reduces computational complexity but also fully utilizes hardware resources, ensuring real-time rendering of light field images in high-resolution scenes.

Specifically, the method first calculates the optimal tile size based on the number and distribution of viewpoints. In the dynamic adjustment process, the tile size increases as the number of viewpoints increases, ensuring that each tile can accommodate an effective sub-pixel for a CUDA block, thus avoiding memory wastage. The core of the adaptive strategy is to dynamically adjust tile division based on the sub-pixel requirements of different numbers of viewpoints to achieve a more efficient rendering process. In the way, we significantly improve computational efficiency and memory usage while maintaining rendering accuracy.

In our specific implementation, we encode each sub-pixel based on viewpoint, tile ID, column, and row to create a unique identifier, as shown in Figure 3. The encoding method efficiently organizes and manages the sub-pixels of different views, preparing them for parallel computation. Then, based on the viewpoint number and tile ID of each sub-pixel, we sort the encoded sub-pixels. The sorting process ensures that all sub-pixels of the same viewpoint remain contiguous in the data structure, while also ensuring that sub-pixels from different viewpoints and tiles are evenly distributed. The strategy not only optimizes the locality of data storage but also reduces the likelihood of memory access conflicts.

The efficient integration of encoding and sorting mechanisms allows each tile to be rendered independently and in parallel within a single CUDA block. The mechanism fully leverages the parallel computation power of the GPU, with each thread focusing on the computation of a single sub-pixel. More importantly, by accurately filtering and eliminating unnecessary redundant sub-pixels, we significantly reduce the computational burden and memory usage. The tile-based parallel rendering approach not only improves the efficiency of light field image rendering but also effectively preserves the sub-pixel mapping relationship from views to light fields.

To further address the potential resource bottlenecks in high-resolution light field image rendering across multiple viewpoints, we propose an optimization strategy based on sub-pixel priority sorting. Specifically, based on the contribution of each sub-pixel within the tile to the final light field image, we prioritize high-priority sub-pixels and, through dynamic CUDA thread allocation, achieve precise matching and optimal use of computational resources. The optimization strategy ensures high-quality rendering of critical sub-pixels while reducing computational overhead in low-priority areas, thereby improving overall rendering efficiency.

In summary, our adaptive tile partitioning significantly accelerates the rendering process while maintaining the precision required for high-resolution light field image generation. The technological breakthrough provides a novel solution for the efficient generation of real-time 8K light field images, fully demonstrating the capability of deep GPU architecture optimization and the potential of combining optical encoding with computer graphics.

3.2. Mapping the Light Field Gaussian Ellipsoid Set Based on a Single Image

In the training process of the Splatter Image network, the original method performs a 360° reconstruction of the entire scene. However, for light field 3D display technology, such a comprehensive scene reconstruction is unnecessary. The core of 3D light field displays is to reconstruct the light field within a specific field of view, not to fully recreate the 3D structure of the entire scene. Therefore, 3D light field displays focus on the distribution of 3D Gaussian points observed from a specific viewpoint range.

Based on the analysis above, we propose an efficient 3D reconstruction strategy aimed at specific field-of-view angles. This strategy optimizes the training process of the Splatter Image network by limiting the viewpoint range of the training data. Specifically, we use a set of images captured within a specific viewpoint range to construct a dataset for training the improved U-Net network. This method avoids interference from redundant data across the full 360° view, allowing the network to focus on the 3D Gaussian point inference task within the target field of view.

In full light field reconstruction, wide field-of-view angles often lead to mismatches in the positions of 3D Gaussian points across different viewpoints, causing blurry or inconsistent rendering results. However, by adopting a strategy with smaller field-of-view angles, we can effectively use the Gaussian ellipsoid set to mitigate such mismatches.

As shown in Figure 4, when focusing on a small-angle range for 3D reconstruction, the depth information of the Gaussian ellipsoid set inferred by the network becomes more concentrated on the side of the reconstructed object facing the camera. In small field-of-view angles, the network is able to capture the 3D structure of the target area more precisely, thus effectively avoiding the mismatches caused by full field-of-view inference. Small field-of-view reconstruction not only improves rendering quality but also enhances the robustness of the model, particularly in complex lighting or occlusion scenarios, where the network still maintains high stability and accuracy.

The data distribution characteristics of small field-of-view angles are more suitable for the demands of light field reconstruction, which enhances the network’s ability to learn locally and ultimately improves the rendering quality of the views. Due to the limitation of the field of view range, the network can better learn the most important local features of the scene and reduce interference from background information, demonstrating better robustness. Additionally, the network can improve its adaptability to specific scenes, effectively avoiding rendering distortion and mismatching problems that may arise during full light field generalization.

To further validate the effectiveness of the method, we conduct experimental analyses. The results show that, with the same training duration and data scale, the network model trained on specific field-of-view angles significantly outperforms the model trained on the entire scene in terms of rendering quality within the target viewpoint range. The rendered light field images are not only more precise in the spatial frequency domain but also demonstrate excellent depth perception and viewpoint consistency.

3.3. Memory Optimization for 8K High-Resolution Multi-Viewpoint Inference

We establish memory analysis curves and analyze the memory usage of the rendering module. Although we adopt a per-pixel light field image rendering method and skip the disparity map generation process, the model still needs to project in multiple directions due to reliance on multiple virtual viewpoints, which causes the memory requirements to increase exponentially. Therefore, optimizing memory allocation becomes key to achieving efficient rendering.

We propose a memory reclamation mechanism aimed at minimizing memory usage and optimizing computational resource utilization. The core idea of this mechanism is to precisely control the memory lifecycle during the rendering process, combining on-demand allocation and reclamation strategies to dynamically manage the memory consumption associated with each viewpoint and tile. By performing fine-grained memory analysis of the rendering process, we identify the fluctuation patterns of memory requirements at various stages, enabling on-demand allocation and intelligent reclamation. During each viewpoint’s rendering cycle, memory is dynamically allocated based on the number of GS points that need to be processed for the current viewpoint. Once the sorting operation is completed, we immediately reclaim the intermediate results and unused memory blocks, preventing long-term memory occupation. This process follows a delayed reclamation strategy, ensuring that memory is only released when it is no longer in use, thereby maximizing memory utilization.

In addition, to address the issue of high memory usage during the 2D Gaussian reconstruction kernel sorting process, we introduce an allocation and reclamation mechanism based on lifecycle management. During the rendering process of each viewpoint, the necessary memory space is first allocated for the viewpoint and is promptly reclaimed after the processing is completed. Particularly during the sorting process, we implement hierarchical reclamation of memory blocks corresponding to the sorted GS point keys and their associated data. Through effective memory pool management and reuse mechanisms, we avoid memory fragmentation. To ensure the efficiency and safety of the reclamation process, we use modern memory management techniques such as reference counting and smart pointers to ensure that each memory block is reclaimed without causing memory leaks or access conflicts.

In the specific implementation, we combine asynchronous memory management and stream data processing techniques, enabling memory reclamation operations to run in parallel with rendering tasks, thus reducing performance bottlenecks caused by memory management. This memory reclamation mechanism not only significantly reduces memory usage but also effectively improves resource scheduling efficiency during the rendering process, ensuring the real-time generation of high-resolution light field images.

To further explain, we analyze the rendering module code. Memory allocation mainly involves three core components: GeometryState, ImageState, and BinningState. GeometryState stores parameters of the Gaussian reconstruction kernel (e.g., depth, size, color), and its memory consumption is proportional to the number of Gaussian ellipsoids and viewpoints; ImageState stores data related to images, and its memory usage is proportional to the resolution of the final rendered image; and BinningState is used for encoding and sorting the ellipsoids, and this memory consumption is directly related to the number of ellipsoids.

For example, for a 3D model represented by 128 × 128 Gaussian ellipsoids, when rendering 96 viewpoints for an 8K light field image, the memory requirements for the three components are 82 MB (GeometryState), 265 MB (ImageState), and 1166 MB (BinningState), totaling 1513 MB. However, when the number of ellipsoids increases to 512 × 512, the memory requirements for the three components increase by nearly 16 times, with total memory requirements expected to reach 24 GB, exceeding the memory limit of the RTX 3090 GPU, which results in program crashes.

We found that the largest memory usage occurs in BinningState, approximately 18.7 GB. This phenomenon arises because each viewpoint contains multiple tiles, and each tile is associated with a large number of Gaussian points, leading to very high memory usage. We dynamically allocate memory based on the number of Gaussian ellipsoids that need to be rendered for each viewpoint, and reclaim unused memory promptly after the sorting operation. Additionally, after calculating the ranges (which store the start and end GS point indices for each tile) based on the keys, the intermediate results generated during the sorting process are also discarded. Using this strategy, the memory usage of BinningState is reduced by about half.

Furthermore, benefiting from the larger tile size introduced in Section 1, the memory optimization effect is further enhanced. By increasing the tile size, we significantly reduce the number of tiles associated with each ellipsoid. This is because larger tiles cover more ellipsoids, reducing the number of projections per ellipsoid. This approach effectively reduces memory usage while decreasing the number of Gaussian ellipsoids to be splatted during rendering, further improving rendering efficiency.

In conclusion, by combining memory reclamation mechanisms with the larger tile size strategy, we achieve memory optimization, enabling us to infer light field images of a larger scale.

4. Results

In this section, we evaluate the inference speed and light field image synthesis quality of the single-viewpoint image encoding 8K light field image method based on Gaussian splatting (GS). We use a 65-inch autostereoscopic light field display with a resolution of 7680 × 3840 and 96 viewpoints to load the light field images. The computing platform configuration is as follows: an RTX 3090 GPU with 24 GB of VRAM, an Intel i9 12,900 k CPU, and a host memory of 64 GB.

4.1. Comparison of 96-Viewpoint 8K Light Field Image Rendering Speed

The rendering quality and rendering speed of the light field images are used to analyze the performance of the proposed method. The light field images are obtained using our method and the classic multi-view interleaving method, respectively (Figure 5). It is worth noting that, for the classic method, we only consider the rendering time for 96 views and do not include the time for subsequent interleaving of the light field images. We use 10 images as input and perform 10 repeated experiments with both methods, calculating the weighted average. As shown in Table 1 and Figure 6, the average rendering time of our method is 0.0416 s, while the traditional method’s average rendering time is 1.3101 s, achieving a nearly 30-fold speedup.

It is important to emphasize that the rendering time for Gaussian splatting depends on the number of Gaussian points and the proximity of the viewpoints. Our test times were recorded under the same viewpoint and Gaussian model conditions. Our experiments demonstrate that our method is able to significantly accelerate the rendering speed of light field images.

Rendering effect comparison. As shown in Figure 6, experimental observations reveal that the display quality of both methods is nearly identical, which aligns with our predicted results.

4.2. Light Field Rendering Quality for a Specified Field of View

We use the same network training configuration as the original method. Our model is trained with L2 reconstruction loss using the Adam optimizer, with a learning rate of and a batch size of 4. The training is conducted for 800,000 iterations.

We compare the rendering image quality of our light field Splatter Image method with the original Splatter Image method. We compare the PSNR and SSIM of the viewpoint images generated by both methods, and also demonstrate the actual display effects of the two methods on an 8K light field display. As there is currently no unified quantitative evaluation standard for 3D light field display effects, we provide a visual comparison through multiple sets of photos to intuitively present the differences in actual display effects between the two methods.

We compare the differences in rendering image quality between the light field Splatter Image method and the native Splatter Image method, using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) as the quantitative evaluation metrics (Table 2).

We present the variations of PSNR and SSIM during the training process in Figure 7.

We compare the reconstruction effects of our method and the original method, as shown in Figure 8. It can be observed that, when the viewpoint is shifted, our method is able to infer more accurate model details, while the original method exhibits errors in shape and color, with these errors becoming more severe as the viewpoint angle increases. This indicates that our method achieves better rendering quality.

In addition, as shown in Figure 9, we also present a comparison between the original image set used for testing and the results of our training.

Finally, we observe the display effect on a 65-inch autostereoscopic display, and our method achieved a better display effect, as shown in Figure 10.

4.3. Memory Optimization Method

We compare the memory usage before and after the memory optimization process. The VRAM usage before and after optimization is shown in Table 3. Our experimental results demonstrate that the memory usage is reduced by nearly half. The significant reduction in memory consumption not only enhances the overall efficiency of the system but also enables the handling of larger-scale data during the light field rendering process. These improvements are critical for achieving scalability and efficiency in memory-limited environments.

5. Discussion and Conclusions

In summary, we propose a single-viewpoint image encoding 3D light field generation method based on Gaussian splatting. The integration of the light field image encoding algorithm with the Gaussian splatting process improves the light field image rendering speed; the viewpoint constraint method is adopted to enhance the rendering quality; and, finally, a memory reclamation mechanism is designed to further reduce the performance requirements for rendering light field images on the device. Experimental results validate the feasibility of the proposed method. Our approach can rapidly generate high-resolution light field images from a single view. However, due to the current limitations of the experimental platform, we are unable to train larger-scale networks to infer scenes with more Gaussian representations. In the next steps, we plan to upgrade the computing platform to complete the task and further improve the method’s performance. Additionally, our future work will focus on enhancing the network’s global perception by introducing a global attention mechanism, giving higher weight to pixels near the center of the field of view to ensure that the network learns regions with the most significant contributions to the rendering result. Furthermore, we plan to optimize training data and network structures, incorporating multi-scale feature fusion modules and attention mechanisms to enhance the network’s ability to capture local features and understand global structures. This will address the adaptability issues of Gaussian ellipsoid sets in complex scenes, making our light field image generation method more general and robust.

Author Contributions

Conceptualization, S.S. and C.M. (Chaoqun Ma); data curation, F.Z.; formal analysis, S.S.; funding acquisition, J.L.; investigation, C.M. (Changpei Ma); methodology, S.S.; project administration, J.L. and X.J.; software, S.S.; supervision, J.L. and X.J.; validation, C.M. (Chaoqun Ma), C.M. (Changpei Ma) and F.Z.; writing—original draft, S.S.; writing—review and editing, C.M. (Chaoqun Ma). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/ShiSzzz/Single-View-Encoding-of-3D-Light-Field; accessed on 16 March 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shen, S.; Xing, S.; Sang, X.; Yan, B.; Chen, Y. Virtual stereo content rendering technology review for light-field display. Displays 2023, 76, 102320. [Google Scholar] [CrossRef]
Wu, G.; Masia, B.; Jarabo, A.; Zhang, Y.; Wang, L.; Dai, Q.; Chai, T.; Liu, Y. Light field image processing: An overview. IEEE J. Sel. Top. Signal Process. 2017, 11, 926–954. [Google Scholar] [CrossRef]
Urey, H.; Chellappan, K.V.; Erden, E.; Surman, P. State of the art in stereoscopic and autostereoscopic displays. Proc. IEEE 2011, 99, 540–555. [Google Scholar] [CrossRef]
Efrat, N.; Didyk, P.; Foshey, M.; Matusik, W.; Levin, A. Cinema 3D: Large scale automultiscopic display. ACM Trans. Graph. (TOG) 2016, 35, 1–12. [Google Scholar] [CrossRef]
Sang, X.; Gao, X.; Yu, X.; Xing, S.; Li, Y.; Wu, Y. Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing. Opt. Express 2018, 26, 8883–8889. [Google Scholar] [CrossRef] [PubMed]
Wilburn, B.; Joshi, N.; Vaish, V.; Talvala, E.V.; Antunez, E.; Barth, A.; Adams, A.; Horowitz, M.; Levoy, M. High performance imaging using large camera arrays. ACM Trans. Graph. 2005, 24, 765–776. [Google Scholar] [CrossRef]
Liu, D.; Huang, X.; Zhan, W.; Ai, L.; Zheng, X.; Cheng, S. View synthesis-based light field image compression using a generative adversarial network. Inf. Sci. 2021, 545, 118–131. [Google Scholar] [CrossRef]
Fink, L.; Strobel, S.; Franke, L.; Stamminger, M. Efficient rendering for light field displays using tailored projective mappings. Proc. ACM Comput. Graph. Interact. Tech. 2023, 6, 1–17. [Google Scholar] [CrossRef]
Woods, A.J. Crosstalk in stereoscopic displays: A review. J. Electron. Imaging 2012, 21, 040902. [Google Scholar] [CrossRef]
Li, D.; Zang, D.; Qiao, X.; Wang, L.; Zhang, M. 3D synthesis and crosstalk reduction for lenticular autostereoscopic displays. J. Disp. Technol. 2015, 11, 939–946. [Google Scholar] [CrossRef]
He, J.; Yu, X.; Gao, X.; Yan, B.; Tong, Y.; Xie, X.; Zhang, H.; Shi, K.; Hu, X.; Sang, X. Assessment of the definition varying with display depth for three-dimensional light field displays. Opt. Commun. 2024, 563, 130623. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Nguyen, T.A.Q.; Bourki, A.; Macudzinski, M.; Brunel, A.; Bennamoun, M. Semantically-aware neural radiance fields for visual scene understanding: A comprehensive review. arXiv 2024, arXiv:2402.11141. [Google Scholar]
Rabby, A.; Zhang, C. BeyondPixels: A comprehensive review of the evolution of neural radiance fields. arXiv 2023, arXiv:2306.03000. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5470–5479. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 102. [Google Scholar] [CrossRef]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5501–5510. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
Zhou, B.; Zheng, S.; Tu, H.; Shao, R.; Liu, B.; Zhang, S.; Nie, L.; Liu, Y. GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views. arXiv 2024, arXiv:2411.11363. [Google Scholar]
Szymanowicz, S.; Rupprecht, C.; Vedaldi, A. Splatter image: Ultra-fast single-view 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10208–10217. [Google Scholar]
Chen, S.; Yan, B.; Sang, X.; Chen, D.; Wang, P.; Yang, Z.; Guo, X.; Zhong, C. Fast virtual view synthesis for an 8k 3d light-field display based on cutoff-nerf and 3d voxel rendering. Opt. Express 2022, 30, 44201–44217. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Liu, B.; Song, Y.; Yi, L.; Xiong, Y.; Zhang, Z.; Yu, X. DirectL: Efficient Radiance Fields Rendering for 3D Light Field Displays. ACM Trans. Graph. (TOG) 2024, 43, 1–19. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]

Figure 1. Light field image encoding process. Firstly, images from different viewpoints of the scene are captured. Then, based on the parameters of the autostereoscopic display device, the light field image is synthesized. The refraction of light through the cylindrical lens array is utilized to bend the light rays from different disparity images in different directions, forming viewpoints and generating stereoscopic vision.

Figure 2. Overall framework. A single-viewpoint image is input. The U-net network will infer the 3D Gaussian representation corresponding to each pixel based on the pixel information contained in the image. Then, through Gaussian splatting and light field encoding, the direct rendering of the light field image is achieved.

Figure 3. Firstly, calculate the sub-pixel arrangement index matrix according to the hardware parameters to obtain the view ID of each sub-pixel in the light field image, which represents the viewpoint from which the sub-pixel is sourced. Then, divide the light field image into tiles. Based on the tile where the sub-pixel is located, as well as the row number and column number of each sub-pixel, encode each sub-pixel as (viewpoint, tile ID, column number, row number). Next, sort the sub-pixels according to the encoding. The sorted sub-pixels are saved as a one-dimensional sequence in the order of viewpoints and tiles. Finally, interpolate the sorted sequence. Fill the number of sub-pixels contained in each tile of each viewpoint to fill a CUDA block. Mark the row and column of the filled sub-pixels as (−1, −1), indicating that such a sub-pixel does not actually exist.

Figure 4. Unlike the objective of the 360° inference model in 3D reconstruction tasks, light field display only requires reconstructing the frontal light field. By imposing constraints on the viewing angles, the viewing angle of the output viewpoint image is always within a certain range of the viewing angle of the input viewpoint image. We train the neural network in a specific direction, which enables more depth information of the Gaussian ellipsoids to be correctly predicted on the frontal surface, thereby rendering images with higher quality.

Figure 5. Comparison of rendering speeds between the two methods.

Figure 6. Comparison of the display effects of light field images generated by the two methods on an autostereoscopic display device.

Figure 7. Changes in PSNR and SSIM during the training process. It can be observed that during the training process both PSNR and SSIM gradually increase.

Figure 8. Comparison of multi-view rendering results between our method and the original method.

Figure 9. Comparison between our method and the original test set.

Figure 10. Comparison of light field reconstruction effects between our method and the original method on a 65-inch autostereoscopic display device.

Table 1. Rendering speeds of the two methods.

	1	2	3	4	5	6	7	8	9	10
Ours	0.052	0.032	0.044	0.049	0.029	0.060	0.035	0.027	0.032	0.056
Classic	1.476	1.102	1.287	1.582	1.069	1.661	1.255	0.976	1.202	1.491

Table 2. PSNR and SSIM.

Method	PSNR (dB)↑	SSIM↑
Origin	26.01	0.94
Ours	29.77	0.97

Table 3. Comparison of video memory sizes.

	Before	After
Video memory	1513 MB	944 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, S.; Ma, C.; Liu, J.; Ma, C.; Zhang, F.; Jiang, X. Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting. Photonics 2025, 12, 279. https://doi.org/10.3390/photonics12030279

AMA Style

Shi S, Ma C, Liu J, Ma C, Zhang F, Jiang X. Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting. Photonics. 2025; 12(3):279. https://doi.org/10.3390/photonics12030279

Chicago/Turabian Style

Shi, Shizhou, Chaoqun Ma, Jing Liu, Changpei Ma, Feng Zhang, and Xiaoyu Jiang. 2025. "Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting" Photonics 12, no. 3: 279. https://doi.org/10.3390/photonics12030279

APA Style

Shi, S., Ma, C., Liu, J., Ma, C., Zhang, F., & Jiang, X. (2025). Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting. Photonics, 12(3), 279. https://doi.org/10.3390/photonics12030279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Single-View Encoding of 3D Light Field Based on Editable Field of View Gaussian Splatting

Abstract

1. Introduction

2. 3D Light Field Encoding from a Single Image

2.1. Light Field Image Coding Based on Autostereoscopic Display

2.2. Reconstructing the 3D Light Field from a Single Image

3. Method

3.1. Design of the Gaussian Splatting Real-Time Rendering Pipeline for 8K Light Field Images

3.2. Mapping the Light Field Gaussian Ellipsoid Set Based on a Single Image

3.3. Memory Optimization for 8K High-Resolution Multi-Viewpoint Inference

4. Results

4.1. Comparison of 96-Viewpoint 8K Light Field Image Rendering Speed

4.2. Light Field Rendering Quality for a Specified Field of View

4.3. Memory Optimization Method

5. Discussion and Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI