1. Introduction
Integral imaging is a popular method to achieve a 3D display without the need to wear glasses. This approach provides full parallax and all-depth cues, enabling a more immersive experience [
1,
2,
3,
4]. Previously, the pickup and display all relied on a lenslet array [
5]. With the advancement of computers, computer-generated and computational reconstruction have experienced rapid progress, and many efforts have been devoted to realistic high-quality 3D images, including increasing the sampling rate to improve the resolution of each view [
6], changing rendering algorithms to accelerate and generate high-quality 3D images [
7,
8], as well as the use of convolutional neural networks (CNNs) to eliminate the noise of 3D images [
9].
Computational integral imaging reconstruction (CIIR) is based on the technique of optical mapping [
10,
11,
12], which involves mapping elemental images to a 3D space and combining them at a certain depth plane to generate a 3D scene. Optical mapping-based CIIR methods have been extensively studied for improving 3D imagery due to their straightforward ray optics model [
13,
14]. These methods consist of mapping the pixels in elemental images to the 3D space, either directly or after applying windowing or convolution. Pixel mapping methods, in which each pixel is individually mapped using a lens array, effectively reduce computational costs and improve the visual quality of reconstructed images [
15,
16,
17,
18,
19]. Windowing methods involve weighting the elemental images with a window function before mapping [
20], which helps eliminate blurring and lens array artifacts, thereby improving the quality of the reconstructed image [
21,
22]. Convolutional methods have been recently introduced to extract depth information through the use of convolutions and delta functions. This approach has resulted in improved image quality and greater control over depth resolution [
23,
24,
25,
26,
27,
28,
29,
30].
The reconstruction of super high-resolution 3D images with lower optical sampling presents a substantial challenge, notwithstanding the valuable insights offered by the aforementioned research in the field of 3D image generation and reconstruction.
In recent years, deep learning has become a popular technique in computer vision and has made significant progress in various areas such as speech and image recognition and classification [
31], novel view synthesis [
32], and image super-resolution (SR) [
33]. Image SR is a technique used to restore high-resolution images with more details from low-resolution images. It aims to restore lost details, improve image clarity, and enhance visual quality, making the images more informative and realistic. In surveillance systems, it can be used to improve video quality and aid in target recognition and tracking. In medical imaging, it can assist doctors in more accurate diagnosis of patients [
34,
35]. In satellite imaging and other related fields, it can enhance image clarity and analytical capabilities, helping scientists gain a better understanding of earth, weather, and astronomical phenomena. Single image super-resolution (SISR) has been the subject of extensive research over the past few decades and serves as the foundation for multiple image super-resolution [
36]. Super-resolution convolutional neural networks (SRCNNs) utilize an end-to-end learning approach to directly generate high-resolution images from low-resolution ones, eliminating the need for multiple stages in traditional methods [
37]. Very deep super-resolution (VDSR) [
38] and residual dense networks (RDNs) [
39] further explore the mechanisms and effectiveness of deep convolutional neural networks. VDSR, consisting of 20 convolutional layers, can better capture the minute details in images. On the other hand, RDNs utilize the structure of residual skip connections and dense connections to effectively handle complex textures and structures, resulting in the generation of realistic images. The super-resolution generative adversarial network (SRGAN) applies the concept of generative adversarial networks (GANs) to the field of image generation for producing high-resolution photo-realistic images [
40]. Compared to traditional methods, they are better at capturing complex patterns and creating intricate textures. Similarly, an enhanced super-resolution generative adversarial network (ESRGAN) was proposed, which utilizes GANs to generate realistic high-resolution images with improved detail and sharpness [
41].
Computational efficiency is also a concern in image SR, and the fast super-resolution convolutional neural network (FSRCNN) simplifies the network structure [
42], achieving faster speeds compared to each camera viewpoint independent rendering (ECVIR) [
43]. Another approach [
44] simplifies the generation network to only consist of two convolutions, an activation function, and a deconvolution. While this accelerates computation, designing the generation network too simply can limit its ability to capture texture details.
The traditional SRGAN network utilizes convolutional layers with batch normalization (BN) to extract feature maps during training. BN normalizes the data along the channel dimension for each batch, ensuring that the values of each layer are within a specific range. This helps alleviate the problem of vanishing gradients during the training process. However, BN also introduces additional computational complexity and memory usage. Furthermore, BN only considers the statistical characteristics of the internal batch data during normalization and ignores the inter-batch correlations, which can lead to artifacts and impact the generalization ability of the network [
45]. Therefore, in this study, we chose to remove the BN layers to reduce computational complexity and improve the network’s generalization ability. Additionally, addition and subtraction operations are added between the basic blocks to increase the inter-block correlation and enhance the reconstruction effect.
This paper proposes a method for CIIR based on GAN super-resolution. First, a low-resolution reconstructed 3D image based on pixel mapping is generated, which allows for obtaining a view image in less time, and then a lite GAN is employed to super-resolve the low-resolution 3D image. Eliminating the BN layer from SRGAN leads to improvements in generalization and increased high-frequency texture details while reducing computational complexity and memory usage. The residual network layers are then added together to form a dense residual network, which allows for greater connectivity between each residual block. This approach ensures that there is more communication between each residual block and improves the overall performance of the network. After the process of the lite GAN network, super-high resolution reconstructed 3D images are obtained. The experimental results show that the proposed method achieves SSIM above 0.95 in different 3D models and reduces network training time by about 20%.
This article is structured as follows:
Section 2 introduces the fundamental principles of the algorithm.
Section 2.1 provides an overview of ray tracing and pixel mapping, explaining how to generate low-resolution images from elemental image arrays.
Section 2.2 focuses on high-resolution 3D image generation based on GAN, explaining how to enhance the effect of the reconstructed image.
Section 3 presents the experimental results and engages in comprehensive discussions. Finally,
Section 4 draws conclusions based on the research conducted.
2. Principle
The proposed method is shown in
Figure 1. Firstly, a sparse camera array is established to quickly obtain the elemental image array of the 3D model. Then, view images are generated by pixel mapping. Due to the limited number of cameras, only low-resolution view images can be reconstructed. Therefore, a lite GAN network is utilized for super-resolution of the view images. This enables high-resolution integral imaging reconstruction to be achieved with fewer cameras.
2.1. Ray Tracing and Pixel Mapping
The difference between the computational integral imaging and the all-optical integral imaging is that the reconstructed 3D image is synthesized in the computer by using a simulated lens array (or pinhole array, camera array), rather than using a combination of a display device and a lens array. At present, there are two kinds of methods to reconstruct the 3D image in integral imaging [
13]. One is to synthesize the pinhole array model by computer and reconstruct the 3D image by ray tracing. The other is to extract and model the depth of the collected 3D information.
Ray tracing is a fundamental process to generate an elemental image array, as shown in
Figure 2a. Any point on the surface of a 3D object is imaged through the lenslet array. Each elemental image represents the imaging mapping of the 3D object through the corresponding optical center. The elemental images record the information of the rays emitted from every point on the surface of a 3D object. Therefore, a ray can be determined by selecting any point in the elemental image array. We can use ray tracing to analyze the recording process of integral imaging.
According to the dual-plane light field parameterization method [
46], the imaging range of the lenslet is located behind it, and all of the optical axes of the lenslet are perpendicular to the elemental image array. The optical center of the lenslet corresponds to the elemental image array. Based on geometric optics, from
Figure 2a we can derive that
where
are the coordinates of the ray, it passes through the lenslet array plane
, and imaging on the elemental image arrays is
. Then the light field distribution
of the elemental image array can be written as:
The light field information recorded by integral imaging is:
Therefore, the recorded is equivalent to , and can be reconstructed from .
To reconstruct view images, pixel mapping is used to generate images from different perspectives, as shown in
Figure 2b. According to the optical mapping between elemental images and view images [
47], parallel rays with the same incident angle (e.g., red or green in
Figure 2b) refracted by the lens array will produce corresponding view images at the imaging plane under the corresponding angle. Different angles of light will generate different view images. By using pixel mapping, the pixels with the same viewpoint can be extracted from the elemental image array to reconstruct the view image.
Parallel light rays with the same incident angle are refracted by the lenslet array and generate view images at corresponding angles on the imaging plane. Different angles of light rays will generate view images from different perspectives. The pixel mapping between elemental image
and view image
can be expressed as:
where subscript
represent the coordinates of the elemental image, and
are the pixels in each elemental image. Equation (4) shows that the resolution of the reconstructed view image is equal to the number of elemental images, which means increasing the resolution of view images requires an increase in the number of cameras. For example, a 50 × 50 camera array can only reconstruct a 3D image with a resolution of 50 × 50. To improve the rendering efficiency, it is necessary to avoid increasing the number of cameras. Therefore, a lite GAN is used to convert the low-resolution 3D image to super-resolution.
2.2. High-Resolution 3D View Image Generation Based on GAN
Based on the method described in
Section 2.1, it is possible to generate a low-resolution view image. In order to achieve efficient generation of super high-resolution 3D images, a refined SRGAN network is proposed.
Traditional CNN-based super-resolution reconstruction algorithms usually use mean squared error (MSE) as the loss function, which cannot guarantee the perceptual consistency between the generated images and the real images. SRGAN introduces perceptual loss, which calculates the feature differences between the generated images and the real images using a pre-trained feature extraction network, thereby better preserving the perceptual details of the images. In addition, the discriminator of the SRGAN network adopts a deep convolutional neural network, which can effectively learn the differences between the generated images and the real images. This enables SRGAN to generate more realistic and natural high-resolution images, avoiding the issues of excessive smoothness or over-sharpening.
To prevent the issues of gradient diffusion and degradation caused by an increasing number of network layers, the generator network has incorporated a skip-connection residual network structure. Each residual block includes two convolutional layers, followed by batch normalization and leaky rectified linear unit (LReLU) activation function. Furthermore, the visual geometry group (VGG) network is employed to compute the content loss instead of the traditional MSE. All convolutional kernels have a 3 × 3 structure with 64 feature maps, aiding in the extraction of richer feature information.
We can input a low-resolution image and perform convolution on the convolutional layer with parameters set to 3 × 3 × 64, meaning there are 64 convolution kernels with a 3 × 3 structure and step size of 1. LReLU is used as the activation function, followed by training in 6 residual block networks before entering the convolutional layer. Finally, the image is upscaled by a factor of 2 using the scaling convolution method. The nearest-neighbor interpolation is employed to upscale the image by 2 times, and then it is passed through the convolutional layer. After that, a convolution operation is performed, and a high-resolution image is outputted.
This article is based on an improved SRGAN network that removes the BN layer from the previous SRGAN. The BN layer is implemented by normalizing each batch of data in the channel dimension by subtracting the mean of the batch and dividing it by the standard deviation of the feature values of each sample. In this way, the feature values of each layer are in a specific range, which reduces the problem of gradient disappearance during training. However, since the BN layer only considers the statistical properties of the current batch data, the correlation between different batches is ignored. This means that if there are differences in the data distribution between batches, the BN layer may not be able to completely eliminate these differences, introducing an artifact problem. Such artifacts can manifest themselves as unnatural textures or color deviations in the generated images, among others. Moreover, since the BN layer only exploits the statistical properties of the local batch data, it may affect the generalization ability of the network. Because the generalization ability requires the network to be able to adapt to different distributions of data, the normalization process of the BN layer is only carried out on the local batch data, which cannot fully consider the global characteristics of the entire data set. Therefore, removing BN layers can reduce this artifact problem and improve the generalization ability of the network by increasing the correlation between the basic blocks. Such improvement measures help to improve the accuracy of the reconstruction task and the quality of the generated images.
The residual network layers are added together to form a dense residual network, allowing for more connections between each residual block, as shown in
Figure 3.
We use relatively few basic blocks to compose the generative network, which can reduce the training time and the time required to generate high-resolution images. The GAN uses content loss and adversarial loss to improve the realism of the output image and generate high-frequency information. The loss function is expressed as follows:
where
is the adversarial loss, which is the loss incurred when the discriminator identifies the generated images from the generator as natural images. The content loss is divided into two parts:
, which is the MSE loss of the generator network, and
, which is the feature loss obtained by inputting the generated images into the VGG network. The weight of the adversarial loss is l = 10
–3.
The MSE loss is a commonly used loss function to measure the difference between the high-resolution images generated by the generative network and the real images. If the loss function only consists of the MSE loss, it can be represented as:
where
and
represent the width and height of the image,
is the magnification factor for super-resolution,
is the pixel value of the original image
, and
represents the reconstructed image output by the generative network. The VGG loss function utilizes the feature vectors from a pre-trained VGG-19 network to measure the perceptual similarity between the generated image and the original image. This loss is calculated by comparing the feature values produced by the VGG network for both images. The VGG loss can be expressed as:
where
represent the feature maps output after the
convolutional layer and before the
max pooling layer. It is the feature value obtained through the VGG network.
The purpose of the generative network is to generate realistic images to deceive the discriminative network and produce high-quality images. Therefore, the adversarial loss is used to measure the similarity between the generated image and the real image. The adversarial loss is:
where
is the probability that the discriminator classifies the generated image by the generator as a real image.
The VOC2012 dataset was used to train the network, which contains a large number of high-quality images and has wide applicability, making it suitable for SR training [
48]. The training dataset consists of images with varying complexity captured using different sets of lenses, enabling the network to have better generalization ability. To expand the dataset, the images in the dataset can be cropped, rotated, mirrored, etc., to increase the number of samples. During training, the high-resolution images were first down-sampled using bicubic interpolation to obtain low-resolution images which were then upscaled and compared with the original images to train the network parameters. To generate low-resolution and high-resolution image pairs for training, we first cropped all images in the dataset to 88 × 88 images, then downsampled to the corresponding low-resolution image. The adaptive moment estimation (Adam) was used with a learning rate of 10
–4 for gradient descent during training. The network was trained for 100 epochs with a batch size of 16.
3. Experimental Results
In the experiments, the PC hardware configuration consists of Intel(R) Core(TM) i7-11,800H CPU @ 2.30GHz with 16GB RAM and NVIDIA GeForce RTX 3060 laptop GPU. The parameters of the ray tracing are shown in
Table 1.
The test objects encompass a simple textured dolphin, a complex textured warcraft, and a hamburger.
Figure 4 shows the elemental image array generated by ray tracing, as well as upper, lower, left, and right view images resulting from pixel mapping.
In order to quickly obtain high-resolution 3D images, light generative networks and deep discriminative networks can be used to achieve better results in image super-resolution, while reducing the time required for the generation network to produce high-resolution images. While SRGAN uses 16 basic blocks to increase the complexity of the network, this paper uses fewer basic blocks to improve training speed while reducing complexity, as shown in
Table 2.
Table 2 shows that the structural similarity index measure (SSIM) and peak signal to noise ratio (PSNR) obtained using 16 basic blocks in SRGAN are not significantly different from those obtained using 6 basic blocks, but the training time is greatly reduced by about 20%. On the other hand, the training time difference between 6 basic blocks and 3 basic blocks is not significant, yet the SSIM and PSNR values are much lower than those obtained using 16 basic blocks. This indicates that while maintaining the same image output quality, appropriately reducing the number of basic blocks can lead to faster training of network parameters. To enhance the training speed while ensuring that the image reconstruction quality remains unaffected, a comparison was also made between the traditional SRGAN and the proposed method. The training results of traditional SRGAN showed a PSNR of 23.9041 dB and an SSIM of 0.7091, while the training results of the proposed method showed a PSNR of 23.8863 dB and an SSIM of 0.7102. It can be observed that there is not a significant difference in the training results between the two methods. This indicates that the proposed method can train the network quickly without compromising on the image reconstruction quality. In summary, the proposed method in this paper offers both time efficiency and quality preservation.
Due to the limited number of cameras, through pixel mapping, we can only reconstruct the view images with lower resolution. Therefore, super-resolution is used to improve these reconstructed images; the results are shown in
Figure 5.
To verify the feasibility of the proposed method, a comparison of reconstructed 3D images is made between our method, SRCNN [
37], bicubic interpolation [
37], and pixel mapping [
13]. For simplicity, we select a specific viewpoint for super-resolution reconstruction. From the local zoom-in effect, it can be observed that the proposed method provides reconstructed images with sharper texture details and better subjective image quality than others.
In addition, SSIM is used as an objective quantitative metric for the reconstructed image quality evaluation, as shown in
Table 2.
From
Figure 5 and
Table 2 and
Table 3, it can be observed that the proposed method has better view effects for the three different complexity 3D models. It also exhibits better generalization ability, displays more high-frequency information, and ensures a better 3D imaging effect as perceived by the human eye. The proposed method not only achieves improvements in SSIM but also reduces the training time by about 20%.
In summary, using the improved light GAN network can quickly reconstruct the low-resolution view images at high resolution and improve the 3D image display quality, so as to better represent the details and texture information. By removing the BN layer, better extraction of high-frequency information from images can be achieved, as the BN layer may introduce unnecessary smoothing to high-frequency details in the images. Removing the BN layer allows the network to preserve image details and texture features more effectively. Furthermore, reducing the number of basic blocks brings two benefits. Firstly, it reduces the number of parameters and computational load, thereby accelerating network training and high-resolution image generation. Secondly, reducing the number of basic blocks prevents the network from becoming overly complex and helps avoid overfitting issues, resulting in generated images with better generalization ability. Increasing intra-block operations enhances connectivity between layers, allowing the network to better propagate and utilize information. This improves the network’s ability to capture image details, leading to the generation of more realistic and clear high-resolution images.
The VOC2012 dataset was used for training, which includes a large number of scene images covering various types of images such as people, landscapes, etc. The dataset was augmented using techniques such as rotation and scaling to increase its diversity, which improves the model’s ability to generalize to different scenes and enhances its performance. Although the training time is long, once the model is trained, the subsequent super-resolution can be applied directly without the need for further training. It only takes about 1 s to super-resolve a new input image using the trained model, as shown in
Table 2. Therefore, it is feasible to use super-resolution for computational integral imaging reconstruction. Reference [
44] also indicates that using super-resolution methods can improve the reconstruction speed of integral imaging. To verify the feasibility of the proposed method, three models with different complexities were evaluated, and image quality evaluation metrics such as PSNR and SSIM were used for comparison. The results showed that the proposed method exhibited higher PSNR and SSIM values while reducing training time. This means that the proposed method can better preserve image details, improve image fidelity, and significantly enhance image reconstruction quality. Through extensive testing on a large amount of data, we confirmed that the performance difference of the proposed method is significant and not just due to random errors. Therefore, we can conclude that the proposed improvement method has statistical significance and is effectively reflected in the image quality evaluation metrics.
It should be noted that the proposed method performs well in SR for images with relatively high resolutions. However, if the resolution of the view images is too small, the method may suffer from distortion issues. In particular, when dealing with images of size 25 × 25 pixels, the performance is not ideal. While the current performance may not be ideal for such low-resolution images, ongoing research and advancements in SR algorithms hold promise for further enhancing the method’s effectiveness in handling smaller image sizes. Thus, future work could focus on refining the proposed method to improve its performance even for images with extremely low resolutions.