In recent years, three main methods have been used for LFI compression: frequency domain transformation, inter-frame prediction and deep learning. Transformation-based compression techniques, such as JPEG, JPEG2000 [
6], and JPEG PLENO [
7], utilize frequency domain transformations like the discrete cosine transform (DCT) and discrete wavelet transform (DWT) to eliminate pixel redundancy. In the realm of DCT, Aggoun et al. [
8] introduced the concept of combining images from adjacent microlens units into 3D pixel blocks and employing 3D-DCT to reduce spatial redundancy. Sgouros et al. [
9] followed up by exploring different scanning methods to assemble microlens views into 3D blocks, finding superior rate-distortion performance with the Hilbert scan. More recently, Carvalho et al. [
10] presented a 4D-DCT based light-field codec, MuLE, which segments the light field into 4D blocks for 4D-DCT computation, groups transform coefficients using a hexadeca-tree, and encodes the stream with an adaptive arithmetic encoder. On the DWT front, Zayed et al. [
11] applied 3D-DWT to sub-aperture image stacks, recursively applying three 1D-DWTs to each dimension and producing eight subbands, followed by quantization using a dead-zone scalar quantizer and encoding with the set partitioning in hierarchical trees (SPIHT) method [
12]. A comparison of two DWT-based encoding schemes (JPEG 2000 and SPIHT) against a DCT-based standard (JPEG) in terms of the reconstructed lens images’ objective quality, measured by average PSNR and SSIM indices, showed that SPIHT outperforms JPEG in RD performance, while JPEG 2000 surpasses both in terms of PSNR and SSIM [
13]. However, these methods, despite their efficacy in compressing microlens-captured images, are less suitable for high-resolution multi-view images captured by multi-camera arrays. The computational complexity of DCT and DWT for high-dimensional images can be prohibitively high for real-time processing and rapid response applications. Additionally, irreversible detail distortion introduced during compression can adversely affect subsequent applications of light-field data.
Compression methods based on inter-frame prediction can be divided into pseudo-video sequence (PVS) and multi-view prediction. PVS, formed by the sub-aperture views, shows a strong correlation similar to video sequences. Thanks to the inter-frame prediction and intra-frame prediction technology of the video encoder, the video compression method can remove redundant information about the light-field data. Early efforts in this domain include Olsson et al. [
14]’s utilization of the H.264 encoder to compress SAIs by assembling them into video sequences. Subsequently, Dai et al. [
15] explored converting lenselet images into sub-aperture images, comparing rotary and raster scanning for compression efficiency. Vieira et al. [
16] furthered this research by evaluating HEVC configurations for encoding PVS obtained from various scanning sequences, thereby advancing light-field data compression. Similarly, Hariharan et al. [
17] employed HEVC to encode a similar topological scan structure. Zhao et al. [
18] introduced a novel U-shaped and serpentine combined scan sequence, utilizing JEM for encoding and achieving superior results over traditional methods like JPEG. Jia et al. [
19] enhanced block-based illumination compensation and adaptive filtering for reconstructed sub-aperture images, achieving notable bit savings compared to existing methods. While video encoders employing these methods achieve effective compression, they often retain the one-dimensional characteristics of video sequences. Regardless of scan order, certain views lack reference information from adjacent views, constraining the full utilization of view correlation. Multi-view prediction, however, overcomes this limitation by leveraging correlations across multiple view directions. Liu et al. [
20] proposed a two-dimensional hybrid coding framework dividing the view array into seven layers with differing prediction directions. Li et al. [
21] improved upon this by dividing the view array into four regions to enhance random accessibility, coupled with motion vector transformation for improved compression. Amirpour et al. also proposed an MVI structure for random accessibility, dividing the view into three layers, allowing the use of parallel processing to reduce the time complexity of multi-core codecs [
22]. In addition, Ahmad et al. [
23] and Zhang et al. [
24] from the Central University of Sweden and Khoury’s team [
25] at Columbia University regarded the SAIs as a multi-view video sequence, and proposed different prediction schemes based on the multi-view extended coding framework (MV-HEVC) of HEVC. Currently, inter-frame prediction-based schemes are predominant due to their versatility in predicting structure, offering more potential for exploiting LF correlation than direct lens image decoding.
In recent years, with the rapid development of deep learning, many models for LFI compression have been proposed. Notably, Shin et al. [
26] introduced the EPINE model, leveraging convolutional neural networks (CNNs) to enhance depth estimation across the light-field array through analyses of epipolar plane images (EPIs), facilitating the reconstruction of high-quality LF images at the decoder end. Similarly, the Hedayati team’s [
27] development of JPEG-Hance Net and Depth-Net, built upon JPEG, innovates in-depth map estimation from a central view, subsequently enabling the comprehensive reconstruction of the LF. This approach demonstrates marked improvements over the compression effects achieved by HEVC. Furthermore, research by Bakir et al. [
28] and Jia et al. [
29] explores the domain of sparse sampling of sub-aperture images (SAIs), with varied compression methodologies being devised. These methods incorporate adversarial generative networks (GANs) alongside conventional encoders, aiming to restore the full light field from both unsampled and sampled parallax perspectives. Concurrently, Yang et al. [
30]’s method involves sampling images into sparse viewpoint arrays, discerning differences between adjacent microimages, and reconstructing the holistic holographic image leveraging the residual viewpoints and parallax. In a similar vein, Liu et al. [
31] implemented sparse sampling of dense LF SAIs, transmitting merely the sparse SAIs and employing a multi-stream view reconstruction network (MSVRNet) at the decoder side to reconstruct the dense LF sampling area, yielding commendable outcomes.
Although these methods achieve a certain level of compression efficiency, the disadvantages and challenges they face cannot be overlooked. The inherent multi-dimensional nature of light-field images leads to a substantial volume of data, which imposes higher demands on hardware and time costs during the model training process. Furthermore, while these techniques demonstrate excellent performance on training datasets, the models’ ability to generalize across diverse scene changes remains a critical consideration.