Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression

Wang, Jiajia; Zhang, Qi; Zhao, Haiwu; Wang, Guozhong; Shang, Xiwu

doi:10.3390/app14198626

Open AccessArticle

Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression

by

Jiajia Wang

,

Qi Zhang

,

Haiwu Zhao

^*,

Guozhong Wang

and

Xiwu Shang

School of Electric and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8626; https://doi.org/10.3390/app14198626

Submission received: 23 July 2024 / Revised: 18 August 2024 / Accepted: 23 September 2024 / Published: 25 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

The surge in ultra-high-definition video content has intensified the demand for advanced video compression techniques. Video encoding preprocessing can improve coding efficiency while ensuring a high degree of compatibility with existing codecs. Existing video encoding preprocessing methods are limited in their ability to fully exploit redundant features in video data and recover high-frequency details, and their network architectures often lack compatibility with neural video encoders. To addressing these challenges, we propose a Multi-Dimensional Enhancement and Reconstruction (MDER) preprocessing method to improve the efficiency of deep learning-based neural video encoders. Firstly, our approach integrates a degradation compensation module to mitigate encoding noise and boost feature extraction efficiency. Secondly, a lightweight fully convolutional neural network is employed, which utilizes residual learning and knowledge distillation to refine and suppress irrelevant features across spatial and channel dimensions. Furthermore, to maximize the use of redundant information, we incorporate Dense Blocks, which can enhance and reconstruct important features in the video data during preprocessing. Finally, the preprocessed frames are then mapped from pixel space to feature space through the Dense Feature-Enhanced Video Compression (DFVC) module, which improves motion estimation and compensation accuracy. The experimental results demonstrate that, compared to neural video encoders, the MDER method can reduce bits per pixel (Bpp) by 0.0714 and 0.0536 under equivalent PSNR and MS-SSIM conditions, respectively. These results demonstrate significant improvements in compression efficiency and reconstruction quality, highlighting the effectiveness of the MDER preprocessing method and its compatibility with neural video codec workflows.

Keywords:

video coding; preprocessing technology; neural network

1. Introduction

The proliferation of online ultra-high-definition video has significantly increased data volumes, challenging existing bandwidth resources. It has become a crucial research priority to enhance video coding strategies for efficiency and to maintain visual quality. Traditional video encoders rely on heuristic rules and human visual system characteristics to minimize temporal and spatial redundancies. With advancements in deep learning, neural networks are increasingly applied in image and video processing. However, advancements in deep learning have led to the increased application of deep neural networks (DNNs) in image and video processing. Initially, efforts centered on replacing specific modules of traditional video encoders with neural networks, including prediction [1], residual [2], mode decision [3], and discrete cosine transform (DCT) [4] modules. Despite these integrations, this approach often struggles to achieve an optimal balance between coding efficiency and visual fidelity, which has prompted the development of fully neural network-based models.

Lu et al. [5] introduced Deep Video Compression (DVC), a fully deep learning-based framework utilizing optical flow estimation and autoencoder-based networks for motion and residual information compression. However, DVC’s single reference frame prediction limits its utilization of temporal correlations. To address this, Lin et al. [6] proposed the Multiple Frames Prediction for Learned Video Compression (M-LVC) model, which enhances temporal correlations through multi-reference frame motion compensation. Further improvements were achieved by Li et al. [7], who converted the predictive coding framework to a conditional coding framework. This approach takes rich spatio-temporal contextual information as conditional input to optimize the coding process of video frames. Then, the DCVC model proposed by Li et al. [7] employs context encoding to overcome the limitations of residual coding. Subsequently, DCVC-HEM [8] introduced a powerful parallel-friendly entropy model for improved probability distribution prediction. Further advancements by Li et al. with the DCVC-DC model [9] demonstrated superior performance over traditional codecs in both RGB and YUV420 color spaces.

Despite the successes of DCVC models in contextual coding, they typically require large changes to the coding framework, which may limit their applicability to existing video coding standards. To address this, Hu et al. [10] proposed Feature Space Deep Video Compression (FVC) to project video frames from the pixel space to feature space. Compared to conditional coding frameworks, feature space mapping enables a more efficient representation, which significantly reduces pixel-domain redundancy and improves compression efficiency while maintaining sensitivity and adaptability to visual content.

Although these methods have advanced video-coding algorithms, there remains a lack of framework for optimizing encoding from an end-to-end perspective. For instance, Lu et al. [5] proposed an end-to-end video encoding method which employs two neural networks to compress both motion and residual information. Similarly, Samarathunga et al. [11] first introduced a semantic communication-based hybrid video codec, which leverages the intra coding capabilities of Versatile Video Coding (VVC) to encode key frames. These key frames provide context for semantic communication, while residuals are encoded to enhance the fidelity of the output frames.

These methods excel in reducing redundancy and enhancing compression efficiency, but they may lack sufficient flexibility to adapt to varying video content and encoding requirements. Video preprocessing techniques, which involve a series of preliminary operations on video data before coding, play a crucial role in the optimization process. Recent advancements in learning-based video encoding preprocessing techniques have shown promising results. Talebi et al. [12] built a Pre-Editing Network (PEN), which uses a ResNet cascade and a patch-based Spatial Transformer Network to smooth an image. Chadha et al. [13] proposed a concept of deep perceptual preprocessing (DPP), which employs expansion convolutions with varying expansion rates at each layer. This architecture enhances the receptive field to capture more comprehensive feature information and produce higher-quality, more stable coding features. Ma et al. [14] introduced a rate perception optimized preprocessing (RPP) method for video coding, which integrates perceptual enhancement mechanisms to optimize video frames for both bit rate efficiency and perceptual quality. These methods leverage neural networks’ self-learning capabilities to enhance compression efficiency.

Despite these advancements, current learning-based approaches often fall short in adequately filtering redundant features and lack a robust mechanism to recover high-frequency details in video data. Furthermore, the existing network architectures are not sufficiently lightweight to be deployed to CPU-based end-user devices, which highlights the potential for further research in preprocessing techniques to enhance video coding efficiency. To address these challenges, we propose a Multi-Dimensional Enhancement and Reconstruction (MDER) preprocessing method aimed at optimizing video compression efficiency. First, we introduce a degradation compensation module to remove encoding noise from the original video and relieve the frame quality degradation caused by transmission, which improves the efficiency of feature extraction. Second, we construct a lightweight fully convolutional neural network with multi-dimensional deredundancy. This network utilizes residual learning and knowledge distillation to enhance and refine key features from both spatial and channel dimensions, which suppresses irrelevant features. In addition, we incorporate Dense Blocks during the feature extraction phase to maximize the utilization of redundant information, which makes the network more efficient in processing residual information and further enhancing and reconstructing important features. Finally, the preprocessed frames are converted from pixel space to feature space through the feature extraction module of Dense Feature-Enhanced Video Compression (DFVC), which enables more accurate motion estimation and motion compensation. Overall, this comprehensive preprocessing approach allows the DFVC neural video encoder to learn essential coding features more efficiently and improve the coding efficiency without compromising the visual quality.

The contributions of this paper are summarized as follows:

(1): We propose a Multi-Dimensional Enhancement and Reconstruction (MDER) preprocessing method for video coding which effectively removes encoding noise and enhances details to reconstruct frames for video coding.
(2): Dense Blocks are integrated to further maximize the utilization of redundant information, which can process and utilize the information more effectively and enhance the efficiency of residual information processing.
(3): Our proposed method can improve coding efficiency while maintaining consistent visual quality and is more easily deployed on devices with limited processing power.

2. Materials and Methods

The deployment workflow of the MDER preprocessing method is presented in Figure 1. Each frame of the original video sequence is processed individually through MDER. By utilizing the output of the MDER as the input for the feature extraction module, we have performed end-to-end optimization of the DFVC. The MDER serves as a plug-and-play module that can be easily integrated into other neural encoders, enabling their end-to-end optimization. The MDER preprocessing method is presented in Figure 2.

2.1. Multi-Dimensional Enhancement and Reconstruction (MDER)

The MDER preprocessing method, as illustrated in Figure 2, is designed to extract and enhance essential video image features across multiple dimensions for the subsequent encoding process. Firstly, the Degradation Compensation Model (DCM) is applied to eliminate coding noise and restore degradations such as blurring, scaling, and lossy compression that may occur during transmission. Next, a lightweight fully convolutional neural network, MDFRNet, is utilized. This network integrates residual learning and feature distillation principles [15] to extract and reconstruct key coding features from both spatial and channel dimensions [16]. After preprocessing video data with the DCM, MDFRNet is better guided to focus on the core features, accelerating network convergence and resulting in more representative feature representations that enhance encoding performance. For each frame to be encoded, it is first passed through the DCM to remove encoding noise and restore image degradation, which improves the efficiency and accuracy of feature extraction. Subsequently, the processed data are forward-propagated through the trained MDFRNet, where encoding-relevant features are extracted and reconstructed, which results in the preprocessed output image.

However, the denoising process may lead to the loss of high-frequency details such as edges, which are critical to maintaining video quality. To mitigate this issue, we incorporate a sharpening filter to restore high-frequency details and introduce an Adaptive Discrete Cosine Transform (ADCT) loss function. This loss function specifically optimizes the retention of different types of frequency information during neural network training. Recognizing that the separated high-frequency components also contain critical details for human visual perception, we designed a specialized adaptive adjustment strategy to preserve these important high-frequency details. This significantly reduces the volume of video data, which improves video encoding efficiency and performance without compromising image quality. Additionally, the MAE loss function is employed to ensure that the network possesses fundamental image reconstruction capabilities, and the full-reference image quality evaluation metric MS-SSIM is also integrated into the loss function. By combining these various preprocessing techniques, we have improved the coding efficiency of neural video encoders without sacrificing visual quality, contributing significant technological innovations to the field of video coding.

DCM. The model comprises four parts: the Blur Degradation module, Noise Degradation module, Resize Degradation module, and JPEG Degradation module, each addressing specific types of video degradation. Through the deployment of DCM, we effectively remove high-frequency noise and degradation during transmission by combining different types of filters to enhance image quality. This process provides high-quality input features to MDFRNet, which accelerates network convergence and enhances compression efficiency. By applying the DCM, the quality of the input frame is reduced, and the amount of high-frequency information is greatly reduced, which helps to remove high-frequency noise. At the same time, the overall content and structure of the frame are highlighted. This strategy can make MDFRNet pay more attention to the intrinsic features in the training phase.

MDFRNet. First, a desub-pixel convolution layer [17] is introduced to downsample the input features and increase channel depth, which is utilized to enhance the network’s capacity to capture intricate video details. Subsequently, multiple Residual Local Feature Blocks (RLFBs) are cascaded to extract detailed features [18]. Each RLFB employs a modest number of stacked convolution (CONV) and rectified linear unit (ReLU) layers for local feature extraction. Specifically, each feature refinement module in RLFB contains a single 3 × 3 convolution layer followed by a ReLU activation function layer. Among them, RLFB is used to extract features through feature distillation. With the gradual stacking of RLFB, redundant information is effectively filtered and removed, and high-level semantic information that is critical to subsequent coding is retained.

Second, Spatial Channel (SC) convolution integrates and refines features across spatial and channel dimensions, which improves the discrimination of features and enhances their discriminative qualities. Skip connections within MDFRNet enrich deep features, which compensates for potential information loss in high-level abstract features. Furthermore, these skip connections also mitigate vanishing gradients, improve signal backpropagation efficiency, and reduce computational resource utilization. Finally, a sub-pixel convolution layer [19] facilitates spatial up-sampling of the feature maps with minimal computation overhead. This method avoids artifacts and detail loss common in conventional interpolation techniques, ensuring that predictive outcomes retain semantic content and refined spatial details.

LOSS. Discrete Cosine Transform (DCT) is a widely used mathematical transformation that converts an image from the spatial domain to the frequency domain, which enables it to separate the low-frequency components of an image from its high-frequency components. However, DCT often results in the loss of critical high-frequency structural details, which are essential for image quality. To address this issue, we introduce the Adaptive Discrete Cosine Transform (ADCT) loss algorithm, an advance over the traditional Discrete Cosine Transform (DCT), which is utilized to incorporate adaptive threshold adjustment and detail-preserving quantization techniques. This algorithm effectively retains critical high-frequency details in images while achieving a high compression ratio. By reducing spatial redundancy in video content without compromising image quality, it ensures that compressed video closely resembles the original in terms of visual characteristics. When optimizing the MDER preprocessing method, we need to consider both the amount of video data and the visual quality of the video. In our paper, we optimize the parameters of the model by jointly minimizing the LDCT, the reconstruction loss function LR, and the perceptual loss function LP. Additionally, we integrated a Multi-Scale Structural Similarity Index (MS-SSIM) as a perceptual loss function to preserve luminance, contrast, and structural integrity in high-frequency image regions.

2.2. Dense Feature-Enhanced Video Compression (DFVC)

The objective of the Feature-space Video Coding (FVC) network is to generate high-quality reconstructed video frames

{\hat{X}}_{t}

at any given bit rate. As shown in Figure 3, all components of the FVC framework, including deformable compensation, residual compression, and multi-frame feature fusion modules, operate within the feature space. For a specified input frame

X_{t}

, FVC first performs feature extraction to generate input features

F_{t}

. The deformable compensation module involves three steps: motion estimation, motion compression, and motion compensation. Subsequently, the residual features

R_{t}

between the input features

F_{t}

and the predicted features

{\bar{F}}_{t}

are compressed in the residual compression module. The initial reconstructed features

{\tilde{F}}_{t}

are then fused with the previous reference features

F_{t - 1}^{r e f}

,

F_{t - 2}^{r e f}

,

F_{t - 3}^{r e f}

in the multi-frame feature fusion module to generate the final reconstructed features

{\hat{F}}_{t}

. Finally, the reconstructed frames

{\hat{X}}_{t}

are generated after the

{\hat{F}}_{t}

frame reconstruction module. In the FVC framework, ResBlocks play a central role in managing the transition between pixel space and feature space through the feature extraction module and frame reconstruction module. ResBlocks preserve detail and texture in video frames, ensuring natural appearance and high quality in compressed video playback. They also prioritize computational efficiency, in order to reduce resource requirements while maintaining high-quality reconstruction.

In the FVC framework, ResBlocks play a central role in managing the transition between pixel space and feature space through the feature extraction module and frame reconstruction module. ResBlocks preserve detail and texture in video frames, ensuring natural appearance and high quality in compressed video playback. They also prioritize computational efficiency, in order to reduce resource requirements while maintaining high-quality reconstruction.

During encoding, ResBlocks are crucial in the deformable motion compression and residual compression modules. Additionally, their residual connections alleviate gradient issues, so as to optimize deep network learning. Despite their effectiveness in enhancing video processing efficiency and compression quality, ResBlocks do not fully exploit video data redundancy. Their design prioritizes facilitating information flow and learning residual representations over directly leveraging spatial and temporal redundancies in video data.

To further maximize the use of redundant information, we incorporate Dense Blocks. The original ResBlock structure is replaced with a Dense Block configuration, which enhances and reconstructs important features in the video data. The feature extraction and frame reconstruction modules are shown in Figure 4a and Figure 4b, respectively, with Figure 4c illustrating the specific design of the Dense Block module. The Dense Block consists of four consecutive convolutional (Conv) layers, each followed by ReLU activation functions, which are arranged in a densely connected manner. The Conv+ReLU units in the Dense Block incrementally refine and augment feature information. The convolutional layers also perform feature downscaling and tuning, which are utilized to ensure accurate and robust feature representation for subsequent processing stages. The dense connections between feature maps enable each convolutional layer to utilize features from previous layers, which preserves information richness and enhances the network’s understanding of image context.

2.3. Preprocessing Module for Neural Video Encoding

In this section, we introduce a preprocessing module within the DFVC to explore and evaluate preprocessing techniques for neural video encoders. Similar to preprocessing strategies in traditional video encoders, the objective of integrating a preprocessing module before the neural video encoder is to smooth the video signal to be encoded, which removes high-frequency noise while preserving the high-frequency details critical to human visual quality. This preprocessing module is inserted between the current frame input and the feature extraction module, which ensures that the video signal is preprocessed before feature extraction. Subsequently, the preprocessed video is passed through the DFVC feature extraction module, transforming it from pixel space to feature space. The video signal then proceeds through a series of operations, including the deformable compensation module, residual compression module, multi-frame feature fusion module, and frame reconstruction module, for encoding and decoding.

By processing raw video through the preprocessing module, DFVC can more effectively identify and learn key encoding features. The removal of redundant high-frequency information significantly reduces the data volume that needs to be processed during encoding, which enhances encoding efficiency. Additionally, high-frequency information often accompanies irregular noise, which can increase the complexity of feature extraction and compromise the stability of these features. The preprocessing module smooths the video image by eliminating such high-frequency noise, which allows the feature extraction process to focus more on the primary structure and content of the image. This method also improves the stability and reliability of the features.

However, the removal of high-frequency information may result in the loss of important visual details, such as edges. To optimize the efficiency and relevance of the neural encoder’s features, the MDER preprocessing method is employed as the preprocessing module for DFVC. The output of MDER is fed as input to the feature extraction module, which enables end-to-end optimization of DFVC. MDER is designed as a plug-and-play module, functioning as an independent preprocessing step before the encoder network’s training. By providing input video data with reduced high-frequency noise, MDER contributes to saving encoding bitrates and improving overall encoding performance. This integration does not require modifications to the underlying end-to-end codec architecture, which makes it compatible with existing end-to-end systems. The deployment process of MDER within the neural encoder is illustrated in Figure 1.

3. Experiments and Results

3.1. Datasets

In our paper, the DIV2K dataset and the Flickr2K dataset [20] were utilized as the training set for MDER. The DIV2K dataset comprises 1000 different scenes, contents and styles of 2 K high-resolution images. The Flickr2K dataset is an extension of the DIV2K dataset, containing 2650 high-resolution 2 K images from various scenes, themes, and styles sourced from the Flickr platform. For performance evaluation, Classes B–D of the VVC Common Test Conditions (CTC) [21] were used as the test set.

3.2. Implementation Details

We use the Adam optimizer [22] to train the MDER preprocessing method, with a learning rate of 1 × 10⁻³ and the batch size set to 64. To enhance the dataset’s diversity and improve the model’s generalization capabilities, we applied data augmentation prior to training. Specifically, the original images from the training set were randomly cropped into patches of size 128 × 128, followed by random flipping, with an equal probability of 25% for no flipping, vertical flipping, horizontal flipping, and mirror flipping. The entire MDER is based on PyTorch [23], which requires about 3 days of training on an NVIDIA GeForce RTX 2080 Ti (NVIDIA, Santa Clara, CA, USA). During the deployment phase, the coding frame was processed by MDER, and then the preprocessed frame was coded by neural video codecs. Our MDER preprocessing method achieves inference performance of 1080p@49FPS and 720p@62FPS on a single NVIDIA GeForce RTX 2080 Ti.

3.3. Evaluation Methods

The Bjontegaard Delta-rate (BD-rate) metric is expressed as bits per pixel (Bpp). To evaluate the distortion between the reconstructed frames and the original frames, Peak Signal-to-Noise Ratio (PSNR), Multi-Scale Structural Similarity Index Measure (MS-SSIM), and Video Multi-method Assessment Fusion (VMAF) were employed. Based on the obtained PSNR and MS-SSIM results, we calculated the average PSNR and average MS-SSIM of all reconstructed frames and plotted the RD curve to visualize the comprehensive performance of DFVC at different bit rates.

4. Results and Discussion

4.1. Results

To verify the effectiveness of the MDER method when used with neural video encoders, we deployed it as shown in Figure 1 and compared the experimental results before and after MDER deployment. Class B-D YUV420P sequences of VVC were tested. The rate–distortion (RD) curves for each category are presented in Figure 5. During inference, the H.266 encoder generated I-frames with Constant Rate Factor (CRF) values of 20, 23, 26, and 29. These I-frames served as reference frames for each GOP in DFVC, where the GOPSIZE was set to 12.

As shown in Figure 5, the modified DFVC achieved a reduction in Bpp by 0.0522 and 0.0602 under the same PSNR and MS-SSIM conditions, respectively, compared to the original FVC. This indicates the advantages of using Dense Block over Residual Block. This is especially evident in feature extraction, reconstruction, and compressing motion and residual information. These improvements positively enhance coding efficiency. When MDER preprocessing was applied to both FVC and DFVC, Bpp was reduced by 0.0578 and 0.0714, respectively, under the same PSNR and MS-SSIM conditions. Whether deployed on the FVC or the enhanced DFVC, the integration of the MDER preprocessing method consistently achieves significant performance gains in compression efficiency for neural video encoders. This highlights the effectiveness of our proposed MDER preprocessing method in improving compression efficiency.

Additionally, a further comparison between MDER + FVC and MDER + DFVC showed that MDER + DFVC achieved savings of 0.0488 and 0.0415 Bpp, respectively, under identical PSNR and MS-SSIM conditions compared to MDER + FVC. This highlights that incorporating a preprocessing module and performing end-to-end optimization in the improved neural video encoder results in a more substantial improvement in compression efficiency than using a standalone enhanced encoder or merely adding a preprocessing module. The results in Figure 5 clearly demonstrate the substantial effectiveness of our MDER preprocessing method in enhancing the coding efficiency of neural video encoders while maintaining high-quality video output.

To further validate the superiority of the proposed MDER method in neural video encoders, a comparative analysis was performed against other existing methods. We deployed MDER, DPP [14], and RPP [15] within the improved DFVC neural video encoder. Then, we plotted RD curves for these three methods to facilitate a comparative evaluation. The results show that, under equivalent PSNR values, Bpp was reduced by 0.0192 and 0.0133, respectively. Under equivalent MS-SSIM values, Bpp was reduced by 0.0377 and 0.0183, respectively. Additionally, under equivalent VMAF values, Bpp was reduced by an average of 0.049 and 0.045, respectively. Specifically, after incorporating the MDER preprocessing method before the DFVC, the experimental results show that, under equivalent PSNR, MS-SSIM, and VMAF values, the average Bpp was reduced by 0.0488, 0.0415, and 0.114, respectively. The results confirm that, while each preprocessing method enhances DFVC’s encoding efficiency, the MDER method offers a significant enhancement in video coding efficiency while maintaining high-quality output, which demonstrates its clear advantages.

4.2. Discussion

The results of this study underscore the significant advancements made by the MDER preprocessing method in the realm of neural video encoders. Previous research has either substituted traditional video encoder modules with neural networks or developed entirely neural-based models. However, these approaches often struggle to fully leverage video data redundancies and address noise interference. Our MDER method allows the neural encoder to learn the essential features required for coding, which enhances and reconstructs important features within the video data. The experimental results show that the reduction in Bpp translates directly to more efficient video compression, which is crucial for reducing bandwidth consumption and storage requirements. This improvement is particularly beneficial in applications such as video streaming, where bandwidth efficiency is paramount, and in storage solutions, where reducing file size can lead to significant cost savings.

Specifically, the MDER preprocessing module enhances video encoding by utilizing residual learning and feature distillation techniques. This approach reduces data redundancy while preserving high-frequency details, which are crucial for maintaining visual fidelity. Additionally, the unique dense connectivity properties of Dense Blocks ensure direct connections between each layer and all preceding layers. This enhances the representation capability of the current layer and contributes to improved compression efficiency. This approach significantly reduces redundant data and enhances the stability and reliability of the features by removing high-frequency noise through the preprocessing module.

Dense Blocks provide each network layer with access to all preceding layers’ features through dense connections in order to enhance feature extraction. This approach allows for more comprehensive capture and utilization of feature information. Compared to DPP’s dilated convolutions and RPP’s residual networks, MDER more thoroughly extracts and enhances important features, which improves feature richness and accuracy. While DPP utilizes dilated convolutions to expand the receptive field, and RPP employs residual connections to handle high-frequency details, neither method adequately filters redundant video data.

As shown in Figure 6, deploying these preprocessing methods before DFVC results in enhanced encoding efficiency, and our preprocessing method demonstrates significant advantages in improving video encoding efficiency while maintaining high-quality video output. Although an identical metric is not available to objectively quantify the computational complexity of preprocessing modules, a subjective comparison shows that employing knowledge distillation to compress the network scale in MDFRNet and substituting traditional convolution layers with SCconv for feature refinement results in a more lightweight network compared to DPP, which utilizes dilated convolutions to expand the receptive field, and RPP, which incorporates residual connections to handle high-frequency details. While DPP and RPP are effective, their designs do not fully filter out redundant video data and are less lightweight, which makes them more challenging to deploy on CPU-based devices. Table 1 and Figure 6 illustrate that MDER outperforms DPP and RPP in terms of processing speed and efficiency. Specifically, MDER achieves an inference performance of 1080p@49FPS on a single NVIDIA GeForce RTX 2080 Ti, which is superior to the FPS achieved by DPP and RPP. This performance advantage is particularly relevant in scenarios such as live streaming and video conferencing, where real-time processing with high visual fidelity is essential. The lower computational complexity of MDER also suggests that it can be more easily deployed on devices with limited processing power, further expanding its applicability.

The results of this study have several important implications. Firstly, the reduction in data volume alleviates bandwidth constraints in video transmission, which optimizes network resource utilization. Secondly, the improved video quality contributes to a superior user experience, particularly in real-time applications such as live streaming and teleconferencing. Furthermore, the techniques developed are versatile and applicable across various video content types and domains. MDER can be integrated as a plug-and-play module into other neural encoders, serving as an independent preprocessing step before the training of encoding networks. MDER is utilized to provide input video data with reduced high-frequency noise, which helps to save encoding bitrates and enhances encoding performance. This integration is achieved without necessitating changes to the underlying end-to-end codec architecture, which facilitates its incorporation into existing systems. Additionally, expanding the application of this approach to handle diverse video content types and varying streaming conditions could provide valuable insights into its practical applicability and effectiveness. Future research could explore additional preprocessing methods combined with neural video encoders to further improve coding efficiency.

5. Conclusions

In this paper, we propose a Multi-Dimensional Enhancement and Reconstruction (MDER) preprocessing method for neural video compression. A neural network is utilized to preprocess the video, which produces reconstructed frames with reduced data and enhanced details for video coding. Our MDER method can be integrated as an independent preprocessing module, which enables the video encoder to achieve end-to-end optimization. The experimental results validate that MDER significantly improves coding efficiency while maintaining high visual quality, which enhances the compression performance of neural video encoder.

Author Contributions

Conceptualization, J.W. and H.Z.; data curation, Q.Z.; formal analysis, J.W.; funding acquisition, G.W.; methodology, J.W.; project administration, H.Z.; resources, H.Z. and G.W.; software, J.W. and G.W.; supervision, H.Z. and X.S.; validation, J.W. and Q.Z.; visualization, Q.Z.; writing—original draft, J.W.; writing—review & editing, J.W., Q.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shang, X.; Wang, G.; Liang, J. Color-Sensitivity-Based Rate-Distortion Optimization for H.265/HEVC. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2022, 32, 802–812. [Google Scholar] [CrossRef]
Alexandre, D.; Hang, H.M.; Peng, W.H.; Domański, M. Deep Video Compression for Interframe Coding. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2124–2128. [Google Scholar]
Shang, X.; Li, G.; Zhao, X. Low complexity inter coding scheme for Versatile Video Coding (VVC). J. Vis. Commun. Image Represent. 2023, 90, 103683. [Google Scholar] [CrossRef]
Tsai, Y.-H.; Liu, M.-Y.; Sun, D.; Yang, M.-H.; Kautz, J. Learning binary residual representations for domain-specific video streaming. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; p. 32. [Google Scholar]
Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; Gao, Z. Dvc: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11006–11015. [Google Scholar]
Lin, J.; Liu, D.; Li, H.; Wu, F. M-LVC: Multiple Frames Prediction for Learned Video Compression. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3543–3551. [Google Scholar]
Li, J.; Li, B. Deep contextual video compression. Adv. Neural Inf. Process. Syst. 2021, 34, 18114–18125. [Google Scholar]
Li, J.; Li, B.; Lu, Y. Hybrid spatial-temporal entropy modelling for neural video compression. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1503–1511. [Google Scholar]
Li, J.; Li, B.; Lu, Y. Neural video compression with diverse contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22616–22626. [Google Scholar]
Hu, Z.; Lu, G.; Xu, D. FVC: A New Framework towards Deep Video Compression in Feature Space. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1502–1511. [Google Scholar]
Samarathunga, P.; Ganearachchi, Y.; Fernando, T.; Jayasingam, A.; Alahapperuma, I.; Fernando, A. A Semantic Communication and VVC Based Hybrid Video Coding System. IEEE Access 2024, 12, 79202–79224. [Google Scholar] [CrossRef]
Talebi, H.; Kelly, D.; Luo, X.; Dorado, I.G.; Yang, F.; Milanfar, P.; Elad, M. Better Compression with Deep Pre-Editing. IEEE Trans. Image Process. 2021, 30, 6673–6685. [Google Scholar] [CrossRef] [PubMed]
Chadha, A.; Andreopoulos, Y. Deep Perceptual Preprocessing for Video Coding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14847–14856. [Google Scholar]
Ma, C.; Wu, Z. Rate-perception optimized preprocessing for video coding. arXiv 2023, arXiv:2301.10455. [Google Scholar]
Huang, G.; Liu, Z.; Van, L. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Part III 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 41–55. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Vu, T.; Nguyen, C.V.; Pham, T.X. Fast and Efficient Image Quality Enhancement via Desubpixel Convolutional Neural Networks. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Kong, F.; Li, M.; Liu, S.; Liu, D.; He, J.; Bai, Y.; Chen, F.; Fu, L. Residual Local Feature Network for Efficient Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 765–775. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Xue, T.; Chen, B.; Wu, J. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 12. [Google Scholar]

Figure 1. The deployment workflow of the MDER preprocessing method: single-pass input frames of the original video sequence and application to neural video codec.

Figure 2. Overall architecture of the MDER preprocessing method.

Figure 3. FVC framework diagram.

Figure 4. Feature Extraction Module and Frame Reconstruction Module in DFVC. (a) Feature extraction module; (b) frame reconstruction module; and (c) Dense Block.

Figure 5. The rate–distortion curves for the VVC Class B–D dataset on PSNR and MS-SSIM.

Figure 6. Comparison of experimental results for VVC Class B–D datasets on PSNR and MS-SSIM.

Table 1. Complexity comparison of preprocessing methods for VVC Class B 1080p video on NVIDIA GeForce RTX2080Ti.

Attribute	MDER	DPP	RPP
Model Complexity	Low to Moderate	High	Moderate
Total Processing Time	45.41 s	59.01 s	56.25 s
Average Processing Time per Frame	0.0086 s	0.0092 s	0.0079 s
Average FPS	49	45	48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Zhang, Q.; Zhao, H.; Wang, G.; Shang, X. Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression. Appl. Sci. 2024, 14, 8626. https://doi.org/10.3390/app14198626

AMA Style

Wang J, Zhang Q, Zhao H, Wang G, Shang X. Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression. Applied Sciences. 2024; 14(19):8626. https://doi.org/10.3390/app14198626

Chicago/Turabian Style

Wang, Jiajia, Qi Zhang, Haiwu Zhao, Guozhong Wang, and Xiwu Shang. 2024. "Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression" Applied Sciences 14, no. 19: 8626. https://doi.org/10.3390/app14198626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression

Abstract

1. Introduction

2. Materials and Methods

2.1. Multi-Dimensional Enhancement and Reconstruction (MDER)

2.2. Dense Feature-Enhanced Video Compression (DFVC)

2.3. Preprocessing Module for Neural Video Encoding

3. Experiments and Results

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Methods

4. Results and Discussion

4.1. Results

4.2. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI