Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

Das, Tanni; Liang, Xilong; Choi, Kiho

doi:10.3390/app14188276

Open AccessArticle

Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

by

Tanni Das

¹,

Xilong Liang

¹ and

Kiho Choi

^1,2,*

¹

Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea

²

Department of Electronic Engineering, Kyung Hee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276

Submission received: 25 August 2024 / Revised: 11 September 2024 / Accepted: 12 September 2024 / Published: 13 September 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such as blockiness, blurriness, and ringing, which can detract from the viewer’s experience. To ensure a seamless and engaging video experience, it is essential to remove these artifacts, which improves viewer comfort and engagement. In this paper, we propose a deep feature fusion based convolutional neural network (CNN) architecture (VVC-PPFF) for post-processing approach to further enhance the performance of VVC. The proposed network, VVC-PPFF, harnesses the power of CNNs to enhance decoded frames, significantly improving the coding efficiency of the state-of-the-art VVC video coding standard. By combining deep features from early and later convolution layers, the network learns to extract both low-level and high-level features, resulting in more generalized outputs that adapt to different quantization parameter (QP) values. The proposed VVC-PPFF network achieves outstanding performance, with Bjøntegaard Delta Rate (BD-Rate) improvements of 5.81% and 6.98% for luma components in random access (RA) and low-delay (LD) configurations, respectively, while also boosting peak signal-to-noise ratio (PSNR).

Keywords:

VVC; post-processing; CNN; QP; deep learning

1. Introduction

The proliferation of video data has become an integral part of our digital landscape, with an astonishing volume of footage being captured and generated at an unprecedented rate of millions of hours daily. The emergence of video streaming has dramatically transformed the way we engage with visual content, turning the digital world into a lively and engaging space. As high-speed internet, smartphones, and smart devices became widespread, video streaming became a crucial aspect of daily life, offering easy access to a vast selection of entertainment, educational, and informative content at users’ fingertips. With social media platforms, online movie theaters, video conferencing tools, and live event broadcasts, the growth of video streaming has created new opportunities for creators, viewers, and businesses, reshaping the way we connect, learn, and enjoy ourselves. The surge in video streaming applications has led to a substantial escalation in bandwidth consumption, thereby imposing considerable strain on network infrastructure, data centers, and devices. The growing popularity of high-definition and 4K video content has, in turn, triggered a pioneering demand for bandwidth, exerting immense pressure on networks. This has resulted in bandwidth constraints, which can lead to slower speeds, buffering, and poor video quality. The significance of advanced video compression technologies is emphasized by the need to deliver high-quality video content over networks with limited bandwidth. By reducing the amount of data required to transmit video, these solutions enable seamless video streaming even in environments where bandwidth is scarce. This is achieved through sophisticated algorithms that compress video data, minimizing the strain on networks while maintaining an optimal viewing experience.

Several lossy video compression technologies, including High Efficiency Video Coding/H.265 (HEVC) [1] and Versatile Video Coding/H.266 (VVC) [2], have been developed to attain a balance between bitrate and distortion, enabling efficient compression of video data. VVC offers a substantial compression gain, reducing the bitrate by half while preserving the same level of video quality as HEVC, which result in a notable enhancement in compression performance. VVC retains the conventional block-based hybrid video coding framework but incorporates numerous innovative coding tools, each with its own unique compression capabilities, to enhance overall coding efficiency. The VVC framework consists of a range of essential components, including transform, quantization, entropy coding, intra prediction, inter prediction, and loop filtering, which work together to enable efficient video compression. The transform unit plays a crucial role in VVC, converting video frames from the time domain to the frequency domain, thereby compacting energy into the low-frequency regions, which facilitates more efficient compression. The quantization module serves to limit the dynamic range of the image, thereby addressing the primary source of video coding distortion, which is essential for achieving efficient compression. Entropy coding is a method used in encoding to convert data into compact binary streams, enabling efficient data storage and transmission by reducing the amount of data required to represent the information. Intra- and inter-prediction methods are used to eliminate spatial and temporal redundancies, respectively. Additionally, a loop filter module is applied during video coding to improve video frame quality and optimize compression efficiency, comprising three key components: a deblocking filter (DBF), a sample adaptive offset (SAO), and an adaptive loop filter (ALF).

Despite its superior performance, VVC-encoded videos still exhibit visual quality issues, indicating that the advanced tools and algorithms used in VVC are not enough to eliminate visual degradation in reconstructed videos. The block-based design of VVC can compromise video quality through various forms of distortion. The block-based approach in VVC can result in visible block boundaries (e.g., blockiness), which appears as a grid-like pattern on the video. This occurs when blocks are misaligned or when the compression algorithm creates discontinuities between blocks, making it more noticeable in areas with smooth textures or uniform colors. In VVC, the Discrete Cosine Transform (DCT) is employed to transform blocks of pixels from the spatial domain into the frequency domain, where they are represented as coefficients, enabling efficient compression and processing of the video data. The DCT transformation in VVC can sometimes produce unwanted visual effects, including oscillations or ringing artifacts around edges, which appear as concentric circles, and a buzzing or fizzing effect, causing distractions and annoyances in the video. Quantized DCT coefficients in VVC can result in two main issues: firstly, the loss of fine details and textures due to the reduction in coefficient values, causing a decrease in image clarity; and secondly, a reduction in bandwidth, which leads to the loss of high-frequency components, resulting in a softer, more blurred image. Intra-prediction method in VVC, which uses neighboring blocks to predict block values, can have two drawbacks: it can introduce directional artifacts, such as streaks or patterns, due to inaccurate predictions, and it can also cause over-smoothing, leading to a loss of details and textures in the decoded video. Inter-prediction, which predicts block values based on previous frames, can have two negative consequences: it can cause motion artifacts, such as blur, jerkiness, or ghosting, due to inaccurate motion estimation, and it can also lead to temporal artifacts, like flickering or flashing, which can be distracting and annoying.

In VVC, traditional in-loop filters are employed to enhance visual quality by reducing artifacts and improving overall video clarity. DBF reduces blocking artifacts, which appear as visible block boundaries in the video. These artifacts occur when the video is divided into blocks for compression, and the boundaries between blocks become visible. SAO reduces ringing artifacts, which appear as oscillations or ringing around edges in the video. SAO also helps to reduce mosquito noise, which is a type of noise that appears as a buzzing or fizzing effect around edges. ALF reduces blurring artifacts, which occur when the video becomes overly smoothed or loses its sharpness.

Despite being designed to enhance visual quality, traditional in-loop filters in VVC, such as ALF, DBF, and SAO, have inherent limitations that prevent them from fully eliminating decoded frame artifacts, resulting in residual distortions and compromised image quality. The limitations of traditional in-loop filters in VVC can be attributed to their imperfect artifact detection, which leads to incomplete distortion removal. Overly aggressive filtering can have the opposite effect, causing image blur and loss of details, thereby introducing new artifacts. Furthermore, these filters often struggle to adapt to diverse content, such as intricate textures or motion, resulting in inadequate artifact suppression. Moreover, the focus of the traditional in-loop filters is blocking and ringing artifacts, meaning that other types of distortions, including motion and temporal artifacts that can persist in decoded frames, can be ignored. Despite their widespread use, traditional filters in video coding still have untapped potential for enhancement. There are opportunities to further refine and optimize these filters to achieve better reconstruction quality, even with existing algorithms and architectures.

The past few years have witnessed a remarkable surge in the adoption of deep learning-based filter approaches, driven primarily by their exceptional performance and efficacy. These innovative methods have garnered significant attention and acclaim, establishing themselves as a go-to solution in various domains. The rise of deep learning has led to a rise in the development of revolutionary image and video quality enhancement techniques, prominently featuring Convolutional Neural Networks (CNNs). This has opened new avenues for significantly enhancing visual quality, thereby expanding the possibilities in image and video processing. Many methods for improving image and video quality using CNNs have been developed with the rise of deep learning. Several CNN-based image enhancement techniques have been proposed to eliminate artifacts and blocking effects in Joint Photographic Experts Group (JPEG) compressed images, as seen in references [3,4,5,6,7,8]. Several approaches, including those in [9,10,11,12,13,14], have been proposed to enhance the quality of HEVC-compressed videos. Recently, efforts have also been made to boost the quality of VVC-compressed visual content, targeting better image quality [15,16,17,18,19,20,21]. Despite the encouraging outcomes of previous studies, significant limitations and opportunities for enhancement remain, necessitating continued investigation and innovation to overcome these challenges and unlock further advancements. The sophistication of previous architectures has not translated to meaningful advancements, highlighting a disconnect between design complexity and practical benefits.

The innovation of neural network-based post-process filters resides in their ability to apply deep learning principles to upgrade decoded video quality, without requiring changes to the video coding framework. Also, the in-loop filter approach is plagued by several issues. In-loop filters in video processing can have several drawbacks. They often rely on motion estimation, which can lead to over-smoothing and loss of important details, especially in complex scenes. This can result in decreased video quality and the introduction of artifacts. Additionally, in-loop filters can cause temporal inconsistencies, where the quality of the filtered video varies across frames. The use of pre-defined parameters and thresholds can also lead to suboptimal filtering for certain types of video content. Furthermore, the aggressive application of in-loop filters can increase the bitrate of the video, leading to larger file sizes and reduced encoding efficiency. Neural network-based post-process filters work independently of the video encoding and decoding process, making them easy to implement and flexible to use. Unlike traditional filters, they do not impact the video’s bitrate or encoding efficiency, which means videos can be compressed to a smaller size without losing quality. These filters are particularly effective at enhancing the quality of complex or dynamic video content. Additionally, they can be easily customized to work with different video formats, resolutions, and frame rates, making them a more versatile solution than traditional filters. In this paper we introduce a Convolutional Neural Network-based Post-Processing Filter (VVC-PPFF) that leverages the integration of dense features and quantization parameter (QP) maps to achieve improved results. The goal of VVC-PPFF is to learn and extract valuable residual details, thereby enhancing the quality of the input image that has undergone compression. The reconstructed image is strengthened with the QP map [22], which serves as an additional input channel to the CNN network, providing it with explicit information about the compression level and artifact distribution. A feature fusion mechanism is integrated into the proposed network to facilitate more effective feature sharing and propagation, thereby enhancing the overall representation power of the network. We utilized a skip connection between the input reconstructed image and the network residuals.

The main contributions of this paper can be summarized as follows:

We leveraged a feature fusion mechanism that combines the benefits of local and global features by integrating earlier layer features into deeper layers, thereby preserving spatial information and preventing feature loss during network output processing.
The inclusion of the QP map as prior knowledge with the reconstructed images allows the CNN architecture to concentrate on identifying features that are most closely tied to compression artifacts, resulting in a more targeted and efficient feature extraction process.
We leveraged skip connections between the input reconstructed image and the network residuals, allowing the network to effectively reuse features from the input. This design choice is particularly beneficial for tasks where the output should closely resemble the input, but with enhanced quality.
In video coding, both random access (RA) and low delay (LD) modes rely on inter-frame prediction to enable efficient decoding and display. RA allows decoding from any frame, while LD demands rapid display after decoding. To overcome the reconstruction quality issues inherent in inter-frame prediction, the proposed network is employed to improve the accuracy and fidelity of the reconstructed video in both RA and LD scenarios. We designed a versatile, single CNN-based post-processing network capable of adapting to coding scenarios, including RA and LD, while accommodating a wide range of QP values, from low to high.

The remainder of this paper is structured as follows: a review of recent study in traditional artifact removal techniques and deep learning-based in-loop filter and post-processing techniques is presented in Section 2. Our proposed approach is detailed in Section 3. The performance evaluation and analysis are discussed in Section 4, and the paper concludes with a summary of the study in Section 5.

2. Related Works

Extensive research has been undertaken in the area of video compression artifact removal, encompassing both traditional methodologies and deep learning-based solutions, to address the degradation of video quality caused by compression. This section provides an overview of various research studies that have employed both conventional and deep learning-based approaches to tackle video compression artifact removal, highlighting their respective strengths and limitations.

2.1. Conventional Artifact Removal Methods

Conventional techniques were initially designed to alleviate the artifacts introduced by image compression, with the goal of restoring the original image quality. In [23], Chen et al. introduced a novel post-processing approach that takes into account the masking effect of the human visual system (HVS), incorporating an adaptive weighting mechanism to optimize the post-filtering process and improve image quality. This approach utilizes a large window to effectively reduce artifacts in low-activity regions, where blocking artifacts are more visually prominent. To preserve image details, a small mask and a large central weight are applied to high-activity blocks, where the human visual system’s masking ability helps conceal blocking artifacts amidst complex local backgrounds. Shen et al. [24] proposed an innovative deblocking algorithm that leverages an adaptive non-local means filter to effectively eliminate blocking artifacts in images compressed using Block Discrete Cosine Transform (BDCT). Blocks are used as the fundamental processing unit, and their values are estimated by taking a weighted sum of neighboring blocks. The weights are dynamically adjusted based on the content of each block and the level of quantization noise present, allowing for adaptive processing. Foi et al. [25] introduced an approach to image filtering, utilizing the Shape-Adaptive Discrete Cosine Transform (SA-DCT) to effectively remove blocking and ringing artifacts from images compressed using block-DCT methods. This approach involves using the threshold or attenuated SA-DCT coefficients to generate a localized approximation of the original signal, confined within a dynamically determined support region that adapts to the signal’s shape. In [26], Liew et al. developed a non-iterative deblocking algorithm that leverages wavelet technology to simultaneously mitigate both blocking and ringing artifacts in compressed images. The proposed algorithm is grounded in a theoretical understanding of blocking artifacts, allowing it to effectively address the statistical properties of block discontinuities and the scale-based behavior of wavelet coefficients for various image features, thereby reducing both blocking and ringing artifacts. Liu et al. [27] presents an innovative method for restoring JPEG compressed images, which leverages the residual redundancies present in JPEG code streams and the sparse nature of underlying images, to effectively recover the original image. The proposed approach tackles image restoration by directly recovering the DCT coefficients of the original image, thereby preventing the propagation of quantization errors into the pixel domain. Additionally, it utilizes online machine-learned local spatial features to constrain the solution of the inverse problem, ensuring a more accurate restoration. In [28], Ren et al. introduces a new approach to alleviate blocking artifacts in block-coded images by combining patch clustering and low-rank minimization. This method leverages both local and non-local sparse representations in a single framework, effectively reducing the visibility of block boundaries and improving overall image quality. The compressed image is split into small segments, and then similar segments are grouped together through clustering. Next, these grouped segments are jointly refined using a low-rank minimization technique known as the Singular Value Thresholding (SVT) algorithm, which helps to enhance image quality. Zhang et al. [29] presents an innovative technique to alleviate compression artifacts by leveraging non-local similarities to estimate transform coefficients for overlapping blocks. The proposed method estimates DCT coefficients for each block by combining two predictions: the quantized coefficients from the compressed bitstream and a weighted average of coefficients from similar non-local blocks. The weights are determined by the similarity between blocks in the transform domain, allowing the method to prioritize the most effective coefficients for prediction. To optimize the process, the overlapped blocks are divided into non-overlapping subsets, each covering the entire image, and optimized separately.

2.2. Deep Learning-Based Artifact Removal Methods

Deep learning has made significant strides in various fields, and its application in image and video quality enhancement has gained popularity among researchers, yielding promising results. In [17], Chen et al. introduced a Dense Residual Convolutional Neural Network (DRN) as an in-loop filter for VVC, which leverages residual learning, dense shortcuts, and bottleneck layers to address gradient vanishing, promote feature reuse, and reduce computational costs. Huang et al. [15] introduced a Variable CNN (VCNN) in-loop filter for VVC, capable of handling compressed videos with varying quality parameters (QPs) and frame types (FTs) using a single model. The filter features an attention module that adaptively adjusts features based on QPs or FTs, which is integrated into the residual block to leverage informative features. Additionally, a residual feature aggregation module (RFA) is employed to minimize information loss and enhance feature extraction efficiency. In [16], Li et al. proposed a CNN filter designed to improve the quality of VVC intra-coded frames, utilizing additional information such as partitioning and prediction data as inputs. For chroma, the auxiliary information also incorporates luma samples. The objective is to enhance the visual quality of VVC intra-coded frames by leveraging this supplementary information in the CNN-based filter. In [30], Kathariya et al. proposed a CNN-based in-loop filter for VVC to reduce compression artifacts. This filter uses CNN features from DCT-transformed input to extract high-frequency components and introduces long-range correlation into spatial CNN features through multi-stage feature fusion. Zhang et al. [20] introduced a new CNN-based post-processing method for VVC in the RA configuration. To optimize performance, the network was trained on a large dataset of VVC-compressed videos at various resolutions and quality settings. The trained network is applied at the decoder to enhance the reconstruction quality of VVC-encoded videos. In [21], Ma et al. introduced MFRNet, a CNN-based network for post-processing and in-loop filtering in video compression. MFRNet consists of four multi-level feature review residual dense blocks (MFRBs), which extract features from multiple convolutional layers using a residual learning structure with dense connections. This design enables the reuse of high-dimensional features from previous blocks, improving feature reuse and capturing information flow between blocks. In [18], Bonnineau et al. explored a learning-based approach as a post-processing step to improve the quality of decoded VVC videos. They employed multitask learning to develop a single network that can simultaneously perform quality enhancement and super-resolution, optimized to handle various levels of degradation. In [31], Santamaria et al. proposed a content-adaptive CNN-based post-processing filter to enhance the quality of VVC decoded videos. The filter was initially trained on a general dataset of video sequences and then fine-tuned on the specific test video sequence to adapt to its unique content. Pham et al. [32] introduced a deep learning-based approach called Spatial-Temporal In-Loop Filtering (STILF) that leverages coding information to enhance VVC in-loop filtering. The proposed method learns to map reconstructed video frames to their original counterparts, utilizing a combination of VVC default filtering, a self-enhancement CNN with a CU map (SEC), and a reference-based enhancement CNN with optical flow (REO) for each CTU. In [33] Lim et al. proposed a generalized in-loop filter that builds upon the ALF in VVC by integrating a CNN. In this method, kernels are dynamically selected from a set of trained kernels based on fixed classifications across multiple layers, thereby reducing the computational complexity of the CNN-based in-loop filter. Zhang et al. [34] proposed a CNN-based in-loop filter for VVC intra coding, featuring a lightweight and efficient design. The filter consists of two main modules: residual attention block (RAB) and weakly connected attention block (WCAB) for feature extraction and refinement. Additionally, a lightweight feature extraction head (LFEH) with parallel convolutional layers is used to extract shallow features from multiple inputs. The network is trained using a multi-stage strategy based on progressive learning to maximize its learning ability. Lim et al. [35] proposed an in-loop filter that calculates a weighted sum of FIR-filtered versions of the reconstructed frame, with sample-adaptive weighting factors determined by a CNN. This approach replaces the hard classification of ALF with a soft, CNN-computed classification, where multiple classes can overlap at each sample position. Notably, it reduces the number of multiplications per luma pixel compared to other CNN-based in-loop filtering methods. Liu et al. [36] proposed a frequency and spatial QP-adaptive mechanism (FSQAM) to eliminate the need for separate models for each quantization parameter (QP) band, which is impractical due to storage limitations. FSQAM consists of a frequency-domain FQAM that incorporates the quantization step (Qstep) into the convolution, and a spatial-domain SQAM that compensates for FQAM. This approach enables any CNN filter to handle varying quantization noise by adding Qstep-related influence factors, and utilizes octave convolution for better frequency decomposition, ultimately improving the CNN filter’s performance through the interaction between FSQAM and convolutional decomposition. Wang et al. [37] proposed an efficient video coding scheme that involves pre-processing degradation and post-processing restoration. The pre-processing step uses an edge-preserving filter to remove high-frequency information before encoding, while the post-processing module employs a CNN-based approach, specifically a modified very deep super-resolution (VDSR), to restore degraded frames and correct coding artifacts like quantization error. The pre-processing bilateral filter is designed to control degradation and optimize rate-distortion performance, and the post-processing network is trained on-the-fly using reconstructed frames from the encoder to learn sequence-specific and coding configuration-specific model parameters.

2.3. Comparative Analysis of Prior Approaches

Traditional approaches to artifact removal and video quality enhancement rely on mathematical models and algorithms that are transparent and easy to understand. Traditional approaches are not versatile and cannot effectively handle diverse video content and various types of artifacts. In contrast, deep learning-based methods can simplify the artifact removal process by automating it, thereby eliminating the need for manual adjustments and fine-tuning of parameters.

There has been a growing trend towards the use of deep learning-based techniques for artifact removal in recent times. By training on a vast collection of paired input–output images, a CNN network develops the ability to recognize patterns and correlations between the two, allowing it to effectively extrapolate and perform well on novel, unseen data. Employing a hierarchical feature extraction method, CNN networks begin by detecting low-level features (e.g., edges, textures) in early layers and progress to high-level features (e.g., objects, patterns) in later layers, empowering them to identify and eliminate a diverse array of artifacts throughout the image-processing pipeline. With their vast learning capacity, CNN networks can capture and model intricate patterns, empowering them to eliminate artifacts that are too complex or intractable for traditional filters to handle, thereby achieving superior artifact removal capabilities.

In previous deep learning-based VVC post-processing filters, almost all research developed a residual style model to obtain the residual between compressed frame and original frame. We also followed the same method while we attempted to keep the residual block as minimal as possible to obtain the optimum bitrate saving. Our proposed network structure utilizes both shallow and deep features through feature fusion, while most other studies use very simple or very complex structures. According to the input of networks, most previous research is focused only on compressed frames. Since QP value is one the deciding factors in the compression level, we used QP map as the input along with the compressed frame.

3. Proposed Method

Figure 1 depicts the post-processing pipeline, where a CNN-based network is introduced as a refinement step after frame decoding, aiming to boost the reconstruction quality of the output video. This advanced technique utilizes artificial intelligence to optimize the decoded video, offering a visually superior experience through the refinement of the reconstructed video. The original video sequence is first compressed by the VVC encoder, which uses advanced methods such as intra-prediction, inter-prediction, transform coding, and entropy coding to reduce the data size of the raw video. The compressed video is then represented as a bitstream, which is fed into the VVC decoder. The decoder reverses the compression process, reconstructing the video from the bitstream. However, this reconstructed video may still contain imperfections and artifacts due to the lossy compression. To address this, the reconstructed video is then fed into a deep neural network (DNN)-based post-processing module. This AI-powered module analyzes the video, detects imperfections, and applies advanced algorithms to refine the video, reducing or eliminating these imperfections. By leveraging the power of deep learning, the post-processing module significantly improves the video quality. The final output video, refined and enhanced by the DNN module, is then displayed on the screen, providing a high-quality and visually stunning video experience. By integrating a deep learning-based post-processing approach into the VVC coding workflow, the reconstructed video is substantially improved, resulting in a better viewing experience for the end-user.

3.1. Insight of Data Ingestion

Figure 2 shows the step-by-step process for training and testing by transforming MP4 videos into YUV format and subsequently reconstructing the original video using the open source-based VVC encoder (VVenC) [38] and decoder (VVdeC) [39]. Our experiment is based on the BVI-DVC [40] dataset, which is stored in an MP4 format. While we recognize that using the original non-compressed YUV format is common practice in video coding experiments, our research focused on a scenario where MP4-encoded videos are more representative of the conditions in real-world applications. BVI-DVC [40] is currently utilized by the Joint Video Exploration Team (JVET) standardization organization for developing neural network-based video coding technology. The key benefit of YUV is that it decouples luminance (i.e., brightness) from chrominance (i.e., color), which is not possible with RGB. This decoupling enables more effective compression, as the human visual system is more sensitive to brightness variations than color changes. By reducing the resolution of YUV color components, the amount of data required to represent color information is lowered, resulting in a more efficient use of bandwidth for data transmission. The widespread use of YUV format in real-world video data transmission makes it an ideal choice for training datasets, ensuring better compatibility with practical applications. Moreover, the Joint Video Exploration Team (JVET) neural network-based video coding (NNVC) common test condition (CTC) [41] test sequences, a benchmark for video coding network development, are also in YUV format, making it a sensible decision to convert training data to YUV for more effective testing and analysis.

The preprocessing stage, depicted in Figure 3, involves converting (a) original videos into their constituent frames, known as original images, using FFmpeg (version 6.1.1) [42], and (b) applying the same process to reconstructed videos to obtain reconstructed images, which are then input into the neural network for subsequent analysis. Deep learning models, typically designed for 2D image processing, can be adapted for video analysis by breaking down videos into individual frames. This approach unlocks several advantages, including the creation of a larger training dataset from the numerous frames, streamlined preprocessing tasks like data augmentation and feature extraction, and improved computational efficiency, leading to accelerated training and inference speeds.

Figure 4 describes the process of up-sampling the YUV 4:2:0 format data to convert it into the YUV 4:4:4 format, which involves interpolating and expanding the chroma components to match the resolution of the luma component. Most image processing neural networks are designed to process input data with identical resolutions for all color channels. However, YUV 4:2:0 data fall short of this requirement, which can lead to issues during training and inference. To avoid these problems, up-sampling YUV 4:2:0 to YUV 4:4:4 ensures that the input data meet the neural network expectations, resulting in smoother and more accurate processing. The chroma subsampling in YUV 4:2:0 data leads to a lower color resolution, as the UV channels are downscaled relative to the Y channel. This reduction in color fidelity can degrade the performance of neural networks. Conversely, YUV 4:4:4 data retain the full resolution for all components, thereby preserving color accuracy and facilitating optimal neural network outcomes.

Figure 5 illustrates the process of reducing the resolution of the chroma components in the neural network output, transforming it from the YUV 4:4:4 format to the YUV 4:2:0 format. This involves discarding every other chroma sample, both horizontally and vertically, resulting in a 2:1 reduction in chroma resolution. The luminance component remains unchanged, while the chroma blue and chroma red components are subsampled, reducing the overall data size and preparing the output for efficient storage, transmission, and display.

3.2. Influence of QP Map

The quantization parameter (QP) is a critical setting in video compression that balances video quality and file size. It is a single value that determines how much data is kept or discarded during compression. A lower QP setting preserves more video details, resulting in a higher quality video, but at the cost of a larger file size. On the other hand, a higher QP setting reduces the file size, but at the expense of video quality, as more data are discarded during compression.

A QP map is a technique that dynamically adjusts the quantization parameter based on the unique features of each region within the video. This adaptive approach enables more effective compression, as it allocates more bits to areas of high importance, such as faces or textures, and fewer bits to less critical regions, for example backgrounds or skies. This results in the better preservation of details in areas that matter most, while maintaining an optimal balance between compression efficiency and visual quality. By feeding a QP map into the neural network, it can intelligently adjust the compression ratio based on the unique features of each image or video region. This enables the network to prioritize bit allocation, devoting more resources to critical areas like edges, textures, and faces, while assigning fewer bits to less important regions. The QP map can act as a guide for the neural network to identify and extract features that are resilient to compression-related distortions. By integrating the QP map into the feature extraction process, the network can learn to emphasize features that are less susceptible to QP-induced artifacts, ultimately leading to more reliable and accurate representations of the image or video content. By integrating the QP map into the neural network design, it can develop strategies to mitigate compression artifacts tied to specific QP values. For instance, the network can learn to minimize blocking effects in areas with high compression (i.e., high QP) or alleviate ringing effects in regions with low compression (i.e., low QP), resulting in a more visually pleasing and artifact-free output. By leveraging the QP map, the neural network can acquire a compression-agnostic understanding of the image or video, where the representation of the content is decoupled from the specific QP value used. This enables the network to produce more reliable and consistent results, unaffected by variations in compression settings, and ultimately leading to improved performance and accuracy.

The QP map is input into a neural network that has the same spatial dimensions as the compressed input image, allowing the network to process the compression information in a spatially corresponding manner. QP value can be varied at frame level, slice level, or block level. However, in our proposed solution, we deal with the base QP value Base QP which indicates the primary QP value (i.e., 22, 27, 32, etc.) set during the encoding decoding process. The QP map produces a value that is scaled to a standard range, as specified by Equation (1).

{Q P}_{m a p} (x, y) = \frac{Q P (x, y)}{{Q P}_{m a x}}

(1)

where x represents the horizontal pixel coordinates, y represents the vertical pixel coordinates, and the maximum quantization parameter

{Q P}_{m a x}

is set to 63. Here, x ranges from 1 to W (i.e., Width) and y ranges from 1 to H (i.e., Height). In the context of VVC,

{Q P}_{m a x}

determines the maximum level of compression that can be applied to each coding unit within a frame.

3.3. Feature Fusion

The learned representations of CNNs, known as deep features, retain vital details and intrinsic patterns from input images, effectively summarizing their visual content. The CNN feature extraction process unfolds in a hierarchical manner, with initial layers detecting primitive visual cues such as edges and lines, and subsequent layers building upon these to recognize more sophisticated patterns, including shapes, textures, and objects.

In CNNs, a distinction exists between the features extracted by early layers and deep layers, which are shaped by the varying intensity of pixels in an image. The early layers focus on low-level details, such as edges and textures, which are sensitive to local pixel intensity changes. In contrast, deep layers capture higher-level features (e.g., objects and patterns), which are more robust to pixel intensity variations. This disparity in feature extraction arises from the different ways early and deep layers process visual information, with early layers emphasizing local details and deep layers recognizing more global structures. As a result, the features extracted by early layers are more localized, detailed, and sensitive to pixel intensity variations, whereas the features extracted by deep layers are more global, abstract, and robust to pixel intensity variations.

By merging early and deep layer features, the model can create a more complete and nuanced representation of the input image, blending fine-grained details with broader contextual patterns. The sensitivity of early layer features to pixel intensity fluctuations is compensated for by the robustness of deep layer features, mitigating the impact of intensity changes on feature extraction. This fusion yields a more diverse and informative set of features, enabling the model to more accurately differentiate between objects, scenes, and patterns. By harnessing the complementary strengths of early and deep layer features, the model can reduce the disparity between them, resulting in a more appropriate and effective feature extraction process.

3.4. Philosophy of Model Design

The core design principle is to harness the strengths of diverse feature representations to generate a more thorough and resilient output. By integrating the QP map and fusing features from early to deep layers of the CNN, the post-processing filter can produce a refined image that boasts enhanced visual quality and significantly reduced compression artifacts and noise. The integration of the QP map and feature fusion in convolutional layers empowers the network to develop a richer image representation, mitigates the effects of intensity variations, and yields enhanced output quality in post-processing tasks. The QP map serves as a spatial guide, indicating the compression level applied to each image region. By leveraging this map, the network can dynamically compensate for varying compression levels, thereby mitigating the effects of pixel intensity fluctuations. The QP map’s spatial adaptivity enables the network to tailor its correction to the unique compression artifacts and noise patterns present in each local image area. The feature fusion process integrates the QP map with the diverse features extracted from the convolutional layers, which encode various image attributes such as edges, textures, and patterns. By merging these features, the network can generate a more holistic and detailed representation of the image, encompassing both the compression-induced distortions and the intrinsic image content. The QP map plays a crucial role in mitigating pixel intensity variations by providing a spatially adaptive representation of compression artifacts and noise. The feature fusion process takes this a step further by combining the QP map with convolutional layer features, which are inherently less sensitive to intensity fluctuations. This fusion enables the network to become more resilient to pixel intensity variations, ultimately leading to enhanced post-processing performance and improved overall image quality.

3.5. Proposed Network Architecture

The diagram in Figure 6 illustrates the proposed network architecture, which is designed to enhance the quality of reconstructed frames. The network takes as input the reconstructed frames and the corresponding QP map information. The network uses a QP map that is the same size as the reconstructed frame, and then concatenates the two before feeding them into the network for processing. The QP map supplies the network with prior knowledge about the frame quality variations caused by the QP used during compression, providing insight into the quality of the reconstructed frames when combined with the images.

The concatenated QP map and reconstructed frame are then fed into the first convolutional layer, which applies a 1 × 1 kernel with 128 output channels to process the input. The 1 × 1 convolution kernel compress feature map reduces their number and dimensionality, while preserving crucial information. The 128-channel configuration strikes a balance between feature extraction and computational efficiency, enabling the model to learn a robust representation of the input data without incurring excessive computational costs. The parametric rectified linear unit (PreLU) activation function, applied after the initial convolutional layer, learns to tailor its rectification parameters to the input data, resulting in improved model performance. The resulting feature maps are subsequently passed into the feature extraction block for further processing.

The network can use multiple feature extraction blocks to balance complexity and performance. We used 16 blocks in our experiments to optimize feature extraction while keeping the network simple and efficient. Every feature extraction block consists of a convolutional layer featuring a 3 × 3 kernel, 128 output channels, and a PReLU activation function. The 3 × 3 kernel size enables it to effectively identify small-scale features such as edges, lines, and textures, while also reducing the computational load, thereby speeding up training and inference processes. PReLU was used to aid convergence in deeper network layers, as it allows for varying degrees of nonlinearity, which is particularly important in deeper layers where nonlinearity tends to increase.

A hierarchical feature fusion approach is employed across 16 feature extraction blocks. In each block, features from previous blocks are refined through a 1 × 1 convolutional layer, which learns to capture complex patterns. The refined features are then combined with the previous features to produce the final output of each block. This process is repeated, with each block aggregating features from its predecessors. Ultimately, the first block receives the collective output from all 16 blocks, resulting in a rich and integrated feature representation.

The feature fusion process is mathematically represented by Equations (2) and (3), which describe how the features are combined and integrated to produce the final output.

{P^{'}}_{N} = F_{c o n v 1 \times 1} (P_{N})

(2)

{P^{'}}_{N - i} = {P^{'}}_{N} + \sum_{i = 1}^{15} F_{c o n v 1 \times 1} (P_{N - i})

(3)

where N represents the number of feature extraction blocks,

P_{N}

represents the convolution and activation function of each feature extraction blocks,

P_{N - i}

represents each preceding block where i ranges from 1 to 15, and

{P^{'}}_{N - i}

is the result of combining the features from the current block with those from the previous blocks, using a function

F_{c o n v 1 \times 1}

that represents the feature extraction process through a 1 × 1 convolution layer.

The final feature from the first block is then fed into a convolutional layer with a 3 × 3 kernel size, which performs localized feature aggregation. By combining features from a small, focused region of the feature map, this layer refines and distills the information extracted by previous layers, resulting in a more robust and concentrated representation of the most critical features.

In the final stage of the network, a convolutional layer with a 1 × 1 kernel size is employed, followed by a hyperbolic tangent (Tanh) activation function. The Tanh activation function is chosen for its ability to provide a more consistent gradient flow, which helps mitigate the vanishing gradient problem. This, in turn, enables faster and more stable convergence during the training process, ultimately leading to improved network performance.

The proposed network adopts a residual learning approach, which involves adding a skip connection that directly links the input reconstructed frames to the final extracted residuals. This connection enables the gradients to flow unimpeded, facilitating the smooth transmission of information from earlier layers to later ones, and thereby promoting more effective learning and representation.

The relationship between the input data, consisting of the reconstructed frame and QP map, and the output, which is the enhanced frame, is formally defined as Equation (4).

Y = Z_{θ} (\hat{X} (| |) {Q P}_{m a p}) \oplus \hat{X}

(4)

where

\hat{X}

represents the reconstructed input frames, Y denotes the enhanced output frame,

Z_{θ}

symbolizes the operation performed by the CNN architecture, (||) means the concatenation operation, and

\oplus

symbolizes the element-wise addition.

3.6. Model Optimization Settings

The BVI-DVC [40] dataset, comprising 800 videos with resolutions from 270p to 2160p, was utilized for training a model. This diverse set of training data was compressed under the JVET NNVC CTC using the RA and LD configurations. The videos, initially in MP4 format, were converted to YUV format with chroma sampling 4:2:0 and a bit depth of 10. For ease of processing, 10 frames were extracted from each video, yielding a total of 8000 frames for training. Given that the proposed network cannot manage different input sizes, the chroma channels were up-sampled by a factor of two to match the Luma channel’s spatial resolution. Finally, a random patch of size 240 × 240 was chosen from each frame to serve as the network input.

Five separate models were trained using the same network architecture, but with different QP values, specifically 22, 27, 32, 37, and 42, as per the JVET NNVC CTC guidelines. These models were then evaluated for different base QP values, with the same model generation strategy applied to the RA and LD configurations. Each CNN model underwent training for 200 epochs, utilizing the Adam optimizer with a learning rate of

10^{- 4}

. The hyper-parameters

β_{1}

= 0.9 and

β_{2}

= 0.999 were employed to calculate the gradient averages during the learning process.

The following Equation (5) illustrates the models developed for different QP ranges.

C N N M o d e l s = \{\begin{matrix} M_{Q p = 22}, & i f QP < 24.5 \\ M_{Q p = 27}, & i f 24.5 \leq Q P < 29.5 \\ M_{Q p = 32}, & i f 29.5 \leq Q P < 34.5 \\ M_{Q p = 37}, & i f 34.5 \leq Q P < 39.5 \\ M_{Q p = 42}, & i f Q P \geq 39.5 \end{matrix}

(5)

where

M_{Q p}

denotes the model associated with each QP values (i.e., 22, 27, 32, 37, 42). According to the range of base QP values, each model is selected.

The L₂ loss function, also referred to as the mean squared error (MSE) loss, is a widely used loss function in machine learning. It calculates the average of the squared differences between the predicted outcomes and the actual values, providing a measure of the model accuracy.

The model utilizes the L₂ loss function, specified by Equation (6), to measure its performance.

L_{2} o r M S E = \sum_{j = 1}^{k} {(Y_{j} - Y_{j})}^{2}

(6)

where

Y_{j}

denotes the original pixel value and

Y_{j}

represents the corresponding enhanced pixel value resulting from the application of the CNN filter.

4. Experimental Evaluations and Discussion

4.1. Testing Environment Configuration

To assess the performance of the proposed method, we selected the JVET NNVC CTC sequences for evaluation, which were not part of the training dataset. These 22 sequences, categorized into classes A1, A2, B, C, D, and E, were used to evaluate the RA and LD configurations according to the JVET NNVC CTC guidelines. Specifically, we tested RA and LD using classes A1, A2, B, C, and D and classes B, C, D, and E, respectively. The test QP values for all configurations were 22, 27, 32, 37, and 42.

4.2. Quality Metric

The final video quality was evaluated using two assessment methods: peak signal-to-noise ratio (PSNR) and the Bjøntegaard Delta Rate (BD-Rate) [43] measurements. PSNR, a widely accepted quality metric in the video compression community, was used to assess video quality. Additionally, the performance of the proposed approach was compared to the VVC codec using the BD-Rate, which measures the bitrate savings of the new approach compared to the reference codec.

PSNR

The proposed method was evaluated by comparing the quality of its output with that of the decoded frames produced by the VVenC v1.10.0 software, using PSNR as the evaluation metric. The PSNR calculation is given by Equation (7), which quantifies the difference between the original and reconstructed frames.

P = 10 * l o g 10 (\frac{{(255 ≪ (β - 8))}^{2}}{M}) .

(7)

In this equation, β specifies the number of bits used to encode each pixel color data in an image.

M

calculates the average of the squared differences between actual and predicted values. The left-shift operator (<<) is applied to compute the maximum possible value based on the bit depth.

BD-Rate

BD-Rate is a crucial metric in video coding that measures the compression efficiency of a video codec by comparing its bitrate to a reference codec or technique. It is calculated as the average percentage difference in bitrate across various quality levels or quantization parameters. A lower BD-Rate indicates better compression performance. Typically expressed as a percentage, the BD-Rate provides a concise summary of the efficiency of coding performance, making it a valuable tool for the video coding community to evaluate and compare the performance of different codecs and techniques.

4.3. Test Bed

The experiment was run using PyTorch 2.3.0 [44] as the deep learning framework on an Ubuntu-based system. The system was equipped with a powerful hardware setup, featuring two 10-core Intel Xeon Silver 4210R CPUs (Intel, Santa Clara, CA, USA) and four high-performance NVIDIA TITAN RTX GPUs (Nvidia, Santa Clara, CA, USA).

4.4. Quantitative Analysis

The compression performance of the proposed architecture is detailed in Table 1 and Table 2, which highlight its efficiency in both RA and LD configurations. JVET utilizes the BD-Rate [43] metric to evaluate the effectiveness of bit rate reduction, where a lower BD-Rate value signifies better coding efficiency. The RA configuration results in Table 1 demonstrate that the proposed method yields overall coding gains of 5.81%, 14.52%, and 16.19% for the Luma (i.e., Y component) and Chroma (i.e., U and V component) components, respectively, when compared to VVC compression. Specifically, for D-class sequences with a lower resolution of WQVGA (i.e., 416 × 240), the proposed method achieves significant average BD-Rate reductions of 7.69% for Luma and 14.43% and 17.98% for Chroma components.

Table 2 presents the coding performance of the proposed architecture for the LD configuration, achieving overall BD-rate reductions of 6.98%, 24.68%, and 26.74% for Luma and Chroma components. The LD configuration excels in WQVGA (i.e., 416 × 240) lower-resolution class D sequences and demonstrates improved performance in 1080i/p (i.e., 1920 × 1080) high-resolution class B sequences. The average BD-Rate savings for the Luma component are 5.25% for Class B sequences and 8.56% for class D sequences. The compression technique demonstrates a substantial boost in performance for lower-resolution sequences for both RA and LD scenario, which are typically difficult to improve, and exhibits impressive results for high-resolution sequences. Interested readers can see Tables S1 and S2 in the supplementary document for detailed PSNR values.

To assess the effectiveness of the proposed method, we compared it to the latest research in [45] and present the results in Table 3 and Table 4. Specifically, we evaluated the method in the RA and LD configurations, consistent with the JVET CTC. In terms of coding efficiency measured by BD-Rate, the proposed method outperformed [45] with a value of −5.81% and −6.98% for RA and LD scenario, while [45] reported a BD-Rate value of −2.81% and −3.81% for the Y channel. Notably, the strength of the proposed method was evident in the results for class D, which represents the most challenging class due to its low resolution (416 × 240). In this case, our proposed model surpassed [45], achieving a Y channel bit reduction of 7.69% and 8.56% compared to the reported rate of 4.76% and 4.72% for RA and LD configuration, respectively. Furthermore, it is worth mentioning that our proposed method comprehensively addressed all three channels (i.e., Y, U, and V) using a single network, while the network presented in [45] could only handle the Y component.

4.5. Qualitative Analysis

Figure 7 and Figure 8 illustrate a visual comparison of the reconstructed frames between the VVC-compressed content and the proposed network output, with a high QP value of 42. The network output in Figure 7 demonstrates less distortion than the reconstructed frame and a 0.18 dB luma PSNR improvement for the DaylightRoad2 sequence. In Figure 8, the LD configuration preserves clearer details of the FourPeople sequence, with a 0.51 dB luma PSNR gain. Notably, the CNN-post-processed frames exhibit fewer artifacts and higher subjective quality compared to the original VVC decoder output for both RA and LD scenarios.

4.6. Rate Distortion Plot Analysis

In video coding, a rate-distortion (RD) plot is a graphical representation that illustrates the tradeoff between the bitrate and the quality of a video compression algorithm. Here, the horizontal axis represents the bitrate, which is the amount of data required to represent the compressed video. The vertical axis represents the distortion introduced by the compression algorithm which is measured using PSNR metric. The RD plot is a useful tool for evaluating the performance of video compression algorithms, as it allows the comparison of the efficiency of different compression algorithms, the optimization of the compression settings to achieve a desired balance between bitrate and quality, and the analysis of the impact of different compression techniques on the video quality.

The RD curves in Figure 9 and Figure 10 compare the performance of RA and LD configurations, highlighting their tradeoffs between bitrate and PSNR. The proposed method consistently outperformed others in both RA and LD scenarios, demonstrating improved results across a range of quantization parameters (QPs) and variable resolution image sequences.

4.7. Ablation Studies

The ablation study involves a comparative evaluation of the performance of various feature extraction blocks. The proposed method’s PSNR performance is compared in Table 5 and Table 6, showcasing the impact of 8 and 16 feature extraction blocks in RA and LD configurations. Notably, the average PSNR values for RA configuration are 29.78 dB, 37.24 dB, and 37.60 dB, while those for LD configuration are 28.81 dB, 36.52 dB, and 36.91 dB. PSNR without proposed VVC-PPFF can be found at Tables S3 and S4 of the supplementary document for RA and LD configuration, respectively.

A visual quality comparison is presented in Figure 11, showcasing the proposed method’s performance with 8 feature extraction blocks in RA and LD scenarios at QP 42 for the MarketPlace and PartyScene sequences. The results indicate that using 16 feature extraction blocks yields an average luma PSNR gain of 0.25 dB and 0.03 dB compared to using 8 blocks for both RA and LD scenarios, respectively.

Table 7 and Table 8 present a PSNR comparison of the proposed method with 12 and 16 feature extraction blocks in both RA and LD configurations. The results reveal that the average PSNR values are 29.82 dB, 37.28 dB, and 37.73 dB for the RA scenario and 28.83 dB, 36.56 dB, and 36.92 dB for the LD scenario, respectively.

The impact of the QP map as an additional input can be observed from the quantitative analysis presented in Table 9 and Table 10. PSNR improvements with the QP map and without the QP map is shown in Table 9 for RA configuration and QP 42. Table 10 outlines LD configuration and QP 22. Both results are for the D class, which is the toughest to improve due to its low-resolution frames. Even so, our proposed solution outperforms the methods without a QP map-based solution.

Figure 12 illustrates a visual quality comparison of the proposed method with 12 feature extraction blocks for RA and LD scenarios at QP 42, using the RitualDance and Cactus sequences. Notably, the results show that increasing the number of feature extraction blocks to 16 yields PSNR gains of 0.38 dB, 0.15 dB, and 0.09 dB for the RA scenario and 0.01 dB, 0.02 dB, and 0.54 dB for the LD scenario, respectively.

4.8. Discussion

The VVC-PPFF architecture showcases its robustness by achieving significant coding gains across a wide range of QP values, from low to high, for both RA and LD scenarios. In video compression, inter-frame prediction involves encoding the changes between adjacent frames rather than the entire frame, which reduces data storage and transmission needs but can compromise video quality. To address this, the proposed network is tailored to enhance reconstruction quality in two distinct scenarios: RA, where any frame can be accessed randomly, and LD, where real-time transmission is crucial, each with its unique set of challenges. This consistency in performance is crucial, as it ensures that the proposed method can effectively reduce encoding artifacts and improve video quality, regardless of the specific encoding settings used. Notably, the proposed method excels in enhancing PSNR while reducing bitrate for lower resolution image sequences. These sequences are typically more challenging to improve, as they often exhibit more pronounced artifacts due to the reduced spatial resolution. However, the VVC-PPFF architecture demonstrates its ability to effectively address these challenges, delivering improved PSNR and reduced bitrate for these sequences. By consistently delivering a good coding gain across varying QP ranges and video resolutions, the proposed approach is well-suited to provide superior reconstruction quality. This, in turn, leads to enhanced visual quality, as the reduced artifacts and improved PSNR result in a more accurate and detailed representation of the original video content. The VVC-PPFF architecture’s ability to maintain a good coding gain across different video resolutions is particularly noteworthy. This robustness ensures that the proposed method can effectively handle a wide range of video content, from lower resolution sequences to higher resolution ones, without compromising on performance. Ultimately, the proposed VVC-PPFF architecture is designed to provide enhanced visual quality, which is essential for a wide range of applications, including video streaming, broadcasting, and storage. By reducing encoding artifacts and improving PSNR, the proposed method ensures that the reconstructed video is more visually appealing, with reduced distortion and improved overall quality.

5. Conclusions

This paper introduces VVC-PPFF, a novel CNN-based post-processing filter approach that leverages the QP map as prior knowledge, combining it with input reconstructed images and feature fusion from earlier and deeper layers. The network architecture is designed to accommodate variability in QP ranges and image sequence sizes, enabling improved image quality. Additionally, the method employs skip connections between the input and residual to achieve reconstruction quality closely matching the original video. The proposed approach demonstrates substantial coding gains for both RA and LD scenarios, across a range of QP values from low to high.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14188276/s1, Table S1: The compression performance with and without proposed method for RA configuration; Table S2: The compression performance with and without proposed method for LD configuration; Table S3: PSNR comparison of proposed method vs without proposed method for 8 and 16 feature extraction blocks in RA configuration; Table S4: PSNR comparison of proposed method with 8 and 16 feature extraction blocks in LD configuration.

Author Contributions

Investigation, validation, methodology, writing—original draft, T.D.; conceptualization, investigation, X.L.; supervision, project administration, funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from Kyung Hee University in 2023 (KHU-20230874); the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2023-00220204, Development of Standard Technologies on next-generation media compression for transmission and storage of Metaverse contents).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [40,41].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.-R. Overview of the Versatile Video Coding (VVC) Standard and its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Dong, C.; Deng, Y.; Loy, C.C.; Tang, X. Compression Artifacts Reduction by a Deep Convolutional Network. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 576–584. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. MemNet: A Persistent Memory Network for Image Restoration. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4549–4557. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Qi, Z.; Jung, C.; Xie, B. Subband Adaptive Image Deblocking Using Wavelet Based Convolutional Neural Networks. IEEE Access 2021, 9, 62593–62601. [Google Scholar] [CrossRef]
Xie, B.; Zhang, H.; Jung, C. WCDGAN: Weakly Connected Dense Generative Adversarial Network for Artifact Removal of Highly Compressed Images. IEEE Access 2022, 10, 1637–1649. [Google Scholar] [CrossRef]
Wang, J.; Xu, M.; Deng, X.; Shen, L.; Song, Y. MW-GAN+ for Perceptual Quality Enhancement on Compressed Video. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4224–4237. [Google Scholar] [CrossRef]
Ding, D.; Kong, L.; Chen, G.; Liu, Z.; Fang, Y. A Switchable Deep Learning Approach for In-Loop Filtering in Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1871–1887. [Google Scholar] [CrossRef]
Lin, W.; He, X.; Han, X.; Liu, D.; See, J.; Zou, J.; Xiong, H.; Wu, F. Partition-Aware Adaptive Switching Neural Networks for Post-Processing in HEVC. IEEE Trans. Multimed. 2020, 22, 2749–2763. [Google Scholar] [CrossRef]
Huang, H.; Schiopu, I.; Munteanu, A. Frame-Wise CNN-Based Filtering for Intra-Frame Quality Enhancement of HEVC Videos. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2100–2113. [Google Scholar] [CrossRef]
Guan, Z.; Xing, Q.; Xu, M.; Yang, R.; Liu, T.; Wang, Z. MFQE 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 949–963. [Google Scholar] [CrossRef]
Yang, R.; Xu, M.; Liu, T.; Wang, Z.; Guan, Z. Enhancing Quality for HEVC Compressed Videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2039–2054. [Google Scholar] [CrossRef]
Huang, Z.; Sun, J.; Guo, X.; Shang, M. One-for-all: An Efficient Variable Convolution Neural Network for In-loop Filter of VVC. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2342–2355. [Google Scholar] [CrossRef]
Li, Y.; Zhang, L.; Zhang, K. Convolutional Neural Network Based In-Loop Filter for VVC Intra Coding. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2104–2108. [Google Scholar] [CrossRef]
Chen, S.; Chen, Z.; Wang, Y.; Liu, S. In-Loop Filter with Dense Residual Convolutional Neural Network for VVC. In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Shenzhen, China, 6–8 August 2020; pp. 149–152. [Google Scholar] [CrossRef]
Bonnineau, C.; Hamidouche, W.; Travers, J.-F.; Sidaty, N.; Deforges, O. Multitask Learning for VVC Quality Enhancement and Super-Resolution. arXiv 2021, arXiv:2104.08319. [Google Scholar] [CrossRef]
Jia, C.; Wang, S.; Zhang, X.; Wang, S.; Liu, J.; Pu, S.; Ma, S. Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding. IEEE Trans. Image Process. 2019, 28, 3343–3356. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Feng, C.; Bull, D.R. Enhancing VVC Through CNN-Based Post-Processing. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Ma, D.; Zhang, F.; Bull, D.R. MFRNet: A New CNN Architecture for Post-Processing and In-loop Filtering. IEEE J. Sel. Top. Signal Process. 2021, 15, 378–387. [Google Scholar] [CrossRef]
Wang, M.-Z.; Wan, S.; Gong, H.; Ma, M.-Y. Attention-Based Dual-Scale CNN In-Loop Filter for Versatile Video Coding. IEEE Access 2019, 7, 145214–145226. [Google Scholar] [CrossRef]
Chen, T.; Wu, H.R.; Qiu, B. Adaptive postfiltering of transform coefficients for the reduction of blocking artifacts. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 594–602. [Google Scholar] [CrossRef]
Shen, S.; Fang, X.; Wang, C. Adaptive non-local means filtering for image deblocking. In Proceedings of the 2011 4th International Congress on Image and Signal Processing, Shanghai, China, 15–17 October 2011; pp. 656–659. [Google Scholar] [CrossRef]
Foi, A.; Katkovnik, V.; Egiazarian, K. Pointwise Shape-Adaptive DCT for High-Quality Denoising and Deblocking of Grayscale and Color Images. IEEE Trans. Image Process. 2007, 16, 1395–1411. [Google Scholar] [CrossRef]
Liew, A.W.-C.; Yan, H. Blocking Artifacts Suppression in Block-Coded Images Using Overcomplete Wavelet Representation. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 450–461. [Google Scholar] [CrossRef]
Liu, X.; Wu, X.; Zhou, J.; Zhao, D. Data-driven sparsity-based restoration of JPEG-compressed images in dual transform-pixel domain. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5171–5178. [Google Scholar] [CrossRef]
Ren, J.; Liu, J.; Li, M.; Bai, W.; Guo, Z. Image Blocking Artifacts Reduction via Patch Clustering and Low-Rank Minimization. In Proceedings of the 2013 Data Compression Conference, Snowbird, UT, USA, 20–22 March 2013; p. 516. [Google Scholar] [CrossRef]
Zhang, X.; Xiong, R.; Fan, X.; Ma, S.; Gao, W. Compression Artifact Reduction by Overlapped-Block Transform Coefficient Estimation with Block Similarity. IEEE Trans. Image Process. 2013, 22, 4613–4626. [Google Scholar] [CrossRef]
Kathariya, B.; Li, Z.; Wang, H.; Van Der Auwera, G. Multi-stage Locally and Long-range Correlated Feature Fusion for Learned In-loop Filter in VVC. In Proceedings of the 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
Santamaria, M.; Lam, Y.-H.; Cricri, F.; Lainema, J.; Youvalari, R.G.; Zhang, H.; Hannuksela, M.M.; Rahtu, E.; Gaubbuj, M. Content-adaptive convolutional neural network post-processing filter. In Proceedings of the 2021 IEEE International Symposium on Multimedia (ISM), Naple, Italy, 29 November–1 December 2021; pp. 99–106. [Google Scholar] [CrossRef]
Pham, C.D.K.; Fu, C.; Zhou, J. Deep Learning based Spatial-Temporal In-loop filtering for Versatile Video Coding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1861–1865. [Google Scholar] [CrossRef]
Lim, W.-Q.; Stallenberger, B.; Pfaff, J.; Schwarz, H.; Marpe, D.; Wiegand, T. Simplified CNN In-Loop Filter with fixed Classifications. In Proceedings of the 2024 Picture Coding Symposium (PCS), Taichung, Taiwan, 12–14 June 2024; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, H.; Jung, C.; Liu, Y.; Li, M. Lightweight CNN-Based in-Loop Filter for VVC Intra Coding. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 1635–1639. [Google Scholar] [CrossRef]
Lim, W.Q.; Pfaff, J.; Stallenberger, B.; Erfurt, J.; Schwarz, H.; Marpe, D.; Wiegand, T. Adaptive Loop Filter with a CNN-Based Classification. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; Available online: https://ieeexplore.ieee.org/document/9897666 (accessed on 5 September 2024).
Liu, C.; Sun, H.; Katto, J.; Zeng, X.; Fan, Y. QA-Filter: A QP-Adaptive Convolutional Neural Network Filter for Video Coding. IEEE Trans. Image Process. 2022, 31, 3032–3045. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Jia, L.; Ren, H.; Jia, K. Video Coding Scheme Using Pre-Processing Degradation and Post-Processing Restoration. In Proceedings of the 2023 2nd International Conference on Image Processing and Media Computing (ICIPMC), Xi’an, China, 26–28 May 2023; pp. 113–117. [Google Scholar] [CrossRef]
Wieckowski, A.; Brandenburg, J.; Hinz, T.; Bartnik, C.; George, V.; Hege, G.; Helmrich, C.; Henkel, A.; Lehmann, C.; Stoffers, C.; et al. Vvenc: An Open and Optimized VVC Encoder Implementation. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–2. [Google Scholar] [CrossRef]
Wieckowski, A.; Hege, G.; Bartnik, C.; Lehmann, C.; Stoffers, C.; Bross, B.; Marpe, D. Towards A Live Software Decoder Implementation for The Upcoming Versatile Video Coding (VVC) Codec. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3124–3128. [Google Scholar] [CrossRef]
Ma, D.; Zhang, F.; Bull, D.R. BVI-DVC: A Training Database for Deep Video Compression. IEEE Trans. Multimed. 2022, 24, 3847–3858. [Google Scholar] [CrossRef]
Alshina, E.; Liao, R.-L.; Liu, S.; Segall, A. JVET Common Test Conditions and Evaluation Procedures for Neural Network Based Video Coding Technology; JVET-AC2016-v1; Joint Video Experts Team (JVET): Geneva, Switzerland, 2023. [Google Scholar]
ffmpeg Documentation. Available online: https://ffmpeg.org/ffmpeg.html (accessed on 22 March 2023).
Bjontegaard, G. Calculation of Average PSNR Differences between RD-Curves; VCEG-M33; Video Coding Experts Group (VCEG): Geneva, Switzerland, 2001. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Available online: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html (accessed on 4 March 2023).
Zhang, H.; Jung, C.; Zou, D.; Li, M. WCDANN: A Lightweight CNN Post-Processing Filter for VVC-Based Video Compression. IEEE Access 2023, 11, 83400–83413. [Google Scholar] [CrossRef]

Figure 1. Enhancing video quality with CNN based post-processing in conventional VVC coding workflow.

Figure 2. MP4 to YUV conversion and reconstruction using VVenC and VVdeC.

Figure 3. Illustration of video-to-image conversion process: (a) original videos converted to original images using FFmpeg, and (b) reconstructed videos converted to reconstructed images using FFmpeg.

Figure 4. Illustration of the conversion process from YUV 4:2:0 format to YUV 4:4:4 format before feeding data into the deep learning network.

Figure 5. Illustration of down-sampling process of neural network output from YUV 4:4:4 to YUV 4:2:0 format.

Figure 6. Architecture of the proposed CNN-based post-filtering method, integrating multiple feature extractions for enhanced output refinement.

Figure 7. Comparative visualization of (b) reconstructed frames from anchor VVC and (c) proposed methods for DaylightRoad2 sequence at QP 42 for RA configuration, alongside the (a) original uncompressed reference frame.

Figure 8. Comparative visualization of (b) reconstructed frames from anchor VVC and (c) proposed methods for FourPeople sequence at QP 42 for LD configuration, alongside the (a) original uncompressed reference frame.

Figure 9. RD curve performance comparison for five different test sequences in RA configuration.

Figure 10. RD curve performance comparison for four different test sequences in LD configuration.

Figure 11. Visual quality comparison of proposed method with 8 feature extraction blocks for RA and LD scenarios at QP 42: (a) MarketPlace Sequence and (b) PartyScene Sequence.

Figure 12. Visual quality comparison of proposed method with 12 feature extraction blocks for RA and LD scenarios at QP 42: (a) RitualDance Sequence and (b) Cactus Sequence.

Table 1. The compression performance of proposed method for RA configuration.

Class	Sequence	BD-Rate (%)
Class	Sequence	Y	U	V
A1	Tango2	−4.34%	−19.45%	−18.88%
	FoodMarket4	−3.06%	−8.08%	−9.10%
	Campfire	−5.74%	−6.38%	−18.76%
Average		−4.38%	−11.31%	−15.58%
A2	CatRobot	−6.52%	−22.98%	−21.03%
	DaylightRoad2	−7.47%	−18.10%	−16.31%
	ParkRunning3	−1.91%	−2.18%	−0.60%
Average		−5.30%	−14.42%	−12.65%
B	MarketPlace	−3.08%	−17.64%	−14.65%
	RitualDance	−5.55%	−15.33%	−16.73%
	Cactus	−4.75%	−12.96%	−11.42%
	BasketballDrive	−5.97%	−18.57%	−22.98%
	BQTerrace	−6.07%	−10.87%	−8.39%
Average		−5.08%	−15.08%	−14.83%
C	BasketballDrill	−6.92%	−13.77%	−22.00%
	BQMall	−6.86%	−19.92%	−22.51%
	PartyScene	−6.23%	−12.64%	−9.11%
	RaceHorses	−5.11%	−19.38%	−23.22%
Average		−6.28%	−16.43%	−19.21%
D	BasketballPass	−6.46%	−10.55%	−12.83%
	BQSquare	−13.10%	−8.62%	−21.47%
	BlowingBubbles	−4.49%	−14.81%	−12.71%
	RaceHorses	−6.70%	−23.57%	−24.91%
Average		−7.69%	−14.43%	−17.98%
Overall		−5.81%	−14.52%	−16.19%

Table 2. The compression performance of proposed method for LD configuration.

Class	Sequence	BD-Rate (%)
Class	Sequence	Y	U	V
B	MarketPlace	−3.10%	−31.06%	−25.53%
	RitualDance	−4.54%	−21.87%	−30.52%
	Cactus	−5.56%	−24.13%	−23.42%
	BasketballDrive	−5.28%	−25.01%	−26.59%
	BQTerrace	−7.77%	−28.08%	−23.87%
Average		−5.25%	−26.03%	−23.95%
C	BasketballDrill	−6.71%	−25.45%	−31.29%
	BQMall	−9.17%	−26.18%	−29.91%
	PartyScene	−6.94%	−28.71%	−25.51%
	RaceHorses	−5.67%	−28.84%	−34.67%
Average		−7.12%	−27.29%	−30.35%
D	BasketballPass	−6.01%	−29.03%	−28.02%
	BQSquare	−14.85%	−27.14%	−40.74%
	BlowingBubbles	−5.43%	−31.59%	−25.97%
	RaceHorses	−7.96%	−32.28%	−34.24%
Average		−8.56%	−30.01%	−32.24%
E	FourPeople	−8.50%	−15.09%	−19.36%
	Johnny	−8.32%	−9.08%	−22.02%
	KristenAndSara	−5.84%	−11.27%	−16.39%
Average		−7.56%	−11.81%	−19.26%
Overall		−6.98%	−24.68%	−26.74%

Table 3. BD-Rate comparison between the proposed method and [45] for RA configuration.

Class	WCDANN [45]			Proposed
	BD-Rate			BD-Rate
	Y	U	V	Y	U	V
A1	−2.23%	N/A	N/A	−4.38%	−11.31%	−15.58%
A2	−2.70%	N/A	N/A	−5.30%	−14.42%	−12.65%
B	−2.73%	N/A	N/A	−5.08%	−15.08%	−14.83%
C	−3.43%	N/A	N/A	−5.11%	−16.43%	−19.21%
D	−4.76%	N/A	N/A	−7.69%	−14.43%	−17.98%
E	-	-	-	-	-	-
Overall	−2.81%	N/A	N/A	−5.81%	−14.52%	−16.19%

Table 4. BD-Rate comparison between the proposed method and [45] for LD configuration.

Class	WCDANN [45]			Proposed
	BD-Rate			BD-Rate
	Y	U	V	Y	U	V
A1	−2.53%	N/A	N/A	-	-	-
A2	−2.92%	N/A	N/A	-	-	-
B	−3.49%	N/A	N/A	−5.25%	−26.03%	−23.95%
C	−3.95%	N/A	N/A	−7.12%	−27.29%	−30.35%
D	−4.72%	N/A	N/A	−8.56%	−30.01%	−32.24%
E	−6.33%	N/A	N/A	−7.56%	−11.81%	−19.26%
Overall	−3.81%	N/A	N/A	−6.98%	−24.68%	−26.74%

Table 5. PSNR comparison of proposed method with 8 and 16 feature extraction blocks in RA configuration.

Sequences	PSNR (8 Block)			PSNR (16 Block)
Sequences	Y	U	V	Y	U	V
MarketPlace	31.78	39.06	40.05	30.01	39.75	40.29
RitualDance	31.95	40.91	40.08	32.34	41.07	40.41
Cactus	30.31	36.70	37.59	30.73	37.13	38.09
BasketballDrive	31.29	39.08	37.83	31.47	39.28	38.18
BQTerrace	30.35	38.38	40.99	30.66	38.76	41.30
BasketballDrill	29.67	35.61	34.92	29.88	35.81	35.38
BQMall	29.40	37.41	37.67	29.59	37.67	38.24
PartyScene	26.08	34.42	34.26	26.16	34.48	34.67
RaceHorses	27.22	33.59	35.01	27.44	33.99	35.46
Average	29.78	37.24	37.60	30.03	37.55	38.00

Table 6. PSNR comparison of proposed method with 8 and 16 feature extraction blocks in LD configuration.

Sequences	PSNR (8 Block)			PSNR (16 Block)
Sequences	Y	U	V	Y	U	V
MarketPlace	30.81	38.65	39.04	30.83	38.73	39.14
RitualDance	31.55	40.17	39.43	31.58	40.13	39.45
Cactus	29.36	36.27	36.79	29.39	36.32	36.87
BasketballDrive	30.60	38.09	37.13	30.63	38.22	37.18
BQTerrace	28.69	37.72	40.29	28.72	37.68	40.27
BasketballDrill	29.07	34.97	34.46	29.09	34.90	34.55
BQMall	28.04	36.32	36.81	28.06	36.39	36.85
PartyScene	24.35	33.30	33.68	24.35	33.21	33.60
RaceHorses	26.88	33.15	34.56	26.89	33.24	34.61
Average	28.81	36.52	36.91	28.84	36.54	36.95

Table 7. PSNR comparison of proposed method with 12 and 16 feature extraction blocks in RA configuration.

Sequences	PSNR (12 Block)			PSNR (16 Block)
Sequences	Y	U	V	Y	U	V
MarketPlace	31.68	38.80	39.89	30.01	39.75	40.29
RitualDance	31.96	40.92	40.32	32.34	41.07	40.41
Cactus	30.38	36.94	37.70	30.73	37.13	38.09
BasketballDrive	31.32	39.01	37.79	31.47	39.28	38.18
BQTerrace	30.43	38.52	41.13	30.66	38.76	41.30
BasketballDrill	29.74	35.67	35.06	29.88	35.81	35.38
BQMall	29.43	37.41	37.87	29.59	37.67	38.24
PartyScene	26.13	34.56	34.72	26.16	34.48	34.67
RaceHorses	27.33	33.69	35.06	27.44	33.99	35.46
Average	29.82	37.28	37.73	30.03	37.55	38.00

Table 8. PSNR comparison of proposed method with 12 and 16 feature extraction blocks in LD configuration.

Sequences	PSNR (12 Block)			PSNR (16 Block)
Sequences	Y	U	V	Y	U	V
MarketPlace	30.82	38.69	39.06	30.83	38.73	39.14
RitualDance	31.57	40.20	39.42	31.58	40.13	39.45
Cactus	29.38	36.33	36.83	29.39	36.32	36.87
BasketballDrive	30.62	38.19	37.11	30.63	38.22	37.18
BQTerrace	28.73	37.72	40.22	28.72	37.68	40.27
BasketballDrill	29.08	35.01	34.55	29.09	34.90	34.55
BQMall	28.06	36.40	36.84	28.06	36.39	36.85
PartyScene	24.35	33.29	33.66	24.35	33.21	33.60
RaceHorses	26.89	33.20	34.58	26.89	33.24	34.61
Average	28.83	36.56	36.92	28.84	36.54	36.95

Table 9. PSNR differences with and without QP map as input as a proposed solution for RA configuration, class D, and QP 42.

Video	With QP Map PSNR (dB)			Without QP Map PSNR (dB)			PSNR Difference (dB) (With QP Map—Without QP Map)
Video	Y	U	V	Y	U	V	Y	U	V
BasketBallPass	27.8515	35.8552	33.4582	27.8076	35.8208	33.3941	0.0439	0.0343	0.0641
BQSquare	27.1305	37.8534	38.8090	27.0558	37.8771	38.9540	0.0746	−0.0237	−0.1450
BlowingBubbles	26.5069	34.0143	34.4027	26.4869	34.0737	34.3561	0.0200	−0.0593	0.0465
RaceHorses	26.6916	33.1223	33.9343	26.6440	33.0459	33.8659	0.0476	0.0763	0.0684
Average	27.0451	35.2113	35.1510	26.9986	35.2044	35.1425	0.0465	0.0069	0.0085

Table 10. PSNR Difference with and without QP map as input to proposed solution for LDB configuration, class D, and QP 22.

Video	With QP Map PSNR (dB)			Without QP Map PSNR (dB)			PSNR Difference (dB) (With QP Map—Without QP Map)
Video	Y	U	V	Y	U	V	Y	U	V
BasketBallPass	40.3176	43.9128	43.5004	40.2249	43.8300	43.3209	0.0926	0.0828	0.1795
BQSquare	37.2240	43.0319	44.5432	37.1296	42.9724	44.2991	0.0943	0.0594	0.2440
BlowingBubbles	37.0140	40.8094	41.6349	36.9599	40.7638	41.4932	0.0541	0.0455	0.1417
RaceHorses	38.8813	41.0287	42.4459	38.8106	40.9196	42.2640	0.0706	0.1091	0.1818
Average	38.3592	42.1957	43.0311	38.2813	42.8443	43.0311	0.0779	0.0742	0.1867

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Das, T.; Liang, X.; Choi, K. Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement. Appl. Sci. 2024, 14, 8276. https://doi.org/10.3390/app14188276

AMA Style

Das T, Liang X, Choi K. Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement. Applied Sciences. 2024; 14(18):8276. https://doi.org/10.3390/app14188276

Chicago/Turabian Style

Das, Tanni, Xilong Liang, and Kiho Choi. 2024. "Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement" Applied Sciences 14, no. 18: 8276. https://doi.org/10.3390/app14188276

APA Style

Das, T., Liang, X., & Choi, K. (2024). Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement. Applied Sciences, 14(18), 8276. https://doi.org/10.3390/app14188276

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Conventional Artifact Removal Methods

2.2. Deep Learning-Based Artifact Removal Methods

2.3. Comparative Analysis of Prior Approaches

3. Proposed Method

3.1. Insight of Data Ingestion

3.2. Influence of QP Map

3.3. Feature Fusion

3.4. Philosophy of Model Design

3.5. Proposed Network Architecture

3.6. Model Optimization Settings

4. Experimental Evaluations and Discussion

4.1. Testing Environment Configuration

4.2. Quality Metric

4.3. Test Bed

4.4. Quantitative Analysis

4.5. Qualitative Analysis

4.6. Rate Distortion Plot Analysis

4.7. Ablation Studies

4.8. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI