Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposes a novel CNN-based post-processing method that effectively reduces artifacts in VVC-compressed videos, significantly enhancing visual quality and improving viewer experience. However, there are several issues:
The introduction is poorly written and fails to establish the necessity of this research.
Although the paper introduces a new CNN architecture (VVC-PPFF) for video quality enhancement, the literature review does not sufficiently explain the technical or theoretical uniqueness of the proposed method compared to existing approaches.
The comparative analysis is inadequate; the experiments mainly compare the proposed method with traditional VVC decoders, but do not sufficiently benchmark it against other recent deep learning-based video enhancement methods, leaving the superiority or weaknesses of the proposed method unclear.
The logical structure of the paper is somewhat unclear in places, and some paragraphs are overly verbose, which may make it difficult for readers to quickly grasp the key points. For instance, the description of the feature fusion section is overly complex and could be simplified and more directly focused on explaining the technical implementation and its advantages.
The references are outdated and need to be significantly updated with literature from the past three years.
Comments on the Quality of English LanguageThe English requires further revision.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors
The manuscript proposed a post-processing solution to enhance the quality of decoded video frames using CNN. The manuscript has a major deficiency which is the lack of PSNR comparisons with and without the proposed solution.
The experimental results show comparisons between RA and LD compression approaches which is not directly related to the proposed solution.
The authors must also add results to show the PSNR difference with the CNN solution with and without the use of the QP map.
The CNN training is performed on 240x240 images, how is the model used for enhancing PSNR of videos with larger resolutions ?
Other comments:
Section 3.1:
Why is the input in Fig 2 MP4 encoded ? in video coding experiments, the input is always in original non-compressed YUV format. What is the QP used in this MP4 compression ?
The fact that your input is already distorted affects your final results and makes them questionable. You do not have to use this particular dataset in your work for training as it is already MP4 compressed. Otherwise convince the reader the mp4 compression did not affect your training and testing
Line 336: conversion between YUV and RGB is possible and simple, ad a note on that please
Figure 2:
In Figure 1, You mentioned that you are using Vvenc for compression and here you mentioned ffmpeg ? please mention that Vvenc is now implemented in ffmpeg, since June 2024
This figure is confusing ? what is the input ? is it compression ? or YUV sequences ?
Figures 4 and 5:
What filtering is used in downsamling and upsamplig ?
Section 3.2
The QP changes within a video frame and also changes from one frame to the other. So do you create a QP map per video frame? please clarify
Section 3.3:
How is the discussion about HR and LR patches in CNN training relevant to your proposed solution ?
Section 3.4:
So is the output of the CNN a filter ? please clarify
Line 500: “The combined QP map and reconstructed frame” How?
Eq 4: define the + sign
Figure 6:
Modify the figure to show that you have a sequence of QP maps, one per video frame
Section 4.1:
If the QP is fixed, please explain to the reader how this value is changed in the QP map
Table 1:
Your proposed solution is a 'post process' applied to decoded YUV images, hence BD-Rate is irrelevant as the video bistream is unaffected by the proposed solution.
What matters in your proposal is the PSNR only. Please add PSNR values to Table 1. With and without the use of the proposed solution
Table 2:
What matters in your proposal is the PSNR only. Please add PSNR values to Table 2. With and without the use of the proposed solution
Figure 9 and 10:
The comparison between LD and RA is irrelevant to your proposed solution. Justify the relevance of these figures to your solution
Tables 3 and 4:
I want to see the PSNR with and without the proposed solution
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes an approach to improve the compression performance of VVC by utilizing feature fusion, QP-map information, and skip connection techniques in a CNN-based post-filter. The proposed method offers intriguing results for readers interested in integrating neural network technology into video compression. I recommend that this paper be published as submitted.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper can be accepted.
Reviewer 2 Report
Comments and Suggestions for Authorsthe authors did a good job in the review process. However, the PSNR results show that the average enhancement is less than 0.5dB. In the video coding community, it is known that such an enhancement is visually insignificant. Hence, I weakly accept the paper for publication.