Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276

by Tanni Das¹, Xilong Liang¹ and Kiho Choi^1,2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276

Submission received: 25 August 2024 / Revised: 11 September 2024 / Accepted: 12 September 2024 / Published: 13 September 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a novel CNN-based post-processing method that effectively reduces artifacts in VVC-compressed videos, significantly enhancing visual quality and improving viewer experience. However, there are several issues:

The introduction is poorly written and fails to establish the necessity of this research.

Although the paper introduces a new CNN architecture (VVC-PPFF) for video quality enhancement, the literature review does not sufficiently explain the technical or theoretical uniqueness of the proposed method compared to existing approaches.

The comparative analysis is inadequate; the experiments mainly compare the proposed method with traditional VVC decoders, but do not sufficiently benchmark it against other recent deep learning-based video enhancement methods, leaving the superiority or weaknesses of the proposed method unclear.

The logical structure of the paper is somewhat unclear in places, and some paragraphs are overly verbose, which may make it difficult for readers to quickly grasp the key points. For instance, the description of the feature fusion section is overly complex and could be simplified and more directly focused on explaining the technical implementation and its advantages.

The references are outdated and need to be significantly updated with literature from the past three years.

Comments on the Quality of English Language

The English requires further revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript proposed a post-processing solution to enhance the quality of decoded video frames using CNN. The manuscript has a major deficiency which is the lack of PSNR comparisons with and without the proposed solution.

The experimental results show comparisons between RA and LD compression approaches which is not directly related to the proposed solution.

The authors must also add results to show the PSNR difference with the CNN solution with and without the use of the QP map.

The CNN training is performed on 240x240 images, how is the model used for enhancing PSNR of videos with larger resolutions ?

Other comments:

Section 3.1:

Why is the input in Fig 2 MP4 encoded ? in video coding experiments, the input is always in original non-compressed YUV format. What is the QP used in this MP4 compression ?

The fact that your input is already distorted affects your final results and makes them questionable. You do not have to use this particular dataset in your work for training as it is already MP4 compressed. Otherwise convince the reader the mp4 compression did not affect your training and testing

Line 336: conversion between YUV and RGB is possible and simple, ad a note on that please

Figure 2:

In Figure 1, You mentioned that you are using Vvenc for compression and here you mentioned ffmpeg ? please mention that Vvenc is now implemented in ffmpeg, since June 2024

This figure is confusing ? what is the input ? is it compression ? or YUV sequences ?

Figures 4 and 5:

What filtering is used in downsamling and upsamplig ?

Section 3.2

The QP changes within a video frame and also changes from one frame to the other. So do you create a QP map per video frame? please clarify

Section 3.3:

How is the discussion about HR and LR patches in CNN training relevant to your proposed solution ?

Section 3.4:

So is the output of the CNN a filter ? please clarify

Line 500: “The combined QP map and reconstructed frame” How?

Eq 4: define the + sign

Figure 6:

Modify the figure to show that you have a sequence of QP maps, one per video frame

Section 4.1:

If the QP is fixed, please explain to the reader how this value is changed in the QP map

Table 1:

Your proposed solution is a 'post process' applied to decoded YUV images, hence BD-Rate is irrelevant as the video bistream is unaffected by the proposed solution.

What matters in your proposal is the PSNR only. Please add PSNR values to Table 1. With and without the use of the proposed solution

Table 2:

What matters in your proposal is the PSNR only. Please add PSNR values to Table 2. With and without the use of the proposed solution

Figure 9 and 10:

The comparison between LD and RA is irrelevant to your proposed solution. Justify the relevance of these figures to your solution

Tables 3 and 4:

I want to see the PSNR with and without the proposed solution

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes an approach to improve the compression performance of VVC by utilizing feature fusion, QP-map information, and skip connection techniques in a CNN-based post-filter. The proposed method offers intriguing results for readers interested in integrating neural network technology into video compression. I recommend that this paper be published as submitted.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper can be accepted.

Reviewer 2 Report

Comments and Suggestions for Authors

the authors did a good job in the review process. However, the PSNR results show that the average enhancement is less than 0.5dB. In the video coding community, it is known that such an enhancement is visually insignificant. Hence, I weakly accept the paper for publication.

Article Menu

Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

Further Information

Guidelines

MDPI Initiatives

Follow MDPI