Learning Adaptive Quantization Parameter for Consistent Quality Oriented Video Coding

Tien Huu Vu; Minh Ngoc Do; Sang Quang Nguyen; Huy PhiCong; Thipphaphone Sisouvong; Xiem HoangVan

doi:10.3390/electronics12244905

Abstract

In the industry 4.0 era, video applications such as surveillance visual systems, video conferencing, or video broadcasting have been playing a vital role. In these applications, for manipulating and tracking objects in decoded video, the quality of decoded video should be consistent because it largely affects the performance of the machine analysis. To cope with this problem, we propose a novel perceptual video coding (PVC) solution in which a full reference quality metric named video multimethod assessment fusion (VMAF) is employed together with a deep convolutional neural network (CNN) to obtain consistent quality while still achieving high compression performance. First of all, in order to achieve the consistent quality requirement, we propose a CNN model with an expected VMAF as input to adaptively adjust the quantization parameters (QP) for each coding block. Afterwards, to increase the compression performance, a Lagrange coefficient of rate-distortion optimization (RDO) mechanism is adaptively computed according to rate-QP and quality-QP models. The experimental results show that the proposed PVC solution has achieved two targets simultaneously: the quality of video sequence is kept consistently with an expected quality level and the bit rate saving of the proposed method is higher than traditional video coding standards and the relevant benchmark, notably with around 10% bitrate saving on average.

Keywords:

video quality consistency; adaptive QP; perceptual-based RDO

1. Introduction

With the growth of data in video services on telecommunications networks, delivering video services with consistent quality to viewers is one of the urgent requirements. In video streaming applications, image quality stability is one of the most important criteria [1]. An experiment in [2] indicated that the Mean Opinion Score (MOS) in video streaming application decreases significantly when the quality changing frequency is increased. The reason is that the quality changing among frames in a video sequence typically causes the annoying experience to human visual perception (see Figure 1). Ensuring consistent video quality is also essential for networks such as the visual sensor network or the surveillance camera network. In these networks, servers receive a video signal from a mass number of camera nodes to process and analyze such as intrusion detection and abnormal behavior analysis in smart surveillance camera systems. Therefore, the consistent quality of video signal is necessary for increasing efficiency of network [3,4].

Several rate control algorithms have been proposed to improve the consistency of video quality. In [5], a method is proposed to provide the smooth quantization under constant bitrate (CBR) encoding. In this method, the quatization parameters that are produced from the traditional rate control algorithms are smoothed by a low filtering mechanism. In [6], a sequential rate control algorithm is proposed for real-time video coding. However, both of these methods use mean squared error (MSE), which does not exactly reflect the human perceptual quality in order to measure visual quality. Thus, in [7], a new visual quality metric (VQM) is proposed. However, the proposed metric also uses MSE beside the motion information content in computing VQM.

Figure 1. Example of video with inconsistent quality [8]. (a–d) Frames of Foreman video sequence. (e–h): Difference with the original images.

To address the causes of quality fluctuation in RDO, in [9], the Lagrange multiplier

λ

is adjusted according to the video content in the RDO process in order to ensure that reconstructed video always achieves the stable quality level. In particularly, the Lagrangian multiplier and quantization parameter (QP) for each frame are computed so that the difference of quality between the predicted frame and the correspondingly decoded frame is minimized. With the goal of achieving a constant quality level among frames, the method in [10] uses the probability density function of the transform coefficients to estimate the depth of the coding tree. From there, the quality of the coding blocks is adjusted to keep a stable quality level. In [11], the content-adapted quality-distortion model for the H.264 encoding standard is used to estimate the distortion between the original frame and the decoded frame. Based on the estimated distortion value, the QP parameter for each frame is estimated to achieve the desired quality level.

Another issue related to visual quality in video coding is the quality assessment metric. In general, conventional objective quality assessment metrics are preferable in practical applications since they offer a specific computational formula and may be easily implemented at the encoder. The peak signal to noise ratio (PSNR) and MSE are the two most commonly used objective measures, respectively. In [12], a PSNR-based method is proposed to control the constant quality of the reconstructed video sequence. In this method, to keep the video quality constant in terms of PSNR, the QP of each frame is adjusted according to the average PSNR of the previous frames. If the average PSNR is less than the PSNR target, the QP of the current frame is reduced and vice-versa. However, it has been demonstrated that PSNR or MSE only have a weak relationship with the human visual system (HVS). Therefore, Netflix introduced a metric called VMAF, which uses a machine-learning model that is trained on user feedback, to reflect the viewer’s viewpoint [13]. By using Support Vector Machine (SVM) regression, this metric is created by combining several fundamental metrics such as Visual Information Fidelity (VIF) [14], Detail Loss Metric (DLM) [15], and motion. In practice, the industry usually uses the VMAF metric extensively because of its superior accuracy compared to traditional metrics [16,17,18].

Because of its benefits, VMAF is proposed to replace conventional metrics in some literatures such as [19,20]. In these methods, the relationship between the sum of squared difference (SSD) and VMAF is built. Consequently, the Lagrange multiplier in the RDO function is computed based on VMAF instead of MSE. Also, using perceptual visual quality to improve RDO, the methods proposed in [21,22] used the neural network to predict the QP value. In particularly, the authors in [21] proposed a perceptual adaptive quantization based on a VGG-16 model for high-efficiency video coding (HEVC) for bitrate reduction while maintaining subjective visual quality. In [22], the proposed method used the CNN model to predict the visibility threshold for each image patch and then estimate the QP value based on this visibility threshold. However, both of these proposed methods only focus on improving the rate-distortion (RD) performance of encoder, while the stability of the video quality at frame-level is not considered. To overcome the drawbacks of the previous methods, we proposed a VMAF-based method to predict QP by using the CNN model in article [23]. However, this method is applied for the intra-mode encoding and for low-resolution video sequences. To develop a method that can be applied for a variety of video resolutions in both intra-mode and inter-mode, in this paper, we proposed: (1) An estimation for the rate-quantization parameter and distortion-quantization parameter functions based on the VMAF metric instead of PSNR. (2) A CNN-based algorithm to estimate the QP value at block-level in order to achieve our target quality for the overall frame.

The rest of the paper is organized as follows. In Section 2, the background works on RDO modeling and perceptual RDO for quality consistency are introduced. Then, the framework of the proposed system is illustrated in Section 3. The experimental parameters and simulation results are presented in Section 4. Finally, Section 5 concludes the contributions of this work.

2. Background Works

In this section, we review the original RDO modeling adopted in video coding standards such as H.264/AVC and HEVC and the perceptual RDO models for video quality consistency.

2.1. RDO Modeling

Initially introduced in the H.264/AVC standard [24], the RDO model brings a significant RD performance improvement compared to the predecessor video codecs [25]. The RDO model helps the encoder to select an optimal mode among a large number of coding options. The target of RDO is to minimize the distortion for a given rate

R_{c}

by appropriately selecting the coding parameters, namely,

\begin{matrix} m i n {D} s u b j e c t t o R \leq R_{c} \end{matrix}

(1)

where R and D are the bitrate and distortion computed for a coding unit, which may be a macroblock, a frame, or even a group of frames. To solve the above problem, the Lagrange optimization solution is used. Then, the problem (1) is converted to the following form:

\begin{matrix} m i n {J} w h e r e J = D + λ \times R \end{matrix}

(2)

where J is a Lagrange cost function and

λ

is the Lagrange multiplier. When the RD curve is convex, and both are differentiable everywhere, the function J is minimum when its derivative equals zero, i.e.,

\begin{matrix} \frac{d J}{d R} = \frac{d D}{d R} + λ = 0 \end{matrix}

(3)

In the RDO process, the Lagrange multiplier is used to balance the relationship between the distortion and the bitrate. In other words, the Lagrange multiplier helps the encoder to find out an R-D optimal point in the RD convex hull to minimize the Lagrange cost function. To compute the Lagrange multiplier, R-Q and D-Q functions are established. In [24], the rate distortion model is represented by:

\begin{matrix} R (D) = a \times {l o g}_{2} (\frac{b}{D}) \end{matrix}

(4)

where a and b are constant. The D-Q function is represented by:

\begin{matrix} D = \frac{{Q P}^{2}}{3} \end{matrix}

(5)

where

Q P

is the quantization parameter. By putting (4) and (5) into (3),

λ

can be derived as:

\begin{matrix} λ = - \frac{d D}{d R} = c \times {Q P}^{2} \end{matrix}

(6)

where c is set to 0.136 in the H.264/AVC standard.

Standard video encoders usually use objective distortion metrics such as PSNR or MSE to build the distortion model, although these metrics do not work well as human visual distortion metrics. In [26,27], the structural similarity index (SSIM) [28] is used to establish an adaptive Lagrange multiplier in RDO. Based on the observation between the QP and SSIM values, a distortion model is derived as:

\begin{matrix} D_{S S I M} = 10^{- 4} \times e^{\frac{Q P + 11.804}{6.8625}} \end{matrix}

(7)

The Lagrange multiplier is computed in [26] as:

\begin{matrix} λ = 2.39 \times e^{\frac{Q P + 11.804}{6.8625}} \end{matrix}

(8)

and in [27] as:

\begin{matrix} λ = \frac{10^{- 7} \times 4.04}{σ_{s d} - 11.50} \times e^{\frac{Q P + 11.804}{6.8625}} \end{matrix}

(9)

in which

σ_{s d}

is the standard deviation of transformed residuals for one frame and computed as:

\begin{matrix} σ_{s d} = \sqrt{E (x^{2}) - {[E (x)]}^{2}} \end{matrix}

(10)

where x is the DCT coefficient of the frame and E is the expectation.

Because VMAF is considered as a perceptual distortion metric better than PSNR and SSIM, Ref. [29] proposed a method using VMAF to replace objective metrics in RDO. In particular, VMAF is estimated as a function of some visual factors, including brightness adaptability, texture complexity, contrast masking, and timing masking. After that, the R-D cost function is computed based on the estimated VMAF. In [30], CNN is used to estimate the perceptual distortion in terms of the VMAF score between the original frame and the reconstructed frame. However, VMAF does not have a computational formula; these methods established an approximate relationship between VMAF and SSE in the RDO function [20,27,30]. To avoid computing VMAF via another objective score, in our method, the distortion function is estimated directly by the VMAF score, and the Lagrange multiplier is computed according to the new R-D model. Then, the optimal QP value in the Lagrange cost function is used as a label to train a CNN model aimed at replacing the RDO process.

2.2. Perceptual RDO for Video Quality Consistency

To maintain consistency in video coding, some previous methods are proposed to minimize the variance of video quality among frames. Due to the scene changes in consecutive frames, QP values are estimated at frame level or MB level to control the distortion of each frame. However, intervening in the RDO process to recompute the QP value may affect the performance of the encoder. Specifically, the target of RDO is to estimate the QP that satisfies the optimal point of rate and distortion. Meanwhile, keeping consistency in video quality may require a QP value that is different from the QP value in RDO. Therefore, the problem here is to find an optimized QP value to achieve the two goals simultaneously: optimizing rate and distortion while still achieving the expected quality for the output video. In [11], to control quality at the frame level, a distortion-quality model is proposed to assign a suitable QP value to each frame. In particular, before coding the

k^{t h}

frame, an SSE value of the frame is estimated. Based on the proposed model, encoder selects a suitable

{QP}_{k}^{*}

value in a set of considered QP values to minimize the difference between the distortion and SSE as follows:

\begin{matrix} {QP}_{k}^{*} = \underset{{QP}_{k} \in Q}{arg min} \{|D_{T} - D_{p} ({QP}_{k})|\} \end{matrix}

(11)

where

Q

is a set of considered QP values,

D_{T}

is corresponding target of

k^{t h}

frame, and

D_{p} ({QP}_{k})

is the frame-level predicted distortion by using

{QP}_{k}

.

Similar to [11], the algorithm in [9] also tries to minimize the difference between the estimated reconstruction quality and target reconstruction quality. However, besides the estimating quantization step size, this method fine tunes

λ

in RDO for better quality consistency.

A common feature of the above methods is that they try to find QP values at the frame level. However, assigning a fixed QP value to the entire frame will cause waste bitrate in the coding process because macroblocks (MBs) with different contents may require different amounts of coded bit. In addition, in the video quality assessment, some MBs in a frame may be less important than the others. Thus, coding MBs in a frame with different QP values helps improve the coding performance. In our proposed method, the CNN model is used to estimate the QP value for each MB in a frame. Moreover, RDO is integrated into the proposed CNN model to achieve the two goals simultaneously as above stated: improving the performance of video coding while achieving the expected quality of the reconstructed video sequence.

3. Proposed Method

This section describes a method learning adaptive quantization parameter (LAQP) in which a CNN model is used to predict QP values for MBs in a video coding frame to achieve an expected VMAF score.

3.1. Overall Coding Framework

Figure 2 illustrates the encoding process of a video sequence to achieve an expected VMAF score for the reconstructed video sequence by using the trained CNN model to predict the QP value. Initially, the current frame of the video sequence is split into macroblocks with a size of 16 × 16. MBs, along with the classified VMAF score, are fed into the trained CNN model to estimate the optimal QP values. Then, these QP values are used by x.264 to encode the current MBs.

Figure 2. Framework of the proposed method.

To generate the trained CNN model, the proposed LAQP algorithm includes the following steps: generating data, labeling data, and training model. Figure 3 illustrates steps in the LAQP algorithm. In the first step, 15 original video sequences are encoded with different QP values. These are standard video sequences used widely in video coding [31]. Then, the decoded video sequences are measured in the VMAF metric. In the second step, based on the VMAF metric, each frame is classified into a specific quality level. At each quality level, an optimal QP map (as shown in Figure 4) is estimated by using a proposed VMAF-based rate-distortion model. At each quality level, the QP value in the QP map is used as a label for an MB of frame. In other words, this QP value is used to encode MB to achieve the expected quality level for the reconstructed frame. In the third step, a CNN model is trained with inputs including an MB of frame and the quality level corresponding to a label QP.

Figure 3. Steps in generating CNN model.

Figure 4. A sample of QP map.

3.2. VMAF-Based RDO Modeling

In LAPQ, the CNN model is used to replace the RDO process in the x.264 encoder. To train the CNN model, a VMAF-based RDO model is proposed to estimate optimal QP values, which are labels for MBs. In previous perceptual video coding methods, D-Q and R-Q functions in the RDO model are built to compute the Lagrange multiplier and to estimate the optimal QP value to minimize the Lagrange cost function. However, it is difficult to integrate a subjective metric into RDO because a specific formula for the subjective metric does not exist. Therefore, to build the perceptual-based RDO model, some studies proposed an objective function, which is an approximation of the subjective metric. Then, the D-Q and R-Q functions are derived via this approximate function [20,27,30]. In this work, to propose a perceptual-based RDO model, D-Q and R-Q functions are built based on the VMAF metric. In particular, based on the hypothesis that the distortion is inversely proportional to the quality, the distortion D of the frame is simply computed as the following equation:

\begin{matrix} D = \frac{1}{V M A F} \end{matrix}

(12)

To derive the function of R and D in terms of QP, 10 video sequences with a length of 50 frames are encoded. After encoding, we obtain the fitting curve describing the distortion function and the rate function in terms of QP. Figure 5 shows the fitting curve of functions R and D for the “City” video sequence encoded in the Inter coding mode. The blue dots are the actual data, and the red line is the estimation function that fits the actual data. As shown in the figure, the fitting curves of the R-Q and D-Q functions are third-degree polynomials with R-squared values are 0.96 and 0.90, respectively. For the other sequences, the fitting curves are also third-degree polynomials with R-squared, as shown in Table 1. The averages of the R-squared of 10 R-Q and D-Q fitting curves are 0.93 and 0.91, respectively. Based on the R-D and D-Q fitting curves of 10 video sequences, two general R-Q and D-Q functions for all video sequences are established in which the coefficients of the polynomial are the average values of the coefficients of 10 fitting curves. In particular, for the I frame, the R-Q and D-Q functions are cubic functions of QP, as shown in Table 2.

Figure 5. The fitting curve of “City” sequence for rate and distortion function.

Table 1. The R-squared of functions

R (Q P)

and

D (Q P)

of 10 video sequences.

Table 2. The coefficients of R-Q and D-Q functions.

With the average value of coefficients, the R-Q and D-Q functions are established as follows:

\begin{matrix} R_{V M A F_I} = - 1.49 \times Q P^{3} + 185.3 \times Q P^{2} - 7716 \times Q P \end{matrix}

(13)

\begin{matrix} D_{V M A F_I} = 3 \times 10^{- 4} \times {Q P}^{3} - 3 \times 10^{- 2} \times {Q P}^{2} + 0.82 \times Q P \end{matrix}

(14)

Based on (3), the new Lagrange multiplier is computed as follows:

\begin{matrix} λ_{V M A F_I} = - \frac{9 \times 10^{- 4} \times {Q P}^{2} - 6 \times 10^{- 2} \times Q P + 0.82}{- 4.47 \times {Q P}^{2} + 370.6 \times Q P - 7716} \end{matrix}

(15)

Similarly, for the P frame, the R-Q function, D-Q function, and Lagrange multiplier are computed as below:

\begin{matrix} R_{V M A F_P} = 0.1 \times {Q P}^{3} - 9.55 \times {Q P}^{2} + 268.44 \times Q P \end{matrix}

(16)

\begin{matrix} D_{V M A F_P} = 5.85 \times {Q P}^{3} - 3 \times 10^{- 3} \times {Q P}^{2} + 0.03 \times Q P \end{matrix}

(17)

\begin{matrix} λ_{V M A F_P} = - \frac{17.55 \times {Q P}^{2} - 6 \times 10^{- 3} \times Q P + 0.03}{0.3 \times {Q P}^{2} - 19.1 \times Q P + 288.44} \end{matrix}

(18)

For B frame:

\begin{matrix} R_{V M A F_B} = - 0.22 \times {Q P}^{3} + 26.08 \times {Q P}^{2} - 1039 \times Q P \end{matrix}

(19)

\begin{matrix} D_{V M A F_B} = 7.57 \times {Q P}^{3} - 6 \times 10^{- 3} \times {Q P}^{2} + 0.19 \times Q P \end{matrix}

(20)

\begin{matrix} λ_{V M A F_B} = - \frac{22.71 \times {Q P}^{3} - 12 \times 10^{- 3} \times {Q P}^{2} + 0.19 \times Q P}{- 0.66 \times {Q P}^{3} + 52.16 \times {Q P}^{2} - 1039 \times Q P} \end{matrix}

(21)

Based on the above D-Q and R-Q functions, the minimum Lagrange cost function in (2) is computed to select the optimal QP values for macroblocks. However, in this proposed method, instead of using the RDO process, CNN is used to predict the QP values. Therefore, the RDO process is used offline to generate the dataset for training CNN. The dataset generation and architecture of the CNN model are described in following section.

3.3. CNN Model for QP Map Prediction

To achieve an expected VMAF score for all frames in a video sequence, a CNN model is proposed to predict the optimal QP value for MB in a frame. The input of the CNN model includes an MB accompanied by the expected VMAF score of the current frame. The output of the CNN model is an optimal QP value for that MB. In this case, the proposed CNN model replaces the RDO process in estimating the optimal QP value. Besides the purpose of optimizing the rate-distortion performance in RDO, the second task of the CNN model is to estimate the QP values for MBs to achieve target quality for a frame. Consequently, the proposed method can maintain consistent quality for the whole video sequence. To train the proposed CNN model, a dataset including MBs labeled by optimal QP values for each expected quality level is generated. The following sections describe the details of data generation and the training CNN model.

3.3.1. Dataset Collecting and Labeling

The flow chart in Figure 6 describes the process of generating labels

{QPmap}^{*}

, including the QP values of each MB in a frame corresponding to each quality level. In the first step, a frame is encoded with different values of the constant rate factor (crf) from 20 to 45. After encoding and decoding, the quality of the reconstructed frame is measured in the VMAF metric and classified according to the quality level. Because of the similarity of consecutive VMAF scores, six consecutive VMAF scores are grouped into a quality level. To generate a dataset for the training model, 15 video sequences with resolutions 352 × 280, 1280 × 720 are encoded. Each video sequence includes 50 frames, and the configuration of group of pictures (GOP) is IBBBPPBBBPP. After measuring the quality of the reconstructed frames, we observe that the range of VMAF values is from 55 to 100. Therefore, VMAF scores are grouped into nine groups, as described in Figure 6. It is assumed that there are n values of VMAF in the quality level

i^{t h}

as follows:

\begin{matrix} {V M A F}_{i} = \{{V M A F}_{i 1}, {V M A F}_{i 2}, \dots, {V M A F}_{i n}\} \end{matrix}

(22)

Figure 6. Steps in dataset generation process.

In the second step, the Lagrange cost function

J_{i}^{j}

value corresponding to crf

j^{t h}

in the quality level

i^{t h}

is computed by the following equation:

\begin{matrix} J_{i}^{j} = D_{V M A F}^{j} + λ_{V M A F}^{j} {\cdot R}_{V M A F}^{j} \end{matrix}

(23)

where

j \in \{20, 21, \dots, 45\}

and

D_{V M A F}^{j}, λ_{V M A F}^{j}, R_{V M A F}^{j}

are computed as shown in Equations (12)–(19) depending on the type of frame I, P, or B. In the third step, a minimum

J_{i}^{*}

at quality level

i^{t h}

is selected as follows:

\begin{matrix} J_{i}^{*} = \underset{j = \bar{1, n}}{m i n} J_{i}^{j} \end{matrix}

(24)

Finally,

{QPmap}_{i}^{*}

corresponding to crf

j^{t h}

at the quality level

i^{t h}

is considered as the optimal QP map for the current frame to achieve the quality level

i^{t h}

. The QP values in this

{QPmap}_{i}^{*}

are used as the labels for MBs in the current frame.

3.3.2. Training CNN Model

The QP map for a frame with an expected VMAF score is predicted by the proposed CNN model as illustrated in Figure 7. This proposed CNN architecture is inspired by the VGG-16 model [32]. However, in this case, the input of the proposed model is a macroblock 16 × 16 instead of a large size image as in the original VGG-16 model. Therefore, the number of layers of VGG-16 is reduced to six. In the optimal training strategy, firstly, we have set the number of layers and kernels at a high value. Then, the number of layers and kernels is reduced until the training loss and validation loss converge to a minimum value. After that, the model is tuned by changing the hyperparameters of the convolution layers. Finally, the optimal model is derived with the Adam optimizer, the learning rate equals 0.0001, the activation function is ‘Relu’, and the loss function is ‘MAE’. The configuration of the model with the highest accuracy is shown in Table 3.

Figure 7. The architecture of the proposed CNN model.

Table 3. The configuration of the proposed CNN model.

Preprocessing layers: The pixels of input MB 16 × 16 are preprocessed by converting into grayscale and then normalized to values between 0 and 1.
Convolutional layers: The output of the preprocessing layers is convolutionalized by kernels 4 × 4 at the first convolutional layer and kernels 2 × 2 to extract higher-level features. In addition, the batch normalization layer is used to normalize the feature map to stabilize the learning process and reduce the number of epochs. After the convolutional layers, the pooling layer is added to reduce the size of each feature map. Moreover, the dropout layer is used to drop features randomly with probabilities 20%.
Fully connected layers: The feature maps at the output of the convolutional layers are concatenated and then flattened into a column vector. Then, the column vectors are fed to three fully connected layers that compile the features extracted to form the final output as QP value. Because the target VMAF score is a requirement for the output reconstructed video, a target VMAF score is supplemented as an external feature in the feature vectors for fully connected layers.

In the proposed model, the Mean Absolute Error (MAE) is used to measure the accuracy and is computed as the following equation:

\begin{matrix} M A E = \frac{1}{n} \sum_{j = 1}^{n} | y_{j} - {\hat{y}}_{j} | \end{matrix}

(25)

where

{\hat{y}}_{j}

is the estimated QP value,

y_{j}

is the ground truth QP value of MB

j t h

, and n is number of MBs. The minimum MAE that the proposed model achieved is 1.26 after 100 epochs.

4. Performance Evaluation

4.1. Test Methodology

In the test methodology, the compression performance of the proposed method is compared with the standard video codec x.264 [24] and a relevant method content adaptive distortion–quantization (CADQ) proposed in [11] in terms of BD-VMAF and BD-Rate [33]. In this work, the practical x.264 video coding reference software was selected due to its low complexity and popularly used in general. It should be noted that the proposed method can be integrated into x.265 [34] or VVC [35] in the future works. The BD-VMAF metric is used to evaluate the effectiveness of the proposed algorithm in controling the quality level, while the BD-Rate reflects the performance of the proposed method in saving the bitrate when compared with the other methods. Six popular video sequences with resolutions of 352 × 280 and 1280 × 720 are used and encoded with 4 crf values 29, 32, 35, and 37. These are also popular video test sequences in video coding.

In addition, the quality level expectation and quality consistency of three methods are also evaluated. The quality level expectation reflects the ability of methods in achieving the quality level as expected, while the quality consistency measures the smoothness of the quality between frames in a video sequence.

The overall testing process is shown in Figure 8. In the first step, the video test sequence is encoded in a video codec standard, i.e., x.264, assuming that, in this step, the quality of the reconstructed video sequence measured in the VMAF metric and in the PSNR metric is VMAF_x.264 and PSNR_x.264, respectively. The bitrate of the encoding process is BR_x.264 bps. In the second step, the video test sequence is fed into the CNN model, accompanied by VMAF_x.264 to predict the QP map. In this case, VMAF_x.264 is used as the expected VMAF score for the CNN model. Then, the predicted QP map is applied to the video encoder to encode frames of the video test sequence, assuming that the quality score of the reconstructed video sequence in the second step is VMAF_LAQP and the bitrate is BR_LAQP.

Figure 8. Architecture of test methodology for the proposed method.

Similarly, PSNR_x.264 is considered as the expected quality level for the encoder when using CADQ to encode the video sequence. The quality score of the reconstructed video sequence is VMAF_CADQ, and the bitrate is BR_CADQ. Finally, the parameters, including the VMAF score and bitrate of three reconstructed video sequences, are compared to evaluate the effectiveness of the methods.

4.2. RD Performance Evaluation and Discussion

The BD-Rate and BD-VMAF comparison between methods are shown in Table 4. As shown in the results, when compared to x.264 and CADQ, the proposed method can save bitrate up to 3.36% and 10.03%, respectively. Meanwhile, the VMAF score of the proposed method gains 1.59% and 2.16%. The LAQP method can achieve lower bitrate and higher quality because the R-Q and D-Q functions of LAQP are based on VMAF and the quality of the reconstructed video sequences in this experiment are measured in VMAF instead of MSE. In other words, the R-Q and D-Q functions of LAQP reflect the relationships rate-quantization and distortion-quantization more precisely than x.264 and CADQ, which use the MSE metric to generate the R-Q and D-Q functions. Consequently, the RDO process of LAQP can estimate the QP value more effectively than the others; therefore, the LAQP method can achieve higher RD performance than x.264 and CADQ.

Table 4. BD-Rate and BD-VMAF comparison.

4.3. Quality Level Expectation Assessment

The quality levels of 24 reconstructed sequences (six video sequences × four cases of crf) in three methods are shown in Figure 9, in which the quality level of x.264 codec is considered as the expected quality level for the other two methods. As shown in the figure, the quality of the reconstructed video in the proposed method is the same as the expected level in almost cases except “Coastguard” and “Crew” with crf 32, and “Silent” with crf 35. Meanwhile, in CADQ, the output quality level is different to the expected quality in the cases of “Coastguard”and “Crew” with crf 32; “Silent”, “Tempete”, “Crew”, and “Vydio3” with crf 35; and “Coastugard”, “Crew”, and “Vydio3”, with crf 37.

Figure 9. Comparison of quality level between methods.

The proposed method LAQP can achieve the expected quality more precisely than CADQ because CADQ estimates QP according to the PSNR variation between the current frame and the previous frames. However, as stated above, PSNR is a metric that does not correlate well with human perception as VMAF. Consequently, the quality level at the output video may not achieve the expected quality score as the expected level. Meanwhile, LAQP uses the CNN model to estimate QP according to the content of MB and expected quality level. Therefore, LAQP achieves the expected quality level in terms of the VMAF metric better than CADQ.

4.4. Quality Consistency Evaluation

Besides achieving the expected quality, the smoothness of quality between frames in a sequence is also considered. The smoothness is computed by the variance of VMAF score of frames in a video sequence. Table 5 shows the quality variance of six output video sequences in x.264 codec, the method CADQ, and the proposed method, while Figure 10 describes the quality lines of the “Coastguard” sequence in three methods. As shown in the results, the average quality variance in x.264 is 8.19, while the method CADQ and the proposed method are 6.07 and 4.02, respectively. Especially, in cases of crf 29, 32, the average variance of the “Silent” sequence in the proposed method is 0.

Table 5. Quality variance comparison between methods.

Figure 10. Comparison of quality level between methods.

Figure 11 is a visual illustration the quality of the reconstructed frames of three methods for the “Coastguard” sequence. As shown in the figure, the quality level of frames in x.264 and CADQ are not as smooth as in LAQP. These quality levels are also consistent with the lines depicted in Figure 10.

Figure 11. Reconstructed frames in three methods: (a) x.264, (b) CADQ, and (c) LAQP.

The quality consistency of LAQP is higher than x.264 because the RDO process of x.264 only focuses on the optimizing rate and distortion, while the CNN model of LAQP focuses on predicting a QP map to optimize the rate and distortion at an expected quality level. In CADQ, the fluctuation of quality of the reconstructed video sequence is higher because the PSNR metric is used in this method to control quality instead of using VMAF. Consequently, the quality variation in terms of the VMAF of CADQ is higher than LAQP.

4.5. Discussion

As shown in the performance assessment, x.264 video coding with the proposed LAQP method has achieved promising quality consistency for reconstructed videos with a minimum of compression bitrate. This feature is important in many real-world scenarios, including streaming media service, IoTs multimedia communications, and video broadcasting, where the quality fluctuation of videos and the compression rate are of utmost importance. In fact, the quality consistency is also an important factor in maintaining the quality of experience (QoE) in viewing videos in many other applications.

However, the high dependence of the proposed model with the training data is also a remaining problem that needs to be solved in the future. In fact, the parameters of the current model are obtained through a set of training videos that may not fully cover the visual characteristics of the video scenes or coding methodologies. Hence, the data-free driven method for the quality-consistent video coding structure would be a promising future research approach.

5. Conclusions

In this paper, a CNN-based method is proposed to estimate the QP value for video coding to achieve an expected VMAF score. The inputs of the CNN model include a macroblock of the current frame accompanied with an expected VMAF score for that frame at the output of decoder side. The output of the CNN model is an estimated QP value for that macroblock. The experimental results show that with an expected quality level, the proposed method can save the bitrate up to 5.82% and improve the quality up to 4.28% when compared to the conventional x.264 codec. In addition, the proposed method also saves the bitrate up to 27.65% and boosts the quality to 4.11% when compared with the relevant method in [11]. Besides improvement of RD performance, the proposed method also achieves smoothness better than the other methods in terms of quality.

In this work, the proposed method is implemented in x.264 codec, which does not support high-resolution video such as 2K, or 4K video. In future work, the proposed method will be considered to be integrated in the up-to-date standard video codec such as H.266/VVC. In addition, the VMAF metric will be replaced by a more advanced subjective metric to measure the quality of the video. Thanks to that, the proposed method can be implemented in applications providing video content with high resolution and with stability in quality.

Author Contributions

Funding acquisition, T.H.V.; conceptualization, T.H.V. and X.H.; methodology, T.H.V. and X.H.; project administration, T.H.V.; software, M.N.D. and S.Q.N.; validation, H.P. and T.S.; visualization, H.P. and T.S.; writing—original draft, M.N.D. and S.Q.N.; and writing—review and editing, T.H.V. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Research Collaboration Project between PTIT and Naver Corp. under grant number 01-PTIT-NAVER-2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available on request.

Acknowledgments

This work has been supported by Research Collaboration Project between PTIT and Naver Corp.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brunnström, K.; Beker, S.A.; de Moor, K.; Dooms, A.; Egger, S.; Garcia, M.-N.; Hossfeld, T.; Jumisko-Pyykkö, S.; Keimel, C.; Larabi, M.-C.; et al. Qualinet White Paper on Definitions of Quality of Experience. 2013. hal-00977812. Available online: https://hal.science/hal-00977812/document (accessed on 10 August 2023).
Hoßfeld, T.; Seufert, M.; Sieber, C.; Zinner, T. Assessing effect sizes of influence factors towards a QoE model for HTTP adaptive streaming. In Proceedings of the 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX), Singapore, 18–20 September 2014; pp. 111–116. [Google Scholar] [CrossRef]
Chen, X.; Hwang, J.N.; Meng, D.; Lee, K.H.; Queiroz, R.L.D.; Yeh, F.M. A quality-of-content-based joint source and channel coding for human detections in a mobile surveillance cloud. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 19–31. [Google Scholar] [CrossRef]
Milani, S.; Bernardini, R.; Rinaldo, R. A saliency-based rate control for people detection in video. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 2016–2020. [Google Scholar] [CrossRef]
He, Z.; Zeng, W.; Chen, C.W. Low-pass filtering of rate-distortion functions for quality smoothing in real-time video communication. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 973–981. [Google Scholar] [CrossRef]
Xie, B.; Zeng, W. A sequence-based rate control framework for consistent quality real-time video. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 56–71. [Google Scholar] [CrossRef]
Xu, L.; Li, S.; Ngan, K.N.; Ma, L. Consistent visual quality control in video coding. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 975–989. [Google Scholar] [CrossRef]
Trieu Duong, D.; Phi Cong, H.; Hoang Van, X. A Novel Consistent Quality Driven for JEM Based Distributed Video Coding. Algorithms 2019, 12, 130. [Google Scholar] [CrossRef]
Cai, Q.; Chen, Z.; Wu, D.O.; Huang, B. Real-time constant objective quality video coding strategy in high efficiency video coding. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2215–2228. [Google Scholar] [CrossRef]
Seo, C.W.; Moon, J.H.; Han, J.K. Rate control for consistent objective quality in high efficiency video coding. IEEE Trans. Image Process. 2013, 22, 2442–2454. [Google Scholar] [CrossRef] [PubMed]
Wu, C.-Y.; Su, P.-C. A Content-Adaptive Distortion–Quantization Model for H.264/AVC and its Applications. IEEE Trans. Circuits Syst. Video Technol. 2014, 24, 113–126. [Google Scholar] [CrossRef]
Vito, F.D.; Martin, J.C.D. PSNR control for GOP-level constant quality in H.264 video coding. In Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, Athens, Greece, 18–21 December 2005; pp. 612–617. [Google Scholar] [CrossRef]
Li, Z.; Aaron, A.; Katsavounidis, A.; Moorthy, I.; Manohara, M. Toward a Practical Perceptual Video Quality Metric. Netflix Blog. 2016. Available online: http://techblog.netflix.com/2016/06/toward-practical-perceptual-video.html (accessed on 11 August 2023).
Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhang, F.; Ma, L.; Ngan, K.N. Image quality assessment by separately evaluating detail losses and additive impairments. IEEE Trans. Multimed. 2011, 13, 935–949. [Google Scholar] [CrossRef]
Rassool, R. VMAF reproducibility: Validating a perceptual practical video quality metric. In Proceedings of the 2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Cagliari, Italy, 7–9 June 2017; pp. 1–2. [Google Scholar] [CrossRef]
Lee, C.; Woo, S.; Baek, S.; Han, J.; Chae, J.; Rim, J. Comparison of objective quality models for adaptive bit-streaming services. In Proceedings of the 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), Larnaca, Cyprus, 27–30 August 2017; pp. 1–4. [Google Scholar] [CrossRef]
Barman, N.; Schmidt, S.; Zadtootaghaj, S.; Martini, M.G.; Möller, S. An evaluation of video ality assessment metrics for passive gaming video streaming. In Proceedings of the 23rd Packet Video Workshop, Amsterdam, The Netherlands, 12–15 June 2018; pp. 7–12. [Google Scholar] [CrossRef]
Deng, S.; Han, J.; Xu, Y. VMAF Based Rate-Distortion Optimization for Video Coding. In Proceedings of the 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 21–24 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
Luo, Z.; Zhu, C.; Huang, Y.; Xie, R.; Song, L.; Kuo, C.-C.J. VMAF Oriented Perceptual Coding Based on Piecewise Metric Coupling. IEEE Trans. Image Process. 2021, 30, 5109–5121. [Google Scholar] [CrossRef] [PubMed]
Marzuki, I.; Sim, D. Perceptual adaptive quantization parameter selection using deep convolutional features for HEVC encoder. IEEE Access 2020, 8, 37052–37065. [Google Scholar] [CrossRef]
Alam, M.M.; Nguyen, T.D.; Hagan, M.T.; Chandler, D.M. A perceptual quantization strategy for HEVC based on a convolutional neural network trained on natural images. Appl. Digit. Image Process. 2015, 9599, 959918. [Google Scholar] [CrossRef]
Vu, T.H.; Cong, H.P.; Sisouvong, T.; HoangVan, X.; NguyenQuang, S.; DoNgoc, M. VMAF based quantization parameter prediction model for low resolution video coding. In Proceedings of the 2022 International Conference on Advanced Technologies for Communications (ATC), Ha Noi, Vietnam, 20–22 October 2022; pp. 364–368. [Google Scholar] [CrossRef]
Wiegand, T.; Sullivan, G.; Luthra, A.; Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264ISO/IEC 14 496-10 AVC). 2003, pp. 7–14. Available online: http://ip.hhi.de/imagecom_G1/assets/pdfs/JVT-G050.pdf (accessed on 11 August 2023).
Sullivan, G.J.; Wiegand, T. Rate-distortion optimization for: Video compression. IEEE Signal Process. Mag. 1998, 15, 74–90. [Google Scholar] [CrossRef]
Yang, C.-L.; Leung, R.-K.; Po, L.-M.; Mai, Z.-Y. An SSIM-optimal H.264/AVC inter frame encoder. In Proceedings of the 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, Shanghai, China, 20–22 November 2009; pp. 291–295. [Google Scholar] [CrossRef]
Wang, X.; Su, L.; Huang, Q.; Liu, C. Visual perception based Lagrangian rate distortion optimization for video coding. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 1653–1656. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Tong, X.; Zhu, C.; Xie, R.; Xiong, J.; Song, L. A VMAF Directed Perceptual Rate Distortion Optimization for Video Coding. In Proceedings of the 2020 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Paris, France, 27–29 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Zhu, C.; Huang, Y.; Xie, R.; Song, L. HEVC VMAF-oriented Perceptual Rate Distortion Optimization using CNN. In Proceedings of the 2021 Picture Coding Symposium (PCS), Bristol, UK, 29 June–2 July 2021; pp. 1–5. [Google Scholar] [CrossRef]
Xiph.org. Xiph.org Video Test Media. 2017. Available online: https://media.xiph.org/video/derf/ (accessed on 10 September 2023).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Bjontegaard, G. Calculation of Average PSNR Differences between RD-Curves. 2001. Available online: https://api.semanticscholar.org/CorpusID:61598325 (accessed on 10 September 2023).
x265 Documentation. Available online: https://x265.readthedocs.io/en/master/ (accessed on 10 September 2023).
ISO/IEC 23090-3; Versatile Video Coding. ISO: Geneva, Switzerland, 2020.

Figure 2. Framework of the proposed method.

Figure 3. Steps in generating CNN model.

Figure 4. A sample of QP map.

Figure 5. The fitting curve of “City” sequence for rate and distortion function.

Figure 6. Steps in dataset generation process.

Figure 7. The architecture of the proposed CNN model.

Figure 8. Architecture of test methodology for the proposed method.

Figure 9. Comparison of quality level between methods.

Figure 10. Comparison of quality level between methods.

Figure 11. Reconstructed frames in three methods: (a) x.264, (b) CADQ, and (c) LAQP.

Table 1. The R-squared of functions

R (Q P)

and

D (Q P)

of 10 video sequences.

Table 1. The R-squared of functions

R (Q P)

and

D (Q P)

of 10 video sequences.

Video Sequences	R-Squared of $R (QP)$	R-Squared of $D (QP)$
Hall	0.95	0.93
City	0.96	0.90
Foreman	0.90	0.90
Crew	0.84	0.94
Four-people	0.92	0.94
Ice	0.94	0.91
Kris	0.89	0.91
Mobile	0.99	0.88
Soccer	0.91	0.97
Waterfall	0.98	0.83
Average	0.93	0.91

Table 2. The coefficients of R-Q and D-Q functions.

Video	$R_{VMAF_I} = - a_{1} \cdot {QP}^{3} + b_{1} \cdot {QP}^{2} + c_{1} \cdot QP + d_{1}$				$D_{VMAF_I} = - a_{2} \cdot {QP}^{3} + b_{2} \cdot {QP}^{2} + c_{2} \cdot QP + d_{2}$
Video	a1	b1	c1	d1	a2	b2	c2	d2
Hall	−0.78	102.06	−4632.9	74,387	0.0001	−0.008	0.2214	−0.9966
City	−0.51	94.62	−5625.4	108,880	0.0006	−0.054	1.595	−14.522
Foreman	−0.87	118.40	−5492.3	87,700	0.0002	−0.025	1.193	−2.5509
Crew	−3.29	416.98	−17,942	26,502	0.0002	−0.02	1.179	−4.4885
Four-people	−0.86	175.53	−6968	440,751	0.0002	−0.014	1.106	−2.7634
Ice	−3.53	308.97	−5563	185,914	0.0002	−0.014	0.277	−2.4193
Kris	−2.81	187.33	−2901.1	342,303	0.0001	−0.014	0.344	−2.2143
Mobile	−0.60	102.02	−6010.9	141,990	0.0003	−0.03	0.6236	−8.5126
Soccer	−0.81	206.00	−14,340	30,425	0.0003	−0.034	0.6596	−5.1124
Waterfall	−0.84	140.77	−5980.7	140,316	0.0004	−0.036	1.0176	−8.4162
Average	−1.49	185.3	−7716.02	157,916.8	0.0003	−0.03	0.8216	−5.1996

Table 3. The configuration of the proposed CNN model.

Layer (Type)	Output Size	Number of Parameters	Activation Function
Convolution 1	16 × 16 × 32	544	Relu
Batch Normalization 1	16 × 16 × 32	128
Convolution 2	16 × 16 × 32	16,416	Relu
Batch Normalization 2	16 × 16 × 32	128
Max Pooling 1	8 × 8 × 32	0
Convolution 3	8 × 8 × 64	8256	Relu
Batch Normalization 3	8 × 8 × 64	256
Convolution 4	8 × 8 × 64	16,448	Relu
Batch Normalization 4	8 × 8 × 64	256
Max Pooling 2	4 × 4 × 64	0
Convolution 5	4 × 4 × 128	32,896	Relu
Batch Normalization 5	4 × 4 × 128	512
Convolution 6	4 × 4 × 128	65,664	Relu
Batch Normalization 6	4 × 4 × 128	512
Max Pooling 3	2 × 2 × 128	0
Fully connected 1	1024	526,336	Relu
Fully connected 2	512	524,800	Relu
Fully connected 3	192	98,496	Relu
Linear 4	1	193	Linear
Total		1,291,841

Table 4. BD-Rate and BD-VMAF comparison.

Video-Sequence	crf	x.264 codec		CADQ		Our LAQP		LAQP vs. x.264		LAQP vs. CADQ
Video-Sequence	crf	BR_ x.264	VMAF_ x.264	BR_ CADQ	VMAF_ CADQ	BR_ LAQP	VMAF_ LAQP	BD- Rate	BD- VMAF	BD- Rate	BD- VMAF
Coastguard 352 × 288	29	339.26	96	469.73	92	524.46	100	−2.15	0.53	−16.21	3.55
	32	241.02	84	280.23	85	275.41	90
	35	110.28	73	153.42	76	130.91	75
	37	75.22	65	102.87	69	80.09	64
Container 352 × 288	29	99.51	100	142.14	100	98.56	100	1.28	0.72	−27.65	4.11
	32	63.21	99	81.15	96	63.6	99
	35	43.56	93	50.5	92	44.85	95
	37	34.9	87	39.19	87	35.52	89
Silent 352 × 288	29	131.84	100	143.89	100	107.64	100	−4.08	4.28	−1.60	2.20
	32	91.84	98	94.58	98	85.24	100
	35	63.91	88	60.5	91	61.45	93
	37	50.22	80	46.85	84	45.23	85
Tempete 352 × 288	29	283.87	98	382.01	98	306.59	100	−4.74	0.89	−7.88	1.57
	32	187.94	91	217.23	92	217.2	95
	35	126.62	80	123.33	80	118.57	79
	37	98.92	72	88.69	72	83.34	71
Crew 1280 × 720	29	502.26	97	518.7	98	537.59	97	−4.64	1.44	−3.37	1.09
	32	348.33	88	313.61	95	376.43	91
	35	245.98	77	254.21	81	249.18	80
	37	194.04	67	195.07	71	185.69	69
Vidyo3 1280 × 720	29	512.69	100	499.98	100	495.39	100	−5.82	1.65	−3.51	0.41
	32	362.5	97	398.7	99	372.61	98
	35	255.7	88	253.45	90	241.93	90
	37	201.24	80	204.13	80	205.69	80
Average								−3.36	1.59	−10.03	2.16

Table 5. Quality variance comparison between methods.

Video Sequence	crf	x.264	CADQ	LAQP
Coastguard	29	5.57	6.68	0.59
	32	4.98	6.18	4.83
	35	5.41	5.34	5.10
	37	4.24	5.81	4.51
Container	29	0.08	0.15	0.11
	32	2.83	0.47	1.11
	35	4.49	0.88	1.41
	37	5.05	0.99	0.89
Silent	29	0.11	0.74	0.00
	32	4.94	1.09	0.00
	35	12.94	1.59	1.34
	37	13.10	2.30	2.12
Tempete	29	2.94	1.58	0.3
	32	6.93	3.57	1.59
	35	5.84	4.71	3.67
	37	6.22	5.43	3.09
Crew	29	10.15	15.67	10.60
	32	20.53	18.73	12.04
	35	26.30	23.73	15.32
	37	37.39	26.90	18.89
Vidyo3	29	1.24	1.53	1.19
	32	2.99	2.92	2.44
	35	7.23	5.21	2.49
	37	5.12	3.56	2.76
Average		8.19	6.07	4.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.