Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression

Wei, Lili; Yang, Zhenglong; Zhang, Hua; Liu, Xinyu; Deng, Weihao; Zhang, Youchao

doi:10.3390/app14135573

Open AccessArticle

Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression

by

Lili Wei

,

Zhenglong Yang

^*,

Hua Zhang

,

Xinyu Liu

,

Weihao Deng

and

Youchao Zhang

School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5573; https://doi.org/10.3390/app14135573

Submission received: 19 May 2024 / Revised: 22 June 2024 / Accepted: 25 June 2024 / Published: 26 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, video data have increased in size, which results in enormous transmission pressure. Rate control plays an important role in stabilizing video stream transmissions by balancing the rate and distortion of video compression. To achieve high-quality videos through low-bandwidth transmission, video multi-scale-based end-to-end rate control is proposed. First, to reduce video data, the original video is processed using multi-scale bicubic downsampling as the input. Then, the end-to-end rate control model is implemented. By fully using the temporal coding correlation, a two-branch residual-based network and a two-branch regression-based network are designed to obtain the optimal bit rate ratio and Lagrange multiplier λ for rate control. For restoring high-resolution videos, a hybrid efficient distillation SISR network (HEDS-Net) is designed to build low-resolution and high-resolution feature dependencies, in which a multi-branch distillation network, a lightweight attention LCA block, and an upsampling network are used to transmit deep extracted frame features, enhance feature expression, and improve image detail restoration abilities, respectively. The experimental results show that the PSNR and SSIM BD rates of the proposed multi-scale-based end-to-end rate control are −1.24% and −0.50%, respectively, with 1.82% rate control accuracy.

Keywords:

end-to-end rate control; super resolution; convolutional neural network; Lagrange multipliers; video multi-scale

1. Introduction

In recent years, in order to achieve high-quality, high-resolution video storage and transmission, the Video Coding Experts Group (VCEG) of the International Telecommunication Union Telecommunication Standardization Sector (ITU-T) and the Moving Picture Experts Group (MPEG) of the International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) has released a series of international video coding standards: the H.26x standards and MPEG-x. ITU-T formulates H.261, H.262, H.263, H.263+, and H.264, which mainly aim at real-time video communication. ISO/IEC formulates MPEG-1, MPEG-2, and MPEG-4, which mainly aim at video storage and video streaming applications. On the other hand, ITU-T and ISO/IEC established a joint working group to formulate MPEG-2/H.262, H.264/advanced video coding (AVC), H.265/high-efficiency video coding (HEVC), and the newest H.266/versatile video coding (VVC). Rate control is a critical part of every video coding standard, especially in bandwidth-limited tasks such as live broadcasts. Many excellent rate control algorithms are proposed for the different video coding standards: the Variable Bit Rate (VBR) model [1] for MPEG-1; Test Model 5 (TM5) [2] for MPEG-2, which consists of bit allocation, rate control, and modulation; scalable rate control (VM8) [3] for MPEG-4, which uses a second-order rate distortion (RD) model; Reference Model 8 (RM8) [4] for H.261; and Test Model Near-term 8 (TMN8) [5] for H.263 and JVT-F086 [6] for H.264/AVC, with all adopting a hybrid coding framework. To further improve rate control performance, Unified Rate Quantization (URQ) [7] was first developed in H.264/AVC, in which Quantization Parameters (QPs) are regarded as the critical factor in determining the bit rate. Then, the URQ model continues to be used in H.265/HEVC. However, Li et al. [8] suggest that the bit rate is determined by QP only when all other coding parameters except QP are not too flexible, which indicates that the URQ model cannot work for increasingly flexible coding schemes. Therefore, the Lagrange multiplier λ, which is derived from the RD optimization to balance the bit rate and coding distortion, will be used as the most critical factor for controlling the bit rate. Then, the R-λ rate control model [9] was proposed. By considering the distortion effect in the temporal domain, Yang et al. [10] provided a coding distortion propagation equation for calculating the optimal λ for every frame. Due to the excellent coding performance of the R-λ model, it is also adopted in H.266/VVC. In [11], an improved R-λmode with a quality-dependent relationship is proposed to achieve better coding performance. As a convolutional neural network (CNN) [12] can obtain foreknowledge by extracting image textures, many rate control algorithms use CNN to obtain prior information for coding, such as predicting optimal λ [13], obtaining the RD performance [14], and deriving rate–quantization (R-Q) and distortion–quantization (D-Q) relationships [15]. However, CNN-based rate control methods are always carried out on the hybrid coding framework, which will limit coding performance improvement. With the development of deep learning methods, researchers observed that all modules of the traditional coding framework can be replaced by different types of neural networks, and this can achieve good coding performance; this neural network coding framework is called end-to-end coding. In [16], an end-to-end image compression framework is proposed, which outperforms the traditional image coding standards. In [17], a more advanced entropy model is adopted for end-to-end image compression. In [18], the first end-to-end framework for deep video compression (DVC) is proposed, and the key components of traditional video compression are replaced by neural networks. To further improve end-to-end video compression, Li et al. [19] propose deep contextual video compression (DCVC), which leverages the high-dimensional context to carry rich information for high-frequency content and exhibits higher video coding quality. Wang et al. [20] propose an end-to-end strategy for surveillance videos. Since bit allocation can affect RD performances directly, Eren Çetin et al. [21] exploited the gain unit to control bit allocation among intra-coding with end-to-end hierarchical bidirectional video compression. However, only allocating bits for every frame cannot result in finding the suitable λ to decrease RD costs, which are still far from a feasible rate control scheme in deep learning video compression. Li et al. [22] provided a learned rate control according to the derived R-D-λ relationship in DVC. However, bit allocation and λ are still obtained via predictive coding, and some coding parameters are not suitable for the coding features. It is necessary to develop an end-to-end rate control model, and this enables a paradigm shift from predictive coding to conditional coding.

In this paper, video multi-scale-based end-to-end rate control is proposed. The major contributions of this paper are as follows: (1) To achieve optimal end-to-end rate control performance, a two-branch residual-based network and a two-branch regression-based network are designed to obtain a suitable bit rate ratio and the optimal λ according to the temporal encoded features. (2) To obtain a suitable video multi-scale model, multi-scale bicubic downsampling operation is used, and the hybrid efficient distillation SISR network (HEDS-Net), which contains the multi-branch distillation network, the lightweight attention LCA block, and the upsampling network, is designed to restore high-resolution video.

The rest of this paper is organized as follows. Section 2 presents a brief analysis of related works from the literature. In Section 3, the proposed algorithm is described in detail. The experimental results are demonstrated in Section 4. Section 5 provides the conclusion.

2. Background and Related Work

2.1. Bit Allocation and λ Decision of URQ and R-λ Rate Control

①: URQ rate control

For the URQ model, the target bit allocation of

R_{t}

is calculated via quantization step size

Q_{s}

, which can be modeled as a quadratic function:

R_{t} = X_{1} \cdot \frac{M A D}{Q_{s}^{2}} + X_{2} \cdot \frac{M A D}{Q_{s}}

(1)

where

X_{1}

and

X_{2}

are the model parameters, and MAD indicates the mean absolute difference, which is used to measure the distortion between the reconstructed pixels and the original pixels. Then, the Lagrange multiplier λ is calculated using

λ = a_{1} \cdot 2^{\frac{(Q P - 12)}{3}} \cdot \max (2, \min (4, \frac{(QP - 12)}{6}))

(2)

where

a_{1}

is the predefined factor. According to Equation (2), the bit rate is finally controlled by QP.

②: R-λ rate control

For the R-λ model, the target bit rate is allocated one by one according to the coding level. For the group-of-picture (GoP) level, the bit rate is allocated via

R_{G o P} = \frac{(\frac{T}{FR} \cdot (N_{c} + SW))}{SW} \cdot N_{f}

(3)

where FR is the frame rate; T is the target bit rate;

N_{c}

and

R_{c}

are the number of the encoded frames and the bit cost, respectively; SW is the smooth window; and

F_{f}

is the number of frames in a GoP. Then, the frame level bit rate is allocated via

R_{f} = (R_{G o P} - R_{f}^{\cos t}) \cdot ω_{f}

(4)

where

R_{f}^{\cos t}

is the sum used bit rate of encoded frames in the GoP, and

ω_{f}

is the predefined weight factor of the current frame. At last, the bit allocation of the coding tree unit (CTU), which is modeled in terms of bit per pixel (bpp), is calculated according to the following formula:

b p p = \frac{(R_{f} - R_{CTU}^{\cos t})}{P i x e l_{CTU}} \cdot ω_{CTU}

(5)

where

R_{CTU}^{\cos t}

is the sum used bit rate of the encoded CTU in the frame.

ω_{CTU}

is the weight factor, which can be obtained according to the MAD of every CTU. The RD characteristic is modeled as a hyperbolic function, which is defined as

D = C \cdot R^{- K}

(6)

where C and K are the model parameters. Then, λ is derived by taking a derivate with respect to R, which is

λ = - \frac{\partial D}{\partial R} = C \cdot K \cdot R^{- K - 1} \overset{Δ}{=} α \cdot b p p^{β}

(7)

Therefore, according to Equation (7), the bit rate is finally controlled by bpp and parameters

α

and

β

.

2.2. Video Multi-Scale Super-Resolution

In recent years, learning-based super-resolution techniques have been widely used. For example, in [23], the synergy of supervised learning and super-resolution technology is exploited to enable low-overhead beam and power allocation. For video coding, super-resolution techniques assume that the relationship between low-resolution and high-resolution frames can be learned from a training set that contains several low-resolution frames and their corresponding high-resolution frames. For the traditional single-frame super-resolution approach called patch-based or dictionary-based super-resolution [24], a low-resolution input frame is segmented into small patches. Then, each patch is compared against the high-resolution patches in the training set to find its best match. Finally, an input low-resolution patch is replaced with the corresponding high-resolution patch of its best match. Since the residual network structure adopts a skip connection in which the input of the current network is the difference between the input and output of the previous network, it can effectively solve the gradient disappearance problem to improve the learning ability of the network. The residual network is widely used to extract high-level features in super-resolution techniques [25]. To extend the super-resolution application, single-frame super-resolution is developed for videos where a low-resolution video is segmented into spatiotemporal patches [26]. Learning-based multi-frame super-resolution methods are also introduced in videos by leveraging the temporal correlation between video frames to reconstruct an accurate high-resolution video [27,28].

2.3. End-to-End Video Compression

End-to-end video coding is increasingly being studied to absorb effectiveness via deep learning methods. The DVC [17] is a neural-network-based end-to-end video coding framework, where all modules in the traditional hybrid video framework are replaced by neural networks. A comparison of the traditional coding framework and end-to-end coding framework is shown in Figure 1. In Figure 1a, the traditional hybrid coding framework adopts the predict–transform architecture. In Figure 1b, the end-to-end coding framework has a one-to-one correspondence with the traditional coding framework using the deep learning method.

In [29], the DVC model receives further refinement and exhibits better coding performance. Following a framework similar to DVC, Hu et al. [30] considered rate-distortion optimization when encoding motion vector (MV). In [31], a single reference frame is extended to multiple reference frames. Recently, Yang et al. [32] proposed an RNN-based MV/residue encoder and decoder. In [33], the residue is adaptively scaled using a learned parameter. To further improve the end-to-end video coding framework, DCVC is designed with a conditional coding-based framework, and motion estimation and motion compensation are used to guide the network to generate contextual features. Then, the context with higher dimensions can be leveraged to carry rich information for both the encoder and decoder, which helps reconstruct high-frequency content for high-resolution videos.

3. The Proposed Algorithm

3.1. Framework

The proposed algorithm has three main steps. For the first step, the original video is dealt with using multi-scale bicubic downsampling to reduce the data. The interpolated pixel

I (i + u, j + v)

can be expressed as

I (i + u, j + v) = A \cdot B \cdot C

(8)

where

i

and

j

are integer values of the interpolated pixel position in the horizontal and vertical coordinate axes, respectively.

u

and

v

are decimal values of the interpolated pixel position in the horizontal and vertical coordinate axes, respectively. Arrays A, B, and C are expressed as

A = [S (1 + u) \begin{matrix}  \end{matrix} S (u) \begin{matrix}  \end{matrix} S (1 - u) \begin{matrix}  \end{matrix} S (2 - u)]

(9)

B = [\begin{array}{l} I (i - 1, j - 2) \begin{matrix}  \end{matrix} I (i, j - 2) \begin{matrix}  \end{matrix} I (i + 1, j - 2) \begin{matrix}  \end{matrix} I (i + 2, j - 2) \\ I (i - 1, j - 1) \begin{matrix}  \end{matrix} I (i, j - 1) \begin{matrix}  \end{matrix} I (i + 1, j - 1) \begin{matrix}  \end{matrix} I (i + 2, j - 1) \\ I (i - 1, j) \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} I (i, j) \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} I (i + 1, j) \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} \begin{matrix}  \end{matrix} I (i + 2, j) \\ I (i - 1, j + 1) \begin{matrix}  \end{matrix} I (i, j + 1) \begin{matrix}  \end{matrix} I (i + 1, j + 1) \begin{matrix}  \end{matrix} I (i + 2, j + 1) \end{array}]

(10)

C = {[S (1 + v) \begin{matrix}  \end{matrix} S (u) \begin{matrix}  \end{matrix} S (1 - v) \begin{matrix}  \end{matrix} S (2 - v)]}^{T}

(11)

where array B is the 16-integer pixel in the original image.

S (\cdot)

in array A and C is the interpolated function, which is defined as

S (n) = \{\begin{matrix} 1 - 2 | n |^{2} + | n |^{3} \\ 4 - 8 | n | + 5 | n |^{2} - | n |^{3} \\ 0 \end{matrix} \begin{matrix} | n | < 1 \\ \begin{matrix}  \end{matrix} 1 \leq | n | < 2 \\ | n | \geq 2 \end{matrix}

(12)

Therefore, according to Equation (8), the frames in the original video can be multi-scale downsampled. In this paper, ×2, ×3, and ×4 downsampling methods are mainly used.

For the second step, the low-resolution frame is placed in the end-to-end rate control model for coding. For end-to-end rate control, a two-branch residual-based network is designed to obtain the optimal bit rate ratio for every frame by considering the temporal coding feature. With the optimal frame bit rate ratio, the bit rate can be controlled accurately and allocated reasonably to every frame. A two-branch regression-based network is used to predict the optimal λ to control bit streams by balancing the bit rate and distortion. Then, λ is input into the encoder of DCVC to generate the bit stream. The bit cost will be stored in a bit buffer to adjust the bit allocation of the next frame. Finally, the decoder will reconstruct the low-resolution frame from the bit stream. Therefore, with end-to-end rate control, the low-resolution frame can be encoded under a limited bit rate and reconstructed from the bit stream.

For the third step, HEDS-Net is designed to restore the high-resolution frame from the reconstructed low-resolution frame, which mainly consists of the multi-branch distillation network, the lightweight attention LCA block, and the upsampling network. The multi-branch distillation network aims to extract deep features. The lightweight attention LCA block aims to enhance learning and expression abilities to rebuild detail features. The upsampling network aims to enhance the resolution and rebuild low-frequency features. The framework of the proposed algorithm is shown in Figure 2.

3.2. End-to-End Rate Control

To make full use of the temporal correlation, a two-branch residual-based network for bit rate ratio prediction is designed, and it is shown in Figure 3.

In Figure 3, the residual block is designed to extract high-level semantic features.

R_{F} (n - 1)

,

D_{F} (n - 1)

, and

λ_{F} (n - 1)

represent the bit rate, distortion, and Lagrangian multiplier of the previous encoded frame, respectively.

R_{G}

represents the target bit rate of the current GoP.

W

is the output of the network, which represents the predicted bit rate ratio of every frame. The up branch is used to extract the deep features correlated with the content of the original frame. The down branch will build a learning vector of the coding parameters from the previous encoded frame. Since the outputs of the two branches have a strong temporal correlation, a multiplication operation is used to fuse the output features. Then, fusion features are further extracted via a residual block and finally converted to bit rate ratio

W

.

From the GoP bit allocation,

R_{G}

can be expressed as

R_{G} = \frac{R_{target} \cdot (n_{encoded} + N_{S W}) - R_{encoded}}{N_{S W}} \cdot N_{G}

(13)

where

R_{target}

and

R_{encoded}

are the target bit rate and total used bit rate, respectively.

N_{G}

is the number of frames in the GoP.

N_{S W}

is the smooth window, which is set to 40. Then, the bit allocation of a frame

n

can be expressed as

R_{F} (n) = \frac{R_{G} - R_{encoded - G}}{\sum_{i = n}^{N_{G}} W_{i}} \cdot W_{n}

(14)

where

R_{encoded - G}

is the used bit rate of the frames in the GoP.

W_{n}

is the bit rate ratio of the frame

n

, which can be predicted from the two-branch residual-based network in Figure 3. The loss function is defined as

L o s s_{B T R} = \frac{1}{N} \cdot \sum_{i = 1}^{N} {(W_{i} - {\hat{W}}_{i})}^{2}

(15)

where

{\hat{W}}_{i}

is the actual bit rate ratio.

N

is the number of frames for training.

To predict the Lagrange multiplier λ of the current frame, a two-branch regression network is designed in Figure 4. Since the residual feature of a frame, which is the difference between the predicted and original frames, can indicate the correlation between the adjacent frames, the residual frame is used as an input of the up branch. The bit allocation of the current frame

R_{F} (n)

calculated via Equation (14) and the bit cost

R_{F} (n - 1)

and distortion

D_{F} (n - 1)

and

λ_{F} (n - 1)

of the previously encoded frame are used as inputs of the down branch. Then, the fusion features of the two branches are inputted into the regression block. Finally, the optimal λ value of the current frame can be predicted using the network.

Different from the two-branch residual-based network, the two-branch regression-based network for λ is trained using a multitasking loss function, which is defined as

L o s s_{λ} = γ {(\frac{|R_{F} - {\hat{R}}_{F} (λ)|}{R_{F}})}^{2} + (1 - γ) {\hat{D}}_{F} (λ)

(16)

where

γ

is set as 0.4 empirically, and

R_{F}

is the calculated bit in Equation (14).

{\hat{R}}_{F} (λ)

and

{\hat{D}}_{F} (λ)

are the actual bit and distortion, respectively.

3.3. Video Restoration

To restore the high-resolution frame from the reconstructed frame, HEDS-Net is designed. The structure of HEDS-Net is similar to the skip connection of the residual network, where two parallel processes are designed to restore different features. For one process, the multi-branch distillation network and the upsampling network are connected to build the low-frequency features of the high-resolution frame. For the other processes, the lightweight attention LCA block is used to build the detailed features of the high-resolution frame. Then, the two processes merge together to generate the final reconstructed high-resolution frame.

①: Multi-branch distillation network

The multi-branch distillation network aims to learn and extract detailed features with multi-layer interaction. The structure of the multi-branch distillation network is shown in Figure 5.

In Figure 5, it can be observed that three residual blocks are used for extracting detailed features. The depthwise separable convolution in the residual block enhances the feature learning ability. Different residual blocks are connected by the concatenation operation to accumulate feature information for improving multi-scale expression.

②: Lightweight attention LCA block

The lightweight attention LCA block provides the main contribution by generating the detailed pixels of the high-resolution frame; it carries this out by combing local and global features. On the other hand, the residual recursion mechanism utilizes the lightweight attention LCA block to improve prediction accuracy, with network resources continuously optimizing. The lightweight attention LCA block is shown in Figure 6.

③: Upsampling network

The upsampling network is mainly used to polymerize the low-frequency frame feature. Two branches are used for the upsampling network. The left branch of the sub-pixel convolution carried out pixel rearrangement. The right branch of the transposed convolution implements the feature transform. The fusion of the two branches will improve the diversity of the upsampling feature. The upsampling network is shown in Figure 7.

As mentioned above, the loss function of HEDS-Net is defined as

l o s s_{HEDS - Net} = \frac{1}{N} \sum_{n = 1}^{N} \sqrt{{(x_{n} - {\hat{x}}_{n})}^{2} + {(ω_{1} \cdot {|\nabla (x_{n} - {\hat{x}}_{n}) (i, j)|}^{ω_{2}})}^{2}}

(17)

where

N

is the number of the dataset, and

x

and

\hat{x}

are the predicted frame and the label frame, respectively:

ω_{1} \in [0.1, 1]

and

ω_{2} \in [0.2, 0.8]

.

\nabla

denotes the gradient value of

(x_{n} - {\hat{x}}_{n})

at position

(i, j)

.

4. Experimental Results

The proposed algorithm is embedded into DCVC software. DCVC software is implemented using the official default configuration and model. The structural similarity (SSIM), PSNR, and BD-rate index are used to evaluate coding quality. The experimental platform is shown in Table 1.

4.1. Performance Analysis of End-to-End Rate Control

The Vimeo-90k [34] and BVI-DVC [35] datasets are used to train the two-branch residual-based network and two-branch regression network. The iterations are set as 2 × 10⁵, and the initial learning rate is set as 1 × 10⁻⁴. After 8 × 10⁴ iterations, the learning rate decreases to 1 × 10⁻⁵ via exponential decay. With respect to the proposed end-to-end rate control, Li et al. [22] used the novel R-D-λ model to derive an allocated bit for learned rate control, and Li et al. [9] used Lagrange multiplier λ to control bit rates, which are used for comparisons. In total, 100 frames were used to encode every test sequence. DCVC is used as an anchor, and four rate distortion (RD) points are selected, which are λ = 256, 512, 1024, and 2048. The bit rate accuracy is defined as

M = \frac{|R - \hat{R}|}{R}

(18)

where

R

is the target bit rate, and

\hat{R}

is the actual bit rate. The bit rate accuracy results are shown in Table 1.

It is observed from Table 2 that the average bit rate accuracy results are 2.62%, 3.89%, 5.93%, and 2.25%, respectively. The control accuracy of the proposed algorithm is better than the others. Since control of the bit rate is a highly challenging task for end-to-end coding, the accuracies of the four algorithms remain high. The coding quality comparisons are shown in Table 3.

In Table 3, the average BD-rate (PSNR) indexes of Li et al. [22], Li et al. [9], and the end-to-end rate control are −0.69, −0.35, and −0.84, respectively. This means that the proposed end-to-end rate control uses the lowest bit rate but improves coding quality the most. For the BD-rate (SSIM) indexes, the proposed algorithm achieves -0.35. Li et al. [22] and Li et al. [9] achieved −0.24 and −0.17, respectively. This also means that the proposed end-to-end rate control mostly improves subjective coding quality. Since temporal coding information is used to train the network for coding, the bit rate is allocated more reasonably to meet the changing frame feature, and λ can be selected more effectively to decrease RD costs. The RD comparisons are shown in Figure 8.

Figure 8 shows the RD comparisons of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm. It can be observed that the RD performance of the proposed end-to-end rate control is better than the others, which indicates the effectiveness of the proposed end-to-end rate control. To summarize, the proposed end-to-end rate control can improve both objective and subjective coding performance with good control accuracies.

4.2. Performance Analysis of HEDS-Net

For training HEDS-Net, 800 images in DIV2K [36] are used. The low-resolution image is obtained via bicubic downsampling, and the downsampling rates are set as ×2, ×3, and ×4. To enhance the diversity of the training set, the images in the train set are rotated 90 degrees, 180 degrees, and 270 degrees. The cosine annealing method is used to decrease the learning rate automatically. The initial learning rate is set as 1 × 10⁻³. After 1 × 10⁵ iterations, the learning rate decreases to 1 × 10⁻⁷ gradually. PSNR and SSIM are used to evaluate the difference between the HEDS-Net output and the original image. The datasets of Manga109 [37], BSD100 [38], Set5 [39], and Set14 [40] are used for testing. The algorithms of Bicubic, SRCNN [41], VDSR [42], EDSR [43], and RCAN [44] are used for comparison. The results are shown in Table 4.

In Table 4, it can be observed that HEDS-Net has better PSNR and SSIM indexes at different scales compared to other algorithms. All algorithms in the ×2 scale have the best restoration performance, and the restoration performance in the ×4 scale is the worst. Since the low-resolution image with ×4 scale downsampling loses most of the original image’s detailed features, restoring the original image is more difficult compared to the ×2 scale downsampling and ×3 scale downsampling images; that is, the ×2 scale downsampling image reserves more detailed information, and the different features of this high-resolution image can be restored more easily than the ×3 scale and ×4 scale downsampling images. The parameter of HEDS-Net is 0.267M, and the complexity of HEDS-Net is measured via floating-point operations (FLOPs), which is 15.96G. It can be observed in Table 3 that the parameters and complexity of HEDS-Net are slightly greater than SRCNN, which means HEDS-Net is a lightweight network, but it achieves good restoration performance.

The visualization map of the PSNR indexes for Set5 at the ×4 scale is shown in Figure 9. From Figure 9, it can be seen that HEDS-Net has the best restoration performance with low complexity and a few parameters. This also verifies the high computational efficiency of HEDS-Net.

4.3. Performance Analysis of Multi-Scale-Based End-to-End Rate Control

For the multi-scale-based end-to-end rate control experiment, four RD points comprising λ = 256, 512, 1024, and 2048, were selected, and ×2 bicubic downsampling was used. We use the high-resolution test Classes for comparison, which comprises Classes A1, A2, B, and E. The accurate bit rate results are shown in Table 5.

In Table 5, the average accuracies of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm are 2.67%, 3.82%, 5.69%, and 1.82%, respectively. Therefore, the proposed algorithm has the best control accuracy. Since the proposed algorithm processes the original video via the ×2 bicubic downsampling operation, reducing video data will be helpful for controlling the bit rate. On the other hand, end-to-end rate control can allocate suitable bit rates and λ for every frame based on the temporal coding feature; these will improve the bit rate control’s accuracy. The experimental results of the PSNR and SSIM BD-rates are shown in Table 6.

In Table 6, the BD-rate (PSNR) indexes of Li et al. [22], Li et al. [9], and the proposed algorithm are −1.09%, −0.46%, and −1.24%, respectively. The BD-rate SSIM indexes are −0.48%, −0.28%, and −0.50%, respectively. Therefore, the proposed algorithm has the best coding performance. Since the end-to-end rate control and HEDS-Net are trained separately, the proposed algorithm satisfies the diversity of coding features. Therefore, the proposed algorithm can improve coding performance significantly with high rate control accuracies. To measure coding time complexity, the test sequences in Table 5 are used, and 100 frames are encoded. It should be noted that the original frame of the proposed algorithm utilizes ×2 bicubic downsampling operations. The coding time comparisons of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm are shown in Table 7.

In Table 7, the coding time of DCVC is used as the anchor. The coding time indexes of Li et al. [22], Li et al. [9], and the proposed algorithm are 126%, 111%, and 98%, respectively. Because Li et al. [22] increased the coding complexity of the end-to-end coding model, more time was used with respect to DCVC. For Li et al. [9], the hybrid coding framework uses a substantial amount of time on prediction coding. Therefore, Li et al. [9] required more coding time compared to DCVC. For the proposed algorithm, because the ×2 bicubic downsampling operation of the original frame is used first, a substantial decrease in the amount of coding data occurs. Even though some types of neural networks are used in the proposed algorithm, which will increase coding times, data reduction will affect coding times more. Subjective comparisons of BasketballDrive and Cactus are shown in Figure 10.

From Figure 10, it can be observed that the subjective quality of the basket region in Figure 10(b-1) is clearer than the others in Figure 10(c-1), (d-1), and (e-1). Similarly, the subjective qualities of the yellow flower regions presented in Figure 10(c-2), (d-2), and (e-2) are worse than that of the image in Figure 10(b-2). Therefore, we can easily conclude that the proposed algorithm exhibits better subjective performances than the other algorithms.

5. Conclusions

In this paper, a frame super-resolution-based end-to-end rate control is proposed in DCVC. Different from the traditional hybrid coding framework, the key coding parameters of the multi-scale-based end-to-end rate control are predicted using various convolutional neural networks. Firstly, the original video is processed by employing the multi-scale bicubic downsampling operation to greatly reduce video data. For end-to-end rate control, the suitable bit rate ratio for every frame is predicted using the two-branch residual-based network according to the temporally encoded parameters. Moreover, the optimal λ can be obtained using the two-branch regression-based network according to the temporally encoded feature in order to balance distortion and the bit rate. Finally, for restoring the high-resolution frame, HEDS-Net is designed, which contains a multi-branch distillation network, the lightweight attention LCA block, and the upsampling network, to generate the detailed features of the upsampling frame. The experimental results show that the proposed algorithm can achieve a PSNR BD-rate of −1.24% and SSIM BD-rate of −0.50%, with 1.82% rate control accuracy.

Since video content changes generate fluctuating bit rates that affect rate control performances, dynamic data analysis techniques for video streams comprise the future subjects for rate control research. Applying analyses and reasoning in dynamic data analyses will assist the encoder in understanding and analyzing video content, video features, and video structure. Therefore, dynamic data analysis techniques will be useful for improving coding performance and the control accuracy of rate control for the future work.

Author Contributions

Conceptualization, L.W., Z.Y. and H.Z.; methodology, L.W. and Z.Y.; software, L.W., Z.Y., H.Z., X.L., W.D. and Y.Z.; validation, X.L., W.D. and Y.Z.; formal analysis, Z.Y.; investigation, L.W., Z.Y. and H.Z.; data curation, H.Z., X.L., W.D. and Y.Z.; writing—original draft preparation, L.W. and Z.Y.; writing—review and editing, L.W., Z.Y. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the major project of Zhangjiang (Grant No. ZJ2020-ZD-009).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Doulamis, N.D.; Konstantoulakis, G.; Stassinopoulos, G. Efficient modeling of VBR MPEG-1 coded video sources. IEEE Trans. Circuits Syst. Video Technol. 2000, 10, 93–112. [Google Scholar] [CrossRef]
Wang, L. Rate control for MPEG video coding. Signal Process. Image Commun. 2000, 15, 493–511. [Google Scholar] [CrossRef]
Lee, H.-J.; Chiang, T.; Zhang, Y.-Q. Scalable rate control for MPEG-4 video. IEEE Trans. Circuits Syst. Video Technol. 2000, 10, 878–894. [Google Scholar]
CCXITT; SGXV. Description of Reference Model 8 (RM8); Document 525; Specialists Group on Coding for Visual Telephony: Geneva, Swiss, 1989. [Google Scholar]
Tsai, J.-C.; Shieh, C.-H. Modified TMN8 rate control for low-delay video communications. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 864–868. [Google Scholar] [CrossRef]
Ma, S. Proposed Draft Description of Rate Control on JVT Standard; Doc. JVT-F086, Tech. Rep.; Joint Video Team: Geneva, Swiss, 2002. [Google Scholar]
Choi, H.; Yoo, J.; Nam, J.; Sim, D.; Bajić, I.V. Pixel-wise unified rate-quantization model for multi-level rate control. IEEE J. Sel. Top. Signal Process. 2013, 7, 1112–1123. [Google Scholar] [CrossRef]
Li, B.; Li, H.; Li, L.; Zhang, J. Rate control by R-lambda model for HEVC. In Proceedings of the 11th Meeting on JCTVC-K0103, JCTVC of ISO/IEC and ITU-T, Shanghai, China, 10–19 October 2012. [Google Scholar]
Li, B.; Li, H.; Li, L.; Zhang, J. λ domain rate control algorithm for high efficiency video coding. IEEE Trans. Image Process. 2014, 23, 3841–3854. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Wang, G.; Li, G.; Zhu, W. Distortion propagation-based optimal λ decision for random access rate control in HEVC. J. Electron. Imaging 2020, 29, 013002. [Google Scholar] [CrossRef]
Li, Y.; Liu, Z.; Chen, Z. Rate Control for Versatile Video Coding. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Hu, Y.; Luo, D.; Hua, K.; Zhang, X. Overview on deep learning. CAAI Trans. Intell. Syst. 2019, 14, 9–19. [Google Scholar]
Yang, Z.; Luo, Y.; Lin, Y.; Wei, L.; Zhang, H. Convolutional neural network-based optimal R-λ intra rate control in Versatile Video Coding. J. Electron. Imaging 2022, 31, 063011. [Google Scholar] [CrossRef]
Wang, J.; Shang, X.; Zhao, X.; Zhang, Y. A convolutional neural network-based rate control algorithm for VVC intra coding. Displays 2024, 82, 102652. [Google Scholar] [CrossRef]
Mao, Y.; Wang, M.; Ni, Z.; Wang, S.; Kwong, S. Neural network based rate control for versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6072–6085. [Google Scholar] [CrossRef]
Jiang, F.; Tao, W.; Liu, S.; Ren, J.; Guo, X.; Zhao, D. An end-to-end compression framework based on convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3007–3018. [Google Scholar] [CrossRef]
Minnen, D.; Ballé, J.; Toderici, G. Joint autoregressive and hierarchical priors for learnedimage compression. arXiv 2018, arXiv:1809.02736. [Google Scholar]
Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; Gao, Z. DVC: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11006–11015. [Google Scholar]
Li, J.; Li, B.; Lu, Y. Deep contextual video compression. Adv. Neural Inf. Process. Syst. 2021, 34, 18114–18125. [Google Scholar]
Wang, S.; Zhao, Y.; Gao, H.; Ye, M.; Li, S. End-to-end video compression for surveillance and conference videos. Multimed. Tools Appl. 2022, 81, 42713–42730. [Google Scholar] [CrossRef]
Çetin, E.; Yılmaz, M.A.; Tekalp, A.M. Flexible-rate learned hierarchical bi-directional video compression with motion refinement and frame-level bit allocation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: New York, NY, USA, 2022; pp. 1206–1210. [Google Scholar]
Li, Y.; Chen, X.; Li, J.; Wen, J.; Han, Y.; Liu, S.; Xu, X. Rate control for learned video compression. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 2829–2833. [Google Scholar]
Cao, Y.; Ohtsuki, T.; Maghsudi, S.; Quek, T.Q.S. Deep Learning and Image Super-Resolution-Guided Beam and Power Allocation for mmWave Networks. IEEE Trans. Veh. Commun. 2023, 72, 15080–15085. [Google Scholar] [CrossRef]
Singh, A.; Singh, J. Survey on single image based superresolution1—Implementation challenges and solutions. Multimed. Tools Appl. 2019, 79, 1641–1672. [Google Scholar] [CrossRef]
Nasrollahi, K.; Moeslund, T.B. Super-resolution: A comprehensive survey. Mach. Vis. Appl. 2014, 25, 1423–1468. [Google Scholar] [CrossRef]
Shahar, O.; Faktor, A.; Irani, M. Space-time super-resolution from a single video. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 20–25 June 2011; pp. 3353–3360. [Google Scholar]
Kawulok, M.; Benecki, P.; Piechaczek, S.; Hrynczenko, K.; Kostrzewa, D.; Nalepa, J. Deep learning for multiple-image super resolution. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1062–1066. [Google Scholar] [CrossRef]
Salvetti, F.; Mazzia, V.; Khaliq, A.; Chiaberge, M. Multi-image super resolution of remotely sensed images using residual attention deep neural networks. Remote Sens. 2020, 12, 2207. [Google Scholar] [CrossRef]
Lu, G.; Zhang, X.; Ouyang, W.; Chen, L.; Gao, Z.; Xu, D. An end-to-end learning framework for video compression. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3292–3308. [Google Scholar] [CrossRef] [PubMed]
Hu, Z.; Chen, Z.; Xu, D.; Lu, G.; Ouyang, W.; Gu, S. Improving deep video compression by resolution-adaptive flow coding. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 193–209. [Google Scholar]
Lin, J.; Liu, D.; Li, H.; Wu, F. M-LVC: Multiple frames prediction for learned video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Yang, R.; Mentzer, F.; Gool, L.V.; Timofte, R. Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Yang, R.; Yang, Y.; Marino, J.; Mandt, S. Hierarchical autoregressive modeling for neural video compression. In Proceedings of the 9th International Conference on Learning Representations, ICLR, Virtually, 3–7 May 2021. [Google Scholar]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Ma, D.; Zhang, F.; Bull, D.R. BVI-DVC: A training database for deep video compression. IEEE Trans. Multimed. 2021, 24, 3847–3858. [Google Scholar] [CrossRef]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
Marco, B.; Roumy, A.; Guillemot, C.M.; Alberi-Morel, M.-L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012. [Google Scholar]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the 7th International Conference of Curves and Surfaces, Avignon, France, 24–30 June 2012; pp. 711–730. [Google Scholar]
Guo, R.; Shi, X.-P.; Jia, D.-K. Learning a deep convolutional network for image super-resolution reconstruction. J. Eng. Heilongjiang Univ. 2018, 9, 52–59. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]

Figure 1. Coding frameworks: (a) traditional hybrid coding framework; (b) end-to-end coding framework.

Figure 2. Frame super-resolution-based end-to-end rate control.

Figure 3. Two-branch residual-based network.

Figure 4. Two-branch regression-based network.

Figure 5. Multi-branch distillation network.

Figure 6. Lightweight attention LCA block.

Figure 7. Upsampling network.

Figure 8. RD curve comparisons of (A): the proposed algorithm, (B): DCVC, (C): Li et al. [22] and (D): Li et al. [9].

Figure 9. Visualization map of PSNR indexes for SRCNN, VDSR, EDSR, RCAN, and HEDS-Net.

Figure 10. Subjective comparisons of BasketballDrive for the second frame @ 2320.8 kps and Cactus for the second frame @ 4261.38kps. (a-1,a-2) are the ground truth images; (b-1,b-2) are images from the proposed algorithm; (c-1,c-2) are the images from DCVC; (d-1,d-2) are the images from Li et al. [22]; (e-1,e-2) are the images from Li et al. [9].

Table 1. Experimental platform configuration.

Operating System	Windows 11
CPU	i7-14700kf
GPU	GeForce RTX 4080 super
RAM	32GB
Deep Learning Software	Pytorch
CUDA Version	12.4

Table 2. Bit rate accuracy comparisons of DCVC: Li et al. [22], Li et al. [9], and the end-to-end rate control.

Class	DCVC	Li et al. [22]	Li et al. [9]	Proposed
Class	M%	M%	M%	M%
Class A1	4.21	5.41	7.60	2.13
Class A2	4.18	5.38	7.56	2.12
Class B	1.43	3.24	6.27	3.36
Class C	1.31	2.75	4.71	2.80
Class D	2.65	3.24	3.83	1.76
Class E	1.95	3.30	5.60	1.35
Average	2.62	3.89	5.93	2.25

Table 3. Experimental comparisons of Li et al. [22], Li et al. [9], and the end-to-end rate control.

Class	Sequence	Li et al. [22]		Li et al. [9]		Proposed
Class	Sequence	BD-Rate (PSNR)	BD-Rate (SSIM)	BD-Rate (PSNR)	BD-Rate (SSIM)	BD-Rate (PSNR)	BD-Rate (SSIM)
Class A1	Tango2	−3.63	−0.77	−0.20	−0.03	−0.42	−0.27
	FoodMarket4	0.83	0.10	0.08	0.02	0.74	0.07
	Campfire	−2.02	−0.65	−0.32	−0.20	−1.66	−0.42
Class A2	CatRobot1	−2.27	−0.68	−0.32	−0.03	−0.63	−0.40
	DaylightRoad2	−0.82	−0.12	−0.22	−0.06	−0.07	−0.12
	ParkRunning3	0.40	0.04	0.34	0.05	0.29	−0.05
Class B	MarketPlace	1.03	0.12	0.14	−0.02	−1.60	−0.67
	RitualDance	−0.93	−0.31	−0.82	−0.36	−1.30	−0.58
	Cactus	−0.49	−0.19	−0.40	−0.18	−1.13	−0.51
	BasketballDrive	−2.52	−0.68	−1.07	−0.67	−2.56	−0.72
	BQTerrace	−1.83	−0.60	−1.16	−0.65	−1.52	−0.60
Class C	BasketballDrill	−0.06	−0.01	−0.51	−0.32	−1.14	−0.43
	BQMall	0.15	0.00	−0.22	−0.03	−0.50	−0.34
	PartyScene	0.23	−0.02	0.18	0.02	−1.96	−0.50
	RaceHorses	−0.59	−0.08	−0.61	−0.28	−1.12	−0.46
Class D	BasketballPass	−0.89	−0.30	−1.13	−0.44	−0.05	−0.08
	BQSquare	1.03	0.13	−0.93	−0.41	−0.80	−0.30
	BlowingBubbles	1.02	0.10	1.04	0.32	−0.01	−0.06
	RaceHorses	−0.45	−0.20	−0.10	−0.01	−0.46	−0.22
Class E	FourPeople	−1.33	−0.43	−0.69	−0.23	−1.29	−0.37
	Johnny	−0.43	−0.12	−0.17	−0.10	−0.43	−0.33
	KristenAndSara	−1.59	−0.61	−0.50	−0.21	−0.82	−0.38
	Average	−0.69	−0.24	−0.35	−0.17	−0.84	−0.35

Table 4. Comparisons of bicubic, SRCNN, VDSR, EDSR, RCAN, and HEDS-Net.

Algorithm	Parameters (M)	FLOPs (G)	Manga109 PSNR/SSIM	BSD100 PSNR/SSIM	Set5 PSNR/SSIM	Set14 PSNR/SSIM
Scale	×2
Bicubic	\	\	30.79/0.932	29.56/0.843	33.65/0.928	30.23/0.867
SRCNN [40]	0.009	6.1	35.59/0.965	31.35/0.887	36.65/0.953	32.44/0.905
VDSR [41]	0.667	70.5	38.87/0.972	32.09/0.898	37.89/0.955	33.69/0.912
EDSR [42]	1.381	8679	39.49/0.978	32.31/0.901	38.10/0.959	33.91/0.918
RCAN [43]	1.582	120	39.43/0.977	32.40/0.903	38.26/0.960	34.11/0.921
HEDS-Net	0.267	15.96	39.51/0.979	32.54/0.907	38.40/0.966	34.39/0.929
Scale	×3
Bicubic	\	\	26.94/0.855	27.20/0.737	30.38/0.865	27.54/0.773
SRCNN [40]	0.009	6.1	30.47/0.911	28.40/0.786	32.74/0.908	29.29/0.820
VDSR [41]	0.667	70.5	31.87/0.918	29.19/0.801	34.48/0.921	30.22/0.840
EDSR [42]	1.381	8679	32.01/0.920	29.24/0.808	34.64/0.927	30.51/0.846
RCAN [43]	1.582	120	34.03/0.926	29.31/0.810	34.72/0.929	30.64/0.848
HEDS-Net	0.267	15.96	34.17/0.930	29.40/0.825	34.79/0.932	30.77/0.850
Scale	×4
Bicubic	\	\	26.39/0.760	25.95/0.667	28.41/0.810	26.01/0.713
SRCNN [40]	0.009	6.1	27.58/0.851	26.90/0.710	30.48/0.862	27.58/0.755
VDSR [41]	0.667	70.5	30.83/0.858	27.69/0.720	31.15/0.880	27.85/0.760
EDSR [42]	1.381	8679	31.02/0.860	27.75/0.721	31.35/0.883	28.01/0.768
RCAN [43]	1.582	120	31.10/0.910	27.84/0.742	32.62/0.900	28.85/0.787
HEDS-Net	0.267	15.96	31.13/0.912	27.97/0.751	32.69/0.901	28.92/0.790

Table 5. Bit rate accuracy comparisons of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm.

Class	DCVC	Li et al. [22]	Li et al. [9]	Proposed
A1	3.87%	4.41%	7.30%	2.01%
A2	3.43%	5.38%	7.56%	2.11%
B	1.69%	2.13%	2.91%	1.72%
E	1.67%	3.35%	4.97%	1.42%
Average	2.67%	3.82%	5.69%	1.82%

Table 6. Coding performance comparisons of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm.

Class	Sequence	Li et al. [22]		Li et al. [9]		Proposed
Class	Sequence	BD-Rate (PSNR)	BD-Rate (SSIM)	BD-Rate (PSNR)	BD-Rate (SSIM)	BD-Rate (PSNR)	BD-Rate (SSIM)
A1	Tango2	−3.59%	−0.77%	−0.20%	−0.03%	−0.42%	−0.27%
	FoodMarket4	0.78%	0.10%	0.08%	0.02%	0.74%	0.07%
	Campfire	−2.01%	−0.65%	−0.32%	−0.20%	−1.66%	−0.42%
A2	CatRobot1	−2.37%	−0.68%	−0.32%	−0.03%	−0.63%	−0.40%
	DaylightRoad2	−0.92%	−0.12%	−0.22%	−0.06%	−0.07%	−0.12%
	ParkRunning3	0.43%	0.04%	0.34%	0.05%	0.29%	−0.05%
B	MarketPlace	−0.13%	−0.15%	−0.64%	−0.17%	−2.48%	−0.87%
	RitualDance	−1.83%	−0.58%	−1.60%	−0.51%	−2.18%	−0.78%
	Cactus	−1.39%	−0.46%	−1.18%	−0.33%	−2.01%	−0.71%
	BasketballDrive	−0.42%	−0.95%	−0.35%	−0.82%	−3.44%	−0.92%
	BQTerrace	−0.73%	−0.87%	−0.34%	−0.80%	−0.80%	−0.90%
E	FourPeople	−1.30%	−0.60%	−0.47%	−0.38%	−1.67%	−0.57%
	Johnny	−0.33%	−0.49%	−0.95%	−0.25%	−1.31%	−0.53%
	KristenAndSara	−1.49%	−0.55%	−0.30%	−0.36%	−1.70%	−0.58%
Ave		−1.09%	−0.48%	−0.46%	−0.28%	−1.24%	−0.50%

Table 7. Coding time comparisons of DCVC, Li et al. [22], Li et al. [9], and the proposed algorithm.

Class	DCVC	Li et al. [22]	Li et al. [9]	Proposed
A1 (average)	29,267.82	37,577.63	32,325.63	29,135.9
A2 (average)	43,911.46	57,200.19	49,623.23	42,136.14
B (average)	21,357.6	25,076.49	23,282.05	21,235.88
E (average)	7418.42	8609.4	7939.47	7408.27
Total average	100%	126%	111%	98%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, L.; Yang, Z.; Zhang, H.; Liu, X.; Deng, W.; Zhang, Y. Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression. Appl. Sci. 2024, 14, 5573. https://doi.org/10.3390/app14135573

AMA Style

Wei L, Yang Z, Zhang H, Liu X, Deng W, Zhang Y. Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression. Applied Sciences. 2024; 14(13):5573. https://doi.org/10.3390/app14135573

Chicago/Turabian Style

Wei, Lili, Zhenglong Yang, Hua Zhang, Xinyu Liu, Weihao Deng, and Youchao Zhang. 2024. "Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression" Applied Sciences 14, no. 13: 5573. https://doi.org/10.3390/app14135573

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression

Abstract

1. Introduction

2. Background and Related Work

2.1. Bit Allocation and λ Decision of URQ and R-λ Rate Control

2.2. Video Multi-Scale Super-Resolution

2.3. End-to-End Video Compression

3. The Proposed Algorithm

3.1. Framework

3.2. End-to-End Rate Control

3.3. Video Restoration

4. Experimental Results

4.1. Performance Analysis of End-to-End Rate Control

4.2. Performance Analysis of HEDS-Net

4.3. Performance Analysis of Multi-Scale-Based End-to-End Rate Control

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI