Next Article in Journal
Development of a Low-Frequency Piezoelectric Ultrasonic Transducer for Biological Tissue Sonication
Next Article in Special Issue
A Highly Pipelined and Highly Parallel VLSI Architecture of CABAC Encoder for UHDTV Applications
Previous Article in Journal
Contactless Interface Using Exhaled Breath and Thermal Imaging
Previous Article in Special Issue
Low-Complexity Lossless Coding of Asynchronous Event Sequences for Low-Power Chip Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning-Based Rate Control for High Efficiency Video Coding

1
Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok 10330, Thailand
2
Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok 10330, Thailand
3
Chitose Institute of Science and Technology, Chitose 066-8655, Japan
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(7), 3607; https://doi.org/10.3390/s23073607
Submission received: 17 February 2023 / Revised: 24 March 2023 / Accepted: 27 March 2023 / Published: 30 March 2023

Abstract

:
High efficiency video coding (HEVC) has dramatically enhanced coding efficiency compared to the previous video coding standard, H.264/AVC. However, the existing rate control updates its parameters according to a fixed initialization, which can cause errors in the prediction of bit allocation to each coding tree unit (CTU) in frames. This paper proposes a learning-based mapping method between rate control parameters and video contents to achieve an accurate target bit rate and good video quality. The proposed framework contains two main structural codings, including spatial and temporal coding. We initiate an effective learning-based particle swarm optimization for spatial and temporal coding to determine the optimal parameters at the CTU level. For temporal coding at the picture level, we introduce semantic residual information into the parameter updating process to regulate the bit correctly on the actual picture. Experimental results indicate that the proposed algorithm is effective for HEVC and outperforms the state-of-the-art rate control in the HEVC reference software (HM-16.10) by 0.19 dB on average and up to 0.41 dB for low-delay P coding structure.

1. Introduction

Multimedia technology has been upgraded from one generation to another to fulfill daily needs such as television, telephones, computers, robots, etc. Numerous multimedia applications have been utilized, including digital versatile disc (DVD), digital television (TV) broadcasting, video telephony, video teleconferencing, video games, and other forms of video-on-demand. According to [1], the resolution of television broadcasting has been upgraded from standard-definition television (SDTV) to 8K ultra high definition (UHD), which requires a very high bit rate to transmit or store. Furthermore, the video demand on internet traffic is increasing, based on a statistical report in the “Cisco Annual Internet Report (2018–2023)”, a Cisco White Paper in 2018 in [2]. Thus, it strongly needs an effective video coding technique to reduce the network traffic load with good visual quality and a lower bit rate.
In general, video properties have four redundancy criteria: spatial redundancy, temporal redundancy, perceptual redundancy, and statistical redundancy, which can be eliminated by the video coding standard [3]. High efficiency video coding (HEVC) [4], an advanced video coding standard released in 2013 by ITU-T and ISO/IEC, can effectively remove the digital video redundancies and achieve a bit rate saving of about fifty percent at the same visual quality by comparing with the previous standard (H.264/AVC [3,5,6]). HEVC is built following the structure of the successful block-based hybrid video coding approach [7], the same as the H.264/AVC video coding standard. In addition, several advanced techniques are applied in HEVC to get efficient compressions, such as flexible partitioning using quad-tree structure, prediction modes [8], sample adaptive offset (SAO) [9], and the cutting-edge interpolation technique [10].
Moreover, HEVC needs to have a functional encoder control, known as rate control, to determine the optimum codec parameters to accomplish minimal rate–distortion (RD) score [11]. Many codec parameters include modes selection, quad-tree structure, motion estimation, and quantization parameter ( Q P ). In common, the rate control algorithms [11,12] are used to define the bit allocation and Q P by fixing the other parameters to accomplish the target bit with consistent visual quality. Specifically, rate control needs to manipulate the number of bits from a constant bit rate (CBR) into each coding level, including the group of picture (GOP) level, picture level, and basic units known as macroblocks (MBs) in H.264/AVC. The Q P is then regulated to achieve the pre-allocated bits for each coding level, where the larger number of Q P leads to a smaller number of allocated bits and vice versa. Encoder controls typically implement a uniform bit allocation in a GOP structure and initialize the fixed encoding parameters for any video contents to preserve a short-term constant output bit rate in the CBR channel. As a result, this implementation faces an infeasible problem of accurately adjusting encoding parameters for each GOP frame. Accordingly, if the target bit is less than the output bits, the encoded bits will rack up in the encoder buffer, causing a buffer overflow. The target bit is greater than the output bits, which implies the buffer underflow. Hence, controlling the relationship between bit rate and Q P is essential for maintaining picture quality throughout the video sequence, as buffer overflows and underflows have an undesired effect on video quality fluctuations. Q-domain rate control is a direct estimation that attempts to model a correlation function between bit rate and quantization; the bit allocation can be computed from the Q P to allocate for residual information but not for non-residual information. This model can work well when the coding parameters are not very flexible. Another rate control algorithm called ρ -domain rate control is developed [12,13] by introducing a linear function that outputs the coding bit rate from the percentage of zeros among the quantized transform coefficient. The model is effective only if the size of the transform is fixed. Both Q-domain and ρ -domain rate controls are designed to assume a high correlation between bit rate and quantization. This assumption is not valid for the current video codec because the codec becomes progressively variable [4]. Thus, a robust rate control [11], named R λ rate control, has been released to achieve the best balance between bit rate and distortion. This rate control attempts to improve the coding efficiency and rate control accuracy by using the Lagrangian method, λ , for rate–distortion optimization (RDO).
Although the aim of R λ rate control is for HEVC to enhance the coding efficiency compared with the conventional methods, two difficulties still need to be solved in HEVC reference software [14], including inaccurate bit allocation and inaccurate λ estimation. For the bit allocation part, the bit consumption of each CU of the first picture is computed by applying one to all initial encoder parameters at the basic unit level. In other words, all CUs are encoded using the same rate control parameters as the picture level. In such a case, the rate control will cause a bit consumption imbalance in the CU due to the spatial characteristic of each CU and result in the error bits’ distribution affecting the overall quality control. In addition, the inaccurate bit consumption at each coding level affects the λ adjustment to accomplish the frame bit budget because λ and the bit allocation are highly correlated. Specifically, according to the previous encoding results and the statistical characteristics of the input source data, the encoder parameters are empirically inaccurate, resulting from performance degradation at scene changes.
Based on these considerations, we propose a learning-based mapping method between R λ parameters and video content to achieve accurate target bit rates and preserve good video quality. We use a feedback re-encoding method for the intra-picture and inter-picture to distribute R λ parameters adaptively related to picture pattern changes. Additionally, the convolutional neural network (CNN) model [15] is used to capture the powerful spatial representation of the local coding tree units (CTUs). This CNN model is trained on the ImageNet dataset [16]. By incorporating the CNN model with the R λ rate control algorithm, we can accurately obtain the expected number of bits per CTU. Our problem is a constrained optimization problem, where, by obtaining the optimal encoder control parameters to minimize the distortion subject to a constraint, the actual bit rate consumption is less than the target bit rate. To solve the constrained optimization problem, there are two optimization methods, namely the gradient-based method [17,18] and the non-gradient-based method (known as the evolutionary algorithm) [19,20,21,22,23,24]. The gradient-based method is effective only when the constraints and objective or penalty function can be derived. Since our model aims to map the high-dimensional feature space of the CTU to the R λ parameter with the goal of RD optimization, which cannot directly derive the gradient information from the penalty function, the evolutionary algorithm (EA) is chosen to optimize the parameters of our model. There are several EAs such as evolution strategies (ES) [19], simulated annealing (SA) [20], genetic algorithm (GA) [21], and particle swarm optimization (PSO) [22]. Due to the simplicity and convergence speed characteristics of all EAs [24], PSO is the most powerful one and has been successfully implemented to solve various constrained optimization problems [25,26,27,28]. Comprehensively, PSO takes the value of the objective function and uses primitive mathematical operators to solve the social behavior of model parameters. Therefore, PSO is implemented in our model to find the best solution for mapping the characteristics of CTU and rate control parameters. Furthermore, we feed the semantic residue information to adjust the current parameters of rate control updating cross-picture. The main contributions of this paper can be summarized in three aspects:
(i)
We propose a learning-based neural network to define the mapping between video contents and rate control parameters to assign CTU budgets correctly;
(ii)
We introduce a particle swarm optimization algorithm to finalize the optimal parameters at the basic unit level to maintain the bit budget and obtain good visual video quality;
(iii)
We enhance the rate control parameter updating by considering the semantic residue information of the actual inter-picture into rate control.
The rest of the paper is organized as follows. In the next section, we briefly summarize related work. Then, the learning-based parameters of R λ are described. After that, the experimental results are given. Finally, concluding remarks are provided.

2. Related Works

In this section, we briefly review the existing rate control models: RQ model, ρ -domain-based Rate-GOP, R λ models, and deep learning based rate control.

2.1. RQ Model

The RQ model [29] has extended to HEVC encoder control, known as a pixel-wise unified RQ model (URQ); the quadratic RQ model is defined as in (1),
R = a Q 1 + b Q 2
where R presents as the target bit rate, Q is the quantization parameter, and a and b are the parameters related to the video characteristic. The bit allocation of the URQ model is proposed similarly to the rate control model in H.264/AVC, where the target bit is computed based on the mean absolute difference (MAD) corresponding to the quantization step. As a result, compared with the earlier HEVC video coding standard (HM6.0) [14], the visual quality of the URQ model is slightly improved. However, some issues have been discussed regarding Q-domain rate control [30,31], such that Q P is an integer data type that may not be adjusted accurately to achieve a bit budget.

2.2. ρ -Domain-Based Rate-GOP

The enhanced RQ model known as ρ -domain-based Rate-GOP is proposed in [32] by presenting a new relationship one-to-one quantized transform coefficient with target bit rate. It is formulated as in (2):
R i = θ i ( 1 ρ i )
where θ i and ρ i denote a parameter related to the video pattern and the percentage of zero transform coefficients of frame i, respectively. Additionally, the mapping between non-zero transform coefficients and Q P is determined following the quadratic function to properly allocate the bit to non-zero transform units. Consequently, the ρ -domain-based Rate-GOP can significantly achieve better video quality than the Q-domain rate control. Although this indirect relationship between R and Q technique is advantageous, it is still difficult to adapt its estimation to the variable block size transform in HEVC.

2.3. R λ Model

To overcome the limitations of the RQ model mentioned above, a new type of encoder control with the hierarchical bit allocation for every picture in a GOP is proposed in [11], called R λ rate control. First, the author proposed a hyperbolic function as a model to express the characteristics of the RD relationship, as in (3):
D ( R ) = C · R K
where C and K are parameters related to video content. Then, to minimize (3), λ is determined as the slope of the model in (4).
λ = D R = C · K · R K 1
λ = α · R β γ · D τ .
Therefore, λ can indicate the trade-off between bit rate and distortion. If λ is large, the lower bit rate will cause higher distortion. On the other hand, small λ results in a higher bit rate with lower distortion. In addition, a hierarchical bit allocation method [33] is used to allocate different picture weights corresponding to each picture position in the GOP to improve coding efficiency. Furthermore, the Q P can be computed by giving λ for each coding level as in (6).
Q P = 4.2005 · l n ( λ ) + 13.7122 .
The rate control can obtain stable buffer occupation and codec improvements through the hierarchical bit allocation method and the novel relationship between λ and R. As a result, R λ rate control is generally used in the advanced video coding standard. However, the R λ model mainly considers the bit rate by ignoring the characteristics of the video content. Furthermore, the model initializes its parameters by sharing the same fixed constant from the frame to all CTU levels. These aspects can cause video quality degradation.
A distortion-based Lagrange multiplier is proposed in [34] to enhance the compressed video quality in HEVC. The authors used the equivalent of distortion D and λ instead of R λ . Two main objective functions control the λ adjustment: mean square error (MSE) and absolute error. MSE is calculated from the original and reconstructed video content, while the absolute error is computed by subtracting between the actual and target bit budget. This technique is designed for the non-hierarchical structure of rate control. It can enhance the video quality by an average of 0.23 dB in the low-delay P configuration compared with non-hierarchical R λ rate control. The R λ model with a hierarchical structure achieves a higher video quality of 0.26 dB than the R λ model without a hierarchical structure [11]. This ability of the hierarchical structure in R λ makes it a common approach as the default HEVC general test condition in [35]. A video quality enhancement of the compressed video worked on R λ with a hierarchical structure is proposed in [36]. The authors introduced a simple rate control parameter-sharing in a GOP structure (PS-GOP), achieving a higher video quality of 0.07 dB on average and up to 0.17 dB compared to the default HEVC reference software (HM-16.10) [14].
An inter-block dependency-based CTU-level rate control for HEVC is established in [37], known as the RCA model. This proposed RCA is inspired by the temporal-dependent RDO, which is formulated as the fusion between inter-block dependency and R–D characteristics. This proposed model has achieved a considerable PNSR enhancement. However, the spatial coding units have not been taken into consideration, which would result in parameter propagation errors at the early stage.

2.4. Deep Learning-Based Rate Control

A deep reinforcement learning-based rate control for the dynamic video sequences is designed in [38] to capture the experience gained from the various factors, including brightness, variance, and gradient of each coding unit during the coding process. The proposed model is structured following the Markov decision process in a continuous discrete space to obtain better PSNR and lower-quality fluctuation. Nevertheless, the reinforcement approach has limitations, including a high number of interactions required to learn an optimal policy and difficulty generalizing to new, unseen environments.
Under a random access configuration, a deep convolution features-driven rate control for the HEVC encoders is proposed [39]. The method involves utilizing a pre-trained VGG-16 model to extract perceptual features, which addresses the problem of the rate control estimation. However, the model has not generalized the visual characteristic mapping to the rate control parameter.
Hence, we propose effective R λ parameters associated with the video content to improve the compressed video quality and maintain the bit budgets at the encoder side. The following section presents the proposed framework in detail.

3. Learning-Based Rate Control

This section introduces a learning-based rate control algorithm, which creates a regression map for the R λ parameter. The proposed framework is designed, as shown in Figure 1. The green boxes represent the modification rate control model using the feature translation technique and the convolution feature map. First, the input video is fed into the convolution feature map to extract the high dimensional feature space, which contains essential features representing the CTU in the scene. Then, the proposed model learns to translate the input feature space to rate control parameters to get the optimal trade between the target bit rate and distortion rate. Additionally, the dashed lines from the inter- and intra-prediction are indicated to send the convolution feature representation of the video coding with the coding mode, whether intra- or inter-prediction to the Encoder Control block. Figure 2 shows the convolution feature map module and the regression map representations module, which are constructed to generate the R λ parameters. The regression map is designed as learning-based particle swarm optimization (LB-PSO). Furthermore, the parameter updating for inter-coding is performed by considering residue information. The details of each part are presented in the following subsections.

3.1. Convolutional Feature Map

The convolutional feature map (fully convolutional networks—FCNs) is introduced at the first stage to obtain the meaningful spatial representation of CTU pictures for the input of our LB-PSO model. In general, the early layers of convolutions in the deep convolutional networks demonstrate the input image’s local or low-level feature information. In contrast, the deeper layers of convolutions indicate the high-level feature information that provides more global information about the image [40]. Additionally, the last fully connected (FC) layer of deep nets is designed to define the high-level feature information into object classes. Since FCNs do not include the FC layer, a relationship between the input image and the final feature output layer is preserved and expressed as data compression, which encodes the raw-pixel representation of the input image to high-level information. This information provides the global feature G representing the input image characteristic. G is fed into our LB-PSO model to generate the R λ parameters. A pre-trained residual networks (ResNets) [15] model without the FC layer is used to extract the powerful convolutional feature. However, the original input size of ResNets is incompatible with the maximum size of CTUs. The adaptive average pooling (AAP) is then applied to the last convolution layers to ensure the compatibility of input and output dimensions. Figure 2 demonstrates the overall layout of our convolutional feature map architecture.
Suppose a tth frame contains a total K CTUs, then G t = { g 0 , g 1 , , g K } t . Precisely, G is a parameter representing the high-dimensional features required as input to the proposed LB-PSO model. To obtain G for re-feedback coding of each coding structure in HEVC, i.e., intra- or inter-pictures, we define G as in (7):
g k t = S k t , if intra-picture . S k t S k t N G O P ( t mod c ) , otherwise .
where k K , and c ( c > 0 ) is a constant to determine the frame index for re-feedback coding on ( t mod c ) . N G O P is the total number of pictures in a GOP. S k t and S k t N G O P represent the convolutional feature information (spatial representation) of k t h CTU obtained from the original frame f o r g at t position and reconstruction frame f r e c at t N G O P position, respectively.
Specifically, if the encoding mode is intra coding, the spatial representation will directly input to the LB-PSO model. Otherwise, we compute the semantic residue information by applying the absolute difference between the current spatial representation S k t of the original CTU and the previous spatial representation S k t N G O P of the reconstructed CTU before feeding it to the LB-PSO model to accurately generate rate control parameters on the changes between consecutive CTUs. In addition, the reconstructed frame at t N G O P is chosen in the proposed method because a group of pictures in a video allows for exploits of the temporal redundancy in the video. The proposed model can be adapted following the N G O P .

3.2. Learning-Based Particle Swarm Optimization Network

3.2.1. LB-PSO Estimator

Our LB-PSO is proposed to define the optimal mapping ϕ from the spatial–temporal representation of CTU g k to rate control parameters y k , y k = { α , β } k . We introduce a feedforward network with one hidden layer to determine y k . This feedforward network can be computed as in (8):
y k = ϕ ( h k ; W ϕ , b ϕ ) = W ϕ T h k + b ϕ
where W ϕ provides the weights of a mapping function ϕ , b ϕ is a bias, and h k represents the output of the hidden layer. Precisely, h k is designed by applying a rectified linear activation function to the output of a linear transformation composed of the weights W h and bias b h parameters to trigger a non-linear transformation. Thus, h k can be derived as in (9):
h k = max { 0 , W h T g k + b h }
From (8) and (9), our complete mapping model can be reformulated as in (10):
y k = W ϕ T max { 0 , W h T g k + b h } + b ϕ
The model parameters M = { W ϕ , W h , b ϕ , b h } are optimized by utilizing swarm intelligence to exchange information between particles about RD cost function, J. On the other hand, the model parameters regulate its trajectory concerning its best previous position and the best previous position reached by any member of its neighborhood. To target the swarm intelligence rule, the cost function J is determined by two objective functions, including a reconstruction error (MSE) of visual quality and smooth L 1 error of bit allocation. The cost function J can be defined as in (11):
J = 1 N j = 0 N 1 ( f o r g j f r e c j ) 2 + η smooth L 1 ( R T R A )
smooth L 1 ( U ) = U 2 2 , if U < 1 U 1 2 , otherwise
where N is the total number of pixels in a picture and η is a penalty coefficient. R T and R A are the target and actual bit on the picture level, respectively.
According to the cost function design, the model parameters are updated after all CTUs are fully encoded. This cost function is aimed at the model learning to achieve the trade-off between distortion and bit allocation. The next section introduces the complete process of the parameters update.

3.2.2. Parameter Updating

In this subsection, we present the parameter update of the encoder controller corresponding to the intra/inter coding mode. In addition, the inter coding mode is classified into two sets of coding frames: a core frame and a common one. A core frame is encoded by activating the re-feedback coding to adjust the bit budget at the CTU coding level. In contrast, the common frame is coded by applying the default Lagrangian multiplier to determine the bit budget at the CTU coding level. For both intra coding and core frame of inter coding, the bit budget at the CTU coding level is computed using Equations (4) and (10). Additionally, the model parameters M in Equation (10) individually parameterize its value according to its movement in a search space.
Let P be the total size of the population, V i be the velocity (position change) of i-th particle, B i be the best previous model parameters of i-th particle, and B g be the best model parameter in the swarm. Then the swarm is manipulated on each iteration n according to the following two equations:
V i n + 1 = a V i n + c 1 r i 1 n ( B i n M i n ) + c 2 r i 2 n ( B g n M i n ) ,
M i n + 1 = M i n + V i n + 1 ,
where i = 1 , 2 , , P , and a is the inertia weight of velocity V, which is used to control the trade-off between the swarm’s global and local exploration capabilities. c 1 and c 2 are two positive acceleration constants, named the PSO’s cognitive and social parameters, respectively. r i 1 and r i 2 are the random numbers, generated from a uniform distribution within the range [ 0 , 1 ] . The performance of each model parameter M i in the swarm is measured according to the cost function J. The lower cost function indicates a better M i . After finalizing the best M i to preserve the minimal cost function J at the CTU coding level, the CTU is encoded.
For the picture level of inter coding, the rate control parameters are adjusted by considering the residue score of the semantic residue information. The probability of residue score Q t on a picture at time t can be computed as
Q t = k K j S k A k t ( j ) S k t ( j )
A k t ( j ) = 0 , if t N G O P 0 S k t ( j ) S k t × t N G O P ( j ) , otherwise
where . represents the rounded result. Additionally, in the G O P regarding the spatiotemporal information of the video sequence, the picture levels generally have different pairs of encoder controller coefficients α p and β p . Therefore, the rate control parameters can be updated by (17)–(21). The Lagrangian multiplier, λ , is defined as
λ = α p o l d · R β p o l d
If the G O P i d equals 0, a pair of rate control parameters can be formulated as in (18) and (20).
α p n e w = α p o l d + δ α · ( ln ( λ r λ c ) ) · α p o l d + ζ Q t
β p n e w = β p o l d + δ β · ( ln ( λ r λ c ) ) · ln ( b p p r ) + ζ 2 Q t
Otherwise, a pair of rate control parameters can be computed as in (19) and (20).
α p n e w = α p o l d + ζ Q t
β p n e w = β p o l d + ζ 2 Q t
where δ α and δ β are the default constant in HEVC reference software. λ r represents the real λ value, λ c is a computed λ value from the real cost b p p r with the previous rate control parameters α p o l d and β p o l d at picture level, and ζ is the residue penalty constant.
For the quantization parameter (QP), it can be determined as in (21).
Q P = 4.2005 · ln ( λ ) + 13.7122
Figure 3 provides the model flowchart of the learning-based PSO method, named LB-PSO. LB-PSO initially randomizes the group of particle parameters. Then, the rate control coefficients are computed using the LB-PSO estimator. Subsequently, the LB-PSO model’s best local and global parameters have reallocated if the current position is better than the stored position according to its cost function, J. After that, the velocity V and position M are calculated following Equations (13) and (14). Finally, the best particle for the LB-PSO model is determined to generate the best rate control coefficients for the current input CTU context.

4. Experimental Results

To evaluate the performance of the proposed learning-based particle swarm optimization, the experiments are conducted on various videos, including static and dynamic scenes.

4.1. Experiment Setting

In the experiment, the proposed algorithm is implemented on HEVC reference software [14] and is compared with the PS-GOP [36] and the state-of-the-art R λ rate control (RC-HEVC) [11]. According to HEVC common parameter setting [3], the largest size of a CTU produces high-efficiency coding performance. Specifically, the largest feasible size of a CTU in HEVC is a 64 × 64 block size. We have also designed the model to adapt bit allocation for CTUs related to their spatial information, which is extracted using a pre-trained CNN model. Since we have implied CNN feature extraction on the largest size of a CTU in HEVC, we transform YUV420 format to a true color (64 × 64 × 3) CTU as the input in the feature extraction block. The proposed algorithm and baseline methods are simulated in the same reference software HM-16.10. Precisely, the experiments are conducted under the low-delay P main profile configurations, and the encoder parameters are set according to the standard setting in [35] by enabling the rate control as T r u e . In addition, there are 100 iterations in every decision-making process for each rate control parameters prediction in the proposed LB-PSO. There are fifteen test video sequences with four video resolutions, including two videos of 240p (wide quarter video graphics array—WQVGA) [41], three videos of 480p (wide video graphics array—WVGA) [41], five videos of 720p (HD) [42], three videos of 1080p (full HD) [41], and two videos of 4k resolution [43]. Table 1 briefly summarizes the characteristics of the test video sequence. In addition, the test video sequence is encoded at four target bit rates corresponding to the video resolution. Since the goal of rate control is not only to improve the visual quality of the video for a given bit rate but also to achieve the bit rate closest to the target bit rate, both peak signal-to-noise ratio ( P S N R ) and bit rate error ( B R E ) are used as the criteria for determining the performance of the rate control algorithm. The P S N R and B R E can be computed as in (22) and (23).
P S N R = 10 log ( 2 n 1 ) 2 1 N j = 0 N 1 f o r g j f r e c j 2
B R E = ( R T R A ) R T × 100 %
where n represents bit depth.

4.2. Experimental Results and Analysis

(1) R–D performance and Bit Rate Accuracy: The first experiment was conducted on the low video resolution (WQVGA), which contains two video sequences with different frame rates, including BlowingBubbles and BQSquare. These two videos have various dynamic characteristics, such as a moving camera, moving objects, and illumination changes. Table 2 describes the proposed method’s P S N R and B R E performance compared with the baseline methods. Our learning-based method outperforms all the baseline methods as we achieve the highest P S N R value with the same bit rate.
Specifically, our method’s average P S N R enhancement is 0.23 dB and 0.12 dB compared with RC-HEVC and PS-GOP, respectively. Our approach also performs the maximum P S N R improvement ( m a x ) of 0.30 dB and 0.20 dB compared to RC-HEVC and PS-GOP. Figure 4a illustrates the RD performance curve of the BQSquare test sequence. The learning-based approach obtains a better RD performance than the baselines method. In addition, the average B R E of RC-HEVC, PS-GOP, and our methods are 0.01%, indicating that all approaches can effectively achieve the target bit rate. However, the proposed method has the lowest B R E at a lower target bit rate (256 kbps). It is noticed that the RC-HEVC has poor visual quality on these WQVGA with dynamic scenes compared to all approaches. As a result, even if the scene has dynamic properties, our algorithm can constructively achieve the target bit rate with the good visual quality of the WQVGA sequence.
Next, the WVGA sequences were tested, such as BasketballDrillText, PartyScene, and BQMall. The scene properties are similar to the above experiments, but these WVGA sequences are more challenging than WQVGA because they involve multi-object movement, camera movement, and higher resolution. The outcomes of P S N R and B R E are summarized in Table 3, where the proposed learning-based method works much better. It reaches 0.41 dB and 0.33 dB of visual quality better than RC-HEVC and PS-GOP, respectively. Concisely, our approach has no error bit consumption on average and performs 0.23 dB and 0.16 dB on average higher than RC-HEVC and PS-GOP, respectively. On one side of the RD curve, our proposed method is significantly higher than the competitive methods, as shown in Figure 4b. Based on the outcomes of all approaches in Table 2 and Table 3, the R λ rate control and PS-GOP are unsuitable for such dynamic scenes and cameras. Consequently, it can indicate that the λ adjustment and quality control are not correctly estimated.
After testing the WVGA sequences, the HD videos containing video conferencing and online teaching test sequences were simulated. The HD videos are FourPeople, KristenAndSara, Vidyo1, Vidyo3, and Vidyo4. These videos have the characteristics of a static camera with multiple objects moving. Figure 4c shows an overall outgrowth of the RD curve of FourPeople from the low bit rate to the high bit rate. Although the scene is used with a static camera, the proposed method’s RD performance is noticeably greater than the competitive methods. Additionally, the P S N R and B R E evaluations of these HD video sequences are recorded in Table 4, where the average P S N R enhancement value of our method is approximately 0.17 dB ( m a x = 0.30 dB) and 0.08 dB ( m a x = 0.21 dB) in comparison with the RC-HEVC and PS-GOP.
The last experiment was applied on full HD and 4k video test sequences. The first three videos, ParkScene, Cactus, and BQTerrace, were used for the full HD experiment. The last two sequences, HoneyBee and Jocky, were used for 4k videos. This last test contained all types of scenarios. The ParkScene and Jocky videos have a moving camera and multiple object motions, while the BQTerrace video stacks the camera motion with a static camera. Furthermore, the Cactus video consists of a static camera and the rotation of the objects. The HoneyBee video has multiple object motions and a static camera. According to Table 5, the overall P S N R evaluation of the proposed method on the BQTerrace sequence at a low bit rate is the highest compared to the other sequences. In contrast, the ParkScene sequence has the highest P S N R at a high bit rate. The reason is that the scenes containing a dynamic camera have significant movement changes; thus, the state-of-the-art R λ rate control cannot update the encoding controller correctly. In addition, PS-GOP uses parameter sharing in GOP, which is not enough to adapt to encoder parameters following frame characteristics. Reasoning from this fact, our method establishes a novel mapping between frame features and R λ coefficient parameters. We provide a computationally feasible solution using LB-PSO to produce optimal RD for good visual quality and to maintain the target bit rate. Figure 4 shows the overall RD curve on different video resolutions. Consequently, our method has achieved the highest outcomes of all competitive methods. From Table 2 to Table 5, the average P S N R improvement is 0.19 dB ( m a x = 0.41 dB) and 0.10 dB ( m a x = 0.33 dB) compared with RC-HEVC and PS-GOP, respectively.
The PSNR performance of our proposed model is extensively compared with other state-of-the-art rate control methods for both the dynamic scene and interview scene as shown in Table 6. Our proposed model achieves the highest PSNR for all bit rates in both types of video sequences. This indicates that the inter coding approach should not only consider the inter-block dependency coding structure but also the rate control coefficient.
Additionally, Figure 5 shows a graph of the PSNR difference between consecutive frames. The plot shows that the performance of the proposed method adaptively achieves better results on frame reconstruction from the start of encoding compared to RC-HEVC and PS-GOP. This demonstrates the effective interaction of spatiotemporal features in the rate control model and the crossed LB-PSO model to decide on appropriate rate control coefficients to acquire the target bit rate and perform well in PSNR. Furthermore, Figure 6 indicates the details of the rate fluctuation performance of the proposed method compared to the baselines. This rate fluctuation describes successive frames’ historical bit allocation performance to understand the bit flow in the video codec. Therefore, LB-PSO can control bit allocation better than the baselines, and it can carry out lower bit allocation and produce higher PSNR in most consecutive frames, as shown in Figure 5 and Figure 6.
(2) Bit Heatmaps and Visual Quality: To indicate the performance of bit allocation at the CTU level, the heatmap visualization and the subjective results of the reconstructed frame are illustrated in Figure 7 and Figure 8. Since there is no modification of the intra coding of PS-GOP, Figure 7 shows only the comparison between state-of-the-art RC-HEVC with our proposed learning-based approach. The bit consumption is highlighted by red color intensity on each CTU, while the blue acts as a mask to cover the frame. If the red intensity is low, the allocated bits are consumed less. The patch image is extracted from the frame to illustrate the greatest difference in bit consumption at the CTU level of RC-HEVC and our proposed method. Figure 7b,c reveal that the bit allocation performance of RC-HEVC on the plane space CTU is slightly high, leading to less bit budget for the necessary spatial CTU. On the contrary, our proposed method obtains smoother bit allocation on non-important spatial images (low-frequency components), providing more budget to important CTU features. Additionally, the visualization of the human face of the proposed learning-based approach on the intra-picture shows more details with a smoother look than that of RC-HEVC, as shown in the green box of Figure 7c,d. According to these results, our LB-PSO can obtain better bit allocation by using the information from the mapping encoder control parameters with the input convolution feature map of each spatial CTU instead of the fixed initialization of R λ rate control.
For inter coding, the PS-GOP is added in comparison. Similarly, the color representation is defined the same as the intra coding. Regarding bitmaps, Figure 8b shows that RC-HEVC has a problem with bit allocation on the essential features. Due to hand movement, RC-HEVC should provide higher bit allocation in these necessary parts; on the contrary, it allocates fewer bits to these blocks. Furthermore, PS-GOP attempts to allocate the amount of bit budget to the hand movement area to keep the visual quality of the action consistent. However, the bit budget on large hand motion blocks is still small, as shown in Figure 8c.
Regarding residual semantic information, our proposed method can correctly regulate the bit budget responding to the motion information in the scene, as illustrated in Figure 8d. On the other hand, our proposed method obtains the accurate bit allocation of each CTU corresponding to its spatial–temporal characteristics. Furthermore, the visual quality visualization of this hand movement is shown in Figure 8e–g. In particular, RC-HEVC has a considerable distortion in this hand movement area, while PS-GOP is slightly better than RC-HEVC. Although PS-GOP is better than RC-HEVC, PS-GOP still has higher distortion compared with our proposed method. As a result, the proposed method achieves better hand and cup shapes than the competitive methods. According to our experimental results, we can conclude that the proposed learning-based R λ parameter outperforms other competing methods by achieving the highest P S N R while maintaining the target bit rate.
(3) Computational Complexity: We compare the computational time of the proposed method with RC-HEVC and PS-GOP. Regarding computational time in an average of seconds per frame, as indicated in Table 7, our LB-PSO achieves 53.30 s/frame, 97.79 s/frame, and 351.10 s/frame on WVGA, HD, and full HD resolution, respectively. We also compare our computational complexity with other baseline methods. Table 6 shows that our computational time is higher than the baseline methods. This is because our framework is designed as online training using the integration of the forward pass network with particle swarm optimization. However, we obtained a significantly higher PSNR value and achieved the target bit rate. Furthermore, our bit allocation was assigned correctly compared to baseline approaches.

5. Conclusions

In this paper, we proposed novel learning-based R λ parameters for HEVC. The proposed framework is embedded with a deep convolution neural network feature map and LB-PSO, which brings advantages to rate control parameters estimation corresponding to spatial–temporal CTUs. LB-PSO is designed to obtain the feasible rate control coefficient parameters solution to optimize the RD relationship. Experimental results clearly show that our proposed learning-based approach obtains an accurate target bit rate with 0.19 dB on average to 0.41 dB and 0.10 dB on average to 0.33 dB maximum P S N R improvement than the state-of-the-art RC-HEVC and PS-GOP, accordingly. Due to the bit allocation, our algorithm can achieve an operational bit distribution to each CTU on both intra and inter coding. In other words, our method is effective and robust for determining the bit budget for the CTU of the frame. For future work, CTU partitioning will be considered together with R λ parameters to increase coding efficiency.

Author Contributions

Conceptualization, S.A.; Methodology, S.C.; Software, S.C.; Validation, S.A.; Formal analysis, S.C.; Investigation, S.C.; Resources, S.A.; Data curation, S.C.; Writing—original draft, S.C.; Writing—review & editing, S.A. and Y.M.; Visualization, S.C.; Supervision, S.A. and Y.M.; Project administration, S.A.; Funding acquisition, S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is supported by the Collaborative Research Project entitled Video Processing and Transmission by the JICA Project for AUN/SEED-Net, Japan, Thailand Science research and Innovation Fund Chulalongkorn University (CU_FRB65_ind (9)_157_21_23), (IND66210019), the NSRF via the Program Management Unit for Human Resources Institutional Development, Research and Innovation [grant number B04G640053], and Ratchadaphiseksomphot Endowment Fund (Multimedia Data Analytics and Processing Research Unit).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Owens, J. Television Production; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
  2. Cisco. Cisco Annual Internet Report—Cisco Annual Internet Report (2018–2023) White Paper. 9 March 2020. Available online: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (accessed on 11 February 2021).
  3. Wiegand, T.; Sullivan, G.J.; Bjontegaard, G.; Luthra, A. Overview of the H. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 560–576. [Google Scholar] [CrossRef] [Green Version]
  4. Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
  5. Correa, G.; Assuncao, P.; Agostini, L.; da Silva Cruz, L.A. Performance and computational complexity assessment of high-efficiency video encoders. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1899–1909. [Google Scholar] [CrossRef]
  6. Grois, D.; Marpe, D.; Mulayoff, A.; Itzhaky, B.; Hadar, O. Performance comparison of H. 265/MPEG-HEVC, vp9, and H. 264/MPEG-AVC encoders. In Proceedings of the 2013 Picture Coding Symposium (PCS), San Jose, CA, USA, 8–11 December 2013; pp. 394–397. [Google Scholar]
  7. Sze, V.; Budagavi, M.; Sullivan, G.J. High efficiency video coding (HEVC). In Integrated Circuit and Systems, Algorithms and Architectures; Springer: Berlin/Heidelberg, Germany, 2014; Volume 39, p. 40. [Google Scholar]
  8. Lainema, J.; Bossen, F.; Han, W.-J.; Min, J.; Ugur, K. Intra coding of the HEVC standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1792–1801. [Google Scholar] [CrossRef]
  9. Fu, C.M.; Alshina, E.; Alshin, A.; Huang, Y.W.; Chen, C.Y.; Tsai, C.Y.; Hsu, C.W.; Lei, S.M.; Park, J.H.; Han, W.J. Sample adaptive offset in the HEVC standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1755–1764. [Google Scholar] [CrossRef]
  10. Lv, H.; Wang, R.; Xie, X.; Jia, H.; Gao, W. A comparison of fractional-pel interpolation filters in HEVC and H. 264/AVC. In Proceedings of the 2012 Visual Communications and Image Processing, San Diego, CA, USA, 27–30 November 2012; pp. 1–6. [Google Scholar]
  11. Lv, H.; Wang, R.; Xie, X.; Jia, H.; Gao, W. λ domain rate control algorithm for High Efficiency Video Coding. IEEE Trans. Image Process. 2014, 23, 3841–3854. [Google Scholar]
  12. He, Z.; Kim, Y.K.; Mitra, S.K. Low-delay rate control for DCT video coding via ρ-domain source modeling. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 928–940. [Google Scholar]
  13. Liu, M.; Guo, Y.; Li, H.; Chen, C.W. Low-complexity rate control based on ρ-domain model for scalable video coding. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 1277–1280. [Google Scholar]
  14. HEVC Reference Software. 2014. Available online: http://hevc.kw.bbc.co.uk/trac/browser/jctvc-hm/tags (accessed on 16 February 2023).
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  16. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  17. Mohamed, A.W.; Sabry, H.Z. Constrained optimization based on modified differential evolution algorithm. Inf. Sci. 2012, 194, 171–208. [Google Scholar] [CrossRef]
  18. Bertsekas, D.P. Constrained Optimization and Lagrange Multiplier Methods; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
  19. Beyer, H.-G.; Schwefel, H.-P. Evolution strategies—A comprehensive introduction. Nat. Comput. 2002, 1, 3–52. [Google Scholar] [CrossRef]
  20. Dowsland, K.A.; Thompson, J. Simulated annealing. In Handbook of Natural Computing; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1623–1655. [Google Scholar]
  21. Mirjalili, S. Genetic algorithm. In Evolutionary Algorithms and Neural Networks; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43–55. [Google Scholar]
  22. Parsopoulos, K.E.; Vrahatis, M.N. Recent approaches to global optimization problems through particle swarm optimization. Nat. Comput. 2002, 1, 235–306. [Google Scholar] [CrossRef]
  23. Du, K.-L.; Swamy, M. Particle swarm optimization. In Search and Optimization by Metaheuristics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 153–173. [Google Scholar]
  24. Tang, B.; Zhu, Z.; Luo, J. A framework for constrained optimization problems based on a modified particle swarm optimization. Math. Probl. Eng. 2016, 2016, 8627083. [Google Scholar] [CrossRef] [Green Version]
  25. Khalili-Damghani, K.; Abtahi, A.-R.; Tavana, M. A new multi-objective particle swarm optimization method for solving reliability redundancy allocation problems. Reliab. Eng. Syst. Saf. 2013, 111, 58–75. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Wu, L.; Wang, S. UCAV path planning by fitness-scaling adaptive chaotic particle swarm optimization. Math. Probl. Eng. 2013, 2013, 705238. [Google Scholar] [CrossRef]
  27. Xu, L.; Wang, J.; Li, Y.-P.; Li, Q.; Zhang, X. Resource allocation algorithm based on hybrid particle swarm optimization for multiuser cognitive OFDM network. Expert Syst. Appl. 2015, 42, 7186–7194. [Google Scholar] [CrossRef]
  28. Darwish, A.; Ezzat, D.; Hassanien, A.E. An optimized model based on convolutional neural networks and orthogonal learning particle swarm optimization algorithm for plant diseases diagnosis. Swarm Evol. Comput. 2020, 52, 100616. [Google Scholar] [CrossRef]
  29. Choi, H.; Yoo, J.; Nam, J.; Sim, D.; Bajić, I.V. Pixel-wise unified rate-quantization model for multi-level rate control. IEEE J. Sel. Top. Signal Process. 2013, 7, 1112–1123. [Google Scholar] [CrossRef]
  30. Hu, S.; Wang, H.; Kwong, S. Adaptive quantization-parameter clip scheme for smooth quality in H. 264/AVC. IEEE Trans. Image Process. 2011, 21, 1911–1919. [Google Scholar]
  31. Wang, S.; Ma, S.; Wang, S.; Zhao, D.; Gao, W. Quadratic ρ-domain based rate control algorithm for HEVC. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 1695–1699. [Google Scholar]
  32. Wang, S.; Ma, S.; Wang, S.; Zhao, D.; Gao, W. Rate-GOP based rate control for high efficiency video coding. IEEE J. Sel. Top. Signal Process. 2013, 7, 1101–1111. [Google Scholar] [CrossRef]
  33. Sanz-Rodríguez, S.; Schierl, T. A rate control algorithm for HEVC with hierarchical GOP structures. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 1719–1723. [Google Scholar]
  34. Wang, M.; Ngan, K.N.; Li, H. Low-delay rate control for consistent quality using distortion-based Lagrange multiplier. IEEE Trans. Image Process. 2016, 25, 2943–2955. [Google Scholar] [CrossRef]
  35. Bossen, F. Common Test Conditions and Software Reference Configurations. JCTVC-J1100 2012, 12. [Google Scholar]
  36. Chen, S.; Aramvith, S.; Miyanaga, Y. Encoder Control Enhancement in HEVC Based on R-Lambda Coefficient Distribution. In Proceedings of the 2019 International Symposium on Multimedia and Communication Technology (ISMAC), Quezon City, Philippines, 19–21 August 2019; pp. 1–4. [Google Scholar]
  37. Guo, H.; Zhu, C.; Xu, M.; Li, S. Inter-block dependency-based CTU level rate control for HEVC. IEEE Trans. Broadcast. 2019, 66, 113–126. [Google Scholar] [CrossRef]
  38. Zhou, M.; Wei, X.; Kwong, S.; Jia, W.; Fang, B. Rate control method based on deep reinforcement learning for dynamic video sequences in HEVC. IEEE Trans. Multimed. 2020, 23, 1106–1121. [Google Scholar] [CrossRef]
  39. Marzuki, I.; Lee, J.; Wiratama, W.; Sim, D. Deep convolutional feature-driven rate control for the HEVC encoders. IEEE Access 2021, 9, 162018–162034. [Google Scholar] [CrossRef]
  40. Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
  41. ISO/IEC-JCT1/SC29/WG11; Common Test Conditions of 3DV Core Experiments. ISO: San Jose, CA, USA, 2014.
  42. Xiph.org. Xiph.org Video Test Media. 2017. Available online: https://media.xiph.org/video/derf/ (accessed on 16 February 2023).
  43. Mercat, A.; Viitanen, M.; Vanne, J. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, Istanbul, Turkey, 8–11 June 2020; Available online: https://ultravideo.fi/#testsequences (accessed on 16 February 2023).
  44. Li, L.; Li, B.; Li, H.; Chen, C.W. λ-domain optimal bit allocation algorithm for High Efficiency Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 130–142. [Google Scholar] [CrossRef]
  45. Guo, H.; Zhu, C.; Li, S.; Gao, Y. Optimal bit allocation at frame level for rate control in HEVC. IEEE Trans. Broadcast. 2018, 65, 270–281. [Google Scholar] [CrossRef]
Figure 1. Learning-Based Rate Control Diagram for High Efficiency Video Coding.
Figure 1. Learning-Based Rate Control Diagram for High Efficiency Video Coding.
Sensors 23 03607 g001
Figure 2. Overview of proposed learning-based particle swarm optimization.
Figure 2. Overview of proposed learning-based particle swarm optimization.
Sensors 23 03607 g002
Figure 3. Learning-based particle swarm optimization flowchart.
Figure 3. Learning-based particle swarm optimization flowchart.
Sensors 23 03607 g003
Figure 4. Rate–Distortion curves: (a) BQSquare, (b) PartyScene, (c) FourPeople, (d) ParkScene.
Figure 4. Rate–Distortion curves: (a) BQSquare, (b) PartyScene, (c) FourPeople, (d) ParkScene.
Sensors 23 03607 g004
Figure 5. Comparison of PSNR difference between consecutive frames.
Figure 5. Comparison of PSNR difference between consecutive frames.
Sensors 23 03607 g005
Figure 6. Rate fluctuation performance comparison.
Figure 6. Rate fluctuation performance comparison.
Sensors 23 03607 g006
Figure 7. Bit heatmaps and reconstructed frame of intra coding at 384 kbps: (a) original frame, (b,d) RC-HEVC, and (c,e) proposed method.
Figure 7. Bit heatmaps and reconstructed frame of intra coding at 384 kbps: (a) original frame, (b,d) RC-HEVC, and (c,e) proposed method.
Sensors 23 03607 g007
Figure 8. Bit heatmaps and reconstructed frame of inter coding at 384 kbps: (a) original frame, (b,e) RC-HEVC, (c,f) PS-GOP, and (d,g) proposed method.
Figure 8. Bit heatmaps and reconstructed frame of inter coding at 384 kbps: (a) original frame, (b,e) RC-HEVC, (c,f) PS-GOP, and (d,g) proposed method.
Sensors 23 03607 g008
Table 1. Characteristics of Test Video Sequences and Bit Rate.
Table 1. Characteristics of Test Video Sequences and Bit Rate.
ResolutionName of Video SequenceFrame Rate (fps)Bit Rate (kbps)
3840 × 2160HoneyBee1201000, 2000, 3000, 4000
Jocky120
1920 × 1080ParkScene241000, 2000, 3000, 4000
Cactus50
BQTerrace60
1280 × 720FourPeople60384, 512, 850, 1200
KristenAndSara60
Vidyo160
Vidyo360
Vidyo460
832 × 480BasketballDrillText50384, 512, 768, 1200
PartyScene50
BQMall60
416 × 240BlowingBubbles50256, 384, 512, 1200
BQSquare60
Table 2. The Performance of PSNR and BRE of Video Sequence with Resolution of 416 × 240.
Table 2. The Performance of PSNR and BRE of Video Sequence with Resolution of 416 × 240.
Name of Video SequenceTarget Bit RateRC-HEVCPS-GOPProposed Method
Bit RatePSNRBREBit RatePSNRBREBit RatePSNRBRE
BlowingBubbles256256.0629.69−0.02256.0829.79−0.03256.0229.99−0.01
384384.0531.14−0.01384.0031.260.00384.0231.44−0.01
512512.0632.26−0.01512.0532.38−0.01512.0432.51−0.01
12001200.1835.64−0.021200.0535.710.001200.1535.73−0.01
BQSquare256256.0430.31−0.02256.0130.42−0.01256.0230.60−0.01
384384.0331.53−0.01384.0331.67−0.01384.0331.78−0.01
512512.0332.45−0.01512.0332.56−0.01512.0232.640.00
12001200.0635.200.001200.0435.330.001200.0435.370.00
Average32.28−0.01 32.39−0.01 32.51−0.01
Table 3. The Performance of PSNR and BRE of Video Sequence with Resolution of 832 × 480.
Table 3. The Performance of PSNR and BRE of Video Sequence with Resolution of 832 × 480.
Name of Video SequenceTarget Bit RateRC-HEVCPS-GOPProposed Method
Bit RatePSNRBREBit RatePSNRBREBit RatePSNRBRE
BasketballDrillText384384.0330.82−0.01383.9930.930.00384.0230.99−0.01
512512.0531.94−0.01512.0032.010.00511.9932.080.00
768768.0433.46−0.01768.0433.52−0.01768.0533.60−0.01
12001200.1035.15−0.011200.0735.20−0.011200.0735.32−0.01
PartyScene384384.0126.400.00384.0026.490.00383.9726.800.01
512512.0227.270.00512.0127.370.00511.9627.680.01
768768.0928.61−0.01768.0228.680.00768.0229.010.00
12001200.0630.15−0.011200.0230.200.001200.0330.530.00
BQMall384384.0130.680.00384.1330.77−0.03384.0030.850.00
512512.0131.860.00512.0531.92−0.01512.0332.00−0.01
768768.0133.500.00768.0133.590.00768.0133.660.00
12001200.0435.280.001200.0335.330.001200.0135.390.00
Average31.26−0.01 31.33−0.01 31.490.00
Table 4. The Performance of PSNR and BRE of Video Sequence with Resolution of 1280 × 720.
Table 4. The Performance of PSNR and BRE of Video Sequence with Resolution of 1280 × 720.
Name of Video SequenceTarget Bit RateRC-HEVCPS-GOPProposed Method
Bit RatePSNRBREBit RatePSNRBREBit RatePSNRBRE
FourPeople384383.9737.020.01383.9937.120.00383.9937.320.00
512511.9738.100.01512.0038.240.00511.9938.380.00
850849.9839.840.00849.9939.940.00849.9840.060.00
12001200.0840.81−0.011199.9640.870.001200.0540.970.00
KristenAndSara384384.0639.17−0.02384.0839.32−0.02384.1239.37−0.03
512512.0740.03−0.01512.0940.17−0.02512.1140.20−0.02
850850.1241.31−0.01850.0941.43−0.01850.1241.47−0.01
12001200.1842.04−0.011200.1642.12−0.011200.1642.16−0.01
Vidyo1384384.0038.950.00383.9839.060.01384.0039.110.00
512512.0139.860.00511.9339.950.01511.9940.010.00
850849.9641.190.00849.8841.260.01850.0141.320.00
12001200.0041.930.001199.9642.000.001200.0142.070.00
Vidyo3384384.0137.850.00384.0038.000.00384.0238.01−0.01
512512.0238.820.00512.0138.950.00512.0138.970.00
850850.0140.220.00850.0140.330.00850.0140.370.00
12001200.0241.000.001200.0341.080.001200.0041.120.00
Vidyo4384384.0138.680.00384.0138.730.00384.0138.860.00
512512.0239.470.00512.0139.530.00512.0239.670.00
850850.0240.670.00850.0140.740.00850.0240.860.00
12001200.0241.390.001200.0541.450.001200.0241.540.00
Average39.920.00 40.020.00 40.090.00
Table 5. The Performance of PSNR and BRE of Video Sequence with Resolution of 1920 × 1080 and 4k.
Table 5. The Performance of PSNR and BRE of Video Sequence with Resolution of 1920 × 1080 and 4k.
Name of Video SequenceTarget Bit RateRC-HEVCPS-GOPProposed Method
Bit RatePSNRBREBit RatePSNRBREBit RatePSNRBRE
ParkScene1000999.9633.200.00999.8433.210.02999.8633.320.01
20002000.0135.300.001999.8935.410.012000.1035.490.00
30002999.9536.600.002999.9136.680.002999.9836.760.00
40004000.1137.520.004000.0937.570.004000.1137.660.00
Cactus10001000.0131.620.001000.0231.750.001000.0231.740.00
20002000.0433.770.002000.0333.850.002000.0333.870.00
30003000.0934.960.003000.0335.010.003000.0335.040.00
40004000.0635.700.003999.9535.770.004000.0735.810.00
BQTerrace10001000.0531.62−0.011000.0131.730.001000.1731.97−0.02
20002000.1333.03−0.012000.0233.110.002000.0433.250.00
30003000.1533.670.003000.0133.780.003000.0833.820.00
40004000.5334.10−0.014000.0534.200.004000.0434.150.00
HoneyBee10001000.0138.240.001000.0038.250.001000.0338.310.00
20002000.0138.630.002000.0038.650.002000.0138.660.00
30003000.0138.750.003000.0138.780.003000.0138.780.00
40004000.4038.81−0.014000.0138.830.004000.0238.830.00
Jocky1000999.9832.300.001000.0132.400.001000.0032.400.00
20002000.0335.550.002000.0135.600.002000.0035.610.00
30003000.0036.950.003000.0436.970.003000.0636.990.00
40004000.0037.680.003999.9937.690.004000.0237.710.00
Average35.400.00 35.460.00 35.510.00
Table 6. PSNR Comparisons at different bit rates with other state-of-the-art rate control schemes.
Table 6. PSNR Comparisons at different bit rates with other state-of-the-art rate control schemes.
Name of Video SequenceBit RateBA [44]BAF [45]RCA [37]Proposed Method
FourPeople38436.3036.8137.0737.32
51237.4938.1938.3138.38
85039.7639.9840.0340.06
120040.5240.6940.8940.97
BasketballDrillText38430.8230.8130.8930.99
51231.8731.8631.9132.08
85033.4133.4433.5233.60
120034.9134.9635.1935.32
Table 7. Computational Complexity.
Table 7. Computational Complexity.
Intel Core i9-7960× CPU @ 2.80 GHz
ResolutionHM-16.10 (s/frame)PS-GOP (s/frame)Proposed Method (s/frame)
WVGA24.1023.7553.30
HD45.1844.9297.79
Full HD166.15165.47351.10
Average78.4878.04167.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, S.; Aramvith, S.; Miyanaga, Y. Learning-Based Rate Control for High Efficiency Video Coding. Sensors 2023, 23, 3607. https://doi.org/10.3390/s23073607

AMA Style

Chen S, Aramvith S, Miyanaga Y. Learning-Based Rate Control for High Efficiency Video Coding. Sensors. 2023; 23(7):3607. https://doi.org/10.3390/s23073607

Chicago/Turabian Style

Chen, Sovann, Supavadee Aramvith, and Yoshikazu Miyanaga. 2023. "Learning-Based Rate Control for High Efficiency Video Coding" Sensors 23, no. 7: 3607. https://doi.org/10.3390/s23073607

APA Style

Chen, S., Aramvith, S., & Miyanaga, Y. (2023). Learning-Based Rate Control for High Efficiency Video Coding. Sensors, 23(7), 3607. https://doi.org/10.3390/s23073607

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop