To evaluate the performance of the proposed learning-based particle swarm optimization, the experiments are conducted on various videos, including static and dynamic scenes.
4.1. Experiment Setting
In the experiment, the proposed algorithm is implemented on HEVC reference software [
14] and is compared with the PS-GOP [
36] and the state-of-the-art
R–
rate control (RC-HEVC) [
11]. According to HEVC common parameter setting [
3], the largest size of a CTU produces high-efficiency coding performance. Specifically, the largest feasible size of a CTU in HEVC is a 64 × 64 block size. We have also designed the model to adapt bit allocation for CTUs related to their spatial information, which is extracted using a pre-trained CNN model. Since we have implied CNN feature extraction on the largest size of a CTU in HEVC, we transform YUV420 format to a true color (64 × 64 × 3) CTU as the input in the feature extraction block. The proposed algorithm and baseline methods are simulated in the same reference software HM-16.10. Precisely, the experiments are conducted under the low-delay P main profile configurations, and the encoder parameters are set according to the standard setting in [
35] by enabling the rate control as
. In addition, there are 100 iterations in every decision-making process for each rate control parameters prediction in the proposed LB-PSO. There are fifteen test video sequences with four video resolutions, including two videos of 240p (wide quarter video graphics array—WQVGA) [
41], three videos of 480p (wide video graphics array—WVGA) [
41], five videos of 720p (HD) [
42], three videos of 1080p (full HD) [
41], and two videos of 4k resolution [
43].
Table 1 briefly summarizes the characteristics of the test video sequence. In addition, the test video sequence is encoded at four target bit rates corresponding to the video resolution. Since the goal of rate control is not only to improve the visual quality of the video for a given bit rate but also to achieve the bit rate closest to the target bit rate, both peak signal-to-noise ratio (
) and bit rate error (
) are used as the criteria for determining the performance of the rate control algorithm. The
and
can be computed as in (22) and (23).
where
n represents bit depth.
4.2. Experimental Results and Analysis
(1) R–D performance and Bit Rate Accuracy: The first experiment was conducted on the low video resolution (WQVGA), which contains two video sequences with different frame rates, including BlowingBubbles and BQSquare. These two videos have various dynamic characteristics, such as a moving camera, moving objects, and illumination changes.
Table 2 describes the proposed method’s
and
performance compared with the baseline methods. Our learning-based method outperforms all the baseline methods as we achieve the highest
value with the same bit rate.
Specifically, our method’s average
enhancement is 0.23 dB and 0.12 dB compared with RC-HEVC and PS-GOP, respectively. Our approach also performs the maximum
improvement (
) of 0.30 dB and 0.20 dB compared to RC-HEVC and PS-GOP.
Figure 4a illustrates the
R–
D performance curve of the BQSquare test sequence. The learning-based approach obtains a better
R–
D performance than the baselines method. In addition, the average
of RC-HEVC, PS-GOP, and our methods are 0.01%, indicating that all approaches can effectively achieve the target bit rate. However, the proposed method has the lowest
at a lower target bit rate (256 kbps). It is noticed that the RC-HEVC has poor visual quality on these WQVGA with dynamic scenes compared to all approaches. As a result, even if the scene has dynamic properties, our algorithm can constructively achieve the target bit rate with the good visual quality of the WQVGA sequence.
Next, the WVGA sequences were tested, such as BasketballDrillText, PartyScene, and BQMall. The scene properties are similar to the above experiments, but these WVGA sequences are more challenging than WQVGA because they involve multi-object movement, camera movement, and higher resolution. The outcomes of
and
are summarized in
Table 3, where the proposed learning-based method works much better. It reaches 0.41 dB and 0.33 dB of visual quality better than RC-HEVC and PS-GOP, respectively. Concisely, our approach has no error bit consumption on average and performs 0.23 dB and 0.16 dB on average higher than RC-HEVC and PS-GOP, respectively. On one side of the
R–
D curve, our proposed method is significantly higher than the competitive methods, as shown in
Figure 4b. Based on the outcomes of all approaches in
Table 2 and
Table 3, the
R–
rate control and PS-GOP are unsuitable for such dynamic scenes and cameras. Consequently, it can indicate that the
adjustment and quality control are not correctly estimated.
After testing the WVGA sequences, the HD videos containing video conferencing and online teaching test sequences were simulated. The HD videos are FourPeople, KristenAndSara, Vidyo1, Vidyo3, and Vidyo4. These videos have the characteristics of a static camera with multiple objects moving.
Figure 4c shows an overall outgrowth of the
R–
D curve of FourPeople from the low bit rate to the high bit rate. Although the scene is used with a static camera, the proposed method’s
R–
D performance is noticeably greater than the competitive methods. Additionally, the
and
evaluations of these HD video sequences are recorded in
Table 4, where the average
enhancement value of our method is approximately 0.17 dB (
= 0.30 dB) and 0.08 dB (
= 0.21 dB) in comparison with the RC-HEVC and PS-GOP.
The last experiment was applied on full HD and 4k video test sequences. The first three videos, ParkScene, Cactus, and BQTerrace, were used for the full HD experiment. The last two sequences, HoneyBee and Jocky, were used for 4k videos. This last test contained all types of scenarios. The ParkScene and Jocky videos have a moving camera and multiple object motions, while the BQTerrace video stacks the camera motion with a static camera. Furthermore, the Cactus video consists of a static camera and the rotation of the objects. The HoneyBee video has multiple object motions and a static camera. According to
Table 5, the overall
evaluation of the proposed method on the BQTerrace sequence at a low bit rate is the highest compared to the other sequences. In contrast, the ParkScene sequence has the highest
at a high bit rate. The reason is that the scenes containing a dynamic camera have significant movement changes; thus, the state-of-the-art
R–
rate control cannot update the encoding controller correctly. In addition, PS-GOP uses parameter sharing in GOP, which is not enough to adapt to encoder parameters following frame characteristics. Reasoning from this fact, our method establishes a novel mapping between frame features and
R–
coefficient parameters. We provide a computationally feasible solution using LB-PSO to produce optimal
R–
D for good visual quality and to maintain the target bit rate.
Figure 4 shows the overall
R–
D curve on different video resolutions. Consequently, our method has achieved the highest outcomes of all competitive methods. From
Table 2 to
Table 5, the average
improvement is 0.19 dB (
= 0.41 dB) and 0.10 dB (
= 0.33 dB) compared with RC-HEVC and PS-GOP, respectively.
The PSNR performance of our proposed model is extensively compared with other state-of-the-art rate control methods for both the dynamic scene and interview scene as shown in
Table 6. Our proposed model achieves the highest PSNR for all bit rates in both types of video sequences. This indicates that the inter coding approach should not only consider the inter-block dependency coding structure but also the rate control coefficient.
Additionally,
Figure 5 shows a graph of the PSNR difference between consecutive frames. The plot shows that the performance of the proposed method adaptively achieves better results on frame reconstruction from the start of encoding compared to RC-HEVC and PS-GOP. This demonstrates the effective interaction of spatiotemporal features in the rate control model and the crossed LB-PSO model to decide on appropriate rate control coefficients to acquire the target bit rate and perform well in PSNR. Furthermore,
Figure 6 indicates the details of the rate fluctuation performance of the proposed method compared to the baselines. This rate fluctuation describes successive frames’ historical bit allocation performance to understand the bit flow in the video codec. Therefore, LB-PSO can control bit allocation better than the baselines, and it can carry out lower bit allocation and produce higher PSNR in most consecutive frames, as shown in
Figure 5 and
Figure 6.
(2) Bit Heatmaps and Visual Quality: To indicate the performance of bit allocation at the CTU level, the heatmap visualization and the subjective results of the reconstructed frame are illustrated in
Figure 7 and
Figure 8. Since there is no modification of the intra coding of PS-GOP,
Figure 7 shows only the comparison between state-of-the-art RC-HEVC with our proposed learning-based approach. The bit consumption is highlighted by red color intensity on each CTU, while the blue acts as a mask to cover the frame. If the red intensity is low, the allocated bits are consumed less. The patch image is extracted from the frame to illustrate the greatest difference in bit consumption at the CTU level of RC-HEVC and our proposed method.
Figure 7b,c reveal that the bit allocation performance of RC-HEVC on the plane space CTU is slightly high, leading to less bit budget for the necessary spatial CTU. On the contrary, our proposed method obtains smoother bit allocation on non-important spatial images (low-frequency components), providing more budget to important CTU features. Additionally, the visualization of the human face of the proposed learning-based approach on the intra-picture shows more details with a smoother look than that of RC-HEVC, as shown in the green box of
Figure 7c,d. According to these results, our LB-PSO can obtain better bit allocation by using the information from the mapping encoder control parameters with the input convolution feature map of each spatial CTU instead of the fixed initialization of
R–
rate control.
For inter coding, the PS-GOP is added in comparison. Similarly, the color representation is defined the same as the intra coding. Regarding bitmaps,
Figure 8b shows that RC-HEVC has a problem with bit allocation on the essential features. Due to hand movement, RC-HEVC should provide higher bit allocation in these necessary parts; on the contrary, it allocates fewer bits to these blocks. Furthermore, PS-GOP attempts to allocate the amount of bit budget to the hand movement area to keep the visual quality of the action consistent. However, the bit budget on large hand motion blocks is still small, as shown in
Figure 8c.
Regarding residual semantic information, our proposed method can correctly regulate the bit budget responding to the motion information in the scene, as illustrated in
Figure 8d. On the other hand, our proposed method obtains the accurate bit allocation of each CTU corresponding to its spatial–temporal characteristics. Furthermore, the visual quality visualization of this hand movement is shown in
Figure 8e–g. In particular, RC-HEVC has a considerable distortion in this hand movement area, while PS-GOP is slightly better than RC-HEVC. Although PS-GOP is better than RC-HEVC, PS-GOP still has higher distortion compared with our proposed method. As a result, the proposed method achieves better hand and cup shapes than the competitive methods. According to our experimental results, we can conclude that the proposed learning-based
R–
parameter outperforms other competing methods by achieving the highest
while maintaining the target bit rate.
(3) Computational Complexity: We compare the computational time of the proposed method with RC-HEVC and PS-GOP. Regarding computational time in an average of seconds per frame, as indicated in
Table 7, our LB-PSO achieves 53.30 s/frame, 97.79 s/frame, and 351.10 s/frame on WVGA, HD, and full HD resolution, respectively. We also compare our computational complexity with other baseline methods.
Table 6 shows that our computational time is higher than the baseline methods. This is because our framework is designed as online training using the integration of the forward pass network with particle swarm optimization. However, we obtained a significantly higher PSNR value and achieved the target bit rate. Furthermore, our bit allocation was assigned correctly compared to baseline approaches.