Reducing Video Coding Complexity Based on CNN-CBAM in HEVC

Li, Huayu; Wei, Geng; Wang, Ting; Bui, ThiOanh; Zeng, Qian; Wang, Ruliang

doi:10.3390/app131810135

Open AccessArticle

Reducing Video Coding Complexity Based on CNN-CBAM in HEVC

by

Huayu Li

¹

,

Geng Wei

^1,*,

Ting Wang

¹

,

ThiOanh Bui

¹,

Qian Zeng

¹ and

Ruliang Wang

^1,2

¹

School of Physics and Electronics, Nanning Normal University, Nanning 530100, China

²

Technology College of Xiangsihu School, GuangXi Minzu University, Nanning 530100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10135; https://doi.org/10.3390/app131810135

Submission received: 31 July 2023 / Revised: 1 September 2023 / Accepted: 6 September 2023 / Published: 8 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

High-efficiency video coding (HEVC) outperforms H.264 in coding efficiency. However, the rate–distortion optimization (RDO) process in coding tree unit (CTU) partitioning requires an exhaustive exploration of all possible quad-tree partitions, resulting in high encoding complexity. To simplify this process, this paper proposed a convolution neural network (CNN) based optimization algorithm combined with a hybrid attention mechanism module. Firstly, we designed a CNN compatible with the current coding unit (CU) size to accurately predict the CU partitions. In addition, we also designed a convolution block to enhance the information interaction between CU blocks. Then, we introduced the convolution block attention module (CBAM) into CNN, called CNN-CBAM. This module concentrates on important regions in the image and attends to the target object correctly. Finally, we integrated the CNN-CBAM into the HEVC coding framework for CU partition prediction in advance. The proposed network was trained, validated, and tested using a large scale dataset covering various scenes and objects, which provides extensive samples for intra-frame CU partition prediction in HEVC. The experimental findings demonstrate that our scheme can reduce the coding time by up to 64.05% on average compared to a traditional HM16.5 encoder, with only 0.09 dB degradation in BD-PSNR and a 1.94% increase in BD-BR.

Keywords:

HEVC; intra-frame; CNN; CBAM; CU partition

1. Introduction

To transport huge volumes of video data effectively, some new versions of flexible video standards like H.264 [1], H.265 [2], and H.266 [3] are continually being developed. Each of these standards provides many efficient coding methods, greatly increasing the code complexity. For example, to obtain the best CU, versatile video coding (VVC) uses highly adaptable block coding (i.e., quad tree-based merge and transform (QTMT) structure), increasing the intra-frame coding complexity by an average of 18 times compared to HEVC [4]. It will become considerably more difficult if the rate distortion optimization (RDO) is also used. The VVC standard’s practical application is constrained by its high level of complexity, particularly in real-time settings. Given that both VVC and HECV use block-based encoding constructions with some variations in specific technologies as well as the fact that deep learning-based accelerating methods typically require the power of a graphics processing unit (GPU), HEVC is more developed and widely used than VVC in this regard. Thus, to estimate the performances of the various methods and validate the efficacy of the suggested algorithm, this work utilized HEVC without losing its universality.

Plenty of approaches to simplify HEVC coding have been developed during the past 10 years. It has been demonstrated that the standard reference software HM’s recursive CU partition procedure consumes up to 80% of the intra-frame encoding time. Therefore, reducing the complexity of quad-tree CU partitioning without compromising coding efficiency is a much-anticipated problem in the field of video compression. Currently, complexity reduction methods are mainly categorized into heuristics and learning-based methods, and there are also individual papers that have proposed other ways to reduce the coding time, like in [5,6].

The RDO search for CU partitions can be streamlined using several middle characteristics in heuristic strategies [7,8,9,10,11,12,13,14,15]. The approach by Shen et al. [7] circumvented the large CU size of the close-by coding tree unit (CTU) and offered an early CU size determination method based on the texture homogeneity measured by the mean absolute distortion. Quantization parameters (QPs), RD cost distributions, and video content were used to update an adaptive early termination threshold in Zhang et al.’s [8] statistical modeling-based CU partition technique. Lim et al. [9] developed a quick joint approach to address the issue of high complexity when the PU crosses the RDO. In particular, it performs jump and split termination judgments on prediction units (PUs) using Bayes’ rule and freshly set decision parameters, avoiding RD processing. Shen et al. [10] originally used a class conditional probability density function using one-two essential features and prior experience to identify whether or not the CU is divided. They then developed a Bayesian threshold method to ensure that the loss of the large coding unit (LCU) partition was kept to a minimum. The technique is limited in its ability to encode various scenarios because it lacks statistical features that are simple to compute. Zhang et al. [11] overcame the constraints of [10] by presenting a statistical-based model solution to the high-complexity of the CU partition and PU mode decision-making process. Using statistical qualities, the method extracted as many key features of the image as possible, minimizing the coding time to a greater extent.

Learning-based approaches have gradually replaced traditional heuristics to extract more features and lower the complexity of the CU partition in today’s environment, where artificial intelligence technologies are constantly encouraged [16,17,18,19,20]. To achieve more efficient results, we used machine learning (ML) methods such as support vector machines (SVMs) [16,17,18], random forests (RFs) [19], or Bayesian decision rule [20] classifiers (not the block similarity metrics), which are used in statistical-based methods to simplify the complexity of the CU decision module. Unlike the ML approaches, the CNN methods have been proven to be effective in automatic feature extraction [21,22,23,24,25,26,27,28,29,30,31,32]. CNN-based approaches, in comparison to heuristic-based methods, exhibit superior generalization ability and automatic feature learning. They can be effectively employed in various video coding and decoding scenarios, thereby enhancing the universality of the proposed algorithm. Consequently, CNN-based methods are increasingly gaining attention and finding diverse applications in the field of video coding and decoding.

Although the complexity can be greatly reduced using both heuristic and learning-based approaches, heuristic methods often struggle to extract sufficient features from video content, and the learning-based methods lead to a significant increase in complexity at deeper CTU depths. Therefore, when designing the network structure, not only must sufficient features be extracted, but the encoding time must be considerably reduced within the acceptable range of model calculation. In this paper, we propose a fast partition algorithm in CU based on machine learning and heuristic methods. Firstly, a CNN that can quickly predict all CU partition results was proposed to adapt to three CU sizes of 16 × 16, 32 × 32, and 64 × 64. In addition, the convolution blocks were designed to enhance the connection between the blocks in the intra-frame space. Secondly, a CBAM was introduced into the proposed CNN to construct CNN-CBAM with better performance, consequently refining the feature map while improving the network performance. Finally, the experimental data revealed that the CNN-CBAM significantly reduced the intra-frame coding time while somewhat degrading the RD performance.

To conclude, our primary contribution is in the three categories listed below:

We propose a CNN for variable CU sizes to properly fetch the global information of the current CTU. The developed CNN can accurately anticipate the size of the HEVC coding units, resulting in a reduced intra-frame coding time.
Embedding a CBAM into CNN leads to the development of the first CNN-CBAM, which is capable of predicting CU partitions more accurately in HEVC, so that it can obtain the key information of the different sequences.
The CNN-CBAM is embedded in the HEVC coding framework to predict the CU partition in advance. This method can avoid the repeated iterations of traditional RDO to extract the optimal CU depth, therefore, it can remarkably decrease the coding complexity while sustaining good coding performance.

The rest of this paper is organized as follows. Section 2 describes the proposed CNN-CBAM algorithm in detail. Section 3 reports the experimental results. Section 4 summarizes the paper and discusses some future work.

2. The Proposed Methods

2.1. Traditional CU Partition

The high complexity of HEVC is mainly attributed to the CU partition in which RDO is the core process, as previously mentioned. According to the image content of each LCU, each is divided into four CU sizes with varying depths, and the specific CU sizes of the four are shown in Figure 1. As already noted, the RDO technique is utilized to traverse in order to find the ideal CU sizes, and the procedure consumes the majority of the intra-frame coding time, realizing a full LCU delineation. However, only one CU partition from 1 (maximum size 64 × 64 CU are not partitioned) to 64 (CTU are fully partitioned into 8 × 8 CU with depth = 3) is chosen in the final CU partition, which significantly increases the amount of redundant coding time.

2.2. Proposed Algorithm

2.2.1. Proposed Structure of CNN

Deep learning has been brought to video coding to reduce the time overhead of intra-frame coding by training a large number of parameters to learn the encoder CU partition rules. These methods can be broadly classified into two kinds: single CNN [21,22,23,24,25,26,27] and single CNN with some auxiliary data [28,29,30,31,32], For example, CNNs were initially used in video codecs by Liu et al. [19], who presented a deep CNN to minimize the amount of CU/PU candidate patterns. Furthermore, size-specific down-sampling for different CU/PU sizes resulted in texture loss variances for different CU/PU layers, which had an effect on the anticipated results. To better understand the CU partition rules in the encoder, Li et al. [22] suggested a deep CNN method trained on numerous parameters to address the problem of redundant computation during intra-frame CU partition. Fan et al. [28] introduced an efficient block partitioning CNN method that not only improved on the CNN, but also incorporated an adaptive threshold technique for the accurate management of CNN prediction errors. Zhang et al. [32] investigated the correlation between texture complexity, quantization parameters, and CU depth and proposed a CNN scheme based on texture classification aiming at accelerating CU partition. However, the huge reduction in the complexity of these methods based on CNN comes at the expense of a minor drop in RD performance. The latest study [28] showed a more significant improvement in both the complexity reduction and compression efficiency. However, there are still some shortcomings. For example, in [28], although the different CU sizes could share the same convolution structure, the extracted global information was limited for larger CU sizes. In addition, this method used RDO in uncertain regions with a threshold of approximately 0.5, further increasing the computational complexity. Inspired by [21,28], we propose a novel CNN-based approach to further reduce the coding complexity while sustaining the compression performances without a loss of generality.

Unlike in [21,28], the proposed CNN considers the global image information and designs a convolution operation that is compatible with the current CU size, thus the feature extraction is realized between adjacent scaling CU and the information interaction can be carried out in the same path. In addition, a convolution block with superposition performance was designed to extract the detailed features of the image blocks. Thus, the effective learning of the global information is achieved. The proposed CNN not only accurately predicts the CU partition to replace unnecessary encoding processes, but also strengthens the connection between CNN and different CU sizes.

The proposed CNN is shown in Figure 2. Note that we selected original image luminance information as input to the network since it contained more visual information. There are three different branches of the proposed CNN, namely B1, B2, and B3, corresponding to the different CU sizes. The different branches are first pre-processed separately, and then the pre-processed CU information is passed through the convolution, concatenation, and fully connection layers in sequence until the CU partition decision results are output. The specific structures are described as follows:

Convolution Layer: The initial 64 × 64 luminance pixel matrix serving as the input to this network is the corresponding preprocessing process before entering each of the three branches, with the preprocessed CU block serving as the input to the convolutional layer of that branch alone. Noted that in this paper, we fixed the size of the convolution kernel to be same as the step of this layer.

For a further understanding of deeper features, the convolution block was designed and set in the convolution path

C_{i - j} (i = 1, 2, 3; j = 3, 4)

and

C_{3 - 2}

. Figure 3 depicts the structure of the convolution block. Firstly, the input blocks are convolved along principal and lateral branch paths to obtain two feature maps, Con-1 × 1 and Con-2 × 2, respectively, then to obtain the convolution blocks, the feature maps of the two different paths are stacked in the channel dimension, which can help the network learn about the many combinations of CU sizes. The convolution kernels of the main and side branch are 1 × 1 and 2 × 2, respectively. Meanwhile, the step size of both convolutions is set to 1, which aims to ensure that the feature maps of Con-1 × 1 and Con-2 × 2 have the same size to facilitate feature fusion. Those features contain rich texture information for learning CU partition. Secondly, the convolution kernels of the

C_{i - 1} (i = 1, 2, 3)

are set to half of the size of the current CU to obtain the global information of the image as much as possible. These features are closely related to the current CU partition. Finally, to fuse the feature maps of the two paths, the

C_{i - 4} (i = 1, 2, 3)

and

C_{i - 1} (i = 1, 2, 3)

channels are set to 64 and 128, respectively. In addition, a non-overlapping 2 × 2 max pooling, computation with a step size 2 is used to filter the features in the main channel

C_{i - j} (i = 1, 2, 3; j = 2, 3, 4)

. This drastically decreases the number of parameters and the risk of over-fitting while preserving the main features.

Concatenation Layer: The fused feature maps obtained by the three branches are turned into three one-dimensional vectors respectively in the concatenation layer. The one-dimensional vector contains all the key information features of the current CU.

Fully connected layer: In each of the three branches, the connection layer’s three one-dimensional vectors are used as inputs to the fully connected layer. The input contains local features collected from various layers for each category. Furthermore, because QP is critical in the CU partition process, it is spliced into the fully connected layer of each branch as supplementary features. These above features will be integrated and categorized in this layer until the learning of all of the CTU partition rules is complete, and then the output results will be used to judge the current CU.

Re LU and Sigmoid are used as the convolutional and output layer activation functions, respectively. Furthermore, the method proposed employs an early termination strategy to eliminate superfluous CU partition operations during coding. Table 1 displays the parameters of each layer of the proposed CNN, where full convolution, convolution, maximum pooling, and step size are abbreviated as FCon, Con, MP, and S, respectively.

2.2.2. Structure of CBAM

The proposed convolution block attention module (CBAM) by [33] can improve the network model performance by learning to focus or suppress key feature information, hence facilitating network information transmission. Figure 4 depicts the CBAM’s detailed construction.

For the given input feature map

F \in R^{C \times H \times W}

, where C, H, and W denote the number of channels, height, and width of the feature map, respectively. The CBAM sequentially contains

M_{C} \in R^{C \times 1 \times 1}

(channel attention map) and

M_{S} \in R^{1 \times H \times W}

(spatial attention map). The overall CBAM can be characterized as:

F^{'} = M_{C} (F) \otimes F,

(1)

F^{″} = M_{S} (F^{'}) \otimes F^{'},

(2)

where

\otimes

means the elements of the feature map and the two attention modules to produce the corresponding weight product operation. The two attention modules are detailed below.

Channel attention module: First, in

F \in R^{C \times H \times W}

, maximum pooling and average pooling are carried out to summarize the spatial information of the feature map and produce two distinct spatial context descriptions, i.e.,

F_{m a x}^{c}

(maximum pooling) and

F_{a v g}^{c}

(average pooling). Second, the

F_{a v g}^{c}

and

F_{m a x}^{c}

are passed to a hidden layer and multi-layer perceptron (MLP) to produce a channel attention map

M_{c} \in R^{C \times 1 \times 1}

. In addition, we set the drop channel size of the hidden layer to

R^{\frac{c}{r} \times 1 \times 1}

to reduce the model parameter overhead, where

\frac{c}{r}

is an integer. Then, after implementing a shared network for each descriptor, two new descriptors are generated (i.e.,

F_{a v g'}^{c}

and

F_{m a x'}^{c})

. Finally, we utilize an element-by-element summing approach to obtain F’ as the output. Therefore, the channel attention module is computed as:

M_{C} (F) = α (M L P (A v g p o o l (F)) + M L P (M a x p o o l (F))), = α (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{m a x}^{c}))),

(3)

where α represents the sigmoid function and

W_{0} \in R^{C / r \times C}

and

W_{1} \in R^{C \times C / r}

are the MLP weights.

Spatial attention module: Firstly, the first attention module’s output result F’ is used as the second attention module’s input. Secondly, the maximum pooling and average pooling are performed along the channel, and the pooling results are concatenated to yield 2D valid feature descriptors (i.e.,

F_{a v g}^{s} \in R^{1 \times H \times W}

(average pooling) and

F_{m a x}^{s} \in R^{1 \times H \times W}

(maximum pooling)). Finally, for cascade-based feature descriptors, a convolution layer is applied to produce

M_{s} (F) \in R^{H \times W}

(spatial attention map). Therefore, the spatial attention module is computed as:

M_{S} (F) = α (f^{7 \times 7} ([A v g p o o l (F); M a x p o o l (F)])), = α (f^{7 \times 7} ([F_{a v g}^{s}; F_{m a x}^{s}])),

(4)

where

f^{7 \times 7}

represents a 7 × 7 convolution operation and α is the sigmoid function. The results by [33] showed that the filter size of 7 × 7 had good performance, so this paper selected this filter without loss of generality.

In summary, the CBAM introduces the spatial and channel attention from the spatial and channel dimensions to notice important information about the object’s content and thus better describe the key features of the image. Secondly, adding CBAM to the network induces the original network to spotlight the target objects, thus improving the generalization and characterization ability of the network. More importantly, CBAM is a lightweight network that improves network effectiveness without adding additional computational complexity. Based on the above advantages, we introduced the CBAM into the proposed CNN and then achieved a better texture representation performance.

2.2.3. Proposed CNN-CBAM Algorithm

As above-mentioned, to increase the link between CUs of various sizes and the convolution structure, the size of the convolution kernel in

C_{i - 1} (i = 1, 2, 3)

is half of the current CU size. The feature map obtained from the convolution is a global description of the current CU information, which is critical in the CU partition of the output layer. However, when the CNN performs feature mapping, the weights assigned to each channel are the same that result in the poor description of the key information. In light of this, as shown in Figure 5, we introduced CBAM in the CNN structure to better characterize key features of the image by adaptively assigning the weights within the channels according to the content of the image and focusing on important regions of the image content.

2.2.4. Loss Function of Proposed CNN-CBAM

Cross entropy is derived from the concept of information entropy in information theory and is commonly used to measure the disparity between different probability distributions. Specifically, it treats the ground-truth labels as one probability distribution (referring to the true CU partition labels of the video sequence by HM16.5), while the model’s predicted output can be treated as another probability distribution (CNN-CBAM model). The cross-entropy loss function is calculated as follows:

H (p, q) = - \sum_{i = 1}^{N} p_{i} \log (q_{i}),

(5)

where

p_{i}

denotes the probability of category i in the ground-truth label and

q_{i}

denotes the probability of category i in the model’s predicted output.

This loss function is not only the most often used function for classification tasks, but it also sets the neural network’s direction of convergence, allowing the neural network to perform well during the initial training. On this premise, the CNN-CBAM network model was trained in this research using the cross-entropy loss function. Meanwhile, applying the Sigmoid function to the gradient descent algorithm can effectively solve the problem of a reduced learning rate.

Supposing that there are M training samples, the prediction values in the B1, B2, and B3 branches are shown as

{\{y_{B 1}^{*} (k)\}}_{k = 1}^{4}

,

{\{y_{B 2}^{*} (k)\}}_{k = 1}^{4}

, and

{\{y_{B 3}^{*} (k)\}}_{k = 1}^{4}

, respectively. The corresponding true values are denoted as

{\{y_{B 1} (k)\}}_{k = 1}^{4}

,

{\{y_{B 2} (k)\}}_{k = 1}^{4}

, and

{\{y_{B 3} (k)\}}_{k = 1}^{4}

. The samples’ loss function

L r

is expressed as follows:

L r = \sum_{k = 1}^{1} H (y_{B 1}^{*} (k), y_{B 1} (k)) + \sum_{k = 1}^{4} H (y_{B 2}^{*} (k), y_{B 2} (k)) + \sum_{k = 1}^{16} H (y_{B 3}^{*} (k), y_{B 3} (k)),

(6)

by minimizing the loss function, the CNN-CBAM can be trained in this manner.

L = \frac{1}{M} \sum_{m = 1}^{M} L r .

(7)

2.3. Proposed CNN-CBAM Algorithm and Flowcharts

The overall CNN-CBAM algorithm is shown as Algorithm 1. The algorithm contains the two most important parts. (1) Training the CNN-CBAM model to obtain the partition labels of CU in the image. (2) We embed the trained model into HEVC to predict the CU partition in advance.

Algorithm 1 (CNN-CBAM)

1:: Input: current image information
2:: Output: return a set of split CU with RDO costs
3:: Processed processing: I-pro
4:: {// train CNN-CBAM model to obtain current CU split_flag
5:: If W = 64 and H = 64 then
6:: If split_flag = 0 then
7:: skip CU partition
8:: Else W = 32 and H = 32 then
9:: If split_flag = 0 then
10:: skip CU partition
11:: Else W = 16 and H = 16 then
12:: If split_flag = 0 then
13:: skip CU partition
14:: Else W = 8 and H = 8 then
15:: split_flag ← 0
16:: End if
17:: End if
18:: End if
19:: End if
20:: }
21:: frontier ← stack (LCU)
22:: decision results ← ∅
23:: While is empty (frontier) is false do
24:: CU block ← POP (frontier)
25:: If execute Model (CNN-CBAM, CU block) is not split do
26:: decision results ← decision results ∪ CU block
27:: Else
28:: Sub CU ← GET (Sub CU, CU block)
29:: Sub CU RDO ← ∅
30:: For sub-CU in Sub CU do
31:: If is leaf (sub-CU) is true then
32:: Sub CU RDO ← Sub CU RDO ∪ calculate (RDO, sub-CU)
33:: Else
34:: PUSH (frontier, sub-CU)
35:: End if
36:: End for
37:: Parent RDO ← calculate (RDO, CU block)
38:: If parent RDO < sum of 4 sub-CU RDO then
39:: decision results ← decision results ∪ CU block
40:: Else
41:: decision results ← decision results ∪ Sub CU
42:: End if
43:: End if
44:: End while
45:: Return decision results

The framework of the CNN-CBAM is displayed in Figure 6. The proposed method firstly avoids unnecessary redundant computation of violent RDO compared with the traditional CU classification method; secondly, the proposed CNN, which is compatible with different levels of CU, can predict all the CU partition results at one time, significantly reducing the coding complexity. In addition, the convolution block is designed to enhance the information exchange of the same path of the CU to obtain richer texture features. Finally, CBAM is introduced into CNN to build CNN-CBAM. The CNN-CBAM is embedded in the HEVC coding framework to predict the partition of CU in advance. The CBAM can effectively improve the intermediate features by dynamically suppressing or emphasizing the feature map through the attention mechanism, thus improving the representational ability of the CNN-CBAM network without increasing the computational complexity.

3. Experimental Results

3.1. Experimental Conditions

Experimental data: The database for the CU partition of HEVC(CPH) [34] (2000 ultra-high definition (UHD) images were selected from the original image datasets) was used to test the network presented. First, these UHD images were randomly divided into validation, testing, and training, which was 5%, 10%, and 85% of the entire dataset, respectively. Generally speaking, the more the training data, the more accurate the predicted CTU partition. Second, each group contains four subsets equally (i.e., 768 × 512, 1536 × 1024, 4928 × 3264, and 2880 × 1920). Finally, the CPH video sequences are encoded with HM16.5 to different QPs to obtain the containing CU partition after training, testing, and validation of the data labels.

In addition, all trainable parameters are randomly initialized when training the model. The specific hyper parameter settings are shown in Table 2.

Standard experimental configuration: In this study, to encode the standard video test sequence, four QP values {22, 27, 32, 37} and encoder_intra_main.cfg [35] were chosen using the Benchmark software HM16.5 as the experimental platform. Table 3 shows the specific experimental conditions.

Objective evaluation indicators: Validation studies were carried out on 18 reference sequences with varied resolutions provided by JCT-VC in this publication for fair comparison. Simultaneously, to evaluate the overall effectiveness of the suggested strategy, we settled on four criteria (i.e.,

∆ T

,

∆ T_{a v g}

, BD-PSNR [36], and BD-BR). The specific calculation formulas are as follows:

∆ T = \frac{T_{p r o p o s e d} - T_{H M 16.5}}{T_{H M 16.5}} \times 100 %,

(8)

∆ T_{a v g} = \frac{1}{4} \sum_{i = 1}^{4} \frac{T_{p r o p o s e d} (Q P_{i}) - T_{H M 16.5} (Q P_{i})}{T_{H M 16.5} (Q P_{i})} \times 100 %,

(9)

B D - P S N R = \frac{1}{r_{h} - r_{l}} \int_{r_{l}}^{r_{h}} (D_{p r o p o s e d} (r) - D_{H M 16.5} (r)) d r,

(10)

B D - B R = \frac{1}{D_{h} - D_{l}} \int_{D_{l}}^{D_{h}} (r_{p r o p o s e d} - r_{H M 16.5}) d D,

(11)

where

T_{H M 16.5}

and

T_{p r o p o s e d}

denote the coding time by HM16.5 and the proposed method, respectively;

D_{p r o p o s e d}

and

D_{H M 16.5}

are the two corresponding RD curves;

r_{l}

and

r_{h}

are the low and high ends of the output bit rate range, respectively;

r_{H M 16.5}

and

r_{p r o p o s e d}

are the corresponding bit rates of the original one in HM16.5 and the proposed method, respectively;

D_{l}

and

D_{h}

are the low and high ends of the output RD curve range, respectively.

3.2. Ablation Experiments

To evaluate the method of the CNN-CBAM and CNN, three ablation experiments were performed by using the same configurations for the first 20 frames of each sequence under the four QP values. The results are displayed in Figure 7, where the point values indicate the average of the five types (A–E) of sequences at the four QPs.

Figure 7a indicates that the coding time saved by CNN-CBAM and CNN was very similar and that there was almost no difference. This is because the introduced CBAM is a lightweight network and does not significantly increase the complexity of the network. From Figure 7b,c, we can see that the CNN-CBAM had a different degree of improvement in RD performance compared to the CNN, which is because the CNN-CBAM network can refine the feature map after adding CBAM, thus improving the representational capability of the CNN. Therefore, the RD performance of CNN-CBAM is more advantageous compared with the CNN.

3.3. Analyzing Experimental Outcomes

3.3.1. Performance Comparison with HM16.5 Encoder

To evaluate the proposed CNN-CBAM algorithm, we performed it and the original one in HM16.5. It has already been mentioned how the four indicators were obtained (i.e., BD-BR, BD-PSNR,

∆ T

, and

∆ T_{a v g})

.

Table 4 indicates that the computational complexity of the suggested method was significantly lower than that of the method in HM16.5. On the one hand, the proposed approach saved an average 64.05% of time. The complexity reduction was especially evident for class B with gentle motion, reaching 74.24%, and even for class C sequences with vigorous motion, reaching 53.21%. For sequences such as Kimono, which has a background with high texture complexity and slow-moving foreground targets, the average time saving even reached 87.32%. The proposed CNN-CBAM method for the CU partition’s accurate prediction accounted for the proposed algorithm’s low complexity, thereby avoiding unnecessary redundant iterations. On the other hand, although there was a decrease in RD performance per sequence, it was not significant. The average loss of video quality was only 0.09. There was only a 1% average bit rate increase. Undeniably, the proposed algorithm, in general, maintained a good rate–distortion performance.

To display the proposed algorithm’s RD performance, we took the two sequences with the greatest difference in performance loss among the test sequences as an example. Figure 8 depicts the RD curves of the Basketball Drive and BQ Square sequences. The figure indicates that the suggested algorithm’s RD performance lines up with the original approach in HM16.5. The experimental results showed that various sequences with varied video properties had comparable characteristics. In addition, the average loss of the video quality was only 0.09 dB, with a maximum drop of 0.15 dB (for the Four People sequence) and a minimum drop of 0.02 dB (for the BQ Square sequence), which was almost negligible. Undeniably, the network structure we devised efficiently decreased the coding complexity while maintaining the coding quality.

3.3.2. Comparison of Subjective Quality

To assess the subjective quality, we compared the experimental results of the proposed algorithm and the one in HM16.5 on sequences with different characteristics. For example, the Class A Traffic sequence with a flat mean, the Class B Cactus and Class E Kristen & Sara sequence with a complex background, and the Class C BQ Mall sequence with complex heterogeneous texture. To make it easier to see the image loss details, we provide a local magnification area for subjective quality comparison, as shown in Figure 9. The subjective effects of the suggested method and the one in HM16.5 were almost identical. The PSNR loss of Traffic, Cactus, BQ Mall, and Kristen & Sara in the suggested method was 0.0181, 0.0239, 0.0068, and 0.0317, respectively, when compared to the methodology in HM16.5. Thus, the method proposed has an excellent performance in both the image quality and coding efficiency.

Moreover, to further demonstrate why the suggested method is of low complexity, we compared the CU partition results of the proposed algorithm to that of HM16.5. Figure 10a,b show a zoomed-in demonstration of the two algorithms in specific regions (r1–r6). On the one hand, as illustrated in Figure 10a, although HM16.5 exhibited strong results as a high-efficiency algorithm for CU partition, this efficiency may come at the cost of increased computational complexity. However, compared with Figure 10a, the CU partitioning of our scheme was very consistent with that of HM16.5, which is one of the important reasons for the high coding performance obtained by our method. On the other hand, from a subjective perspective, the HM16.5 method divides some regions that do not need CU partition, while the proposed method does not divide these areas, which better reflects the subjective feelings. Specifically, Figure 10a,b highlights yellow regions that are not divided in our method but are divided in HM16.5. This is one of the reasons for the low complexity of our scheme. It is noteworthy that similar results were obtained for other test sequences of JCT-VT, which proves the robustness and effectiveness of the proposed algorithm.

3.3.3. Comparison of the Proposed Algorithm with Three Existing Algorithms

We compared the proposed method to the competing algorithms including the heuristic-based scheme by Zhang et al. [11], the CNN-based algorithm by Liu et al. [21], and the CNN-based scheme by Fan et al. [28] to further illustrate its superior performance, as shown in Table 5. The meanings of BD-BR (%), BD-PSNR (dB), and

∆ T_{a v g}

(%) were the same as noted above. On the one hand, for the coding complexity, we noted that the proposed scheme was reduced on average by 64.05%, better than both the 59.69% in [21] and 60.35% in [28], and even about 10% higher compared to [11]. The method has such a low complexity because it can extract all features of the current CTU pixels from both the global information and refined texture features, while the encoding process does not require any intermediate features, and the CNN-CBAM can predict all of the CU partition results at one time, skipping more encoding unit depth computation processes, and thus effectively removing redundant CU partition computations. In addition, as above-mentioned, for some regions, especially for the regions with gentle movement and simple textures, the CU partition of the CNN-CBAM better reflected the subjective feelings than the method in HM 16.5. For these regions, the CNN-CBAM terminated the CU partition early, which further reduced the intra encoding time.

On the other hand, for the RD performance, all four methods in Table 5 showed a slight loss compared with the method in HM16.5; the proposed method was superior to the methods in [11,21,28] both in terms of the code rate and video quality with 2.18% and −0.11 dB, 2.71% and −0.13 dB, and 2.03% and −0.09 dB, respectively. Our method had a better generalization ability compared with competing algorithms. Meanwhile, the CBAM algorithm could better identify the features of the current block and induce the CNN to focus on the target object correctly, thus improving the network prediction performance.

In summary, compared to heuristic-based methods [11], the CNN-CBAM based scheme excels in feature extraction and adaptively adjusts to scene characteristics, optimizing the network structure. In contrast to CNN-based schemes [21,28], the distinctive convolutional block design in CNN-CBAM strengthens the interaction of information between blocks, leading to the acquisition of more abundant texture features. Moreover, the incorporation of CBAM enhances image quality without incurring additional computational complexity. It is undeniable that our method has an advantage over competing methods.

In Figure 11, the results of the comparison between the other three algorithms and the proposed algorithm are presented in the form of three-dimensional visualization graphs, which vividly show the differences between the four algorithms on the three metrics. Note that the values of BD-PSNR (dB),

∆ T_{a l l}

(%), and BD-BR (%) were the average of all 18 test sequences. Undeniably, the proposed algorithm is more advantageous in terms of the overall criteria for assessment compared with the three methods in [11,21,28]. On the one hand, the proposed method has significant advantages in reducing the encoding complexity since the proposed CNN-CBAM can classify blocks more comprehensively and faster, and consequently reduce many RDO candidates. On the other hand, the feature refinement process of the CBAM enables the CNN-CBAM to effectively utilize the given features, which enhances the neural network’s capacity for generalization and maintains good coding performance.

We further used a metric (i.e., figures of merit (FoM) mentioned in [37]) to more objectively show the complexity–RD performance of the proposed method. The FoM is calculated as follows:

F o M = |\frac{B D - B R}{Δ T_{a v g}}| \times 100

(12)

the FoM can be used to compare the strengths and weaknesses of different algorithms and to find the optimal balance between low RD performance loss and high complexity reduction (i.e., the better the performance of the method, the smaller the FoM value, and vice versa).

Table 6 presents the FoMs of the different approaches including the proposed one and the existing methods in [11,21,28]. A lower value for FoM is desirable as it indicates a better compromise between RD efficiency and reduced computational complexity. Table 6 shows that the proposed method’s FoM was 3.0312, which was significantly lower than the other methods in [11,21,28]. The result demonstrates that the proposed CNN-CBAM attained a favorable balance between the low encoding complexity and high RD performance.

4. Conclusions and Feature Works

In this paper, we proposed an optimization algorithm for CNN based on the hybrid attention mechanism (i.e., CNN-CBAM), which aimed to accelerate CU partitioning in the HEVC. Firstly, we extracted the global information and local texture features of the video sequence using CNN. Then, we introduced CBAM to CNN to construct CNN-CBAM, which was then embedded into the coding framework of HECV to predict the CU partition in advance. CBAM mainly extracted the nonuniform fusion of the fetched multi-channel features, which in turn improved the information representation of the fused features. Finally, the experimental data showed that the CNN-CBAM could reduce the intra-frame coding time to 64.05%. Despite a slight performance loss in BD-PSNR and BD-BR (with a BD-PSNR and BD-BR of −0.09 dB and 1.94%, respectively), relatively good video quality was maintained.

Although some findings have been made in this study, there is room for further refinement and expansion in the future. We can further optimize and refine the CNN-CBAM model by combining the rules of inter-frame coding. In addition, we will further explore how to apply the optimized CNN-CBAM to VVC. With such improvements, we expect to achieve better results and advance the technology in the field of video compression.

In conclusion, the algorithm presented in this paper offers a promising solution for accelerating CU partition in the HEVC. However, future research endeavors should focus on refining and extending the algorithm to achieve even better results.

Author Contributions

Conceptualization, H.L., G.W. and T.W.; Methodology, H.L., G.W., T.W., Q.Z. and T.B.; Software, H.L., G.W. and T.W.; Validation, H.L., G.W., T.W. and R.W.; Formal analysis, H.L., G.W. and T.W.; Investigation, H.L., G.W. and T.W.; Resources, H.L.; Data curation, H.L.; Writing—original draft preparation, H.L. and G.W.; Writing—review and editing, H.L., G.W., T.W., Q.Z., T.B. and R.W.; Visualization, H.L. and G.W.; Supervision, G.W.; Project administration, G.W.; Funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Guangxi Province (Grants No. 2020GXNSFAA297184, No. 2020GXNSFBA297097), the National Natural Science Foundation of China (Grant. No. 62161031), and the Science and Technology Planning Project of Guangxi Province (Grant. No. AD21238038).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the finding of the study are available from Dr. Wei at wei_geng@nnnu.edu.cn upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript.

Abbreviations	Full Name
CNN	Convolution neural network
CU	Coding unit
HEVC	High-efficiency video coding
CBAM	Convolution block attention module
CNN-CBAM	Convolution neural network convolution block attention module
BD-PSNR	Bjøntegaard delta PSNR(Peak Signal-to-Noise Ratio)
BD-BR	Bjøntegaard delta bitrate
VVC	Versatile video coding
QTMT	Quad tree-based merge and transform
RDO	Rate–distortion optimization
GPU	Graphics processing unit
HM	High-efficiency video coding test model
CTU	Coding tree unit
QP	Quantization parameter
RD	Rate–distortion
PU	Prediction unit
LCU	Large coding unit
ML	Machine learning
SVMs	Support vector machines
RFs	Random forests
FCon	Full convolution
Con	Convolution
MP	Maximum pooling
S	Stride
2D	Two dimensional
MLP	Multi-layer perceptron
CPH	Coding unit partition of high efficiency video coding
UHD	Ultra-high definition
JCT-VC	Joint collaborative team on video coding
FoM	Figures of merit.

References

Wiegand, T.; Sullivan, G.J.; Bjontegaard, G.; Luthra, A. Overview of the h. 264/avc video coding standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 560–576. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the versatile video coding (V-VC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Wu, S.; Shi, J.; Chen, Z. HG-FCN: Hierarchical Grid Fully Convolutional Network for Fast VVC Intra Coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5638–5649. [Google Scholar] [CrossRef]
Chauvet, A.; Sugaya, Y.; Miyazaki, T.; Omachi, S. Optical Flow-Based Fast Motion Parameters Estimation for Affine Motion Compensation. Appl. Sci. 2020, 10, 729. [Google Scholar] [CrossRef]
Gutiérrez-Aguado, J.; Peña-Ortiz, R.; Garcia-Pineda, M.; Claver, J.M. A Cloud-Based Distributed Architecture to Accelerate Video Encoders. Appl. Sci. 2020, 10, 5070. [Google Scholar] [CrossRef]
Shen, L.; Zhang, Z.; Liu, Z. Effective CU Size Decision for HEVC Intracoding. IEEE Trans. Image Process. 2014, 23, 4232–4241. [Google Scholar] [CrossRef]
Zhang, Y.; Kwong, S.; Zhang, G.J.; Pan, Z.Q.; Yuan, H.; Jiang, G.Y. Low complexity HEVC INTRA coding for High-quality mobile video communication. IEEE Trans. Ind. Inform. 2015, 11, 1492–1504. [Google Scholar] [CrossRef]
Lim, K.; Lee, J.; Kim, S.; Lee, S. Fast PU skip and Split termination algorithm for HEVC Intra Prediction. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1335–1346. [Google Scholar]
Shen, X.L.; Yu, L.; Chen, J. Fast coding unit size selection for HEVC based on Bayesian decision rule. In Proceedings of the 2012 Picture Coding Symposium, Krakow, Poland, 7–9 May 2012. [Google Scholar]
Zhang, Y.; Li, N.; Kwong, S.; Jiang, G.; Zeng, H.Q. Statistical Early Termination and Early Skip Models for Fast Mode Decision in HEVC INTRA Coding. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–23. [Google Scholar] [CrossRef]
Leng, J.; Sun, L.; Ikenaga, T.; Sakaida, S. Content Based Hierarchical Fast Coding Unit Decision Algorithm for HEVC. In Proceedings of the 2011 International Conference on Multimedia and Signal Processing, Guilin, China, 14–15 May 2011; pp. 56–59. [Google Scholar]
Dang-Nguyen, D.-T.; Pasquini, C.; Conotter, V.; Boato, G. Raise: A raw images dataset for digital image forensics. ACM Multimed. Syst. Conf. 2015, 6, 219–224. [Google Scholar]
Cho, S.; Kim, M. Fast CU splitting and pruning for suboptimal CU partitioning in HEVC intra coding. IEEE Trans. Circuits Syst. Video Technol. (TSCVT) 2013, 23, 1555–1564. [Google Scholar] [CrossRef]
Gu, J.; Tang, M.; Wen, J.; Han, Y. Adaptive intra candidate selection with early depth decision for fast intra prediction in HEVC. IEEE Signal. Process. Lett. 2018, 25, 159–163. [Google Scholar] [CrossRef]
Yin, J.; Yang, X.; Lin, J.; Chen, Y.; Fang, R. A fast block partitioning algorithm based on SVM for HEVC intra coding. In Proceedings of the 2018 2nd International Conference on Video and Image Processing (ICVIP), Hong Kong, China, 29–31 December 2018. [Google Scholar]
Liu, X.G.; LI, Y.Y.; Liu, D.Y.; Wang, P.C.; Yang, L.T. An adaptive CU size decision algorithm for HEVC intra prediction based on complexity classification using machine learning. IEEE Trans. Circuits Syst. Video Technol. 2017, 29, 144–155. [Google Scholar] [CrossRef]
Liu, D.Y.; Liu, X.G.; Li, Y.Y. Fast CU size decisions for HEVC intra frame coding based on support vector machines. In Proceedings of the 2016 IEEE 14th International Conference on Dependable, Autonomic and Secure Computing, 14th Intl 552Conf on Pervasive Intelligence and Computing, 2nd International Conference on Big Data Intelligence and Computing and 553Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Auckland, New Zealand, 8–12 August 2016; pp. 594–597. [Google Scholar]
Du, B.; Siu, W.-C.; Yang, X.F. Fast CU partition strategy for HEVC intra-frame coding using learning approach via random forests. In Proceedings of the 2015 Asia-Pacifific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 16–19 December 2015; pp. 1085–1090. [Google Scholar]
Kuang, W.; Chan, Y.L.; Tsang, S.H.; Siu, W.C. Online-learning based Bayesian decision rule for fast intra mode and CU partitioning algorithm in HEVC screen content coding. IEEE Trans. Image Process. 2020, 29, 170–185. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.Y.; Yu, X.Y.; Gao, Y.; Chen, S.L.; Ji, X.Y.; Wang, D.S. Cu partition mode decision for HEVC hardwired intra encoder using convolution neural network. IEEE Trans. Image Process. 2016, 25, 5088–5103. [Google Scholar] [CrossRef]
Li, T.Y.; Xu, M.; Deng, X. A deep convolutional neural network approach for complexity reduction on intra-mode HEVC. IEEE Int. Conf. Multimed. Expo (ICME) 2017, 7, 1255–1260. [Google Scholar]
Xu, M.; Li, T.Y.; Wang, Z.L.; Deng, X.; Yang, R.; Guan, Z.Y. Reducing Complexity of HEVC: A Deep Learning Approach. IEEE Trans. Image Process. 2018, 27, 5044–5059. [Google Scholar] [CrossRef]
Kim, K.; Ro, W.W. Fast CU depth decision for HEVC using neural networks. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2019, 29, 1462–1473. [Google Scholar] [CrossRef]
Wang, Z.X.; Li, F. Convolutional neural network based low complexity HEVC intra encoder. Multimed. Tools Appl. 2021, 80, 2441–2460. [Google Scholar] [CrossRef]
Schiopu, I.; Huang, H.; Munteanu, A. CNN-based intra-prediction for lossless HEVC. IEEE Trans. Circuits Syst. Video Technol. 2019, 99, 1816–1828. [Google Scholar] [CrossRef]
Qin, L.M.; Zhu, Z.J.; Bai, Y.Q.; Liao, G.L.; Liu, N.L. A Complexity-Reducing HEVC Intra-Mode Method Based on VGGNet. J. Comput. 2022, 33, 57–67. [Google Scholar]
Fan, J.Y.; Song, L.F. Fast Intra-frame Prediction Algorithm for HEVC Based on Neural Networks and Adaptive Threshold. In Proceedings of the 6th International Conference on Video and Image Processing (ICVIP 2022), Shanghai, China, 23–26 December 2022; ACM: New York, NY, USA, 2022. [Google Scholar]
Feng, A.; Gao, C.S.; Li, L.; Liu, D.; Wu, F. CNN-based depth map prediction for fast block partition in HEVC intra coding. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021. [Google Scholar]
Ting, H.C.; Fang, H.L.; Wang, J.S. Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5. In Proceedings of the IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18–20 March 2019. [Google Scholar]
Fu, B.; Zhang, Q.Q.; Hu, J.L. Fast prediction mode selection and CU partition for HEVC coding. IET Image Process. 2020, 14, 1892–1900. [Google Scholar] [CrossRef]
Zhang, Y.F.; Wang, G.; Tian, R.; Xu, M.; Kuo, C.C. Texture-classification accelerated CNN scheme for fast intra CU partition in HEVC. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019. [Google Scholar]
Woo, S.; Park, J.C.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Lecture Notes in Computer Science, Proceedings of the Computer Vision—ECCV 2018; Munich, Germany, 8–14 September 2018, Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Swtzerland, 2018; Volume 11211. [Google Scholar]
CPH-Intra. Available online: https://github.com/HEVC-Projects/CPH (accessed on 3 October 2018).
Bossen, F. Common test conditions and software reference configurations. In Proceedings of the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, 5th Meeting, Geneva, Switzerland, 16–23 March 2011. [Google Scholar]
Bjøntegaard, G. Calculation of Average PSNR Differences Between RD-Curves. In Proceedings of the 13th VCEG Meeting, Austin, TX, USA, 2–4 April 2001. [Google Scholar]
Najafabadi, N.; Ramezanpour, M. Mass center directionbased decision method for intra prediction in HEVC standard. J. Real-Time Image Process. 2020, 17, 153–1168. [Google Scholar] [CrossRef]

Figure 1. Traditional CU partition (1, 2, and 3 denote different CU size of 32 × 32, 16 × 16, 8 × 8 respectively).

Figure 2. Structure of the CNN.

Figure 3. Structure of the convolution block.

Figure 4. Structure of the convolution block attention module.

Figure 5. Structure of CNN-CBAM.

Figure 6. Comparison of the two flowcharts.

Figure 7. Comparison results between CNN-CBAM and CNN in average of the A-E classes. (a) comparison of

∆ T_{a v g}

(%); (b) comparison of BD-BR (%); (c) comparison of BD-PSNR (dB).

Figure 7. Comparison results between CNN-CBAM and CNN in average of the A-E classes. (a) comparison of

∆ T_{a v g}

(%); (b) comparison of BD-BR (%); (c) comparison of BD-PSNR (dB).

Figure 8. Comparison results of the RD curves. (a) Basketball Drive; (b) BQ Square.

Figure 9. Subjective results for Frame 3 of four sequences on QP = 37. (a) Traffic; (b) Cactus; (c) BQ Mall; (d) Kristen & Sara.

Figure 10. CU partition results for Frame 3 of the BQ Mall sequences on QP = 37. (a) CU partition by HM 16.5 (r1–r3 are zoomed-in demonstration); (b) CU partition by CNN-CBAM (r4–r6 are zoomed-in demonstration).

Figure 11. Comparison of the coding performance (

∆ T_{a l l}

, BD-BR, and BD-PSNR) with other algorithms on standard test sequences [11,21,28].

Figure 11. Comparison of the coding performance (

∆ T_{a l l}

, BD-BR, and BD-PSNR) with other algorithms on standard test sequences [11,21,28].

Table 1. The parameters of the proposed CNN.

Branch	Layer	Output Size	Proposed CNN Configuration
B1	C_1-1	2 × 2 × 128	S-32, Con-32 × 32
	C_1-2	16 × 16 × 64	S-4, Con-4 × 4
	C_1-3	16 × 16 × 64	S-1, Con-[2 × 2, 1 × 1]
	pooling	8 × 8 × 64	S-2, MP-2 × 2
	C_1-3	8 × 8 × 128	S-1, Con-[2 × 2, 1 × 1]
	pooling	4 × 4 × 128	S-2, MP-2 × 2
	Concatenation	/	2560
	FCon-1	/	512
	FCon-2	/	256
B2	C₂_-1	8 × 8 × 64	S-16, Con-16 × 16
	C_2-2	8 × 8 × 64	S-4, Con-4 × 4
	pooling	4 × 4 × 64	S-2, MP-2 × 2
	C_2-3	4 × 4 × 128	S-1, Con-[2 × 2, 1 × 1]
	pooling	2 × 2 × 128	S-2, MP-2 × 2
	C_2-4	/	S-1, Con-[2 × 2, 1 × 1]
	pooling	/	S-2, MP-2 × 2
	Concatenation	/	1024
	FCon-1	8 × 8 × 64	256
	FCon-2	8 × 8 × 64	16
B3	C₂_-1	4 × 4 × 64	S-8, Con-8 × 8
	C_2-2	4 × 4 × 128	S-1, Con-[2 × 2, 1 × 1]
	pooling	8 × 8 × 64	S-2, MP-2 × 2
	C_2-3	8 × 8 × 64	S-1, Con-[2 × 2, 1 × 1]
	pooling	4 × 4 × 64	S-2, MP-2 × 2
	C_2-4	4 × 4 × 128	S-1, Con-[2 × 2, 1 × 1]
	pooling	2 × 2 × 128	S-2, MP-2 × 2
	Concatenation	/	1024
	FCon-1	/	256
	FCon-2	/	16

Table 2. Specific hyper parameter settings of CNN-CBAM.

Hyper Parameters	Value
Model initialization	Random initialization
Distribution type	Truncated normal distribution
Mean	0
Standard deviation	0.1
Batch size (R)	64
Momentum	0.9
Initial learning rate	0.01
Learning rate decay	Decrease by 1% every 2000 iterations
Total iterations	1,000,000

Table 3. Experimental conditions.

Name	Model
Operating system	Window 11
python	3.6
TensorFlow	1.8
Processor	AMD Ryzen 7 5800 H with Radeon Graphics @3.20 GHz
Running memory	8.00 GB RAM
Graphics Card Adapter	NVIDIA GeForce GTX1650

Table 4. Results for the A–E sequences.

Resolution	BD-BR	BD-PSNR	$∆ T (%)$				$∆ T_{a v g}$
Resolution	(%)	(dB)	QP = 22	QP = 27	QP = 32	QP = 37	(%)
People on Street	2.11	−0.12	−62.59	−68.02	−65.43	−64.56	−65.15
Traffic	2.47	−0.13	−67.50	−66.77	−69.64	−73.14	−69.26
Average Class A	2.29	−0.13	−65.05	−67.40	−67.54	−68.85	−67.21
Basketball Drive	3.86	−0.09	−71.37	−77.46	−80.00	−82.38	−77.80
BQ Terrace	1.37	−0.09	−65.25	−64.76	−74.41	−67.51	−67.98
Cactus	2.02	−0.05	−60.66	−64.91	−68.34	−71.48	−66.35
Kimono	1.47	−0.05	−83.50	−85.35	−89.99	−90.42	−87.32
Park Scene	2.15	−0.09	−74.99	−67.27	−70.63	−74.17	−71.77
Average Class B	2.17	−0.07	−71.15	−71.95	−76.67	−77.19	−74.24
Basketball Drill	2.94	−0.14	−46.77	−51.38	−68.23	−74.97	−60.34
Party Scene	0.86	−0.07	−42.50	−40.93	−42.54	−45.13	−42.78
Race Horses C	1.64	−0.10	−54.81	−54.63	−57.61	−62.58	−56.51
BQ Mall	1.18	−0.07	−46.86	−50.37	−53.94	−58.06	−52.33
Average Class C	1.65	−0.10	−47.53	−49.10	−55.35	−59.97	−52.99
Basketball Pass	1.96	−0.11	−62.21	−72.57	−75.99	−78.17	−72.24
Blowing Bubbles	0.74	−0.04	−37.07	−40.11	−45.55	−51.86	−43.76
BQ Square	0.24	−0.02	−64.49	−44.78	−46.21	−47.60	−50.77
Race Horses	1.04	−0.07	−45.57	−47.00	−49.34	−53.55	−48.87
Average Class D	1.00	−0.06	−52.44	−51.18	−54.20	−57.81	−53.91
Four People	2.61	−0.15	−68.61	−75.95	−66.71	−76.08	−71.84
Johnny	3.39	−0.14	−71.93	−74.21	−76.60	−78.31	−75.26
Kristen & Sara	2.82	−0.14	−70.09	−71.45	−73.58	−75.15	−72.57
Average Class E	2.94	−0.14	−70.21	−73.87	−72.30	−76.51	−73.22
Average	1.94	−0.09	−60.91	−62.07	−65.20	−68.02	−64.05

Table 5. The BD-BR, BD-PSNR, and complexity comparison between the proposed algorithm and three algorithms of Refs. [11,21,28].

Sequence	Algorithm [11]			Algorithm [21]			Algorithm [28]			Proposed Algorithm
	BD-BR	BD-PSNR	$∆ T_{a v g}$	BD-BR	BD-PSNR	$∆ T_{a v g}$	BD-BR	BD-PSNR	$∆ T_{a v g}$	BD-BR	BD-PSNR	$∆ T_{a v g}$
	(%)	dB	(%)	(%)	dB	(%)	(%)	dB	(%)	(%)	dB	(%)
People on Street	1.77	−0.10	−53.34	2.31	−0.12	−62.30	2.84	−0.15	−57.56	2.11	−0.12	−65.15
Traffic	2.38	−0.12	−58.51	2.35	−0.11	−60.80	2.64	−0.13	−65.53	2.47	−0.13	−69.26
Average Class A	2.08	−0.11	−55.93	2.33	−0.12	−61.55	2.74	−0.14	−61.55	2.29	−0.13	−67.21
Basketball Drive	2.18	−0.06	−57.53	3.67	−0.09	−71.00	3.91	−0.11	−72.26	3.86	−0.09	−77.80
BQ Terrace	1.47	−0.07	−56.21	2.32	−0.12	−62.40	1.99	−0.09	−58.21	1.37	−0.09	−67.98
Cactus	2.20	−0.07	−58.25	2.57	−0.09	−59.90	2.18	−0.07	−65.95	2.02	−0.07	−66.35
Kimono	1.82	−0.06	−67.30	3.07	−0.10	−59.80	2.84	−0.10	−80.90	1.47	−0.05	−87.31
Park Scene	2.35	−0.10	−61.97	1.82	−0.07	−59.30	2.57	−0.11	−65.09	2.15	−0.09	−71.77
Average Class B	2.00	−0.07	−60.25	2.69	−0.09	−62.48	2.70	−0.10	−68.32	2.17	−0.08	−74.24
Basketball Drill	2.06	−0.10	−48.93	3.98	−0.18	−56.60	2.27	−0.11	−53.75	2.94	−0.14	−60.34
Party Scene	1.37	−0.11	−48.44	1.95	−0.13	−50.60	0.66	−0.05	−42.83	0.86	−0.07	−42.78
Race Horses C	2.50	−0.14	−47.77	2.03	−0.12	−54.20	1.87	−0.10	−60.06	1.64	−0.10	−56.51
BQ Mall	3.15	−0.17	−56.31	2.77	−0.15	−57.30	2.31	−0.12	−55.79	1.18	−0.07	−52.33
Average Class C	2.27	−0.13	−50.36	2.68	−0.15	−54.68	1.78	−0.10	−53.11	1.66	−0.10	−52.99
Basketball Pass	2.72	−0.15	−54.25	2.92	−0.16	−59.60	1.52	−0.09	−57.30	1.96	−0.11	−71.84
Blowing Bubbles	1.43	−0.09	−46.92	2.44	−0.13	−53.00	0.64	−0.04	−40.32	0.74	−0.04	−43.76
BQ Square	1.20	−0.10	−45.47	1.39	−0.11	−53.40	0.52	−0.04	−46.04	0.24	−0.02	−50.77
Race Horses	/	/	/	2.41	−0.14	−52.70	1.45	−0.08	−52.01	1.04	−0.07	−48.87
Average Class D	1.78	−0.11	−48.88	2.29	−0.14	−54.68	1.03	−0.06	−48.92	1.00	−0.06	−53.91
Four People	2.71	−0.15	−59.89	3.06	−0.16	−59.60	1.76	−0.10	−64.78	2.61	−0.15	−71.84
Johnny	2.48	−0.10	−64.67	4.58	−0.18	−72.70	2.31	−0.09	−75.44	3.39	−0.14	−75.26
Kristen & Sara	3.25	−0.16	−62.36	3.10	−0.15	−69.30	2.21	−0.11	−73.34	2.82	−0.14	−72.57
Average Class E	2.81	−0.14	−62.31	3.58	−0.16	−67.20	2.09	−0.10	−71.19	2.94	−0.14	−73.22
Average	2.18	−0.11	−55.77	2.71	−0.13	−59.69	2.03	−0.09	−60.35	1.94	−0.09	−64.05

Table 6. Comparison of the FoM of the four methods.

Parameter	Algorithm [11]	Algorithm [21]	Algorithm [28]	Proposed
FoM	3.9089	4.5401	3.3637	3.0312

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Wei, G.; Wang, T.; Bui, T.; Zeng, Q.; Wang, R. Reducing Video Coding Complexity Based on CNN-CBAM in HEVC. Appl. Sci. 2023, 13, 10135. https://doi.org/10.3390/app131810135

AMA Style

Li H, Wei G, Wang T, Bui T, Zeng Q, Wang R. Reducing Video Coding Complexity Based on CNN-CBAM in HEVC. Applied Sciences. 2023; 13(18):10135. https://doi.org/10.3390/app131810135

Chicago/Turabian Style

Li, Huayu, Geng Wei, Ting Wang, ThiOanh Bui, Qian Zeng, and Ruliang Wang. 2023. "Reducing Video Coding Complexity Based on CNN-CBAM in HEVC" Applied Sciences 13, no. 18: 10135. https://doi.org/10.3390/app131810135

APA Style

Li, H., Wei, G., Wang, T., Bui, T., Zeng, Q., & Wang, R. (2023). Reducing Video Coding Complexity Based on CNN-CBAM in HEVC. Applied Sciences, 13(18), 10135. https://doi.org/10.3390/app131810135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reducing Video Coding Complexity Based on CNN-CBAM in HEVC

Abstract

1. Introduction

2. The Proposed Methods

2.1. Traditional CU Partition

2.2. Proposed Algorithm

2.2.1. Proposed Structure of CNN

2.2.2. Structure of CBAM

2.2.3. Proposed CNN-CBAM Algorithm

2.2.4. Loss Function of Proposed CNN-CBAM

2.3. Proposed CNN-CBAM Algorithm and Flowcharts

3. Experimental Results

3.1. Experimental Conditions

3.2. Ablation Experiments

3.3. Analyzing Experimental Outcomes

3.3.1. Performance Comparison with HM16.5 Encoder

3.3.2. Comparison of Subjective Quality

3.3.3. Comparison of the Proposed Algorithm with Three Existing Algorithms

4. Conclusions and Feature Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI