2.2.1. Proposed Structure of CNN
Deep learning has been brought to video coding to reduce the time overhead of intra-frame coding by training a large number of parameters to learn the encoder CU partition rules. These methods can be broadly classified into two kinds: single CNN [
21,
22,
23,
24,
25,
26,
27] and single CNN with some auxiliary data [
28,
29,
30,
31,
32], For example, CNNs were initially used in video codecs by Liu et al. [
19], who presented a deep CNN to minimize the amount of CU/PU candidate patterns. Furthermore, size-specific down-sampling for different CU/PU sizes resulted in texture loss variances for different CU/PU layers, which had an effect on the anticipated results. To better understand the CU partition rules in the encoder, Li et al. [
22] suggested a deep CNN method trained on numerous parameters to address the problem of redundant computation during intra-frame CU partition. Fan et al. [
28] introduced an efficient block partitioning CNN method that not only improved on the CNN, but also incorporated an adaptive threshold technique for the accurate management of CNN prediction errors. Zhang et al. [
32] investigated the correlation between texture complexity, quantization parameters, and CU depth and proposed a CNN scheme based on texture classification aiming at accelerating CU partition. However, the huge reduction in the complexity of these methods based on CNN comes at the expense of a minor drop in RD performance. The latest study [
28] showed a more significant improvement in both the complexity reduction and compression efficiency. However, there are still some shortcomings. For example, in [
28], although the different CU sizes could share the same convolution structure, the extracted global information was limited for larger CU sizes. In addition, this method used RDO in uncertain regions with a threshold of approximately 0.5, further increasing the computational complexity. Inspired by [
21,
28], we propose a novel CNN-based approach to further reduce the coding complexity while sustaining the compression performances without a loss of generality.
Unlike in [
21,
28], the proposed CNN considers the global image information and designs a convolution operation that is compatible with the current CU size, thus the feature extraction is realized between adjacent scaling CU and the information interaction can be carried out in the same path. In addition, a convolution block with superposition performance was designed to extract the detailed features of the image blocks. Thus, the effective learning of the global information is achieved. The proposed CNN not only accurately predicts the CU partition to replace unnecessary encoding processes, but also strengthens the connection between CNN and different CU sizes.
The proposed CNN is shown in
Figure 2. Note that we selected original image luminance information as input to the network since it contained more visual information. There are three different branches of the proposed CNN, namely B1, B2, and B3, corresponding to the different CU sizes. The different branches are first pre-processed separately, and then the pre-processed CU information is passed through the convolution, concatenation, and fully connection layers in sequence until the CU partition decision results are output. The specific structures are described as follows:
Convolution Layer: The initial 64 × 64 luminance pixel matrix serving as the input to this network is the corresponding preprocessing process before entering each of the three branches, with the preprocessed CU block serving as the input to the convolutional layer of that branch alone. Noted that in this paper, we fixed the size of the convolution kernel to be same as the step of this layer.
For a further understanding of deeper features, the convolution block was designed and set in the convolution path
and
.
Figure 3 depicts the structure of the convolution block. Firstly, the input blocks are convolved along principal and lateral branch paths to obtain two feature maps, Con-1 × 1 and Con-2 × 2, respectively, then to obtain the convolution blocks, the feature maps of the two different paths are stacked in the channel dimension, which can help the network learn about the many combinations of CU sizes. The convolution kernels of the main and side branch are 1 × 1 and 2 × 2, respectively. Meanwhile, the step size of both convolutions is set to 1, which aims to ensure that the feature maps of Con-1 × 1 and Con-2 × 2 have the same size to facilitate feature fusion. Those features contain rich texture information for learning CU partition. Secondly, the convolution kernels of the
are set to half of the size of the current CU to obtain the global information of the image as much as possible. These features are closely related to the current CU partition. Finally, to fuse the feature maps of the two paths, the
and
channels are set to 64 and 128, respectively. In addition, a non-overlapping 2 × 2 max pooling, computation with a step size 2 is used to filter the features in the main channel
. This drastically decreases the number of parameters and the risk of over-fitting while preserving the main features.
Concatenation Layer: The fused feature maps obtained by the three branches are turned into three one-dimensional vectors respectively in the concatenation layer. The one-dimensional vector contains all the key information features of the current CU.
Fully connected layer: In each of the three branches, the connection layer’s three one-dimensional vectors are used as inputs to the fully connected layer. The input contains local features collected from various layers for each category. Furthermore, because QP is critical in the CU partition process, it is spliced into the fully connected layer of each branch as supplementary features. These above features will be integrated and categorized in this layer until the learning of all of the CTU partition rules is complete, and then the output results will be used to judge the current CU.
Re LU and Sigmoid are used as the convolutional and output layer activation functions, respectively. Furthermore, the method proposed employs an early termination strategy to eliminate superfluous CU partition operations during coding.
Table 1 displays the parameters of each layer of the proposed CNN, where full convolution, convolution, maximum pooling, and step size are abbreviated as FCon, Con, MP, and S, respectively.
2.2.2. Structure of CBAM
The proposed convolution block attention module (CBAM) by [
33] can improve the network model performance by learning to focus or suppress key feature information, hence facilitating network information transmission.
Figure 4 depicts the CBAM’s detailed construction.
For the given input feature map
, where
C,
H, and
W denote the number of channels, height, and width of the feature map, respectively. The CBAM sequentially contains
(channel attention map) and
(spatial attention map). The overall CBAM can be characterized as:
where
means the elements of the feature map and the two attention modules to produce the corresponding weight product operation. The two attention modules are detailed below.
Channel attention module: First, in
, maximum pooling and average pooling are carried out to summarize the spatial information of the feature map and produce two distinct spatial context descriptions, i.e.,
(maximum pooling) and
(average pooling). Second, the
and
are passed to a hidden layer and multi-layer perceptron (MLP) to produce a channel attention map
. In addition, we set the drop channel size of the hidden layer to
to reduce the model parameter overhead, where
is an integer. Then, after implementing a shared network for each descriptor, two new descriptors are generated (i.e.,
and
. Finally, we utilize an element-by-element summing approach to obtain
F’ as the output. Therefore, the channel attention module is computed as:
where α represents the sigmoid function and
and
are the
MLP weights.
Spatial attention module: Firstly, the first attention module’s output result
F’ is used as the second attention module’s input. Secondly, the maximum pooling and average pooling are performed along the channel, and the pooling results are concatenated to yield 2D valid feature descriptors (i.e.,
(average pooling) and
(maximum pooling)). Finally, for cascade-based feature descriptors, a convolution layer is applied to produce
(spatial attention map). Therefore, the spatial attention module is computed as:
where
represents a 7 × 7 convolution operation and α is the sigmoid function. The results by [
33] showed that the filter size of 7 × 7 had good performance, so this paper selected this filter without loss of generality.
In summary, the CBAM introduces the spatial and channel attention from the spatial and channel dimensions to notice important information about the object’s content and thus better describe the key features of the image. Secondly, adding CBAM to the network induces the original network to spotlight the target objects, thus improving the generalization and characterization ability of the network. More importantly, CBAM is a lightweight network that improves network effectiveness without adding additional computational complexity. Based on the above advantages, we introduced the CBAM into the proposed CNN and then achieved a better texture representation performance.
2.2.4. Loss Function of Proposed CNN-CBAM
Cross entropy is derived from the concept of information entropy in information theory and is commonly used to measure the disparity between different probability distributions. Specifically, it treats the ground-truth labels as one probability distribution (referring to the true CU partition labels of the video sequence by HM16.5), while the model’s predicted output can be treated as another probability distribution (CNN-CBAM model). The cross-entropy loss function is calculated as follows:
where
denotes the probability of category
i in the ground-truth label and
denotes the probability of category
i in the model’s predicted output.
This loss function is not only the most often used function for classification tasks, but it also sets the neural network’s direction of convergence, allowing the neural network to perform well during the initial training. On this premise, the CNN-CBAM network model was trained in this research using the cross-entropy loss function. Meanwhile, applying the Sigmoid function to the gradient descent algorithm can effectively solve the problem of a reduced learning rate.
Supposing that there are
M training samples, the prediction values in the B1, B2, and B3 branches are shown as
,
, and
, respectively. The corresponding true values are denoted as
,
, and
. The samples’ loss function
is expressed as follows:
by minimizing the loss function, the CNN-CBAM can be trained in this manner.