1. Introduction
Digital video coding has important applications in video storage and communications. This is because the raw videos are composed of a sequence of non-compressed images that require massive storage space and cannot be used for video communications. Video coding standards compress these raw videos with a compression ratio of up to 1000:1, making storage and the communication of the videos feasible. Such compression ratios are possible because in video coding algorithms, lossy compression is used mainly in the many-to-one quantization of DCT coefficients. This can also be combined with a reduction in the spatial and temporal resolutions of the input videos, all of which are performed with the constraint of maintaining the highest subjective quality of the reconstructed video as possible.
Briefly, the first usable video coding standard was H.261, which was created in 1988 [
1]. This was followed by MPEG-1 in 1992 [
2], which introduced bidirectional frame coding. Soon after, MPEG-2 was finalized in 1995 [
3], which introduced half-pixel motion estimation. MPEG-2 became very popular as it was the digital format for TV signals broadcast over the air, as well as cable and satellite TV systems. MPEG-2 also introduced scalable or multilayer video coding, which never became popular as it generates inferior quality to the simpler alternative based on bit stream switching. In 1996, H.263 [
4] was published for use in low-bitrate communications. H.263 introduced quarter-pixel motion estimation. Additionally, MPEG-4 was created in 2003 [
5], and it included object-based compression, which did not become popular simply because image segmentation was inaccurate at that time. However, AVC, which is also known as H.264 [
6], has become very popular, being used in streaming platforms. A decade later, HEVC was introduced [
7]; it went back to the basic block-based DCT compression and discarded object-based coding. It introduced the concept of large coding units with adaptive recursive splits, making it suitable for ultra HD stream. Lastly, VVC was introduced in 2021 [
8], where coding units became double their counterpart in HEVC, allowing for greater flexibility in adaptive recursive splitting according to the video content.
Many authors have proposed solutions for enhancing the quality of coded video. The challenge here is to enhance the quality whilst remaining compliant with the syntax of the underlying video codec. There are two approaches for doing so: in the first approach, the enhancement algorithms are performed as a pre- and/or post-processes with respect to video coding; on the other hand, the second approach enhances coding algorithms that are already compliant with the video codec syntax. These algorithms include bitrate control, two-pass encoding, in-loop filtering, motion estimation, and the selection of sizes and coding modes for coding units.
For video coding enhancements based on two-pass encoding, the work in [
9] proposed an efficient low-latency two-pass encoding solution for live video streaming applications. In the first pass, feature variables are extracted to predict the bitrate optimal constant rate factor to be used in the second-pass constrained variable bitrate coding. Compared to single-pass encoding, the solution resulted in bitrate savings of up to 11%. Additionally, the work in [
10] proposed a two-pass encoding rate–distortion optimization method to improve the coding efficiency of HEVC. In the first pass, a video frame is HEVC-encoded to obtain the rate–distortion model of coding units and the number of bits allocated for the frame. In the second pass, an optimal equation combines the rate–distortion model and the rate–distortion, which is used to determine the Lagrange multiplier and quantization parameter for each coding unit. This results in rate–distortion improvement performance of up to 5.6% in comparison to HEVC coding. More recently, the study in [
11] proposed a constrained two-pass per-title VBR encoding scheme, which is an optimized bitrate ladder approach for live video streaming that reduces storage and delivery costs while improving quality of experience. Evaluated with the HEVC encoder, the solution achieved an average bitrate reduction of 18.80% compared to standard HTTP live-streaming CBR encoding. Additionally, it results in a 68.96% reduction in storage space and an 18.58% reduction in encoding time, demonstrating its efficiency for adaptive live streaming. Moreover, the work in [
12] addressed complexity reduction in two-pass rate control for the versatile video encoder. The authors propose spatial and temporal sub-sampling during the first encoding pass to speed up the overall process. The encoding process achieves an 18% speedup while incurring only a 0.48% loss in coding efficiency.
For video coding enhancements using bitrate control algorithms, the work in [
13] proposed the utilization of entropy-based visual saliency models within the framework of HEVC. Consequently, the quantization parameters are adjusted according to visual saliency relevance at the coding tree unit level. Efficient rate control is achieved by allocating bits to salient and non-salient coding tree units by manipulating the quantization parameters according to their perceptual weighted map. Bitrate reductions up to 6.6% in comparison to HEVC are reported. The work in [
14] proposed SSIM–MSE distortion models at the coding tree unit level to enable the performance of SSIM-based rate–distortion optimization with a simpler R-DMSE cost scaled by the SSIM-based Lagrangian parameter. Compared to HEVC encoding, the proposed solution results in bitrate savings of 5%, 11%, and 17% using the same SSIM in all-intra, hierarchical, and non-hierarchical low-delay-B configurations. The work in [
15] propose a novel bitrate control method that takes into account the distortion characteristics of inter-frame coding for updating the parameters of bitrate control. The paper also proposed a low-complexity I-frame quantization parameter decision strategy for low delay scenarios in which estimated distortion characteristics and the previous quantization parameters are made use of. The proposed method resulted in 2.6% bitrate saving in comparison to HEVC coding. More recently, the work in [
16] proposed a CTU-level bit allocation improvement scheme for intra-mode rate control. A dataset is created using natural images, and various metrics are applied to determine the significance and complexity of each coding unit. The most important coding units are weighted differently, and their optimal adjustment values are incorporated into the dataset. A PLS regression model is then used to refine the bit allocation weights. The proposed method improves rate control accuracy by 0.453%, Y-PSNR by 0.05 dB, BD rate savings by 0.33%, and BD-PSNR by 0.03 dB compared to the standard video coding rate control algorithm. Moreover, the work in [
17] presented a neural network-based rate control method for intra-frame coding. A neural network-based model is developed to predict bit allocation by mapping video content features to estimated bit usage at both the frame and CTU levels. Additionally, an improved parameter updating algorithm is introduced at the frame level. Experimental results show that ENNRC achieves 7.23% BD rate savings while providing more accurate bit allocation compared to VVC’s default rate control algorithm.
For video coding enhancements using the pre/post-processing of input and decoded images, the work in [
18] proposed a frame-level filtering solution based on CNNs for enhancing the decoded video quality of HEVC. This is achieved using a deep neural network architecture for post-filtering the decoded all-intra videos. The proposed solution serves as an alternative to the HEVC in-loop filtering for intra-coded frames. The solution resulted in a BD rate saving of 11.1% compared with the HEVC reference model. The work in [
19] introduced CNN-based up-sampling for intra-frame coding in HEVC, which down-sampled blocks before coding and then used a custom CNN to up-sample them. Implemented into HEVC, a BD rate saving of up to 9.0% is reported. Additionally, the work in [
20] presented a quality enhancement CNN (QE-CNN) for HEVC that improved the quality of locally decoded images without altering the encoder. The reported BD rate enhancements averaged 8.31%. In [
21], a post-processing solution was proposed which enhances the quality of the locally decoded images of a VVC codec by training a deep CNN network that receives the locally decode images, their prediction errors, and quantization maps. When deployed, the network enhances the quality of locally decoded images, which are consequently used for the prediction of future images, thus reducing the BD rate by 1.52% without increasing the computational complexity. Other similar solutions that work on enhancing locally decoded images include the work reported in [
22] using a squeeze-and-excitation CNN, resulting in 10.05% BD rate savings. Lastly, in [
23], the authors proposed a patch-wise spatial–temporal quality enhancement network which extracts and fuses spatial–temporal features. Using the HEVC baseline under LDP configuration, the work resulted in a BD rate savings of 17.24%.
On the other hand, the objective of this work is to enhance video coding through a pre- and post-processing approach, where input images undergo a content-symmetrical multidimensional transpose (CSMT) before compression. The transpose is content-symmetrical, meaning the video content remains unchanged but its spatial axes are permuted. As explained in later sections, the proposed method is particularly effective for the all-intra configuration in video coding. After decoding, the images are restored to their original form using the same CSMT. We analyze the impact of this transformation on the homogeneity of raw images and its influence on the coding process by examining the percentage of coding unit (CU) splits. Additionally, we propose a lightweight two-pass encoding approach, where the suitability of the video sequence for CSMT is assessed in the first pass before proceeding with full encoding in the second pass.
The rest of the paper is organized as follows:
Section 2 provides on overview of the proposed CSMT of input videos. In
Section 3, we introduce the overall system architecture.
Section 4 discusses the limitations of the proposed system,
Section 5 presents the experimental results, and
Section 6 concludes the paper.
2. Proposed Solution
The proposed solution relies on the content-symmetrical multidimensional transpose (CSMT) of YUV video images as a pre-process to video coding and as a post-process to video decoding. The CSMT results in video images of temporal slices composed of temporal information from many images over a small spatial area, which is one line from each image. This transpose is illustrated in
Figure 1, where the upper part represents input images and the lower part represents output images, where each is a temporal slice composed of individual spatial lines of input images.
Mathematically, a sequence of images can be represented as a three dimensional matrix as follows:
where
I is a sequence of images;
are pixel values at coordinates
x;
y in an image at index
z; and
w,
h, and
f represent the number of rows, number of columns, and total number of frames, respectively. The following constraints apply:
.
The CSMT of
I can be represented through swaps in indices and swaps in the dimensions as follows [
24].
Applying multidimensional transpose twice on the set of images returns them to their original form. This is mathematically expressed as follows:
where
T(2,3) represents swapping the second and third axes (
z and
y) to restore the original order of the images.
Note that this approach can be used on a group of images, with a total number of images in each group equal to the number of rows in each image. This guarantees that the transposed images have the same spatial dimensions as the input images.
Figure 2 shows example images transposed using CSMT, belonging to different video sequences.
The figure lists both positive examples, where the proposed solution was suitable, and negative examples, where it was unsuitable. Suitability here refers to whether or not the proposed transpose resulted in video compression enhancements in terms of BD rate and BD-PSNR. Such suitability of the proposed solution to various sequences can be detected prior to full encoding, as shall be elaborated upon in this section.
In general, it was experimentally observed that on average, the images transposed using CSMT have lower spatial variance and entropy than the original images. More specifically, using the video sequences listed in the experimental results section, it was observed that the average spatial variance per image dropped from 1624 to 1466, and the average entropy per image dropped from 7.03 bits to 6.86 bits, which represent 9.73% and 2.42% drops in variance and entropy, respectively.
These statistics indicate that the images transposed using CSMT are good candidates for the all-intra coding configuration of the HEVC video codec. Clearly, such images are not good candidates for inter-frame coding as the transpose increases the spatial homogeneity at the expense of the temporal resemblance.
Therefore, one research question raised in this research work is related to whether or not the content-symmetrical multidimensional transpose of images enhances the efficiency of video coding using the all-intra configuration. Another research question is related to whether or not individual video sequences can be examined to attain the suitability of the proposed solution prior to full encoding. These questions are addressed in the sections to follow.
3. Overall System Architecture
To address the two research questions posed in the previous section, we propose the system architecture illustrated in
Figure 3.
The first block in the figure pertains to the decision logic in which a decision is made on whether or not the proposed solution is suitable for a particular video sequence. The details of this decision logic are presented later in this section. If a video sequence is deemed suitable for the proposed solution, then the input images are subjected to CSMT prior to video encoding. After that, the compressed video is either transmitted and/or stored. This is followed by video decoding, where CSMT takes place again prior to video display. As noted in the previous section and illustrated in Equation (3), transposing the transposed sequence puts it back to its original non-transposed form.
On the other hand, if a video sequence is deemed unsuitable for the proposed solution, then it goes through the typical video coding pipeline illustrated in the figure. Note that the figure is illustrated in such a way as to emphasize that the proposed solution is a pre- and a post-process to video coding, without the need for video syntax amendments or modifications to the compression algorithm itself.
The decision logic block in
Figure 3 is elaborated upon in
Figure 4. Basically, this can be considered a lightweight first encoding pass in which the input images are temporally subsampled prior to encoding. One valid approach is to use systematic sampling in which every kth image from the input sequence is retained. Formally, with
N images, to retain
n images only, the value of
k is calculated as follows.
Consequently, the images to retain will be at the following indices:
where
p is a number between 1 and
k.
In
Figure 4, the systematic temporal sub-sampling step makes sense as in the all-intra coding configuration, temporal dependencies between video frames are not taken into account; rather, each video frame is compressed as an intra-frame in isolation from the surrounding video frames. As such, temporal sub-sampling does not affect the coding efficiency of the underlying video.
After the systematic temporal sub-sampling, the images are subjected to CSMT and encoded. Since the encoder results in locally decoded images, there is no need to run the decoder separately. The PSNR of the locally decoded images and the resultant bitrate of the encoder are compared against the case of encoding without CSMT. A decision on whether or not to use the proposed solution is then made. These statements are revisited and further elaborated upon in the experimental results section.
4. Limitation of the Proposed System
In the experimental results, the proposed solution shows a clear enhancement over ordinary video coding in terms of bitrate and PSNR. However, for completeness, this section lists the limitations of the proposed solution.
First, and as mentioned previously, the proposed solution works for the all-intra coding configuration only. This is because the CSMT of input images typically results in more spatially homogeneous images at the expense of reducing the temporal resemblance. The statement can be verified experimentally by comparing the entropy of predicted images with and without the use of the proposed solution. A predicted image in this context is the difference between an image and its motion compensated for by its predecessor, which is represented as follows:
where
are the
x and
y motion vector components of the
fth image, and
w and
h are the width and height of the image, respectively.
Having calculated the entropy of the predicted images with and without CSMT, averaged over all images, the results are 4.1 bits and 3.99 bits, respectively. Since the entropy is increased, this indicates that the proposed solution reduces the temporal resemblance; hence, it is not suitable for inter-frame coding.
The second limitation of the proposed solution is that it is more suitable for off-line coding as opposed to compressing a live stream of images. This is because, as in the proposed solution, to perform CSMT and create images with the same dimensions as the input (rows × columns), we need to have access to a number of images, which is equal to the number of rows. Again, each transposed image has one row of pixels from each image up to the number of rows in the input images. For example, if the rows and columns of images are 480 and 832, then we need access to 480 images to perform the CSMT, resulting in images with the dimensions as the input, which are 480 rows and 832 columns. This creates an initial delay equal to 480 images, which might not be suitable for encoding a live stream of images.
One can alleviate such an initial delay by creating transposed images of size (64 × 832) instead of (480 × 832), and 64 is chosen as it is the size of the largest coding unit in HEVC. While such a solution reduces the initial delay, it results in splitting images into many smaller images. When compressed using video coding, each smaller image will have its own frame header, which contains many syntax elements according to the underlying video coding standard used. These extra headers increase the coding bits and thus reduce the efficiency of the proposed solution. Reducing the initial delay is therefore not feasible; thus, the proposed solution is more suitable for off-line coding, which is the second limitation of this work.
5. Experimental Results
In this section, the video sequences reported in
Table 1 are used to generate the experimental results. The diverse characteristics of the sequences guarantee that the proposed solution is not suitable for all of them; however, the proposed decision logic introduced in
Section 2 will be applied to identify such sequences.
In digital video compression, a typical experimental setup involves reporting the bitrate and PSNR using four quantization parameters (QPs). In HEVC video coding, these QPs are 22, 27, 32, and 37 [
25]. As mentioned previously, the proposed solution is for the all-intra configuration of HEVC.
In
Section 2, it was mentioned that the proposed multidimensional transpose of input videos reduces variance and entropy on average, making them more suitable for the all-intra coding profile. These statistics are calculated on the raw YUV images. However, additional statistics can be derived from the actual all-intra coding process to further analyze the impact of the proposed transpose on coding unit (CU) splitting. It is well known that CU splits are applied automatically by the encoder to achieve smaller block sizes that are more homogeneous in content. Therefore, a higher percentage of CU splits indicates that the input video frames are less homogeneous. The experiments are repeated four times, once for each of the quantization parameters (QP) used (22, 27, 32, and 37). The results are presented in
Table 2.
In the tables, the last rows are the average per QP, and the last columns are the averages per video sequence. As can be seen in both parts of
Table 2, on average, the percentage of coding unit splits for all QPs are reduced when applying the proposed multidimensional transpose of the input videos. This indicates that the proposed solution, on average, results in a more homogeneous video content suitable for the all-intra profile. Other conclusions can be drawn from the tables. This includes the fact that on average, the percentage of coding unit splits is reduced for most video sequences but not all, hence the need for the two-pass encoding, as illustrated in
Figure 4 of
Section 3.
In the same tables, the last row represents the average percentage of CU splits for each QP, while the last column represents the average per video sequence. As seen in both parts of
Table 2, on average, the percentage of CU splits is lower when applying the proposed multidimensional transpose. This indicates that the proposed solution results in more homogeneous video content, making it better suited for the all-intra profile.
Additionally, while the CU split percentage decreases for most video sequences, it does not do so for all cases. This further justifies the need for two-pass encoding, as illustrated in
Figure 4 of
Section 3.
In the following experiments, the bitrate and PSNR are reported for both the proposed solution and the standard, unmodified HEVC encoding, which serves as a benchmark in this study. These results, obtained using four QP values, are presented in
Table 3.
As can be seen from the results, on average, the proposed solution reduced the bitrate and enhanced the PSNR of the coded videos at all four quantization parameters. However, looking at the results of individual videos, we notice some exceptions. To clearly point out the video sequences that were suitable or unsuitable for the proposed solution, the results in
Table 3 are further summarized as percentage bitrate decreases and PSNR differences, as reported in
Table 4.
Here the percentage bitrate decrease in Kbit/s is calculated as follows.
And the PSNR differences are calculated as follows.
As can be seen in the table, on average, the proposed solution resulted in around 14%, 13%, 11%, and 9% bitrate decreases when the proposed solution is used. The best decrease in bitrate results are obtained when using the lowest QP, which is 22 in this work. The bitrate savings then slightly decreases as the QP increases, which is understood in video encoding. The reported percentage decreases in bitrate are considered remarkable given that the PSNR also increased on average. This indicates that the proposed solution of applying a multidimensional image sequence transpose as a pre- and post-process in all-intra video coding works remarkably well.
However, the results cannot be perfect, with a closer look at the results reported for individual video sequences in both
Table 3 and
Table 4, one can observe that when using the proposed solution, video sequences 4 and 7 constantly generate worse results than standard HEVC encoding. These sequences are BQMall and City, samples images of which are displayed in
Figure 2 above. Hence, the proposed lightweight two-pass encoding solution is needed to identify if a sequence is suitable or not for the proposed CSMT, as shall be elaborated upon in this section. For further visualization of the reported results, we also display samples of the rate–distortion curves of both positive and negative results (sequences 4 and 7) in
Figure 5.
Figure 5a presents sample sequences where the proposed solution worked well, and
Figure 5b presents the curves for the two sequences (4 and 7) that generated negative results. Note that the rest of the sequences that generate positive results are not added to
Figure 5a as their curves overlap with existing ones.
Moreover, the areas between the curves of the reference HEVC and the proposed solution can be quantified by means of BD-PSNR and BD rate [
26]. A positive BD-PSNR indicates that on average, the PSNR of the proposed solution is higher that of the reference HEVC encoder, while a negative BD rate indicates that on average, the bitrate of the proposed solution is lower than that of the reference HEVC encoder. So, in summary, in this work, it is desired to report a positive BD-PSNR and a negative BD rate. These results are reported in
Table 5.
As can be seen in the table, all sequences generated the desired values for the BD-PSNR and BD rate, except for sequences 4 and 7. The average BD-PSNR and BD rate, excluding these sequences, is reported as 2.72 dB and −31.91%, respectively.
To further investigate the proposed solution, we present its effect on the encoding process in terms of the percentage of splits applied to coding units during the compression. It is known in video coding that video frames are divided into blocks of pixels with a typical size of 64 × 64 pixels, referred to as largest coding units (LCUs). These coding units are then recursively divided into smaller units according to the spatial characteristics of the pixels or their prediction residuals. The smallest size of a coding unit is 4 × 4 pixels.
Table 6 lists the average percentage of coding unit splits per video sequence per QP value for both the proposed solution and regular HEVC coding.
It is interesting to observe that on average, the percentage of coding unit splits is lower in the proposed solution for all of the four quantization parameter values. This indicates that the proposed CSMT of the input image sequences results in more spatially homogeneous regions and, consequently, the decreased need for splitting the coding units. In turn, this results in fewer coded elements and lower syntax overhead in the output video bit stream, which also justifies the decrease in bitrate reported in
Table 3 and
Table 4 above.
Another interesting finding from the results in
Table 6 is related to the percentages of splits reported for sequences 4 and 7. In general, the encoding of these sequences using the proposed solution resulted in a higher percentage of splits for the coding units, which can also help in justifying why they resulted in higher bitrates compared to the regular HEVC encoding process.
Lastly, we present the results of the proposed lightweight two-pass solution, which is used to detect if a video sequence is suitable for the proposed CSMT prior to full encoding or not. As mentioned in
Section 3, systematic sampling is used for temporal sub-sampling of the video frames prior to encoding. In this work, we retain one video frame out of each 25 frames. This is a suitable step size as it encodes one or two frames per second depending on the temporal resolution of the input video, thus reducing the number of frames to be encoded by a factor of 25. Again, as mentioned in
Section 3, this arrangement works for the all-intra configuration as there are no temporal dependencies between frames; thus, systematic sampling applies.
In
Table 7, we report the BD-PSNR and BD rates resulting from this lightweight first-pass encoding. For completeness, we also replicate the BD-PSNR and BD rates reported in
Table 5 for the second-pass encoding, which is the full-pass encoding.
It is observed that the BD-PSNR and BD rate results are very similar in both encoding passes. The averages of the first pass are 1.69 dB and −16.21%, while in the second pass, the averages are 1.62 dB and −15.12% for BD-PSNR and BD rates, respectively. This indicates that the first lightweight pass can be used to make a decision on the suitability of the proposed solution for the underlying sequence prior to full encoding. In the table, it is shown that the BD-PSNR and BD rate results are positive and negative, respectively, except for sequences 4 and 7. In summary, in the first lightweight encoding pass, if the BD-PSNR and BD rate values are positive and negative, respectively, then the proposed solution of CSMT in the all-intra encoding configuration is a viable solution.
For completeness, the computational time for the first- and second-pass encodings are reported in
Table 8. The reported times are averaged over all sequences using the four quantization parameters. The results are generated using a laptop with Windows 10 OS with a 10th generation Intel Core i9 processor, 32 GB RAM, and a NVIDIA Quadro T2000 GPU.
The results in the table show that the first pass is indeed a lightweight encoding pass as it requires a fraction of the time needed for full encoding. More specifically, the required time is reduced by around 24~25 times. This is expected as the systematic temporal subsampling applied retained every other 25th image of the input sequence.
Table 9 reports the additional time required to perform the proposed multidimensional transpose as a pre- and a post-process to video compression. The results are reported according to the spatial resolution used, as reported in
Table 1.
As shown in the table, the required time for the multidimensional transpose as a pre- and post-process depends on the spatial resolution of the input video. On average, the required time per frame varied from 0.002 to 0.047 s, and the required time per 25 frames varied from 0.059 to 1.18 s. These results correspond to spatial resolutions varying from 416 × 240 to 1280 × 768, respectively.
There are no similar existing solutions to compare the results of this work against; nonetheless, we compare its bitrate savings against solutions that enhanced the quality of HEVC as a pre/post-process or using two-pass encoding. As reported in the introduction, ref. [
18] used a CNN-based filtering solution for locally decoded images, and ref. [
19] down-sampled blocks as a pre-process and used CNN for up-sampling them as a post-process. Additionally, ref. [
20] used a CNN-based enhancement approach for locally decoded images. The work in [
11] used two-pass encoding, and the work in [
17] proposed a solution for enhancing the video quality of intra-frame coding.
Table 10 lists the BD rate savings of these solutions in comparison to the proposed one.
In the Table, the DB rate of the proposed solution is presented in two variants: with and without the two-pass solution. In the later, the bitrate reductions are higher as detailed in
Table 7.