Next Article in Journal
Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network
Previous Article in Journal
Multi-Granularity and Multi-Modal Feature Fusion for Indoor Positioning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Content-Symmetrical Multidimensional Transpose of Image Sequences for the High Efficiency Video Coding (HEVC) All-Intra Configuration

Department of Computer Science and Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
Symmetry 2025, 17(4), 598; https://doi.org/10.3390/sym17040598
Submission received: 8 March 2025 / Revised: 27 March 2025 / Accepted: 13 April 2025 / Published: 15 April 2025
(This article belongs to the Section Computer)

Abstract

:
Enhancing the quality of video coding whilst maintaining compliance with the syntax of video coding standards is challenging. In the literature, many solutions have been proposed that apply mainly to two-pass encoding, bitrate control algorithms, and enhancements of locally decoded images in the motion-compensation loop. This work proposes a pre- and post-coding solution using the content-symmetrical multidimensional transpose of raw video sequences. The content-symmetrical multidimensional transpose results in images composed of slices of the temporal domain whilst preserving the video content. Such slices have higher spatial homogeneity at the expense of reducing the temporal resemblance. As such, an all-intra configuration is an excellent choice for compressing such images. Prior to displaying the decoded images, a content-symmetrical multidimensional transpose is applied again to restore the original form of the input images. Moreover, we propose a lightweight two-pass encoding solution in which we apply systematic temporal subsampling on the multidimensional transposed image sequences prior to the first-pass encoding. This noticeably reduces the complexity of the encoding process of the first pass and gives an indication as to whether or not the proposed solution is suitable for the video sequence at hand. Using the HEVC video codec, the experimental results revealed that the proposed solution results in a lower percentage of coding unit splits in comparison to regular HEVC coding without the multidimensional transpose of image sequences. This finding supports the claim of there being increasing spatial coherence as a result of the proposed solution. Additionally, using four quantization parameters, and in comparison to regular HEVC encoding, the resulting BD rate is −15.12%, which indicates a noticeable bitrate reduction. The BD-PSNR, on the other hand, was 1.62 dB, indicating an enhancement in the quality of the decoded images. Despite all of these benefits, the proposed solution has limitations, which are also discussed in the paper.

1. Introduction

Digital video coding has important applications in video storage and communications. This is because the raw videos are composed of a sequence of non-compressed images that require massive storage space and cannot be used for video communications. Video coding standards compress these raw videos with a compression ratio of up to 1000:1, making storage and the communication of the videos feasible. Such compression ratios are possible because in video coding algorithms, lossy compression is used mainly in the many-to-one quantization of DCT coefficients. This can also be combined with a reduction in the spatial and temporal resolutions of the input videos, all of which are performed with the constraint of maintaining the highest subjective quality of the reconstructed video as possible.
Briefly, the first usable video coding standard was H.261, which was created in 1988 [1]. This was followed by MPEG-1 in 1992 [2], which introduced bidirectional frame coding. Soon after, MPEG-2 was finalized in 1995 [3], which introduced half-pixel motion estimation. MPEG-2 became very popular as it was the digital format for TV signals broadcast over the air, as well as cable and satellite TV systems. MPEG-2 also introduced scalable or multilayer video coding, which never became popular as it generates inferior quality to the simpler alternative based on bit stream switching. In 1996, H.263 [4] was published for use in low-bitrate communications. H.263 introduced quarter-pixel motion estimation. Additionally, MPEG-4 was created in 2003 [5], and it included object-based compression, which did not become popular simply because image segmentation was inaccurate at that time. However, AVC, which is also known as H.264 [6], has become very popular, being used in streaming platforms. A decade later, HEVC was introduced [7]; it went back to the basic block-based DCT compression and discarded object-based coding. It introduced the concept of large coding units with adaptive recursive splits, making it suitable for ultra HD stream. Lastly, VVC was introduced in 2021 [8], where coding units became double their counterpart in HEVC, allowing for greater flexibility in adaptive recursive splitting according to the video content.
Many authors have proposed solutions for enhancing the quality of coded video. The challenge here is to enhance the quality whilst remaining compliant with the syntax of the underlying video codec. There are two approaches for doing so: in the first approach, the enhancement algorithms are performed as a pre- and/or post-processes with respect to video coding; on the other hand, the second approach enhances coding algorithms that are already compliant with the video codec syntax. These algorithms include bitrate control, two-pass encoding, in-loop filtering, motion estimation, and the selection of sizes and coding modes for coding units.
For video coding enhancements based on two-pass encoding, the work in [9] proposed an efficient low-latency two-pass encoding solution for live video streaming applications. In the first pass, feature variables are extracted to predict the bitrate optimal constant rate factor to be used in the second-pass constrained variable bitrate coding. Compared to single-pass encoding, the solution resulted in bitrate savings of up to 11%. Additionally, the work in [10] proposed a two-pass encoding rate–distortion optimization method to improve the coding efficiency of HEVC. In the first pass, a video frame is HEVC-encoded to obtain the rate–distortion model of coding units and the number of bits allocated for the frame. In the second pass, an optimal equation combines the rate–distortion model and the rate–distortion, which is used to determine the Lagrange multiplier and quantization parameter for each coding unit. This results in rate–distortion improvement performance of up to 5.6% in comparison to HEVC coding. More recently, the study in [11] proposed a constrained two-pass per-title VBR encoding scheme, which is an optimized bitrate ladder approach for live video streaming that reduces storage and delivery costs while improving quality of experience. Evaluated with the HEVC encoder, the solution achieved an average bitrate reduction of 18.80% compared to standard HTTP live-streaming CBR encoding. Additionally, it results in a 68.96% reduction in storage space and an 18.58% reduction in encoding time, demonstrating its efficiency for adaptive live streaming. Moreover, the work in [12] addressed complexity reduction in two-pass rate control for the versatile video encoder. The authors propose spatial and temporal sub-sampling during the first encoding pass to speed up the overall process. The encoding process achieves an 18% speedup while incurring only a 0.48% loss in coding efficiency.
For video coding enhancements using bitrate control algorithms, the work in [13] proposed the utilization of entropy-based visual saliency models within the framework of HEVC. Consequently, the quantization parameters are adjusted according to visual saliency relevance at the coding tree unit level. Efficient rate control is achieved by allocating bits to salient and non-salient coding tree units by manipulating the quantization parameters according to their perceptual weighted map. Bitrate reductions up to 6.6% in comparison to HEVC are reported. The work in [14] proposed SSIM–MSE distortion models at the coding tree unit level to enable the performance of SSIM-based rate–distortion optimization with a simpler R-DMSE cost scaled by the SSIM-based Lagrangian parameter. Compared to HEVC encoding, the proposed solution results in bitrate savings of 5%, 11%, and 17% using the same SSIM in all-intra, hierarchical, and non-hierarchical low-delay-B configurations. The work in [15] propose a novel bitrate control method that takes into account the distortion characteristics of inter-frame coding for updating the parameters of bitrate control. The paper also proposed a low-complexity I-frame quantization parameter decision strategy for low delay scenarios in which estimated distortion characteristics and the previous quantization parameters are made use of. The proposed method resulted in 2.6% bitrate saving in comparison to HEVC coding. More recently, the work in [16] proposed a CTU-level bit allocation improvement scheme for intra-mode rate control. A dataset is created using natural images, and various metrics are applied to determine the significance and complexity of each coding unit. The most important coding units are weighted differently, and their optimal adjustment values are incorporated into the dataset. A PLS regression model is then used to refine the bit allocation weights. The proposed method improves rate control accuracy by 0.453%, Y-PSNR by 0.05 dB, BD rate savings by 0.33%, and BD-PSNR by 0.03 dB compared to the standard video coding rate control algorithm. Moreover, the work in [17] presented a neural network-based rate control method for intra-frame coding. A neural network-based model is developed to predict bit allocation by mapping video content features to estimated bit usage at both the frame and CTU levels. Additionally, an improved parameter updating algorithm is introduced at the frame level. Experimental results show that ENNRC achieves 7.23% BD rate savings while providing more accurate bit allocation compared to VVC’s default rate control algorithm.
For video coding enhancements using the pre/post-processing of input and decoded images, the work in [18] proposed a frame-level filtering solution based on CNNs for enhancing the decoded video quality of HEVC. This is achieved using a deep neural network architecture for post-filtering the decoded all-intra videos. The proposed solution serves as an alternative to the HEVC in-loop filtering for intra-coded frames. The solution resulted in a BD rate saving of 11.1% compared with the HEVC reference model. The work in [19] introduced CNN-based up-sampling for intra-frame coding in HEVC, which down-sampled blocks before coding and then used a custom CNN to up-sample them. Implemented into HEVC, a BD rate saving of up to 9.0% is reported. Additionally, the work in [20] presented a quality enhancement CNN (QE-CNN) for HEVC that improved the quality of locally decoded images without altering the encoder. The reported BD rate enhancements averaged 8.31%. In [21], a post-processing solution was proposed which enhances the quality of the locally decoded images of a VVC codec by training a deep CNN network that receives the locally decode images, their prediction errors, and quantization maps. When deployed, the network enhances the quality of locally decoded images, which are consequently used for the prediction of future images, thus reducing the BD rate by 1.52% without increasing the computational complexity. Other similar solutions that work on enhancing locally decoded images include the work reported in [22] using a squeeze-and-excitation CNN, resulting in 10.05% BD rate savings. Lastly, in [23], the authors proposed a patch-wise spatial–temporal quality enhancement network which extracts and fuses spatial–temporal features. Using the HEVC baseline under LDP configuration, the work resulted in a BD rate savings of 17.24%.
On the other hand, the objective of this work is to enhance video coding through a pre- and post-processing approach, where input images undergo a content-symmetrical multidimensional transpose (CSMT) before compression. The transpose is content-symmetrical, meaning the video content remains unchanged but its spatial axes are permuted. As explained in later sections, the proposed method is particularly effective for the all-intra configuration in video coding. After decoding, the images are restored to their original form using the same CSMT. We analyze the impact of this transformation on the homogeneity of raw images and its influence on the coding process by examining the percentage of coding unit (CU) splits. Additionally, we propose a lightweight two-pass encoding approach, where the suitability of the video sequence for CSMT is assessed in the first pass before proceeding with full encoding in the second pass.
The rest of the paper is organized as follows: Section 2 provides on overview of the proposed CSMT of input videos. In Section 3, we introduce the overall system architecture. Section 4 discusses the limitations of the proposed system, Section 5 presents the experimental results, and Section 6 concludes the paper.

2. Proposed Solution

The proposed solution relies on the content-symmetrical multidimensional transpose (CSMT) of YUV video images as a pre-process to video coding and as a post-process to video decoding. The CSMT results in video images of temporal slices composed of temporal information from many images over a small spatial area, which is one line from each image. This transpose is illustrated in Figure 1, where the upper part represents input images and the lower part represents output images, where each is a temporal slice composed of individual spatial lines of input images.
Mathematically, a sequence of images can be represented as a three dimensional matrix as follows:
I = [ i x y z ] w . h . f
where I is a sequence of images; i x y z are pixel values at coordinates x; y in an image at index z; and w, h, and f represent the number of rows, number of columns, and total number of frames, respectively. The following constraints apply: 1 x w , 1 y h ,   a n d   1 z f .
The CSMT of I can be represented through swaps in indices and swaps in the dimensions as follows [24].
I T = [ i x z y ] w . f . h
Applying multidimensional transpose twice on the set of images returns them to their original form. This is mathematically expressed as follows:
( I T ) T ( 2,3 ) = I
where T(2,3) represents swapping the second and third axes (z and y) to restore the original order of the images.
Note that this approach can be used on a group of images, with a total number of images in each group equal to the number of rows in each image. This guarantees that the transposed images have the same spatial dimensions as the input images. Figure 2 shows example images transposed using CSMT, belonging to different video sequences.
The figure lists both positive examples, where the proposed solution was suitable, and negative examples, where it was unsuitable. Suitability here refers to whether or not the proposed transpose resulted in video compression enhancements in terms of BD rate and BD-PSNR. Such suitability of the proposed solution to various sequences can be detected prior to full encoding, as shall be elaborated upon in this section.
In general, it was experimentally observed that on average, the images transposed using CSMT have lower spatial variance and entropy than the original images. More specifically, using the video sequences listed in the experimental results section, it was observed that the average spatial variance per image dropped from 1624 to 1466, and the average entropy per image dropped from 7.03 bits to 6.86 bits, which represent 9.73% and 2.42% drops in variance and entropy, respectively.
These statistics indicate that the images transposed using CSMT are good candidates for the all-intra coding configuration of the HEVC video codec. Clearly, such images are not good candidates for inter-frame coding as the transpose increases the spatial homogeneity at the expense of the temporal resemblance.
Therefore, one research question raised in this research work is related to whether or not the content-symmetrical multidimensional transpose of images enhances the efficiency of video coding using the all-intra configuration. Another research question is related to whether or not individual video sequences can be examined to attain the suitability of the proposed solution prior to full encoding. These questions are addressed in the sections to follow.

3. Overall System Architecture

To address the two research questions posed in the previous section, we propose the system architecture illustrated in Figure 3.
The first block in the figure pertains to the decision logic in which a decision is made on whether or not the proposed solution is suitable for a particular video sequence. The details of this decision logic are presented later in this section. If a video sequence is deemed suitable for the proposed solution, then the input images are subjected to CSMT prior to video encoding. After that, the compressed video is either transmitted and/or stored. This is followed by video decoding, where CSMT takes place again prior to video display. As noted in the previous section and illustrated in Equation (3), transposing the transposed sequence puts it back to its original non-transposed form.
On the other hand, if a video sequence is deemed unsuitable for the proposed solution, then it goes through the typical video coding pipeline illustrated in the figure. Note that the figure is illustrated in such a way as to emphasize that the proposed solution is a pre- and a post-process to video coding, without the need for video syntax amendments or modifications to the compression algorithm itself.
The decision logic block in Figure 3 is elaborated upon in Figure 4. Basically, this can be considered a lightweight first encoding pass in which the input images are temporally subsampled prior to encoding. One valid approach is to use systematic sampling in which every kth image from the input sequence is retained. Formally, with N images, to retain n images only, the value of k is calculated as follows.
k = N n
Consequently, the images to retain will be at the following indices:
Retained images indices = p, p + k, p + 2 k,…, p + (n − 1) k
where p is a number between 1 and k.
In Figure 4, the systematic temporal sub-sampling step makes sense as in the all-intra coding configuration, temporal dependencies between video frames are not taken into account; rather, each video frame is compressed as an intra-frame in isolation from the surrounding video frames. As such, temporal sub-sampling does not affect the coding efficiency of the underlying video.
After the systematic temporal sub-sampling, the images are subjected to CSMT and encoded. Since the encoder results in locally decoded images, there is no need to run the decoder separately. The PSNR of the locally decoded images and the resultant bitrate of the encoder are compared against the case of encoding without CSMT. A decision on whether or not to use the proposed solution is then made. These statements are revisited and further elaborated upon in the experimental results section.

4. Limitation of the Proposed System

In the experimental results, the proposed solution shows a clear enhancement over ordinary video coding in terms of bitrate and PSNR. However, for completeness, this section lists the limitations of the proposed solution.
First, and as mentioned previously, the proposed solution works for the all-intra coding configuration only. This is because the CSMT of input images typically results in more spatially homogeneous images at the expense of reducing the temporal resemblance. The statement can be verified experimentally by comparing the entropy of predicted images with and without the use of the proposed solution. A predicted image in this context is the difference between an image and its motion compensated for by its predecessor, which is represented as follows:
P r e d e c t e d _ i m g x , y f = i m g f + 1 x , y i m g f x V x x , y f , y V y x , y f f = 1 . . N 1 , x = 1 . . w , y = 1 . . h
where V x f   a n d   V y f are the x and y motion vector components of the fth image, and w and h are the width and height of the image, respectively.
Having calculated the entropy of the predicted images with and without CSMT, averaged over all images, the results are 4.1 bits and 3.99 bits, respectively. Since the entropy is increased, this indicates that the proposed solution reduces the temporal resemblance; hence, it is not suitable for inter-frame coding.
The second limitation of the proposed solution is that it is more suitable for off-line coding as opposed to compressing a live stream of images. This is because, as in the proposed solution, to perform CSMT and create images with the same dimensions as the input (rows × columns), we need to have access to a number of images, which is equal to the number of rows. Again, each transposed image has one row of pixels from each image up to the number of rows in the input images. For example, if the rows and columns of images are 480 and 832, then we need access to 480 images to perform the CSMT, resulting in images with the dimensions as the input, which are 480 rows and 832 columns. This creates an initial delay equal to 480 images, which might not be suitable for encoding a live stream of images.
One can alleviate such an initial delay by creating transposed images of size (64 × 832) instead of (480 × 832), and 64 is chosen as it is the size of the largest coding unit in HEVC. While such a solution reduces the initial delay, it results in splitting images into many smaller images. When compressed using video coding, each smaller image will have its own frame header, which contains many syntax elements according to the underlying video coding standard used. These extra headers increase the coding bits and thus reduce the efficiency of the proposed solution. Reducing the initial delay is therefore not feasible; thus, the proposed solution is more suitable for off-line coding, which is the second limitation of this work.

5. Experimental Results

In this section, the video sequences reported in Table 1 are used to generate the experimental results. The diverse characteristics of the sequences guarantee that the proposed solution is not suitable for all of them; however, the proposed decision logic introduced in Section 2 will be applied to identify such sequences.
In digital video compression, a typical experimental setup involves reporting the bitrate and PSNR using four quantization parameters (QPs). In HEVC video coding, these QPs are 22, 27, 32, and 37 [25]. As mentioned previously, the proposed solution is for the all-intra configuration of HEVC.
In Section 2, it was mentioned that the proposed multidimensional transpose of input videos reduces variance and entropy on average, making them more suitable for the all-intra coding profile. These statistics are calculated on the raw YUV images. However, additional statistics can be derived from the actual all-intra coding process to further analyze the impact of the proposed transpose on coding unit (CU) splitting. It is well known that CU splits are applied automatically by the encoder to achieve smaller block sizes that are more homogeneous in content. Therefore, a higher percentage of CU splits indicates that the input video frames are less homogeneous. The experiments are repeated four times, once for each of the quantization parameters (QP) used (22, 27, 32, and 37). The results are presented in Table 2.
In the tables, the last rows are the average per QP, and the last columns are the averages per video sequence. As can be seen in both parts of Table 2, on average, the percentage of coding unit splits for all QPs are reduced when applying the proposed multidimensional transpose of the input videos. This indicates that the proposed solution, on average, results in a more homogeneous video content suitable for the all-intra profile. Other conclusions can be drawn from the tables. This includes the fact that on average, the percentage of coding unit splits is reduced for most video sequences but not all, hence the need for the two-pass encoding, as illustrated in Figure 4 of Section 3.
In the same tables, the last row represents the average percentage of CU splits for each QP, while the last column represents the average per video sequence. As seen in both parts of Table 2, on average, the percentage of CU splits is lower when applying the proposed multidimensional transpose. This indicates that the proposed solution results in more homogeneous video content, making it better suited for the all-intra profile.
Additionally, while the CU split percentage decreases for most video sequences, it does not do so for all cases. This further justifies the need for two-pass encoding, as illustrated in Figure 4 of Section 3.
In the following experiments, the bitrate and PSNR are reported for both the proposed solution and the standard, unmodified HEVC encoding, which serves as a benchmark in this study. These results, obtained using four QP values, are presented in Table 3.
As can be seen from the results, on average, the proposed solution reduced the bitrate and enhanced the PSNR of the coded videos at all four quantization parameters. However, looking at the results of individual videos, we notice some exceptions. To clearly point out the video sequences that were suitable or unsuitable for the proposed solution, the results in Table 3 are further summarized as percentage bitrate decreases and PSNR differences, as reported in Table 4.
Here the percentage bitrate decrease in Kbit/s is calculated as follows.
D e c r e a s e   p r e c e n t = b i t r a t e H E V C b i t r a t e P r o p o s e d b i t r a t e ( H E V C ) 100
And the PSNR differences are calculated as follows.
Δ P S N R = P S N R ( p r o p o s e d ) P S N R ( H E V C )
As can be seen in the table, on average, the proposed solution resulted in around 14%, 13%, 11%, and 9% bitrate decreases when the proposed solution is used. The best decrease in bitrate results are obtained when using the lowest QP, which is 22 in this work. The bitrate savings then slightly decreases as the QP increases, which is understood in video encoding. The reported percentage decreases in bitrate are considered remarkable given that the PSNR also increased on average. This indicates that the proposed solution of applying a multidimensional image sequence transpose as a pre- and post-process in all-intra video coding works remarkably well.
However, the results cannot be perfect, with a closer look at the results reported for individual video sequences in both Table 3 and Table 4, one can observe that when using the proposed solution, video sequences 4 and 7 constantly generate worse results than standard HEVC encoding. These sequences are BQMall and City, samples images of which are displayed in Figure 2 above. Hence, the proposed lightweight two-pass encoding solution is needed to identify if a sequence is suitable or not for the proposed CSMT, as shall be elaborated upon in this section. For further visualization of the reported results, we also display samples of the rate–distortion curves of both positive and negative results (sequences 4 and 7) in Figure 5.
Figure 5a presents sample sequences where the proposed solution worked well, and Figure 5b presents the curves for the two sequences (4 and 7) that generated negative results. Note that the rest of the sequences that generate positive results are not added to Figure 5a as their curves overlap with existing ones.
Moreover, the areas between the curves of the reference HEVC and the proposed solution can be quantified by means of BD-PSNR and BD rate [26]. A positive BD-PSNR indicates that on average, the PSNR of the proposed solution is higher that of the reference HEVC encoder, while a negative BD rate indicates that on average, the bitrate of the proposed solution is lower than that of the reference HEVC encoder. So, in summary, in this work, it is desired to report a positive BD-PSNR and a negative BD rate. These results are reported in Table 5.
As can be seen in the table, all sequences generated the desired values for the BD-PSNR and BD rate, except for sequences 4 and 7. The average BD-PSNR and BD rate, excluding these sequences, is reported as 2.72 dB and −31.91%, respectively.
To further investigate the proposed solution, we present its effect on the encoding process in terms of the percentage of splits applied to coding units during the compression. It is known in video coding that video frames are divided into blocks of pixels with a typical size of 64 × 64 pixels, referred to as largest coding units (LCUs). These coding units are then recursively divided into smaller units according to the spatial characteristics of the pixels or their prediction residuals. The smallest size of a coding unit is 4 × 4 pixels. Table 6 lists the average percentage of coding unit splits per video sequence per QP value for both the proposed solution and regular HEVC coding.
It is interesting to observe that on average, the percentage of coding unit splits is lower in the proposed solution for all of the four quantization parameter values. This indicates that the proposed CSMT of the input image sequences results in more spatially homogeneous regions and, consequently, the decreased need for splitting the coding units. In turn, this results in fewer coded elements and lower syntax overhead in the output video bit stream, which also justifies the decrease in bitrate reported in Table 3 and Table 4 above.
Another interesting finding from the results in Table 6 is related to the percentages of splits reported for sequences 4 and 7. In general, the encoding of these sequences using the proposed solution resulted in a higher percentage of splits for the coding units, which can also help in justifying why they resulted in higher bitrates compared to the regular HEVC encoding process.
Lastly, we present the results of the proposed lightweight two-pass solution, which is used to detect if a video sequence is suitable for the proposed CSMT prior to full encoding or not. As mentioned in Section 3, systematic sampling is used for temporal sub-sampling of the video frames prior to encoding. In this work, we retain one video frame out of each 25 frames. This is a suitable step size as it encodes one or two frames per second depending on the temporal resolution of the input video, thus reducing the number of frames to be encoded by a factor of 25. Again, as mentioned in Section 3, this arrangement works for the all-intra configuration as there are no temporal dependencies between frames; thus, systematic sampling applies.
In Table 7, we report the BD-PSNR and BD rates resulting from this lightweight first-pass encoding. For completeness, we also replicate the BD-PSNR and BD rates reported in Table 5 for the second-pass encoding, which is the full-pass encoding.
It is observed that the BD-PSNR and BD rate results are very similar in both encoding passes. The averages of the first pass are 1.69 dB and −16.21%, while in the second pass, the averages are 1.62 dB and −15.12% for BD-PSNR and BD rates, respectively. This indicates that the first lightweight pass can be used to make a decision on the suitability of the proposed solution for the underlying sequence prior to full encoding. In the table, it is shown that the BD-PSNR and BD rate results are positive and negative, respectively, except for sequences 4 and 7. In summary, in the first lightweight encoding pass, if the BD-PSNR and BD rate values are positive and negative, respectively, then the proposed solution of CSMT in the all-intra encoding configuration is a viable solution.
For completeness, the computational time for the first- and second-pass encodings are reported in Table 8. The reported times are averaged over all sequences using the four quantization parameters. The results are generated using a laptop with Windows 10 OS with a 10th generation Intel Core i9 processor, 32 GB RAM, and a NVIDIA Quadro T2000 GPU.
The results in the table show that the first pass is indeed a lightweight encoding pass as it requires a fraction of the time needed for full encoding. More specifically, the required time is reduced by around 24~25 times. This is expected as the systematic temporal subsampling applied retained every other 25th image of the input sequence.
Table 9 reports the additional time required to perform the proposed multidimensional transpose as a pre- and a post-process to video compression. The results are reported according to the spatial resolution used, as reported in Table 1.
As shown in the table, the required time for the multidimensional transpose as a pre- and post-process depends on the spatial resolution of the input video. On average, the required time per frame varied from 0.002 to 0.047 s, and the required time per 25 frames varied from 0.059 to 1.18 s. These results correspond to spatial resolutions varying from 416 × 240 to 1280 × 768, respectively.
There are no similar existing solutions to compare the results of this work against; nonetheless, we compare its bitrate savings against solutions that enhanced the quality of HEVC as a pre/post-process or using two-pass encoding. As reported in the introduction, ref. [18] used a CNN-based filtering solution for locally decoded images, and ref. [19] down-sampled blocks as a pre-process and used CNN for up-sampling them as a post-process. Additionally, ref. [20] used a CNN-based enhancement approach for locally decoded images. The work in [11] used two-pass encoding, and the work in [17] proposed a solution for enhancing the video quality of intra-frame coding. Table 10 lists the BD rate savings of these solutions in comparison to the proposed one.
In the Table, the DB rate of the proposed solution is presented in two variants: with and without the two-pass solution. In the later, the bitrate reductions are higher as detailed in Table 7.

6. Discussion

As mentioned previously, in addition to enhancing the bitrate savings, the proposed solution has an additional advantage related to its simplicity as it is purely based on pre- and post-processes of CSMT of image sequences and does not require deep learning or other similar predictive solutions.
In general, real-world video coding scenarios can be divided into two categories: those that require live streaming, and those that rely on pre-encoded video. The former includes applications such as video conferencing, live broadcasting, and live surveillance, while the latter encompasses video streaming services, Blu-ray disc content, and video-on-demand platforms.
While the proposed solution is not suitable for live broadcasting, it is well-suited for scenarios that rely on pre-encoded video. This is because the additional time required for the first lightweight encoding pass and the pre- and post-processing steps (involving the multidimensional transpose) can be performed offline, without real-time constraints.
Future work directions include reducing the latency of the proposed solution by applying the multidimensional transpose at the CU level instead of the frame level. A future study will be required to examine the impact of this approach on syntactical video overhead. Additionally, latency can be further reduced by utilizing deep learning to predict whether a video sequence is suitable for the multidimensional transpose without requiring the first lightweight encoding pass. Reducing the latency can help in implementing this solution in live video streaming. Another research direction involves experimenting with a modified multidimensional transpose, where video frames are constructed using planes of pixels arranged from right to left instead of from top to bottom.

7. Conclusions

In this work, the enhancement of the quality of all-intra video coding using a content-symmetrical multidimensional transpose (CSMT) of image sequences as a pre- and post-process to video coding was proposed. Having applied the proposed CSMT to various video sequences, it was reported that the average spatial variance per image dropped by 9.73%, and the average entropy per image dropped by 2.42%, thus indicating the suitability of the proposed solution for the all-intra configuration of the HEVC codec. However, in the context of inter-frame coding, the entropy of the predicted images with and without the proposed solution increased from 3.99 bits to 4.1 bits. This indicates that the proposed solution reduces the temporal resemblance; hence, it is not suitable for inter-frame coding.
In the experimental results, it was observed that the proposed solution is not suitable for all video sequences; therefore, the proposed two-pass solution was used. The BD-PSNR and BD rate results of the first pass with temporal subsampling are very similar to those of the full encoding; thus, such lightweight first-pass encoding can be used to detect the suitability of the proposed solution prior to the full encoding of the underlying video sequence. This arrangement worked well as temporal subsampling can be applied to the all-intra configuration where temporal dependencies are irrelevant.
Furthermore, the BD rate and BD-PSNR results of the proposed solution using various video sequences and four different quantization parameters resulted in −15.12% and 1.62 dB, respectively. This indicates a clear reduction in bitrate and an increase in video quality. With the two-pass encoding, having eliminated the video sequences that are not suitable for the proposed solution, the two measures increased to −31.9% and 2.72 dB, respectively.
Future work will involve reducing latency by means of applying the multidimensional transpose at the CU level instead of the frame level. Additional reductions can be achieved by using deep leaning to predict the suitability of the solution on individual video sequences to replace the two-pass coding solution.

Funding

The work in this paper was also supported, in part, by the Open Access Program from the American University of Sharjah, award number OPFY25-3152-OE2558. This paper represents the opinions of the author(s) and does not mean to represent the position or opinions of the American University of Sharjah.

Data Availability Statement

The video test sequences used in this work belong to the ISO/IEC MPEG group and can be asked for through them.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HEVCHigh-Efficacy Video Coding
CNNConvolutional Neural Networks
QPQuantization Parameter
PSNRPeak Signal-to-Noise Ratio
LCULargest Coding Unit

References

  1. ITU-T. H.261; Video Codec for Audiovisual Services at p x 384 Kbit/s—Recommendation H.261 (11/88). ITU: Geneva, Switzerland, 1988.
  2. ISO/IEC 11172-2 (MPEG-1); Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 Mbit/s Part 2: Video. ISO: Geneva, Switzerland, 1993.
  3. ISO/IEC 13818-2 (MPEG-2 Video); ITU-T and ISO/IEC JTC 1, Generic Coding of Moving Pictures and Associated Audio Information—Part 2: Video, ITU-T Rec. H.262, Version 1. ITU: Geneva, Switzerland, 1994.
  4. Rijkse, K.H. 263: Video coding for low-bit-rate communication. IEEE Commun. Mag. 1996, 34, 42–45. [Google Scholar] [CrossRef]
  5. ISO/IEC 144962 (MPEG-4 Visual Version 1); ISO/IEC, Coding of Audio-Visual Objects—Part 2: Visual. ISO: Geneva, Switzerland, 1999.
  6. ISO/IEC 14496-10 (AVC); ITU-T and ISO/IEC JTC 1, Advanced Video Coding for Generic Audiovisual Services. ITU-T Rec. H.264, Version 1. ITU: Geneva, Switzerland, 2003.
  7. Sullivan, G.; Ohm, J.-R.; Han, W.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1648–1667. [Google Scholar] [CrossRef]
  8. ISO/IEC 23090-3:2022; Information Technology—Coded Representation of Immersive Media—Part 3: Versatile Video Coding. ISO: Geneva, Switzerland, 2022.
  9. Menon, V.V.; Amirpour, H.; Ghanbari, M.; Timmerer, C. ETPS: Efficient Two-Pass Encoding Scheme for Adaptive Live Streaming. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1516–1520. [Google Scholar] [CrossRef]
  10. Guo, H.; Xiao, T.; Fan, X.; Liu, Y.; Liu, S. Rate-Distortion Optimization Based on Two-Pass Encoding for HEVC. IEEE Access 2021, 9, 146888–146899. [Google Scholar] [CrossRef]
  11. Menon, V.V.; Rajendran, P.T.; Feldmann, C.; Schoeffmann, K.; Ghanbari, M.; Timmerer, C. JND-Aware Two-Pass Per-Title Encoding Scheme for Adaptive Live Streaming. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1281–1294. [Google Scholar] [CrossRef]
  12. Henkel, A.; Helmrich, C.; Hinz, T.; Brandenburg, J.; Wieckowski, A.; Bross, B. Fast First Pass in Two-Pass Video Encoding Using Sub-Sampling. In Proceedings of the 2024 Picture Coding Symposium (PCS), Taichung, Taiwan, 12–14 June 2024; pp. 1–5. [Google Scholar] [CrossRef]
  13. Zeeshan, M.; Majid, M. High Efficiency Video Coding Compliant Perceptual Video Coding Using Entropy Based Visual Saliency Model. Entropy 2019, 21, 964. [Google Scholar] [CrossRef]
  14. Li, Y.; Mou, X. Joint Optimization for SSIM-Based CTU-Level Bit Allocation and Rate Distortion Optimization. IEEE Trans. Broadcast. 2021, 67, 500–511. [Google Scholar] [CrossRef]
  15. Zhao, F.; Ku, C.; Xiang, G.; Jia, H.; Cui, Y.; Li, Y.; Xie, X. A Novel Quality Enhanced Low Complexity Rate Control Algorithm for HEVC. In Proceedings of the 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Macau, China, 1–4 December 2020; pp. 278–280. [Google Scholar] [CrossRef]
  16. Jin, X.; Sun, H.; Zhang, Y. Research on VVC Intra-Frame Bit Allocation Scheme Based on Significance Detection. Appl. Sci. 2024, 14, 471. [Google Scholar] [CrossRef]
  17. Zhao, Z.; He, X.; Xiong, S.; Bi, X.; Chen, H. An Efficient Neural Network Based Rate Control for Intra-frame in Versatile Video Coding. IEEE Trans. Veh. Technol. 2024, 1–5. [Google Scholar] [CrossRef]
  18. Huang, H.; Schiopu, I.; Munteanu, A. Frame-Wise CNN-Based Filtering for Intra-Frame Quality Enhancement of HEVC Videos. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2100–2113. [Google Scholar] [CrossRef]
  19. Li, Y.; Liu, D.; Li, H.; Li, L.; Wu, F.; Zhang, H.; Yang, H. Convolutional Neural Network-Based Block Up-sampling for Intra Frame Coding. IEEE Trans. Circuits Syst. 2018, 28, 2316–2330. [Google Scholar] [CrossRef]
  20. Yang, R.; Xu, M.; Liu, T.; Wang, Z.; Guan, Z. Enhancing Quality for HEVC Compressed Videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2039–2054. [Google Scholar] [CrossRef]
  21. Nasiri, F.; Hamidouche, W.; Morin, L.; Dhollande, N.; Cocherel, G. A CNN-Based Prediction-Aware Quality Enhancement Framework for VVC. IEEE Open J. Signal Process. 2021, 2, 466–483. [Google Scholar] [CrossRef]
  22. Bouaafia, S.; Khemiri, R.; Messaoud, S.; Ahmed, O.; Sayadi, F. Deep learning-based video quality enhancement for the new versatile video coding. Neural Comput. Appl. 2022, 34, 14135–14149. [Google Scholar] [CrossRef] [PubMed]
  23. Ding, Q.; Shen, L.; Yu, L.; Yang, H.; Xu, M. Patch-Wise Spatial-Temporal Quality Enhancement for HEVC Compressed Video. IEEE Trans. Image Process. 2021, 30, 6459–6472. [Google Scholar] [CrossRef] [PubMed]
  24. Solo, A. Multidimensional Matrix Mathematics: Multidimensional Matrix Transpose, Symmetry, Antisymmetry, Determinant, and Inverse, Part 4 of 6. In Proceedings of the World Congress on Engineering 2010 Vol. III—WCE 2010, London, UK, 30 June–2 July 2010. [Google Scholar]
  25. Bossen, F. JCTVC-L1100: Common HM Test Conditions and Software Reference Configurations. JCT-VC Document Management System (April 2013). 2013. Available online: https://www.itu.int/wftp3/av-arch/jctvc-site/2013_01_L_Geneva/ (accessed on 13 October 2024).
  26. Bjontegaard, G. Calculation of Average PSNR Differences Between RD-Curves; VCEG-M33; Video Coding Experts Group (VCEG): Geneva, Switzerland, 2001. [Google Scholar]
Figure 1. Illustrate of content-symmetrical multidimensional transpose of input images. Each color represent a row index.
Figure 1. Illustrate of content-symmetrical multidimensional transpose of input images. Each color represent a row index.
Symmetry 17 00598 g001
Figure 2. Example images transposed using CSMT, belonging to different video sequences. The first two columns are sample images from video sequences, and the last column is an example transposed image which is a slice of from a sequence of input images.
Figure 2. Example images transposed using CSMT, belonging to different video sequences. The first two columns are sample images from video sequences, and the last column is an example transposed image which is a slice of from a sequence of input images.
Symmetry 17 00598 g002aSymmetry 17 00598 g002b
Figure 3. Block diagram of overall proposed system with CSMT of images as pre- and post-processes to video coding using the all-intra configuration.
Figure 3. Block diagram of overall proposed system with CSMT of images as pre- and post-processes to video coding using the all-intra configuration.
Symmetry 17 00598 g003
Figure 4. Block diagram of the proposed lightweight first-pass encoding.
Figure 4. Block diagram of the proposed lightweight first-pass encoding.
Symmetry 17 00598 g004
Figure 5. Rate–distortion curves for positive and negative results. The x-axis represents the bitrate in Kbit/s, and the y-axis represents the PSNR in dB.
Figure 5. Rate–distortion curves for positive and negative results. The x-axis represents the bitrate in Kbit/s, and the y-axis represents the PSNR in dB.
Symmetry 17 00598 g005aSymmetry 17 00598 g005b
Table 1. Video sequences used and their resolutions.
Table 1. Video sequences used and their resolutions.
Seq. IDNameW × HNum. of Frames
1BasketballDrill832 × 480500
2BasketballPass416 × 240500
3BlowingBubbles416 × 240500
4BQMall832 × 480600
5BQSquare416 × 240600
6Cactus1280 × 768500
7CITY704 × 576300
8ParkScene1280 × 768240
9PartyScene832 × 480500
Table 2. Percentage of coding unit splits across different QPs.
Table 2. Percentage of coding unit splits across different QPs.
Seq. IDQP22QP27QP32QP37Avg.
10.830.740.560.390.63
20.660.560.480.390.52
30.890.830.750.600.77
40.690.630.560.460.59
50.910.840.790.750.82
60.710.560.470.360.52
70.590.590.580.490.56
80.630.530.430.300.47
90.900.870.820.750.83
Avg.0.760.680.600.500.64
(a) Original video sequences without multidimensional transpose.
Seq. IDQP22QP27QP32QP37Avg.
10.540.490.430.360.45
20.790.720.620.510.66
30.790.750.660.530.68
40.650.550.480.400.52
50.820.760.650.590.70
60.530.440.390.320.42
70.720.720.670.530.66
80.750.650.530.350.57
90.660.650.610.530.61
Avg.0.700.640.560.460.59
(b) Video sequences with the proposed multidimensional transpose.
Table 3. Bitrate in kilo/s and PSNR in dB for the proposed solution and the standard HEVC encoding at four different QPs for the all-intra coding configuration.
Table 3. Bitrate in kilo/s and PSNR in dB for the proposed solution and the standard HEVC encoding at four different QPs for the all-intra coding configuration.
ProposedRef. HEVC ProposedRef. HEVC
Seq.BitratePSNRBitratePSNRSeq.BitratePSNRBitratePSNR
19194.0642.9011,308.0641.7715726.0239.806156.6138.39
22214.9043.752374.4142.6021449.2639.971406.7239.00
32773.2341.604257.6241.0031644.9938.202521.8336.90
411,841.2941.5010,698.5041.8046802.4438.336399.4738.98
53150.5142.045150.2941.3852047.1538.313450.5136.95
629,508.6240.5236,861.1640.61612,756.7637.8017,324.9937.44
721,742.3541.5615,690.9242.09713,741.0237.239608.7938.23
824,842.7841.7030,625.0841.66912,931.5638.7516,597.4838.52
915,579.6541.6924,631.4941.07109730.4038.1415,504.5236.87
Avg.13,427.4941.9215,733.0641.55Avg.7425.5138.508774.5537.92
(a) QP 22(b) QP 27
ProposedRef. HEVC ProposedRef. HEVC
Seq.BitratePSNRBitratePSNRSeq.BitratePSNRBitratePSNR
13476.5336.553315.2235.7812038.0933.301825.5132.80
2895.3636.25788.1235.602522.7932.70426.7332.46
3944.5934.701388.9233.203520.2131.50704.4429.90
43822.9135.103750.5735.9642059.2631.852105.1432.90
51288.7834.622260.9233.045808.8331.161416.0029.15
67123.9735.659242.5534.9963903.7833.134908.8532.49
77664.2933.135461.2634.6273613.2129.502828.6331.28
86762.8635.928531.7035.5383476.5933.194086.8532.72
95830.7434.579356.9533.0593288.2631.055090.6929.31
Avg.4201.1135.174899.5834.64Avg.2247.8931.932599.2131.45
(c) QP 32(d) QP 37
Table 4. Summary percentage bitrate deceases and PSNR differences between the proposed solution and HEVC encoding.
Table 4. Summary percentage bitrate deceases and PSNR differences between the proposed solution and HEVC encoding.
QP22QP27QP32QP37
Seq.Bitrate
Decrease
ΔPSNRBitrate
Decrease
ΔPSNRBitrate
Decrease
ΔPSNRBitrate
Decrease
ΔPSNR
118.69%1.136.99%1.41−4.87%0.77−11.65%0.50
26.72%1.15−3.02%0.97−13.61%0.65−22.51%0.24
334.86%0.6034.77%1.3031.99%1.5026.15%1.60
4−10.68%−0.30−6.30%−0.65−1.93%−0.862.18%−1.05
538.83%0.6740.67%1.3643.00%1.5842.88%2.01
619.95%−0.0826.37%0.3722.92%0.6620.47%0.64
7−38.57%−0.53−43.00%−1.00−40.34%−1.49−27.74%−1.78
818.88%0.0522.09%0.2320.73%0.3914.93%0.47
936.75%0.6237.24%1.2737.69%1.5235.41%1.74
Avg.13.94%+0.3712.87%+0.5910.62%+0.538.90%+0.49
Table 5. BD-PSNR and BD rate results for the proposed solution in comparison to regular HEVC encoding.
Table 5. BD-PSNR and BD rate results for the proposed solution in comparison to regular HEVC encoding.
Sequence IDBD-PSNR (dB)BD Rate (%)
11.09−18.61
20.31−3.43
33.80−46.11
4−0.9718.94
56.24−50.06
61.49−31.76
7−3.5168.32
81.35−25.64
94.77−47.76
Avg.1.62 dB−15.12%
Avg exc. 4 and 72.72 dB−31.91%
Table 6. Percentage of coding unit splits per video sequence per QP value for both the proposed solution and regular HEVC coding.
Table 6. Percentage of coding unit splits per video sequence per QP value for both the proposed solution and regular HEVC coding.
QP 22QP 27QP 32QP 37
Seq.Prop.HEVCProp.HEVCProp.HEVCProp.HEVC
10.540.830.520.740.470.560.400.39
20.790.660.720.560.620.480.510.39
30.790.890.750.830.660.750.530.60
40.650.690.660.600.590.530.510.43
50.820.910.760.840.650.790.590.75
60.530.710.410.560.350.470.270.36
70.720.590.730.610.710.590.590.50
80.750.630.640.530.520.430.370.30
90.660.900.670.850.630.800.550.71
Avg.0.700.760.650.680.580.600.480.49
Table 7. BD-PSNR and BD rate results for the proposed solution in comparison to regular HEVC encoding. The first set of results belongs to the lightweight first pass, and the second set of results belongs to the second pass or the full-encoding pass.
Table 7. BD-PSNR and BD rate results for the proposed solution in comparison to regular HEVC encoding. The first set of results belongs to the lightweight first pass, and the second set of results belongs to the second pass or the full-encoding pass.
Lightweight First PassFull-Encoding Second Pass
Seq.BD-PSNR (dB)BD RATE (%)BD-PSNR (dB)BD RATE (%)
11.42 dB−23.221.09−18.61
20.57−7.650.31−3.43
33.78−45.903.80−46.11
4−0.8015.59−0.9718.94
56.11−49.356.24−50.06
61.45−30.771.49−31.76
7−3.5369.04−3.5168.32
81.34−25.481.35−25.64
94.84−48.134.77−47.76
Avg.1.69−16.211.62−15.12
Avg.
exc. 4 & 7
2.79−32.932.72−31.91
Table 8. The computational time required for the first- and second-pass encodings averaged over all sequences using the four quantization parameters.
Table 8. The computational time required for the first- and second-pass encodings averaged over all sequences using the four quantization parameters.
First Pass (s)Second Pass (s) P a s s _ 2 P a s s _ 1
Proposed61.421497.1524.37
HEVC62.321504.8024.15
Table 9. The computational time required for the proposed pre- and post-multidimensional transpose.
Table 9. The computational time required for the proposed pre- and post-multidimensional transpose.
Seq. IDResolutionNum. FramesSecondsPer FramePer 25 Frames
1832 × 4805008.980.0180.449
2416 × 2405001.170.0020.059
3416 × 2405001.170.0020.059
4832 × 48060010.780.0180.449
5416 × 2406001.410.0020.059
61280 × 76850023.590.0471.180
7704 × 5683006.420.0210.535
81280 × 76824011.330.0471.180
9832 × 4805008.980.0180.449
Avg.0.0200.491
Table 10. BD rate savings of existing HEVC quality enhancement solutions based on pre/post-processes.
Table 10. BD rate savings of existing HEVC quality enhancement solutions based on pre/post-processes.
SolutionRef. [18]Ref. [19]Ref. [20]Ref. [11]Ref. [17]Proposed
One-Pass
Proposed
Two-Pass
BD RATE−11.1%−9.0%−8.31%−28.25%−7.23%−15.12%−31.91%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shanableh, T. Content-Symmetrical Multidimensional Transpose of Image Sequences for the High Efficiency Video Coding (HEVC) All-Intra Configuration. Symmetry 2025, 17, 598. https://doi.org/10.3390/sym17040598

AMA Style

Shanableh T. Content-Symmetrical Multidimensional Transpose of Image Sequences for the High Efficiency Video Coding (HEVC) All-Intra Configuration. Symmetry. 2025; 17(4):598. https://doi.org/10.3390/sym17040598

Chicago/Turabian Style

Shanableh, Tamer. 2025. "Content-Symmetrical Multidimensional Transpose of Image Sequences for the High Efficiency Video Coding (HEVC) All-Intra Configuration" Symmetry 17, no. 4: 598. https://doi.org/10.3390/sym17040598

APA Style

Shanableh, T. (2025). Content-Symmetrical Multidimensional Transpose of Image Sequences for the High Efficiency Video Coding (HEVC) All-Intra Configuration. Symmetry, 17(4), 598. https://doi.org/10.3390/sym17040598

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop