1. Introduction
With the recent development of video capture, displays and processing capability, there is a growing demand for high-definition, high-quality video services in the market. Along with the market requirements, the ISO/IEC moving picture expert group (MPEG) and video coding expert group (VCEG) have developed subsequent video codec standards such as MPEG-2, H.264/AVC and high-efficiency video coding (HEVC) [
1,
2,
3]. In addition, private entity codecs such as VP8 and VP9 [
4,
5], among others, have been developed. As the codecs have evolved, their coding performance has increased by a bit-saving factor of two for the same visual quality. These advances have been made possible by adding a number of new tools with high computational complexity. Many tools have been developed for and added to intra frames. However, to reduce the amount of spatial redundancy of the frame, coding tools are interdependent, making it difficult to accelerate the video codecs in real time with a minimal computational load.
In today’s world, we are flooded with video content from multiple media services through broadcasting, over-the-top media service, the internet, and so on. Easy selection of these contents is necessary, and thumbnail display is used as a part of user interfaces to allow users to visually select the content they wish to engage with. However, video resolution is also increasing rapidly up to 8K or higher, resulting in a significant increase in the hardware requirements for fully decoding video content. Several attempts have been made, such as parallel decoding and decoder implementation using single Instruction multiple data (SIMD) instructions, to improve the decoding speed [
6,
7,
8,
9]. However, it is still almost impossible to fully decode multiple 4K or 8K videos simultaneously in limited hardwares to enable thumbnail display. In addition, we need to reduce the required amount of memory used for thumbnail processing on embedded systems due to their memory limitations. Some attempts have been made to extract thumbnail images in the frequency domain [
10,
11] based on Chen’s transform domain intra prediction method [
12]. However, these methods require additional look-up tables, which increases the amount of memory use and is very hard to apply to other codecs. As such, a partial decoding method has been proposed [
13]. This method restores only the right and bottom boundary pixels in 4 × 4 units, which is the minimum transform unit (TU) size of HEVC. This method can be easily applied to other codecs; however, it always operates in 4 × 4 units regardless of the block size. Therefore, unnecessary pixels that are not used for the thumbnail output or reference pixels for intra prediction are restored.
Figure 1 shows the pixels used for a reference pixel or thumbnail output (gray) and the unnecessarily reconstructed pixels (yellow) of the method [
12]. The number of unnecessarily restored pixels for the N × N TU block are as follows:
In this paper, we present a fast thumbnail decoding method using a small amount of memory and partial decoding according to the prediction block size for intra frames of H.264/AVC, HEVC and VP9 with minimal visual quality loss and without any error propagation. The proposed partial decoding method restores the pixels constituting the thumbnail and the reference pixels used in the intra prediction by replacing the full inverse transform and intra prediction with the partial inverse transform and partial intra prediction. The computational load and memory usage are greatly reduced by omitting both the reconstruction process and the storage process for the other pixels. HEVC employs large transformations with dimensions such as 32 × 32; thus, we reconstructed several pixels inside the block to preserve the visual quality of the reconstructed thumbnails. The memory structure for the proposed partial decoding method uses the minimal thumbnail buffer and the reference line buffer rather than the decoded picture buffer (DPB). Memory is not allocated for pixels whose restoration is omitted, and reference pixels that are no longer required are removed by storing the restored reference pixels of the next block, thereby reducing the memory allocation required. In addition, a down-sampling process using a thumbnail buffer to store thumbnail pixels for output is not performed, thereby reducing computational complexity, memory usage and memory access. Video codecs of H.264/AVC, HEVC and VP9 have different coding structures and transforms. However, the proposed algorithm can be applied in the same manner for these codecs. In order to evaluate the performance of the proposed method, we implemented the proposed algorithm for H.264/AVC, HEVC and VP9. We found that the thumbnail extraction time of the proposed method decreased by 66% in H.264/AVC, 52% in HEVC and 48% in VP9 compared to the full decoding method.
This paper is organized as follows:
Section 2 explains the proposed partial decoding method for fast thumbnail extraction and an efficient memory structure.
Section 3 shows the experimental results of the proposed algorithm and the existing implementation of open software in terms of running times and visual quality.
Section 4 concludes the paper.
2. Proposed Partial Decoding of H.264/AVC, HEVC and VP9 for Thumbnail Extraction
A proposed thumbnail extraction method includes a video decoding process and a down-sampling process. The video decoding process consists of entropy decoding, inverse transformation, intra prediction and in-loop filtering for intra frames. During the thumbnail extraction, entropy decoding, inverse transformation and intra prediction are performed to reconstruct the full image frame. Then, down-sampling is performed with the thumbnail size. However, the decoding and down-sampling processes of thumbnail extraction have high computational complexity and memory usage. In this paper, we propose a partial decoding method according to prediction block size and a memory structure for high-speed thumbnail extraction and compact memory usage with minimal visual degradation.
The proposed thumbnail extraction method replaces the inverse transform and intra prediction in the existing video decoding process with a partial inverse transform and partial intra prediction. The proposed algorithm uses a low-capacity thumbnail buffer optimized for the reconstructed pixels in the partial decoding process, along with two-line buffers.
Figure 2 shows the block diagram of the proposed method for fast thumbnail extraction. This diagram shows a thumbnail extraction method in which the partial inverse transform and partial intra prediction are replaced with the original inverse transform and intra prediction, and the restored pixels are stored in the line buffer and thumbnail buffer to omit the down-sampling process.
In this section, we describe a partial decoding method that reduces the computational complexity of the inverse transform and intra prediction process during the decoding process, with minimal visual degradation. The proposed partial decoding method restores only the pixels necessary for thumbnail extraction in order to reduce the complexity of the inverse transform and intra prediction. The pixels necessary for the thumbnail extraction are not only the pixels to be output to the thumbnail but also those necessary to avoid error propagation in the intra-picture prediction. The pixels required for intra prediction in the subsequent blocks are the right-boundary and lower-boundary pixels required for intra prediction in all the pixels of a transform block. If these pixels have errors or are not restored, the errors propagate and accumulate for the entire image, greatly reducing the visual quality of the reconstructed image.
For videos with a resolution of less than Full HD, the decoding speed is sufficiently fast even in a limited hardware environment; thus, we targeted videos with a resolution of 4K or higher. As demonstrated by the experiments, the subjective quality was good enough to display the thumbnails even if the thumbnails were generated at 1/64 size of the original resolution from 4K or higher UHD videos. This paper is based on the 1/64 size thumbnail extraction method, and in order to extract the thumbnails at the 1/ size of the original resolution according to the user’s preference, it can be applied similarly by restoring the right and bottom boundary pixels and one pixel for each blocks.
In order to extract a thumbnail with a size corresponding to 1/64 of the original image resolution, one pixel is required for each 8 × 8 block. Therefore, all the pixels output to the thumbnail correspond to one pixel at the bottom right of the 8 × 8 unit block. However, HEVC and others employ larger transform sizes; thus, we were required to reconstruct several pixels inside the transform blocks for better visual quality. H.264/AVC, HEVC and VP9 video compression standards have various prediction and transform sizes and shapes (from 4 × 4 to 32 × 32, and square and non-square blocks). When the proposed partial decoding method was applied to the H.264/AVC, HEVC and VP9 video compression standards, the restored pixel positions within the block were as shown in
Figure 3.
Table 1 lists the numbers of reconstructed pixels in the block according to the transform block size when the proposed method is applied. The reference pixels in the table are pixels to be reconstructed and stored for intra prediction of the next block, and the pixels inside the block are the pixels that are reconstructed to be output as a thumbnail. The reconstruction pixel ratio represents the ratio of the reconstructed pixels of the proposed method to the total number of pixels in each block. According to the table, the reconstruction ratio of the proposed algorithm is reduced from 44% to as low as 7%, depending on the block sizes. Because the reconstruction ratio is reduced in proportion to the block size, lower computational complexity of the thumbnail extraction method is necessary when larger-sized blocks are included in an image. In addition, the memory usage can be reduced for large blocks. We are required to develop partial inverse transformations and partial intra prediction, which are described in the subsections along with the minimal memory structure. The proposed partial decoding for chroma samples can be easily derived in a same way, even for a 4:2:0 format.
2.1. Partial Transformations
The inverse transformation process is to transform frequency domain transform coefficients, obtained through the entropy decoding process, into magnitudes in the pixel domain by performing a 2D inverse discrete cosine transform (IDCT). In order to reduce computational complexity, the inverse transformation of H.264/AVC, HEVC and VP9 decoders consists of a butterfly structure in which 2D IDCTs are divided into two vertical and horizontal 1D IDCT operations, and each 1D IDCT is added and multiplied [
14,
15].
The proposed transform is to inversely transform the lowermost and rightmost pixels according to reference pixels, in order to avoid error propagation. In addition, if a transform block is larger than 8 × 8 pixels, one pixel is recovered per 8 × 8 sub-block inside the larger block to avoid interpolation in the thumbnail. We employ the two-stage inverse transformation based on the separable characteristics. To perform the horizontal 1D transformation, the vertical 1D IDCT should be fully performed. However, the horizontal 1D transform can be partly performed for the reference pixels and one pixel per internal 8 × 8 sub-block. As shown in
Figure 4, the 16 × 16 block is inversely transformed for the yellow pixels of the second stage.
The partial horizontal 1D-IDCT is performed by removing some part of the butterfly structure. Depending on the transformation sizes, the 8th, 16th, 24th and 32nd pixels should be reconstructed.
Figure 5 shows the operations required to restore the last 16th pixel in the original 16-point 1D-IDCT. The computation amount required for the last pixel reconstruction of the horizontal 16-point 1D-IDCT is 15 additions and 24 multiplications. The popular butterfly structure of 1D-IDCT requires 64 additions and 72 multiplications.
Table 2 shows the numbers of additions and multiplications for the proposed partial reconstruction depending on the transformation block sizes. As shown in the table, blocks that are larger and have greater horizontal lengths can be accelerated more.
2.2. Partial Intra Prediction
The intra prediction process was designed to improve the compression efficiency by eliminating redundancy among adjacent pixels in an image. In the H.264, HEVC and VP9 decoders, reference pixels of neighbouring blocks are filtered and the filtered pixel values are then used for predicted signals. The predicted and residual signals from the inverse transformation are added.
For thumbnail extraction, partial prediction can be performed from the filtered reference samples. In a manner similar to the partial inverse transformation, the bottom and rightmost pixel lines as well as one pixel per inner 8 × 8 sub-block are predicted. As a result, it is possible to omit the memory copy operations as much as the number of unnecessary pixels for thumbnail.
Figure 6 shows the necessary pixels to be predicted and reconstructed for the thumbnail extraction process for a 16 × 16 block.
2.3. Memory Structures and Memory Access
In real-time thumbnail extraction, it is important to reduce memory access and requirements. In this section, the proposed partial decoding memory structure is described. In the proposed partial decoding method, a very small number of pixels are restored over the original number of image pixels. Furthermore, some part of the restored pixels is used for the thumbnail image. Therefore, the method of storing the reconstructed pixels in the original memory structure for partial decoding is inefficient because it leaves a large amount of memory unused. The proposed memory structure does not allocate memory for pixels whose restoration operations are omitted. Because reference pixels are not re-used, a new reference pixel line can be overwritten. For the proposed thumbnail extraction, one thumbnail buffer and two reference line buffers are employed rather than a full reconstruction frame buffer. The thumbnail buffer resolution is 1/64 of the original one. The reference line buffer is composed of the left and top line buffers. The left line buffer has the maximum block height, and the top line buffer has a pixel value that corresponds to the width of the original image.
After the partial intra prediction and the partial inverse transformation are performed, the predicted and residual signals are summed up, thereby restoring the thumbnail and reference pixels. Among the restored pixels, the thumbnail pixels are to be included in the output thumbnail image, and one pixel per inner 8 × 8 sub-block is extracted and stored in the thumbnail buffer. The reference pixels are referred to as an input of the intra prediction, and they are divided into two reference line buffers and stored. The restored rightmost pixels of the block are stored in the left reference line buffer, and the lowest-order pixels are stored in the upper reference line buffer. The reference line buffers are used for consecutive block reconstructions.
Figure 7 shows an example of the proposed memory structure when extracting thumbnails with a resolution of 3840 × 2160. The thumbnail buffer has a resolution of 480 × 270 (129,600 pixels). The upper and left reference line buffers have 3840 and 64 pixels, respectively; therefore, the required memory space for the restored pixel is reduced by 98%, from 8,294,400 to 133,504.
3. Experimental Results and Discussion
In order to evaluate the performance of the fast thumbnail extraction method proposed in this paper, the proposed algorithm for H.264/AVC, HEVC and VP9 was implemented on FFmpeg version 4.2.2 [
16]. We also used ffmpegthumbnailer [
17] to fairly evaluate the performance of thumbnail extraction with the FFmpeg decoder. The open software ffmpegthumbnailer drives the FFmpeg decoders and down-sampler for thumbnail extraction, as shown in
Figure 8. Thus, the proposed algorithm also employs the open software for the same interface and fair evaluation.
The experiment was conducted in a virtual Linux environment with a 3.40 GHz processor, 16.0 GB of memory and Windows 10 64-bit operating system, as shown in
Table 3.
The experiment was conducted in a virtual Linux environment with a 3.70 GHz processor, 16.0 GB of memory and Windows 10 64-bit operating system, as shown in
Table 3.
For test sequences, six video sequences from class A1 and A2 with 4K resolution under the common test conditions of versatile video coding (VVC) were selected [
18], and they were coded by H.264/AVC, HEVC and VP9 encoders.
Table 4 shows the bit rates of the test bitstreams.
For performance evaluation, the thumbnail of the first frame of each test sequence was extracted for H.264/AVC, HEVC and VP9 bitstreams, and the extraction times were compared. The time saving (TS) for measuring the thumbnail extraction speed comparison was calculated, as defined by:
Table 5,
Table 6 and
Table 7 show the comparison of the thumbnail extraction time of each codec compared to the conventional method. The acceleration rate of the proposed thumbnail extraction algorithm was 66% for H.264/AVC bitstreams, 52% for HEVC bitstreams and 48% for VP9 bitstreams. The acceleration ratios may differ slightly depending on the codecs. The computing time to decode the inverse transform and intra prediction of H.264/AVC was higher than that in the others; in addition, the speed factors differed depending on image characteristics because the ratios of transform skip or zero residuals influenced the thumbnail extraction computing time. The proposed algorithm focuses on the removal of the inverse transformation and intra prediction portions.
Table 8 shows the peak signal-to-noise ratio values of the thumbnails compared with the conventional full decoding method for each codec. The result shows that the PSNR value significantly differs depending on the sequences. This is because the proposed method stores only pixels necessary for the thumbnail in the thumbnail buffer, and the down-sampling process is removed. Therefore, in the case of sequences containing complex textures, they suffer aliasing; thus, PSNR can be lower than for others. Nevertheless, the thumbnails generated by the proposed method had a sufficient level of visual quality for commercial use in thumbnail applications.
Figure 9,
Figure 10 and
Figure 11 show the thumbnails from the proposed method alongside those from the conventional method, which performs down-sampling after decoding, for the ‘Tango2’, ‘Campfire’ and ‘ParkRunning3’ sequences. They were 4K sequences of 3840 × 2160 size.
Figure 9,
Figure 10 and
Figure 11 show the thumbnails extracted by the proposed algorithm and exiting software, respectively. The width and height of the thumbnails were 1/8 of the original ones, respectively. Since the intra prediction was performed with the reconstructed pixels of the upper and left neighboring boundaries, errors at the boundary pixels propagated to the consecutive blocks. In the worst case, the error could propagate up to the last coding block of the slice or picture. The proposed algorithm was designed to reconstruct the boundary pixels with the same inverse transforms and prediction. The reconstructed pixels were efficiently stored in the down-sampled reference buffers. In addition, the thumbnail was a low-resolution image, the degradation of the image quality in the proposed method was insignificant and it was difficult to see their visual difference.
4. Conclusions
The conventional thumbnail extraction method consists of the decoding and down-sampling stages for a thumbnail. However, the decoding and down-sampling processes have high computational complexity and memory usage. In this paper, we proposed a partial decoding method and a memory structure for high-speed thumbnail extraction. The proposed partial decoding method replaces the inverse transform and intra prediction in the decoding process with partial inverse transform and partial intra prediction. The computation complexity is reduced by restoring the one-pixel rule per inner 8 × 8 sub-block along with the rightmost pixels and the lowermost pixels. In addition, we designed a memory structure suitable for the partial decoding process. The proposed memory structure reduces 98% of the restoration buffer required for 4K videos by replacing the restoration buffer with a low-resolution thumbnail buffer and two reference line buffers for intra prediction. In order to evaluate the performance of the proposed fast thumbnail extraction method, we implemented the proposed algorithm with the FFmpeg H.264/AVC, HEVC and VP9 decoders and compared them with the speed of the conventional thumbnail extraction algorithm implemented on FFmpeg. For 4K resolution videos, we compared running times by extracting the thumbnail of the first frame of the test sequences. With the proposed method, we reduced 66% of the process time for H.264/AVC, 52% for HEVC and 48% for VP9. In addition, we reduced the amount of required memory without visual quality loss. The proposed algorithm was commercialized and implemented on an ARM processor for 2019 and 2020 LG televisions.