Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM)

Lee, Minhun; Park, Seungjin; Oh, Seoung-Jun; Kim, Younhee; Jeong, Se Yoon; Lee, Jooyoung; Sim, Donggyu

doi:10.3390/electronics12194042

Open AccessArticle

Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM)

by

Minhun Lee

¹

,

Seungjin Park

¹,

Seoung-Jun Oh

²

,

Younhee Kim

³,

Se Yoon Jeong

³,

Jooyoung Lee

³ and

Donggyu Sim

^1,*

¹

Department of Computer Engineering, Kwangwoon University, Seoul 01897, Republic of Korea

²

Department of Electronic Engineering, Kwangwoon University, Seoul 01897, Republic of Korea

³

Broadcasting and Media Research Laboratory, Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(19), 4042; https://doi.org/10.3390/electronics12194042

Submission received: 23 August 2023 / Revised: 20 September 2023 / Accepted: 22 September 2023 / Published: 26 September 2023

(This article belongs to the Special Issue Image/Video Processing and Encoding for Contemporary Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The burgeoning field of machine vision has led to the development by the Moving Picture Experts Group (MPEG) of a new type of compression technology called video coding for machines (VCM), to enhance machine recognition through video information compression. This research proposes a principal component analysis (PCA)-based compression methodology for multi-level feature maps extracted from the feature pyramid network (FPN) structure. Unlike current PCA-based studies that independently carry out PCA for each feature map, our approach employs a generalized basis matrix and mean vector derived from channel correlations by a generalized PCA process to eliminate the need for a PCA process. Further compression is achieved by amalgamating high-dimensional feature maps, capitalizing on the spatial redundancy within these multi-level feature maps. As a result, the proposed VCM encoder forgoes the PCA process, and the generalized data do not incur any compression loss. It only requires compressing the coefficients for each feature map using versatile video coding (VVC). Experimental results demonstrate superior performance by our method over all feature anchors for each machine vision task, as specified by the MPEG-VCM common test conditions, outperforming previous PCA-based feature map compression methods. Notably, it achieved an 89.3% BD-rate reduction for instance segmentation tasks.

Keywords:

moving picture experts group (MPEG); video coding for machines (VCM); versatile video coding (VVC); feature map compression; deep learning; machine vision; principal component analysis (PCA)

1. Introduction

Nowadays, the accuracy of machine vision tasks has seen considerable improvement owing to advances in deep-learning technology. This has facilitated the widespread integration of deep-learning-based models into IoT devices that we encounter in our day-to-day lives [1,2]. As a result, an exponential surge in machine-to-machine (M2M) communication, where devices autonomously communicate and process tasks, has been observed. Within M2M communication, the typical process involves compressing and transmitting input image/video data captured by one device before another analyzes these. Over the years, conventional video coding standard techniques such as high-efficiency video coding (H.265/HEVC) [3] and versatile video coding (H.266/VVC) [4] have been developed; these techniques prioritize human visual characteristics in terms of subjective and objective quality. However, their effectiveness in conducting efficient visual signal-level compression for image/video analysis, which is primarily performed by machines rather than humans, is debatable. Thus, there is a pressing need for a new compression scheme that aims to efficiently maintain data for machine vision while preserving human visual quality as well as ensuring accurate recognition of image/video information [5,6].

In response to this emerging need, the Moving Picture Experts Group (MPEG) initiated a new standardization activity in 2019 called video coding for machines (VCM). The MPEG-VCM group highlighted use cases such as smart cities, intelligent transportation, content management, and surveillance as prime applications for the VCM standard technology [6]. These applications typically involve machine vision tasks such as object detection, instance segmentation, and object tracking. Accordingly, these were identified as main tasks, and three deep-learning networks were determined for VCM standardization: Faster R-CNN X101-FPN, Mask R-CNN X101-FPN, and JDE-1088 × 608. MPEG-VCM aims to establish an efficient bitstream from an encoded video, descriptors, or feature maps extracted from a video by considering both the bitrate and performance of the machine vision task post-decoding [6]. Therefore, MPEG-VCM bifurcates into two tracks based on compression targets: feature map domain compression and video domain compression. Given that most intelligent edge devices are limited by storage capacity and computational power, cloud servers with deep-learning networks have been introduced to share these burdens while also supporting data analytics from transmitted devices [7,8]. A collaborative intelligence (CI) paradigm can be utilized to effectively distribute computational workload between edge devices and cloud servers by compressing and transmitting intermediate feature maps [9,10]. This is particularly significant for edge devices, such as mobile and embedded devices, as deep-learning network architectures continue to increase in complexity. Moreover, feature map compression techniques developed for MPEG-VCM can be effectively integrated within these CI paradigms. Consequently, this study concentrates on feature map compression for VCM.

Multi-level feature maps extracted from the commonly employed feature pyramid network (FPN) structure in the aforementioned deep-learning networks were established as compression targets for feature map domain compression for MPEG-VCM. Specifically, P-layer feature maps (P2, P3, P4, and P5) were targeted for object detection and instance segmentation, while L-layer feature maps (L0, L1, and L2) were used for object tracking, as depicted in Figure 1. However, the total data of multi-level FPN feature maps are significantly larger than the data size of an input image [11]. Therefore, several methods have been proposed to eradicate redundancy within and between multi-level feature maps using a transform-based mechanism. Lee et al. [11] proposed a receptive block-based principal component analysis (RPCA) for feature map compression. While superior to discrete cosine transform-based compression methods, this approach is limited when compressing low-level feature maps, which constitute more than half the data among multi-level feature maps. In addition, various PCA-based multi-level feature map compression methods have been proposed, such as by Park et al. [12], Lee et al. [13], and Rosewarne and Nguyen [14]. These methods commonly employ a dimension-reduction mechanism to eliminate redundancy within each feature map and between low-level feature maps, leveraging the properties of the FPN structure. In addition, all these methods utilize PCA to extract basis kernels to efficiently reduce the dimensionality of the multi-level feature maps, eliminating redundancy between correlated data in each feature map. However, this requires executing an independent PCA for one or more multi-level feature maps for each image or video frame and compressing the basis matrices, coefficients, and mean vectors for feature map reconstruction. These components have distinct characteristics and may necessitate different codecs for sufficient compression efficiency. Depending on spatial redundancy, Park et al. [12] and Lee et al. [13] utilized VVC and DeepCABAC [15], while Rosewarne and Nguyen [14] allocated these three components into different areas of a picture, defined as a single sub-picture. The sub-picture was compressed by the VVC quantizer at a lower quantization parameter (QP) than the QP for other components. However, this approach is not ideal for applications that require low bitrates.

In an effort to circumvent these issues, a transform-based feature map compression method employing a predefined generalized basis matrix and mean vector is proposed. In this method, the PCA process is not incorporated in the proposed VCM encoder. Instead, the generalized PCA (GPCA) process [16] is executed separately prior to encoding. This process generates the generalized basis matrix and mean vector using several non-duplicated images for the test datasets, eliminating the need to iterate through all feature maps of each input image data point. Generally, each channel within a feature map exhibits a strong correlation with others, irrespective of the input data provided to the deep-learning network [17,18]. Thus, the generalized basis matrix and mean vector (TGB and TGM) can be utilized to reduce the redundancy between the channels of each feature map, instead of using a basis matrix and mean vector that are dependent on all input data derived from performing PCA for all feature maps of each input. Moreover, by capitalizing on the spatial redundancy of the multi-level feature maps, a generalized common basis matrix and mean vector (TGCB and TGCM) for the low-level feature maps were obtained via the GPCA process. As TGB, TGM, TGCB, and TGCM are predefined on both the encoder and decoder sides, there is no need to send bitstreams representing them to the decoder side. Therefore, the proposed method only needs to compress and transmit the coefficients required for feature map reconstruction. This allows efficient compression while preserving the accuracy of machine vision tasks. This work extends our contribution to the 140th MPEG meeting [19]. The remainder of this paper is organized as follows. Section 2 elaborates on the proposed feature map compression method. Section 3 presents and discusses the experimental results of the proposed method, validating its effectiveness. Finally, Section 4 concludes the paper.

2. Proposed Transform-Based Feature Map Compression Method

This section provides an in-depth description of the proposed transform-based feature map compression method. Initially, a brief introduction to the proposed scheme’s outline is provided, followed by a more detailed explanation of each component in the remaining subsections. Figure 2 showcases the overall pipeline of the proposed method. Similar to previous works [11,12,13,14], our method utilizes PCA to reduce the dimensions of the multi-level feature maps intended for compression. Previous methods pose certain drawbacks: they necessitate performing an independent PCA for one or more feature maps for each input datum during the encoding process, and they compress the basis matrices, coefficients, and mean vectors with distinct characteristics for feature map reconstruction.

To mitigate these limitations, our method employs the predefined TGB, TGM, TGCB, and TGCM obtained through the PCA process on multiple feature maps extracted from a dataset of several images not included in the test data for all machine vision tasks. This process occurs separately before encoding. Consequently, it is possible to extract the generalized basis matrix and mean vector that exhibit a strong correlation between channels for each feature map of each deep-learning network. Hence, the overall pipeline of the proposed method does not incorporate the PCA process at the encoder side.

The proposed method compresses multi-level feature maps, such as P- or L-layer feature maps, by the machine vision task. Table 1 outlines the dimensions of each multi-level feature map. As indicated in the table, the lower-level feature maps, such as P2, P3, and L0, L1, account for over 85% of the total feature maps owing to their extensive spatial dimensions [12].

The deep-learning networks used for each machine vision task commonly employ an FPN structure that generates multi-level feature maps based on a top-down architecture as the backbone network [20]. There exists spatial redundancy between these feature maps. Therefore, to achieve effective compression, our method combines two low-level feature maps, P2 and P3, as well as L0 and L1, by leveraging the nearest-neighbor down-sampling used in the FPN structure and concatenating along the channel dimension. This process is represented in Figure 3a. Consequently, the feature map count is reduced by one, and the feature maps with the largest dimensions are eliminated. The decoded combined feature map is then reconstructed into each low-level feature map by performing an inverse process, as shown in Figure 3b. Consequently, one or two high-level feature maps, such as P4, P5, and L2, as well as the combined feature maps, are compressed. Furthermore, we use an independent set of TGB, TGM, TGCB, and TGCM for each feature map to be compressed, tailored to the specific machine vision task.

By predefining and using a generalized basis matrix and mean vector for each feature map based on the task, both on the encoder and decoder sides, there is no need for compression, thereby preventing compression degradation. As illustrated in Figure 2, the proposed encoder generates coefficients by centering using the TGM and TGCM and transforming using the TGB and TGCB for each feature map. To compress the coefficients with VVC, quantization and packing processes are performed. These are then encoded via VVC to generate a bitstream. On the receiving end, the proposed decoder performs unpacking and dequantization of the coefficients decoded with VVC. Following this, reconstruction of the multi-level feature maps is performed through mean compensation and inverse transform using the generalized basis matrix and mean vector. Finally, the reconstructed multi-level feature maps are fed into the machine vision inference network, as shown in Figure 2, to produce inference results for each task.

2.1. Generalized Basis Matrix and Mean Vector

This subsection elaborates on the GPCA process used to derive each task’s generalized basis matrix and mean vector. Generally, the feature map of the intermediate layer of a deep-learning network is made up of channels that are extracted using convolution filters with fixed training parameters. Notably, channels in the feature map of the same layer are generated using the same convolution filter, regardless of the input data fed into the network. Consequently, the relationships between channels of the compression target feature map for any input data can be viewed as relatively similar [17,18]. Leveraging this characteristic, we obtain the generalized basis matrix and mean vector for each compression target feature map through the GPCA process before encoding. These are predefined on the proposed VCM encoder and decoder and are used to reduce inter-channel redundancy by transforming each feature map. The generalized basis matrix and mean vector for each high-level feature map and the combined feature map are derived based on the machine vision tasks by conducting the following process: considering each spatial point (1 × 1 × C) of a feature map (W × H × C) as a sample and performing the PCA process, a basis matrix (C × C) consisting of eigenvectors sorted according to their eigenvalue and a mean vector (C × 1) are obtained. These are then used to reduce the correlation between the feature map channels.

The aforementioned process highlights the strong correlation among channels within the feature map generated in the same layer, regardless of the input data. To leverage this correlation for dimensionality reduction and compression, each three-dimensional feature map extracted from

N

data points is reshaped into two dimensions, as shown in Figure 4. This allows the concatenation of feature maps with different spatial resolutions. The GPCA is then performed on each sample set, consisting of one or more feature maps. These sets are obtained by concatenating the reshaped feature map or the reshaped combined feature map concatenated along the channel axis, as illustrated in Figure 4. The result of this process yields the TGB and TGM for each high-level feature map (P4, P5, and L2) and the TGCB and TGCM for the combined feature map, which is composed of two low-level feature maps (P2 and P3, L0 and L1) as shown in Figure 5. These processes are performed separately for each task, obtaining two sets of TGB and TGM and one set of TGCB and TGCM for instance of segmentation and object detection. For object tracking, two sets of TGB, TGM, TGCB, and TGCM are derived. These sets are predefined for both the proposed VCM encoder and the decoder for coefficient compression for each task. The proposed method uses the same dataset across all tasks to derive the generalized basis matrix and mean vector. Furthermore, the number of coefficients to be compressed was determined experimentally, and this is discussed in detail in the next section.

2.2. Quantization and Packing

The coefficients are initially represented as 32-bit floating-point values. For compression with VVC, these values must be quantized to an integer representation. There are various techniques to perform quantization, with the goal of reducing the margin of error as much as possible. One such method is Lloyd–Max quantization [21]. However, prior studies have indicated that the impact on performance from errors in the feature map domain due to quantization is not substantial [11]. Therefore, the previous feature map compression methods and the MPEG-VCM [22] feature anchor generation process adopted a uniform integer quantization method, relying only on the minimum and maximum values and bit depth [11,12,13,14].

Figure 6 shows the feature map and the coefficients obtained using TGB and TGM. As illustrated in Figure 6a, the feature map illustrates object edges in the input data. As the coefficients are meant to represent the principal components of the feature map, they exhibit similar characteristics to the object edges in the input data, as demonstrated in Figure 6b. Therefore, the proposed VCM codec uses the following uniform integer quantization process:

C_{Q} (i, j, k) = r o u n d (\frac{(C (i, j, k) - C_{m i n}) \times (2^{n} - 1)}{C_{m a x} - C_{m i n}}),

(1)

where

C

and

C_{Q}

represent the original and quantized coefficients, respectively;

n

is the bit-depth for quantization;

C_{m i n}

and

C_{m a x}

denote the minimum and maximum values of all coefficients for the input image or video, respectively. The dequantization process performed on the decoder side is meant to reconstruct an approximation of the original coefficient values. It can be expressed as follows:

C_{D Q} (i, j, k) = C_{Q}^{'} (i, j, k) \times \frac{C_{m a x} - C_{m i n}}{2^{n} - 1} + C_{m i n},

(2)

where

C_{Q}^{'}

represents the decoded coefficients. The dequantized coefficients are derived by multiplying the quantization step value and adding the minimum value, resulting in a floating-point real number in the original range.

In the proposed VCM encoder, a packing process is carried out after the quantization process, as illustrated in Figure 2. The dimensions of each coefficient

W_{F} \times H_{F}

are the same as the spatial resolution of the feature map. The number of coefficients to be compressed is determined based on the ratio of TGB and TGCB. They can be rearranged using spatial or temporal packing for efficient compression of the quantized coefficients. However, the gain of spatiotemporal arrangement is not significant compared with the spatial arrangement, as reported in [23]. Thus, the packing process is performed in a spatial tiling fashion in z-scan order, as demonstrated in Figure 7. Finally, the packed and quantized coefficients for each feature map are encoded using VVC.

3. Experimental Results

Experimental results of the proposed feature map compression method for feature anchors [24,25] under the MPEG-VCM common test condition (CTC) [22] are presented. The experimental setup is described in detail, and then the objective performance of the proposed method is evaluated for each machine vision task.

Table 2 outlines the experimental setup, detailing the deep-learning network, test dataset, and metrics for each machine vision task, as defined in the MPEG-VCM CTC. The experiment employed three pretrained deep-learning networks: Faster R-CNN X101-FPN and Mask R-CNN X101-FPN from Detectron2 [26], and the DarkNet-53 ImageNet pretrained model [27] for JDE-1088 × 608. Experiments were conducted using four test sets from three datasets: two subsets of the OpenImage V6 dataset [28] were used for object detection and instance segmentation; one set of 5k images was used for object detection, and another set of 5k images was used for instance segmentation. The SFU-HW-Objects-v1 dataset [29] was used for object detection, and three video sequences from the Tencent video dataset (TVD) [30] were used for object tracking. The performance of the proposed scheme was evaluated using mean average precision (mAP) for object detection and instance segmentation and multiple object tracking accuracy (MOTA) for object tracking. Bitrate measurements included bits per pixel (bpp) and bitrate for image and video datasets. The Bjøntegaard-delta (BD)-rate [31] metric was also used to evaluate average bitrate savings at an equivalent machine vision task performance.

For evaluating the proposed method, three sets of generalized basis matrix and mean vector are required for object detection and instance segmentation, and two sets for object tracking. These were obtained through the GPCA process, as mentioned in Section 2.1. The dataset used in this strategy comprised 344 frames of TVD training videos selected by random sampling that for all tasks were not duplicates of the test data. Furthermore, the quantities of TGB and TGCB were determined at a saturation point for each machine vision task by comparing the amount of data to be compressed and the corresponding performance (mAP and MOTA) according to the TGB ratio. As illustrated in Figure 8, the quantity of TGCB for P2 and P3 was set at 15%, and the quantity of TGB for P4 and P5 was set at 30%. At this point, mAP reached saturation in object detection. The TGB ratio reached saturation in instance segmentation at the same point as in object detection. Additionally, MOTA reached saturation in object tracking when the quantity of TGCB for L0 and L1 was set at 30%, and the quantity of TGB for L2 was set at 20%.

We also employed the VVC test model (VTM12.0 [32]) for compression of coefficients. For all tasks, coefficients were quantized by setting the quantization bit-depth

n

to 8. Furthermore, the quantized coefficients were encoded with six QPs per task, using an all-intra (AI) configuration for object detection and instance segmentation, and a random access (RA) configuration for object tracking as per the VVC common test conditions [33].

Table 3, Table 4 and Table 5 outline the performance of the proposed method compared with the feature anchor according to each machine vision task’s test dataset. Experiments were carried out on six evaluation rate points (QP or rate point) for each dataset, and BD rate was subsequently calculated. The corresponding rate performance (RP) curves are shown in Figure 9. The six evaluation rate points for each dataset, according to the proposed algorithm, were determined as follows. For experiments with the OpenImage V6 and TVD datasets, experiments were first run on all available QPs. Then, six QPs were selected as being most similar to the mAP of each QP in the feature anchor results. For the SFU-HW-Objects-v1 dataset, the same six QPs for each sequence as in the feature anchor results were selected for the 14 sequences.

As indicated in Table 3, the proposed method showed superior performance compared with the feature anchor across all evaluation points. Specifically, it achieved 85.35%, 70.33%, 85.15%, and 72.37% BD-rate reductions compared with the feature anchor for the OpenImage V6 dataset and each class of the SFU-HW-Objects-v1 dataset, respectively. Moreover, the proposed method also demonstrated improved performance in instance segmentation compared with the feature anchor, obtaining an 89.27% BD-rate reduction as illustrated in Table 4. In terms of object tracking, the experimental results summarized in Table 5 show that the proposed method achieved BD-rate reductions of 50.49%, 72.47%, and 79.09% for TVD-01, TVD-02, and TVD-03, respectively. It displayed a comparable MOTA performance at a lower bitrate than the feature anchor at each rate point and achieved a 64.91% BD-rate gain.

Table 6 shows BD-rate reductions achieved by our method against the previous PCA-based feature map compression methods [12,13,14] in object detection and instance segmentation for the OpenImage V6 dataset. In the object detection task, the proposed method achieved 85.35% BD-rate reduction against the feature anchor. It outperformed the previous methods [12,13,14] with 69.20%, 65.29%, and 49.87% BD-rate reduction, respectively. In addition, the proposed algorithm achieved 89.27% BD-rate reduction against the feature anchor in the instance segmentation task while outperforming [14] with 41.90%.

Table 5 reveals that the MOTA performance of the proposed method for each QP is similar to that of the feature anchor for TVD-03. However, for TVD-01 and TVD-02, it can be observed that the MOTA performance of the proposed method is lower than that of the feature anchor, particularly at high QPs. This discrepancy also appears in some sequences in the SFU-HW-Objects-v1 dataset. To analyze these results, we examined the ground truth within all frames on the TVD dataset, as depicted in Figure 10. As shown in the figure, nearly all bounding boxes of each ground truth present in TVD-01 and TVD-02 are small compared to the video resolution. However, almost all bounding boxes of ground truth in TVD-03 are larger than those in TVD-01 and TVD-02. In addition, this phenomenon also appears in some sequences of the SFU-HW-Objects-v1 dataset that showed low performance compared with feature anchors at the high bitrate of the proposed algorithm. Nevertheless, the proposed feature map compression method outperformed in overall performance, including the accuracy of high bitrate for sequences mainly comprising medium or large-sized objects. Therefore, we believe that the results in some sequences that exhibited somewhat lower accuracy at these higher bitrates arise from the low accuracy of small-object detection. More specifically, the proposed method achieves higher compression efficiency by compressing the combined low-level feature maps using the characteristics of the FPN structure. However, this process may lead to information degradation of low-level feature maps for small-object detection. Despite this, the proposed algorithm generally outperforms all feature anchors of the MPEG-VCM, as shown in Table 3, Table 4 and Table 5 and Figure 9. Particularly, the proposed algorithm obtained superior accuracy on all machine vision tasks and datasets at low bitrates. It is important to obtain excellent accuracy at low bit rates to utilize MPEG-VCM technology in low-power devices. As a result, the method’s compression of multi-level feature maps can offer high-accuracy machine vision task performance with a reduced bit requirement. This indicates that the proposed scheme has significant machine vision task efficiency for medium or large-sized objects.

4. Conclusions

In this paper, we present a transform-based feature map compression approach for VCM. The compression target feature maps for each task of VCM exhibited spatial similarities, as they were extracted from the FPN structure using a top-down architecture. Additionally, each channel within the feature map generated by the same layer strongly correlated with the others, irrespective of the input data fed to the deep-learning network. Given these properties, the proposed method conducted the GPCA process separately from encoding to derive the TGB, TGCB, TGM, and TGCM for effective compression. These elements were then predefined in the proposed VCM encoder and decoder to minimize the cross-channel redundancy of the compression target feature map through transforms. Consequently, the PCA process was excluded from the VCM encoder, preventing compression loss from generalized data. The encoder compressed only the coefficients for each feature map, using VVC for reconstruction. Experimental evaluations conducted under the common MPEG-VCM test conditions across various machine vision tasks demonstrated the effectiveness of the proposed method. For object detection tasks using the OpenImage V6 and SFU-HW-Objects-v1 datasets, our method attained BD-rate reductions of 85.35%, 70.33%, 85.15%, and 72.37% compared with the feature anchor. In instance segmentation, the method yielded an 89.27% BD-rate reduction, and for object tracking tasks, it achieved a 64.91% BD-rate reduction relative to the feature anchor. Moreover, the proposed algorithm outperformed the previous PCA-based feature map compression method. We plan to enhance our proposed algorithm to ensure superior performance in machine vision tasks, even when dealing with small objects. This would enable our approach to be applicable across a broader range of use cases.

Author Contributions

Conceptualization, M.L., S.P., S.-J.O. and D.S.; methodology, M.L. and S.P.; software, S.P.; validation, Y.K., S.Y.J. and J.L.; formal analysis, M.L.; investigation, M.L. and S.P.; writing—original draft preparation, M.L.; writing—review and editing, M.L.; visualization, M.L.; supervision, S.-J.O. and D.S.; project administration, S.-J.O. and D.S.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) under a grant funded by the Korean Government through the Ministry of Science, ICT (MSIT) (Grant No. 2020-0-00011 for the project Video Coding for Machine). Additional funding was provided by the Basic Science Research Program, overseen by the National Research Foundation of Korea (NRF) and funded by the MSIT & Future Planning (Grant No. NRF-2021R1A2C2092848). The Information Technology Research Center (ITRC) support program, supervised by the Institute for IITP and funded by MSIT, Korea, also provided significant support (Grant No. IITP-2023-RS-2022-00156225).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shafique, K.; Khawaja, B.A.; Sabir, F.; Qazi, S.; Mustaqim, M. Internet of things (IoT) for next-generation smart systems: A review of current challenges, future trends and prospects for emerging 5G-IoT scenarios. IEEE Access 2020, 8, 23022–23040. [Google Scholar] [CrossRef]
Li, H.; Ota, K.; Dong, M. Learning IoT in edge: Deep learning for the internet of things with edge computing. IEEE Netw. 2018, 32, 96–101. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Lee, M.; Song, H.; Park, J.; Jeon, B.; Kang, J.; Kim, J.-G.; Lee, Y.-L.; Kang, J.-W.; Sim, D. Overview of versatile video coding (H.266/VVC) and its coding performance analysis. IEIE Trans. Smart Process. Comput. 2023, 12, 122–154. [Google Scholar] [CrossRef]
Chen, S.; Jin, J.; Meng, L.; Lin, W.; Chen, Z.; Chang, T.-S.; Li, Z.; Zhang, H. A new image codec paradigm for human and machine uses. arXiv 2021, arXiv:2112.10071. [Google Scholar]
Hollmann, C.; Liu, S.; Rafie, M.; Zhang, Y. Use Cases and Requirements for Video Coding for Machines. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. N00190. In Proceedings of the 138th MPEG Meeting, Online, 25–29 April 2022. [Google Scholar]
Cao, J.; Yao, X.; Zhang, H.; Jin, J.; Zhang, Y.; Ling, B.W.-K. Slimmable multi-task image compression for human and machine vision. IEEE Access 2023, 11, 29946–29958. [Google Scholar] [CrossRef]
Chen, Z.; Fan, K.; Wang, S.; Duan, L.; Lin, W.; Kot, A.C. Toward intelligent sensing: Intermediate deep feature compression. IEEE Trans. Image Process. 2020, 29, 2230–2243. [Google Scholar] [CrossRef] [PubMed]
Bajić, I.V.; Lin, W.; Tian, Y. Collaborative intelligence: Challenges and opportunities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, 11 June 2021; pp. 8493–84976. [Google Scholar]
Choi, H.; Bajić, I.V. Deep feature compression for collaborative object detection. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3743–3747. [Google Scholar]
Lee, M.; Choi, H.; Kim, J.; Do, J.; Kwon, H.; Jeong, S.Y.; Sim, D.; Oh, S.-J. Feature map compression for video coding for machines based on receptive block based principal component analysis. IEEE Access 2023, 11, 26308–26319. [Google Scholar] [CrossRef]
Park, S.; Lee, M.; Choi, H.; Kim, M.; Oh, S.-J.; Kim, Y.; Do, J.; Jeong, S.Y.; Sim, D. A feature map compression method for multi-resolution feature map with PCA-based transformation. J. Broadcast Eng. 2022, 27, 56–68. [Google Scholar]
Lee, M.; Choi, H.; Park, S.; Kim, M.; Oh, S.-J.; Sim, D.; Kim, Y.; Lee, J.; Do, J.; Jeong, S.Y. [VCM track1] Advanced feature map compression based on optimal transformation with VVC and DeepCABAC. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. m58787. In Proceedings of the 137th MPEG Meeting, Online, 17–21 January 2022. [Google Scholar]
Rosewarne, C.; Nguyen, R. [VCM Track 1] Tensor compression using VVC. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. m59591. In Proceedings of the 138th MPEG Meeting, Online, 25–29 April 2022. [Google Scholar]
Wiedemann, S.; Kirchhoffer, H.; Matlage, S.; Haase, P.; Marban, A.; Marinč, T.; Neumann, D.; Nguyen, T.; Schwarz, H.; Wiegand, T.; et al. DeepCABAC: A universal compression algorithm for deep neural networks. IEEE J. Sel. Top. Signal Process. 2020, 14, 700–714. [Google Scholar] [CrossRef]
Vidal, R.; Ma, Y.; Sastry, S. Generalized principal component analysis (GPCA). IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1945–1959. [Google Scholar] [CrossRef] [PubMed]
Gwak, D.-G.; Kim, C.; Lim, J. [VCM track 1] Feature data compression based on generalized PCA for object detection. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. m58785. In Proceedings of the 137th MPEG Meeting, Online, 17–21 January 2022. [Google Scholar]
Wang, Y.; Xu, C.; Xu, C.; Tao, D. Beyond filter: Compact feature map for portable deep model. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3703–3711. [Google Scholar]
Lee, M.; Park, S.; Oh, S.-J.; Sim, D.; Kim, Y.; Lee, J.; Jeong, S.Y. [VCM Track 1] Response to CfE: A transformation-based feature map compression method. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. m60788. In Proceedings of the 140th MPEG Meeting, Mainz, Germany, 24–28 October 2022. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 2117–2125. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Moving Picture Experts Group (MPEG). Common test conditions and evaluation methodology for video coding for machines. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. N00231. In Proceedings of the 139th MPEG Meeting, Online, 18–22 July 2022. [Google Scholar]
Suzuki, S.; Takagi, M.; Takeda, S.; Tanida, R.; Kimata, H. Deep feature compression with spatio-temporal arranging for collaborative intelligence. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 25–28 October 2020; pp. 3099–3103. [Google Scholar]
Moving Picture Experts Group (MPEG). Call for Evidence for Video Coding for Machines. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. N00215. In Proceedings of the 139th MPEG Meeting, Online, 18–22 July 2022. [Google Scholar]
Rosewarne, C.; Nguyen, R. [FCVCM] SFU object detection feature anchor update. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 2, Doc. m62503. In Proceedings of the 142nd MPEG Meeting, Antalya, Turkey, 24–28 April 2023. [Google Scholar]
Detectron2 Pretrained Model. Available online: https://github.com/facebookresearch/detectron2 (accessed on 11 October 2019).
DarkNet-53 ImageNet Pretrained Model. Available online: https://github.com/Zhongdao/Towards-Realtime-MOT (accessed on 5 October 2019).
OpenImage V6 Dataset. Available online: https://storage.googleapis.com/openimages/web/download_v6.html (accessed on 26 February 2020).
SFU-HW-Objects-v1 Dataset. Available online: https://data.mendeley.com/datasets/hwm673bv4m (accessed on 2 December 2020).
Xu, X.; Liu, S.; Li, Z. Tencent video dataset (TVD): A video dataset for learning-based visual data compression and analysis. arXiv 2021, arXiv:2105.05961. [Google Scholar]
Bjøntegaard, G. Calculation of average PSNR differences between RD-Curves. ITU-T SG16/Q6, Doc. VCEG-M33. In Proceedings of the 13th Moving Video Coding Experts Group (VCEG), Austin, Texas, USA, 2–4 April 2001. [Google Scholar]
VVC Test Model (VTM12.0). Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-12.0?ref_type=tags (accessed on 16 February 2021).
Chao, Y.-H.; Sun, Y.-C.; Xu, J.; Xu, X. JVET common test conditions and software reference configurations for non-4:2:0 colour formats. Moving Picture Experts Group (MPEG) of ISO/IEC JTC 1/SC 29/WG 5, Doc. JVET-T2013. In Proceedings of the 132nd MPEG Meeting, Online, 12–16 October 2020. [Google Scholar]

Figure 1. Network architecture of feature extractor for MPEG-VCM main tasks. (a) Feature extractor of Faster R-CNN X101-FPN and Mask R-CNN X101-FPN for object detection and instance segmentation. (b) Feature extractor of JDE-1088 × 608 for object tracking.

Figure 2. Overall pipeline of the proposed feature map compression method.

Figure 3. Block diagram of the (a) feature connector and (b) feature splitter modules in Figure 1.

Figure 4. Block diagram of the process for generating reshaped feature map. (a) Reshaped feature map for each high-level feature map (P4, P5, and L2). (b) Reshaped combined feature map for low-level feature maps (P2 and P3, L0 and L1).

Figure 5. Block diagram of the process for obtaining generalized basis matrix and mean vector. (a) Generalized basis matrix and mean vector for each high-level feature map (P4, P5, and L2). (b) Generalized common basis matrix and mean vector for low-level feature maps (P2 and P3, L0 and L1).

Figure 6. Example of feature map and the corresponding coefficients derived by using TGB and TGM. (a) Five channels of feature map, (b) five coefficients.

Figure 7. Example of packed and quantized coefficients.

Figure 8. Object detection performance of the proposed method according to the compressed data ratio on OpenImage V6 dataset (x axis: proportion of data to be compressed depending on generalized basis matrix ratio; y axis: mAP performance).

Figure 9. RP curves for each machine vision task (x axis: bpp or bitrate, y axis: mAP or MOTA). (a–d) Object detection; (e) instance segmentation; (f–i) object tracking. (a) OpenImage V6 dataset (b) Class AB of SFU-HW-Objects-v1 dataset; (c) Class C of SFU-HW-Objects-v1 dataset; (d) Class D of SFU-HW-Objects-v1 dataset; (e) OpenImage V6 dataset; (f) TVD-01 of the TVD dataset; (g) TVD-02 of the TVD dataset; (h) TVD-03 of the TVD dataset; and (i) overall for the TVD dataset [12,13,14].

Figure 10. Ground truth of the TVD dataset (a) Example of TVD-01. (b) Example of TVD-03. (c) Histogram of bounding box size (width, height) of all ground truth for the TVD dataset.

Table 1. Dimension for each multi-level feature map (input data dimension (W × H × C): 1920 × 1080 × 3).

P-Layer Feature Maps	Dimension (W × H × C)	L-Layer Feature Maps	Dimension (W × H × C)
P2	480 × 272 × 256	L0	136 × 76 × 256
P3	240 × 136 × 256	L1	68 × 38 × 512
P4	120 × 68 × 256	L2	34 × 19 × 1024
P5	60 × 34 × 256	-	-

Table 2. Experimental setup to evaluate performance of the proposed method for each machine vision task.

	Object Detection	Instance Segmentation	Object Tracking
Deep-learning network	Faster R-CNN X101-FPN	Mask R-CNN X101-FPN	JDE-1088 × 608
Test dataset	5k images from the OpenImage V6 dataset [28] SFU-HW-Objects-v1 dataset [29]	5k images from the OpenImage V6 dataset [28]	TVD-01, TVD-02, and TVD-03 (among the TVD [30] dataset)
Metric (task performance)	mAP (%)	mAP (%)	MOTA (%)
Metric (bitrate measurement)	bpp (bits) for image bitrate (kbps) for video	bpp (bits)	bitrate (kbps)

Table 3. Performance of the proposed method in object detection.

Dataset	Feature Anchor			Proposed Method			BD-Rate (%)
Dataset	QP	bpp (Bits)	mAP (%)	QP	bpp (Bits)	mAP (%)	BD-Rate (%)
OpenImage V6	35	1.37	78.74	32	0.43	78.31	−85.35%
	37	0.94	78.18	35	0.27	77.68
	39	0.65	76.84	40	0.12	76.33
	41	0.44	74.65	45	0.05	72.81
	43	0.29	69.49	47	0.03	69.23
	45	0.20	60.11	49	0.02	64.74
SFU-HW-Objects-v1	Rate point	bitrate (kbps)	mAP (%)	Rate point	bitrate (kbps)	mAP (%)	BD-rate (%)
Class AB	0	7761.06	52.72	0	5266.12	49.19	−70.33%
	1	4324.12	51.22	1	2010.33	49.38
	2	2392.21	44.35	2	758.44	47.87
	3	1244.20	38.15	3	276.88	39.56
	4	681.53	30.78	4	100.44	28.62
	5	365.85	19.12	5	42.65	8.51
Class C	0	9333.60	48.12	0	8037.45	44.07	−85.15%
	1	5324.62	45.43	1	2822.83	44.53
	2	3033.83	42.26	2	1022.01	42.81
	3	1662.77	35.41	3	367.73	36.43
	4	897.39	24.15	4	128.14	27.13
	5	551.83	15.91	5	49.71	8.53
Class D	0	6745.37	43.82	0	6986.45	41.92	−72.37%
	1	3993.40	42.01	1	2456.03	40.50
	2	2391.73	39.13	2	892.89	37.98
	3	1373.15	35.32	3	321.95	33.05
	4	759.90	26.02	4	115.27	24.28
	5	470.17	11.84	5	47.02	11.19

Table 4. Performance of the proposed method in instance segmentation.

Dataset	Feature Anchor			Proposed Method			BD-Rate (%)
Dataset	QP	bpp (Bits)	mAP (%)	QP	bpp (Bits)	mAP (%)	BD-Rate (%)
OpenImage V6	35	1.38	80.63	35	0.30	80.09	−89.27%
	37	0.95	79.10	37	0.22	80.04
	39	0.65	77.32	42	0.09	77.62
	41	0.44	72.85	44	0.07	75.48
	43	0.29	64.89	47	0.04	69.37
	45	0.20	50.95	50	0.02	56.88

Table 5. Performance of the proposed method in object tracking.

Dataset		Feature Anchor			Proposed Method			BD-Rate (%)
Dataset		QP	bitrate (kbps)	MOTA (%)	QP	bitrate (kbps)	MOTA (%)	BD-Rate (%)
TVD	TVD-01	22	6576.74	35.46	26	1697.18	22.60	−50.49%
		24	4394.81	31.94	28	1172.08	21.42
		26	2861.28	27.08	29	957.98	19.28
		29	1388.16	16.72	30	767.96	16.07
		31	788.09	9.38	32	491.34	12.36
		33	411.88	2.14	37	139.00	0.77
	TVD-02	23	5325.82	54.41	23	3572.15	48.23	−72.47%
		25	3445.63	52.18	25	2499.01	48.18
		27	2176.04	46.14	27	1707.19	47.05
		28	1725.76	41.46	32	598.09	43.19
		30	1026.00	29.52	35	298.08	36.19
		31	787.87	15.17	38	146.39	29.16
	TVD-03	25	4011.16	67.86	27	1425.79	67.46	−79.09%
		26	3106.91	66.10	28	1171.78	67.37
		27	2460.80	63.65	30	731.59	65.70
		29	1458.38	59.69	32	464.49	63.59
		30	1101.87	55.43	34	283.96	59.23
		31	828.46	50.43	35	223.17	56.23
	Overall	Rate point	bitrate (kbps)	MOTA (%)	Rate point	bitrate (kbps)	MOTA (%)	BD-rate (%)
		0	5440.45	49.84	0	1790.82	42.47	−64.91%
		1	3790.18	47.11	1	1313.32	41.80
		2	2631.71	43.18	2	949.29	39.91
		3	1451.58	35.76	3	631.22	37.12
		4	936.11	29.47	4	389.68	32.97
		5	614.80	22.81	5	172.70	25.17

Table 6. BD-rate comparison with previous works for the OpenImage V6 dataset.

	Object Detection			Instance Segmentation
	Park et al. [12]	Lee et al. [13]	Rosewarne and Nyugen [14]	Rosewarne and Nyugen [14]
BD-rate (%)	−69.20%	−65.29%	−49.87%	−41.90%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, M.; Park, S.; Oh, S.-J.; Kim, Y.; Jeong, S.Y.; Lee, J.; Sim, D. Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM). Electronics 2023, 12, 4042. https://doi.org/10.3390/electronics12194042

AMA Style

Lee M, Park S, Oh S-J, Kim Y, Jeong SY, Lee J, Sim D. Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM). Electronics. 2023; 12(19):4042. https://doi.org/10.3390/electronics12194042

Chicago/Turabian Style

Lee, Minhun, Seungjin Park, Seoung-Jun Oh, Younhee Kim, Se Yoon Jeong, Jooyoung Lee, and Donggyu Sim. 2023. "Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM)" Electronics 12, no. 19: 4042. https://doi.org/10.3390/electronics12194042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM)

Abstract

1. Introduction

2. Proposed Transform-Based Feature Map Compression Method

2.1. Generalized Basis Matrix and Mean Vector

2.2. Quantization and Packing

3. Experimental Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI