1. Introduction
Since Shannon proposed information entropy in 1948 [
1], image compression has been a popular research field. Before the advent of computer vision, all images were created for human perception. Consequently, the goal of image compression was to make the compressed image visually identical to the original. In recent years, with the continuous development of object detection techniques, images have become able to be “seen” by machines. Therefore, when the final recipient of images is an object detection model, the goal of compression shifts to making the detection results of the compressed representation as close as possible to those on the original image. Traditional compression methods do not take the requirements of detection tasks into account; all parts of the image are uniformly compressed according to a preset compression rate. As a result, some information needed for detection is discarded when the compression rate is high, leading to a drastic decrease in accuracy. Thus, we need a compression method capable of preserving information crucial for detection tasks.
Handcrafted image codecs [
2,
3,
4,
5] reduce image size by transforming the image into the frequency domain through mathematical methods [
6,
7] and then remove some high-frequency signals that are insensitive to human eyes. Learning-based image compression methods [
8,
9,
10], on the other hand, use neural networks to perform nonlinear transform of the image. Such methods learn to preserve key information by training under mean squared error (MSE) loss between the original and reconstructed image. Some of them [
11,
12,
13,
14,
15,
16,
17] are even better than advanced hand-crafted methods (like the still image coding of VVC [
18]) in terms of PSNR (peak signal-to-noise ratio) and MS-SSIM [
19]. Both kinds of compression methods fundamentally involve discarding information that is less perceptible to human eyes, reducing the size of the image while preserving visual quality.
Currently, in many edge–cloud collaborative scenarios, it is challenging to deploy complex image processing models due to the limited computational capabilities of edge devices. One way to solve this problem is to directly compress the models [
20], while another is to compress the images captured by edge devices and transmit them to the powerful cloud server for processing. This paper primarily discusses the latter approach, and such a scenario has led to more research on ensuring the performance of downstream task models on compressed data. One kind of method [
21,
22,
23] is to adjust the backend model to work with compressed input, e.g., Chan et al. [
22] improved the performance of downstream tasks on compressed data by conducting transfer learning on backend DNNs with compressed data acquired from vehicle sensors. Another type of method involves making adjustments at the input side to minimize the impact of compression on downstream models. For example, for video compression, Huang et al. [
24] proposed a Learned Semantic Representation (LSR) method to extract semantic information between temporally adjacent frames, which can be used for signal reconstruction observable by humans and visual analysis understandable by machines. In this paper, we present a feature compression approach designed for integration with current neural network-based image compression frameworks. This method allows for smooth incorporation with these frameworks to enhance their efficiency in object detection tasks. Furthermore, our method considers the computational power imbalance between the edge and the cloud. During edge-side encoding, a lightweight and high-recall module is used to identify potential target areas in the extracted features. The features are then encoded and transmitted to the cloud, where they are decoded, and a powerful detection model is employed for the final detection. This ensures that compression is performed effectively at the edge while high-accuracy detection is carried out in the cloud, addressing both the need for high performance in object detection and the constraints of edge computing.
The idea behind our method is that for image compression aimed at detection, we can remove information that is less sensitive to the detection model. Each spatial position in the image impacts the detection results differently; regions with objects are more influential than others. Therefore, a key issue of compression for detection is how to retain the parts containing the objects effectively. At the edge device, a lightweight model can be used to determine the regions where objects are located. Based on this idea, we design a masked feature compression method. Firstly, a feature extractor processes the image to extract the low-level features, followed by a mask generator to create an object mask for choosing target regions. Additionally, we enhance the mask by adding information from the objects’ vicinity through a “neighborhood convolution” process. By conducting complexity analysis and exploring feasibility, we find that compressing the features obtained from the feature extractor directly is more efficient than compressing the input image. The encoder then compresses the masked features into latent representations. At the decoding stage, a decoder will reconstruct the features. Experimental results show that under identical compression rates, our method outperforms other compression methods in detection accuracy. At the same time, compared with some up-to-date neural network image compression methods, the proposed feature compression method has a faster encoding and decoding speed. The key contributions of this paper can be summarized as follows:
We explore the feasibility of applying generated masks to low-level features and reduce the model’s time complexity by directly compressing the features. The model’s encoding speed surpasses that of current DNN image compression models.
We design a lightweight mask generator that can generate an object mask in one forward pass, and perform compression on the masked feature to save bits while ensuring the accuracy of backend object detection tasks.
The proposed framework can easily be integrated with existing neural network compression frameworks, enhancing the compression performance for object detection tasks.
The structure of the remainder of this paper is as follows.
Section 2 reviews related work.
Section 3 explores the feasibility of mask feature compression.
Section 4 provides a detailed description of the proposed method.
Section 5 presents and discusses the experimental results. Finally,
Section 6 concludes the paper.
3. Feasibility Analysis of Masked Feature Compression
To utilize object-related information during the encoding process, we need to perform a preliminary detection of the input image to identify potential target areas before implementing specific compression. After that, two different pipelines can be adopted to integrate the acquired information with encoding. The first pipeline uses the original image as the input for the encoder, and the second one involves directly inputting the obtained features into the encoder for compression. The structures of the two pipelines are illustrated in
Figure 2.
By the feature extraction process, the resolution of the input image is reduced. For the convolution operation, the resolution significantly affects its speed. The computational complexity of the convolution is shown in Equation (
4):
where
S is the output feature map’s side length,
K is the kernel size,
is the input channel number, and
is the output channel number. It is straightforward that using the image as input for the encoder does not take advantage of the resolution reduction achieved by the feature extraction process. Thus, directly compressing the extracted features is a more efficient method for encoding.
Through the detection part shown in
Figure 2, potential target areas are identified in advance. Subsequently, a 0–1 mask, where 1 corresponds to the targets and 0 corresponds to the background, can be used to filter the content to be compressed. The prerequisite for applying the mask to the extracted features is that the features and the image have a consistent spatial structure. Previous studies [
39,
40] have shown that low-level features are capable of preserving the spatial information of the original image. In this paper, we train the complete YOLOv5l [
37] structure as the powerful detection model (
Section 5.2), and use its first four layers as the feature extractor. We visualize the features extracted by the feature extractor. Some results are shown in
Figure 3. It can be observed that the features retain the spatial information of the original image, meaning the relative position of the objects in both the features and the original image is consistent. This allows the mask generated for the original image to be directly applied to the features.
4. Proposed Method
Figure 4 shows the workflow of the proposed method. The encoding process is performed on the edge device. First, extract the feature from the image, and then the mask generator will output an object mask based on the feature. The neighborhood convolution adds the parts near the objects to the mask. The encoder then uses the mask to remove the background and compress the feature to the latent representation. As for the decoding process on the high-performance cloud server, a decoder recovers the latent representation into the feature. Finally, a powerful detector outputs the detection results from the feature.
4.1. The Feature Extractor, Mask Generator, and Powerful Detector
The feature extractor can be regarded as a module shared by the mask generator and the powerful detector.
Figure 5 integrates it into the mask generator. After masking, extracted features will be compressed and transmitted. During the feature extraction stage, the input image undergoes three times reduction in both width and height; correspondingly, the channel dimension increases three times. Eventually, when the input size is
, the final extracted feature size is
.
To generate a masked feature, an intuitive approach is to use a lightweight detection network (e.g., YOLOv5n) to perform pre-detection on the extracted feature and keep all values inside the detected objects while discarding others. However, a major problem with this method is that it needs to iterate through each bounding box to obtain coordinates, which significantly slows down the encoding speed. To address this issue, we introduce the mask generator model, which directly outputs a mask as shown in
Figure 5. The model has a similar structure to detection models, and the difference lies in the fact that the detection model’s neck outputs three different size features. Each feature is intended for detecting objects of large, medium, and small sizes, respectively. In contrast, the mask generator only needs to output an object mask that has the same size as the extracted feature from the feature extractor.
The labels for training the mask generator are obtained from the detection labels. Since the low-level feature’s receptive field is small, the extracted feature still retains a spatial structure similar to the original image and the object’s position in the original image aligns with its position in the feature. Assuming the height and width of the feature to be compressed are
h and
w, the mask label size should be
too. As
Figure 5 shows, we utilize the bounding box coordinates from the detection label, and set the mask values inside objects to 1, and the rest to 0. The training loss is the MSE loss between the mask label and the generated mask. Details are as follows:
where
h and
w are the height and width of the mask,
is the weight of the object region loss,
m is the mask label,
is the generated mask,
means the object, and
means the background.
The powerful detector is employed in the decoding stage to perform the final detection. Since the decoding stage is deployed on high-computing cloud servers, high-accuracy large models can be used. In this paper, we adopt YOLOv5l as the powerful detector.
4.2. Neighborhood Convolution
The mask generator can identify object positions, yet object detection models require surrounding pixel information for more accurate predictions. Hence, selecting a suitable neighborhood range around objects is vital. Too large a range increases file size, whereas too small a range compromises accuracy.
We can calculate the distance of each pixel from the target and set a threshold to define the neighborhood. However, iterating over each pixel drastically reduces the coding speed. Given that current machine learning frameworks [
42,
43] have optimized convolution operations for efficiency, we propose the neighborhood convolution to expedite the neighborhood determination process. Details are illustrated in
Figure 6. First, we define a convolutional kernel with all values set to 1. Then, we use this kernel to perform convolution (to keep the output’s size unchanged, paddings should be
and stride should be 1) on the generated mask. Values in the output represent the count of pixels belonging to objects within the kernel. Finally, we set all nonzero values in the output to 1 to obtain a new mask. By adjusting the size of the convolutional kernel, we can control the size of the areas to be retained in the final mask, thereby fine-tuning the file size of the compressed representation.
Prior to feature compression, it is crucial to understand how the mask affects detection outcomes. To this end, we combine neighborhood convolution with a series of exploratory experiments on the VisDrone dataset, following these specific steps:
Table 2 indicates that when the kernel size of the neighborhood convolution is 11, the difference of mAP0.5 between the masked and unmasked situations is only 0.6%. Therefore, during the feature compression process, we can choose to compress and transmit the features after masking, instead of transmitting the full features. Correspondingly, during decoding, we only need to restore the masked features instead of the complete features.
4.3. Feature Compression Model
The feature compression model is used to achieve the compression and decompression of the masked feature. It consists of three parts: encoder, decoder and entropy model. Its goal is to make the decompressed feature consistent with the input. Current learning-based image compression networks [
11,
13,
14,
26] have good compression performance on images. Since both the feature compression model and the image compression model essentially aim to restore the input, we modify the mean-scale hyperprior compression model [
14] to make it suitable for feature compression. Here, we do not use the context model, as it would slow down the decoding speed. The masking operation has already reduced the amount of information in the feature, so the compression performance is satisfying even without the context model. Note that feature compression differs from image compression only in terms of input and output dimensions. Therefore, other learning-based image compression models can also be adapted for feature compression with a few modifications. In the ablation study in
Section 5.4, we experiment with other compression models and demonstrate that the proposed framework can easily integrate with current compression networks, enhancing their compression performance.
The differences between our feature compression model and original image compression models are shown in
Table 3. The image encoder takes the input image of size
and outputs latent representations of size
. It transforms the image format into a more compact latent representation. The feature encoder should also output latent representations of size
to achieve the same effect. After feature extraction, the feature size becomes
. Therefore, the feature encoder needs to reduce the number of channels. We set the feature encoder’s intermediate channel number to 224 and the output channels to 192. The reduction in width and height only occurs in the first convolution. The feature decoder implements the inverse process of the encoder. Since both image compression and feature compression have the same latent representation size, there is no need to make any changes to the entropy model.
4.4. Computational Complexity Analysis
Based on Equation (
4), we can estimate the ratio of time complexities between the image encoder and the feature encoder in
Table 3. Assuming the size of the image is
, for the image encoder, we estimate the complexity:
The feature size after extraction is
. So, the complexity of the feature encoder is:
Note that these two values are used only for estimating the ratio of time complexities between the feature encoder and the image encoder, and do not represent the actual amount of computation. We conclude that the complexity of encoding images is nearly twice that of encoding features. From this, it is evident that compressing features directly can enhance the speed of the compression model. In
Section 5.3, we compare the encoding speeds of different compression methods. The experimental results demonstrate that our feature compression method achieves a faster coding speed when compared to image compression models with competitive compression performance. Although the total number of feature elements increases from an input image size of
to
, after processing by the feature encoder, the dimension is reduced to
, aligning with the situation when the input is an image.
6. Conclusions and Future Work
Differing from the traditional image compression methodology that discards high-frequency signals insensitive to human eyes, we propose a masked feature compression method. This method utilizes a mask generator to generate an object mask and remove most background information that is irrelevant to detection tasks during the compression process. Compared to existing methods, our approach achieves superior detection performance across various compression levels. Experimental results indicate it has an outstanding real-time performance and low computational demands. Through an ablation study, we demonstrate that the proposed mask generator significantly speeds up the encoding progress, and the neighborhood convolution markedly improves the compression performance. Based on the above analysis, we can ascertain that our method holds significant potential for application in cloud–edge collaborative scenarios. For lightweight edge devices such as drones and smart cameras, the mask feature compression scheme can be utilized to obtain compressed representations for transmission. Once these compressed representations are transmitted to the cloud (the decoding end) and decoded, they can be input into the detection model for automatic detection. Compared to traditional compression methods, our method enhances the accuracy of the detection model on the compressed representations.
The RA curves on the VisDrone and COCO datasets show that the masked feature compression method proposed in this paper has a more pronounced advantage in the VisDrone dataset. This is because the VisDrone dataset consists of small objects, characterized by target sparsity. Therefore, the mask method can filter out a significant amount of background information. In contrast, the COCO dataset contains many large objects that occupy most of the image area, leaving limited background regions that can be discarded. As a result, the compression gain brought by the mask is limited. In recent years, masked image modeling has been widely applied in unsupervised learning. The underlying idea is to mask certain regions of an image and reconstruct the entire image based on the visible regions. Considering that not all parts of large objects contribute significantly to the detection task (e.g., when the detection target is a human, the importance of facial features is far greater than the color patterns on clothing), we can mask the regions that do not contain key features and reconstruct these masked regions at the decoding end based on the visible parts. Through this method, we aim to make the mask compression scheme applicable to images containing a high proportion of large objects.