MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression

Yan, Zikang; Liu, Peishun; Wang, Xuefang; Gao, Haojie; Ma, Xiaolong; Hu, Xintong

doi:10.3390/sym16101342

Open AccessArticle

MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression

by

Zikang Yan

¹,

Peishun Liu

^1,*

,

Xuefang Wang

²,

Haojie Gao

¹,

Xiaolong Ma

¹ and

Xintong Hu

¹

Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266100, China

²

School of Mathematical Sciences, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(10), 1342; https://doi.org/10.3390/sym16101342

Submission received: 28 August 2024 / Revised: 29 September 2024 / Accepted: 8 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue Symmetry/Asymmetry in Neural Networks and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The rich information and complex background of industrial images make it a challenging task to improve the high compression rate of images. Current learning-based image compression methods mostly use customized convolutional neural networks (CNNs), which find it difficult to cope with the complex production background of industrial images. This causes useful information to be lost in the abundance of irrelevant data, making it difficult to accurately extract important features during the feature extraction stage. To address this, a Multi-path Residual Asymmetric Convolutional Compression Network (MRACNN) is proposed. Firstly, a Multi-path Residual Asymmetric Convolution Block (MRACB) is introduced, which includes the Multi-path Residual Asymmetric Convolution Down-sampling Module for down-sampling in the encoder to extract key features, and the Mult-path Residual Asymmetric Convolution Up-sampling Module for up-sampling in the decoder to recover details and reconstruct the image. This feature transfer and information flow enables the better capture of image details and important information, thereby improving the quality and efficiency of image compression and decompression. Furthermore, a two-branch enhanced local attention mechanisms, and a channel-squeezing entropy model based on the compression-based enhanced local attention module is proposed to enhance the performance of the modeled compression. Extensive experimental evaluations demonstrate that the proposed method outperforms state-of-the-art techniques, achieves superior Rate–Distortion Performance, and excels in preserving local details.

Keywords:

asymmetric convolution; image compression; non-local attention; industrial images

1. Introduction

In recent years, industrial image compression has been a significant research topic in the field of signal processing, aiming at achieving high-quality image compression and effective storage. Classical image compression techniques such as JPEG [1], JPEG2000 [2], WebP [3], HEVC [4] have achieved remarkable success. However, with the rapid development of industrialization, the requirements for the effectiveness and efficiency of image compression coding have increased. Traditional hybrid image encoders and decoders have limitations and are not suitable for all types of images. For example, transform quantization using image blocks can result in block effects. In addition, due to the limitation of network bandwidth, the traditional hybrid image encoder and decoder methods often lead to the loss of details when implementing low bit rate encoding, resulting in blur and distortion of the image.

With the development of deep learning techniques in computer vision, Ball et al. [5] firstly proposed a learning-based image compression method based on a variational autoencoder (VAE). Compared to traditional lossy image compression techniques, this method exhibits superior rate–-distortion (RD) performance [6], significantly outperforming other methods in metrics such as peak signal to noise ratio and multi-scale structural similarity index [7].Subsequent advancements introduced nonlinear transformation techniques like convolution combined with generalized divisive normalization (GDN), quantization methods such as uniform noise approximation quantization (UNAQ)and soft-to-hard quantization [8], as well as conditional probabilistic models following the Bayesian generative rule, including super-prior [9] and super-prior based on 2D PixelCNN [10] and joint autoregressive context [11].These innovations have progressively improved performance. Besides the common MSE or MS-SSIM loss functions, other loss functions like feature loss and adversarial loss have been employed to enhance image quality, particularly subjective quality [12,13,14].

Current deep learning image compression models perform well in natural scenes but face challenges in compressing industrial images, which cannot fully meet the demands of intelligent manufacturing. The primary issue is that industrial environment images often contain numerous details, complex structures, various noises and disturbances such as lighting changes, dust, and vibrations. These factors can degrade image quality. Convolutional neural networks (CNNs) can eliminate redundant spatial information from global data to extract key features for image compression. In compression processes, images typically undergo multiple up-sampling and down-sampling operations to reduce size and remove redundant information. Traditional learning-based image compression models commonly use symmetric convolution for sampling operations, employing a convolution kernel of the same size for both horizontal and vertical directions. While effective in extracting image features, symmetric convolution kernels limit the model’s ability to capture details in specific directions and may not efficiently capture feature differences in complex image structures. This limitation is particularly pronounced when dealing with the intricate structures and details of industrial images. To address this issue, we propose a Multi-path Residual Asymmetric Convolutional Block (MRACB) for up-sampling and down-sampling in image compression models. The MRACB combines small-size convolutional kernels with strip convolutional kernels to better capture details and structures at different scales and orientations. This asymmetric convolution design allows the model to more effectively capture details and structures in industrial images, enhancing the performance of the compression model in complex industrial scenes. Attention mechanisms play a crucial role in computer vision tasks and numerous attention modules [15,16,17] were proposed to enhance image compression. Studies have shown that non-local attention mechanisms can guide the adaptive processing of latent features, helping compression algorithms allocate more bits to complex regions (e.g., edges, textures) for improved rate–distortion (RD) performance. The Swin Transformer [18] leverages the attention mechanism to capture global dependencies, which is not as efficiently extracted by global semantic information in image compression compared to other computer vision tasks. In this context, we propose the Enhanced Local Attention Mechanism Module (ELAM) to further enhance image compression performance. ELAM captures detailed information more efficiently by focusing on local features and combining them with asymmetric convolutional blocks. This approach not only improves the compression rate but also maintains high image quality. In addition to encoder and decoder design, optimizing the entropy model is crucial for image compression. Directly adding attention modules to the entropy model can enhance performance but introduces numerous parameters. Therefore, we propose an attention-based entropy model with channel squeezing, using the Compressed Enhanced Local Attention Module (CLAB) to achieve effective channel compression. We reduce the number of slices from 10 to 5 to minimize parameters while avoiding delays caused by too many slices, thus balancing operation speed and RD performance.

Our main work contributions are summarized below:

To address the challenge of accurately extracting crucial feature information in industrial images during the feature extraction stage, we propose a Multi-path Residual Asymmetric Convolution Block (MRACB) by incorporating two main modules. The Multi-path Residual Asymmetric Convolution Down-sampling (MRACD) module is used for encoder down-sampling operations and key feature extraction, using asymmetric convolution to focus on different directions and enhance detail extraction. It also incorporates residual learning to improve the feature extraction process. And the Multi-path Residual Asymmetric Convolution Up-sampling (MRACU) module replaces the standard convolutional layer with a transposed convolutional (TConv) layer for decoder up-sampling operations. This approach effectively utilizes previously extracted high-level features, enhancing the accuracy and detail recovery in image reconstruction. Furthermore, it can be seamlessly integrated with existing image compression models.
To better capture the correlation between neighboring elements in industrial image spaces, we propose a flexible Enhanced Local Attention Mechanism (ELAM). This mechanism allocates higher bit rates to target regions, effectively improving the reconstruction quality of compressed image targets. Compared to other learning-based methods, our method produces sharper results in texture details.
Addressing the lack of attention modules tailored for the channel entropy model in current image compression networks, we design a parameter-efficient attention module-Compressed Enhanced Local Attention Block (CLAB) for the entropy model. This module incorporates channel compression to acquire localized information, thereby better capturing specific image details. Extensive experimental evaluations demonstrate that our proposed method is highly effective, outperforming previous image compression methods.

2. Related Work

In recent years, image compression models based on deep learning have become a hot research area. Previously, most deep learning-based image compression models [5,9,11,15,19,20,21,22] utilized symmetric convolutional neural networks to extract features. It was shown that CNN-based image compression methods outperform traditional methods such as JPEG [1], JPEG2000 [2], WebP [3] and HEVC [4] in terms of performance metrics, highlighting their potential to enhance the efficiency and effectiveness of image compression. Despite the significant results achieved by image compression methods based on symmetric convolution, their limitations, such as the loss of details in Figure 1, have gradually become apparent.

With the deepening of research, more and more researchers are focusing on asymmetric convolution techniques and exploring their unique roles in image processing. Asymmetric convolution enhances the model’s flexibility and expressiveness by using different kernel sizes in the horizontal and vertical directions, which streamlines the number of parameters. Hu X [23] introduced asymmetric convolution to capture the relationships between different features, learning the weights of each pixel on RGB and depth features. Szegedy C [24] used convolution kernels of varying sizes to reduce the number of parameters in an image classification model while maintaining effective extraction of important features. Further research by Lo.S.Y [25] fused asymmetric convolution and dense modules to accelerate model training in semantic segmentation tasks, resulting in overall performance improvements despite some information loss. Tian C [26] utilized asymmetric convolution kernels, such as 3 × 1 and 1 × 3, in convolutional layers to capture both horizontal and vertical features in images, thereby enhancing the prominence of locally salient features. Tang [27] combined graph attention mechanisms and asymmetric convolutional neural networks to improve image compression results, showcasing the potential of asymmetric convolution in this field.

Both asymmetric convolutions and the square convolutions (also known as cubic convolutions) used in most convolutional networks fall under the category of rectangular convolutions. While both methods achieve effective feature maps through local image convolutions, the difference lies in the shape of the convolutional blocks: asymmetric convolutions use rectangular blocks, whereas square convolutions use square blocks. Figure 2 illustrates the asymmetric convolution method. Networks built on asymmetric convolutions [23,24] have demonstrated their effectiveness. In deep neural networks, the convolutional layer utilizes a

N \times M

convolutional kernel that slides over the image with a defined stride. Asymmetric convolutions, which use rectangular blocks, differ from square convolutions which use square blocks. They effectively extract features by sliding a kernel over the image with a defined stride, using padding to ensure complete coverage. This method captures higher-level features through local convolutions, and stacking multi-path convolutional layers enhances this capability.

When

N = M

, the convolution kernel is square, allowing it to extract features equally in both horizontal and vertical directions. When

N \neq M

, the convolution kernel becomes asymmetric, capturing more global features in one direction, thus improving the receptive field and focus. Many methods introduce asymmetric convolution, using one-dimensional kernels like (

1 \times N

) and (

N \times 1

) to better extract local key features. Combining these features in a square convolution integrates global and local information, reducing model parameters and accelerating training.

The attention module is designed to mimic the biological process of observation by focusing attentional resources on key regions to gather more detailed information. This helps the learning model prioritize important areas for finer details. Widely used in natural language processing tasks, attention mechanisms have also been introduced into the field of image compression. Non-local attention mechanisms [28] have proven beneficial in various visual tasks. Liu et al. [16] integrated non-local attention into the VAE structure for image compression, employing it to create implicit importance masks that guide the adaptive processing of potential features. This method captures local and global correlations, generating an attention mask that adapts bit allocation to reduce the ratio of unimportant pixels. Li [21] and Mentzer [29] used an importance map approach, adaptively assigning information to quantized potential features by giving more bits to texture regions and fewer bits to other regions, resulting in better visual quality at similar bit rates. Inspired by [30], using a cascade of non-localized modules and regular convolutional layers to generate an attention mask [31] simplifies the attention mechanism by removing non-localized blocks. Additionally, Zou et al. [17] improved image compression with a window-based attention module, while Koyuncu [32] combined spatial and channel dimensions for an efficient attention mechanism. These approaches further enhanced compression efficiency and image quality.

3. Methodology

In this section, we first describe the architecture of the Multi-path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism-based Image Compression Model (MRACNN). Subsequently, we introduce the proposed Multi-path Residual Asymmetric Convolution Block (MRACB) and discuss its design principle and mechanism of action in detail. Next, we introduce the Enhanced Local Attention Mechanism (ELAM) and explore its application in networks and its effect on performance improvement. Finally, we introduce the squeezed channel entropy model and the CLAB module applied to it, and illustrate their significance for the optimization of image compression models.

3.1. The Proposed Model

This paper proposes an end-to-end image compression framework based on asymmetric convolution and introduces a hyperprior encoding and decoding module.

The hyperprior encoding and decoding module proposed in this paper includes components such as the primary encoding and decoding model, the hyperprior encoding and decoding model, the quantization module, the arithmetic module, and the channel autoregressive entropy model. The hyperprior encoding and decoding model introduces hyperpriors as side information, enabling the structural hierarchy information capture of latent feature points. This allows the perception of structural information of latent points, thereby establishing a more accurate entropy model. The framework of the hyperprior image autoencoder is shown in Figure 3.

In the image compression process, the input image is first passed through the primary encoder to obtain spatially varying standard deviation responses, which are the latent features derived from the encoder. These features are then quantized and encoded to produce the output. An entropy encoding is performed on the image using the probability model of the quantized data. However, because the actual distribution is unknown, there is a discrepancy between the probability model and the actual distribution. To minimize this discrepancy, we introduce a new variable, which is then quantized, compressed, and transmitted as auxiliary information to achieve a more accurate probability model estimation.

During the image compression transformation, the encoder

g_{a}

maps a given image

x

to its latent features

y

. After passing through the quantizer

Q

, these latent features

y

are represented discretely. The decoder

g_{s}

then uses this discrete representation to generate the reconstructed image. The primary process is as follows:

\{\begin{array}{l} y = g_{a} (x; φ) \\ \hat{y} = Q (y) \end{array}

(1)

In the formulas,

x

represents the input raw image data,

φ

represents the optimization parameters.

g_{a}

is an encoder network designed to efficiently extract valuable feature information from the input image. It contains several key modules such as MRACD and ELAM, among others. These modules extract more representative features by progressively decreasing the size of the input image and increasing the number of channels. After this series of transformations, the obtained

y

is the feature representation of the encoder output, which condenses the important information of the original image, so the performance of the encoder directly affects the quality of subsequent image reconstruction. And

Q

denotes quantization. The

Q

formulas are defined as follows:

Q (x) = \{\begin{array}{l} U (- \frac{1}{2}, \frac{1}{2}) + x t r a i n i n g \\ r o u n d (x) i n f e r e n c e \end{array}

(2)

During the training phase, we use uniform noise

U (- 0.5,0.5)

as the quantization function. For the inference step, we employ a rounding function to quantize the latent representation. The quantized latent representation

\hat{y}

is encoded into a bitstream using arithmetic entropy coding (AE). When the image needs to be reconstructed, arithmetic entropy decoding (AD) is used to decode the bitstream, and the decoding module reconstructs the compressed image. The decoder consists of a regular deconvolution layer, three MRACB up-sampling modules, and three ELAM modules. The reconstructed image

\hat{x}

is obtained as follows:

\hat{x} = g_{s} (\hat{y}; θ)

(3)

θ

represents the optimization parameters. This approach introduces a new variable

z

to address the dependencies between

y

:

\{\begin{array}{l} z = h_{a} (y; φ_{h}) \\ \hat{z} = Q (z) \\ p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z}) \leftarrow h_{s} (\hat{z}; θ_{h}) \end{array}

(4)

h_{a}

and

h_{s}

represent the analysis and synthesis transforms in the autoencoder, where

φ_{h}

and

θ_{h}

denote the optimization parameters.

p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z})

is the estimated distribution based on

z

. The entropy decoding result is then input to the main decoder

g_{s}

end to reconstruct the source image.

3.2. Multi-Path Residual Asymmetric Convolution Block

This paper proposes a Multi-path Residual Asymmetric Convolution Block (MRACB) to capture long-range contextual information and enhance local key features at a low cost, thereby emphasizing image details. The structure of MRACB is shown in Figure 4. It includes both up-sampling and down-sampling modules. In the down-sampling module, (Conv, s = 2) uses a 3 × 3 convolution layer to capture local information, where the parameter ‘s’ adjusts the spatial size of the feature maps. These feature maps are then fed into the asymmetric block, which consists of three convolution layers. The (Conv, 1 × n) and (Conv, n × 1) are one-dimensional asymmetric convolutions with 1 × 3 and 3 × 1 kernels, respectively. The (Conv, n × n) is a 3 × 3 square convolution kernel. Residual learning is influenced by (Conv, 1 × n) and (Conv, n × 1), refining the extracted local key features and enhancing the image compression network’s ability to represent local details. Additionally, a Generalized Divisive Normalization (GDN) follows for normalization. The feature extraction process is defined as follows:

\{\begin{array}{l} y = \frac{x - μ}{\sqrt{σ^{2} + ε}} ⨂ γ + β (y_{a c b}) \\ y_{h} = C (x, θ_{v}) \\ y_{v} = C (x, θ_{v}) \\ y_{3 \times 3} = C (x, θ_{v}) \\ y_{a c b} = [y_{h}, y_{v}, y_{3 \times 3}] \\ y = \frac{x - μ}{\sqrt{σ^{2} + ε}} ⨂ γ + β (C (x, θ_{h}) + C (x, θ_{v}) + C (x, θ_{3 \times 3})) \end{array}

(5)

Here,

x

denotes the input feature map,

μ

is the mean of the feature map,

σ

is the standard deviation of the feature map,

ε

is a small constant for numerical stability, and is the learnable parameter.

y_{h}

denotes the feature map for horizontal direction,

θ_{h}

denotes the parameter for horizontal direction convolution,

y_{v}

denotes the feature map for vertical direction,

θ_{h}

denotes the parameter for vertical direction convolution,

y_{3 \times 3}

denotes the feature map for 3 × 3 convolution,

θ_{v}

denotes the parameter for 3 × 3 convolution.

The residual network helps to solve the problem of gradient vanishing, and we insert the residual paths in the MRACB. As shown in Figure 4, the 3 × 3 (Conv, s = 2) module in the main path of the Multi-path Residual Asymmetric Convolutional Down-sampling (MRACD) as well as in the residual path is replaced with an inverse convolutional layer (TConv, s = 2), and the GDN is replaced with an IGDN layer for up-sampling. We refer to the new structure as Multi-path residual asymmetric convolutional up-sampling (MRACU).

Residual networks help mitigate the vanishing gradient problem by introducing residual paths. In the MRACB, we integrate these residual paths. As shown in Figure 4, the 3 × 3 (Conv, s = 2) module in the primary path of the Multi-path residual asymmetric convolution down-sampling (MRACD) is replaced with a transposed convolution layer (TConv, s = 2), and the GDN layer is replaced with an IGDN layer for up-sampling. We refer to this new structure as the Multi-path residual asymmetric convolution up-sampling (MRACU).

3.3. Enhanced Local Attention Module

The Enhanced Local Attention Module (ELAM), as illustrated on the left side of Figure 5, consists of two branches. The first branch includes three residual convolution blocks, which generate the feature map. The second branch comprises the ELSA module, three residual convolution blocks, a 1 × 1 convolution layer, and a Sigmoid activation function, and is responsible for generating the attention mask. The key technique in the ELSA module is the Hadamard attention, where the Hadamard product (⊙) effectively produces attention while maintaining high-order mapping relationships. The Ghost head combines attention with a static matrix to increase channel capacity [33]. These two advantages allow the ELSA Block to better embed local details.

ELSA can be expressed as Equation (1):

F_{m} = \sum_{n \in ⊖} G (H_{n - m}) V_{n}

(6)

H_{n - m}

represents the Hadamard attention value,

G (\cdot)

denotes the Ghost mapping,

F_{m}

is the output feature at pixel m,

V_{n}

is the value vector at pixel n.Hadamard attention can be represented as Equation (7), where ⊙ denotes the Hadamard product,

q_{m} ⨀ k_{m}

and

q_{n} ⨀ k_{n}

are achieved by querying key feature maps and simply element-wise multiplication. The expressions

r_{n - m}^{k}

and

r_{n - m}^{q}

are equivalently represented using 1 × 1 convolution filters.

H_{n - m} = {S o f t m a x}_{n} ((q_{m} ⨀ k_{m}) r_{n - m}^{k} + r_{n - m}^{q} (q_{n} ⨀ k_{n}) + r_{n - m}^{b})

(7)

R

represents operations performed on X, the output features of the ELSA module go through the operations composed of three residual convolutions and a 1 × 1 convolution cascade in the mask branch, denoted as RB for a single residual convolution operation. Conv stands for a 1 × 1 convolution, enhancing the inter-channel information exchange through 1 × 1 convolutional kernels.

\otimes

denotes element-wise multiplication between corresponding pixels of two feature maps. A Sigmoid activation function is applied to generate an importance mask, enabling the learned model to focus more on complex regions, thus improving encoding performance.

M

is a real number ranging from 0 to 1. When the real number

M

is multiplied sequentially with each pixel of the feature map in the first branch,

Y

is obtained. A residual structure is then utilized to accelerate the convergence speed during the training process.

\{\begin{array}{l} R (X) = {C o n v (R B}^{(3)} (F_{m})) \\ M = S i g m o i d (R (F_{m})) \\ Y = M ⨂ R B^{(3)} (F_{m}) \end{array}

(8)

3.4. Channel-Aware Squeezing Entropy Mode

In the encoder–decoder process, the model must estimate the parameters of a spatial autoregressive (AR) model, which can increase decoding time. The channel modulation model is initially built on [9], where the hyperprior model typically employs a conditional Gaussian model parameterized by variance and mean. The most effective models combine information from the hyperprior (forward adaptation) with the spatial autoregressive model (backward adaptation) before predicting entropy parameters

μ

and

σ

[31,34]. Therefore, it is crucial to balance the accuracy of the hyperprior information and the spatial autoregressive model parameters to ensure the model’s efficiency and precision. Inspired by [35,36], in Figure 6, we propose a channel-aware squeezing entropy model.

The overall architecture processes the latent representation y generated by the main encoder. Specifically, on one hand, y is fed into the entropy prior module for image entropy estimation; on the other hand, channel-level separation is performed, dividing the latent image into i roughly equal slices along the channel dimension, and estimating the mean and variance of each slice separately through the autoregressive module. Additionally, consideration is given to the mean and variance predicted in the previous layer. Subsequently, the various components of the autoregressive bitstream are integrated through the network and fed into the decoder for decoding. Furthermore, to enhance the parallelism of the framework, adjustments are made to the entropy parameters of each slice on previously decoded slices. Meanwhile, research on the channel squeezing entropy model [37] has demonstrated its effectiveness in image compression. We designed an attention module (CLAB) for the entropy model in Figure 7, which can capture local information to better grasp specific details of the image.

\hat{x} = g_{s} (\hat{y}; θ)

(9)

where

\hat{y}

represents the discrete representation of the latent feature,

g_{s}

is the decoder,

θ

denotes the optimization parameters.

Since the number of feature channels in the input

y_{i}

increases with the slice index

i

, the input channel number of

y_{i}

can be expressed as:

y_{i} = i \times (M / / a) + M)

(10)

M

represents the number of channels in the latent variables, and a denotes the total number of slices. Referring to the settings in [38], we fix the total number of slices to 5. Following the setup in [35], we compress the input channels of all slices to 128, setting the output channels of the first 1 × 1 convolutional layer to 128. Finally, we utilize the un-squeeze operation to expand the output channels to the original number, setting the output channels of the last 1 × 1 convolutional layer to

i \times (M / / a) + M)

. Additionally, as mentioned in [39], the hyperprior path of the image compression network contains a large number of redundant parameters, so channel compression can significantly reduce parameters without affecting performance. However, the main path is sensitive to parameters, so such operations cannot be used. Furthermore, the CLAB can extract not only local features but also attention maps.

3.5. Loss Function

Distortion between the source image and the reconstructed image is utilized to construct a loss function for end-to-end global optimization.

L = λ \cdot D + R

(11)

The equation comprises two components, where

D

represents the distortion between the reconstructed image and the original image. It can be defined as:

D = \frac{1}{C H W} \sum_{0 \leq c \leq C} \sum_{0 \leq i \leq H} \sum_{0 \leq j \leq W} (x_{c, i, j} - {\hat{x}}_{c, i, j})^{2}

(12)

In Equation (12),

C

,

W

and

H

represent the channels, height, and width of the image, respectively.

R

represents the compression rate of the overall framework, and the coefficient

λ

controls the balance between rate and distortion. In this work, the bit rate is composed of the bit streams of

\hat{y}

and

\hat{z}

, which can be defined as:

R = R (\hat{z}) + R (\hat{y})

(13)

where

\{\begin{array}{l} R (\hat{z}) = E [- \log_{2} (p_{\hat{z |} ψ} (\hat{z |} ψ))] \\ R (\hat{y}) = E [- \log_{2} (p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z}))] \end{array}

(14)

In the formula,

p_{\hat{z |} ψ} (\hat{z |} ψ)

and

p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z})

are the probability density models of

\hat{z}

and

\hat{y}

, respectively.

4. Experimental Results and Analysis

4.1. Training Details

In our experiments, we utilized an Intel (R) Xeon (R) Platinum 8124M CPU and NVIDIA GeForce RTX 3090Ti GPU. Our programming and experiments were conducted using CUDA 12.1, PyTorch 1.10.0, and Python 3.8.

All experiments were programmed and implemented in PyTorch and Python. To ensure fair experimental comparisons, all learning-based compression models used in this study were trained with the same training strategy.

In this section, we compare our method with various learning-based approaches and traditional compression standards using the Kodak dataset [40] and the CLIC-PRO dataset [41]. The Kodak dataset consists of 24 color images of different subjects, each with a resolution of 768 × 512 pixels. The CLIC-PRO dataset includes a collection of images with diverse characteristics and challenges, such as high dynamic range (HDR) images, images taken under low-light conditions, and images with complex textures and structures. These two datasets are frequently used to evaluate the performance and quality of image compression algorithms.

For training, we randomly selected 310K images from the OpenImage [42] dataset, with random cropping to a size of 256 × 256 during the training process. The batch size for all model training was set to 4, and the initial learning rate was set to 0.00001, using the Adam [43] optimizer to train the network. Our model was optimized using two quality metrics (MSE and MS-SSIM) as supervision. When optimizing the model with MSE, the λ values were {0.0018, 0.0035, 0.0067, 0.0130, 0.025, 0.05}. When optimizing with MS-SSIM, the λ values were {3, 5, 8, 16, 32, 60}.

For the MRACU and MRACD modules in our primary encoder

g_{a}

and decoder

g_{s}

, the kernel_size was set to 3. For the ELAM modules in the primary encoder

g_{a}

and decoder

g_{s}

, the num_heads in the main path was set to 8 and the kernel_size was set to 7. For the only remaining up-sampled and down-sampled convolution modules, we set the convolution kernel to 3. For the convolution module in the super-prior encoder and decoder, we also set the convolution kernel to 3. The ELAT modules were configured similarly. The channel count M for the latent feature y was set to 320, and the channel count M for z was set to 192. Other hyperparameters in the entropy model followed the settings in [38].

4.2. Evaluation

In this paper, we used the following evaluation metrics:

b p p

,

P S N R

,

S S I M

, and

M S - S S I M

. The bitrate (bits per pixel,

b p p

) represents the average bitstream resource consumption and is calculated as follows:

b p p = \frac{B i t s t r e a m S i z e}{I m a g e H e i g h t * I m a g e W i d t h}

(15)

Image quality degradation reflects the performance loss caused by compression algorithms at a given bitrate. Therefore, a reasonable image quality evaluation standard is crucial for assessing the effectiveness of compression algorithms.

The formula for calculating

P S N R

is as follows:

P S N R = 10 * \log_{10} \frac{M A X^{2}}{M S E}

(16)

M A X

represents the maximum value of the image pixels, and

M S E

(Mean Squared Error) is the mean squared difference between pixel values of the reconstructed image and the original image. A higher

P S N R

indicates smaller errors between the reconstructed image and the original image, resulting in better image quality. Currently,

P S N R

is one of the most widely used standards for evaluating image quality.

The

S S I M

[44] (Structural Similarity Index Measure) is a metric used to measure the similarity between two images. It is a perception-based model that considers changes in perceived structure information as image degradation, typically measuring losses within a local range. It comprehensively considers luminance contrast, structural similarity, and the covariance of pixel values within a window. The specific calculation method is as follows:

S S I M (x, y) = [l (x, y)]^{α} [c (x, y)]^{β} [s (x, y)]^{γ}

(17)

where

l (x, y) = \frac{2 μ_{x} μ_{y} + c_{1}}{{μ_{x}}^{2} + {μ_{y}}^{2} + c_{1}}, c (x, y) = \frac{2 σ_{x} σ_{y} + c_{2}}{{σ_{x}}^{2} + {σ_{y}}^{2} + c_{2}}, l (x, y) = \frac{σ_{x y} + c_{3}}{σ_{x} σ_{y} + c_{3}}

(18)

The variables

α

,

β

and

γ

represent the comparison factors for brightness, contrast, and structure between the windows

x

and

y

.

μ

and

σ

represent the mean and standard deviation, respectively, while

σ_{x y}

represents the covariance.

c

is a constant. In practical applications,

α = β = γ = 1, c_{3} = c_{2} / 2

are typically set, simplifying the formula to:

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x} σ_{y} + c_{2})}{({μ_{x}}^{2} + {μ_{y}}^{2} + c_{1}) ({σ_{x}}^{2} + {σ_{y}}^{2} + c_{2})}

(19)

The window is moved pixel by pixel each time, and the local structural similarity index is computed at each position until all positions are covered. The average of these indices represents the structural similarity between the two images.

M S - S S I M

(Multi-Scale SSIM) is an improvement over

S S I M

. It involves down-sampling the images by a factor of

2^{M - 1}

and computing the structural similarity of the images at multiple scales. When

M

= 1, it represents the original image size.

M S S I M (x, y) = [l_{m} (x, y)]^{α M} \prod_{j = 1}^{M} {[c_{j} (x, y)]}^{β j} [s_{j} (x, y)]^{γ j}

(20)

The luminance contrast is performed only at scale

M

.

4.3. Rate–Distortion Performance

We validated the rate–distortion performance of MRACNN, traditional image compression methods, and other CNN and transformer-based compression methods on different datasets. In Figure 8 and Figure 9, we present the performance of our model on the Kodak dataset and the CLIC PRO dataset. We used Multi-Scale Structural Similarity (MS-SSIM) and Peak Signal-to-Noise Ratio (PSNR) to measure these performances. Additionally, to facilitate comparison with other methods, we converted MS-SSIM to

- 10 \log_{10} (1 - (M S - S S I M))

for clearer comparisons. From the figures, it can be observed that the rate–distortion characteristics of deep learning-based image compression methods are slightly higher than those of traditional image compression methods. Moreover, our proposed method, Multi-Path Residual Asymmetric Convolutional Compression Network (MRACNN), outperforms all other methods, including deep learning-based image compression methods and traditional classical image compression methods. Furthermore, our method achieved state-of-the-art compression performance in terms of

P S N R

and

M S - S S I M

indicators, validating the effectiveness of our proposed approach. The multi-path residual asymmetric convolutional feature module can better extract important feature information from images, while the dual-branch enhanced local attention mechanism focuses more on high-contrast areas, allocating more bits to them. Additionally, the comparison results on the CLIC PRO validation dataset also demonstrate the robustness of our model, maintaining excellent performance even in complex scenarios.

4.4. Visualization

In image compression tasks, feature maps should focus more on regions of the image that contain important details and structures, as these areas are crucial for preserving image quality and key features during reconstruction. In Figure 10, we present the average feature maps at different models for latent feature y obtained from down-sampled industrial production scene images. It is evident that our model directs more energy towards regions where workers interact with equipment, capturing crucial information such as details of production machinery, workers’ actions, and significant elements related to the production process. This effectiveness is attributed to our multi-path residual asymmetric convolution block, which is designed to focus on regions with important details and structures during the encoder’s down-sampling operations. The multi-path residual asymmetric convolution block further refines the extraction and processing of detailed information in the image by using convolution kernels of varying sizes in different directions. Compared to traditional symmetric convolution methods, this asymmetric convolution approach is more effective at capturing directional features, especially those critical in industrial production scenarios. By ensuring that these important regions are better preserved, we prevent the loss of key features during image reconstruction. The enhanced local attention mechanism further optimizes bit allocation. This mechanism mimics the human visual system by focusing more resources on critical areas of the image, allowing the encoder to allocate more bits to these regions while reducing bit allocation in less significant areas. As a result, the quality of the compressed image is significantly improved, with clearer visual details and better preservation of essential information from the original image, leading to a more complete reconstruction in terms of detail and structure.

Figure 11 illustrates the reconstructed images of industrial production scenes under similar bit rates, evaluated using the PSNR distortion metric. The first image shows the original image, followed by results from various traditional and other compression methods. It is evident that our proposed method achieves the best performance at the lowest bit rate compared to other reconstruction images from industrial production scenes. For instance, our method provides better visual quality than traditional image compression methods. Specifically, it improves upon the traditional JPEG algorithm by 10.472 dB, the JPEG2000 algorithm by 4.101 dB, and the BGP algorithm by 1.411 dB, all at similar bits per pixel (bpp). Additionally, compared to the baseline model WACNN, our method achieves superior PSNR and MS-SSIM at lower bpp values. It is noteworthy that traditional learned compression methods perform less effectively, whereas our approach outperforms others. From the locally focused images, it is clear that our MRACNN model retains more details and achieves more pleasing visual quality than other learned methods and traditional encoder and decoder. For example, noticeable block artifacts from the JPEG algorithm are evident in the circuit board texture. The enlarged images reveal that the optimized model maintains clearer texture information around the disassembled parts and shows more distinct white grooves on the disassembled objects. Furthermore, the traditional JPEG2000 algorithm exhibits color shifts on the circuit board, whereas our method preserves the structure of the circuit board more accurately compared to other algorithms. The enlarged images on the right also demonstrate a noticeable improvement in quality.

4.5. Time Analysis

Next, we tested the runtime of our proposed method on the CPU and compared it with the execution times of traditional image compression methods such as VTM, AV1, WebP, BPG, JPEG, and JPEG2000. We then displayed the runtime of our method on the GPU and compared it with other deep learning-based image compression methods. The comparison in Table 1, reveals that, due to the high complexity of CNN models, our method runs slower on the CPU compared to some traditional codecs. However, it performs better than VCC, AV1, and JPEG2000.

With GPU support, the model proposed in this paper achieves faster encoding and decoding speeds, as demonstrated in Table 2, when compared with several learning-based image compression models. Despite slightly lagging behind SwinTChARM in both encoding and decoding speeds, our method still outperforms other deep learning-based image compression methods. The MRACB module incurs minimal computational overhead while yielding performance gains. Furthermore, our method exhibits excellent scalability and flexibility, making it suitable for large-scale image compression scenarios. Although most deep learning-based image compression methods show shorter encoding times than traditional algorithms with GPU acceleration, the strictly sequential nature of the decoder results in significantly longer decoding times compared to traditional algorithms.

4.6. Ablation Study

(1): Contributions of MRACB, ELAM, and CLAB to MRACNN: To demonstrate that our proposed MRACB, ELAM, and CLAB modules significantly enhance image compression performance and improve RD (rate–distortion) performance, we conducted ablation experiments on the Kodak dataset. The experimental results are presented in Table 3.

In this context, Baseline + MRACB refers to replacing only the convolution and GDN modules in the codec with our multi-path residual asymmetric convolution for up-sampling and down-sampling. For a fair comparison, Baseline + ELAM denotes substituting the WAM module in the Baseline with the ELAM module. Lastly, Baseline + CLAB retains the WACNN framework overall but replaces only the entropy model with CLAB.

From the table, it is evident that our WRACB module achieves the most significant RD improvements, with a 0.16 dB increase in PSNR compared to the baseline. The ELAM and CLAB modules also contribute to an improvement of around 0.01 dB each. Notably, as shown in Table 2, time analysis of different methods indicates that our MRACNN only adds an additional 0.02 s of time cost compared to previous models, remaining within the same order of magnitude. This demonstrates that the MRACB, ELAM, and CLAB modules effectively enhance model performance while maintaining a low time cost. The introduction of these modules offers new insights and methods for the field of image compression and is expected to see broader applications in the future.

(2): Hyperparameter experiments:

As shown in Figure 12, we conducted ablation experiments on the MRACB to quantify the role of its asymmetric convolution in the structure. First, we tested the MRACB without the three GDN layers (green triangle). Additionally, we evaluated the performance of the MRACB module without using asymmetric convolution (red rectangle). Finally, we assessed the case where GDN layers were replaced with BN layers. The results indicated that both the introduction of the three GDN layers and the asymmetric convolution significantly enhanced the model’s performance, as evidenced by reduced bit rates and improved PSNR. Specifically, the three GDN layers effectively lowered the bit rate, while the asymmetric convolution significantly increased the PSNR. This improvement is attributed to the flexibility and adaptability of the asymmetric convolution in feature extraction, allowing the model to better capture image details and thereby optimize reconstruction quality.

To investigate the impact of different convolution kernel sizes on the performance of the MRACB module, we conducted multiple sets of experiments using various kernel dimensions. The Figure 13 indicate that as the kernel size increases, both bpp and PSNR values decrease correspondingly. This suggests that larger convolution kernels may lead to some loss of information, thereby improving the compression ratio. Overall, the results show that the compression performance is optimal when the kernel size is 3. This outcome could be attributed to the ability of smaller kernels to capture detailed features in the images more effectively, thus maintaining higher image quality during compression. Furthermore, k = 3 strikes a good balance between computational efficiency and model complexity. Therefore, selecting an appropriate convolution kernel size is crucial for enhancing the performance of the MRACB module.

We also investigate the impact of different kernel_size and the Num_heads on the performance of the ELAM module, we selected BD-rate as the evaluation metric, using Minnen2020 + WAM as the baseline. We tested kernel sizes of [3,5,7] and numbers of heads of [4,8,16], recording the performance of each combination. The results are summarized in Table 4. When the kernel size was seven and the number of heads was eight the ELAM module showed particularly outstanding performance, achieving 1.67% reduction in bit rate. Notably, while increasing the number of heads may intuitively seem beneficial for model performance, the actual results indicate that a larger number of heads does not always lead to improvement and can instead incur additional computational costs. The larger receptive field allows the ELAM to better capture long-range dependencies in images during the training phase, enhancing rate–distortion performance, especially when handling high-resolution images, and effectively reducing information loss. Therefore, in our experiments, the kernel size for both ELAM and CLAB in MRACNN is set to seven, as a larger receptive field results in better rate–distortion performance.

(3): Impact of Each Module on Industrial Image Compression: Additionally, we further discuss the impact of each module on the compression and reconstruction of industrial workshop images. Figure 14 provides a comparison of the multi-path residual asymmetric convolution module (MRACB), ELAM, CLAB, and other modules in industrial image compression reconstruction. Compared to the traditional WACNN method, MRACB demonstrates superior compression performance and better image quality in industrial production image reconstruction. Specifically, MRACB achieves a 0.429 dB improvement in high-resolution industrial images at lower compression ratios. MRACB is particularly effective in preserving key details and structures, reducing information loss during compression, and thus providing clearer and more accurate visual results during image reconstruction. Notably, methods using ELAM and CLAB also retain more details at lower bits per pixel (bpp).

5. Conclusions

This paper introduces an end-to-end image compression architecture (MRACNN) by incorporating the multi-path residual asymmetric convolution block (MRACB), an enhanced local attention mechanism (ELAM) and a channel-aware squeezing entropy model. This architecture demonstrates exceptional performance in deep convolutional neural networks. Compared to traditional context-adaptive models, MRACNN not only offers significant advantages in compression quality but also minimizes the need for serial processing. Experimental results confirm that integrating the multi-path residual asymmetric convolution module and the enhanced local attention mechanism into the network structure can improve rate–distortion performance to a certain extent. In future research, we plan to explore additional factors that affect local detail reconstruction in image compression, such as entropy parameter modules, to better optimize rate–distortion performance. This research introduces new perspectives and methods to the field of industrial image compression, providing strong support for enhancing compression effectiveness and maintaining image quality. We believe that with further research, MRACNN and its related technologies will find broad applications and make substantial contributions to the development of image compression algorithms.

Author Contributions

Conceptualization, P.L.; methodology, Z.Y. and X.W.; resources, X.W. and P.L.; data curation, H.G. and X.M.; investigation, X.M.; validation, H.G., X.M. and X.H.; writing—original draft preparation, Z.Y.; writing—review and editing, P.L., X.W. and Z.Y.; and supervision, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2022YFB3305300. The funder had the following involvement with the study: data curation and validation.

Data Availability Statement

Data are available in a publicly accessible repository. The data presented in this study are openly available in [40,41,42].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wallace, G.K. The JPEG still picture compression standard. Commun. ACM 1991, 34, 30–44. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W. JPEG2000: Image Compression Fundamentals, Standards and Practice; Springer Science+Business Media: New York, NY, USA, 2002; Volume 11, pp. 286–287. [Google Scholar]
Ginesu, G.; Pintus, M.; Giusto, D.D. Objective assessment of the WebP image coding algorithm. Signal Process. Image Commun. 2012, 27, 867–874. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end optimized image compression. arXiv 2016, arXiv:1611.01704. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 2001, 5, 3–55. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; pp. 1398–1402. [Google Scholar]
Mentzer, F.; Agustsson, E.; Tschannen, M.; Timofte, R.; Van Gool, L. Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4394–4402. [Google Scholar]
Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. arXiv 2018, arXiv:1802.01436. [Google Scholar]
Van Den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1747–1756. [Google Scholar]
Minnen, D.; Ballé, J.; Toderici, G.D. Joint autoregressive and hierarchical priors for learned image compression. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Liu, H.; Chen, T.; Shen, Q.; Yue, T.; Ma, Z. Deep image compression via end-to-end learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2575–2578. [Google Scholar]
Agustsson, E.; Tschannen, M.; Mentzer, F.; Timofte, R.; Gool, L.V. Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 221–231. [Google Scholar]
Huang, C.; Liu, H.; Chen, T.; Shen, Q.; Ma, Z. Extreme image coding via multiscale autoencoders with generative adversarial optimization. In Proceedings of the 2019 IEEE Visual Communications and Image Processing (VCIP), Sydney, NSW, Australia, 1–4 December 2019; pp. 1–4. [Google Scholar]
Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7939–7948. [Google Scholar]
Liu, H.; Chen, T.; Guo, P.; Shen, Q.; Cao, X.; Wang, Y.; Ma, Z. Non-local attention optimized deep image compression. arXiv 2019, arXiv:1904.09757. [Google Scholar]
Zou, R.; Song, C.; Zhang, Z. The devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17492–17501. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Johnston, N.; Vincent, D.; Minnen, D.; Covell, M.; Singh, S.; Chinen, T.; Hwang, S.J.; Shor, J.; Toderici, G. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4385–4393. [Google Scholar]
Lin, C.; Yao, J.; Chen, F.; Wang, L. A spatial rnn codec for end-to-end image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13269–13277. [Google Scholar]
Li, M.; Zuo, W.; Gu, S.; Zhao, D.; Zhang, D. Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3214–3223. [Google Scholar]
Dong, C.; Deng, Y.; Loy, C.C.; Tang, X. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 576–584. [Google Scholar]
Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
Lo, S.-Y.; Hang, H.-M.; Chan, S.-W.; Lin, J.-J. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, New York, NY, USA, 15–18 December 2019; pp. 1–6. [Google Scholar]
Tian, C.; Xu, Y.; Zuo, W.; Lin, C.W.; Zhang, D. Asymmetric CNN for image superresolution. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 3718–3730. [Google Scholar] [CrossRef]
Tang, Z.; Wang, H.; Yi, X.; Zhang, Y.; Kwong, S.; Kuo, C.C.J. Joint graph attention and asymmetric convolutional neural network for deep image compression. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 421–433. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. arXiv 2019, arXiv:1903.10082. [Google Scholar]
Zhou, J.; Wang, P.; Wang, F.; Liu, Q.; Li, H.; Jin, R. Elsa: Enhanced local self-attention for vision transformer. arXiv 2021, arXiv:2112.12786. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Koyuncu, A.B.; Jia, P.; Boev, A.; Alshina, E.; Steinbach, E. Efficient contextformer: Spatio-channel window attention for fast context modeling in learned image compression. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7498–7511. [Google Scholar] [CrossRef]
Lee, J.; Cho, S.; Beack, S.K. Context-adaptive entropy model for end-to-end optimized image compression. arXiv 2018, arXiv:1809.10452. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3146–3154. [Google Scholar]
Liu, J.; Sun, H.; Katto, J. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
Ren, G.; Cao, T.; Kong, F.; Tang, J. Channel-Stationary Entropy Model for Multispectral Image Compression. In Proceedings of the International Conference in Communications, Signal Processing, and Systems, Xi’an, China, 25–27 October 2022; pp. 213–219. [Google Scholar]
He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; Wang, Y. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5727. [Google Scholar]
Minnen, D.; Singh, S. Channel-wise autoregressive entropy models for learned image compression. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 3339–3343. [Google Scholar]
Luo, A.; Sun, H.; Liu, J.; Katto, J. Memory-efficient learned image compression with pruned hyperprior module. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3061–3065. [Google Scholar]
Kodak, E.J.U. Kodak Lossless True Color Image Suite (PhotoCD PCD0992). Available online: https://r0k.us/graphics/kodak/ (accessed on 7 October 2024).
Toderici, G.; Shi, W.; Timofte, R.; Theis, L.; Balle, J.; Agustsson, E.; Johnston, N.; Mentzer, F. Workshop and challenge on learned image compression (clic2020). In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Krasin, I.; Duerig, T.; Alldrin, N.; Ferrari, V.; Abu-El-Haija, S.; Kuznetsova, A.; Rom, H.; Uijlings, J.; Popov, S.; Veit, A.; et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset 2017, 2, 18. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Cheng, K.L.; Chen, Q. Enhanced invertible encoding for learned image compression. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 162–170. [Google Scholar]
Zhu, Y.; Yang, Y.; Cohen, T. Transformer-based transform coding. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]

Figure 1. Comparison between the original image and the compressed image based on symmetric convolution.

Figure 2. Schematic diagram of symmetric and asymmetric convolution structures.

Figure 3. The end-to-end image compression model (MRACNN). MRACD↓2 represents the Multi-Path Residual Asymmetric Convolution Down-Sampling module, MRACU↓2 signifies the Multi-Path Residual Asymmetric Convolution Up-Sampling module. ELAM denotes the Enhanced Local Attention Module, Q signifies Quantization, AE and AD represent Arithmetic Encoder and Arithmetic Decoder.

Figure 4. Multi-Path Residual Asymmetric Convolution Block (MRACB). Above is the Multi-Path Residual Asymmetric Convolution Up-Sampling module (MRACU), and below is the Multi-Path Residual Asymmetric Convolution Down-Sampling module (MRACD).

Figure 5. The Enhanced Local Attention Module.

Figure 6. Channel-aware Squeezing Entropy Model. ELAT represents the Enhanced Local Attention Module applied to the channel entropy model, E and D represent arithmetic encoding and decoding,

μ

and

σ

represent variance and mean, and LRP represents latent residual error.

Figure 6. Channel-aware Squeezing Entropy Model. ELAT represents the Enhanced Local Attention Module applied to the channel entropy model, E and D represent arithmetic encoding and decoding,

μ

and

σ

represent variance and mean, and LRP represents latent residual error.

Figure 7. Compressed Enhanced Local Attention Block (CLAB) in the Channel-aware Squeezing Entropy Model.

Figure 8. Rate–distortion curves for different methods on the Kodak dataset. (a) Evaluations on the Kodak dataset in terms of PSNR. (b) Evaluations on the Kodak dataset in terms of MS-SSIM.

Figure 9. Rate–distortion curves for different methods on the CLIC-PRO dataset. (a) Evaluations on the CLIC-PRO dataset in terms of PSNR. (b) Evaluations on the CLIC-PRO dataset in terms of MS-SSIM.

Figure 10. Feature map of potential feature in industrial production scene images after down-sampling.

Figure 11. Visualization of reconstructed images from industrial production scenes. The indicator is [bpp↓/PNSR↑/MS-SSIM↑].

Figure 12. The ablation study on the MRACB.

Figure 13. The impact of the number of convolution kernel sizes on MRACB module.

Figure 14. Visual comparison of reconstructed images from industrial production workshops. The indicator is [bpp↓/PNSR↑/MS-SSIM↑].

Table 1. Time comparison of different methods on the CPU. The EncT and DecT denote the time cost for encoding and decoding on Kodak.

Method	EncT	DecT
Ours	1.18 s	0.21 s
Vtm	129.21 s	0.14 s
Av1	22.24 s	0.037 s
Webp	0.059 s	0.007 s
Bpg	0.66 s	0.17 s
Jpeg	0.012 s	0.006 s
Jpeg2000	0.50 s	0.48 s

Table 2. BD-rate (%) comparison for different methods in team of PSNR and MS-SSIM on different datasets. “CNN” means convolutional neural networks. “TF” means transformers. BD-rate is computed with VVC as author method. Para. represents the parameters of each model. FLOPs means the number of floating point operations. Running time means the averaged encoding and decoding time of different models on the Kodak dataset using GPUs.

Method	Type	Dataset			Para. (M)	FLOPs (G)	Running Time (s)
Method	Type	Kodak	CLIC	Avg.	Para. (M)	FLOPs (G)	EncT	DecT
VVC	/	0.00	0.00	0.00	/	/	91.57	0.09
JPEG	/	253.55	/	/	/	/	/	/
BPG	/	20.29	26.53	23.41	/	/	/	/
Bmshj2018-factorized	CNN	28.03	34.64	31.34	36.6	162.23	0.69	0.88
Bmshj2018-hyperprior	CNN	20.10	25.65	22.88	98.4	244.44	0.07	0.06
Minnen2018	CNN	16.25	15.92	16.09	20.15	179.83	0.03	0.03
Minnen2020	CNN	10.37	11.65	11.01	26.4	217.66	2.5	2.3
Cheng2020	CNN	5.44	10.52	7.98	28.6	258.53	8.32	14.23
Entroformer	STF	2.75	4.22	3.49	45.0	194.5	4.59	8.38
Xie [45]	CNN	−0.72	−1.79	−1.26	50.3	427.08	2.35	5.21
SwinTChARM [46]	STF	−3.12	−5.01	−4.07	60.8	230.30	0.13	0.08
STF	STF	−4.28	−3.76	−4.02	99.9	208.45	0.18	0.17
WACNN	CNN	−5.01	−1.18	−3.10	57.5	233.39	0.13	0.13
ELIC	CNN	−7.16	−5.23	−6.20	38.2	340.40	0.23	0.16
Ours	CNN	−9.35	−10.09	−9.72	69.77	287.39	0.14	0.14

Table 3. Ablation study, baseline is WACNN.

Method	λ = 0.0067		λ = 0.0130		λ = 0.025
Method	Bpp↓	PSNR↑	Bpp↓	PSNR↑	Bpp↓	PSNR↑
Baseline	0.309	32.26	0.449	34.15	0.649	35.91
Baseline + MRACB	0.303	32.42	0.447	34.28	0.633	35.98
Baseline + ELAM	0.308	32.32	0.450	34.23	0.645	35.97
Baseline + CLAB	0.305	32.37	0.444	34.22	0.635	35.91
MRACNN	0.302	32.56	0.445	34.35	0.630	36.07

Table 4. The BD-rate of different parameters of ELAM modules are compared on the Kodak.

Dataset	Method	Kernel_Size	Num_Heads	BD-Rate (%)
Kodak	Minen2020 + WAM	/	8	0
	Minen2020 + ELAM	3	4	−0.73
		3	8	−0.78
		3	16	−0.69
		5	4	−0.96
		5	8	−1.06
		5	16	−0.99
		7	4	−1.58
		7	8	−1.67
		7	16	−1.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Z.; Liu, P.; Wang, X.; Gao, H.; Ma, X.; Hu, X. MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression. Symmetry 2024, 16, 1342. https://doi.org/10.3390/sym16101342

AMA Style

Yan Z, Liu P, Wang X, Gao H, Ma X, Hu X. MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression. Symmetry. 2024; 16(10):1342. https://doi.org/10.3390/sym16101342

Chicago/Turabian Style

Yan, Zikang, Peishun Liu, Xuefang Wang, Haojie Gao, Xiaolong Ma, and Xintong Hu. 2024. "MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression" Symmetry 16, no. 10: 1342. https://doi.org/10.3390/sym16101342

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MRACNN: Multi-Path Residual Asymmetric Convolution and Enhanced Local Attention Mechanism for Industrial Image Compression

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. The Proposed Model

3.2. Multi-Path Residual Asymmetric Convolution Block

3.3. Enhanced Local Attention Module

3.4. Channel-Aware Squeezing Entropy Mode

3.5. Loss Function

4. Experimental Results and Analysis

4.1. Training Details

4.2. Evaluation

4.3. Rate–Distortion Performance

4.4. Visualization

4.5. Time Analysis

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI