A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph

Dong, Jiajun; Yan, Tianfeng; Sun, Wenhao

doi:10.3390/electronics13122366

Open AccessArticle

A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph

by

Jiajun Dong

,

Tianfeng Yan

^* and

Wenhao Sun

The School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2366; https://doi.org/10.3390/electronics13122366

Submission received: 28 April 2024 / Revised: 6 June 2024 / Accepted: 11 June 2024 / Published: 17 June 2024

(This article belongs to the Special Issue Semantic Communications and Intellicise Networks: A Themed Issue in Honor of Prof. Ping Zhang)

Download

Browse Figures

Versions Notes

Abstract

:

In the progress of research in the field of semantic communication, most efforts have been focused on optimizing the signal-to-noise ratio (SNR), while the design and optimization of the bit rate required for transmission have been relatively neglected. To address this issue, this study introduces an innovative low-bit-rate image semantic communication system model, which aims to reconstruct images through the exchange of semantic information rather than traditional symbol transmission. This model employs an image feature extraction and optimization reconstruction framework, achieving visually satisfactory and semantically consistent reconstruction performance at extremely low bit rates (below 0.03 bits per pixel (bpp)). Unlike previous methods that used pixel accuracy as the standard for distortion measurement, this research introduces multiple perceptual metrics to train and evaluate the proposed image semantic encoding model, aligning more closely with the fundamental purpose of semantic communication. Experimental results demonstrate that, compared to WebP, JPEG, and deep learning-based image codecs, our model not only provides a more visually pleasing reconstruction effect but also significantly reduces the required bit rate while maintaining semantic consistency.

Keywords:

semantic communication; deep learning; low bit rates; image feature extraction

1. Introduction

With the rapid advancement of high-resolution visual applications such as streaming media, virtual reality (VR), and smart surveillance, the efficient compression and transmission of images and videos generated by these applications have become a pressing issue. Current lossy codecs, such as JPEG [1], JPEG2000 [2], WebP [3], and BPG [4] based on HEVC, achieve data compression primarily by removing redundant information in images and employing certain approximate representations. Unlike lossless image codecs, lossy codecs may introduce some loss of information during the restoration of images. However, through careful design, significant data compression can be achieved without noticeably impacting visual perception.

At extremely low bit rates, however, codecs designed based on traditional algorithms can significantly degrade image quality, possibly introducing block effects and artifacts. This issue arises from the compression algorithms’ block division and internal compression when processing images, which may lead to noticeable block boundaries and color distortions. This is because traditional evaluation standards tend to favor the preservation of pixel-level local structures rather than semantic information. Image compression methods based on deep learning could also produce blurry, checkerboard-like artifacts, and when relying on MS-SSIM and MSE metrics, the texture reconstruction performance is generally poor [5].

Semantic communication, which emphasizes the fidelity of semantics rather than minimizing bit or symbol error rates, is gaining increasing attention and has become an important research direction in the field of communication [6,7,8]. Semantic communication has shown broad application prospects in various industries, including brain computer interfaces [9], virtual reality [10], augmented reality [11], and mixed reality [12]. Although existing research on semantic communication covers text, voice, image, and even video information, the bit rate usage during the semantic transmission process has rarely been addressed.

Inspired by these considerations, this study introduces a low-bit-rate semantic communication system for image transmission, aimed at capturing the semantic information of images through a feature extraction network and a semantic segmentation network. The system transmits compressed semantic information during the communication process to reduce the bit rate required for transmission, and reconstructs the image on the semantic level at the receiver end using prior knowledge.

The principal contributions of our work are as follows:

-: We propose a deep learning based feature extraction compression network, utilizing the ResNet [13] neural network model combined with various attention mechanisms, to achieve efficient semantic feature extraction. Through the network’s feature extraction capabilities, we manage to capture partial semantic information of images at extremely low bit rates.
-: We employ a conditional diffusion model for image reconstruction. Unlike traditional diffusion models, the conditional diffusion model incorporates conditional information, making the generation process more controllable. This implies that the model’s generation is influenced not only by random noise but also by given conditional information. We use semantic information as this conditional input, thus aligning the generated images more closely with our requirements.
-: We optimize and evaluate the semantic exchange efficiency of the proposed semantic encoding method and combine different perceptual metrics. The reconstruction results are quantitatively assessed using the Fréchet Inception Distance (FID) [14], mean Intersection over Union (mIoU) [15], and classification accuracy of the reconstructed images, aligning with human perception.

2. Related Work

2.1. Semantic Communication

Since Claude Shannon introduced the theory of information, he and Warren Weaver have further developed and refined their theory and model. In this process, they recognized the importance of semantics in communication and proposed three levels of communication: syntactic communication, semantic communication, and pragmatic communication [16]. Semantic communication, focused on a task-oriented “understand first, transmit later” communication strategy, involves selective feature extraction, compression, and the transmission of the original signal, followed by communication using semantic-level information. Semantic communication systems for text transmission have been introduced in references [17,18], achieving intelligent communication between humans and machines, as well as between machines. The theory of the lossless compression of semantic data, which asserts that data can be significantly compressed at the semantic level, was explored in reference [19]. Building on this, we propose an image semantic encoding method for multimedia semantic communication, utilizing prior information such as image distribution models.

2.2. Image Compression

Early image compression techniques primarily utilized entropy coding directly to reduce the redundancy in image encoding, such as Huffman coding [20], arithmetic coding [21], and context-adaptive binary arithmetic coding [22]. In the late 1960s, compression methods based on image transformation were introduced; they compress images by transforming them from the spatial domain to the frequency domain and encoding them in the frequency domain. The transformation techniques used in transform coding mainly include the Fourier transform [23], Hadamard transform [24], and Discrete Cosine Transform (DCT) [25]. In addition to eliminating data redundancy through entropy coding and transformation techniques, prediction and quantization techniques were also proposed to further reduce spatial redundancy and psychovisual redundancy in images.

In recent years, image compression techniques based on deep learning have been extensively studied and applied, including image compression methods based on autoencoders, convolutional neural networks, and Generative Adversarial Networks (GANs). Toderici [26] pioneered an end-to-end optimized image compression method that reconstructs images using Recurrent Neural Networks (RNNs). Ballé [27] proposed an image compression framework based on the Variational Autoencoder (VAE) structure. Rippel et al. [28] were the first to introduce adversarial loss functions in an end-to-end framework to enhance the visual quality of images. To further reduce the number of bits required to store images, researchers have explored various effective probability models for entropy coding, including hierarchical priors and autoregressive models with context.

2.3. Image Generation

In the field of image generation, Generative Adversarial Networks (GANs) and diffusion models have made rapid progress, focusing on learning the distribution of natural images rather than pursuing precision in local reconstruction.

GANs, as pioneers in image generation technology, were introduced by Ian Goodfellow and colleagues in 2014 [29]. The core concept of GANs is to generate new data instances through adversarial training between two networks—a generator network (G) and a discriminator network (D). This approach has been demonstrated by Agustsson et al. [30] and D He et al. [31], who have employed advanced generative networks to provide exceptional reconstruction quality at extremely low bit rates, showcasing the potential of adversarial learning.

As an inheritor of the diffusion model, the conditional diffusion model [32] is an extension that introduces conditional information for more precise control of generation. Compared to the potential instability during the training process of GANs, such as pattern collapse and non-convergence, the training process of conditional diffusion models exhibits greater stability. After 2022, GAN has gradually been replaced by the conditional diffusion model. Conditional diffusion models can constrain multimodal information and promote low-bit-rate compression through contrastive language image pre-training encoding [33], as well as denoising diffusion implicit models, latent diffusion, and stable diffusion [34] reconstruction models. Compared to GANs, conditional diffusion models can reduce the randomness of generated images to a certain extent through reasonable parameter adjustments and conditional control, thereby improving the stability of generated images.

3. Low-Bit-Rate Image Semantic Communication System

This section introduces the overall architecture of the low-bit-rate image semantic communication system proposed in this paper, followed by a detailed discussion of the designed feature extraction module, multi-attention module, and conditional diffusion image generation module.

3.1. Overall Architecture

The system model framework for semantic communication at extremely low bit rates proposed in this document is composed of a multiscale feature extraction module, an image reconstruction module, a channel, and a shared knowledge background between the sender and receiver, as illustrated in Figure 1. The multiscale feature extraction module comprises two parts: a multiscale feature extraction network and a semantic segmentation network. The multiscale feature extraction network is responsible for extracting feature information from images, while the semantic segmentation network is tasked with extracting the semantic distribution of images. The semantic reconstruction module includes a semantic generation network, which reconstructs images using a conditional diffusion model based on the received image feature information and semantic distribution as conditions.

Under the lowest bpp compression configuration proposed in this article, image feature information is compressed at a very high compression rate (<0.01 bpp), and the semantic distribution map is down-sampled to a resolution of 128 × 128 and transmitted as a grayscale image, achieving a satisfactory compression rate (<0.02 bpp). These two pieces of information are transmitted to the receiver end through low-bit-rate lossless encoding.

3.2. Multiscale Feature Extraction Module

As shown in Figure 2, the multiscale feature extraction network we propose follows the ResNet50 structure and is composed of four module stages. The four stages are composed of multiscale residual blocks and attention blocks.

The details of the multiscale residual module are depicted in Figure 3, where the convolutional layers utilize 3 × 3 and 5 × 5 sized depthwise separable convolutions [35]. Compared to conventional convolutional methods, this approach significantly reduces computational complexity and the number of parameters, while still maintaining effective feature extraction capability from the input.

In the multiscale residual module, features are initially processed through 3 × 3 and 5 × 5 sized depthwise separable convolution layers for both

X

and

Y

inputs, utilizing the SiLU activation function to obtain features at different scales [36]. Subsequently, the extracted features are concatenated and fed into 5 × 5 and 3 × 3 sized depthwise separable convolution layers in a cross manner to further expand the receptive field and enhance information extraction. The concatenated features are then passed through a 1 × 1 sized conventional convolution layer to adjust dimensions, before being combined with the original residual features to produce the final output. Specifically, the multiscale residual module can be expressed as follows:

X_{1} = σ (f_{3 \times 3}^{d 1} (w)),

(1)

Y_{1} = σ (f_{5 \times 5}^{d 1} (w)),

(2)

X_{2} = σ (f_{3 \times 3}^{d 2} ([X_{1}, Y_{1}])),

(3)

Y_{2} = σ (f_{5 \times 5}^{d 2} ([Y_{1}, X_{1}])),

(4)

w^{'} = f_{1 \times 1}^{n 3} ([X_{2}, Y_{2}])) + w

(5)

where

f

represents the weight of the network,

d

in the superscript represents the deep separable convolution, n represents the ordinary convolution, and the number represents the number of layers. The subscript is the convolution kernel size used in the convolution layer. σ represents the SiLU activation function, and

[X_{1}, Y_{1}], [Y_{1}, X_{1}], [X_{2}, Y_{2}]

represents the fusion operation, which corresponds to C in Figure 3. The specific operation is the addition on the convolution channel.

3.3. Hybrid Attention Block

Image features exhibit significant spatial locality, yet current feature extraction networks do not fully leverage the potential of attention mechanisms. The common practice of employing a single type of attention mechanism often leads to inaccurate feature extraction, thereby affecting the performance of subsequent tasks. To address this issue, this study proposes a hybrid attention module that integrates multiple types of attention mechanisms. This module applies different attention mechanisms to various component groups, fully utilizing the characteristics of each attention mechanism to significantly enhance the effectiveness of image feature extraction.

First, for the first to second module stages, they are characterized by more spatial information and fewer channels, so this paper uses SA (self-attention) [37] with jump connection to focus on their characteristics. The formula is as follows:

q (x) = W_{q} x, k (x) = W_{k} x, v (x) = W_{v} x,

(6)

Γ (i, j) = \frac{{q (x_{i})}^{T} k (x_{j})}{∥ q (x_{i}) ∥ ∥ k (x_{j}) ∥},

(7)

y = x + W_{j} \sum_{j} {s o f t m a x}_{j} (α Γ (i, j)) \cdot v (x_{j})

(8)

where

x

is the input of the attention block and y is the output.

W_{q} {, W}_{k} {, W}_{v} {, W}_{j} \in R^{C \times C}

represents 1 × 1 in the attention block.

i

and

j

are spatial indexes ranging from 1 to H × W.

Secondly, for the third to fourth module stages, as the feature map size decreases and the number of channels increases, the importance of each channel of the feature map is enhanced. Therefore, this study adopts the CA (coordinate attention) [38] mechanism to better focus on spatial position information and channel information, thereby improving the efficiency and accuracy of feature extraction. Specifically, the coordinate attention mechanism is implemented as follows:

g^{h} = σ (F_{h} (f^{h})),

(9)

g^{w} = σ (F_{w} (f^{w})),

(10)

y (i, j) = x (i, j) \times g^{h} (i) \times g^{w} (j)

(11)

where

x

is the input of the attention block and

y

is the output.

g^{h} \in R^{C \times H \times 1}, g^{w} \in R^{C \times 1 \times W} i s t h e a t t e n t i o n v e c t o r o b t a i n e d b y f^{h} \in R^{\frac{C}{r} \times H \times 1}, f^{w} \in R^{\frac{C}{r} \times 1 \times W}

using 1 × 1 convolution to increase dimensionality and combining it with sigmoid activation function,

i

and

j

are spatial indexes ranging from 1 to H × W.

3.4. Quantitative Transmission

In the feature extraction network, quantizing the output results can significantly reduce the number of bits required for transmission. However, during the model training process, optimization requires backpropagation and gradient descent, necessitating that the computation process be differentiable, while quantization is inherently non-differentiable. To address this issue, we refer to the quantization method described in reference [39] and simplify it following the ideas presented in reference [40]. Specifically, given

w = [w_{1}, w_{2}, w_{3}, \dots]

, to quantize each element

w_{i}

in

w

to the corresponding symbol

a_{j}

, it is necessary to select the

a_{j}

closest to

w_{i}

from a set of candidate symbols

a = [a_{1}, a_{2}, a_{3}, \dots]

. We employ the nearest neighbor assignment method to accomplish this calculation,

w_{i}^{'} = Q (w_{i}) ∶ = {a r g m i n}_{j} ∥ w_{i} - c_{j} ∥

(12)

In backpropagation, to make the network differentiable, quantization can be replaced by:

w_{i}^{'} = \sum_{j = 1}^{L} \frac{e x p (- σ ∥ w_{i} - c_{j} ∥)}{\sum_{l = 1}^{L} e x p (- σ ∥ w_{i} - c_{l} ∥)}

(13)

This is equivalent to weighted averaging

a = [a_{1}, a_{2}, a_{3}, \dots]

, with higher weights approaching

w_{i}

. This method is called soft quantization.

3.5. Conditional Diffusion Image Generation Module

Firstly, a brief introduction to the theory of conditional diffusion models is provided. The objective of conditional diffusion models is to generate samples that conform to the maximum likelihood estimate

m (p_{0} | x)

under a given conditional data distribution

c (p_{0} | x)

. In the Conditional Denoising Diffusion Probabilistic Model (Conditional DDPM), two processes are defined: the forward process and the backward process. The forward process

n (p_{1 : T} | p_{0})

, also known as the diffusion process, refers to the gradual addition of Gaussian noise to the data until it becomes random noise. The formula for this process is as follows:

n (p_{t}| p_{t - 1}) = N (p_{t}; \sqrt{1 - β_{t}} p_{t - 1}, β_{t} I),

(14)

With the notation

α ∶ = 1 - β

, there is a formula as follows:

n (p_{t}| p_{0}) = N (p_{t}; \sqrt{α_{t}} p_{0}, (1 - α_{t}) I),

(15)

Reverse process

m_{θ} (p_{0 : T} | x)

is defined as a Markov chain that learns the Gaussian distribution

m (p_{T}) ~ N (0, I)

, and its formula is as follows:

m_{θ} (p_{0 : T}| x) = m (p_{T}) \prod_{t = 1}^{T} m_{θ} (p_{t - 1}| p_{t}, x),

(16)

m_{θ} (p_{t - 1}| p_{t}, x) = N (p_{t - 1}; μ_{θ} (p_{t}, x, t), \sum_{θ} (p_{t}, x, t))

(17)

The purpose of training a conditional diffusion model is to optimize its negative logarithmic likelihood variational upper bound.

Referring to previous research, this study made some modifications on the basis of the Semantic Diffusion Model (SDM) [41] network to meet the image reconstruction requirements of this paper. In SDM, the conditional denoising network is based on the U-Net structure and aims to estimate the noise components in the input noisy image. This network structure can independently process semantic information and noisy images, where noisy images are introduced into the denoising network during the encoder stage.

In order to more effectively utilize semantic information, this article has improved the structure for processing semantic information in SDM—SDDResblock. As shown in Figure 4, this study introduces semantic features obtained from multiscale feature extraction modules as additional conditions in the original SDDResblock structure. Through multi-layer spatially adaptive normalization operation, the semantic label map and semantic features are combined into the decoder of the denoising network.

4. Experiments and Discussion of Results

4.1. Setup

Experimental Environment: The experiments were conducted on a computer running Windows 11 64-bit OS, equipped with an Intel i7-12700H processor, NVIDIA GeForce RTX 3070 graphics card, and 8 GB RAM. The programming environment is based on the PyTorch framework, implemented in Python 3.10.

Dataset: The experiments were performed on the ADE20K dataset [42], which was released by MIT CSAIL Computer Vision Group in 2016. It includes 20,210 training images and 2000 validation images, covering 150 semantic categories such as people, vehicles, plants, and buildings. Each image provides pixel-level semantic annotation. Due to the inconsistent size of each image in the ADE20K dataset, we adjusted the size of all images to a resolution of 256 × 256 for subsequent experiments.

Evaluation Metrics: In evaluating the performance of different algorithms, we employed three crucial metrics: FID (Fréchet Inception Distance), mIoU (Mean Intersection over Union), and accuracy. These metrics measure the differences in authenticity, semantic consistency, and classification accuracy between synthetic and real images from different dimensions.

First, FID, as an indicator for assessing image authenticity, was achieved by calculating the Wasserstein-2 distance between synthetic and real images in the feature space. This metric quantified the perceptual realism of synthetic images, providing us with an intuitive standard for evaluating the performance of image generation algorithms.

Second, to assess the semantic accuracy of synthetic images, we introduced two metrics: mIoU and accuracy. Accuracy was evaluated by computing the average pixel classification accuracy across all categories, reflecting the ability to recognize the semantic content of synthetic images. Specifically, we re-segmented the synthetic images semantically and calculated the pixel classification accuracy between the segmentation results and the original semantic labels, thereby obtaining the overall accuracy. As another significant semantic evaluation metric, mIoU comprehensively measured the performance of synthetic images in semantic segmentation tasks by calculating the ratio of the intersection to the union between the true values and predicted values for each category, and then averaging the values for all categories.

Training method: We used an end-to-end approach to train the model as a whole, where the semantic segmentation network used a pre-trained frozen model that only provides segmentation results. Specifically, the learning rate of the model was 10-4 and the adam optimizer was used. We trained three different bpp size models for comparative experiments, namely, the smallest model with a semantic feature compression rate of 0.01 bpp and a segmentation result compression rate of 0.02 bpp; the medium model with a semantic feature compression rate of 0.03 bpp and a segmentation result compression rate of 0.02 bpp; and the large model with a semantic feature compression rate of 0.03 bpp and a segmentation result compression rate of 0.05 bpp.

Experimental Method: In the experiment, for traditional image compression algorithms such as JPEG and BPG, we controlled the compression ratio of the image to about 0.03 bpp by controlling the compression quality. For network models that use semantic distribution maps to generate images, such as SPADE and Pix2pixhd, we used the segmentation results of the VIT adapter model [43] as input, which showed excellent semantic segmentation performance on the ADE20K dataset. At a resolution of 256 × 256, the compression ratio of the segmentation result was about 0.05 bpp. The segmentation result was down-sampled once to a resolution of 128 × 128, and the compression ratio of the segmentation result was about 0.02 bpp. The model proposed in this article had two inputs. The first part was the semantic features extracted by a deep semantic extraction network, with a minimum compression rate of about 0.01 bpp. The second part was the segmentation result of the Vision Transformer Adapter model, which was added together to achieve a compression rate of about 0.03 bpp. Meanwhile, we tested the performance of the model under different BPPs by extracting semantic features of different sizes.

4.2. Result and Analysis

Through comparative experiments, we comprehensively compared the proposed model with other excellent image processing models at various bpps, and the results are listed in Table 1. The results indicate that our research model outperforms the comparison model in all indicators compared to the second ranked model at each bpp. The lower FID value and higher accuracy score directly reflect the high visual similarity between the generated image and the real image. Meanwhile, the significant increase in mIoU further validates the consistency between the distribution of generated images and real images in our model, clearly demonstrating the effectiveness of using semantic conditions to guide the diffusion model during the generation process.

To further compare the popularity of images generated by our method versus other coding and decoding methods on the ADE20K dataset, we conducted a user study. In each set of experiments, participants were asked to choose between images generated by our method and those generated by other coding and decoding methods, based on which had better visual performance. According to the data presented in Table 2, users were more likely to support our results. Figure 5 showcases a comparison of partial results generated using various methods. It is evident that at low compression rates, JPEG images already begin to show blocky artifacts, and WebP images display smearing effects. Meanwhile, images generated by SPADE and SDM methods exhibit boundary distortions, indicating that these approaches are not fully capitalizing on semantic information.

4.3. Ablation Studies

To verify the effectiveness of key designs within our system, we conducted a series of ablation studies. These experiments targeted modules including the multiscale residual blocks and the hybrid attention blocks. Without the multiscale residual blocks, a traditional residual block composed of two convolutional layers was used; without the hybrid attention blocks, the model would directly output to subsequent layers. The quantitative results are summarized in Table 3. Upon analysis, we observed that both the multiscale residual blocks and the hybrid attention blocks positively contribute to enhancing the performance metrics of our model on the test set.

Under the condition of retaining the multiscale residual blocks and hybrid attention blocks, this study conducted detailed ablation experiments on the attention sub-modules within the hybrid attention blocks and investigated the sequence of these sub-modules. The experimental results are summarized in Table 4. We observed that, compared to employing a single attention sub-module, the strategy of using hybrid attention sub-modules has a more significant positive impact on the overall performance of the model.

5. Conclusions and Future Work

In this work, we developed a low-bit-rate image semantic communication system, which reconstructs images by transmitting key information and semantic label maps, aiming at the consistency of semantic information transmission rather than the pixel-level consistency of the image. Compared to previous efforts, our system can receive information and reconstruct satisfactory images at an extremely low bit rate (below 0.03 bits per pixel (bpp)). This advantage not only significantly conserves bandwidth resources but also fundamentally prevents the occurrence of image degradation phenomena that are common in traditional coding and decoding methods under extremely low-bit-rate reconstruction. Exploring semantic communication systems for video content is a promising direction for future work.

Author Contributions

Conceptualization, J.D. and W.S.; methodology, J.D.; simulation, J.D.; validation, J.D.; investigation, J.D.; resources, T.Y. writing—original draft preparation, J.D.; writing—review and editing, W.S. and T.Y.; supervision, T.Y.; project administration, W.S. and T.Y.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Gansu Province Major Science and Technology Projects, Grant number 22ZD6GA041; the Gansu Provincial Key Talent Project, Grant number 6660010201.

Data Availability Statement

The data and source code can be obtained by contacting the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wallace, G.K. The JPEG Still Picture Compression Standard. IEEE Trans. Consum. Electron. 1992, 38, xviii–xxxiv. [Google Scholar] [CrossRef]
Christopoulos, C.; Skodras, A.; Ebrahimi, T. The JPEG2000 Still Image Coding System: An Overview. IEEE Trans. Consum. Electron. 2000, 46, 1103–1127. [Google Scholar] [CrossRef]
Ginesu, G.; Pintus, M.; Giusto, D.D. Objective Assessment of the WebP Image Coding Algorithm. Signal Process. Image Commun. 2012, 27, 867–874. [Google Scholar] [CrossRef]
Albalawi, U.; Mohanty, S.P.; Kougianos, E. A Hardware Architecture for Better Portable Graphics (BPG) Compression Encoder. In Proceedings of the 2015 IEEE International Symposium on Nanoelectronic and Information Systems, Indore, India, 21–23 December 2015; IEEE: Indore, India, 2015; pp. 291–296. [Google Scholar]
Huang, D.; Tao, X.; Gao, F.; Lu, J. Deep Learning-Based Image Semantic Coding for Semantic Communications. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; IEEE: Madrid, Spain, 2021; pp. 1–6. [Google Scholar]
Qin, Z.; Tao, X.; Lu, J.; Tong, W.; Li, G.Y. Semantic Communications: Principles and Challenges. arXiv 2022, arXiv:2201.01389. [Google Scholar]
Luo, X.; Chen, H.-H.; Guo, Q. Semantic Communications: Overview, Open Issues, and Future Research Directions. IEEE Wirel. Commun. 2022, 29, 210–219. [Google Scholar] [CrossRef]
Yang, W.; Du, H.; Liew, Z.Q.; Lim, W.Y.B.; Xiong, Z.; Niyato, D.; Chi, X.; Shen, X.; Miao, C. Semantic Communications for Future Internet: Fundamentals, Applications, and Challenges. IEEE Commun. Surv. Tutor. 2023, 25, 213–250. [Google Scholar] [CrossRef]
Friedman, D. Brain Art: Brain-Computer Interfaces for Artistic Expression. Brain-Comput. Interfaces 2020, 7, 36–37. [Google Scholar] [CrossRef]
Catalá, A.; Jaén, J.; Mocholí, J.A. A Semantic Publish/Subscribe Approach for U-VR Systems Interoperation. In Proceedings of the 2008 International Symposium on Ubiquitous Virtual Reality, Gwangju, Republic of Korea, 10–13 July 2008; IEEE: Gwangju, Republic of Korea, 2008; pp. 29–32. [Google Scholar]
Nguyen Dang, T.; Nguyen, L.X.; Le, H.Q.; Kim, K.; Ahsan Kazmi, S.M.; Park, S.-B.; Huh, E.-N.; Seon Hong, C. Semantic Communication for AR-Based Services in 5G and Beyond. In Proceedings of the 2023 International Conference on Information Networking (ICOIN), Bangkok, Thailand, 11–14 January 2023; pp. 549–553. [Google Scholar]
Meza, R.A.C.; Patra, A.N. Beyond Words: Semantic Communication in the Metaverse. In Proceedings of the SoutheastCon 2024, Atlanta, GA, USA, 15–24 March 2024; IEEE: Atlanta, GA, USA, 2024; pp. 136–144. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
Bynagari, N.B. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Asian J. Appl. Sci. Eng. 2019, 8, 25–34. [Google Scholar] [CrossRef]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Yan, L.; Qin, Z.; Zhang, R.; Li, Y.; Li, G.Y. Resource Allocation for Text Semantic Communications. IEEE Wirel. Commun. Lett. 2022, 11, 1394–1398. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, S.; Zhang, Y.; Cao, K.; Zhou, L.; Seet, B.-C.; Zhao, H.; Wei, J. Extended Context-Based Semantic Communication System for Text Transmission. Digit. Commun. Netw. 2022, in press. [Google Scholar] [CrossRef]
Pumma, S.; Vishnu, A. Semantic-Aware Lossless Data Compression for Deep Learning Recommendation Model (DLRM). In Proceedings of the 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), St. Louis, MO, USA, 15 November 2021; pp. 1–8. [Google Scholar]
Huffman, D.A. A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic Coding for Data Compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
Marpe, D.; Schwarz, H.; Wiegand, T. Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 620–636. [Google Scholar] [CrossRef]
Fourier, J.B.J. Théorie Analytique de La Chaleur; Gauthier-Villars: Paris, France, 1888. [Google Scholar]
Pratt, W.K.; Kane, J.; Andrews, H.C. Hadamard Transform Image Coding. Proc. IEEE 1969, 57, 58–68. [Google Scholar] [CrossRef]
Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete Cosine Transform. IEEE Trans. Comput. 1974, C–23, 90–93. [Google Scholar] [CrossRef]
Toderici, G.; O’Malley, S.M.; Hwang, S.J.; Vincent, D.; Minnen, D.; Baluja, S.; Covell, M.; Sukthankar, R. Variable Rate Image Compression with Recurrent Neural Networks. arXiv 2016, arXiv:1511.06085. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E.P. Density Modeling of Images Using a Generalized Normalization Transformation. arXiv 2016, arXiv:1511.06281. [Google Scholar] [CrossRef]
Rippel, O.; Bourdev, L. Real-Time Adaptive Image Compression. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2922–2930. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Agustsson, E.; Tschannen, M.; Mentzer, F.; Timofte, R.; Van Gool, L. Generative Adversarial Networks for Extreme Learned Image Compression. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 221–231. [Google Scholar]
He, D.; Zheng, Y.; Sun, B.; Wang, Y.; Qin, H. Checkerboard Context Model for Efficient Learned Image Compression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 14766–14775. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3813–3824. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 7794–7803. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Agustsson, E.; Mentzer, F.; Tschannen, M.; Cavigelli, L.; Timofte, R.; Benini, L.; Van Gool, L. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. arXiv 2017, arXiv:1704.00648. [Google Scholar] [CrossRef]
Mentzer, F.; Agustsson, E.; Tschannen, M.; Timofte, R.; Gool, L.V. Conditional Probability Models for Deep Image Compression. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4394–4402. [Google Scholar]
Wang, W.; Bao, J.; Zhou, W.; Chen, D.; Chen, D.; Yuan, L.; Li, H. Semantic Image Synthesis via Diffusion Models. arXiv 2022, arXiv:2207.00050. [Google Scholar] [CrossRef]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5122–5130. [Google Scholar]
Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision Transformer Adapter for Dense Predictions. arXiv 2023, arXiv:2205.08534. [Google Scholar] [CrossRef]
Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
Park, T.; Liu, M.-Y.; Wang, T.-C.; Zhu, J.-Y. Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2332–2341. [Google Scholar]

Figure 1. Proposed overall architecture of low-bit-rate semantic communication system.

Figure 2. Multiscale feature extraction network.

Figure 3. Multiscale residual block.

Figure 4. Modified SDDResblock.

Figure 5. A comparison of partial results generated using various methods.

Table 1. Comparison of different methods used on the Ade20k dataset. ↓ indicates the lower the better, ↑ indicates the higher the better. Bold denotes the best performance.

Method		ADE20K		bpp
Method	FID↓	mIoU ↑	ACC ↑	bpp
Jpg	163.32	8.81	48.39	0.03
Jpg2000	110.66	20.78	63.28	0.03
Webp	59.89	31.23	72.03	0.03
Bpg	41.91	37.57	75.62	0.03
Pix2pixhd [44]	82.31	26.83	72.23	0.02
SPADE [45]	34.27	36.01	76.87	0.02
SDM	30.18	47.08	80.25	0.02
Ours-S	28.89	48.93	81.42	0.03
Jpg	132.86	25.72	50.57	0.05
Jpg2000	97.43	31.37	68.64	0.05
Webp	54.72	35.42	75.36	0.05
Bpg	37.23	40.36	77.48	0.05
Pix2pixhd [44]	81.80	27.27	72.61	0.05
SPADE [45]	33.92	36.28	77.12	0.05
SDM	29.23	47.26	80.75	0.05
Ours-M	28.63	49.27	81.77	0.05
Jpg	83.45	41.15	78.35	0.08
Jpg2000	72.67	43.42	78.94	0.08
Webp	31.37	46.68	79.61	0.08
Bpg	29.64	47.34	81.44	0.08
Ours-L	27.95	49.76	82.04	0.08

Table 2. During the user study, participants made their selection between our method and other methods based on the visual quality of the generated images. The numerical values in the table represent the percentage of user preferences that supported our method. The results indicate that images generated by our method were generally more favored by users.

Method		Method
Ours vs. Jpg	82.1%	Ours vs. SPADE	69.6%
Ours vs. WebP	64.4%	Ours vs. SDM	53.3%

Table 3. Ablation studies about multiscale residual block and hybrid attention block. ↓ indicates the lower the better, ↑ indicates the higher the better.

MRB ¹	HAB ²	FID↓	mIoU ↑	Acc ↑
✕	✕	29.12	48.29	81.07
✕	√	28.98	48.76	81.36
√	✕	28.94	48.47	81.22
√	√	28.89	48.93	81.42

¹ MRB represents multiscale residual block. ² HAB represents hybrid attention block.

Table 4. Sequential testing of attention sub-modules in hybrid attention blocks. ↓ indicates the lower the better, ↑ indicates the higher the better.

Method				FID ↓	mIoU ↑	Acc ↑
SA ¹	SA	SA	SA	28.97	48.61	81.35
SA	SA	SA	CA ²	28.91	48.77	81.38
SA	SA	CA	CA	28.89	48.93	81.42
SA	CA	CA	CA	28.93	48.86	81.40
CA	CA	CA	CA	29.03	48.52	81.33

¹ SA represents self-attention with jump connection. ² CA represents coordinate attention.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, J.; Yan, T.; Sun, W. A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph. Electronics 2024, 13, 2366. https://doi.org/10.3390/electronics13122366

AMA Style

Dong J, Yan T, Sun W. A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph. Electronics. 2024; 13(12):2366. https://doi.org/10.3390/electronics13122366

Chicago/Turabian Style

Dong, Jiajun, Tianfeng Yan, and Wenhao Sun. 2024. "A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph" Electronics 13, no. 12: 2366. https://doi.org/10.3390/electronics13122366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Low-Bit-Rate Image Semantic Communication System Based on Semantic Graph

Abstract

1. Introduction

2. Related Work

2.1. Semantic Communication

2.2. Image Compression

2.3. Image Generation

3. Low-Bit-Rate Image Semantic Communication System

3.1. Overall Architecture

3.2. Multiscale Feature Extraction Module

3.3. Hybrid Attention Block

3.4. Quantitative Transmission

3.5. Conditional Diffusion Image Generation Module

4. Experiments and Discussion of Results

4.1. Setup

4.2. Result and Analysis

4.3. Ablation Studies

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI