Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models

Song, Mengshu; Ma, Nan; Dong, Chen; Xu, Xiaodong; Zhang, Ping

doi:10.3390/electronics12224637

Open AccessArticle

Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models

by

Mengshu Song

¹,

Nan Ma

^1,2,*

,

Chen Dong

¹

,

Xiaodong Xu

^1,2

and

Ping Zhang

^1,2

¹

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen 518066, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4637; https://doi.org/10.3390/electronics12224637

Submission received: 12 September 2023 / Revised: 2 November 2023 / Accepted: 5 November 2023 / Published: 13 November 2023

(This article belongs to the Special Issue Semantic Communications and Intellicise Networks: A Themed Issue in Honor of Prof. Ping Zhang)

Download

Browse Figures

Versions Notes

Abstract

:

The implementation of joint source-channel coding (JSCC) schemes using deep learning has accelerated the development of semantic communication research. Existing JSCC schemes based on deep learning (DL) are trained on a fixed signal-to-noise ratio (SNR); however, these trained models are not designed for scenarios in which the SNR is dynamic. Therefore, a novel semantic adaptive model for semantic communication—called joint source-channel coding with adaptive models (AMJSCC)—that has a semantic adaptive model selection (SAMS) module is proposed. The joint source-channel encoding (JSCE) model and the joint source-channel decoding (JSCD) model adapt according to both real-time channel conditions and system available computational power resources. Furthermore, residual networks with different layers are investigated to further improve the accuracy of information recovery. Simulation results demonstrate that our model can achieve higher recovery similarity and is more robust and adaptive to the SNR and communication resources. Meanwhile, compared to the state-of-the-art deep JSCC methods, it reduces storage space and communication resource consumption.

Keywords:

semantic communication; deep learning; joint source-channel coding (JSCC); semantic adaptive model selection (SAMS)

1. Introduction

With the rapid development of mass transmission data in today’s world, the introduction of semantic communication makes communication systems transfer from focusing on symbols to focusing on meaning [1,2,3], which will help to improve the efficiency of communication and has also become an important research direction in the field of future communications [4,5,6]. Semantic communication is promising in various industries, such as brain–computer interaction [7], virtual reality [8], augmented reality [9], mixed reality, etc.

In Shannon’s theorem, the coding part is composed of source coding and channel coding; similarly, the decoding part is also composed of two separate modules, which are channel decoding and source decoding. This scheme is optimal theoretically at the asymptotic limit of infinitely long source and channel blocks [10]. However, the traditional coding method can achieve this performance for a single module but not for the overall system. In addition, it also causes the cliff effect [11]. This means that the system cannot reliably recover the data from the transmitter when the SNR is below the threshold, which is the minimum limitation of the channel quality at the receiver after a transmission rate is selected at the transmitter.

Currently, some researchers focus on studying a JSCC scheme to outperform the separate approach [12]. JSCC considers source coding and channel coding, and, similarly, source decoding and channel decoding, as a whole [13]. By combining the source coding module and the channel coding module for optimization in the JSCC scheme, the system performance is improved and the cliff effect is relieved [14]. The combination of DL promotes the study of semantic communication. According to different modalities of source information, such studies are roughly divided into three categories: text [15,16,17], image and video [12,18], and speech [19]. Bourtsoulatze, Kurka, and Gündüz proposed a JSCC model based on DL [12]. Xie, Qin, and Li designed an intelligent end-to-end JSCC scheme based on the self-attention mechanism of the transformer method and extracted sentence semantics by learning the potential relationships between sentences to improve transmission efficiency [20]. Furthermore, Xie, Qin, and Li used DL techniques to represent the implied meaning of the text to extract the semantic information of long sentences and used transfer learning to jointly train codecs [15]. Ding, Li, and Ma designed a new deep JSCC scheme based on a self-encoder to solve the JSCC problem of multi-user image transmission over noisy channels [21].

As we all know, channel quality is one of the most important factors in determining the level of communication. Kurka and Gündüz made full use of the channel feedback information and designed the DeepJSCC-f scheme, which effectively improved the end-to-end reconstruction quality of fixed-length transmissions in an image transmission system, reduced the average delay of variable-length transmissions, and realized the continuous refinement of images [22]. The existing JSCC semantic communication frameworks based on neural networks improve the transmission accuracy and reduce the bandwidth occupation to a certain extent, but they are not strong in their adaptability to channel fluctuations and still cannot achieve adaptive coding for different SNRs and levels of communication power. That is because the above-mentioned JSCC schemes are all trained under a certain SNR, the best performance can only be achieved by transmitting at this specific SNR, and the corresponding networks need to be retrained as the SNR changes. For example, a model trained at an SNR of 10 dB will not achieve the desired metric at an SNR of 0 dB [22,23]. Therefore, in order to adapt to all channel transmission environments, it is necessary to train a large number of neural networks, which will consume a lot of computing power resources and time resources. This is a big challenge for scenarios with large fluctuations in the communication environment and expensive or resource-constrained computing resources [23].

To solve this problem, Xu, Ai, and Chen proposed an adaptive deep joint source-channel coding (ADJSCC) scheme, which adopts a channel soft attention network instead of an artificially designed resource allocation strategy and can adjust the learned image characteristics under different channel SNR conditions [22]. On this basis, Yang and Kim designed a deep joint source-channel coding method with adaptive rate control capability, which considers the channel conditions and image content and studies the coding rate adaptation in wireless image transmission [23]. Furthermore, Zhang et al. proposed a predictive and adaptive deep coding (PADC) framework to solve the problem of transmission quality prediction and obtained results similar to those of ADJSCC [24]. Huang, Tao, and Gao et al. designed a coarse-to-fine image semantic coding model for a multimedia-semantic-communication-system-based GAN and introduced a base layer and an enhancement layer to further improve the accuracy of image recovery [25]. Dong, Liang, and Xu et al. proposed the concept of semantic slice-models (SeSMs) and a new semantic measure called semantic service quality (SS), established a layer-based semantic communication system for images (LSCI) framework, and constructed a layered-image semantic communication system on a simulation platform, which proved the feasibility of the proposed system [26]. In addition to paying attention to channel quality, communication resource consumption should also be considered, especially in scenarios where communication resources are expensive or limited. However, the above research did not consider the computing power of the communication system.

Liu, Guo, and Yang et al. proposed a task-oriented communication architecture based on DL. Under the delay constraint, the optimization goal is to consider the compression ratio of semantic information and the available communication resources, such as bandwidth and power, so as to maximize the probability of successful transmission [27]. Yan, Qin, and Zhang et al. proposed a quality-of-experience (QoE) model for semantic communication networks and developed a quality-aware resource allocation strategy based on the number of transmitted semantic symbols, channel allocation, and power allocation [28]. Chi et al. studied the resource allocation problem of JSCC networks in order to maximize the number of system users by optimizing resource block allocation, power allocation, and the compression ratio for each user under the constraints of the transmission delay and performance so as to maximize the system capacity of an OFDM communication system [29]. Wang et al. proposed a proximal-policy-optimization-based reinforcement learning (RL) algorithm integrated with an attention network to maximize the quality of semantic information transmission by jointly optimizing resource allocation strategies and determining the portion of semantic information to be transmitted [30].

Based on the above, a joint source-channel coding with adaptive models (AMJSCC) scheme with a semantic adaptive model selection (SAMS) module is established to achieve the target communication and which considers the adaptive model under dynamic channel conditions and system computing power. Compared with previous work, the major contributions and innovations of this paper are as follows:

A new semantic communication paradigm with adaptive model is proposed, namely AMJSCC, that has two stages: the basic transmission stage and the enhanced transmission stage. Both the base layer and the enhance layer can adaptively select the model.
A discriminator is introduced: namely, the SAMS module. This module can simultaneously consider the channel quality and computational capability of the communication system to determine the selection strategy by using a regression analysis method. It can adaptively select the appropriate JSCC model and the enhance model to complete the target transmission based on the limitations of the target recovery effect and while reducing the computational cost.
The Openimages dataset was used to test the AMJSCC performance. The results show that the proposed AMJSCC can achieve better recovery performance and less computing power consumption than the traditional communication method and state-of-the-art methods, and the SSIM and PSNR performance for recovering images outperformed the deep JSCC model a maximum of 16.5% and 17.3%, respectively.

The rest of this article is organized as follows. Section 2 introduces the proposed models. The experiments and results are discussed in Section 3, and Section 4 concludes the article.

2. Proposed Models

In this section, the framework of the proposed AMJSCC method is elaborated, including the overall architecture, the base layer, and the enhanced layer.

2.1. Overall Architecture

The proposed communication system is based on the JSCC scheme. This scheme can realize end-to-end semantic information transmission with adaptive model selection capability, and the goal is to reconstruct the original image data with minimal distortion at the receiver. The overall architecture of the AMJSCC scheme is illustrated in Figure 1. The scheme mainly includes two stages inspired by [25]. In the first stage, the base layer generates and retains semantic information. In the second stage, the enhance layer restores the fine details of the image. The final recovered information is obtained by adding the information reconstructed in the first stage and the information obtained in the second stage.

2.2. Base Model

The base layer adopts the deep JSCC method for image transmission, as shown in Figure 1, and includes a discriminator

S_{γ, μ}

, trainable encoder

E_{θ}

, quantizer

Q

, non-trainable physical noise channel

η

, and trainable decoder

D_{ϕ}

, where

μ

and

γ

denote the real-time channel quality information received from the noise channel and the system power situation information, respectively, while

θ

and

ϕ

denote the encoder parameters and the decoder parameters, respectively.

Denote the size of the set of the original images

X

as

N (n u m b e r) \times H_{x} (h e i g h t) \times W_{x} (w e i g h t) \times

C_{x} (c h a n n e l)

, where each element

x = (x_{1}, x_{2}, \dots, x_{n})

satisfies

x \in R^{n}

,

n

is the dimension of the input symbols, and

R

denotes the set of real numbers. Before the original image enters the system for processing, the discriminator needs to make a decision in order to select the appropriate JSCE and JSCD models according to the channel feedback information and the system power situation; this will be discussed in detail later.

Denote the size of the set of semantic images

W

as

N (n u m b e r) \times H_{w} (h e i g h t) \times

W_{w} (w e i g h t) \times C_{w} (c h a n n e l)

. The basic semantic features

w \in W

of the image

x \in X

are extracted by the JSCE

E_{θ}

:

R^{n} \to C^{m}

, where the

m

is the dimension of the semantic symbols, and

C

denotes the set of complex numbers. Akin to [22,24,31], we call the image size

n = H_{x} (h e i g h t) \times W_{x} (w e i g h t) \times C_{x} (c h a n n e l)

, the channel input size

m = H_{w} (h e i g h t) \times

W_{w} (w e i g h t) \times C_{w} (c h a n n e l)

, and

R

=

m

/

n

as the source bandwidth, the channel bandwidth, and the bandwidth ratio, respectively. The joint source-channel encoding (JSCE) process can be expressed as:

w = E_{θ} (x) .

(1)

A quantizer is adopted to quantize the extracted semantic information w using the nearest-neighbor quantization method, which maps the

m

-dimensional vector of a complex-valued semantic image

W

to a

m

-dimensional vector of a complex-valued channel input sample

\hat{W}

. There are

L

quantization center points. The quantization process can be expressed as:

\begin{matrix} {\hat{w}}_{i} & = Q (w_{i}) \\ = a r g m i n_{l} ‖ w_{i} - c_{l} ‖, i = 1, 2, \dots, m, l = 1, 2, \dots, L . \end{matrix}

(2)

where

c_{l}

is the quantization center point, which satisfies:

\begin{matrix} c_{l} ϵ C, \\ C = c_{0}, c_{1}, \dots, c_{l}, \dots, c_{L} . \end{matrix}

The physical channel can be represented by the function

η

:

C^{M} \to C^{M}

. The widely used channel model AWGN is considered. The independent identically distributed (i.i.d) samples of AWGN are the vector

n \in C^{M}

,

n \sim CN (0, σ^{2} I)

, and the channel quality is expressed by the SNR, which is defined as

μ = 10 \log_{10} \frac{P_{s}}{σ^{2}}

.

CN (\cdot, \cdot)

denotes a circularly symmetric complex Gaussian distribution,

σ^{2}

denotes the noise power,

P_{s}

denotes the transmit power. A higher SNR means better channel quality and thereby brings stronger anti-interference ability and better performance to the system. Other channel models can be incorporated into the proposed system in a similar manner as long as the channel transfer function allows gradient computation and error back-propagation.

After power normalization,

{\hat{w}}_{i}

is transmitted over the physical channel. The information passing through the physical channel can be expressed as:

\begin{matrix} y & = η (\hat{w}) \\ = \hat{w} + n . \end{matrix}

(3)

At the receiver, the source image is restored. The joint source-channel decoding (JSCD) process can be expressed as:

x^{'} = D_{ϕ} (y) .

(4)

Combining Equations (1)–(4) shows that:

x^{'} = D_{ϕ} (η (Q (E_{θ} (x)))),

which can be expressed as:

x^{'} = F (θ, ϕ, x) .

(5)

The distortion between the original image

x

and the restored image

x^{'}

is expressed as:

d (x, x^{'}) = \begin{matrix} M S E (x, x^{'}) \end{matrix} + α \cdot S S I M (x, x^{'}) + β \cdot L P I P S (x, x^{'}) .

(6)

The mean squared error (MSE), structural similarity (SSIM), and learned perceptual image patch similarity (LPIPS) are metrics that measure the similarity of two images.

M S E (x, x^{'}) = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - x_{i}^{^{'}})}^{2}

is the mean squared error between the input image

x

and the reconstructed image

x^{'}

, which constrains the reconstructed image to be consistent with the input image at each pixel point.

S S I M (x, x^{'}) = 1 - l (x, x^{'}) \cdot c (x, x^{'}) \cdot s (x, x^{^{'}})

is the structural similarity between the input image

x

and the reconstructed image

x^{'}

, which constrains the reconstructed image to be consistent with the input image in terms of brightness, contrast, and structure.

S S I M (x, x^{'})

consists of three losses:

l (x, x^{'}) = \frac{2 μ_{x} μ_{x^{'}} + c_{1}}{μ_{x}^{2} + μ_{x^{'}}^{2} + c_{1}}

is the luminance loss,

c (x, x^{'}) = \frac{2 σ_{x} σ_{x^{'}} + c_{2}}{σ_{x}^{2} + σ_{x^{'}}^{2} + c_{2}}

is the contrast loss, and

s (x, x^{'}) = \frac{σ_{x, x^{'}} + c_{3}}{σ_{x} σ_{x^{'}} + c_{3}}

is the structural loss. The symbols

μ_{x}

and

μ_{x^{'}}

, respectively, represent the mean values of

x

and

x^{'}

;

σ_{x}

and

σ_{x^{'}}

, respectively, represent the standard deviations of

x

and

x^{'}

; and

σ_{x x^{'}}

represents the covariance of

x

and

x^{'}

. The variables

c_{1}

,

c_{2}

, and

c_{3}

are constants [32].

LPIPS is the learned perceptual image patch similarity between the input image and the reconstructed image and is also known as the perceptual loss. LPIPS is more aligned with human perception compared to traditional methods such as the peak signal-to-noise ratio (PSNR) and the SSIM [33], a lower LPIPS value indicates a higher similarity between two images, while a higher value suggests greater differences.

In the above deep JSCC semantic communication process, the parameters of the JSCE and JSCD are jointly optimized to achieve the minimum distance between the original image

x

and the restored image

x^{'}

, which can be expressed as:

(θ^{*}, ϕ^{*}) = a r g m i n_{θ, ϕ} E [d (x, x^{'})] .

(7)

2.3. Enhance Layer

Textures and details in regulated semantic areas of an image can be enhanced by the enhance layer, as shown in Figure 1. The symbol

z

is the residual image of

x

and

x^{'}

, where

x

is the source information, and

x^{'}

is the reconstructed image of the base model.

z = x - x^{'} .

(8)

The symbol

z

is first enhanced by the enhance model to obtain the enhanced residual information

z^{'}

, which can be expressed as:

z^{'} = R_{φ} (z), i = 1, 2, \dots, k .

(9)

Then, put

z^{'}

into the noisy channel

η

for transmission and obtain the information

\hat{z}

.

The parameters of the enhance layer are optimized to achieve the minimum distance between the original image

z

and the restored image

z^{'}

, which can be expressed as:

φ^{*} = a r g m i n_{φ} E [d (z, \hat{z})] .

(10)

Finally, the reconstruction information

x^{'}

obtained in the first stage and the detailed information

\hat{z}

obtained in the second stage are summed to obtain the final recovery information

\hat{x}

, which can be expressed as:

\hat{x} = x^{'} + \hat{z} .

(11)

The distortion between the original image

x

and the final recovery image

x^{'}

is expressed as:

d (x, \hat{x}) = \begin{matrix} M S E (x, \hat{x}) \end{matrix} + α \cdot S S I M (x, \hat{x}) + β \cdot L P I P S (x, \hat{x}) .

(12)

The optimization objective (7) can be expressed as:

(θ^{*}, ϕ^{*}, φ^{*}) = a r g m i n_{θ, ϕ, φ} E [d (x, \hat{x})] .

(13)

2.4. Semantic Adaptive Model Selection (SAMS) Module

The SAMS module is proposed to make a policy to select the appropriate JSCE and JSCD models as the encoder and decoder of the base model and to determine the number of residual network (ResNet) blocks in the enhance model. Specifically, in order to ensure the accuracy of semantic information transmission, when the SNR is high, the low-computing-power transmission model is adopted, while when the SNR is low, the high-computing-power transmission model and low bandwidth ratio are adopted. The structure of the proposed policy network can be shown in Figure 2.

The SAMS discriminator receives the real-time channel quality information

μ \in R

and real-time available computing power information

γ \in R

of the system. The symbol

μ

is the channel SNR. Suppose that JSCE, JSCD, and the enhance model each have

k \in N_{+}

different models from which to choose. According to the computing power requirements

P_{r} \leq γ

of different JSCC models stored in the system, the SAMS discriminator Q makes a decision S to determine the JSCE model

E_{θ_{i}}

to extract semantic information, which can be expressed as:

w = E_{θ_{i}} (x), i = 1, 2, \dots, k .

(14)

The information that has been quantized by the quantizer and passed through the physical channel is decoded by the JSCD model

D_{ϕ_{i}}

corresponding to the selected JSCE

E_{θ_{i}}

model, and the reconstructed information

X^{'}

is obtained, which can be expressed as:

x^{'} = D_{ϕ_{i}} (y), i = 1, 2, \dots, k .

(15)

The residual images of X and

X^{'}

are feature-enhanced by the enhance model selected by Q, which can be expressed as:

z^{'} = R_{φ_{i}} (z), i = 1, 2, \dots, k .

(16)

In order to explore the accuracy of image semantic reconstruction under different SNRs and communication resources, we adjust the number of layers of the basic model in LSCI [26], obtain different JSCC models to be selected as the base layer, then add different enhance models. The demands of these models and the performance of image restoration are calculated after image reconstruction. In general, the functional relationship between channel quality and image restoration performance is nonlinear. Particularly, this nonlinear function

f (r)

should satisfy the following properties, where denotes r is the SNR, and

f (r)

is the SSIM.

SSIM is the structural similarity; it satisfies $0 \leq f (r) \leq 1$ .
Since a higher SNR would provide higher signal transmission quality, $f (r)$ is a monotonically decreasing function of r.
As SNR increases, the magnitude of the partial derivative $| \frac{\partial f (r)}{\partial r} |$ will gradually decrease and become zero when the SNR is sufficiently high, meaning that increasing the SNR no longer helps information recovery.
As the SNR decreases, $f (r)$ approaches 0, and the magnitude of the partial derivative $| \frac{\partial f (r)}{\partial r} |$ will gradually decrease and become zero when the SNR is sufficiently low. That means the original image will not be reconstructed when the SNR is too small, so decreasing the SNR will not affect the image restoration quality.

According to the four properties mentioned above, the following nonlinear model, which is a deformation function of a sigmoid function, can be used to capture the relationship between SNR and SSIM:

f (r) = t_{1} (1 / (1 + e^{- (r / t_{2}) - t_{3}})),

(17)

where

t_{1}

,

t_{2}

,

t_{3}

are the model parameters. As the function is a deformation function of a sigmoid function, it has low computational difficulty and is convenient for high-order derivative analysis.

Let

r_{i}^{(n)}

denote the n-th

S N R_{t e s t}

value when conducting tests for the i-th model. Let

u_{i}^{(n)}

represent the actual SSIM value, denoted as

S S I M_{a c t}

, which is obtained from testing the i-th model under

r_{i}^{(n)}

. According to (17), the relationship between the SNR and SSIM of the proposed models in this paper is defined as

f (r) = t_{1} (1 / (1 + e^{- (r / t_{2}) - t_{3}}))

, where r represents the

S N R_{t e s t}

, and

f (r)

represents the estimated value of the SSIM. Therefore, the

S S I M_{e s t}

for the n-th

S N R_{t e s t}

of the i-th model can be represented as

f_{i} (r_{i}^{n})

, while the single difference between

S S I M_{a c t}

and

S S I M_{e s t}

can be calculated as

f_{i} (r_{i}^{n}) - u_{i}^{n}

. Define the total difference over N trials as

L (t_{i})

, which can be expressed as

L (t_{i}) = \frac{1}{2} \sum_{n = 1}^{N} {(f_{i} (r_{i}^{(n)}) - u_{i}^{(n)})}^{2} .

(18)

The parameter

t_{i} = (t_{i 1}, t_{i 2}, t_{i 3})

that minimizes the objective function

L (t_{i})

can be obtained by the non-linear least squares method. And a different expression of (17) with different parameter values can be obtained.

Different models have different expressions of (17) with different parameter values, which can be obtained by using the aforementioned method. The intersection points of different function curves are used as the basis for choosing different coding schemes, and the intersection points between these function curves can be solved by Algorithm 1. For convenience of analysis, consider the p-th model and the q-th model, where

p = 1, 2, \dots, k - 1

and

q = p + 1

, and look for the intersection points of these two models in the order of increasing SNR. We assume that if the difference between the two

S S I M_{e s t}

values is less than or equal to 0.02, it is considered that the two function curves approximately intersect at

S N R_{p q}

. If the two function curves do not have such an approximate intersection, denote

S N R_{p q} = + \infty

.

S N R_{p q}

represents the SNR threshold for selecting either the p-th model or the q-th model. Assuming that there are k models,

k - 1

SNR thresholds can be obtained, which are

S N R_{12}

,

S N R_{23}

, …,

S N R_{(k - 1) k}

. Sort these

k - 1

values from smallest to largest to get

S N R_{1}^{*}

,

S N R_{2}^{*}

, …,

S N R_{k - 1}^{*}

. Since these SNR thresholds can be seen as the basis for choosing different coding schemes, the discriminator can choose the semantic information coding scheme according to the channel quality.

Algorithm 1 Semantic Adaptive Model Selection Algorithm

1:: Input: $f_{i} (r_{i}^{(n)}) = t_{i 1} (1 / (1 + e^{- (r_{i}^{(n)} / t_{i 2}) - t_{i 3}})), i = 1, 2, \dots, k, n = 1, 2, \dots, N$
2:: initialize the set of model selection thresholds $ϵ = (d_{1}, d_{2}, \dots, d_{k - 1}), d_{i} = 0,$ $i = 1, 2, \dots, k - 1$
3:: for $p = 1$ to $k - 1$ , $q = p + 1$ do
4:: for $n = 1$ to N do
5:: repeat
6:: $f_{p} (r_{p}^{(n)}) - f_{q} (r_{q}^{(n)}) = △_{p q}^{(n)};$
7:: until $△_{p q}^{(n)} \leq 0.02$ or $△_{p q}^{(n)} > 0.02$ for $n = 1$ to N
8:: make $S N R_{p q} ≜ r_{p}^{(n)} = r_{q}^{(n)}$ or $S N R_{p q} = + \infty$ ;
9:: end for
10:: Sort $S N R_{p q}$ from smallest to largest
11:: end for
12:: Output: the set of model selection thresholds $ϵ = ({S N R}_{1}^{*}, S N R_{2}^{*}, \dots, S N R_{k - 1}^{*})$

At the same time, the discriminator takes into account the real-time system resources and selects the semantic information encoding and decoding scheme as per the policy S. Optimization of objective (13) can be expressed as:

(θ^{*}, ϕ^{*}, φ^{*}) = a r g m i n_{θ, ϕ, φ, S} E [d (x, \hat{x})] .

(19)

3. Experiments and Results Discussion

In this section, the network architecture, dataset, and related training settings are introduced in detail. A series of experiments are conducted to evaluate the image restoration performance of the proposed model. In the following discussion, bpp (bit per pixel) is adopted to indicate the information compression rate, and the AMJSCC model performance under fixed bpp and un-fixed bpp under different computing power and different channel conditions is discussed.

3.1. Settings

In end-to-end transmission experiments, 100 k and 1 k pictures in Openimages datasets are chosen as the experimental data for training and testing, respectively. The input images are all resized to 512 × 512, and the Adam stochastic optimization method is adopted. For the loss function settings in Equation (12), we set

α = 0.01

,

β = 0.001

.

The specific neural network setup of the base layer can be shown in Table 1 and Table 2. The JSCE architecture consists of convolution modules for the image source, as indicated in Table 1. Each convolution module has a convolution layer, an instance norm layer, and a leaky rectified linear unit (Leaky ReLU) activation function except for the last one, which adopts a hyperbolic tangent function. The structure of JSCD corresponds to that of JSCE: the difference is that JSCD adds nine residual blocks after the first module, and the parameters of different models have different settings, which are shown in Table 2. The main difference between base layers is the number of encoding and decoding layers.

The specific neural network setup of the enhance layer is shown in Figure 3. The enhanced model consists of ResNet blocks followed by a convolution block. Each ResNet block consists of the modules, wherein each module has a convolution layer, an instance norm layer, and an activation function ReLU except for the last one, which adopts a sigmoid function. The convolution module following the last ResNet block consists of three modules, where each module has a convolution layer and an activation function ReLU except for the last one, which adopts a sigmoid function. The difference between the enhance models is the number of ResNet blocks.

3.2. The Results of Models with Fixed bpp

The image restoration quality for different channel environments using the deep JSCC with the enhancement layer (EDJSCC) and the deep JSCC (DJSCC) models are discussed in this section.

The EDJSCC model is obtained by adding enhanced models with different numbers of ResNet blocks to the DJSCC model. DJSCC has different number of convolutional layers as well as different training SNRs. Based on this, EDJSCC also has different numbers of ResNet blocks. For example,

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 10 dB)

refers to the EDJSCC model with an enhance module for which the JSCE has six convolutional layers and the bottleneck depth C is 16.

E 3

means the enhance module has 3 ResNet blocks, and

S N R_{t r a i n} = 10 dB

refers to the JSCC model trained at

10 dB

SNR. Thus, the corresponding base model of the EDJSCC model

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 10 dB)

is

D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 10 dB)

. The dashed line and solid line with regular hexagon symbols in Figure 4a show the relationship between the test SNR and SSIM of the

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 10 dB)

model and

D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 10 dB)

, respectively. The bpp of all models in Figure 4 is 0.1875. The following conclusions can be drawn from Figure 4a.

First, it is proved that adding the enhance layer can effectively improve the accuracy of image restoration. Comparing the lines for the EDJSCC model and the corresponding DJSCC model in Figure 4a shows that the two-stage transmission method can further improve the accuracy of semantic information recovery. The EDJSCC model has better semantic recovery accuracy than the DJSCC model, with the recovery performance SSIM above

0.88

when

S N R > 4 dB

, and the SSIM improves more with increasing SNR. This proves that the adoption of the enhance layer can recover the detailed semantic information of the images.

Next, it is proved that the proposed EDJSCC models can reconstruct the original images with different SNRs well and can reduce communication overhead as well as save memory space. Performance improvement produced by the enhanced model is more effective in two cases. One of them is the model with higher SNR when the JSCC structures are the same, and the other is the model with a smaller number of convolutional layers while the JSCC structures are different. Typically, the performance of a model trained at a certain specified SNR will be degraded at a test SNR different from this training SNR, but as Figure 4a shows, the model trained at a low SNR yields better performance than the model trained at a high SNR when

S N R < 6 dB

. However, with the proposed EDJSCC, the model trained at a high SNR achieved similar performance to the model trained at a low SNR when

6 dB < S N R < 21 dB

. It can also be observed in Figure 4a that the performance of EDJSCC models with four convolutional layers is better than EDJSCC models with six convolutional layers when

S N R > 21 dB

. The above results demonstrate that the proposed EDJSCC models that can deal with different SNRs only need to be trained on a few specific SNRs, which can effectively reduce the memory space of the system.

The above conclusions about SSIM are also applicable to PSNR, as shown in Figure 4b.

3.3. The Results of Models with Un-Fixed bpps

In order to further improve the performance of the model, unfixed bpps are adopted. The performance of models with different compression factors, different channel conditions, different computing power requirements, and different training SNRs are considered. The architecture of the basic models is shown in Table 1 and Table 2.

The three DJSCC models have four encoding layers, five encoding layers, and six encoding layers, respectively, which are the same as the settings in Table 1 and Table 2. The training SNRs are 10 dB, 5 dB, and 0 dB, respectively, and the bpps are 0.1875, 0.375, and 0.75, respectively. Correspondingly, three EDJSCC models are obtained by adding different enhancement layers on the basis of the DJSCC model mentioned above: that is, by adding one, two, and three ResNet blocks to the three DJSCC models with four, five, and six coding layers, respectively. The three EDJSCC models are, namely,

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 0 dB)

,

E D J S C C_L 5_C 8_E 2 + (S N R_{t r a i n} = 5 dB)

, and

E D J S C C_L 4_C 16_E 1 + (S N R_{t r a i n} = 10 dB)

. For convenience of discussion, we call these three models the ‘high-complexity model’, ‘medium-complexity model’, and ‘low-complexity model’, respectively. The performance of three different DJSCC models and three different EDJSCC models on the Openimages test set are compared using the metrics of SSIM and PSNR, as shown in Figure 5.

First, it is proved that decreasing the compression rate of the EDJSCC model can further improve the system performance. By comparing the lines of

E D J S C C_L 4_C 16_E 1 + (S N R_{t r a i n} = 10 dB)

in Figure 5 and the line of

E D J S C C_L 4_C 4_E 1 + (S N R_{t r a i n} = 10 dB)

in Figure 4, for which the bpps are 0.1875 and 0.75, respectively, it can be found that the image restoration performance of the model increases with the increase in bpp. Comparing the lines of

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 0 dB)

and

E D J S C C_L 4_C 16_E 1 + (S N R_{t r a i n} = 10 dB)

in Figure 5, for which the bpps are 0.187 and 0.75, respectively, it can be found that when

S N R > 24 dB

, the low-complexity model can achieve almost the same performance as the high-complexity model. However, in Figure 4, the lines of

E D J S C C_L 6_C 16_E 3 + (S N R_{t r a i n} = 0 dB)

and

E D J S C C_L 4_C 4_E 1 + (S N R_{t r a i n} = 0 dB)

with the same bpps still have a difference of about 0.02 when

S N R > 24 dB

. This indicates that transmitting more semantic features can effectively improve the performance of image restoration when the channel condition is good.

Next, it is proved that with the increase in stored EDJSCC model types, the computing power required for image transmission can be further reduced. From Figure 5, it can be found that the medium-complexity model can achieve similar performances to the high-complexity model in most channel conditions. Therefore, if there are more models to choose from during training, we can further refine the application scenario of the model so as to select fewer trained semantic transmission models suitable for this scenario for storage. Meanwhile, since different EDJSCC models produce similar recovery performance in some channel cases, there is no need to train the network on each SNR.

At the same time, it can be found that the worse the channel conditions are, the better the image restoration performance of the high-complexity model is compared to that of the low-complexity model. Therefore, the target transmission performance can be achieved by consuming more computing power at low SNRs. This corresponds to the conclusions of the models with same bpp and different computational complexity in Figure 4.

The above results demonstrate that the three proposed EDJSCC models can adapt the model to the changes in SNR and can effectively reduce the computing power consumption of the system.

3.4. The Results of the AMJSCC Model Compared with the Baselines

Let k = 3,

f_{3} (r)

,

f_{2} (r)

,

f_{1} (r)

mean the function of the relationship between the SNR and SSIM belonging to the high-complexity model, medium-complexity model, and low-complexity model, respectively. Applying the non-linear least squares method to expression (17), the parameters of these three functions can be obtained, which are

t_{3} = (t_{31}, t_{32}, t_{33})

,

t_{2} = (t_{21}, t_{22}, t_{23})

, and

t_{1} = (t_{11}, t_{12}, t_{13})

, respectively. The parameter values are shown in Table 3.

The sum of squared error (SSE) is used to evaluate the fitting effect of the function value calculated with this parameter to the corresponding points of the original data: SSE values closer to zero indicate better fitting performance. The SSE values of the three fitting curves are 0.000029, 0.000063, and 0.002037 respectively, indicating that the proposed fitting functions can effectively capture the relationship between channel quality and image transmission quality, as shown in Figure 6.

Then,

S N R_{1}^{*} = 6 dB

and

S N R_{2}^{*} = 21 dB

are obtained by Algorithm 1. The performance of the AMJSCC model is shown in Figure 7. Figure 7 compare the performance of three different DJSCC models, three different EDJSCC models, the proposed AMJSCC model, JPEG, JPEG2000, DeepJSCC, and LSCI on the Openimages test set using the metrics of SSIM and PSNR. JPEG2000 and JPEG use half-rate LDPC coding in 4QAM and are all set at a bpp of roughly 0.75. The settings for the DJSCC models and EDJSCC modes are the as same as those for the models in Figure 5. The settings for LSCI with the channel-slice model are the same as in reference article [26].

As Figure 7 shows, the performance of the AMJSCC model is not less than the performance of any EDJSCC models for the same

S N R_{t e s t}

and is higher than any of the other comparison models in this paper. SSIM outperforms the DeepJSCC model (

S N R_{t r a i n} = 5 dB

) by a margin of at most 16.5%, as shown in Figure 7c, and PSNR outperforms the DeepJSCC model (

S N R_{t r a i n} = 5 dB

) by a margin of at most 16.3%, as shown in Figure 7d. Meanwhile, The AMJSCC model effectively relieves the cliff effect and is more efficient in the low-SNR regime than JPEG and JPEG2000. In this case, the AMJSCC model can effectively improve the reconstruction quality of the transmitted images.

In Figure 7c,d, we can observe that the AMJSCC model can adapt to all channel transmission environments even though only three models are used for training. It is worth noting that despite the fact that the AMJSCC model is only trained on a few specific training SNRs

(S N R_{t r a i n} = 0 dB, 5 dB, 10 dB)

, the AMJSCC model still obtains high and stable target recovery performance on all

S N R_{t e s t}

by setting models with different computational complexities and different compression ratios. Specifically, as shown in Figure 7c, the SSIM value of AMJSCC is above 0.86; the highest value reaches 0.94. The reason is that in the AMJSCC scheme, the high-complexity model that has a lower compression ratio and consumes more computing power is selected for the high noise interference environment to obtain higher recovery effect, while the low-complexity model with high compression ratio that consumes less computing power is selected in the low noise interference environment to reduce the computational complexity. In conclusion, AMJSCC can balance the relationship between channel quality and computing resources consumed: thus, not only achieving the target communication effect but also saving computational power resources, which can also be shown in Figure 8.

3.5. Complexity Analysis

A brief discussion of the computational complexity of the proposed AMJSCC is provided in this section. The number of floating-point operation per second (FLOPs) and parameters of the proposed base models, enhance models, EDJSCC models, and AMJSCC model when dealing with a 512 × 512 × 3 image are shown in Table 4.

According to

S N R_{1}^{*} = 6 dB

and

S N R_{2}^{*} = 21 dB

, which were obtained from Algorithm 1, the number of FLOPs or parameters can be expressed by the following formula.

v = \sum_{n = 1}^{3} p_{i} \cdot v_{i},

(20)

where v is the weighted number of FLOPs or parameters for the adaptive model, and

v_{1}

,

v_{2}

, and

v_{3}

, respectively, are the number of FLOPs or parameters for different models with different convolutional layers. For example, in the JSCE module,

v_{1}

represents the number of FLOPs or parameters of the JSCE module with four convolutional layers,

v_{2}

represents the number of FLOPs or parameters of the JSCE module with five convolutional layers, and

v_{3}

represents the number of FLOPs or parameters of the JSCE module with six convolutional layers. Further,

p_{1} = \frac{1}{5}

,

p_{2} = \frac{1}{2}

, and

p_{3} = \frac{3}{10}

are the probabilities of selecting adaptive models with different convolutional layers. The number of parameters for the adaptive model can also be obtained by (20), and the calculation results are shown in Table 4.

Table 4 and Figure 8 show that the computing power required by the maximum model reaches

7.25

times that required by the minimum model, which indicates that AMJSCC can be applied to most channel environments and has a large span of model computing power.

4. Conclusions

In this paper, we have proposed a novel, semantic-driven, end-to-end image transformer method named AMJSCC. A semantic adaptive model selection (SAMS) module is proposed to reduce the communication consumption caused by model propagation and to enable flexible model selecting under the different requirements of model performance, channel situation, and transmission goals. Then, an adaptive enhancement model is adopted to retransmit the residual information obtained from the base layer to further improve the quality of image restoration. Experimental results show that, compared with the deep JSCC method, the AMJSCC method’s SSIM performance improved by up to 16.5%, and the PSNR performance improved by up to 17.3%. Meanwhile, our proposed AMJSCC method has a wide range of computing power usage interval, and the computing power required by the minimum model is only 0.14 times that of the maximum model. It is worth noting that the AMJSCC method consumes more computing power than the deep JSCC method in exchange for higher image restoration quality. In conclusion, the proposed semantic communication framework is adaptive to channel conditions and system computing power and also achieves highly stable image transmission quality while saving network computing complexity.

Author Contributions

Conceptualization, M.S.; methodology, M.S.; software, M.S.; validation, M.S.; formal analysis, M.S., N.M. and C.D.; writing—original draft, M.S.; writing—review and editing, M.S., N.M., C.D., X.X. and P.Z.; investigation, M.S., N.M., C.D., X.X. and P.Z.; project administration, P.Z.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the National Natural Science Foundation of China: 62293480; 62293481 and The Major Key Project of PCL Department of Broadband Communication under Grant PCL2023AS1-1.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We acknowledge the equal contribution of all the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, P.; Xu, W.; Gao, H.; Niu, K.; Xu, X.; Qin, X.; Yuan, C.; Qin, Z.; Zhao, H.; Wei, J.; et al. Toward Wisdom-Evolutionary and Primitive-Concise 6G: A New Paradigm of Semantic Communication Networks. Engineering 2021, 8, 60–73. [Google Scholar] [CrossRef]
Shi, G.; Gao, D.; Song, X.; Chai, J.; Yang, M.; Xie, X.; Li, L.; Li, X. A new communication paradigm: From bit accuracy to semantic fidelity. arXiv 2021, arXiv:210112649. [Google Scholar]
Zhong, Y. A theory of semantic information. China Commun. 2017, 14, 1–17. [Google Scholar] [CrossRef]
Gündüz, D.; Qin, Z.; Aguerri, I.E.; Dhillon, H.S.; Yang, Z.; Yener, A.; Wong, K.K.; Chae, C.B. Beyond transmitting bits: Context, semantics, and task-oriented communications. IEEE J. Sel. Area Commun. 2022, 41, 5–41. [Google Scholar] [CrossRef]
Luo, X.; Chen, H.; Guo, Q. Semantic communications: Overview, open issues, and future research directions. IEEE Wirel. Commun. 2022, 29, 210–219. [Google Scholar] [CrossRef]
Ma, N.; Song, M.; Liu, Y.; Dong, C. Description and Measurement of Semantic Information for the Intelligent Machine Communication. J. Beijing Univ. Posts Telecommun. 2022, 45, 12–21. [Google Scholar]
Anilkumar, P.; Venugopal, P. A Survey on Semantic Segmentation of Aerial Images using Deep Learning Techniques. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; pp. 1–7. [Google Scholar]
Catalá, A.; Jaén, J.; Mocholí, J.A. A Semantic Publish/Subscribe Approach for U-VR Systems Interoperation. In Proceedings of the 2008 International Symposium on Ubiquitous Virtual Reality, Gwangju, Republic of Korea, 10–13 July 2008; pp. 29–32. [Google Scholar]
Dang, T.N.; Nguyen, L.X.; Le, H.Q.; Kim, K.; Kazmi, S.A.; Park, S.B.; Huh, E.N.; Hong, C.S. Semantic Communication for AR-based Services in 5G and Beyond. In Proceedings of the 2023 International Conference on Information Networking (ICOIN), Bangkok, Thailand, 11–14 January 2023; pp. 549–553. [Google Scholar]
Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Kokalj-Filipovic, S.; Soljanin, E. Suppressing the cliff effect in video reproduction quality. Bell Labs Tech. J. 2012, 16, 171–185. [Google Scholar] [CrossRef]
Bourtsoulatze, E.; Kurka, D.B.; Gündüz, D. Deep joint source-channel coding for wireless image transmission. IEEE Trans. Cogn. Commun. 2019, 5, 576–579. [Google Scholar]
Guionnet, T.; Guillemot, C. Joint source-channel decoding of quasiarithmetic codes. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 23–25 March 2004; pp. 272–281. [Google Scholar]
Dai, J.; Wang, S.; Tan, K.; Si, Z.; Qin, X.; Niu, K.; Zhang, P. Nonlinear transform source-channel coding for semantic communications. IEEE J. Sel. Area Commun. 2022, 40, 2300–2316. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.H. Deep Learning Enabled Semantic Communication Systems. IEEE Trans. Signal Proces. 2021, 69, 2663–2675. [Google Scholar] [CrossRef]
Farsad, N.; Rao, M.; Goldsmith, A. Deep Learning for Joint Source-Channel Coding of Text. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2326–2330. [Google Scholar]
Agarwal, N.; Seth, P.; Meleet, M. A New Sentence Similarity Computing Technique Using Order and Semantic Similarity. In Proceedings of the 2021 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 24–25 September 2021; pp. 1–5. [Google Scholar]
Xu, J.; Ai, B.; Chen, W.; Yang, A.; Sun, P.; Rodrigues, M. Wireless image transmission using deep source channel coding with attention modules. IEEE Trans. Circ. Syst. Vid. 2021, 32, 2315–2328. [Google Scholar] [CrossRef]
Weng, Z.; Qin, Z.; Li, G.Y. Semantic communications for speech signals. In Proceedings of the 2021 IEEE International Conference on Communications (ICC), Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.H. Deep Learning based Semantic Communications: An Initial Investigation. In Proceedings of the 2020 IEEE Global Communications Conference (GLOBECOM), Taipei, Taiwan, 7–11 December 2020; pp. 1–6. [Google Scholar]
Ding, M.; Li, J.; Ma, M.; Fan, X. SNR-adaptive deep joint source-channel coding for wireless image transmission. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1555–1559. [Google Scholar]
Kurka, D.B.; Gündüz, D. DeepJSCC-f: Deep joint source-channel coding of images with feedback. IEEE J. Sel. Area Commun. 2020, 1, 178–193. [Google Scholar] [CrossRef]
Yang, M.; Kim, H.S. Deep joint source-channel coding for wireless image transmission with adaptive rate control. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 5193–5197. [Google Scholar]
Zhang, W.; Zhang, H.; Ma, H.; Shao, H.; Wang, N.; Leung, V.C. Predictive and Adaptive Deep Coding for Wireless Image Transmission in Semantic Communication. IEEE Trans. Wirel. Commun. 2023, 8, 5486–5501. [Google Scholar] [CrossRef]
Huang, D.; Tao, X.; Gao, F.; Lu, J. Deep learning-based image semantic coding for semantic communications. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; pp. 1–6. [Google Scholar]
Dong, C.; Liang, H.; Xu, X.; Han, S.; Wang, B.; Zhang, P. Semantic Communication System Based on Semantic Slice Models Propagation. IEEE J. Sel. Area Commun. 2022, 41, 202–213. [Google Scholar] [CrossRef]
Liu, C.; Guo, C.; Yang, Y.; Jiang, N. Adaptable semantic compression and resource allocation for task-oriented communications. arXiv 2022, arXiv:2204.08910. [Google Scholar]
Yan, L.; Qin, Z.; Zhang, R.; Li, Y.; Li, G.Y. QoE-Aware Resource Allocation for Semantic Communication Networks. In Proceedings of the 2022 IEEE Global Communications Conference (GLOBECOM), Rio de Janeiro, Brazil, 4–8 December 2022; pp. 3272–3277. [Google Scholar]
Chi, K.; Yang, Q.; Yang, Z.; Duan, Y.; Zhang, Z. Resource Allocation for Capacity Optimization in Joint Source-Channel Coding Systems. arXiv 2022, arXiv:2211.11412. [Google Scholar]
Wang, Y.; Chen, M.; Luo, T.; Saad, W.; Niyato, D.; Poor, H.V.; Cui, S. Performance optimization for semantic communications: An attention-based reinforcement learning approach. IEEE J. Sel. Area Commun. 2022, 40, 2598–2613. [Google Scholar] [CrossRef]
Kurka, D.B.; Gündüz, D. Successive refinement of images with deep joint source-channel coding. In Proceedings of the 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cannes, France, 2–5 July 2019; pp. 1–5. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]

Figure 1. Overall structure of the proposed deep AMJSCC with two stages.

Figure 2. The structure of the proposed policy network.

Figure 3. The structure of the enhance layer with ResNet blocks.

Figure 4. Performance comparison of EDJSCC and DJSCC with fixed bpp 0.1875. (a) The SSIM performance comparison of EDJSCC models and DJSCC models with fixed bpp. (b) The PSNR performance comparison of EDJSCC models and DJSCC models with fixed bpp.

Figure 5. Performance comparison of EDJSCC and DJSCC with unfixed bpps of 0.1875, 0.375, and 0.75 with the number of convolution layers as 4, 5, and 6, respectively. (a) The SSIM performance comparison of EDJSCC models and DJSCC models with unfixed bpps. (b) The PSNR performance comparison of EDJSCC models and DJSCC models with unfixed bpps.

Figure 6. Curve-fitting results: (a) The model corresponding to the graph is a model trained with

S N R = 0 dB

, 6 encoding layers of base model, 3 ResNet blocks of enhance model, and with

b p p = 0.1875

. (b) The model corresponding to the graph is a model trained with

S N R = 5 dB

, 5 encoding layers of base model, 2 ResNet blocks of enhance model, and with

b p p = 0.375

. (c) the model corresponding to the graph is a model trained with

S N R = 10 dB

, 4 encoding layers of base model, 1 ResNet block of enhance model, and with

b p p = 0.75

.

Figure 6. Curve-fitting results: (a) The model corresponding to the graph is a model trained with

S N R = 0 dB

, 6 encoding layers of base model, 3 ResNet blocks of enhance model, and with

b p p = 0.1875

. (b) The model corresponding to the graph is a model trained with

S N R = 5 dB

, 5 encoding layers of base model, 2 ResNet blocks of enhance model, and with

b p p = 0.375

. (c) the model corresponding to the graph is a model trained with

S N R = 10 dB

, 4 encoding layers of base model, 1 ResNet block of enhance model, and with

b p p = 0.75

.

Figure 7. Performance comparison of AMJSCC, EDJSCC with different layers, DeepJSCC, LSCI with channel-slice model, JPEG with LDPC coding in 4QAM, and JPEG2000 with LDPC coding in 4QAM. (a) The SSIM performance comparison of the DJSCC, EDJSCC and AMJSCC. (b) The PSNR performance comparison of the DJSCC, EDJSCC and AMJSCC. (c) The SSIM performance comparison of the AMJSCC, DeepJSCC, LSCI, JPEG and JPEG2000. (d) The PSNR performance comparison of the AMJSCC, DeepJSCC, LSCI, JPEG and JPEG2000.

Figure 8. The FLOPs and parameters of the base models, enhance models, and the EDJSCC models. (a) The number of FLOPs of the base models, enhance models, and the EDJSCC models. (b) The number of parameters of the base models, enhance models, and the EDJSCC models.

Table 1. JSCE Structure of the Base Model.

Convent Configuration
4 weight layers	5 weight layers	6 weight layers
Input (3 × 512 × 512 images)
Conv 3-64	Conv 3-64	Conv 3-64
InstanceNorm (64)	InstanceNorm (64)	InstanceNorm (64)
LeakyReLU	LeakyReLU	LeakyReLU
Conv 3-128	Conv 3-128	Conv 3-128
InstanceNorm (128)	InstanceNorm (128)	InstanceNorm (128)
LeakyReLU	LeakyReLU	LeakyReLU
Conv 3-256	Conv 3-256	Conv 3-256
InstanceNorm (256)	InstanceNorm (256)	InstanceNorm (256)
LeakyReLU	LeakyReLU	LeakyReLU
Conv 3-C	Conv 3-512	Conv 3-512
InstanceNorm (C)	InstanceNorm (512)	InstanceNorm (512)
Tanh	LeakyReLU	LeakyReLU
	Conv 3-C	Conv 3-1024
/	InstanceNorm (C)	InstanceNorm (1024)
	Tanh	LeakyReLU
		Conv 3-C
/	/	InstanceNorm (C)
		Tanh

Table 2. JSCD Structure of the Base Model.

Convent Configuration
4 weight layers	5 weight layers	6 weight layers
Input (C × 512 × 512 images)
ConvT 3-512	ConvT 3-512	ConvT 3-1024
InstanceNorm (512)	InstanceNorm (512)	InstanceNorm (1024)
ReLU	ReLU	ReLU
ResBlock (512, 512) × 9	ResBlock (512, 512) × 9	ResBlock (1024, 1024) × 9
ConvT 3-256	ConvT 3-256	ConvT 3-512
InstanceNorm (256)	InstanceNorm (256)	InstanceNorm (512)
ReLU	ReLU	ReLU
Conv3T -128	ConvT 3-128	ConvT 3-256
InstanceNorm (128)	InstanceNorm (128)	InstanceNorm (256)
LeakyReLU	LeakyReLU	ReLU
ConvT 3-3	ConvT 3-64	ConvT 3-128
InstanceNorm (3)	InstanceNorm (64)	InstanceNorm (128)
Tanh	ReLU	ReLU
	ConvT 3-3	ConvT 3-64
/	InstanceNorm (3)	InstanceNorm (64)
	Tanh	ReLU
		ConvT 3-3
/	/	InstanceNorm (3)
		Tanh

Table 3. The parameter values of the SNR and SSIM relationship function for the high-complexity model, the medium-complexity model, and the low-complexity model, and the values with 95% confidence bounds for these parameters.

95% Confidence Bounds	Value	Parameter
(0.9388, 0.9425)	0.94	$t_{31}$
(3.288, 4.022)	3.66	$t_{32}$
(2.06, 2.158)	2.11	$t_{33}$
(0.9295, 0.9345)	0.93	$t_{21}$
(2.552, 3.061)	2.81	$t_{22}$
(1.443, 1.538)	1.49	$t_{23}$
(0.9021, 0.9336)	0.92	$t_{11}$
(2.812, 3.653)	3.23	$t_{12}$
(−0.81, −0.5082)	−0.66	$t_{13}$

Table 4. The number of FLOPs and parameters for the different proposed models.

Model	FLOPs (GMac)	Params (M)
DJSCC_L4_C16	57.86	7.22
DJSCC_L5_C8	91.05	7.91
DJSCC_L6_C16	116.12	31.73
Enhance model_E1	226.04	0.84
Enhance model_E2	593.33	2.26
Enhance model_E3	1944.95	7.42
EDJSCC_L4_C16_E1	283.90	8.06
EDJSCC_L5_C8_E2	684.38	10.17
EDJSCC_L6_C16_E3	2061.07	39.15
AMJSCC	839.57	15.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, M.; Ma, N.; Dong, C.; Xu, X.; Zhang, P. Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models. Electronics 2023, 12, 4637. https://doi.org/10.3390/electronics12224637

AMA Style

Song M, Ma N, Dong C, Xu X, Zhang P. Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models. Electronics. 2023; 12(22):4637. https://doi.org/10.3390/electronics12224637

Chicago/Turabian Style

Song, Mengshu, Nan Ma, Chen Dong, Xiaodong Xu, and Ping Zhang. 2023. "Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models" Electronics 12, no. 22: 4637. https://doi.org/10.3390/electronics12224637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Models

Abstract

1. Introduction

2. Proposed Models

2.1. Overall Architecture

2.2. Base Model

2.3. Enhance Layer

2.4. Semantic Adaptive Model Selection (SAMS) Module

3. Experiments and Results Discussion

3.1. Settings

3.2. The Results of Models with Fixed bpp

3.3. The Results of Models with Un-Fixed bpps

3.4. The Results of the AMJSCC Model Compared with the Baselines

3.5. Complexity Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI