Semantic communication is typically task oriented. In image transmission, the semantic information of an image refers to the meaning of its content. In a specific task, by extracting semantic information related to the task and removing irrelevant information, the amount of data transmitted is reduced. The task of the SC-IRT algorithm proposed in this paper is image recognition, where the semantic information related to the task includes the targets at sea in the images collected by the USVs, such as ships, buoys, shore objects, etc., while irrelevant information includes the background parts of the image, including the sea and sky.
2.2. Image Semantic Encoding with Autoencoder Network
The autoencoder consists of an encoder and a decoder. The encoder transforms the input image into a low-dimensional latent semantic representation, while the decoder reconstructs the image using the low-dimensional latent semantics. The dimensionality of the input image is reduced by the autoencoder as the dimensionality of the low-dimensional latent semantics is smaller. The operation of the autoencoder network is as follows:
An input image datum
x is processed by the encoder network to obtain a new image data matrix
, which represents the low-dimensional latent semantic information
w of the image. At the same time, the decoder network aims to reconstruct the original image based on the semantic information. The mathematical expression for the encoding process is as follows:
where
represents the parameters of the encoder network,
e is the activation function,
W is the weight matrix,
x is the input image data, and
b is the bias vector. From this, it can be understood that the encoding process involves linear computations followed by the application of a non-linear activation function.
To maximize the preservation of semantic features from the original image during reconstruction, the semantic encoder network in this paper consists of an autoencoder encoding network E, a quantizer Q and an entropy coder. The encoding network employs a Convolutional Neural Network (CNN) to reduce the dimensionality of the image data into a low-dimensional latent semantic representation, significantly reducing the amount of data to be transmitted while retaining the image’s semantics.
The encoding network consists of six convolutional layers, including one convolutional layer with a stride of one and a kernel size of seven, four convolutional layers with a stride of two and a kernel size of three, and one bottleneck layer with a stride of one and a kernel size of three. The bottleneck layer has a channel depth of
C. The specific parameter settings for each layer of the encoding network are shown in
Figure 2. It is assumed that the input image size is
, where
w,
h, and
c represent the width, height, and depth (number of color channels, for example, in an RGB image,
c is 3) of the image, respectively. The image undergoes several downsampling operations through the layers. This results in a feature matrix of 1024 layers size
, where the width and height of the feature matrix are 1/16 of the original input image. This significantly reduces the amount of data. Finally, the feature matrix is fed into the bottleneck layer for dimensionality reduction. The bottleneck layer controls the bit rate of the low-dimensional latent semantic representation by its channel depth
C. The final semantic information obtained is a low-dimensional latent semantic
w with a size of
. A LeakyReLU activation function is applied after each convolutional layer to enhance the non-linear relationships between adjacent convolutional layers.
The subsequent step involves quantization of the low-dimensional latent semantic w, obtained from the encoding network of the autoencoder. In image encoding, quantization is essential to reduce the amount of transmitted data. Quantization helps reduce the message entropy of the image, achieving higher compression ratios by converting the image into a smaller bitstream. After the image data are transformed into a low-dimensional latent semantic representation through the encoding network of the autoencoder, the data are in a floating-point format, which occupies a large amount of storage space. Therefore, it is necessary to perform quantization on the low-dimensional latent semantic representation.
The principle of quantization Is to change the data type of the low-dimensional latent semantic representation from floating-point data to integers. The quantization function used in this paper is defined by Equation (2).
where
w is the low-dimensional latent semantic output from the encoding network of the autoencoder and
is the quantized result. It is important to note that the quantization process is non-differentiable, which means that gradients cannot be directly propagated through quantization to update the parameters in the subsequent layers. To address this issue, additional techniques are required to enable gradient backpropagation, such as adding random uniform noise. This paper adopts a method that converts the multi-base representation to an integer representation during quantization. This approach reduces the information loss in the quantization process. By adding random uniform noise, the gradients are flipped during the quantization process, making the image encoding and decoding more efficient and stable. This allows for more effective transmission of the encoded images.
2.3. Image Semantic Decoding with Generative Adversarial Networks
In the SC-IRT algorithm, the original image is extracted and quantized by the semantic encoder to obtain low-dimensional latent semantic information
. Before transmission,
undergoes Low-Density Parity-Check (LDPC) channel encoding, as shown in Equation (3). And during the model training phase, this paper chooses the Additive White Gaussian Noise (AWGN) model as a simulation channel model for training purposes.
where
is the output of the LDPC channel encoding,
h is the channel gain, and
n is the independent Gaussian noise.
It is worth noting that the AWGN channel model simulates the information distortion that occurs in the real physical channel, including the effects of channel gain h and independent Gaussian noise n. This allows the entire USV visual image transmission system to be robust against noise and enhances its reliability.
In the SC-IRT algorithm, the received low-dimensional latent semantic information
at the receiving end is passed through the generator G to generate the reconstructed image y, as shown in Equation (4). This process utilizes the generative capabilities of the generator network to reconstruct the image based on the received semantic information.
The original image x is encoded by the semantic encoder and then decoded by the semantic decoder, resulting in the reconstructed image y. The distribution probability of the original image is denoted as , while the distribution probability of the generated image by the generator G is denoted as . The discriminator D is trained through adversarial learning using both the original images x and the generated images y. By training the discriminator appropriately, it becomes more effective in extracting features that distinguish the original image distribution .
This adversarial training process helps the generator G to approximate the distribution of the original images and to generate images that are more similar to the original ones. It encourages the generator to produce images that are more likely to be classified as real by the discriminator. As a result, the generator learns to capture the underlying characteristics of the original images and to generate high-quality reconstructed images y that preserve the semantics of the original images.
The semantic decoder in this paper aims to minimize the distortion of semantic information while tolerating some pixel-level errors. The goal is not to generate an exact replica of the original image but to reconstruct the challenging parts of the real image dataset, such as the semantic features. This approach allows for maximum restoration of the semantic information in the image, even though the pixel values may differ. Therefore, a GAN model is well-suited for the design of the SC-IRT algorithm as it can effectively restore the semantic information in the image. In this paper, a GAN model is selected as the semantic decoder for the SC-IRT algorithm.
The adversarial training of GAN models has significant advantages in restoring semantic information. In this paper, a GAN model is employed as the semantic decoder, which can reconstruct the semantic information into images. The reconstructed images are similar to the input images and visually natural. The naturalness can be measured by the adversarial difference between the probability distributions of the reconstructed and original images, while visual similarity is measured using semantic perceptual loss in the feature space.
The specific architecture of the semantic decoder network is shown in
Figure 3. The input image
x is processed by the semantic encoder to extract the low-dimensional latent semantic information
w. The semantic information
w is quantized into a low-dimensional latent semantic code
, which is then losslessly entropy-encoded and transmitted through the physical channel. The semantic decoder is based on a GAN model and consists of a generator G and three discriminators D (D1, D2, and D3).
Even if the semantic encoder E cannot preserve precise details, the generator G can still produce output images that conform to the distribution of real images, avoiding artifacts or blocky structures. The encoder E and generator G are trained together and optimized using the discriminator D.
The generator G network consists of one convolutional block, nine residual blocks, four upsampling blocks, and one convolutional layer. The specific network parameters are shown in
Figure 4.
First, the semantic information is convolved with a convolutional block, which includes one convolutional layer, one instance normalization layer, and one ReLU activation function layer. The convolutional kernel size is 3 × 3, stride is one, and the number of channels is 1024. Next, there are nine residual blocks, and each residual block consists of two convolutional layers, two instance normalization layers, and one ReLU activation function layer. The convolutional layers have a kernel size of 3 × 3, stride of one, and 1024 channels. After that, there are four upsampling blocks, each including one deconvolutional layer, one normalization layer, and one ReLU activation function layer. The kernel size is 3 × 3, stride is one, and the number of channels is 512, 256, 128, and 64, respectively. Finally, there is a convolutional layer with a kernel size of 7 × 7, stride of one, and the number of channels is C.
The discriminator D is an important component of the GAN model and is also composed of multiple layers of convolutional neural networks. It is responsible for distinguishing the authenticity of images, i.e., determining whether they are generated images or original images from the dataset. As a crucial part of the GAN model, the discriminator D is trained in parallel with the generator G to improve the quality of generated images and to make them as close to the original images as possible. In this paper, the discriminator D is unique. It not only receives generated images and original images but also receives semantic information. This means that the generated images are evaluated not only based on their quality but also based on their semantic content. The discriminator plays a crucial role in guiding the generator to generate images that not only look realistic but also capture the desired image semantics.
To enable discrimination of multiple-scale images, this paper adopts a multi-scale discriminator structure, as shown in
Figure 3. The discriminator consists of three independent discriminators, each responsible for distinguishing feature differences at different scales between the original and generated images. In the discrimination process, the first discriminator is directly used to distinguish the local differences, i.e., pixel-level feature differences, between the original and generated images. The second discriminator focuses on the downsampled versions of the original and generated images, obtained by applying a downsampling factor of two, where the dimensions of the new images are halved compared to the original images. Each pixel in the new images represents four pixels in the original images. Similarly, the third discriminator operates on the downsampled versions of the previous step’s original and generated images. The three discriminators are denoted as D1, D2, and D3, respectively. Each of them has independent training parameters, but the architecture is similar. The discriminators progressively guide the generator G to generate images that are closer in global semantics to the original images. D3 has the largest discrimination range, with each pixel representing 16 pixels in the original image. It guides the generator to generate images that are most similar to the global semantic features of the original images. D1 and D2 gradually guide the generator towards lower-level details, aiming to generate images that are more similar to the overall appearance of the original images.
The specific parameters of the discriminator network are shown in
Figure 5. The original image
x and the generated image
y are fed as inputs to the discriminator D1. The input images undergo convolution in the first convolutional layer of the D1 network, which consists of 64 channels, a kernel size of three, and a stride of two. The output of this convolutional layer is then passed through a LeakyReLU activation function. Next, the images go through three convolutional block networks in the discriminator. Each block includes a convolutional layer, an instance normalization layer, and a ReLU activation function. After the convolutional processing in each block, spatial downsampling is performed. The number of channels in each convolutional block is twice that of the previous block, with channel sizes of 128, 256, and 512, respectively. The kernel size for all blocks is three, and the stride is two. As the number of channels increases exponentially in each block, the number of feature maps also increases, while the image size decreases. After the three convolutional blocks, a regular convolutional layer follows with a channel size of one, a kernel size of three, and a stride of one. Finally, a Sigmoid function is applied for the output as the discriminator essentially serves as a binary classification model, estimating the probability of the input image being real. To obtain the inputs for discriminators D2 and D3, the original image
x and the generated image
y are downsampled twice with a downsampling factor of two. These downsampled images are then separately fed into the convolutional networks of D2 and D3. Each discriminator at different scales outputs a result, and their structures are combined by weighted summation to obtain the output
D(
x), as shown in Equation (5).
where
a1,
a2, and
a3 are the weighted coefficients.
A multi-scale discriminator was used in this paper with the aim of employing a more efficient convolutional layer structure and to enable information fusion across different scales. This approach minimizes image distortion on each scale’s discriminator, gradually discerning from high-resolution to low-resolution images, reducing image complexity. It allows for better reconstruction of high-quality images and minimizes the distortion of semantic information in the images.
2.4. Loss Functions
The choice of loss function plays a crucial role in training the network model. In GAN models, the Mean Squared Error (MSE) loss and Learned Perceptual Image Patch Similarity (LPIPS) loss are commonly used as loss functions. The MSE loss is used to compute the difference between each corresponding pixel of two images and display their discrepancy. This is important for maintaining the similarity between the original and generated images. However, the MSE loss is not effective in measuring differences in texture or structure. On the other hand, the LPIPS loss can calculate the differences in textural and structural characteristics between the original and generated images, aiming to minimize these differences. By utilizing the LPIPS loss, the generated images can be visually closer to the original images. Therefore, in the training process of the network model, a combination of the MSE loss and LPIPS loss is commonly used to ensure both pixel-level similarity and the preservation of texture and structure in the generated images.
Before conducting adversarial training, this paper performs MSE loss and LPIPS loss training, without involving adversarial training, to compute the pixel differences and texture structure differences between the original image x and the generated image y. These two loss functions help guide the training of the generator G and the discriminator D in the right direction. Training the generator G using this approach before the final adversarial training is beneficial for its performance. This allows the discriminator D to learn more useful information, and the adversarial loss becomes more reasonable. Without this process, directly conducting adversarial training on the GAN model may result in a significant gap between the generated images by the generator G and the original images, making them easily distinguishable and leading to gradient vanishing issues for the generator G.
The loss function of the semantic communication image transmission algorithm in this paper includes adversarial loss for the generator and discriminator, as well as MSE loss and LPIPS loss for the images. The USV visual image transmission algorithm in this paper is based on semantic encoding using autoencoders and semantic decoding using GANs to achieve low-bit-rate image transmission. Adversarial training effectively addresses the issues of image blurring and contour loss at low bit rates. Adversarial training is essentially the loss between the generator G and the discriminator D. The GAN network consists of one generator and three discriminators. The loss function for the generator G is shown in Equation (6).
The loss function of the discriminator D is shown in Equation (7).
Thus, the adversarial loss
is defined as in Equation (8).
where
x denotes the input raw image,
w is the semantic potential representation being compressed,
m is the number of discriminators, and
is the weighting factor of the
i-th discriminator.