Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication

Chen, Yuanming; Hong, Xiaobin; Cui, Bin; Peng, Rongfa

doi:10.3390/jmse11122280

Open AccessArticle

Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication

¹

School of Civil Engineering & Transportation, South China University of Technology, Guangzhou 510641, China

²

School of Mechanical & Automotive Engineering, South China University of Technology, Guangzhou 510641, China

³

Guangzhou Shipyard International Company Limited, Guangzhou 511462, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(12), 2280; https://doi.org/10.3390/jmse11122280

Submission received: 10 October 2023 / Revised: 7 November 2023 / Accepted: 28 November 2023 / Published: 30 November 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

With the increasingly maturing technology of unmanned surface vehicles (USVs), their applications are becoming more and more widespread. In order to meet operational requirements in complex scenarios, the real-time interaction and linkage of a large amount of information is required between USVs, between USVs and mother ships, and between USVs and shore-based monitoring systems. Visual images are the main perceptual information gathered from USVs, and their efficient transmission and recognition directly affect the real-time performance of information exchange. However, poor maritime communication signals, strong channel interference, and low bandwidth pose great challenges to efficient image transmission. Traditional image transmission methods have difficulty meeting the real-time and image quality requirements of visual image transmissions from USVs. Therefore, this paper proposes an efficient method for visual image transmission from USVs based on semantic communication. A self-encoder network for semantic encoding which compresses the image into low-dimensional latent semantics through the encoding end, thereby preserving semantic information while greatly reducing the amount of data transmitted, is designed. On the other hand, a generative adversarial network is designed for semantic decoding. The decoding end decodes and reconstructs high-quality images from the semantic information transmitted through the channel, thereby improving the efficiency of image transmission. The experimental results show that the performance of the algorithm is significantly superior to traditional image transmission methods, achieving the best image quality while transmitting the minimum amount of data. Compared with the typical BPG algorithm, when the compression ratio of the proposed algorithm is 51.6% of that of the BPG algorithm, the PSNR and SSIM values are 7.6% and 5.7% higher than the BPG algorithm, respectively. And the average total time of the proposed algorithm is only 59.4% of that of the BPG algorithm.

Keywords:

image transmission; semantic communication; unmanned surface vehicle; image recognition; object detection

1. Introduction

Unmanned surface vehicles (USVs) have emerged as a new type of unmanned intelligent equipment, characterized by their light weight, intelligence, unmanned operation, high maneuverability, and low cost, and are being increasingly widely used. When USVs perform tasks at sea, they can perceive the surrounding environment through their onboard camera, and visual image information is the main information in the remote information exchange of USVs. However, due to poor signal quality, strong interference, and small bandwidth at sea, traditional image transmission methods are unable to meet the visual image transmission needs of USVs under low-bandwidth conditions. There are problems such as image distortion, signal loss, and the easy theft and tampering of image information. Semantic communication, as a novel communication technology, provides a promising solution to solve the above problems.

Since Shannon introduced his mathematical theory of communication in 1948 [1], significant progress has been made in understanding the mathematical foundations of symbol transmission, without considering the semantics of the transmitted symbols or bit streams. Carnap et al. [2] raised the issue of semantics, which was overlooked in Shannon’s information theory, and provided an initial definition of semantic information. Bao et al. [3] further clarified the concepts of semantic noise and semantic channels and proposed a semantic communication framework [4] to reduce semantic errors. These groundbreaking achievements, based on logical probability, were primarily focused on text processing. However, the development of semantic communication stagnated for about 70 years due to the lack of a universal mathematical model for representing semantics.

In recent years, the rapid development of deep learning (DL) and its applications in areas such as natural language processing (NLP), speech recognition, and computer vision have provided important inspiration for the advancement of semantic communication [5,6,7,8,9]. Increasing attention has been given to semantic communication by scholars. Many researchers in the communication field believe that future 6G and more advanced communication systems are expected to realize the Internet of Things (IoT) with intelligent interconnections [10,11,12,13,14]. Tong et al. [15] identified two key challenges related to semantic communication faced by AI and 6G, including its mathematical foundation and its system design. Kalfa [16] discussed the semantic transformation of different information sources in popular tasks in this field and proposed semantic communication system designs for different types of information sources. Strinati et al. [17] highlighted the role of semantic communication in 6G. Lan et al. [18] categorized semantic communication into human-to-human, human-to-machine, and machine-to-machine communication and proposed various potential applications of semantic communication. Farsad et al. [19] proposed a joint source-channel coding semantic communication model for text data transmission and recovery, which shows good robustness in noisy channels. Furthermore, Chattopadhyay et al. [20] quantified the complexity of semantic entropy and semantic compression. To capture and transmit semantic features, Jankowski et al. [21,22] proposed the joint source-channel coding (JSCC) scheme, where the semantic receiver directly performs the corresponding actions instead of recovering the source message. Lee et al. [23] proposed a semantic communication scheme for image classification tasks and designed a communication system common to image transmission and recognition. Jankowski et al. [24] combined the image feature extraction network (ResNet) with JSCC for image-retrieval-related tasks. Subsequently, a series of semantic communication frameworks for transmitting speech and text as information sources were proposed, attracting widespread attention and prompting more scholars to actively engage in semantic communication research [25,26,27,28].

In summary, the following conclusions can be drawn: (1) Currently, semantic communication technology is still rapidly developing. However, the potential of semantic communication has been demonstrated. Compared with traditional communication, semantic communication can significantly reduce communication resource requirements while ensuring data security and improving communication efficiency. (2) Semantic communication is expected to become a new standard paradigm for future communication technology. (3) The application of semantic communication in the image transmission of USVs has rarely been reported. Applying semantic communication technology to the image transmission of USVs has important application value and is an important guarantee for the reliable and efficient communication of USV image transmission systems in harsh marine environments with poor signals and strong interference, as well as low-bandwidth conditions.

Therefore, this paper focuses on the problem of the image transmission of USVs, conducts research on efficient visual image transmission methods, and proposes a semantic communication algorithm for image recognition tasks, customized as the SC-IRT algorithm. The advantage of the SC-IRT algorithm is that it can selectively prioritize the transmission of image information of obstacle targets through early training and learning, thus achieving better target recognition accuracy while transmitting less data. This algorithm has inherent advantages when specifically used for maritime communication. Because maritime communication generally has limited bandwidth, it is necessary to compress the amount of data as much as possible. On the other hand, the priority transmission of target information in this algorithm requires training and learning. The maritime targets are relatively single and the background is relatively simple, and the targets are generally ships and floating structures. So, the training and learning are relatively easy to achieve. This paper takes the maritime image transmission of USVs as a research example to implement the algorithm, which accumulates valuable experience for the promotion and application of the algorithm in more complex scenarios.

2. Design of Semantic Communication Algorithm for Image Recognition Tasks

Semantic communication is typically task oriented. In image transmission, the semantic information of an image refers to the meaning of its content. In a specific task, by extracting semantic information related to the task and removing irrelevant information, the amount of data transmitted is reduced. The task of the SC-IRT algorithm proposed in this paper is image recognition, where the semantic information related to the task includes the targets at sea in the images collected by the USVs, such as ships, buoys, shore objects, etc., while irrelevant information includes the background parts of the image, including the sea and sky.

2.1. Framework of the SC-IRT Algorithm

The proposed SC-IRT algorithm framework is illustrated in Figure 1. The image source collected by the USV is first processed by the semantic encoder (E) to extract semantic information, which represents the low-dimensional latent semantics of the autoencoder network. Then, the semantic information is quantized and entropy-encoded before being transmitted through the physical channel. The semantic decoder consists of a generative adversarial network (GAN), which includes a generator (G) and multi-scale discriminator (D). The generator decodes the semantic information to reconstruct the image. The multi-scale discriminator engages in adversarial training to discriminate between the original image and the reconstructed image. The autoencoder architecture demonstrates excellent performance in extracting semantic information from images. It compresses the image data into a low-dimensional latent space that can be used for image reconstruction. The downstream portion of the autoencoder can be replaced with a network structure related to semantics, such as a generative adversarial network. The SC-IRT algorithm consists of a semantic encoder including an autoencoder and a feature extraction network, a physical channel, and a semantic decoder based on a GAN, which can achieve low-bit-rate image transmission and reduce the loss of semantic information during the transmission process.

2.2. Image Semantic Encoding with Autoencoder Network

The autoencoder consists of an encoder and a decoder. The encoder transforms the input image into a low-dimensional latent semantic representation, while the decoder reconstructs the image using the low-dimensional latent semantics. The dimensionality of the input image is reduced by the autoencoder as the dimensionality of the low-dimensional latent semantics is smaller. The operation of the autoencoder network is as follows:

An input image datum x is processed by the encoder network to obtain a new image data matrix

f_{θ} (x)

, which represents the low-dimensional latent semantic information w of the image. At the same time, the decoder network aims to reconstruct the original image based on the semantic information. The mathematical expression for the encoding process is as follows:

w = f_{θ} (x) = e (W x + b)

(1)

where

θ = (W, b)

represents the parameters of the encoder network, e is the activation function, W is the weight matrix, x is the input image data, and b is the bias vector. From this, it can be understood that the encoding process involves linear computations followed by the application of a non-linear activation function.

To maximize the preservation of semantic features from the original image during reconstruction, the semantic encoder network in this paper consists of an autoencoder encoding network E, a quantizer Q and an entropy coder. The encoding network employs a Convolutional Neural Network (CNN) to reduce the dimensionality of the image data into a low-dimensional latent semantic representation, significantly reducing the amount of data to be transmitted while retaining the image’s semantics.

The encoding network consists of six convolutional layers, including one convolutional layer with a stride of one and a kernel size of seven, four convolutional layers with a stride of two and a kernel size of three, and one bottleneck layer with a stride of one and a kernel size of three. The bottleneck layer has a channel depth of C. The specific parameter settings for each layer of the encoding network are shown in Figure 2. It is assumed that the input image size is

w \times h \times c

, where w, h, and c represent the width, height, and depth (number of color channels, for example, in an RGB image, c is 3) of the image, respectively. The image undergoes several downsampling operations through the layers. This results in a feature matrix of 1024 layers size

w ⸱ h / (16 \times 16)

, where the width and height of the feature matrix are 1/16 of the original input image. This significantly reduces the amount of data. Finally, the feature matrix is fed into the bottleneck layer for dimensionality reduction. The bottleneck layer controls the bit rate of the low-dimensional latent semantic representation by its channel depth C. The final semantic information obtained is a low-dimensional latent semantic w with a size of

C \cdot w \cdot h / (16 \times 16)

. A LeakyReLU activation function is applied after each convolutional layer to enhance the non-linear relationships between adjacent convolutional layers.

The subsequent step involves quantization of the low-dimensional latent semantic w, obtained from the encoding network of the autoencoder. In image encoding, quantization is essential to reduce the amount of transmitted data. Quantization helps reduce the message entropy of the image, achieving higher compression ratios by converting the image into a smaller bitstream. After the image data are transformed into a low-dimensional latent semantic representation through the encoding network of the autoencoder, the data are in a floating-point format, which occupies a large amount of storage space. Therefore, it is necessary to perform quantization on the low-dimensional latent semantic representation.

The principle of quantization Is to change the data type of the low-dimensional latent semantic representation from floating-point data to integers. The quantization function used in this paper is defined by Equation (2).

\hat{w} = R o u n d (w)

(2)

where w is the low-dimensional latent semantic output from the encoding network of the autoencoder and

\hat{w}

is the quantized result. It is important to note that the quantization process is non-differentiable, which means that gradients cannot be directly propagated through quantization to update the parameters in the subsequent layers. To address this issue, additional techniques are required to enable gradient backpropagation, such as adding random uniform noise. This paper adopts a method that converts the multi-base representation to an integer representation during quantization. This approach reduces the information loss in the quantization process. By adding random uniform noise, the gradients are flipped during the quantization process, making the image encoding and decoding more efficient and stable. This allows for more effective transmission of the encoded images.

2.3. Image Semantic Decoding with Generative Adversarial Networks

In the SC-IRT algorithm, the original image is extracted and quantized by the semantic encoder to obtain low-dimensional latent semantic information

\hat{w}

. Before transmission,

\hat{w}

undergoes Low-Density Parity-Check (LDPC) channel encoding, as shown in Equation (3). And during the model training phase, this paper chooses the Additive White Gaussian Noise (AWGN) model as a simulation channel model for training purposes.

\hat{y} = h \hat{w} + n

(3)

where

\hat{y}

is the output of the LDPC channel encoding, h is the channel gain, and n is the independent Gaussian noise.

It is worth noting that the AWGN channel model simulates the information distortion that occurs in the real physical channel, including the effects of channel gain h and independent Gaussian noise n. This allows the entire USV visual image transmission system to be robust against noise and enhances its reliability.

In the SC-IRT algorithm, the received low-dimensional latent semantic information

\hat{y}

at the receiving end is passed through the generator G to generate the reconstructed image y, as shown in Equation (4). This process utilizes the generative capabilities of the generator network to reconstruct the image based on the received semantic information.

y = g (\hat{y})

(4)

The original image x is encoded by the semantic encoder and then decoded by the semantic decoder, resulting in the reconstructed image y. The distribution probability of the original image is denoted as

P_{x}

, while the distribution probability of the generated image by the generator G is denoted as

P_{y}

. The discriminator D is trained through adversarial learning using both the original images x and the generated images y. By training the discriminator appropriately, it becomes more effective in extracting features that distinguish the original image distribution

P_{x}

.

This adversarial training process helps the generator G to approximate the distribution of the original images

P_{x}

and to generate images that are more similar to the original ones. It encourages the generator to produce images that are more likely to be classified as real by the discriminator. As a result, the generator learns to capture the underlying characteristics of the original images and to generate high-quality reconstructed images y that preserve the semantics of the original images.

The semantic decoder in this paper aims to minimize the distortion of semantic information while tolerating some pixel-level errors. The goal is not to generate an exact replica of the original image but to reconstruct the challenging parts of the real image dataset, such as the semantic features. This approach allows for maximum restoration of the semantic information in the image, even though the pixel values may differ. Therefore, a GAN model is well-suited for the design of the SC-IRT algorithm as it can effectively restore the semantic information in the image. In this paper, a GAN model is selected as the semantic decoder for the SC-IRT algorithm.

The adversarial training of GAN models has significant advantages in restoring semantic information. In this paper, a GAN model is employed as the semantic decoder, which can reconstruct the semantic information into images. The reconstructed images are similar to the input images and visually natural. The naturalness can be measured by the adversarial difference between the probability distributions of the reconstructed and original images, while visual similarity is measured using semantic perceptual loss in the feature space.

The specific architecture of the semantic decoder network is shown in Figure 3. The input image x is processed by the semantic encoder to extract the low-dimensional latent semantic information w. The semantic information w is quantized into a low-dimensional latent semantic code

\hat{w}

, which is then losslessly entropy-encoded and transmitted through the physical channel. The semantic decoder is based on a GAN model and consists of a generator G and three discriminators D (D1, D2, and D3).

Even if the semantic encoder E cannot preserve precise details, the generator G can still produce output images that conform to the distribution of real images, avoiding artifacts or blocky structures. The encoder E and generator G are trained together and optimized using the discriminator D.

The generator G network consists of one convolutional block, nine residual blocks, four upsampling blocks, and one convolutional layer. The specific network parameters are shown in Figure 4.

First, the semantic information is convolved with a convolutional block, which includes one convolutional layer, one instance normalization layer, and one ReLU activation function layer. The convolutional kernel size is 3 × 3, stride is one, and the number of channels is 1024. Next, there are nine residual blocks, and each residual block consists of two convolutional layers, two instance normalization layers, and one ReLU activation function layer. The convolutional layers have a kernel size of 3 × 3, stride of one, and 1024 channels. After that, there are four upsampling blocks, each including one deconvolutional layer, one normalization layer, and one ReLU activation function layer. The kernel size is 3 × 3, stride is one, and the number of channels is 512, 256, 128, and 64, respectively. Finally, there is a convolutional layer with a kernel size of 7 × 7, stride of one, and the number of channels is C.

The discriminator D is an important component of the GAN model and is also composed of multiple layers of convolutional neural networks. It is responsible for distinguishing the authenticity of images, i.e., determining whether they are generated images or original images from the dataset. As a crucial part of the GAN model, the discriminator D is trained in parallel with the generator G to improve the quality of generated images and to make them as close to the original images as possible. In this paper, the discriminator D is unique. It not only receives generated images and original images but also receives semantic information. This means that the generated images are evaluated not only based on their quality but also based on their semantic content. The discriminator plays a crucial role in guiding the generator to generate images that not only look realistic but also capture the desired image semantics.

To enable discrimination of multiple-scale images, this paper adopts a multi-scale discriminator structure, as shown in Figure 3. The discriminator consists of three independent discriminators, each responsible for distinguishing feature differences at different scales between the original and generated images. In the discrimination process, the first discriminator is directly used to distinguish the local differences, i.e., pixel-level feature differences, between the original and generated images. The second discriminator focuses on the downsampled versions of the original and generated images, obtained by applying a downsampling factor of two, where the dimensions of the new images are halved compared to the original images. Each pixel in the new images represents four pixels in the original images. Similarly, the third discriminator operates on the downsampled versions of the previous step’s original and generated images. The three discriminators are denoted as D1, D2, and D3, respectively. Each of them has independent training parameters, but the architecture is similar. The discriminators progressively guide the generator G to generate images that are closer in global semantics to the original images. D3 has the largest discrimination range, with each pixel representing 16 pixels in the original image. It guides the generator to generate images that are most similar to the global semantic features of the original images. D1 and D2 gradually guide the generator towards lower-level details, aiming to generate images that are more similar to the overall appearance of the original images.

The specific parameters of the discriminator network are shown in Figure 5. The original image x and the generated image y are fed as inputs to the discriminator D1. The input images undergo convolution in the first convolutional layer of the D1 network, which consists of 64 channels, a kernel size of three, and a stride of two. The output of this convolutional layer is then passed through a LeakyReLU activation function. Next, the images go through three convolutional block networks in the discriminator. Each block includes a convolutional layer, an instance normalization layer, and a ReLU activation function. After the convolutional processing in each block, spatial downsampling is performed. The number of channels in each convolutional block is twice that of the previous block, with channel sizes of 128, 256, and 512, respectively. The kernel size for all blocks is three, and the stride is two. As the number of channels increases exponentially in each block, the number of feature maps also increases, while the image size decreases. After the three convolutional blocks, a regular convolutional layer follows with a channel size of one, a kernel size of three, and a stride of one. Finally, a Sigmoid function is applied for the output as the discriminator essentially serves as a binary classification model, estimating the probability of the input image being real. To obtain the inputs for discriminators D2 and D3, the original image x and the generated image y are downsampled twice with a downsampling factor of two. These downsampled images are then separately fed into the convolutional networks of D2 and D3. Each discriminator at different scales outputs a result, and their structures are combined by weighted summation to obtain the output D(x), as shown in Equation (5).

D (x) = α_{1} D_{1} (x_{1}) + α_{2} D_{2} (x_{2}) + α_{3} D_{3} (x_{3})

(5)

where a₁, a₂, and a₃ are the weighted coefficients.

A multi-scale discriminator was used in this paper with the aim of employing a more efficient convolutional layer structure and to enable information fusion across different scales. This approach minimizes image distortion on each scale’s discriminator, gradually discerning from high-resolution to low-resolution images, reducing image complexity. It allows for better reconstruction of high-quality images and minimizes the distortion of semantic information in the images.

2.4. Loss Functions

The choice of loss function plays a crucial role in training the network model. In GAN models, the Mean Squared Error (MSE) loss and Learned Perceptual Image Patch Similarity (LPIPS) loss are commonly used as loss functions. The MSE loss is used to compute the difference between each corresponding pixel of two images and display their discrepancy. This is important for maintaining the similarity between the original and generated images. However, the MSE loss is not effective in measuring differences in texture or structure. On the other hand, the LPIPS loss can calculate the differences in textural and structural characteristics between the original and generated images, aiming to minimize these differences. By utilizing the LPIPS loss, the generated images can be visually closer to the original images. Therefore, in the training process of the network model, a combination of the MSE loss and LPIPS loss is commonly used to ensure both pixel-level similarity and the preservation of texture and structure in the generated images.

Before conducting adversarial training, this paper performs MSE loss and LPIPS loss training, without involving adversarial training, to compute the pixel differences and texture structure differences between the original image x and the generated image y. These two loss functions help guide the training of the generator G and the discriminator D in the right direction. Training the generator G using this approach before the final adversarial training is beneficial for its performance. This allows the discriminator D to learn more useful information, and the adversarial loss becomes more reasonable. Without this process, directly conducting adversarial training on the GAN model may result in a significant gap between the generated images by the generator G and the original images, making them easily distinguishable and leading to gradient vanishing issues for the generator G.

The loss function of the semantic communication image transmission algorithm in this paper includes adversarial loss for the generator and discriminator, as well as MSE loss and LPIPS loss for the images. The USV visual image transmission algorithm in this paper is based on semantic encoding using autoencoders and semantic decoding using GANs to achieve low-bit-rate image transmission. Adversarial training effectively addresses the issues of image blurring and contour loss at low bit rates. Adversarial training is essentially the loss between the generator G and the discriminator D. The GAN network consists of one generator and three discriminators. The loss function for the generator G is shown in Equation (6).

L_{G} = \sum_{i = 1}^{m} E {[D_{i} (G (z)) - 1]}^{2}

(6)

The loss function of the discriminator D is shown in Equation (7).

\begin{matrix} L_{D} & = E {[D_{1} (x) - 1]}^{2} + E {(D_{1} (z))}^{2} + E {[D_{2} (x_{2}) - 1]}^{2} \\ + E {(D_{2} (z))}^{2} + E {[D_{4} (x_{4}) - 1]}^{2} + E {(D_{4} (z))}^{2} \\ = \sum_{i = 1}^{m} E {[D_{i} (x_{i}) - 1]}^{2} + E {(D_{i} (z))}^{2} \end{matrix}

(7)

Thus, the adversarial loss

L_{G A N}

is defined as in Equation (8).

L_{G A N} = \sum_{i = 1}^{m} α_{i} \{E [\log D_{i} (x)] + E [\log (1 - D_{i} (G (z)))]\}

(8)

where x denotes the input raw image, w is the semantic potential representation being compressed, m is the number of discriminators, and

α_{i}

is the weighting factor of the i-th discriminator.

3. Experiments and Analysis

3.1. Dataset and Parameter Settings

The proposed SC-IRT algorithm in this paper is trained on the water surface object detection and recognition dataset (WSODD) [29]. The dataset consists of 7476 images, including 14 types of target objects and a total of 21,911 water surface instances. The images capture various scenes, such as oceans, lakes, and small rivers, under different weather conditions, including clear, foggy, and cloudy. They also cover different time periods, including daytime, nighttime, and twilight. The dataset provides a diverse range of environmental conditions. The images have a resolution of 1920 × 1080 pixels, and the dataset is split into a training set and a test set in a 7:3 ratio. The training set contains 5233 images, while the test set contains 2243 images.

Considering sea objects such as ships and buoys as key semantic information and ignoring irrelevant information such as sky and the sea surface, the self-encoder network extracts semantic information. Then, the generator generates images based on the extracted semantic information. The multi-scale discriminators are employed to assess the generated images against the original images, enabling adversarial training. By evaluating the generated images at different scales, the SC-IRT model can preserve both local details and global features. The specific parameter settings for training the SC-IRT model are summarized in Table 1.

3.2. Experimental Setup

The experimental device used in this study consists of a transmitting part on the USV and a receiving part on the shore. The transmitting part includes a camera installed on the USV, an industrial control computer of the NVIDIA Jeston Xavier model (NVIDIA, Santa Clara, CA, USA) and a Universal Software Radio Peripheral (USRP) (National Instruments, Austin, TX, USA) that can communicate wirelessly. The receiving part includes a server and a USRP. The server is equipped with an Intel(R) Xeon(R) i7-12700 CPU @ 2.60 GHz (Intel, Santa Clara, CA, USA), 64 GB of RAM (Samsung, Seoul, Republic of Korea), and an RTX3090 graphics card (NVIDIA, Santa Clara, CA, USA).

The transmission part first collects visual images through the camera installed on the USV and then extracts semantic information, quantifies semantic information, and encodes entropy from the visual images through the industrial control computer on the USV to obtain waveform signals that can be used for physical channel transmission. The waveform signals are then remotely transmitted through the USRP. The specific operation is to use the LabVIEW software platform (LabVIEW2020) to write a program to control the USRP, to encode and modulate the waveform signal, and then to send it from the physical channel antenna.

The receiving part on the shore performs a series of processes opposite to the transmitting part. Firstly, the USRP receives the signal, demodulates it, and decodes the channel to obtain the required waveform signal, which is then transmitted to the semantic decoder for semantic reconstruction of the image. In the experiment, the wireless transmission parameters of the USRP are shown in Table 2.

3.3. Image Transmission Experiments

In this paper, the proposed SC-IRT semantic codec algorithm is compared with commonly used coding and decoding algorithms, including BPG [30], JPEG2000 [31], JPEG [32], and the deep learning-based codec HiFIC [33]. A series of experiments are conducted to evaluate the performance of the SC-IRT algorithm for image semantic transmission in various scenarios.

The performance of the SC-IRT and other competing algorithms is quantified based on the metrics of the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index (SSIM), which are commonly used to measure image quality. Additionally, under simulated channel conditions of wireless transmission, LDPC channel encoding is applied to enhance the robustness of the transmission. The choice of the PSNR and SSIM as evaluation metrics allows both the quality of the received images and the similarity between the received images and the original images to be assessed, providing a comprehensive description of the communication system’s performance.

Comparison of Image Transmission Performance under Different Compression Ratios

Based on the above experimental conditions, this paper conducted comparative experiments on the image transmission performance of JPEG, JPEG2000, BPG, HiFIC, and SC-IRT algorithms under different compression ratios (CR), as shown in Figure 6. From the figure, it can be seen that the SC-IRT algorithm encodes based on image semantics using semantic communication technology and has better semantic retention ability at low compression ratios. Therefore, under low compression ratios, the PSNR and SSIM values of the SC-IRT algorithm are higher than other algorithms, which means that the reconstructed images obtained by SC-IRT are more similar to the original images. But, when the compression ratio is greater than 0.088, the performance of SC-IRT is slightly worse than that of the BPG algorithm. The reason for this is that SC-IRT can tolerate pixel errors and does not guarantee local consistency. Due to its focus on system performance at low compression ratios for the transmission of visual images from USVs, the SC-IRT algorithm is generally superior to the BPG algorithm.

2.: Comparison of Image Transmission Performance under Different Signal-to-noise Ratios

This paper selects the AWGN simulation channel to compare the transmission performance of various algorithms under different signal-to-noise ratios (SNR), and the results are shown in Figure 7. From the figure, it can be seen that algorithms such as JPEG and JPEG2000 have significantly decreased PSNR and SSIM values when the signal-to-noise ratio is reduced, while these values decrease relatively slowly for the SC-IRT algorithm. Especially when the signal-to-noise ratio is less than 10 dB, the SSIM and PSNR values of the SC-IRT algorithm are significantly higher than the other algorithms. In the task of image transmission and recognition for USVs in this paper, it is necessary to ensure high-quality image transmission under weak signals and low signal-to-noise ratio channel conditions at sea. Therefore, the SC-IRT algorithm in this paper has more advantages than other algorithms in this scenario.

3.: Comparison of Object Detection Accuracy of Transmitted Images under Different Compression Ratios and Signal-to-noise Ratios

Since the main purpose of the visual image transmission of USVs in this paper is image object detection, in order to verify how different transmission schemes, different compression ratios, and different signal-to-noise ratios affect the accuracy of image object detection after transmission, the SC-IRT scheme is compared with the traditional BPG and HiFIC schemes, and the results are shown in Figure 8. Figure 8a shows the comparison of object detection accuracy using the YOLOv7 algorithm for images transmitted under different compression ratios. The lower the compression ratio, the more image information is compressed and the less transmission bandwidth is required. When the bandwidth is limited and does not support the transmission of all semantic information, the BPG and HiFIC algorithms do not treat the important information of images differently but compress all information uniformly; meanwhile, the SC-IRT algorithm proposed in this paper considers the relationship between specific intelligent tasks and semantic information, always transmitting the most important part of semantic information to meet different compression ratios. From the figure, it can be seen that compared with the traditional BPG and HiFIC transmission schemes, the SC-IRT scheme has more obvious advantages at lower compression ratios. Especially when the compression ratio is less than 0.02, the image object detection accuracy transmitted using the SC-IRT scheme can still remain above 0.7, far higher than the BPG and HiFIC schemes.

Figure 8b shows the comparison of object detection accuracy using the YOLOv7 algorithm for transmitted images under different signal-to-noise ratios. From the figure, it can be seen that the advantages of the SC-IRT scheme gradually become apparent when the signal-to-noise ratio is less than 10 dB. The reason for this is that the SC-IRT scheme transmits the most relevant semantic information for image recognition tasks, and the semantic communication system is conducted end-to-end and incorporates channel state information into its training. In contrast, traditional BPG and HiFIC transmission schemes have higher bit error rates at low signal-to-noise ratios, resulting in poor detection performance. The above experimental results indicate that the SC-IRT scheme can significantly reduce redundant information, reduce the amount of data transmitted, ensure the accuracy of image object detection tasks, and has the best semantic information retention performance.

4.: Comparison of Visual Effects of Images Transmitted under Different Algorithms

Figure 9 shows the comparison of visual effects of images transmitted by different algorithms. From the figure, it can be seen that the SC-IRT algorithm has the smallest compression ratio and the highest PSNR and SSIM values. Among them, compared with the typical BPG algorithm, the compression ratio of SC-IRT is 51.6% of that of BPG, while the PSNR and SSIM values are 7.6% and 5.7% higher, respectively. It can be seen that traditional algorithms transmit images with a certain degree of noise and distortion, while the SC-IRT algorithm transmits images with less noise and distortion. Experiments have shown that SC-IRT performs better than traditional algorithms such as BPG and JPEG in reconstructing images while transmitting less data. The SC-IRT can ensure image quality while greatly reducing the amount of data transmitted, significantly improving transmission efficiency.

5.: Comparison of Time Consumption of Image Transmission under Different Algorithms

Table 3 calculates the average transmission time, processing time (including encoding time and decoding time), and total time for three communication methods. From the table, it can be seen that the proposed SC-IRT algorithm has lower processing and transmission delays compared to traditional schemes. The total latency of SC-IRT is only 59.4% of that of traditional BPG transmission schemes and 42.5% of that of JPEG. The SC-IRT uses a self-encoder network to extract semantic information from images, which has a lower complexity compared to traditional complex image encoding. Therefore, its average encoding time is much lower than that of the BPG and JPEG algorithms. However, due to the fact that SC-IRT decoding is a deconvolution operation that takes relatively more time, its average decoding time is higher than that of the BPG and JPEG algorithms. Overall, the average total time for image transmission using the SC-IRT algorithm in the experiment is 58.46 ms, which achieves a relatively ideal effect. The SC-IRT algorithm proposed in this paper has lower complexity, less data transmission, faster overall processing time, and meets the visual image transmission requirements of USVs. It is particularly suitable for image transmission systems with high real-time requirements.

4. Conclusions

This paper proposes an efficient method for visual image transmission based on semantic communication in response to the demand for visual image transmission from USVs at sea. A self-encoder network for semantic encoding is designed, which compresses the image into low-dimensional latent semantics through the encoding end, thereby preserving semantic information while greatly reducing the amount of data transmitted. On the other hand, a generative adversarial network is designed for semantic decoding. The decoding end decodes and reconstructs high-quality images from the semantic information transmitted through the channel, thereby improving the efficiency of image transmission. The experimental results show that when the compression ratio is low (<0.088) and the signal-to-noise ratio is poor (<10 dB), the PSNR and SSIM values of the SC-IRT algorithm proposed in this paper are significantly higher than those of traditional image encoding and decoding algorithms. Compared with the typical BPG algorithm, when the compression ratio of SC-IRT is 51.6% of that of BPG, the PSNR and SSIM values are 7.6% and 5.7% higher, respectively, than the BPG algorithm. In terms of image object detection after transmission, the SC-IRT algorithm also has obvious advantages, especially when the compression ratio is less than 0.02. The accuracy of the image object detection transmitted using the SC-IRT scheme can still be maintained at over 70%, far higher than traditional transmission schemes. In terms of time consumption, compared with traditional schemes, the SC-IRT algorithm has lower processing and transmission delays, with the average total time of the SC-IRT being only 59.4% of that of the BPG transmission scheme and 42.5% of that of JPEG. In summary, the SC-IRT algorithm proposed in this paper performs significantly better than traditional image transmission methods at lower compression ratios and poor signal-to-noise ratios. It can achieve better image quality and higher image recognition accuracy while transmitting less data. It can effectively solve the problems of poor communication signals, strong channel interference, low bandwidth, and real-time requirements for image transmission of USVs at sea.

5. Future Work

The research work of the paper can provide new ideas and theoretical references for the visual image transmission and recognition applications of USVs. However, due to the limitations of experimental conditions and time, there are still many shortcomings and areas for further research in the paper: ① Due to the limitations of the experimental environment, the dataset for image transmission and image object detection, as well as the types of maritime targets, does not cover all types of objects. The focus of subsequent improvement efforts should be on expanding the image dataset. ② Due to the influence of experimental resources and time, the experimental scenario settings in the paper are not comprehensive enough. Further experimental verification of more scenarios is needed in the future. ③ In terms of the theory of semantic communication, the semantic communication in this paper is oriented towards image recognition tasks, and addressing how to achieve semantic knowledge sharing when facing different intelligent tasks and different modal sources will be a difficult problem that needs further research and key breakthroughs in the future.

Author Contributions

Conceptualization, X.H. and Y.C.; Formal analysis, Y.C. and R.P.; Funding acquisition, X.H.; Investigation, B.C. and R.P.; Methodology, Y.C. and X.H.; Project administration, X.H.; Software, R.P. and B.C.; Supervision, X.H.; Validation, B.C. and Y.C.; Writing—original draft, Y.C. and R.P.; Writing—review and editing, Y.C., B.C. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under grant 2019YFB1804204.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Bin Cui was employed by the Guangzhou Shipyard International Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information; Massachusetts Institute of Technology: Cambridge, MA, USA, 1952. [Google Scholar]
Bao, J.; Basu, P.; Dean, M.; Partridge, C.; Swami, A.; Leland, W.; Hendler, J.A. Towards a theory of semantic communication. In Proceedings of the NSW 2011: 2011 IEEE 1st International Network Science Workshop, West Point, NY, USA, 22–24 June 2011. [Google Scholar]
Guler, B.; Yener, A.; Swami, A. The semantic communication game. IEEE Trans. Cogn. Commun. Netw. 2018, 4, 787–802. [Google Scholar] [CrossRef]
Juang, B.H. Quantification and Transmission of Information and Intelligence-History and Outlook. IEEE Signal. Proc. Mag. 2011, 28, 90–101. [Google Scholar] [CrossRef]
Qin, Z.; Ye, H.; Li, G.Y.; Juang, B.H.F. Deep learning in physical layer communications. IEEE Wireless Commun. 2019, 26, 93–99. [Google Scholar] [CrossRef]
Qiao, D.; Liu, G.; Lv, T.; Li, W.; Zhang, J. Marine Vision-Based Situational Awareness Using Discriminative Deep Learning: A Survey. J. Mar. Sci. Eng. 2021, 9, 397. [Google Scholar] [CrossRef]
Chen, Y.; Hong, X.; Chen, W.; Wang, H.; Fan, T. Experimental Research on Overwater and Underwater Visual Image Stitching and Fusion Technology of Offshore Operation and Maintenance of Unmanned Ship. J. Mar. Sci. Eng. 2022, 10, 747. [Google Scholar] [CrossRef]
Hong, X.; Cui, B.; Chen, W.; Rao, Y.; Chen, Y. Research on Multi-Ship Target Detection and Tracking Method Based on Camera in Complex Scenes. J. Mar. Sci. Eng. 2022, 10, 978. [Google Scholar] [CrossRef]
Letaief, K.B.; Chen, W.; Shi, Y. The roadmap to 6G: AI empowered wireless networks. IEEE Commun. Mag. 2019, 57, 84–90. [Google Scholar] [CrossRef]
Akyildiz, I.F.; Kak, A.; Nie, S. 6G and beyond: The future of wireless communications systems. IEEE Access 2020, 8, 133995–134030. [Google Scholar] [CrossRef]
Dang, S.; Amin, O.; Shihada, B. What should 6G be? Nat. Electron. 2020, 3, 20–29. [Google Scholar] [CrossRef]
Rong, B. 6g: The next horizon: From connected people and things to connected intelligence. IEEE Wireless Commun. 2021, 28, 8. [Google Scholar] [CrossRef]
Elkhodr, M.; Shahrestani, S.; Cheung, H. The internet of things: New interoperability, management and security challenges. Int. J. Netw. Secur. Its Appl. 2016, 8, 85–102. [Google Scholar] [CrossRef]
Tong, W.; Li, G.Y. Nine challenges in artificial intelligence and wireless communications for 6G. IEEE Wireless Commun. 2021, 29, 140–145. [Google Scholar] [CrossRef]
Kalfa, M.; Gok, M.; Atalik, A. Towards goal-oriented semantic signal processing: Applications and future challenges. Digit. Signal Process. 2021, 119, 103134. [Google Scholar] [CrossRef]
Strinati, E.C.; Barbarossa, S. 6G networks: Beyond Shannon towards semantic and goal-oriented communications. Comput. Netw. 2021, 190, 107930. [Google Scholar] [CrossRef]
Lan, Q.; Wen, D.; Zhang, Z. What is semantic communication? A view on conveying meaning in the era of machine intelligence. J. Commun. Netw. 2021, 6, 336–371. [Google Scholar] [CrossRef]
Farsad, N.; Rao, M.; Goldsmith, A. Deep learning for joint source-channel coding of text. In Proceedings of the ICASSP 2018: IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Chattopadhyay, A.; Haeffele, B.D.; Geman, D.; Vidal, R. Quantifying task complexity through generalized information measures. In Proceedings of the ICLR 2019: 9th International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Jankowski, M.; Günduz, D.; Mikolajczyk, K. Deep Joint Source-Channel Coding for Wireless Image Retrieval. In Proceedings of the ICASSP 2020: IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Kurka, D.B.; Gündüz, D. DeepJSCC-f: Deep Joint-Source Channel Coding of Images with Feedback. IEEE JSAIT 2019, 1, 178–193. [Google Scholar] [CrossRef]
Lee, C.H.; Lin, J.W.; Chen, P.H. Deep learning-constructed joint transmission-recognition for Internet of Things. IEEE Access 2019, 7, 76547–76561. [Google Scholar] [CrossRef]
Jankowski, M.; Gündüz, D.; Mikolajczyk, K. Wireless image retrieval at the edge. IEEE J. Sel. Areas Commun. 2020, 39, 89–100. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.H. Deep learning enabled semantic communication systems. IEEE Trans. Signal Process. 2021, 69, 2663–2675. [Google Scholar] [CrossRef]
Weng, Z.; Qin, Z. Semantic communication systems for speech transmission. IEEE J. Sel. Areas Commun. 2021, 39, 2434–2444. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z. A lite distributed semantic communication system for Internet of Things. IEEE J. Sel. Areas Commun. 2021, 39, 143–153. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z.; Li, G.Y. Task-oriented multi-user semantic communications for VQA. IEEE Wireless Commun. Lett. 2022, 11, 553–557. [Google Scholar] [CrossRef]
Zhou, Z.; Sun, J.; Yu, J.; Liu, K.; Duan, J.; Chen, L.; Chen, C.P. An Image-Based Benchmark Dataset and a Novel Object Detector for Water Surface Object Detection. Front. Neurorobot. 2021, 15, 723336. [Google Scholar] [CrossRef]
BPG Image Format. Available online: https://bellard.org/bpg/ (accessed on 21 April 2018).
Hernández, J.J.S.; Ruiz, V.G.; Ortiz, J.P.G.; Müller, D. Client-Driven Transmission of JPEG2000 Image Sequences Using Motion Compensated ConditionalReplenishment. In Proceedings of the DCC 2019: IEEE Data Compression Conference, Snowbird, UT, USA, 26–29 March 2019. [Google Scholar]
Rhee, K.H. Forensic Detection of JPEG Compressed lmage. In Proceedings of the CSCI 2018: International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA, 12–14 December 2018. [Google Scholar]
Mentzer, F.; Toderici, G.D.; Tschannen, M.; Agustsson, E. High-fidelity generative image compression. Adv. Neural Inf. Process. Syst. 2020, 33, 11913–11924. [Google Scholar]

Figure 1. Block diagram of the SC-IRT algorithm.

Figure 2. Semantic encoder network parameters.

Figure 3. Generator and discriminator framework.

Figure 4. Generator G network parameters. (a) The overall network parameters of the generator. (b) The network parameters of the residual block.

Figure 5. Network parameters of discriminator D1, D2, and D3.

Figure 6. PSNR and SSIM comparison between SC-IRT and other algorithms at different compression ratios. (a) PSNR performance comparison. (b) SSIM performance comparison.

Figure 7. PSNR and SSIM comparison between SC-IRT and other algorithms at different SNRs. (a) PSNR performance comparison. (b) SSIM performance comparison.

Figure 8. Image object detection accuracy under different compression ratios and signal-to-noise ratios. (a) Under different compression ratios. (b) Under different signal-to-noise ratios.

Figure 9. Comparison of visual effects of images transmitted under different algorithms. (a) Original image before transmission. (b) Image transmitted under SC-IRT algorithm, where CR = 0.032, PSNR = 35.3, and SSIM = 0.93. (c) Image transmitted under HiFIC algorithm, where CR = 0.045, PSNR = 34.5, and SSIM = 0.91. (d) Image transmitted under BPG algorithm, where CR = 0.062, PSNR = 32.8, and SSIM = 0.88. (e) Image transmitted under JPEG2000 algorithm, where CR = 0.074, PSNR = 29.8, and SSIM = 0.81. (f) Image transmitted under JPEG algorithm, where CR = 0.083, PSNR = 28.2, and SSIM = 0.76.

Table 1. Parameter settings for SC-IRT model training.

Training Parameters	Parameter Values
Epochs	50
Batch size	64
Optimizer	Adam
Learning rate	0.001

Table 2. The key wireless transmission parameters of the USRP.

Experimental Parameters	Parameter Values
Frequency of emission	5 GHz
Modulation mode	16 QAM
Modulation mode	LDPC
Bandwidth	100 MHz

Table 3. Comparison of time consumption of image transmission under different algorithms.

Algorithms	Average Transmission Time	Average Processing Time	Average Encoding Time	Average Decoding Time	Average Total Time
JPEG	89.46 ms	48.21 ms	31.53 ms	16.68 ms	137.67 ms
BPG	62.24 ms	36.25 ms	25.46 ms	10.79 ms	98.49 ms
SC-IRT (Ours)	29.64 ms	28.82 ms	6.90 ms	21.92 ms	58.46 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Hong, X.; Cui, B.; Peng, R. Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication. J. Mar. Sci. Eng. 2023, 11, 2280. https://doi.org/10.3390/jmse11122280

AMA Style

Chen Y, Hong X, Cui B, Peng R. Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication. Journal of Marine Science and Engineering. 2023; 11(12):2280. https://doi.org/10.3390/jmse11122280

Chicago/Turabian Style

Chen, Yuanming, Xiaobin Hong, Bin Cui, and Rongfa Peng. 2023. "Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication" Journal of Marine Science and Engineering 11, no. 12: 2280. https://doi.org/10.3390/jmse11122280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implementation of an Efficient Image Transmission Algorithm for Unmanned Surface Vehicles Based on Semantic Communication

Abstract

1. Introduction

2. Design of Semantic Communication Algorithm for Image Recognition Tasks

2.1. Framework of the SC-IRT Algorithm

2.2. Image Semantic Encoding with Autoencoder Network

2.3. Image Semantic Decoding with Generative Adversarial Networks

2.4. Loss Functions

3. Experiments and Analysis

3.1. Dataset and Parameter Settings

3.2. Experimental Setup

3.3. Image Transmission Experiments

4. Conclusions

5. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI