Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning

Lu, Hongrui; Zhang, Yingjun; Wang, Zhuolin

doi:10.3390/jmse11010122

Open AccessArticle

Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning

by

Hongrui Lu

,

Yingjun Zhang

^* and

Zhuolin Wang

Navigation College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(1), 122; https://doi.org/10.3390/jmse11010122

Submission received: 7 December 2022 / Revised: 17 December 2022 / Accepted: 19 December 2022 / Published: 6 January 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

As the technology for offshore wireless transmission and collaborative innovation in unmanned ships continues to mature, research has been gradually carried out in various countries on methods of compressing and transmitting perceptual video while driving ships remotely. High Efficiency Video Coding (H.265/HEVC) has played an extremely important role in the field of Unmanned Aerial Vehicle (UAV) and autopilot, and as one of the most advanced coding schemes, its performance in compressing visual sensor video is excellent. According to the characteristics of shipborne vision sensor video (SVSV), optimizing the coding aspects with high computational complexity is one of the important methods to improve the video compression performance. Therefore, an efficient video coding technique is proposed to improve the efficiency of SVSV compression. In order to optimize the compression performance of SVSV, an intra-frame coding delay optimization algorithm that works in the intra-frame predictive coding (PC) session by predicting the Coding Unit (CU) division structure in advance is proposed in combination with deep learning methods. The experimental results show that the total compression time of the algorithm is reduced by about 45.49% on average compared with the official testbed HM16.17 for efficient video coding, while the Bjøntegaard Delta Bit Rate (BD-BR) increased by an average of 1.92%, and the Peak Signal-to-Noise Ratio (BD-PSNR) decreased by an average of 0.14 dB.

Keywords:

shipborne vision sensor video; unmanned ship; H.265/HEVC; deep learning; delay of compression

1. Introduction

By processing the perception data generated by the shipborne visible light camera (SVLC), shipborne infrared camera (SIRC), automatic identification system, millimeter wave radar, marine radar, and LiDAR and other perception equipment, the unmanned ship can seamlessly perceive the surrounding navigation situation and achieve autonomous navigation of ships. At the same time, the massive shipborne sensing video needs to be compressed and transmitted to the shore-based console to monitor the ship’s real-time dynamics. The shore-based console can intervene inside the ship’s system and perform remote control operations at any time by collecting the navigation environment information inside and around the ship. In order to fill the gap of compressing ship perception video, reduce the pressure on the data storage space, effectively improve the efficiency of maritime bandwidth utilization, avoid wasting bandwidth, and provide richer visual information for remote piloting to ensure the safety of intelligent ship navigation, it is necessary to research and optimize the proposed video compression technology for unmanned ships to reduce the resources occupied by redundant data when transmitting data.

In recent years, researchers from various countries have started to try to apply high-performance video compression algorithms in the field of intelligent driving, such as compression of perceptual video generated by vehicles, traffic, aircraft, and ships for storage or transmission. In the research on compressing shipborne radar video, Lu et al. [1] proposed a deep learning-based digital video compression method for shipborne radar that reduces the complexity of encoding shipborne radar digital video by using the HEVC technique combined with deep learning algorithms, while reducing the storage pressure on hardware and improving the efficiency of ship-to-shore communication. In reference [2], HEVC is used to compress intelligent video surveillance and achieve real-time classification of “human-vehicle” objects. Reference [3] proposed a method to reduce the complexity of the prediction part in HEVC by using Bayesian networks, and the improved HEVC coding scheme is applied to vehicular self-organizing networks (VANETs) to improve road safety. Reference [4] considered the application of HEVC techniques to vehicle surveillance to ensure surveillance video quality and storage costs as the resolution of VANETs increases and commercial VANETs do not contain large hard disk matrices. In order to reduce the pressure of storing video recordings in vehicles and to prevent problems such as falsification of digital recording evidence, Kim et al. used the Advanced Video Coding (H.264/AVC) coding scheme and processed the I-frames in it and added an anti-forgery watermark [5]. Reference [6] provided a fast-transcoding solution for video in the field of the vehicular network (LoV), which intends to solve the coexistence problem of AVC technology and HEVC technology in the field of vehicular network. By exploiting the mapping relationship between decoding information in AVC and CU in HEVC, the authors improve the transmission efficiency of real-time video communication systems in vehicular networks. Reference [7] adds the ISODATA clustering algorithm and performs frame correlation analysis to the HEVC technique for UAV applications to achieve custom keyframes. Reference [8] used the H.264/AVC technique to compress video stream data in intelligent traffic system and combined it with deep learning methods for real-time monitoring of vehicles in the traffic stream.

Machine learning methods have provided researchers with new directions in problems related to optimizing the efficiency of HEVC compression. Among them, the processing schemes represented by convolutional neural networks (CNN) are the most popular. Reference [9] proposed a fast CU division algorithm based on machine learning, and designs an algorithm for CU division prediction using online Support Vector Machine (SVM), and another CNN-based CU division prediction algorithm named DeepCNN. The authors compare the two algorithms laterally and proved that the CNN method is more effective in reducing the computational complexity of HEVC. Reference [10] designed a variable-filter-size Residue-learning CNN (VRCNN) and proposed a CNN-based post-processing algorithm for HEVC. To solve the problem that the traditional rate distortion cost calculation is too complicated during Coding Tree Unit (CTU) division, reference [11] proposed a low-complex shallow asymmetric-kernel CNN for intra-frame prediction of patterns and designed an HEVC intra-frame coding fast learning framework. Reference [12] proposed a combined CNN and LSTM structure for CU segmentation prediction, where the LSTM network is responsible for solving the time-domain correlation in the CU segmentation process. The scheme solves the problem of overly complex recursive CU segmentation search based on quadtrees in the CTU segmentation process to a certain extent and significantly reduces the coding complexity of HEVC. Reference [13] proposed a CNN-based intra-frame segmentation decision network, and another CNN combined with LSTM for inter-frame segmentation decision network, which enables the use of deep learning methods instead of CU segmentation by predicting the intra-frame and inter-frame CU segmentation results while establishing a large coding test universal sequence CU segmentation data set for open use. In references [14,15,16], a CNN-based PU angle prediction model, named AP-CNN, was proposed to replace the original PU angle prediction in the HEVC lossless coding model. In the literature [16], an optimized architecture LAP-CNN based on AP-CNN was designed to further reduce the complexity of the model. In reference [17], the authors combined CNN with image recovery techniques for the loop filtering segment in HEVC to improve the overall performance of coding. In reference [18], CNNs were used to filter the video luminance and chrominance components separately in the intra-frame coding mode instead of the traditional loop filtering mode. Reference [19] introduced CNN into inter-frame coding and designed a block-level up/down sampling model to improve the interframe coding performance of HEVC. To reduce the distortion of video image compression at low bit rates, reference [20] designed a quality-enhanced convolutional neural network (QE-CNN) and proposed a time-constrained quality-enhanced optimization (TQEO) scheme based on HEVC without any modification of the HEVC encoder, and the results of the intra-frame coding test also proved the effectiveness of the scheme. In reference [21], in order to ensure stable transmission in unstable network environments after video compression, a neural network-based low-complexity fault-tolerant coding algorithm was designed to improve HEVC coding efficiency while reducing the bit error rate, and named LC-MSRE. With the popularity of HD 3D video, ISO and ITU jointly introduced the 3D-HEVC technology that can support 3D video compression, as an extension of HEVC standard, 3D-HEVC adds depth mapping coding techniques. In reference [22], based on 3D-HEVC, a CNN technology was used to reduce the computational complexity of 3D video compression, and a depth edge classification CNN (DEC-CNN) framework was designed to classify the depth map edges. In reference [23], the researchers designed a LeNet-5-based CNN optimization model with an early termination CU division strategy to reduce the computational complexity of Rate Distortion cost (RD cost) in intra-frame prediction. In addition to the above approaches that focus on CU division decisions to reduce HEVC computational complexity, some researchers have considered the relationship between CTU division depth and HEVC computational complexity. Reference [24] set two CTU depth ranges and determined the best division result of CTU based on the texture complexity of the currently encoded CTU, which determines the depth range the CTU recursively computes RD-cost. Reference [25] transformed the division pattern decision problem into a classification problem and proposes a fast classification algorithm for CU based on convolutional neural network by learning the image texture, shape and other features for fast encoding.

Although all of the above methods have effect on reducing the complexity of video coding, the geographical peculiarities of unmanned ships lead to low transmission bandwidth when the ship is underway. At the same time, the SVSV is more different than ordinary video images, and it requires the specialized data sets to train the network and reduce the distortion of small objects at sea, so these methods are unadaptable for the compression of targets at sea (especially for small targets). Until now, there has been no HEVC accelerated coding scheme developed for SVSV in the maritime domain. Therefore, by analyzing the characteristics of SVSV and combining the results of SVSV compression, optimizing the sessions with high compression delay is one of the key steps to realize real-time transmission and storage of SVSV from unmanned ships.

In this paper, by analyzing the characteristics of SVSVs (visible light video, thermal imaging video) and combining the results of video compression, a deep learning-based algorithm for optimizing the compression latency of shipboard vision sensor videos without affecting the video compression quality is proposed.
By collecting the segmentation results of shipboard vision sensor video compression, a CTU segmentation structure data set is built based on SVSV, training the proposed CU segmentation prediction model to improve the prediction accuracy.
In the process of compressing the SVSV, the encoder predicts the CTU division in advance by invoking the trained CU division prediction model, which reduces the time-consuming cost of calculating the CU rate distortion, decreases the coding complexity of the shipboard vision sensor video, and significantly reduces the compression latency.

This paper is divided into six parts, and the specific organization is as follows. Section 2 details the image characteristics of shipboard vision sensor video, which includes SCLV video and thermal imaging video. Section 3 analyzes the results of HEVC compression of SVSV, including the relevant parameters used in the experiments, the time consumed in each step of the SVSV compression process, and the final results. In Section 4, we optimize the CTU partitioning process for compressing shipboard vision sensor videos, and designs a hierarchical model for CU partitioning based on SVSV compression, and a CNN-based CU partitioning prediction model is proposed, which uses a large number of collected SVSV data sets to train the network model to improve the prediction accuracy. Section 5 discusses our completed and future work. Finally, in Section 6 a conclusion is made on the work of this paper.

2. SVSV Image Characteristics

2.1. Features of Shipborne Visible Light Camera Video Images

The visible light camera is defined as a camera that can image light sensitive information in the wavelength range of 400 nm to 740 nm. Visible light cameras are one of the most commonly used perception devices for visual recognition in the marine field. The images captured by visible light cameras are characterized by rich color details and high image resolution, which facilitates target recognition and foreground segmentation. However, the visible light camera also has certain limitations. For example, in darkness or bad weather and other poor visibility weather conditions, the visible light camera imaging system will be subject to certain interference, making its detection ability reduced. It is difficult to filter out the noise of the collected video image data, affecting the accurate identification of the target.

The application of vision sensors strengthens ship’s ability to perceive surrounding targets and makes up for the shortcomings of traditional marine radar and AIS. The SVLC video image belongs to one of the ship video monitoring systems, and is mainly responsible for imaging the ship’s external and cockpit environment, with the basic features of large sight distance, wide range, strong timeliness, high image resolution, continuous and uninterrupted, and easy to be influenced by the ship’s attitude and climate within the video image. When the ship is sailing or anchored at sea, the SVLC continuously and uninterruptedly detects the environment around the ship, generating image content information that can show more visual details and texture. The brightness of the sky area will change with the intensity of light, the ocean and sky occupy most of the display area and data volume of the whole image, the different orientation of the camera arrangement will also lead to different sky-to-ocean ratio area in the image. As shown in Figure 1a, the ships are usually far away from each other when sailing. Therefore, the ship targets in the sea antenna area are small, and the targets may occupy only a few or tens of pixels in the whole image. In offshore, inland waterways or at anchor, the closer the target or the larger the target is, the larger the percentage of the image it occupies, and vice versa.

The content information of the SVLC video image is variable, most of the area in the image is flat or changes more slowly in the background area of the sea and sky, and the pixel-to-pixel correlation is very strong. The targets in the images are different in shape but there is localization (mostly near the sea antenna), making more redundant data of the visible light image in the air domain. The grayscale histogram of the SVLC image is shown in Figure 1b, and it can be seen that the grayscale values exhibit a double-peaked trailing feature with a very strong pixel-to-pixel correlation. The first peak with a small gray value corresponds to the sea background, the second peak with a larger gray value corresponds to the sky background, and the larger “trailing” part corresponds to the target in the image. In good weather conditions, restricted by the ship’s own conditions, the movement of the target in the video content captured by the SVLC is generally slow, the target changes between video frames are limited, and the sea-sky background area does not even produce more obvious changes within a certain period of time, through these characteristics it can be inferred that there is a large amount of time-domain redundant data in the visible light video.

To prove the above viewpoints, the average intra-frame pixel difference probability density p(d) and the average inter-frame pixel difference

d_{k} (i, j)

for 10 sets of SVLC videos are calculated using Equations (1) and (2) respectively. The average intra-frame pixel difference probability density of the video is shown in Figure 2, and the average inter-frame pixel difference statistics are shown in Figure 3. It can be seen that the intra-frame pixel difference probability density and inter-frame pixel difference statistics are mainly concentrated in the range around 0, and the frequency of inter-frame pixel difference being 0 is more than 90%. Therefore, it can be inferred that the spatial correlation and temporal correlation of the SVLC video images are high, and therefore, the application of HEVC advanced PC technique can well improve the compression efficiency of SVLC video.

p (d) = \frac{1}{\sqrt{2} σ_{d}} E X P (- \frac{\sqrt{2} |d|}{σ_{d}})

(1)

d_{k} (i, j) = f_{k} (i, j) - f_{k - 1} (i, j)

(2)

where

d

denotes the difference of neighboring pixels in the frame,

σ_{d}

is the mean squared difference,

f_{k} (i, j)

denotes the pixel value with coordinates

i, j

in the current frame, and

f_{k - 1} (i, j)

denotes the pixel value with coordinates

i, j

in the previous frame.

2.2. Shipborne Infrared Camera Video Image Features

The infrared camera is a kind of infrared induction imaging camera, which main principle is to form the contour shape of the object on the image through the different areas of radiation infrared energy variations, and reflectivity variations. Unlike the visible light camera, the infrared camera has a very strong detection capability in the dark or bad weather and other poor visibility climate conditions, while its imaging system will not be disturbed. The working band of the infrared ray is divided into near-infrared, short-wave infrared, infrared, long-wave infrared, far infrared, etc. The working band in the long wave of SIRC is also known as an onboard thermal imager, whose main function is to convert the temperature difference signal into video image. That is, detecting the thermal radiation emitted by the target in the waters around the ship by the infrared detector, and after processing and conversion into a video signal, the target in the waters around the ship will be imaged. Since there is usually a temperature difference between the target on the water and the water surface, the SIRC can detect the target existing on the water surface more easily than the visible light camera in the environment of poor visibility. During navigation, the video content captured by the SIRC is similar to that of the visible camera, both with the sea and sky as the background. The main distinction is that the visual effect of the infrared video (IR video) image is relatively blurred, the signal-to-noise ratio is low, and it is easily affected by the external environment, making the target and background contrast not high and the edges are blurred.

The SIRC video, similar to the SVLC video, also has a large amount of redundant data in the spatial and temporal domains. As shown in Figure 4a, the IR image has a uniform background with limited color information, and the main content of the camera consists of the infrared radiation from the water surface, sky, the target, and the ambient noise together. Since the infrared intensity of the water surface is much lower than that of the sky, and it will lead to a large variation between the grayscale values of the water surface and those of the sky. Therefore, the video image generated by the SIRC has a more obvious middle sea antenna. In order to visualize the energy distribution characteristics in the SIRC video, the SIRC image is transformed into a grayscale histogram Figure 4b, which shows that the energy distribution of the image is more concentrated, and this energy distribution characteristic is very favorable for its processing using the intra-frame PC technique. The movement characteristics of the target and background in the IR video are also the same as in the visible light video. Figure 5 and Figure 6 show the intra-frame pixel variation probability density and inter-frame pixel variation statistics of the SIRC video. Showing that the intra-frame pixel variation probability density and inter-frame pixel variation are more concentrated compared to the visible light camera video. Therefore, it is also more efficient to use HEVC PC algorithm to reduce the redundant data in the null and temporal domain in IR video.

3. Compression Experiment of Shipborne Vision Sensor Video

According to the analysis of the ship vision sensor video characteristics, it is known that the redundant data of ship vision sensor video mainly contain intra-frame spatial domain redundancy and inter-frame time domain redundancy, etc. In this section, in order to test the actual control performance of HEVC video compression technology on the redundant data within the ship vision sensor video, a large number of HD perception video sequences captured from a real ship experiment and the measured compression effect are analyzed.

3.1. SVSV Sequence Parameters

In the process of compressing SVSV, a total of 63 sets of uncompressed SVLC video sequences and infrared camera video sequences in different navigation scenes and RGB color spaces are acquired, and their main parameters are given in Table 1. Meanwhile, in order to adapt to the input characteristics of HEVC, the SVSV color space will be preprocessed and converted to YUV format with a sampling rate of 4:2:0, which reduces part of the color information redundancy. The conversion and inverse conversion processes are shown in Equations (3) and (4).

\{\begin{matrix} Y = 0.229 R + 0.587 G + 0.114 B \\ U = - 0.1687 R - 0.3313 G + 0.5 B + 128 \\ V = 0.5 R - 0.4187 G - 0.0813 B + 128 \end{matrix}

(3)

\{\begin{array}{l} R = Y + 1.402 (V - 128) \\ G = Y - 0.34414 (U - 128) - 0.71414 (V - 128) \\ B = Y + 1.722 (U - 128) \end{array}

(4)

3.2. Coding Complexity Analysis

The complexity of coding computation is one of the main reasons for consuming the computational resources of ships and affecting the compression delay. Too much compression delay will lead to the reduction of real-time storage or transmission of ship vision sensor videos and cause the hidden danger of intelligent ship navigation. In order to analyze the percentage of computational resources consumed by each link when compressing ship vision sensor video, this section selects two coding modes: All Intra (AI) and Low Delay P (LDP), and conducts coding complexity analysis experiments on the captured ship vision sensor video to test the time consumption of different links under the two configurations and takes the average of the measured results as the final result. The AI mode is responsible for testing the computational complexity of each link in the intra-frame PC mode, and the LDP mode is testing the inter-frame PC mode, within Group of Pictures (GOP), the first frame is I and the rest is P.

The experimental platform uses 64-bit Ubuntu 20.04.2 LTS operating system, and the test environment is built using the official HEVC test platform HM16.17, compiled using C++ language, and the performance analysis tool that comes with Visual Studio is selected to diagnose and analyze the coding time consumption of each link. The hardware configuration of the experimental platform is Intel Core i7-8700 [email protected] with 16 G RAM, and the main configuration parameters of the encoder is given in Table 2. The percentage of time consumed by the CPU in executing each major encoding session code during the video compression of the onboard vision sensor is given in Table 3.

In the process of compressing the SVSV, it can be seen that the TComTrQuant class, which implements the transform and quantization functions in HEVC, takes the most time in AI mode, with an average of 17.69%. It is mainly because this part needs to calculate the rate distortion cost metric needed for optimal quantization. Next is the TComDataCU section (stores CU data information), the TComPrediction and TEncSearch sections (perform intra-frame prediction and search), and the TComRdCost section (implements the calculation of the rate distortion cost), accounting for 14.44%, 11.95%, 11.6%, and 3.79%, respectively. The intra-frame PC mode requires iterative computation of the rate distortion cost for all possible division methods in order to determine the optimal CTU division, which consumes most of the computational resources. When compressing the SVSV in LDP mode, the motion estimation and compensation parts consume most of the computational resources, with the TComInterpolationFliter class, which is used to implement the interpolation filtering function, accounting for the highest percentage, occupying 19.91% of the total coding time on average. Next is the TComDataCU class and the TComRdCost part for calculating SAD and other rate distortion optimization metrics, which consume 15.61% and 12.25% on average.

The above results can be analyzed that when compressing the SVSV in AI mode, and the highest computational complexity is the CTU division part. This is mainly because the intra-frame search and rate distortion cost calculation consume more coding time. When compressing the SVSV using LDP mode, the motion estimation and motion compensation parts consume more time, which leads to the higher computational complexity of the inter-frame prediction part and affects the compression latency.

3.3. Compression Performance Analysis

In order to test the performance of HEVC compressed SVSV, and the impact of different scenes, different resolutions, and different Quantization Parameter (QP) values on the compression performance, the LDP mode was used to analyze the performance of compressed SVSV, and the test evaluation indexes contain Bitrate, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), where the calculation formula of PSNR and SSIM are given in Equations (5) and (6). The experimental platform is the same as Section 3.2. For the encoder configuration, the LDP default configuration is used for encoding, and the average of all the measured results is taken as the final result. Table 4 shows the performance comparison results of compressing different scenes and different resolutions of SVSVs with the same QP value. Figure 7 shows the R-D plots of SVLC video with resolution 1920 × 1080 and SIRC video with resolution 704 × 576 in the anchoring scenario with different QP values (22, 27, 32 and 37). Figure 8 shows the quality comparison of the first frame image in the SVLC video with 1920 × 1080 resolution in the anchoring scene before and after encoding at the four QPs.

PSNR = 10 \cdot l g \frac{M \cdot N \cdot A^{2}}{\sum_{n = 0}^{N - 1} \sum_{m = 0}^{M - 1} {[f (m, n) - \hat{f} (m, n)]}^{2}}

(5)

where

M

and

N

are the width and height of the video image,

f (m, n)

,

\hat{f} (m, n)

are the video images before and after compression respectively.

A

is the grayscale value of the video image.

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(6)

l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}}

(7)

c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}}

(8)

s (x, y) = \frac{2 σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}

(9)

where

l (x, y)

is the brightness comparison function, whose expressions are given in Equation (7);

c (x, y)

is the contrast comparison function, given in Equation (8);

s (x, y)

is the structure comparison function, given in Equation (9).

μ_{x}

and

μ_{y}

are the average intensity before and after compression, which expressions are given in Equation (10);

σ_{x}

,

σ_{y}

are the standard deviations, given in Equation (11);

σ_{x y}

is the covariance, given in Equation (12);

C_{1}

,

C_{2}

and C_{3}

are the constants.

μ_{x / y} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} / y_{i}

(10)

σ_{x / y} = \frac{1}{N - 1} \sum_{i = 1}^{N} {(x_{i} / y_{i} - μ_{x} / μ_{y})}^{2}

(11)

σ_{x y} = \frac{1}{N - 1} \sum_{i = 1}^{N} [(x_{i} - μ_{x}) (y_{i} - μ_{y})]

(12)

As can be seen from the results in Table 4 that the bit rate of the compressed SVSV is affected by the video resolution and the navigation scene at the same QP value. The higher the resolution and the more complex the video content, the higher the bit rate of the video. The results of SSIM also reflect that when the QP value is 32, the compression effect of HEVC is relatively good, and the quality of the compressed video does not produce a more obvious difference in sensory experience compared with the original video. At the same time, under the premise of the same video content, the higher the resolution of the SVSV, the higher the compression ratio, which fully indicates that the larger CTU quadtree division structure of HEVC can well adapt to the more complex content areas in the high-resolution SVSV, and also has a good compression effect on flat areas. The comparison results in Figure 8 show that different QP values have different effects on the compression performance of different types of SVSVs. The larger the QP value the more distortion of content details, and the lower the QP value the clearer the video. When the QP value is set above 27, the slope of the R-D curve is larger, indicating that the video quality is significantly improved, and the code rate changes more slowly. When the QP value is less than 27, the slope of the R-D curve decreases, indicating that the video bit rate grows faster while the growth of the peak signal-to-noise ratio becomes slower. It can also be seen from Figure 8e that when the QP value is 37, the edge contour of the target in the visible image becomes relatively blurred compared with the original image, but it does not affect the discrimination of the target on the sea surface.

The experimental results of compressed SVSV show that intra-frame prediction coding of HEVC largely reduces the redundant data in the null domain of the SVSV. However, in the process of intra-frame prediction, HEVC recursively divides the CU downward and traverses the computational rate distortion cost to get the best division according to the texture complexity of the QP and CTU, and the decision of CU division brings high computational complexity to the intra-frame prediction coding. Therefore, optimizing the computational complexity of the compression process for the characteristics of SVSV is one of the very critical steps to further upgrade the intelligent ship tele pilot system.

In this section, based on the characteristics and usage scenarios of SVSV images, an algorithm acting in the intra-frame PC session for fast compression of SVSV by predicting the CU division structure in advance is proposed in combination with deep learning methods without affecting the compression quality.

4. Time Delay Optimization of Compressing SVSV Based on Deep Learning

4.1. HEVC Intraframe Prediction of CU Partitioning Mode

The specific CU division process for HEVC intra-frame PC is shown in Figure 9. The CU division process is a hierarchical and recursive search, and a CTU can be regarded as a combination of one or more different CU sizes. According to the quadtree division rule in HEVC, the CU is divided into four possible depths or sizes according to the texture features of different image contents, which are 64 × 64, 32 × 32, 16 × 16 and 8 × 8. In the process of CU division, HEVC has to traverse a total of 85 CUs from 64 × 64 to 8 × 8 in size, and decide on a CU division scheme with the lowest rate distortion cost as the actual coding structure. The partitioning process is as follows.

Firstly, calculate the CTU rate distortion cost of 64 × 64 size and continue to loop down to divide into four sub-CUs of 32 × 32 size.
The distortion cost of four sub-CU rates of size 32 × 32 is calculated separately, and continues to be cyclically divided down to sub-CUs of size 16 × 16.
The current four 16 × 16 size sub-CU rate distortion cost is calculated sequentially in the process of cyclic downward division and continues cyclic downward division into 8 × 8 size sub-CUs.
The current 4 sub-CU rate distortion costs of 8 × 8 size is computed sequentially in the process of cyclic downward division.
The four 8 × 8-sized sub-CU rate distortion cost are compared with the current 16 × 16-sized CU rate distortion cost in turn, and the solution with the lower rate distortion cost is selected as the current CU division structure.
Similar to step 5, the comparison of sub-CU rate distortion cost of 16 × 16 size is repeated, and the solution with lower rate distortion cost is selected as the division structure of the current CU.
Compare the rate distortion cost of 32 × 32 size sub-CU with the current CTU to get the lowest rate distortion cost of the current CTU and the final division structure of the current CTU.

The above description of the specific process of CU division shows that in order to decide the final CTU division structure, it is necessary to calculate and compare the rate distortion cost for all possible sizes of CUs. The exhaustive search approach of HEVC not only has high computational complexity, but also generates a large number of redundant calculations, which is very unfavorable to the intelligent ships with limited computational resources.

4.2. Modeling of CU Partition Structure

In order to be able to describe the different ways of dividing different CUs in the ship vision sensor images, a hierarchical structure model of CU division based on the ship vision sensor images is proposed, as shown in Figure 10. Where

x ϵ [0, 1, 2, 3], i, j ϵ [1, 2, 3, 4]

. The CU at different sizes and coordinates are uniformly represented by

D (x, i, j) = 0, 1

, where x is the depth corresponding to the current CU size, i, j are the coordinates of the current sub-CU in the 64 × 64 and 32 × 32 size of CU respectively, and the binary labels 0 and 1 represent whether the current CU is divided or not, and if x = 0 or 1, it is not substituted into i or i, j. The specific expression is given in Table 5. The CU of 64 × 64 size (i.e., the CTU is not divided, and the depth is 0) is defined as D(0), the CU of 32 × 32 size is defined as D(1,i), and the CU of 16 × 16 size is defined as D(2,i,j). For example, D(1,3) = 0 means that the current sub-CU of size 32 × 32 has coordinates 3 in the upper CU and does not continue to divide down. D(2,3,4) = 1 represents that the current sub-CU of size 16 × 16 has coordinates (3,4) in the upper CU and continues to divide down into CUs of size 8 × 8.

By using neural networks to predict all possible ways of dividing the current CTU, the structured output of a total of 21 CUs can be directly derived, saving the computational time consumed by predicting them one by one. Finally, based on the predicted CTU division structure, the step of calculating and comparing the CU rate distortion cost is skipped, and the computational redundancy caused by recursively traversing the CU rate distortion cost is avoided to a certain extent, which helps to reduce the complexity of intra-frame PC.

4.3. Establishment of Data Sets

High-quality data sets are the basis for training and verifying the neural network model, and also provide a guarantee for improving the efficiency of the algorithm. The collected abundant high-definition SVLC video sequences and SIRC videos are processed firstly and sets the corresponding categorization label for training the network model. A large amount of CU division data provides support for training the SVSV CU division prediction model. Each sample data contains the Y component of the CU and the corresponding 0 and 1 division labels. All SVLC video samples constitute a CU division data set. All SIRC video samples together constitute a CU division data set based on SIRC video.

All video samples are compressed by HEVC official test platform HM16.17. On the encoder configuration, the QP is set to the general values of the coding test 22, 27, 32 and 37. The default configuration of the AI coding mode is adopted to save the Y component and division information of all CUs recorded. Due to the strong correlation between the video frames of the shipborne vision sensor, in order to avoid overfitting in the training phase due to high data repeatability, the frame level interval sampling (FLIS) method is used in the data processing phase to enhance the difference between the data. Considering that the frame rate of the SVLC video is different from that of the SIRC video, the CU division data of one frame image is selected for each encoded 10 frames when compressing the SVLC video, and the CU division data of one frame image is selected for each encoded 5 frames when compressing the SIRC video. The CU division data set of the SVLC video comes from 41 SVLC video sequences, totaling 2460 images. The CU division data set of SIRC video comes from 22 SIRC video sequences with a total of 3300 images. The specific parameter information of the data set is given in Table 6. After the production of the data set is completed, three seed data sets are obtained through random sampling, of which the training set accounts for 80% of the total data set, the validation set accounts for 10% of the total data set and the testing set accounts for 10% of the total data set.

4.4. CU Division Prediction Model for Shipboard Vision Sensor Video Based on Deep Learning

CNN is an efficient prediction approach developed in recent years. The structure of CNN mainly consists of three parts: convolution layer, pooling layer and fully connected layer. CNN use a convolution kernel of certain size multiplied by the corresponding position of each layer to extract features, and the parameters in the convolution kernel are shared by weight in the feature map to continuously improve learning efficiency. According to the characteristics of shipboard vision sensor video, combined with the Convolutional Neural Network (CNN) [13], this paper proposes an intra-frame coding delay optimization algorithm for SVSV, designs a shipboard vision sensor video based on deep learning CU partition prediction model, and optimizes the complexity of the neural network model, adding a threshold termination mechanism, and the improved convolution network structure is shown in Figure 11.

The convolution network structure is composed of two preprocessing layers, three convolutional layers, one pooling layer, one merging layer and three fully connected layers. Before performing PC, the encoder will first divide the visual sensor image into N CTUs of 64 × 64 size. The preprocessing layer is responsible for extracting the brightness matrix containing the main visual information in the image as input information for the predictive model for division of the CU of the SVSV, and then, performing global homogenization, 32 × 32 local homogenization, 16 × 16 local homogenization and normalization to speed up the speed of gradient descent for optimal solutions.

Convolutional layer: The convolutional layer is mainly responsible for extracting local features in the SVSV. The convolutional layer performs the convolution operation on the feature maps input to this layer, and in order to be able to obtain more features in the CU with low complexity, the convolution scheme in Inception Net [26] is referred to. By using convolutional kernels of different sizes to obtain different sizes of perceptual fields in CU, the features after the operation are higher level. Therefore, three convolutional kernels of different sizes will be used, such as 8 × 8, 4 × 4 and 2 × 2 in the first convolutional layer to extract the low-level features of CU division on three branches respectively. In the second and third convolutional layers, 2 × 2 convolutional kernels of the same size are used, which are responsible for extracting higher-level features on the three branches, and finally, 64 feature maps are obtained on the three branches simultaneously. In order to fit the mutual nonoverlapping rule of CU quadtree division specified by HEVC, the step length of all convolution operations is set to the kernel edge length for non-overlapping convolution operations.

The merged feature vectors are processed by three fully connected layers in three branches in turn, including two hidden layers and one output layer, and the final output results are CU division prediction values. According to the experimental results in Section 3, it is known that QP is one of the main factors affecting the code rate and CU division size in the process of compressing the video of shipboard vision sensors. Therefore, the selection of QP value is added as a major external feature to the feature vectors of the first and second fully connected layers to improve the adaptability of the model under different QP values and the accuracy of predicting CU division.

L e a k y - R e L U (x) = \{\begin{matrix} x x > 0 \\ α x x \leq 0 \end{matrix}

(13)

S (x) = \frac{1}{1 + e^{- x}}

(14)

H = \frac{1}{N} \sum_{i = 1}^{N} H_{i}

(15)

In the training and testing phases, to prevent the problem of gradient disappearance, all convolutional layers and the first and second fully connected layers are activated using a Leaky-ReLU (Leaky-ReLU) with leakage correction, as shown in Equation (13). Leaky-ReLU, as a variant of the modified linear unit (ReLU), solves the neuron in the negative interval of the ReLU function by introducing a fixed slope

α

the problem of non-learning. The output layer is activated by an S-type function (Sigmoid) to ensure that the model output value lies within the (0, 1) interval, and its expression is shown in Equation (14). The formula for calculating the cross-entropy in the training phase is shown in Equation (15), where

N

denotes the number of samples and

H_{i}

denotes the entropy value between the true value of the ith sample and the predicted value.

4.5. Video Intra-frame Coding Delay Optimization Algorithm Flow of Shipboard Vision Sensor

The specific flow of the intra-frame coding delay optimization algorithm for SVSV is shown in Figure 12. It can be seen that the algorithm is mainly used to avoid the rate distortion cost calculation and comparison performed when deciding the CTU division structure in HEVC by directly predicting the CU division result of the shipboard vision sensor video image. Meanwhile, a threshold termination mechanism in the model is added, that is, when the predicted value is larger than the threshold, the model stops making CU division structure prediction and outputs the current CU hierarchical division structure prediction result to avoid the waste of computational resources to a certain extent and its specific workflow is shown as follows.

Input the shipboard vision sensor video signal to be compressed into the encoder, and pre-process it to YUV format if the color space of the vision sensor video is RGB. Before formally starting encoding, the encoder will split each frame of the vision sensor video into N CTUs to be encoded.
The Y-component of CTU is fed into the trained prediction model with the network structure shown in Figure 11, and the pixel matrix is normalized to speed up the convergence. The model output is the probability of D(x,i,j), i.e., the corresponding dichotomous labels 0, 1.
The probability size of the model output D(0) is compared with the set threshold (this paper the threshold is set to 0.5) to determine whether the CU in the current CTU continues to be divided downward. If the probability of judging D(0) is less than the set threshold, it is directly determined that the current CTU is not divided, i.e., D(0) = 0 and output. Otherwise, D(0) = 1 continues to divide downward.
Determine whether the probability size of D(1,i) is greater than the set threshold, if yes then D(1,i) = 1, the current sub-CU continues to divide downward. Otherwise, D(1,i) = 0, the current sub-CU is recorded, and the downward division is stopped. If the prediction results of all the current 4 sub-CUs are less than the threshold, i.e., ${\{D (x, i)\}}_{i = 1}^{4} = 0$ , the division structure of the current CTU is determined directly and output.
Similar to Step (4), determine whether the probability size of D(1,i,j) is greater than the set threshold, if yes, then D(1,i,j) = 1, the current sub-CU continues to divide downward. Otherwise, D(1,i,j) = 0, the current sub-CU is recorded, and the downward division is stopped. If the prediction results of all 16 current sub-CUs are less than the threshold value, i.e., ${\{D (x, i, j)\}}_{i, j = 1}^{4} = 0$ , the division structure of the current CTU is determined directly and output.
Integrate all recorded sub-CU division structures, corresponding to the CU hierarchical division structure model, the final division structure of the current CTU is determined and output.
In the process of compressing the SVSV, the process of calculating and comparing the cost of rate distortion is avoided by directly predicting the CU division structure of the shipboard vision sensor video image. For example, when the prediction result is ${\{D (x, i)\}}_{i = 1}^{4} = 0$ , the network model terminates the prediction of the division structure of 16 × 16 size CUs and directly outputs the CTU division structure model containing four 32 × 32 size CUs, which reduces the compression delay to a certain extent.

4.6. Analysis of Experimental Results

In order to verify the effectiveness of the intra-frame encoding delay optimization algorithm for reducing the compression delay of the shipboard vision sensor video, this paper uses Tensorflow 1.13.0 to build a convolutional neural network model, embeds the trained model into the HEVC dedicated testbed HM16.17 and compiles and tests it in Ubuntu OS, the hardware configuration of the test environment is the same as Section 3.2 The hardware configuration of the test environment is the same as in Section 3.2. For the encoder configuration, the AI default configuration was used for encoding. The test set contains shipboard vision sensor videos in three different navigation scenarios and at different resolutions. In this paper, the coding-saving ratio

Δ T

and the BD-BR and the BD-PSNR of the VCED-M33 proposal are used as metrics to evaluate the effectiveness of the algorithm. The smaller the value of BD-BR, the lower the bit rate of the compressed video, and the larger the value of PSNR, the higher the quality of the compressed video. The smaller the value of

Δ T

, the less time consuming the encoding is. Equation (16) lists the calculation method of

Δ T

. BD-PSNR is the same as PSNR calculation method, as shown in Equation (5).

Δ T = \frac{1}{n} \sum_{Q P} \frac{T_{H M 16.17} - T_{p r o p}}{T_{H M 16.17}} \times 100 %

(16)

where

T_{H M 16.17}

represents the encoding elapsed time of encoder HM16.17,

T_{p r o p}

represents the encoding elapsed time of the methods in this section and n is the total number of encodings.

The results show that the intra-frame coding delay optimization algorithm proposed reduces the total compression time by about 45.49% on average, BD-BR by 1.92% on average, and BD-PSNR by 0.14 dB on average compared with HM16.17 in AI coding mode. Among them, the compression time decreases by about 50.70% on average when compressing the SVLC video with the resolution of 1920 × 1080 and by about 43.97% on average when compressing the SVLC video with the resolution of 1280 × 720. This fully illustrates that the higher the resolution of the SVSV, the better the adaptation of the intra-frame coding delay optimization algorithm for SVSV proposed and the stronger the compression performance.

From Table 7, it can also be analyzed that the setting of QP value is also a major factor limiting the performance of the intra-frame coding delay optimization algorithm for the SVSV. The larger the QP value set, the higher depth CTUs and the coarser the CU division. For example, when compressing a SIRC video with a resolution of 704 × 576, the average compression savings ratio at a QP value of 22 is 38.55%. When the QP value is 37, the average compression savings ratio increases by nearly 6.11%, up to 44.66%. The performance of the algorithm is also affected by the actual content of the video when the QP value is certain, and the less targets in the video and the flatter the content, the better the algorithm compression performance. For example, when compressing a shipboard visible camera video with a resolution of 1280 × 720 at a QP value of 32, the average compression time in the port-resting environment is 3.12% higher than in the sailing state and 1.36% higher than in the anchoring state. It can be concluded from the above analysis that the intra-frame coding delay optimization algorithm proposed largely reduces the intra-frame PC computational complexity and has a higher performance in compressing the SVSV, especially in compressing the high-resolution SVSV, which significantly reduces the compression time.

5. Discussion

In our research, the characteristics of the SVSV are analyzed in detail, including the video image characteristics of the SVLC and the SIRC. Secondly, the compression experiment is carried out, and the computational complexity of each encoding link in the process is analyzed through the experiment, as well as the main reasons for the compression delay. The performance of applying HEVC technology to compress SVSV in different resolutions and scenes is summarized, and a deep learning based intra-frame coding delay optimization algorithm for SVSV is proposed, while the process of CTU division in HEVC is analyzed. A CU hierarchy based on SVSV is designed by combining the characteristics of SVLC and infrared video. The proposed deep learning-based CU segmentation prediction model for SVSV is built, and a CU segmentation data set based on SVSV is built by using the official HEVC testbed HM16.17 to encode a large number of collected high-definition shipboard vision sensor video sequences. The data set contains the CU partition information of the SVLC video and the infrared camera video under different QP values, then use the obtained samples to train the CU partition prediction model of the SVSV. The final results show that the performance of the algorithm proposed is better when compressing the SVSV. Under the premise of less impact on the overall clarity of the video, the total compression time is comparable to that of the high efficiency video coding official test platform HM16.17. Compared with the test platform HM16.17, the average reduction is about 45.49%.

Finally, in order to intuitively reflect the performance difference between the proposed algorithm and the traditional method, and the general optimization algorithm [13] on compressed SVSV, it is chosen to visualize the results of compressed SVLC under the constraints of different QP values (22, 27, 32, 37), as shown in Figure 13. The three methods are represented with different colors. It can be seen that the R-D curve of the proposed method is not much different from that of HM16.17, and the results on the bit rate and PSNR are better than reference [13], indicating that the proposed algorithm performs well on compressed SVSV.

6. Conclusions

Real-time and accurate transmission of SVSV is of practical significance to ensure the safe navigation of intelligent ships. The ship monitoring unit plays a similar function to the “black box” of an aircraft when the ship is sailing, and the transmission and storage of the monitoring video need to occupy a lot of network bandwidth resources and memory space. Therefore, using efficient data compression algorithms is the most reliable solution for transmitting and storing SVSVs and is also the development trend of data transmission between ship and shore for intelligent ships. It is believed that the compression delay is one of the main factors affecting the total transmission delay when transmitting the SVSV between ship and shore of intelligent ships. Among them, the coding complexity is highest when the video is coded intra-frame and so further reduces the latency caused by coding is achieved by using an advanced HEVC coding scheme to compress the SVSV and using convolutional neural networks to optimize the intra-frame coding process in HEVC. The final experimental results show that the proposed algorithm can reduce the storage and transmission bit-rates while reducing the delay caused by encoding.

During the development of future intelligent ships, higher-resolution surveillance video and 3D video will also be introduced into the system development to promote the progress of ship autopilot, provide more detailed visual information to the shore-based maneuvering console and make the ship visual obstacle avoidance better, but the type and volume of SVSV data will also increase exponentially. Therefore, we will improve and optimize the adaptability of the algorithm in this paper in the future and do further research on the optimization of the inter-frame coding of SVSV.

Author Contributions

Methodology, H.L.; investigation, H.L and Z.W.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, Y.Z. and Z.W.; supervision, Y.Z. and project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Liao Ning Revitalization Talents Program (No. XLYC1902071).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AVC	Advanced Video Coding
BD-BR	Bjøntegaard Delta Bit Rate
BD-PSNR	Bjøntegaard Delta Peak Signal to Noise Rate
CTU	Coding Tree Unit
CU	Coding Unit
CNN	Convolutional Neural Network
FLIS	Frame Level Interval Sampling
GOP	Group of Pictures
HEVC	High Efficiency Video Coding
PC	Predictive Coding
PSNR	Peak Signal to Noise Ratio
QP	Quantization Parameter
RD-cost	Rate Distortion Cost
SSIM	Structural Similarity Index Measure
SIRC	Shipborne Infrared Camera
SVLC	Shipborne Visible Light Camera
SVSV	Shipborne Vision Sensor Video
UAV	Unmanned Aerial Vehicle

References

Lu, H.R.; Zhang, Y.J.; Wang, Z.L. Time Delay Optimization of Compressing Shipborne Radar Digital Video Based on Deep Learning. J. Mar. Sci. Eng. 2021, 9, 1279. [Google Scholar] [CrossRef]
Zhao, L.; He, Z.H.; Cao, W.M.; Zhao, D.B. Real-Time Moving Object Segmentation and Classification From HEVC Compressed Surveillance Video. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 1346–1357. [Google Scholar] [CrossRef]
Jiang, X.T.; Feng, J.; Song, T.; Katayama, T. Low-Complexity and Hardware-Friendly H.265/HEVC Encoder for Vehicular Ad-Hoc Networks. Sensors 2019, 19, 1927. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, H.C.; Fan, C.P. Effective Reduction of Memory Storage and Cost by HEVC-based Intra-Frame Video Encoding Technology with Region-Adaptive Quantization for Car Digital Video Recorders. In Proceedings of the 2022 4th International Conference on Computer Communication and the Internet (ICCCI), Chiba, Japan, 1–3 July 2022; pp. 62–71. [Google Scholar]
Kim, C.; Shin, D.; Shin, D.; Yang, C.N. Secure protection of video recorder video in smart car. Int. J. Distrib. Sens. Netw. 2016, 12, 1550147716681792. [Google Scholar] [CrossRef]
Liu, X.G.; Li, Y.Y.; Dai, C.; Li, P.; Yang, L.T. An Efficient H.264/AVC to HEVC Transcoder for Real-Time Video Communication in Internet of Vehicles. IEEE Internet Things J. 2018, 5, 3186–3197. [Google Scholar] [CrossRef]
Wang, H.; Pan, Z.; Zhai, X.; Huang, Z.; Zhang, X.; Cao, S. A Key Frame Extraction Method of HEVC Video Based on Clustering Algorithm for Electric Utilities Management. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 360–365. [Google Scholar]
Altaf, M.; ur Rehmar, F.; Chughtai, O. Discernible Effect of Video Quality for Distorted Vehicle Detection using Deep Neural Networks. In Proceedings of the 2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall), Norman, OK, USA, 27–30 September 2021; pp. 1–5. [Google Scholar]
Bouaafia, S.; Khemiri, R.; Sayadi, F.E.; Atri, M. Fast CU partition-based machine learning approach for reducing HEVC complexity. J. Real-Time Image Process. 2020, 17, 185–196. [Google Scholar] [CrossRef]
Dai, Y.Y.; Liu, D.; Wu, F. A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding. In Proceedings of the 23rd International Conference on MultiMedia Modeling (MMM), Reykjavik, Iceland, 4–6 January 2017; Volume 10132, pp. 28–39. [Google Scholar]
Chen, Z.B.; Shi, J.; Li, W.P. Learned Fast HEVC Intra Coding. IEEE Trans. Image Process. 2020, 29, 5431–5446. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bouaafia, S.; Khemiri, R.; Maraoui, A.; Sayadi, F.E. CNN-LSTM Learning Approach-Based Complexity Reduction for High-Efficiency Video Coding Standard. Sci. Program. 2021, 2021, 6628041. [Google Scholar] [CrossRef]
Xu, M.; Li, T.Y.; Wang, Z.L.; Deng, X.; Yang, R.; Guan, Z.Y. Reducing Complexity of HEVC: A Deep Learning Approach. Ieee Trans. Image Process. 2018, 27, 5044–5059. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schiopu, I.; Huang, H.Y.; Munteanu, A. CNN-Based Intra-Prediction for Lossless HEVC. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1816–1828. [Google Scholar] [CrossRef]
Huang, H.; Schiopu, I.; Munteanu, A. Deep Learning based Angular Intra-Prediction for Lossless HEVC Video Coding. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; p. 579. [Google Scholar]
Huang, H.Y.; Schiopu, I.; Munteanu, A. Low-Complexity Angular Intra-Prediction Convolutional Neural Network for Lossless HEVC. In Proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 21–24 September 2020. [Google Scholar]
Jia, C.M.; Wang, S.Q.; Zhang, X.F.; Wang, S.S.; Liu, J.Y.; Pu, S.L.; Ma, S.W. Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding. IEEE Trans. Image Process. 2019, 28, 3343–3356. [Google Scholar] [CrossRef] [PubMed]
Huang, H.Y.; Schiopu, I.; Munteanu, A. Frame-Wise CNN-Based Filtering for Intra-Frame Quality Enhancement of HEVC Videos. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2100–2113. [Google Scholar] [CrossRef]
Li, Y.; Liu, D.; Li, H.Q.; Li, L.; Wu, F.; Zhang, H.; Yang, H.T. Convolutional Neural Network-Based Block Up-Sampling for Intra Frame Coding. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2316–2330. [Google Scholar] [CrossRef] [Green Version]
Yang, R.; Xu, M.; Liu, T.; Wang, Z.L.; Guan, Z.Y. Enhancing Quality for HEVC Compressed Videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2039–2054. [Google Scholar] [CrossRef] [Green Version]
Wang, T.Y.; Li, F.; Qiao, X.Y.; Cosman, P.C. Low-Complexity Error Resilient HEVC Video Coding: A Deep Learning Approach. IEEE Trans. Image Process. 2021, 30, 1245–1260. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Jia, K.B.; Liu, P.Y. Fast Depth Intra Coding Based on Depth Edge Classification Network in 3D-HEVC. IEEE Trans. Broadcast. 2022, 68, 97–109. [Google Scholar] [CrossRef]
Ting, H.C.; Fang, H.L.; Wang, J.S. Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5. In Proceedings of the 1st IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, China, 18–20 March 2019; pp. 20–24. [Google Scholar]
Feng, Z.Q.; Liu, P.Y.; Jia, K.B.; Duan, K. HEVC Fast Intra Coding Based CTU Depth Range Prediction. In Proceedings of the 3rd IEEE International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 551–555. [Google Scholar]
Kuanar, S.; Rao, K.R.; Bilas, M.; Bredow, J. Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach. Circuits Syst. Signal Process. 2019, 38, 5081–5102. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]

Figure 1. Shipborne visible light camera video and grayscale histogram: (a) Shipborne visible light camera video; (b) Grayscale histogram.

Figure 2. Intra pixel difference statistics of shipborne visible light camera video: (a) Level pixel difference probability density; (b) Vertical pixel difference probability density.

Figure 3. Inter pixel difference statistics of shipborne visible light camera video.

Figure 4. Shipborne infrared camera video and grayscale histogram. (a) Shipborne infrared camera video. (b) Grayscale histogram.

Figure 5. Intra pixel difference statistics of shipborne infrared camera video: (a) level pixel difference probability density and (b) vertical pixel difference probability density.

Figure 6. Inter pixel difference statistics of shipborne infrared camera video.

Figure 7. R-D curve of compressed shipborne video under different QPs: (a) R-D curve of compressed visible video; (b) R-D curve of compressed IR video.

Figure 8. Coding quality comparison of shipborne visible light camera video under different QPs: (a) Original image; (b) QP = 22; (c) QP = 27; (d) QP = 32; (e) QP = 37.

Figure 9. Flow chart of CU partition in HEVC intra prediction coding mode.

Figure 10. CU hierarchy partition structure model.

Figure 11. CU partition prediction model of shipborne vision sensor video.

Figure 12. Flow chart of intra coding time-delay optimization algorithm for shipborne vision sensor video.

Figure 13. R-D curves contrast.

Table 1. Shipborne perception videos parameters.

Video Type	Navigation Environment	Time	Frame Rate	Resolving Power	Number of Videos
SVLC video	Anchoring	20 s	30 fps	1920 × 1080	6
	Anchoring			1280 × 720	5
	Underway			1920 × 1080	8
	Underway			1280 × 720	9
	Port			1920 × 1080	8
	Port			1280 × 720	5
SIRC video	Anchoring	30 s	25 fps	704 × 576	8
	Underway			704 × 576	8
	Port			704 × 576	6

Table 2. Main parameters of the HEVC encoder.

Configuration Name	Type	Parameter
File I/O	InputBitDepth	8
File I/O	InputChromaFormat	420
Unit definition	MaxCUWidth	64
	MaxCUHeight	64
	MaxPartitionDepth	4
Motion Search	FastSearch	1
	SearchRange	64
	HadamardME	1
	FEN	1
	FDM	1
	BipredSearchRange	4
Coding Tools	SAO	1
Coding Tools	AMP	1
Quantization	QP	32
Coding Structure	GOPSize	5

Table 3. Time ratio of major coding processes.

Type	Visible Light Camera		Infrared Cameta
	AI	LDP	AI	LDP
	%		%
TComYUV	2.00	1.72	1.89	1.85
TComSlice	0.01	0.02	0.02	0.02
TComDataCU	14.67	14.77	14.21	16.44
TComTU	0.38	0.31	0.36	0.30
TEncSearch	11.73	7.89	11.47	7.10
TComPrediction	12.07	1.34	11.83	1.05
TComRdCost	3.83	13.19	3.75	11.31
TComLoopFilter	0.09	0.01	0.13	0.01
TComTrQuant	16.80	11.00	18.58	10.49
TComInterpolationFilter	0	19.42	0	20.40
TEncBinCABAC	0.88	0.47	0.89	0.46
TEncSbac	6.98	3.43	7.28	3.25
TEncEntropy	0.25	0.14	0.25	0.11
else	30.31	26.29	29.34	27.21

Table 4. Performance comparison of compressed shipborne video under the same QP.

Video Type	Resolving Power	Navigation Environment	QP = 32
Video Type	Resolving Power	Navigation Environment	Bitrate	PSNR	SSIM
Visible light video	1920 × 1080	Anchoring	780 kbps	40.614	0.946
		Underway	631 kbps	41.323	0.953
		Port	869 kbps	39.965	0.940
	1280 × 720	Anchoring	319 kbps	42.464	0.953
		Underway	226 kbps	43.141	0.961
		Port	388 kbps	42.573	0.958
Infrared video	704 × 576	Anchoring	27 kbps	43.430	0.989
		Underway	18 kbps	43.501	0.989
		Port	47 kbps	42.793	0.984

Table 5. Hierarchy partition structure classification of CU.

Size of CU	Depth of CU	Coordinates of CU	Condition of Division	Quantity of Division
64 × 64	0	$D (x)$	None	1
32 × 32	1	${\{D (x, i)\}}_{i = 1}^{4}$	D (0) = 1	4
16 × 16	2	${\{D (x, i, j)\}}_{i, j = 1}^{4}$	D (1, i) = 1	16
8 × 8	3	None	D (2, i, j) = 1	64

Table 6. Parameter information of shipborne vision sensor video data set.

Video Type (Resolution)	Number of Videos	(FLIS)	Size of CU	QP	Number of Samples
Visible light camera video (1920 × 1080)	22	1320	64 × 64	22	2,534,400
				27
				32
				37
			32 × 32	22	7,451,136
				27
				32
				37
			16 × 16	22	14,319,360
				27
				32
				37
Visible light camera video (1280 × 720)	19	1140	64 × 64	22	1,003,200
				27
				32
				37
			32 × 32	22	2,638,416
				27
				32
				37
			16 × 16	22	4,845,456
				27
				32
				37
Infrared camera video (704 × 576)	22	3300	64 × 64	22	1,306,800
				27
				32
				37
			32 × 32	22	2,848,824
				27
				32
				37
			16 × 16	22	5,645,376
				27
				32
				37
Total videos	63	5760	Total number of samples		42,592,968

Table 7. Performance of intra coding time-delay optimization algorithm for shipborne vision sensor video with different QPs.

Video Type	Navigation Environment	BD-BR/%	BD-PSNR/dB	$Δ T / %$
Video Type	Navigation Environment	BD-BR/%	BD-PSNR/dB	QP = 22	QP = 27	QP = 32	QP = 37	Average
Visible light camera video (1920 × 1080)	Achoring	2.37	−0.16	−48.22	−49.30	−50.62	−53.34	−50.37
	Underway	1.74	−0.14	−50.85	−50.97	−52.26	−56.01	−52.52
	Port	2.71	−0.16	−47.09	−48.55	−49.36	−51.80	−49.20
Visible light camera video (1280 × 720)	Achoring	1.75	−0.13	−40.51	−41.40	−45.00	−47.55	−43.62
	Underway	1.49	−0.14	−41.10	−44.58	−46.76	−51.37	−45.95
	Port	2.24	−0.15	−39.44	−40.33	−43.64	−45.90	−42.33
Infrared camera video (704 × 576)	Achoring	1.73	−0.12	−38.98	−41.67	−43.28	−45.03	−42.24
	Underway	1.44	−0.12	−40.65	−41.10	−43.53	−45.66	−42.74
	Port	1.84	−0.13	−36.01	−40.57	−41.96	−43.30	−40.46
Average		1.92	−0.14	−42.54	−44.27	−46.27	−48.88	−45.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, H.; Zhang, Y.; Wang, Z. Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning. J. Mar. Sci. Eng. 2023, 11, 122. https://doi.org/10.3390/jmse11010122

AMA Style

Lu H, Zhang Y, Wang Z. Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning. Journal of Marine Science and Engineering. 2023; 11(1):122. https://doi.org/10.3390/jmse11010122

Chicago/Turabian Style

Lu, Hongrui, Yingjun Zhang, and Zhuolin Wang. 2023. "Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning" Journal of Marine Science and Engineering 11, no. 1: 122. https://doi.org/10.3390/jmse11010122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Time Delay Optimization of Compressing Shipborne Vision Sensor Video Based on Deep Learning

Abstract

1. Introduction

2. SVSV Image Characteristics

2.1. Features of Shipborne Visible Light Camera Video Images

2.2. Shipborne Infrared Camera Video Image Features

3. Compression Experiment of Shipborne Vision Sensor Video

3.1. SVSV Sequence Parameters

3.2. Coding Complexity Analysis

3.3. Compression Performance Analysis

4. Time Delay Optimization of Compressing SVSV Based on Deep Learning

4.1. HEVC Intraframe Prediction of CU Partitioning Mode

4.2. Modeling of CU Partition Structure

4.3. Establishment of Data Sets

4.4. CU Division Prediction Model for Shipboard Vision Sensor Video Based on Deep Learning

4.5. Video Intra-frame Coding Delay Optimization Algorithm Flow of Shipboard Vision Sensor

4.6. Analysis of Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI