In view of the problems with the DeepLabv3 + network when performing semantic segmentation tasks on the underwater image dataset, in this study, the encoder and decoder structures of the DeepLabv3 + network were improved, respectively. First, we added a UCM unit in the encoding structure. Second, the decoding structure was further optimized to capture more feature information. We introduce the improvement part in detail later. The proposed network model is shown in
Figure 1.
In the entire network structure, the original image is processed by the UCM module to improve the quality of the image and then generate the binary files required for training. Then, the backbone network Xception_65 with 65 network layers, a filter size of 3 × 3 and a stride of 2 are used to extract feature information. The obtained feature maps are input into the ASPP [
20] structure to obtain feature maps with different sampling rates and capture multi-scale context information. The ASPP module contains a 1 × 1 convolution and three 3 × 3 hole convolutions with sampling rates of 6, 12 and 18, respectively. The number of filters is 256, including the batch normalization layer and global average pooling. Subsequently, all feature maps are fused and cascaded together through a 1 × 1 convolution to obtain high-level feature maps.
The high-level feature map is up-sampled twice by bilinear interpolation to obtain an enlarged feature map. After another two rounds of unsampling, the feature map is restored to the same size as the low-level feature map extracted by the Xception structure. The high-level and low-level feature maps are connected and 3 × 3 convolution is used to fine-tune the features. Finally, the bilinear interpolation method is used to perform unsampling by 2 to obtain the final prediction segmentation map.
The loss function used in this article is a commonly used cross-entropy loss function [
21]. The loss function formula is:
where
is the number of samples and
is the actual sample label and
represents the training loss,
is the prediction label.
takes value 0 or 1,
takes value from (0, 1). The smaller the value of
L, the more accurate the prediction result and the better the performance of the network model.
3.2. UCM Unit
The optical characteristics of water show that the imaging effect of objects in water is lower than the imaging effect of objects in air. A variety of substances in seawater cause significant attenuation of light. The underwater environment with a depth of 20 m can absorb 70% of the incident light [
23] and environmental visibility is extremely low. Whether it is for humans or underwater robots, the difficulty factor will be greatly increased under these conditions. Poor lighting conditions make the imaging effect worse and are prone to certain phenomena, for example, false details, self-shadows, false contours and blurring. Underwater images are affected by reduced contrast and non-uniform color cast due to the absorption and scattering of light in the aquatic environment. This affects the quality and reliability of image processing and therefore color correction is a necessary image processing operation [
24]. Compared with the Rayleigh Distribution [
25], RGHS [
26] and other underwater enhancement methods, we found that the method of unsupervised color correction (UCM) is an underwater image enhancement method for color correction and the image enhanced by this method has more edge information and the image with edge information has more feature content, which is helpful for feature extraction operating [
16,
24]. In
Section 4 of the paper, we perform experiments on these underwater enhancement algorithms to verify the effectiveness of the UCM algorithm.
RGB and HSI color models were exploited to correct the color and lighting problems of images to improve the image quality in water [
4]. The flow chart of the UCM algorithm is shown in
Figure 2:
First, the RGB color model is used to perform contrast correction. To obtain a high-quality image, the RGB color components need to be equal.
represents the pixel values of the red, green and blue components of the RGB image,
c represents the corresponding channel of the RGB image,
represents the size of the image and
represents the maximum value of each color component of an RGB image obtained.
,
.
Then, the average values of each color component R, G and B are calculated.
represents the average values of each color component:
The remaining two gain factors are calculated based on the main color cast being blue. Finally, the final adjusted pixel value is obtained by Von Kries assumption and then, the RGB color model is exploited to perform upper and lower contrast correction by using the following formula:
where,
is the pixel value after contrast correction;
is the pixel value of the current image,
is the lower limit,
is the upper limit.
is the minimum pixel value that exist in the current image,
is the maximum pixel value that exist in the current image.
and
take values from
.
If red is the lowest color component, a is the minimum pixel value of the red component; if a heavier color cast blue appears, b represents the maximum pixel value of the blue component; if you look for a color component value between blue and red, each variable in the formula remains unchanged. The HSI color model is used to perform contrast correction from both dark and light sides.
The experimental results are shown in
Figure 3. The first row is the original image and the second row is an enhanced image after UCM processing. As can be seen, the UCM module unit effectively improves the phenomena of the underwater image, for example, blurring and dim light and increases the illumination of the underwater image, which has more texture features and detailed information and has clearer and richer colors. Therefore, adding the UCM module to the encoder structure can improve the segmentation accuracy.
In order to further verify the applicability of the UCM algorithm in the research of underwater image semantic segmentation technology, we used the sobel operator to perform edge detection operations on the original image and the enhanced image and compared the results.
Figure 4 is a comparison diagram of the experimental results of the sobel edge detection algorithm. In the figure, the first line is the edge detection result of the original image and the second line is the edge detection result of the enhanced image. It can be seen from the figure that the image processed by UCM in the second line has more edge information, while the image with more edge information is considered to have higher feature content. Therefore, the UCM algorithm is suitable for the study of semantic segmentation of underwater images.
3.3. Decoding Structure Optimization
In
Figure 5, to the left is the encoder structure and to the right is the decoder structure. Considering the problem of unclear segmentation of the target boundary in the underwater image semantic segmentation result, upsampling with 4 layers directly by the decoding module will cause part of the feature information to be lost and it will also lack the contour and boundary information of the target if it is not connected to the low-level feature information.
Therefore, we add two layers of upsampling to the decoding structure. The high-level feature map is upsampled by 2 layers to obtain part of the enlarged feature maps, which are concatenated with the same resolution low-level feature maps obtained from the backbone network to retain more feature information, make the boundary information of the object more complete and make the semantic information clearer.
We also apply deep separable convolutions and hole convolutions in the upsampling layer. Therefore, the decoder structure can more effectively control the resolution of the feature map extracted from the encoder structure.