New Underwater Image Enhancement Algorithm Based on Improved U-Net

Zhu, Sisi; Geng, Zaiming; Xie, Yingjuan; Zhang, Zhuo; Yan, Hexiong; Zhou, Xuan; Jin, Hao; Fan, Xinnan

doi:10.3390/w17060808

Open AccessArticle

New Underwater Image Enhancement Algorithm Based on Improved U-Net

by

Sisi Zhu

^1,2,

Zaiming Geng

²,

Yingjuan Xie

³,

Zhuo Zhang

⁴,

Hexiong Yan

⁴,

Xuan Zhou

³,

Hao Jin

³ and

Xinnan Fan

^3,*

¹

College of Mechanical and Electrical Engineering, Hohai University, Changzhou 213000, China

²

China Yangtze Power Co., Ltd., Yichang 443002, China

³

College of Information Science and Engineering, Hohai University, Changzhou 213000, China

⁴

College of Artificial Intelligence and Automation, Hohai University, Changzhou 213000, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(6), 808; https://doi.org/10.3390/w17060808

Submission received: 30 December 2024 / Revised: 27 February 2025 / Accepted: 6 March 2025 / Published: 12 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

(1) Objective: As light propagates through water, it undergoes significant attenuation and scattering, causing underwater images to experience color distortion and exhibit a bluish or greenish tint. Additionally, suspended particles in the water further degrade image quality. This paper proposes an improved U-Net network model for underwater image enhancement to generate high-quality images. (2) Method: Instead of incorporating additional complex modules into enhancement networks, we opted to simplify the classic U-Net architecture. Specifically, we replaced the standard convolutions in U-Net with our self-designed efficient basic block, which integrates a simplified channel attention mechanism. Moreover, we employed Layer Normalization to enhance the capability of training with a small number of samples and used the GELU activation function to achieve additional benefits in image denoising. Furthermore, we introduced the SK fusion module into the network to aggregate feature information, replacing traditional concatenation operations. In the experimental section, we used the “Underwater ImageNet” dataset from “Enhancing Underwater Visual Perception (EUVP)” for training and testing. EUVP, established by Islam et al., is a large-scale dataset comprising paired images (high-quality clear images and low-quality blurry images) as well as unpaired underwater images. (3) Results: We compared our proposed method with several high-performing traditional algorithms and deep learning-based methods. The traditional algorithms include He, UDCP, ICM, and ULAP, while the deep learning-based methods include CycleGAN, UGAN, UGAN-P, and FUnIEGAN. The results demonstrate that our algorithm exhibits outstanding competitiveness on the underwater imagenet-dataset. Compared to the currently optimal lightweight model, FUnIE-GAN, our method reduces the number of parameters by 0.969 times and decreases Floating-Point Operations Per Second (FLOPS) by more than half. In terms of image quality, our approach achieves a minimal UCIQE reduction of only 0.008 while improving the NIQE by 0.019 compared to state-of-the-art (SOTA) methods. Finally, extensive ablation experiments validate the feasibility of our designed network. (4) Conclusions: The underwater image enhancement algorithm proposed in this paper significantly reduces model size and accelerates inference speed while maintaining high processing performance, demonstrating strong potential for practical applications.

Keywords:

underwater image enhancement; U-Net; lightweight network; simplified attention mechanism; SK fusion module

1. Introduction

The vast ocean, as the Earth’s largest and most valuable resource, holds profound significance for human development. To obtain information about the ocean is a critical way to fully understand the ocean and enhance the exploitation and utilization of marine resources. Underwater optical images, as one of the key carriers of marine information, play an indispensable role in this process. Due to the rapid propagation and attenuation of light underwater, the intensity of visible light decreases sharply with increasing depth, making underwater images blurred and dim (Raveendran et al., 2021) [1]. Additionally, refraction and scattering of light by water cause distortions and deformations in underwater images, altering the position, shape, and color of objects captured underwater. Noise sources such as water currents, bubbles, and suspended particles further degrade the quality and clarity of underwater images, leaving them in a bluish-green blurred state (Moghimi et al., 2021) [2], as shown in Figure 1; these images are selected from the EUVP (Enhancing Underwater Visual Perception) dataset, Underwater ImageNet.

As a result, underwater machine vision systems struggle to capture clear images, often requiring specialized light sources and sensors to address light attenuation, which inevitably increases the cost and difficulty of exploring the underwater world. To ensure the normal functioning of computer vision systems equipped on underwater robots and other devices, removing underwater noise and enhancing underwater images has become a challenge that researchers must tackle (Anwar et al., 2020) [3].

Figure 1. Different underwater images [4].

Existing underwater image enhancement methods are mainly categorized into physical model-based methods, non-physical model-based methods, and deep learning-based methods. Physical model-based underwater image enhancement methods mainly focus on accurately estimating the transmittance, and using the estimated medium transmittance and other key underwater imaging parameters such as global background light to obtain a clear image by backpropagating from a physical underwater imaging model. The a priori is the basis of the physical model for underwater image enhancement. He et al. (2010) [5] proposed the dark channel a priori (DCP) method for the single-image defogging problem, which is simple in principle and effective, so many scholars have tried to utilize the DCP in conjunction with the attenuation properties of the underwater image and proposed a large number of methods for underwater image restoration. Drews et al. (2013) [6] adapted the DCP method to created the underwater dark channel prior (UDCP) method, which considers the blue and green channels as underwater visual information sources, significantly improving the application of DCP underwater, but the processing effect is still limited. Song et al. (2018) [7] proposed a fast and effective estimation model of underwater image scene depth based on the underwater light attenuation prior (ULAP), and trained the model using supervised linear regression based on learning coefficients. However, the physical model-based approach performs poorly in realistic underwater scenes due to the fact that the model assumptions are not always reasonable for complex and dynamic underwater environments, and it is difficult to evaluate multiple model parameters simultaneously. Non-physical model-based methods mainly modify pixels in the spatial or frequency domain to enhance color attributes in order to improve the visual quality of underwater images, and they have significant performance in enhancing contrast and brightness. For example, Kashif et al. (2007) [8] proposed an underwater image perception method (ICM) based on sliding stretching, which utilizes the contrast stretching of the RGB algorithm to equalize the color contrast in the image, and then uses the saturation and intensity stretching of the HSI to increase the true colors. However, as is the case with most non-physical model-based methods, the algorithm can lead to over-enhancement and over-saturation of the image. Learning-based methods tend to achieve better results than traditional methods. Fabbri et al. (2018) [9] proposed UGAN, a generative adversarial network for underwater scenes, which uses CycleGAN (Zhu et al., 2017) [10] to generate paired images and is able to provide the model with a higher-quality training set, which improves model optimization. Islam et al. (2020) [4] proposed FunIEGAN, a real-time underwater image enhancement model based on conditional generative adversarial networks and demonstrated a new high-quality underwater dataset, EUVP, which combined with a multimodal objective function significantly improved the contrast and clarity of the processed image. UMGan (Sun et al., 2023) [11] used a global–local discriminator to improve the underwater image while adaptively modifying the local regions of the image effect, effectively improving the local effect of the image. However, up to now, most of the learning-based methods are limited by insufficient datasets, and small-scale datasets can greatly reduce the quality of model training. In addition, the current research trend of learning-based models is to enhance the network performance by superimposing complex network architectures as well as embedding a large number of attention mechanisms, which gradually leads to more and more large underwater image enhancement models with saturated effects. Therefore, in this paper, we consider the methods to design models that can still achieve, or even surpass, the above-mentioned overwhelming effects while ensuring a lightweight network architecture.

Based on the above analysis, this paper chooses to propose an underwater image enhancement algorithm based on improved U-Net by utilizing the classical architecture of U-Net. The method is an end-to-end structure, which inputs low-quality underwater images and outputs processed clear underwater images. The designed model is based on the basic block and SK fusion module, supplemented by local residual and global residual operations, which can effectively extract and aggregate spatial and channel information. Experimental results show that our proposed method has a slight advantage in output image effect compared with the current leading methods, and the algorithm complexity and parameter count are greatly improved. The main contributions of this paper are as follows:

(1) Point convolution and deep convolution are widely used to replace the traditional convolution, which effectively aggregates spatial information and transforms features and reduces the amount of computation. The SK fusion module is introduced into the model to dynamically fuse the global information and improve the feature acquisition capability.

(2) This paper simplifies the attention mechanism in the basic blocks of the U-Net network, significantly reducing computational complexity and further enhancing the network’s ability to extract channel information.

(3) In this study, additional training strategies were incorporated during training. These strategies are not commonly used for image enhancement, but experimental results demonstrate that they enable the network to achieve excellent training performance on a small-scale dataset.

2. Network Architecture

In this paper, for the problems of color distortion and low clarity of underwater images, a network model based on improved U-Net is proposed, which can effectively improve the image quality and at the same time significantly reduce the amount of computation compared with the SOTA method. The model architecture is described in detail in this section.

2.1. General Overview of the Model

U-Net was proposed by Olaf Ronneberger et al. [12] in 2015 and was designed initially as a deep learning architecture for image segmentation with a U-shaped structure, which, due to its simplicity, lends itself to component modification and embedding to enhance performance. The following Figure 2 shows the basic composition of the network in this paper. The network is an end-to-end structure, and the overall architecture is similar to the symmetric downsampling–upsampling structure of U-Net, which also generates feature maps of different sizes at each level, thus enabling the network to capture features at different scales. We abandoned the simple stacking of convolutional layers and max-pooling layers in the U-Net network and instead incorporated our designed basic blocks into the network to enhance feature extraction capabilities. The number of basic blocks in each stage is 2, 2, 2, 4, 2, 2, 2, 2. In addition, we retain the jump connections of the U-Net, but instead of directly cascading the up- and downsampling parts, we dynamically fuse feature mappings from different paths by means of the SK fusion module. The input and output of the network are each cascaded with a convolutional layer of size; to minimize possible gradient vanishing or gradient explosion during gradient backpropagation, we also added a global residual structure at the input and output positions to enhance the stability of the network.

2.2. Basic Block

As a whole, the basic block (BB) consists of two consecutive local residual structures (He et al., 2016) [13], and simplicity and validity are the most notable features of the block. In order to achieve the simplicity of the structure, this paper tries to compose the basic block with common components in neural networks. Normalization has been widely adopted in high-level computer vision tasks, and it is also popular in low-level vision. The effectiveness of Batch Normalization (Ioffe et al., 2015) [14] is affected by small batch sizes. For smaller batch sizes, the statistical estimates are less accurate, which may lead to poor normalization. Given the rapid development of many current Transformer methods (Devlin et al., 2018; Huang et al., 2022; Peng et al., 2023) [15,16,17], Layer Normalization (Ba et al., 2016) [18] has been increasingly applied and Layer Normalization is a batch size independent algorithm, i.e., regardless of the number of samples, the amount of data involved in the calculation of Layer Normalization is not affected, thus solving the possible problems of Batch Normalization; so, as shown in Figure 3, we speculate that Layer Normalization plays an important role in model training and image recovery and choose Layer Normalization as the first layer of the basic block.

Layer Normalization is followed by a tandem fit of point-by-point convolution and deep convolution (Howard et al., 2017) [19]. The deep convolution kernel size, which performs separate convolution operations for each channel of the input, can compress the model parameters and speed up the computational process. The point-by-point convolution kernel size performs feature extraction on a single point and can enhance the model performance. In terms of activation function, the ablation experiment in Section 4 of the article demonstrates that the use of the Gaussian Error Linear Unit (GELU) function (Hendrycks et al., 2016) [20] in the network built in this paper is more suitable than the Rectified Linear Unit (ReLU) function. Therefore, this paper chooses to use the GELU instead of the commonly used ReLU in machine vision tasks, and it maintains the image denoising performance while gaining a considerable gain in image deblurring. The formula is as follows:

G E L U (x) = x Φ (x)

(1)

where

Φ (x)

denotes the cumulative probability distribution of the Gaussian distribution, i.e., the definite integral of the Gaussian distribution over the interval

(- \infty, x]

. Attention mechanism (Vaswani et al., 2017) [21] plays a crucial role in image enhancement networks (Wang et al., 2020; Xue et al., 2022) [22,23], which directly determines the ability of the network to extract features. Since deep convolution can capture local information well, this paper will focus on the global attention mechanism. The channel attention mechanism has high computational efficiency while preserving the global information of each feature, which is one of the reasons why many good image processing networks, such as FFA-Net (Qin et al., 2020) [24], have strong performance. It works by first compressing the spatial information into channels and then applying multilayer perception to it to compute the channel attention and use it for weighted feature maps as shown in Figure 4a. Its formula can be expressed as follows:

C A (X) = X * σ (W_{2} max (0, W_{1} p o o l (X)))

(2)

where X denotes the feature map;

p o o l

denotes the global average pooling operation that aggregates spatial information into channels.

σ

is a nonlinear activation function Sigmoid,

W_{1}

,

W_{2}

is a fully connected layer, and ReLU activation function is used between fully connected layers. Finally, ∗ is a channel domain dot product operation (channel-wise product ). Adding a channel attention mechanism will greatly enhance the performance of the model, but it will also increase the complexity within the basic block. Therefore, simplification measures need to be taken to improve computational efficiency. Gated Linear Units (GLUs) are simplified in NAFNet (Chen et al., 2022) [25] by treating them as a variant of the activation function GELU, thus transforming computationally complex GLUs into simple matrix element multiplications. Thus, the channel attention can be rewritten in the following form:

C A (X) = X * ψ (X)

(3)

where X denotes the feature map, and the channel attention computation is denoted as the function

ψ

. Next, only the two most critical factors in channel attention, i.e., global information and channel information, are retained, as shown in Figure 4b. A simplified channel attention mechanism can be obtained by aggregating the two:

S C A (X) = X * W p o o l (X)

(4)

Obviously, the obtained SCA is simpler and is no longer limited by the computation of the nonlinear activation functions ReLU and Sigmoid. We will demonstrate in ablation experiments that this bold simplification does not degrade the image quality.

2.3. SK Fusion Module

Inspired by SKNet (Li et al., 2019) [26], this paper realigns the SK module to obtain the SK fusion module. Let the two feature mappings be

x_{1}

and

x_{2}

, where

x_{1}

is the feature mapping from the jump connection and

x_{2}

is the feature mapping from the network main path. First, input

x_{1}

into the

1 \times 1

point-by-point convolutional layer to obtain

{\hat{x}}_{1}

, and then input

{\hat{x}}_{1}

and

x_{2}

into the global average pooling layer, multilayer perceptual structure (composed of dimensionality-decreasing convolution, ReLU activation function, and dimensionality-raising convolution), and finally obtain the fusion weights through the softmax function. The specific calculation process is represented as follows:

{\hat{x}}_{1} = \sum_{c_{in} = 1}^{C_{in}} w_{c_{out}, c_{in}} \cdot x_{1} + b_{c_{out}}

(5)

where

x_{1}

represents the value of the input feature map, w denotes the corresponding weight in the convolution kernel, and W and H are the height and width of the input feature map, respectively.

b_{o u t}

is the bias term,

c_{i n}

is the number of input channels, and

c_{o u t}

is the number of output channels.

y_{G} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} ({\hat{x}}_{1} + x_{2})

(6)

where

y_{G}

is the output of global average pooling,

\hat{x_{1}}

and

x_{2}

represent the values of the input feature map, and H and W are the height and width of the input feature map, respectively.

{\begin{matrix} a_{1}, a_{2} \end{matrix}} = s p l i t (s o f t m a x \circ f^{(L - 1)} \circ \dots \circ f^{(1)} (y_{G}))

(7)

where

f^{(l)} (\cdot)

is the activation function of the l layer of the multilayer perceptron, and

a_{1}

and

a_{2}

denote the fusion weights.

y = a_{1} {\hat{x}}_{1} + a_{2} x_{2}

(8)

where y is the result of feature fusion. The final modified model structure is shown in Figure 5. Compared with the simple splicing of different layers of features in U-net, the SK fusion module can better analyze the weight of feature maps in information transfer so that the feature maps can transfer richer information.

2.4. Loss Functions

Mean Square Error (MSE) and L2 loss are commonly used loss functions in image enhancement model training. Lim et al. (2017) [27] pointed out that L1 loss has better performance in various kinds of metrics for image restoration tasks. Therefore, in this paper, an optimized L1 loss is selected for training. The specific formula is as follows:

l o s s (y_{g t}, y_{o}) = \frac{1}{N} \sum_{i = 1}^{N} ∥y_{g t} - y_{o}∥

(9)

where

y_{g t}

are ground truth from the input, and

y_{0}

are images output from the network.

3. Experiments

3.1. Training Setup

The method used in this paper is based on the Pytorch framework, and all tests in the experiment were performed on two NVIDIA RTX-3090 graphics cards, Taiwan, China. For training, the images were randomly cropped into 256 × 256 blocks. The epoch was set to 1000, where the first 50 epochs were used for training warm-up. We optimized the network using the Adam optimizer (

β_{1}

and

β_{2}

with default values of 0.9 and 0.999, respectively).

For the dataset, in this paper, images were selected from the paired dataset Underwater ImageNet of EUVP (Enhancing Underwater Visual Perception) for training and testing. EUVP was established by Islam et al.(2020) [4], which is a large-scale dataset that contains a combination of paired (high-quality clear vs. low-quality blurred) and non-paired underwater images in a large-scale dataset. The producers used seven different cameras, including multiple DJI GoPros, Shenzhen, China, Aqua AUV’s uEye camera, Prague, Czech Republic, a low-light USB camera, and Trident ROV’s high-definition camera, Berkeley, CA, USA to capture images for the dataset. The images were captured and collected in different locations and under different illumination conditions using ocean exploration and human–computer collaboration techniques. In addition, the producers extracted images from a number of publicly available YouTube videos and added them to the dataset. These added images were carefully selected to possess a variety of naturally varying characteristics (different scenes, water body types, illumination conditions, etc.). The unpaired images were classified as superior or inferior by subjective selection of the human eye based on contrast, color, clarity, etc. The process of creating the paired images was as follows: after a period of time, the images were classified as superior or inferior by the human eye based on contrast, color, and clarity. The process of creating paired images is as follows: the CycleGan-based model is trained on unpaired images to learn how to convert between good and poor quality images. After that the learned model is used to distort the good quality images to generate the corresponding low quality images. As a result, the obtained paired images are able to satisfy the human perceptual preferences well. In this paper, 2000 of these images are randomly selected to simulate a small-scale dataset for training, and then 400 images were randomly selected as a test set.

3.2. Training Strategies

In this paper, we adopt the following two techniques that are not commonly used for image enhancement to improve the training results:

Mixed-precision training: This is a technique to optimize the training speed and memory usage of deep learning models. Traditional deep learning models usually use single-precision floating-point numbers (32-bit) for parameter and gradient computation to obtain high computational accuracy, but they also have high demands on memory and computational resources. Mixed-precision training, on the other hand, utilizes the characteristics of GPUs to use both half-precision floating-point numbers (16 bits) and single-precision floating-point numbers for computation. Under this training strategy, the parameters of the model are usually stored in half-precision, while the gradient computation and some intermediate results are computed using single-precision. In this paper, we implement mixed-precision training to significantly reduce memory usage and speed up computation without degrading model performance.

Training Preheat: Linear preheat is commonly used to optimize advanced vision tasks. Since the initial learning rate of the training process in this paper is relatively large and mixed-precision training is enabled, in order to reduce the risk of model crash and improve the training success rate, we adopt a warm-up strategy, i.e., gradually increase the learning rate from a smaller value to a set maximum value in the first 50 epochs of training.

3.3. Reference Metrics

In most cases, the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) can effectively reflect the performance of an image enhancement network. However, relying solely on the PSNR and SSIM is insufficient, as high scores in these metrics do not necessarily indicate absolutely superior image quality. Images with high scores may still exhibit texture details and overall chromaticity that do not align with human visual perception. Therefore, this paper employs the PSNR, SSIM, UCIQE (Underwater Color Image Quality Evaluation metrics), and NIQE (Natural Image Quality Evaluator) to objectively assess the output images. The latter two are no-reference image quality evaluation metrics, meaning they do not require comparison with ground truth images. Compared to the PSNR and SSIM, which measure image similarity, UCIQE and the NIQE provide better visual reference value and pose greater challenges.

UCIQE is based on image chrominance, contrast, and saturation, and its calculation formula is shown below:

U C I Q E = c_{1} \times σ_{c} + c_{2} \times c o n_{l} + c_{3} \times μ_{s}

(10)

where

c_{1}

,

c_{2}

,

c_{3}

is the hyperparameter,

σ_{c}

is the standard deviation of chromaticity,

c o n_{l}

is the standard deviation of contrast, and

μ_{s}

is the standard deviation of average saturation. The larger value of UCIQE represents the better color balance, sharpness, and contrast of the image. It is worth mentioning that the human perception system has a good correlation with the standard deviation of chromaticity of underwater images, so the UCIQE is a reference index that is very consistent with the perception of the human eye.

The NIQE extracts features from a spatial domain natural scene statistics (NSS) model and models them using a multivariate Gaussian (MVG) model, and the quality of the image to be tested is expressed as the distance between the MVG fitting parameters of the NSS features extracted from this image and the parameters of the pre-built model. The NIQE reflects the texture details of the image very well. The specific calculation formula is shown below:

D (v_{1}, v_{2}, \sum 1, \sum 2) = \sqrt{{(v_{1} - v_{2})}^{T} {(\frac{\sum 1 + \sum 2}{2})}^{- 1} (v_{1} - v_{2})}

(11)

where

v_{1}

,

v_{2}

,

\sum 1

, and

\sum 2

denote the mean vector and covariance matrix of the natural MVG model and the fuzzy image MVG model, respectively. The smaller the value of the NIQE metric, the higher the texture quality of the image.

3.4. Comparison Test

We compare the methods proposed in this paper with several excellent traditional algorithms and deep learning methods that have achieved excellent results in recent years. Among the comparison methods, the traditional algorithms include He, UDCP, ICM, and ULAP, and the deep learning methods include CycleGan, UGan, FunIEGAN, UMGan, and U-Net. Among the methods based on deep learning, all of them are trained in accordance with the network structure and training strategy in the original paper. The comparison results are shown in Figure 6.

Considering the complexity of the underwater environment, this paper shows the images taken under different distortion environments in the Underwater ImageNet test set and their processed images. From the image effect, it is easy to see that the He algorithm is difficult to adapt to the complex underwater environment, and the correction effect is very poor; UDCP enhances the contrast of the underwater image, but it cannot correct the chromatic aberration problem, and part of the image is dark; the ICM improves the overall brightness of the image, but the problem of chromatic aberration is still serious and there is the problem of overexposure; the ULAP improves the problem of the image with greenish bias, but the overall visual effect is still poor; CycleGan improves the greenish-blue problem, but there are unrealistic artifacts in the image which are manifested in reddish color and texture distortion in some areas; UGan has a better effect on the color correction of bluish images, but the brightness of the image is overcorrected, and there are oversaturation problems in some of the images; the overall tone of the processed image by FunIEGAN is slightly darker and not bright enough, and some details of the texture are filtered. The contrast and saturation of the UMGan output image are further improved, but the performance in the low-light image enhancement task is lacking. The algorithm in this paper performs well in color correction, detail restoration, and contrast enhancement, and the texture is more delicate compared to other methods.

Table 1 presents the NIQE, UCIQE, parameter count, floating-point operations, PSNR, and SSIM results for each method. The method proposed in this paper achieved the best results across all metrics. While significantly reducing both the parameter count and computational cost, it improved UCIQE by 0.009–0.095, reduced NIQE by 0.256–1.032, improved PSNR by 1.436–2.162, and enhanced SSIM by 0.074–0.196. The reference index data show that the algorithm proposed in this paper has more excellent ability to improve image contrast, sharpness, and texture color. In addition, the method in this paper has a smaller number of parameters and floating-point computation compared to FunIEGAN, which is currently the best lightweight underwater model, and the lightweight level is more prominent.

In addition, we also conducted tests on the highly challenging underwater dataset of UIEB, and the output images are shown in Figure 7, with the left column as the original image and the right column as the processed image. From the test results, it can be seen that our proposed network still has objective performance in turbid and complex underwater environments.

4. Discussion

Ablation experiments can better prove the rationality and effectiveness of the key design in the model. In this section, we will use experimental data to show the specific performance of the network under different component additions.

4.1. Basic Fast Internal Component Ablation Experiments

4.1.1. Normalization Method

We study the effects of Layer Normalization and Batch Normalization on basic blocks. As shown in the table, Layer Normalization calculates the mean and variance of the feature mapping during inference, and using Layer Normalization results in a slight increase in the level of processed images.

4.1.2. Attention Mechanism

The channel attention mechanism is compared to the simplified channel attention module after the simplification, and also compared to the pixel–channel attention combination used in FFA-Net instead. We find that the simplified channel attention module still possesses excellent enhancement results, while the inclusion of the complex attention mechanism does not substantially improve the module, due to the fact that the jump connection of the U-Net structure with the SK fusion module we incorporated already has considerable global feature extraction capabilities and no longer requires excessive channel attention.

4.1.3. Activation Functions

As shown in Table 2 and Table 3, when investigating the effects of different activation functions GeLU and ReLU in the basic block on the network, we find that the ReLU obtains better results without the channel attention mechanism; however, the GeLU is more effective with the addition of the channel attention mechanism. The ReLU has a good performance when dealing with simple information, but when the feature information is progressively more complex, the GeLU is more capable of meeting the network’s expressiveness requirements.

4.2. SK Fusion Module

We replace the SK fusion module with the traditional summing or splicing operation, and the effect of the whole network undergoes a significant degradation. Table 4 and Table 5 present the comparative experimental results. It can be seen that the SK fusion module helps to extract the global information and dynamically fuse the features at each level. From the results of experiments, it is clear that the SK fusion module plays an important role in the network.

5. Conclusions

The attenuation effect of light propagation in water and the scattering of light by suspended particles in water make the final underwater images presented often have serious color distortion, reduced sharpness and contrast, and overall bluish or greenish tones. Previous methods for underwater low-quality image enhancement mainly improve network performance by introducing deep residual structures and complex attention mechanisms, which are large and difficult to deploy. In this paper, we explore how to perform underwater image enhancement under a minimalist network design. We experimentally demonstrate that a reasonable selection of components and a simplified attention mechanism will also improve the model efficiency. Specifically, this paper is based on an improved U-Net network, which utilizes concise basic blocks to achieve feature extraction and the SK fusion module to aggregate features at different levels. We test the performance of various methods including this paper’s algorithm on underwater-imagenet, and the results show that relative to other excellent algorithms, this paper’s algorithm handles better results in both qualitative and quantitative comparisons and possesses smaller computational expenditures. In subsequent studies, we will compare more SOTA algorithms and try to perform more component replacement and optimization.

Author Contributions

Conceptualization, S.Z. and Z.G.; methodology, S.Z., Y.X. and Z.G.; software, Z.Z.; validation, S.Z., Z.Z., Z.G. and H.Y.; formal analysis, H.Y.; investigation, X.Z.; resources, S.Z. and Z.G.; data curation, S.Z. and H.J.; writing—original draft preparation, Z.G.; writing—review and editing, X.F.; visualization, X.Z.; supervision, H.J.; project administration, X.F. and Z.G.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 4752022YFB4703402), the China Three Gorges Corporation (Grant No. 2324020012), the National Natural Science Foundation of China (Grant No. 62476080), the Jiangsu Province Natural Science Foundation (Grant No. BK20231186), and Changzhou Sci&Tech Program, Grant/Award Number: CE20235053.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Sisi Zhu and Zaiming Geng were employed by the company China Yangtze Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Raveendran, S.; Patil, M.D.; Birajdar, G.K. Underwater image enhancement: A comprehensive review, recent trends, challenges and applications. Artif. Intell. Rev. 2021, 54, 5413–5467. [Google Scholar] [CrossRef]
Moghimi, M.K.; Mohanna, F. Real-time underwater image enhancement: A systematic review. J. Real-Time Image Process. 2021, 18, 1509–1525. [Google Scholar] [CrossRef]
Anwar, S.; Li, C. Diving deeper into underwater image enhancement: A survey. Signal Process. Image Commun. 2020, 89, 115978. [Google Scholar] [CrossRef]
Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark hannel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341. [Google Scholar] [PubMed]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
Song, W.; Wang, Y.; Huang, D.; Tjondronegoro, D. A rapid scene depth estimation model based on underwater light attenuation prior for underwater image restoration. In Advances in Multimedia Information Processing–PCM 2018, Proceedings of the 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Proceedings, Part I 19; Springer: Berlin/Heidelberg, Germany, 2018; pp. 678–688. [Google Scholar]
Iqbal, K.; Salam, R.A.; Osman, A.; Talib, A.Z. Underwater Image Enhancement Using an Integrated Colour Model. IAENG Int. J. Comput. Sci. 2007, 34. [Google Scholar]
Fabbri, C.; Islam, M.J.; Sattar, J. Enhancing underwater imagery using generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 7159–7165. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Sun, B.; Mei, Y.; Yan, N.; Chen, Y. UMGAN: Underwater image enhancement network for unpaired image-to-image translation. J. Mar. Sci. Eng. 2023, 11, 447. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Huang, Z.; Li, J.; Hua, Z.; Fan, L. Underwater image enhancement via adaptive group attention-based multiscale cascade transformer. IEEE Trans. Instrum. Meas. 2022, 71, 1–18. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef] [PubMed]
Ba, J.L. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications (2017). arXiv 2017, arXiv:1704.04861. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wang, J.; Li, P.; Deng, J.; Du, Y.; Zhuang, J.; Liang, P.; Liu, P. CA-GAN: Class-condition attention GAN for underwater image enhancement. IEEE Access 2020, 8, 130719–130728. [Google Scholar] [CrossRef]
Xue, L.; Zeng, X.; Jin, A. A novel deep-learning method with channel attention mechanism for underwater target recognition. Sensors 2022, 22, 5492. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 17–33. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 510–519. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]

Figure 2. Overall network architecture.

Figure 3. Basic block.

Figure 4. (a) Channel explanation. The meaning of different colors has been added as a note below the figure. attention. (b) Simplified attention. (Different colors represent different weights).

Figure 5. SK fusion module.

Figure 6. Qualitative comparison between our method and other methods.

Figure 7. Results on UIEB.

Table 1. Comparison of different results.

	NIQE	UCIQE	Param (M)	FLOPS (G)	PSNR	SSIM
HE	5.192	0.407	-	-	18.733	0.683
UDCP	4.927	0.411	-	-	18.852	0.695
ULAP	4.591	0.418	-	-	19.034	0.716
ICM	4.405	0.426	-	-	19.252	0.732
CycleGan	5.037	0.380	42.410	7.841	18.792	0.705
UGan	5.082	0.425	18.155	54.404	18.807	0.697
FunIEGAN	4.412	0.417	10.239	7.020	20.129	0.850
UMGan	4.618	0.440	38.745	13.149	19.239	0.764
U-Net	4.459	0.419	28.513	30.618	19.878	0.805
Ours	4.393	0.430	9.270	2.741	21.565	0.879

Note: Bold text is the optimal value for each column, underlined text is the suboptimal value.

Table 2. Comparison of NIQE and UCIQE metrics in ablation experiments of basic block components.

BN	LN	Attention Mechanism	ReLU	GeLU	NIQE	UCIQE
✓	-	-	-	-	4.414	0.426
-	✓	-	-	-	4.394	0.429
-	✓	PA-CA	✓	-	4.402	0.368
-	✓	CA	✓	-	4.399	0.432
-	✓	SCA	✓	-	4.395	0.429
-	✓	PA-CA	-	✓	4.403	0.432
-	✓	CA	-	✓	4.389	0.431
-	✓	SCA	-	✓	4.378	0.445

Note: BN (Batch Normalization), LN (Layer Normalization). Bold text is the optimal value for each column, underlined text is the suboptimal value. The ✓ indicates that this component was used in the experiment.

Table 3. Comparison of PSNR and SSIM metrics in ablation experiments of basic block components.

BN	LN	Attention Mechanism	ReLU	GeLU	PSNR	SSIM
✓	-	-	-	-	19.132	0.704
-	✓	-	-	-	20.024	0.792
-	✓	PA-CA	✓	-	20.560	0.850
-	✓	CA	✓	-	20.793	0.864
-	✓	SCA	✓	-	21.122	0.876
-	✓	PA-CA	-	✓	20.545	0.833
-	✓	CA	-	✓	21.253	0.882
-	✓	SCA	-	✓	21.337	0.887

Note: BN (Batch Normalization), LN (Layer Normalization). Bold text is the optimal value for each column, underlined text is the suboptimal value. The ✓ indicates that this component was used in the experiment.

Table 4. Comparison of NIQE and UCIQE metrics for different fusion methods.

Method	NIQE	UCIQE
Tandem Splicing	4.921	0.389
Summation	5.048	0.421
SK Fusion Modules	4.394	0.431

Note: Bold text is the optimal value for each column, underlined text is the suboptimal value.

Table 5. Comparison of PSNR and SSIM metrics for different fusion methods.

Method	PSNR	SSIM
Tandem Splicing	18.863	0.685
Summation	18.857	0.689
SK Fusion Modules	21.273	0.884

Note: Bold text is the optimal value for each column, underlined text is the suboptimal value.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, S.; Geng, Z.; Xie, Y.; Zhang, Z.; Yan, H.; Zhou, X.; Jin, H.; Fan, X. New Underwater Image Enhancement Algorithm Based on Improved U-Net. Water 2025, 17, 808. https://doi.org/10.3390/w17060808

AMA Style

Zhu S, Geng Z, Xie Y, Zhang Z, Yan H, Zhou X, Jin H, Fan X. New Underwater Image Enhancement Algorithm Based on Improved U-Net. Water. 2025; 17(6):808. https://doi.org/10.3390/w17060808

Chicago/Turabian Style

Zhu, Sisi, Zaiming Geng, Yingjuan Xie, Zhuo Zhang, Hexiong Yan, Xuan Zhou, Hao Jin, and Xinnan Fan. 2025. "New Underwater Image Enhancement Algorithm Based on Improved U-Net" Water 17, no. 6: 808. https://doi.org/10.3390/w17060808

APA Style

Zhu, S., Geng, Z., Xie, Y., Zhang, Z., Yan, H., Zhou, X., Jin, H., & Fan, X. (2025). New Underwater Image Enhancement Algorithm Based on Improved U-Net. Water, 17(6), 808. https://doi.org/10.3390/w17060808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

New Underwater Image Enhancement Algorithm Based on Improved U-Net

Abstract

1. Introduction

2. Network Architecture

2.1. General Overview of the Model

2.2. Basic Block

2.3. SK Fusion Module

2.4. Loss Functions

3. Experiments

3.1. Training Setup

3.2. Training Strategies

3.3. Reference Metrics

3.4. Comparison Test

4. Discussion

4.1. Basic Fast Internal Component Ablation Experiments

4.1.1. Normalization Method

4.1.2. Attention Mechanism

4.1.3. Activation Functions

4.2. SK Fusion Module

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI