Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN

Liu, Xiangyong; Chen, Zhixin; Xu, Zhiqiang; Zheng, Ziwei; Ma, Fengshuang; Wang, Yunjie

doi:10.3390/jmse12091467

Open AccessArticle

Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN

by

Xiangyong Liu

^1,2

,

Zhixin Chen

¹,

Zhiqiang Xu

^1,*,

Ziwei Zheng

³,

Fengshuang Ma

¹ and

Yunjie Wang

¹

Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Science, Shanghai 200092, China

²

State Key Laboratory of the Internet of Things for Smart City (IOTSC), University of Macau, Macau 999078, China

³

Digital Industry Research Institute, Zhejiang Wanli University, No. 8 South Qian Hu Road, Ningbo 315199, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1467; https://doi.org/10.3390/jmse12091467

Submission received: 18 July 2024 / Revised: 18 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Ocean exploration is crucial for utilizing its extensive resources. Images captured by underwater robots suffer from issues such as color distortion and reduced contrast. To address the issue, an innovative enhancement algorithm is proposed, which integrates Transformer and Convolutional Neural Network (CNN) in a parallel fusion manner. Firstly, a novel transformer model is introduced to capture local features, employing peak-signal-to-noise ratio (PSNR) attention and linear operations. Subsequently, to extract global features, both temporal and frequency domain features are incorporated to construct the convolutional neural network. Finally, the image’s high and low frequency information are utilized to fuse different features. To demonstrate the algorithm’s effectiveness, underwater images with various levels of color distortion are selected for both qualitative and quantitative analyses. The experimental results demonstrate that our approach outperforms other mainstream methods, achieving superior PSNR and structural similarity index measure (SSIM) metrics and yielding a detection performance improvement of over ten percent.

Keywords:

image enhancement; local features; global features; parallel fusion

1. Introduction

Exploration of the ocean is vital for harnessing its abundant resources [1]. Underwater robots are crucial instruments to explore the ocean, which enables image-based target detection tasks. Due to light attenuation and scattering in seawater, the quality of underwater images is often compromised (Figure 1). Consequently, underwater image processing faces significant challenges, and underwater image enhancement is a significant research area within the fields of computer vision and underwater robotics [2,3].

Underwater image enhancement aims to improve the quality of distorted images by restoring its color distortion [4]. Some scholars have explored non-deep learning approaches and have made some progress. Non-deep learning methods rely on statistical assumptions and models to enhance underwater images, such as Underwater Dark Channel Prior (UDCP) [5], Image Blur Recovery [6], and Underwater Light Attenuation Prior (ULAP) [7]. Cheng et al. [8] pointed out that the dissolved substances in water can weaken the imaging process, and influence the attenuation parameters of light propagation in water. Drews et al. proposed an underwater prior method by utilizing the red channel information [9]. Li et al. proposed an underwater light attenuation prior (ULAP) model to restore image quality [10]. Ma et al. designed a wavelet transform network that decomposes input images into frequency maps to enrich image details [11]. However, the complexity of underwater environments often leads to inaccuracies in parameter estimation for these methods.

Currently, neural networks have been widely employed in various visual tasks [12]. In contrast, extensive datasets and specialized loss functions have been utilized by deep learning techniques to train deep neural networks for image quality enhancement, including models like Underwater Residual Network (UResNet), Shallow Underwater Network (UWNet), and Underwater Convolutional Neural Network (UWCNN) [13]. Mean square error loss and edge difference loss are used to optimize convolutional neural networks for image enhancement [14]. By employing conventional convolutions, Naik et al. developed a network specifically for underwater image enhancement, which demonstrates effective enhancement capabilities on public datasets [15]. Li et al. introduced a residual network-based underwater image enhancement algorithm [16]. Chen et al. introduced an end-to-end neural network enhancement model that integrates residual structures and attention mechanisms [17]. Current enhancement algorithms predominantly rely on convolutional neural networks, but they often utilize a single feature-extraction backbone. However, the features extracted from these models are often insufficiently detailed.

Wang et al. utilized Generative Adversarial Networks (GANs) to design a feature enhancement network [18,19]. Moreover, the underwater generative adversarial networks (UGANs) [20] scheme has been developed for UIE tasks, employing an encoder–decoder structure [21,22] that effectively preserves rich semantic information. Junjun Wu et al. have developed a multi-scale fusion generative adversarial network named Fusion Water-GAN (FW-GAN), which aims to improve underwater image quality while effectively preserving rich semantic information. This network integrates four convolutional branches to achieve this goal [23]. Kei et al. created a dataset that includes both image and sonar data specifically designed for low-light underwater environments. They utilized a Generative Adversarial Network (GAN) to improve the image quality. Experimental results show that this method achieves better detection performance [24]. Zhang et al. collected images from different angles and then calculated the camera poses for each angle. They fed the collected image sequences and their corresponding poses into a Neural Radiance Field (NeRF), synthesizing new viewpoints and improving the effect of 3D image reconstruction [25]. Adversarial learning methods are mostly based on object detection with similar quality or visibility, and acquiring clear sample data for these models remains a formidable task.

Deep learning encompasses various backbone architectures, including the widely utilized Convolutional Neural Networks (CNNs) and transformers [26,27], which has gained popularity in computer vision tasks. The Transformer architecture incorporates features like multi-head mechanisms and multi-layer perceptions, making it versatile for a range of visual tasks [28]. Zamir et al. [29] employed an encoder_decoder structure to obtain features at different scales, achieving image enhancement in rainy and foggy weather conditions. Song et al. modified attention modules within the network layers, constructing a parameter-adjustable dehazing network [30]. Although the transformer architecture shows great potentiality for computer vision tasks, its high computational complexity often results in increased computational load and longer processing times.

Despite advancements in current methods for solving underwater image distortions, challenges persist in achieving high-quality restoration. To produce high-quality images, our research motivation is to overcome the uncertainty in generated images by utilizing the complementary features of different neural network frameworks. This paper introduces a new model for enhancing the quality of underwater images, aiming to tackle issues such as color shifts, unrealistic colors, and reduced contrast [31]. Furthermore, a matrix linear computation approach has been designed to minimize the computational delays caused by the stacked network. To this point, an innovative approach is proposed to extract both local and global image features. This network integrates visual transformer models with CNN networks to enhance the overall restoration process. Additionally, information fusion weights are calculated from the Fourier transform features of original images. The main advantages of our work are as followed:

(1): A novel transformer model that extracts local features has been proposed. It incorporates both PSNR attention and linear operations to effectively reduce computational load and alleviate color artifacts.
(2): Furthermore, a novel global feature extraction network is devised, which leverages both temporal and frequency domain characters to enrich image features.
(3): Additionally, a feature fusion method leveraging Fourier transform has been proposed to optimize global and local feature weights. High-frequency Fourier components enhance the global features, while low-frequency components refine local features. This approach effectively integrates various features without leaving noticeable fusion trace at the boundaries.

2. Materials and Methods

Figure 2 depicts our detection framework. This network incorporates both CNN and transformer backbones, which is designed to extract both global and local features, respectively. These extracted features are fused at the smallest down-sampling size. Additionally, the low-frequency and high-frequency information of the original image is obtained via Fourier transform, serving as fusion weights for the extracted CNN and transformer features.

2.1. Two Branches’ Feature Extraction Network

Conventional low-light image enhancement networks usually employ convolutional structures within the feature layers, predominantly extracting image information from the image’s bright regions. These regions are rich in visible content or signals. However, in some non-prominent regions, fine-grained features may be lost, leading to reduced detection accuracy.

Based on this, two backbones for image enhancement are employed. Information from both global and local regions complements each other, and this differentiation can be determined based on the distribution of Fourier transforms on the image. On one hand, for image regions characterized by high-frequency Fourier transforms, their CNN features predominantly manifest in the globally salient information. On the other hand, in image regions corresponding to low-frequency Fourier transforms, the transformer predominantly captures local detailed information. This methodology facilitates the extraction of diverse image features.

As to the encoder framework, two different backbones for feature extraction were developed. One employs transformers, while the other utilizes convolutional structures. Each backbone incorporates a three-level and top-down feature pyramid extraction network. The input dimension of the first layer features is 6 dimensions, yielding 64 dimensions as output. Both the second and third levels’ feature extraction operations utilize 64 channels.

CNNs excel at extracting edges, textures, and simple shapes from images. On the other hand, transformers excel at identifying long-range dependencies and inferring local information. By integrating the local features extracted by transformers with the global textures identified by CNNs, richer and more diverse representations can be produced. This hybrid approach is more effective at handling noise and variations in data. Moreover, hybrid models can better adapt to different types of data, which can deal with spatial and sequential information simultaneously. Therefore, this method can enhance the model’s recognition and classification capabilities.

2.2. Implementation of the CNN Branch

Convolutional Neural Networks (CNNs) are types of deep learning models specifically designed to process image data. CNNs employ convolutional layers for extracting local features, pooling layers to diminish data dimensionality and computational complexity and fully connected layers for classification. The core advantages of CNNs lie in their local connections and shared weight, which make them particularly effective for image recognition and classification tasks.

CNNs extract features by sliding convolutional kernels over the pixels’ matrix of the input image, computing the weighted sum of local regions to generate feature maps (Figure 3b). “*” denote the multiplication of two parameters. The kernels capture local features like edges and textures. Subsequently, activation functions perform nonlinear mapping operations on these features, enhancing the model’s ability to filter key characteristics. The parameters of the network are optimized through the error back propagation and iterative learning processes. This learning process automatically constrains the input and maximizes the activation of the output.

Residual Network (ResNet) is an improved version of CNNs, introducing skip connections or residual connections (Figure 4a). These connections allow gradients to pass directly from later layers to earlier layers, addressing the gradients’ vanishing and exploding problems in the networks’ training process. By stacking more layers, ResNet achieves deeper network architectures than the traditional CNNs and has demonstrated significant performance in various visual tasks [33].

Different from the ResNet network, some researchers have noted that the brightness degradation on images primarily resides in the magnitude components of Fourier transformation, while the rest exists in the phase components [34]. Inspired by previous research on Fourier transformation, this backbone further introduces the correlation properties between magnitude component and brightness to enhance feature extraction effectiveness. In this backbone, two_stage feature architecture is designed (Figure 4b). In stage one, the brightness of low-light image features is enhanced by optimizing the amplitude in Fourier space. In stage two, features from convolutional neural networks are further integrated.

In stage one, as to the input image x with dimensions of H × W, its transformation of the frequency space can be represented as Equation (1):

F (x) (u, v) = X (u, v) = \frac{1}{\sqrt{H W}} \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} x (h, w) e^{- j 2 π (\frac{h}{H} u + \frac{w}{W} v)}

(1)

where h and w denote coordinates in the temporal domain, while u and v represent coordinates in the frequency domain. To acquire frequency features, the Fourier processing (FFT) block is utilized to extract the amplitude and phase components (Figure 4). Subsequently, two 1 × 1 convolutional layers with Leaky activation are applied to each branch. Finally, an inverse Fourier transform (iFFT) is applied to convert these two components back to the spatial domain.

The Fourier transformation primarily relies on convolutional changes in the frequency domain to enhance brightness, while lacking convolution operations in the temporal domain for extracting details. Therefore, in the second stage, convolution operations in the temporal domain are employed to enrich features. Finally, to optimize feature fusion, dimension addition and reduction operations are applied to the features of the temporal-frequency domain, respectively.

Compared with the commonly used residual network (Figure 4a), our method with the Fourier Transform (Figure 4b) can convert convolution operations into multiplication operations in the frequency domain, thereby enhancing the efficiency of convolution calculations. By enhancing specific frequency components, specific edge features can be accentuated, which is especially useful for image enhancement. However, during the inverse Fourier Transform process, some important information may be lost. Therefore, integrating features from the temporal domain is crucial to preserve specific characteristics.

2.3. Implementation of the Transformer Branch

Unlike CNNs, which extract features through global attentions, transformers capture local features from token regions. Local feature extraction is achieved through the matrixes’ multiplication, which has been validated in various high-level and low-level vision tasks. Assuming the feature dimensions are h*w*C and the token size is p*p, the total number of feature tokens can be calculated as m = (h/p)*(w/p)*C. Moreover, multi-head self-attention (MSA) modules and multi-layer perceptions (MLP) are employed in the transformer. Assuming the input features to the transformer have the same dimensions, tokens are merged into a sequence of features with multiple heads (Figure 5a). Q*K^T = (R^d*n)*(R^d*n)^T = R^d*d, and its computation load = d²*n. Then, (R^d*d)*(R^d*n) = R^d*n, and its computation load = d²*n. In summary, the total computation load = 2d²*n. The calculation process of feature transformation and computational complexity are depicted in Equation (2) and Equation (3), respectively.

\{\begin{array}{l} q_{i} = k_{i} = v_{i} = L N (F_{1}, F_{2}, \dots, F_{m}) \\ {\overset{⌢}{y}}_{i} = M S A (q_{i}, k_{i}, v_{i}) + y_{i - 1} \\ y_{i} = M L P (L N ({\overset{⌢}{y}}_{i})) + {\overset{⌢}{y}}_{i} \end{array}

(2)

\hat{x} = softmax (\frac{Q \times K^{T}}{\sqrt{d}}) * V

(3)

where Q is the query matrix, K is the key matrix, and V is the value matrix.

To alleviate calculation complexity, the method in Figure 5b replaces traditional dot-product multiplication with element-wise multiplication. Different colors denote different dimensions. Typically, Q, K, and V have dimensions of R^d×n. To compute the attention weights for extracting features from underwater images, the query matrix is initially multiplied by a trainable parameter vector (w_n ∈ Rⁿ). This process results in the generation of a global attention vector of size η_d in Equation (4):

η_{d} = \frac{\exp (Q * w_{n} / \sqrt{n})}{\sum_{j = 1}^{d} \exp (Q * w_{n} / \sqrt{n})}

(4)

Next, the K matrix undergoes element-wise multiplication with the global attention vector η_d to yield the global query vector q. As shown in Figure 5b, q ∈ R^1*n. Subsequently, this global vector q is element-wise multiplied with the V matrix to generate global features that merge information from both the Q-matrix and K-matrix. Unlike the previous dot-product computations, the computational load of element-wise multiplication is linearly related to the parameters (d*n), alleviating the overall computational load. Following this, we perform another transformation to activate the final information in Equation (5):

\{\begin{matrix} q = \sum_{i = 1}^{d} η_{i} * K_{i} \\ x = T (q * V) \end{matrix}

(5)

where T denotes the activation operation. To mitigate the influence of extremely dark regions on inference, an SNR map is utilized to guide the learning attention of the transformer. For an input image I ∈ R^{H x W x 3}, with its corresponding SNR map S ∈ R^{H x W}, S is adjusted into S′ ∈ R^{h x w} to align with the dimensions of the feature map F. Then, S’ is partitioned into m patches, and the average value for each patch is calculated. S_i ∈ [0, 1], where I = {1, ..., m}. This masking mechanism effectively prevents the influence of features with very low signal-to-noise ratio (SNR), as illustrated in Figure 5b. The i-th mask value of S′ is defined in Equation (6):

S_{i} = \{\begin{matrix} 0, S_{i} < s \\ 1, S_{i} > s \end{matrix}, i = \{1, \dots, m\}

(6)

The masking calculation process for the x parameter is expressed in Equation (7):

\hat{x} = x * S

(7)

As shown in Table 1, the total calculation load is 3d*n, which is far less than the previous calculation load 2*d²*n. According to the commonly used dot-product multiplication in Figure 5a, the computational complexity of the self-attention mechanism scales quadratically with the sequence length (d²), causing a significant increase in resource consumption when the sequence length is large. Due to the large number of parameters in each layer, transformer models are typically much larger than CNNs, requiring more memory and computational power for training and inference. By adopting the proposed hybridized block modular approach, the computational load can be reduced from 2d²n to 3dn, offering a substantial advantage. The integration of CNN and Transformer models increases complexity and computation time. Therefore, the algorithm’s complexity is reduced through the matrixes’ element-wise multiplication.

It is worth noting that similar block or token attention has been developed. Zhou et al. divided large images into smaller blocks, utilizing a trained rCNN as a block descriptor for image forgery detection [35]. To develop a well-suited denoising model, Bei et al. introduced a block matching and grouping method, applying a convolutional neural network (CNN) within each block for 3D filtering [36]. To generate high-resolution landslide susceptibility maps, Abbas et al. created an innovative hybrid block-based neural network model, integrating expert modular structures and divide-and-conquer strategies with a genetic algorithm (GA) [37]. In this method, each sub-network module employs input blocks, layers of hidden blocks, and an additional decision block (Figure 6a). Different from the independent block’s method, the element-wise multiplication operation is developed in the research to extract the cross-attention (Figure 6b).

To evaluate the performance of different block-based methods, the two structures in Figure 6 were employed to enhance images across three distinct scenarios. Despite the significant development and approved capability of image processing systems through the advanced block-based or modular structures, our presented model in this study offers three significant advantages. Firstly, it can reduce learning losses and accelerate model convergence (Figure 7a). Secondly, it captures global forward-backward attention more effectively and extracts the continuous features, thereby minimizing information loss from independent blocks (Figure 7b,c). Thirdly, it produces high-quality enhanced images with higher PSNR and SSIM indexes (Figure 7d).

2.4. Fusion Attention Based on High-Pass and Low-Pass Filters

The torch.add method is commonly used to perform element-wise addition of tensors [35]. This method checks the shapes of the input tensors. When the shapes are aligned, the addition operation is performed element by element. This means that corresponding elements from each tensor are added together, producing a new tensor as the result. The result of the addition operation can be stored into a new tensor. While this method can combine different features, it cannot differentiate or utilize the advantages of different features.

Different from the traditional torch.add method, the significant difference of low-frequency and high-frequency features is valuable and can be utilized. The display of an image relies on trigonometric frequency components. High-frequency signals cause rapid changes, leading to sharp edges within the image. Conversely, low-frequency signals result in more gradual changes, contributing to a smoother image appearance. The role of filters is to pass or suppress certain frequency components of an image.

The Fourier transform acts as a bridge between the temporal domain and the frequency domain (Figure 8). Ideal low-pass filtering, a method for image smoothing, retains low-frequency components. The transfer function of an ideal low-pass filter is represented in Equation (8):

\{\begin{array}{l} D (u, v) = \sum_{x = 0}^{M - 1} \sum_{y = 0}^{N - 1} f (x, y) e^{- j 2 π (u x / M + v y / N)} \\ L (u, v) = e^{- D^{2} (u, v) / 2 D_{0}^{2}} \\ L (x, y) = \sum_{x = 0}^{M - 1} \sum_{y = 0}^{N - 1} L (u, v) e^{j 2 π (u x / M + v y / N)} \end{array}

(8)

where M and N denote the length and height of the image, respectively. D(u, v) denotes the frequency domain of the image, and f(x, y) represents the temporal domain of the image. The range of u is [0, M − 1], and the range of v is [0, N − 1]. D(u, v) denotes the distance from the point (u, v) in the frequency domain to the center, while D₀ denotes the cutoff frequency. L(u, v) denotes the low-pass filters in the frequency domain. L(x, y) denotes the low-pass results in the temporal domain. Figure 9 illustrates the corresponding low-pass filter functions and their corresponding filtering outcomes.

Different from the low-pass filters, high-pass filters enhance the details and edges of the image by eliminating low-frequency components. The basic principle is to set the low-frequency components in the frequency domain to zero and only retain the high-frequency components. H(u, v) denotes the high-pass filters in the frequency domain. H(x, y) denotes the high-pass results in the temporal domain. Figure 10 illustrates the associated high-pass filter functions and their filtering results. The transfer function of a high-pass filter is designed by Equation (9):

\{\begin{array}{l} D (u, v) = \sum_{x = 0}^{M - 1} \sum_{y = 0}^{N - 1} f (x, y) e^{- j 2 π (u x / M + v y / N)} \\ H (u, v) = 1 - e^{- D^{2} (u, v) / 2 D_{0}^{2}} \\ H (x, y) = \sum_{x = 0}^{M - 1} \sum_{y = 0}^{N - 1} H (u, v) e^{j 2 π (u x / M + v y / N)} \end{array}

(9)

Fourier transform is utilized to calculate fusion weights for integrating different backbone features (Figure 11b). Low-frequency features are extracted by the transformer to align with locally smooth features. Sharp edge details are captured by convolutional neural networks to match the high-frequency features. Both high- and low-frequency features are normalized to the [0, 1] range. The different fusion calculation is illustrated in Equation (10):

\{\begin{array}{l} F^{'} = F_{C N N} + F_{T r a n s f o r m e r} \\ F^{″} = F_{C N N} * H (x, y) + F_{T r a n s f o r m e r} * L (x, y) \end{array}

(10)

Low-frequency Fourier variations correspond to transformer features, while high-frequency Fourier variations correspond to CNN features. Compared with the commonly used torch.add operation in Figure 11a, the combination of CNNs and transformers can fully leverage different advantages. This method not only enhances feature representation capabilities and robustness, but also optimizes the use of computational resources.

3. Experimental Validation

3.1. Dataset and Experimental Designation

We evaluate the algorithm’s performance by using two publicly accessible datasets: LSUI [36] (Large-Scale Underwater Image dataset) and UIEB dataset [37]. LSUI comprises 5000 underwater images with varying exposure levels. The UIEB dataset includes pairs of low-exposure and high-exposure images, with 800 pairs designated for training, 150 pairs for validation, and 90 pairs for testing. The LSUI and UIEB datasets play crucial roles in underwater image enhancement research. LSUI, with its large and diverse data volume, offers ample material for training and testing deep learning models. Due to its high-quality annotated image pairs, UIEB is a key resource for evaluating and optimizing algorithms. By utilizing the two datasets, it provides us with powerful and robust underwater image evaluation benchmarks.

Our framework was implemented in PyTorch [38], and the training and testing processes are conducted on a computer equipped with a 2080Ti GPU. Gaussian distribution was used to randomly initialize the network training parameters. And standard data augmentation techniques, such as vertical and horizontal flipping, are applied. Our encoder frame includes three layers, which are followed by a feature fusion module. Similarly, the decoder comprises three layers, utilizing ChannelShuffle for up-sampling operations. Adam optimizer [39] with an initial learning rate of 1 × 10⁻³ was used to minimize loss. The learning rate was decreased by 0.1 after every 100 iterations.

During the training process, the model’s performance was evaluated through loss functions (such as MSE, PSNR, etc.), which measures the discrepancy between the output and ground truth images. The model’s weights are saved during each training epoch. The .ckpt files are used to save training weights. The loss function is expressed in Equation (11):

Total Loss = α*MSE + β*(1 − SSIM) + γ*PSNR

(11)

Here, α, β, and γ are the weighting coefficients used to optimize different components’ influence in the loss functions. Through the defined loss function, the performance of the underwater image enhancement model can be effectively evaluated and optimized, improving the quality of enhanced images. When needed, the optimal weight of the model can be loaded from the storage files for inference and further training.

3.2. Ablation Study

For the underwater images, evaluation metrics include the Peak Signal-to-Noise Ratio (PSNR) [40], Structural Similarity Index (SSIM) [41], and the Mean Squared Error (MSE). MSE represents the mean squared error between two approximate images I and K, as defined in Equation (12):

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} [I (i, j) - K (i, j)]^{2}

(12)

The PSNR metric represents the signal ratio of maximum value to mean squared error. It is represented by the logarithmic decibel units, as indicated in Equation (13):

P S N R = 10 * \log_{10} (\frac{M A X_{I}^{2}}{M S E}) = 20 * \log_{10} (\frac{M A X_{I}}{\sqrt{M S E}})

(13)

where MAX_I denotes the maximum value of image color. Higher PSNR values indicate a clearer image. SSIM requires two input images to assess their similarity. One of the images is an uncompressed and undistorted image, and the other is the restored image. So, SSIM can serve as a metric for quality assessment. Assuming x and y are the two input images, the SSIM(x, y) is defined in Equation (14):

S S I M (x, y) = {[l (x, y)]}^{α} {[c (x, y)]}^{β} {[s (x, y)]}^{γ}

(14)

Here, α > 0, β > 0 and γ > 0. l(x, y), c(x, y), and s(x, y) are defined in Equations (15) and (16):

\{\begin{array}{l} l (x, y) = \frac{2 u_{x} u_{y} + c_{1}}{u_{x}^{2} + u_{y}^{2} + c_{1}} \\ c (x, y) = \frac{2 σ_{x y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}} \\ s (x, y) = \frac{σ_{x y} + c_{3}}{σ_{x} σ_{y} + c_{3}} \end{array}

(15)

\{\begin{array}{l} u_{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} \\ δ_{x} = {(\frac{1}{N - 1} \sum_{i = 1}^{N} {(x_{i} - μ_{x})}^{2})}^{1 / 2} \\ C o v (X, Y) = E (X - E (X)) (Y - E (Y)) \end{array}

(16)

Among them, c1, c2, and c3 are constants, respectively. To prevent system errors caused by the zero denominator, smaller values are adopted. In the actual calculation, it is common to assign α = β = γ = 1. c3 = c2/2. σ_xy represents the covariance of x and y. SSIM is simplified in Equation (17):

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (σ_{x y} + c_{2})}{(u_{x}^{2} + u_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(17)

The ablation study is a commonly used method in machine learning to evaluate the importance and contribution of various components in a model. By systematically removing certain parts of the model and observing the changes in performance, it is possible to identify which parts are important. A similar approach has been adopted by other scholars, as indicated in reference [42].

Rigorous ablation experiments were conducted to evaluate the proposed techniques. These experiments were conducted on the LSUI and UIEB datasets, evaluating three key factors: CNN features enhanced by Fourier transform, transformer features based on PSNR attention and linear operations, and feature fusion with Fourier weights. Figure 12 illustrates the enhancement effects in each ablation experiment. In Figure 12, (b–d) all use the same input from (a). Additionally, Table 2 presents the comparison metrics of PSNR and SSIM for the ablation study. “√” denote the adopted scheme.

Experimental results indicate that image quality can be enhanced through the utilization of CNN and transformer architectures, respectively. Additionally, the integration of CNN and transformer features yields a notable improvement on the image enhancement.

By utilizing the appropriate PyTorch libraries, the best trained model was loaded for the test verification. Through normalization and resizing operations, the input images are standardized to align with preprocessing steps. Time tools are used to record the start and end times during the model inference. And the inference time for a single image is obtained by calculating the difference between the end and start times. In the experiment (Table 3), two types of backbone feature extraction networks are employed and the times are recorded, respectively. The experiment demonstrates that element-wise transformer attention can significantly reduce the time consumption. Additionally, while the dual-channel approach increases detection time, our method achieves the satisfied detection with the similar time consumption of the single-transformer approach.

3.3. Feature Visualization Process

To validate the robustness of our feature extraction method, the feature visualization process was conducted in Figure 13. These visualized features include two types of network features. Transformer features, extracted within the Token range, improve the local perception accuracy. The transformer network captures long-range features through the self-attention mechanism. The self-attention mechanism allows the Transformer to integrate features from any position within the image. This is crucial for tasks such as image restoration and color correction. According to the visualized results, it is easy to find that the transformer can effectively restore the local color and structure of images, overcoming the defects of CNNs in the local feature-extraction process.

Conversely, CNN features provide a global perspective, contributing to improve the global perception accuracy. By visualization, it is easy to find that CNNs excel at capturing the obvious features of images. Through convolution operations, CNNs can efficiently extract image details such as edges and textures, thereby effectively suppressing noise and enhancing boundary detail. Deeper convolutional layers enables CNNs to progressively extract high-quality features from images, which is significantly effective for removing random noise of underwater images.

Furthermore, we obtained high-pass filter and low-pass filter features by the Fourier transform, which are subsequently employed as fusion weights for the two backbones’ features. High-pass filters extract edge details of images, whereas low-pass filters capture smooth information. This complementary information is multiplied with the transformer and CNN features, respectively. This matching process enhances the accuracy of feature extraction and fusion.

Visual results show that the integration of CNNs with transformers yields superior image enhancement effects. In summary, CNNs can remove most noise and enhance the overall color and structure, while transformers can restore local details of the image. This combination effectively reduces noise and significantly improves the overall image quality. The effect is particularly notable when processing complicated underwater images.

Our proposed methods are compared with other methods. Figure 14b,c show the visualized global features, including the improved CNN network and the traditional ResNet network. The results indicate that the proposed method appears to have more prominent edge features, while the traditional ResNet method extracts relatively blurred features. The experiments demonstrate the superiority of the proposed method that integrates both the time-domain and frequency-domain features.

Different feature-fusion methods are also compared in Figure 14d,e. The results show that our method can optimize fusion weights for different objects, which enhances the feature diversity. In contrast, the torch.add method reduces the diversity and prominence of features.

3.4. Comparison with Current Methods

Our approach was qualitatively compared with other state-of-the-art (SOTA) image enhancement methods, including MIR-Net [40], U-Net [41], WaterNet [43], and Ucolor [44]. Additionally, the proposed backbone is compared quantitatively with the traditional CNN and Transformer architectures.

3.4.1. Qualitative Analysis

Visual samples of LSUI are displayed in Figure 15, which are also compared with other commonly used methods. The proposed approach demonstrates outstanding clearance, showcasing finer details, consistent colors, and higher visibility. Additionally, the method’s outputs display fewer visual artifacts, especially in zones with complicated textures.

A visual comparison of the UIEB dataset is presented in Figure 16, highlighting our method in dealing with noisy and low-light images. The results indicate that our approach can significantly enhance image brightness, enriches details, and suppresses noise.

3.4.2. Quantitative Analysis

To compare with other image restoration networks, PSNR and SSIM are used to evaluate performance. Generally, higher SSIM implies the image presence with more details and structure. The codes for comparison were obtained from the corresponding publications, and all detection experiments were conducted on the same original input dataset, without using optimized images from the intermediate process. Table 4 provides a comparative analysis of various methods, indicating that our algorithm outperforms others in achieving the highest PSNR and SSIM scores.

Compared to the Transformer method [26], it is worth noting that our linear multiplication backbone only utilizes 60% of the parameters. Additionally, in comparison with the Ucolor-based approach [44], our method exhibits superior performance. Furthermore, our method outperforms MIR Net [40], U-net [41], and WaterNet [43], yielding improvements of 1-3 in PSNR and 0.1-0.3 in SSIM.

3.5. Comparison on Detection Tasks

To evaluate the effect of underwater image enhancement on detection tasks, the enhanced images were integrated into a series of detection algorithms, including the single-stage methods SSD, RetinaNet, and GIoU [45,46]. These enhanced images were utilized as inputs for various detection tasks. The obtained detection results demonstrate that the proposed method exceeds other competing methods in detection accuracy [47,48]. The visualized detection results in Figure 17 correspond with the objective outcomes, demonstrating our approach’s superiority.

By utilizing precision-recall and recall-confidence curves as evaluation metrics [49], Figure 18 presents a quantitative comparison of visual detection. Due to the improved color and brightness, our method demonstrates a notable enhancement in precision and recall indexes [50]. The enhanced images by this method show superior detection outcomes, marking a significant enhancement over other competing techniques.

4. Conclusions

The influence of light absorption and scattering by the surrounding water leads to the loss of certain details and color information in underwater images. To address issues, such as low illumination, reduced contrast, and color shift in underwater imagery, an underwater image enhancement algorithm is proposed based on the parallel fusion of transformer and CNN. Experiments indicate that this approach can effectively combine the local context capture ability of transformers with the global feature extraction capability of CNNs, thereby improving the richness and accuracy of extracted features. To effectively reduce computational load and alleviate color artifacts, a novel transformer model integrates the PSNR attention and linear operations. Through a mathematical method, this method can reduce computational complexity from 2d²n to 3dn while simultaneously extracting constrained features. Additionally, by leveraging both temporal and frequency domain characters, a novel global feature extraction network is devised to enrich image features. The high-frequency and low-frequency information from the input image’s Fourier transform are extracted, which are used to fuse different backbone’s features. Experiments show that this method optimizes the fusion weights for the Transformer and CNN features, enriching the diversity of representation features. Compared with the current mainstream algorithms, this method achieves optimal values in objective evaluation metrics and also produces superior subjective perceptual quality in the generated images.

Author Contributions

Conceptualization, X.L. and F.M.; methodology, X.L. and Z.C.; software, X.L.; validation, Z.C., F.M. and Z.X.; formal analysis, Z.X.; investigation, Z.X. and Z.Z.; data curation, Y.W.; writing—original draft preparation, X.L.; writing—review and editing, F.M. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by Development of Intelligent Breeding Equipment for Cabin Breeding Platform (2022YFD2401104), in part by Central Public-Interest Scientific Institution Basal Research Fund, FMIRI of CAFS (No. 2024YJS011), in part by the Guangdong Basic and Applied Basic Research Foundation (No. 2022A1515110038), in part by the China Postdoctoral Science Foundation (No. 2020T130474), and Macau Young Scholars Program (No. AM2021003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this paper are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, W.; Liu, W.; Li, L. Underwater Single-Image Restoration with Transmission Estimation Using Color Constancy. J. Mar. Sci. Eng. 2022, 10, 430. [Google Scholar] [CrossRef]
Chiang, J.Y.; Chen, Y.-C. Underwater Image Enhancement by Wavelength Compensation and Dehazing. IEEE Trans. Image Process. 2011, 21, 1756–1769. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Tian, F.; Qi, Q.; Wu, Q.M.J.; Li, K. Underwater image enhancement with latent consistency learning-based color transfer. IET Image Process. 2022, 16, 1594–1612. [Google Scholar] [CrossRef]
Mustafa, W.A.; Kader, M.M.M.A. A Review of Histogram Equalization Techniques in Image Enhancement Application. J. Physics: Conf. Ser. 2018, 1019, 012026. [Google Scholar] [CrossRef]
Zhou, J.; Wei, X.; Shi, J.; Chu, W.; Zhang, W. Underwater image enhancement method with light scattering characteristics. Comput. Electr. Eng. 2022, 100, 898–915. [Google Scholar] [CrossRef]
Peng, Y.-T.; Zhao, X.; Cosman, P.C. Single underwater image enhancement using depth estimation based on blurriness. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4952–4956. [Google Scholar] [CrossRef]
Song, W.; Wang, Y.; Huang, D. A rapid scene depth estimation model based on underwater light attenuation prior for under-water image restoration. In Proceedings of the 2018 Advances in Multimedia Information Processing, Hefei, China, 21–22 September 2018; pp. 678–688. [Google Scholar] [CrossRef]
Cheng, C.; Zhang, H.; Li, G. Overview of Underwater Image Enhancement and Restoration Methods. In Proceedings of the International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Baishan, China, 27–31 July 2022; pp. 520–525. [Google Scholar] [CrossRef]
Drews, P.; Nascimento, E.R.; Botelho, S.S.C.; Campos, M.F.M. Underwater Depth Estimation and Image Restoration Based on Single Images. IEEE Comput. Graph. Appl. 2016, 36, 24–35. [Google Scholar] [CrossRef]
Li, J.; Hou, G.; Wang, G. Underwater image restoration using oblique gradient operator and light attenuation prior. Multimedia Tools Appl. 2023, 82, 6625–6645. [Google Scholar] [CrossRef]
Ma, Z.; Oh, C. A wavelet-based dual-stream network for underwater image enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; pp. 2769–2773. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, Q.; Liu, P.; Gao, S.; Pan, X.; Zhang, C. Underwater Image Enhancement Using Deep Transfer Learning Based on a Color Restoration Model. IEEE J. Ocean. Eng. 2023, 48, 489–514. [Google Scholar] [CrossRef]
Wang, K.; Hu, Y.; Chen, J.; Wu, X.; Zhao, X.; Li, Y. Underwater Image Restoration Based on a Parallel Convolutional Neural Network. Remote. Sens. 2019, 11, 1591. [Google Scholar] [CrossRef]
Ueki, Y.; Ikehara, M. Underwater Image Enhancement with Multi-Scale Residual Attention Network. In Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP), Munich, Germany, 5–8 December 2021; pp. 1–5. [Google Scholar] [CrossRef]
Xing, Z.; Cai, M.; Li, J. Improved Shallow-UWnet for Underwater Image Enhancement. In Proceedings of the International Conference on Unmanned Systems (ICUS), Guangzhou, China, 28–30 October 2022; pp. 1191–1196. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Chen, C.; Chen, Q.; Xu, J.; Koltun, V. Learning to See in the Dark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3291–3300. [Google Scholar] [CrossRef]
Wang, H.; Yang, M.; Yin, G.; Dong, J. Self-Adversarial Generative Adversarial Network for Underwater Image Enhancement. IEEE J. Ocean. Eng. 2024, 49, 237–248. [Google Scholar] [CrossRef]
Wang, Y.; Er, M.J.; Chen, J.; Wu, J. A Novel Generative Adversarial Network for Underwater Image Enhancement. In Proceedings of the International Conference on Intelligent Autonomous Systems (ICoIAS), Dalian, China, 23–25 September 2022; pp. 84–89. [Google Scholar] [CrossRef]
Fabbri, C.; Islam, M.J.; Sattar, J. Enhancing underwater imagery using generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 7159–7165. [Google Scholar] [CrossRef]
Balakrishnan, G.; Zhao, A.; Dalca, A.V.; Durand, F.; Guttag, J. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8340–8348. [Google Scholar] [CrossRef]
Hu, X.; Naiel, M.A.; Wong, A.; Lamm, M.; Fieguth, P. RUNet: A robust UNet architecture for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 505–507. [Google Scholar] [CrossRef]
Wu, J.; Liu, X.; Lu, Q.; Lin, Z.; Qin, N.; Shi, Q. FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion. Signal Process. Image Commun. 2022, 109, 116855. [Google Scholar] [CrossRef]
Terayama, K.; Shin, K.; Mizuno, K.; Tsuda, K. Integration of sonar and optical camera images using deep neural network for fish monitoring. Aquac. Eng. 2019, 86, 102000. [Google Scholar] [CrossRef]
Zhang, T.; Johnson-Roberson, M. Beyond NeRF Underwater: Learning Neural Reflectance Fields for True Color Correction of Marine Imagery. IEEE Robot. Autom. Lett. 2023, 8, 6467–6474. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Con-ference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Kovács, L.; Csépányi-Fürjes, L.; Tewabe, W. Transformer Models in Natural Language Processing. In International Conference In-terdisciplinarity in Engineering; Lecture Notes in Networks and Systems; Springer Nature Switzerland: Cham, Switzerland, 2023; Volume 929–945. [Google Scholar] [CrossRef]
Liu, C.; Wang, G.; Zhang, C.; Patimisco, P.; Cui, R.; Feng, C.; Sampaolo, A.; Spagnolo, V.; Dong, L.; Wu, H. End-to-end methane gas detection algorithm based on transformer and multi-layer perceptron. Opt. Express 2024, 32, 987–1002. [Google Scholar] [CrossRef]
Zamir, S.; Arora, A.; Khan, S. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE Press: Piscataway, NJ, USA, 2022; pp. 5718–5729. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision Transformers for Single Image Dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Berman, D.; Levy, D.; Avidan, S.; Treibitz, T. Underwater single image color restoration using haze-lines and a new quantita-tive dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2822–2837. [Google Scholar] [CrossRef]
Charu, C. Aggarwal. Neural Networks and Deep Learning; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Lu, Z.; Jiang, X.; Kot, A. Deep Coupled ResNet for Low-Resolution Face Recognition. IEEE Signal Process. Lett. 2018, 25, 526–530. [Google Scholar] [CrossRef]
Huang, J.; Liu, Y.; Zhao, F.; Yan, K.; Zhang, J.; Huang, Y.; Zhou, M.; Xiong, Z. Deep Fourier-Based Exposure Correction Network with Spatial-Frequency Interaction. In Proceedings of the European Conference Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 163–180. [Google Scholar] [CrossRef]
Zhou, J.; Ni, J.; Rao, Y. Block-Based Convolutional Neural Network for Image Forgery Detection. In Proceedings of the Digital Forensics and Watermarking: 16th International Workshop IWDW, Magdeburg, Germany, 23–25 August 2017; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2017; Volume 10431. [Google Scholar] [CrossRef]
Zou, B.J.; Guo, Y.D.; He, Q.; Ouyang, P.B.; Liu, K.; Chen, Z.L. 3D Filtering by Block Matching and Convolutional Neural Network for Image Denoising. J. Comput. Sci. Technol. 2018, 33, 838–848. [Google Scholar] [CrossRef]
Abbas Shahri, A.; Maghsoudi Moud, F. Landslide susceptibility mapping using hybridized block modular intelligence model. Bull. Eng. Geol. Environ. 2021, 80, 267–284. [Google Scholar] [CrossRef]
Liu, Q.; Su, Y.; Xu, P. Implementation of Artificial Intelligence Anime Styl-ization System Based on PyTorch. In Proceedings of the Annual International Conference on Net-work and Information Systems for Computers (ICNISC), Wuhan, China, 27–29 October 2023; pp. 84–87. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-Shape Transformer for Underwater Image Enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef]
Basha, C.; Pravallika, B.; Shankar, E. An Efficient Face Mask Detector with PyTorch and Deep Learning. EAI Endorsed Trans. Pervasive Health Technol. 2021, 7, 167843. [Google Scholar] [CrossRef]
Li, W.; Li, S.; Liu, R. Channel Shuffle Reconstruction Network for Image Compressive Sensing. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 2880–2884. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Li, X.; Zhang, C. Salt and pepper noise removal in surveillance video based on low-rank matrix recovery. Comput. Vis. Media 2015, 1, 59–68. [Google Scholar] [CrossRef]
Yao, J.; Liu, G. Improved SSIM IQA of contrast distortion based on the contrast sensitivity characteristics of HVS. IET Image Process. 2018, 12, 872–879. [Google Scholar] [CrossRef]
Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin Adversarial Contrastive Learning for Underwater Image Enhancement and Beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Learning enriched features for real image restoration and en-hancement. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 1934–1948. [Google Scholar] [CrossRef]
Liu, X.; Chen, G.; Sun, X.; Knoll, A. Ground Moving Vehicle Detection and Movement Tracking Based On the Neuromorphic Vision Sensor. IEEE Internet Things J. 2020, 7, 9026–9039. [Google Scholar] [CrossRef]
Liu, X.; Yang, Z.; Hou, J.; Huang, W. Dynamic Scene’s Laser Localization by NeuroIV-based Moving Objects Detection and LIDAR Points Evaluation. IEEE Trans. Geosci. Remote Sens. 2022, 6, 5230414. [Google Scholar] [CrossRef]
Tan, L.; Huangfu, T.; Wu, L.; Chen, W. Comparison of RetinaNet, SSD, and YOLO v3 for real-time pill identification. BMC Med. Informatics Decis. Mak. 2021, 21, 324–337. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional network for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]

Figure 1. Visual acquisition by our underwater robot and camera. (a) Our underwater robot. (b) The acquired low-quality underwater image without lamping light. (c) The acquired low-quality underwater image under lamping light.

Figure 2. Architecture of the proposed network, which consists of three parts: encoder, fusion, and decoder.

Figure 3. The learning process for the optimal model [32]. (a) Convolutional Neural Network. (b) The convolutional kernels. (c) The activation functions.

Figure 4. Comparisons for different CNN feature-extraction network unite. (a) The commonly used ResNet network. (b) Our feature network incorporates both temporal and frequency domain characters.

Figure 5. The transformer feature-extraction network with PSNR-based attention and linear operations. (a) The dot-product multiplication operation. (b) The element-wise multiplication operation with PSNR-based attention.

Figure 6. Comparison of different image block-based or modular structures. (a) The independent block’s method. (b) The proposed cross-attention method with element-wise multiplication operation.

Figure 7. Experimental verification for different image block-based or modular structures. (a) The training loss. (b) The learned discontinuous features of independent block’s method. (c) The learned successive features of cross-attention method. (d) The PSNR and SSIM performance for different enhanced images.

Figure 8. Fourier transformation of the image. (a) Original image. (b) Fourier transformation. (c) Shifted frequency.

Figure 9. Low-pass filter. (a) Ideal low-pass filter. (b) Gaussian low-pass filter. (c) Low-pass filter result.

Figure 10. High-pass filter. (a) Ideal high-pass filter. (b) Gaussian high-pass filter. (c) High-pass filter result.

Figure 11. Comparison of different feature fusion methods. (a) The commonly used torch.add method adds the CNN and transformer features. (b) The proposed approach incorporates different features with the weights of low-frequency and high-frequency features.

Figure 12. Ablation experiment with different components. (a) The same inputs for (b–d) detection methods. (b) CNN method with Fourier transform. (c) Transformer method based on PSNR attention and linear operations. (d) CNN and transformer fusion method with Fourier weights. (e) Ground truth.

Figure 13. Visualization of the features, including the transformer branch, CNN branch, and fusing weights of the original image’s Fourier transform. (a) Input. (b) Transformer characters. (c) CNN characters. (d) Low-pass filtering attentions. (e) High-pass filtering attentions. (f) Ground truth.

Figure 14. Validation for the proposed methods with other commonly used structures. (a) Input image. (b) The proposed CNN network incorporates both the temporal and frequency domain characters. (c) The original ResNet only adopts the temporal characters. (d) The fused feature by the Fourier transform weight of the original image. (e) The fused feature by the torch.add method. (f) Ground truth image.

Figure 15. Qualitative analysis with the LSUI dataset.

Figure 16. Qualitative analysis with the UIEB dataset.

Figure 17. Visualized detection results with different image enhancement effects.

Figure 18. Precision-recall and recall-confidence curves with different image enhancement effects.

Table 1. Computation load for different parameters.

Parameters	Computation Load	Computation Load Summation
$η_{d}$	d*n	3d*n
q	d*n
$\hat{x}$	d*n

Table 2. Comparative test of ablation experiments (“✓” denote the adopted scheme).

Structures				Fusion		LSUI		UIEB
CNN	Fourier	Transformer	SNR Attention	Additive Fusion	Fourier Fusion	PSNR	SSIM	PSNR	SSIM
✓						15.22	0.47	13.03	0.42
✓	✓					18.82	0.64	16.77	0.60
✓	✓	✓			✓	24.83	0.79	21.70	0.70
✓	✓	✓	✓	✓		24.42	0.75	21.56	0.69
✓	✓	✓	✓		✓	26.53	0.83	23.85	0.78

Table 3. Comparison of different matrix-multiplication attention and computation latency.

Backbones	Attention	Latency (ms)
Transformer	Dot-product Transformer	2.5 ms
Transformer	Element-wise Transformer	2.1 ms
Transformer + CNN	Dot-product Transformer	3.0 ms
Transformer + CNN	Element-wise Transformer	2.6 ms

Table 4. Comparative evaluation with different image enhancement networks.

Methods	LSUI		UIEB
Methods	PSNR	SSIM	PSNR	SSIM
CNN [16]	15.28	0.50	13.68	0.48
MIR-Net [40]	18.80	0.66	16.78	0.63
U-net [41]	19.45	0.78	17.46	0.76
WaterNet [43]	19.62	0.80	19.27	0.83
Ucolor [44]	21.62	0.84	20.67	0.81
Transformer [26]	22.83	0.79	21.70	0.70
Ours	24.49	0.85	22.79	0.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Chen, Z.; Xu, Z.; Zheng, Z.; Ma, F.; Wang, Y. Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN. J. Mar. Sci. Eng. 2024, 12, 1467. https://doi.org/10.3390/jmse12091467

AMA Style

Liu X, Chen Z, Xu Z, Zheng Z, Ma F, Wang Y. Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN. Journal of Marine Science and Engineering. 2024; 12(9):1467. https://doi.org/10.3390/jmse12091467

Chicago/Turabian Style

Liu, Xiangyong, Zhixin Chen, Zhiqiang Xu, Ziwei Zheng, Fengshuang Ma, and Yunjie Wang. 2024. "Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN" Journal of Marine Science and Engineering 12, no. 9: 1467. https://doi.org/10.3390/jmse12091467

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancement of Underwater Images through Parallel Fusion of Transformer and CNN

Abstract

1. Introduction

2. Materials and Methods

2.1. Two Branches’ Feature Extraction Network

2.2. Implementation of the CNN Branch

2.3. Implementation of the Transformer Branch

2.4. Fusion Attention Based on High-Pass and Low-Pass Filters

3. Experimental Validation

3.1. Dataset and Experimental Designation

3.2. Ablation Study

3.3. Feature Visualization Process

3.4. Comparison with Current Methods

3.4.1. Qualitative Analysis

3.4.2. Quantitative Analysis

3.5. Comparison on Detection Tasks

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI