1. Introduction
In recent years, with the continuous development of the power system, the overall length of State Grid transmission lines in China had reached 1.142 million kilometers by the end of 2020. During the winter season in northern China, the transmission lines are susceptible to ice and snow coverage due to variations in air humidity, wind speed, temperature, and other factors, that poses safety and stability concerns to the cables [
1,
2]. Transmission line ice-coverage will not only lead to the decrease of line load capacity, but may also cause equipment failure, power outage and other problems that seriously affect the operation of the power grid [
3,
4,
5]. Therefore, in order to accurately monitor the condition of power transmission lines, we designed a suspended intelligent vibration de-icing robot as shown in
Figure 1. The robot is suspended on the cables between the towers, and the camera carried on the side of the robot takes an upward shot of the cables to detect the snow and ice covering the cables, and to analyze the ice-covered condition of the cables using artificial intelligence. It then destroys the thin ice on the cable surface through autonomous judgment and vibration. Robots have the ability to independently perceive their environment, make decisions, and take actions without requiring external control or intervention.
However, a crucial aspect for robots to effectively monitor and ensure the safe functioning of transmission lines is to collect high-quality images. In the actual work process, the images collected by robots usually contain noisy distortions caused by swaying in breeze, drifting snow, frost, etc. The presence of these noises hampers the robot’s ability to accurately assess the cable icing status and make accurate actions, hence exacerbating the cable icing problem [
6,
7]. Therefore, denoising the images of ice-covered transmission lines is a key step to improve image quality and accurate monitoring. By employing denoising techniques, we can acquire a precise and unambiguous representation of the cable’s actual condition. This information is helpful for the robot to assess the presence of ice coating on the cable and serves as a reliable and timely basis for the autonomous vibration system to make accurate judgments. Consequently, the robot can adjust its vibration amplitude accordingly to effectively remove the ice, thereby ensuring the safe operation of the transmission lines [
8,
9].
At present, there is a limited amount of research on image denoising of transmission lines, so we rely on the knowledge and expertise gained from the field of image processing to inform our approach. Traditional image denoising algorithms primarily denoise images by considering the statistical properties of individual pixels or local neighborhoods, such as median filtering [
10], mean filtering [
11], and NLM [
12]. These methods are simple to implement and easy to understand, but they are more inclined to process local information of the image, may have certain limitations in retaining the global structure and texture of the image, and are difficult to adapt to complex image structures and changes. With the advancement of deep learning, researchers have increasingly utilized this technology for image denoising, such as FFDNet [
13], CTNet [
14], and DRSformer [
15]. These methods typically employ a neural network model to learn the mapping relationship between noisy images and clean images by training a large number of image pairs. They can acquire knowledge about the intricate arrangement and qualities of the image, resulting in the successful elimination of noisy disturbances and the preservation of a greater amount of detailed data. However, in order to achieve the optimal denoising effect for some forms of noise or complex image structures, it is necessary to employ specific designs and make adjustments [
16,
17].
Therefore, this paper aims to investigate and refine current image denoising algorithms so as to be most effective for the unique image features of transmission lines. As a result, we propose an enhanced denoising Unet++ for ice-covered transmission line images (EDUNet++). This algorithm uses the Unet++ model to encode and decode images. For this stage, in order to enhance the feature extraction capabilities of the model, to better retain the structure and content of the image, and to adapt to noise at different scales, we propose a residual attention module (RAM). The RAM can effectively suppress invalid information in features and enhances the details of the image. In order to combine different levels of features and capture image contextual information, we propose a multilevel feature attention module (MFAM). The MFAM effectively combines local and global features and focuses on different features at different levels, which can better capture details and texture information in images. In order to utilize the details, texture and other information in the low-level features, we propose a source feature fusion module (SSFFM). The SSFFM avoids excessive smoothness of the image caused by over-reliance on high-level features and enhances the realism and clarity of the image. In order to minimize the discrepancy between the generated image and the original image, we propose an error correction module (ECM). The ECM acts on generated images by calculating the error between images, thus generating more realistic images. During the model training process, in order to improve the model’s recovery of the detailed information within the generated image, we used a segmented fusion loss function. Based on the above viewpoints, the main contributions of this paper are as follows:
- (1)
The residual attention module (RAM) and the multilevel feature attention module (MFAM) are proposed in the feature encoding and decoding module (FEADM) to improve the feature extraction capabilities of the model, to effectively combine local and global features, and to suppress the influence of noise information.
- (2)
The shared source feature fusion module (SSFFM) is proposed to enhance the model’s utilization of source feature information, to understand image information at different levels, and to improve the denoising performance of the model.
- (3)
The error correction module (ECM) is proposed to enhance the model’s ability to fully capture the rich potential information in the image, and to help generate more realistic images by calculating the error.
- (4)
The piecewise joint loss function is innovatively employed in this model. Initially, Mean Squared Error (MSE) loss is utilized to enhance the model’s convergence speed during the early stage of training. Subsequently, a joint function consisting of Charbonnier loss and structural similarity index measure (SSIM) loss is employed in the later stage to enhance the model’s ability to handle outliers effectively.
- (5)
A new denoising model is proposed, called EDUNet++. A large number of experiments show that the EDUNet++ is superior to current denoising algorithms in both quantitative and qualitative aspects.
2. Related Work
Traditional image denoising methods mainly relied on prior knowledge. Dabov [
18] proposed the block-matching and 3D filtering method (BM3D), which uses the self-similarity existing in natural images to match adjacent image blocks, then integrates similar blocks through domain transformation to form a denoised image. Li et al. [
19] proposed an adaptive matching and tracking algorithm. First, the sparse coefficients are calculated; then the K singular value decomposition algorithm is used to train the dictionary into an adaptive dictionary that can effectively reflect the image structure; finally, the sparse coefficients are combined with an adaptive dictionary for image reconstruction. Qian et al. [
20] used sparse coding to optimize the block matching results. This approach utilizes non-local information and introduces graph Laplacian regularization to maintain local information, but image details are severely lost in the presence of strong noise. The feature extraction process of traditional methods is complex, computationally intensive and time-consuming, and has limitations when dealing with complex noise [
21].
In recent years, deep learning-based methods have achieved many results in the field of image denoising. Convolutional neural networks have two major characteristics: local perception and parameter sharing, and good results in image feature extraction and recognition [
22]. Zhang et al. [
23] proposed a denoising convolutional neural network (DnCNN) model that combines batch normalization and residual learning. The effect of denoising uniformly distributed Gaussian noise is good, but the convolution and pooling layers used result in a relatively small receptive field, resulting in limited feature extraction capabilities, and the inability to fully capture contextual information for larger image structures. Guo et al. [
24] proposed the convolutional blind denoising network (CBDNet), which uses a fully convolutional network to estimate the noise level to achieve an adaptive denoising effect. However, the fully convolutional network pays more attention to local information and therefore has a negative impact on global consistency. Ding et al. [
25] employed a dilated convolution to construct a two-stage blind denoising network to obtain comprehensive features within sensing ranges of different sizes. However, the network structure is relatively simple, resulting in limited information capture. Huang et al. [
26] proposed a channel affine self-attention-based progressively updated network (CasaPuNet), which adopts a multi-stage progressive update method, uses channel affine self-attention to extract channel information from input features, and adaptively fuses multi-stage features through skip connections and residual structures. Mou et al. [
27] integrated local and non-local attention mechanisms into deep networks to restore images containing complex textures. Zamir et al. [
28] proposed a multi-stage progressive image restoration network (MPRNet), which exchanges information at different stages to reduce the loss of detailed information. Potlapalli et al. [
29] proposed a prompt-based learning approach (PromptIR), which generates intermediate representations based on input images through multiple decoding stages to reflect semantic features, and uses these intermediate representations as guidance to restore images in the decoding stage. However, this method requires more computing resources for the model during training and inference.
The introduction of the self-attention mechanism in transformer neural networks solves the problem of the limited receptive field of the convolution operator and the inability of the network to flexibly adapt to the input content [
30]. In recent years, the performance of transformer-based modules in visual tasks has also been significantly improved. Wang et al. [
31] proposed a general U-shaped transformer network (Uformer) which, to a certain extent, solves the limitations of traditional convolutional neural networks in denoising problems, but lacks clear spatial position information. Yao et al. [
32] proposed a dense residual transformer network (DenSformer), that combines a transformer with the ideas of dense connection and residual connection to bridge and fuse features between different layers. Transformer-based methods apply a self-attention mechanism to capture the long-range dependence of image patches. However, they primarily focus on structure-level characteristics and overlook the enhancement of pixel-level features. This oversight can result in the presence of residual textures in the denoised images.
3. Network Structure
The network structure proposed in this article is shown in
Figure 2 and has three primary modules: feature encoding and decoding module (FEADM), shared source feature fusion module (SSFFM), and error correction module (ECM). The input of the network is a non-uniform noise image,
. First, a convolution layer is utilized to extract the initial feature,
. Then, in order to restore as much image texture and edge information as possible and reasonably remove the noise corresponding to the image content,
is input into the FEADM. After that, the fused features are encoded and decoded by the residual attention module (RAM) and multilevel feature attention module (MFAM). Finally, the reconstructed noise-free image is recorded as
. The process is expressed as:
In the above formula, represents the convolution operation; represents the intermediate feature in the feature encoding and decoding module; represents the feature down-sampling operation; represents the feature up-sampling operation; represents the feature splicing operation; represents the residual attention module (RAM), multilevel feature attention module (MFAM) and source feature involved in the feature encoding and decoding module process; and represents the reconstruction of the noise-free image through the encoding and decoding module.
However,
still has some distorted texture information in local areas, so after correction by the ECM, the noise-free image
is generated. The process is expressed as:
In this formula, represents the ECM. The inputs are and . The error feature representation is generated by subtracting the two images, and then the features are represented by attention; finally, is used to correct the blurred texture information in the image.
3.1. Feature Encoding and Decoding Module
3.1.1. Residual Attention Module (RAM)
In order to enhance the model’s ability to accurately capture crucial information in the image and preserve valuable details while removing noise, we propose the residual attention module (RAM) as shown in
Figure 3.
The RAM learns a mapping that preserves the original image information by utilizing skip connections and uses previously learned features to preserve the details of the image structure. This module enhances the network’s training process and enables more efficient learning of the image features. This module consists of M residual blocks, and each residual block includes two convolution operations, two residual connections and a combined attention fusion module consisting of channel attention and pixel attention. Channel attention can make the network pay more attention to channels that are particularly important for image denoising tasks, helping to improve the model’s perception of different features. Pixel attention allows the network to focus more intensively on pixels that are important to the image denoising task, helping the network to better capture details and structural information in the image. The combined attention fusion module utilizes channel-level and pixel-level information to enable the model to more comprehensively understand the structure and content of the image and better adapt to noise at different scales and frequencies. By integrating these two levels of information, the model can be more accurate and selectively amplify or diminish specific elements of an image in a more inclusive manner, hence improving overall denoising performance. The output
of the N residual block is:
In Formula (5), represents the input of the N residual block, represents the channel attention operation, and represents the pixel attention operation. First, undergoes the convolution operation, then performs the residual addition operation with to obtain the added features. Next, these added features pass through the convolution operation and the channel attention and pixel attention operations. Finally, the output feature is obtained by performing the residual addition operation with .
3.1.2. Multilevel Feature Attention Module (MFAM)
In order to improve the model’s understanding of image content and structure, we propose the multilevel feature attention module (MFAM) for different scales and different feature extractions as shown in
Figure 4.
The MFAM uses different convolution kernels to extract and capture features of different scales and abstraction levels. The smaller convolution kernels are more suitable for capturing local features of the image, while larger convolution kernels can be used to obtain broader global features, thus enhancing the model’s understanding of the image by fusing multi-level and multi-scale feature information. In order to reduce the number of parameters, we use dilated convolutions instead of traditional convolutions. Therefore, we use 3 × 3 convolutions with dilation rates of 2 and 3 in place of 5 × 5 and 7 × 7, respectively. After getting diverse features, we combine them using a concatenate operation. After module feature fusion, we use the SE attention mechanism to calculate the weight of each channel, dynamically adjust the importance of features at each scale, and adaptively focus on the most important feature information for the denoising task, thereby improving the performance of the model. The MFAM helps enhance the generalization ability of the model, can better adapt to image features of different scales and structures, and is more robust for the processing of different samples and noise distributions. The output
of MFAM is:
In the formula, represents the input feature, represents the 1 × 1 convolution, represents the 3 × 3 convolution, represents the 3 × 3 dilated convolution with the dilation rate of 2, and represents the 3 × 3 dilated convolution with the dilation rate of 3. represents the feature splice operation, represents the intermediate feature, represents the SE attention mechanism, and represents the input feature. First, performs convolution operations through convolution kernels of different sizes to obtain features of different scales, and the different features are then spliced together. Next, the spliced features pass through 1 × 1 and 3 × 3 convolution operations to obtain . Finally, and perform concat operation and SE attention to obtain the output feature, .
3.2. Shared Source Feature Fusion Module (SSFFM)
Low-level features typically encompass the fundamental structures and intricate elements within images, and they play an important part in the process of image denoising. Therefore, the shared source feature fusion module (SSFFM) is proposed, which corresponds to the down-sampling part on the left side of UNet++ as depicted in
Figure 2. The
is combined prior to each down-sampling process primarily because it contains low-frequency information such as smaller details, textures, and edges in the image. Subtle texture plays a vital role in enhancing the authenticity and sharpness of the image, while also preventing distortion produced by excessive smoothness. Edge information is the transition between different areas in an image, so enhancing these edges helps improve the clarity of the image and makes objects and structures easier to identify. High-level features usually contain more abstract information, and over-reliance on it may cause the image to be over-smoothed and lose some of the details of the original image during denoising. Therefore, fusing source features can help retain more details, improve the overall visual quality of the image, and make the image look more natural and realistic. The module process is expressed as:
In the formula, represents the source feature, and , , , and represent intermediate features. The combination of source features and high-level features makes full use of various information in the image, that enables the model to understand and process image information at different levels and improves the denoising performance of the model.
3.3. Error Correction Module (ECM)
The high-frequency information within the image generated by the encoding and decoding module will be partially lost. In order to fully capture the rich potential information contained in noisy images, we propose an error correction module (ECM) as shown in
Figure 5, where
is the noisy image,
is the encoder-generated image and
is the denoised image. This module transfers
to
and subsequently mitigates the various levels of mistakes by computing the errors of both pictures. It allocates greater importance to regions with significant errors and rectifies errors dispersed throughout different pixel areas. ECM can strike a balance between reducing noise and preserving image quality while denoising, preventing excessive destruction of image content and aiding in the creation of more lifelike images. The module process is expressed as:
In the formula, represents the error feature of the and , represents the Relu activation function, represents the Sigmoid function, represents the error attention, represents the error feature after the attention operation, and represents the final generated HD images. This module inputs and : first, it obtains the image error features through a subtraction operation, then transforms the dimensions from C × H × W to 1 × H × W using a convolution operation. Next, the attention of the error feature is obtained through the Sigmoid operation, and is weighted and combined with the error feature maps of and to obtain the module’s output, . Finally, the is added to the to obtain the final denoised image, .
3.4. Loss Function
When training a model for image denoising, the choice of loss function can have an impact on the quality and characteristics of the generated images. The MSE Loss is often used for image denoising tasks, due to its simplicity in calculation and ease of optimization. However, the model-generated denoised image may be over-smoothed and lose some detailed information, performing poorly on the high-frequency part of the noise. Charbonnier Loss [
33] is less sensitive to outliers than MSE Loss and pays more attention to detailed information, thus helping to retain the subtle structure and texture in the image. SSIM Loss [
34] considers brightness, contrast and structural similarity. It is more in line with human visual perception, allowing it to help retain the structure and details of the image and to make the generated image closer to human eye perception. Therefore, we use MSE Loss to calculate the model’s loss during the initial phase of training, enabling the model to be quickly optimized. Then, we use the joint loss operation that combines Charbonnier Loss and SSIM Loss to retain more detailed information of the image and improve the denoising performance of the model.
In the formula, represents the image generated by the model, represents the target image, represents the number of pixels, is a small positive number which is commonly used to avoid the denominator being zero, represents the mean, represents the variance, and , , are constants used for stable calculations.
5. Conclusions
In this paper, we propose an enhanced denoising Unet++ in order to improve the quality of images of ice-covered transmission lines (EDUNet++). The algorithm consists of three crucial modules: a feature encoding and decoding module (FEADM), a shared source feature fusion module (SSFFM) and an error correction module (ECM). Specifically, a residual attention module (RAM) and a multilevel feature attention module (MFAM) are proposed in the FEADM. The RAM incorporates the cascaded residual structure and hybrid attention mechanism, that effectively preserve the mapping of feature information. The MFAM uses dilated convolution to obtain features at different levels, and then uses SE attention for weighting, that effectively combines local and global features and focuses on different features at different levels. The SSFFM realizes the effective transmission line of source features, that enhances the fusion of features and realizes mutual mapping between features. The ECM implements the correction of errors between images. On the ice-covered transmission lines dataset, which we verified through quantitative and qualitative experiments, the experimental results show that the SSIM and PSNR reached 29.765 dB and 0.968, and the visual effects also exhibit enhanced clarity in capturing finer details. This confirms the robustness and reliability of our algorithm in complex noise environments. This method surpasses traditional methods and other deep learning methods; it can effectively suppress complex noise in images, learn the loss of image details and signal distortion caused by noise, and restores smoother cable outlines and clear surface detail information. This research provides a fresh outlook on the clarity and information recovery of ice-covered transmission line images, and provides strong support for detecting the status of ice-covered transmission lines. In future research, it will be deployed on de-icing robots and mounted in actual ice-covered transmission line environments to ensure its scalability in a wider range of application scenarios. We believe that the results of this research will provide valuable reference and inspiration for advanced detection of transmission line conditions.