1. Introduction
In the rapidly evolving landscape of technology, AI-driven autonomous systems like autonomous vehicles [
1,
2], underwater computer vision applications [
3], video surveillance [
4], person re-identification [
5], biometric access control [
6,
7], and medical diagnostics are becoming indispensable. The effectiveness of these systems hinges significantly on the quality of the imagery they process. Clear images are crucial for optimal performance, yet real-world conditions often compromise image quality due to various factors, such as adverse weather, underwater environments, variable lighting, and camera motion. For instance, surveillance systems and medical imaging devices frequently capture images at lower resolutions; moving cameras introduce motion blur; underwater photography is plagued by color distortions and noise; and images captured in misty or rainy conditions are marred by blurs and additional noise. These degradations can drastically hinder the ability of visual systems to perform tasks like detection, segmentation, and target tracking effectively. To counter these challenges, developing robust IR algorithms is essential. These algorithms enhance the adaptability of visual systems across diverse environments. IR, a critical yet complex inverse problem, involves recovering a high-quality image from one that has been degraded. This process is typically categorized into six principal types based on the nature of the degradation: super-resolution, noise reduction, blur removal, rain removal, haze removal, and color adjustment.
Image denoising is a critical and dynamic area of research within image processing that is integral to numerous applications across various fields such as satellite imaging, defense reconnaissance, biometric image improvement [
8], industrial automation, and forensics, among others. In medical and biomedical imaging such as COVID-19 [
9], denoising algorithms serve as essential pre-processing steps to eliminate noise types like speckle, Rician, and Quantum [
10]. These methods also reduce speckle in synthetic aperture radar images [
11] and remove salt and pepper as well as additive white Gaussian noise in remote sensing [
12]. In forensics, images can be corrupted by various kinds of noise, impacting the quality of evidence. Steganography techniques [
13] can be used to hide data within images, further complicating the noise environment and the process of IR. Denoising techniques help suppress this noise [
14]. Furthermore, denoising has applications in agriculture, such as filtering paddy leaves to detect rice plant diseases. Digital images often suffer distortions due to defective imaging equipment, problematic transmission channels, adverse atmospheric conditions, or environmental factors. These images can also be degraded during processes like encryption, compression, and format conversion [
15]. Thus, the role of denoising extends to improving the integrity and usability of images in diverse and challenging conditions. Super-resolution techniques seek to generate a high-resolution image using one or multiple low-resolution images; deraining (along with dehazing and deblurring) focuses on removing impediments like raindrops, haze, or blur; and color correction is crucial for addressing distortions, particularly in underwater imaging.
Early image processing techniques relied on linear, non-linear, and non-adaptive filters as primary tools for noise reduction. Traditional denoising methods often utilized fundamental image filtering techniques such as Gaussian smoothing and diffusion-based approaches [
16]. These methods generally function as low-pass filters, attenuating sudden intensity variations by replacing pixel values with an average or a similar statistical measure derived from neighboring pixels [
17]. While these techniques effectively suppress high-intensity noise, they also tend to eliminate important details, leading to potential degradation in image quality [
18]. To address this issue, adaptive filters, which assign weighted coefficients to the pixels within a specified window based on the fundamental principle of adaptive filtering [
19], have been developed, including common types such as bilateral [
20] and median filters [
21]. Although these methods are computationally less demanding and yield reasonably good results, they also have significant drawbacks, including poor optimization during the test phase, the need for manual parameter settings, and reliance on specific denoising models.
Recent advances in DL have substantially improved the capabilities of IR technologies. These versatile models are designed to tackle multiple IR tasks simultaneously, benefiting from the substantial learning priors provided by DNNs. Enhanced by training on diverse and comprehensive publicly available datasets, this multi-task capability not only streamlines the restoration process but also boosts the efficiency and effectiveness of autonomous systems across various applications. The flexibility of DL-based methods has shown potential in overcoming these limitations. Particularly, the ongoing challenges of digital image deterioration have consistently created a strong demand for restoration through advanced computing technologies. The advent of machine learning and DL has been crucial, as these disciplines involve a learner that acquires knowledge from data and subsequently applies this learned insight to new, unseen data, proving instrumental in enhancing image denoising techniques.
This study introduces a reinforced residual denoising network, R-REDNet, which is designed specifically to address the inherent complexity and variability of real-world image noise. Traditional approaches and existing DL models often suffer from limitations such as the inadequate preservation of fine image details, poor handling of spatially correlated noise, and residual noise amplification due to additive skip connections. To overcome these challenges, our proposed architecture integrates deeper convolutional layers to capture richer hierarchical features and introduces averaging-based skip connections instead of conventional additive methods. These enhancements not only improve feature extraction and reduce residual noise amplification but also result in a significant improvement in both quantitative metrics and visual quality. Thus, the proposed R-REDNet fills a critical gap by delivering robust and efficient denoising performance, which is essential for applications that rely heavily on image clarity, such as surveillance, medical imaging, and low-light photography. Our contribution is summarized as follows:
Develop a reinforced REDNet architecture termed R-REDNet that enhances feature representation by increasing the depth of the encoder and refining the decoder structure.
Introduce averaging-based skip connections to achieve a more balanced fusion of encoder and decoder features, reducing the amplification of residual noise.
Extensive experiments are conducted on two real-world noisy image datasets to assess the generalizability of the proposed approach, demonstrating that our approach significantly outperforms SOTA denoising methods in terms of PSNR and SSIM.
The proposed model exhibits enhanced generalization across diverse noise patterns, making it a robust solution for real-world applications such as surveillance, medical imaging, and photography under challenging conditions.
The paper is organized as follows:
Section 2 provides background on residual neural networks (ResNet) and discusses the challenges in image denoising.
Section 3 reviews related work, emphasizing the limitations of existing denoising techniques and the motivation for the proposed approach.
Section 4 introduces the R-REDNet architecture, describing key modifications such as deeper convolutional layers and averaging-based skip connections.
Section 5 outlines the experimental setup, detailing the datasets, pre-processing steps, and evaluation metrics.
Section 6 provides a comparative analysis against existing denoising techniques. Finally,
Section 7 summarizes the paper’s key findings and outlines potential directions for future research.
2. Background
ResNets were developed to tackle the vanishing gradient problem, which is a frequent challenge in training extremely DNNs. This problem occurs because as gradients are backpropagated through layers, they tend to diminish, leading to very small updates to the weights and, ultimately, stagnating the training process.
ResNet tackle this issue by incorporating skip connections, also known as
identity shortcuts, between layers. Skip connections are designed to connect the input of a layer directly to the output of a subsequent layer, bypassing one or more intermediate layers. This innovation significantly improves the training of DNNs by allowing the gradients to flow more easily through the network. The key idea behind ResNet is that instead of learning an unreferenced function
, the network learns the residual function:
, where
is the true underlying mapping that the network seeks to learn. The residual function can then be rewritten as
Here, x represents the input to the layers, and is the output of the stacked layers (i.e., the transformation the network learns). The identity mapping x is added to the output of , which simplifies the learning process, making it easier for the network to optimize and train, especially as the network depth increases. Skip connections play a multi-faceted role in the success of ResNet architectures:
Mitigating the vanishing gradient problem: Skip connections help prevent the gradient from diminishing too much as it backpropagates through the network. This ensures that the weights of earlier layers are updated more effectively, enabling the training of very deep networks.
Enabling the training of deeper networks: Skip connections allow the network to learn identity mappings more easily. In scenarios where additional layers do not improve the performance, the network can effectively “skip” those layers by learning to output the identity function , thereby passing the input directly to the output. This adaptability enables the training of networks with hundreds or even thousands of layers without encountering the degradation issue, where deeper networks underperform compared to shallower ones.
Facilitating residual learning: The core idea behind residual learning is that it is often easier to learn the residual than the full mapping. In image denoising, for example, the network can focus on learning the noise pattern in the image, which is then subtracted from the noisy input to produce a clean image. The skip connection ensures that the original input is always available at later stages, making it easier for the network to learn the residual noise rather than reconstructing the entire image from scratch.
Improving gradient flow and network convergence: By creating multiple pathways for gradient propagation, skip connections improve the overall flow of gradients throughout the network. This leads to faster convergence during training and reduces the likelihood of getting stuck in poor local minima.
In the context of image denoising, ResNet-based models have been widely used because they can effectively capture both low-level details and high-level abstract features. Skip connections are particularly beneficial in denoising tasks as they allow the network to preserve important image details across layers. Since the noise is typically a small perturbation added to the original clean image, it is beneficial for the network to have access to the original image data at multiple stages of the network. Skip connections ensure that these original details are preserved and combined with the learned noise patterns at each residual block. Mathematically, if we denote the noisy image as
y, the clean image as
x, and the noise as
n, we aim to estimate the noise
n such that
. The ResNet model learns a mapping
that approximates the noise
n, and the clean image can be recovered by
.
Figure 1 illustrates a typical residual block, where the shortcut connection is depicted by the arch encircling the layers, representing the addition of the input directly to the output of the convolutional layers.
This architecture has proven highly effective in various image denoising tasks, as it allows the network to concentrate on modeling the noise pattern, leading to improved denoising performance and generalization across different types of noise and image datasets.
The challenge of removing noise while preserving important image details has spurred considerable research, leading to the development of various techniques ranging from traditional filters to advanced DL models. Historically, image denoising techniques have included approaches such as Gaussian filtering, median filtering, and wavelet-based methods [
22]. These methods, while effective in some scenarios, often struggle with maintaining image details and textures, especially in the presence of high levels of noise. Recent advancements in DL have revolutionized the field of image denoising. One such DL technique is the REDNet, which has demonstrated its effectiveness in denoising real-world images [
23]. The REDNet architecture is a DNN that has the strength of residual learning to achieve SOTA performance in image denoising [
24].
While REDNet has shown promising results, the performance of image denoising models can be further enhanced by incorporating additional refinement mechanisms. The primary limitation of REDNet, as observed in practice, is that the output quality can still be improved particularly in terms of fine detail preservation and noise suppression in highly textured regions. To address this limitation, researchers have proposed integrating refinement networks with existing denoising models. These refinement networks act as post-processing units that fine-tune the outputs of primary denoising models. For instance, methods such as the residual dense network for image restoration [
25] have demonstrated that adding a refinement step can significantly enhance the visual quality and reduce artifacts in the denoised images.
In this study, the authors propose an enhanced approach to image denoising by modifying the existing REDNet architecture to create a R-REDNet. Our approach improves feature extraction and noise suppression by introducing deeper convolutional layers, replacing additive skip connections with averaging operations. These enhancements enable the model to achieve superior denoising performance.
3. Related Work
This section provides an analytical overview of recent CNN-based image denoising approaches. We categorize the existing literature into supervised and unsupervised methods, highlighting their contributions, limitations, and relevance to our proposed method.
3.1. Supervised-Based Image Denoising
Supervised learning approaches have dominated the image denoising landscape due to their strong performance on synthetic datasets and controlled environments. These models typically rely on paired noisy–clean data and are trained to minimize pixel-wise differences. A foundational example is DN-ResNet by Ren et al. [
26], which uses residual blocks and edge-aware loss functions to boost perceptual quality. Their follow-up, DS-DN-ResNet, reduces computational cost using depthwise separable convolutions while maintaining performance across various noise types. Building on these designs, SRRNet [
27] integrates multi-scale dual attention with a high-resolution reconstruction module, enhancing feature recovery in noisy images. Attention mechanisms have proven particularly effective in focusing the model on informative regions and have thus become common in recent architectures.
In addition to structural innovations, supervised methods have begun incorporating iterative refinement strategies. For instance, Pan et al. [
28] introduce a correction mechanism for noise-level maps, leveraging guided feature domain residual networks to reduce artifacts. Meanwhile, TSIDNet [
29] improves generalization through a dual-subnet structure, combining transfer learning with cross-fusion modules to handle diverse noise patterns. In the realm of low-light conditions, LIVENet [
30] proposes a two-stage pipeline that replaces noisy luminance channels using a latent subspace and spatial feature transforms, thereby optimizing both enhancement and denoising simultaneously. While these supervised models show impressive results, they often struggle to generalize to real-world noisy data due to their dependence on clean–noisy pairs and assumptions about noise distribution.
3.2. Unsupervised-Based Image Denoising
Unsupervised denoising methods address the lack of clean target data, which is a significant constraint in real-world scenarios. These approaches generally adopt noise modeling, self-supervision, or zero-shot learning to train models without paired data. FADNet [
31] uses a dynamic filtering mechanism with attention modules and Fourier transforms to adapt to different noise types. This adaptive strategy provides robustness to noise variations, which many fixed-architecture supervised methods lack. Complementing this, the clean-to-noisy framework [
32] simulates realistic noise from clean data, making traditional CNNs more applicable to real-world settings. Generative methods have also evolved with self-collaboration strategies, as in Lin et al. [
33], which iteratively improve synthetic noisy–clean pairs by refining weaker denoisers. This incremental learning approach addresses convergence issues commonly seen in GAN-based methods. Blind-spot network architectures have become prominent for self-supervised denoising. The Conditional Blind-Spot Network by Jang et al. [
34] introduces spatial decorrelation while including central pixel information, which was previously excluded. Similarly, Zhang et al. [
35] incorporate Transformer blocks for global–local feature fusion, enhancing noise discrimination capabilities. These approaches address a key issue in semi-supervised methods: balancing context exploitation with noise suppression.
Noise estimation has also been tackled using non-deep learning techniques. NLH [
36] proposes a pixel-level non-local prior using the Haar transform and Wiener filtering. Although less flexible than deep models, such approaches provide strong baselines and insight into interpretable noise reduction. AP-BSN [
37] represents an evolution in blind-spot design, using asymmetric pixel-shuffle downsampling and random-replace refinement to counteract pixel-wise independence assumptions, which are often unrealistic in natural images. Furthermore, dataset-free and lightweight approaches have gained attention. Zero-shot Noise2Noise [
38] uses a simple two-layer architecture for real-time denoising without any prior training or noise statistics. This minimalistic yet effective model is ideal for constrained environments. MBPDR3 [
39] and MASH [
40] extend blind-spot learning with multi-branch architectures and refinement via masking and shuffling techniques. These methods address spatially correlated noise—often overlooked in pixel-independent models—by incorporating contextual blending. Lastly, curvelet-based methods like CTuNLM [
41] offer an alternative to CNNs by focusing on multi-scale frequency decomposition, outperforming several deep models in RGB denoising tasks. The introduction of Deep Curvelet-Net extends this approach with neural layers focused on the finest curvelet coefficients, bridging the gap between traditional and deep learning methods.
Despite their effectiveness, these existing approaches still encounter limitations related to the insufficient preservation of fine-grained image details, amplification of residual noise due to additive skip connections, and inadequate feature extraction depth. Therefore, our research specifically addresses these gaps by proposing R-REDNet, which integrates deeper convolutional layers and averaging-based skip connections. By doing so, we aim to provide substantial improvements in denoising performance, especially in complex real-world scenarios.
4. Proposed Architecture
The original REDNet model, which employs a combination of convolutional and transposed convolutional layers with additive skip connections, was designed to efficiently reconstruct images by learning residual representations. However, to improve its feature extraction capabilities and smooth the reconstructed output, we propose a modified version, termed R-REDNet, which introduces deeper convolutional layers in the encoder and replaces additive skip connections with averaging operations.
4.1. REDNet Architecture
The proposed model is based on a fully convolutional architecture, incorporating deconvolution layers to restore the image to its original resolution. Rectification layers are applied after each convolution and deconvolution step to enhance information propagation. The convolutional layers function as feature extractors, capturing the essential structural components of the image while suppressing unwanted noise. Once the noisy input image passes through these layers, a refined version is generated. However, some fine details may be lost in the process. To address this, deconvolution layers are employed to recover lost details and improve reconstruction accuracy. Additionally, skip connections are introduced between corresponding convolutional and deconvolutional layers. These connections facilitate the direct transfer of feature maps from the encoding path to the decoding path, where they are merged element-wise before undergoing rectification, ensuring better preservation of spatial details.
4.1.1. Convolution–Deconvolution Decoder
Recently, architectures that integrate convolutional and deconvolutional layers have been suggested for semantic segmentation tasks [
23]. The convolution operation entails calculating the integral of the product between two functions: the input (
x) and the kernel (
w). The outcome, commonly known as the feature map or activation map (
s), is calculated as follows:
Unlike convolutional layers, which aggregate multiple input activations within a filter window to generate a single output activation, deconvolutional layers distribute a single input activation across multiple outputs. Typically, deconvolution serves as a learnable upsampling mechanism. In [
23], convolutional layers incrementally down-sample the input image, condensing it into a compact representation, while deconvolutional layers up-sample this representation to reconstruct the original resolution. Unlike high-level tasks such as segmentation or recognition, where pooling can discard fine image details and degrade restoration performance, the REDNet mitigates this issue. Alternatively, substituting deconvolution with convolution results in an architecture resembling recently introduced deep fully convolutional neural networks.
4.1.2. Skip Connections
Skip connections are introduced between corresponding convolutional and deconvolutional layers, as illustrated in
Figure 1. These connections serve two main purposes. First, as the network deepens, image details may be lost, reducing the effectiveness of deconvolution in restoring them. Skip connections, however, transfer feature maps rich in image details, enhancing deconvolution’s ability to produce a cleaner version of the image. Second, skip connections also facilitate better gradient backpropagation to the lower layers, making it easier to train deeper networks. It is important to note that our skip connections differ from other approaches, where the focus was solely on optimization [
23].
4.2. Proposed R-REDNet
In this work, we propose a modified version of the REDNet architecture, referred to as reinforced REDNet, which introduces a series of enhancements aimed at improving its capacity for complex feature extraction and more refined image restoration. The base REDNet architecture employs a series of convolutional layers for feature encoding, followed by transposed convolutional layers for decoding, utilizing additive skip connections to recover fine spatial details. The architecture of the proposed network is shown in
Figure 2. The modifications made to the original REDNet were the result of extensive experimentation and testing. The input image size was 256 × 256 pixels.
4.2.1. Deeper Encoder
In the original REDNet, the encoder consisted of five convolutional layers with progressively reduced filter counts, starting from 256 and ending with 128 filters. To reinforce the model’s capacity for feature extraction, two additional convolutional layers were added, yielding a total of seven layers in the encoder. Specifically, two new layers (conv layer 6 and conv layer 7) were introduced after the fifth convolutional layer, with 128 and 64 filters, respectively. Increasing the encoder depth allows the network to progressively refine feature representations, capturing both low-level textures and high-level abstract patterns. By stacking additional layers, the model gains the ability to better separate noise from meaningful image features, making denoising more effective. The expanded depth also increases the receptive field, enabling neurons to analyze broader spatial dependencies and structural patterns within the image. This hierarchical learning process ensures that the encoder extracts more robust and noise-resilient feature maps, ultimately leading to improved reconstruction accuracy and enhanced visual clarity in denoised images. The deeper encoder also facilitates the extraction of higher-order spatial features, thereby improving the model’s performance in tasks such as image denoising or restoration.
4.2.2. Employing Modified Skip Connections
The REDNet architecture employs additive skip connections (via Add layers) between corresponding encoder and decoder layers to preserve spatial information during the upsampling process. In the reinforced REDNet, we propose an alternative to this approach by using averaging (Average layers) for the skip connections. The rationale behind this modification is to allow a more balanced blending of features between the encoder and decoder, potentially leading to smoother reconstructions.
Specifically, averaging is applied at two key points in the network:
- -
Between the output of the second convolutional layer (encoder) and the corresponding decoder layer (transposed convolution).
- -
To preserve the residual learning framework, the final skip connection, which adds the input image to the output of the decoder, remains intact. This residual connection helps the model predict image residuals and subsequently reconstruct the denoised or restored version by adding these residuals back to the input.
The average layer serves as a merging operation that combines multiple noisy feature maps by performing an element-wise average. This operation is especially valuable when integrating feature maps from different stages of a denoising network, ensuring that the spatial dimensions of the image are preserved while simultaneously smoothing out noise.
Replacing additive skip connections with averaging operations in R-REDNet enhances denoising performance by reducing noise amplification and improving feature fusion. Additive skip connections directly sum feature maps, which can inadvertently amplify residual noise, especially in high-SNR scenarios. In contrast, averaging normalizes the combined feature maps, mitigating noise propagation and ensuring a more balanced feature representation. This approach also stabilizes gradient flow, preventing issues such as vanishing or exploding gradients in deep networks. Additionally, averaging promotes more effective feature fusion by reducing the dominance of noisy components, leading to a refined reconstruction process. Since noise is often stochastic and uncorrelated across layers, averaging operations suppress high-frequency noise artifacts while retaining crucial structural details in images. As a result, this modification enhances the network’s ability to recover clean images while ensuring more stable training. By incorporating averaging instead of direct addition, R-REDNet achieves improved denoising performance and more effective noise suppression. In contrast to element-wise addition, which can amplify certain pixel values and potentially overemphasize noise, the average layer ensures a balanced contribution from all inputs, allowing for effective denoising without amplifying artifacts. This makes the average layer particularly useful in multi-path denoising architectures, as it facilitates the fusion of multiple noisy representations in a way that emphasizes the clean underlying signal. By providing a robust, noise-reducing combination of features, the average layer plays a key role in improving the overall quality and accuracy of the denoised image.
4.2.3. Mathematical Description of R-REDNet
The architecture of the modified REDNet, referred to as “Reinforced R-REDNet”, includes encoder and decoder layers with skip connections utilizing averaging. Below is the detailed mathematical description:
Encoder layers: The encoder consists of seven convolutional layers
. Let the input image be
. Each encoder layer applies a convolution operation followed by an activation function:
where the input image is represented as
. The weights and biases of the
i-th layer are denoted by
and
, respectively. The convolution operation is represented by *, while
f denotes the activation function, such as ReLU.
Decoder layers: The decoder consists of five transposed convolutional layers
,
, …,
. These layers perform upsampling and reconstruction. For the
j-th decoder layer, the operation is
where the weights and biases of the
j-th decoder layer are represented by
and
, respectively. The transposed convolution operation is denoted by
, while
g represents the activation function. The output image is given by
.
Skip connections: At certain layers, the skip connections from the encoder outputs are concatenated with the decoder outputs using averaging. Let
be the output of encoder layer
k and
be the output of decoder layer
j. The averaging operation is defined as
The decoder layer input is updated with this concatenated information:
Final output: The output image
is produced after processing through the final decoder layer
, incorporating all relevant skip connections and averaging operations:
Complexity: While R-REDNet is computationally more expensive than standard shallow CNN-based denoising models due to its increased depth and hierarchical feature extraction, it offers better denoising performance with less noise amplification. The additional cost is justified by the model’s ability to preserve fine image details while reducing artifacts. However, for real-time applications, optimized implementations (e.g., quantization, pruning) may be required to balance computational efficiency and performance. The computational complexity of a single convolutional layer is given by
where
L is the number of layers,
K is the kernel size,
and
represent input and output channels, and
denote the spatial dimensions of the feature map. The original REDNet consists of five convolutional layers in its encoder and decoder, resulting in a total of
layers. In contrast, R-REDNet introduces two additional convolutional layers in the encoder, making its total depth
.
Thus, the complexity ratio between R-REDNet and REDNet can be expressed as
This indicates that R-REDNet has approximately 20% higher computational complexity than REDNet due to the increased depth. However, the introduction of averaging operations instead of additive skip connections has negligible computational overhead, since element-wise averaging involves only division operations. Despite the increased complexity, R-REDNet improves feature extraction, reduces noise amplification, and achieves better denoising performance compared to REDNet.
Table 1 provides a detailed summary of the hyperparameters for each layer in both the encoder and decoder. While the REDNet base and R-REDNet models share a similar architectural foundation, they differ in training time, parameter count, and overall model size, as summarized in
Table 1. R-REDNet introduces a slight increase in complexity with a marginally larger parameter count and extended training time. However, this modest increase results in enhanced denoising efficiency, making R-REDNet a more effective choice for noise reduction. The model achieves a well-balanced trade-off between size and performance, offering improved denoising capabilities without a substantial rise in computational cost.
5. Experimental Results
This section presents the evaluation framework used to assess the proposed method. It includes a detailed description of the metrics utilized to measure performance, providing insight into the effectiveness and reliability of the approach. Additionally,
Section 5.2 outlines the characteristics of the data employed in the experiments, highlighting their relevance to the study. Finally,
Section 5.4 analyzes the findings, comparing them with baseline methods and discussing their implications in the context of the research objectives.
5.1. Metrics
As evaluation metrics, PSNR [
42] and SSIM [
43] are employed to assess the effectiveness of the proposed system, as defined in Equations (5) and (6). PSNR quantifies the peak signal-to-noise ratio between the original and reconstructed images with higher values indicating better reconstruction quality. In contrast, SSIM evaluates the perceptual similarity between the original and reconstructed (or denoised) image. An SSIM value of `1’ indicates that the original and reconstructed images are identical, while values less than `1’ suggest increasing dissimilarity between the two images.
SSIM evaluates the similarity between two images, which is crucial for image denoising as it helps assess how effectively the denoising method has reconstructed the clean image from the noisy input. Unlike individual pixel-based metrics, such as MSE, SSIM focuses on windows or groups of pixels that carry essential information about the object, making it more meaningful in image denoising tasks. Since SSIM takes a window-based approach rather than evaluating individual pixels like MSE, it provides a more comprehensive comparison. Given two windows,
X and
Y, each of size M × M, SSIM is computed as follows:
where
,
,
,
, and
denote the mean of
X, the mean of
Y, the variance of
X, the variance of
Y, and the covariance between
X and
Y, respectively. The constants
and
are introduced to maintain numerical stability when the denominator is too small or approaches zero with default values
and
. Here,
L represents the dynamic range of pixel values. The SSIM score ranges from −1 to 1, where 1 indicates perfect similarity and 0 signifies no similarity. Therefore, a higher SSIM value, approaching 1, reflects the better performance of the denoising technique.
5.2. Datasets
To improve the generalizability of the proposed paper, tests were conducted on two datasets instead of one. These datasets are widely used for denoising real images, addressing challenges that extend beyond additive white Gaussian noise to tackle the more complex task of real-world noisy image denoising. The datasets utilized are outlined below:
Dataset 1: The dataset introduced by Xu et al. [
44] is considered as benchmark dataset specifically designed to capture the complexity of real-world noise across diverse natural scenes. This dataset consists of noisy images captured using five different camera models from three leading brands: Canon EOS (models 5D Mark II, 80D, and 600D), Nikon D800, and Sony A7 II, ensuring a broad representation of sensor characteristics and noise patterns. The dataset includes 40 unique scenes, covering a variety of indoor environments and objects. From these scenes, 100 image patches of size 512 × 512 pixels were cropped, providing a standardized set for evaluation. The dataset offers a diverse collection of real-world noise types, making it an essential resource for developing and benchmarking denoising algorithms under practical conditions.
Figure 3 showcases sample images from this dataset, demonstrating the presence of complex noise patterns across different lighting conditions and scene compositions.
Dataset 2: The dataset introduced by Nam et al. [
45] provides a comprehensive collection of noisy images captured from 11 static scenes using three different camera models: the Canon 5D Mark 3, Nikon D600, and Nikon D800. For each scene, 500 JPEG images were taken under consistent settings, and an averaged image was generated to serve as the noise-free ground truth reference. This averaging technique effectively minimizes noise, providing a reliable benchmark for evaluating denoising performance. To facilitate comparative analysis, 15 regions of size 512 × 512 pixels were extracted from the captured scenes, offering a standardized evaluation set for testing denoising algorithms.
Figure 4 illustrates sample images from this dataset, showcasing the diversity of scene content and noise characteristics.
5.3. Implementation Details
Hardware configuration: The experiments were conducted on the Kaggle cloud-based computing platform using a dual NVIDIA Tesla P100/T4 GPU setup. The system configuration includes four vCPUs and 16 GB of RAM, ensuring efficient model training and evaluation. The DL model was implemented using TensorFlow and the Keras API, leveraging GPU acceleration for faster processing. Model training and inference were conducted using the CUDA and cuDNN-optimized TensorFlow environment, significantly reducing computation time.
Experimental summary: This study employs a deep convolutional autoencoder with skip connections for real-world image denoising. Dataset 1, used for training, comprises 100 images, which are divided into 60 for training, 30 for validation, and 10 for testing. The training images were resized to 256 × 256 pixels and normalized within the [0, 1] range.
Table 2 summarizes the selected hyperparameters.
5.4. Result Discussion
To better understand the impact of the individual components within the proposed R-REDNet architecture, an ablation study was conducted and presented in
Table 3. The goal of this study was to evaluate the contribution of key modifications, such as the introduction of deeper convolutional layers in the encoder and the replacement of additive skip connections with averaging operations, to the overall denoising performance. The baseline REDNet model achieves a PSNR of 38.93 dB and an SSIM of 0.9846.
5.4.1. Effect of Deeper Encoder
The encoder depth was increased by adding two additional convolutional layers to the original REDNet architecture. This modification aims to enhance feature extraction by capturing finer details and deeper spatial information. To evaluate the impact of this change, we conducted experiments comparing the original REDNet and the proposed R-REDNet, excluding the additional convolutional layers. The results, summarized in
Table 4, show a significant improvement in both PSNR and SSIM, highlighting the crucial role of the deeper encoder in noise reduction and preserving fine image details. Increasing the encoder depth from five to seven convolutional layers results in a PSNR improvement of +2.32 dB and an SSIM increase of 0.0035, confirming that a deeper encoder improves the model’s ability to capture complex noise structures while maintaining fine details.
5.4.2. Effect of Averaging-Based Skip Connections
Introducing iterative refinement improves PSNR by an additional +1.23 dB, bringing the final R-REDNet performance to 44.01 dB. This confirms that iterative refinement progressively suppresses residual noise, improving overall image quality without excessive smoothing. The results validate that each proposed modification significantly enhances denoising performance with the final R-REDNet outperforming the baseline REDNet by 5.08 dB in PSNR and 0.0085 in SSIM.
To highlight the effectiveness of the proposed R-REDNet model,
Figure 5 and
Figure 6 present a qualitative comparison between input images degraded by Gaussian noise (mean = 0, standard deviation = 0.1) and their corresponding denoised outputs for Dataset 1 and Dataset 2, respectively. The visual results demonstrate that R-REDNet effectively removes noise while preserving fine image details and textures. Compared to the original noisy images, the denoised outputs exhibit significant improvements in clarity with reduced noise artifacts and enhanced structural consistency. These observations align with the quantitative performance gains in PSNR and SSIM, confirming that R-REDNet not only achieves superior numerical results but also produces visually appealing restorations. The ability to generalize across different noise levels and image textures makes R-REDNet a robust solution for real-world denoising applications.
6. Comparaison with the SOTA
To assess the performance of the proposed R-REDNet model, we carried out a comparative analysis against SOTA image denoising techniques on two real-world noisy image datasets. The study includes traditional denoising approaches such as trainable non-linear reaction diffusion (TNRD) [
46], noise clinic (NC) [
47], multi-channel weighted nuclear norm minimization (MCWNNM) [
48], and neat image (NI) [
49], along with DL-based models like DnCNN [
50] and REDNet. The effectiveness of these models was evaluated using PSNR and SSIM, which are two widely used metrics for measuring image restoration quality.
The results, as summarized in
Table 4 and
Table 5, demonstrate that DL-based approaches significantly outperform classical denoising techniques. Among classical methods, MCWNNM achieved the best performance, with a PSNR of 38.51 dB and an SSIM of 0.9671 on Dataset 1 and a PSNR of 37.71 dB with an SSIM of 0.9542 on Dataset 2. However, these values remain considerably lower than those achieved by DL-based models, highlighting the limitations of traditional denoising approaches in handling real-world noise complexities. Notably, the DnCNN model exhibited the weakest performance among CNN-based methods, achieving the lowest PSNR and SSIM values in both datasets. This suggests that simple deep networks without specific architectural enhancements struggle to generalize to real-world noise.
Figure 7 presents a comparative analysis of PSNR values across Dataset 1 and Dataset 2, allowing for a direct performance comparison of each denoising method. The results indicate that R-REDNet consistently outperforms all competing methods in both datasets with significantly higher PSNR values. Notably, while classical methods such as MCWNNM perform relatively well, their performance is notably lower than that of DL-based methods. Furthermore, the REDNet model shows clear improvements over traditional CNN-based models like DnCNN, but R-REDNet achieves the best results, demonstrating the effectiveness of its reinforced residual connections and iterative refinement.
Similarly,
Figure 8 compares the SSIM values for both datasets, further reinforcing the superior performance of R-REDNet. The figure highlights that DL-based methods yield significantly higher SSIM values compared to traditional approaches with R-REDNet achieving near-perfect SSIM scores on Dataset 2. Interestingly, while REDNet already shows strong structural similarity preservation, R-REDNet further improves upon it, indicating that the proposed enhanced feature extraction and noise suppression strategies contribute to preserving finer image details.
REDNet, which incorporates residual learning, demonstrated substantial improvements over both classical methods and DnCNN, achieving a PSNR of 38.93 dB and an SSIM of 0.9846 on Dataset 1 and a PSNR of 42.59 dB with an SSIM of 0.9954 on Dataset 2. This confirms the effectiveness of skip connections and deep encoder–decoder architectures in preserving image details while reducing noise. However, the proposed R-REDNet model further refines this approach, achieving the highest PSNR and an SSIM values across both datasets. Specifically, R-REDNet attained a PSNR of 44.01 dB and an SSIM of 0.9931 on Dataset 1, outperforming REDNet by +5.08 dB in PSNR and +0.0085 in SSIM. On Dataset 2, the performance gain was even more pronounced with R-REDNet achieving a PSNR of 46.15 dB and an SSIM of 0.9955, representing an improvement of +3.56 dB in PSNR compared to REDNet. These results indicate that the integration of deeper convolutional layers, averaging-based skip connections, and iterative refinement mechanisms significantly enhances the model’s ability to suppress noise while preserving fine details.
The superiority of R-REDNet is further highlighted by its performance in Dataset 2, where real-world noise variations pose a greater challenge. Compared to MCWNNM, the best-performing classical method, R-REDNet demonstrated a substantial improvement of +5.5 dB in PSNR on Dataset 1 and +8.44 dB on Dataset 2, along with SSIM gains of +0.026 and +0.041, respectively. These improvements validate the robustness of R-REDNet in handling complex, non-Gaussian noise patterns more effectively than traditional methods. Furthermore, the performance gap between REDNet and R-REDNet underscores the importance of reinforcement mechanisms in CNN architectures, as the proposed enhancements allow for better feature extraction, more stable gradient propagation, and improved image reconstruction.
The experimental findings indicate that R-REDNet achieves superior performance compared to existing denoising methods in both PSNR and SSIM. Its strong generalization across various noise patterns and datasets underscores its suitability for real-world applications requiring reliable image restoration. By incorporating deeper encoder layers, an averaging-based skip connection strategy, and iterative refinement, R-REDNet effectively mitigates the limitations of previous models and achieves SOTA denoising performance.
The statistical boxplots for both Dataset 1 (
Figure 9a) and Dataset 2 (
Figure 9b) distinctly illustrate substantial differences between noised and denoised images, strongly validating the effectiveness of the applied denoising method. In Dataset 1, the median (Q2) values for the mean intensities across all RGB channels in denoised images were noticeably higher and more stable compared to those of noised images. This clearly indicates that denoising effectively restores original brightness levels, which were previously obscured by noise. Additionally, the interquartile range (IQR)—defined by the first quartile (Q1) and third quartile (Q3)—was considerably narrower for denoised images across all color channels, signifying a significant reduction in pixel intensity variability. The tighter grouping of minimum and maximum intensity values further emphasizes the efficacy of noise suppression. Moreover, the frequency and magnitude of outliers were substantially reduced post-denoising, highlighting the successful removal of random artifacts.
The standard deviation and variance metrics reinforce these observations. Across both datasets, median values (Q2) for standard deviation and variance were consistently lower in denoised images for all RGB channels, providing strong quantitative support for denoising efficacy. The narrowed IQR and decreased min–max ranges in denoised images suggest a significant reduction in random pixel intensity fluctuations, contributing to enhanced homogeneity. Conversely, noised images displayed greater dispersion and numerous outliers across all channels, reflecting the presence of significant noise-induced randomness.
Dataset 2 further strengthens these findings, demonstrating even more pronounced improvements. Denoised images exhibited consistently lower median (Q2) values for both standard deviation and variance along with notably narrower IQRs and reduced ranges (min–max). This enhanced consistency across statistical parameters in Dataset 2 highlights the robustness and general applicability of the denoising approach across diverse image scenarios.
7. Conclusions
In this study, we present R-REDNet, an upgraded version of the original REDNet model, tailored specifically for restoring images affected by real-world noise. The model incorporates deeper convolutional layers in the encoder and replaces additive skip connections with averaging operations, enhancing feature extraction and leading to smoother reconstructions. Our empirical evaluations demonstrate that these adjustments significantly improve the model’s ability to manage the unpredictable and non-Gaussian nature of real noise commonly found in practical scenarios. REDNet provides a robust solution for image enhancement, effectively balancing noise removal with the preservation of fine details, making it ideal for various real-world applications, including low-light photography, medical imaging, and surveillance. However, R-REDNet slightly increases computational complexity and memory requirements, creating a trade-off between resource usage and denoising performance. Although this can slightly elevate real-time processing demands, the model achieves superior image fidelity, making it suitable when denoising quality outweighs efficiency concerns. To address these, future research will focus on model optimization for real-time deployment, adaptive learning techniques, such as reinforcement learning, to enhance generalization, and expanding the dataset to improve robustness across diverse noise patterns. Furthermore, future research could explore extending R-REDNet to video denoising by incorporating temporal consistency mechanisms to effectively capture correlations between consecutive frames. This could involve integrating recurrent connections or spatiotemporal convolutional layers to enhance temporal coherence while preserving fine details. Additionally, hybrid approaches that combine R-REDNet with Transformers or self-attention mechanisms could be investigated. Such models could leverage the global contextual modeling capabilities of attention-based architectures while maintaining the efficiency of R-REDNet, potentially leading to improved performance in both image and video denoising tasks.