2.1. Existing Approaches for Defocus Deblurring
The traditional defocus deblurring methods first estimate the defocus map of the defocused image [
2,
3,
4], and then use the defocus map to perform non-blind deconvolution [
5,
6,
7,
8,
9] to rehabilitate a sharp image. Because the performance of the former approaches completely depends on the accuracy of the defocus map, therefore, considerable effort has been made to improve the accuracy of the defocus map [
10,
11,
12]. In addition, Andres et al. [
13] proposed a new method to estimate the spatially varying defocus blur and then used local frequency image features and regression tree fields to train their model to obtain a coherent defocus blur map. Liu et al. [
14] proposed a new method to estimate the parameter of the Generalized Gaussian kernels from image patches for non-blind image deblurring. Goilkar et al. [
15] proposed blind and non-blind deconvolution techniques for the restoration of a defocused image. Chan et al. [
16] used blind deconvolution to estimate the blur kernel and then used total variation minimization to recover the background and fuse it with the foreground to obtain a sharp image. However, the aforementioned methods not only require intensive computation but also end up with unsatisfactory results by reason of the inaccurate defocus map.
In recent years, with the rapid development of deep learning, significant breakthroughs have been made in the field of defocused image restoration. Abuolaim et al. [
17] proposed a dataset called DPDD, which is the first public dataset used to train and validate an end-to-end deep learning framework for defocus deblurring. Lee et al. [
18] proposed a convolutional neural network for estimating defocus map and defocus deblurring. Lee et al. [
19] proposed an end-to-end approach that is equipped with an Interactive Filter Adaptive Network (IFAN) and created a dataset named RealDOF (Real Depth of Field). The IFAN does not directly predict pixel values, but generates spatially adaptive per-pixel deblurring filters, which are then applied to features extracted from a blurred image to generate deblurred features. Abuolaim et al. [
20] proposed a single-encoder multi-decoder deep neural network for single image deblurring, which incorporates the two sub-aperture views into a multi-task framework. They discovered that jointly learning to predict the two DP (Dual-Pixel) views from a single blurry image improves the network’s ability to deblur. Zhao et al. [
21] implemented an adversarial promoting learning framework to jointly handle defocus detection and defocus deblurring. Quan et al. [
22] utilized single image defocus deblurring (SIDD) to remove defocus blur.
Zhang et al. [
23] proposed a dual network with two subnets for estimating depth and defocus from a single image. Anwar et al. [
24] estimated depth by cascaded convolutional and fully connected neural networks and then used the depth to recover sharp images. Karaali et al. [
25] put forward a deep convolutional neural network to estimate defocus blur from a single defocused image. Yang et al. [
26] first estimated a blur kernel and then performed the Fourier transform for deblurring. Next, their method reuses the blur kernel to perform a simple convolution for reblurring. Quan et al. [
27] proposed a learnable recursive kernel representation to provide a compact yet effective and physics-encoded parametrization of the spatially varying defocus blurring process and then a physics-driven and efficient deep model with a cross-scale fusion structure was put forward. Li et al. [
28] proposed a new network named GRL (Global, Regional, and Local) based on anchored stripe self-attention, window self-attention, and channel attention for image restoration. Ye et al. [
29] introduced a new approach to estimating the defocus map of the scene, which then learned a direct mapping from the blurry image to the sharp image by using the estimated defocus map.
Zhao et al. [
30] put forward a focused area detection attack (FADA) to enforce the focused area to reblur and implement a defocused region detection attack to guide the realistic blurred regions to deblur in the process of training the deblurring network. Ali et al. [
31] introduced two encoder–decoder sub-networks for feeding with the blurry image and the estimated blur map, and the method works well when combined with a variety of blur estimation methods. Zhang et al. [
32] proposed a dynamic multi-scale network for dual-pixel image defocus deblurring. The encoder network is composed of several vision transformer blocks and the reconstruction module is composed of several Dynamic Multi-scale Sub-reconstruction Modules. Saqib et al. [
33] proposed a new Deep Neural Network (DNN) architecture for depth estimation and image deblurring by sharing the same encoder achieving good results. Nazir et al. [
34] created the Indoor Depth from Defocus (iDFD) dataset, which contains naturally defocused, all-in-focus (AiF) images and dense depth maps of indoor environments. Mazilu et al. [
35] used implicit and explicit regularization techniques to train an autoencoder, which enforces linearity relations among the representations of different blur levels in the latent space.
Zhao et al. [
36] proposed a new method for defocus blur detection (DBD) based on adaptive cross-level feature fusion and refinement, which not only captures the complementary information of the cross-level features but also refines cross-level feature information. Zhang et al. [
37] introduced common causes of image blur, benchmark datasets, performance metrics, and deep-learning-based image deblurring approaches. Chai et al. [
38] proposed a hybrid CNN–Transformer architecture based on complementary residual learning for defocus blur detection. Fernando et al. [
39] created a dataset and trained MobileNetV2 to classify image patches into one of the 20 levels of blurriness. Then the result was refined by applying an iterative weighted guided filter obtaining good results in adaptive image enhancement and defocus magnification. Zhang et al. [
40] proposed a novel self-supervision training objective to enhance the model training consistency and stability and a hard mining strategy to accelerate the defocus blur detection model. Lin et al. [
41] presented an iterative feedback framework for estimating depth maps and all-in-focus (AiF) images simultaneously.
Quan et al. [
42] put forward a pixel-wise Gaussian kernel mixture (GKM) model for representing spatially variant defocus blur kernels. Then they designed GKMNet using a lightweight scale-recurrent architecture, with a scale-recurrent attention module for estimating the mixing coefficients in the GKM for defocus deblurring. Zhang et al. [
43] proposed an efficient Multi-Refinement Network (MRNet) for dual-pixel image defocus deblurring. The alignment module and reconstruction module are the core of MRNet. Jung et al. [
44] proposed the disparity probability volume module to predict the pixel-wise disparity probability and then incorporated the disparity probability into the defocus deblurring network to address defocus deblurring from dual-pixel image pairs. Zhai et al. [
45] proposed a monocular depth estimation network to obtain the depth map and then used the map to guide the network for defocus deblurring. In order to improve the deblurring result, Ma et al. [
46] introduced a new network for single image defocus deblurring by using defocus map estimation as an auxiliary task. Although the previous methods achieved promising results in the field of image defocus deblurring, they fail when the defocus image contains large blurry regions.
2.2. Defocus Deblurring Datasets
Extensive works suggest that high-quality and large-scale datasets are momentous for training an optimum model based on generative adversarial networks. For this reason, we mainly introduce relevant datasets in this section. Although there are several publicly available datasets for defocus deblurring research, such as DED [
46], LFDOF [
47], CUHK [
48], RealDOF, SDD (Single-Image Defocus Deblurring) [
49], DPDD, and PixelDP, DPDD is the only extensively adopted real-world training dataset. The test set of the DED dataset has 100 pairs and the training set contains 1012 pairs. Due to the use of defocus maps to train the network, the training images in this dataset also include 1012 defocus maps. Similar to the above dataset, there is also LFDOF that contains 12,000 images. The CUHK dataset is employed for blur detection; it contains 1000 images with human-labeled ground-truth blur regions. The RealDOF dataset constructed using dual cameras only contains 50 pairs of test images and no training images. The DPDD dataset collected from the real world is the most widely and frequently used dataset at present. It contains 76 pairs of test images, 350 pairs of training images, and 74 pairs of validation images, all of which contain left, middle, and right pixels.
Recently, Li et al. [
49] proposed a joint deblurring and reblurring learning (JDRL) framework for image defocusing and a dataset called SDD that includes 115 pairs of training images and 35 pairs of test images. However, factors such as the corresponding images not being taken simultaneously with each other, as well as some uncontrollable natural factors such as wind and illumination, which can lead to misalignment between the sharp images and the defocused images, and overexposure of some regions in the image led to negative impacts on the model training and the evaluation of image quality. Detailed information on the aforementioned datasets is illustrated in
Table 1.
2.3. Generative Adversarial Networks
In 2014, Goodfellow et al. [
50] proposed the concept of generative adversarial networks (GANs) that consist of a generator and a discriminator. The purpose of the generator is to generate fake samples that are as similar to real samples as possible. The purpose of the discriminator is to distinguish as much as possible whether a given sample is a real sample or a fake one. They have opposite purposes, and in this game of competition with each other, they mutually enhance each other. In the end, even if the discriminator’s ability to distinguish is reliable enough, it is still impossible to distinguish whether the given sample is a real sample or a fake sample generated by the generator. The minimum and maximum targets of generator
G and discriminator
D are as follows:
where
is the initial given data distribution.
is the distribution of data generated by the generator.
represents the fake image generated by
.
represents the probability that the discriminator determines the authenticity of the sample: the more true it is, the closer it is to 1, and the more false it is, the closer it is to 0.
GANs have received a lot of attention since they were proposed, but as Salimans et al. [
51] point out, the vanilla GANs have a series of problems such as gradient vanishing, mode collapse, hyperparameter sensitivity, and so on. Because the original GANs were derived from JS (Jensen–Shannon) divergence, their biggest flaw is that if the distributions of two samples do not overlap, regardless of how far apart they are, JS divergence remains constant at
. In the paper published by Arjovsky et al. [
52], they proposed that JS divergence is the cause of training instability in the vanilla GANs. Besides that, they also proposed a new measurement method called Wasserstein distance, which can be defined as follows:
Among them,
,
, and
have the same meaning as Equation (
1), except that
represents the Wasserstein distance which must satisfy the 1-Lipschitz condition, and the 1-Lipschitz constraint condition is represented as:
the absolute difference in the output of the discriminator for two images
and
must be less than or equal to the absolute value of their average pixel-by-pixel difference. In other words, for different images, whether they are real images or fake images, the outputs of the discriminator should not differ too much. This means that the gradient of the function changes smoothly and there will be no abrupt changes in the gradient descent.
Therefore, the question is, how can we satisfy the 1-Lipschitz constraint? Arjovsky et al. [
52] first adopted the weight clipping approach, truncating the discriminator parameters within the range of
. However, the loss of the discriminator attempts to separate the scores of true and false samples. As weight clipping independently limits the range of values for each network parameter, it will bring about two extremes on the discriminator’s parameters, with a large distribution between
and
. In response to a number of problems that arise from the approach of weight clipping proposed by Arjovsky et al., Gulrajani [
53] et al. replaced weight clipping by introducing a gradient penalty and proposed WGAN-GP. The gradient penalty is used to satisfy the 1-Lipschitz constraint. After adding a gradient penalty term, the preceding Equation (
3) can be rephrased as:
given that gradient penalty terms can bring more stable training and have been widely used up to present. Due to the fact that WGAN-GP (Wasserstein Generative Adversarial Network With Gradient Penalty) has achieved competent results in the field of image generation such as image super-resolution [
54], image shadow removal [
55], image inpainting [
56], illumination processing [
57], and so forth, in this work, the gradient penalty of WGAN-GP is also utilized by our SIDGAN for improving the stability of training.