1. Introduction
Synthetic Aperture Radar (SAR) images are becoming more and more relevant for a large number of applications. They represent a perfect complement to optical remote sensing images, because of their completely unrelated imaging mechanisms and their ability to ensure all-time all-weather coverage. SAR-optical fusion is arguably a major topic in remote sensing image processing [
1,
2,
3,
4]. Unfortunately, extracting reliable information from full-resolution (single-look) SAR images is a very difficult task due to the presence of intense multiplicative speckle noise. Further problems arise because of the non-stationary nature of noise, and the peculiar statistics of SAR images, markedly different from those of natural images. In this challenging scenario, a SAR despeckling technique should satisfy multiple contrasting requirements, as outlined in [
5]:
suppress most of the speckle in homogeneous regions;
preserve textures;
preserve region boundaries and other linear structures;
avoid altering natural or man-made permanent scatterers; and
avoid introducing filtering artifacts.
Research on SAR image despeckling has been going on for several decades [
6]. Early methods were based on spatial-domain filtering [
7,
8], with some forms of local adaptivity to deal with signal non-stationarity. In general, they ensure only a limited speckle reduction. Then, the diffusion of wavelet transform spawned a new generation of filters based on transform-domain coefficient shrinkage. For example, Xie et al. [
9] used a Markov random field (MRF) prior to improve regularity in wavelet-domain shrinkage, while Solbo and Eltoft [
10] performed homomorphic wavelet maximum a posteriori filtering. Despite a stronger speckle rejection, however, these methods ensured limited detail preservation and introduced disturbing artifacts. More recently, nonlocal methods gained large popularity due to their superior performance. The nonlocal approach was first proposed in the seminal paper of Buades et al. [
11], together with the nonlocal means (NLM) algorithm for denoising images corrupted by additive white Gaussian noise (AWGN). Then, in [
12], Dabov et al. proposed the highly effective block-matching 3D algorithm (BM3D), a
de facto baseline in the field. These ideas and tools proved effective also for SAR despeckling, and several effective nonlocal despeckling filters were soon proposed, including PPB [
13], SAR-BM3D [
14], FANS [
15], NL-SAR [
16], and the NLM variants proposed in [
17].
In the last few years, deep learning ensured a quantum leap in many image processing tasks in remote sensing, from land cover classification [
18], to segmentation [
19], pansharpening [
20], and data fusion [
21]. Therefore, there is a growing interest also for deep learning-based SAR image despeckling. Methods based on convolutional neural networks (CNN) [
22,
23] and generative adversarial networks (GAN) [
24] have been proposed already back in 2017, and new methods keep appearing at a growing rate [
25,
26]. Nonetheless, improvements over the previous state of the art have been quite limited to date. This is probably due to the scarcity of high-quality training data, but also to a still insufficient comprehension of the despeckling problem and of the potential of deep learning methods towards its solution. Under this point of view, nonlocal methods are especially interesting as they shed some light on despeckling mechanisms. They rely on the idea of separating the filtering process in two key steps: (i) finding the best predictors of the target, not necessarily close to it; and
(ii) performing the actual estimate based on them. The separate analysis of these two steps provides precious insight on method’s strengths and weaknesses.
In this work, we try to blend the nonlocal concept with CNN-based image processing, with the aim of exploiting their complementary strengths for SAR despeckling. Although some CNN-based nonlocal methods have been proposed for AWGN denoising, in the last few years (e.g., [
27,
28,
29,
30]), only very recently, researchers have begun to explore this promising approach for SAR despeckling [
31,
32]. In particular, here we follow our recent conference paper [
31], and propose a simple CNN-powered nonlocal means filter, that is, plain pixel-wise nonlocal means in which the filter weights are computed by means of a dedicated convolutional network. By doing so, we pursue a two-fold goal. On the one hand, we look for sheer performance, aiming at improving objective and subjective quality indicators with respect to the current state of the art. On the other hand, we look for interpretable results, which shed light on the potential of this approach and on ways to achieve it. Therefore, we do not use deep learning to blindly separate signal from noise, but rather use a strongly constrained CNN-based architecture that provides interpretable results, in terms of relationships among target and predictor pixels. That is, we try to gain some new insights on SAR despeckling by studying the strategy followed by data-driven methods to address this problem.
Following the Introduction, in
Section 2, we analyze the related work on deep learning-based despeckling. Then, in
Section 3, we describe the proposed approach, with two implementations based on different CNN architectures.
Section 4 presents experimental results on synthetic and real-world SAR data, while
Section 5 discusses results. Finally,
Section 6 draws conclusions and outlines future work.
2. Related Work
To the best of our knowledge, the first papers using deep learning for SAR image despeckling date back to 2017. The SAR-CNN proposed in [
22] uses homomorphic processing to adapt the DnCNNN denoiser, proposed originally in [
33] for AWGN data, to perform SAR despeckling. Training relies on 25-look SAR data taken as clean reference and a suitable SAR-oriented loss derived from the measure proposed in [
13,
34]. A significant performance improvement is observed with respect to state-of-the-art conventional methods, but the lack of truly clean (infinite-look) SAR data is pointed out as a major limiting factor to performance.
In addition, the Image Despeckling CNN (ID-CNN) proposed in [
23] resorts to residual learning to estimate the noise component of the image. However, it works directly in the original domain, and the despeckled image is obtained by dividing the original image by the estimated noise, which seems to be an unreliable practice. To circumvent the problem of missing clean data, training is performed on simulated SAR images, obtained by injecting synthetic speckle on optical (e.g., GoogleEarth) images. A combination of Euclidean and Total Variation (TV) losses is used. This approach to training, call it synthetic, is followed by the majority of subsequent papers. Although it allows for virtually unlimited training data, they do not possess the statistical properties of real-world SAR images. The underlying clean signal differs from a true SAR signal (just think of the double reflection lines in urban areas) while the hypothesis of uncorrelated speckle holds only in special circumstances.
In [
24], the same authors of [
23] proposed a despeckling method based on Generative adversarial networks (GAN). During training, the discriminator distinguishes clean from despeckled images and teaches the generator how to extract high-quality clean images from the original SAR data. To this end, the Euclidean loss is complemented by a perceptual loss (computed on a pretrained VGG16 model [
35]) and an adversarial loss. Again, synthetic training is used, casting doubt on the merits of an interesting approach. We also point out that the trained model is not published online, making it virtually impossible to replicate the experiments, given the hardship of GAN training.
In addition, inspired by Wang et al. [
24] is SAR-DRN [
36] based on a dilated residual network (DRN). Dilated convolutions allow one to keep the lightweight structure and small filter size of ID-CNN but enlarge the receptive field. In addition, skip connections help reduce the vanishing gradient problem. Along the same line, Gui et al. [
37] used dilated convolution and residual learning with a densely connected network. In addition, Li et al. [
38] relied on dilated convolution and residual training, the main innovation being the use of a convolutional block attention module to enhance representation power and performance. All these methods use synthetic training on the UC-Merced [
39] dataset. Dense connections are used also in [
40] to face the vanishing gradient problem, but, to further reduce computation, a limited block-wise connectivity is considered. Moreover, to help the network preserve image details, a preliminary single-level wavelet transform is computed, and the stacked subbands are fed to the net, using a loss function based on wavelet features. Again, synthetic training based on UC-Merced images is used.
In [
41], the U-Net architecture, originally proposed for segmentation, is adapted to despeckling needs. The loss includes an additional total variation term to better filter smooth areas. After synthetic training on aerial images, the net is fine-tuned on more realistic simulated data, obtained by injecting speckle on multitemporal filtered SAR images. To avoid overfitting, speckle data are generated on the fly. A very simple denoiser is proposed in [
42] with the goal to show the value of an additional loss term accounting for the Kullback–Leibler divergence between original and despeckled data, so as to ensure fidelity of first-order statistics. However, a very limited experimental validation is carried out.
Training on SAR data is completely bypassed in [
43]. Following the MuLoG approach [
44], the idea is to exploit denoising architectures designed for additive white Gaussian noise and pre-trained on abundant AWGN data. Suitable adaptation is applied to deal with Fisher–Tippet distributed log-transformed SAR data, which, for low number of looks, differ significantly from Gaussian data. Pan et al. [
45] followed the same approach but replaced the DnCNN denoiser with the faster FFDNet denoiser [
46], which uses combined downsampling-upsampling steps to improve efficiency. Then, in [
25], homomorphic filtering is performed based on multiple instances of the same CNN [
47] trained on Gaussian noise at various levels of intensity. The output images are then combined by means of guided filtering driven by an edge map.
A few very recent papers use noise2noise (N2N) training to circumvent the lack of truly clean SAR data. In [
48], it is observed that a CNN denoiser can be trained effectively also in the absence of a clean ground truth provided multiple images with the same signal component and independent noise realizations are available. Therefore, clean targets can be replaced by noisy targets. The noise level of the noisy reference is immaterial (hence, blind despeckling is possible) and is only required that mean value is preserved. In [
49], N2N training is used for a slightly modified version of U-Net. Later on, in [
50], N2N training is applied to a dense dilated-convolution network without batch normalization. In both cases, however, the authors kept training on images with simulated speckle, hence the true potential of N2N training is never really exploited. A similar flaw affects the method in [
51], where samples for N2N training are generated by means of a GAN architecture and a nested U-Net model is used for the final despeckling. In [
52], a blind speckle decorrelator is used to pre-process test images and improve their fir to synthetic images used for N2N training.
To the best of our knowledge, the interplay between nonlocal methods and deep learning for SAR despeckling has been first explored in two very recent papers. In [
32], the approach of Cruz et al. [
29] is followed, in which nonlocal processing is used to refine the output of CNN-based filters. Instead, in [
31], we proposed to use nonlocal means filtering with weights computed patch-by-patch by means of a dedicated CNN, so as to compare the weights provided by the network with those output by conventional nonlocal methods.
We conclude this short, and certainly non-exhaustive, review with some general remarks. First, it clearly appears that the use of deep learning for SAR despeckling is raising great interest, and new methods are being proposed by the day. Most of these proposals, however, focus on new architectures, neglecting what is the most critical point, in our view, i.e., the lack of reliable reference data. Synthetic training cannot really make up for this lack, and ad hoc datasets, such as in many other fields, are needed more than new architectures. A further observation is that many papers do not provide code and data to allow for reproducible research, also due to the restrictive policies on data widespread in the remote sensing field. Finally, we note an insufficient attention to previous methods and results. For example, for the well-known SAR-BM3D method, two papers report a ENL indicator below 5 and above 500,000, respectively, fluctuations that can be hardly attributed to differences in the original images.
4. Experimental Validation
To assess the proposed approach, we carried out experiments on both simulated and real SAR images. Optical images with injected single-look speckle allowed us to compute objective performance indicators, the peak signal-to-noise ratio (PSNR), and the structural similarity (SSIM), which enabled a simple comparison, in ideal conditions, with the state of the art. However, a solid performance validation could only be based on the analysis of real-world SAR images. In the absence of a clean reference, we used visual inspection of despeckled and ratio images to assess the filters’ properties, especially for what concerns preservation of image details. Instead, speckle suppression ability was measured objectively through the equivalent number of looks (ENL) computed on homogeneous regions of the image and by means of the no-reference image quality index
proposed in [
55,
56].
To study the improvement with respect to the state of the art, we considered a number of reference methods, chosen both for their performance and their diffusion in the community. The enhanced-Lee [
57] and Kuan [
8] local filters operate in the spatial domain with adaptive windows (we used size 5 × 5 pixel) that follow the dominant signal structures. Turning to nonlocal filters, besides plain NLM [
11], we considered its SAR-oriented iterative version, PPB [
13], and the more advanced NL-SAR [
16], together with nonlocal transform-domain shrinkage methods, SAR-BM3D [
14] and FANS [
15]. Finally, we compared results with two deep learning-based methods, SAR-CNN [
22] and ID-CNN [
23]. In all cases, the main parameters, e.g., search-area and window size, were set as suggested in the original papers, and the SAR-domain distance proposed in [
13] was used. As for the proposed method, the two core CNNs were trained with the ADAM gradient-based optimization method with 32-patch minibatches, and patch-sizes of 48 × 48 and 104 × 104-pixel, respectively. Synthetic training data were obtained by injecting single-look simulated speckle on 400 different optical images. Real SAR data were acquired by the COSMO-SkyMed satellites. In this latter case, lacking true speckle-free data, we resorted to temporally multilooked images (25 dates) as reference, excluding patches where temporal changes occurred. The two datasets comprise a total of 8000 and 12,800 minibatches, respectively. Training proceeded for 50 epochs, with an initial learning rate of
, divided by ten after every 20 epochs. All code was in Tensorflow, running on an Intel Xeon CPU at 2.10 GHz and an Nvidia P100 GPU. The trained models will be made available online upon publication of the present paper.
4.1. Experiments on Simulated Images
We generated simulated SAR images through the pixel-wise product of clean optical images with a field of Gamma-distributed independent random variables. In all our experiments, we considered only single-look images, since this is the most challenging case, due to the high intensity of speckle, and also the most interesting for applications, since there is no loss of resolution due to spatial multilooking.
In
Figure 3, we show the clean and noisy images used in these experiments. Although in the despeckling literature it is customary to use optical remote sensing images for simulation purposes, we chose to consider general-purpose images to better remark that this approach does not generate faithful approximations of real-world SAR images in any case, and all results must be taken with due care. With these warnings, in
Table 3, we show numerical results obtained on such images. Synthetic results were computed as the average over the 10 test images and, for each image, as the average over 10 realizations of the speckle field. Conventional methods are grouped on the upper part of the table, while methods based on deep learning are in the lower part.
Looking at the PSNR indicator, deep learning-based methods appear to have the potential to provide a clear performance gain over conventional ones. Indeed, while ID-CNN is aligned with advanced nonlocal methods, SAR-CNN improves by about 1 dB over the best of them (SAR-BM3D). As for the proposed method, there is no further improvement with respect to SAR-CNN when the NLM weights are estimated with a fully-convolutional CNN. However, about 0.5 dB is gained when the CNN with nonlocal layers is used. Since this is a consistent behavior, in the following experiments we considered only this latter version of the proposed method. Turning to SSIM, quite a similar behavior was observed. The proposed method (with nonlocal layers) provides the best performance with an appreciable improvement over SAR-CNN, and a more significant gain with respect to all conventional methods. In the last column, we report the average processing time. With nonlocal methods, this was an issue. As an example, SAR-BM3D requires about 50 s of CPU, mostly for nearest neighbor search. This is orders of magnitude more than simpler local filters, such as Lee and Kuan, which in fact keep being popular among practitioners also for this reason. For deep learning methods, processing time becomes fully manageable again, provided a GPU is used. Of course, training the models may take very long, but this is carried out off-line.
To gain some insight into the quality of filtered images,
Figure 4 shows the output of the best performing methods for the Barbara image. To allow for an overall view of results, and also to limit space, we display these images in a compact format. A suitable zooming is therefore recommended for accurate visual inspection of details. Considering the very noisy input, it seems safe to say that the proposed method provides filtered images of impressive quality. The speckle is effectively suppressed without significantly impairing the image resolution. Moreover, most details are well preserved, even thin lines and complex textures, and no major artifacts are introduced. Even the best conventional nonlocal methods, instead, fail under one or the other of these aspects. For example, SAR-BM3D preserves resolution and details, but ensures only a limited speckle suppression, while NL-SAR removes most speckle but at the price of a significant loss of resolution. As for plain NLM, based on the same filtering engine as the proposed method, it causes a strong loss of resolution, only partially solved with PPB. The most interesting comparison, however, is with SAR-CNN. To better appreciate the improvements,
Figure 5 shows a much enlarged strip of Barbara, chosen for the abundance of patterns. Indeed, on such regular patterns, the improvement granted by the new method is striking, with lines that are barely distinguishable in the noisy input that are correctly reproduced in output most of the times. Moreover, the disturbing artifacts produced by both SAR-BM3D and SAR-CNN on Barbara’s face do not longer occur. Nonetheless, the loss of quality with respect to the clean image is still significant. Our sensitivity to the features of a human face allow us to fully appreciate the sharp loss of details that actually occurs. Whether a despeckling engine can ever avoid such losses, without the help of further information, is debatable.
To complete the analysis on simulated images,
Figure 6 shows the ratio images for Barbara, that is, the ratios between the noisy input and the despeckled output. An ideal filter should remove only the injected speckle, therefore the ratio image should be a field of uncorrelated speckle samples. This seems to be actually the case for some filters, such as SAR-BM3D, SAR-CNN, and CNN-NLM, while, in some other cases, notably for NLM and PPB, there is a clear leakage of signal structures in the ratio image. ID-CNN, instead, seems to have a bias in very dark regions. The proposed method seems very satisfactory also under this point of view. This is also confirmed numerically by the no-reference quality index
[
55,
56], which compares the statistical distribution of the ratio image with that of the theoretical speckle. The analysis was carried out on a set of homogeneous areas of the image, automatically selected by the method. Results are reported in
Table 4 for each test image, with smaller values (zero in the ideal case) indicating better performance. The proposed method always exhibits one of the smallest values (best in boldface) and the second smallest on the average.
4.2. Experiments on Real-World SAR Images
To validate the proposed method on real-world SAR data, we relied on a stack of 25 co-registered single-look images acquired by the COSMO-SkyMed sensor over the city of Caserta (I), and spanning a temporal interval of about five years, from 26 July 2010 to 23 March 2015. The images cover an area of about 40 km × 40 km, with 3 m/pixel spatial resolution, for a size exceeding 16,000 × 16,000 pixels. Despeckling experiments were all carried on the first image of the stack. Temporal multilooking was used to obtain reference data for training. Of course, such reference is far from the ideal “clean” data, not only for the limited number of looks, which implies imperfect rejection of speckle, but also for the presence of temporal changes. The latter problem was addressed by discarding areas where a significant temporal change was detected. Eight 600 × 600-pixel clips were cropped from the first image and used for testing, sampling various types of land cover. Of course, these areas were excluded from the training set, but nearby areas of similar characteristics were included, so as to guarantee a good alignment between training and testing data. All test clips are shown in
Figure 7 together with the corresponding multilook reference. Note that these multilooked images were not used in any way in validation (they even include regions in which temporal changes occurred) and are only shown to gain some insight into how the clean SAR signal might appear. The white boxes on the multilooked images indicate the regions used to compute the ENL.
Table 5 reports, for each filter, the ENL measured on all test images and, on the rightmost column, their average. It appears that the proposed CNN-NLM provides always the largest (six images) or second largest (two images) ENL. This reflects in the largest average ENL (about 250) followed by CNN-SAR (150) and NL-SAR (100).
With real-world SAR images, however, even more than with simulated images, visual inspection is necessary for a solid assessment. Therefore, in the following figures, we show detailed visual results for two selected images. Again, for the sake of compactness, we display rather small images which require adequate zooming for analysis, except for two strips shown much enlarged and analyzed in depth later on. In
Figure 8 and
Figure 9, we show the output of selected filters for Images #5 and #6, respectively, together with the single-look input and the 25-look reference. Visual inspection confirms the good behavior of the proposed method. There is a very effective suppression of speckle, as predicted by the ENL numbers of
Table 5, but also a faithful preservation of relevant details, such as man-made structures, field boundaries, and roads, which all keep their original high resolution. In addition, other methods preserve image resolution and details, such as SAR-BM3D, but with very limited speckle suppression. On the other hand, NL-SAR and SAR-CNN suppress speckle very well, but they also degrade resolution or lose entire structures. With the aim of better appreciating such differences,
Figure 10 and
Figure 11 focus on two narrow horizontal strips of Images #5 and #6 (rotated for better displaying) showing the output of SAR-BM3D, SAR-CNN and the proposed method next to the single-look input and the 25-look reference. As we observed before, SAR-BM3D seems to preserve all the information present in the input, without losing or even blurring informative details, but does not remove much speckle. SAR-CNN, instead, removes speckle very effectively but tends to lose or blur linear structures (roads and boundaries), which, instead, are very well preserved by the proposed method. This is arguably a consequence of nonlocal layers’ ability to take advantage of image self-similarities.
However, turning to the ratio images, shown in
Figure 12 for SAR Image #5, we observed also an undesired behavior. The ratio images of all deep learning methods exhibit a clear leakage of signal, concerning not only linear structures but also the average intensity of some fields. Given the black-box nature of CNNs, we have only an indirect explanation for this phenomenon. However, the fact that it involves
both our deep learning methods, and it happens
only with SAR images and not with simulated data, may suggest that this problem has to do with the imperfect reference images used in training. In fact, a 25-look image is not the clean SAR signal, but only an approximation of it, based on temporal multilooking. Indeed, the fields characterized by a different average intensity than the rest of the image correspond to areas where the despeckled image approximates fairly well the reference (see again
Figure 8) but not the original noisy image. Thus, the CNN behaves as instructed to do based on bad examples, probably due to seasonal changes that escaped the change detector. With these premises, the ratio image-based
index can only provide bad results, which is in fact the case, as shown in
Table 6, where the proposed method trails all others. If our conjecture is right, however, these problems will be automatically solved when better reference data will be available, the first item in our agenda.
In alternative, one may be tempted to use the network trained on synthetic data, optical images with injected speckle, far from true SAR data but perfectly reliable and virtually unlimited.
Figure 13 shows the output for SAR Image #5. Speckle suppression is much worse than with the network trained on our 25-look reference (output shown again for easier comparison) and some odd micropatterns appear in the despeckled image confirming that using real-world SAR data for training is the right way to go.
5. Discussion
We propose our CNN-NLM architecture with two goals: improving performance and providing some new insight into nonlocal filtering. Therefore, we now turn to study the weights generated by the proposed method and compare them with those of conventional NLM. Indeed, the only difference between the two methods is in the weights, generated by a CNN in the proposal, set on the basis of a similarity measure in NLM. Thus, we selected some relevant patches from Barbara and SAR Image #6, and analyzed the weights used to estimate their central pixel. The results are shown in
Figure 14 and
Figure 15, respectively. The selected patches are characterized by the presence of lines (blue), edges (yellow), and texture (green), or else are homogeneous (red). These structures are easily recognized in the clean/25-look reference patches, and much less in the original noisy ones. For each test patch, we built a subfigure showing, in the top row, the clean/multilook reference, and the weights selected by NLM and CNN-NLM superimposed to it, and, in the bottom row, the noisy input and the despeckled output provided by NLM and CNN-NLM.
Consider for example the blue patch from Barbara, and the associated subfigure in the top-left. Diagonal structures are clearly visible in the clean patch, especially a dark line in the center, the dark space between two books. Both the conventional and CNN-based weights follow this dark line to estimate the (dark) central pixel. In the first case, however, weights are dispersed over the whole patch, and gather information also from pixels farther away from the target, while the CNN weights are much more concentrated. The first choice is more adherent to the spirit of nonlocal filtering, as it tries to exploit relevant information all over the image. Nonetheless, results speak clearly in favor of the second choice. CNN-NLM provides in output quite a faithful copy of the clean signal, while the NLM output patch exhibits a clear loss of resolution and the dark line almost disappears. This can be explained by looking at how noisy the input patch is. Although sensible, in principle, the NLM choice of weights is quite risky, as it relies on a similarity measure that, in the presence of such noisy data, may select bad predictors.
This is even clearer in the second example, the yellow patch from Barbara, featuring a sharp edge. Due to the limited contrast between the dark and bright sides of the edge, and to the intense noise, NLM selects large weights on both sides, with the effect of largely smoothing the edge. On the contrary, all large CNN-NLM weights are on the right side of the edge, and allow for its faithful reproduction.
Of course, risky NLM weights are less of a problem in homogeneous areas (red patch) and they only give rise to some residual noise in the output patch, which is not necessarily wrong. Instead, in the presence of regular patterns (green patch), the dispersion of weights in a large area leads to blurred patterns in output, while the CNN weights, mostly concentrated on the central line, allow for the extraction of such hidden pattern.
In real-world SAR images, we observe the very same phenomena described before, only less pronounced, because of the absence of the sharp contrasts observed in optical images (we do not analyze strong scatterers or double reflection lines as they are always well reproduced by reasonably well-behaved filters). Again, lines (blue patch) are severely smoothed by NLM and tend to disappear because too many pixels are used for the estimate, and not all of them are reliable. Similar problems affect the edges (yellow patch) but are less pronounced. Homogeneous regions are correctly filtered in both cases, with the CNN weights ensuring only a stronger smoothing. Finally, we could find only some subtle regular patterns in our SAR image (green patch), so faint that the CNN could not find preferential directions, and both NLM and CNN-NLM largely smooth it out.