*4.1. Dataset*

Because it is hard to collect real SAR images without speckle noise, we train networks by using synthetic noise/clean image pairs. Public dataset UC Merced Land Use Dataset (http://weegee.vision. ucmerced.edu/datasets/landuse.html) is chosen as the original clean image for training. The dataset contains 21 scene classes with 100 optical remote sensing images per class. Each image has a size of 256 × 256 pixels and the pixel resolution is 1 foot [52]. According to [27], we randomly select 400 images from the dataset as the training set and use the remaining images for testing. Some training samples are shown in Figure 7. Finally, after grayscale preprocessing, the speckled images are generated using Equation (1) same as the [25,53]. The noise levels (L = 2, 4, 8, 12) correspond to the number of looks in SAR, and the code of adding speckle noise is available on GitHub (https://github.com/rcouturier/ ImageDenoisingwithDeepEncoderDecoder/tree/master/data{\_}denoise).

**Figure 7.** Part of the sample images used to train the network.

#### *4.2. Experimental Settings*

All the networks are trained with stochastic gradient descent (SGD) with a mini-batch size of 32. All weights are initialized by a modified scheme of Xavier initialization [54] proposed by He et al. Meanwhile, we use the Adam optimizer [55] with tuned hyper-parameters to accelerate training. The hyper-parameters are kept the same across all layers and all networks. Experiments are implemented on TensorFlow platform with Intel i7-8700 CPU and an NVIDIA GTX-1080(8G) GPU.

The details of the model are specified here. The number of kernels in each unit is shown in Figure 3. The kernel sizes for the first and last convolutional layers are 5 × 5, while all others are 3 × 3. Rectified Linear Units (ReLU) are used as the activation function for all layers except for the last convolutional layer before sub-pixel unit. The L1 loss is chosen to train the network, which is defined as:

$$\mathcal{L}\_1(\Theta) = \frac{1}{N} \sum\_{y=1}^{W} \sum\_{x=1}^{H} |\varphi\left(\mathfrak{X}(x,y); \Theta\right) - \mathbb{C}(x,y)|,\tag{13}$$

where Θ is the filter parameters that need to be updated during training, **C**, **X** and *ϕ* (·) denote the objective image without noise, the input image with speckle noise, and the output after despeckling, respectively.

#### *4.3. Experimental Results*

The test results of our proposed network will be presented in this section. To verify the proposed method, we compare the performance of our MSR-net with other three despeckling methods, SAR-BM3D [22], ID-CNN [25], and Residual Encoder-Decoder network (RED-NET) [53]. The first one is a traditional nonlocal algorithm based on wavelet shrinkage, and the latter two methods are based on deep convolutional neural networks.

#### 4.3.1. Results on Synthetic Images

Building, freeway, and airplane, three classes of synthetic images, are chosen as the test set to evaluate the noise reduction ability of each method. Part of the processing results of different algorithms under different levels of noise are shown in Figures 8 and 9.

2#

**Figure 8.** Test results on images of airplanes with four levels of noise. The noise level from left to right is L = 2, L = 4, L = 8, and L = 12. (**a**) Speckled image (**b**) SAR-BM3D (Block-Matching and 3D filtering) (**c**) image despeckling convolutional neural network (ID-CNN) (**d**) RED-Net (**e**) multi-scale recurrent network (MSR-net).

 2#  2#

 2#

**Figure 9.** Test results on images of buildings with four levels of noise. The noise level from left to right is L = 2, L = 4, L = 8, and L = 12. (**a**) Speckled image (**b**) SAR-BM3D (**c**) ID-CNN (**d**) RED-Net (**e**) MSR-net.

From the figures, we can observe that the CNN-based methods, including our MSR-net, can preserve more details like texture features in images than SAR-BM3D after despeckling. When the noise is strong, the SAR-BM3D algorithm will cause blurring at the edge of the objects.

ID-CNN has a good performance on image despeckling, however, after filtering by the network, pepper and salt noise appear in the image, which needs to be processed subsequently by using nonlinear filters such as median filtering and pseudo-median filtering. As the noise intensity increases, the salt and pepper noise increase gradually.

MSR-net has excellent retention performance of spatial geometry features like texture features, lines, and feature points. Compared with the other three algorithms, MSR-net has a higher smoothness of smooth areas as well as a smaller loss of sharpness of edges and details, especially for strong speckle noise. Also, more detail information in image will loose when the speckle noise is strong and more local detail can be preserved in output images when the speckle noise is weak.

When the level of noise added to the test set is small, all the CNN-based approaches can ge<sup>t</sup> state-of-art results. Therefore, it is difficult to judge the merits of these algorithms by using visual assessments. Experimental results of evaluation indexes such as PSNR and SSIM are necessary for these circumstances. The PSNR, SSIM, and EFKR evaluation indexes of the above methods are listed in Tables 1–4, respectively. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value. We also test MSR-net with only one scale and call it a single scale network (SS-net) during the experiment.

**Table 1.** The peak signal to noise ratio (PSNR), structural similarity (SSIM), and edge feature keep ratio (EFKR) of test set with noise level L = 2.


The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row.


**Table 2.** The PSNR, SSIM, and EFKR of test set with noise level L = 4.

The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row.


**Table 3.** The PSNR, SSIM, and EFKR of test set with the noise level L = 8.

The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row.



The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row.

Consistent with the results shown in Figures 8 and 9, our method has much better speckle-reduction ability than non-learned approach SAR-BM3D at different noise levels. In addition, the advantage of MSR-net will increase as the noise level increases. Taking airplane images for instance, the PSNR/SSIM/EFKR of our proposed MSR-net outperform SAR-BM3D by about 3.082 dB/0.047/0.2075, 2.785 dB/0.036/0.0895, 2.133 dB/0.022/0.0421, 1.944 dB/0.019/0.0425 for L = 2, 4, 8, 12.

Compared with CNN-based methods, MSR-net still has an advantage when the noise is strong. When L = 2, the PSNR/SSIM/EFKR of MSR-net outperform ID-CNN, RED-Net by about 1.129 dB/0.014/0.1434, 2.356 dB/0.144/0.1689, 0.427 dB/0.008/0.2152 and 0.447 dB/0.012/0.0191, 0.369 dB/0.018/-0.006, 0.643 dB/0.011/0.0194 for building, freeway, and airplane, respectively. When L = 4, the PSNR/SSIM/EFKR of MSR-net outperform ID-CNN, RED-Net by about 0.051 dB/0.134/0.0562, 0.885 dB/0.052/0.0589, 0.299 dB/0.016/0.0722 and 0.004 dB/-0.002/0.0375, 0.072 dB/0.004/0.0139, 0.197 dB/0.002/0.0143 for building, freeway and airplane, respectively.

In addition, we can see that our network does not always achieve the best test results. We consider this may have a certain relationship with the feature distribution of the images. Although MSR-net can only ge<sup>t</sup> sub-optimal test results for a certain class of images, the difference to the best result is small. For other classes of images, the advantages of MSR-net are more considerable. For example, we can observe from Table 1 that the EFKR of RED-Net only outperforms MSR-net by about 0.006 for freeway images, while MSR-net can outperform RED-Net by about 0.0191 and 0.0194 for building and airplane images. When the noise level L = 12, our network only gets four best test values, which suggests the advantage of the multi-scale network becomes smaller while the noise is weaker, as shown in Table 4.

MSR-net with a single scale (SS-net) also has very good speckle-reduction ability. When the noise intensity is weak, the performance is even better than multi-scale. For example, when L = 12, PSNR/SSIM/EFKR of the freeway are 28.787 dB/0.776/0.6247 and 28.893 dB/0.778/0.6197 for SS-net and MSR-net, respectively. Ultimately, we can find by comparison that the edge detection effect of the image has been significantly improved after despeckling.

#### 4.3.2. Results on Real Sar Images

To further verify the speckle-reduction ability of our network for real SAR images, two SAR sceneries are selected, as shown in Figures 10a and 11a, and these two images are imaging results of spaceborne SAR RADARSAT-2.

It can be seen by comparing the subgraphs in Figures 10 and 11 that MSR-net generates the visually best output among all the results and retains the edge sharpness as well as the detail information about the structure in the image while removing the speckle noise. After filtering by SAR-BM3D, the loss of edge sharpness of the original SAR image is obvious and most of the lines and texture feature are blurred. ID-CNN and RED-Net can generate smooth results in homogeneous regions while maintaining textural features in the image. However, from the red boxes, we can observe that they are capable of retaining some texture features but not better than MSR-net. Although SS-net performs well in despeckling for real SAR images, it is still worse than multi-scale MSR-net.

The ENL results are shown in Table 5. We can observe from the table that MSR-net has an outstanding performance for real SAR image despeckling. For these four evaluation regions, the three highest scores and one second highest score of the ENL are obtained by our MSR-net.

**Figure 10.** Test results of Real SAR images (RADARSAT-2). (**a**) Original, (**b**) SAR-BM3D, (**c**) ID-CNN, (**d**) RED-Net, (**e**) single scale network (SS-net), (**f**) MSR-net.

**D E F H I** 8KMOUT 8KMOUT **G**

**Figure 11.** Test results of Real SAR images (RADARSAT-2). (**a**) Original, (**b**) SAR-BM3D, (**c**) ID-CNN, (**d**) RED-Net, (**e**) SS-net, (**f**) MSR-net.

**Table 5.** The equivalent number of looks (ENL) of real SAR regions


The bold number represents the optimal value in each row.

To achieve FPKR results for real SAR data with different methods, we first apply the same affine transformation to each image, as shown in Figure 12. SIFT is then applied to search for the feature points and calculate their descriptors. Matching key points are ultimately conducted by minimizing the Euclidean distance between their SIFT descriptors. Generally, the ratio between distances is used [56] to obtain high matching accuracy. In the experiments, we select three ratios and the FPKR results are shown in Tables 6 and 7. By comparing EFKR of each image, MSR-net performs better than SAR-BM3D. Also, MSR-net shows advantages over other neural network-based algorithms. Specifically, MSR-net achieves the best testing results in five out of six sets of experiments. It also indicates that pre-processing to SAR images by MSR-net can effectively enhance the usefulness of SIFT algorithm to SAR images and improve its performance and efficiency.

(**a**) Before despeckling.

(**b**) After despeckling.

**Figure 12.** Search results of feature point pairs for synthetic aperture radar (SAR) image before and after despeckling.

**Table 6.** Feature point keep ratio (FPKR) results before and after despeckling for real SAR image.


The value inside brackets is the number of feature points. The bold number represents the optimal value in each column.


The value inside brackets is the number of feature points. The bold number represents the optimal value in each column.

#### 4.3.3. Runtime Comparisons

To evaluate the algorithm efficiency, we make statistics of the runtime of each algorithm in CPU implementation. The runtime of different methods on images with different sizes is listed in Table 8. We can see that the proposed denoiser is very competitive although its structure is relatively complex. Such a good compromise between speed and performance over MSR-net is properly attributed to the following two reasons. First, two pooling layers that can achieve spatial dimensionality reduction are embedded in the MSR-net. Each pooling layer with the 2 × 2 pooling kernel can reduce the amount of data that needs to be processed by the subsequent convolution operation to 25% of before. Second, in contrast to the transposed convolution which increases the resolution of feature maps by padding and complex convolution operation, sub-pixel unit, which up-samples feature maps by a periodic shuffling of pixel values, is adopted to build our network.

**Table 8.** Runtime (s) of different methods of images with sizes of 256 × 256 and 512 × 512.

