*3.1. Architecture*

MSR-net is built based on cascaded subnetworks, and each subnetwork contains three parts: encoder, decoder, and ConvLSTM unit, as illustrated in Figure 2. Different levels of subnetworks correspond to different scales of inputs and outputs. The next scale speckled image and the output of current subnetwork are combined as the input of next-level subnetwork. In addition, an LSTM unit with single input and two-output is embedded between encoder and decoder. Specifically, one output is connected to the decoder, and the other output which represents the hidden state is connected to LSTM unit of the next subnetwork.

**Figure 2.** The architecture of Multi-scale Recurrent. The yellow arrow denotes up-sampling by one time.

Different from the general cascaded network like [34], which uses three stages of independent subnetworks, all the state features flow across scales and share the same training parameters in MSR-net. Owing to the multi-scale recurrent and parameter share strategy, the number of parameters that need to be trained in MSR-net is only 1/3 of [34].

For the subnetwork, the output **F***i* of encoder **Net***en*, which takes the speckled image and despeckled result up-sampled from the previous scale as input, can be defined as:

$$\mathbf{F}^{i} = \mathbf{N} \mathbf{et}\_{\rm vn} \left( \mathbf{I}\_{\rm in}^{i} \, \iota \, \mu \left( \mathbf{I}\_{o}^{i+1} \right) ; \Theta\_{\rm cn} \right), \tag{7}$$

where **I***iin* is the input image with speckle noise, Θ*en* is the weights parameters of **Net***en*. *i* = 0, 1, 2, ... is the scale index. The larger *i* is, the lower the resolution is. *i* = 0 represents the original resolution and *i* = 1 indicates down sampling once. **I***i*+<sup>1</sup> *o* is the output of the previous coarse scale. *up*(·) is the operator that adapts features or images from the (i + 1)-th to the i-th scale, which is implemented by bilinear interpolation.

To exploit the information contained in feature maps of different scales, a convolutional LSTM module is embedded between the encoder and the decoder. The ConvLSTM can be defined as:

$$\mathbf{h}^{i}, \mathbf{g}^{i} = \texttt{Conv}\texttt{LSTM}\left(\boldsymbol{\upupup}\boldsymbol{p}(\mathbf{h}^{i+1}), \mathbf{F}^{i}; \texttt{O}\_{LSTM}\right), \tag{8}$$

where Θ*LSTM* is the set of parameters in ConvLSTM, **h***i* is the hidden state which is passed to the next scale, **h***i*+<sup>1</sup> is the hidden state from the previous scales, and **g***i* is the output of the current state (scale). Finally, we use Θ*de* to denote the parameters of the decoder, and the output can be defined as:

$$\mathbf{I}\_o^i = \mathbf{N} \mathbf{e}\_{d\varepsilon} \left( \mathbf{g}^i; \Theta\_{d\varepsilon} \right). \tag{9}$$

#### *3.2. Single Scale Network*

Details of the MSR-net are introduced by the single-scale model in this section. As shown in Figure 3, the single-scale model consists of two parts: encoder and decoder. The encoder includes three building blocks: convolutional layer, pooling layer, and Res block.

**Figure 3.** Architecture of Single Scale Modle

The convolution unit performs convolution operation and non-linear activation. Increasing the number of convolutional layers can enhance the feature extraction ability [35,36]. Multiple Res blocks are added after the convolutional layer while designing the network. Unlike the convolution unit, skip connection proposed by He et al. [37] is built into this block, which can effectively avoid gradient explosion or gradient disappearance, as well as increasing the training speed.

The size of the input and output of the convolutional layers keeps the same as the despeckling networks designed in [6,25,27], which increases the amount of computation to a certain extent. We reduce the amount of calculation by decreasing the dimension of the feature maps, i.e., adopting the pooling layer. We choose max pooling operation with the 2 × 2 pooling kernel in this layer. It should be noted that the pooling layer can also be replaced by strided convolutions [38].

The decoder consists of the convolutional layer and the sub-pixel units. The width and height of the input feature map to the decoder are only 1/4 of the original image after down-sampling twice through the pooling layer. Therefore, the up-sampling operation is required to make the output image of the network the same as the input size. However, an up-sampling operation such as transposed convolution used in [39,40] needs a high amount of computation and causes unwanted checkerboard artifacts [41,42]. A typical checkerboard pattern of artifacts is shown in Figure 4. To reduce the network runtime and avoid the checkerboard pattern of artifacts, the sub-pixel convolution described in Section 3.3 is used to implement the up-sampling operation.

**Figure 4.** Diagram of checkerboard artifacts. (**a**) original image without noise; (**b**) despeckling image with checkerboard artifacts caused by transposed convolution.

#### *3.3. Sub-Pixel Convolution*

Sub-pixel convolution, also called as pixel shuffle, is an upscaling method first proposed in [43] for image super-resolution tasks. Different from the commonly used up-sampling methods in deep learning such as transposed convolution and fractionally strided convolution, sub-pixel convolution adopts channel to space method which achieves spatial scale-up amplification by rearranging pixels in multiple channels of the feature map, as illustrated in Figure 5.

**Figure 5.** The sub-pixel convolutional operation on the input feature maps with an upscaling factor of r = 2, channel c = 1.

For a sub-pixel unit with *r* times up-sampling, its output image is defined as **I***up*, and we have **I***up* ∈ <sup>R</sup>*W*×*H*×*c*, in which W, H, and c denote the width, height and channels of **I***up*. The sub-pixel convolution operation is defined as:

$$\mathbf{I}^{\mathrm{up}}(\mathbf{x}, y, \mathbf{c}) = \mathbf{F}(\lfloor \mathbf{x}/r \rfloor, \lfloor y/r \rfloor, \mathbf{c} \cdot r \cdot \mathrm{mod}(y, r) + \mathbf{c} \cdot \mathrm{mod}(\mathbf{x}, r) + \mathbf{c}) \tag{10}$$

where **<sup>I</sup>***up*(*<sup>x</sup>*, *y*, *c*) is the value of the pixel at the position (x,y) for the c*th* channel. **F** is the input of sub-pixel and **F** ∈ R*W*/*r*×*H*/*r*×*cr*<sup>2</sup> . · represents floor function that takes as input a real number and gives as output the greatest integer less than or equal to it [43]. After sub-pixel convolution operation, the elements of **F** are rearranged to the output **I***up* by increasing the horizontal and vertical count, and decreasing channel count. For example, when a 64 × 64 × 4 feature map is passed through the sub-pixel unit, an output with shape 128 × 128 × 1 will be obtained.

#### *3.4. Proposed Evaluation Criterion*

In this paper, the peak signal to noise ratio (PSNR) [44], structural similarity (SSIM) [45], equivalent number of looks (ENL) [46], and two new proposed evaluation criterions edge feature keep ratio (EFKR) and feature point keep ratio (FPKR) are used to evaluate the performance of despeckling methods.

PSNR is the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation, which has been widely used in quality assessment of reconstructed images. SSIM is a metric of image similarity. ENL can describe the smoothness of regions, and no reference image is needed for its calculation, so it can be used to evaluate the performance of despeckling methods for real SAR images.

#### 3.4.1. Edge Feature Keep Ratio and Feature Point Keep Ratio

PSNR and SSIM can effectively evaluate the overall performance of despeckling methods. Specifically, PSNR measures noise level or image distortion, SSIM measures the similarity between two images, and ENL measures the degree of region smoothing. They are not, however, capable of evaluating the edge and typical features retention ability in despeckling tasks directly. In this section, we propose two evaluation criteria that can compensate for the above deficiencies, i.e., edge feature keep ratio (EFKR) and feature point keep ratio (FPKR).

(a) **EFKR:** from the edge detection results shown in Figure 6, we have the following observations. (1) The edge outline of the speckled image is blurred, and there are discrete points in the image; (2) the edge outline is clear after despeckling and there is no discrete point, which is in agreemen<sup>t</sup> with the edge detection results of a clean image. Enlightened by this phenomenon, we design a quantitative evaluation criterion EFKR with the ability of edge retention based on counting the number of pixels of edges. The computation steps are as follows:


**Figure 6.** Edge detection results. Images from left to right correspond to a clean image, an image with speckle noise, and image after despeckling.

The ratio of these two factors is the edge feature retention ratio, which is defined as:

$$EFKR = \frac{sum(\mathbf{edge}(X) \& \mathbf{edge}(Y))}{sum(\mathbf{edge}(Y))},\tag{11}$$

where & and *sum* denotes the bit-wise conjunction operation and sum operation, and **edge** represents edge detection.

(b) **FPKR:** for real SAR images, ENL is only able to evaluate the smooth level but not the retention ability of typical features such as edges, corners in the image. SIFT [51] can find feature points from different scales and obtain the ultimate descriptor of features. Also, the key points found by SIFT are usually corner points, edge points, bright spots in dark areas, and dark points in bright areas. These points are robust to light, affine transformations, and other transformation. The registration method based on SIFT first uses SIFT to obtain the feature points of the image to be registered, the reference image and their descriptor then matches the feature points according to descriptor and obtains one-to-one corresponding feature point pairs. Finally, the transformation parameters are calculated, and the image registration is carried out.

For SAR images, the registration of feature points and descriptors at the lights spots of speckles are redundant, which also reduce the efficiency and accuracy of subsequent searching of matching points. Based on this phenomenon, we design an evaluation criterion FPKR targeting at key feature points. We first execute an affine transformation to the evaluation image, then use the SIFT algorithm to find the feature points in the two images before and after the transformation, and finally match the feature points. The better the despeckling performs, the more typical features are preserved. The more prominent the feature descriptor obtained by SIFT, the greater the difference of descriptor between different features, so more effective feature point pairs can be searched efficiently. FPKR is defined as:

$$FPKR = \frac{N\_{match}(X\_\prime X\_l)}{\min(N(X), N(X\_l))}\tag{12}$$

where *<sup>N</sup>*(*X*), *N*(*Xt*) are the number of key points before and after SIFT, and *Nmatch*(*<sup>X</sup>*, *Xt*) denotes the number of points for calculating transformation parameters.

#### **4. Experiments and Results**
