**1. Introduction**

Synthetic aperture radar (SAR), owing to its all-weather and all-time condition operation, has been widely applied to microwave remote sensing areas, such as topographic mapping, military target reconnaissance, and natural disaster monitoring [1,2]. SAR imaging achieves high range resolution by exploiting pulse compression technique and high azimuth resolution by using radar platform to form a virtual antenna synthetic aperture along track [3,4]. However, speckle noise exists in the imaging results due to the coherent imaging mechanism of SAR, which leads to images quality and readability reduction. Meanwhile, the existence of speckles limits the effectiveness of the application of common optical image processing methods to SAR images [5]. It thus restricts the SAR images to further understanding and interpretation, increasing the difficulty of extracting roads, farmlands, and buildings in the image and the complexity of spatial feature extraction in image registration, and reducing the accuracy of detection and classification of the objects such as vehicles and ships [6]. Speckle suppression is, therefore, an important task in SAR image post-processing.

To improve the quality of SAR images, there have been various speckle suppression methods proposed, including multi-look processing technologies during imaging and image filtering methods after imaging [1,7]. Multi-look processing divides the whole effective synthetic aperture length into multiple segments. The incoherent sub-views are then superimposed to obtain the high signal-to-noise ratio (SNR) images [8]. However, multi-look processing reduces the utilization of Doppler bandwidth, resulting in a decrease of the spatial resolution of the imaging results, which cannot meet the requirements of high resolution [9].

The filtering methods are mainly divided into three categories: the spatial filtering method, the transform domain filtering method, and the non-local mean filtering method. Median filtering and mean filtering are the earliest spatial filtering methods of traditional digital image processing. Although these two methods can suppress speckles to a certain extent, it leads to image blurring and objects edge information loss. Afterward, Lee filter [10], Frost filter [11], and Kuan filter [12] are designed for speckles suppression of SAR images. Based on the coherent speckle multiplier model, Lee filter selects a fixed local window in the image, assuming that the prior mean and variance can be calculated by the local region [10]. This method has a small amount of computation, but the selection of local window size has a grea<sup>t</sup> influence on the result, and the details and edge information of the image may be lost [13]. Frost filter assumes that the SAR image is a stationary random process and coherent spot noise is multiplying noise, and uses the least mean square error criterion to estimate the real image [11]. For the reason that the actual SAR image does not fully meet the hypothesis, the SAR image processed by this method will have blurred edges in areas with rich details. Kuan filters to apply sliding windows to estimate the local statistical properties of the image and then replaces the global characteristics of the image with these local statistical properties [12].

The representatives of transform domain filtering methods are threshold filtering method and multi-resolution analysis method based on wavelet transform. Donoho et al. first proposed a hard threshold and soft threshold denoising method based on wavelet transform [14]. After that, Bruce and Gao et al. proposed semi-soft threshold function methods to improve the hard threshold and soft threshold denoising methods [15,16]. This method solved the problems of discontinuity of hard threshold function and constant deviation of reconstructed signal with soft threshold function. He et al. later proposed a wavelet Markov model [17], which achieved a significant result in SAR images denoising. However, the wavelet transform cannot deal with two- and higher-dimensional images well. Because in the case of high-dimensional wavelet basis, one-dimensional wavelet basis cannot obtain the optimal representation of the two-dimensional function. In recent years, the appearance of multi-scale geometric analysis has made up this defect. Also, multi-scale geometric analysis tools are abundant, such as Ridgelets transformation, Curvelets transformation, Brushlets transformation, and Contourlets transformation [18–20]. These transform domain filtering methods have good coherent spots suppression capability, which preserved image details and edge information while speckles are removed. However, the processing of these algorithms is based on local characteristics, which is complex with a large amount of computation and easily producing pseudo-Gibbs stripes.

The non-Local Means (NLM) filtering method [21] proposed by Buades et al. repeatedly searched the whole image with similar image blocks and used similar texture regions instead of noise regions to achieve denoising. The authors of [22,23] applied NLM to SAR image denoising, which can effectively eliminate the speckles. Kervrann et al. [24] further improved NLM by proposing a new adaptive algorithm that modified the similarity measurement parameters of NLM. Dabov et al. [22] proposed a Block-Matching and 3D filtering (BM3D) algorithm which applied the local linear minimum mean variance (MMSE) criterion and wavelet transform, and combined the non-local mean idea with the transform domain filtering method. It is one of the best methods for denoising at present. However, this algorithm needs a large number of search operations at the cost of a large amount of computation and low efficiency.

In recent years, deep convolutional neural network (CNN) has developed rapidly, which provides a new idea for SAR image despeckling. Wang et al. constructed an image despeckling convolutional neural network (ID-CNN) in [25]. ID-CNN can directly estimate the speckle distribution and eliminate the estimated speckles from the image to obtain a clean image. Different from the ID-CNN, Yue et al. in [6], combining the statistical model with CNN, proposed a framework that does not require

reference images and could work in an unsupervised way when trained with real SAR images. Bai et al. [26] added fractional total variational loss to the loss function to remove the obvious noise while maintaining the texture details. The authors of [27] proposed a CNN framework based on dilated convolutions called SAR-DRN. This network amplified the receptive field by dilated convolutions and further improved the network by exploiting the skip connections and a residual learning strategy. State-of-the-art results are achieved in both quantitative and visual assessments.

In this study, we design an end-to-end multi-scale recurrent network for SAR image despeckling. Unlike [9,25–27], which only utilized CNN to acquire speckle distribution characteristics and additional division operation or subtraction operation to remove speckle, we use the network to learn the distribution characteristics of speckle noise, meanwhile automatically implementing speckle suppression to output clean images. The proposed network is based on the encoder–decoder architecture. To improve the operation efficiency, in the decoder part, we use the subpixel unit to implement up-sample on the feature maps instead of the deconvolutional layer. Besides, this paper applies a multi-scale recurrent strategy, which inputs the resized images with different scales to the network, and different scale inputs share the same network weights parameters. So, the network performance can be improved without increasing the network parameters and the output which is friendly to the optical image processing algorithm can be obtained. Also, the convolutional LSTM unit is used to implement information transmit among each scale. Although our network is the same as the network based on noise output, i.e., a fully convolutional network, our MSR-net contains the pooling layer that can reduce the dimension of the network and further reduce the amount of computation to a grea<sup>t</sup> extent. Lastly, we propose two evaluation criteria based on image processing methods.

The paper consists of 6 parts. In Section 2, we analyze the speckles of SAR images and briefly introduce CNN, and convolutional LSTM. After providing the framework of our proposed MSR-net in Section 3, the result and discussion of the experiment are shown in Sections 4 and 5. The last section will summarize this paper.

#### **2. Review of Speckle Model and Neural Network**

#### *2.1. Speckle Model of Sar Images*

Multiplicative model is usually used to describe speckle noise [28] and the formula is defined as:

$$I = p\_s \cdot n\_\prime \tag{1}$$

where *I* is image intensity, *ps* is a constant which denotes the average scattering coefficient of objects or ground, and *n* denotes the speckle which is independent with *ps* statistically.

For the homogeneous SAR image, the single-look intensity *I* obeys negative exponential distribution [29] and its probability distribution function (PDF) is defined as:

$$p(I) = \frac{1}{p\_s} \exp\left(-\frac{I}{p\_s}\right). \tag{2}$$

The multi-look processing methods are usually used to improve the quality of SAR images by diminishing the speckle noise. If the Doppler bandwidth is divided into *L* sub-bandwidths during imaging, and *Ii* is the single-look intensity image corresponding to each sub-bandwidth, the result of multi-look processing is:

$$I = \frac{1}{L} \sum\_{i=1}^{L} I\_{i\nu} \tag{3}$$

where *L* is the number of looks. If *Ii* obeys the exponential distribution in Equation (2), then after multi-look averaging, the L-look intensity image follows the Gamma distribution [1], and the PDF is:

$$p\left(I\right) = \frac{1}{p\_s \Gamma(L)} \left(\frac{LI}{p\_s}\right)^{L-1} \exp\left(-\frac{LI}{p\_s}\right), L \ge 1,\tag{4}$$

where Γ(*L*) denotes the Gamma function. The PDF of L-look speckle *n* can be obtained by applying the product model on Equations (1) and (4),

$$p(n) = \frac{L^L n^{L-1} \exp\left(-Ln\right)}{\Gamma(L)}, L \ge 1. \tag{5}$$

#### *2.2. Convolutional Long Short-Term Memory*

Convolutional neural networks (CNNs) have powerful capabilities of extracting spatial features and can automatically extract universal features through back-propagation algorithms driven by dataset [30,31], however they cannot be used to process sequence signals directly for the reason that the input is independent with each other, and the information flows strictly in one direction from layer to layer.To solve this problem, we introduce convolutional long short-term memory (ConvLSTM) [32] to the network, the inner structure of ConvLSTM is shown in Figure 1. As a special kind of RNN, long short-term memory network (LSTM) has internal hidden memory which allows the model to store information about its past computations and capable of learning long-term dependency [33]. Different than standard LSTM, all of the features variables of ConvLSTM including the input *Xt*, cell state *Ct*, the output of the forget gate *Ft*, input gate *it*, and output gate *Ot* are three-dimensional tensors, the latter two of which are spatial dimensions width and height. The key equations of ConvLSTM are defined as:

$$\begin{aligned} F\_t &= \sigma(\mathcal{W}\_f[H\_{t-1}, \mathcal{X}\_t] + b\_f), \\ i\_t &= \sigma(\mathcal{W}\_i[H\_{t-1}, \mathcal{X}\_t] + b\_i), \\ \mathcal{C}\_t &= \tanh(\mathcal{W}\_\mathcal{C}[H\_{t-1}, \mathcal{X}\_t] + b\_c), \\ \mathcal{C}\_t &= F\_t \ast \mathcal{C}\_{t-1} + i\_t \ast \mathcal{C}\_t, \\ O\_t &= \sigma(\mathcal{W}\_o[H\_{t-1}, \mathcal{F}\_t] + b\_o), \\ H\_t &= O\_t \ast \tanh(\mathcal{C}\_t) \end{aligned} \tag{6}$$

where "∗" and " *σ*" denote the Hadamard product and the logistic sigmoid function, respectively. *Ft* controls the abandoned state information of the last layer and *it* is in charge of current state update, i.e., *C*˜*t*. *Wf* , *Wi*, *Wc*, and *Wo* represent weights of each neural unit with *bf* , *bi*, *bc*, and *bo* denoting the corresponding offsets.

**Figure 1.** Inner structure of convolutional long short-term memory (ConvLSTM) [32].

#### **3. Proposed Method**

An end-to-end network MSR-net for SAR image despeckling is proposed in this paper. Rather than using additional division operation [25,26] or subtraction operation [9,27], our network can automatically perform despeckling and generate a clean image. In this section, we first introduce the multi-scale recurrent architecture and then describe specific details through a single-scale model.
