**Synthetic Aperture Radar (SAR) Meets Deep Learning**

Editors

**Tianwen Zhang Tianjiao Zeng Xiaoling Zhang**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Tianwen Zhang University of Electronic Science and Technology of China China

Tianjiao Zeng University of Hong Kong Hong Kong

Xiaoling Zhang University of Electronic Science and Technology of China China

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Remote Sensing* (ISSN 2072-4292) (available at: https://www.mdpi.com/journal/remotesensing/ special issues/synthetic aperture radar meets deep learning).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6382-4 (Hbk) ISBN 978-3-0365-6383-1 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**



## **Preface to "Synthetic Aperture Radar (SAR) Meets Deep Learning"**

A synthetic aperture radar (SAR) is an important active microwave imaging sensor, whose all-day and all-weather working capacity give it an important place in the remote sensing community. Since the United States launched the first SAR satellite, SAR has received much attention in the remote sensing community, e.g., in geological exploration, topographic mapping, disaster forecast, and traffic monitoring. It is valuable and meaningful, therefore, to study SAR-based remote sensing applications.

In recent years, deep learning represented by convolution neural networks has promoted significant progress in the computer vision community, e.g., in face recognition, the driverless field and Internet of things (IoT). Deep learning can enable computational models with multiple processing layers to learn data representations with multiple-level abstractions. This can greatly improve the performance of various applications. Today, scholars are realizing the potential value of deep learning in remote sensing. Many remote sensing application techniques incorporate deep learning, e.g., in target and oil spill detection, traffic surveillance, topographic mapping, AI-based SAR imaging algorithm updating, coastline surveillance, and marine fisheries management.

Interestingly, when SAR meets deep learning, how to use this advanced technology correctly needs to be considered carefully, and how to achieve the best performance of this "black-box" model also needs careful consideration. Notably, deep learning uncritically abandons traditional hand-crafted features and relies excessively on the abstract features of deep networks. Is this reasonable? Can the abstract features of deep networks fully represent a real SAR? Should the traditional hand-crafted features provided with mature theories and elaborate techniques be abandoned completely? These questions are worth pondering when one applies various deep learning techniques to the SAR remote sensing community. In general, deep learning is always proposed for natural optical images whose imaging mechanisms are greatly different from those of SARs.

When SAR meets deep learning, should an SAR adapt itself to deep learning, or should deep learning adapt itself to an SAR? The relationship between the two needs further exploration and research. Furthermore, is deep learning really suitable for SARs? The number of SAR samples is far smaller than that of natural optical images. In this case, could we ensure that deep networks are able to thoroughly learn SAR mechanisms?

This Special Issue provides a platform for researchers to handle the above significant challenges and present their innovative and cutting-edge research results when applying deep learning to SARs in various manuscript types, e.g., articles, letters, reviews and technical reports.

> **Tianwen Zhang, Tianjiao Zeng, and Xiaoling Zhang** *Editors*

### *Editorial* **Synthetic Aperture Radar (SAR) Meets Deep Learning**

**Tianwen Zhang 1, Tianjiao Zeng 2,3 and Xiaoling Zhang 1,\***


#### **1. Introduction**

Synthetic aperture radar (SAR) is an important active microwave imaging sensor. Its all-day and all-weather working capacity makes it play an important role in the remote sensing community. Since the launch of the first SAR satellite by the United States [1], SAR has received extensive attention in the remote sensing community [2], e.g., geological exploration [3], topographic mapping [4], disaster forecast [5,6], and marine traffic management [7–10]. Therefore, it is valuable and meaningful to study SAR-based remote sensing applications [11].

In recent years, with the rapid development of artificial intelligence, deep learning (DL) [12] has been applied to all walks of life, such as face recognition, automatic driving, search recommendation, internet of things, and so on. The DL represented by convolutional neural network (CNN) is promoting the evolution of many algorithms and the innovation of advanced technologies. At present, scholars are exploring the application value of DL in SAR remote sensing field. Many SAR remote sensing application technologies based on DL have emerged, such as land surface change detection, ocean remote sensing, sea-land segmentation, traffic surveillance and topographic mapping.

Aiming to promote the application of DL in SAR, we initiated this Special Issue and collected a total of 14 papers (including 12 articles, 1 review and 1 technical note) covering various topics, e.g., object detection, classification and tracking, SAR image intelligent processing, data analytics in the SAR remote sensing community and interferometric SAR technology. The overview of contribution is in the following section.

#### **2. Overview of Contribution**

On the topic of object detection, classification and tracking, Li et al. [13] summarized the dataset, algorithm, performance, DL framework, country and timeline of DL-based ship detection methods. They analyzed the 177 published papers about DL-based SAR ship detection and attempted to stimulate more research in this field. Xia [14] proposed a visual transformer framework based on contextual joint-representation learning referred to as CRTransSar. CRTransSar combined the global contextual information perception of transformers and the local feature representation capabilities of convolutional neural networks (CNNs). It was found to produce more accurate ship detection results than other most advanced methods. Note that the authors also released a larger-scale SAR multiclass target detection dataset called SMCDD. Feng et al. [15] established a lightweight position-enhanced anchor-free SAR ship detection algorithm called LPEDet. They designed a lightweight multiscale backbone and a position-enhanced attention strategy for balancing detection speed and accuracy. The results showed that their method achieved a higher detection accuracy and a faster detection speed than other state-of-the-art (SOTA) detection methods. Xu et al. [16] presented a unified framework combining triangle distance IoU loss

**Citation:** Zhang, T.; Zeng, T.; Zhang, X. Synthetic Aperture Radar (SAR) Meets Deep Learning. *Remote Sens.* **2023**, *15*, 303. https://doi.org/ 10.3390/rs15020303

Received: 6 December 2022 Accepted: 3 January 2023 Published: 4 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(TDIoU loss), an attention-weighted feature pyramid network (AW-FPN), and a Rotated-SARShip dataset (RSSD) for arbitrary-oriented SAR ship detection. Their method showed superior performance on both SAR and optical image datasets, significantly outperforming the SOTA methods. Xiao et al. [17] proposed a simple, yet effective, self-supervised representation learning (Lite-SRL) algorithm for the scene classification task. Note that they successfully evaluate the on-board operational capability of Lite-SRL by transplanting Lite-SRL to the low-power computing platform NVIDIA Jetson TX2. Kaˇcan et al. [18] explored object classification on a raw and a reconstructed Ground-based SAR (GBSAR) data. They revealed how processing raw data provides overall better classification accuracy than processing reconstructed data, and revealed the value of this method in industrial GBSAR applications where processing speed is critical. Bao et al. [19] proposed a guided anchor Siamese network (GASN) for arbitrary targets of interest (TOI) tracking in Video-SAR. GASN used a matching function for returning the most similar area, followed by a guided anchor subnetwork to suppress false alarms. GASN realized the TOI tracking with high diversity and arbitrariness, outperforming SOTA methods.

On the topic of SAR image intelligent processing, Tan et al. [20] proposed a featurepreserving heterogeneous remote sensing image transformation model. Through decoupling network design, the method enabled enhancing the detailed information of the generated optical images and reducing its spectral distortion. The results in SEN-2 satellite images revealed that the proposed model has obvious advantages in feature reconstruction and the economical volume of the parameters. Zhang et al. [21] proposed a self-supervised despeckling algorithm with an enhanced U-Net called SSEUNet. Unlike previous selfsupervised despeckling works, the noisy-noisy image pairs in SSEUNet were generated from real-world SAR images through a novel generation training pairs module, making it possible to train deep convolutional neural networks using real-world SAR images. Finally, experiments on simulated and real-world SAR images show that SSEUNet notably exceeds SOTA despeckling methods. Habibollahi et al. [22] proposed a DL-based change detection algorithm for bi-temporal polarimetric SAR (PolSAR) imagery called TCD-Net. In particular, this method applied three steps as follows: (1) pre-processing, (2) parallel pseudo-label training sample generation based on a pre-trained model and the fuzzy C-means (FCM) clustering algorithm, and (3) classification. TCD-Net could learn more strong and abstract representations for the spatial information of a certain pixel, and was superior to other well-known methods. Fan et al. [23] proposed a high-precision, rapid, large-size SAR image dense-matching method. The method mainly included four steps: down-sampling image pre-registration, sub-image acquisition, dense matching, and the transformation solution. The experimental results demonstrated that the proposed method is efficient and accurate, which provides a new idea for SAR image registration. Zhang et al. [24] proposed A low-grade road extraction network Based on the fusion of optical and SAR images at the decision level called SDG-DenseNet. Furthermore, they verified that the decision-level fusion of road binary maps from SAR and optical images can significantly improve the accuracy of low-grade road extraction from remote sensing images.

On the topic of data analytics in the SAR remote sensing community, Wangiyana et al. [25] explored the impact of several data augmentation (DA) methods on the performance of building detection on a limited dataset of SAR images. Their results showed that geometric transformations are more effective than pixel transformations and DA methods should be used in moderation to prevent unwanted transformations outside the possible object variations. The study could provide potential guidelines for future research in selecting DA methods for segmentation tasks in radar imagery.

On the topic of interferometric SAR technology, Pu et al. [26] proposed a robust least squares phase unwrapping method called PGENet that works via a phase gradient estimation network based on the encoder–decoder architecture for InSAR. Experiments on simulated and real InSAR data demonstrated that PGENet outperformed the other five well-established phase unwrapping methods and was robust to noise.

#### **3. Conclusions**

Recently, as many SAR systems have been put into use, massive SAR data are available, providing important support for exploring how to apply DL to SAR fields. A large number of SAR data coupled with the DL methodology jointly promote the development of SAR fields. The Special Issue shows innovative applications in object detection, classification and tracking, SAR image intelligent processing, data analytics in the SAR remote sensing community and interferometric SAR technology. There is no doubt that applying DL to more SAR fields (such as terrain classification, SAR agriculture monitoring, SAR imaging algorithm updating, SAR forest applications, marine pollution, etc.) is of great significance for earth remote sensing. In addition, we welcome scholars who are interested in applying DL to SAR to contribute to the scientific literature on this subject.

**Author Contributions:** Writing—original draft preparation, T.Z. (Tianwen Zhang); writing—review and editing, T.Z. (Tiaojiao Zeng) and X.Z. All authors have read and agreed to the published version of the manuscript.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We thank all authors, reviewers and editors for their contributions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Technical Note* **Data Augmentation for Building Footprint Segmentation in SAR Images: An Empirical Study**

**Sandhi Wangiyana \*, Piotr Samczy ´nski and Artur Gromek**

Institute of Electronic Systems, Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland; p.samczynski@elka.pw.edu.pl (P.S.); a.gromek@elka.pw.edu.pl (A.G.) **\*** Correspondence: sandhi.wangiyana.dokt@pw.edu.pl

**Abstract:** Building footprints provide essential information for mapping, disaster management, and other large-scale studies. Synthetic Aperture Radar (SAR) provides consistent data availability over optical images owing to its unique properties, which consequently makes it more challenging to interpret. Previous studies have demonstrated the success of automated methods using Convolutional Neural Networks to detect buildings in Very High Resolution (VHR) SAR images. However, the scarcity of such datasets that are available to the public can limit research progress in this field. We explored the impact of several data augmentation (DA) methods on the performance of building detection on a limited dataset of SAR images. Our results show that geometric transformations are more effective than pixel transformations. The former improves the detection of objects with different scale and rotation variations. The latter creates textural changes that help differentiate edges better, but amplifies non-object patterns, leading to increased false positive predictions. We experimented with applying DA at different stages and concluded that applying similar DA methods in training and inference showed the best performance compared with DA applied only during training. Some DA can alter key features of a building's representation in radar images. Among them are vertical flips and quarter circle rotations, which yielded the worst performance. DA methods should be used in moderation to prevent unwanted transformations outside the possible object variations. Error analysis, either through statistical methods or manual inspection, is recommended to understand the bias presented in the dataset, which is useful in selecting suitable DAs. The findings from this study can provide potential guidelines for future research in selecting DA methods for segmentation tasks in radar imagery.

**Keywords:** image augmentation; building extraction; SAR; semantic segmentation

#### **1. Introduction**

Buildings are the main structures in any urban area. A building's footprint is a polygon surrounding a building's area when viewed from the top. Maintaining this geographic information is vital for city planning, mapping, disaster preparedness, or other large-scale studies. Synthetic Aperture Radar (SAR) imagery provides an advantage over optical sensors by penetrating through clouds and capturing data day-night and in all-weather conditions. This consistently available remote sensing data has attracted researchers to study areas frequently covered by clouds, such as in disastrous situations. Temporal changes are better identified using methods such as Change Detection in largescale areas [1]. However, its unique properties are difficult for non-experts to analyze. This fact leads to the exploitation of automated methods such as deep learning using the Convolutional Neural Network (CNN).

CNN is known for extracting relevant features automatically by learning the underlying function that maps a pair of input and output examples. Automated building detection in Very High Resolution (VHR) SAR images was demonstrated using CNN in [2,3]. Despite this, detecting buildings in an urban SAR scenery is challenging due to the complex

**Citation:** Wangiyana, S.; Samczy ´nski, P.; Gromek, A. Data Augmentation for Building Footprint Segmentation in SAR Images: An Empirical Study. *Remote Sens.* **2022**, *14*, 2012. https:// doi.org/10.3390/rs14092012

Academic Editors: Bahram Salehi, Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 2 March 2022 Accepted: 20 April 2022 Published: 22 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

background and multi-scale objects [2]. In an urban area, buildings are visible but difficult to distinguish from each other. The major indicator of a building is the double bounce scattering formed by the grounds and walls that are visible to the sensor. However, in large cities, high-rise buildings can be challenging to detect due to a phenomenon called layover, which projects the building's wall at the ground toward the sensor. A different approach was taken in [4], where a CNN model was trained to predict these layover areas instead of the traditional building footprint. Predicting the visible building region is indeed intuitive in SAR images, but for most GIS applications, the building footprints are still desirable. An extensive search space on various architectures, pre-trained weights, and loss functions for segmenting building footprints from optical and SAR images was performed in [5]. It was found that the diverse building areas and heights in different cities were problematic. Small-area buildings, mostly found in Shanghai, Beijing, and Rio, were undetectable, while high-rise buildings (mostly in San Diego and Hong Kong) degraded the model's performance due to extreme geometric distortions. Those models performed well in cities such as Barcelona and Berlin because most of the buildings were of moderate size and height.

Predicting well on unseen data or the ability to generalize is the main goal of training a deep learning model. It is a generally accepted fact that deep neural networks perform well on computer vision tasks by relying on large datasets to avoid overfitting [6]. Overfitting happens when a model fits its training set too well. This results in low accuracy predictions on novel data. For the task of building footprint extraction, a handful of datasets from optical sensors exist [7,8], but unfortunately, not many datasets with VHR SAR data are available for public usage. Open, high-quality datasets can be used as a standard benchmark to compare different algorithms and methods. The release of the SpaceNet6 dataset [9] was aimed to promote further research on this topic.

For data that are expensive to collect and label, such as radar or medical images, a common technique to boost performance is using data augmentation. Data augmentations (DA) increase the set of possible data points, artificially growing the dataset's size and diversity. It potentially helps the model avoid focusing on features that are too specific to the data used for training, therefore, increasing generalization (the ability to predict well on data not seen during training) without the need to acquire more images [10].

The use of data augmentation for CNN models has proved effective for classifying rare, deadly skin lesions through basic geometric transformations [11] and unique data cleaning methods [12]. In remote sensing, [13] investigated data augmentations for hyperspectral remote sensing images during inference, while [14] performed object augmentation to increase the number of buildings in optical remote sensing images and demonstrated better building extraction performance. In SAR imagery, [15] showed improvements in paddy rice semantic segmentation by applying quarter-circle rotations and random flipping. Random erasing [16] on target ships was performed in [17] to simulate information loss in radar imagery and improve the robustness of object detection. Conventional methods can be used as a form of DA, as demonstrated in [18], which combined handcrafted features from a Histogram of Oriented Gradients (HOG) with automated features from CNN to improve ship classification in SAR imagery.

Generative models are also a popular data augmentation method, by generating synthetic samples that still retain similar characteristics to the original set. It uses an architecture called Generative Adversarial Network (GAN). However, due to the high computation cost, these are generally applied for low-resolution images, such as in target recognition [19,20], or in a limited scene understanding, such as reducing speckle filters [21,22]. Another promising form of adding synthetic data for SAR is by using simulations. In [23], Inverse SAR (ISAR) images were generated using a simulation to augment the limited SAR data for marine floating raft aquaculture detection. Variations in imaging methods and environmental conditions can be consistently created while retaining accurate labels and information about the target. These would be expensive to both collect and analyze using direct measurements. However, the current state-of-the-art simulators are still

unable to produce high fidelity images that create a gap between measured and synthetic SAR imagery [24].

To the best of our knowledge, no prior research investigated and compared augmentation methods specifically for building detection from SAR images. In this study, we explored several geometric and pixel transformations and their performance on the SpaceNet6 dataset [9]. The article is organized as follows: Section 2 provides a technical overview of the dataset, CNN model, and training details. All data augmentation methods used in this research are briefly explained in Section 3. The results of the ablation study and the main experiments are presented in Section 4 and discussed in Section 5. Finally, we conclude the article in Section 6.

The main contributions of this paper are:


#### **2. Materials and Methods**

#### *2.1. Dataset Overview*

SpaceNet6 was released as a dataset for a competition to extract building footprints from multi-sensor data. The competition data consist of a training set and a testing set. The latter we do not use because there were no labels provided to verify our predictions. The former consists of 3401 tiles of satellite imagery, each from optical RGB, near-infrared (NIR), and SAR. Several works made use of this multi-modal data to boost segmentation performance [5,25,26]. In this study, we are only interested in using the SAR data. A quad polarization X-band sensor mounted on an aerial vehicle was used to take images over the largest port in Europe, the Rotterdam port. The 120 km2 coverage is split into tiles of 450 m × 450 m, with 0.5 m/pixel spatial resolution in both range and azimuth direction. An example of an image tile is shown in Figure 1.

**Figure 1.** A tile example shows a pair of (**a**) optical RGB and (**b**) SAR shown as false-colored (R = HH, G = VV, B = VH) over a small oil storage area. (**c**) shows the building footprints overlayed on top of the false-colored SAR. One main challenge of this dataset is differentiating patterns of buildings against objects with high backscatter intensity (such as containers, oil silos, and ships) in the port area.

The SAR data has 2 orientations; both were captured using a north and a south facing sensor. Figure 2a shows the tile over a base map of Rotterdam city, marking the position of images from orient1 (or north-facing) in green and orient0 (or south-facing) in red. The direction of flight is indicated by the azimuth (*az*) arrow, while the direction where the sensor is facing is given by the range (*rg*) arrow.

**Figure 2.** (**a**) The map of Rotterdam Port area (UTM Zone 31N) overlayed with tile boundaries for all 3401 tiles in the SpaceNet6 training set. Green tiles are orient1 while red tiles are orient0. To showcase the data augmentation methods in this study, we only use the green tiles (orient1) for training. Some building footprints are highlighted as an example of layover effects in (**b**) orient1 and (**c**) orient0 tiles. Notice the direction of a layover is always projected towards the sensor's position (near range) while the shadow is cast away from the sensor. The optical RGB image is provided for comparison.

Each orientation creates different characteristics of how a building looks, namely the shadows and layover. The layover creates a projection of the building's body onto the ground. It is caused by the side-looking nature of SAR imaging and the fact that SAR imaging maps pixel values based on the order of received echo. This is especially prominent in tall buildings.

The quad-channel SAR (also known as polarimetric SAR) is excellent for deep learning applications that benefit from the extra data. However, to showcase the effectiveness of data augmentation methods, we only used tiles from orient1 (covering the northern part of the city) and only a single channel HH polarization. This constraint enforces overfitting due to a limited data situation while still maintaining distinct land use cover of the port area and residential buildings. We chose only the HH polarization channel due to it being mostly available in SAR constellations such as GaoFen3, TerraSARX, and Sentinel 1 [27].

#### *2.2. Preparing the Dataset*

The purpose of training a Deep Learning model is to predict well on new data, which the model never sees during training. This is called generalization. The proper way to develop a model is to prepare 3 separate datasets called the training set, the validation set, and the testing set. As their names suggest, each is used in a different phase of a model's development. For each iteration of training, called an epoch, the model updates its weights by learning from the training set. At the end of each epoch, the model is tested on the validation set without updating its weights. This is used as a guideline for the Deep Learning practitioner to optimize the model or training parameters by studying the generalization ability over each epoch. Finally, when training is finished, the true performance of a model is determined by its score on a separate testing set.

To train our model, we used roughly half of the SpaceNet6 training set, which was constrained to a single sensor orientation and a single HH channel. We did not use a separate testing set. To evaluate the performance of our model and as a guideline for the impact of augmentation methods, we generated a validation dataset using the expanded version of SpaceNet6. This extra data was released later after the SpaceNet6 competition finished and consisted of unprocessed Single Look Complex (SLC) SAR data with additional polygon labels. We generated tile images from the SLC rasters with the same constraint as our training set: a single orientation and a single HH channel. We matched the exact preprocessing steps explained in [9] using their provided Python library. The steps are:


**Figure 3.** One of the SLC stripes after preprocessed and cropped to the boundaries of the testing set labels. These will be further cropped into non-overlapping tiles and later used as our validation dataset.

We used a search algorithm to crop and remove the no-data regions. These are the black regions shown in the left and bottom parts of the image tile in Figure 1. They were produced during the pre-processing stage of the dataset, stemming from the need to have a square non-overlapping tile image. The borders between the raster and the no-data regions created jagged lines due to slight affine rotation during orthorectification. Our experiments have shown these jagged lines to affect the results of pixel transformations, so in order to obtain consistent results, we have cropped them to a clean rectangle shape tile. The code we used in our methods is available as open source in https://github.com/sandhi-artha/sn6\_aug (last accessed on 9 April 2022).

#### *2.3. Segmentation Model*

Segmentation is the task of partitioning the image into regions based on the similarity (alike characteristic within the same class) and discontinuity (the border or edge between different classes) of pixel intensity. For building footprint extraction, there are only 2 classes: the positive examples, i.e., pixels belonging to a building's region, and negative examples, which are the rest of the pixels (non-building). The deep learning model is given a pair of images and labels for training. The iterative process outputs a similar-sized image classifying each pixel as belonging to one of the two classes.

An encoder-decoder type architecture is commonly used for segmentation models, popularized by the famous UNet [28] and its variations. For aerial imagery, DeepLab v3 [29], PSPNet [30], and Feature Pyramid Network (FPN) [31] are commonly used. Based on a previous study [32], we used the FPN architecture combined with the EfficientNet B4

backbone. EfficientNet is a family of CNN models generated using compound scaling to determine an optimal network size [33]. For a deep learning model, the architecture refers to how each layer in the network is connected, while the backbone refers to the feature extraction part of the model.

Building footprints taken from overhead images have various sizes. To differentiate a building from the background, we need to see enough pixels representing the whole or most of the building. This means a higher spatial resolution is required for detecting buildings with a smaller area. A common method in computer vision to help the model learn these multi-sized objects is to use multi-scale input, i.e., the input image downscaled to different pixel resolutions. This is called an image pyramid (Figure 4).

**Figure 4.** FPN architecture with EfficientNet backbone for segmentation tasks.

FPN uses feature pyramids instead. In the Encoder, the image input is scaled down using a dilated convolution operation which cuts the image dimension in half at each pyramid level. As the data flows up the pyramid, the top layer will have the least width and height (the original input's size divided by 32) but the richest semantic information (1632 feature maps or channels). In a classification task, this is compressed further to output a vector with the same size as the number of classification labels [31]. For a segmentation task, an output with the same spatial size as the input is required. Therefore, the top layer needs to be upscaled.

A 1 × 1 convolution filter is applied to the final layer in the encoder pyramid to reduce the number of feature maps to 256, without modifying the image dimension. As data flows down the Decoder's pyramid, the width and height increase 2× using nearest neighbors upsampling. In the skip connections (yellow arrow), feature maps from the same pyramid level in the Encoder and Decoder were concatenated. A 1 × 1 convolution was used to scale the feature maps from the Encoder pyramid to 256. This provides context for better localization as the image gradually recovers in pixel resolution. Afterward, feature pyramids from the Decoder go through a Conv and Upsample operation (black arrow), resulting in modules with 128 feature maps and image dimension 1/4 of the original input. These are then stacked channel-wise, creating a module of 512 feature maps. A final Conv and Upsample operation reduces the number of channels to 1 and restores the image dimension back to the original input [34].

#### *2.4. Training Details*

The training was performed in a Kaggle Kernel, a cloud computing environment equipped with a 2-core processor and an Nvidia P100 GPU with 16 GB of video memory (VRAM). The training pipeline was built using the TensorFlow framework. The

Segmentation-Models library [35] was used to combine the FPN architecture with the EfficientNet B4 backbone with no pre-trained weights. Adam [36] was used as the optimizer with default parameters. We used a learning rate scheduler, which configures the learning rate *α* to gradually decay by a factor of 0.5 ∗ (1 + cos(*nπ*/*N*)) where *n* is the current epoch and *N* is the total number of epochs.

To evaluate the performance of our model, we use the Intersection over Union (IoU) as the metric. IoU is the ratio of overlapping between the predicted area and the real area (Figure 5). In this case, it is a pixel-based metric. A higher IoU indicates a better predictive accuracy.

$$\text{IoU} = \frac{\mathcal{Y}\_{\mathcal{S}^t} \cap \mathcal{Y}\_{pred}}{\mathcal{Y}\_{\mathcal{S}^t} \cup \mathcal{Y}\_{pred}} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} \tag{1}$$

**Figure 5.** How IoU is calculated over an optical image of a warehouse building.

True Positives (TP) are pixels labeled as building and are correctly predicted as building. True Negatives (TN) are pixels labeled as background and are correctly predicted. False Negatives (FN) are misclassified background pixels, while False Positives (FP) are misclassified pixels of buildings.

Calculating statistics over each image tile in the training set, 20% of tiles have less than 1% positive samples (pixels classified as buildings) (Figure 6a). This indicates that most tiles contain high negative samples (background pixels). One must be cautious in selecting a loss function for training a model on a skewed data distribution such as this because the negative samples will dominate the predictions. For example, using a binary cross-entropy as the loss function, the model will obtain a minor error even if it predicted the whole image as background pixels.

We experimented with several loss functions and concluded that Dice Loss leads to better convergence in this dataset. It is based on the Dice Coefficient, which is used to calculate the similarity between 2 samples based on the degree of overlapping. Dice loss is simply 1 − *Dice Coef ficient*. This results in a loss or error score ranging from 0 to 1, where 0 indicates a perfect and complete overlap.

$$Loss\_{Dict} = 1 - 2 \* \frac{y\_{\mathcal{S}^t} \cap \ y\_{pred}}{y\_{\mathcal{S}^t} + y\_{pred}} \tag{2}$$

**Figure 6.** Per image tile statistics for the training set, normalized. (**a**) shows the distribution of positive samples compared to the total number of pixels in an image tile and 20% of tiles have less than 1% of total pixels categorized as buildings. (**b**) shows the number of buildings count for each image tile. Most tiles (17.5%) contain less than 5 buildings.

#### *2.5. Ablation Study*

First, we studied the impact of each augmentation method in an ablation study, applying the same model and training configuration but applying different transformations to the dataset. To speed up this process, we trained and validated only a subset of our main training set. We divided it into 5 identical columned regions. The first and second columns were used as the mini-training dataset, while the last column was for the mini-validation dataset. We did not include the middle area in the mini training set to create a buffer (notice the high overlapping of tiles in Figure 2a) which prevents data leakage between both sets. The mini-training set and the mini-validation set contain 37% and 23% images of the main training set, respectively. After concluding which augmentation works well for the mini dataset in the ablation study, we applied combinations of positively impactful transformations to the main dataset.

#### **3. Data Augmentation**

This section describes the data augmentations used in this study and how they were implemented during the model's development. In general, the geometric transformations (including reduce transformation) were applied using TensorFlow operations, while pixel transformations were applied using the help of the Albumentation library [37].

#### *3.1. Types of Data Augmentation*

#### 3.1.1. Reduce Transformation

Cropping the no-data regions result in a varying aspect ratio for each tile that must be addressed. Square-shaped images are preferred as data input to simplify spatial compression and expansion of the image inside the model. The image size also needs to be reduced to fit into the GPU's memory. We decided on the target resolution of 320 pixels by 320 pixels, which allowed the batch size of eight for the single P100 GPU. All resizing methods used the bilinear interpolation. Two main resize methods were tested:


This downsampling process can be exploited to introduce randomness that further increases the diversity of the training samples. Cropping at random locations gave better details than just resizing the whole image (Figure 7). However, because it introduces randomness, these methods cannot be used as a reduction method for the validation dataset:


**Figure 7.** Reduce Transformations comparison.

#### 3.1.2. Geometric Transformations

In computer vision tasks, geometric transformations are cheap and easy to implement. However, it is important to be aware of choosing the transformations' magnitude that preserves the label in the image. For example, in optical character recognition, rotating a number by 180◦ can result in a different label interpretation in the case of the numbers six and nine.

Flipping an image along the horizontal or vertical centerline is a common data augmentation method. Referring to Figure 2a, the range direction *rg* for this dataset is on the vertical or *y*-axis, while the flight direction *az* is on the horizontal or *x*-axis. The **Horizontal Flip** does not alter the properties of a radar image. It would be as if the vehicle carrying the sensor was moving in the opposite direction. In contrast, the **Vertical Flip** makes the shadows and layovers appear on the opposite side, creating inconsistency.

Rotation helps the model learn the invariant orientation of a building. We performed **Rotation90** or quarter circle rotations {90◦, 180◦, 270◦} and **Fine Rotation** with a randomized angle range, e.g., [−10◦, 10◦]. Similar to Vertical Flip, the quarter circle rotation affects the imaging properties of the radar. The fine rotation exposes an area where image data are unknown. There are several ways to "fill" this blank area, and our experiments showed that leaving it to a dark pixel (0.0) gave the best result.

Shear is a distortion along a specific axis used to modify or correct perception angles (Figure 8). Despite SAR being a side-looking imaging device, the processed SAR image appears flat owing to the orthorectification process that corrects geometric distortions. In **ShearX**, the edges of the image that are parallel to the *x*-axis stay the same, while the other two edges are displaced depending on the shear angle range. **ShearY** is the exact opposite. We set the shear rotation to be randomized between an angle range of [−10◦,10◦].

**Figure 8.** Shear transformation in 2 directions.

**Random Erasing** is an augmentation method inspired by the drop-out regularization technique [16]. It simply selects a random patch or region in the image and erases the pixels within that region. The goal is to increase robustness to occlusion by forcing the model to learn an alternative way of recognizing the covered objects. We have decided to fill the patch values with a dark pixel (0.0). The random erasing was implemented using CoarseDropout class from the Albumentation library. The region's width and size were randomized from 30 pixels to 40 pixels, and the number of regions created was randomized from two to ten patches. The proposed geometric transformations are shown in Figure 9.

**Figure 9.** Summary of Geometric Transformations.

#### 3.1.3. Pixel Transformations

In airborne sensors, unknown perturbations of the sensor's position relative to its expected trajectory can cause several defects such as radiometric distortions and image defocusing. To increase the model's robustness when encountering these defects, we used noise injection methods. The common **Gaussian Noise** was generated using the GaussNoise class, while **Speckle Noise** was amplified by multiplying each pixel with random values using the MultiplicativeNoise class, both from the Albumentation library. Some images suffered from defocusing due to an unpredicted change in the flight trajectory, causing fluctuations in the microwave path length between the sensor and the scene [38]. We simulate this defocusing by applying **Motion Blur** with random kernel size using the MotionBlur class from Albumentation.

We used **Sharpening** with a high pass filter to improve edge detection. Consequently, this also increases other high-frequency components such as speckle.

The SAR images in this dataset are SAR-Intensity in the log scale, also referred to as dB images. From samples of the dataset, we observed that the dB images have a pixel distribution close to Rayleigh and mostly populate a narrow area. Figure 10 shows how histogram equalization methods stretch the pixel distribution to span the full range, increasing contrast and improving edge visibility. Contrast Limited Adaptive Histogram Equalization (**CLAHE**) works best when there are large intensity variations in different parts of the image, limiting the contrast amplification to preserve details. We applied it using the CLAHE class from the Albumentation library. The proposed pixel transformations are shown in Figure 11.

**Figure 10.** Histogram of a SAR Log Intensity image, before and after applying histogram equalization using CLAHE.

**Figure 11.** Summary of Pixel Transformations.

#### 3.1.4. Speckle Filters

Although still categorized as Pixel Transformation, this subsubsection was created to provide more explanation for Speckle Filters. Similar to other coherent imaging methods (laser, ultrasound), SAR suffers from noise-like phenomena called speckle. This happens due to the interference of multiple scatterers within a resolution cell. Speckle takes form in granular variations of pixel values that can lower the interpretability of a SAR image for computer vision systems as well as human practitioners.

Speckle reduction filters such as a Box filter can smooth the speckle using a local averaging window. This is effective in homogenous areas, but in applications requiring high-frequency information such as edges, filters that can adapt to local texture can better preserve information in heterogeneous areas [39]. A previous study [32] has shown a slight performance gain by applying Low Pass filters with varying strength on the UNet model. In this research, we inspected the use of well-known adaptive speckle filters as a form of data augmentation, namely Enhanced Lee Filter (**eLee**), **Frost** Filter, and Gamma Maximum A Posteriori (**GMAP**) Filter.

In Figure 12a, two sample crops of size 150 × 150 pixels are shown for comparing filtration results. A good filter should retain the average mean of an image while reducing speckle [40]. In homogenous areas, the standard deviation should ideally be 0. Speckle filters were applied in MATLAB to the SAR Intensity image (linear scale) and later converted back to the Log-Intensity image (dB scale). Results of filtering are shown in Figure 12c,d, while the quantitative comparison is presented in Table 1. We can observe that the GMAP filter was slightly better at preserving the average value and reducing variance in crops 1 and 2, respectively. The image's Equivalent Number of Looks (ENL) also showed a reduction in speckle variance for all filters.

**Figure 12.** A sample SAR Log Intensity (dB) Image (**a**) with its equivalent optical RGB (**b**). We further analyze two distinct crop regions of a building (crop 1) and a homogenous water area (crop 2). The results of filtering with selected speckle filters and various kernel sizes are given in (**c**,**d**). The image in linear scale appeared dark because of the wide dynamic range.


**Table 1.** Quantitative comparison of the applied filters. The mean, standard deviation (std), and Equivalent Number of Looks (ENL) are calculated in the linear scale.

#### *3.2. Data Augmentation Design and Strategy*

We set a probability of 50% for an image to load either the original version or with data augmentations applied. The magnitude of the transformation was also randomized in a value range, increasing the variation in every iteration of training, except for flipping and quarter circle rotation, which had a limited set of transformations.

The order of transformations is important when multiple augmentations are combined during the main experiment. Pixel transformations are applied first to prevent the presence of no-data pixels from affecting the results. Following it is a reduced transformation, and finally, a geometrical transformation. When using multiple pixel transformations, it is important to combine them into a "One Of" group. By chance, only one of the transformations will be applied, preventing the creation of a disastrous result. In geometric transformation, there is no grouping, so an image has a chance to go through all transformations, which might increase no-data pixels but are generally less harmful than multiple filtering operations.

As categorized in [10], there are three stages of applying Data Augmentation (DA): Online (on-the-fly), Offline, and Test Time Augmentation (TTA).

In Online DA, the input data is manipulated during training. This can lead to a bottleneck if a fast accelerator is used in training, but the augmentation algorithm is slow, leaving the accelerator mostly waiting for data. The advantage is that it does not store the inflated data in storage. On the other hand, Offline DA allows complex manipulation and will not bloat training time. However, since it is applied before training, it takes up storage, and the variations are pre-determined (less randomness). We used Offline DA only for speckle filtered images since they were processed outside the TensorFlow environment, and an image was stored for every applied filter. Other transformations in this study used Online DA, which can have a finer degree of randomness in every iteration.

In TTA, *A* additional images are generated from each test image *x*, where *A* is the number of augmentations applied during the inference or prediction stage. The model will then predict on *A* + 1 samples and the average sum will be taken as the final prediction. This method of predicting multiple transformed versions of the input mimics the theory of ensemble learning, where a group of models using different architectures or trained on different data combines their prediction to increase generalization. This was investigated in [41], concluding that TTA helped reduce overconfident incorrect predictions compared to when using only a single model.

In a classification task, averaging predictions are straightforward since the output is an array with a size equal to the number of classes. In a segmentation task, one must be cautious to perform augmentations that modify the location of labels (in our dataset, the building footprints). If such methods are used, the solution is simply to revert back to the transformation before averaging the predictions.

#### **4. Results**

#### *4.1. Ablation Study*

In this isolated experiment, we measured the impact of each augmentation method. The model was trained on the mini-training dataset and evaluated on the mini-validation dataset. Results are shown in Table 2. Loss and IoU are the scores for the training set, while Val Loss and Val IoU are scores from the validation dataset. The training lasted for 60 epochs. The four metrics were taken at the best epoch, which is when the model obtained the highest Val IoU. We chose not to take the final score at epoch 60 to showcase the best performance of each augmentation objectively. The score at the last epoch was always worse than the best epoch because the training scores kept getting better while the validation scores stagnated or even deteriorated. This can be visualized in Figure 13. The gap between both training and validation scores was less when an augmentation method was applied, which delayed overfitting, enabling the model to move into what is known as a Local Minima or a temporary performance peak.

**Table 2.** Results of Ablation study. All scores are in percentage units. For loss, the lower is better. For IoU, the higher the better. Scores are color-coded in comparison to Pad Resize, where green color projects positive gain while red projects negative gain.


To fit the image resolution to the model's input, we used two basic Reduce methods: Pad Resize and Distorted Resize. A slightly better score obtained using Distorted Resize might indicate that the black paddings were distracting the model because the full scope of the raster's space was not utilized. However, a problem arises when reshaping back the distorted prediction to the original aspect ratio, so we used Pad Resize as the Reduce method for the validation data. Adding randomness to the Reduce method using Random Crop Resize had the biggest performance gain compared to other augmentations. Randomizing the crop size gives the chance to see the image at different scales and details. Random Crop

did not perform well due to the small static crop size of a 160 m × 160 m area, increasing the chance of encountering partial parts of a building.

**Figure 13.** IoU scores comparing the four Reduce Transformations. The solid lines show training IoU scores, while the dashed lines show validation IoU scores. The loosely dotted vertical lines show where the best epoch (highest Val IoU) for a given method. Adding more variations to the input delays the overfitting and is shown by a later best epoch.

Geometric transforms generally increased performance, except for Vertical Flip and Rotation90. Both were detrimental to the performance as they caused the extreme displacement of shadows and layovers' location compared to the actual building footprint. Pixel transforms were not as effective, giving similar or slightly worse scores than the baseline. These augmentations affect the recognition of texture, an important feature when the edges of a building or its shape are unrecognizable due to occlusion or noise. However, this filtering can also be destructive as it also amplifies non-building patterns.

Training scores were generally lower when augmentations were applied, as the model struggles to find the underlying function among the additional variations. A strong increase in training scores for the GMAP speckle filter indicates better recognition of the training data. However, these variations were not shown among the validation data, hence the lower validation scores.

To validate the effects of our proposed data augmentation, we performed the Ablation Study on different mini datasets which we briefly described in the following:


Figure 14 shows that performance gain and loss on the other datasets mostly agree with results in SAR orient1. Random Crop Resize, Horizontal Flip, and Fine rotation showed consistent gains over all datasets. Meanwhile, Rotation90 showed consistent dips in performance, which are more prominent in datasets from SpaceNet6. The result from PAN highlights the method's impact on a similar geographic region (Rotterdam Port) but a different modality, while the result from Inria highlights the impact when exposed to the different urban settlements of multiple cities. However, due to the stochastic nature of deep neural networks, using an optimized model and training method fitted to one dataset might not translate to an optimal solution on another dataset, which has

different distribution [6]. Therefore, these directive insights should be further tweaked when working on a different dataset.

#### *4.2. Main Experiment*

Using similar concepts to those from the Ablation study, we experimented with several combinations of augmentation methods applied to the main training set and evaluated on our prepared validation set. The same model and training parameters were used except for the increased training duration to 90 epochs. This compensates for the additional variation and gives more room for the model to learn. A callback was set to monitor the best Val IoU score and labeled it as the best epoch. The following augmentation schemes were applied:


Only Baseline used Pad Resize as the Reduce method, while the other combinations used Random Crop Resize. For every augmentation scheme, the model's performance was taken at the best epoch and shown in Table 3**.** Due to different datasets, the scores in Table 2 should not be directly compared to results from this main experiment. The training time *ttrain* was measured as the duration of training but shown in the table as a scaling factor compared to Baseline where *ttrain* <sup>=</sup> *ttrain*\_*scheme ttrain*\_*baseline* .

In line with results from the Ablation Study, geometric transformations had better scores than pixel transformations. Increasing the magnitude of transformation did not lead to an increase in performance, as shown by the lower scores obtained in Heavy Geometry. Increasing the diversity of transformations in Combination also did not improve performance despite consisting of transformations that showed positive impacts during the Ablation Study.

All models predicted well on medium height elongated residential buildings (Figure 15a). Applying augmentation increases confidence, modeling a more accurate shape characterized by rooftop patterns. However, fine details of a building structure and small buildings remained undetected.

**Table 3.** Results of combining multiple augmentations. All loss and IoU scores are in percentage. *ttrain* is the scale of training time compared to the Baseline. A lower value indicates faster training time. For loss, lower is better. For IoU, higher is better. Scores are color-coded where a darker green indicates a better value.



**Figure 15.** Comparison of predictions from the trained models of different scene objects: (**a**) mediumheight residential buildings, (**b**) containers in a shipping port, (**c**) outdoor sports field, and (**d**) highrise buildings.

For an image tile of large negative samples (pixels belonging to non-building), pixel augmentations drive extra attention to high backscattering objects such as container storage and large shipping/port equipment made of metal (Figure 15b). This leads to an increase in false positives. Geometric augmentations were less prone to this. However, Geometric augmentations overfit non-building objects with building-shaped backscatters, such as the fences surrounding a sports field (Figure 15c). A combination of geometric and pixel augmentations seems to tune down these false positives and correctly recognize them as non-object patterns.

Occlusion was the biggest problem, especially related to high-rise buildings in dense areas (Figure 15d). All models failed to recognize buildings occluded by the overlay of a neighboring high-rise building. Interestingly Geometric augmentations tend to classify the overlayed parts as positives.

Despite not using any augmentations in Baseline, the Light Pixel scheme, applied using the Albumentation library, managed to train slightly faster. This was caused by the callback function, saving the state of the model each time it sees a better-monitored value, requiring additional time on certain epochs. It shows that training duration is not an objective measurement due to the randomness involved. However, we still included it as a comparison. The other schemes show that adding more transformation methods will increase training time.

#### *4.3. Test-Time Augmentation*

The state of the models in the Main Experiment was saved at their best epoch, and TTA was applied after the training ended. We experimented with two methods for TTA:


We applied TTA to the Baseline model and the best model from the Main Experiment, which was trained on the Light Geometry scheme. TTA comes at the cost of additional inference time *ttest*, which is a scaling factor compared to the Baseline's inference time. It increases proportionally to the number of augmentations applied. Compared to the time required by re-training a model, the additional inference time to implement TTA was negligible. Results for TTA are shown in Table 4.

**Table 4.** Results of applying TTA to the Baseline model and Light Geometry. All loss and IoU scores are in percentage. For loss, lower is better. For IoU, higher is better. Scores are color-coded where a darker green indicates a better value.


The Baseline and Light Geometry model benefit from TTA\_1, which consist of simple Geometric transformations. Interestingly, when TTA\_2 was applied to the Baseline model, the performance was lower, as it predicted fewer positive samples from the two square patches. The Baseline model, trained on images with a fixed scale, had less confidence in

predicting medium-sized buildings compared to the Light Geometry model, which had the chance to see variations of scaling thanks to the Random Crop Resize reduction method.

#### **5. Discussion**

Deep learning methods use vast parameter exploration. Results vary with different data, models, and training parameters (also known as hyperparameters). Our goal is to provide intuition based on our experiments of modifying the data while maintaining the model and hyperparameters unchanged. Radar images, including SAR, have unique properties that differ from optical images. Therefore, several transformations can result in poor performance. Selecting augmentation methods requires knowledge of the biases in the training data, either through statistical analysis, in the case of a large dataset, or through the manual inspection of samples. This helps reduce the search space instead of trying every available method.

Tiling is required in remote sensing images as it is impossible to fit a large raster directly to a model. The choice of target resolution will affect the detection of multi-scale objects, such as buildings. Introducing randomness by varying the scale and crop size during dataset loading is a cheap way of boosting performance since there is no need to store extra images, such as in the case of tiling with overlapping regions. However, cropping too much will increase the chance of a large object covering the whole space and hinder performance. No-data regions are inevitable when tiling a large raster, and in our experience, it is better to remove them before feeding the image tile to the model.

Our study shows that pixel transformations are not as effective as geometric transformations. The reason might be that kernel filters, which are the base of most pixel transformations, are already an integral component of the CNN model itself, thus, learnable by an adequately sized model.

Applying transformations at different stages of the model development has different tradeoffs. For instance, Offline DA can utilize the training images with their stored variations, all at once to train the model. We experimented on a dataset transformed with speckle filters and achieved similar scores to Random Crop Resize in Table 2 but at the cost of 3.5× training time. This might be a good option when the data consist of only a few hundred samples and gathering extra data is unfeasible, such as in the case of classifying rare medical images.

We demonstrated that Test Time Augmentations (TTA) is a cheap method to boost test scores. However, applying a set of augmentations during the test did not achieve better scores when compared to applying the same set of augmentations during training. The model predicted the varying test samples better, had it seen these variations during training. Therefore, applying augmentations in both stages will result in better scores. When using Shear and Fine Rotation in TTA, the angle must be kept low because it removes some portion of the image (outside the image boundary) where it will not return when doing the inverse transformation after prediction. This is why quarter-circle rotations and flips are more commonly used as TTA because they retain the full image after inverse transformation.

#### **6. Conclusions**

This paper presents several data augmentation methods for semantic segmentation of building footprints in SAR imagery. By artificially increasing the training dataset, we improved the model's generalization on unseen samples from the validation set, thereby reducing overfitting. The results show a 5% increase in Val IoU score when comparing the best augmentation scheme to the baseline model (no augmentation). Data augmentation can be very helpful in situations with limited data, either due to proprietary licenses or an expensive collection process.

For building detection in SAR, geometric transformations were more effective than pixel transformations. However, some transformations (such as vertical flip and quarter circle rotations) that alter key features of a building in SAR images were proven to be detrimental. Therefore, data augmentations must not be overused, especially since it takes more resources to train (either storage or processing time), which does not always lead to a better result. Test Time Augmentations showed an additional performance gain compared to augmentations applied only during training.

We hope this study can be used as a guide for future research to optimize object detection in a limited set of radar imagery or as an inspiration for investigating alternative methods to augment radar images. The search for effective data augmentation methods can be expensive; thus, automated approaches can save time and computing resources. These approaches have not yet been studied in this article. Furthermore, generating new data through generative models or the use of simulations remain interesting avenues to explore.

**Author Contributions:** Conceptualization, S.W. and P.S.; methodology, S.W.; software, S.W. and A.G.; validation, S.W., P.S. and A.G.; formal analysis, S.W.; investigation, S.W.; resources, S.W.; data curation, S.W.; writing—original draft preparation, S.W.; writing—review and editing, S.W., P.S. and A.G.; visualization, S.W.; supervision, P.S. and A.G.; project administration, S.W. and P.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Review* **Deep Learning for SAR Ship Detection: Past, Present and Future**

**Jianwei Li 1, Congan Xu 1,2,\*, Hang Su 1, Long Gao <sup>1</sup> and Taoyang Wang <sup>3</sup>**

	- wangtaoyang@whu.edu.cn

**\*** Correspondence: 7520220053@bit.edu.cn

**Abstract:** After the revival of deep learning in computer vision in 2012, SAR ship detection comes into the deep learning era too. The deep learning-based computer vision algorithms can work in an end-to-end pipeline, without the need of designing features manually, and they have amazing performance. As a result, it is also used to detect ships in SAR images. The beginning of this direction is the paper we published in 2017BIGSARDATA, in which the first dataset SSDD was used and shared with peers. Since then, lots of researchers focus their attention on this field. In this paper, we analyze the past, present, and future of the deep learning-based ship detection algorithms in SAR images. In the past section, we analyze the difference between traditional CFAR (constant false alarm rate) based and deep learning-based detectors through theory and experiment. The traditional method is unsupervised while the deep learning is strongly supervised, and their performance varies several times. In the present part, we analyze the 177 published papers about SAR ship detection. We highlight the dataset, algorithm, performance, deep learning framework, country, timeline, etc. After that, we introduce the use of single-stage, two-stage, anchor-free, train from scratch, oriented bounding box, multi-scale, and real-time detectors in detail in the 177 papers. The advantages and disadvantages of speed and accuracy are also analyzed. In the future part, we list the problem and direction of this field. We can find that, in the past five years, the AP50 has boosted from 78.8% in 2017 to 97.8 % in 2022 on SSDD. Additionally, we think that researchers should design algorithms according to the specific characteristics of SAR images. What we should do next is to bridge the gap between SAR ship detection and computer vision by merging the small datasets into a large one and formulating corresponding standards and benchmarks. We expect that this survey of 177 papers can make people better understand these algorithms and stimulate more research in this field.

**Keywords:** SAR ship detection; SAR dataset; single-stage detector; two-stage detector; anchor free; train from scratch; oriented bounding box; multi-scale detection; deep learning; computer vision

#### **1. Introduction**

Synthetic aperture radar (SAR) remote sensing has become one of the important methods for marine monitoring due to its all-day, all-weather advantage. Ship detection in SAR images has broad prospects in both military and civilian fields [1,2].

The traditional detection method includes three steps: sea-land segmentation, CFAR (constant false alarm rate) detection, and discrimination [3,4]. In the sea-land segmentation step, the land pixels are rejected to avoid interference with the CFAR step. The common method is based on GIS (geographic information system) or image features. The gray histogram is the classical feature used for segmentation. In the second step, CFAR is usually used for ship detection. The distribution function is assumed to fit the pixel distribution of the SAR image. K, Weibull, and Rayleigh distribution are usually used in this step. To keep the probability of a false alarm at a constant value, the CFAR algorithm compares

**Citation:** Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. *Remote Sens.* **2022**, *14*, 2712. https://doi.org/10.3390/rs14112712

Academic Editor: Dusan Gleich

Received: 6 May 2022 Accepted: 2 June 2022 Published: 5 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the testing pixel with an adaptive threshold that is generated by the local background surrounding the testing pixel. After the pre-screening by CFAR, a discriminator is needed to reject the background. Discriminator includes two procedures: feature designing and classifier designing. According to the feature difference between ship chips and non-ship chips, this step can reduce the number of false alarms. The traditional detection method dominated this field for a long time.

With the development of deep learning-based object detection algorithms in computer vision (CV) [5], SAR researchers also began to seek inspiration from computer vision. There are three reasons that can explain the revival of deep learning. They are the arising of computing power, big data, and corresponding algorithms. As SAR images are not easily accessed, the deep learning-based detection method cannot be used for SAR ship detection at the beginning.

This problem was solved in 2017, as the first dataset SSDD (SAR Ship Detection Dataset) was open to the public. SSDD provides the same data and evaluation criteria for researchers, and it solves the problem that the traditional algorithms lack data and are not comparable in this field. Since then, more and more researchers adopt a deep learningbased method in this area. The deep learning-based algorithms also show great results compared with the traditional CFAR-based method. The active and open characteristics of computer vision also further promote the development of this field. We think that the emergence of SSDD means that this field comes into the deep learning era.

As far as we know, there are 177 papers [6–182] that use deep learning-based algorithms to detect ships in SAR images. However, there are no papers that review them yet. In order to summarize the achievements of the 177 papers and show the way for the future, we specially wrote this paper, hoping to contribute to the development of this field.

The rest of this review is arranged as shown in Figure 1. Section 2 briefly analyzes some work related to our paper. Section 3 summarizes the past of the traditional detection algorithms in SAR images. It mainly includes CFAR, hand-crafted features, and limited shallow representation. Section 4 introduces the present in deep learning-based detectors. We review the 177 papers, divide them into 10 categories and analyze them, respectively. Section 5 shows the future direction of this field. Section 6 is the conclusion of the paper.

**Figure 1.** The overall architecture of the paper.

#### **2. Related Work**

As far as we know, few researchers have written review papers about this direction. This is partly due to the fact that this direction is new to some extent. At present, only three papers [105,127,170] have performed work related to our work.

Jerzy et al. [105] reviewed the papers from the last 5 years that discuss SAR ship detection. They mainly introduce the development of CFAR methods, CNN (convolutional neural network) based methods, GLRT (generalized likelihood ratio test) based methods, feature extraction-based methods, weighted information entropy-based methods, and variational Bayesian inference-based methods. Compared with paper [105], we mainly focus on the deep learning-based detection methods and do not focus on the traditional methods.

Mao et al. [127] solved the problem of the lack of performance benchmark for stateof-the-art methods on SSDD. Through this work, researchers can compare their work in the same experimental setup. They present 21 advanced detection models, including single-stage, two-stage, train from scratch detection algorithms, and so on. Compared with paper [127], we not only introduce the performance of different public datasets, but also classify all the papers, and summarize the principles and results of the algorithms.

Zhang et al. [170] solved the problem of the coarse annotations and ambiguous standards in SSDD. These improvements are beneficial for a fair comparison. It has played a great role in promoting the healthy development of this field. We suggest that researchers use the standards specified in this paper in the future. Compared with paper [170], our work is not limited to SSDD but introduces other datasets in this field. More importantly, our team has systematically analyzed, classified, and commented on the methods used, and pointed out the future research direction, which is beneficial to the development of this field.

In short, our work is different from the other papers. It is the first comprehensive review of SAR ship detection.

#### **3. Past—The Traditional SAR Ship Detection Algorithms**

Traditional detection algorithms in SAR images are based on hand-crafted features and limited shallow-learning representation. It can be divided into three steps: preprocessing, candidate region extraction, and discrimination.

CFAR is a common method for candidate region extraction. It can select potential ship regions. It first statistically models the clutter and then obtains the threshold value according to the false alarm rate. The pixels above the threshold are regarded as ship pixels, and those below the threshold are regarded as background. CFAR is essentially a segmentation-based algorithm, that is, the pixels are classified into two categories (ship or non-ship) according to the gray size, and then the ship pixel region is merged into the ship region. The performance of this method largely depends on the statistical modeling of sea clutter and the parameter estimation of the selected model. According to different SAR image products and practical application requirements, different statistical models such as Gaussian distribution, gamma distribution, log-normal distribution, Weibull distribution, and K distribution are proposed. Gaussian distribution and K distribution are the most commonly used. Generally speaking, when the scene is relatively simple, the CFAR method can achieve better results. However, for small ships and complex offshore scenes, due to the difficulty of modeling, it will have more false positives and poor detection performance.

Discrimination is generally realized by using artificially designed features and training classifiers. In addition to the simple features such as length, width, aspect ratio, and scattering point position, the features introduced from computer vision are also commonly used and have stronger robustness. Such as integral image features, HoG (histogram of oriented gradients), SURF (speeded up robust features), and LBP (local binary pattern). These features improve the performance of the detection algorithm. In classifier designing, decision trees, SVM, gradient boosting, and their improved versions also further improve the performance.

Feature and classifier designing have pushed this field forward in the past few years. However, since the rise of deep learning in 2012 [5], the above ideas are dwarfed in speed and accuracy. The object detection algorithm based on deep learning is an end-to-end processing method, as shown in Figure 2. It does not need to optimize multiple independent steps like the traditional method. It optimizes the whole detection system uniformly. It can adapt to various complex scenes (there is no need for sea–land segmentation in nearshore and port) and has very strong robustness. Therefore, in recent years, deep learning-based SAR ship detection algorithms have become a new research hotspot.

**Figure 2.** The differences between the CFAR-based detector and the deep learning-based detector.

The advantages and disadvantages of deep learning and traditional-based detection algorithms in SAR images can be proved by qualitative analysis or quantitative experiments.

The detection method based on CFAR has four main shortcomings through qualitative analysis. Firstly, CFAR needs to set the size of the protection window according to the size of the ship. It works well in the case of local uniform clutter in a single ship. If several ships with different sizes are close, the inconsistency between the change of ship size and the fixed protective window will lead to missed detection. Secondly, the CFAR algorithm needs to accurately model SAR images, which is difficult to implement. Thirdly, the essence of CFAR is an unsupervised algorithm, and its performance is essentially worse than the supervised algorithm (Faster R-CNN (region-based convolutional neural network), YOLO (you only look once), SSD (single shot detector), etc.). Fourthly, the CFAR algorithm and discrimination algorithm is a system pieced together after multiple links are debugged separately, and its performance cannot be compared with the end-to-end deep learning algorithm.

Sun Xian carried out a comparative experiment between the classical ship detection algorithm and the deep learning algorithm [65]. In the paper, the classical ship detection algorithms (optimal entropy automatic threshold method and CFAR method based on K distribution) are tested and analyzed on the AIR-SARShip-1.0 dataset. The experiment results are shown in Table 1. We can find that the performance of the deep learning algorithm is significantly better than that of the traditional algorithm.

**Table 1.** The performance difference between traditional-based detectors and deep learning-based detectors on the same condition [65].


Before SSDD, there are six papers that use convolutional neural networks [1–6] to detect ships in SAR images. We think that these six papers are not based on deep learning. The reasons are as follows. Firstly, some algorithms are not end-to-end, they just use CNN as a component in the traditional detection process. Secondly, although some algorithms

are end-to-end, the dataset and evaluation criteria are not public, which is difficult for future researchers to reproduce, and the results are also not comparable.

Due to the important role of SSDD, we take the publication date of SSDD paper as the time separation point between the traditional and deep learning-based detection algorithms. Therefore, we believe that ship detection in SAR images entered the era of deep learning on 1 December 2017, as shown in Figure 3. A large number of researchers gradually began to abandon the traditional detection algorithm based on CFAR and adopt the advanced detection method based on deep learning [6–182]. The overview of these deep learning-based detectors in SAR images is the focus of this paper.

**Figure 3.** The time divisions of past, present and future.

#### **4. Present—The Deep Learning-Based SAR Detection Algorithms**

*4.1. The General Overview of the 177 Papers*

#### 4.1.1. The Countries

In the country view, we can find that 90% of the papers' authors are Chinese, which is shown in Figure 4. There is no doubt that Chinese researchers have been the mainstream in this direction. Several public datasets are constructed by Chinese researchers, which further prove the above opinion.

**Figure 4.** The percentages of the author countries.

#### 4.1.2. Journal or Conference

A total of 63% of the 177 papers are published in journals, and 37% are in conferences. The most common journals and conferences are Remote Sensing and IEEE International Geoscience and Remote Sensing Symposium (IGARSS), respectively.

#### 4.1.3. Timeline of the 177 Papers

The timeline of the deep learning based SAR ship detectors is shown in Figure 5.

Gray lines and gray circles in the figure represent the time, the purple bar represents the number of papers in the current year, and the red circles represent the public time of the dataset. From the timeline, we can find that in the passing five years, the number of papers about deep learning-based SAR ship detectors becoming more and more. The period 2016–2017 is in the transitional period between traditional and deep learning methods, and there are sporadic papers. Additionally, due to the lack of a unified dataset and the lack of in-depth understanding of deep learning and computer vision algorithms, the application of deep learning algorithms is not thorough. This situation did not change until the emergence of the first dataset (SSDD) paper at the end of 2017. SSDD discloses the data and evaluation criteria it used, which lays the foundation for the rapid development of this field. Since then, the fast lane of research was opened, and a large number of papers were published. The milestones of this field are the several open-access datasets, which are shown in the red circle above. With the increase in available datasets, more and more researchers are paying attention to this field.

#### 4.1.4. The Datasets and Satellites That Are Used

The datasets that those papers used are shown in Table 2. We can see that SSDD is the most frequently used dataset for now. It was used 83 times, 62.4% percent of the total. Additionally, the usage of several public datasets shows a gradual upward trend.


**Table 2.** The datasets that are used in the past five years.

Before the first dataset paper was published in 2017, different researchers adopted different SAR images and indicators to test their detectors. Thus, the results of the papers are not comparable. This phenomenon is not beneficial for the development of this field. In order to overcome this, we constructed the first dataset SSDD and it is open to the public. Meanwhile, we provide another dataset called SSDD+, which shares the same images with SSDD but has an oriented bounding box. With the rapid development of deep learning-based computer vision algorithms after 2019, SSDD draws more attention to researchers. Zhang analyzed the usage situation of SSDD in paper [170]. From this paper, we can find that SSDD becomes the most popular dataset, though it has many drawbacks. In addition to this, SAR-Ship-Dataset and AIR-SARShip are showing great potential to be

the popular dataset. The other datasets were seldom used, as their public dates are a little late. As deep learning models need more data to prevent overfitting, the future of this field is to merge them into a large dataset and provide the benchmarks on the large dataset with the common detection algorithms in computer vision. We think that, if the dataset is big enough, the benchmark is whole enough, and the maintenance is regular enough, it will be accepted by most researchers. This is the focus of our future work.

Table 3 shows the SAR satellites that papers used besides the public dataset. We can find that SAR images from Sentinel-1 are the most frequently used all the time. This is because the data are easy to acquire, and can be downloaded for free.


**Table 3.** The satellites that are used in the paper beside the datasets.

However, as China's first C-band multi-polarization SAR satellite Gaofen-3 was officially put into use on 23 January 2017, the policy of obtaining Gaofen-3images has become easier and easier. More and more papers use Gaofen-3 as the source image.

#### 4.1.5. Deep Learning Framework

A deep learning framework can reduce the workload of researchers [183–187]. So, since the emergence of CAFFE (Convolution Architecture For Feature Extraction) [188] in 2017, it gets more and more attention from researchers. Table 4 shows the deep learning framework those 177 papers used. We can find that in the beginning years (2017–2018), CAFFE is the most frequently used framework. It is because CAFFE is the first common deep learning framework that researchers use and most of the detection algorithms in computer vision are based on CAFFE, for example, Faster R-CNN and SSD. In order to improve the efficiency of the deep learning framework, Google provided TensorFlow [189] in 2017. Compared with CAFFE, Tensorflow is more powerful and easier to use. A lot of researchers adopt Tensorflow as their framework. After Tensorflow, PyTorch [190] was promoted by Facebook FAIR in 2017. PyTorch is more suitable for researchers, and the number of users surpasses the Tensorflow gradually. In addition to CAFFE, Tensorflow and Pytorch, Keras, DarkNet and PaddlePaddle are also used by some researchers. Due to the fact that most detection algorithms in computer vision are based on Tensorflow and Pytorch, we recommend researchers in this area use them as the deep learning framework.


**Table 4.** The deep learning framework those papers used.

#### 4.1.6. Performance Evolution

Tables 5–11 show the performance of several public datasets. In the 'AP' column, the large number represents AP50, and the small one represents AP. AP50 refers to the average precision with IoU (intersection over union) = 50%. AP refers to the value of IoU from 50% to 95% in steps of 5% and then calculates the average value of AP under these IoUs. Normally, AP50 is higher than AP. AP50 and AP are usually used in PASCAL VOC and MS COCO, respectively. Since the dataset contains only one class of ships, the mAP value is the same as the AP value.

In the table, the italics represent the performance of two-stage detectors, and the nonitalics represent the performance of single-stage detectors. The blue, red, green, purple, and golden colors represent anchor-free, train from scratch, oriented bounding box, multi-scale, and attention detectors. The underlines represent real-time detectors.

The number of papers in Tables 5–11 is less than in Table 2. This is because some of the papers in Table 2 did not use the AP or AP50 as the evaluation indicator, so we do not show them in Tables 5–11.

From Table 5, we can see that there are 52 papers that are trained and tested on SSDD. Additionally, in the past five years, AP50 of detectors on SSDD boosted from 78.8% in 2017 to 97.8 % in 2022. The testing time is also getting faster and faster. What should be noticed is that as the train-test division is ambiguous in the original SSDD, so the AP in Table 5 is not comparable to some extent. That is also why we recommend the following researchers adopt the Improve SSDD [170] as the new standard.

**Table 5.** The performance evolution of detectors on SSDD (The data come from the 177 papers).


From Table 6, we can see that there are only four papers that are trained and tested on SSDD+, and the AP50 performance is increased from 84.2% in 2018 to 94.46% in 2021. The overall performance is a bit lower than that of SSDD. That is because the detectors on SSDD+ should predict an additional parameter (angle). We also find that the SSDD+ is seldom used compared with SSDD. That is, few researchers are interested in oriented bounding box detection in this area.


**Table 6.** The performance evolution of detectors on SSDD+ (the data come from the 177 papers).

From Table 7, we can see that there are 14 papers that are trained and tested on SAR-Ship-Dataset, and the AP50 performance is boosted from 89.07% in 2019 to 96.1% in 2021. The running speed is also accelerated to 60.4 FPS with 96.1% AP. The overall performance is a bit lower than that of SSDD. That is because this dataset is relatively larger than SSDD.

**Table 7.** The performance evolution of detectors on SAR-Ship-Dataset (the data come from the 177 papers).


From Table 8, we can see that there are only four papers are trained and tested on AIR-SARShip, and the AP50 performance is boosted from 88.01% in 2019 to 92.49% in 2021. In addition, the running speed becomes 7.98 times faster (from 41.6 ms to 5.22 ms). The overall performance is a bit lower than that of SSDD.

**Table 8.** The performance evolution of detectors on AIR-SARShip (the data come from the 177 papers).


From Table 9, we can see that there are only nine papers are trained and tested on HRSID, and the AP50 performance is boosted from 89.3% in 2019 to 94.4% in 2021. The overall performance is a bit lower than that on SSDD. That is because this dataset is relatively larger than SSDD.


**Table 9.** The performance evolution of detectors on HRSID (the data come from the 177 papers).

From Table 10, we can see that there are only three papers are trained and tested on LS-SSDD-v1.0, and the AP performance is boosted from 72.3% in 2019 to 75.5% in 2022. The overall performance is a bit lower than that on SSDD. LS-SSDD-v1.0 is specially used for large-scale SAR ship detection, which is fit for satellite-based SAR systems. It should be used more in the future.

**Table 10.** The performance evolution of detectors on LS-SSDD-v1.0 (the data come from the 177 papers).


The above datasets are relatively smaller than the datasets used in computer vision. In order to improve the generalization ability of the detector, researchers should use a large dataset. Some researchers merge several datasets into a large one as shown in Table 11. From Table 11, we can see that there are three papers that are trained and tested on the composite dataset, and the AP performance is 81.13%, 71.4%, and 95.1%, respectively. As deep learning-based detectors are data-hungry, we should merge the public datasets into a large one to prevent over-fitting.

**Table 11.** The performance evolution of detectors on other datasets (the data come from the 177 papers).


#### *4.2. The Algorithm Taxonomy of the 177 Papers*

We divide the 177 papers into 10 categories, they are papers about datasets, twostage detectors, single-stage detectors, anchor-free detectors, train from scratch detectors, detectors with the oriented bounding box, multi-scale detectors, detectors with attention module, real-time detectors, and others. The percentages of each algorithm are shown in Table 12. What should be explained is that the summation of the percentages is larger than 1. This is because many algorithms in the papers have several attributes. For example, it not only belongs to the single-stage detector but is also trained from scratch.


**Table 12.** The percentage of each algorithm.

From Table 12, we can find the following conclusions. Firstly, there are eight papers that introduce the datasets to the researchers. They make a great contribution to this field. Secondly, two-stage detectors used in this field are slightly more than single-stage detectors. This is partly because the two-stage detectors have higher accuracy than the single-stage detectors in most cases. In addition, accuracy is the first consideration at the moment. Thirdly, anchor-free detectors, detectors trained from scratch, oriented bounding box detectors, and detectors with attention modules almost have a percentage of 5–6% among the 177 papers. This is because, as the above four directions are rare, they are not yet noticed by many researchers. In fact, these directions can overcome the problems of the ship size distribution abnormal and the lack of SAR images. They should be paid more attention in the future. Fourthly, almost 14% of papers are about multi-scale SAR ship detection, which is a little higher than other directions. This is because, compared with objects in computer vision images, ships in SAR are rather small. In order to improve the performance, detectors should pay more attention to multi-scale ships. Fifthly, 14.20% of papers are classified as others, which represents that these papers do not belong to the nine categories. Sixthly, only three papers, which is 1.7% of the 177 papers are reviewed in this field. Considering the active research in this field, it is not enough for now. This is one of the motivations for our work.

#### *4.3. The Public Datasets*

#### 4.3.1. Overview

As far as we know, there are 10 public datasets that are used for training and detecting ships in SAR images. They are SSDD(SSDD+) [11], SAR-Ship-Dataset [38], AIR-SARShip1.0 [65], HRSID [94], LS-SSDD-v1.0 [101], AIR-SARShip2.0 [191], Official-SSDD [170], SRSDD-v1.0 [177] and RSDD-SAR [192]. Table 13 shows the detailed information of the 10 public datasets, in which the annotations of SSDD+, Official-SSDD, SRSDD-v1.0, and RSDD-SAR are the oriented bounding box.

**Table 13.** Detail information of existing public datasets.


In addition to these, SMCDD [182] is a good dataset based on China's first commercial SAR satellite HISEA-1. It has 1851 bridges, 39,858 ships, 12,319 oil tanks, and 6368 aircraft. It shows a great advantage in multi-class ship detection.

In the future, it is very necessary to combine the above datasets into a large one to avoid the problem of overfitting.

In the following part, we will introduce the details of the datasets and evaluate their advantages and drawbacks.

#### 4.3.2. SSDD, SSDD+ and Official-SSDD

We made our dataset SSDD publicly available at the conference of 2017BIGSARDATA in Beijing [11]. SSDD is the first open dataset in this community. It can be a benchmark for researchers to train and evaluate their algorithms. In SSDD, there are a total of 1160 images and 2456 ships. The ships in SSDD have rich diversity, including small-size ships, complex backgrounds, and dense arrangements near the wharf. We also give the statistical results of the length, width, and aspect ratio of the ship bounding box in SSDD. The papers that used SSDD and their performance are shown in Table 5.

At the same time, based on 1160 SAR images of SSDD, we use the oriented bounding box to relabel the ship and obtain the dataset SSDD+. SSDD+ is the first dataset for SAR ship detection with an oriented bounding box. The papers that used SSDD+ and their performance are shown in Table 6.

At that time, there were some problems in SSDD due to the lack of understanding of computer vision and deep learning. The drawbacks of SSDD are the coarse annotations and ambiguous standards of use. It hinders fair comparisons and effective academic exchanges in this field.

In September 2021, Zhang [170] systematically analyzed and improved the problem of SSDD; they call it Official-SSDD. Zhang relabeled ships in SSDD and proposed three new datasets; they are bounding box SSDD, rotatable bounding box SSDD, and polygon segmentation SSDD. In addition, they also formulate some standards: the train-test division, the inshore-offshore protocol, the ship-size definition, the determination of the densely distributed small ship ships, and the determination of the densely parallel berthing at ports ship samples. We suggest that follow-up researchers use the Official-SSDD and standards proposed in paper [170] to carry out their relevant research.

#### 4.3.3. SAR-Ship-Dataset

The training of the deep learning model depends on a large amount of data, and the amount of SSDD is relatively small. To solve this problem, Wang Chao [38] constructed a dataset called SAR-Ship-Dataset. SAR-Ship-Dataset contains 43,819 images and 59,535 ships, which are more than SSDD. The sources of SAR-Ship-Dataset are 102 Gaofen-3 images and 108 Sentinel-1 SAR images. These ships have distinct scales and backgrounds. The resolution, incident angle, polarization mode, and imaging mode are also diverse, which are helpful for the deep learning models to fit different conditions. The papers that used SAR-Ship-Dataset and their performance are shown in Table 7.

#### 4.3.4. AIR-SARShip

A dataset containing more diverse scenes and covering various types of ships will help to train a model with better performance, stronger robustness, and higher practicability. In order to achieve the above purpose, Sun Xian constructed a dataset based on the Gaofen-3 satellite, named AIR-SARShip-1.0 [65].

It contains a total of 31 large images. A total of 21 images are training data and the other 10 images are testing data. The image resolutions include 1 m and 3 m. The image size is about 3000 × 3000 pixels. The information of each image in the dataset includes image number, pixel size, resolution, sea state, scene, and the number of ships. The dataset has the characteristics of a large scene and a small ship.

On the basis of version 1.0, Sun Xian and other researchers added more Gaofen-3 data to build AIR-SARShip-2.0. The dataset contains 300 SAR images. The scene types include ports, islands, reefs, sea surfaces with different levels of sea conditions, etc. The annotation information includes the location of ships, which has been confirmed by professional interpreters.

The papers that used AIR-SARShip and their performance are shown in Table 8.

#### 4.3.5. HRSID

The original SAR image used to construct HRSID [94] includes 99 Sentinel-1B images, 36 TerraSAR-X images, and 1 TanDEM-X image. HRSID has 5604 high-resolution SAR images with 800 × 800 pixels to meet the needs of actual training for GPU. It is designed for ship detection and segmentation based on CNN, and it only contains one category of ships. It is divided into 65% training set and 35% testing set. It uses polygons to label the ship. In order to reduce the deviation of the ship detection algorithm, the interference derived from the ship is marked as a part of the ship.

According to statistics, the total number of ships marked in HRSID is 16,951, and each SAR image contains an average of three ships. The number of small ships, medium ships, and large ships accounted for 54.5%, 43.5%, and 2% of all ships, respectively. The bounding box areas of small ships, medium ships, and large ships account for more than 0~0.16%, 0.16~1.5%, and 1.5% of SAR images, respectively. Therefore, ships are sparsely distributed in SAR images.

The papers that used HRSID and their performance are shown in Table 9.

#### 4.3.6. LS-SSDD-v1.0

Zhang Xiaoling [101] constructed the SAR ship detection dataset LS-SSDD-v1.0 with a large scene and small ships. The dataset consists of 15 pieces with a size of 24,000 × 16,000 pixels Sentinel-1 SAR images. Each image is directly divided into 600 sub-images with 800 × 800 pixels. The dataset contains 6015 ships. LS-SSDD-v1 can support researchers to flexibly apply the dataset. The optical information provided in Google Earth software and ship information provided by AIS is used for the annotation of LS-SSDD-v1.0. The coastline of the imaging area in the dataset is relatively complex, the land area is smaller than the ocean area, and the ships in the inland river are densely distributed. The dataset has the following characteristics: contains large scenes, focus on the small ships, rich pure background, etc. It also provides a large number of performance benchmarks of detection algorithms on datasets.

The papers that used LS-SSDD-v1.0 and their performance are shown in Table 10.

#### 4.3.7. SRSDD-v1.0

The original images of SRSDD-v1.0 are from Gaofen-3 [177]. It contains 30 panoramic SAR images of port areas. It is annotated with an oriented bounding box. Optical images (Google Earth or GF-2) are used to assist the annotation. The image size is set to 1024 × 1024. The annotation format is the same as DOTA. The coordinates of the four corners of the box, the category, and whether it is difficult to identify are given in annotation files.

It contains 666 images. A total of 420 images with 2275 ships include the land cover. A total of 246 images with 609 ships only contain the sea in the background. It has six categories: ore-oil ships (166), bulk cargo ships (2053), fishing boats (288), law enforcement ships (25), dredger ships (263), and container ships (89). The dataset has a certain data imbalance problem.

#### 4.3.8. RSDD-SAR

The RSDD-SAR dataset consists of 84 scenes of Gaofen-3 and 41 scenes of TerraSAR-X. RSDD-SAR has 7000 images, including 10,263 ships, of which 5000 are randomly selected as the training set and the other 2000 as the testing set. By analyzing the distribution of ship angle and aspect ratio in the dataset, it can be found that the angle of ships in the dataset is evenly distributed between 0◦ and 180◦, and the aspect ratio is concentrated between two

and six. It indicates that the dataset has the characteristics of arbitrary rotation direction and a large aspect ratio. The dataset has the characteristics of a high proportion of small ships, which can be used to verify the performance of a small ship detection algorithm. The RSDD-SAR dataset contains vast sea areas, ports, docks, waterways, and other scenes with different resolutions, which are suitable for practical applications.

#### *4.4. Two-Stage Detectors*

The deep learning-based object detection algorithm can be divided into single-stage detectors and two-stage detectors. The single-stage detectors use a full convolution network to classify and regress these anchor boxes once to obtain the detection results. The two-stage detectors use a CNN to classify and regress these anchor boxes twice to obtain the detection results. The principles of single-stage and two-stage detection algorithms are shown in Figure 6.

**Figure 6.** The principle of single-stage and two-stage detectors.

Classical two-stage detectors are Faster R-CNN, R-FCN (fully convolutional network) [193], feature pyramid networks (FPN) [194], Cascade R-CNN [195], Mask R-CNN [196], and so on [197]. Faster R-CNN is the foundation work, and most of the two-stage detectors are improved based on it.

Among the 177 papers, most of the papers are improved from the following aspects: backbone network, region proposal network (RPN), anchor box, loss function, and nonmaximum suppressing (NMS). They are shown in Figure 7. Compared with computer vision, the research in this field lags behind, and other more advanced two-stage detection algorithms have not been used here.

**Figure 7.** The two-stage SAR ship detectors.

#### 4.4.1. Backbone Network

There are three main directions in the improvement of the backbone network, namely FPN, feature fusion, and attention.

FPN produces a feature pyramid structure that combines low-resolution, which has strong semantic features with high-resolution, which has weak semantic features. It includes a bottom-up channel, a top-down channel, and a skipping connection. It predicts independently at all levels, which only brings minimal additional calculation and storage consumption. It improves the detection result of small-size ships and thus is widely used. A lot of work has been completed to improve FPN in the computer vision field, such as ASFF [198], NAS-FPN [199], and BiFPN [200].

There are six papers [42,62,63,69,91,110] adopted and improved FPN in this field. Cui et al. [42] proposed a DAPN (dense attention pyramid network) structure. It densely connects the convolution block attention module from the top to the bottom of the pyramid network. By this, rich features including resolution and semantic information are extracted for multi-scale ship detection. Li et al. [62] used a convolution block attention module (CBAM) [201] to control the degree of upper- and lower-level feature fusion in FPN. Liu et al. [63] proposed a scale-transferrable pyramid network. It densely connects each feature map from top-to-down using scale-transfer layer. It can expand the resolution of feature maps, which is helpful for detection. Wei et al. [69] adopted a parallel high-resolution feature pyramid network to make full use of the feature mapping of high-resolution and low-resolution convolution for SAR ship detection. Zhao et al. [91] adopted receptive fields block and convolutional block attention module to build a top-down fine-grained feature pyramid. It can capture features of ships with large aspect ratios and enhance local features with their global dependences. Hu et al. [110] used a dense connection to a feature pyramid network, in which the shallow features and deep features are processed differently. It considered the differences between different levels.

There are three papers [11,57,115] that improved the backbone network through feature fusion. Li et al. [11] fused the feature maps from convolutional layer 3 to layer 5. The fusion includes the normalization and 1 × 1 convolution. Normalizing each RoI (region of interest) pooling tensor can reduce the scale differences between the following layers. It can prevent the 'larger' features from dominating the 'smaller' ones and make the algorithm more robust. This modification stabilizes the system and increases the accuracy. Yue et al. [57] fused the semantically strong features with the low-level highresolution features, which is helpful for reducing false alarms. Li et al. [115] presented a jump connection structure to extract the features of each scale target in the SAR image. It can improve the ability of recognition and localization.

There are five papers [17,29,62,122,137] that improve the backbone network through the attention module (SENet). It squeezes the feature map along the space and the channel direction, which can explicitly model the interdependence between feature channels, and then automatically obtain the importance of each feature channel through learning. It can improve the useful features and suppress the features that are not useful for the current task according to the importance.

#### 4.4.2. RPN

Another direction is improving the RPN module of Faster R-CNN. Paper [16,25,34,46,87] did not use a single feature map to generate proposals but generated proposals from each fused feature map. Liu et al. [36] designed a scale-independent proposal generation module, which extracts the features such as edge, super-pixel, and strong scattering component from SAR image to obtain ship proposals, and sorts whether the proposals contain ships from the integrity and tightness of the contour. In paper [160], candidate proposals are extracted from the original SAR image and the denoised SAR image, respectively, and then combined to reduce the impact of noise in the SAR image on ship detection. They can improve the performance of multi-size ships to some extent.

#### 4.4.3. Loss Function

Faster R-CNN forces the ratio of positive and negative samples to 1:3 to solve the problem of unbalanced positive and negative proposals. Similar work in the field of computer vision includes focal loss, OHEM (online hard example mining) [202], GHM (gradient harmonizing mechanism) [203], and Libra R-CNN [204]. The paper [16,78], respectively, adopted focal loss to increase the weight of hard negative samples and reduce the weight of simple samples, so as to avoid the problem that a large number of simple samples cover a small number of hard negative samples in the training process.

#### 4.4.4. Anchor and NMS

Faster R-CNN uses three scales and three aspect ratios, producing a total of 60 × 60 × 9 anchor boxes. However, the ship size in the SAR image is extremely small and sparse. There will be a waste in using dense anchor boxes for ship detection in SAR images. Yue et al. [57] and Wang et al. [122] set the parameters of the anchor box based on the analysis of the actual size and distribution of the ship, mainly reducing the size of the anchor box and selecting the appropriate shape. Chen et al. [70] and Li et al. [115] used K-means to obtain the distribution of the ship size, so as to obtain the appropriate anchor box and reduce the difficulty of learning.

Wei et al. [69] and Wang et al. [78] used soft NMS [205] to replace NMS. Soft NMS improves the discrimination process of IoU and threshold in the cycle process and uses weights to attenuate scores to avoid accuracy loss.

#### 4.4.5. Others

ISASDNet (instance segmentation assisted ship detection network) was proposed based on Mask R-CNN in the paper [163]. It has two branches: detection and segmentation. The two branches output interaction to improve the detection results. Gui et al. [34] proposed a lightweight detection head with a large separable convolution kernel and position-sensitive pooling, which improves the detection speed.

#### *4.5. Single-Stage Detectors*

The two-stage detectors generate a candidate box first and then identify and regress the candidate box, which is quite different from the principle of human eyes. The singlestage detectors only need to look at the picture once and can predict what the object is and where the object is. It is similar to the human eyes. In addition, they are quite faster than two-stage detectors.

Classical single-stage detectors are YOLO, SSD, RetinaNet [206], and CornerNet [207]. YOLO and SSD are the two most popular single-stage detection algorithms, and most of the subsequent single-stage works are based on them.

The single-stage ship detectors in SAR images are shown in Figure 8.

**Figure 8.** The single-stage SAR ship detectors.

#### 4.5.1. YOLO and SSD Series in Computer Vision

YOLOv1 [186] regards object detection as a regression problem, and it outputs the spatially separated bounding box and related class probability simultaneously. A neural network can predict the bounding box and class probability from the image in one forward calculation. The speed is very fast, but it has an inaccurate location prediction, and the recall is low. YOLOv2 [208] uses the multi-scale training method. It predicts the offset rather than the parameter itself. The offset value is slightly smaller, which can increase the accuracy of prediction. It uses an anchor mechanism to obtain anchor box parameters by clustering the object size in the dataset. The backbone network adopts DarkNet-19. Although the detection head has changed from 7 × 7 to 13 × 13, the detection result of a small object is still poor. The YOLOv3 [209] detection head includes three branches: 13 × 13, 26 × 26, and 52 × 52, which can take into account large, medium, and small objects and make the location prediction more accurate. The anchor mechanism of YOLOv3 is the same as that of YOLOv2. YOLOv4 [210] uses two anchors for one ground truth, while YOLOv3 uses only one anchor for one ground truth. With this, the problem of imbalance between positive and negative samples is alleviated. CIoU loss is adopted to solve the problems of MSE (mean squared error) loss, IoU loss, GIoU, and DIoU [211–214]. YOLOv4 also uses several techniques to achieve state-of-the-art results. YOLOv5 adopts adaptive anchors and uses the network to learn anchor parameters. Its detection head is the same as YOLOv3 and YOLOv4. It is slightly weaker than YOLOv4 in performance, but much faster than YOLOv4, and has strong advantages in the rapid deployment of the model.

SSD detection algorithm combines the regression idea with the anchor box (default frame) mechanism. It eliminates the candidate region generation and subsequent pixel or feature resampling stage (RoI pooling) in the two-stage algorithm. It encapsulates all calculations in one network, making it easy to train and very fast. RFBNet [215] and M2Det [216] are two successors of SSD. They use receptive fields and multi-level feature pyramid networks to improve the classical SSD, respectively.

Single-stage SAR ship detection algorithms can be divided into three categories: SAR image ship detection based on the YOLO series, SAR ship detection based on the SSD series, and other algorithms.

#### 4.5.2. SAR Ship Detection Based on YOLO Series

YOLO series are widely used in this field. The improvements mainly focus on lightweight backbone network designing, multi-layer feature fusion, anchor box generation, multi-feature map prediction, loss function, etc.

YOLOv2. Deng et al. [33] and Chang et al. [39] adopted YOLOv2 to detect ships in SAR images. Paper [39] proposed YOLOv2-reduced which reduces some layers of YOLOv2. YOLOv2-reduced has an AP of 89.76% with 10.937 ms and 44.72 BFLOPS compared with YOLOv2, which has an AP of 90.05% with 25.767 ms and 50.17 BFLOPS.

YOLOv3. Zhang et al. [82] accelerated the original YOLOv3 by using DarkNet-19 as the backbone network. Additionally, it reduces the repeated YOLOv3-scale1, YOLOv3-scale2, and YOLOv3-scale3. Zhu et al. [116], Chaudhary et al. [123], and Jiang et al. [135] used the classical YOLOv3 with some techniques to detect ships in SAR images. Wang et al. [119] proposed SSS-YOLO which redesigned the feature extractor network to enhance the spatial and semantic information of small ships. It adopts a PAFN (path argumentation fusion network) to fuse different features in a top-down and bottom-up manner. SSS-YOLO has a better performance for small ships in SAR images. Hong et al. [158] improved the performance of YOLOv3 with some techniques. The improved clustering algorithm K-means++ generates an anchor box, which improves the performance of YOLOv3 for multi-scale ships. The Gaussian parameter for ship detection is introduced to add an uncertainty estimator for the positioning of the bounding box. Four anchor boxes are assigned to each detection scale instead of three in YOLOv3. Zhang et al. [40] used the idea of the YOLO algorithm, the input image meshes, and the depth separable convolution is used to improve the detection speed. MobileNet is used as the feature extractor to detect

ships under three scales: 13 × 13, 26 × 26, and 52 × 52. The size of the anchor box can be obtained by the K-means algorithm. D-CNN-13 has a big receptive field with anchor box widths and heights of (9, 11), (11, 22) (14, 26). D-CNN-26 has a medium receptive field with anchor box widths and heights of (16, 40), (17, 12) (27, 57). D-CNN-52 has a small receptive field with anchor box widths and heights of (28, 17), (57, 28) (69, 72). Zhang et al. [54] used depth convolution and point convolution to replace the traditional convolution neural network, and adopt a multi-scale detection mechanism, concatenation mechanism, and anchor box mechanism to improve the detection speed. The detection network is composed of three parts, which means that it can detect an input SAR image under three different scales (5 × 5, 10 × 10, and 20 × 20), and then obtain the final ship detection results. It has nine anchor boxes for three detection scales, so it can detect up to nine ships in the same grid cell. Zhou et al. [102] designed a CNN named LiraNet, which has low complexity, few parameters, and a strong feature representation ability. LiraNet combines the idea of dense connections, residual connections, and group convolution, and it includes stem blocks and extractor modules. The network is the feature extractor of Libra-YOLO. Lira-YOLO has only 2.980 Bflops, and the parameter quantity is only 4.3 MB. It has good accuracy with less memory and computational cost compared with tiny-YOLOv3. In [151], DarkNet-53 with the residual unit is used as the backbone to extract features, and a top-down pyramid structure is added for multi-scale feature fusion. Soft NMS, mix-up, mosaic data augmentation, multi-scale training, and hybrid optimization are used to boost the performance. The 13 × 13, 26 × 26, 52 × 52 feature maps with the large, medium, and small receptive fields are responsible for large, medium, and small ships, respectively. The model is trained from scratch to avoid the learning objective bias of pre-training. The detection speed is fast, about 72 frames per second.

YOLOv4. Ma et al. [156] proposed YOLOv4-light, which is tailored to reduce the model size, detection time, number of computational parameters, and memory consumption. The three-channel images are used for compensating for the loss of accuracy. Liu et al. [181] proposed a detection method based on YOLOv4-Lit [217], whose backbone is MobileNetv2. A receptive field block is used for multi-scale target detection. It has an AP of 95.03% with 47.16 FPS and 49.34 M model size.

YOLOv5. Tang et al. [144] proposed N-YOLO based on YOLOv5. N-YOLO adopts a noise level classifier to classify the noise level of SAR images. SAR ship potential area extraction module is used to extract the complete region of potential ships. Zhou et al. [179] proposed a multi-scale ship detection network based on YOLOv5. It has the cross-stage partial network to improve feature representation capability, and the feature pyramid network with fusion coefficients module to fuse feature maps adaptively. It has a good tradeoff between model size and inference time.

Others. Zhang et al. [84] proposed ShipDeNet-20. It has only 20 convolution layers, and the model size is smaller than 1 MB, which is lighter than the other state-of-the-art detectors. ShipDeNet-20 is based on YOLO and is trained from scratch. Feature fusion module, feature enhance module, and scale share feature pyramid module are proposed to make up the accuracy loss of the raw ShipDeNet-20. It has a good tradeoff between accuracy and speed. Zhu et al. [175] proposed DB-YOLO. It is composed of a feature extraction network, duplicate bilateral feature pyramid network, and detection network. The single-stage network can meet the requirements of real-time detection, and it uses cross-stage partial to reduce redundant parameters. A duplicate bilateral feature pyramid network can enhance the fusion of semantic and spatial information. It alleviates the problem of small ship detection. CIoU loss is used as the loss function, as it has a faster convergence speed and better performance.

#### 4.5.3. SAR Ship Detection Based on SSD Series

Wang et al. [14,18] directly used SSD and do not improve it. Papers [51,98,108] are the detection algorithms trained from scratch based on SSD. Most of the other papers improve the backbone network of SSD to make the model have a stronger feature extraction ability.

Chen et al. [15] adopted a two-stage regression network based on SSD to improve the performance of small ships, namely R2RN (robust two-stage regression network). R2RN connected an anchor modified module and object detection module to inherit the essence of the feature pyramid. Ma et al. [30] proposed an SSD model with multi-resolution input, which can extract richer features. Papers [43,44] applied the attention mechanism to SSD and design a new loss function based on GIoU. Li et al. [47] analyzed the reasons for the low detection accuracy of small and medium-sized ships in SSD and puts forward improvement strategies. Firstly, the anchor box optimization method based on K-means clustering is adopted to improve the matching performance of the anchor box. Secondly, a feature fusion method based on deconvolution is proposed to improve the representation ability of the underlying feature map. Chen et al. [55] adopted the attention mechanism and multi-level features to improve the feature extraction ability of the backbone network. Han et al. [99] used deconvolution to enhance the representation of small ships in the pyramid and improved the detection accuracy of SSD. Zhang et al. [113] token the original SAR image and saliency map as the input and fused the fusion of their features to reduce the computational complexity and network parameters. Chen et al. [114] proposed SSDv2, which adds a deconvolution module and prediction module on the basis of SSD to improve the detection accuracy. Jin et al. [149] improved SSD by feature fusion and squeezeexcitation module.

Sun et al. [162] proposed SANet (semantic attention-based network). It combines semantic attention, focal loss, label, and anchor assigning to improve the performance without increasing computation. Papers [104,159] adopted M2Det to detect ships in SAR images.

#### 4.5.4. Others

RefineDet adopts a two-step cascade regression strategy to predict the position and size of objects. It can make the single-stage detectors obtain the accuracy of the twostage detector without increasing computation. It is widely used in computer vision. Zhu et al. [159] adopted RefineDet to detect ships in SAR Images, which achieve an AP of 98.4%. In [169], GHM was used as the loss function of RefineDet, so that the network can make full use of all examples, and adaptively increase the weight of difficult cases. A multi-scale feature attention module is added to the network to highlight important information and suppress the interference caused by clutter. It achieves 96.61% precision on AIR-SARShip-1.0.

#### *4.6. Anchor Free Detectors*

#### 4.6.1. Development of Anchor Free Detection Algorithm in Computer Vision

The anchor box is the key to the success of Faster R-CNN and SSD. The backbone network extracts features from the input image to obtain the feature map, and each pixel on the map is the anchor point. Taking each anchor point as the central point and artificially setting different scales and aspect ratios, multiple anchor boxes can be obtained. Anchor box has the following two advantages: firstly, it can generate dense candidate boxes, which is convenient for the network to classify and regress the targets. Secondly, it can improve the recall ability and is suitable for small target detection.

However, the anchor box needs to be designed manually by experience, which has the following defects: firstly, hyper-parameters need to be set, such as the number, size, aspect ratio, IoU threshold, etc. Secondly, in order to match the ground box, a large number of anchor boxes need to be generated, which are computationally intensive. Thirdly, most of them are invalid, which will lead to an imbalance between positive and negative samples. Fourthly, it is necessary to adjust the anchor box according to the size and shape distribution of the dataset.

The anchor-free detector opens up another idea by eliminating the predefined anchor box. It can directly predict several key points of the target from the feature map. For example, CornerNet, ExtremeNet [218], CenterNet [219], Objects as Points [220], FCOS (fully convolutional one-stage) [221] and FoveaBox [222].

The anchor-free detectors can avoid various problems and has great application potential in SAR ship detection. For example, due to the small size and sparse distribution of ships, most of the candidate anchor boxes are invalid negative samples. The anchor-free detectors can neglect the invalid anchors and reduce the amount of the predicted boxes, thus improving the accuracy and speed simultaneously. The anchor-free ship detectors in SAR images are shown in Figure 9.

**Figure 9.** The anchor-free SAR ship detectors.

4.6.2. Development of Anchor-Free SAR Ship Detection Algorithm

Mao et al. [81] proposed a simplified U-Net [223] based anchor-free detector in SAR images. It includes ship bounding boxes regression network and score map regression network. The former is expected to be regressed based on each pixel in the input image. The latter is designed to predict a 2D probability distribution in which each score at each position indicates the likelihood of the current position in the center of any ship. Cui et al. [89] proposed a CenterNet (objects as points) based SAR ship detector. It predicts the center point of the target through key point estimation and uses the image information of the center point to obtain the size and position of the ship. There is no need to set anchors in advance and NMS is not needed, which greatly reduces the number of network parameters and calculations. Anchor mismatching of small ships is also reduced. Spatial shuffle-group enhance attention modules are used to extract features with more semantic information. Fu et al. [95] proposed an attention-guided balanced pyramid based on FCOS to improve the performance of small ships. Zhou et al. [97] proposed an anchor-free detector with dense attention feature aggregation. A lightweight feature extractor and dense attention feature aggregation are used to extract multi-scale features. A center-point-based ship predictor is used to regress the centers and sizes. There is no pre-set anchor and NMS, and thus the computational efficiency is high. Mao et al. [103] proposed a lightweight named ResSARNet with only 0.69 M parameters, and improved FCOS in four aspects: center-ness in bounding box regression branch, not in classification regression branch, center sampling, GIoU loss, and adaptive training sample selection. The network only needs 1.17 M parameters and can achieve 61.5% AP and 70.9% AR. An et al. [129] designed an anchor-free rotatable detector. It designs center point-scale and angle prediction to convert the conventional rotatable prior box mechanism into the center point-scale and angle prediction. The training procedure includes positive sample selection, feature encoding, and loss function designing. Wang et al. [133] proposed a CenterNet-based detector in SAR images. The spatial group-wise enhanced attention module is used to extract more semantic features.

Sun et al. [167] proposed category-position FCOS. The category-position module is used to optimize the position regression branch in the FCOS network. The classification and regression branches are redesigned to alleviate the imbalance between positive and negative samples during training. Zhu et al. [180] adopted FCOS as the base model to reduce the effect of anchors. A new sample definition method is used to replace the IoU threshold according to the differences between SAR images and natural images. The same resolution feature convolution module, multi-resolution feature fusion module, and feature pyramid module are used to extract features. The focal loss and CIoU are used to improve the performance further.

In all, researchers in SAR ship detection realize the benefit of anchor-free detectors. Additionally, more and more papers are appearing in this field. However, there is still a problem: the innovation is relatively weak and some of the existing achievements of computer vision are not used in this field.

#### *4.7. Detectors Trained from Scratch*

At present, most SAR image detector backbone needs to pre-train on the classification dataset of natural images, and then fine-tune on the ship detection dataset of SAR images (for example SSDD). This transfer learning can make the detection algorithm initialize better and make up for the problem of insufficient samples. However, there will be the following problems: firstly, there is learning bias. The loss function and category distribution between classification and detection are contradictory in essence. The models trained on classification are not fit for detection. Secondly, most backbone networks will produce a high receptive field through multiple down sampling in the latter layers, which is good for the classification but is harmful for the location. Thirdly, the pre-trained backbone networks are redundant and cannot be modified, which hinders the researchers to design CNN flexibly according to their needs.

In order to solve the problems of transfer learning, algorithms trained from scratch are proposed in computer vision, for example, DSOD (deeply supervised object detectors), DetNet, ScratchDet, and so on [224–227].

The main idea of DSOD and GRP-DSOD to realize training from scratch is by designing the backbone and front-end network elaborately [224]. DetNet [225] retains a large scale in the last few layers, which can have more location information. ScratchDet [226] proposes to adopt the strategy of batch normalization in each layer and increase the learning rate, which can make the detection algorithm more robust and converge faster. Paper [227] replaced the original BN (batch normalization) with group normalization (GN) and asynchronous BN, so as to make the parameters of gradient normalization more accurate. This then made the descending direction of gradient more accurate, so as to accelerate convergence and improve the accuracy.

The model trained from scratch not only has high accuracy but also greatly reduces the size and amount of calculation of the model. Due to the above advantages, it is also used in SAR ship detection.

Most detectors that are trained from scratch in this field have well-designed networks. They are shown in Figure 10.

**Figure 10.** The detectors trained from scratch in SAR images.

Deng et al. [33] designed a dense backbone network composed of multiple dense blocks. The front layer can receive additional supervision from the objective function through dense connections, which makes the training easier, and adopts the feature reuse strategy to make the parameter highly efficient. Zhang et al. [51] designed a lightweight detection algorithm that can be trained from scratch, which can reduce the training and testing time without reducing the accuracy. It adopts the modules of semantic aggregation and feature reusing to improve the performance of multi-scale ships. Zhang et al. [84] proposed a lightweight detection network ShipDeNet-20. It is designed with fewer layers and convolution kernels and depth separable convolution. It also adopts a feature fusion module, feature enhancement module, and proportionally shared feature pyramid module to improve detection accuracy. Han et al. [98] integrated the lightweight asymmetric square convolution block into SSD to realize training from scratch, and its accuracy and speed are better than the classical DSOD. Han et al. [100] proposed a parallel convolution block of multi-scale kernel and feature reusing convolution module to enhance feature representation and reduce information loss. Han et al. [108] designed two kinds of asymmetric convolution blocks: asymmetric and square convolution feature aggregation block, and asymmetric and square convolution feature fusion block. They replace all 3 × 3 convolution layers, which are embedded into the classic DSOD to achieve a better result of the training from scratch. Guo et al. [121] proposed an effective and stable single-stage algorithm that is trained from scratch, namely CenterNet++. The model mainly includes three modules: feature matching module, feature pyramid fusion module, and head enhancement module. Zhao et al. [155] used DetNet as the backbone network to realize training from scratch. It uses superposition convolution instead of down sampling to solve the problem of small ship detection and adopts a feature reusing strategy to improve parameter efficiency.

Compared with other directions, fewer researchers in SAR ship detection realize the benefit of training from scratch. Additionally, the papers using training from scratch techniques in this field are not advanced enough. We should adopt more advanced techniques in computer vision in this direction. In all, the detectors trained from scratch are not used to their full extent in this field. Some good conclusions in papers [226,227] should be considered and applied here.

#### *4.8. Detectors with Oriented Bounding Box*

The oriented bounding box was originally used in scene text detection. In addition, a large number of achievements have emerged, such as SegLink, RRPN (rotation region proposal network), TextBoxes, TextBoxes++, R2CNN (rotational region convolutional neural network), and so on [228–232]. The ships in remote sensing images also have multi-directional characteristics. The conventional vertical rectangular bounding box often cannot accurately surround the target. With the improvement of ship detection accuracy, the use of oriented bounding boxes to realize multi-directional ship detection has become a research hotspot [233–238]. DOTA (dataset for object detection in aerial images) is a commonly used aerial image target detection dataset in this field, which can be used to develop and evaluate the performance of detection algorithms. Similarly, there are many detection algorithms based on oriented bounding boxes in SAR images, which will be introduced here. At present, the dataset that can be used to train and test the oriented bounding box algorithm are SSDD+, RSDD-SAR, and SRSDD-V1 0, the details have been introduced earlier. The oriented bounding box detectors in SAR images are shown in Figure 11.

**Figure 11.** The SAR ship detectors with oriented bounding box.

Two-stage. Chen et al. [56] proposed a multi-scale adaptive recalibration network to detect multi-scale and arbitrarily oriented ships. It can learn the angle information of ships. The anchors, NMS, and loss function are also redesigned to fit the large aspect ratio and arbitrary directionality of ships in SAR images. Pan et al. [83] proposed a multi-stage rotational region-based network (MSR2N) to solve the problem of redundancy regions. MSR2N includes FPN, RRPN, and a multi-stage rotational detection network. It is more suitable and robust for SAR ship detection. An et al. [129] adopted an oriented detector as the based model to solve the problem that conventional CNN models have too many parameters, which increases the difficulty of transfer learning between different tasks.

Single-stage. Wang et al. [20] proposed a SAR ship detector with an oriented bounding box based on SSD. The detector can predict the class, location, and angle information of ships. The semantic aggregation module is used to capture abundant location and semantic information. The attention module is used to adaptively select meaningful features and neglect weak ones. Multi-orientation anchors, angular regression, and the loss function are used to fit the oriented bounding box. Liu et al. [26] adopted DR-Box [239] to detect ships in SAR images. DR-Box is specially designed to detect targets in any direction in remote sensing images. It can effectively reduce the interference of background pixels and locate the target more accurately. An et al. [41] proposed DR-Box-v2 to detect ships in SAR images. A multi-layer anchor box generation strategy for detecting small ships is proposed. A modified encoding scheme is proposed to estimate the position and orientation precisely. Focal loss and hard negative mining are also used to balance the positives and negatives. Yang et al. [90] regarded a rotatable bounding box detector as the base model to solve the problem of negative sample intra-class imbalance in the training stage. Chen et al. [93] proposed a rotated refined feature alignment detector to fit ships with large aspect ratios, arbitrary orientations, and dense distribution properties. A lightweight attention module, modified anchor mechanism, and feature-guided alignment module are proposed to boost the performance of the oriented detector.

Anchor free. Yang et al. [124] proposed R-RetinaNet to beat DRBox-v1, DRBox-v2, and MSR2N (multi-stage rotational region-based network) in this field. R-RetinaNet used a scale calibration method to align the scale distribution. Task-wise attention feature pyramid network is used to alleviate the contradiction of classification and localization. The adaptive IoU threshold training method is used to correct the unbalanced problem. He et al. [152] proposed a method to solve the problem of boundary discontinuity problem in oriented bounding box detectors by learning polar encodings. The encoding scheme uses a group of vectors pointing from the center of the ship to the boundary points to represent an oriented bounding box.

Others. Ding et al. [177] released the SRSDD-v1.0 dataset, which is used for oriented bounding box detectors. The details of the dataset have been described above. They present the performance of several advanced oriented bounding box detection algorithms on the dataset.

Summary. With the emergence of several datasets with oriented bounding boxes, ship detectors in SAR images based on oriented bounding boxes are becoming more and more advanced. However, it is not enough compared with DOTA. Some efforts should be taken in this direction.

#### *4.9. Multi-Scale Ship Detectors*

In MS COCO, the proportions of the small, medium, and large size objects are 41.43%, 34.32%, and 24.24%, respectively. However, in SAR images, the proportion of small-size ships is extremely high. For example, the proportion of small, medium, and large ships in the RSDD-SAR dataset are 81.175%, 18.776%, and 0.049%, respectively. In LS-SSDD-v1.0, the proportions are 99.80%, 0.20%, and 0.00%, respectively. Therefore, this field needs to focus on the problem of multi-scale ship detection, especially the small ships.

Although CNN has developed rapidly in computer vision, it has poor performance on small-size object detection. In order to improve the adaptability to multi-scale ships, computer vision often fuses low-level and high-level features (such as FPN), increases the receptive field and improves the anchor box generation and matching strategy.

SAR ship detection also uses the above methods to improve the performance of multiscale ship detection. The multi-scale ship detectors in SAR images are shown in Figure 12.

**Figure 12.** The multi-scale SAR ship detectors.

Feature fusion. Chen et al. [15] proposed a densely connected multi-scale neural network to solve the problem of multi-scale SAR ship detection. It closely connects each feature map with other feature maps from top-to-bottom and generates proposals from each fused feature map. Cui et al. [42] proposed a dense attention pyramid network, which closely connects the convolutional attention module from the top to the bottom of the pyramid network to each feature map, so as to extract rich features containing location information and semantic information for adapting to multi-scale ships. Liu et al. [63] proposed a scale transferable pyramid network to adapt to the detection of multi-scale ships. It constructs a feature pyramid network through horizontal connection and uses a scale transfer layer to closely connect each feature graph from top to bottom. A horizontal connection introduces more semantic information, and a dense scale transfer connection can expand the resolution of the feature map. Jin et al. [72] combined all feature maps from top to bottom to make use of contextual semantic information at all scales and uses extended convolution to increase the receptive field exponentially. Han et al. [99] used deconvolution to enhance the feature representation of small and medium-sized ships in FPN, so as to improve the detection accuracy of SSD. Hu et al. [110] proposed a dense feature pyramid network, which processes shallow features and deep features differently. Compared with traditional FPN, it has stronger adaptability to multi-scale ships. Wang et al. [119] proposed a path argumentation fusion network to fuse different feature maps. It uses bottom-up and top-down methods to fuse more location information and semantic information. Hu et al. [161] proposed a two-way revolution network based on a bidirectional convolution structure, which can effectively process shallow and deep feature information and avoid the loss of small ship information. Zhang et al. [166] proposed a quad feature pyramid network to detect multi-scale ships. It includes deformable convolutional FPN, a content-aware feature reassembly FPN, a path aggregation space attention FPN, and a Balance Scale Global Attention FPN.

Increase the receptive field. Deng et al. [17] designed a feature extractor with multiple receptive fields through ReLU and inception modules. It generates candidate regions in multiple middle layers to match ships of different scales and fuses multiple feature maps so that small-scale ships have a stronger response. Zhao et al. [22] proposed a coupled CNN to detect small-scale ships. It includes a network that generates candidate areas from multiple receptive fields and improves the recognition accuracy by using the context information of each candidate box. Dai et al. [86] did not use a single feature map but fused the feature map in a bottom-up and top-down manner, and generated candidate boxes from each fused feature map.

Anchor box generation and matching strategy. Li et al. [47] first analyzed the reasons for the low detection accuracy of small and medium-sized ships in SSD and made some improvements. The anchor box optimization method based on K-means clustering solves the problem of less positive samples and more negative samples. The feature fusion

method based on deconvolution improves the representation ability of the low-level feature map and solves the weak recognition ability of the low-level feature map to small ships. Fu et al. [95] proposed a feature balance and matching network, which uses the anchor-free strategy to eliminate the influence of anchors and uses the attention-guided balance pyramid to balance multiple features at different levels semantically. It has a good performance in the detection of small-scale ships. Hong et al. [158] improved the anchor generation based on an improved K-means++ in YOLOv3. It alleviates the difficulty of multi-scale ship detection in YOLOv3 and changes the number of anchor boxes in the YOLO layer from three to four. Sun et al. [167] show that anchor-free detectors have good adaptability to small ships and have a fast speed.

Summary. Small ship detection is extremely hard but is also extremely important for some applications. That is because people hope to find targets within a long distance, and at this point, the targets must be small in size. SAR ship detection also proves this point. Although the above detection methods for small-size ships have certain effects, they are still far from enough. Innovative work needs to be continued.

#### *4.10. Attention Module*

The basic idea of the attention mechanism in computer vision is to make the model ignore irrelevant information and focus on key information. It can be divided into hard attention, soft attention, gaussian attention, spatial transformation, and so on. Attention can be calculated from the spatial domain, channel domain, layer domain, and mixed domain. Representative algorithms include SENet (squeeze and excitation network), SKNet (selective kernel network), CBAM (convolutional block attention module), CCNet (criss-cross attention), OCNet (object context network), DANet (dual attention network), etc. [240–244]. Transformer [245] adopted encoder–decoder architecture, which is the extreme of the attention. It abandons CNN and RNN (recurrent neural network) used in previous deep learning tasks and shows great advantages in the field of NLP (natural language processing) and CV. Swin Transformer [246] makes it compatible with image classification and object detection. It demonstrates the potential of transformer-based models as vision backbones.

Chen et al. [43,44] proposed an attention-based detector. The attention model is mainly composed of the convolution branch and mask branch. Elements in mask maps are similar to the weight of feature maps, which enhance regions of interest and suppress non-target regions. Cui et al. [89] introduced the space shuffle group enhanced attention module to CenterNet. It can extract stronger semantic features and suppress some noise at the same time, so as to reduce false positives caused by inshore and inland interference. Zhao et al. [91] combined the receptive field module and convolution block attention module to construct a top-down fine-grained feature pyramid. Wang et al. [122] designed a feature enhancement module based on a self-attention mechanism. Its spatial attention and channel attention work at the same time to highlight the target and suppress the spot to a certain extent. Wang et al. [131] embedded a soft attention module in the network to suppress the influence of noise and complex background. Zhu et al. [136] proposed a SAR ship detection method based on a hierarchical attention mechanism. The method includes a global attention module and a local attention module. Hierarchical attention strategies are proposed from the image layer and target layer, respectively. Sun et al. [162] introduced a semantic attention mechanism, which highlights the regional characteristics of ships and enhanced the classification ability of the detector. Du et al. [169] embedded the multi-scale feature attention module in the network. By applying the channel and spatial attention mechanism to the multi-scale feature map, it can highlight important information and suppress the interference caused by clutter.

CRTransSar [182] is the first to use a transformer for SAR image ship detection. It is based on Swin Transformer and shows great advantages. CRTransSar combines the global contextual information perception of transformers and the local feature representation capabilities of convolutional neural networks. It innovatively proposes a visual transformer framework based on contextual joint-representation learning. Experiments on SSDD and SMCDD show the effectiveness of the method.

#### *4.11. Real-Time Detectors*

At present, deep learning-based detectors need large computation and storage resources, which hinders the application in real-time prediction. In order to solve this problem, there are a lot of acceleration ideas in the evolution of object detection algorithms. Firstly, researchers usually speed up the detection process. This idea is reflected in the evolution process of R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, and Light-Head R-CNN. The above detectors share the features gradually, and the network structures become thinner and faster. Secondly, researchers usually design lightweight detection networks. The backbone network and the detection head can both be accelerated. Thirdly, researchers usually compress and accelerate CNN models. It includes lightweight neural network designing, model pruning, model quantization, and knowledge distillation [247–253].

The exploration of real-time detection algorithms in SAR ship detection can be divided into three directions, which are shown in Figure 13.

**Figure 13.** The real-time SAR ship detectors.

4.11.1. Improving the Existing Real-Time Algorithms

Many improvements in this field are based on the YOLO and SSD series, because they have great advantages in running time, especially the YOLO series. Zhang et al. [40] used the idea of the YOLO algorithm and adopted depth separable convolution to accelerate the speed. MobileNet is used as the backbone network to improve the detection speed under the condition of ensuring detection accuracy. Zhang et al. [82] proposed an improved YOLOv3 (using DarkNet-19 and deleting repeated layers). It achieved 90.08% AP50 and 68.1% AP on the SSDD dataset. Mao et al. [103] adopted the FCOS detection algorithm with ResSARNet as the backbone network, and center-ness on bounding box regression branch, center sampling, GIoU loss, and adaptive training sample selection were used. It can achieve 61.5% AP with only 1.17 M parameters. Zhong et al. [157] adopted CFAR and YOLOv4 to realize real-time ship detection on China HISEA-1 SAR images.

#### 4.11.2. Designing a Lightweight Model

Zhang et al. [51] designed a lightweight feature optimization network LFO-Net based on SSD. It can be trained from scratch and reduce the training and testing time without reducing the accuracy. The detection performance is further improved by the bidirectional feature fusion module and attention mechanism. It achieved 80.12% AP50 with 9.28 ms testing time on SSDD. Zhang et al. [54] used multi-scale detection, cascade, and anchor box mechanism to design a lightweight network for real-time SAR ship detection. It uses depthwise and pointwise to replace the traditional convolution. It achieved 94.13% AP50 with 9.03 ms testing time on SSDD. Mao et al. [81] used the simplified U-Net as the feature extraction network, which has only 0.47 million learnable weights, it improves the operation speed and solves the problem caused by the anchor box through the anchor-free method. It has a total of 0.93 million learnable weights, and the AP on the SSDD dataset is 68.1%. Zhang et al. [84] proposed ShipDeNet-20, which has 20 convolution layers and a

0.82 MB model size. It uses fewer layers and kernels, and depthwise separable convolution is also used. It improves the accuracy through the feature fusion module, feature enhancement module, and scale share feature pyramid module. It achieved 97.07% AP50 with 233 FPS on SSDD. Zhang et al. [96] proposed HyperLiNet. It realizes high precision through five modules, namely multi receptive field module, divided revolution module, channel and spatial attention module, feature fusion module, and feature pyramid module. It realizes high speed through five modules, namely region-free model, small kernel, narrow channel, separate revolution, and batch normalization fusion. Zhou et al. [102] proposed a lightweight detector Lira YOLO. It combines the idea of dense connections, residual connections, and group convolution, including stem blocks and extractor modules. It achieved 85.46% AP50 with a 4.3 MB model size. Li et al. [115] designed a lightweight network of feature relay amplification and multi-scale feature jump connection structure based on Faster R-CNN and improves the selection of anchor boxes and RoI pooling. It achieved 89.8% AP50 and the speed increased a lot. Zhang et al. [132] proposed a lightweight detection algorithm ShipDeNet-18, which has fewer layers and fewer convolution kernels. The deep and shallow feature fusion module and a feature pyramid module are adopted to improve the detection accuracy. It achieved 93.78% AP50 with 202 FPS. Ma et al. [156] proposed YOLOv4-tiny. It reduces the number of convolutional layers in CSPDarkNet53. It achieves 88.08% AP50 with 12.25 ms compared with YOLOv4 with 96.32% AP and 44.21 ms. Sun et al. [165] proposed a lightweight densely connected sparsely activated detector. It can construct a lightweight backbone network, so as to achieve a balance between performance and computational complexity. It achieved 97.2% AP50 and 61.5% AP on SSDD.

#### 4.11.3. Compressing and Accelerating the Detector

Mao et al. [104] proposed a knowledge distillation-based network slimming method. YOLOv3 and Darknet-53 are pruned on filter-level to obtain lightweight models. Kullback Leibler Divergence (KLD) knowledge distillation is used to train student and teacher networks (YOLOv3@EfficientNet-B7). The model has only 15.4 M parameters, and the AP decreases by only 1%. Chen et al. [118] proposed the algorithm of Tiny-YOLO-Lite. It designs and prunes the backbone structure, strengthens the channel level sparsity, and uses knowledge partition to make up for the performance degradation caused by pruning. Tiny-YOLO-Lite reduces the size of the model, reduces the number of floating-point operations, and obtains faster accuracy.

#### 4.11.4. Summary

From the above discussion, we can find that real-time ship detection is also a hot topic in SAR images. However, the above works are not enough. It is obvious that the transferred deep learning models from computer vision are abundant in this field. Researchers should do the following work to realize real-time detection. Firstly, the anchor-free and the training from scratch method should be used to design lightweight detection algorithms. Secondly, some model compressing, and accelerating techniques should be used to improve the speed further. Thirdly, the lightweight models should be transplanted to high-performance AI chips (NVIDIA Jetson TX2) to achieve the purpose of running at the edge (satellite, airplane).

#### *4.12. Other Detectors*

In this part, we mainly introduce weakly supervised, GAN (generative adversarial network) and data augmentation, which are shown in Figure 14.

**Figure 14.** The other SAR ship detectors.

#### 4.12.1. Weakly Supervised

The supervised methods, such as deep learning approaches, need substantial time and manpower to make training samples [254]. Papers [85,87,126] adopted weakly supervised to train ship detection algorithms. The model is trained by two global labels, namely, "ship" and "non-ship," and produces a ship location heatmap, ship bounding box, and pixel-level segmentation product. They can alleviate the problem of annotation partly. However, the accuracy is lower than the supervised method.

#### 4.12.2. GAN

The insufficient SAR samples restrict the performance of the algorithm. Zou et al. [112] used a multi-scale Wasserstein auxiliary classifier generative adversarial network [255] to generate high-resolution SAR ship images. Then, the original dataset and the generated data are combined into a composite dataset to train the YOLOv3 network, so as to solve the problem of low detection accuracy under a small dataset. Based on the idea of generative adversarial networks, an image enhancement module driven by target features is designed. The quality of the ships in the image is improved. The experimental results verify the effectiveness of this method.

#### 4.12.3. Data Augmentation

Data augmentation can expand the size of the dataset several times, so as to improve the detection accuracy [256]. The training method based on a feature mapping mask eliminates the gradient noise introduced by random clipping, so as to improve the detection performance. The SAR images with ships are generated by electromagnetic numerical analysis technology, and the sea clutter model is used to simulate the real SAR image patch containing various SAR slices, so as to improve the performance of SSD.

#### *4.13. Problems*

From the 177 papers, we can see that most of the detection algorithms in this field are borrowed from computer vision. Additionally, its development is also behind the detectors in computer vision. Due to the large difference between natural image and SAR image (for example, SAR image is single-channel, ship size is small, and distribution is very sparse), some detection algorithms are not suitable for SAR ship detection. So, we should design detectors according to the real characteristics of ships in SAR images.

The 177 papers mainly use the image essences of SAR images, and the research and application of the scattering mechanism are not enough. This is one problem we should solve in the future.

At present, there are several public small datasets, but we lack a large dataset. The models trained on a small dataset face the problem of over-fitting. What we should do next is merge the small datasets into a large one, and make sure the train-test division standards, evaluation indicators, and benchmarks are clear. These works can promote the development of this field.

#### **5. Future—The Direction of the Deep Learning-Based SAR Ship Detectors**

*5.1. Anchor Free Detector Deserves Special Attention*

The anchor-free detection algorithm has many advantages, which have been introduced in Section 4.6. It should be emphasized that the detection algorithm without an anchor box is especially suitable for SAR images. As SAR images have sparse and small size ships, it can greatly improve the detection speed and avoid various problems in anchor box designing and matching. Therefore, the anchor-free detection algorithm needs to be paid more attention. Fortunately, researchers in this field have realized this and many research results have emerged.

#### *5.2. Train Detector from Scratch Deserves More Attention*

At present, there are the following generally accepted conclusions about training from scratch: firstly, pre-training accelerates the convergence speed, especially in the early stage of training. However, the training time of scratch is roughly equivalent to the total time of pre-training and fine-tuning. Secondly, if there are enough target images and computing resources, pre-training is not necessary. Thirdly, if the cost of image collection and image cleaning is considered, a general large-scale classification dataset is not an ideal choice, and collecting images on detection tasks will be a more effective approach. Fourthly, when the target task is to predict spatial positioning (such as ship detection), pre-training does not show any benefits.

Collecting images for detection and training is a solution worth considering, especially when there is a significant gap between the pre-training task and the detection task (such as ImageNet image and SAR image). Therefore, in the field of SAR ship detection, it is very necessary to combine the existing public datasets into a large dataset, so as to ensure training models from scratch.

Due to the difference between natural images and SAR images, it is very necessary to adopt training from scratch detection algorithms in this field, as it can obtain a detection algorithm with stronger adaptability to SAR images and smaller model size. However, the work at this stage is far from enough, so we need to pay more attention to the detection algorithms of training from scratch.

#### *5.3. Many Other Works Need to Be Used for Oriented Bounding Box Detector*

The ship in the SAR image has very changeable directionality, and the vertical bounding box cannot adapt to this scene. It is necessary to use an oriented bounding box. In an inshore scenario, a vertical bounding box is susceptible to interference from onshore buildings and other ships, affecting detection performance, while an oriented bounding box can accurately represent the ship target and reduce redundant interference. In addition, for ship targets in an offshore scenario, an oriented bounding box can obtain information such as heading and aspect ratio, which is of great significance for subsequent trajectory prediction and situation estimation tasks. The scene text detection and the aerial remote sensing image dataset DOTA have conducted in-depth research on the oriented bounding box and achieved many results. We should learn from them.

#### *5.4. Small Ship Detection Is an Eternal Topic*

The main reasons for the poor detection result of small-size ships are as follows: firstly, the features extracted from small-scale ships are few, and the size and receptive field of the anchor are too large for small ships. Secondly, the size of the anchor is discrete (for example, 16, 32, 64, etc.), while the size of the ship is continuous, which makes the recall rate of small-size ships low. Thirdly, the anchor of a small ship matches less with the ground truth bounding box, resulting in fewer positive samples and too many negative samples.

As the proportion of small ships in SAR images is very high, small object detection is difficult, especially in this field. So, it is an eternal topic to study how to improve the detection effect of small ships.

#### *5.5. Real-Time Detection Is the Key to Application*

Real-time SAR ship detection needs to start from many aspects. For example, we can design a lightweight detection network, compress and accelerate the model to improve the speed, and transplant the detection algorithm to high-performance AI chips at the edge (NVIDIA Jetson TX2). At present, most of the work in this field is focused on the first two aspects, and there is less research on the third aspect, which needs to be focused on in the future. Only by realizing this technology can we realize the real-time detection and recognition of ships on satellite or aircraft platforms.

#### *5.6. Transformer Is the Future Trend*

In the past two years, transformer shows great advantage in object detection compared with CNN, for example, DETR (detection transformer) [257] and Swin Transformer. DINO [258] can achieve 63.6% AP on the COCO test-dev, which surpasses CNN-based detector by a large margin. Nowadays, the hot topic of computer vision is the transformer. CRTransSar [182] is the first to use a transformer for SAR image ship detection. It shows a great advantage in accuracy (97% AP on SSDD). Although there are still some problems when a transformer is used for SAR ship detection, there is no doubt that the transformer will be the research trend in the future due to its great advantages.

#### *5.7. Bridging the Gap between SAR Ship Detection and Computer Vision*

Compared with the field of computer vision, the field of ship detection in SAR images is relatively small and not active enough. Therefore, it is necessary to bring this field to computer vision, and systematically learn from the rich achievements in computer vision.

We should also learn about its openness, standardized evaluation, and easily accessed codes. This work can promote this field to develop rapidly. What we should do is as follows: firstly, the existing public datasets (SSDD, SAR-Ship-Dataset, AIR-SARShip, HRSID, LS-SSDD-v1.0, SRSDD-v1.0, and RSDD-SAR) need to be combined into a large dataset, which can be called LargeSARDataset here. The modes trained on it can avoid over-fitting. Secondly, determine the training samples and testing samples. Thirdly, determine the evaluation indicators. Fourthly, release the benchmark. Fifthly, bring it into the field of computer vision. As shown in Figure 15. Through this work, we can bridge the gap between SAR ship detection and computer vision.

\$QDFWLYHDQGRSHQFRPPXQLW\

**Figure 15.** Bridging the gap between SAR ship detection and computer vision.

In addition to detection, classification and segmentation of SAR images also enter into the deep learning era [259–265]. Classification and segmentation algorithms borrowed from computer vision are extensively used in SAR images. We will review them in the future. In the process of detection, only ship and non-ship targets are considered, and the specific content of non-ship targets is not analyzed [266–268]. Some icebergs have great similarities in shape and size with ships, and the algorithms are difficult to distinguish them. So, we will study how to solve this problem in the future.

#### **6. Conclusions**

This paper introduces the past, present, and future of deep learning-based ship detection algorithms in SAR images.

Firstly, the history of SAR ship detection is reviewed (before SSDD was public on 1 December 2017). This part mainly introduces the detection algorithm based on CFAR and analyzes the great advantages of deep learning-based algorithms. In addition, they are compared in theory and experiment.

After that, there is a comprehensive overview of the current (from 1 December 2017 to now) ship detection algorithms based on deep learning. This part first analyzes the datasets, country, timeline, deep learning framework, and the performance evolution of the 177 papers. The basic situation of 10 datasets in this field is introduced especially. The 177 papers were classified, and they are two-stage, single-stage, anchor free, train from scratch, oriented bounding box, multi-scale, attention model, real-time detection, and so on. The specific algorithms in those papers are analyzed, including the principle, innovation, performance, and the summary.

Finally, the problems existing in this field and the future development direction are described. The main ideas are to design the detection algorithm according to the specific characteristics of SAR image, focus on the detection algorithm without an anchor box, pay enough attention to the detection algorithm of training from scratch, and learn from the existing achievements of natural scene text detection and DOTA, improve the performance of small ships continuously, pay attention to realize the real-time detection of ships through model acceleration and AI chip. It is emphasized that the future important work is to bridge the gap between SAR ship detection and computer vision by merging the existing small datasets into a larger one and making relevant standards.

This review can provide a reference for researchers in this field or researchers interested in this field so that they can quickly understand the current situation and future development direction of this field.

**Author Contributions:** Conceptualization, J.L. and C.X.; methodology, H.S., L.G. and T.W.; investigation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and C.X.; supervision, C.X.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China, No. 61790550, No. 61790554, No. 61971432, No. 62022092.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Serial GANs: A Feature-Preserving Heterogeneous Remote Sensing Image Transformation Model**

**Daning Tan 1, Yu Liu 1,2,\*, Gang Li 2, Libo Yao 1, Shun Sun <sup>1</sup> and You He <sup>1</sup>**


**\*** Correspondence: liuyu77360132@126.com

**Abstract:** In recent years, the interpretation of SAR images has been significantly improved with the development of deep learning technology, and using conditional generative adversarial nets (CGANs) for SAR-to-optical transformation, also known as image translation, has become popular. Most of the existing image translation methods based on conditional generative adversarial nets are modified based on CycleGAN and pix2pix, focusing on style transformation in practice. In addition, SAR images and optical images are characterized by heterogeneous features and large spectral differences, leading to problems such as incomplete image details and spectral distortion in the heterogeneous transformation of SAR images in urban or semiurban areas and with complex terrain. Aiming to solve the problems of SAR-to-optical transformation, Serial GANs, a feature-preserving heterogeneous remote sensing image transformation model, is proposed in this paper for the first time. This model uses the Serial Despeckling GAN and Colorization GAN to complete the SARto-optical transformation. Despeckling GAN transforms the SAR images into optical gray images, retaining the texture details and semantic information. Colorization GAN transforms the optical gray images obtained in the first step into optical color images and keeps the structural features unchanged. The model proposed in this paper provides a new idea for heterogeneous image transformation. Through decoupling network design, structural detail information and spectral information are relatively independent in the process of heterogeneous transformation, thereby enhancing the detail information of the generated optical images and reducing its spectral distortion. Using SEN-2 satellite images as the reference, this paper compares the degree of similarity between the images generated by different models and the reference, and the results revealed that the proposed model has obvious advantages in feature reconstruction and the economical volume of the parameters. It also showed that Serial GANs have great potential in decoupling image transformation.

**Keywords:** heterogeneous transformation; SAR image; optical image; conditional generative adversarial nets (CGANs)

#### **1. Introduction**

In recent years, there have been more and more applications of remote sensing images in environmental monitoring, disaster prevention, intensive farming, and homeland security. In practice, optical images are widely used due to their high spectral resolution and easy interpretation. The disadvantage is that they are sensitive to meteorological conditions, especially clouds and haze, which severely limits their use for observation and monitoring of ground targets [1]. In contrast, synthetic aperture radar (SAR) sensors can overcome adverse meteorological conditions by creating images using a longer wavelength of radio waves to obtain all-day and all-weather continuous observations. Although SAR images have significant advantages over optical images, their application is still limited by the difficulty of SAR image interpretation. First, because synthetic aperture radar is a side range-measuring instrument, the imaging effect is affected by the distance between

**Citation:** Tan, D.; Liu, Y.; Li, G.; Yao, L.; Sun, S.; He, Y. Serial GANs: A Feature-Preserving Heterogeneous Remote Sensing Image Transformation Model. *Remote Sens.* **2021**, *13*, 3968. https://doi.org/ 10.3390/rs13193968

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 24 August 2021 Accepted: 29 September 2021 Published: 3 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the target and the antenna, which can lead to geometric distortion in SAR images [2]. Therefore, compared with optical images, it is more difficult for human eyes to understand the details of SAR images. Secondly, synthetic aperture imaging is a coherent imaging method in which the radio waves in the radar beam are aligned in space and time. While this consistency provides many advantages (required by the synthetic aperture process to work), it also leads to a phenomenon called speckle, which reduces the quality of SAR images and makes image interpretation more challenging [3]. Therefore, it is difficult to distinguish structural information directly from SAR images, which may not necessarily become easier with the increase in spatial resolution [4]. Considering the above two points, how to effectively use and interpret the target and scene information in SAR images has become an important issue that users of SAR data need to pay attention to. Under the condition of reasonable use of SAR image amplitude information, if the SAR image can be converted into a near-optical representation that is easy to recognize by human eyes, this will create new opportunities for SAR image interpretation.

Deep learning is a powerful tool for the interpretation of SAR images. Some scholars have reconstructed clear images by learning hidden nonlinear relations [5–10]. This type of method uses a residual learning strategy to overcome speckle noise by learning the mapping between the speckle image and the corresponding speckle-free reconstruction so that it can be further analyzed and explained. Although this mapping learning may be an ill-posed problem, it also provides a useful reference for SAR image interpretation.

In addition to convolutional neural networks, image translation methods in the field of natural images and human images provide other ideas for SAR-to-optical image transformation, such as through conditional generative adversarial networks (CGANs) [11–14]. This type of method separates the style and semantic information in image transformation, so it can transform from the SAR image domain to the optical image domain, and also ensures the transformed images have the prior structural information of the SAR images and the spectral information of optical images. In previous studies, CGANs were first applied to the translation tasks of text to text [15], text to image [16], and image to image [17,18], and are suitable for generating unknown sequences (text/image/video frames) from known conditional sequences (text/image/video frames). In recent literature, the applications of CGANs in image processing were mostly in image modification. This includes single image super-resolution [17], interactive image generation [18], image editing [19], image-to-image translation [11], etc. CGANs have been used in SAR-to-optical transformation in recent years. In the literature [20–22], different improved SAR-to-optical transformation models based on CycleGAN and pix2pix have been proposed. The general idea of these models is to improve the model structure and loss function, but they are not designed specifically for the differences of imaging principle between SAR images and optical images, so they do not have universal applicability.

In order to solve the problem of heterogeneous image transformation in principle, as shown in Figure 1a, we decomposed the SAR-to-optical transformation task into two steps: the first step was to implement the transformation from the SAR image domain to optical grayscale image domain through the Despeckling GAN. In this step, we aimed to suppress the speckle effect of SAR images and reconstruct the semantic structural information and texture details of SAR images. In the second step, we transformed the optical grayscale images obtained in the first step into optical color images through the Colorization GAN. The two subtasks are relatively independent and have low coupling, which can reduce the semantic distortion and spectral distortion in the process of direct SAR-to-optical transformation.

The main contributions of this paper are as follows.

1. Unlike the existing methods of direct image translation, this paper proposes a featurepreserving SAR-to-optical transformation model, which decouples the SAR-to-optical transformation task into SAR-to-gray transformation and gray-to-color transformation. This design effectively reduces the difficulty of the original task, enhancing the feature details of the generated optical color images and reducing spectral distortion.


The rest of this paper is structured as follows. Section 2 introduces the materials involved in this paper. Section 3 introduces the method in detail, including the network structure and the loss function. In Section 4, the experimental results are given, which are discussed and evaluated based on indexes. Section 5 shows the discussion of this paper. The last part of the paper (Section 6) gives the conclusions and prospects for future work.

(b) Generated Examples

**Figure 1.** (**a**) Overview of our method: the SAR image affected by speckling is the input, and the Despeckling GAN generates a corresponding optical grayscale image as output. The optical grayscale image is then sent as input to the second generator network Colorization GAN, and the output is an optical color image. (**b**) Examples of generating optical grayscale images and optical color images through the Serial GANs.

#### **2. Materials**

Due to the lack of a large number of paired SAR and optical image datasets, deep learning-based SAR-to-optical translation research has mainly followed the idea of the CycleGAN [12] model; that is, unpaired image transformation. With the decrease in the cost of remote sensing images, a new idea has been presented to solve the crossmodal transformation, by using an image transformation method based on the Generative Adversarial Network. In the literature [24], Schmitt et al. published the SEN1-2 dataset to promote SAR and optical image fusion in deep learning research. The SEN1-2 dataset is a traditional remote sensing image dataset obtained by the SAR and optical sensors of the Sentinel-1 and Sentinel-2 satellites. As part of the Copernicus Project of the European Space Agency (ESA), Sentinel satellites are used for remote sensing tasks in the fields of climate, ocean, and land detection. The mission is being carried out jointly by six satellites

with different observation applications. Sentinel-1 and Sentinel-2 provide the two most conventional SAR and optical images respectively, so they have been widely studied in the field of remote sensing image processing. Sentinel-1 is equipped with a C-band SAR sensor, which enables it to obtain high-positioning-accuracy SAR images regardless of weather conditions [25]. In its unique SAR imaging mode, the nominal resolution of Sentinel-1 is not less than 5 m, while providing dual-polarization capability and a very short equatorial access time (about 1 week) [26]. In the SEN1-2 dataset, Sentinel-1 images were collected in the interference wide (IW) swath mode, and the result obtained is the ground-rangedetected (GRD) products. These images contain the backscatter coefficient in dB scale for every pixel spacing of 5 m in azimuth and 20 m in range. In order to simplify the operation, the dataset pays more attention to the VV polarization data and ignores the data of VH polarization. Sentinel-2 consists of two polar-orbiting satellites in the same orbit, with a phase difference of 180 degrees [27]. For the Sentinel-2 part of the dataset SEN1-2, the researchers used red, green, and blue channels (i.e., Bands 4, 3, and 2) to generate realistic RGB grid images. Because cloud occlusion will affect the final effect, the cloud coverage of the Sentinel-2 image in the dataset is less than or equal to 1%. SEN1-2 is composed of 282,384 pairs of related image patches, which come from all over the world and all weathers and seasons. It is the first large, open dataset of this kind and has significant advantages for learning a cross-modal mapping from SAR images to optical images. With the aid of the SEN1-2 dataset, we were able to build a new model that is different from the previous methods, the Serial GANs model proposed in this paper. Figure 2 shows some examples of image pairs in SEN1-2.

**Figure 2.** Some example patch pairs from the SEN1-2 dataset. Top row: Sentinel-1 SAR image patches; bottom row: Sentinel-2 RGB image patches.

#### **3. Method**

The heterogeneous transformation from SAR images to optical images is an ill-posed problem. The transformation results are often not ideal due to speckle noise, SAR image resolution, and other factors. Inspired by the ideas of pix2pix, CycleGAN and pix2pixHD, as shown in Figure 3a–d, this paper attempted to introduce optical grayscale images as the intermediate transformation domain *Y*. The transformation task from the SAR image domain *X* to the optical color image domain *Z* was completed in two steps by two generators (*P* and *Q*) and two discriminators (*DY* and *DZ*) as shown in Figure 3e. First, the generator *P* completes the mapping: *X* → *Y*, in which the SAR image is transformed into the optical grayscale image, and the corresponding discriminator *DY* is used to promote the transformation of the SAR image in the source domain *X* to the optical grayscale image in the domain *Y*, which is difficult to distinguish from the real optical grayscale image. Then, the generator *Q* completes the mapping: *Y* → *Z*, in which the optical grayscale image is transformed to the optical color image, and the corresponding discriminator *DZ* is used to promote the transformation of the optical grayscale image in the intermediate domain *Y* to the optical color image in the domain *Z*, which is difficult to distinguish from the optical color image. In this way, the original transformation process from the SAR image to the

optical color image is divided into two steps, reducing the semantic distortion and feature loss in the process of direct transformation from the SAR image to the optical color image.

**Figure 3.** Overview of different methods. (**a**,**b**) CycleGAN. It is essentially two mirror-symmetric GANs, which share two generators *G* and *F* with discriminators *DY* and *DX* respectively, and it uses GAN loss and cycle-consistency loss; (**c**) pix2pix, which directly transforms the image from the *X* domain to the *Z* domain, using GAN loss and L1 loss; (**d**) pix2pixHD. Different from pix2pix, it has two generators, *G*<sup>1</sup> and *G*2, and its loss functions are GAN loss, Feature-matching loss, and Content loss; (**e**) the method proposed in this paper. It uses the intermediate state *y* as the transition, and its loss functions are GAN loss, Feature-preserving loss, and L1 loss.

As shown in Figure 4, the transformation from SAR images to optical images can be defined as the mapping transformation *T* = *PQ* (*P* : *X* → *Y*, *Q* : *Y* → *Z*), from the source domain *X* to the target domain *Y*. Suppose that *x*(*i*) is a random sample taken from the SAR image domain *X*, and its distribution function is P(*i*)(*x*), and the random sample *x*(*i*) mapped to the optical grayscale image domain is *y*(*i*). The final task of the network proposed in this paper is *T* : *X* → *Z*, in which the final distribution function is P - *Tx*(*i*) = *z*(*i*) *x*(*i*) generated from our network.

**Figure 4.** A feature-preserving heterogeneous remote sensing image transformation model is proposed in this paper. Let *X*, *Y*, and *Z* denote the SAR image domain, intermediate optical grayscale image domain, and optical color image domain, respectively, and *<sup>x</sup>*(*i*) <sup>∈</sup> *<sup>X</sup>*, *<sup>y</sup>*(*i*) <sup>∈</sup> *<sup>Y</sup>* and *<sup>z</sup>*(*i*) <sup>∈</sup> *<sup>Z</sup>* denote the dataset samples of the corresponding image domain (*i* = 1, 2, ··· , *N*, *N* denotes the total sample number of the data set).

#### *3.1. Despeckling GAN*

Generator *P*: As shown in Figure 5a, this paper used an improved U-net as the generator of Despeckling GAN. The input SAR image was encoded and decoded to output the optical grayscale image. A structure similar to the convolutional self-encoding network enables the generation network to better predict the optical grayscale image corresponding to the SAR image. The encoding and decoding process of the generator works on multiple levels to ensure that the overall contour and local details of the original SAR image are extracted on multi-scales. In the decoding process, the network upsamples the feature map of the previous level to the next level through deconvolution and adds the feature map of the same level in the encoding process through a long-skip connection to get an average merge (Merge). In U-net, this process is completed by concatenation. At the same time, skip connections are also used in each residual block, which has the advantage of overcoming the gradient disappearance problem of the network during training.

Discriminator *DY*: As shown in Figure 5b, PatchGAN, which is commonly used in GAN, was used as the discriminator. The process of heterogeneous image transformation includes the transformation of the content part and feature detail part. The content part refers to the similarity in content between the generated image and the original image, and the feature detail part refers to the similarity in features between the generated image and the target image. With PatchGAN, feature details can be maintained [11].

**Figure 5.** Architecture of the Despeckling GAN. (**a**) Generator (top). (**b**) Discriminator (bottom). (**c**) The detail of the Res. (The purple, green, and orange blocks in the dotted box correspond to the convolutional layer, the batch normalization layer, and the ReLU or Leaky ReLU layer, respectively). The numbers in brackets refer to the number of filters, filter size, and stride, respectively. The numbers above or below the encoder blocks and images indicate the input and output size of each module. Acronyms in the encoding and decoding modules are as follows: Res: Residual block with three convolutional layers and one skip connection, P: Maxpooling, DC: Deconvolution, C: Convolution, N: Batch Normalization, LR: Leaky ReLU, S: Sigmoid, Merge: Sum to average.

The loss function of the Despeckling GAN generator includes CGAN loss, *L*<sup>1</sup> loss, and feature-preserving loss. Based on the premise of the existing paired training data, this paper used the CGAN loss function to improve the performance of the generator. Through supervised training, the generator *P* learns the mapping from *X* to *Y*, and this makes the

discriminator *DY* judge true. The network structure of the discriminator has the function of distinguishing fake images from real images. Therefore, the CGAN loss from *X* to *Y* is:

$$\mathcal{L}\_{\text{GAN}}(P, D\_Y) = \mathbb{E}\_{\mathbf{x}^{(i)}, y^{(i)}} \left[ \log D\_Y \left( \mathbf{x}^{(i)}, y^{(i)} \right) \right] + \mathbb{E}\_{\mathbf{x}^{(i)}} \left[ \log \left( 1 - D\_Y \left( P \left( \mathbf{x}^{(i)} \right), \mathbf{x}^{(i)} \right) \right) \right]. \tag{1}$$

In the reconstruction loss design, the *L*<sup>1</sup> loss is used to minimize the difference between the optical gray image and the generated image.

$$\mathcal{L}\_{\text{Recon}}\left(P, \mathbf{x}^{(i)}\right) = \mathbb{E}\left[ \left\| P\left(\mathbf{x}^{(i)}\right) - \mathbf{x}^{(i)} \right\|\_{1} \right]. \tag{2}$$

In the best state *T*∗, the output of the network *T*∗ *x*(*i*) should be similar to the optical gray image *y*(*i*). In order to preserve the feature details of SAR images, this paper proposed a gradient-guided feature-preserving loss [28]. If *M*(·) denotes the operation to calculate the image gradient map, the loss of feature-preserving is:

$$\mathcal{L}\_{\text{FP}}\left(P, \mathbf{x}^{(i)}\right) = \mathbb{E}\left[M\left(P\left(\mathbf{x}^{i}\right)\right), M\left(\mathbf{y}^{(i)}\right)\right].\tag{3}$$

For images *I*, *M*(·) is as follows:

$$\begin{array}{l} I\_x(\mathbf{x}, \mathbf{y}) = I(\mathbf{x} + 1, \mathbf{y}) - I(\mathbf{x} - 1, \mathbf{y}), \\ I\_y(\mathbf{x}, \mathbf{y}) = I(\mathbf{x}, \mathbf{y} + 1) - I(\mathbf{x}, \mathbf{y} - 1), \\ \nabla I(\mathbf{x}, \mathbf{y}) = \left( I\_x(\mathbf{x}, \mathbf{y}), I\_y(\mathbf{x}, \mathbf{y}) \right), \\ M(I) = ||\nabla I||\_2. \end{array} \tag{4}$$

Specifically, the operation *M*(·) can be easily implemented by convolution with a fixed convolution kernel.

Therefore, the total training loss of the Despeckling GAN is:

$$\mathcal{L}\_{\text{GAN}^1} = \underset{G}{\text{argmin}} \max\_{D} \mathcal{L}\_{\text{GAN}}(P, D\_Y) + \beta\_1 \mathcal{L}\_{\text{Recon}}\left(P, x^{(i)}\right) + \gamma\_1 \mathcal{L}\_{\text{FP}}\left(P, y^{(i)}\right). \tag{5}$$

where *β*<sup>1</sup> and *γ*<sup>1</sup> are weighted values.

#### *3.2. Colorization GAN*

Colorization GAN completed the transformation from optical gray images to optical color images. Its principle comes from [29], which proved that, compared with Figure 6a, the colorization result of Figure 6b was better, so the latter was adopted in this paper. When a single channel gray image *<sup>y</sup>*ˆ(*i*) <sup>∈</sup> <sup>R</sup>*H*×*W*×<sup>1</sup> is input, the model learns the mapping *z*ˆ (*i*) *ab* = *Q y*ˆ(*i*) from the input gray channel to the corresponding *Lab* space color channels *z*ˆ (*i*) *ab* <sup>∈</sup> <sup>R</sup>*H*×*W*×2, where, *<sup>H</sup>* and *<sup>W</sup>* represent the height and width respectively. Then, the RGB image *z*ˆ(*i*) is obtained by synthesizing *z*ˆ (*i*) *ab* and *<sup>y</sup>*ˆ(*i*). The advantage of this method is that it can reduce the ill-posed problem, such that the colorization result is closer to the real image.

As shown in Figure 7, the generator of the Colorization GAN uses a convolutional self-coding structure, which establishes short-skip connections within different levels and long connections between the same levels of encoding and decoding. This kind of structure design enables different levels of image information to flow in the network so that the hue information of the generated image is more real and full. The discriminator of the Colorization GAN is PatchGAN [11]. Recent studies have shown that adversarial loss helps to make colorization more vivid [29–31], and this paper also followed this idea. During training, we input the reference optical color image and the generated image one by one into the discriminator; the discriminator output was 0 (fake) or 1 (real). According to the previous methods, the loss of the discriminator is the sigmoid cross-entropy.

**Figure 6.** The principle of image colorization. (**a**) The direct mapping from the gray space to the RGB color space; (**b**) the hue Lab mapped from the gray space to the Lab color space.

Among them, the adversarial loss is expressed as follows:

$$\mathcal{L}\_{\text{GAN}}(Q, D\_{\mathcal{Z}}) = \mathbb{E}\_{y^{(i)}, z^{(i)}} \Big[ \log D\_{\mathcal{Z}} \Big( y^{(i)}, z^{(i)} \Big) \Big] + \mathbb{E}\_{y^{(i)}} \Big[ \log \Big( 1 - D\_{\mathcal{Z}} \Big( y^{(i)}, Q \Big( y^{(i)} \Big) \Big) \Big) \Big]. \tag{6}$$

In order to make the generated color distribution closer to the color distribution of the reference image, we defined the L<sup>1</sup> loss in the *Lab* space, which is expressed as follows:

$$\mathcal{L}\_1(Q) = \mathbb{E}\_{\mathcal{y}^{(i)}, \boldsymbol{z}^{(i)}} \Big[ \left\| Q \left( \boldsymbol{y}^{(i)} \right) - \boldsymbol{z}^{(i)} \right\|\_1 \Big]. \tag{7}$$

Therefore, the total loss function of the Colorization GAN model is as follows:

$$\mathcal{L}\_{\text{GAN}^2} = \underset{G}{\text{argmin}} \max\_{D} \mathcal{L}\_{\text{GAN}} \left( Q\left(y^{(i)}\right), z^{(i)}\right) + \beta\_2 \mathcal{L}\_1\left(Q, y^{(i)}\right). \tag{8}$$

where *β*<sup>2</sup> is a weighted value.

**Figure 7.** The network structure of the Colorization GAN generator. The gray image of the input model is first transformed into the L channel in the Lab color space and then trained to map to the AB channels through the network. The obtained hue is spliced with the gray image to get the Lab color image. Finally, the Lab image is transformed into an RGB image. The green block represents the convolutional layer, the yellow block represents the residual block, and the green and light-green blocks represent the average by merge.

#### **4. Experiments and Results**

As the SEN1-2 dataset covers the whole world and contains 282,384 pairs of SAR and optical color images across four seasons, some of which are overlapped, in order to facilitate the training, the original dataset was randomly sampled according to the stratified sampling method. The dataset was divided into the training dataset, validation dataset, and test dataset, and their respective proportions were about 6:2:2. The experiment of the proposed method was carried out on the computing platform of two 11G GPU GeForce RTX 2080Ti and i9900k CPUs using PyTorch. The input size of the images was 256 × 256, and the batch size was set to 10. In the experimental simulation, 200 epochs were set in the GAN training and optimized by the Adam optimizer. The sum of parameters was set to 0.5 and 0.999, respectively. The initial learning rate of the experiment was set to 0.0002. The first 100 epochs remained unchanged and then decreased to 0 according to the linear decreasing strategy.

Considering that season and landscape will affect the training results of the model, we selected image pairs of different seasons and landscapes and followed the principle of equilibrium [32]. As shown in Table 1, the number of SAR and optical image pairs in four seasons is approximately the same, and the number of image pairs of different landscapes in each season is also approximately the same.


**Table 1.** Number of different types of images selected in our dataset.

#### *4.1. Experiment 1*

In order to verify the effectiveness of the proposed method, four groups of experiments were designed using the same dataset and different conditions. The four groups of experiments were carried out according to the single variable principle. In Group 1, the unimproved generators *P* and Q were used, and the loss function included *GAN* loss and reconstruction loss. In Group 2, the improved generators *P* and *Q* were used, and the loss function included *GAN* loss and reconstruction loss. In Group 3, the unimproved generators *P* and *Q* were used, and the loss function included *GAN* loss, reconstruction loss, and feature-preserving loss. In Group 4, the improved generators *P* and *Q* were used, and the loss function included reconstruction loss, reconstruction loss, and feature-preserving loss. The relationship between the four groups of experiments is shown in Table 2.


**Table 2.** Grouping experiments under different conditions.

As shown in Figure 8, the first column shows the SAR images collected by the SEN-1 satellite. The second column shows the SAR images collected by the SEN-2 satellite. The third, fourth, fifth, and sixth columns show the experimental results of Group 1, Group 2, Group 3, and Group 4, respectively. Through visual comparative analysis, it can be seen that improving the network structure and loss function can improve the quality of SARto-optical transformation, especially by enhancing the feature detail information of the generated image. It can map the SAR image to the optical color image to the maximum extent and help the interpretation of the SAR image.

**Figure 8.** Results produced under different conditions. From top to bottom, the images are remote sensing images of five kinds of landscape: river valley, mountains and hills, urban residential area, seashore, and desert. From left to right: SEN-1 images, SEN-2 images, images generated by Group 1, images generated by Group 2, images generated by Group 3, images generated by Group 4.

> In order to compare the detailed information of the generated images, Figure 9 shows the detailed comparison between the SEN-2 images and the four groups of experimental results. According to the subjective evaluation criteria, the results of improving the model

and loss function at the same time are closer to the SEN-2 images. Only improving the loss function can improve the details of the generated images, but its effect is inferior to that of improving the model. The detailed comparison of the four groups of experimental results once again proves that the improvement measures proposed in this paper are effective. By comparing the two situations of improving model and improving loss function, it can be found that improving model contributes more to the results.

**Figure 9.** Detailed comparison of Experiment 1. We selected the generated results in three scenarios for detailed comparison with the SEN-2 reference image. The improvement measures proposed in this paper had an obvious effect on improving the quality of the generated images. From left to right: SEN-2 images, images generated by Group 1, images generated by Group 2, images generated by Group 3, and images generated by Group 4.

In order to quantify the effectiveness of the method, the final transformation effect (IQA) was measured by calculating the structural similarity (SSIM) [33,34], and the feature similarity (FSIM) [35]. Both indexes were calculated between the generated image *z*ˆ and the corresponding SEN-2 image *z*. Assuming that the generated image is *z*ˆ, and the corresponding SEN-2 image is *z*, the SSIM calculation formula is as follows:

$$SSIM(\sharp, z) = [l(\sharp, z)]^a [c(\sharp, z)]^\beta [s(\sharp, z)]^\gamma, \ a, \beta, \gamma > 0 \tag{9}$$

Among them:

$$I(\hat{z}, z) = \frac{2\mu\_{\hat{z}}\mu\_{\hat{z}} + c\_1}{\mu\_{\hat{z}}^2 + \mu\_{\hat{z}}^2 + c\_1} \tag{10}$$

$$\mathcal{L}(\mathbf{\hat{z}}, \mathbf{z}) = \frac{2\sigma\_{\mathbf{\hat{z}}z} + c\_2}{\sigma\_{\mathbf{\hat{z}}}^2 + \sigma\_z^2 + c\_2} \tag{11}$$

$$s(\sharp, z) = \frac{\sigma\_{\sharp z} + c\_3}{\sigma\_{\sharp}\sigma\_z + c\_3} \tag{12}$$

*l*(*z*ˆ, *z*), *c*(*z*ˆ, *z*), and *s*(*z*ˆ, *z*) in the equation represent the brightness comparison, contrast comparison, and structural comparison, respectively. *μz*<sup>ˆ</sup> and *μ<sup>z</sup>* represent the mean of *z*ˆ and *z*, *σz*<sup>ˆ</sup> and *σ<sup>z</sup>* represent the standard deviation of *z*ˆ and *z*, *σzz*<sup>ˆ</sup> represents the covariance of *z*ˆ and *z*, and *c*1, *c*<sup>2</sup> and *c*<sup>3</sup> are constant constants (so that the parent of the equation is not zero). In actually, *α* = *β* = *γ* = 1, *c*<sup>3</sup> = *c*2/2, SSIM is represented as:

$$SSIM(\hat{z}, z) = \frac{(2\mu\_{\hat{z}}\mu\_{z} + c\_{1})(\sigma\_{\hat{z}z} + c\_{2})}{\left(\mu\_{\hat{z}}^{2} + \mu\_{\hat{z}}^{2} + c\_{1}\right)\left(\sigma\_{\hat{z}}^{2} + \sigma\_{\hat{z}}^{2} + c\_{2}\right)}\tag{13}$$

Another index, the FSIM, is a feature similarity evaluation index, which uses phase consistency (phase consistency (PC)) and gradient features (gradient magnitude (GM)), as follows:

$$FSM = \frac{\sum\_{\mathbf{x} \in \Omega} S\_L(\mathbf{x}) \cdot PC\_m(\mathbf{x})}{\sum\_{\mathbf{x} \in \Omega} PC\_m(\mathbf{x})}. \tag{14}$$

Which:

$$S\_{PC}(\mathbf{x}) = \frac{2PC\_1(\mathbf{x}) \cdot PC\_2(\mathbf{x}) + T\_1}{PC\_1^2(\mathbf{x}) + PC\_2^2(\mathbf{x}) + T\_1} \tag{15}$$

$$S\_G(\mathbf{x}) = \frac{2G\_1(\mathbf{x}) \cdot G\_2(\mathbf{x}) + T\_2}{G\_1^2(\mathbf{x}) + G\_2^2(\mathbf{x}) + T\_2} \tag{16}$$

$$S\_L(\mathbf{x}) = \left[ S\_{PC}(\mathbf{x}) \right]^a \cdot \left[ S\_G(\mathbf{x}) \right]^\beta \tag{17}$$

*SPC*(**x**), *SG*(**x**), and *SL*(**x**) represent the phase consistent (PC) similarity, gradient feature (GM) similarity, and PC-GM fusion similarity, respectively.

The similarity indicators of the four experimental schemes were calculated as Table 3. By comparing the results of the second, third, and first rows of the table, it can be seen that after improving the generator structure and loss function, both SSIM- and FSIM-generating images had been significantly improved, and the combined use of improved generators and loss functions obtained better results than improving the generator structure or loss functions alone.

**Table 3.** The model generated result indicators under different improvement measures. The number in bold indicates the optimal value under the corresponding index.


#### *4.2. Experiment 2*

In order to verify the performance of the proposed method in preserving the SAR image features, the proposed algorithm was compared with pix2pix, CycleGAN, and pix2pixHD, respectively. During training, the Serial GANs train the generator *P* and the discriminator *DY* first, and then the training generator *Q* and the discriminator *DZ*, respectively, with 200 epochs. In Figure 10, the first column shows the SAR images collected by the SEN-1 satellite, the second column shows the optical color images collected by the SEN-2 satellite, and the third, fourth, and fifth columns show the experimental results of pix2pix, CycleGAN, and pix2pixHD, respectively. According to the results, the proposed method can significantly preserve the details of SAR images in the process of heterogeneous transformation, with results as good as pix2pixHD. What is more, the volume of parameters of the model proposed in this paper was significantly lower than in the pix2pixHD model.

In order to compare the details of the images generated by different models, Figure 11 shows the details of the results generated by the proposed method compared with the four methods of pix2pix, CycleGAN and pix2pixHD. According to the subjective evaluation criteria, the results of the proposed method and pix2pixHD are closer to the Sentinel satellite image. The generation results of pix2pix and CycleGAN are inferior to the first two methods. Although the results of the proposed method and pix2pixHD are not significantly

different, the subsequent comparison will show that the proposed method is superior to pix2pixHD.

**Figure 10.** Comparison of the results generated by four different heterogeneous transformation models. From the top to the bottom: the remote sensing images of the river valley, mountains and hills, urban residential area, coastal city and desert. From left to right: SEN-1 images, SEN-2 images, images generated by pix2pix, images generated by CycleGAN, images generated by pix2pixHD, and the images generated by our model.

> In order to quantitatively measure the advantages of the method, four image quality evaluation indexes (IQA), including PSNR, SSIM, FSIM, and MSE, were selected to quantitatively evaluate the method. As shown in Table 4, it can be seen from the data that the proposed model achieved the best in PSNR, SSIM, and MSE, and the second-best in FSIM.

> **Table 4.** Comparison of the indexes between the images generated by four methods and the SEN-2 images. The number in bold indicates the optimal value under the corresponding index.


The above experimental results show the effectiveness of the proposed method and the superiority to pix2pix and CycleGAN from both qualitative and quantitative aspects. In order to further illustrate that our method is better than pix2pixHD, we draw a performance comparison diagram reflecting the model size and FSIM value. As shown in Figure 12, although the FSIM value of our method is 0.0004 lower than that of pix2pixHD, the model size of our method is about half of that of pix2pixHD, so the advantage of our method is more obvious.

**Figure 11.** Detailed comparison of Experiment 2. We selected the generated results of different models in three scenarios to compare the details with the SEN-2 reference images. Compared with other image translation models, the proposed model has obvious advantages in improving the generation performance. From left to right: SEN-2 images, images generated by pix2pix, images generated by CycleGAN, images generated by pix2pixHD, and the images generated by our model.

**Figure 12.** Comparison of the results generated by our method and SOTA methods. The ordinate of the graph represents the normalized FSIM value, and the abscissa represents the parameter size (Mbyte) of the model. The comparison of the four methods is represented by a scatter diagram. The closer the scatter points are to the *y*-axis +∞, the better the overall cost performance of the model.

#### **5. Discussion**

The existing SAR-to-optical method is a one-step transformation method; that is, it directly transforms SAR images into optical RGB images. However, spectral and texture distortions inevitably occur, reducing the accuracy and reliability of the final transformation result. Moreover, the direct use of CycleGAN and pix2pix in SAR-to-optical transformation only reconstructs the original image at the pixel level, without restoring the spectrum and texture. Such results may not be suitable for further image interpretation. Inspired by image restoration and enhancement technology, a Serial GAN image transformation method is proposed here and used for SAR-to-optical tasks.

Based on SEN 1-2 SAR and optical image datasets, the effectiveness of the proposed method was verified through ablation experiments. Through qualitative and quantitative analysis with several SOTA image transformation methods, the superiority of the proposed method was verified. The image transformation method we proposed uses SAR images as prior information to restore and reconstruct SAR images based on the gradient contour and spectrum. The advantage of this is avoiding the mixing distortion caused by directly transforming the SAR image into an optical image, and the final transformation result has better texture detail and an improved spectral appearance. At the same time, our method does not simply involve learning the SAR-optical mapping but restores and reconstructs the SAR image from both the texture information and the spectral information so that it has an interpretation advantage similar to that of the optical image. Note that our proposed method was better than CycleGAN and pix2pix in the index of the transformation results, and some indexes were better than pix2pixHD. From an indicator point of view, this difference was small. However, from intuitive observation, the method proposed in this paper was significantly better than CycleGAN and pix2pix. The reason for this is that our method is not a simple transformation but the reconstruction of SAR images, which restores SAR images from the perspective of image theory. In comparison with the SOTA model pix2pixHD, the proposed method has no obvious advantage in the test value, but the parameter size of the model is about half that of pix2pixHD, which means that our method has more advantages in application. However, the proposed method also has some potential limitations. First, although we considered different seasons and different land types (urban, rural, semi-urban, and coastal areas) in the training data, supervised learning inevitably depends on the data. For different SAR image resolutions and speckle conditions, the results of the transformations will be different. In addition, because supervised learning requires a large number of training samples, the training effect of the model may not be ideal for a dataset with a small sample size. Therefore, problems arising from transfer learning, weakly supervised learning, and cross-modal technology will need to be solved in the future.

#### **6. Conclusions and Prospects**

To address the problem of feature loss and distortion in SAR-to-optical tasks, this paper proposed a feature-preserving heterogeneous image transformation model using Serial GANs to maintain the consistency of heterogeneous features, and reduce the distortion caused by heterogeneous transformation. An improved U-net structure was adopted in the model, which was used for SAR image Despeckling GAN, and then the image was colored by Colorization GAN to complete the transformation from a SAR image to an optical color image, which effectively alleviated the uncertainty of transformation results caused by information asymmetry between heterogeneous images. In addition, the end-to-end model architecture also enabled the trained model to be directly used for SAR-to-optical image transformation. At the same time, this paper introduced the feature-preserving loss, which enhanced the feature details of the generated image by constraining the gradient map. Through intuitive and objective comparison, the improved model effectively enhanced the detail of the generated image. In our view, Serial GANs have great potential in other heterogeneous image transformations.

Furthermore, they can provide a common framework for SAR image and photoelectric image transformation. In the future, we will consider incorporating multisource heterogeneous images into a Multiple GANs hybrid model to provide support for the cross-modal interpretation of multisource heterogeneous remote sensing images.

**Author Contributions:** Conceptualization, Y.L.; Data curation, D.T. and G.L.; Formal analysis, D.T.; Funding acquisition, Y.L. and Y.H.; Investigation, Y.L. and G.L.; Methodology, D.T., G.L. and S.S.; Project administration, L.Y. and Y.H.; Resources, L.Y.; Software, L.Y. and S.S.; Validation, S.S.; Visualization, S.S.; Writing—original draft, D.T.; Writing—review & editing, Y.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the National Natural Science Foundation of China, Grant Numbers 91538201, 62022092, and 62171453.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The SEN1-2 dataset was used in this study (accessed 19 July 2021), which is accessible from https://mediatum.ub.tum.de/1436631. It is a dataset consisting of 282,384 pairs of corresponding synthetic aperture radar and optical image patches, acquired by the Sentinel-1 and Sentinel-2 remote sensing satellites, respectively. It is shared under the open access license CC- BY.

**Acknowledgments:** The authors sincerely appreciate that academic editors and reviewers give their helpful comments and constructive suggestions.

**Conflicts of Interest:** The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### **References**


## *Article* **Self-Supervised Despeckling Algorithm with an Enhanced U-Net for Synthetic Aperture Radar Images**

**Gang Zhang 1,\*, Zhi Li 1, Xuewei Li <sup>2</sup> and Sitong Liu <sup>1</sup>**


**Abstract:** Self-supervised method has proven to be a suitable approach for despeckling on synthetic aperture radar (SAR) images. However, most self-supervised despeckling methods are trained by noisy-noisy image pairs, which are constructed by using natural images with simulated speckle noise, time-series real-world SAR images or generative adversarial network, limiting the practicability of these methods in real-world SAR images. Therefore, in this paper, a novel self-supervised despeckling algorithm with an enhanced U-Net is proposed for real-world SAR images. Firstly, unlike previous self-supervised despeckling works, the noisy-noisy image pairs are generated from real-word SAR images through a novel generation training pairs module, which makes it possible to train deep convolutional neural networks using real-world SAR images. Secondly, an enhanced U-Net is designed to improve the feature extraction and fusion capabilities of the network. Thirdly, a self-supervised training loss function with a regularization loss is proposed to address the difference of target pixel values between neighbors on the original SAR images. Finally, visual and quantitative experiments on simulated and real-world SAR images show that the proposed algorithm notably removes speckle noise with better preserving features, which exceed several state-of-the-art despeckling methods.

**Keywords:** self-supervised; synthetic aperture radar (SAR); despeckling; enhanced U-Net

#### **1. Introduction**

Synthetic aperture radar (SAR) [1] is an active remote sensing imaging sensor that transmits electromagnetic signals to target in a slant distance manner. Compared with optical imaging sensors, SAR has the imaging ability of all-time and all-weather. Therefore, SAR has become one of the remote sensors used for disaster assessment [2], resource exploration [3], ocean surveillance [4,5] and statistical analysis [6]. Nevertheless, due to the imaging mechanism, the quality of SAR images is inherently affected by speckle noise [7,8]. Speckle noise is a granular disturbance, usually modeled as a multiplicative noise, that affects SAR images, as well as all coherent images [8]. The speckle noise may severely diminish the performances of detection accuracy [9–12] and information extraction [13]. Therefore, the reduction of speckle noise is a key and essential processing step for a number of applications.

In the past few decades, numerous researchers have attempted to reduce the speckle noise in SAR images. Generally, the existing despeckling methods can be roughly summarized as local window methods, non-local mean (NLM) methods and deep learning (DL) methods. In the first group, local window methods are widely used, such as Lee [14], Frost [15] and Kuan [16]. The despeckling performance of local window methods is very dependent on the window size. The larger the size, the smoother the despeckled image and the better the despeckling performance. However, the despeckled image will lose point targets, linear features and textures.

**Citation:** Zhang, G.; Li, Z.; Li, X.; Liu, S. Self-Supervised Despeckling Algorithm with an Enhanced U-Net for Synthetic Aperture Radar Images. *Remote Sens.* **2021**, *13*, 4383. https:// doi.org/10.3390/rs13214383

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 17 September 2021 Accepted: 28 October 2021 Published: 31 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

To overcome the disadvantage of local window methods, NLM methods are applied to process SAR images, such as PPB [17], SAR-BM3D [18] and NL-SAR [19]. The NLM methods define the similar pixel and pixel weight by measuring the similarity between a local patch centered on the reference pixel and another local patch centered on a selected non-local neighborhood pixel. The greater the similarity, the larger the weight. However, the NLM methods need to use the equivalent number of looks (ENL) as a prior. In practical applications, it is impossible to obtain the accurate ENL of SAR images. In addition, the time complexity of the local window methods and NLM methods is very high.

Recently, benefiting from new breakthroughs of deep learning [20,21], more and more researchers began to explore DL methods [22–33]. The essence of the DL despeckling methods is to learn a relationship from noisy SAR images to noise-free SAR images. It can be described as follows. Firstly, the input of the DL despeckling methods is a noisy SAR image. Then, the noisy SAR images are encoded and decoded through convolutional layers, pooling layers, batch normalization layers and activation function layers. Finally, a noise-free SAR image is obtained. According to whether there are clean images as targets, the DL despeckling methods can be distinguished into three broad categories: supervised methods, semi-supervised methods and self-supervised methods.

The supervised despeckling methods [22–24,26] use noisy-clean image pairs to train convolutional neural networks (CNNs) and the despeckling models are applied to reduce speckle noise in real-world SAR images. Since there is no noise-free SAR images, the training image pairs of the supervised methods are generated by combining regularization RGB photos (natural images) with simulated speckle noise. The natural images contain camera images [23] and aerial images [24]. The advantage of the generated method is that a large number of noisy-clean image pairs can be easily obtained. A deeper CNN with numerous parameters can be trained. But the disadvantage of the generated method is that it ignores the peculiar characteristics of SAR images. For example, the noise distribution of SAR images is not the same as in natural images, as well as the content, the texture, or the physical meaning of a pixel value [27]. Compared with real-world SAR images, the generated simulated SAR images have quite difference in content and geometry. There are strong scattering points in real-world SAR images, but not in simulated SAR images. Therefore, the despeckling CNNs trained with simulated images will change point targets, linear features and textures in the real-world SAR images. Benefiting from the noise2noise [34] method, semi-supervised methods and self-supervised methods [28–31] were proposed. The method of noise2noise is to directly use noisy-noisy image pairs to train deep CNNs. The semi-supervised despeckling model [27] used a small number of noisy-clean image pairs to train CNNs. Then, the obtained despeckling model was fine-tuned on the time series real-world SAR images. Fine-tune refers to the despeckling model obtained in the first step for training again using the time series real-world SAR images. Compared with the supervised despeckling methods, the semi-supervised despeckling methods can better reduce the speckle noise from real-world SAR images. Nevertheless, the time series SAR images will have differences at different times, which will limit the despeckling performance. The self-supervised despeckling methods directly use the extensive noisy-noisy image pairs to train CNNs. The noisy-noisy image pairs are generated by using natural images with simulated speckle noises [28], time series SAR images [31] or generative adversarial network (GAN) [35], which still limit the practicability of these methods in real-world SAR images.

Through the above analysis, a simple summary can be made. Firstly, the DL despeckling method is raising great interest. However, most of DL methods are focused on the new networks [22–26,32,33,36], while ignoring the most essential problem. In our opinion, the lack of truely noise-free SAR images is the most essential problem. Simulated SAR images can not really solve this deficiency. Secondly, it can be found that the despeckling CNNs are becoming more and more deeper. Meanwhile, the number of trainable parameters is increasing. The simulated SAR images can be easily generated through the pixel-wise product of clean natural images with simulated speckle noise [30]. When processing

real-world SAR images, the despeckling models obtained will bring new challenges. The despeckling CNNs trained with simulated images pairs will change point targets, linear features and textures in the real-world SAR images. Thirdly, although noise2noise method can directly use noisy images as targets, the despeckling performance is still affected by natural images [29] and the performance of GAN [30]. In the previous works [27–31], they can not really use real-world SAR data to train the despeckling CNNs. Therefore, in this paper, inspired by noise2noise [34], a novel self-supervised despeckling algorithm with an enhanced U-Net is proposed. This algorithm is called SSEUNet. Compared with previous despeckling CNNs [27–31], SSEUNet can directly use real-world SAR images for training deep CNNs. The SSEUNet is composed of generation training pairs (GTP) module, enhanced U-Net (EUNet) and a self-supervised training loss function with a regularization loss. The main contributions and innovations of the proposed algorithm are as follows:


The rest of this paper in organized as follows. The related work, which includes noisy-clean despeckling methods and noisy-noisy despeckling methods, are analyzed in Section 2. In Section 3, we detailed describe the proposed methods. Section 4 illustrates the results of visual effect and parameter evaluation metrics. Finally, the conclusions are drawn in Section 5.

#### **2. Related Work**

#### *2.1. Noisy-Clean Despeckling Methods*

In recent years, benefiting from deep learning, the noisy-clean despeckling methods have been studied in depth. Inspired by denoising CNN [38], Chierchia et al. [22] proposed the first despeckling CNN, which composed of 17 convolutional layers. The training image pairs of SAR-CNN were generated from time series real-world SAR images. The targets were 25-look SAR images. Zhang et al. [24] directly used dilated convolution layers and skip connections to reduce speckle noise from SAR images. The dilated convolution layer allows CNN to have a lightweight structure and small filter size, but the receptive field of the CNN will not be reduced. In addition, skipping connections reduce the problem of vanishing gradients. Similar to SAR-DRN, Gui et al. [25] proposed a dilated densely connections CNN. After considering the bright distribution of the speckle noise, Shen et al. [26] proposed a recursive deep convolutional neural prior model. The model included a data fitting block and a deep CNN prior block. The gradient descent algorithm was used for the data fitting block and the pre-trained dilated residual channel attention network was applied in the deep CNN prior block. Pan et al. [32] combined multi-channel logarithm gaussian denoising (MuLoG) algorithm with a fast and flexible denoising CNN to deal with the multiplicative noise of SAR images. Li et al. [33] designed a CNN with convolutional block attention module to improve representation power and despeckling performance. In order to help the network retain image details, Zhang et al. [39] proposed a multi-connection CNN with wavelet features. Because the NLM method is one of the most promising algorithm, Cozzolino et al. [23] combined NLM method with deep CNN to design a non-local mean CNN (NLM-CNN). The NLM-CNN used deep CNN to provide

interpretable results for target pixel and predicted pixel. Mullissa et al. [36] proposed a twostage despeckling CNN, called deSpeckleNet. The deSpeckleNet sequentially estimated the speckle noise distribution and the noise-free SAR image. Vitale et al. [40] designed a weighted loss function by considering the contents in the SAR images. Except SAR-CNN, the noisy-clean despeckling methods use synthetic training on the simulated SAR images. Due to the differences in imaging mechanisms and image features between SAR and natural images, i.e., grayscale distribution and spatial correlation [41], training on simulated SAR images is not the best solution [30]. Compared with the most advanced traditional despeckling methods, the deep despeckling CNNs have obvious despeckling advantages, but the lack of truly noise-free SAR images is a major limiting factor in despeckling performance.

#### *2.2. Noisy-Noisy Despeckling Methods*

In the real-world, it is impossible to obtain noise-free SAR images. Inspired by noise2noise [34], the noisy-noisy despeckling methods [27–31] were studied in depth. Yuan et al. [28] designed a self-supervised densely dilated CNN (BDSS-CNN) for blind despeckling. In the BDSS-CNN, the noisy-noisy image pairs were generated by adding simulated speckle noise with random ENL to natural images. Then, the generated noisynoisy image pairs were used to train BDSS-CNN. Finally, the obtained despeckling model was applied on the real-world SAR images. Inspired by blind-spot denoising networks [42], Molini et al. [29] reported a self-supervised bayesian despeckling CNN. The training image pairs were constructed from natural images. Yuan et al. [30] designed a practical SAR images despeckling method to reduce the impact of natural images. The method contained two sub-networks. The first sub-network was a GAN, which used to generate the speckle-speckle training paris. The second network was an enhanced nested-UNet, which was trained by using speckle-speckle training paris. It can be found that the quality of the generated speckle images by GAN will directly affect the despeckling performance of the enhanced nested-UNet. Dalsasso et al. [27] and Ma et al. [31] abandoned the methods of using natural images and GAN, they proposed to use time series real-world SAR images. Whilst, the natural landscape often changes significantly at a short period in the time series real-world SAR images. Therefore, the despeckling performance of [27,31] is still limited.

Although noise2noise methods were used to solve the lack of truly noise-free SAR images, the despeckling performance is still affected by nature images [29] and the performance of GAN [30]. In the previous works [27–31], they can not really use real-world SAR images to train the despeckling CNNs. Therefore, we designed a novel self-supervised despeckling algorithm with an enhance U-Net (SSEUNet). The SSEUNet includes a GTP module, an EUNet and a loss function. The GTP module can directly generate training image pairs from real-world noisy images. The generated image pairs are applied to train EUNet through the proposed loss function. When processing real-world SAR images, the proposed SSEUNet can eliminate the influence of natural images, GAN performance and time series images.

#### **3. Proposed Method**

In order to train the despeckling CNN directly using real-world SAR images, we propose a novel self-supervised despeckling algorithm, which is called SSEUNet. The SSEUNet is mainly composed of two GTP module, an enhanced U-Net (EUNet) and a loss function. The GTP module is designed to generate noisy-noisy image pairs for training the proposed EUNet. The EUNet is an enhanced version of U-Net, which has stronger feature extraction and fusion capabilities. The loss function is a self-supervised training loss function with a regularization loss, which is applied to optimal EUNet. In this section, Section 3.1 gives an overview of the proposed SSEUNet framework in detail. The detailed implementation of the proposed GTP module is introduced in Section 3.2. The proposed EUNet will be illustrated in Section 3.3. Section 3.4 introduces the loss function of SSEUNet.

#### *3.1. Overview of Proposed SSEUNet*

The proposed SSEUNet is composed of generation training pairs (GTP) module, enhanced U-Net (EUNet) and a self-supervised training loss function with a regularization loss. Figure 1 shows the overview of proposed SSEUNet framework, where *y* is the input image of proposed SSEUNet. *z*<sup>1</sup> and *z*<sup>2</sup> are noisy-noisy image pair generated by proposed GTP module. *F<sup>φ</sup>* is the EUNet and *φ* is the parameters of the EUNet. *z* <sup>1</sup> and *z* 2 are the despeckled image pair generated by GTP module. L is the proposed loss function. The proposed SSEUNet are divided into training phase (Figure 1a) and inference phase (Figure 1b). In the training phase, a pair of noisy-noisy images ({*z*1, *z*2}) is generated from a noisy image *y*. The generated masks are recorded as G<sup>1</sup> and G2. The generated masks come from GTP module. The EUNet takes *z*<sup>1</sup> and *z*<sup>2</sup> as input and target, respectively. The input image *y* is fed into EUNet and the despeckled image *y* is obtained in each training epoch, where *y* = *Fφ*(*y*). The depseckled-despeckled image pair ({*z* <sup>1</sup>, *z* <sup>2</sup>}) of the *y* are obtained by GTP module. The loss function is computed by using *Fφ*(*z*1), *z*2, *z* <sup>1</sup> and *z* 2. In the inference phase, the despeckled SAR images are obtained by directly using the trained EUNet.

**Figure 1.** Overview of the proposed SSEUNet framework. (**a**) Complete view of the training phase. (**b**) Inference phase using the trained EUNet.

#### *3.2. Proposed GTP Module*

Benefiting from noise2noise despeckling methods, we have conducted further research on the construction of noisy-noisy image pairs. According to Goodman's theory, the fully developed speckle noise in SAR images is completely random and independently distributed noise. Therefore, we attempt to generate noisy-noisy image pairs from the real-world SAR images *<sup>y</sup>* ∈ {*yi*}*<sup>N</sup> <sup>i</sup>*=1, where *N* is the total number of real-world SAR images.

The height and width of *y* are *H* and *W*, respectively. The noisy-noisy image pair is the {*z*1, *z*2}. The generation process of the proposed GTP module is divided into three steps. Firstly, the *y* is divided into *H*/*k*×*W*/*k* patches, where the · is a floor operation and *k* is the patch size. Secondly, we select a patch at the position (*l*, *m*), the two pixels of the patch are randomly extracted. The extracted pixels are used as the pixel of *z*<sup>1</sup> and *z*<sup>2</sup> at the position (*l*, *m*), respectively. Finally, for *H*/*K*×*W*/*k* patches, the noisy-noisy image pair {*z*1, *z*2} is obtained by repeating the second step. The size of *z*<sup>1</sup> and *z*<sup>2</sup> is *H*/*k*×*W*/*k* . Compared with the methods of using natural images and GAN, the GTP module can directly generate noisy-noisy image pairs from real-world noisy images. Figure 2 shows three noisy-noisy image pairs generation methods, where *y* is the noisy image, and *I* is the clean natural image. *SN*<sup>1</sup> and *SN*<sup>2</sup> are two independent simulated speckle noise. *I*<sup>1</sup> and *I*<sup>2</sup> are simulated SAR images. *yG* is the speckled SAR image generated by GAN [35].

**Figure 2.** Comparison of different methods for generating noisy-noisy image pairs. (**a**) The method of proposed GTP module. (**b**) The method of using natural images. (**c**) The method of using GAN.

In order to better explain the difference between the three generation methods, *k* is set to 2. When using GTP module, the *y* is divided into *H*/2×*W*/2 patches and the size

of *z*<sup>1</sup> and *z*<sup>2</sup> is a quarter of *y*. In Figure 2a, the example of generating an image pair is listed right. The original image size is 6×6. The original image is divided into 9 (6/2×6/2 ) patches. Two pixels are randomly chosen and fill them with orange and blue respectively. The pixel filled with orange is taken as a pixel of a noisy image *z*<sup>1</sup> and the other pixel filled with blue is taken as a pixel of another noisy image *z*2. The noisy-noisy image pair {*z*1, *z*2} is displayed as the orange image and the blue image on the right. In Figure 2b, the clean image *I* is element-wise multiplied with *SN*<sup>1</sup> and *SN*<sup>2</sup> to obtain the noisy-noisy image pair {*I*1, *I*2}. The image size of *I*<sup>1</sup> and *I*<sup>2</sup> is *H* × *W*. In Figure 2c, the image *yG* is obtained by using GAN. *y* and *yG* are combined together to construct noisy-noisy image pair {*y*, *yG*}. The size of *y* and *yG* is *H* × *W*. By comparing the three generation methods, it can be seen that the proposed GTP module directly generates the noisy-noisy image pairs from noisy images. Meanwhile, it will not be affected by natural images or GAN performance. The size of the noisy-noisy image pairs do not affect the despeckling performance of CNNs.

In order to verify that the noisy-noisy image pairs generated by GTP module will not change the distribution of the original noisy images, Figure 3 shows the examples of noisy-noisy image pairs generated by GTP module. In Figure 3, *μ* and *σ* are the mean and standard variance, respectively. The size of the original images is 256 × 256. The size of the noisy-noisy image pairs is 128 × 128. It can be seen that the histograms, *μ*, *σ* and visual effects are very similar.

**Figure 3.** Examples of noisy-noisy image pairs using GTP module. (**a**) Original noisy images *y*. (**b**) Histograms of *y*. (**c**) Noisy images *z*1. (**d**) Histograms of *z*1. (**e**) Noisy images *z*2. (**f**) Histograms of *z*2.

#### *3.3. Enhanced U-Net*

The U-Net [37] was originally proposed for medical image segmentation. It is a fully convolutional network and can be trained with very few images by using data augmentation. The U-Net is widely used in image denoising [43] and super-resolution [44]. At present, there are many variants of U-Net, which are used for various tasks. For example, the nested U-Net [45] and hybrid densely connected U-Net [46] are used for medical image segmentation. Multi-class attention-based U-Net [47] is designed for earthquake detection and seismic phase-picking.

As the speckle noise is modeled as multiplicative noise, the non-linear relationship of between noise-free SAR images and speckle noise can be obtained by a deep CNN. Therefore, an enhanced U-Net (EUNet) is designed to enhance the feature extraction and fusion capabilities of U-Net [37]. The detailed architecture of EUNet is displayed in Figure 4, where *BN*-*RRDC* represents the residual in residual densely connection with batch normalization (BN-RRDC) block. SSAM1-SSAM4 are sub-space attention module (SSAM) [48], respectively. ISC1-ISC4 denote the proposed improved skip connections, respectively. The *Conv* and *s* mean the convolutional layer and stride, respectively. The *Cat* and *TConv*1-*TConv*4 represent concat layer and transpose convolution layers, respectively. Compared with U-Net [37], EUnet mainly has four enhancements. Firstly, we replace the convolutional layer with the BN-RRDC blocks. The BN-RRDC block is composed of residual in residual densely connections [49] and batch normalization layer, which can significantly improve the feature extraction and representation capabilities of U-Net. Meanwhile, because of the residual structure in the BN-RRDC block, the training difficulty of the EUNet will not increase.

**Figure 4.** The detailed architecture of EUNet.

Secondly, for reducing features loss, the pooling layers are replaced by convolutional layers with *s* = 2. Thirdly, we design improved skip connections (ISC1-ISC4) to narrow the gap between encoder features and decoder features. In the architecture of U-Net [37], the skip connections directly send the encoder features to the decoder. In the framework of EUNet, the improved skip connections have different numbers of BN-RRDC blocks. In the structures of ISC1-ISC4, the numbers of BN-RRDC blocks are 1, 2, 3 and 4, respectively. Finally, SSAM is introduced into the feature fusion to restore weak texture and highfrequency information of the image. SSAM is a non-local sub-space attention module, which uses the non-local information to generate basis vectors through projection. The reconstructed SAR image can retain most of the original information and suppress speckle noise which is irrelevant to the basis vectors. The SSAM has been achieved state-ofthe-art denoise performance for removing noise in the natural images. The detailed structure of BN-RRDC block and SSAM are displayed in Figure 5, where *BN* is the batchnormalization layer.

**Figure 5.** The detailed structure of BN-RRDC block and SSAM.

#### *3.4. Loss Function of SSEUNet*

The self-supervised training method is the noise2noise method [34], which does not require clean images as targets. The noise2noise method only requires two noisy image pairs of the same image. Assuming the noisy-noisy image pair is the {*y*1, *y*2} from the clean image *x*, the noise2noise tries to minimize the following loss function:

$$\mathcal{L}\_{\text{ms}\varepsilon} = ||F\_{\Phi}(y\_1) - y\_2||\_{2'}^2 \tag{1}$$

where *F<sup>φ</sup>* is the denoising network and *φ* is the parameters of the denoising network. Equation (1) is the pixel-level loss function. Its optimization is the same as the supervised learning CNNs. In previous noise2noise despeckling methods, Equation (1) is commonly used loss function. Assume that the noisy SAR image is *y*, and the generation masks of GTP module are G<sup>1</sup> and G1. The noisy-noisy image pair is {*z*1, *z*2}. *z*<sup>1</sup> and *z*<sup>2</sup> can be written as:

$$z\_1 = \mathcal{G}\_1 \odot y, \; z\_2 = \mathcal{G}\_2 \odot y,\tag{2}$$

where is the operation of GTP module. When training EUNet, Equation (1) is rewritten as:

$$\mathcal{L}\_{\text{msc}} = ||F\_{\Phi}(z\_1) - z\_2||\_2^2. \tag{3}$$

The despeckled image pair {*z* <sup>1</sup>, *z* <sup>2</sup>} of *y* can be obtained by Equation (4).

$$z\_1' = \mathcal{G}\_1 \odot \stackrel{\prime}{y}', \ \stackrel{\prime}{z\_2} = \mathcal{G}\_2 \odot \stackrel{\prime}{y}',\tag{4}$$

where *y* = *Fφ*(*y*). Thus, the regularization loss can be defined as:

$$\mathcal{L}\_{reg} = ||F\_{\Phi}(z\_1) - z\_2 + z\_2' - z\_1'||\_{2'} \tag{5}$$

Meanwhile, *z* <sup>1</sup> and *z* <sup>2</sup> also can be obtained by the despeckling network *Fφ*, which can be written as:

$$z\_1' = F\_\Phi(z\_1), \ z\_2' = F\_\Phi(z\_2). \tag{6}$$

In an ideal state, *F*∗ *<sup>φ</sup>* (*z*1) and *z*<sup>2</sup> are exactly the same. *F*<sup>∗</sup> *<sup>φ</sup>* is the optimal despeckling network. Meanwhile, *z* <sup>1</sup> and *z* <sup>2</sup> are also exactly the same. Therefore, L*reg* should be equal to 0. Equation (5) provides a regularization loss that is satisfied when a despeckling network *F<sup>φ</sup>* is the optimal network *F*∗ *<sup>φ</sup>* . The regularization loss can narrow the gap of target pixel values between neighbors on the original noisy image. In order to exploit the regularization loss, we do not directly optimize Equation (3), but add Equation (5) to the Equation (3). Finally, the loss function of SSEUNet is defined as:

$$
\mathcal{L} = \mathcal{L}\_{\text{ms\varepsilon}} + \lambda \mathcal{L}\_{\text{reg}\prime} \tag{7}
$$

where *λ* is the hyper-parameter.

#### **4. Experiments and Analysis**

We evaluate the effectiveness of the proposed SSEUNet on simulated and real-world SAR datasets. Firstly, we report the implementation details. Secondly, the training and inference datasets, a simulated SAR dataset and a real-world SAR dataset, are introduced in detail. Thirdly, according to the fact of simulated and real-world SAR images, different evaluation metrics are selected. Finally, we present the experimental results and analysis of the simulated and real-world SAR datasets.

#### *4.1. Implementation Details*

The proposed SSEUNet does not require a pre-trained model, and it can be trained end-to-end from scratch. In the training phase, the initialization method of the network parameters is He [50]. The optimizer is Adam algorithm [51] and its momentum terms are *β*<sup>1</sup> = 0.9 and *β*<sup>2</sup> = 0.999. The initial learning rate of SSEUNet is set to 0.00001 for the realworld experiments and 0.0001 for the simulated experiments. The learning rate reduces by reduce-on-plateau strategy and the decay ratio is 0.5. 64 × 64-sized images are used and 10 instances stack a mini-batch. The training epoch is set to 100. The hyper-parameter of *λ* in the loss function can control the strength of the regularization term. Empirically, the *λ* is set to 2 in the simulated SAR experiments and the *λ* is set to 1 in the real-world SAR experiments. In the inference phase, the input size of the trained ENUet is 256 × 256. All experiments are conducted on a workstation with Ubuntu 18.04. The hardware is an Intel Xeon(R) CPU E5-2620v3, a NVIDIA Quadro M6000 24GB GPU and 48 GB of RAM. The deep learning framework is PyTorch 1.4.0.

#### *4.2. Datasets*

We use UC Merced Land-use (UCML) dataset [52] as a simulated dataset and it is widely used for land-use classification. The UCML dataset is a optical remote sensing dataset. The UCML dataset contains 2100 images, which are extracted from United States Geological Survey (USGS) National Map of the US regions. The resolution is 0.3 m and the size of each image is 256 × 256 × 3. To generate the simulated SAR images, we convert each image to grayscale image. Then, similar with [53], we generate simulated SAR images by multiplying simulated speckle noise to clean grayscale images. The simulated speckle noise follows Gamma distribution. In our experiments, we only considered single-look SAR images. Because of the high intensity of speckle noise in single-look SAR images, processing single-look SAR images is a very challenging case. The mean and variance of simulated speckle noise are 1. We randomly divide 2100 images into training set (1470), validation set (210) and testing set (420). In the training phase, data augmentation is used to train SSEUNet. The method of data augmentation is cropping and the crop size is 64 × 64. Finally, the augmented training set contains 213,175 patches. The size of training image pairs generated by GTP module is 32 × 32. The validation set and testing set do not use data augmentation. In Figure 6, we displays the examples of the generated simulated SAR images.

**Figure 6.** Examples of simulated SAR images. (**a**) RGB images. (**b**) Grayscale images. (**c**) Simulated SAR images.

In order to validate the practicability of SSEUNet, seven large-scale single-look SAR images are used. They were acquired by the ICEYE SAR sensors [54]. The details of each SAR image are listed in Table 1, where the Pol., Level, Mode, Angle and Loc. are polarization type, image format, imaging mode, look angle and imaging area, respectively. The SLC, VV, SL, SM, Asc. and Des. represent the single look complex image, vertical vertical polarization, spotlight, stripmap, ascending orbit and descending orbit, respectively. We convert SLC SAR data (*ISLC*) into amplitude SAR images (*y*). The generation method are as follows. Firstly, the amplitude SAR data (*IA*) of SLC SAR data is obtained through Equation (8).

$$I\_A = \sqrt{I\_{Rcal}^2 + I\_{Image}^2} \tag{8}$$

where *IReal* and *IImag* are the real data and imaginary data of *ISLC*, respectively. Secondly, the obtained amplitude SAR data (*IA*) is normalized. The normalized method is defined as:

$$I\_A^{norm} = 10 \log\_{10} (\frac{I\_A}{\text{Max}(I\_A)}),\tag{9}$$

where *Max*(·) means the maximum function. Finally, the normalized SAR data (*Inorm <sup>A</sup>* ) is encoded for obtaining an amplitude SAR image (*y*). The encoding method is defined as:

$$y = \frac{I\_A^{norm} - Min(I\_A^{norm})}{Max(I\_A^{norm}) - Min(I\_A^{norm})} \times (2^B - 1),\tag{10}$$

where *Min*(·) represents the minimum function and *B* is the encoded bit. In our experiments, *B* is set to 8.


**Table 1.** Acquisition parameters for the ICEYE-SAR sensors.

Each amplitude SAR image (*y*) is cropped to 256 × 256 with a stride equal to 64. The total number of cropped images is 19,440. They are randomly divided into training set (12,800), validation set (1520) and testing set (5120). The training set uses the same data augmentation as the simulated image. The final training data has 473,600 patches. The examples of testing images are displayed in Figure 7, where the white boxes are selected regions, which are used to compute the no-reference evaluation metrics. R1–R4 are the homogeneous regions. R5–R8 are the heterogeneous regions. R9 and R10 are the whole images. These regions are excluded from training set.

**Figure 7.** Real-world SAR images for evaluating.

#### *4.3. Evaluation Metrics*

For the simulated SAR experiments, the classic evaluation metrics, peak signal-tonoise ratio (PSNR, as high as possible) [55] and structural similarity index (SSIM, as closer to 1 as possible) [56], are used. The PSNR and SSIM are defined as follows:

$$PSNR = 10\log\_{10}\frac{Max\_{\%}^2}{MSE(p\_{\prime}\,\hat{g})} \tag{11}$$

$$SSIM = \frac{(2\mu\_p\mu\_\mathcal{g} + c\_1)(2\sigma\_{p\mathcal{g}} + c\_2)}{(\mu\_\mathcal{p}^2 + \mu\_\mathcal{\mathcal{g}}^2 + c\_1)(\sigma\_\mathcal{p}^2 + \sigma\_\mathcal{\mathcal{g}}^2 + c\_2)},\tag{12}$$

where the *p* and *g* are the despeckled image and the clean reference image, respectively. *Maxg* is the maximum signal power, i.e., 255 for grayscale images. *MSE* is computed between the clean reference image and its despeckled image. The *μp*, *σp*, *μ<sup>g</sup>* and *σ<sup>g</sup>* represent the mean and standard variance of *p* and *g*. The *σpg* is the co-variance between *p* and *g*. The *c*<sup>1</sup> and *c*<sup>2</sup> are constants.

For the real-world SAR experiments, the equivalent number of looks (ENL, as high as possible) [57], and the edge-preservation degree based on ratio of average (ER, as closer to 1 as possible) [58] are used. The ENL and ER are defined as:

$$ENL = \frac{\mu\_d^2}{\sigma\_d^2},\tag{13}$$

$$ER = \frac{\sum\_{i}^{m} \left| I\_{d1} \left( i \right) / I\_{d2} \left( i \right) \right|}{\sum\_{i}^{m} \left| I\_{o1} \left( i \right) / I\_{o2} \left( i \right) \right|}, \tag{14}$$

where the *μ<sup>d</sup>* and *σ<sup>d</sup>* are the mean and standard variance of despeckled image. The *i* is the index set of the SAR image. The *m* is the total pixels of the SAR image. The *Id*1(*i*) asn *Id*2(*i*) represent the adjacent pixel values in the horizontal or vertical directions of the despeckled image, respectively. The *Io*1(*i*) and *Io*2(*i*) are the adjacent pixel values in the horizontal or vertical directions of the noisy image, respectively.

#### *4.4. Results and Discussion*

In order to demonstrate the effectiveness of SSEUNet, we compare it qualitatively and quantitatively with Lee [14], Frost [15], SAR-BM3D [18], PPB [17], MuLoG [59], SAR-CNN [22], SAR-DRN [24] and SSUNet. Lee and Frost are the local window filters. The windows are set to 5 × 5. SAR-BM3D, PPB and MuLoG belong to the class of NLM methods. The publicly available Matlab codes of SAR-BM3D, PPB and MuLoG are used and the parameters are set as suggested in the original papers. SAR-CNN and SAR-DRN are machine learning methods. We implement and train SAR-CNN and SAR-DRN from scratch on noisy-noisy dataset, following the specifics given by the authors in the original papers. The noisy-noisy dataset is generated by using natural images. Therefore, SAR-CNN and SAR-DRN are unsupervised (noise2noise) methods. SSUNet is a self-unsupervised method and the training dataa is generated by GTP module. The network of SSUNet is the U-Net [37].

#### 4.4.1. Experiments on Simulated SAR Images

For the simulated SAR image despeckling experiments, a combination of quantitative and visual comparisons is used to analyze the effects of the different methods.

In Table 2, we show the average quantitative evaluation results obtained on simulated single-look SAR images with the best performance marked in bold and the second-best marked in underlined. *Times* and *Param*. are the inference average speed and the number of parameters, respectively. *GFLOPS* represents gigabit floating-point operations per second. *Param*. and *GFLOPS* are not marked the best and the second-best. The reason is that the values of *Param*. and *GFLOPS* can not reflect the performance of despeckling

methods. The results are computed as the average over the 420 testing images. The local window methods and NLM methods are grounded on the middle part of the table, while noise2noise methods are listed in the lower part. The difference between SSUNet and SSEUNet is the network. The network of SSUNet is the U-Net [37]. The network of SSEUNet is the proposed EUNet. The results of SSUNet and SSEUNet are to verify that EUNet has better despeckling performance than U-Net.


**Table 2.** Numerical results on simulated single-look images.

As can be seen from Table 2, the proposed SSEUNet outperform other methods on the MSE, PSNR and SSIM. Looking at the PSNR and SSIM metrics, noise2noise methods appear to have the potential to provide a clear performance gain over conventional ones. Indeed, although the performance of SAR-CNN is similar to that of the advanced NLM method, SSEUNet is about 1 dB and 0.0536 higher than the best method (SAR-BM3D). But the time it takes to process a image with 256 × 256 is approximately 157 times faster. By comparing the results of SAR-CNN and SAR-DRN, the more layers of the network, the better the despeckling performance. However, the network with dilated convolution layer (SAR-DRN) can improve the details of the despeckled image. In addition, compared with the second-best results (SSUNet), the proposed SSEUNet achieved a 0.5 dB and 0.0323 improvement, respectively. From the results of SSEUNet and SSUNet, we can find that our improvement to U-Net is effective and the increase of parameters does not significantly reduce the inference speed.

Visual evaluation is another way to qualitatively evaluate the despeckling performance of different methods. To visualize the despeckling results of proposed SSEUNet and other methods, Figure 8 shows the despeckled visual results of a simulated SAR image. The *Ours* is the result of using SSEUNet. It can be seen from the visual results that all despeckling methods can remove speckle noise to a certain extent. But the best despeckling results is achieved by the based on noise2noise methods. The results of the local window methods are too smooth, which causes the image to be blurred, so that the despeckled images lose the boundary information of the image. Compared with the local window methods, the effect of the NLM methods is significantly improved. However, looking at the results of PPB, there is a ringing effect at the boundary, which distorts the image boundary. In the noise2noise methods, as the network depth deepens, the better the despeckling effect, the clearer the image. It can be seen that proposed SSEUNet (*Ours*) shows the best ability to reduce speckle noise and retain texture structure.

#### 4.4.2. Experiments on Real-World SAR Images

In order to further illustrate the practicability of the proposed method, the realworld SAR images are used for the experiments. The real-world SAR images are detailed introduced in Section 4.2. Table 3 lists the quantitative evaluation results of ENL over the selected regions. These selected regions are shown in Figure 7, where R1–R4 are homogeneous regions.

&OHDQ 1RLV\ /HH )URVW 33% 6\$5%0' 0X/R\* 6\$5&11 6\$5'51 66(81HW 2XUV

**Figure 8.** Visual results of simulated images.

**Table 3.** ENL results on real-world SAR images.


Clearly, as listed by the ENL values in Table 3, the noise2noise methods show better despeckling performance in homogeneous regions. In the local window methods, the Lee and Frost algorithms show similar despeckling performance. In the NLM methods, the PPB algorithm has the worst despeckling ability. The despeckling ability of SAR-BM3D and MuLoG is similar. Although SAR-DRN and SA-CNN show subtle advantages over NLM methods, the proposed SSEUNet has improved 20.31, 6.46, 24.95 and 509.02 in the four regions over the MuLoG method, respectively. Since the SAR-DRN and SAR-CNN methods use training image pairs generated from natural images, they cannot learn the relationship between speckle noise and noise-free SAR images. Compared with SAR-CNN and SAR-DRN, the despeckling performance of SSUNet is significantly improved. The results SSUNet is obtained through using GTP module to generate noisy-noisy image pairs. In addition, comparing the results of SSUNet and SSEUNet, the EUNet has a stronger despeckling ability than U-Net. In general, the proposed SSEUNet can better process real-world SAR images.

Generally, ENL can reflect the effectiveness of the algorithm, to some extent, but perfectly homogeneous regions are rare in the real-world SAR images. Therefore, an no-reference estimated approach, which is called the *ENL map*, is used to demonstrate the effectiveness of the proposed SSEUNet. The *ENL map* involves calculating small ENLs by using a sliding window (set to 3 × 3) until the whole SAR image is covered [60]. Figures 9 and 10 show the ENL maps of R9 and R10, which are listed in Figure 7. We only show the ENL maps of the NLM and noise2noise despeckling methods. The reason is that the results of local window methods are too smooth, resulting in a very average ENL value

of the despeckled images. The ENL value should have a small change in the heterogeneous region, or even zero. However, it should have a greater improvement in the homogeneous region. This point is proven and shown in Figures 9 and 10. The ENL maps also show that the ability of details losses. From the ENL maps, the most details losses are PPB and MuLoG. Compared with SAR-BM3D, the noise2noise methods can not only better protect the image details, but also better remove the speckle noise. In the noise2noise methods, the proposed SSEUNet has the best despeckling and detail protection ability.

Table 4 displays the numerical results of ER metric on the R5–R8 regions, where V and H are the vertical and horizontal results of the ER metric. It can be seen that the proposed SSEUNet have obvious advantages in protecting horizontal and vertical structures. In the noise2noise methods, the proposed GTP module and EUNet can better process real-word SAR images. The proposed GTP module can use real-world SAR images to construct noisy-noisy image pairs to train deep CNNs, so that the network can better learn the relationship of speckle noise and noise-free SAR images.


**Table 4.** ER metric on real-world SAR images.

**Figure 9.** ENL maps of different despeckling methods on R9.

**Figure 10.** ENL maps of different despeckling methods on R10.

It is worth carefully studying different methods of despeckled SAR images. Figures 11 and 12 shows the details of homogeneous and heterogeneous regions in the real-world SAR images, respectively. It can be seen from the results that the local window methods are too smooth to lose the contrast. Loss the contrast means that the despeckled images structure using local window filters becomes smooth, and the structural contrast between edges and non-edges becomes blurred. In addition, many strong points and linear structures are lost in the NLM despeckling methods. The result of PPB can also cause a ringing effect at the boundary. Compared with other traditional methods, the result of SAR-BM3D performs best and provides an acceptable balance between smoothing and details preservation. Among SAR-DRN, SAR-CNN and SSEUNet methods, the result of SSEUNet is the best. Figure 13 shows the visualization despeckling results of different scenes. It can be seen from the despeckled results that the proposed SSEUNet can deal with real-world SAR images in different scenes.

**Figure 11.** Visual results of homogeneous regions on real-word SAR images.

1RLV\ /HH )URVW 33% 6\$5%0' 0X/R\* 6\$5&11 6\$5'51 6681HW2XUV 66(81HW2XUV

**Figure 12.** Visual results of heterogeneous regions on real-word SAR images.

**Figure 13.** The despeckled results of the SSEUNet in real-world SAR images. (**a**,**c**) Real-world SAR images. (**b**,**d**) Despeckled results.

To gain insight about how the learned SSAM on the real-world SAR images, we pick 8 sample images and inspect the output features of SSAM. As can be seen from Figure 4, the input features of SSAM1 are the output features of ISC1 and TConv1. The inputs of SSAM2 are the outputs of ISC2 and TConv2. The outputs of ISC3 and TConv3 are fed into SSAM3. The input features of SSAM4 are the outputs of ISC4 and TConv4. The output features of SSAM1-SSAM4 are extracted from the *projection* layer in the SSAM1- SSAM4. The structure of SSAM1-SSAM4 are exactly the same. The detailed structure of SSAM is shown in Figure 5. The output feature sizes of SSAM1-SSAM4 are 256 × 32 × 32, 128 × 64 × 64, 64 × 128 × 128 and 32 × 256 × 256, respectively. In the process of visualization, we perform averaging operations on the output features of SSAM1-SSAM4. Each visualization feature is the mean features of all channels. The sizes of visualization features are 32 × 32, 64 × 64, 128 × 128 and 256 × 256 in the SSAM1-SSAM4, respectively. Figure 14 shows the visualization features of SSAM1-SSAM4. It can be seen that the details of weak texture and structure are gradually restored from noisy SAR images.

**Figure 14.** Visualization features of SSAM1-SSAM4. (**a**) Original SAR images. (**b**–**e**) The visualization features of SSAM1-SSAM4.

#### **5. Conclusions**

In this paper, we propose a novel self-supervised despeckling algorithm with an enhanced U-Net (SSEUNet). The proposed SSEUNet is composed of generation training pairs (GTP) module, enhanced U-Net (EUNet) and a self-supervised training loss function with a regularization loss. The proposed SSEUNet has the following advantages. Firstly, unlike previous self-supervised despeckling works, the noisy-noisy image pairs are generated from real-word SAR images through a novel generation training pairs module, which makes it possible to train deep convolutional neural networks using real-world

SAR images. The GTP module can eliminate the effects of natural images, time series images and the performance of GAN. Secondly, the EUNet is designed to improve the features extraction and fusion capabilities of the U-Net. Compared with U-Net, we introduce BN-RRDC blocks, convolutional layers with *s* = 2, improved skip connections and SSAM. Although the EUNet has more parameters and more complex structure, the training difficulty will not increase. The reason is that there is the residual structures in the BN-RRDC block. Thirdly, a self-supervised training loss function is designed to address the difference of target pixel values between neighbors on the original noisy image. The loss function includes a reconstruction loss (MSE) and a regularization loss. Finally, visual and quantitative experiments on simulated and real-world SAR images show that the proposed SSEUNet notably reduces speckle noise with better preserving features, which exceed several state-of-the-art despeckling methods.

However, the inference speed of proposed SSEUNet needs to be improved. The proposed SSEUNet uses complex feature extraction block and sub-space attention modules, which leads to a longer time for evaluating an image. At the same time, the despeckling results of different data augmentation methods also need to be verified on the SAR images. In the future, we plan to explore two works. Firstly, we will explore a lightweight network to replace EUNet for improving inference speed. Secondly, the despeckling effects of different data augmentation methods will be verified.

**Author Contributions:** Conceptualization, G.Z.; Data curation, G.Z. and S.L.; Formal analysis, G.Z.; Investigation, G.Z. and X.L.; Methodology, G.Z. and X.L.; Resources, Z.L.; Software, G.Z.; Supervision, Z.L.; Validation, G.Z. and X.L.; Visualization, G.Z.; Writing—original draft, G.Z. and X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the National Science Foundation for Distinguished Young Scholars, Grant Numbers 61906213.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** UC Merced land-use dataset is available are http://vision.ucmerced. edu/datasets, accessed on 27 October 2021. ICEYE SAR dataset is available at https://www.iceye. com/sar-data, accessed on 27 October 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Novel Guided Anchor Siamese Network for Arbitrary Target-of-Interest Tracking in Video-SAR**

**Jinyu Bao, Xiaoling Zhang \*, Tianwen Zhang, Jun Shi and Shunjun Wei**

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; 201811011909@std.uestc.edu.cn (J.B.); twzhang@std.uestc.edu.cn (T.Z.); shijun@uestc.edu.cn (J.S.); weishunjun@uestc.edu.cn (S.W.)

**\*** Correspondence: xlzhang@uestc.edu.cn

**Abstract:** Video synthetic aperture radar (Video-SAR) allows continuous and intuitive observation and is widely used for radar moving target tracking. The shadow of a moving target has the characteristics of stable scattering and no location shift, making moving target tracking using shadows a hot topic. However, the existing techniques mainly rely on the appearance of targets, which is impractical and costly, especially for tracking targets of interest (TOIs) with high diversity and arbitrariness. Therefore, to solve this problem, we propose a novel guided anchor Siamese network (GASN) dedicated to arbitrary TOI tracking in Video-SAR. First, GASN searches for matching areas in the subsequent frames with the initial area of the TOI in the first frame are conducted, returning the most similar area using a matching function, which is learned from general training without TOI-related data. With the learned matching function, GASN can be used to track arbitrary TOIs. Moreover, we also constructed a guided anchor subnetwork, referred to as GA-SubNet, which employs the prior information of the first frame and generates sparse anchors of the same shape as the TOIs. The number of unnecessary anchors is therefore reduced to suppress false alarms. Our method was evaluated on simulated and real Video-SAR data. The experimental results demonstrated that GASN outperforms state-of-the-art methods, including two types of traditional tracking methods (MOSSE and KCF) and two types of modern deep learning techniques (Siamese-FC and Siamese-RPN). We also conducted an ablation experiment to demonstrate the effectiveness of GA-SubNet.

**Keywords:** video synthetic aperture radar (Video-SAR); moving target tracking; guided anchor Siamese network (GASN)

#### **1. Introduction**

Video synthetic aperture radar (Video-SAR) provides high-resolution SAR images at a faster frame rate, which is conducive to the continuous and intuitive observation of ground moving targets. Due to this advantage, Video-SAR brings about important applications in SAR moving target tracking [1]. Since the Sandia National Laboratory (SNL) of the United States first obtained high-resolution SAR images in 2003 [2], many scholars have investigated the problem of moving target tracking in Video-SAR [3–7]. However, due to different angles of illumination, the scattering characteristics of moving targets change with the movement of the platform. Worse still, it is difficult to track a moving target directly because the imaging results of the moving target usually shift from their true position.

Fortunately, shadow is caused by the ground being blocked by the moving target. Due to the absence of energy reflection, shadows appear at the real position of the moving target in the SAR image, with the advantage of a constant grayscale [8]. Therefore, shadow-aided moving target tracking has become a hot topic in Video-SAR. In recent years, many scholars have worked on shadow-aided moving target tracking in Video-SAR [9–11]. Wang et al. [9] fully considered the constant grayscale of shadows and used data multiplexing to achieve moving target tracking. Zhao et al. [10] applied the saliency-based detection mechanism and used spatial–temporal information to achieve moving target

**Citation:** Bao, J.; Zhang, X.; Zhang, T.; Shi, J.; Wei, S. A Novel Guided Anchor Siamese Network for Arbitrary Target-of-Interest Tracking in Video-SAR. *Remote Sens.* **2021**, *13*, 4504. https://doi.org/10.3390/ rs13224504

Academic Editor: Fabio Rocca

Received: 27 August 2021 Accepted: 1 November 2021 Published: 9 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

tracking in Video-SAR. Tian et al. [11] utilized the dynamic programming-based particle filter to achieve the track-before-detect algorithm in Video-SAR. However, the features used by these traditional methods are usually simple, which leads to the problem of the background being similar to the shadow, meaning it cannot be easily distinguished. Deep learning methods then emerged to solve shadow tracking due to their high accuracy and fast speed advantages [12–16]. Ding et al. [12] presented a framework for shadow-aided moving target detection using deep neural networks, which applied a faster region-based convolutional neural network (Faster-RCNN) [13] to detect shadows in a single frame and used a bi-directional long short-term memory (Bi-LSTM) [14] network to track the shadows. Zhou et al. [15] proposed a framework by combining a modified real-time recurrent regression network and a newly designed trajectory smoothing long short-term memory network to track shadows. Wen et al. [16] proposed a moving target tracking method based on the dual Faster-RCNN, which combined the shadow detection results in SAR images and the range-Doppler (RD) spectrum to suppress false alarms for moving target tracking in Video-SAR.

However, arbitrary target-of-interest (TOI) tracking is a challenge for the above methods. In this paper, we define TOI as a specific target in a video that one wants to track. TOI refers to the shadow to be tracked in Video-SAR. The reasons why arbitrary TOI tracking is a challenge are as follows: First, these methods are all based on appearance features, such as shape and texture. These methods need to train a large number of labeled training samples to extract appearance features, and the training samples must include the TOI. However, when we track an arbitrary TOI, it is impractical to collect samples of all categories for training because of the targets' diversity and arbitrariness. Moreover, it takes extensive work and material resources to label a large number of SAR images. Therefore, these methods are both impractical and costly when tracking an arbitrary TOI in Video-SAR.

Thus, we propose a novel guided anchor Siamese network (GASN) for arbitrary TOI tracking in Video-SAR. First, the key of GASN lies in the idea of similarity learning, which learns a matching function to estimate the degree of similarity between two images. After training using a large number of paired images, the learned matching function in GASN, given an unseen pair of inputs (TOI in the first frame as the template, and the subsequent frame as the search image), is used to locate the area that best matches the template. As GASN only relies on the template information, which is independent of the training data, it is suitable for tracking arbitrary TOIs in Video-SAR. Additionally, a guided anchor subnetwork (GA-SubNet) in GASN is proposed to suppress false alarms and to improve the tracking accuracy. GA-SubNet uses the location information of the template to obtain the location probability in the search image, and then it selects the location with a probability greater than the threshold to generate sparse anchors, which can exclude false alarms. To improve the tracking accuracy, the anchor that more closely matches the shape of the TOI is obtained by GA-SubNet through adaptive prediction processing.

The main contributions of our method are as follows:


To verify the validity of the proposed method, we performed experiments on simulated and real Video-SAR data. The results showed that the tracking accuracy of the proposed network is 60.16% on simulated Video-SAR data, 4.55% and 16.49% higher than the two deep learning methods Siamese-RPN [17] and Siamese-FC [18], as well as 18.36% and 28.95% higher than the two traditional methods MOSSE [19] and KCF [20], respectively. Meanwhile, the tracking accuracy is 54.68% on real Video-SAR data, which is higher than the other four methods by 1.93%, 13.08%, 14.70%, and 25.04%, respectively. This demonstrates that our method can achieve accurate arbitrary TOI tracking in Video-SAR.

The rest of this paper is organized as follows: Section 2 introduces the methodology, including the network architecture, preprocessing, and tracking processes. Section 3 introduces the experiments, including the simulated and real data, the implementation details, the loss function, and the evaluation indicators. Section 4 introduces the simulated and real Video-SAR data tracking results. Section 5 discusses the research on pre-training and robustness and the ablation experiment. Section 6 provides the conclusion.

#### **2. Methodology**

#### *2.1. Network Architecture*

Figure 1 shows the architecture of GASN for arbitrary TOI tracking in Video-SAR, including the Siamese subnetwork, GA-SubNet, and the similarity learning subnetwork. GASN is based on the idea of similarity learning, which compares a template image *z* to a search image *x* and returns a high score if the two images depict the same target.

**Figure 1.** The architecture of GASN.

To prepare for similarity learning, the Siamese subnetwork consists of a template branch and a search branch. The two branches apply identical transformation *ϕ* to each input, and the transformation *ϕ* can be considered as feature embedding. Then, similarity learning can be expressed as *f* (*z*, *x*) = *g* (*ϕ*(*z*), *ϕ*(*x*)), where the function *g* is a similarity metric. To suppress false alarms, GA-SubNet receives the prior information from the template to pre-determine the general location and shape of the TOI in the search image using anchors. When tracking an arbitrary TOI that is different from the training sample, we can use the ability of similarity learning to find the TOI in the next frame by providing the template information of said TOI, such as the position and shape. The similarity learning subnetwork is divided into two branches, one for the classification of the shadow and background, and the other for the regression of the shadow's location and shape. In both branches, the similarity between the shadow template and the search area is calculated, and then the target with the maximum similarity to the template of the TOI is chosen as the tracking result.

GASN always uses the previous frame as the template image and the current frame as the search image. After testing the whole SAR image sequence in such a way, GASN can achieve arbitrary TOI tracking in Video-SAR. In the following, we introduce the three subnetworks of GASN in detail in the order of implementation.

#### 2.1.1. Siamese Subnetwork

The Siamese subnetwork (marked as the green region in Figure 1) [21,22] uses CNN for feature embedding. CNN uses different convolutional kernels for multi-level feature embedding of the image. Therefore, compared to the traditional manual features, the features embedded by the Siamese subnetwork are more representative and can describe the TOI better. To obtain the common features of the previous and current frames, the Siamese subnetwork is divided into a template branch (marked with a purple box) and a search branch (marked with a pink box), and the parameters of CNN-1 and CNN-2 in both branches are shared to ensure the consistency of features. The input of the template branch is the TOI area in the previous frame (denoted as *z*), and the input of the search branch is the search area in the current frame (denoted as *x*). See Section 2.2 for details about the preprocessing of the input images. For convenience, we denote the output feature maps of the template and search branches as *ϕ*(*z*) and *ϕ*(*x*).

#### 2.1.2. GA-SubNet

After obtaining the feature maps, we established a GA-SubNet to suppress false alarms and improve the tracking accuracy. The specific architecture of GA-SubNet is shown in Figure 2, including anchor location prediction, anchor shape prediction, and feature adaptation. In the following, we introduce the three modules of GA-SubNet in detail in the order of implementation.

**Figure 2.** The architecture of GA-SubNet: (**a**) anchor location prediction module generates the sparse location of anchors; (**b**) anchor shape prediction module generates the anchor shape that better conforms to the shape of the shadow; (**c**) feature adaptation module generates a new feature map for the best anchor shape.

The purple region in Figure 2a is the anchor location prediction, i.e., the prediction of the location of the anchor containing the center point of a shadow. First, the input to GA-SubNet is two feature maps, one for the template (marked with a blue cube) and the other for the search area (marked with a purple cube). To obtain the prior information of the template that is independent of the training data, the feature map of the template is used as the kernel to convolute the feature map of search area *F*1, so that the score of each location of the output represents the probability that the corresponding location is predicted to be the shadow. Then, the sigmoid function is used to obtain the probability map as shown in the blue box in Figure 2a. After this, the position whose probability exceeds the preset threshold is chosen to be the location of the predicted anchor (marked with a red circle). To learn more information about the shadow, similar to [17], the empirical threshold was chosen as 0.7.

The blue region in Figure 2b is the anchor shape prediction, i.e., the prediction of the anchor shape that better conforms to the shape of a shadow. First, the uniform arbitrary

preset anchor shapes are generated (marked with blue boxes) at each location obtained from the anchor location prediction; i.e., several anchor shapes are arbitrarily set at each location, but the anchor shape setting in sparse locations is uniform. The preset anchor shape with the largest IoU with the shadow's ground truth (marked with a green box) is predicted as the leading shape (marked with an orange box). IoU is defined by Equation (1), where *P* denotes the preset anchor shapes, and *G* denotes the shadow's ground truth.

$$\text{IoU} = \frac{\text{area} (P \cap G)}{\text{area} (P \cup G)} \tag{1}$$

The leading shape of the anchor is still set arbitrarily and may differ significantly from the shadow's ground truth. To make the IoU larger, the offset between the leading shape and the shadow's ground truth at each location is calculated. After continuously optimizing the offsets using the loss function (described in Section 3.3), the best anchor shape can be obtained (marked with a white box), which better conforms to the shape of the shadow.

The orange region in Figure 2c is the feature adaptation, i.e., the adaptation of the feature map and the SAR image. Because the feature map is obtained by multi-layer convolution of the SAR image, there is a certain correspondence between the feature map and the SAR image; i.e., the leading shape of the anchor in the SAR image corresponds to a specific region in the feature map. However, the leading shape of the anchor at each location is optimized adaptively in the anchor shape prediction, resulting in areas with the same shape in the feature map, corresponding to the areas with different shapes in the SAR image. Therefore, feature adaptation is necessary to satisfy the correspondence between the feature map and the SAR image to ensure the accuracy of tracking. First, 1 × 1 convolution is used to calculate the offset between the leading shape and the best shape. Then, 3 × 3 deformed convolution is applied [23,24] based on this offset to the original feature map *F1* of the search area. Finally, the feature map *F2* is obtained for adaptation to the SAR image for the best anchor shape.

#### 2.1.3. Similarity Learning Subnetwork

After obtaining the sparse anchors that better conform to the shadows' shape, the similarity learning subnetwork (marked with a yellow region in Figure 1) is used for classification and regression. The similarity learning subnetwork consists of a classification branch (marked with an orange box in Figure 1) for distinguishing the shadow from the background and a regression branch (marked with a blue box in Figure 1) for predicting the location and shape of the shadow. First, in both branches, to reduce the calculation complexity for subsequent similarity learning, a feature map 6 × 6 of *ϕ*(*z*) is reduced to 4 × 4 and a feature map 22 × 22 of *ϕ*(*x*) is reduced to 20 × 20 by using the convolutions (marked with yellow cubes in Figure 1). In addition, the channel of *ϕ*(*z*) is adjusted to 2*k* × 256 for the foreground and background classification in the classification branch. The channel of *ϕ*(*z*) is adjusted to 4*k* × 256 for determining the location and shape of the shadow in the regression branch. *k* is the number of anchors, 2*k* represents the probability of the foreground and background for each anchor, and 4*k* represents the location (*x*, *y*) and shape (*w*, *h*) of the shadow.

$$\begin{array}{l} A\_{w \times h \times 2k}^{\text{cls}} = [\mathcal{q}(\mathbf{x})]\_{\text{cls}} \circledcirc [\mathcal{q}(z)]\_{\text{cls}}\\ A\_{w \times h \times 4k}^{\text{reg}} = [\mathcal{q}(\mathbf{x})]\_{\text{reg}} \circledcirc [\mathcal{q}(z)]\_{\text{reg}} \end{array} \tag{2}$$

As shown in Equation (2), the similarity learning subnetwork applies pairwise correlations (marked with red rectangles in Figure 1) to calculate the similarity metric, in which the similarity map *A*cls *<sup>w</sup>*×*h*×2*<sup>k</sup>* is for classification and *<sup>A</sup>*reg *<sup>w</sup>*×*h*×4*<sup>k</sup>* is for regression. [·]cls and [·]reg represent the classification and regression, respectively, and ⊗ denotes the convolution operation. We show the feature composition of *A*cls *<sup>w</sup>*×*h*×2*<sup>k</sup>* and *<sup>A</sup>*reg *<sup>w</sup>*×*h*×4*<sup>k</sup>* in Figure 3. *<sup>A</sup>*cls *w*×*h*×2*k* is divided into *k* groups, and each group contains two feature maps, which indicate the

foreground and background probabilities of the corresponding anchors. The anchor is the foreground if the probability of the foreground is higher; otherwise, it is the background. Similarly, *A*reg *<sup>w</sup>*×*h*×4*<sup>k</sup>* is divided into *<sup>k</sup>* groups, and each group contains four feature maps (*x*, *y*, *w*, and *h*), which indicate the similarity metric between the corresponding anchor and the template. According to the highest similarity, the optimal location and the shape of the shadow are obtained.

**Figure 3.** The feature composition of *A*cls *<sup>w</sup>*×*h*×2*<sup>k</sup>* and *<sup>A</sup>*reg *<sup>w</sup>*×*h*×4*k*.

#### *2.2. Preprocessing*

For all images of Video-SAR to have the same feature dimensions, preprocessing is required before entering GASN. As shown in Figure 4, the input of GASN is a pair of adjacent images in the SAR image sequence. The shadow template is a 127 × 127 area centered on the center (*x*, *y*) of the shadow in frame *t*-1. Similar to the image preprocessing in [17], we cropped an ((*w* + *h*) × 0.5 + *w*, (*w* + *h*) × 0.5 + *h*) area in frame *t*-1 centered on (*x*, *y*) and then resized it to 127 × 127, where (*w*, *h*) is the boundary of the shadow. Here, (*x*, *y*, *w*, and *h*) are known in the training stage, while in the testing stage, the parameters represent the prediction results of the previous frame. Because the template size of all existing methods is 127 × 127 [17,18], to ensure the rationality of the comparison, we chose 127 × 127 as the template size. The search area is centered on the center of the shadow in frame *t*, and we cropped an (((*w* + *h*) × 0.5 + *w*) × 255/127, ((*w* + *h*) × 0.5) + *h*) × 255/127) area and then resized it to 255 × 255. This area is larger than the shadow's template to ensure that the shadow is always included in the search area.

**Figure 4.** The input preprocessing of GASN.

#### *2.3. Tracking Process*

The whole process of TOI tracking based on GASN is shown in Figure 5. The details are as follows.

**Figure 5.** The whole process of arbitrary TOI tracking based on GASN.

**Step 1:** Input Video-SAR image sequence.

As shown in Figure 6a, *N* is the number of frames of the input video. For easy observation, we marked the shadow to be tracked with a green box.

**Figure 6.** Image preprocessing and feature embedding. Input SAR Video (**a**), preprocess image (**b**), embed features (**c**).

**Step 2:** Preprocessing SAR images.

For all images of Video-SAR to have the same feature dimensions, we need to crop and resize them. As described in Section 2.2, the shadow in frame *t*-1 is resized to 127 × 127 as the template, and frame *t* is resized to 255 × 255 as the search area, as shown in Figure 6b. *x*, *y*, *w*, and *h* represent the center and boundary of the prediction results in the previous frame. Unlike the RGB three-channel optical images, the SAR images are gray; therefore, all three channels are assigned to the same gray value to use the pre-trained weights. Applying models trained on three-channel RGB images to one-channel radar images has been carried out in several published literatures [10,12,15], and the results in Section 5.3 show that it is reasonable to do so.

**Step 3:** Embed features by the Siamese subnetwork.

After obtaining the template and search areas, the Siamese subnetwork embeds features to better describe the TOI. The Siamese subnetwork is divided into a template branch and a search branch, and the parameters of CNN-1 and CNN-2 in the two branches are shared to ensure the consistency of the features. The template branch outputs 6 × 6 × 256 as the feature map of the template, and the search branch outputs 22 × 22 × 256 as the feature map of the search area, which are shown in Figure 6c.

**Step 4:** Predict anchor location.

After obtaining the feature maps of the template and the search area, the predict anchor location module pre-determines the general location of the TOI in the search area to suppress false alarms. To only locate the anchors containing the center point of the shadow, the feature map of the template is used to convolute the feature map of the search area to obtain the prior information of the template, so that the score of each location of the output feature map represents the probability that the corresponding location is predicted to be the shadow. Then, the locations whose probability exceeds the preset threshold are used as the locations of the sparse anchors. As shown in Figure 7, the blue regions correspond to the locations of the anchors.

**Step 5:** Predict anchor shape.

To generate the anchor that conforms to the shadow's shape, the anchor shape prediction module generates an anchor shape with the highest coverage of the real shadow's shape by adaptive prediction processing in the sparse locations. First, after anchor generation, the preset anchor shapes (marked with blue boxes in Figure 8) of the anchor are obtained. Among them, the shape with the largest IoU with the shadow's ground truth (marked with a green box) is predicted as the leading shape (marked with an orange box). After this, the leading shape of the anchor is regressed to obtain the best anchor shape (marked with a white box) that better conforms to the shadow's shape.

**Figure 8.** Predict anchor shape.

**Step 6:** Adapt the feature map guided by anchors.

After the anchor shape prediction, the anchor shape changes, and the feature map needs to be adapted to guarantee the correct corresponding relationship between the feature map and the SAR images. As described in Section 2.1.2, the adapted feature map can be generated by compensating the offset obtained from 1 × 1 convolution using the 3 × 3 deformable convolution. Based on the adapted feature map shown in Figure 9 (marked with a dark purple), the higher quality anchors can be used for shadow tracking.

**Figure 9.** Adapting the feature map guided by anchors.

**Step 7:** Compare the similarity of the feature maps.

To compare the similarity of the feature map of the search area and the template, the similarity learning subnetwork applies the correlation operation as shown in Figure 10a. The blue cube represents the feature map of the template, and the purple cube represents the feature map of the search area. The feature map of the template changes its channel by the convolution according to the number of anchors *k*. The correlation can be achieved using the feature map of the template to convolute the feature map of the search area; then, *A*cls *<sup>w</sup>*×*h*×2*<sup>k</sup>* and *<sup>A</sup>*reg *<sup>w</sup>*×*h*×4*<sup>k</sup>* are output, where 2*<sup>k</sup>* represents the probability of the foreground and background for each anchor, and 4*k* represents the location (*x*, *y*) and shape (*w*, *h*) of the shadow.

**Figure 10.** The tracking results obtained after comparing the similarity; comparison of the similarity of the feature maps (**a**), classification and regression (**b**), and tracking results (**c**).

**Step 8:** Classification and regression.

The similarity learning subnetwork is divided into classification and regression branches. In the classification branch, the similarity learning probability map of the foreground and background is obtained, and then the foreground anchor with the highest similarity learning metric is the tracking shadow. The regression branch further regresses the best anchor shape (marked with a white box) to achieve a more accurate shadow shape (marked with a red box) in Figure 10b. Using the trained GASN, the shadow tracking in the Video-SAR image sequence can be achieved only using the shadow's location and shape in the first frame.

**Step 9:** Tracking results.

As shown in Figure 10c, after searching the whole Video-SAR image sequence, the shadow, i.e., the TOI tracking of Video-SAR, is realized. Because the shadow's location in the first frame is known, only the tracking results of the subsequent frames are shown here, where the green box represents the real location of the shadow, and the red box represents the tracking results.

To make the tracking process easier to read, it is shown in the Algorithm 1 below.


#### **3. Experiments**

All of the experiments were implemented on a personal computer with an Intel Core i7-8700K CPU@3.40 and an NVIDIA GTX1080 graphics card with 8 GB of memory. The software experiment environment was Linux, Ubuntu 16.04, python 3.7, and Pytorch3.0.

#### *3.1. Experimental Data*

As existing recognized real Video-SAR data, due to the high resolution, the data of SNL [1] have been used by many scholars for moving target detection and tracking [7–10]. In our experiments, we used both the simulated and real data to verify the effectiveness of GASN for arbitrary TOI tracking in Video-SAR. We produced the simulated Video-SAR data from the echo to approximate real SAR images, and the details of the data are described below.

In the simulated Video-SAR data, two real SAR backgrounds containing roads and six moving targets were simulated, considering the generality. The radar system parameters and the velocity of the moving targets are listed in Tables 1 and 2. Regarding the simulation of the shadow, the scattering coefficient was set to zero because of no reflection. In the experiment, 17 videos were simulated, where 11 videos were utilized for training and 6 for testing. Each video contained 61 frames, and one of the test video sequences is shown in Figure 11. The size of all images was 600 × 600.

The real Video-SAR SNL data contained 50 different moving targets in all 899 frames. When GASN was used for arbitrary TOI tracking, 751 frames with the former 35 targets were set for training, and 148 frames with the latter 15 targets were set for testing. The size of all images was 600 × 600. Compared to the simulated data, there was more noise and clutter in the real Video-SAR data, and the tracking results with clutter are shown in Section 4.2.2.


**Table 1.** The system parameters of simulated Video-SAR.

SNR

**Table 2.** The velocity of the moving targets in the simulated Video-SAR data.


40 dB

**Figure 11.** Image sequence of a test video: (**a**) third frame in Video 1; (**b**) 23rd frame in Video 1; (**c**) 43rd frame in Video 1; (**d**) 60th frame in Video 1.

#### *3.2. Implementation Details*

To avoid over-fitting, the pre-trained weight of ResNet50 [25] was applied, which was successfully trained from the widely used ImageNet large-scale visual recognition challenge (ILSVRC) data set [26]. Unlike the three-channel RGB for optical images, the SAR images were all gray; therefore, we assigned all three channels to the same gray value to use the pre-trained weights. Due to the limited memory, only conv4 and the upper layers of the pre-trained network weights were fine-tuned for adaptation to the TOI tracking task in Video-SAR. During the training stage, the batch size was four, and the stochastic gradient descent (SGD) [27] was applied, in which the momentum was 0.9, the weight decay was 0.0005, and the learning rate was 0.0001.

Data augmentation techniques were used in our implementation, including translation, scale transformations, blur, and flip. After data augmentation, the amount of data expanded by approximately 10 times, which can better fine-tune the model.

#### *3.3. Loss Function*

As shadow occupies a small proportion of the SAR image, we used focal loss [28] as the anchor location loss *lossloc* to predict the anchor location:

$$\text{loss}\_{\text{loc}} = -(1-p)^{\gamma} \log(p) \tag{3}$$

where *p* is the probability of the shadow in the location, and *γ* = 2 is the hyper-parameter to adjust the drop speed influenced by [29].

Anchor shape loss *lossshape* uses a smooth *L*1 loss inspired by [12].

$$loss\_{\text{shpers}} = smoothL\_1(1 - \min(\frac{w}{w\_{\text{g}}}, \frac{w\_{\text{g}}}{w})) + smoothL\_1(1 - \min(\frac{h}{h\_{\text{g}}}, \frac{h\_{\text{g}}}{h})) \tag{4}$$

$$smoothL\_1 = \begin{cases} 0.5 \text{x} & |\mathbf{x}| < 1\\ |\mathbf{x}| - 0.5 & \text{otherwise} \end{cases} \tag{5}$$

where (*wg*, *hg*) is the ground truth of the shadow, and (*w*, *h*) is the shape of the anchor.

As per Siamese-RPN [17], classification loss *losscls* and regression loss *lossreg* are as follows:

$$\text{loss}\_{\text{cls}} = -\log[p\_i^\* p\_i + (1 - p\_i^\*)(1 - p\_i)]\tag{6}$$

$$loss\_{\%} = smoothhL\_1(\mathbf{t}\_i - \mathbf{t}\_i^\*)\tag{7}$$

where *pi* represents the probability of shadow, *<sup>t</sup>* represents the ground truth of the center point (*x*, *y*) and shape (*w*, *h*) of the shadow, and \* represents the prediction result.

The total loss function is shown below, where *λ*<sup>1</sup> = *λ*<sup>2</sup> = 5 and *λ*<sup>3</sup> = *λ*<sup>4</sup> = 2 are the hyper-parameters balancing the four parts.

$$\text{loss} = \lambda\_1 \text{loss}\_{\text{loc}} + \lambda\_2 \text{loss}\_{\text{shape}} + \lambda\_3 \text{loss}\_{\text{cls}} + \lambda\_4 \text{loss}\_{\text{reg}} \tag{8}$$

By minimizing the loss functions, GASN finally achieves parameter optimization after the iterations.

#### *3.4. Evaluation indicators*

To verify the performance of GASN, three general evaluation indicators were used in this paper.

#### 3.4.1. Tracking Accuracy

The expected average overlap (EAO) can represent the tracking accuracy [30], and the greater the EAO, the more accurate the tracking result. EAO is defined as follows:

$$\text{EAO} = \frac{\sum\_{j=1}^{N\_b} \text{mIoU}(~j)}{N\_s}, \text{mIoU} = \frac{\sum\_{i=1}^{N} \text{IoU}(P\_{i\prime}G)}{N} \tag{9}$$

where IoU is as defined in Equation (1), *P* is the tracking result, *G* is the shadow's ground truth, *N* is the number of images in the Video-SAR sequence, and *Ns* is the number of videos in the test data. We calculated mIoU, including IoU = 0; therefore, EAO can truly reflect the tracking accuracy.

#### 3.4.2. Tracking Stability

The central location error (CLE) reflects the stability of the tracking method [15]; i.e., the smaller the CLE, the more stable the tracking method, and the CLE is defined as follows:

$$\text{CLE} = \sqrt{(\mathbf{x}\_R - \mathbf{x}\_G)^2 + (y\_R - y\_G)^2} \tag{10}$$

where (*xR*, *yR*) represents the central location of the tracking result, and (*xG*, *yG*) represents the central location of the shadow's ground truth.

#### 3.4.3. Tracking Speed

The frames per second (FPS) represent the tracking speed, which is defined as follows:

$$FPS = \frac{N}{t} \tag{11}$$

where *t* represents the total tracking time, and *N* is the number of images in the Video-SAR sequence.

#### **4. Results**

#### *4.1. Results of the Simulated Video-SAR Data*

Figure 12 shows the tracking results of the simulated Video-SAR data. In the rest of this paper, the red box represents the tracking results, and the green box represents the ground truths of the shadow. It can be seen that the red and green boxes have a great overlap, which means that GASN can track the target effectively.

**Figure 12.** Tracking results of the simulated Video-SAR data: (**a**) 9th frame in Video 2; (**b**) 14th frame in Video 2; (**c**) 37th frame in Video 2; (**d**) 56th frame in Video 2.

We quantitatively analyzed the tracking results of GASN. Because Siamese-FC and Siamese-RPN significantly outperform MOSSE [19] and KCF [20], only the visual comparison results of GASN with Siamese-RPN and Siamese-FC in terms of accuracy, CLE, and FPS indicators are shown.

#### 4.1.1. Comparison with Other Tracking Methods

Figures 13–15 show the results of comparing GASN to Siamese-RPN and Siamese-FC on the six test videos. In the comparative experiments, we retrained Siamese-FC and Siamese-RPN using the same simulated data, and both networks were tuned. Moreover, to ensure the rationality of the experiments, our comparative experiments were all performed under the same conditions, such as the data preprocessing, the hard and soft platforms, and the training mechanism. From the results, we can see that GASN (marked with purple) obtained the highest mIoU (Figure 13) and the lowest CLE (Figure 14) on each video. Moreover, the FPS (Figure 15) of GASN (marked with purple) was almost the same as that of Siamese-RPN (marked with green), which indicates that GASN has almost no speed loss at a higher accuracy. Due to the above phenomenon also applying to real data, we

explain the reason in detail in the next section. To reveal the performance of GASN more intuitively, we calculated the average tracking performance of the six testing videos, and the results are shown in Table 3.

In Table 3, for the two traditional methods (MOSSE and KCF), their simple framework leads to two different implications. On the one hand, these methods require low computation (105 FPS for MOSSE and 58 FPS); on the other hand, the simple framework may cause the loss of some information, such as the edges and textures, resulting in the inability to track shadows that are too wide or too long, and, therefore, the accuracy is low (31.21% for MOSSE and 41.80% for KCF). As for the comparison between deep learning methods, the anchors generated by GA-SubNet can better conform to the shape of the shadow in SAR images. Therefore, the accuracy of GASN (60.16%) is better than that of Siamese-RPN (55.61%) or Siamese-FC (43.67%). As for the tracking speed, GASN also slightly improved (32 FPS) compared to Siamese-RPN (31 FPS), because the anchors generated by GA-SubNet are sparse. In addition, GASN achieved the lowest CLE score (6.68) when considering the stability, because GA-SubNet generates anchors based on the probability of the shadow's location. Through the above analysis, we can see that the tracking performance of GASN is better than that of the other methods.

**Figure 13.** The comparison results of GASN with Siamese-RPN and Siamese-FC on accuracy.

**Figure 14.** The comparison results of GASN with Siamese-RPN and Siamese-FC on CLE.

**Figure 15.** The comparison results of GASN with Siamese-RPN and Siamese-FC on FPS.



#### 4.1.2. Tracking Results with Distractors

To verify that the proposed method only tracks the TOI, we selected two adjacent targets with similar shapes for tracking. Figure 16a,b show the tracking results in the same frame. TOI-2 can be considered a distractor when we want to track TOI-1 in Figure 16a. Similarly, TOI-1 can be considered a distractor when we want to track TOI-2 in Figure 16b. The green box represents the ground truth of the TOI, and the red box represents the tracking results using the proposed method. The overlap between the red and green boxes in both figures is greater than 50%, so the proposed method can accurately track the TOI without errors. The main reasons are as follows: GASN uses the Siamese subnetwork to extract multi-level and more expressive features compared to the traditional methods. In addition, compared to the existing deep learning methods, GASN uses GA-SubNet to provide the general location and shape of the TOI based on the template, which can effectively suppress the distractors. Through the above analysis, we think that the proposed method can accurately track the TOI without errors, although there are distractors in the scene.

**Figure 16.** Tracking results with distractors: (**a**) TOI-1 in the 51st frame of Video 6; (**b**) TOI-2 in the 51st frame of Video 6.

#### 4.1.3. Tracking Results of the Target with a Specific Speed

To verify the tracking capability of the proposed method for TOI with a specific speed, we simulated two identical targets, except for the velocity. Figure 17a,b show the tracking results in the same frame. The azimuth velocity of TOI-1 in Figure 17a is 2 m/s and the radial velocity is –2.5m/s, while the azimuth velocity of TOI-2 in Figure 17b is 1.5 m/s and the radial velocity is –1.5m/s. The green box represents the ground truth of the TOI in this tracking process, and the red box represents the tracking result using the proposed method. The overlap between the red and green boxes in both figures is greater than 50%. Therefore, it can be seen that the proposed method can accurately track the TOI with a specific speed.

**Figure 17.** Tracking results of the target with a specific speed: (**a**) TOI-1 in the 20th frame of Video 5; (**b**) TOI-2 in the 20th frame of Video 5.

#### *4.2. Results of Real Video-SAR Data*

Figure 18 shows the tracking results using the real Video-SAR data, aiming to verify the effectiveness of GASN using real data. It can be seen that the tracking results (marked with a red box) and the ground truths of the shadow (marked with a green box) have a great overlap (the IoU is greater than 50%), which means that GASN can track the real shadow effectively.

**Figure 18.** Tracking results of the real Video-SAR data: (**a**) third frame in Video 2; (**b**) 35th frame in Video 2; (**c**) 52nd frame in Video 2.

#### 4.2.1. Comparison with Other Tracking Methods

In the comparative experiments, with the same training mechanism as GASN, we first initialized Siamese-FC and Siamese-RPN using the pre-trained model parameters obtained from the optical image. Then, we adjusted the model parameters using SAR images for tracking in Video-SAR. Moreover, to ensure the rationality of the experiments, our comparative experiments were all performed under the same conditions, such as the data preprocessing and the hard and soft platforms.

Figure 19 shows the accuracy comparison results of the three methods. Siamese-FC (marked with yellow) had the lowest accuracy in each video because it cannot fit the scale transformation of the shadow. For Siamese-RPN (marked with green), the accuracy

improved somewhat, because the anchors can handle scale transformation. However, most preset anchors do not perfectly fit the actual shape of the shadow, which results in failure when tracking shadows that are too long or too wide. For GASN (marked with purple), GA-SubNet only locates the anchors containing the center of the shadow to suppress false alarms. GA-SubNet adaptively refines the shape of the anchor to better fit the shadow's shape for further improvement of the tracking accuracy. Therefore, it is obvious that the accuracy of GASN is higher than that of Siamese-RPN and Siamese-FC in Figure 19.

**Figure 19.** Accuracy comparison of the three methods.

To validate the stability of GASN, we used CLE to compare GASN to Siamese-RPN and Siamese-FC. While GA-SubNet only locates the anchors containing the center of the shadow in advance, GASN can locate the center of the shadow more accurately. As shown in Figure 20, the CLE of GASN (marked with purple) is less than that of Siamese-RPN (marked with green) and Siamese-FC (marked with yellow), which means that TOI tracking using GASN is the most stable.

To validate the speed of GASN, we used FPS to compare GASN to Siamese-RPN and Siamese-FC. Figure 21 shows the comparison results of FPS, from which we can see that GASN (marked with purple) is almost identical to Siamese-RPN (marked with green), while Siamese-FC (marked with yellow) is lower. To the best of our knowledge, Siamese-RPN can satisfy real-time tracking [17]. Compared to Siamese-RPN, on the one hand, GASN needs to calculate the location and shape of the anchors, which reduces the tracking speed. On the other hand, the anchors are sparse, which reduces the computation of subsequent processing. It can be seen from the experimental results that the FPS of GASN is almost the same as that of Siamese-RPN; therefore, our method can achieve real-time tracking.

**Figure 20.** CLE comparison of the three methods.

**Figure 21.** FPS comparison of the three methods.

Table 4 shows the average tracking performance of the real Video-SAR data using the different methods. Due to the simple framework, MOSSE has the lowest performance, with 29.64% accuracy and 37.64 CLE, but the highest speed (125 FPS). Moreover, the deep learning methods improved the accuracy over the traditional correlation filtering methods (MOSSE and KCF), because the networks can extract multi-level and more expressive features. Most importantly, GA-SubNet in GASN only locates the sparse anchors containing the center of the shadow to suppress false alarms. Additionally, GA-SubNet refines the anchor's shape to conform to the shape of the shadow, which further improves the tracking accuracy. Therefore, the accuracy of GASN (54.68%) is better than that of Siamese-RPN (52.75%) and Siamese-FC (41.60%). In addition, because the sparse anchors can reduce the

subsequent computation, there is no speed loss in GASN (33 FPS) compared to Siamese-RPN (33 FPS). The above analysis shows that GASN has the highest accuracy (54.68%) without sacrificing speed.


**Table 4.** Average tracking performance of real Video-SAR data.

4.2.2. Tracking Results with Clutter

To verify the suppression ability of the proposed method for clutter, we selected the videos with these two types of interference in the real data for tracking. Because Siamese-RPN has excellent performance in both accuracy and speed in optical tracking, and the proposed method is better than Siamese-RPN, making it applicable to Video-SAR, we compared the proposed method with Siamese-RPN, as shown in Figure 22. Figure 22a,b show the tracking results of the proposed GASN method and Siamese-RPN under background clutter (e.g., road signs), respectively, where the green boxes represent the ground truths of the TOI during this tracking process, and the red boxes represent the tracking results. The comparison clearly shows that the overlap between the tracking results (red) and the labels of the TOI (green) using the proposed GASN method is greater than 50%, while the overlap of Siamese-RPN is less than 30%. Figure 22c,d show the tracking results of the proposed method and Siamese-RPN under environmental clutter (e.g., imaging sidelobe), respectively, and it can be seen that the overlap between the tracking results (red) and the labels of the TOI (green) using the proposed method is higher than the results using Siamese-RPN. Therefore, we believe that the tracking accuracy of the proposed method is higher than that of Siamese-RPN in the presence of clutter.

**Figure 22.** Tracking results with interference: (**a**) Siamese-RPN with background clutter; (**b**) our method with background clutter; (**c**) Siamese-RPN with environmental clutter; (**d**) our method with environmental clutter.

#### 4.2.3. Tracking Results of Different Frame Rates

Figure 23 shows the tracking results of different frame rates. We created Video 16 from Video 15 at a frame rate of 6.4, noting that the frame rate here refers to the rate at which a video is divided into frames. For example, the frame rate of Videos 1–15 was 3.2, which means that an SAR image was captured every 1/3.2 s in the video. The parameters of Video 15 in Figure 23a and of Video 16 in Figure 23b are the same, except for the frame rate. It is obvious that the two boxes in Figure 23b have higher IoUs, i.e., more accurate tracking results. Only the comparison results for frame 5 are shown, showing that the results of almost all frames in Video 16 are more accurate than those of Video 15. The main reason is that the higher the frame rate, the smaller the change in the shadow's location and shape between the adjacent frames. Therefore, it is reasonable to assume that the frame rate is positively correlated with the tracking accuracy.

**Figure 23.** True tracking results of different frame rates: (**a**) Video 15 at a frame rate of 3.2; (**b**) Video 15 at a frame rate of 6.4 (Video 16).

#### 4.2.4. Tracking Results of another Real Video-SAR Dataset

We conducted an additional experiment on a new dataset that is derived from [15]. Two videos containing 675 images were used to train the network, and two videos with 389 images were used to test the network. The size of all images was 1000 × 1000 pixels.

Figure 24 shows the tracking results of another real Video-SAR dataset, and Table 5 shows the average tracking performance. From Table 5, we can see that the accuracy of the proposed method is 1.33% higher than that of Siamese-RPN. Therefore, the proposed method is still more accurate than Siamese-RPN.

**Figure 24.** Tracking results of another real Video-SAR data: (**a**) 4th frame; (**b**) 45th frame; (**c**) 75th frame.


**Table 5.** Average tracking performance of another real Video-SAR data.

#### **5. Discussion**

#### *5.1. Research on the Transfer*

We arranged a set of experiments to verify whether the proposed method entirely relies on the prior information of the TOI, such as the location and shape, rather than the appearance features of the training data. In the first experiment, we used the simulated data for training and the real data for testing, as shown in Figure 25a,b. In the second experiment, we used the real data for training and the simulated data for testing, as shown in Figure 25c,d. We can see that the tracking results (marked with red boxes) and the ground truths of the shadow (marked with green boxes) have a great overlap in the two experiments.

**Figure 25.** The experimental results of cross-validation: (**a**) simulated Video-SAR data for training; (**b**) real Video-SAR data for testing; (**c**) real Video-SAR data for training; (**d**) simulated Video-SAR data for testing.

To reveal the performance of GASN more intuitively, we evaluated the tracking results using accuracy, and the results are shown in Tables 6 and 7.

**Table 6.** Cross-validation for testing the simulated Video-SAR data.


**Table 7.** Cross-validation for testing the real Video-SAR data.


The first set of cross-validation experiments involved training with real data (data B) and testing with simulated data (data A). The results are shown in row 2 of Table 6. For comparison, we also provide the results of both the training and testing using simulated data (see row 1 of Table 6). The experimental results show that their accuracy differs by 0.9%.

The second set of cross-validation experiments involved training with simulation data (data A) and testing with real data (data B). The results are shown in row 2 of Table 7. For comparison, we also provide the results of both the training and testing using real data (row 1 of Table 7). The experimental results show that their accuracy differs by 1.3%.

From the above experiments, we can see that the results of the two cross-validation experiments have little difference in terms of accuracy, which indicates that GASN has good transfer ability.

The proposed GASN in this paper is capable of similarity learning. In other words, GASN is trained with a large number of training samples so that the network has the ability to measure the similarity of two input images (i.e., the template and the search image in the training data). The greater the similarity, the higher the output score of GASN. Therefore, once a template image of TOI is given, the information provided by the template (such as the location and shape) can be used to match the target in the next image based on the similarity measure capabilities of GASN. Then, the target with the highest similarity is determined as the tracking result in the next image. Therefore, GASN can track the TOI using the template information instead of the appearance features of the training data, so the proposed GASN is highly robust.

#### *5.2. Ablation Experiment of GA-SubNet*

We explored the effect of GA-SubNet on false alarms. Figure 26 shows the anchors on Siamese-RPN (Figure 26a) and GASN (Figure 26b). It can be seen that after adding GA-SubNet, the anchors are mainly concentrated around the TOI, and the number of anchors is also greatly reduced. Table 8 shows the comparison results of whether to add GA-SubNet or not. Because GA-SubNet discards the useless anchors in the background and improves the imbalance between positive and negative samples, the accuracy is improved by 4.52% after adding GA-SubNet. Therefore, GASN with GA-SubNet can better distinguish the TOI from the background.

**Figure 26.** Ablation experiment on GA-SubNet: (**a**) Siamese-RPN; (**b**) GASN.

**Table 8.** Ablation experiment of GA-SubNet.


#### *5.3. Research on Pre-Training*

In the deep learning field, in recent years, a common practice is to pre-train a model on some large-scale training data [31–33]. As shown in Figure 6b, the one-channel SAR image needs to be copied three times to use the pre-training parameters of three-channel RGB optical images. This method of copying one-channel SAR images three times has been widely used in SAR image processing tasks [12,15]. For example, to be suitable for SAR tracking tasks, the pre-training parameters of the optical image are adjusted by the one-channel SNL data copied three times, and the tracking results are good.

To determine whether it is reasonable to apply a model trained on a three-channel RGB image to a one-channel radar image in a completely different domain or not, we arranged a group of experiments. The final tracking results for the simulated data are shown in Table 9. The second row of the results contains the tracking results after pre-training the model using optical images and then fine-tuning the training using SAR images replicated as three channels. The first row contains the tracking results after training using only replicated SAR images without pre-training with optical images. The tracking accuracy is significantly reduced by approximately 4% compared to the second row. This illustrates that it is feasible and reasonable to apply a model trained on three-channel RGB images to one-channel radar images. Therefore, it is wise to use fine-tuning in the absence of sufficient training data.

**Table 9.** Accuracy indexes of research on pre-training.


#### *5.4. Research on the Statistical Analysis*

Regarding the statistical analysis of small data, we added an experiment where we trained 10 times and calculated the statistical average (including the mean and variance of the accuracy and the central location error (CLE)). The results are shown in Table 10.

**Table 10.** The statistical analysis of the tracking result.


From the table, we can see that our method outperforms Siamese-RPN in terms of accuracy (58.79 > 56.37) and the accuracy variance (0.61 < 0.72), which indicates that our method is accurate and that the accuracy is more stable.

Moreover, our method outperforms Siamese-RPN in terms of the central location error (CLE) (7.49 > 6.56) and the CLE variance (0.89 < 0.98), which indicates that the CLE of our method is smaller and that the CLE is more stable.

#### **6. Conclusions**

To achieve the tracking of arbitrary TOIs in Video-SAR, this paper proposed a novel GASN. GASN is based on the idea of similarity learning, which uses the feature map of the template as the convolution kernel to slide windows on the feature map of the search image. Then, the output indicates the similarity of the two feature maps. Based on the maximum similarity, GASN can determine the tracking results in the search image. GASN tracks the TOI between the first frame and the next one instead of learning the appearance among all separate frames. Additionally, we established a GA-SubNet, which uses the location information of the template to obtain the location probability in the search image and selects the location with a probability greater than the threshold to exclude false alarms. To improve the tracking accuracy, the anchor that more closely matches the shape of the TOI is obtained by GA-SubNet through adaptive prediction processing. The experimental results showed that the tracking accuracy of the proposed method was 60.16% and 54.68% on the simulated and real Video-SAR data, respectively, which are higher than that of the two deep learning methods Siamese-RPN and Siamese-FC and the two traditional methods MOSSE and KCF.

In the future, we will try to apply scale invariant feature transform (SIFT) [34] and the Lee filter [35] to real Video-SAR for more accurate tracking results and research how to use the accurate tracking trajectory to refocus the moving target.

**Author Contributions:** Conceptualization, J.B. and X.Z.; methodology, J.B.; software, J.B.; validation, J.B., X.Z. and T.Z.; formal analysis, J.B.; investigation, J.B.; resources, J.S.; data curation, J.S.; writing original draft preparation, J.B.; writing—review and editing, J.B.; visualization, X.Z.; supervision, T.Z.; project administration, X.Z.; funding acquisition, X.Z., J.S. and S.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under grants 61571099, 61501098, and 61671113.

**Acknowledgments:** The authors thank all reviewers for their comments toward improving our manuscript, as well as the Sandia National Laboratory of the United States for providing SAR images. The authors would also like to thank Durga Kumar for his linguistic assistance during the preparation of this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


## *Article* **A Robust InSAR Phase Unwrapping Method via Phase Gradient Estimation Network**

**Liming Pu, Xiaoling Zhang \*, Zenan Zhou, Liang Li, Liming Zhou, Jun Shi and Shunjun Wei**

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; puliming@std.uestc.edu.cn (L.P.); 201722020918@std.uestc.edu.cn (Z.Z.); liliang@std.uestc.edu.cn (L.L.); zhouliming@std.uestc.edu.cn (L.Z.); shijun@uestc.edu.cn (J.S.); weishunjun@uestc.edu.cn (S.W.)

**\*** Correspondence: xlzhang@uestc.edu.cn

**Abstract:** Phase unwrapping is a critical step in synthetic aperture radar interferometry (InSAR) data processing chains. In almost all phase unwrapping methods, estimating the phase gradient according to the phase continuity assumption (PGE-PCA) is an essential step. The phase continuity assumption is not always satisfied due to the presence of noise and abrupt terrain changes; therefore, it is difficult to get the correct phase gradient. In this paper, we propose a robust least squares phase unwrapping method that works via a phase gradient estimation network based on the encoder– decoder architecture (PGENet) for InSAR. In this method, from a large number of wrapped phase images with topography features and different levels of noise, the deep convolutional neural network can learn global phase features and the phase gradient between adjacent pixels, so a more accurate and robust phase gradient can be predicted than that obtained by PGE-PCA. To get the phase unwrapping result, we use the traditional least squares solver to minimize the difference between the gradient obtained by PGENet and the gradient of the unwrapped phase. Experiments on simulated and real InSAR data demonstrated that the proposed method outperforms the other five well-established phase unwrapping methods and is robust to noise.

**Keywords:** interferometric synthetic aperture radar; deep convolutional neural network; phase unwrapping

#### **1. Introduction**

Synthetic aperture radar interferometry (InSAR) is playing an increasingly important role in the field of surface deformation monitoring and topographic mapping [1–3]. The InSAR system uses two co-registered complex images from different viewing angles to obtain the two-dimensional interferometric phase images. Due to the trigonometric function in the transmitting and receiving models, the obtained interferometric phase is wrapped—that is, its range is in (−*π*, *π*] [4,5]. In order to obtain an accurate elevation measurement of the surveying area, the unwrapped phase must be obtained by adding the correct wrap count to each pixel of the wrapped phase, which is called phase unwrapping. Therefore, in the InSAR data processing pipeline, the measurement accuracy of elevation level is highly correlated with the accuracy of phase unwrapping.

Since phase unwrapping is an ill-posed problem, the phase continuity assumption is usually considered in the process of phase unwrapping: the absolute values of the gradients in the two directions of the unwrapped phase are less than *π* [6]. Under this assumption, many kinds of phase unwrapping methods have been presented in recent decades, and they can be divided into two categories: path following [5,7,8] and optimization-based methods [9–15]. A path following method selects the integration path for integrating the estimated phase gradient through the residue distribution or the phase quality map, so as to avoid the local error from being propagated globally. Examples are the branch-cut method [5] and the quality-guided method [7]. An optimization-based method minimizes

**Citation:** Pu, L.; Zhang, X.; Zhou, Z.; Li, L.; Zhou, L.; Shi, J.; Wei, S. A Robust InSAR Phase Unwrapping Method via Phase Gradient Estimation Network. *Remote Sens.* **2021**, *13*, 4564. https://doi.org/ 10.3390/rs13224564

Academic Editor: João Catalão Fernandes

Received: 1 September 2021 Accepted: 9 November 2021 Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the difference between the estimated gradient and the unwrapping phase gradient through the objective function to obtain the optimal unwrapped phase. Examples are the least squares (LS) method [13] and the statistical-cost, network-flow algorithm for phase unwrapping (SNAPHU) method [15], and phase unwrapping max-flow/min-cut algorithm (PUMA) [10]. Both types of method need to obtain the estimated value of the phase gradient through the phase continuity assumption before unwrapping. Due to the presence of noise and abrupt terrain changes, the phase continuity assumption is not always satisfied that is, the unwrapped phase may jump above *π*, which may cause local errors in the unwrapping process. This local error may produce a global error along the integration path, so the estimated phase gradient information will directly affect the final unwrapping accuracy. Therefore, it is a valuable aim to seek a more accurate estimation method of phase gradient information instead of directly relying on the traditional estimation method based on the phase continuity assumption.

In recent years, the deep learning-based phase unwrapping methods have attracted significant interest [16–24]. Most of these methods [16–18] convert the unwrapping problem into a classification problem of the wrap count, and their effectiveness is verified using optical images. In the field of InSAR, the unwrapping problem becomes more difficult because of two characteristics: the complex wrapped phase caused by topography features and the low coherence coefficient. Therefore, combining traditional phase unwrapping methods with deep learning, instead of relying solely on deep learning, is a promising development trend [19–23]. In [19], a modified fully convolutional network was first applied to classify the wrapped phase into normal pixels and layover residues, which can suppress the error propagation of layover residues during the phase unwrapping process. Additionally, a CNN-based unwrapping method was proposed in [20], which feeds the wrapped phase and coherence map into the network at the same time for training to obtain the wrap count gradient. In this method, the wrap count reconstruction is necessary for obtaining the final unwrapping result. A deep learning-based method combined with the minimum cost flow (MCF) unwrapping model was proposed in [21]. In this method, the phase gradient is discretized to match the MCF unwrapping model and treated as a three-classification deep learning problem, but the number of categories may need to change according to the terrain changes, because the three categories cannot cover all situations. In addition, the ambiguity gradient [23] is taken as ground truth for network training, and the MCF model is used as the postprocessing step for final unwrapped phase reconstruction. However, the MCF unwrapping model is usually very complex computationally and requires numerous computational resources [25].

The LS phase unwrapping method is widely used in practical applications and converges quickly [9,26,27]; therefore, we considered combining it and deep learning to improve the unwrapping accuracy while retaining the advantages of the LS method. In the traditional LS method, estimating the phase gradient according to the phase continuity assumption (PGE-PCA) is an essential step. Recent studies [28–32] have indicated that the encoder–decoder architecture based on deep convolutional neural networks (DCNN) can learn the global features from a large number of input images with different levels of noise or other disturbances, which is useful for obtaining the robust phase gradient from noisy wrapped phase images.

In this paper, we propose a robust LS InSAR phase unwrapping method that works via a phase gradient estimation network (PGENet-LS). In this method, we transform the phase gradient estimation into a regression problem and design a phase gradient estimation network based on the encoder–decoder architecture (PGENet) for InSAR. From lots of wrapped phase images with topography features and different levels of noise, PGENet can extract global high-level phase features and recognize the phase gradient between adjacent pixels, so the more accurate and robust phase gradient can be estimated by PGENet than that obtained by PGE-PCA. Finally, the phase unwrapping result is obtained by using the least squares solver to minimize the difference between the gradient obtained by PGENet and the gradient of the unwrapped phase. The phase gradient estimated by PGENet is used

to replace the PGE-PCA in the traditional LS unwrapping method. As the accuracy of the phase gradient estimated by PGENet is significantly higher and more robust than that of the phase gradient estimated by PGE-PCA, the proposed method has higher accuracy than the traditional LS phase unwrapping method. A series of experimental results of simulated wrapped phase and real InSAR data demonstrate that the proposed method outperforms the other five well-established phase unwrapping methods and is robust to noise.

This paper is organized as follows. Section 2 introduces the principles of phase unwrapping, problem analysis, PGENet, and the proposed method. In Section 3, the data generation method, loss function, performance evaluation index, and experiment settings are described. In Section 4, a series of experimental results using simulated and real InSAR data are presented. Section 5 and Section 6 present the discussion and conclusions of the paper, respectively.

#### **2. PGENet-LS Phase Unwrapping Method**

In this section, we first introduce the principle of phase unwrapping and how to use deep neural networks instead of PGE-PCA to estimate the phase gradient. Then we describe the detailed structure of PGENet, and finally introduce the processing flow of the PGENet-LS phase unwrapping method.

#### *2.1. Principle of Phase Unwrapping*

The wrapped phase is distributed in (−*π*, *π*] containing the ambiguities of integral multiples of 2*π*. For an image pixel point (*i*, *j*), the relationship between the wrapped phase *φi*,*<sup>j</sup>* and the unwrapped phase *ϕi*,*<sup>j</sup>* can be expressed as

$$
\varphi\_{\bar{i},\bar{j}} = \phi\_{\bar{i},\bar{j}} + 2\pi k\_{\bar{i},\bar{j}} \tag{1}
$$

where *ki*,*<sup>j</sup>* is a sequence of integers, which is called the wrap count. When *ki*,*<sup>j</sup>* is known, the unwrapped phase can be recovered from the wrapped phase. However, a unique solution *ki*,*<sup>j</sup>* cannot be obtained because there are two unknowns in (1). Therefore, in the traditional phase unwrapping method, the phase continuity assumption is used to ensure the uniqueness of the phase unwrapping result. Under this assumption, the LS phase unwrapping method can be divided into the following two steps: phase gradient estimation and implementation of the least squares solver. The flowchart is shown in Figure 1.

Step 1: According to the phase continuity assumption, for the two-dimensional phase unwrapping issue, the horizontal gradient and vertical phase gradient can be estimated by

$$
\Delta\_{i,j}^{\pi\prime} = \begin{cases}
\phi\_{i,j+1} - \phi\_{i,j}, & -\pi \le \phi\_{i,j+1} - \phi\_{i,j} \le \pi \\
\phi\_{i,j+1} - \phi\_{i,j} - 2\pi, & \pi < \phi\_{i,j+1} - \phi\_{i,j} \le 2\pi \\
\phi\_{i,j+1} - \phi\_{i,j} + 2\pi, & -2\pi \le \phi\_{i,j+1} - \phi\_{i,j} < -\pi
\end{cases} \tag{2}
$$

$$
\Delta\_{i,j}^{\mathcal{W}} = \begin{cases}
\phi\_{i+1,j} - \phi\_{i,j}, & -\pi \le \phi\_{i,j+1} - \phi\_{i,j} \le \pi \\
\phi\_{i+1,j} - \phi\_{i,j} - 2\pi, & \pi < \phi\_{i,j+1} - \phi\_{i,j} \le 2\pi \\
\phi\_{i+1,j} - \phi\_{i,j} + 2\pi, & -2\pi \le \phi\_{i,j+1} - \phi\_{i,j} < -\pi
\end{cases} \tag{3}
$$

where Δ*<sup>x</sup> <sup>i</sup>*,*<sup>j</sup>* and <sup>Δ</sup>*<sup>y</sup> <sup>i</sup>*,*<sup>j</sup>* are the horizontal gradient and vertical gradient, respectively. For brevity, the step is called PGE-PCA.

Step 2: After obtaining the estimated horizontal gradient Δ*<sup>x</sup> <sup>i</sup>*,*<sup>j</sup>* and vertical gradient Δ*y i*,*j* , the final unwrapping result *ϕ <sup>i</sup>*,*<sup>j</sup>* can be calculated according to the least squares solver of (4).

$$\underset{\boldsymbol{\Psi}'\_{i,j}}{\arg\min} \sum\_{i} \sum\_{j} \left| \boldsymbol{\varphi}\_{i,j+1}^{\prime} - \boldsymbol{\varphi}\_{i,j}^{\prime} - \boldsymbol{\Delta}\_{i,j}^{\boldsymbol{x}\prime} \right|^{2} + \sum\_{i} \sum\_{j} \left| \boldsymbol{\varphi}\_{i+1,j}^{\prime} - \boldsymbol{\varphi}\_{i,j}^{\prime} - \boldsymbol{\Delta}\_{i,j}^{\boldsymbol{y}\prime} \right|^{2} \tag{4}$$

The meaning of (4) is to minimize the difference between the estimated gradient and the gradient of the unwrapped phase. To obtain the solution of (4), there are mainly two classes of fast algorithms: transformation-based methods [13] and multi-grid methods [12]. We can see that the accuracy of the phase unwrapping result from (4) is directly related to the accuracy of the estimated phase gradient. In other words, if the accuracy of the estimated phase gradient can be improved, we can obtain the more accurate phase unwrapping result.

**Figure 1.** Flowchart of the traditional least squares phase unwrapping method.

#### *2.2. Problem Analysis*

In practical applications, the phase continuity assumption is often not satisfied in every pixel due to the presence of noise and abrupt terrain changes. The noise level can be evaluated by the coherence coefficient, which can be expressed as

$$\rho = |\frac{E\left\{\mathcal{S}\_1 \cdot \mathcal{S}\_2^\*\right\}}{\sqrt{E\left\{|\mathcal{S}\_1|^2\right\} \cdot E\left\{|\mathcal{S}\_2|^2\right\}}}|\tag{5}$$

where *S*<sup>1</sup> and *S*<sup>2</sup> are the co-registered master and slave complex images, respectively; ∗ denotes the complex conjugate; and *E*{·} denotes the mathematical expectation.

Figure 2 shows the influences of different coherence coefficients on the wrapped phase and the corresponding gradients in the horizontal and vertical directions. Figure 2a shows the wrapped phase with different coherence coefficients, and Figure 2b,c shows the corresponding phase gradients in two directions obtained by (2) and (3), respectively. We can see that in the case of a coherence coefficient of 1 (no noise), the accurate vertical and horizontal phase gradients can be obtained by (2) and (3), but in the presence of noise, the estimated phase gradients in two directions from (2) and (3) are no longer reliable because the gradient information is destroyed by noise. As the coherence coefficient gets lower and lower, the phase gradients in two directions are destroyed more and more severely. From (2) and (3) and Figure 2, we can see that only considering the relationship between adjacent pixels to calculate the gradient is not enough, which means that more or even global phase information may need to be employed. Therefore, in the presence of noise, we use the global phase information in the gradient estimation process for improving the accuracy of phase gradient estimation. That is to say, the gradient estimation of each pixel does not only depend on the adjacent pixels but on the overall wrapped image.

In recent years, the encoder–decoder architecture based on DCNN has been widely used in image processing in the fields of optics, medicine, and SAR [29–32]. These studies have indicated that the encoder–decoder architecture can learn the global high-level features from input images with different levels of noise or other disturbances. The powerful feature extraction capability is conducive to extract global phase features from the noisy wrapped images to obtain the accurate phase gradient. In addition, according to (2) and (3), it can be seen that the gradient calculation principles in the two directions are the same. Therefore, while the original wrapped phase image and the horizontal gradient image

are used as an image pair, the transposed original wrapped phase image and the transposed vertical gradient image of the original wrapped phase image are taken as another equivalent image pair, so that we can get the horizontal and vertical gradient images of the wrapped phase after inputting the original and transposed wrapped phase images to a network, respectively. Based on the above analysis, we designed PGENet based on the encoder–decoder architecture that takes the original or transposed wrapped phase images as inputs and outputs the estimated horizontal or vertical phase gradient images. The deep convolutional neural network can learn global phase features and the phase gradient between adjacent pixels from lots of wrapped phase images with topography features and different coherence coefficients; hence, the accurate and robust horizontal and vertical phase gradient images can be predicted after training.

**Figure 2.** (**a**) Wrapped phase images, and (**b**,**c**) are the corresponding horizontal gradient images and vertical gradient images, respectively. From top to bottom, the coherence coefficients are 1, 0.95, 0.75, and 0.5, respectively.

#### *2.3. PGENet*

PGENet is designed based on the encoder–decoder architecture and is mainly composed of two parts: an encoder and a decoder. Its overall structure and the detailed parameters of each layer are shown in Figure 3 and Table 1, respectively. As shown in Figure 3 and Table 1, the encoder part (Encoder) contains eight encoder blocks and converts the input noisy wrapped phase images into more feature maps with smaller sizes, which can enrich the global phase gradient features and reduce memory requirements when adding the number of feature maps. An encoder block is constitutive of two convolution layers (Conv + Relu), and each layer performs two operations, namely, convolution and activation (Relu). After the encoding process, a large number of global phase features are extracted in the feature maps. The decoder part

(Decoder) contains eight decoder blocks and gradually recovers the phase gradient from these extracted global phase feature maps. Each decoder block is constitutive of a convolution layer and a deconvolution layer (Deconv + Relu). Each deconvolution layer performs two operations, namely, deconvolution and activation. During the decoding process, a large number of feature maps are gradually merged into larger images until the output image is the same as the input image. At the same time, due to the fusion of global phase features, the accurate phase gradient is extracted.

**Figure 3.** Overall architecture of PGENet.



In the process of constructing the Encoder and Decoder, a deep network structure is formed. The deep architecture of PGENet is used to enrich the level of phase gradient features, which can ensure that the network has sufficient phase gradient estimation capabilities when processing noisy wrapped phase images. While increasing the network depth, the feature maps with the same size between the encoder and decoder are added by skip connections [28]. As shown in Figure 3, the skip connections can transfer the extracted global phase features containing more detailed gradient information to the deconvolution layer of the Decoder and accelerate convergence. Therefore, in the decoder process, the low/mid/high-level global phase gradient features containing more detailed information from the Encoder are compensated and fused in the current phase gradient feature maps. Under the effect of the deep structure and skip connections, PGENet can obtain accurate and robust phase gradient estimation results when inputting images with different noise levels.

#### *2.4. PGENet-LS Phase Unwrapping Method*

As shown in Figure 4, in the PGENet-LS phase unwrapping method, the horizontal and vertical gradients are estimated by PGENet first, and then the least squares solver is employed to minimize the difference between the gradients obtained by PGENet and the gradients of the unwrapped phase. Therefore, the processing flow can be divided into the following two steps: phase gradient estimation and unwrapping using the least squares solver.

**Figure 4.** Flowchart of the proposed method.

Step 1: For estimating the phase gradients using PGENet, training and testing are required. PGENet takes the original or transposed wrapped phase images with different coherence coefficients and topography features as input and produces the corresponding horizontal or vertical gradient images. During the training processing, the loss function described in Section 3.2 is selected to update the trainable parameters. During the testing processing, the testing wrapped phase image (simulated or real InSAR data) or its transposed version is fed into the trained PGENet to obtain the estimated horizontal or vertical phase gradient images, respectively.

Step 2: After obtaining the horizontal gradient and vertical gradient estimated by PGENet, all least squares solvers based on phase gradients from (2) and (3) can be used in this step. In this paper, the well-established weighted least squares solver of (6) [26] are selected to get the unwrapping result and can be expressed as

$$\underset{\boldsymbol{\Psi}'\_{i,j}}{\arg\min} \sum\_{i} \sum\_{j} \omega\_{i,j}^{x} \left| \boldsymbol{\varphi}\_{i,j+1}^{\prime} - \boldsymbol{\varphi}\_{i,j}^{\prime} - \boldsymbol{\Delta}\_{i,j}^{x\prime} \right|^{2} + \sum\_{i} \sum\_{j} \omega\_{i,j}^{y} \left| \boldsymbol{\varphi}\_{i+1,j}^{\prime} - \boldsymbol{\varphi}\_{i,j}^{\prime} - \boldsymbol{\Delta}\_{i,j}^{y\prime} \right|^{2} \tag{6}$$

where *w<sup>x</sup> ij* <sup>=</sup> min *w*2 *<sup>i</sup>*,*j*+1, *<sup>w</sup>*<sup>2</sup> *i*,*j* , and *w<sup>y</sup> ij* <sup>=</sup> min *w*2 *i*+1,*j* , *w*<sup>2</sup> *i*,*j* . *w<sup>x</sup> ij* are the weights defined as the normalized cross-correlation coefficient.

#### **3. Experiments**

In this section, the detailed data generation process, loss function, and performance evaluation index are described.

#### *3.1. Data Generation*

To make the trained PGENet have a good generalization capability, a large number of labeled images with topography features are needed for training. Therefore, we used a digital elevation model (DEM) to generate wrapped phase images according to the ambiguity height of the real InSAR system [21,33]. After generating the wrapped phase, the corresponding ideal horizontal and vertical phase gradients were obtained according to (2) and (3). In addition, different levels of noise were added to the ideal (no noise) wrapped phase images, and the noise level is expressed by the coherence coefficient in the field of InSAR [2,33]. A lower coherence coefficient means a higher noise level, and in the absence of noise, the coherence coefficient is 1. The wrapped phase images with different coherence coefficients were used to train PGENet, which ensured that the trained PGENet had good robustness to noise.

The details of the generated training data are as follows. Figure 5a shows the DEM (eastern part of Turkey, 2048 × 2048 pixels) used for training in this study. It was from SRTM 1Sec HGT and can be downloaded from the Sentinel Application Platform (SNAP). The reason for choosing this DEM data was that their topographic features are similar to those of the real wrapped phase in the subsequent experiments. In addition, the ambiguity height of the simulated system (92.13 m) was the same as that of the real wrapped phase image in the subsequent experiments. Similar topographic features and the same ambiguity height made the phase gradient features of the training data and the real data as similar as possible. According to the ambiguity height and the DEM, the corresponding ideal wrapped phase and horizontal and vertical phase gradient images are shown in Figure 5b–d. The examples of the noisy wrapped phase images are shown in Figure 6. Ten noisy wrapped phase images were generated for training. Their coherence coefficients were in the range of [0.5, 0.95], and the interval was 0.05. This range of the coherence coefficients can cover most common InSAR data. In order to reduce the memory requirement, the whole wrapped phase images were cut into image patches with the size of 256 × 256 pixels. To augment the training data, 128 pixels were shifted on the columns or rows in each cropping process to ensure 50% overlap of adjacent image patches. Therefore, the total number of horizontal and vertical gradient image patches for training was 4500.

The details of the generated testing data are as follows. Figure 7a shows the reference DEM (1024 × 1024 pixels) which was used for testing. The 10 noisy wrapped phase images were generated, and the range of the coherence coefficients was the same as that of the training data. Figure 7b shows an example of a noisy wrapped phase with a coherence coefficient of 0.5, and Figure 7c,d shows the ideal horizontal and vertical phase gradients, respectively. The whole wrapped phase images were cut into image patches of 256 × 256 pixels for testing, and 128 pixels were shifted on the columns or rows in each cropping process to ensure 50% overlap of adjacent image patches. Therefore, the total number of horizontal and vertical gradient image patches for testing was 980, accounting for 22% of the sum of training data and testing data.

**Figure 5.** (**a**) Reference DEM used for training. (**b**) Ideal wrapped phase simulated by the reference DEM. (**c**) Ideal horizontal gradient. (**d**) Ideal vertical gradient.

**Figure 6.** Examples of the noisy wrapped phase with different coherence coefficients. (**a**) 0.5. (**b**) 0.75. (**c**) 0.95.

**Figure 7.** (**a**) Reference DEM used for testing. (**b**) Wrapped phase with a coherence coefficient of 0.5. (**c**) Ideal horizontal phase gradient. (**d**) Ideal vertical phase gradient.

#### *3.2. Loss Function*

The widely-used mean-square error (MSE) [33] is taken as the loss function to update the training parameters of PGENet. It is calculated according to the ideal phase gradient (ground truth) and the estimated phase gradient image (network output), and can be expressed as

$$\mathcal{L} = \sum\_{i=1}^{N} \frac{\left(\Delta\_i - \Delta\_i'\right)^2}{N} \tag{7}$$

where *N* is the number of phase gradient image pixels; Δ*<sup>i</sup>* and Δ *<sup>i</sup>* are the ideal gradient and the gradient estimated by PGENet, respectively.

#### *3.3. Performance Evaluation Index*

To fully evaluate the accuracy and robustness of the proposed method, we employed qualitative and quantitative methods to perform the evaluation. Qualitative evaluation refers to performing the unwrapping accuracy judgment based on the observation of the image by the naked eye. Therefore, the original wrapped images and DEM images obtained by the unwrapped phases of six different unwrapping methods are simultaneously presented in this paper. To more comprehensively and objectively evaluate the unwrapping accuracy and robustness, the unwrapping failure rate [19] and root mean square error (RMSE) were adopted for quantitative evaluation.

The unwrapping failure rate can be defined as

$$\text{UFR} = \frac{1}{\text{MN}} \sum\_{i=1}^{M} \sum\_{j=1}^{N} d\_{i,j} \tag{8}$$

where *di*,*<sup>j</sup>* = 1, if *ϕ <sup>i</sup>*,*<sup>j</sup>* − *ϕi*,*<sup>j</sup>* <sup>≥</sup> *<sup>π</sup>* 0, otherwise . *<sup>ϕ</sup> <sup>i</sup>*,*<sup>j</sup>* and *ϕi*,*<sup>j</sup>* are the estimated unwrapped phase and

ideal unwrapped phase, respectively. *M* and *N* are the width and height of the unwrapped phase image, respectively. A smaller UFR value means better unwrapping accuracy to a certain extent in simulated data processing.

For simulated data processing, in addition to the unwrapping failure rate, the RMSE between the estimated unwrapped phase and ideal unwrapped phase was also employed to evaluate the unwrapping accuracy. For real InSAR data processing, due to the ideal unwrapped phase being unknown and the ultimate goal of wrapped phase processing being to obtain elevation, we employed the RMSE between the reference DEM and the DEM obtained by the unwrapped phases of different unwrapping methods to evaluate the unwrapping accuracy [20,23]. The formulaic expression of RMSE is

$$\text{RMSE} = \sqrt{\sum\_{j}^{N} \sum\_{i}^{M} \frac{\left(\varphi\_{i,j} - \varphi\_{i,j}'\right)^2}{MN}} \tag{9}$$

where *ϕi*,*<sup>j</sup>* is the reference DEM image; *ϕ <sup>i</sup>*,*<sup>j</sup>* is the DEM image obtained by the estimated unwrapped phase; *M* and *N* are the DEM image width and height, respectively. A small RMSE means that the estimated DEM is close to the reference DEM—that is, the unwrapping accuracy is high.

#### *3.4. General Experiment Settings*

We implemented all experiments on a PC with an Intel i7-8700K CPU, a NVIDIA GeForce RTX 2080 GPU, and 64G memory. PGENet was trained for 260 epochs with a batch size of 16 on the TensorFlow platform, and Adam optimizer [34] was selected to accelerate training. The initial learning rate was set to 1 × <sup>10</sup>−<sup>4</sup> and gradually decayed to 0 exponentially. We chose the early stopping method [35,36] to determine when the network would stop iterating. The trained PGENet was used in the following experiments on simulated and real data.

#### **4. Results**

Three experiments were implemented to evaluate the unwrapping accuracy and robustness of the proposed method. In the first experiment, we demonstrated the accuracy of the phase gradient and the robustness of PGENet, and compared PGENet with PGE-PCA. In the second experiment, we evaluated the unwrapping accuracy and robustness of the proposed method on simulated data; and we compared the proposed method with the LS [26], QGPU [7], SNAPHU [14], and PUMA [10] unwrapping methods, and a state-ofthe-art deep learning-based method [20]. In the third experiment, the proposed method was compared with the three aforementioned phase unwrapping methods using the real Sentinel-1 InSAR data. For a clear comparison, RMSE was used for performance evaluation in these three experiments.

#### *4.1. Performance Evaluation of PGENet*

We first selected a testing sample with a coherence coefficient of 0.5 from the testing samples described in Section 3.1 to visually analyze the performance of phase gradient estimation and then evaluate the estimation accuracy of PGENet from the perspective of the phase gradient error. Meanwhile, PGENet was compared with PGE-PCA.

The reference DEM is shown in Figure 8a, and the corresponding wrapped phase image with a coherence coefficient of 0.5 is shown in Figure 8b. Figure 8c,d shows the ideal horizontal and vertical phase gradients, respectively. The horizontal and vertical phase gradients estimated by PGE-PCA are shown in Figure 9a,b, respectively, and the horizontal and vertical phase gradients estimated by PGENet are shown in Figure 9c,d, respectively. The corresponding gradient errors between the results estimated by PGE-PCA and ideal gradients are shown in Figure 10a,b. The corresponding gradient errors between the results estimated by PGENet and ideal gradients are shown in Figure 10c,d. It can been seen that the horizontal and vertical phase gradients obtained by PGENet are very close to the corresponding ideal horizontal and vertical phase gradients because most pixels of its error image are close to zero. In order to better quantify the gradient errors of the two methods, their error histogram curves were fitted according to their gradient error histograms and are shown in Figure 11. The horizontal axis of Figure 11 is the gradient error, and its vertical axis is the corresponding number of pixels of the gradient error image. The histogram curve can clearly shows the error distribution and is convenient for comparing the errors of different methods, so it was also used for subsequent analysis. From Figure 11, regardless of the horizontal gradient error or the vertical gradient error, it can be seen that the error curve of PGENet is more concentrated near zero and sharper than that of PGE-PCA, so the estimation accuracy of PGENet is significantly better than that of PGE-PCA from the perspective of the phase gradient error.

**Figure 8.** A testing sample. (**a**) Ideal unwrapped phase used for performance evaluation. (**b**) Noisy wrapped phase with a coherence coefficient of 0.5. (**c**) Ideal horizontal phase gradient. (**d**) Ideal Vertical phase gradient.

**Figure 9.** Phase gradient images estimated by two methods. (**a**) Horizontal phase gradient and (**b**) vertical phase gradient estimated by PGE-PCA. (**c**) Horizontal phase gradient and (**d**) vertical phase gradient estimated by PGENet.

**Figure 10.** Phase gradient estimation errors. (**a**) Horizontal gradient error and (**b**) vertical gradient error of PGE-PCA. (**c**) Horizontal gradient error and (**d**) vertical gradient error of PGENet.

**Figure 11.** Phase gradient error curves. (**a**) Horizontal phase gradient error. (**b**) Vertical phase gradient error.

#### *4.2. Robustness Testing of PGENet*

As described in Section 3.1, the coherence coefficients of all testing samples ranged from 0.5 to 0.95. For the noise robustness testing, we calculated the mean values of the RMSE of PGENet and PGE-PCA for the testing samples with the same coherence coefficients. The results of the horizontal and vertical phase gradients are shown in Figure 12. Regardless of the horizontal gradient estimation result or the vertical gradient estimation result, we can observe that the RMSE of PGENet is smaller, which means that the accuracy of PGENet was significantly higher than that of PGE-PCA for each considered case. Moreover, the RMSE of PGENet did not change significantly with the changes in the coherence coefficient, which means that PGENet is robust to noise. To evaluate the comprehensive performance in response to different coherence coefficient situations, we calculated the mean values of the RMSE of all testing samples and list the results in Table 2. We can see that PGENet is robust to noise and has higher estimation accuracy than PGE-PCA.

**Figure 12.** Root mean square errors (RMSE) of two phase gradient estimation methods on simulated data with different coherence coefficients. (**a**) Horizontal phase gradient estimation. (**b**) Vertical phase gradient estimation.

**Table 2.** A comparison of two phase gradient estimation methods using root mean square error (RMSE) on simulated data with different coherence coefficients.


#### *4.3. Performance Evaluation of Phase Unwrapping on Simulated Data*

We selected a testing sample with a coherence coefficient of 0.5 (Figure 8b) to analyze the unwrapping accuracy of the proposed method from the perspective of the unwrapped phase error. The proposed method is compared with five widely-used phase unwrapping methods, namely, LS, QGPU, PUMA, and SNAPHU methods, and a deep learning-based method.

The unwrapped phases obtained by these six methods are shown in Figure 13. Figure 14 shows the corresponding errors between the estimated unwrapped phases and ideal unwrapped phase (Figure 8a). We can observe that the unwrapped phase obtained by the propose method is close to the ideal unwrapped phase because most pixels of its error image are close to zero. In order to better quantify the unwrapped phase errors of these six methods, their error histogram curves were fitted and are shown in Figure 15. From Figure 15, compared with the other five unwrapping methods, the error curve of the proposed method is more concentrated near zero and sharper, so the unwrapping accuracy of the proposed method was the highest among these six methods.

**Figure 13.** Unwrapped phase images of six phase unwrapping methods on simulated data. (**a**) LS. (**b**) QGPU. (**c**) PUMA. (**d**) SNAPHU. (**e**) CNN [20]. (**f**) Proposed method.

**Figure 14.** Unwrapped phase errors on simulated data. (**a**) LS. (**b**) QGPU. (**c**) PUMA. (**d**) SNAPHU. (**e**) CNN [20]. (**f**) Proposed method.

**Figure 15.** Fitted unwrapped phase error histogram curves of six phase unwrapping methods on simulated data.

#### *4.4. Robustness Testing of Phase Unwrapping on Simulated Data*

As described in Section 3.1, the coherence coefficients of all testing samples ranged from 0.5 to 0.95. For the noise robustness testing, we calculated the mean values of the unwrapping failure rate and RMSE of these six unwrapping methods for the testing samples with the same coherence coefficients, and the results are shown in Figure 16. It can been observed that the proposed method had the smallest unwrapping failure rate and RMSE for each considered coherence coefficient case; that is, among these six unwrapping methods, the proposed method has the highest unwrapping accuracy. Moreover, the unwrapping failure rate and RMSE of the proposed method did not change significantly with the changes in the coherence coefficient, which means the proposed method is robust to noise. To evaluate the comprehensive performance in response to different coherence coefficient situations, the mean values of the unwrapping rate and RMSE for all testing samples were calculated, and are listed in Table 3. We can observe that the unwrapping failure rate and RMSE of the proposed method are the smallest among all six unwrapping methods. Compared with the PUMA method, the unwrapping failure rate and RMSE of the proposed method were 89.3% and 49.5% lower, respectively. Compared with the CNN method [20], the unwrapping failure rate and RMSE of the proposed method were 96.3% and 61.2% lower, respectively. The reason for this performance improvement may be that the classification error is further amplified in the post-processing of the CNN method [20] due to the difference between adjacent categories corresponding to an unwrapped phase of 2*π*. Additionally, in the proposed method, the continuous phase gradient estimation can ensure the unwrapped phase error is within a small range. From Figure 14, we can indeed observe that the unwrapped phase error range of the CNN method [20] is significantly larger than that of the proposed method. In addition, the unwrapping failure rate and RMSE of the CNN method [20] increased as the noise level increased, which may be because the classification error increased as the noise level increased. Furthermore, as the level of noise decreased, the performance gaps among the different unwrapping methods gradually decreased. Based on the above analysis, the proposed method has the highest unwrapping accuracy among these six unwrapping method and is robust to noise.

**Figure 16.** Quantitative indexes of six methods for phase unwrapping results on simulated wrapped phase images with different coherence coefficients. (**a**) Unwrapping failure rate. (**b**) Root mean square error (RMSE)

**Table 3.** Quantitative indexes of six phase unwrapping methods on simulated data.


#### *4.5. Performance Evaluation of Phase Unwrapping on Real InSAR Data*

A real wrapped phase image was used to evaluate the phase unwrapping performance of the proposed method. The proposed method's performance is compared with those of other five unwrapping methods. Before unwrapping, the widely-used Goldstein phase filtering algorithm [37] was employed to suppress the noise of the real wrapped phase image. According to the unwrapping results, we performed two operations to obtain the estimated DEM: elevation inversion and terrain correction. This two operations were performed by the standard methods of SNAP software ("Phase to Elevation" and "Range Doppler Terrain Correction").

The wrapped phase image covering the eastern part of Turkey was computed from a pair of SLC images acquired by the SAR satellite Sentinel-1 Interferometric Wide Swath mode on July 2 and 8, 2019. Figure 17a is the wrapped phase image and Figure 17b is the corresponding reference DEM. The DEM is from SRTM 1Sec HGT and can be downloaded with the official SNAP software; and it was processed using the bilinear interpolation method to match the grid size of the estimated DEM (13.93 m).

**Figure 17.** Sentinel-1 InSAR data. (**a**) Wrapped phase image. (**b**) Reference DEM.

In the case of performing phase filtering, the DEM results obtained by the phase unwrapping results of these six unwrapping methods are shown in Figure 18. Figure 19 shows the corresponding errors between the DEM solutions and the reference DEM. We can observe that the DEM obtained by the proposed method is closest to the reference DEM among these six methods because most pixels of its error image are closest to zero. To better quantify the DEM errors of these six unwrapping methods, the histogram error curves were fitted and are shown in Figure 20. From Figure 20, compared with the other five unwrapping methods, the error curve of the proposed method is more concentrated near zero and sharper, so the unwrapping accuracy of the proposed method was the highest among these six methods.

For quantitative evaluation, the RMSE between DEM solutions obtained by these six methods and the reference DEM were calculated and are listed in Table 4. As seen from Table 4, the RMSE of the proposed method is the smallest among these six unwrapping methods. In addition, compared with the PUMA method and CNN method [20], the RMSE of the proposed method is 3.73% or 7.0% lower, respectively. The performance improvement was not as large as with the simulated data, because the noise level was greatly reduced after performing phase filtering. As the level of noise decreases, the performance gaps among different unwrapping methods gradually decreases. Based on the above analysis, the unwrapping accuracy of the proposed method is the highest among these six methods.

**Figure 18.** In the case of performing phase filtering, the DEM results of six phase unwrapping methods on real data. (**a**) LS. (**b**) QGPU. (**c**) PUMA. (**d**) SNAPHU. (**e**) CNN [20]. (**f**) Proposed method.

**Figure 19.** In the case of performing phase filtering, the DEM errors on real data. (**a**) DEM errors of the least squares method. (**b**) DEM errors of QGPU. (**c**) DEM errors of PUMA. (**d**) DEM errors of SNAPHU. (**e**) DEM errors of CNN [20]. (**f**) DEM errors of the proposed method.

**Figure 20.** In the case of performing phase filtering, the fitted DEM error histogram curves of six phase unwrapping methods on real InSAR data.

**Table 4.** In the case of performing phase filtering, the comparison of six phase unwrapping methods using root mean square error (RMSE) on real InSAR data.


#### *4.6. Robustness Testing of Phase Unwrapping on Real InSAR Data*

In real data processing, to improve the accuracy of phase unwrapping, phase filtering is often required before unwrapping, but after filtering, there will still be different levels of noise, and its level depends on the noise suppression performance of the filtering algorithm and the quality of the wrapped phase [38]. To evaluate the robustness of the proposed method, this section shows the DEM estimation results of Figure 17a in the case of no phase filtering.

The DEM results obtained by the phase unwrapping results of these six unwrapping methods without phase filtering are shown in Figure 21. Figure 22 shows the corresponding errors between the DEM solutions and the reference DEM. The histogram error curves were fitted and are shown in Figure 23, and the RMSE between DEM solutions and the reference DEM are listed in Table 5. We can see that the unwrapping accuracy of the proposed method was still the highest among these six methods because the RMSE of the proposed method is smallest. In addition, compared with Section 4.5, the unwrapping results of the LS and QGPU methods are failures, and the performances of the PUMA, SNAPHU, and CNN [20] methods decreased significantly when the noise level became larger because their RMSE increased by 85.04%, 39.60%, and 66.16%, respectively. The performance of the proposed method decreased slightly because its RMSE increased by 7.25%. Based on the above analysis, it can be seen that the proposed method has better anti-noise performance than the other five methods in real data processing.

**Figure 21.** In the case of no phase filtering, the DEM results of six phase unwrapping methods on real data. (**a**) LS. (**b**) QGPU. (**c**) PUMA. (**d**) SNAPHU. (**e**) CNN [20]. (**f**) Proposed method.

**Figure 22.** In the case of no phase filtering, the DEM errors on real data. (**a**) DEM errors of the least squares method. (**b**) DEM errors of QGPU. (**c**) DEM errors of PUMA. (**d**) DEM errors of SNAPHU. (**e**) DEM errors of CNN [20]. (**f**) DEM errors of the proposed method.

**Figure 23.** In the case of no phase filtering, the fitted DEM error histogram curves of six phase unwrapping methods on real InSAR data.

**Table 5.** In the case of no phase filtering, the comparison of six phase unwrapping methods using root mean square error (RMSE) on real InSAR data.


#### **5. Discussion**

To analyze the influence of networks with different block numbers on the unwrapping performance, we complemented three experiments with simulated data, and their numbers of blocks were 10, 8, and 5, respectively. Their RMSE were calculated and are listed in Table 6. As the RMSE of the network with eight blocks were smallest, we selected the network with eight blocks. From Tables 3 and 6, we can see that different block numbers led to slight fluctuations in RMSE for our method, but the unwrapping performance was still better than those of the other five unwrapping methods.

**Table 6.** Root mean square errors (RMSE) of the proposed method in networks with different numbers of blocks.


#### **6. Conclusions**

In this paper, a robust InSAR phase unwrapping method combining PGENet and the least squares solver was proposed to improve the accuracy of phase unwrapping. We designed PGENet to estimate the horizontal and vertical gradients first, and then the phase unwrapping result is obtained by using the least squares solver to minimize the difference between the gradient obtained by PGENet and the gradient of the unwrapped phase. The horizontal and vertical gradients estimated by PGENet are used to replace the gradients estimated by PGE-PCA in the traditional LS unwrapping method. PGENet can extract global high-level phase features and recognize the phase gradient between adjacent pixels from lots of wrapped phase images with topography features and different coherence coefficients. Therefore, compared with the phase gradient obtained by PGE-PCA, the more accurate and robust phase gradient can be estimated by PGENet. This is the reason why the proposed method has higher precision and better robustness than the traditional LS unwrapping method. The experimental results on simulated data showed that the proposed method has the highest unwrapping accuracy among six widely-used unwrapping methods and is robust to noise. Furthermore, when processing the real Sentinel-1 InSAR data, the proposed method had the best performance among these six unwrapping methods.

The proposed method successfully combines deep learning and the traditional LS method for InSAR phase unwrapping. The core of this method is the accurate and robust phase gradient estimation based on PGENet, which makes the proposed method have high accuracy and robustness. In future work, to achieve more accurate unwrapping, we will make targeted modifications to PGENet to match more traditional InSAR phase unwrapping methods. In addition, we will use the proposed phase unwrapping method to process more real InSAR data.

**Author Contributions:** Conceptualization, L.P., X.Z. and J.S.; methodology, L.P. and X.Z.; software, L.P., Z.Z. and S.W.; validation, L.P., Z.Z. and J.S.; formal analysis, L.P., S.W., L.L. and L.Z.; investigation, L.P., Z.Z., L.Z. and L.L.; resources, X.Z. and S.W.; data curation, L.P. and X.Z.; writing—original draft preparation, L.P., X.Z. and J.S.; writing—review and editing, L.P., L.L., L.Z. and S.W.; visualization, L.P., Z.Z., L.L. and L.Z.; supervision, X.Z. and J.S.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported in part by the National Key R&D Program of China under grant 2017YFB0502700, and in part by the National Natural Science Foundation of China under grants 61571099, 61501098, and 61671113.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We thank all editors and reviewers and for their valuable comments and suggestions for improving this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **TCD-Net: A Novel Deep Learning Framework for Fully Polarimetric Change Detection Using Transfer Learning**

**Rezvan Habibollahi 1, Seyd Teymoor Seydi 1, Mahdi Hasanlou 1,\* and Masoud Mahdianpari 2,3**


**Abstract:** Due to anthropogenic and natural activities, the land surface continuously changes over time. The accurate and timely detection of changes is greatly important for environmental monitoring, resource management and planning activities. In this study, a novel deep learning-based change detection algorithm is proposed for bi-temporal polarimetric synthetic aperture radar (PolSAR) imagery using a transfer learning (TL) method. In particular, this method has been designed to automatically extract changes by applying three main steps as follows: (1) pre-processing, (2) parallel pseudo-label training sample generation based on a pre-trained model and fuzzy c-means (FCM) clustering algorithm, and (3) classification. Moreover, a new end-to-end three-channel deep neural network, called TCD-Net, has been introduced in this study. TCD-Net can learn more strong and abstract representations for the spatial information of a certain pixel. In addition, by adding an adaptive multi-scale shallow block and an adaptive multi-scale residual block to the TCD-Net architecture, this model with much lower parameters is sensitive to objects of various sizes. Experimental results on two Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) bi-temporal datasets demonstrated the effectiveness of the proposed algorithm compared to other well-known methods with an overall accuracy of 96.71% and a kappa coefficient of 0.82.

**Keywords:** unsupervised change detection; polarimetric synthetic aperture radar (PolSAR); UAVSAR; multi-scale shallow block; multi-scale residual block

#### **1. Introduction**

The proliferation of remote sensing (RS) images at different temporal and spatial resolutions have increased its use in a wide range of global environmental and management applications, including change detection [1–5], target detection [6,7], wetland classification [8–10], oil spill detection [11–13], disaster monitoring [14,15] and so on. Detection of change is one of the most important applications of RS, which is essential for better resource management.

Change detection (CD) is the process of identifying changes, caused by manmade or natural factors, in multi-temporal Earth Observation (EO) data [16]. CD algorithms are commonly employed to monitor changes in different applications, including land use and land cover (LULC) [17,18], deforestation [19], urban development [20] and natural disaster [20].

In recent years, Synthetic Aperture Radar (SAR) sensors have become one of the most popular alternatives to other RS techniques because they can provide imaging in all weather conditions, day or night. In addition, SAR sensors are capable of penetrating through clouds, rain, smoke, snow, dust and so on. Therefore, these factors cannot affect the ability of SAR sensors. In addition, SAR sensors use their source of illumination to detect the target. Therefore, the light conditions of the area do not affect their imaging [21].

**Citation:** Habibollahi, R.; Seydi, S.T.; Hasanlou, M.; Mahdianpari, M. TCD-Net: A Novel Deep Learning Framework for Fully Polarimetric Change Detection Using Transfer Learning. *Remote Sens.* **2022**, *14*, 438. https://doi.org/10.3390/rs14030438

Academic Editors: Gwanggil Jeon, Tianwen Zhang, Xiaoling Zhang and Tianjiao Zeng

Received: 12 December 2021 Accepted: 17 January 2022 Published: 18 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In general, SAR systems have more advantages than optical sensors in CD applications, because of their ability to acquire periodic images, regardless of weather or daylight [22].

Polarisation is one of the properties of an electromagnetic wave described as a function of time on a plane perpendicular to the direction of propagation based on the geometric location of the electric field vector [22]. PolSAR systems can transmit and receive waves in a variety of linear polarization or circular polarization. This characteristic will provide more scattering information from different aspects of a target. To send waves in linear polarization, two common base polarizations, horizontal linear polarization (H) and vertical linear polarization (V), are used. To send waves in circular polarization, the basic polarizations of right-handed and left-handed circles are used. In PolSAR systems, the transmitted and received waves can be sent and received both in the cross-polarizations (e.g., HV or VH polarization) and in the co-polarizations (e.g., HH or VV polarisation) [23]. Using fully PolSAR data, polarization information can be significantly extracted because it allows phase measurements between different polarization channels [22]. Nevertheless, the microwave imaging mechanism used in PolSAR images makes the background more complicated and the features of the region are mixed up. This is reflected in the structural sensitivity, the geometric distortion of the image, the interference of the imaging systems, and the speckle noise. Therefore, compared to other types of EO data, detecting changes in SAR data is more challenging and thus has been less investigated.

Three main steps for unsupervised CD methods in a SAR image can be summarized as (1) pre-processing, (2) difference image (DI) generation, and (3) analysis of DI to generate the change map [24,25]. In SAR image pre-processing, multi-looking, co-registration of images [26] and speckle filtering [27] are the main techniques. Additionally, DI quality has a significant impact on the final change map. Two common methods for DI production are image difference and image ratio. The main advantage of these methods is simplicity, but they do not consider the edge and neighboring information, and thus have low sensitivity to the speckle noise level [28]. However, the mean operator [29] considers neighboring information and has an excellent inhibitory effect on independent points. To extract more robust features and to improve detection performance, the transformation-based models have been proposed in SAR CD [30]. These approaches transform raw feature vectors to a new feature representation, to reduce the impact of noise, suppress *no-change* areas and highlight the *changes* in a new feature space [31]. For instance, principal component analysis (PCA), multivariate alteration detection (MAD) and iterative reweighted multivariate alteration detection (IR-MAD) were utilized in PolSAR CD [31]. Based on recent studies, transformation-based models have a high ability to extract information [31,32]. However, in these methods, manual feature extraction and identification of information-rich components is an important challenge. Furthermore, these algorithms are pixel-based and do not consider spatial features (e.g., texture).

After DI generation, the analysis of DI is usually done through thresholding or clustering strategies. The key point in the thresholding method is to opt for the threshold value. Several popular methods, such as the Kittler and Illingworth (KI) algorithm [33] and the expectation maximization (EM) algorithm [34], are used in SAR data. In these methods, a model must be established to fit the *no-changed* and *changed* class condition distributions. These methods have weak consequences when the *change* and *no-change* features overlap, or when their statistical distributions are mistakenly modeled and in some cases require frustrating trials and errors. In addition, a generalized KI (GKI) threshold selection algorithm [35], a histogram optimization method [36], and a semi-EM algorithm [37] are used to automatically generate a threshold value in SAR data. Since SAR images are extremely influenced by speckle noise, methods that determine thresholds automatically cannot eliminate it, because noise affects the estimation of parameters of the statistical model. Moreover, choosing a global threshold does not make sense for the entire image and may not cover all sections. Another method for analysis of DI is clustering. This is often based on the k-means [38], multiple kernel k-means [39], and fuzzy c-means (FCM) [40]. Although these algorithms are widely used in SAR CD, there are substantial disadvantages [38–40]. On the one hand, these algorithms are distance-based (Euclidean distance, Mahalanobis distance and so on), which is very sensitive to speckle noise and on the other, these algorithms are presented assuming a balance between *change* and *no-change* classes. In many cases, the *change* pixels are far less than the *no-change* pixels, i.e., the imbalance between the two classes. Traditional clustering methods led to extreme false alarms when challenged with unbalanced data. There are other clustering methods for SAR CD, such as the fuzzy local information C-means algorithm (FLICM) [41] and the reformulated FLICM algorithm (RFLICM) [42], which adds local information to the fuzzy method. Clustering methods have greater flexibility than thresholding methods because there is no need to construct a model. However, they are sensitive to noise because of inadequate attention to spatial information.

Recently, several deep learning (DL) algorithms, such as stacked auto-encoders (SAEs) [43], deep belief networks (DBNs) [44], convolutional neural networks (CNNs) [2], recurrent neural networks (RNNs) [45], pulse coupled neural networks (PCNNs) [46] and generative adversarial networks (GANs) [47] have been proposed for detection changes in EO data. Among these DL methods, the CNN model is commonly employed as a feature extractor for solving visual tasks. One of the most important advantages of CNNs is the automatic extraction of low- to high-level features. Therefore, unlike PCA, MAD and IRMAD algorithms, CNNs do not require manual feature selection and extraction.

#### *1.1. Related Works*

The various DL approaches can be divided into several categories. In this study, the DL approaches used to CD are classified into three categories based on the learning technique and the accessibility of labeled or unlabeled training datasets, including (1) supervised, (2) unsupervised, and (3) semi-supervised methods. The first category is supervised methods which train the network by using labeled training datasets. The second category, unsupervised methods that learn from unlabeled datasets. The third category, semi-supervised methods that learn from both labeled and unlabeled datasets.

#### 1.1.1. DL Supervised Methods for EO Data

There are many challenges in training deep supervised neural networks. The most important of which is the need for a large training dataset. The need for large training data, especially in RS applications that sometimes do not have access to the area, remains one of the most substantial challenges. Numerous studies have examined the performance of monitored networks in CD applications. Accordingly, it has been shown that deep neural networks can properly generate change maps if large amounts of labeled training datasets are available.

Mou et al. [48] have proposed a supervised dual-branch end-to-end neural network method for CD. In this network, a CNN and an RNN are joined, therefore developing a Recurrent Convolutional Neural Network (ReCNN) deep architecture. This algorithm was implemented in three main steps: (1) initially, convolutional layers construct feature maps automatically from each image in two separate branches; (2) second, after extracting the feature from both images, a recurrent sub-network is embedded to preserve temporal dependence in the bi-temporal images; and (3) finally, the output of recurrent sub-network is entered as a fully connected layer and a change map is extracted. More specifically, they used three types of the recurrent sub-network, i.e., fully connected RNN, long short-time memory (LSTM) and gated recurrent unit (GRU) to compute the hidden state information for the current input and restore information [48]. Liu, Jiao, Tang, Yang, Ma and Hou [18] have presented a local restricted CNN (LRCNN), which is a new version of CNN, in two main steps: (1) first, they proposed a similarity measure for PolSAR data and produced several Layered Difference Images (LDIs) of PolSAR images. Then, LDIs are improved to Discriminative Enhanced LDIs (DELDIs) for CNN training, and (2) second, the CNN/LRCNN was trained for CD tuning hyperparameters. Finally, based on the optimized trained model [18], a change map was obtained. Jaturapitpornchai et al. [49]

have proposed a supervised method to identify novel building structures in three main steps: (1) first, each 256 × 256-pixel patch at time1 and time2 are concatenated and fed to a U-Net-based network. They used HH polarization ALOS-PALSAR over the same area at different times. (2) Second, a prediction map is derived from the U-Net-based trained model, and (3) at last, by applying a threshold of 0.5, a binary map that indicates the position of newly built constructions is produced [49]. Sun et al. [50] have proposed an end-to-end LU-Net architecture to leverage both spatiality and temporality characteristics simultaneously. This CD method was implemented in two steps: (1) first, they combined the convolution and recurrent structure in a layer and introduced a Conv-LSTM layer, and (2) second, they substituted the standard convolutional layer of U-Net with Conv-LSTM and formed a new architecture, L-UNet [50]. Cao et al. [51] have proposed a CD method for bi-temporal SAR images and introduced a deep denoising network to eliminate the SAR image noise in three main steps: (1) first, a deep denoising model is trained efficiently by using plenty simulated SAR images to estimate the noise constituent. Then, the original SAR image can be cleaned up by removing this noise constituent. (2) Secondly, a denoised DI has been generated from the new image pair after denoising, and (3) finally, using a three-layer CNN, denoised DI has been classified into *changed* and *no-changed* regions [51]. Wang et al. [52] have designed a new deformable residual CNN (DRNet) for SAR images CD. The DRNet was used to adjust the sampling location. Additionally, prior to regular convolution, two stages were added: (1) offset field generation, and (2) deformable feature map generation. Moreover, a new pooling module called residual pooling was designed by replacing the conventional pooling with a set of smaller pooling kernels to discover the multi-scale information of the ground objects.

#### 1.1.2. DL Unsupervised Methods for EO Data

Various supervised DL methods, including CNNs, have demonstrated satisfactory results in computer vision tasks when accompanied by a large labeled dataset [18]. In the case of CD tasks, the training datasets often are insufficient to construct such models. Additionally, constructing a ground truth (GT) map based on real-time *change* information of terrestrial objects takes a lot of time and effort [17]. Consequently, in many cases, it is more effective to learn *change* features from an unsupervised approach.

For instance, Kiana et al. [53] have proposed an unsupervised CD method for SAR images using the Gaussian mixture model (GMM). The CD framework in this study was implemented in two main steps: (1) first, using GMM, three Gaussian distributions were modeled (i.e., positive *change*, negative *change* and *no-change* distribution); (2) then, two thresholds were calculated as injection points of distributions. Before the first threshold, pixels are *negative changes*, between the two thresholds, they are *no-changes,* and after the second threshold, pixels are *positive changes* [53]. Thresholding methods in which the statistical distribution is modeled may be difficult to estimate the statistical parameters when *change* and *no-change* pixels overlap. Moreover, in PolSAR data, this problem can be more pronounced because of the strong effect of speckle noise. Liu et al. [54] proposed an unsupervised symmetric convolutional coupling network (SCCN) for CD based on heterogeneous SAR and optical images. They have defined a coupling function to determine network parameters. This CD method was implemented in two steps: (1) first, each of the two images is fed to one side of the SCCN and transferred to a feature space. In the new feature space, the two input images have more harmonious features and (2) second, a difference map was straight computed through pixel-wise Euclidean distances in feature space [54]. Bergamasco et al. [55] have proposed an unsupervised CD based on convolutional auto-encoder (CAE) feature extraction in two steps: (1) first, to train the CAE, the reconstruction error between the reconstructed output and the input from unlabeled single-time image patches Sentinel 1 was minimized, and (2) second, the trained CAE was used to extract multi-scale features from both the bi-temporal images and extract a change map [55]. Huang et al. [56] have developed a DL unsupervised algorithm that can detect changes in buildings from RS images in two steps: (1) first, a convolutional layer

is employed to extract the spatial, texture and spectral features and produce a low-level feature vector for each pixel, and (2) second, a model based on deep belief network and extreme learning machine (DBN-ELM) was applied: a DBN was pre-trained by introducing unlabeled samples and they were then jointly optimized through the use of an ELM classifier [56].

In some cases, first, a pre-train step is performed and the pixels with the greatest likelihood to belong to the *change* and *no-change* classes are extracted. Then, these pixels are utilized to train the model. For instance, Gao et al. [57] proposed a pre-train scheme in two main steps: (1) first, they used a logarithmic ratio operator and a hierarchical FCM classifier to generate pseudo-label training samples, and (2) next, by integrating a CNN model and a dual-tree Complex Wavelet transform, called CWNN, pixels were classified into *change* and *no-change* classes [57]. In addition, Zhang et al. [58] proposed an automated method to detect changes in bi-temporal SAR images based on a pre-train scheme and the PCANet algorithm in two main steps: (1) first, a parameterized pooling algorithm is used to develop a deep difference image (DDI). Following this, Sigmoid nonlinear mapping with two different parameters is applied to DDIs to give two mapped DDIs. Then, the parallel FCM is applied to produce three types of pseudo-label training pixels: *changed*, *no-changed* and *intermediate* pixels. (2) Next, a support vector machine (SVM) was trained using the *changed* and *no-changed* pixels. Finally, the trained model was used to classify *intermediate* pixels and generate a change map [58]. In such methods, the accuracy of the pre-train step is very important. Accordingly, if the pixels are extracted with little precision, the network will not be properly trained. Therefore, training pixels must be extracted with high accuracy.

In a few cases, a fake-GT is generated with unsupervised methods and is used to minimize the DL method's loss function. For instance, Liu et al. [59] developed a CNNbased CD approach. This network was trained based on a two-part loss function. The CD framework in this study was implemented in three main steps: (1) first, a U-Net model was pre-trained using an open-source dataset. Then, the Euclidean distance (ED) is computed between two feature vectors extracted for each pair of pixels in bi-temporal images. (2) Second, based on a fake-GT, the ED is minimized for *no-changed* pixels and maximized for *changed* pixels in the first part of the loss function. The second part of the loss function is designed to transfer the pre-trained model to the target dataset. (3) Finally, after the training is complete, the k-nearest neighbors clustering is applied to extract a change map [59].

#### 1.1.3. DL Semi-Supervised Methods for EO Data

In semi-supervised learning, the little labeled dataset is coupled with large quantities of an unlabeled dataset to form a model. Semi-supervised learning falls between unsupervised learning and supervised learning. One of the most common approaches in semi-supervised is TL fine-tuning [60]. In the case of not enough samples, TL can be used to adapt the features learned in previous tasks, which involves fine-tuning the network pretrained in general images. To achieve this, the final layers of the pre-trained network are usually retrained based on the little data available. Following this approach, Kutlu and Avcı [61] have been proposed a method based on AlexNet fine-tuning. They employed CNN, discrete wavelet transforms (DWT) and LSTM, aiming to obtain the feature vector, translate and strengthen the feature vector and classify the signal, respectively. The framework in this study was implemented in three main steps: (1) first, they used fine-tuning for AlexNet architecture to extract useful features; (2) then, they applied one-dimensional DWT on each feature vector to obtain the approximation coefficients by convolving the signals with the low-pass filter; and (3) finally, the LSTM was used for classification [61]. Venugopal [62] has introduced a semi-supervised CD method based on Resnet-101 fine-tuning in three steps: (1) firstly, two bi-temporal SAR images were converted to grayscale images to compute the similarity between the two images; (2) secondly, a Resnet-101 based multiple dilated

deep neural network was fine-tuned to extract the feature sets; and (3) finally, semantic segmentation is applied to detect changes in the two SAR images [62].

Some networks are made up of several sub-networks. In such networks, each of the sub-networks has specific purposes. However, they may cause a large increase in network parameters. To overcome this, some of these sub-networks use the parameters of pre-trained models with zero learning rates. Following this approach, Zhang and Shi [1] proposed an approach based on a deep feature difference CNN (FDCNN) based on two sub-networks named FD-Net and FF-Net, where FD-Net is trained based on sharing parameter from VGG16 and FF-Net is trained based on a few pixel-level samples. The CD framework in this study was implemented in three main steps: (1) first, VGG16 is trained on RS datasets to learn deep features; (2) second, FDCNN is trained based on the proposed change magnitude guided loss function by using a few pixel-level training samples; and (3) third, a binary change map is derived using a threshold value from the change magnitude map inferred using FDCNN [1]. Peng, Bruzzone, Zhang, Guan, Ding and Huang [4] have proposed a new SemiCDNet based on a GAN in two main steps: (1) first, they used both the labeled data and unlabeled data to generate initial predictions (segmentation maps) and entropy maps based on an adopted UNet++ model as a generator. Then, they optimized UNet++ in a supervised manner using a binary cross-entropy loss and (2) second, in the discriminator phase, they introduced two discriminators to apply the distribution compatibility feature of segmentation maps and entropy maps between labeled data and unlabeled data [4]. Although semi-supervised algorithms reduce the need for training data, they can still be challenging in RS applications because they still require high-quality training data.

#### *1.2. Problem Statements and Contribution*

As mentioned earlier, the performance of deep learning-based CD methods is highly dependent on the quality and quantity of the training data. Therefore, one of the main challenges of applying DL for CD applications is to provide enough training samples. On the other hand, most of the deep networks that have been developed to detect changes are single-channel or dual-channel. In single-channel architectures, the network only takes one input. Therefore, two images must be converted into one input. This is usually done by differentiating or stacking images. As a result, information is lost. In dual-channel architectures, first, each image has separately entered a channel. Next, the features of each image are extracted. Then, like single-channel architectures, the two feature vectors are converted to a feature vector and entered into a fully connected layer. Because there is usually no information transition and connection between channels, the information is lost.

To overcome these challenges, we proposed a parallel pseudo-label training sample generation method. This method is based on a pre-trained CNN-based model that was trained carefully on two UAVSAR datasets and an FCM algorithm. First, we used a pretrained model and the TL technique to calculate the probability change map for our datasets. Then, to improve and increase the reliability of the model, it was combined in parallel with the FCM algorithm to select samples that can most likely belong to the *change* and *no-change* classes. Additionally, we introduce a novel end-to-end three-channel deep neural network, called TCD-Net. The three channels of TCD-Net are designed so that the first and third channels independently extract features from each image and identify the objects in each image well, while the second channel identifies distractions and transfers information from the low- to the high levels. Compared with the use of a single- or dual-channel architecture, this three-channel architecture not only provides a feature representation of each image but also identifies changes at various levels. In addition, there are connections between the three channels that prevent data loss. Therefore, the proposed method can learn more strong and abstract representations for the spatial information of a certain pixel. We also utilize an adaptive multi-scale shallow block and an adaptive multi-scale residual block in the TCD-Net architecture to make the network resistant to objects of various sizes with much fewer parameters and to transfer information to the final layer.

In particular, our proposed algorithm consists of three parts: (1) parallel pseudo-label training sample generation, (2) model optimization for TCD-Net, and (3) binary change map generation. Therefore, the main contribution of this study can be summarized as:


The rest of the paper is organized as follows. The methodology is described in Section 2. Section 3 presents the case study. Section 4 presents the experimental results and analyses. Section 5 provides the discussion. Finally, the conclusions and future work is presented in Section 6.

#### **2. Methodology**

In this section, we describe the details of the proposed method for CD. According to Figure 1, the general scheme of the proposed method consists of three main steps, including (1) pre-processing, (2) automatic training sample generation, and (3) end-to-end CD learning. We describe these three steps in detail in the following three sub-sections.

**Figure 1.** General scheme of the proposed unsupervised binary change detection (CD) method. CNN is convolutional neural network.

#### *2.1. Pre-Processing*

Data pre-processing is greatly important in PolSAR CD methods. Multi-looking, coregistration of images [26] and speckle filtering [27] are the main techniques in PolSAR image processing. PolSAR images are always affected by speckle noise, which makes the CD process more challenging. Therefore, there are several methods for speckle filtering; we used the refined Lee filter with a kernel size of 5 [63]. Moreover, geometric correction is used for co-registration of images for comparison and matching. Several GCP points were selected for modeling and a second-order polynomial was used to resample gray values. The final geometric correction accuracy (i.e., RMSE) was approximately 0.4 pixels.

#### *2.2. Automatic Training Sample Generation*

The purpose of this section is to produce pseudo-label training samples automatically and without human interference. We use a pre-trained model, which has been trained on large and open-source UAVSAR datasets and extracts a probabilistic change map (PCHM). In fact, we use the TL technique because we apply a pre-trained model instead of training a model. In addition, to improve the reliability and robustness of the results, we use a parallel combination of the results of the pre-trained model and the results of the FCM algorithm. As shown in Figure 2, the proposed method consists of the following main steps:


$$\begin{cases} \quad (i,j) \in w\_{\mathcal{L}} & \text{for } (i,j) \in w\_{\mathcal{L}}^{\text{CNN}} \text{ and } (i,j) \in w\_{\mathcal{L}}^{\text{FCM}}\\ \quad (i,j) \in w\_n & \text{for } (i,j) \in w\_n^{\text{CNN}} \text{ and } (i,j) \in w\_n^{\text{FCM}} \end{cases} \tag{1}$$

**Figure 2.** Flowchart of the proposed parallel pseudo-label training sample generation. FCM is fuzzy c-means, PCHM is probabilistic change map, and TL is transfer learning.

#### *2.3. End-to-End Change Detection Learning*

#### 2.3.1. Convolutional Layer

The convolutional layer is the core of CNNs. Each layer of convolution in CNNs contains a set of filters and the output of the network is derived from the convolution between the filters and the input layer. Each filter can contain a specific pattern, followed by a specific pattern in the image. In the network training process, these filters are supposed to extract meaningful patterns from each image. Since finding only one pattern does not lead to good results and makes the network limited in terms of performance, the convolutional layer needs to have multiple filters. Therefore, the output of the convolutional layer is a set of different patterns that are called feature maps. The output of a convolutional layer in the *nth* layer is expressed using Equation (2).

$$F^n = g\left(w^n F^{n-1} + b^n\right) \tag{2}$$

where *<sup>F</sup>n*−<sup>1</sup> represents the neuron input from the previous layer, *<sup>n</sup>*−1; *<sup>g</sup>* represents the activation function; *b<sup>n</sup>* represents the bias vector for the current layer; and *w<sup>n</sup>* represents the weighted template for the current layer.

A 2D convolution equation can be used to compute the output of the *j* th feature map (*v*) within the *i* th layer at spatial location *(x, y)*, according to Equation (3).

$$\boldsymbol{\upsilon}\_{i,j}^{xy} = \mathbf{g} \left( \boldsymbol{b}\_{i,j} + \sum\_{m} \sum\_{r=0}^{R-1} \sum\_{s=0}^{S-1} \mathsf{W}\_{i,j,m}^{r,s} \boldsymbol{\upsilon}\_{i-1,m}^{(x+r)(y+s)} \right) \tag{3}$$

where *g* is activation function, *b* is bias, *m* is the feature cube connected to the current feature cube in the previous layer and *W* is the (*r*, *s*) th value of the kernel connected to the *m* th feature cube in the previous layer. Moreover, *R* and *S* are the length and width of the convolution kernel size, respectively.

#### 2.3.2. Multi-Scale Block

In RS imagery with meter and sub-meter-level spatial resolution, there are many objects in different sizes. In addition, there are large structures and details in the texture of the objects and ground scenes that need to be extracted. Since small-scale features, like short building edges, typically respond to smaller-sized convolutional filters, but large-scale structures respond better to larger convolutional filters, we use the multi-scale convolutional block. The multi-scale convolutional block extracts helpful dynamic features and improves feature extraction. Using this multi-scale convolutional block, the network can continuously learn a set of features and the related scales at which these features occur with a minimum increase in parameters.

According to Figure 3, in the *n* th layer of the multi-scale block, three sizes of convolutional filters are set: 1 × 1, 3 × 3 and 5 × 5. With a 1 × 1 convolutional kernel, features are extracted from pixels themselves. A 3 × 3 convolutional kernel extracts features from a small neighborhood. Additionally, a 5 × 5 convolutional kernel extracts features of a larger range, which is suitable for some continuous large-scale images. In the traditional multi-scale approach, the number of filters is the same for each kernel size, N and the output feature maps have a 3N spectral dimension. Therefore, large kernel size (i.e., 3 × 3 and 5 × 5) require more processing time and have high parameters. Therefore, it is better to change the number of filters for each kernel size in the multi-scale block. To achieve this, this research develops an adaptive formula for determining the number of filters (NoFs) in a multi-scale block. To keep a constant total number of filters in each block and to preserve a large increase in the number of parameters, we consider the number of filters that have smaller length and width dimensions more than filters with larger length and width. According to Equation (4), NoFs is the total number of filters of a multi-scale block and is divided into NoF1, NoF2 and NoF3, which are the number of filters for 1 × 1, 3 × 3 and 5 × 5 kernels, respectively:

$$\begin{cases} NoF\_1 = a \times NoF \\ NoF\_2 = \beta \times NoF \\ NoF\_3 = \gamma \times NoF \\ s.t: \kappa + \beta + \gamma = 1 \end{cases} \tag{4}$$

where *α*, *β*, *γ* are coefficients that determine the number of filters for 1 × 1, 3 × 3 and 5 × 5 kernels, respectively. To reduce the network parameters, we consider these coefficients in such a way that *α* > *β* > *γ*.

**Figure 3.** A multi-scale shallow block.

#### 2.3.3. Residual Block

CNNs with deeper layers can generally model more complex patterns and have higher nonlinearity. Visual representation of feature maps shows that a deeper network can lead to the extraction of more robust and abstract features [64]. However, there is a substantial problem in the training process of a deep CNN. As the number of layers' increases, the gradient vanishing problem during back-propagation increases. Therefore, updating the convolutional kernels and bias vectors to achieve optimal allocation of all parameters is very slow. Additionally, it has been seen that as the number of layers gradually increases, the accuracy first increases, then at a point it starts to saturate and finally decreases [64]. For this reason, residual learning has now become one of the most effective solutions available for training deep CNNs. It involves replacing the convolutional filtering process *F<sup>n</sup>* = *G<sup>n</sup> Fn*−<sup>1</sup> by *F<sup>n</sup>* = *Fn*−<sup>1</sup> + *G<sup>n</sup> Fn*−<sup>1</sup> , which is called a "skip connection", using the residual *<sup>F</sup>n*−<sup>1</sup> − *<sup>F</sup><sup>n</sup>* as a prediction process. This research uses a combination of the multi-scale block and the residual block, called the multi-scale residual block (Figure 4).

**Figure 4.** A multi-scale residual block.

#### 2.3.4. TCD-Net for CD

Considering two images *It*<sup>1</sup> and *It*<sup>2</sup> taken over the same area at different times *t*<sup>1</sup> and *t*2, the goal is to recognize areas that have changed between the images. Assume that *CM*ˆ is the binary change map derived from *It*<sup>1</sup> and *It*<sup>2</sup> and *CM*ˆ *<sup>i</sup>*,*<sup>j</sup>* is the change values at location (*i*, *<sup>j</sup>*). Generally, *CM*<sup>ˆ</sup> *<sup>i</sup>*,*<sup>j</sup>* <sup>∈</sup> {0, 1}, *CM*<sup>ˆ</sup> *<sup>i</sup>*,*<sup>j</sup>* <sup>=</sup> 1 indicates (*i*, *<sup>j</sup>*) is *changed* and otherwise, it indicates (*i*, *j*) is *no-changed*. We propose the TCD-Net architecture to generate the binary change map.

• Architecture

As shown in Figure 5, the proposed TCD-Net architecture includes three channels, each of which is a sub-network that extracts feature. The traditional method of DL-based CD requires the conversion of two images to one input for the single-channel networks, leading to missing information. Dual-channel networks extract features from two images and, in the last layer, convert these features into a vector that is then fed to a fully connected layer. Since there is no intermediate channel and there is no information transfer and connection between the channels at different levels, the information is lost. To prevent information loss, we use a three-channel network. Additionally, a multi-channel network converges faster than a single- or dual-channel. In TCD-Net, the first and third channels take bi-temporal images, *It*<sup>1</sup> and *It*<sup>2</sup> , separately, and the second channel can learn change information from the features extracted from the first and third channels to obtain DI. In the first and third channels, which are symmetric, there is an adaptive multi-scale shallow block, three adaptive multi-scale residual blocks and two max-pooling layers. The second channel consists of an adaptive multi-scale shallow block, two adaptive multi-scale residual blocks and two max-pooling layers. An adaptive multi-scale shallow/residual block contains one 1 × 1, one 3 × 3 and one 5 × 5 convolutional block, as mentioned before, where the number of their filters for each kernel size is considered adaptive. In multi-scale blocks, after connecting the output of these three convolutional blocks, a 3 × 3 convolutional block has been installed to adjust the third dimension, allowing the layer input to be added with the output of this section. This no longer requires the number of features extracted from the multi-scale shallow/residual block to be fixed throughout the network. Moreover, each convolutional block includes an activation function (rectified linear unit (ReLU)), batch normalization and many convolutional filters that extract deep features. We use *f t*1 *<sup>l</sup>* , *f t*2 *<sup>l</sup>* and *l* = {1, . . . , *L*} to represent features in the *l th* layer of the first and third channels, respectively, corresponding to *t*<sup>1</sup> and *t*2. For instance, the *f t*1 <sup>1</sup> represents the features extracted from the multi-scale shallow block in the first layer of the first channel (corresponding to *t*1). Finally, we obtain the features *f t*1 *<sup>L</sup>* and *f t*2 *<sup>L</sup>* for *<sup>I</sup>t*<sup>1</sup> and *<sup>I</sup>t*<sup>2</sup> , respectively. In the second channel, which we call the intermediate channel, new features are also extracted, which we call intermediate features and represent with *f <sup>m</sup> <sup>l</sup>* . At the first layer, the features extracted in the first and third channels are subtracted, *f t*1 <sup>1</sup> − *f t*2 1 and fed to the second channel. In the second channel, *f t*1 <sup>1</sup> − *f t*2 1 enters the multi-scale shallow block and *f m* <sup>1</sup> is extracted. In the next layers, change information is inferred from *f t*1 <sup>2</sup> − *f t*2 2 + *f <sup>m</sup>* 1 , *f t*1 <sup>3</sup> − *f t*2 3 + *f <sup>m</sup>* <sup>2</sup> , ... , *f t*1 *<sup>L</sup>* − *f t*2 *L* + *f <sup>m</sup> <sup>L</sup>*−1. That is, at each layer, the features extracted in the first and third channels are subtracted then added to the features extracted in the second channel from the previous layer, thus making our algorithm very powerful in detecting changes. In the last layer, *f t*1 *<sup>L</sup>* − *f t*2 *L* + *f <sup>m</sup> <sup>L</sup>*−<sup>1</sup> is extracted, flattened and fed into a fully connected layer with ReLU activation function. Moreover, we use the *ReLU* after each convolutional layer as a piecewise linear activation function. The ReLU function can be formulated using Equation (5).

$$f(\mathbf{x}) = \max\left(0, \mathbf{x}\right) \tag{5}$$

**Figure 5.** The proposed TCD-Net architecture for CD of remote sensing (RS) datasets.

The latest fully-connected layer is a *softmax* layer. In general, this layer is used to model categorical probability distributions and calculate the probability that each pixel belongs to the *change* and *no-change* classes. Finally, the pixels are divided into two categories of *change* and *no-change*. The *softmax* function is expressed in Equation (6).

$$f(\mathbf{x}\_i) = \frac{e^{\mathbf{x}\_i}}{\sum\_{i} e^{\mathbf{x}\_i}} \tag{6}$$

#### • Model Optimization

As shown in Figure 1, after the automatic training sample generation phase, the samples generated are divided into three categories: training, testing and validation datasets.

The TCD-Net is trained based on the training dataset. Additionally, the loss value was calculated by the loss function based on the validation dataset. There is no analytical method for optimizing CNN parameters. Thus, optimization is used to adjust the model parameters iteratively. In this research, an Adam optimizer is used to adjust CNN parameters. As a result, the model is trained based on the initial values of the parameters, then the output of the model is compared with the actual value. The error of the training model is fed to the optimizer and is updated the parameters. In an iterative process, the gradient is reduced at this point to minimize the total output error. This process continues until the stop condition is reached, i.e., a certain number of repetitions or a certain error (minimum error). Due to back-propagation, the parameters are updated at each step to decrease the error of comparing the results obtained from the network with the training/validation dataset. Finally, test data is used to evaluate network performance.

In this research, cross-entropy was used to calculate the loss function of the proposed architectures. The performance of the network given the inputs and the labels with optional performance weights and other parameters is calculated by cross-entropy function for inputs (*y*) and outputs (*t*) using the following Equation (7):

$$E = \frac{1}{n} \sum\_{j=1}^{n} \sum\_{i=1}^{k} t\_{ij} l n y\_{ij} + \left(1 - t\_{ij}\right) l n \left(1 - y\_{ij}\right) \tag{7}$$

where *n* is the number of training samples and *k* is the number of classes. Additionally, *tij* is the *ij* th entry of the target matrix and *yij* is the *ij* th entry of the training sample matrix.

#### 2.3.5. Accuracy Assessment

Accuracy assessment is an integral part of any RS task and is done in two ways. In the first approach, the results of the proposed method are compared with GT data and in the second approach with sample data. In this study, the final results of the proposed CD method are compared quantitatively as well as qualitatively with the GT data and the results of other SOTA CD methods. The quantitative comparison is based on the metrics described subsequently. Based on the CD results and the GT data, there are four modes: (1) if both the GT data and result are positive, it is considered as True Positive (TP); (2) if the GT data is positive and the result is negative, it is considered as False Negative (FN); (3) if both the GT data and the results are negative, it is considered as True Negative (TN); and (4) if the GT data is negative but the result is positive, it is considered as False Positive (FP). With the help of these four values, the essential criteria such as false-positive rate (FPR) (also called false alarm rate), true-positive rate (TPR) (also called hit rate and recall), false-negative rate (FNR), overall accuracy (OA), precision, detection rate (DR), F1-score, overall error rate (OER), Prevalence (PRE) and kappa coefficient (KC) are calculated by the following relationships shown in Table 1.


**Table 1.** Formulas for accuracy assessment criteria.

#### 2.3.6. Comparative Methods

To compare the effectiveness of the intermediate layer of TCD-Net, this research compares the TCD-Net algorithm with a dual-channel deep network. This dual-channel network is similar to TCD-Net, except that the intermediate layer is removed. To make the comparison fair, the dual-channel network is trained with the same training samples extracted from the pseudo-label sample generation phase. In addition, the following unsupervised SOTA CD methods are compared and analyzed to confirm the efficiency of TCD-Net. These approaches are PCA\_kmeans [65], NR\_ELM [66], Gabor\_PCANet [67], CWNN [57] and DP\_PCANet [58], which are described in brief below:


These methods are also parameterized using references to the corresponding publications.

#### **3. Case Study**

Two co-registered L-band UAVSAR full polarimetric images are utilized to assess the performance of the proposed method. These two images belong to the city of Los Angeles, California, acquired on 23 April 2009 and 11 May 2015, by the JAV Propulsion Laboratory/National Aeronautics and Space Administration UAVSAR. There are 786 × 300 pixels in the first dataset and 766 × 300 pixels in the second dataset. Figure 6a–e shows the RGB (Red: |HH–VV|; Green: 2|HV|; Blue: |HH+VV|) Pauli images of the two subsets of the PolSAR scenes. The GT images connected with these subsets, shown in Figure 6c,f, were prepared for the numerical analysis of CD results by using Google Earth images. Actually, the image of GT is a binary image in which the black pixels are *no-change* and the white pixels are *change*. The first and second datasets are called dataset#1 and dataset#2, respectively.

**Figure 6.** Pauli decomposition of UAVSAR images taken over Los Angeles, California on (**a**,**d**) 23 April 2009; (**b**,**e**) 11 May 2015; (**c**,**f**) ground truths, where white means change area and black means no-change area. Top: dataset#1. Bottom: dataset#2.

#### **4. Experimental Results and Analysis**

#### *4.1. Parameter Setting*

In NR\_ELM and CWNN, parameters are neighborhood size r = 3 × 3 and patch size w = 7, respectively. The PCANet parameters are the image patch size k = 5, the number of filters L1 = L2 = 8 and training samples 30% of the total data. In PCA\_kmeans, patch size h = 5 is used. In Gabor feature extraction, the orientation of Gabor kernel U = 8, the scale of Gabor kernel V = 5, the maximum frequency kmax = 2π and the spacing factor between kernels in the frequency domain <sup>f</sup> <sup>=</sup> <sup>√</sup><sup>2</sup> are used. To generate DDI and parallel clustering in DP\_PCANet, the center bias b in the Sigmoid function b = 0.1 and the number of pooled images that are accumulated to generate the DDI, T = 7 are used. For TCD-Net, Table 2 lists the details of the configuration settings for each channel. Additionally, Table 3 shows the total number of filters in each multi-scale block. The model parameters are trained based on the mini-batch back-propagation algorithm with a size of 150. The error in 250 epochs is calculated based on the determined objective function and then the parameters are updated. Adam optimizer, with an initial learning rate of 10 × <sup>10</sup>−<sup>3</sup> with an epsilon value of 10 × <sup>10</sup><sup>−</sup>10, is used as the optimization algorithm.

**Table 2.** TCD-Net configurations of each channel and block.


*NoFij <sup>k</sup>* is the number of filters for *<sup>k</sup>th* convolutional layer of multi-scale block in *<sup>i</sup> th* channel of *j th* block.


**Table 3.** The total number of filters in each multi-scale block.

#### *4.2. Pseudo-Label Training Sample Generation*

As previously mentioned, we first use the pre-trained model introduced in [2]. Based on Table 4, which shows the results of the CD framework proposed in [2] for our case study, it can be seen that this model is not robust for all case studies. On other hand, the performance of this model is dependent on the objects of the study area. For this reason, we generate PCHM using the pre-trained model. Then, by applying a reliable threshold, we extract the pixels that most likely belong to the *change* and *no-change* classes. Quantitative results show that this increases the performance significantly. In addition, we obtain PCHM with the FCM clustering. Finally, we extract the pixels that have been identified in both algorithms as *change* and *no-change* pixels. The quantitative results show that the aggregation of these two algorithms greatly increases accuracy. For dataset#1, the OA and KC in the FCM clustering is 94.10% and 0.66, in the TL-based classification is 93.85% and 0.75 and in the aggregation is 97.58% and 0.84. For dataset#2, the OA and KC in the FCM clustering is 95.84% and 0.52, in the TL-based classification is 97.64% and 0.82, and in the aggregation of these two methods is 99.52% and 0.91. Therefore, the quantitative results show that the aggregation of these two methods improves the pseudo-label generation accuracy. We considered 5% of the total data as the reference data and divided the reference data into 65% for training, 15% for validation, and 20% for testing (Table 5).



**Table 5.** The number of *change* and *no-change* pixels extracted from the parallel pseudo-label generation framework and the number of training, testing and validation pixels used in the training process of TCD-Net.


#### *4.3. Comparison of Results for Dataset#1*

Figure 7 illustrates the result of CD for dataset#1. As seen, Figure 7a,c,d show the results CD for PCA\_kmeans, Gabor\_PCANet and DP\_PCANet that have many noisy pixels, while other methods provide the low noisy pixels. Furthermore, the NR\_ELM, Figure 7b and CWNN, Figure 7e, have miss detection pixels that are evident in the top and bottom of the study areas (red circles). In Figure 7, the red circles show the *no-changed* pixels that have

been detected as *changed* pixels by all of the methods except the TCD-Net and dual-channel deep network. The TCD-Net and dual-channel deep network provide significant results compared to other methods in the detection of *no-changed* pixels. However, the TCD-Net (Figure 7g), in detail, discovers subtle *change* pixels better than the dual-channel deep network (Figure 7f). Therefore, the TCD-Net provides a promising result in the detection of both *change* and *no-change* pixels.

**Figure 7.** Visualized results of various CD methods on dataset#1; (**a**) PCA\_kmeans, (**b**) NR\_ELM, (**c**) Gabor\_PCANet, (**d**) DP\_PCANet, (**e**) CWNN, (**f**) dual-channel deep network, (**g**) TCD-Net, and (**h**) ground truth. The red circles highlight different output performances in *no-change* pixels.

In Table 6, CD quantitative results show that the TCD-Net algorithm performed better than other methods in terms of OA, KC, precision and DR indicators. In particular, the TCD-Net algorithm has the OA of 95.01% and the KC of 0.80, which is 6.35% and 0.37 more than PCA\_kmeans, 2.75% and 0.14 more than NR\_ELM, 3.29% and 0.15 more than Gabor\_PCANet, 2.43% and 0.11 more than DP\_PCANet, 2.24% and 0.10 more than CWNN, and 0.76% and 0.04 more than the dual-channel deep network. Furthermore, the TCD-Net algorithm has a much higher F1-score. Additionally, the OER is much lower in TCD-Net. These results show that TCD-Net is more effective in CD than other algorithms.


**Table 6.** The accuracy of different change detection methods for dataset#1.

#### *4.4. Comparison of Results for Dataset#2*

The results of CD for dataset#2 are shown in Figure 8. Similarly, the Gabor\_PCANet, NR\_ELM and DP\_PCANet provide many noisy *changed* pixels while these pixels are *nochange*. Furthermore, most methods have many miss detection pixels in the *no-change* areas, which are more evident at the top and middle of the region of interest. This theme can be seen in the CD results, which are illustrated by red circles in Figure 8. As compared to dataset#1, CD methods perform a little differently. In *change* areas, there are differences among the CD methods, a sample of which is illustrated by green circles in Figure 8. The green circles show that the dual-channel deep network with good performance in *no-change* pixels cannot detect change pixels well. However, the TCD-Net has considerable results compared with other methods in both classes. Additionally, the TCD-Net is more sensitive to the subtly *changed* pixels, while other methods did not detect these in much detail.

In Table 7, we display the values of the mentioned criteria to evaluate the performance of CD methods. The results of dataset#2 also confirm the efficiency of the TCD-Net algorithm. As seen, the TCD-Net has the highest OA, KC, F1-score, precision and DR indicators. The OA and KC are 96.71% and 0.82 of TCD-Net, which is 4.68% and 0.47 higher than PCA\_kmeans, 2.59% and 0.16 higher than NR\_ELM, 2.31% and 0.15 higher than Gabor\_PCANet, 2.11% and 0.14 higher than DP\_PCANet, 2.17% and 0.17 higher than CWNN, and 1.26% and 0.1 higher than the dual-channel deep network. The TCD-Net algorithm has a much higher TPR and much lower FNR. In addition, TCD-Net has a much higher KC and DR (approximately 15–60% in DR and 0.1–0.5 in KC). This shows that the proposed methods perform better than the other methods implemented in this paper.

**Figure 8.** Visualized results of various CD methods on dataset#2; (**a**) PCA\_kmeans, (**b**) NR\_ELM, (**c**) Gabor\_PCANet, (**d**) DP\_PCANet, (**e**) CWNN, (**f**) dual-channel deep network, (**g**) TCD-Net and (**h**) ground truth. The red circles highlight different output performances in *no-change* pixels. The green circles highlight different output performances in *change* pixels.

**Table 7.** The accuracy of different change detection methods for dataset#2.


#### **5. Discussion**

In this section, we first compare the TCD-Net in terms of accuracy to other CD methods implemented in this article. Then we compare the TCD-Net with the results of other studies implemented on the UAVSAR datasets. Finally, we mention some of the challenges that the TCD-Net algorithm has resolved.

Most of the CD methods have low efficiency in detecting *change* pixels. In other words, the low value of some indices such as precision, TPR and KC instigated from the low efficiency of the CD algorithms in detecting *change* pixels. However, the TCD-Net simultaneously has high precision, TPR and KC values in two datasets, which indicates high efficiency in detecting *change* pixels. Most algorithms have a reasonable value of OA which indicates that they have been successful in detecting *no-change* pixels. Therefore, the FPR is very low for most CD algorithms. For better evaluation, *change* and *no-change* pixels should be considered together. For this purpose, we consider both the TPR and FPR criteria. The TPR is a criterion defined based on *change* pixels. Therefore, its low value indicates that the algorithm is weak in detecting *change* pixels. Although the PCA\_kmeans algorithm has low FPR and detects *no-change* pixels well, its TPR is very low, indicating that it has performed poorly in detecting *change* pixels. The NR\_ELM algorithm has a higher TPR than the PCA\_kmeans algorithm. It may be because the NR\_ELM algorithm uses neighborhood information, but still has higher FPR than TCD-Net. The Gabor\_PCANet, DP\_PCANet and CWNN algorithms have a much lower TPR for dataset#2 and much higher FPR for dataset#1 than TCD-Net. The dual-channel deep network can discover the *no-changed* pixels well, but there are *changed* pixels, especially in edges in dataset#1 and another area in dataset#2, that the dual-channel deep network cannot detect. However, the TCD-Net detects both the *changed* and *no-changed* pixels well. Comparison of the dual-channel deep network and the TCD-Net shows that the intermediate channel plays a key role in detecting *change* pixels and can improve network performance. These differences are because of the robust and strong architecture of the proposed algorithm (e.g., feature extraction at different levels, separate extraction of features of two images, intermediate connection, sensitivity to different object sizes, and extraction of high-precision training data).

In the following, we quantitatively compare the TCD-Net with the results of other researches applied on the UAVSAR data according to Table 8. Ratha et al. [68] proposed a method based on geodesic distance (GD), which is the distance between an observed Kennaugh matrix and the Kennaugh matrix associated with an elementary target. This algorithm achieved the FPR and KC values equal to 6.9% and 0.73, respectively, for dataset#1, the FPR value of 3.9% and the KC value of 0.75 for dataset#2. By comparing the FPR values between the GD and TCD-Net, it can be found that the TCD-Net performed better than the GD in accurately identifying pixels as *change* or *no-change*. Bouhlel, Akbari and Méric [3] have proposed a determinant ratio test (DRT) statistic for automatic CD in bi-temporal PolSAR images, assuming that the multi-look complex covariance matrix follows the scaled complex Wishart distribution. The DRT algorithm obtained the FPR and DR values of 10.58% and 63.38%, respectively, for dataset#1, and the FPR value of 8.39% and the DR value of 51.49% for dataset#2. The quantitative results demonstrate that the TCD-Net provides an average of 20% higher DR and 8% lower FPR compared to the DRT, which indicates the superiority of TCD-Net. Nascimento et al. [69] have proposed a comparison between the likelihood ratio, Kullback–Leibler (KL) distance, Shannon entropy and Rényi entropy. The results of this research demonstrated that entropy-based algorithms may perform better than algorithms based on the KL distance and probability ratio statistics. Comparison of the TCD-Net algorithm with the best entropy-based algorithms in [69] shows that the TCD-Net algorithm has much higher DR (about 30% in dataset#1 and 20% in dataset#2). In addition, the KC in the TCD-Net algorithm is much higher than the entropy-based algorithm (0.18 in dataset#1 and 0.26 in dataset#2). [68] and [69] are statistical methods and these methods operate in unsupervised manners. Moreover, the TCD-Net acts unsupervised. Nevertheless, the TCD-Net is more effective. One of the important factors in improving the accuracy of the proposed method is the use of deep features while other methods operate

on the main polarization channels (i.e., HH, HV, VH and VV). The limited polarization channels and noise conditions cause statistics-based methods to not perform well. In terms of processing time, statistical-based algorithms have less processing time than DL-based algorithms. The training phase of DL-based algorithms is time-consuming, but in general, DL-based algorithms are more accurate than statistical-based algorithms.


**Table 8.** Comparison of TCD-Net results with other methods developed on UAVSAR images.

As mentioned earlier, one of the main challenges of applying DL-based algorithms for CD applications is finding enough training data. Several studies have proposed methods for automatically extracting pseudo-label training samples, which have been employed in this study. In [67], Gabor wavelet features were used to exploit the changed information. In addition, the FCM algorithm was implemented in a coarse-to-fine procedure to obtain enough pseudo-label training samples. In [58], a parallel FCM clustering was developed for SAR images based on combining nonlinear sigmoid mapping, Gabor wavelets and parallel FCM to provide pseudo-label training pixels. These methods are pixel-based and do not take into account spatial information, which may produce isolated pixels as an output. In [66], a pre-classification step was implemented by using a neighborhood-based ratio operator and hierarchical FCM clustering. In addition, some studies have also used trained neural networks and TL techniques. In these methods, pixels are classified based on a global threshold, which can lead to mistakes and less reliability in some cases. In contrast, our pseudo-label sample generation framework is based on probability, it extracts the pixels from the pre-trained model with a probability of more than 95%, and also aggregates with the results of the FCM algorithm for more reliability. In addition, the process of detecting changes in PolSAR images has many challenges. For instance, the process of extracting polarimetric decomposition parameters, which is a common step in conventional PolSAR CD methods, is time-consuming and challenging, especially when dealing with time-series data. In addition, selecting appropriate decompositions with high information content requires optimization algorithms that are also time-consuming. Furthermore, previous studies showed that adding spatial features to scattering information significantly increases the accuracy of CD methods. However, extracting spatial features, such as texture, is challenging because of hardware limitations and long processing time. To overcome these problems, we present the TCD-Net algorithm, which can extract deep features only with four bands and does not require any additional processing (e.g., feature extraction, feature selection and target decomposition). Additionally, DL-based CD methods automatically employ both spatial and spectral features, and because of the simultaneous use of spatial and spectral information, this method is more accurate and robust than other CD methods. In addition, the TCD-Net architecture uses residual and multi-scale blocks. The residual blocks allow information to flow from the initial layers to the final layers, preventing the network depth from increasing too much. Moreover, the multi-scale blocks increase the network sensitivity to objects of different sizes.

#### **6. Conclusions and Future Work**

In this study, a novel end-to-end framework based on DL is proposed for detecting changes in the polarimetry UAVSAR datasets. The proposed method can solve the challenges of conventional CD methods (i.e., thresholding, manual feature extraction methods and training data limitation in DL-based CD methods). First, we propose a parallel pseudo-label training sample generation framework, which can generate high-reliability samples for TCD-Net training using a parallel combination of the result of the pre-trained model and FCM algorithm. Numerical analysis shows that the generated samples have provided the OA of 99.52% and KC of 0.91. Second, we construct a TCD-Net architecture with three-channels based on an adaptive multi-scale shallow block and an adaptive multi-scale residual block that are sensitive to objects of different sizes and maintain fundamental information through the transfer of information to higher layers. Therefore, our proposed method has high efficiency in the extraction of deep features. The performance of our proposed method is evaluated using two different UAVSAR datasets. Moreover, the results of our proposed method are compared to other SOTA PolSAR CD methods and a dual-channel deep network to evaluate the effectiveness of an intermediate channel embedded on TCD-Net. The result of CD is evaluated by visual and numerical accuracy assessments indices. Experimental results show that the highest OA of 96.71% and the best KC of 0.82 belong to TCD-Net. In summary, compared to other CD algorithms, the proposed method has several advantages: (1) it is more accurate than other SOTA CD methods; (2) it provides robust results compared with a dual-channel deep network; (3) it is unsupervised and produces appropriate quality and quantity training data; (4) it is strong against noise and complicated and multi-size objects; and (5), its end-to-end framework requires no pre-processing (e.g., manual feature extraction, feature selection and PolSAR target decomposition).

One of the limitations of SAR CD is the complexity and noise conditions of SAR data. The issue can affect the CD and weaken the result of the CD. In this regard, the fusion of multimodal datasets can improve the result of CD and enhance accuracy. The digital elevation model (DEM) is one of the most important datasets that can provide the CD in more detail. In addition, we intend to evaluate the performance of TCD-Net across all single-, dual- and fully-polarized modes in the future.

**Author Contributions:** Conceptualization, R.H., S.T.S., M.H. and M.M.; methodology, S.T.S., writing original draft preparation, S.T.S.; writing—review and editing, S.T.S., M.H. and M.M.; visualization, R.H. and S.T.S.; supervision, M.H. and M.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. These datasets can be found here: [https://rslab.ut.ac.ir] (accessed on 15 January 2022).

**Acknowledgments:** The authors would like to thank the anonymous reviewers for their valuable comments on our manuscript.

**Conflicts of Interest:** The authors declare that they have no conflict of interest.

#### **References**


## *Article* **A Transformer-Based Coarse-to-Fine Wide-Swath SAR Image Registration Method under Weak Texture Conditions**

**Yibo Fan, Feng Wang \* and Haipeng Wang**

Key Laboratory for Information Science of Electromagnetic Waves (Ministry of Education), School of Information Science and Technology, Fudan University, Shanghai 200433, China; ybfan19@fudan.edu.cn (Y.F.); hpwang@fudan.edu.cn (H.W.)

**\*** Correspondence: fengwang@fudan.edu.cn

**Abstract:** As an all-weather and all-day remote sensing image data source, SAR (Synthetic Aperture Radar) images have been widely applied, and their registration accuracy has a direct impact on the downstream task effectiveness. The existing registration algorithms mainly focus on small subimages, and there is a lack of available accurate matching methods for large-size images. This paper proposes a high-precision, rapid, large-size SAR image dense-matching method. The method mainly includes four steps: down-sampling image pre-registration, sub-image acquisition, dense matching, and the transformation solution. First, the ORB (Oriented FAST and Rotated BRIEF) operator and the GMS (Grid-based Motion Statistics) method are combined to perform rough matching in the semantically rich down-sampled image. In addition, according to the feature point pairs, a group of clustering centers and corresponding images are obtained. Subsequently, a deep learning method based on Transformers is used to register images under weak texture conditions. Finally, the global transformation relationship can be obtained through RANSAC (Random Sample Consensus). Compared with the SOTA algorithm, our method's correct matching point numbers are increased by more than 2.47 times, and the root mean squared error (RMSE) is reduced by more than 4.16%. The experimental results demonstrate that our proposed method is efficient and accurate, which provides a new idea for SAR image registration.

**Keywords:** synthetic aperture radar; image registration; transformer

#### **1. Introduction**

Synthetic aperture radar (SAR) has the advantages of working in all weather, at all times, and having strong penetrability. SAR image processing is developing rapidly in civilian and military applications. There are many practical scenarios for the joint processing and analysis of multiple remote sensing images, such as data fusion [1], change detection [2], and pattern recognition [3]. The accuracy of the image matching affects the performance of the above downstream tasks. However, SAR image acquisition conditions are diverse, such as different polarizations, incident angles, imaging methods, time phases, and so on. At the same time, defocusing problems caused by motion errors degrade the image quality. Besides this, the time and spatial complexity of traditional methods are unacceptable for large images. Thus, for the mass of scenes where multiple SAR images are processed simultaneously, SAR image registration is a real necessity. The nonlinear distortion and inherent speckle noise of SAR images leave wide-swath SAR image registration as a knot to be solved.

The geographical alignment of two SAR images, under different imaging conditions, is based on the mapping model, which is usually solved by the relative relationship of the corresponding parts from images. The two images are reference images and sensed images to be registered. Generally speaking, conventional geometric transformation models include affine, projection, rigid body, and nonlinear transformation models. In this paper, we focus on the most pervasive affine transformation model.

**Citation:** Fan, Y.; Wang, F.; Wang, H. A Transformer-Based Coarse-to-Fine Wide-Swath SAR Image Registration Method under Weak Texture Conditions. *Remote Sens.* **2022**, *14*, 1175. https://doi.org/10.3390/ rs14051175

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 19 January 2022 Accepted: 23 February 2022 Published: 27 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The registration techniques in the Computer Vision (CV) field have continued to spring up for decades. The existing normal registration methods can be divided mainly into traditional algorithms and learning-based algorithms. The traditional methods mainly include feature-based and region-based methods. The region-based method finds the best transformation parameters based on the maximum similarity coefficient, and includes mutual information methods [4], Fourier methods [5], and cross-correlation methods [6,7]. Stone et al. [5] presented a Fourier-based algorithm to solve translations and uniform changes of illumination in aerial photos. Recently, in the field of SAR image registration, Luca et al. [7] used cross-correlation parabolic interpolation to refine the matching results. This series of methods only use plain gray information and risk mismatch under speckle noise and radiation variation.

Another major class of registration techniques in the CV field is the feature-based method. It searches for geometric mappings such as points, lines, contours, and regional features based on the stable feature correspondences across two images. The most prevalent method is SIFT (Scale Invariant Feature Transform) [8]. SIFT has been widely used in the field of image registration due to the following invariances: rotation, scale, grayscale, and so on. PCA-SIFT (Principal Component Analysis-SIFT) [9] applies dimensionality reduction to SIFT descriptors to improve the matching efficiency. Slightly different from the classical CV field, a series of unique image registration methods appear in the SAR image processing field. Given the characteristics of SAR speckle noise, SAR-SIFT [10] adopts a new method of gradient calculation and feature descriptor generation to improve the SAR image registration performance. KAZE-SAR [11] uses the nonlinear diffusion filtering method KAZE [12] to build the scale space. Xiang [13] proposed a method to match large SAR images with optical images. To be specific, the method combines dilated convolutional features with epipolar-oriented phase correlation to reduce horizontal errors, and then fine-tunes the matching part. Feature-based methods are more flexible and effective; as such, they are more practical under complex spatial change. Coherent speckle noise has consequences on the conventional method's precision, and the traditional matching approach fails to achieve the expected results under complex and varied scenarios.

Deep learning [14] (DL) has exploded in CV fields over the past decade. With strong abilities of feature extraction and characterization, deep learning is in wide usage across remote sensing scenarios, including classification [15], detection [16], image registration [17], and change detection [18]. More and more methods [19,20] use learning-based methods in the registration of the CV field. He et al. [19] proposed a Siamese CNN (Convolutional Neural Networks) to evaluate the similarity of patch pairs. Zheng et al. [20] proposed SymReg-GAN, which achieves good results in medical image registration by using a generator that predicts the geometric transformation between images, and a discriminator that distinguishes the transformed images from the real images. Specific to the SAR image (remote sensing image) registration field, Li et al. [21] proposed a RotNET to predict the rotation relationship between two images. Mao et al. [22] proposed a multi-scale fused deep forest-based SAR image registration method. Luo et al. [23] used pre-trained deep residual neural features extracted from CNN for registration. The CMM-Net (cross modality matching net) [24] used CNN to extract high-dimensional feature maps and build descriptors. DL often requires large training datasets. Unlike optical natural images, it is difficult to accurately label SAR images due to the influence of noise. In addition, most DLbased SAR image registration studies generally deal with small image blocks with a fixed size, but in practical applications, wide-swath SAR images cannot be directly matched.

As was outlined earlier in this article, Figure 1 lists some of the registration methods for SAR (remote sensing) image domains. Although many SAR image registration methods exist, there are still some limitations:


• The existing SAR image registration methods mainly rely on the CNN structure, and lack a complete relative relationship between their features due to the receptive field's limitations.

**Figure 1.** Remote sensing image registration milestones in the last two decades.

Based on the above analysis, this paper proposes a wide-swath SAR image finelevel registration framework that combines traditional methods and deep learning. The experimental results show that, compared with the state of the art, the proposed method can obtain better matching results. Under the comparison and analysis of the matching performance in different data sources, the method in this paper is more effective and robust for SAR image registration.

The general innovations of this paper are as follows:


The remainder of this paper is organized as follows. In Section 2. Methods, the proposed framework of SAR image registration and the learning-based sub-image matching method are discussed in detail. In Section 3. Experimental Results and Analyses, specified experiments, as well as quantitative and qualitative results, are given. In Section 4. Discussion, the conclusion is provided.

#### **2. Methods**

In this study, we propose a phased SAR image registration framework that combines traditional and deep learning methods. The framework is illustrated in Figure 2; the proposed method mainly consists of four steps. First, the ORB [25] and GMS [26] are used to obtain the coarse registration result via the downsampled original image. Second, K-means++ [27] select cluster centers of registration points from the previous step, and a series of corresponding original-resolution image slices are obtained. Third, we register the above image pairs through deep learning. The fourth step is to integrate the point pair subsets and obtain the final global transformation result after RANSAC [28].

As a starting point for our work, we first introduce the existing deep learning mainstream.

#### *2.1. Deep Learning-Related Background*

As AlexNet [29] won first place in 2012 ImageNet, deep learning had begun to play a leading role in CV, NLP (natural language processing), and other fields. The current mainstream of deep learning includes two categories: CNN and Transformer. CNN does well in the extraction of local information from two-dimensional data, such as images. Because the deep neural network can extract key features from massive data, deep CNN is performed outstandingly in image classification [30], detection [16,31], and segmentation [32].

**Figure 2.** The pipeline of the proposed method.

Corresponding to text and other one-dimensional sequence data, currently, the most widely used processing method is Transformer [33], which solves the long-distance relying problem using a unique self-attention mechanism. It is sweeping NLP, CV, and related fields.

Deep learning has been widely used in SAR image processing over the past few years. For example, Hou et al. [16] proposed ship classification using CNN in an SAR ship dataset. Guo et al. [31] applied an Attention Pyramid Network for aircraft detection. Transformer is also used in recognition [34], detection [35] and segmentation [36]. LoFTR (Local Feature TRansformer) [37] has been proposed as a coarse-to-fine image matching method based on Transformers. However, to our knowledge, Transformer has not been applied to SAR image registration. Inspired by [37], in this article we use Transformer and CNN to improve the performance of SAR image registration.

The method proposed in this paper is mainly inspired by LoFTR. The initial consideration is that in the SAR image registration scene, due to the weak texture information, traditional CV registration methods based on gradient, statistical information, and other classical methods cannot obtain enough matching point pairs. LoFTR adopts a two-stage matching mechanism and features coding with Transformer, such that each position in the feature map contains the global information of the whole image. It works well in natural scenes, and also has a good matching effect even in flat areas with weak texture information. However, considering that SAR images have weaker texture information than optical images, it is difficult to obtain sufficient feature information.

In order to obtain more matching feature point pairs and give consideration to model complexity and algorithm accuracy, this paper adopts several modification schemes for SAR image scenes. (1) Feature Pyramid Network is used as a feature extraction network in LoFTR; in this paper, an advanced convolutional neural network, is adopted as a feature extraction part in order to obtain more comprehensively high- and low-resolution features with feature fusion. (2) This paper analyzes the factors that affect the number of matching point pairs, and finds that the size of the low-resolution feature map has an obvious direct impact on the number of feature point pairs. The higher the resolution, the higher the number of correct matching points that are finally extracted. Therefore, (1/2,1/5) resolution is adopted to replace the original (1/2,1/8) or (1/4,1/16); such a change leads to the number of matching point pairs increasing significantly. (3) In order to further reduce the algorithm complexity and improve the algorithm speed, this paper combines the advanced linear time complexity method to encode features, such that the location features at the specific index of the feature map can be weighted by the full image information, which can further improve the efficiency while ensuring the algorithm accuracy. The detailed expansion and analysis of the above parts are in the following sections.

#### *2.2. Rough Matching of the Down-Sampled Image*

The primary reasons that the traditional matching methods SIFT and SURF (Speeded-Up Robust Features) [38] cannot be applied directly to SAR images are the serious coherent speckle noise and the weak discontinuous texture. It is often impossible to obtain sufficient matching points on original-resolution SAR images by the traditional method. At the same time, the semantic information of the original-size image is relatively scarce. Therefore, we do not simply use traditional methods to process the original image. Considering that the down-sampled image is similar to a high-level feature map in deep CNN with rich semantic information, we use the down-sampled image (the rate is 10 almost) to perform rough pre-matching, as shown in Figure 3. The most representative method is SIFT. However, it runs slowly, especially for large images. The ORB algorithm is two orders of magnitude more rapid [25] than SIFT. ORB is a stable and widely used feature point detection description method.

**Figure 3.** The pipeline of rough matching.

ORB combines and improves the FAST (Features from Accelerated Segment Test) [39] keypoint detector and the BRIEF (Binary Robust Independent Elementary Features) [40] descriptor. FAST's idea is that if the pixel's gray is distinguished from the surrounding neighborhood (i.e., it exceeds the threshold value), it may be a feature point. To be specific, FAST uses a neighborhood of 16 pixels to select the initial candidate points. Non-maximum suppression is used to eliminate the adjacent points. The gaussian blurring of different scales is performed on the image in order to achieve scale invariance.

The intensity weighted sum of a patch is defined as the centroid, and the orientation is obtained via the angle between the current point and the centroid. Orientation invariance can be enhanced by calculating moments. BRIEF is a binary coded descriptor that uses binary and bit XOR operations to speed up the establishment of feature descriptors and reduce the time for feature matching. Steered BRIEF and rBRIEF are applied for rotation invariance and distinguishability, respectively. Overall, FAST accelerates the feature point detection, and BRIEF reduces the spatial redundancy.

GMS is applied after ORB to obtain more matching point pairs; here is a brief description. If the images Ia and Ib, respectively, have N and M feature points, the set of feature points is written as {M, N}, the feature matching pair in the corresponding two images is Xa→<sup>b</sup> = {x1, x2, ··· xn}, xi= {m, n}, and a and b are the neighborhoods of the feature points from two images Ia and Ib. For a correct matching point pair, there are more matching points as support for its correctness. For the matching pair xi, Si = |χi|−1 is used to represent the support of its neighboring feature points, where χ<sup>i</sup> is the number of matching pairs in the neighborhood of xi. Because the matching of each feature point is independent, it can be considered that Si approximately obeys the binomial distribution, and can be defined as

$$\mathbf{S}\_{\mathbf{i}} \sim \begin{cases} \begin{array}{c} \mathrm{B}(\mathbf{n}, \mathbf{p}\_{\mathbf{t}}) \\ \mathrm{B}(\mathbf{n}, \mathbf{p}\_{\mathbf{f}}) \end{array} \times\_{\mathbf{i}} \text{matches correctly} \\ \begin{array}{c} \mathrm{x}\_{\mathbf{i}} \text{ matches wrongly} \\ \end{array} \end{cases} \tag{1}$$

n is the average number of feature points in each small neighborhood. Let *fa* be one of the supporting features belonging to region a. pt is the probability that region b includes the nearest neighbor of *fa*, and similarly, pf can be defined, and pt and pf can be obtained by the following formulae:

$$\begin{array}{l} \mathbf{p}\_{\mathbf{t}} = \mathbf{p} \left( \mathbf{f}\_{\mathbf{a}}^{\mathbf{t}} \right) + \mathbf{p} \left( \mathbf{f}\_{\mathbf{a}}^{\mathbf{t}} \right) \mathbf{p} \left( \mathbf{f}\_{\mathbf{a}}^{\mathbf{b}} \mid \mathbf{f}\_{\mathbf{a}}^{\mathbf{t}} \right) = \mathbf{t} + (1 - \mathbf{t}) \boldsymbol{\beta} \mathbf{m} / \mathbf{M} \\\ \mathbf{p}\_{\mathbf{f}} = \mathbf{p} \left( \mathbf{f}\_{\mathbf{a}}^{\mathbf{f}} \right) \mathbf{p} \left( \mathbf{f}\_{\mathbf{a}}^{\mathbf{b}} \mid \mathbf{f}\_{\mathbf{a}}^{\mathbf{t}} \right) = (1 - \mathbf{t}) \boldsymbol{\beta} \mathbf{m} / \mathbf{M} \end{array} \tag{2}$$

f t a, f f a, and f b <sup>a</sup> correspond to events: *fa* is correctly matched, *fa* is incorrectly matched, and *fa*'s matching point appears in region b. m represents the number of all of the feature points in region b in image Ib, and M represents the number of all of the feature points in image Ib. In order to further improve the discriminative ability, the GMS algorithm uses the multi-neighborhood model to replace the single-neighborhood model:

$$\mathbf{S}\_{\mathbf{i}} = \sum\_{\mathbf{k}=1}^{K} |\chi\_{\mathbf{a}, \mathbf{k}, \mathbf{k}}| - 1 \tag{3}$$

K is the number of small neighborhoods near the matching point, χakb<sup>k</sup> is the number of matching pairs in the two matching neighborhoods, and Si can be extended to

$$\mathbf{S}\_{\mathbf{i}} \sim \begin{cases} \text{B}(\text{Kn}, \mathbf{p}\_{\mathbf{i}}) & \text{x\_{\mathbf{i}} matches correctly} \\ \text{B}(\text{Kn}, \mathbf{p}\_{\mathbf{f}}) & \text{x\_{\mathbf{i}} matches wrongly} \end{cases} \tag{4}$$

According to statistics, an evaluation score P is defined to measure the ability of the function Si to discriminate between right and wrong matches, as follows:

$$\mathbf{P} = \frac{\mathbf{m}\_{\mathbf{t}} - \mathbf{m}\_{\mathbf{f}}}{\mathbf{s}\_{\mathbf{t}} - \mathbf{s}\_{\mathbf{f}}} = \sqrt{\mathbf{K} \mathbf{n}} \frac{\mathbf{p}\_{\mathbf{t}} - \mathbf{p}\_{\mathbf{f}}}{\sqrt{\mathbf{p}\_{\mathbf{t}}(1 - \mathbf{p}\_{\mathbf{t}})} + \sqrt{\mathbf{p}\_{\mathbf{f}}(1 - \mathbf{p}\_{\mathbf{f}})}} \tag{5}$$

Among them, st and sf are the standard deviations of Si in positive and false matches, respectively, and mt and mf are the mean values, respectively. It can be seen from Formula (5) that the greater the feature points' number, the higher the matching accuracy. If we set Sij <sup>=</sup> <sup>∑</sup>K=<sup>9</sup> <sup>K</sup>=<sup>1</sup> Xikjk for grid pair {i,j} and <sup>τ</sup>≈<sup>6</sup> <sup>√</sup>n for the threshold, then {i,j} is regarded as a correctly matched grid pair when Sij > τ.

In order to reduce the computational complexity, GMS replaces the circular neighborhood with a non-overlapping square grid to speed up Sij's calculation. Experiments have shown that when the number of feature points is 10,000, the image is divided into a

20 × 20 grid. The GMS algorithm scales the grid size for image size invariance, introduces a motion kernel function to process the image, and converts the rotation changes into the rearrangement of the corresponding neighborhood grid order to ensure rotation invariance.

#### *2.3. Sub-Image Acquisition from the Cluster Centers*

The existing image matching methods mostly apply to small-size images, which have lower time and storage requirements. Although some excellent methods can reach subpixels in local areas, they cannot be extended to a large scale due to their unique gradient calculation method and scale-space storage. Take the representative algorithm SAR-SIFT, for example; its time and memory consumption vary with the size, as shown in Figure 4, and when the image size reaches 5000–10,000 pixels or more, the memory reaches a certain peak. This computational consumption is unacceptable for ordinary desktop computers. The test experiment here was performed on high-performance workstations. Even so, the memory consumption caused by the further expansion of the image size is unbearable.

**Figure 4.** Trends in time and space consumption along with the image size, with SAR-SIFT as the method.

Storage limitation is also one of the key considerations. In addition to this, the time complexity of the algorithm also needs to be taken seriously. This is because, in specific practical applications, most scenarios are expected to be processed in quasi-real time. It can be seen that, for small images, SAR-SIFT can be processed within seconds, and for mediumsized images, it takes roughly minutes. For larger images, although better registration results may be obtained, the program running time of several hours or even longer cannot be accepted. Parallel optimization processing was tried here, but it did not speed the process up significantly.

According to the above analysis, due to the special gradient calculation method and the storage requirements of the scale space, wide-swath SAR image processing will risk the boom of the time and space complexity. As a comparison, we also tried the method of combining ORB with GMS for large image processing, but the final solution turned out to be wrong. The above has shown the time and spatial complexity from a qualitative point of view. The following uses SIFT as an example to analyze the reasons for the high time complexity from a formula perspective. The SIFT algorithm mainly covers several stages, as shown in Figure 5.

**Figure 5.** The pipeline of the SIFT algorithm.

The overall time complexity is composed of the sum of the complexity for each stage. Assume that the size of the currently processed image is NxN.

1. Regarding Gaussian blur, there are a total of <sup>s</sup> groups of images, and each group consists of s scales; for the original-resolution image NxN, Gaussian filter group G(x, y, σ) is

$$\mathbf{G}(\mathbf{x}, \mathbf{y}, \sigma) = \frac{1}{2\pi\sigma^2} \mathbf{e}^{-\frac{\mathbf{x}^2 + \mathbf{y}^2}{2\sigma^2}} \tag{6}$$

The corresponding time complexity is O N2w2s . For each pixel, a weighted sum of the surrounding Gaussian filtering (wxw) is required, with complexity:

$$\mathcal{L}(\mathbf{x}, \mathbf{y}, \sigma) = \sum\_{\mathbf{u} = -\frac{\mathbf{w} - 1}{2}}^{\frac{\mathbf{w} - 1}{2}} \sum\_{\mathbf{v} = -\frac{\mathbf{w} - 1}{2}}^{\frac{\mathbf{w} - 1}{2}} \mathcal{G}(\mathbf{u}, \mathbf{v}) \mathcal{I}(\mathbf{x} + \mathbf{u}, \mathbf{y} + \mathbf{v}) \tag{7}$$

The complexity of all of the groups is

$$\mathbf{O}\left(\sum\_{j=0}^{\tilde{\mathbf{s}}-1} \frac{\mathbf{N}^2}{2^{\tilde{\mathbf{l}}}} \mathbf{w}^2 \mathbf{s}\right) = \mathbf{O}\left(\mathbf{N}^2 \mathbf{w}^2 \mathbf{s}\right) \tag{8}$$

2. To calculate the Gaussian difference, subtract each pixel of adjacent scales once in one direction.

$$\mathbf{L}\_{\mathbf{i}}^{\mathbf{j}} = \mathbf{L}\_{\mathbf{i}+\mathbf{1}}^{\mathbf{j}} - \mathbf{L}\_{\mathbf{i}}^{\mathbf{j}} \tag{9}$$

$$\mathbf{O}\left(\sum\_{j=0}^{\tilde{s}-1} \frac{\mathbf{s} \mathbf{N}^2}{2^j}\right) = \mathbf{O}\left(\mathbf{s} \mathbf{N}^2\right) \tag{10}$$

3. To calculate the extremum detection in scale space, each point is compared with 26 adjacent points in the scale space. If the whole points are larger or smaller than the point, it is regarded as an extreme point; the complexity is

$$\mathbf{O}\left(\sum\_{j=0}^{\widehat{s}-1} \frac{(\mathbf{s}+2)\mathbf{N}^2}{2^j}\right) = \mathbf{O}\left(\mathbf{s}\mathbf{N}^2\right) \tag{11}$$


$$\mathbf{m}\_{\mathbf{i}}^{\dagger}(\mathbf{x},\mathbf{y}) = \sqrt{\left(\mathbf{L}\_{\mathbf{i}}^{\dagger}(\mathbf{x}+1,\mathbf{y}) - \mathbf{L}\_{\mathbf{i}}^{\dagger}(\mathbf{x}-1,\mathbf{y})\right)^{2} + \left(\mathbf{L}\_{\mathbf{i}}^{\dagger}(\mathbf{x},\mathbf{y}+1) - \mathbf{L}\_{\mathbf{i}}^{\dagger}(\mathbf{x},\mathbf{y}-1)\right)^{2}} \tag{12}$$

$$\theta\_{\mathbf{i}}^{\mathbf{j}}(\mathbf{x},\mathbf{y}) = \tan^{-1} \frac{\left(\mathcal{L}\_{\mathbf{i}}^{\mathbf{j}}(\mathbf{x},\mathbf{y}+1) - \mathcal{L}\_{\mathbf{i}}^{\mathbf{j}}(\mathbf{x},\mathbf{y}-1)\right)}{\left(\mathcal{L}\_{\mathbf{i}}^{\mathbf{j}}(\mathbf{x}+1,\mathbf{y}) - \mathcal{L}\_{\mathbf{i}}^{\mathbf{j}}(\mathbf{x}-1,\mathbf{y})\right)} \tag{13}$$

Non-keypoint points with magnitudes close to the peak are added as newly added keypoints. The total number of output points is

$$
\alpha \circledast \mathbf{N}^2 + \gamma \left(\mathbf{N}^2 - \alpha \circledast \mathbf{N}^2\right) = \alpha \circledast \mathbf{N}^2 (1 - \gamma) + \gamma \mathbf{N}^2 \cong \mathbf{N}^2 (\alpha \circledast + \gamma) \tag{14}$$

The computational complexity of each point is O(1), and the total complexity is O N2s .

6. For the feature point descriptor generation, the complexity of each point is O x2 , and the total complexity is O x2N2(αβ + γ) .

Based on the analysis of the above results, we believe that the reasons for the failure of the above algorithm in actual wide-swath SAR image registration are as follows: (1) It is inefficient to calculate the scale space of the entire image. For most areas, it is not easy to find feature points that can establish a mapping relationship, which leads to potential ineffective calculations. Not only are these time-consuming, the corresponding scale space and feature points descriptor also consume a lot of storage resources. (2) For the feature point sets obtained from the two images, one-to-one matching needs to be carried out by the brute force calculation of the Euclidean distance, etc. Most points are not possible candidate points, and the calculated Euclidean distance needs to be stored, such that again there is an invalid calculation during the match.

From the perspective of the algorithm's operation process, we will discuss the reason why algorithms such as SAR-SIFT are good at sub-image registration but fail in wide-swath images. The most obvious factors are time and space consumption. The reason can be found from a unified aspect: calculation and storage are not directional. There is redundancy in the calculation of the scale space. Some areas can be found beforehand in order to reduce the calculation range of the scale space, and the calculation amount of subsequent mismatch can also be reduced. At the same time, feature point matching does not have certain directivity because, for feature points in a small area, points from most of the area in another image are not potential matching ones. Therefore, redundant calculation and storage can be omitted.

In this paper, the idea of improving the practicality of wide-swath SAR image registration is to reduce the calculation range of the original image and the range of candidate points according to certain criteria. Based on the coarse registration results of the candidate regions, we determine the approximate spatial correspondences, and then perform more refined feature calculations and matching in the corresponding image slice regions.

In this work, the corresponding slice areas with a higher probability of feature points are selected. K-means++ is used to obtain the clustering centers of coarse matching points in the first step. The clustering center is marked as the geometric center in order to obtain the image slices. By using the geometric transformation relationship, a set of image pairs corresponding approximately to the same geographic locations are obtained. Adopting this approach has the following advantages:


K-means++ is an unsupervised learning method which is usually used in scenarios such as data mining. K-means++ needs to cluster N observation samples into K categories. Here, K = 4. the cluster centers are used as the slice geometric centers, and the slice size is set to 640 × 640. According to the above process, a series of rough matching image groups are obtained within an error of about ten pixels.

#### *2.4. Dense Matching of the Sub-Image Slices*

After the above processing is performed on the original-resolution SAR image, a set of SAR image slices are obtained. As is known, compared with optical image registration, an SAR image meets many difficulties: it has a low resolution and signal-to-noise ratio, overlay effects, perspective shrinkage, and a weak texture. Therefore, the original-resolution SAR image's alignment is more difficult than the optical alignment.

This article uses Transformer. Based on the features extracted by CNN, Transformers are used to obtain the feature descriptors of the two images. The global receptive field provided by Transformer enables the method in this article to fuse the local features and contextual location information, which can produce dense matching in low-texture areas (usually, in low-texture areas, it is difficult for feature detectors to generate repeatable feature points).

The overall process consists of several steps, as shown in the lower half of Figure 2:


#### 2.4.1. HRNet

Traditional methods such as VGGNet [43] and ResNets (Residual Networks) [44] include a series of convolution and pooling, which loses a lot of spatial detail information. The HRNet structure maintains high-resolution feature maps, and combines high- and low-resolution subnet structures in parallel to obtain multi-scale information.

HRNet is used as a network model for multi-resolution feature extraction in this method. At the beginning of this paper, we tried a variety of convolutional neural network models, including ResNets, EfficientNet [45] and FPN [46]; we found that HRNet has the best effect. The HRNet's structure is shown in Figure 6; the network is composed of multiple branches, including the fusion layer with different resolution branches' information interactions, and the transition layer, which is used to generate the 1/2 resolution downsampling branch. By observing the network input and output of HRNet at different stages, it can be seen that multi-resolution feature maps with multi-level information will be output after the full integration of the branches with different resolutions.

**Figure 6.** HRNet architecture diagram.

The Transformer part of this paper requires high and low resolution to complete the coarse matching of the feature points and the more accurate positioning of specific areas. HRNet, as a good backbone, outputs feature maps with a variety of resolutions to choose

from and full interaction between the feature maps, such that it contains high-level semantic information with a low resolution, and low-level detail information with a high resolution.

In addition, the subsequent part of this paper makes further attempts to combine different resolutions. It can be seen in subsequent chapters that improving the resolution of the feature maps in the rough matching stage can significantly increase the number of matching point pairs. Default HRNet outputs 1/2, 1/4, 1/8 resolution feature maps. Under the constraints of the experimental environment, we chose 1/5 and 1/2 as the low and high resolutions. The reason will be discussed in Section 3. Experimental Results and Analyses. The 1/5 resolution can be obtained from other resolutions by interpolation. The 1/2 resolution feature map cascades the 1 × 1 convolutional layer and the output works as the fine-level feature map. The 1/4 and 1/8 resolution feature maps are all interpolated to 1/5 resolution. After stitching, the coarse-level feature map is obtained through the 1 × 1 convolutional layer.

#### 2.4.2. Performer

The Transformer has outstanding performance in many fields of CV, such as classification and detection. With the help of a multi-head self-attention mechanism, the Transformer can capture richer characteristic information. Generally speaking, Transformer complexity is squared with sequence length. In order to improve the speed of training and inference, a linear time complexity Transformer was also proposed recently, i.e., Performer [42]. It can achieve faster self-attention through the positive Orthogonal Random features approach.

$$\text{Attention}\left(\mathbf{Q}, \mathbf{K}, \mathbf{V}\right) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{\mathbf{d}\_\mathbf{k}}}\right)\mathbf{V} \tag{15}$$

Self-attention (as shown in Figure 7) performs an attention-weighted summation of the current data and global data, and realizes a special information aggregation by calculating the importance of the current location feature relative to other location features. The feedforward part contains the linear layer and the GELU (Gaussian Error Linear Unit) activation function. Each layer adopts Layer Normalization in order to ensure the consistency of the feature distribution, and to accelerate the convergence speed of the model training.

**Figure 7.** Performer (Transformer) encoder architecture and self-attention schematic diagram.

As is shown in reference [42], Performer can achieve space complexity O(Lr + Ld + rd) and time complexity O(Lrd), but the original Transformer's regular attention is O(L<sup>2</sup> + Ld) and O(L2d), respectively. The sinusoidal position encoding formula used in this work is as follows: ⎧

$$\begin{cases} \mathbf{p}\_{\mathbf{k},2\mathbf{i}} = \sin\left(\mathbf{k}/10000^{2\mathbf{i}/\mathbf{d}}\right) \\\mathbf{p}\_{\mathbf{k},2\mathbf{i}+1} = \cos\left(\mathbf{k}/10000^{2\mathbf{i}/\mathbf{d}}\right) \end{cases} \tag{16}$$

Based on the features extracted by CNN, Performers are used to obtain the feature descriptors of the two images. The global receptive field provided by Performer enables our method to fuse local features and contextual location information, which can produce dense matching in low-texture areas (usually, in low-texture areas, it is difficult for feature detectors to generate repeatable feature points).

#### 2.4.3. Training Dataset

Due to the influence of noise, it is difficult to accurately annotate the control points of an SAR image, and the corresponding matching dataset of an SAR image is not common. MegaDepth [47] contains 196 groups of different outdoor scenes; it applies SOTA (State-ofthe-Art) methods to obtain depth maps, camera parameters, and other information. The dataset contains different perspectives and periodic scenes. Considering the dataset size and GPU memory, 1500 images were selected as a validation set, and the long side of the image was scaled to 640 during the training and 1200 during the verification.

#### 2.4.4. Loss Function

$$\mathcal{L} = \mathcal{L}\_{\mathsf{c}} + \mathcal{L}\_{\mathsf{f}} = -\frac{1}{\left| \mathsf{M}\_{\mathsf{c}}^{\mathcal{G}\dagger} \right|} \sum\_{(\overleftarrow{\mathrm{i},\overleftarrow{\mathrm{j}}) \in \mathsf{M}\_{\mathsf{c}}^{\mathcal{G}}} \log \mathsf{P}\_{\mathsf{c}} \left( \overleftarrow{\mathrm{i},\overleftarrow{\mathrm{j}}} \right) + \frac{1}{\left| \mathsf{M}\_{\mathsf{f}} \right|} \sum\_{(\overleftarrow{\mathrm{i},\overleftarrow{\mathrm{j}}') \in \mathsf{M}\_{\mathsf{f}}} \mathsf{o}^{\mathcal{G}} \left( \overleftarrow{\mathrm{i}} \right)} \frac{1}{\left| \overleftarrow{\mathrm{i}} \right|} \left\| \overleftarrow{\right\|}^{\prime} - \overleftarrow{\mathrm{j}}\_{\mathsf{g}\mathtt{t}}^{\prime} \right\|\_{2} \tag{17}$$

As in [37], this article uses a similar loss function configuration. Here is a brief explanation. Pc is the confidence matrix returned by dual softmax. The true label of the confidence matrix is calculated by the camera parameters and depth maps. The nearest neighbors of the two sets of low-resolution grids are used as the true value of the coarse matching Mc, and the low resolution uses negative log-likelihood as the loss function.

The high resolution adopts the L2 norm. For a point, the uncertainty is measured by calculating the overall variance in the corresponding heatmap. The real position of the current point is calculated from the reference point, camera position, and depth map. The total loss is composed of low- and high-resolution items.

#### *2.5. Merge and Solve*

After obtaining the corresponding matching point sets of each image slice pair, the final solution requires mapping point sets of the entire image. Considering that the registration mapping geometric relationship solved by each set of slices is not necessarily the same, this work merges all of the point sets. The RANSAC method is used here to obtain the final result, i.e., a set of corresponding subsets describing the two large images. The corresponding point numbers must be less than the sum of the independent one. Without bells and whistles, the affine matrix of the entire image is solved.

#### **3. Experimental Results and Analyses**

In this section, we design several experiments to validate the performance of our methods from three perspectives: (1) the comparative performance tests with SOTA methods for different data sources, (2) the checkerboard visualization of the matching, (3) scale, rotation and noise robustness tests, and (4) the impact of the network's high- and low-resolution settings on the results. First, a brief introduction to the experimental datasets is given.

#### *3.1. Experimental Data and Settings*

In this work, datasets from five sources were used to verify the algorithm's effectiveness, which contains GF-3, TerraSAR-X, Sentinel-1, ALOS, and SeaSat. These data

include a variety of resolutions, polarization modes, orbital directions, and different terrains. Table 1 and Figure 8 contain detailed information. DEC and ASC mean "descending" and "ascending", respectively.


**Table 1.** Experimental datasets.

In order to verify the effectiveness of the proposed matching method, several evaluation criteria were used to evaluate the accuracy of the SAR image registration, as shown below:

1. The root mean square error, RMSE, is calculated by the following formula:

$$\text{RMSE} = \sqrt{\frac{1}{\text{N}} \sum\_{i=1}^{\text{N}} \left( \mathbf{x}\_{i}^{2\prime} - \mathbf{x}\_{i}^{1} \right)^{2} + \left( \mathbf{y}\_{i}^{2\prime} - \mathbf{y}\_{i}^{1} \right)^{2}} \tag{18}$$

2. NCM stands for the number of matching feature point pairs filtered by the RANSAC algorithm, mainly representing the number of feature point pairs participating in the calculation of the spatial transformation model. It is a filtered point subset of the matching point pairs output by algorithms such as SAR-SIFT. For the solution of the affine matrix, the larger the value, the better the image registration effect.

#### *3.2. Performance Comparison*

In this section, we compare the proposed method with several methods: SAR-SIFT, HardNet [48], TFeat [49], SOSNet [50], LoFTR, KAZE-SAR, and CMM-Net. HardNet, TFeat, and SOSNet use GFTT [51] as the feature point detector, and the patch size of the above methods is 32 × 32. GFTT will pick the top N strongest corners as feature points. The comparison methods are introduced briefly as follows:


(**a**) (**b**)

(**c**) (**d**)

**Figure 8.** *Cont*.

(**e**) (**f**)

(**g**) (**h**)

(**i**) (**j**)

(**k**) (**l**)

(**m**) (**n**)

**Figure 8.** *Cont*.

(**o**) (**p**)

**Figure 8.** The SAR image datasets (**a**–**q**) of datasets 1–17.

As for our method, we set K = 4 and window size = 640 for the sub-image. SAR-SIFT and KAZE-SAR are traditional methods, and Hardnet, Tfeat, SOSNet, LoFTR, and CMM-Net are deep learning methods. The algorithm in this paper was trained and tested on a server with a GPU of NVIDIA TITAN\_X (12GB), a CPU of Intel(R) Core(TM) I7-5930K @3.50GHZ, and a memory size of 128 GB. The comparison experiment was carried out using the same hardware. As will be discussed later (Section 3.4), under existing hardware conditions, (1/2,1/5) resolution was adopted in this paper in order to achieve the best effect, and was used as the final network model to calculate the speed and accuracy of the algorithm. Like diverse methods, in addition to feature point detection and feature point description, other processing steps are consistent with our method, including the rough matching of sub-sampling images and the acquisition of subimages. The other settings were based on the original settings of the algorithm in order to ensure the fairness of the comparison.

Table 2 shows the performance results of several methods on the above dataset, in which the best performance corresponding to each indicator is shown in bold. '-' in the table means that the matching result of the corresponding algorithm is incorrect. It can be seen that the performance of our method on RMSE is better than the comparison methods in more than half of the datasets. For all of the SAR image registration datasets, the performance of our method reaches the sub-pixel level. Considering NCM, our method obtains the best performance for all of the total datasets, as well as a better spatial distribution of points, i.e., dense matching. Figure 9 shows that our method's NCMs are higher than those of other methods, while the RMSEs are lower in most cases.

#### *3.3. Visualization Results*

In order to display the matching accuracy more intuitively, we added the checkerboard mosaic images. In Figure 10, the continuity of the lines can reflect the matching accuracy. As the pictures show, the areas and lines overlap well, indicating the high accuracy of the proposed method.


**Table 2.** RMSE and NCM of diverse methods on the datasets.

**Figure 10.** *Cont*.

**Figure 10.** The checkerboard mosaic images (**a**–**q**) of datasets 1–17.

Considering that our method adopts the strategy of fusing local and global features, it can fully extract matching point pairs in the selected local area, which leads to better solutions that are closer to the affine transformation relationship of real images. In the data pairs 14 and 17, the two images have relatively strong changes in their radiation intensities, and the method in this paper can still achieve good matching results, which proves that the method has certain robustness to changes in radiation intensity. Dataset 5 contains two images of different orbit directions. It can be seen from the road and other areas in the figure that the matching is precise. Datasets 2 and 7 contain multi-temporal images of different polarizations. Due to the scattering mechanism, the same objects in different polarizations may be different in the images. Our method demonstrates the stability in multi-polarization.

#### *3.4. Analysis of the Performance under Different Resolution Settings*

Considering that our proposed method has a coarse-to-fine step, we analyzed the registration performance of different resolution settings to find the best ratio. In this experiment, we tested several different high and low resolutions. Table 3 shows the corresponding performances, respectively, and the method of the best performance for each data is indicated in bold. All of the resolution parameter combinations include (1/4, 1/16), (1/2,1/8), (1,1/8), and (1/2,1/5).


**Table 3.** RMSE and NCM of the different resolution settings on the datasets.

It can be seen from Table 3 that, for the datasets, the resolution of (1/2,1/5) has the best performance. It is shown in Table 3 that the size of the low resolution directly affects the overall matched point's number. It can be found intuitively that the total potential points of resolution 1/16 are a quarter of 1/8 s point number (1/2 × 1/2 = 1/4). As such, configuration (1/2,1/5) can obtain more matched point pairs. Take configurations (1/2,1/8) and (1,1/8) for an example; for high-resolution feature maps, the number of matched points increases to a certain extent with the increase of the resolution. However, here, the GPU memory requested by configuration (1,1/5) exceeds the upper limit of the machine used in this work, so we chose a compromise configuration, (1/2,1/5).

#### **4. Discussion**

The experimental results corroborate the accuracy and robustness of our method. There are three main reasons for this: First, the features extracted based on Transorfmer are richer, including the local gray information of the image itself, and global information such as the context. Second, the down-sampled image has stronger semantic information, and is suitable for traditional registration methods. The subsequent registration is provided with a better initial result of coarse matching. Third, according to the K-means++ clustering method, the relationship between the original image and the sub-images to be registered is constructed, and representative sub-images are obtained in order to reduce time and space consumption.

From the performance analysis and model hyperparameter comparison experiments, it can be seen that our proposed method achieved stable and accurate matching results under different ground object scenes and various sensors' data conditions. Now, we will further examine the rotation, scaling and noise robustness of the proposed method. Furthermore, another vital criterion—the execution time—needs to be compared. Finally, we will show the matching accuracy's impact on downstream tasks.

#### *4.1. Rotation and Scale Test*

In practical applications, there are often resolution inconsistencies and rotations between the sensed image and the reference image. In order to test the rotation and scale robustness of the proposed subimage registration method, we experimented on the data with a simulated variation. The RADARSAT SAR sensor collected the data of Ottawa, in May and August 1997, respectively. The size of the two original images is 350 × 290, as shown in Figure 11.

**Figure 11.** The Ottawa data: (**a**) May 1997 and (**b**) August 1997.

A matching test between the two images at 5◦ intervals from –15◦ to 15◦ was carried out to verify the rotation robustness. In addition, two scaling ratios of 0.8 and 1.2 were tested to simulate the stability of the image registration algorithm at different resolutions. In all of the above cases, more than 100 matching points could be extracted between the two SAR images with an RMSE of around 0.7 (the subpixel level). As Figures 12 and 13 show, the proposed method has rotation and scale robustness.

**Figure 12.** The registration performance of SAR images at varying rotations: (**a**) –15◦, (**b**) –10◦, (**c**) –5◦, (**d**) 5◦, (**e**) 10◦, and (**f**) 15◦.

**Figure 13.** The registration performance of SAR images at varying sizes: (**a**) 0.8, and (**b**)1.2.

In addition, we can see that, under different rotation and scaling conditions, a large number of matching points can be obtained not only in the strong edge area but also in the weak texture area. This is significantly helpful for SAR image registration including low-texture areas. The simulation experiments reflect the validity of image registration with various changes in real scenes.

#### *4.2. Robustness Test of the Algorithm to Noise*

Previously, we discussed the robustness of the algorithm to scale and rotation. Considering that there is often a high degree of noise in SAR images, taking motion error as an example, SAR images in practical applications will possess some unfocused positions. Whether or not it has stable performance in noisy scenes, this paper refers to the method and results of [52]; here, two sets of images—before and after autofocus—are tested in order to verify the stability of the algorithm.

Here, we used our sub-image method to register the defocusing data; the results can be seen in Figure 14, and although the image data has a high degree of noise due to error motion defocusing, our method can still obtain a good matching result between two images, and matching points can also be maintained at a high level with error at the subpixel level. It is thus proven that the proposed method is robust to noise.

**Figure 14.** Algorithm robustness test in a noise scenario.

#### *4.3. Program Execution Time Comparison*

In actual tasks, the registration process needs to achieve real-time or quasi-real-time analysis; as such, in addition to the accuracy of the algorithm, timeliness is also a focus of measurement. We also compared several representative methods on a selection of characteristic dataset pairs, i.e., 2, 8, 12, 17. The algorithm in this paper selected the resolution configuration of (1/2, 1/5) as a comparison. As Table 4 shows, our method is significantly faster than the traditional method, SAR-SIFT, and slightly slower than other deep learning methods.


**Table 4.** Execution time (s) comparison of the different methods.

Methods like TFeat use a shallow convolutional neural network so that the feature extraction phase is faster. Given that our approach is multi-stage, the model is relatively complex. Except for the model itself, our method obtains the most matching points and consumes exponentially more time in both the feature point matching and filtering stages. Although the running time is slightly longer, the SAR image registration performance is significantly improved. In further work, we will consider improving the efficiency by adjusting the distillation learning of the feature extraction module in order to obtain a lightweight network with similar performance.

#### *4.4. Change Detection Application*

In some applications—such as SAR image change detection—the simultaneous analysis of SAR images with different acquisition conditions is inevitable. We carried out a simple analysis, and the registration result was applied to the task of SAR image change detection. In this project, we use the previously mentioned Ottawa dataset.

In this experiment, we rotate one of the images to achieve a relative image offset. The Ottawa data of two SAR images were matched first, and the change detection results were obtained after they were processed by two registration methods: SAR-SIFT and ours. The PCA-Kmeans [53] method was used as the basic change detection method. Kappa was used as the change detection performance metric; the formula is as follows:

$$\kappa = \frac{2 \times (\text{TP} \times \text{TN} - \text{FN} \times \text{FP})}{(\text{TP} + \text{FP}) \times (\text{FP} + \text{TN}) + (\text{TP} + \text{FN}) \times (\text{FN} + \text{TN})} \tag{19}$$

The Kappa coefficient can be used to measure classification accuracy. The higher the value is, the more accurate the classification result is. Compared with SAR-SIFT, our proposed method improved the kappa indicator from 0.307 to 0.677, which shows that accurate registration can lead to better change detection results. Intuitively, from Figure 15, we can see that our method results (**b**) are more similar to the ground truth (**c**) than SAR-SIFT (**a**). The deviation of the image registration will cause different objects to be mistaken for the same area during change detection, which will be mistaken for obvious changes.

**Figure 15.** The change map: (**a**) SAR-SIFT, (**b**) the proposed method, and (**c**) the ground truth.

#### **5. Conclusions**

This paper proposes a novel wide-swath SAR image registration method which uses a combination of traditional methods and deep learning to achieve accurate registration. Specifically, we combined the clustering methods and traditional registration methods to complete the stable extraction of representative sub-image slices containing high-probability regions of feature points. Inspired by Performer's self-attention mechanism, a coarse-tofine sub-image dense-matching method was adopted for the SAR image matching under different terrain conditions, including weak texture areas.

The experimental results demonstrate that our method achieved good performance for different datasets which include multi-temporal, multi-polarization, multi-orbit direction, rotation, scaling, noise changes. At the same time, the combination of CNN and Performer verified the effectiveness of the strong representation in SAR image registration. Under the framework of sub-images matching to original images matching, stable dense matching can be obtained in high-probability regions. This framework overcomes the time-consuming problem of the traditional method of matching. Compared with existing methods, more matching point pairs can be obtained by adjusting the model parameter settings in our method. Rotation, scaling and noise experiments were also carried out to verify the robustness of the algorithm. The results showed that a large number of matching point pairs can be obtained even in regions with weak textures, which shows that our method can combine local and global features to characterize feature points more effectively.

In addition, the experimental results suggest that the running time is significantly less than those of traditional methods but slightly longer than those of similar deep learning methods; as such, the way in which to further simplify the network model will be the focus of the next step. Meanwhile, the matching between heterogeneous images is also a topic that can be discussed further.

**Author Contributions:** Conceptualization, Y.F., H.W. and F.W.; validation, Y.F., H.W. and F.W.; formal analysis, Y.F., H.W. and F.W.; investigation, Y.F., H.W. and F.W.; resources, Y.F., H.W. and F.W.; data curation, Y.F., H.W. and F.W.; writing—original draft preparation, Y.F., H.W. and F.W.; writing review and editing, Y.F., H.W. and F.W.; visualization, Y.F., H.W. and F.W.; funding acquisition, Y.F., H.W. and F.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China (Grant No. 61901122), the Natural Science Foundation of Shanghai (Grant No. 20ZR1406300, 22ZR1406700), and the China High-resolution Earth Observation System (CHEOS)-Aerial Observation System Project (30-H30C01-9004-19/21).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data sharing not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection**

**Runfan Xia 1,2,3, Jie Chen 1,2,3,\*, Zhixiang Huang 1,2, Huiyao Wan 1,2,3, Bocai Wu 3, Long Sun 3,4,5, Baidong Yao 3, Haibing Xiang <sup>3</sup> and Mengdao Xing 4,5**


**Abstract:** Synthetic-aperture radar (SAR) image target detection is widely used in military, civilian and other fields. However, existing detection methods have low accuracy due to the limitations presented by the strong scattering of SAR image targets, unclear edge contour information, multiple scales, strong sparseness, background interference, and other characteristics. In response, for SAR target detection tasks, this paper combines the global contextual information perception of transformers and the local feature representation capabilities of convolutional neural networks (CNNs) to innovatively propose a visual transformer framework based on contextual joint-representation learning, referred to as CRTransSar. First, this paper introduces the latest Swin Transformer as the basic architecture. Next, it introduces the CNN's local information capture and presents the design of a backbone, called CRbackbone, based on contextual joint representation learning, to extract richer contextual feature information while strengthening SAR target feature attributes. Furthermore, the design of a new cross-resolution attention-enhancement neck, called CAENeck, is presented to enhance the characterizability of multiscale SAR targets. The mAP of our method on the SSDD dataset attains 97.0% accuracy, reaching state-of-the-art levels. In addition, based on the HISEA-1 commercial SAR satellite, which has been launched into orbit and in whose development our research group participated, we released a larger-scale SAR multiclass target detection dataset, called SMCDD, which verifies the effectiveness of our method.

**Keywords:** transformer; deep learning; SAR target detection; multiscale learning; ship detection

#### **1. Introduction**

Synthetic-aperture radar (SAR) is an active microwave sensor that produces allweather earth observations without being restricted by light and weather conditions. Compared with optical remote sensing images, SAR has significant application value. In recent years, SAR target detection and recognition have been widely used in military and civilian fields, such as military reconnaissance, situational awareness, agriculture, forestry management and urban planning. In particular, future war zones will extend from the traditional areas of land, sea and air to space. As a reconnaissance method with unique advantages, synthetic-aperture radar satellites may be used to seize the right to control information on future war zones and even play a decisive role in the outcome of these

**Citation:** Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. *Remote Sens.* **2022**, *14*, 1488. https:// doi.org/10.3390/rs14061488

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 3 February 2022 Accepted: 14 March 2022 Published: 19 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

wars. SAR image target detection and recognition is the key technology with which to realize these military and civilian applications. Its core idea is to efficiently filter out regions and targets of interest through detection algorithms, and accurately identify their category attributes.

By contrast, from optical images, the imaging mechanism of SAR images is very different. SAR targets have characteristics such as strong scattering, unclear edge contour information, multiscale, strong sparseness, weak, small, sidelobe interference, and complex background. The SAR target detection and recognition tasks present huge challenges. In recent years, many research teams have also conducted extensive research on the abovementioned difficulties. For SAR target imaging problems, phase modulation from a moving target's higher-order movements severely degrades the focusing quality of SAR images, because the conventional SAR ground moving target imaging (GMTIm) algorithm assumes a constant target velocity in high-resolution GMTIm with single-channel SAR. To solve this problem, a novel SAR-GMTIm algorithm [1] in the compressive sensing (CS) framework is proposed to obtain high-resolution SAR images with highly focused responses and accurate relocation. To improve moving target detectors, one study proposed a new moving target indicator (MTI) scheme [2] by combining displaced-phase-center antenna (DPCA) and along-track interferometry (ATI) sequentially to reduce false alarms compared to MTI via either DPCA or ATI. As shown by the simulation results, the proposed method can not only reduce the false alarm rate significantly, but can also maintain a high detection rate. Another study proposed a synthetic-aperture radar (SAR) change-detection approach [3] based on a structural similarity index measure (SSIM) and multiple-window processing (MWP) The work proposed by focusing on SAR imaging [2] can be found in [1]. The main focus of these studies is on the detection of moving SAR targets [3] and changes in SAR images, while that of our study is SAR target detection.

The use of a detector with constant false-alarm rate (CFAR) [4] is common in radar target detection. Constant false-alarm rate detection is an important part of automatic radar target detection. It can be used as the first step in extracting targets from SAR images and it is the basis for further target identification. However, traditional methods rely too much on expert experience to design manual features, which have great feature limitations. The traditional methods are also difficult to adapt to SAR target detection in complex scenes and cannot be used for large-scale practical applications. Based on traditional feature-extraction target-detection methods, the histogram-of-oriented-gradient (HOG) feature is a feature descriptor used for object detection in computer vision and image processing. HOG calculates histograms based not on color values but on gradients. It constructs features by calculating and counting the histograms of gradient directions in local areas of the image. HOG features combined with support-vector-machine (SVM) classifiers have been widely used in SAR image recognition. In recent years, with the development of computer vision, convolutional neural networks have been applied to SAR image detection, and a large number of deep neural networks have been developed, including AlexNet [5], VGGNet [6], ResNet [7], and GoogLeNet [8]. Additionally, methods such as Faster R-CNN [9], SSD [10], and YOLO V3 [11] are also widely used in SAR image recognition. Moreover, we mainly rely on the advantages of CNN because it is highly skilled in extracting local feature information from images with more refined local attention capabilities. However, because of the large downsampling coefficient used in CNN to extract features, the network misses small targets. In addition, a large number of studies has shown that the actual receptive field in CNN is much smaller than the theoretical receptive field, which is not conducive to making full use of context information. CNN's feature capturability is unable to extract global representations. Although we can enhance CNN's global capturability by continuously stacking deeper convolutional layers, this results in a number of layers that are too deep, too many parameters for the model to learn, difficulty in effectively converging, and the possibility that the accuracy may not be greatly improved. Additionally, the model is too large, the amount of calculation increases sharply, and it becomes difficult to guarantee timeliness.

In recent years, the use of a classification and detection framework with a transformer [12] as the main body has received widespread attention. Since Google proposed bidirectional encoder representation from transformers (BERT) [13], the BERT model has also been developed, and the structure that plays an important role in BERT includes a transformer. Generalized autoregressive pretraining for language understanding (XLNET) [14] and other models have since emerged. BERT's core has not changed and still includes a transformer. The first vision transformer (ViT) for image classification was proposed in [15] and obtained the best results in optical natural scene recognition. Network models, such as detection transformer (DETR) [16] and Swin Transformer [17], with a transformer utilized for the main body, have appeared in succession.

Swin Transformer is currently mainly used in image classification, optical object detection, and the instance segmentation of natural scenes in the field of computer vision. In the field of remote sensing, the Swin Transformer is mainly used in image segmentation [18] and semantic segmentation [19]. We investigated the papers in this area in detail, and did not find any research work in the field of SAR target detection. We can transfer the entire framework to target segmentation and transfer work, which is also a focus of our future work, at a later date.

The successful application of a transformer in the field of image recognition is mainly due to three advantages. The first advantage includes the ability to break through the RNN model's limitation, enabling it to be calculated in parallel. The second advantage is that compared with CNN, the number of operations required to calculate the association between two positions does not increase with distance. The third advantage is that self-attention enables it to produce more interpretable models. We can check the attention distribution from the model. Each attention head can learn to perform different tasks. Compared with the CNN architecture, the transformer has better global feature capturabilities. Therefore, due to the key technical difficulties in the above-mentioned SAR target detection task, this paper combines the global context information perception of a transformer and the local information feature extractability of CNN that is oriented to the SAR target detection task, and innovatively proposes a context-based joint visual transformer framework for representation learning, referred to as CRTransSar. This is the first framework attempt in the field of SAR target detection. The experimental results from the SSDD and self-built SAR target dataset show that our method achieves higher precision. This paper focuses on the optimization design of the backbone and neck parts of the target detection framework. Therefore, we take the cascaded mask r-cnn framework as the basic framework of our method, and our method can be used as a functional module that is flexibly embedded in any other target detection frame. The main contributions of this paper include the following:


#### **2. Related Work**

#### *2.1. SAR Target Detection Algorithm*

In traditional ship SAR target detection, the use of a detector with a constant falsealarm rate (CFAR) [4] is a common method for radar target detection. It can be used as the first step in extracting targets from SAR images and is the basis for further target identification. Chen et al. [20] proposed a histogram-based CFAR (H-CFAR) method, which directly uses the gray histogram of the SAR image and combines it with CFAR to successfully achieve ship target detection. Li et al. [4] proposed an improved super-pixel level CFAR detection method, which uses weighted information entropy (WIE) to describe the statistical characteristics of super-pixels and better distinguishes between targets and cluttered super-pixels. With the development of computer vision, convolutional neural networks have been applied to the detection of SAR images, and a large number of deep neural networks have emerged, such as AlexNet, VGGNet, ResNet, and GoogLeNet, which also enable Faster R-CNN, SSD, and YOLO V3. These neural networks are widely used in SAR image recognition. Roughly divided into two-stage detection methods, such as Mask R-CNN [21] and Faster R-CNN, and single-stage detection methods, such as YOLO V3 and SSD, transformers are added to combine with CNN, and deep-level extraction features are more suitable for SAR targets.

The two-stage detection method joins the fully connected segmentation subnet after the basic feature network and the original classification and regression task are divided into three tasks: classification, regression, and segmentation. These tasks are applied to SAR target detection to improve ship recognition accuracy. Its working principle is divided into four stages. First, a set of basic volumes, relu activation functions and pooling layers are used to extract features. Next, the feature maps are passed into the subsequent RPN and the fully connected layer. The RPN network is used to generate region proposals. Next, roi pooling of this layer collects the feature maps input by the convolutional layer and the proposals generated by the RPN network before passing them into the following fully connected layer to determine the target category. Finally, the proposal and features are used. The maps calculate the category of the proposal, and the bounding box regression is again used to obtain the detection frame's final precise position.

Single-stage detection methods, such as SSD [10], mainly detect specific targets directly from many dense anchor points and use features of different scales to predict the object. The main idea is to uniformly conduct dense sampling at different positions of the picture. Different sampling approaches can be used, including scale and aspect ratio. Next, CNN is used to extract features and directly perform classification and regression. The introduction of the YOLO series improves the detection speed.

Existing deep learning-based SAR ship detection algorithms have huge model sizes and very deep network scales. A series of algorithms are proposed by Xiaoling Zhang's team, ShipDeNet-20 [22] is a novel SAR ship detector, built from scratch; it is lighter than most algorithms and can be applied effectively to hardware transplantation. The detection accuracy of SAR ships is reduced due to the huge imbalance in the number of samples in different scenarios. Thus, to solve this problem, the authors of [23] proposed a balance scene learning mechanism (BSLM) for offshore and inshore ship detection in SAR images. In addition, the authors of [24] proposed a novel approach for high-speed ship detection in SAR images based on a grid convolutional neural network (G-CNN). This method improves the detection speed by meshing the input image, inspired by the basic principle of YOLO, and using depthwise separable convolution. However, existing most studies improve detection accuracy at the expense of detection speed. Thus, to solve this problem, HyperLi-Net was proposed [25] for high-accuracy and high-speed SAR ship detection. In addition, a novel high-speed SAR ship detection approach mainly using a depthwise separable convolution neural network (DS-CNN) was proposed [26]. In this approach, we integrated multi-scale detection mechanism, concatenation mechanism and anchor box mechanism to establish a brand-new light-weight network architecture for high-speed SAR ship detection. There are still some challenges hindering accuracy improvements for

SAR ship detection, such as complex background interferences, multi-scale ship feature differences, and indistinctive small ship features. Therefore, to address these problems, a novel quad feature pyramid network (Quad-FPN) [27] is proposed for SAR ship detection.

#### *2.2. Transformer*

Since the emergence of the attention mechanism and its high-quality performance in natural language processing, researchers have tried to introduce this attention mechanism into computer vision. Currently, however, research is mainly focused on optical natural image scenes in the field of image detection. Applying a transformer in the field of vision has recently become increasingly popular. Vision transformers [15] enable it to simultaneously learn low-level features and high-level semantic information by combining convolutional and regular transformers [12]. Experiments have proven that after replacing the last convolution module of Resnet [6] with a visual transformer, the number of parameters is reduced, and the accuracy is improved. DETR [16] uses a complete transformer to build an end-to-end target detection model. The largest highlight is the decoder. The original decoder is used to predict and generate sentence sequences, but in the target detection task, the input of the decoder is 0. The object vector outputs the object category and coordinates after FFN. The DETR model is simple and straightforward, except that the model abandons the manual method of designing anchors. Small objects have less pixel information and are easily lost in the downsampling process. For example, ships at sea are small in size. To address the detection problem of such obvious object size differences, the classic method is the image pyramid used for multiscale change enhancement, but this involves a considerable amount of calculation. However, multiscale detection is becoming increasingly important in target detection, especially for small targets.

In general, there are two main model architectures in the related work that use a transformer in computer vision (CV). One is a transformer-only structure [11], and the other is a hybrid structure that combines the backbone network of CNN with a transformer. In [15], a vision transformer was proposed for image classification for the first time. This research shows that dependence on CNN is not necessary. When directly applied to a sequence of image blocks, the transformer can also perform image classification tasks well. The research is based on a large amount of data for model pretraining and migration to multiple image recognition benchmark datasets. The results show that the vision transformer (ViT) model is comparable to the current optimal convolutional network. As a result, the computing resources required for its training are greatly reduced. The research specifically divides the image into multiple image patches and uses the linear embedding sequence of these image patches as the input for the transformer. Subsequently, the token processing method in the natural language processing (NLP) field is used to process the image block and train the image classification model in a supervised manner. When training with a medium-scale dataset (such as ImageNet), the model produces unsatisfactory results. This seemingly frustrating result is predictable. The transformer lacks some of the inherent inductive biases of CNN, such as translation, degeneration, and locality. Thus, after training with insufficient data, the transformer cannot generalize well. However, if the model is trained on a large dataset (14–300 m image), the situation is quite different. The study found that large-scale training outperforms inductive bias. When pretraining on a large enough data scale and migrating to tasks with fewer data points, the transformer can achieve excellent results.

#### *2.3. Related Datasets in the Field of SAR Target Detection*

On 1 December 2017, at the BIGSARDATA conference held in Beijing, China, a dataset SSDD [28] for ship target detection in SAR images was disclosed. SSDD is the first public dataset in this field. As of 25 August 2021, from 161 papers on deep learning-based SAR ship detection, 75 used SSDD as the training and test data, accounting for 46.6%, which shows the popularity and significance of SSDD in the SAR remote sensing community. The datasets used in other papers are the other five public datasets proposed in recent years, namely the SAR-Ship dataset released by Wang et al. in 2019, the AIR SARShip-1.0 released

by Sun et al. [29] HRSID released in 2020, and LS SSDD-v1.0, released by Zhang et al. [30] in 2020. The original paper of SSDD used a random ratio of 7:1:2 to divide the dataset into training set, validation set, and test set. However, this random partitioning mechanism leads to great uncertainty over the samples in the test set, resulting in different results when using the same detection algorithm for multiple training and testing. This is because the number of samples in SSDD is too small, only 1160, and random division may destroy the distribution consistency between the training and test sets. Similar to HRSID [31] and LS-SSDD-v1.0, here, images containing land are considered as near-shore samples, while other images are considered as far-sea samples. The numbers of near-shore and far-ocean samples were highly unbalanced (19.8% and 80.2%, respectively), a phenomenon consistent with the fact that the oceans cover much more of the Earth's surface than land. In the SSDD dataset, there are a total of 1160 images and 2456 ships, with an average of 2.12 ships per image, and the dataset will continue to expand in the future. Compared with the PASCAL VOC [32] dataset, which features 20 categories of objects, SSDD has fewer pictures, but the category is only ships, so it is enough to train the detection model.

The HRSID dataset was released by Su Hao from the University of Electronic Science and Technology of China in January 2020. HRSID is a dataset for ship detection, semantic segmentation, and instance segmentation tasks in high-resolution Sar images. The dataset contains a total of 5604 high-resolution SAR images and 16,951 ship instances. The ISSID dataset borrows from the construction process of the Microsoft common objects in context (COCO) [33] dataset, including SAR images at different resolutions, polarization, sea state, sea area, and coastal ports. This dataset is the benchmark against which the researchers evaluate their methods. For HRSID, the resolutions of the SAR images are: 0.5 m, 1 m, and 3 m, respectively.

#### **3. The Proposed Method**

This paper combines the respective advantages of the transformer [12] and CNN architectures, and is oriented to the SAR target detection task. Thus, we innovatively propose a visual transformer SAR target detection framework based on contextual jointrepresentation learning, called CRTransSar. The overall framework is shown in Figure 1. This is the first framework attempt in the field of SAR target detection.

**Figure 1.** Overall architecture of the CRTransSar network.

#### *3.1. The Overall Framework of Our CRTransSar*

First, based on the cascade mask r-cnn two-stage model as the basic architecture, this paper innovatively introduces the latest Swin Transformer architecture as the backbone, introduces the local feature extraction module of CNN, and redesigns a target detection

framework. The design of the framework can fully extract and integrate the global and local joint representations.

Furthermore, this paper combines the respective advantages of a Swin Transformer and CNN to design a brand-new backbone, referred to as CRbackbone. Thus, the model can make full use of contextual information, perform joint-representation learning, extract richer contextual feature information, and improve the multi-characterization and description of multiscale SAR targets.

Finally, we designed a new cross-resolution attention enhancement Neck, CAENeck. A feature pyramid network [34] is used to convey strong semantic features from top to bottom, enhancing the two-way multiscale connection operation through top-down and bottom-up attention, while also aggregating the parameters from different backbone layers to different detection layers, which can guide the multi-resolution learning of dynamic attention modules with little increase in computational complexity.

As shown in Figure 1, CRTransSar is mainly composed of four parts: CRbackbone, CAENeck, RPN-Head, and Roi-Head. First, we used our designed CRbackbone to extract features from the input image and performed a multiscale fusion of the obtained feature maps. The bottom feature map is responsible for predicting small targets and the highlevel feature map is responsible for predicting large targets. The RPN module receives the multiscale feature map and starts to generate anchor boxes, generating nine anchors corresponding to each point on the feature map, which can cover all possible objects on the original image. Using a 1 × 1 convolution to make prediction scores and prediction offsets for each anchor frame, all the anchor frames and labels were matched. Next, we calculated the value of IOU to determine whether the anchor frame belonged to the background or the foreground. Here, we establish a standard to distinguish the samples. The positive sample and the negative sample, after the above steps, obtain a set of suitable proposals. The received feature map and the above proposal are passed into ROI pooling for unified processing, and then finally passed to the fully connected RCNN network for classification and regression.

#### *3.2. Backbone Based on Contextual Joint Representation Learning: CRbackbone*

Aiming at the strong scattering, sparseness, multiscale, and other characteristics of SAR targets, this paper combines the respective advantages of transformer and CNN architectures to design a target detection backbone based on contextual joint representation learning, called CRbackbone. It performs joint representation learning, extracts richer contextual feature salient information, and improves the feature description of multiscale SAR targets.

First, we used the Swin Transformer, which currently performs best in NLP and optical classification tasks, as the basic backbone. Next, we incorporated CNN's multiscale local information acquisition and redesigned the architecture of a Swin Transformer. Influenced by the latest EfficientNet [35] and inspired by the architecture of CoTNet [36], we introduced multidimensional hybrid convolution in the patchembed part to expand the receptive field, depth, and resolution, which enhanced the feature perception domain. Furthermore, the self-attention module was introduced to strengthen the comparison between different windows on the feature map, and for contextual information exchange.

#### 3.2.1. Swin Transformer Module

For SAR images, small target ships in large scenes easily lose information in the process of downsampling. Therefore, we use a Swin Transformer [17]. The framework is shown in Figure 2. The transformer has general modeling capabilities and it is complementary to convolution. It also has powerful modeling capabilities, better connections between vision and language, a large throughput, and large-scale parallel processing capabilities. When a picture is input into our network, first the transformer [11] is used to process the image because we need to use all of the means that can be processed to divide the picture into tokens similar to NLP with the high-resolution characteristics of the image. The language difference leads to a layered transformer whose representation is calculated by moving the window. By limiting self-attention calculations to non-overlapping partial windows while allowing cross-window connections, the shifted window scheme leads to higher efficiency. This layered architecture has the flexibility of modeling at various scales and has linear computational complexity relative to the image size. This is an improvement to the vision transformer.

**Figure 2.** Overall architecture of Swin Transformer. (**a**) Swin Transformer structure diagram. (**b**) Swin Transformer blocks.

The vision transformer always focuses on the patch that is segmented at the beginning and does not perform any operations on the patch in the subsequent process. Thus, it does not affect the receptive field. A Swin Transformer is processed when a window is enlarged; subsequently, the calculation of self-attention is calculated in units of windows. This is equivalent to introducing locally aggregated information, which is very similar to the convolution process of CNN. The step size is the same as the size of the convolution kernel; thus, the windows do not overlap. The difference is that CNN performs the calculation of convolution in each window, and each window finally obtains a value, which represents the characteristics of this window. The Swin Transformer performs the self-attention calculation in each window and obtains an updated window. Next, through the patch merging operation, the window is merged, and the merged window continues to perform self-calculation. The Swin Transformer places the patches of the surrounding four windows together in the process of continuous downsampling, and the number of patches decreases. In the end, the entire image has only one window and seven patches. Therefore, we believe that downsampling means reducing the number of patches, but the size of the patches increases, which increases the receptive field.

As illustrated in Figure 3, the first module uses a regular window partitioning strategy, which starts from the top-left pixel, and the 8 × 8 feature map is evenly partitioned into 2 × 2 windows o 4 × 4 (M = 4) in size. Next, the next module adopts a windowing configuration that is shifted from that of the preceding layer by displacing the windows by (M/2, M/2) pixels from the regularly partitioned windows. With the shifted windowpartitioning approach, consecutive Swin Transformer blocks are computed as:

$$\mathbf{x}^{l}\mathbf{z}^{l} = \text{W-MSA}\left(\text{LN}\left(\mathbf{z}^{l-1}\right)\right) + \mathbf{z}^{l-1}\mathbf{z}^{l} = \text{MLP}\left(\text{LN}\left(\mathbf{z}^{l}\right)\right) + \mathbf{z}^{l}\mathbf{z}^{l+1} = \text{SW-MSA}\left(\text{LN}\left(\mathbf{z}^{l}\right)\right) + \mathbf{z}^{l}\mathbf{z}^{l+1} = \text{MLP}\left(\text{LN}\left(\mathbf{z}^{l+1}\right)\right) + \mathbf{z}^{l+1} \tag{1}$$

where **^ z** *l* and **z***<sup>l</sup>* denote the output features of the SW-MSA module and the MLP module for block, respectively; W-MSA and SW-MSA denote window-based multi-head self-attention using regular and shifted window partitioning configurations, respectively.

**Figure 3.** Swin Transformer sliding window. In the figure, 4, 8, 16 represent the number of patches.

A Swin Transformer performs self-attention in each window. Compared with the global attention calculation performed by a transformer, we assume that the complexity of a known MSA is the square of the image size. According to the complexity of an MSA, we can conclude that the complexity is (3 × 3)<sup>2</sup> = 81. The Swin Transformer calculates self-attention in each local window (the red part). According to the complexity of MSA, we can see that the complexity of each red window is 1 × 1 squared, which is 1 to the fourth power. When there are nine windows, the complexity of these windows is summed, and the final complexity is nine, which is a greatly reduced figure. The calculation formulas for the complexity of an MSA and W-MSA are expressed by Formulas (2) and (3).

$$
\Omega(MSA) = 4\text{hwC}^2 + 2(hw)^2\text{C} \tag{2}
$$

$$
\Omega(\text{W-MSA}) = 4\text{hw}\text{C}^2 + 2\text{M}^2\text{hw}\text{C} \tag{3}
$$

Although computing self-attention inside the window may greatly reduce the complexity of the model, different windows cannot interact with each other, resulting in a lack of expressiveness. To better enhance the performance of the model, shifted-windows attention is introduced. Shifted windows alternately move between successive Swin Transformer blocks.

#### 3.2.2. Self-Attention Module

Due to its spatial locality and other characteristics in computer vision tasks, CNN can only model local information and lacks the ability to model and perceive long distances. The use of a Swin Transformer introduces a shifted-window partition to improve this defect. The problem of information exchange between different windows is not limited to the exchange of local information. Furthermore, based on multihead attention, this paper takes into account the CotNet [36] contextual attention mechanism and proposes to integrate the attention module block into the Swin Transformer. The independent Q and K matrices are connected to each other. After the feature extraction network moves to patchembed, the feature map of the input network is 640\*640\*3. However, the length and width of the input data are not all 640\*640. Next, we determine whether it is an integer multiple of 4 according to the length and width of the feature map to determine whether to pad the length and width of the feature map, followed by two convolutional layers. The feature channel changes from the previous 3 channels to 96 channels, and the feature dimension also changes to 1/4 of the previous dimension. Finally, the size of the attention module is 160\*160\*96, and the size of the convolution kernel is 3 × 3, as shown in Figure 4. The feature dimension and feature channel of the module remain unchanged, which strengthens the information exchange between the different windows on the feature map.

**Figure 4.** Self-attention module block.

The first step is to define three variables: Q = X, K = X, and V = XWv. V is subjected to 1 × 1 convolution processing, then K is the grouped convolution operation of K × K and is recorded as the Q matrix and concat operation. The result after the concat performs two 1 × 1 convolutions and the calculation are shown in formula 4.

The self-attention module first encodes the contextual information of the input keys through 3 × 3 convolution to obtain a static contextual expression K<sup>1</sup> about the input; the encoded keys are further concatenated with the input query and the dynamic multi-head attention matrix is learned through two consecutive 1 × 1 convolutions. The resulting attention matrix is multiplied by the input values to obtain a dynamic contextual representation K2 about the input. The fusion result of static context and dynamic context expression is used as output O. The architecture of the self-attention module is shown in Figure 5.

$$A = [K^1, Q] \mathcal{W}\_\theta \mathcal{W} \delta \tag{4}$$

**Figure 5.** Architecture of self-attention module.

Here, *A* does not just model the relationship between *Q* and *K*. Thus, through the guidance of context modeling, the communication between each part is strengthened, and the self-attention mechanism is enhanced.

$$K^2 = V \otimes A \tag{5}$$

$$O = \mathbb{K}^2 \oplus \mathbb{K}^1\tag{6}$$

Unlike the traditional self-attention mechanism, the self-attention module block structure of this paper combines contextual information and self-attention. Unlike the latest global self-attention mechanism, HOG-ShipCLSNet [37] and PFGFE-Net [38] are distinguish the characteristics of different scales and different polarization modes, so as to ensure sufficient global responses to comprehensively describe SAR ships. The specific process is to first calculate Φ and θ through two 1 × 1 convolutional layers, and it is used to characterize feature A through a 1 × 1 convolutional layer. The 1 × 1 convolutional layer is used to characterize feature g(·). We then obtain the similarity f by matrix multiplication <sup>θ</sup>TΦ, and, finally, f through a softmax function/layer with a sigmoid activation is multiplied by g to obtain the self-attention output. Furthermore, to make the output yi match the dimension of the input x to facilitate the follow-up element-wise adding operation, we add an extra 1 × 1 convolutional layer to achieve the dimension shaping. This is because in the embedded space, the number of convolution channels is c/2, which is not equal to the number of input channels c. This process is similar to the function of the residual or skip connections in ResNet, which can be described by 1 × 1. The weight matrix of the convolution layer is multiplied by yi and then added to the input.

#### 3.2.3. Multidimensional Hybrid Convolution Module

To increase the receptive field according to the characteristics of the SAR target, this section describes the proposed method in detail. The feature extraction network proposed in this paper is based on a Swin Transformer in order to improve the backbone. The CNN convolution is integrated into the patchembed module with the attention mechanism, and it is reconstructed. The entire feature extraction network structure diagram is shown in Figure 6. Affected by the efficient network [35], a multidimensional hybrid convolution module is introduced in the patchembed module. The reason why we introduced this network is that according to the mechanism characteristics of CNN, the more convolutional layers are stacked, the larger the receptive field of the feature maps. We used this approach to expand the receptive field and the depth of the network, and to increase the resolution to improve the performance of the network.

**Figure 6.** Overall architecture of CRbackbone.

When computing resources increase, if we thoroughly search for various combinations of the three variables of width, depth, and image resolution, the search space is infinite and the search efficiency is very low. The key to obtaining higher accuracy and efficiency is to balance the scaling ratios (*d*, *r*, *w*) of the three dimensions of network width, network depth, and image resolution using the combined zoom method:

depth : *<sup>d</sup>* <sup>=</sup> *<sup>α</sup>φ*width : *<sup>w</sup>* <sup>=</sup> *<sup>β</sup>φ*resolution : *<sup>r</sup>* <sup>=</sup> *<sup>γ</sup>φ*s. t.*<sup>α</sup>* · *<sup>β</sup>*<sup>2</sup> · *<sup>γ</sup>*<sup>2</sup> <sup>≈</sup> <sup>2</sup>*<sup>α</sup>* <sup>≥</sup> 1, *<sup>β</sup>* <sup>≥</sup> 1, *<sup>γ</sup>* <sup>≥</sup> 1 (7)

*α*, *β*, *γ* are constants (not infinite because the three correspond to the amount of computation), which can be obtained by grid search. The mixing coefficient φ can be adjusted manually. If the network depth is doubled, the corresponding calculation amount will double, and the network width or image resolution will double the corresponding calculation amount, that is, the calculation amount of the convolution operation (FLOPS) is proportional to *d*, ω *γ*2. There are two square terms in the condition. Under this constraint, after specifying the mixing coefficient φ, the network calculation amount is about 2<sup>Φ</sup> times what it was before.

Now, we can integrate the above three methods and integrate the hybrid parameter expansion method. Although there is no lack of research in this direction about models, such as MobileNet [39], ShuffleNet [40], M-NasNet [41], etc., the model is compressed by reducing the amount of parameters and calculations. The model is also applied to mobile devices and edge devices, but the amount of parameters and calculations are considerably reduced at the same time. However, the accuracy of the model is greatly improved. The patchembed module mainly increases the channel dimensions of each patch, which are divided into non-overlapping patch sets by patch partitioning processing the input picture H × W × 3, which reduces the size of the feature map and sends it to the Swin Transformer block for processing. When each feature map is sent to patchembed's dimension 2 × 3 × H × W and then finally sent to the next module, the dimension is 2 × 3 × 96. When four downsamplings are achieved through the convolutional layer and the number of channels becomes 96, a layer of a multidimensional hybrid convolution module is stacked before the 3 × 3 convolution. The size of the convolution kernel is 4, keeping the number of channels fed into the convolution unchanged, which also increases the depth of the receptive field and the network. This improves the efficiency of the model.

#### *3.3. Cross-Resolution Attention Enhancement Neck: CAENeck*

This paper, inspired by the structure of SGE [42] and PAN [43], addresses the small targets in large scenes, including the strong scattering characteristics of SAR imaging and the characteristics of low discrimination between targets and backgrounds. This paper designs a new cross-resolution attention enhancement neck, called CAENeck.

The specific step is to divide the feature map into G groups according to the channel, and then to calculate the attention of each group. After global average pooling is performed on each group, g is obtained, and then g is a matrix multiplied with the original grouped feature map. Next, we proceed to the norm. Additionally, sigmoid was used to perform the operation, and the result obtained was the matrix multiplied by the original grouping feature map. The specific steps are shown in Figure 7.

The attention mechanism is added to connect the context information, and attention is incorporated at the top-to-bottom connection. This is to better integrate the shallow and deep feature map information and to better extract the features of small targets, along with the goals of the target. The positioning is shown in Figure 1. We upsample during the transfer process of the feature map from top to bottom. The size of the feature map increases. The deepest layer is strengthened by the attention module and concatenated with the feature map of the middle layer, which then passes through the attention module. A concat connection is formed with the most shallow feature map. The specific steps are as follows. The neck receives the feature maps of three scales; 30\*40\*384, 60\*80\*192, 120\*160\*96, and 30\*40\*384 are the deepest features, which are upsampled and pay attention to the force

enhancement operation, before being connected with 60\*80\*192. Finally, upsampling and attention enhancement are carried out to connect with the shallowest feature map.

This series of operations is carried out from top to bottom. Next, bottom-up multiscale feature fusion is performed. Figure 1 shows the neck part. The SAR target is a very small target in a large scene, especially the marine ship target of the SSDD dataset. At sea, the ship itself has very little pixel information, and it easily loses the information of small objects in the process of downsampling. Although the high-level feature map is rich in semantic information for prediction, it is not conducive to the positioning of the target. The low-level feature map has little semantic information but is beneficial to the location of the target. The FPN [34] structure is a fusion of high-level and low-level from top to bottom. It is achieved through upsampling. The attention module is added during the upsampling process to integrate contextual information mining and self-attention mechanisms into a unified body. The ability to extract the information of the target location is enhanced. Furthermore, the bottom-up module has a pyramid structure from the bottom to the high level, which realizes the fusion of the bottom level and the high level after downsampling, while enhancing the extraction of semantic feature information. The small feature map is responsible for detecting large ships, and the large feature map is responsible for detecting small ships. Therefore, attention enhancement is very suitable for multiscale ship detection in SAR images.

#### *3.4. Loss*

The loss function is used to estimate the gap between the model output y and the true value y to guide the optimization of the model. This paper uses different loss functions in the head part. The specific formulas for the loss of the category in the RPN-head use cross-entropy loss, and the regression loss utilization function is as follows:

$$L(\{\left\{P\_{\rm i}\right\}, \left\{t\_{\rm i}\right\}) = \frac{1}{N\_{\rm class}} \sum\_{\rm i} L\_{\rm cls}(p\_{i\prime}, p\_{i}^{\*}) + \lambda \frac{1}{N\_{\rm reg}} \sum\_{\rm i} p\_{i}^{\*} L\_{\rm reg}(t\_{i\prime}, t\_{i}^{\*}) \tag{8}$$

where ∑ i *L*cls(*pi*, *p*<sup>∗</sup> *<sup>i</sup>* ) represents the filtered anchor's classification loss, Pi is the true value of each anchor's category, and ∑ *i p*∗*Lreg*(*ti*, *t* ∗ *<sup>i</sup>* ) is the predicted category of each anchor. Representing the loss of the regression, the function formula used for the regression loss is as follows:

$$L\_{\text{ref}\S}(t\_i, t\_i^\*) = \sum\_{i \in x, y, w, l\text{t}} smooth\_{L1}(t\_i - t\_i^\*) \tag{9}$$

$$smooth\_{L1}(\mathbf{x}) = \begin{cases} \quad 0.5\mathbf{x}^2 & if|\mathbf{x}| < 1\\ \quad |\mathbf{x}| - 0.5 & otherwise \end{cases} \tag{10}$$

#### **4. Experiments and Results**

This section evaluates our proposed detection method through experimental results. First, we use the SSDD dataset and the SMCDD dataset as experimental data. The SMCDD dataset provides some of the parameter settings of the experiment. The next part describes the influence of the attention enhancement backbone, the reconstruction of the patchembed module, and the multiscale attention enhancement neck. Finally, we compare our method with other methods to verify the effectiveness of our method. Our hardware platform is a personal computer with an Intel i5 CPU based on the mmdet [44] framework, an NVIDIA RTX2060 GPU, 8 GB of video memory, and an Ubuntu 18.04 operating system.

#### *4.1. Dataset*

#### 4.1.1. SSDD Dataset

We used a remote sensing SAR image dataset. The SAR ship detection established in 2017 used a SSDD ship dataset, which sets the baseline of the SAR ship detection algorithm and is used by many other scholars. The SSDD dataset contains data in a variety of scenarios, including different polarization modes and scenarios. We used the same labeling method as the most popular PASCAL VOC dataset. A total of 1160 images and 2456 ships were included, with an average of 2.12 ships per image. Although the number of images was small, we used it as a benchmark for ship target detection performance. The ratio of the training image, verification image, and test image was 7:2:1. The SSDD dataset is shown in Table 1.

**Table 1.** SSDD dataset description.


#### 4.1.2. SMCDD Dataset

Our research group will soon release the SAR dataset, which contains data from the satellite HISEA-1 called SMCDD, as shown in Figure 8.

The HISEA-1 satellite is China's first commercial SAR synthetic-aperture radar satellite, jointly developed by the 38th Research Institute of China Electronics Technology Group Corporation, China Changsha Tianyi Space Science and Technology Research Institute Co., Ltd. (Changsha, China), as well as other units. Since its entry into orbit, the HISEA-1 has performed more than 1880 imaging tasks, obtaining 2026 striped images, 757 spotlight images, and 284 scanned images. The HISEA-1 has the ability to provide stable data services. The slice data of the SMCDD dataset we constructed are all from the SAR large scene image captured by HISEA-1.

Our SMCDD dataset contains four types of data: ship data, airplane data, bridge data, and oil tank data, as shown in Figure 9. The images we used were all large. There were four polarization modes, and, as a result, we cut them into 256, 512, 1024, and 2048 sizes. We used slices of 1024 and 2048 sizes and finally passed our screening and cleaning, leaving 1851 bridges, 39,858 ships, 12,319 oil tanks, and 6368 aircraft, as shown in the figure. Although the current version of the dataset is unbalanced, we will continue to expand the dataset in the future. We also verified the effectiveness of the method proposed in this paper through our dataset. The data information is shown in Table 2.

**Figure 8.** Some examples of the dataset SMCDD to be released by the research group. (**a**,**b**) are oil tanks; (**c**,**d**) are ship; (**e**,**f**) are bridges; (**g**,**h**) are aircraft.

**Figure 9.** Description of the research group dataset SMCDD.

**Table 2.** Description of the research group dataset SMCDD.


In contrast to the existing open-source SAR target detection dataset, our SMCDD dataset has the following advantages:


recognition, long-tailed distribution (class imbalance), small sample detection and recognition, etc. This will greatly promote the overall development of the SAR target detection field.


#### *4.2. Setup and Implementation Details*

This study used Python 3.7.10, PyTorch 1.6.0, CUDA 10.1, CUDNN 7.6.3, and MMCV1.3.1, and the results of our network pretraining model were Swin-T on ImageNet. A total of 500 epochs was set up for training in the entire network. Due to the limitations of computer hardware and the size of the network itself, the batch size was set to 2. Each training sent two images to the network for processing, and the AdamW optimizer was selected as the model. The initial learning rate was set to 0.0001, the weight attenuation was 0.0001, and strategies such as LoadImageFromFile, LoadAnnotations, RandomFlip, and AutoAugment were used to optimize the training pipeline, as well as to enhance the online data, which enhanced the robustness of the algorithm. We adjusted the image size and finally selected the most suitable size for the network proposed in this paper to be 640\*640.

#### *4.3. Evaluation Metric*

To quantitatively evaluate the performance of the proposed cascade mask rcnn with the improved Swin Transformer as the backbone and CAENeck as the neck detection algorithm, the accuracy, recall rate, average accuracy (mAP) and F-measure (F1) were used as evaluation indicators. Accuracy refers to the rate of correct detection of ships in all detection results, and recall refers to the rate of correct detection of ships in all ground facts. The definition of precision and recall is as follows:

$$P = \frac{TP}{TP + FP} \tag{11}$$

$$R = \frac{TP}{TP + FN} \tag{12}$$

In the formula, *TP*, *FP*, and *FN* represent the positive samples predicted by the model as positive, the negative samples predicted by the model as positive, and the positive samples predicted by the model as negative, respectively. In addition, if the IoU between the predicted bounding box and the real bounding box is higher than the threshold of 0.5, the bounding box is recognized as a correctly detected ship. The precision recall (*PR*) curve shows the precision recall rate under different confidence thresholds. *MAP* is a comprehensive metric that calculates the average precision under the recall range [0, 1]. The definition of m*AP* is as follows:

$$\text{Im}AP = \int\_0^1 P(R) \text{d}R \tag{13}$$

In the formula, *R* is the recall value, which represents the precision corresponding to the recall. *F*1 evaluates the comprehensive performance of the detection network proposed in this paper by considering the accuracy and recall rate. *F*1 is defined as:

$$F1 = \frac{2 \times P \times R}{P + R} \tag{14}$$

#### *4.4. Analysis of Experimental Results*

#### 4.4.1. Ablation Experiments

A. The Influence of CRbackbone on the Experimental Evaluation Index

During the experiment, we first added the improved network backbone to the network, and the neck part was consistent with the baseline. We compared the benchmark Swin Transformer as the backbone, and PAN as the neck, and evaluated the test indicators. The comparison results are shown in Table 3. We observed that adding the optimized backbone to the percentage of mAP (0.5) led to an improvement of 0.4%; the improvement of mAP (0.75) was 6.5%, and the recall rate was also improved, by 1.3%. Therefore, the improved backbone had a propelling effect on the optimization of the network.

**Table 3.** The influence of CRbackbone on the experimental evaluation index.


B. The Influence of the CAENeck Module on the Experimental Evaluation Index

During the experiment, first added the improved network neck into the network, and the backbone part was consistent with the baseline. The lightweight attention-enhancement neck module was discarded to study its influence on the experiment, and an evaluation of the experimental indicators was carried out. The comparison results are shown in Table 4. We observed a 0.1% improvement in the percentage of mAP (0.5), a 2.5% improvement in mAP (0.75), and a 0.2% improvement in the recall rate. In the neck part, the detection performance also improved.

**Table 4.** The influence of CAENeck module on experimental evaluation index.


#### 4.4.2. Experimental Comparison with Current Methods

To compare traditional methods and advanced methods, we adopted the same parameter settings to test and verify them. We propose to use the improved Swin Transformer as the backbone's cascade mask RCNN target detection network to verify and compare the SSDD dataset and the dataset to be released by our research group. The experimental results are shown in the following table.

The target detection model proposed in this paper achieved a substantial improvement over the SSDD dataset. The accuracy of mAP (0.5) reached 97.0%, the accuracy of mA (0.75) reached 76.2%, and the F1 was 95.3. It can be seen that through the improvement of the Swin Transformer, the integration of the CotNet attention mechanism, and the lightweight EfficientNet module in patchembed promoted the optimization of the backbone. The crossresolution attention enhancement neck strengthened the characteristics of the different scales. The fusion of the maps and these several methods are of great help for detecting ships. We compared the two-stage, single-stage, and anchor-free methods. The experiments showed that the detection accuracy of the method proposed in this paper is generally higher than that of the two-stage methods, such as Faster RCNN (88.5%) and Cascade R-CNN [45] (89.3%). We also compared our results with those of single-stage yolov3 (95.1%), SSD (84.9%), and RetinaNet (90.5%) [46]. The experimental results were also higher than those of the single-stage detection algorithm, which shows that the transformer uses the attention mechanism. Powerful functions, cascaded local information, and enhanced multiscale fusion is more conducive to the detection of inshore vessels without the perception of noise or the identification of ships of different sizes. We also compared the most advanced

FCOS [47] and CenterNet [48] without anchor frame detection and Cascade R-CNN [45] and Libra R-CNN [49] with anchor frame detection, as shown in Tables 5 and 6. This paper also draws the PR curve to compare the difference between the different networks, in Figure 10.

**Table 5.** Comparison with the latest anchor-free target detection method.



**Table 6.** Comparison with the latest anchor-based target detection method.

**Figure 10.** Comparison with the PR curve of the classic method.

The basic principle of CenterNet is that each target object is modeled as a center point to represent it. No candidate frame is required, nor is postprocessing, such as non-maximum suppression. CenterNet uses a fully convolutional network to generate a high-resolution feature map, classifies and judges each pixel of the feature map, and determines whether it is the center point of the target category or the background. This feature map gives each target the position of the center point of the object, the processing confidence in the center point of the target is 1, and the confidence of the background point is 0. Now, since there is no anchor box, there is no need to calculate the IoU between the anchor box and the bounding box to obtain positive samples to directly train the regressor. Instead, each point (located within the bounding box and having the correct class label) that is determined to be a positive sample is part of the regression of the bounding box size parameter.

This paper quotes the latest SAR target detection methods, FBR-Net [50], Center-Net++ [51], NNAM [52], DCMSNM [53], and DAPN [54]. Since the relevant papers do not have specific data divisions, this paper has no way to fully reproduce the results from other relevant papers. Therefore, this paper can only quote them. The results are compared horizontally, as shown in Table 7.


**Table 7.** Comparison with the latest SAR target detection method.

To demonstrate the robustness of our proposed algorithm, we conducted comparative experiments with low SNR on salt and pepper noise, random noise, and Gaussian noise. In the salt and pepper noise experiments, our method led to mAP of 94.8. The map of our method was 5.5% higher than Yolo v3 and 9.7% higher than Faster R-CNN. In the random noise experiments, our method led to mAP of 96.7. The map of our method was 3% higher than Yolo v3 and 11.1% higher than Faster R-CNN. In the Gaussian noise experiments, our method led to mAP of 95.8. The map of our method was 1% higher than Yolo v3 and 10.7% higher than Faster R-CNN. The experimental results, presented in Table 8, show that we produced a reliable performance for SAR target detection tasks in low SNR.

**Table 8.** Comparison with low SNR of other advanced methods.


In order for our proposed method to effectively solve the SAR target detection task, we also made corresponding experimental comparisons for the computational cost of the Swin Transformer. The FPS and parameter statistics of several representative target detection algorithms are shown in Table 9. Compared with the single-stage target detection algorithm, our parameter was 34M higher than YOLO V3, and the FPS was 28.5M lower, but the mAP was 1.9% higher than YOLO v3. Compared with two-stage target detection, the parameter amount was 52M higher than Faster R-CNN, and the FPS was 11.5M lower. Compared with Cascade R-CNN, the parameter amount was 8M higher and the FPS was 4.5M lower. Compared with Cascade Mask R-CNN, the number of parameters was 19M higher and the FPS was 7.5M lower. However, our mAP was 8.5% higher than that of Faster R-CNN, 7.7% higher than that of Cascade R-CNN, and 6.8% higher than that of Cascade Mask R-CNN. Because the overall architecture of the Swin Transformer is still relatively large, the large volume of Transformer is a general problem in this field, and we plan to make further improvements in model lightweighting and compression in the future.


**Table 9.** Compared with computational cost of other advanced methods.

4.4.3. Comparison between Experimental Results of the SMCDD Data Set

We used state-of-the-art object detection methods to evaluate our self-built SMCDD dataset. We chose CRTransSar, RetinaNet, and YOLOV3 as our benchmark algorithms, as shown in Table 10. There was a large number of dense targets in the oil tanks and aircraft in the SMCDD data set, which posed great challenges to the detection. It can be seen from the data that CRTransSar's mAP reached 16.3, which was better than RetinaNet and yolov3, and it was also higher than these two models in Recall.

**Table 10.** Comparison results on the SMCDD data set.


#### *4.5. Visualization Result Verification and Analysis*

To verify the effectiveness of the method in this paper, we visualized the SSDD dataset and the dataset to be released by our own research group, and obtained satisfactory results. We randomly selected some near-shore and far-shore ships for inspection. It can be seen from the figure that the use of multiscale fusion feature maps can more effectively improve the results of SAR images in different scenes, meaning that the method proposed in this paper can extract features with rich semantic information, even from complex backgrounds near shore. This method can also eliminate the interference and accurately identify the place where the naked eye has difficulty distinguishing between the noise and the ship. It can also eliminate some marine object noise, such as ships in the distant sea, and can be accurately distinguished. We also accurately verified the ships photographed by HISEA-1.

(1) This section visually verifies the performance of the network from two datasets, which are divided into inshore and offshore sets. Figure 11 shows the visual verification of SSDD inshore ships. When there is a relatively small amount of dense ships, the network's detection performance is better, and it is not disturbed by shore noise. Figure 12 is the dataset to be released by our research group, which contains inshore ships photographed by the HISEA-1 satellite and high-resolution satellites.

(2) Figures 13 and 14 are the SSDD dataset of the far sea and the results of the identification of the offshore ships of the dataset to be released by our research group, respectively. Offshore, because the surrounding environment receives less noise interference, the recognition accuracy is higher than it is inshore. Therefore, almost all target ships can be accurately identified in the offshore scene.

(3) To demonstrate the object detection performance of our proposed method for large scenes, we selected our self-built SMCDD dataset as the inference dataset. Our original data were obtained from the 38th Research Institute of China Electronics Technology Group Corporation. Because the data belonged to a secret military institution, we signed a confidentiality agreement with them. The original image of the large scene was obtained by HISEA-1. However, to further demonstrate the effectiveness of our method, we used the sliced data of some large scenes with a size of 2048\*2048. As shown in Figure 15, from the

visualization results, it can be seen that Figure 15a,f are missing detections in three places; Figure 15b,d feature one false detection.

**Figure 11.** SSDD inshore inspection results. The red rectangular box is the correct visualization result of the CRTransSar method inshore on the SSDD dataset.

**Figure 12.** The results of verification of inshore taken by HISEA-1 Satellite. The red rectangular box is the correct visualization result of the CRTransSar method in the inshore scene of a port captured by HISEA-1 Satellite.

**Figure 13.** SSDD offshore ship identification result. The red rectangular box is the correct visualization result of the CRTransSar method offshore on the SSDD dataset.

**Figure 14.** The results of verification of offshore ships taken by HISEA-1 satellites. The red rectangular box is the correct visualization result of the CRTransSar method in the offshore scene of a port captured by HISEA-1 Satellite.

.

**Figure 15.** Visualization results of large scenes. (**a**–**f**) are slices of a large scene taken by HISEA-1 Satellite in a port. The red mark in the figure is the visualization result of the ships detected by CRTransSar in this scene.

#### **5. Discussion**

The four graphs in Figure 16 show some errors in the visualized results. It can be seen that (a) there are obvious detection frames to identify a ship, picture (b) has obvious undetected ships detected, and picture (c) has obvious detection frames to identify multiple ships, while one ship is detected by multiple ships. In (d), there are multiple ships that have not been recognized. We can solve the problem of the difficult identification of neighboring ships by segmentation, and introduce nonlocal mean models to highlight edge information.

**Figure 16.** Visualization results of misdetected samples. (**a**–**d**) are the inshore visualization results of the CRTransSar method under the SSDD dataset. (**a**,**c**,**d**) are the visualization results of false-alarms. (**b**,**d**) are the visualization results of missed-detection.

#### **6. Conclusions**

SAR target detection has important application value in military and civilian fields. Aiming to overcome the difficulties of SAR targets, such as strong scattering, sparseness, multiscale, unclear contour information, and complex interference, we propose a visual transformer SAR target detection framework based on contextual joint representation learning, called CRTransSar. In this paper, CNN and the transformer are innovatively combined to improve the feature representation and the detectability of SAR targets in a balanced manner. This study was based on the use of a Swin Transformer and integrates CNN architecture ideas. We also redesigned a new backbone, named CRbackbone, which

makes full use of contextual information, conducts joint-representation learning, and extracts richer context-feature salient information. Furthermore, we constructed a new cross-resolution attention enhancement neck, called CAENeck, which is used to enhance the ability to characterize SAR targets at different scales.

We conducted related experiments on the SSDD dataset and SMCDD dataset, as well as verification experiments on the SSDD dataset and the SMCDD dataset to be released by our research group. We performed visual verification of the classification of near-shore vessels and high-water vessels in the verification experiment. The high-quality results prove the robustness and practicability of our method. In the comparison experiment on the two-stage and no-anchor frames, higher precision was achieved. The method proposed in this paper achieves 97.0% mAP (0.5) and 76.2% mAP (0.75). In future work, we will first standardize the SMCDD dataset of our research group and release it for download and use. In addition, we will introduce segmentation to detect densely adjacent ships and explore more efficient distillation methods that do not require time-consuming training. Combined with pruning methods, model compression will be more diversified and easier to transplant, as will the lightweight development of the network.

**Author Contributions:** Conceptualization, R.X.; methodology, R.X.; software, R.X.; validation, R.X.; formal analysis, R.X.; investigation, R.X.; resources, R.X.; data curation, R.X.; writing—original draft preparation, R.X.; writing—review and editing, R.X., J.C., Z.H. and H.W.; visualization, R.X.; supervision, J.C., Z.H., B.W., L.S., B.Y., H.X. and M.X.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under Grant 62001003, in part by the Natural Science Foundation of Anhui Province under Grant 2008085QF284, and in part by the China Postdoctoral Science Foundation under Grant 2020M671851.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used in this study are open data sets. The dataset can be downloaded at https://pan.baidu.com/s/1paex4cEYdTMjAf5R2Cy9ng (2 February 2022).

**Acknowledgments:** We would like to thank the anonymous reviewers for their constructive and valuable suggestions on the earlier drafts of this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Lightweight Position-Enhanced Anchor-Free Algorithm for SAR Ship Detection**

**Yun Feng 1,2,3, Jie Chen 1,2,3,\*,†, Zhixiang Huang 1,2,3,†, Huiyao Wan 1,2,3, Runfan Xia 1,2,3, Bocai Wu 3, Long Sun 3,4,5 and Mengdao Xing 4,5,†**


**Abstract:** As an active microwave device, synthetic aperture radar (SAR) uses the backscatter of objects for imaging. SAR image ship targets are characterized by unclear contour information, a complex background and strong scattering. Existing deep learning detection algorithms derived from anchor-based methods mostly rely on expert experience to set a series of hyperparameters, and it is difficult to characterize the unique characteristics of SAR image ship targets, which greatly limits detection accuracy and speed. Therefore, this paper proposes a new lightweight positionenhanced anchor-free SAR ship detection algorithm called LPEDet. First, to resolve unclear SAR target contours and multiscale performance problems, we used YOLOX as the benchmark framework and redesigned the lightweight multiscale backbone, called NLCNet, which balances detection speed and accuracy. Second, for the strong scattering characteristics of the SAR target, we designed a new position-enhanced attention strategy, which suppresses background clutter by adding position information to the channel attention that highlights the target information to more accurately identify and locate the target. The experimental results for two large-scale SAR target detection datasets, SSDD and HRSID, show that our method achieves a higher detection accuracy and a faster detection speed than state-of-the-art SAR target detection methods.

**Keywords:** deep learning; SAR ship detection; position-enhanced attention; lightweight backbone

#### **1. Introduction**

SAR is one of the main ways of imaging Earth's surface for civilian and military research purposes at any time of day and is not affected by the weather or other imaging characteristics. With rapid updates of tools, information and technology, a large number of SAR images have been obtained. Due to the particularities of SAR imaging, the artificial interpretation of SAR images is a time-consuming and labor-intensive process and so a considerable amount of data have not been fully utilized. SAR image target detection aims to automatically locate and identify specific targets from images and has wide application prospects in real life. For example, in a military context, location detection of specific military targets is conducive to tactical deployment and coastal defense early warning capabilities. In a civil context, the detection of smuggling and illegal fishing vessels is helpful for the monitoring and management of maritime transport.

**Citation:** Feng, Y.; Chen, J.; Huang, Z.; Wan, H.; Xia, R.; Wu, B.; Sun, L.; Xing, M. A Lightweight Position-Enhanced Anchor-Free Algorithm for SAR Ship Detection. *Remote Sens.* **2022**, *14*, 1908. https:// doi.org/10.3390/rs14081908

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 6 March 2022 Accepted: 13 April 2022 Published: 15 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Since optical images are widely used in daily life, many researchers have developed numerous target detection algorithms based on optical images, but there are relatively few studies on SAR images. Due to the long imaging wavelength and complex imaging mechanism of SAR images, their targets are discontinuous; that is, they are composed of multiple discrete and irregular bright spots of scattering centers. Therefore, SAR images are difficult to interpret intuitively. At the same time, SAR images have the characteristics of an uneven target distribution and great sparsity. These characteristics make SAR image target detection very different from common optical image target detection. When target detection models used for optical images are directly used for SAR image detection without considering the particularity of SAR images, the advantages of the algorithm are not fully manifested. The development of SAR image target detection technology can be introduced via the following two aspects: traditional SAR target detection and SAR ship detection using deep learning.

Traditional SAR image target detection algorithms mainly include the constant false alarm rate (CFAR) [1] detection algorithm based on the background clutter statistical distribution and artificial image texture feature detection algorithms. The method based on the CFAR uses the background units around the target and selects the constant false alarm probability to determine the detection threshold. There are two main reasons for its poor detection rate: one is that the same statistical model is used for all the clutter in the sliding window, which easily leads to a mismatch of the statistical model in the maladaptive regions. Second, the algorithm does not make full use of the feature information in the image, but only uses the statistical distribution characteristics of the image gray values. Huang et al. [2] proposed a CFAR algorithm based on target semantic features, which has a lower false alarm rate when detecting targets in high-resolution SAR images. The detection algorithm based on artificial extraction of image texture features has good performance for some kinds of target detection; however, in the case of large differences in target features, the performance drops significantly. Stein et al. [3] proposed a target detection method based on the rotation-invariant wavelet transform. Compared with the CFAR detection algorithm, the texture feature-based algorithm utilizes more image information and has higher detection accuracy. However, texture features need to be extracted by manual design, and the design process is complicated and time-consuming, so it is difficult to ensure the timeliness of detection.

SAR ship detection methods based on deep learning have become a research priority and a large number of methods based on convolutional neural networks have emerged. Zhang et al. [4] proposed a learning mechanism for marine balanced scenes when the number of SAR image samples was extremely unbalanced and which extracted features from images by establishing a generative adversarial network, using the k-means algorithm for clustering and expansion of the number of samples to train the mode. The model has achieved good results. The lightweight SAR ship detector "ShipDeNet-20" [5] greatly reduces the size of the model and combines the feature fusion, feature enhancement and scale sharing feature pyramid modules to further improve the accuracy, which is conducive to hardware transplantation. HyperLi-Net [6] achieves high accuracy and high speed in SAR image ship detection. The high accuracy is achieved by the multireceptive field, dilated convolution, channel and spatial attention, feature fusion and feature pyramid modules. High speed is achieved by fusion of region-free models, small kernels, narrow channels, separable convolutions and batch normalization. Its model is also more lightweight, which is more conducive to hardware porting. Tz et al. [7] solved the four imbalance problems in the SAR ship detection process and proposed corresponding solutions, which were combined into the model to obtain a new balanced learning network. Zhang et al. [8] mainly used depth-wise separable convolution to constitute a new SAR ship detection method. By integrating the multi-scale detection, connection and anchor box mechanisms, this method makes the model more lightweight and the detection speed is also improved to a certain extent. Zhang et al. [9] gridded the input image and used depthwise separable convolution operations. The backbone

convolutional neural network and the detection convolutional neural network are combined to form a new grid convolutional neural network, which has achieved good results in SAR ship detection. RetinaNet [10] is essentially composed of resnet + FPN + two FCN subnetworks. The design idea is that the backbone selects effective feature extraction networks such as vgg and resnet. FPN is intended to strengthen the use of multi-scale features formed in resnet, to obtain a feature map with stronger expressiveness and include multi-scale target area information, and finally use two FCN sub-networks with the same structure but no shared parameters on the feature map set of FPN so as to complete the target box category classification and bbox position regression tasks. The SSD [11] model completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computations in a single network. This makes SSD easy to train and directly integrated into systems that require detection components. The core of the SSD approach is to use small convolutional filters to predict class scores and position offsets for a fixed set of default bounding boxes on the feature map. The network model of YOLOv3 [12] is mainly composed of 75 convolutional layers. Since the fully connected layer is not used, the network can correspond to input images of any size. In addition, the pooling layer does not appear in YOLOv3. Instead, the stride of the convolutional base layer is set to 2 to achieve the effect of down sampling and the scale-invariant features are transferred to the next layer. In addition, YOLOv3 also uses structures similar to ResNet and FPN networks, which are also beneficial for improving detection accuracy. YOLOv3 is mainly aimed at small targets and the accuracy has been significantly improved. YOLOX [13] is the first model to apply the anchor-free mode in the YOLO series. The specific operation is to explicitly define the 3 × 3 region of the truth frame projected to the center of the feature graph as the positive sample region and predict the four values of the target position (the offset distance of the upper left corner and the height and width of the frame). The AFSar [14] network model redesigns the backbone network, replaces the original Darknet-53 with MobileNetV2 and improves it. At the same time, the detection head and neck are newly designed, making it a lightweight network model. The RFB-net [15] algorithm introduces a receptive field block (RFB) into the SSD [11] network and strengthens the feature extraction ability, influenced by the way the human visual system works.

In summary, the following problems still need to be resolved:


To this end, we propose a new lightweight position-enhanced anchor-free SAR ship detection algorithm called LPEDet which improves the accuracy and speed of SAR ship detection from a more balanced perspective. The main contributions are as follows:

(1) To solve the problems that occur because anchor-based detection algorithms are highly dependent on design frameworks based on expert experience and the difficulties that occur in solving problems such as unclear contour information and complex backgrounds of SAR image ship targets, we introduced an anchor-free target detection algorithm. We introduced the latest YOLOX as the base network and, inspired by the latest lightweight backbone, LCNet [16], replaced the backbone Darknet-53 with LCNet and then optimized the design according to the SAR target characteristics.


#### **2. Related Work**

The development process for SAR image target detection technology ranges from traditional SAR target detection to SAR target detection using deep learning. In the target detection task based on deep learning, the main task of target detection is to take the image as the input and output the characteristic image of the corresponding input image through the backbone network. Therefore, the performance of target detection is closely related to the feature extraction of the backbone network. Many studies have designed different feature extraction backbone networks for different application scenarios and detection tasks.

(1) Traditional SAR target detection algorithm.

The traditional SAR target detection algorithm is as follows. Ai et al. [18] proposed a joint CFAR detection algorithm based on gray correlation by utilizing the strong correlation characteristics of adjacent pixels inside the target SAR images. The CFAR algorithm only considers the gray contrast and ignores target structure information, which causes poor robustness and anti-interference ability and poor detection performance under complex background clutter. Kaplan et al. [19] used the extended fractal (EF) feature to detect vehicle targets in SAR images. This feature is sensitive not only to the contrast of the target background but also to the target size. Compared with the CFAR algorithm, the false alarm rate of detection is reduced. Charalampidis [20] proposed the wavelet fractal (WF) feature, which can effectively segment and classify different textures in images.

(2) Common SAR image backbone networks based on deep learning.

It can be seen from VGG [21] that a deeper network can be formed by stacking modules with the same dimension. For a given receptive field, it is shown that compared with using a large convolution kernel for convolution, the effect of using a stacked small convolution kernel is preferable. GoogLeNet [22] adopts a modular structure (inception structure) to enrich network receptive fields with convolutional kernels of different sizes. ShuffleNetV1 [23] and ShuffleNetV2 [24] adopt two core operations: pointwise group convolution and channel shuffling, and they exchange information through channel shuffling. GhostNet [25] divides the original convolution layer into two parts. First, a traditional convolution operation is applied to the input to generate feature maps, then these feature maps are transformed using a linear operation, merging all the features together to get the final result. In DarkNet-53, the poolless layer, the fully connected layer and the reduction of the feature graph are achieved by increasing the step size of the convolution kernel. Using the idea of feature pyramid networks (FPNs), the outputs of three scale feature layers are 13 × 13, 26 × 26 and 52 × 52. Among them, 13 × 13 is suitable for detecting large targets and 52 × 52 is suitable for detecting small targets. Although the above backbone network greatly improves detection accuracy, it also introduces a large number of parameters into the model and the detection speed is relatively slow. MobileNetV1 [26]

constructed a network by utilizing depth-separable convolution, which consists of two steps: depthwise convolution and pointwise convolution. MobileNetV2 [27] introduced a residual structure on the basis of MobileNetV1, which first raised the dimension and then reduced the dimension. Although the model is lightweight, it is suitable only for large models and it provides no significant improvement in accuracy in small networks. The characteristic of a remote sensing image target is density and it is difficult to distinguish between target contours and the background environment. A new algorithm [28] is proposed for the above difficulties which can also be used for video target recognition. It mainly uses the visual saliency mechanism to extract the target of the region of interest and experiments show the effectiveness of its results. In addition to SAR image target detection, the research on images captured by UAVs should continue to advance because the use of UAV images for target detection has broad application prospects in real life. The target detection of UAV images is the subject of [29], which combines the deep learning target detection method with existing template matching and proposes a parallel integrated deep learning algorithm for multi-target detection.

(3) SAR image detection algorithm based on deep learning.

Jiao et al. [30] considered that the multi-scale nature of SAR image ship targets and the background complexity of offshore ship targets were not conducive to monitoring and the authors innovatively proposed a model based on the faster-RCNN framework. Improvements have been made and a new training strategy has also been proposed so that the training process focuses less on simple targets and is more suitable for the detection of ship targets with complex backgrounds in SAR images, improving detection performance and thereby solving the problem of multiple scales and multiple scenes. Chen et al. [31] mainly focused on indistinguishable ships on land and densely arranged ships at sea and combined a model with an attention mechanism. The purpose was to better solve the above two problems frequently encountered in ship target detection. The application of an attention mechanism can better enable the location of the ship targets we need to detect. At the same time, the loss function is also improved, that is, generalized cross loss is introduced, and soft non-maximum suppression is also used in the model. Therefore, the problem of densely arranged ship targets can be better solved and detection performance can be improved. Cui et al. [32] considered the multi-scale problem of ship targets in SAR images and used a densely connected pyramid structure in their model. At the same time, a convolution block attention module was used to refine the feature map, highlight the salient features, suppress the fuzzy features and effectively improve the accuracy of the SAR image ship target. Although the above algorithms generally have high detection accuracy, model size is large, inference speed is slow and they do not take the characteristics of SAR images into account, which greatly limits the performance of these algorithms. Wan et al. [14] proposed an anchor-free SAR ship detection algorithm, the backbone network of which is the more lightweight MobileNetV2S network, and further improved the neck and head, so that the overall model effect is optimal. However, their improved strategy did not fully consider the characteristics of SAR targets against complex backgrounds, which is an issue worthy of further exploration.

Therefore, we propose a new SAR image detection method that comprehensively considers the tradeoff between algorithm accuracy and speed.

#### **3. Methods**

This paper proposes a position-enhanced anchor-free SAR ship detection algorithm called LPEDet which generally includes the benchmark anchor-free detection benchmark network YOLOX, the lightweight feature enhancement backbone NLCnet and a positionenhanced attention strategy. The overall framework is shown in Figure 1. The model proposed in this paper will be explained in detail from three aspects.

**Figure 1.** Overall framework of the model. PEA = position-enhanced attention. (The numbers (1–6) represent the output feature maps of blocks 1–6, respectively, and PEA is added to the adjacent blocks. The subsequent operations of (2), (3) are the same with (1)).

#### *3.1. Benchmark Network*

Considering the unclear edge information of SAR targets and avoiding the shortcomings of traditional anchor-based methods, inspired by the latest anchor-free detection framework YOLOX [13], we use YOLOX as the benchmark network. YOLOX is the first to apply the anchor-free mode in the YOLO series. The specific operation is to explicitly define the 3 × 3 region of the truth frame projected to the center of the feature graph as the positive sample region and predict the four values of the target position (the offset distance of the upper left corner and the height and width of the frame). To better allocate fuzzy samples, YOLOX uses the simOTA algorithm for positive and negative sample matching. The general process of the simOTA algorithm is as follows: First, we calculated the matching degree of each pair. Then, we selected the top k prediction boxes with the smallest cost in a fixed central area. Finally, the grids associated with these positive samples were marked as positive.

Since YOLOX represents various improvements to the YOLO series, including a decoupling head, a new tag allocation strategy and an anchor-free mechanism, it is a highperformance detector subject to a trade-off between accuracy and speed. In the face of the SAR ship detection problem, these characteristics of YOLOX precisely match SAR image target sparsity, small sample characteristics and target scattering, so we chose YOLOX as the baseline of our network. Although YOLOX has achieved the performance of SOTA in optical image detection, its model size is too large and its model complexity is too high such that it cannot be applied in SAR image detection. Therefore, we redesigned the backbone network of YOLOX.

#### *3.2. Lightweight Feature Enhancement Backbone: NLCNet*

Most of the existing YOLO series backbones use DarkNet-53 and CSPNet architectures. Such backbones are usually excellent in terms of detection effect, but there is still a possibility for improvement of inference speed. The easiest way is to reduce the size of the model. To this end, according to the characteristics of the SAR target, the backbone network, namely, NLCnet, was designed to be lightweight so as to better balance speed and accuracy.

NLCNet uses the deeply separable convolution mentioned by MobileNetV1 as the basic block. It is generally known that depthwise separable convolution is mainly divided into two processes, namely, depthwise convolution and pointwise convolution. Compared with conventional convolution, the number of parameters for depthwise separable convolution is about one-third of that for conventional convolution. Therefore, given the same number of parameters, the number of neural network layers using separable convolution can be deeper. Based on the LCNet network, a new network design is carried out. We reorganized and stacked these blocks to form a new backbone network which is mainly divided into six blocks. The stem part uses standard convolution, which is activated by the h-swish function. Block2 to block6 all use depthwise separable convolution. The main difference is that the number of superimposed depthwise separable convolutions is different, and in block5 and block6 5 × 5 convolution kernels are used in the depth-level convolution process. The NLCNet network achieved the highest precision with respect to recent work in the following two areas: (1) discarding of the squeeze-and-excitation networks (SE) module and (2) design of the lightweight convolution block. The structural details of NLCNet are shown in Figure 2.

**Figure 2.** The details of the NLCNet backbone network.

3.2.1. Discarding of the Squeeze-and-Excitation Networks (SE) Module

The SE module [33] is widely used in many networks. It can help the model weight the channels in the network to obtain better features. However, we cannot blindly add the SE module to the model because not all SE modules will be more effective. Recently, through my own thinking and experiments, I found that the SE attention mechanism was added to the network, which resulted in a certain improvement in the classification task, but for target detection the effect is not obvious and sometimes it will affect the results, which may be similar to the network model. There is also a certain correlation. Considering this issue, we removed the SE module on the basis of LCnet in the experiments; the accuracy of the model was not reduced and the parameters of the model were relatively few.

#### 3.2.2. Design of a Lightweight Convolution Block

Experiments showed that convolutional verification of different sizes would have a certain impact on network performance. The larger the convolution kernel, the larger the receptive field will be in the convolution process and the better it will be for constructing the global information of the target. In light of this, we chose to use a larger convolution kernel to balance speed and accuracy. It was found by YOLOX that placing the large convolution kernel at the tail of the network was the best choice because the performance achieved by these two methods was equivalent to replacing all layers of the network. Therefore, this substitution was only performed at the end of the network.

Through simple stacking and the use of corresponding technologies, the lightweight backbone used in this paper achieved a certain improvement in accuracy with respect to the SSDD dataset, while the number of parameters has also significantly decreased. Therefore, the advantages of NLCNet are obvious. The specific network structure is shown in Table 1.

**Table 1.** The details of NLCNet. PEA = position-enhanced attention.


#### *3.3. Position-Enhanced Attention*

Squeeze-and-excitation attention is a widely used attention mechanism that significantly enhances network performance and avoids many parameter calculations. Squeezeand-excitation attention is widely used in various network models to highlight important channel information in features and is mainly used for the differential weighting of different channels through global pooling and a two-layer full connection layer without considering the influence of location information on features. Location information can further help the model to obtain target details in the image, thus improving model performance.

To highlight the key location information of features, we designed a new attention module in the network inspired by coordinate attention [17] called position-enhanced attention. It can embed the location information of the target in the image into the channel attention, which can better capture the interesting position information of the SAR target against a complex background and obtain a good global perception ability. At the same time, the computational cost of this process is relatively low. See Figure 3 for the positionenhanced attention architecture.

Since 2D global pooling does not contain location information, position-enhanced attention makes corresponding changes in 2D global pooling by splitting the original channel attention and forming two 1D global pooling operations. The specific process is that when the feature map is inputted, two 1D global pools are aggregated in a vertical and horizontal direction to form two independent feature maps with orientation awareness. The two generated feature maps with specific direction information are then encoded to form two attention maps. The two attention maps capture the independent and mutually dependent relationship of the input feature maps along a horizontal and vertical direction. From the above process, position information is obtained in the generated attention map and the two attention maps are applied to the input feature map, which can emphasize the target of interest in the image for better recognition.

**Figure 3.** The details of the position-enhanced attention block (C, W, H, r represent the number of channels, width, height and reduction ratio, respectively).

For the accurate location information obtained, position-enhanced attention can be applied to coding channel relationships and remote dependencies. See Figure 4 for details of the position-enhanced attention architecture.

(**a**) Classic SE channel attention block (**b**) Position-enhanced attention block

**Figure 4.** Structural contrast of the classic SE channel attention block and the position-enhanced attention block.

With channel attention, the spatial information in the image can usually establish the connection between channels through the global pooling operation, but it also causes the loss of position information, which is the result of the compression of the global information by the global pooling. In order to further utilize the location information of the target in the image, we split the 2D global pooling in the SE module to form two 1D global pooling operations. The 1D global pooling can extract the region of interest in the image in the horizontal and vertical directions so as to obtain better global perception ability and the two feature maps generated with specific directions save the position information of the target so the image target can be better identified and located. Specifically, given input X, two 1D global pooling operations are used to encode each channel in a horizontal and vertical direction and the size of the pooling kernel is (*H*, 1) or (1, *W*). Therefore, at height h, the output of channel c can be expressed as:

$$z\_c^h(h) = \frac{1}{W} \sum\_{0 \le i \le W} \mathbf{x}\_c(h, i) \tag{1}$$

At width *w*, the output of channel *c* can also be written as:

$$z\_{\mathfrak{c}}^{w}(w) = \frac{1}{H} \sum\_{0 \le j \le H} \mathfrak{x}\_{\mathfrak{c}}(j, w) \tag{2}$$

Through the above transformation, we can aggregate the input features in two spatial directions and obtain two feature maps with directional perception characteristics. These two feature maps not only enable the corresponding attention module to save the remote dependency relationship between features but also to maintain accurate position information in the spatial direction, thereby helping the network to more accurately detect the target.

As mentioned above, through the extraction process of Equations (1) and (2), the attention branch channel can have a good global receptive field, can well retain global feature information and can encode precise location information.

Further, considering that the strong scattering characteristics of SAR targets against complex backgrounds cause their contours to be unclear and that the SAR target imaging angle changes greatly, we have carefully designed the follow-up attention processing flow. Previous studies have shown that 2D global pooling will lose position information. For this reason, Hou et al. [17] adopted two 1D pooling strategies and then performed channel concatenation. This method has difficulty handling the characteristics of SAR targets, mainly due to the following two problems: first, after the feature extraction and pooling operations of Equations (1) and (2), they are concatenated into a channel for subsequent processing because the feature correlation degrees of SAR targets in different spatial directions are very different, so this method loses the significant feature information of the two spatial directions, which is not conducive to characterizing the unique features of multi-oriented sparse SAR targets; second, this concatenation operation also increases the computational complexity of the channel.

To this end, we designed an attention strategy different from Hou et al.'s [17], namely, position-enhanced attention. Our starting point was to overcome the two problems of the above analysis, namely, directly designing two parallel branches to extract depth feature information in different spatial directions respectively. This operation can better extract salient feature information in two spatial directions and so can better characterize the characteristics of sparse SAR targets with different orientations; in addition, this parallel branch extraction can obtain a wider receptive field area so that better global awareness can be obtained.

Therefore, the aggregated feature maps in the two spatial directions were generated based on Equations (1) and (2). They respectively perform convolution operations along the spatial direction and the convolution function F is used for transformation, thereby generating:

$$\mathbf{f}^h = \mathcal{S}\left(Bn\left(\mathcal{F}\left(z\_c^h(h)\right)\right)\right) \tag{3}$$

$$\mathbf{f}^{\mathrm{uv}} = \mathcal{S}(Bn(\mathcal{F}(z\_c^{\mathrm{uv}}(w))))\tag{4}$$

Among these:

$$h - swish(\mathbf{x}) = \mathbf{x} \frac{\operatorname{ReLU}(\mathbf{x} + \mathbf{3})}{6} \tag{5}$$

In the Equations (3) and (4), δ is the *h-swish* activation function and *x* is *Bn*(F(·)). *Bn* is the batchnorm. f *<sup>h</sup>* and f*<sup>w</sup>* is the intermediate feature graph. f *<sup>h</sup>* and f *<sup>w</sup>* are transformed into tensors by using the other two 1 × 1 convolution transforms *Fh* and *Fw*.

$$\mathbf{g}^{h} = \sigma\left(F\_{h}\left(\mathbf{f}^{h}\right)\right) \tag{6}$$

$$\mathbf{g}^w = \sigma(F\_w(\mathbf{f}^w)) \tag{7}$$

where *σ* is the sigmoid function. Then, *g<sup>h</sup>* and *g<sup>w</sup>* are used in the position-enhanced attention block:

$$\mathbf{x}\_{\mathcal{E}}(i,j) = \mathbf{x}\_{\mathcal{E}}(i,j) \times \mathbf{g}\_{\mathcal{E}}^{h}(i) \times \mathbf{g}\_{\mathcal{E}}^{w}(j) \tag{8}$$

Position-enhanced attention considers the encoding of spatial information. As mentioned above, attention along both horizontal and vertical directions applies to the input tensor. This coding process allows position-enhanced attention to more accurately locate the target position in the image, thus helping the whole model to achieve better recognition. Experiments show that our method does achieve good results.

#### **4. Experiments**

To verify the proposed method, we conducted a series of related experiments to evaluate the model's detection performance. The content of this section includes details of some settings in the experiment and the main content of the SSDD dataset, followed by the evaluation indicators used in the experimental results, the influence of each module proposed in the ablation experiment on the model and a comparison with other target detection algorithms. Finally, LPEDet is compared with other recent SAR imaging methods.

#### *4.1. Dataset and Experimental Settings*

In our experiment, the datasets used were SSDD [34] and HRSID. For each ship, the detection algorithm predicts the frame of the ship target and gives the confidence of the ship target. The SSDD process is based on the PASCALVOC dataset and its data format is algorithmically compatible, making it easier to use with fewer code changes.

SSDD data are obtained by downloading public SAR images from the internet. Figure 5 shows part of the images in the dataset. The target area was cropped to approximately 500 × 500 pixels and the ship target location was manually marked. As long as there is a ship in the dataset, there are no requirements regarding ship type. The data in this dataset mainly include HH, HV, VV and VH polarization modes. There are 1160 images in the dataset and each image contains 2456 ships of different numbers and sizes. Although SSDD has few pictures, for the detection network, the number of targets that only recognize ships is sufficient. The corresponding relationship between the number of pictures and the number of ships in the dataset is shown in Table 2.

**Figure 5.** Illustration of the diversity of ship targets in the SSDD dataset.


**Table 2.** Correspondence between NoS and NoI in the SSDD dataset.

NoS = number of ships; NoI = number of images.

In addition, to verify the detection performance of our proposed method in different scenarios, we introduced another large-scale SAR target detection dataset, namely, the HRSID dataset. The images in this dataset are high-resolution SAR images, which are mainly used for ship detection, semantic segmentation and instance segmentation tasks. The dataset contains a total of 5604 high-resolution SAR images and 16,951 ship instances. The HRSID dataset borrows from the construction process of the Microsoft Common Objects in Context (COCO) dataset, including SAR images with different resolutions, polarizations, sea states, sea areas and coastal ports. For HRSID, the resolutions of the SAR images are: 0.5 m, 1 m and 3 m, respectively.

To make a fair comparison with previous work, we attempted to use the same settings that previous workers used. We randomly divided the original SSDD dataset according to the ratio of 8:2 commonly used in existing studies and 80% of the datasets were used for the training of all methods. The remaining 20% was used as a test set to evaluate the detection performance of all methods. The data in the training set and test set were not repeated at all among the methods to ensure the rigor and fairness of the experiment. Other parameters included a batch size of 8 and an image size for the input model of 640, RandomHorizontalFlip was adopted, ColorJitter and multiscale were used for data augmentation, and Mosaic and MixUp enhancement strategies were employed. Using the lr×batchsize/64 learning rate, the cosine lr schedule and initial lr = 0.01 were employed. The weight decayed to 0.0005 and the SGD momentum was 0.9. A total of 600 epochs were trained. In the HRSID [35] dataset, we used a ratio of 6.5:3.5 to split the dataset, with 65% data for training and 35% for testing, the same as the original author split. The image size of the input model was 800. All experiments in this paper were carried out on an Ubuntu 18.04 operating system equipped with a GeForceRTX2060.

#### *4.2. Evaluation Indicators*

We used average precision (mAP) to analyze and verify the detection performance of our proposed method. Average accuracy can be derived from accuracy and recall.

Accuracy is the percentage of targets that are correctly identified in the test set. The percentage is defined by true positives (TPs) and false positives (FPs):

$$P = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{9}$$

TP means that the prediction of the classifier is positive and the prediction is correct; FP indicates that the prediction of the classifier is positive and the prediction is incorrect.

The recall rate is the probability that all positive samples in the test set are correctly identified, which is derived from true positives (TPs) and false negatives (FNs):

$$R = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{10}$$

FN indicates that the prediction of the classifier is negative and the prediction is incorrect. Based on the accuracy and recall rate, an average accuracy value is also obtained. The graphical meaning can be clearly seen in the coordinate axis, that is, the area under the accuracy and recall rate curve, which is defined as follows:

$$\text{mAP} = \int\_0^1 P(R)dR \tag{11}$$

#### *4.3. Experimental Results and Analysis*

#### 4.3.1. Ablation Experiments on SSDD Datasets

To clearly compare the advantages of the added modules, we conducted the following ablation experiments. The first experiment ensured that other settings remained unchanged while replacing the backbone network Darknet-53 with a lightweight backbone NLCNet. Second, the attention module position-enhanced attention was added on the basis of the original network. This process did not change other settings and parameters.

It should be noted that the methods in the ablation experiment were reproduced according to the official open-source code of the comparison method and applied to the SSDD dataset for experimental comparison. The dataset used by the comparison method was exactly the same as that used by our proposed method; the hyperparameters of the comparison method were all set with standard default settings and the number of training epochs was also consistent with our method.

#### Influence of the NLCNet Backbone Network on the Experimental Results

The backbone Darknet-53 was replaced with our proposed NLCNet based on YOLOX, as previously shown in Figure 2. The mAP increased by 0.6% from 96.2% to 96.8% and the FLOPs dropped by 8.37 from 26.64 to 18.27. According to the data, our redesigned NLCNet showed advantages in feature extraction with respect to SAR image ship targets, not only improving accuracy but also reducing the number of parameters, making the model more lightweight and easier to transplant in industrial settings.

#### Influence of Position-Enhanced Attention on Experimental Results

To verify the effectiveness of our proposed position-enhanced attention and its advantages, we conducted ablation experiments with the original network without attention, the network with coordinate attention and the new network with our proposed position-enhanced attention in our dataset, respectively. The experimental results are shown in Table 3. The mAP of the network with our proposed position-enhanced attention was greatly improved compared to the network without attention, which increased from 96.8% to 97.4%. At the same time, the increase in FLOPs and params was negligible. The results show the effectiveness of our proposed position-enhanced attention. Compared with the original network with coordinate attention, the detection accuracy of our proposed model increased from 97.1% to 97.4% with the parameters and FLOPs unchanged. It should be noted that we kept two significant digits after the decimal point when we counted the experimental results. Therefore, it was calculated that the parameters of FLOPs and params of our position-enhanced attention model and the original coordinate attention model were the same size. Thus, the advantages of our designed positional attention are demonstrated by the results. The visualization results are shown in Figure 6.

**Table 3.** Results of ablation experiments. (mAP, FLOPs, Params and average inference time represent detection accuracy, computational complexity, parameter amount and average inference time, respectively).


#### 4.3.2. Comparison with the Latest Target Detection Methods Using SSDD Datasets

To further demonstrate the validity of our work, we compared LPEDet with some of the latest target detection methods, including the one-stage RetinaNet, SSD300, YOLOv3, YOLOX, YOLOv5, AFSar, two-stage Faster R-CNN, Cascade R-CNN, FPN and anchor-free CornerNet, CenterNet and FCOS methods. Among them, considering that our model

mainly focuses on the lightweight design of the backbone network, for a fair comparison of performance, we cite the results of the backbone ablation experiments of AFSar [14]. The results are shown in Table 4. As seen from the table, our work not only outperformed other methods in terms of precision but also in terms of speed.

**Figure 6.** Visualization effect of the ablation experiment. The purple box is the target that was missed when marked, the red box is the target of misdetection and the blue box is a missed target. CA = coordinate attention; PEA = position-enhanced attention.



It should be noted that, except for AFSar, the methods in the comparison experiments were reproduced according to the official open-source code of the comparison method and applied to the SSDD dataset for experimental comparison. The dataset used by the comparison method was exactly the same as that used by our proposed method; the hyperparameters of the comparison method were all set with standard default settings and the number of training epochs is also consistent with our method.

We also visualized the detection results of these methods. As shown in Figure 7, compared with the proposed LPEDet method, the detection rate of the above methods was significantly higher than that of the proposed LPEDet method. Our method has good performance with small target images, complex backgrounds and intensive target image detection. These findings show the effectiveness of our approach.

#### *4.4. Comparison with the Latest SAR Ship Detection Methods Using SSDD Datasets*

To further verify the performance of our method, we also compared it with the latest SAR ship detection methods, as shown in Table 5.


**Table 5.** Comparison with the latest SAR ship detection methods.

The comparison methods and related experimental results in Table 5 need special explanation. Since none of these comparison methods have open-source original codes, it is difficult for us to completely reproduce the codes and parameter settings of the comparison methods. Therefore, to fairly compare the performance of different detection methods, we directly cite the highest detection results reported in the original reference of the comparison methods. Especially for the two indicators of FLOPs and params, most of the comparison methods have no relevant results. Therefore, we only cite the best experimental results for other indicators published in the references. In addition, the results of the comparison methods listed in Table 5 are mainly from references [42,43].

The results show that not only does our LPEDet achieve SOTA in accuracy but it also has a relatively faster inference speed, which shows the high efficiency of our method.

#### *4.5. Comparison with the Latest SAR Ship Detection Methods Using HRSID Datasets*

In order to fully verify the performance stability of the LPEDet method proposed in this paper with respect to different datasets, we introduce a new large-scale SAR target detection dataset, namely, HRSID, and compare a variety of state-of-the-art SAR target detection methods using this dataset. The specific results are shown in Table 6, below. By comparing the experimental results, it was found that, compared with the current latest SAR ship target detection methods, the LPEDet method proposed in this paper is superior in terms of accuracy, while the parametric and computational complexity of the model are also the lowest, which proves the stability of our method in relation to different datasets. As can be seen in Table 6, except for the fact that the data for CenterNet2 on AP, AP75, APM and APL are slightly higher than for our model, the difference is not big, and the number of parameters in our model is almost 1/15 the number of its parameters, such that, considering accuracy and speed, our model still has better performance in comparison.

**Figure 7.** Visual detection results of the latest methods. The purple box is the target that was missed when marked, the red box is the target of misdetection and the blue box is a missed target.


**Table 6.** Comparison of the latest SAR target detection methods on HRSID.

#### *4.6. The Effect of the Number of Training Sets on Detection Performance*

In order to verify the robust performance of our proposed method under the conditions of different training data, we redivided the training and testing ratios of the dataset under the conditions of 33% and 66% of the training set data, respectively, to validate the performance of our model. The results are shown in Table 7 below. We combined the results of Table 4 in the paper for the analysis (the training data volume of all methods in Table 4 is 80% and above): when we only use 66% of the training data volume, mAP can still reaching 96.8%, the performance is still better than most state-of-the-art SAR target detection methods; and when we use only 33% of the training data volume, mAP can still reach 94.6%, outperforming RetinaNet and SSD300, such that, compared to other state-of-the-art SAR target detection methods, the results are not much different. The analysis of the above results shows that the proposed LPEDet method can still achieve better performance than the latest SAR target detection methods with fewer training data, that it still has good robustness and that it can greatly reduce the labor costs involved in manual labeling of data.


#### **5. Conclusions**

Multi-platform SAR earth observation equipment has accumulated massive amounts of high-resolution SAR target image data, and SAR image target detection has great engineering application value in military/civilian fields. Aimed at the problems of unclear target contour information, complex backgrounds, strong scattering and multiple scales in SAR images, a new anchor-free SAR ship detection algorithm, LPEDet, was proposed to improve the accuracy and speed of SAR ship detection in a balanced manner. First, YOLOX was used as the benchmark detection network; then, a new lightweight backbone, NLCNet, was designed. At the same time, to further improve localization accuracy, we designed a location-enhanced attention strategy. The experimental results based on the SSDD dataset showed that the mAP of our LPEDet reached 97.4%, achieving SOTA. Meanwhile, the average inference time for a single image is only 7.01ms when the input size is 640. With respect to the HRSID dataset, our model is also stable, with an AP50 of 89.7%, which is superior to other state-of-the-art object detection methods, while the computational complexity, the number of parameters and the average inference time are lowest. In the future, based on the Hisea-1 SAR satellite that our research group participated in launching, our group independently constructed a larger-scale multi-type SAR target image dataset. We will verify the effectiveness of our proposed LPEDet algorithm on this large-scale dataset. Common SAR image artifacts such as speckle noise can affect SAR target detection results, and Mukherjee et al. [48] has demonstrated that their methods can respond to various types of image artifacts. Therefore, in the future, we will consider introducing image quality metrics to evaluate and correct the quality of the input SAR images so as to more comprehensively iterate and verify the robust performance of our designed SAR target detection method.

**Author Contributions:** Conceptualization, Y.F.; Methodology, Y.F.; Project administration, J.C. and Z.H.; Software, Y.F.; Supervision, J.C., Z.H., B.W., R.X., L.S. and M.X.; Validation, Y.F.; Visualization, Y.F.; Writing—original draft, Y.F.; Writing—review & editing, J.C. and H.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under Grant 62001003, in part by the Natural Science Foundation of Anhui Province under Grant 2008085QF284 and in part by the China Postdoctoral Science Foundation under Grant 2020M671851.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Low-Grade Road Extraction Method Using SDG-DenseNet Based on the Fusion of Optical and SAR Images at Decision Level**

**Jinglin Zhang 1, Yuxia Li 1,\*, Yu Si 1, Bo Peng 1, Fanghong Xiao 1, Shiyu Luo <sup>1</sup> and Lei He <sup>2</sup>**


**Abstract:** Low-grade roads have complex features such as geometry, reflection spectrum, and spatial topology in remotely sensing optical images due to the different materials of those roads and also because they are easily obscured by vegetation or buildings, which leads to the low accuracy of low-grade road extraction from remote sensing images. To address this problem, this paper proposes a novel deep learning network referred to as SDG-DenseNet as well as a fusion method of optical and Synthetic Aperture Radar (SAR) data on decision level to extract low-grade roads. On one hand, in order to enlarge the receptive field and ensemble multi-scale features in commonly used deep learning networks, we develop SDG-DenseNet in terms of three modules: stem block, D-Dense block, and GIRM module, in which the Stem block applies two consecutive small-sized convolution kernels instead of the large-sized convolution kernel, the D-Dense block applies three consecutive dilated convolutions after the initial Dense block, and Global Information Recovery Module (GIRM) combines the ideas of dilated convolution and attention mechanism. On the other hand, considering the penetrating capacity and oblique observation of SAR, which can obtain information from those low-grade roads obscured by vegetation or buildings in optical images, we integrate the extracted road result from SAR images into that from optical images at decision level to enhance the extraction accuracy. The experimental result shows that the proposed SDG-DenseNet attains higher *IoU* and *F*1 scores than other network models applied to low-grade road extraction from optical images. Furthermore, it verifies that the decision-level fusion of road binary maps from SAR and optical images can further significantly improve the *F*1, *COR*, and *COM* scores.

**Keywords:** low-grade road extraction; remote sensing; image segmentation; SAR image; deep learning

#### **1. Introduction**

Research on extracting road information from remote sensing images has been carried out for many years. However, due to the different width and shape characteristics of different grades of roads, such as national roads, provincial roads, village roads, and mountain roads; roads with different materials have different color and texture characteristics, such as cement, asphalt, earth road, etc.; at the same time, the road area is blocked by buildings, trees, the central green belt of the road and many other factors, so the accurate extraction of road information is still the research frontier and poses a technical difficulty in the field of remote sensing information extraction.

Road extraction can be described as a pixel-level binary classification problem that distinguishes whether each pixel belongs to a road or not [1]. Recently, deep convolution neural networks (DCNNs) have been demonstrated to have significant improvements to typical computer vision tasks such as semantic segmentation [2]. Road semantic segmentation has

**Citation:** Zhang, J.; Li, Y.; Si, Y.; Peng, B.; Xiao, F.; Luo, S.; He, L. A Low-Grade Road Extraction Method Using SDG-DenseNet Based on the Fusion of Optical and SAR Images at Decision Level. *Remote Sens.* **2022**, *14*, 2870. https://doi.org/ 10.3390/rs14122870

Academic Editor: Giuseppe Scarpa

Received: 5 May 2022 Accepted: 14 June 2022 Published: 15 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

applications in many fields, such as autonomous driving [3,4], traffic management [5], and smart city construction [6]. Semantic segmentation requires pixel-level classification [7–9], and it must combine pixel-level accuracy with multi-scale contextual reasoning [7–10]. In general, the simplest way to aggregate multi-scale context is inputting multi-scale information into the network for merging all scales of features. Some researchers have made much progress in the image processing fields. Farabet et al. [11] obtained different scale images by transforming the input image through a Laplacian pyramid. References [12,13] applied multi-scale inputs sequentially from coarse-to-fine. References [7,14,15] directly resized the input image for several scales. Meanwhile, another aggregating multi-scale context way is adopting an encoder-decoder structure, such as SegNet [16], U-Net [17], RefineNet [18], and other networks [19–21], which have demonstrated the effectiveness of models based on encoder-decoder structure. In addition, the context module is an effective way to aggregate multi-scale context information, such as merging DenseCRF [9] into DCNNs [22,23]. The spatial pyramid pool structure is also a common method to aggregate multi-scale context, such as Pyramid Scene Parsing Net (PSP) [24,25].

The larger receptive field is critical for networks because it can capture more global context information from the input images. For a standard convolution neural network (CNN), the traditional way to expand the receptive field is stacking more convolutional layers with a bigger convolutional kernel size, while the operation could result in the exponential expansion of the training parameters, which makes networks hard to train. The alternative way to expand the receptive field is stacking more pooling layers, which can expand the receptive field by reducing the dimension of the feature maps and maintaining the saliency characteristics. Although the pooling operations did not add the training parameters, much information would be lost because of the decrease in spatial resolution.

Reference [23] developed a convolutional network module, dilated convolution, which aggregates multi-scale contextual information without increasing the training parameters and decreasing resolution. Further, the module can also aggregate multi-scale contextual information with different expanding rates of dilated convolution kernel size. Besides, the module can be plugged into existing architectures for any resolution image, which is appropriate for dense prediction. Therefore, DeepLab v2 [26], DeepLab v3 [27], DeepLab v3+ [28], and D-LinkNet [1], which adopted dilated convolution for semantic segmentation, presented better performances.

Another effective strategy to increase the capture capabilities of global features is to introduce attention mechanism. Reference [29] first introduced an attention mechanism into computer vision tasks, which has been proven to be reliable. DANet [30] adopts a spatial and channel attention module to obtain more global context information. CBAM [31] introduced a lightweight spatial and channel attention module. DA-RoadNet [32] constructed a novel attention mechanism module to improve the network's ability to explore and integrate roads.

The network structure for semantic segmentation was divided into several parts, and those networks [1,26–28] only adopted dilated convolutions in one part. In fact, the encoder part and decoder part of existing architectures for semantic segmentation is built by stacking residual blocks or dense blocks. So, dilated convolution layers after each block have been added to capture more global context information. In the research, a new structure, D-Dense blocks, combined with traditional convolution layers and dilated convolution layers, has been proposed. Further, a network is built with D-Dense block and the center part of D-LinkNet for road extraction from satellite images. To increase the capabilities of capturing global features, the DA mechanism [30] is also introduced into the network. With the above design, the dilated convolution can run through the whole network and effectively integrate with the attention mechanism to obtain more global features and information. The presented context network was evaluated through controlled experiments with the Massachusetts Road dataset. The experiments demonstrate that the D-Dense block with attention mechanism architectures reliably increases the pixel-level accuracy for semantic segmentation.

Since SAR has the advantages of all-weather and strong penetration, using SAR images has irreplaceable advantages in remote sensing road extraction, which can further improve the accuracy of road information extraction. Many traditional road segmentation methods of SAR images have been proposed and proved effective. Methods based on human–computer interaction are called semi-automatic methods. Bentabet, L et al. [33] were the first to use the snake model for SAR image road extraction. The results of the experiments show that straight or curved roads could be accurately extracted by this model, but this model needs a large number of human–computer interactions [34]. Some automatic methods were also proven to be useful. Cheng Jianghua et al. [35] proposed a method based on the Markov random field (MRF). In order to maximize calculation efficiency, this method is developed on GPU-accelerated road extraction. Besides, there are also Deep-Learning methods of road extraction on SAR images. Wei X et al. [36] used Ordinal Regression and introduced Road-Topology Loss, which improves the baseline up to 11.98% in the *IoU* metric in their own dataset.

Focused on some problems of low-grade roads in remote sensing images, we study how to improve the accuracy of the road extraction in complex scenes using the powerful feature expression ability of deep learning and the penetrating feature of SAR images.

In this paper, we propose a novel deep learning network model called SDG-DenseNet to improve the accuracy of low-grade road extraction from optical remote sensing images. We fuse the extraction results from the SAR image into that of the optical image at the decision level, which improves the accuracy of low-grade road extraction in practical application scenarios. Therefore, the main contribution of this study can be summarized as:


#### **2. Methods**

In order to improve the image semantic segmentation accuracy, a novel SDG-Densenet network for low-grade road extraction in optical images is proposed. The construction of the novel SdG-Densenet for optical image semantic segmentation is composed of three parts: an encoder path, a decoder path, and the center part—the global information recovery module. The encoder path takes RGB images as input parameters and extracts features by stacking convolutional layers and pooling layers. The decoder path restores the detailed information and expands the spatial dimensions of the feature maps with deconvolutional layers. The center part is responsible for enlarging the receptive field, integrating multi-scale features, and maintaining the detailed information simultaneously. The skip connection encourages the reuse of the feature maps to help the decoder path recover spatially detailed information. Besides, a decision-level fusion method is also introduced in order to fuse the results optical image and SAR image, which mainly contains six steps: data preparation, pretreatment, image registration, road extraction, road segmented, and decision level integration. Figure 1 shows the overall structure of the proposed method.

**Figure 1.** The overall technical flow of the proposed method.

#### *2.1. Architecture of SDG-DenseNet Network*

Because low-grade roads are easily blocked by vegetation or buildings, there are often problems of fracture and discontinuity in extracting low-grade roads in optical images. At the same time, due to the low construction standard of low-grade roads, their materials are often consistent with the surrounding environment, and they are often integrated into the background in the optical orthographic projection, resulting in a poor extraction effect. Based on the above problems, it is imperative to specialize in the novel network and improve the ability of global information extraction.

Based on D-LinkNet, the SDG-DenseNet was proposed. In order to improve the extraction ability of global information, the global information recovery module was introduced to the proposed new Network for semantic segmentation. Furthermore, the novel network took DenseNet as its backbone instead of ResNet and replaced the initial block with the stem block. Additionally, the Attention mechanism was introduced to improve the ability to obtain global information. The The overall structure of SDG-DenseNet network is shown in Figure 2.

#### *2.2. Improved D-Dense Block and Stem Block*

The construction of the D-Dense block is shown in Figure 3. In contrast to the original Dense block, we added three consecutive dilated convolution layers with different expanding rates after the original Dense block. The expanding rates of these three dilated

convolutions are 2, 4, and 8, respectively. The structure of each dilated convolution could be set as BN-ReLU-Conv (1 × 1)-BN-ReLU-D\_Conv (3 × 3, rate = 2, or 4, or 8). The same computation process with the original Dense block repeated (*n* + 3) times and makes the D-Dense block generate feature maps with (*n* + 3) × *k* channels.

**Figure 2.** The construction of the SDG-DenseNet.

**Figure 3.** The construction of the D-Dense block.

The encoder starts with an initial block and performs convolution on the input image with a kernel of 7 × 7 size and a stride of 2 followed by a 3 × 3 max pooling. In addition, the output channels of the initial block are 64. Inspired by Inception v3 [37] and v4 [38], References [39,40] replaced the initial block [41] 7 × 7 convolution layer, stride = 2 followed by a 3 × 3 max pooling layer by the stem block. The Stem block is composed of three 3 × 3 convolution layers and one 2 × 2 mean pooling layer. The stride of the first convolution layer is 2 and the others are 1. In addition, the output channels for all the three convolution layers are 64. The experiment results in Reference [40] proved that the initial block applied would lose much information due to two consecutive down-sample operations, making it hard to recover the marginal information of the object in the decoder phase. The stem block is helpful for object detection, especially for small objects. So, the research also adopts the stem block at the beginning of the encoder phase.

#### *2.3. Global Information Recovery Module (GIRM) Based on d-Blockplus and Attention Mechanism*

In order to weaken and eliminate the problem of road fracture or low recall in lowgrade road extraction, this paper proposes a global information recovery module, which is composed of a dual attention mechanism and d-blockplus. The global information extraction module aims to further improve the network's ability to obtain global information to ensure the integrity of the extraction results.

As shown in Figure 4, the global information extraction module is mainly composed of two parts. The dual attention mechanism mainly starts from the two directions of spatial attention and channel attention, extracts and integrates the global information of space and channel, and improves the attention to road targets. d-blockplus introduces multi-layer hole convolution to improve the receptive field, so as to improve the ability of the network to maintain the integrity of road extraction.

**Figure 4.** The construction of GIRM.

In the center part of the SDG-DenseNet, in addition to the d-blockplus, the position attention module (PAM) and the channel attention module (CAM) are also introduced. PAM and CAM are two reliable self-attention modules, which improve the ability of the network to obtain global information in the spatial dimension and channel dimension, respectively.

Figure 5 shows the structure of PAM. In PAM, the input feature maps go through two branches, and one of them will be used as *Q* and *K* to generate a (*H* × *W*) × (*H*×*W*) Attention probability map. In another branch, it is used as *V*. Where, *V*, *Q*, and *K* represent value features, query features, and key features, respectively; *C*, *H*, and *W* represent the channel, height, and weight of the characteristic graph, respectively. The overall structure of PAM is shown in Equation (1):

$$\begin{array}{l} Att = softmax(Q\_{(\mathbb{C}\times HW)} \cdot K\_{\text{(HW}\times\mathbb{C})})\\ F\_{\text{out}} = (V\_{(\mathbb{C}\times HW)} \cdot Attack \text{(\$\mathbb{C}\times H\times W\$)} + Input\_{(\mathbb{C}\times H\times W\$)})\end{array} \tag{1}$$

**Figure 5.** Architecture of the position attention module [30].

Figure 6 shows the structure of CAM. The structure of CAM is basically similar to that of PAM. CAM pays more attention to the information on the channel. In this network structure, the size of the probability map generated by CAM is (*C* × *C*), which helps to boost feature discrimination. The overall structure of CAM is shown in Equation (2):

$$\begin{array}{l} Att = softmax(Q\_{(\mathbb{C}\times HW)} \cdot K\_{\{HW\times\mathbb{C}\}})\\ F\_{out} = (Att \cdot V\_{(\mathbb{C}\times HW)}) . reshape(\mathbb{C}\times H \times W) + Input\_{(\mathbb{C}\times H \times W)} \end{array} \tag{2}$$

D-block has four paths that contain dilated convolution in two cascade modes and two parallel modes, respectively. In each path, dilated convolutions are stacked with different expanding rates. Consequently, the receptive field of each path is different, and the network can aggregate multi-scale context information. Inspired by MobileNetV2 [42], to save network parameters and improve network performance, the bottleneck block is introduced into d-block to build d-blockplus. Figure 7 shows the structure of D-blockplus.

**Figure 7.** The construction of d-blockplus.

#### *2.4. Decision-Level Fusion Algorithm for Low Grade Roads*

In optical images, low-grade roads often show the problem where the roads are blocked by buildings, vegetation, shadows, and so on. However, the background of buildings and vegetation is often quite different from the road, and the blocked part is often not judged as a road in the process of deep learning, which directly leads to the phenomenon of fracture or undetected in the extraction results of low-grade roads. Figure 8 shows some examples of blocked roads. In these pictures, the roads in red boxes show fractures in the optical image because it is obscured by vegetation, buildings, or shadows.

**Figure 8.** Figures of blocked roads.

For the problems of the above complex scenes, the imaging mechanism of the optical image determines that the SDG-DenseNet network model cannot solve the problem of poor road continuity well. In this paper, the optical image extraction results based on the SDG-DenseNet network model and the SAR image extraction results based on Duda and path operators [43] realize decision-level fusion.

The Duda operator is a linear feature extraction operator that divides an *N* × *N* window into three parallel linear parts. The specific structure of the Duda operator is shown in Figure 9, where A, B, C, C1, and C2 represent the mean gray values of the three parts. What's more, the operator shown in Figure 9a has a relatively strong ability to extract roads in the horizontal direction, and the operator shown in Figure 9b has a relatively strong ability to extract roads with a certain inclination angle.

The other two types of Duda Operators are a 90-degree rotation of the above two. The function to determine the new value of a pixel can be expressed as follows:

$$H(\mathbf{x}) = (1 - \frac{\mathbf{C}}{A})(1 - \frac{\mathbf{C}}{B})\frac{\mathbf{C}1}{\mathbf{C}2}.\tag{3}$$

Path operators refer to path openings and closings, which are morphological filters applied to analyze oriented linear structures in images. The morphological filter defines the adjacency graphs as structuring elements. Four different adjacency graphs are defined as horizontal lines, vertical lines, and two diagonal lines, respectively. Applying these four adjacency graphs to a binary image, the maximum path length of each pixel can be achieved. Then, the pixels, whose maximum path lengths are larger than the threshold Lmin, are retained in the image.

**Figure 9.** Window structure of two types of Duda operators.

The specific algorithm flow of the decision-level fusion method for low grade roads is shown in Figure 10.

**Figure 10.** Low-grade road extraction algorithm for decision-level fusion of optical and SAR images.

Figure 10 shows the overall technical process of the road extraction algorithm based on the decision-level fusion of high-resolution optical and SAR remote sensing images. The specific steps of the algorithm are as follows.

*Step 1*: Data preparation. Obtain optical remote sensing images and SAR images in the same area, and their imaging time should be as close as possible;

*Step 2*: Pretreatment. The optical remote sensing image and SAR image are preprocessed, respectively, including radiometric correction, geometric correction, geocoding, and so on;

*Step 3*: Image registration. The optical remote sensing image and SAR image are matched and transformed into the same pixel coordinate system;

*Step 4*: Road extraction. Roads in optical remote sensing images are extracted by SDG-DenseNet and those in SAR images are extracted by the method in Reference [43], which is based on Duda and Path operator;

*Step 5*: Roads segmented. For the road extraction results of optical remote sensing image and SAR image, the road segments are obtained by segment method, and the attributes of each segment are recorded;

*Step 6*: Decision level fusion. Taking the line segment as the basic unit, the final road distribution map is obtained by decision-making level fusion of the roads extracted from the optical remote sensing image and SAR image.

#### **3. Experiments**

Our network experiments are performed on the Massachusetts Roads Dataset, and we test the fusion method in our own dataset that came from WorldView-2, WorldView-4, and TerraSAR-X. The TensorFlow platform was selected as the deep learning framework to train and test all networks. All models are trained on one NVIDIA GTX 2080 Ti GPU.

#### *3.1. Dataset and Data Augmentation*

Three sets of satellite images were applied to evaluate the Low-Grade road extraction method. To verify the effectiveness of the proposed SDG-DenseNet network on public datasets, we tested the SDG-DenseNet on the Massachusetts dataset. In addition, we conducted low-grade road extraction experiments on the self-built Chongzhou–Wuzhen dataset. Finally, we conducted decision-level fusion experiments on two sets of large-scale images from the Chongzhou and Wuzhen regions including optical and SAR images.

We trained and tested our SDG-DenseNet network model on the Massachusetts Roads Dataset [44], which consists of 1108 training images, 14 validation images, and 49 test images. The size of each image is 1500 × 1500. We cut each 1500 × 1500 image into four 1024 × 1024 images. Therefore, we obtained 4432 training images, 56 validation images, and 196 test images. Further, we performed data augmentation on the training set, including rotation, flipping, cropping, and color jittering, which could prevent the training set from overfitting. After data augmentation, we obtained 22,160 training images in total. Finally, we obtained 22,160 training images, 56 validation images, and 196 test images.

In order to test the proposed SDG-DenseNet network of low-grade road extraction, this paper also tests the SDG-DenseNet on the self-built dataset: The Chongzhou–Wuzhen dataset. Table 1 displays the three source images of the self-built dataset. We cut the three source images into 13,004 512 × 512 images. Therefore, we obtained 11,788 training images, 204 validation images, and 1012 test images. After the data augmentation of the training set, we got 47,152 training images. Finally, we obtained 47,152 training images, 204 validation images, and 1012 test images.


**Table 1.** The three source images of the self-built low-grade road dataset.

We also test our decision-level fusion experiments on two sets of large-scale images from the Chongzhou and Wuzhen regions including optical and SAR images. The optical images came from WorldView-2 and WorldView-4, while we got the SAR images from TerraSAR-X. As shown in Table 2, in order to test the effect under application conditions, the decision-level fusion experiment is mainly tested on the two large-scale images.


**Table 2.** Two sets of large-scale images used in decision-level fusion experiments.

#### *3.2. Hybrid Loss Function and Implementation Details*

In previous work, most networks train their models only by using the cross-entropy loss [45], which is defined as Equation (3):

$$L\_{\rm ct} = -\frac{1}{N} \sum\_{i=0}^{N} (y \log y' + (1 - y) \log(1 - y')),\tag{4}$$

where *N* indicates categories. *y* and *y* mean the label and prediction vectors, respectively. Since an image consists of pixels, for road area segmentation, the imbalance of sample points (where the roads only cover a small part of the whole image) makes the direction of the gradient decrease toward the back corner (Figure 11a), which leads to a local optimum, especially in the early stage [46]. The Jaccard loss function is defined as:

$$L\_{\text{jaccard}} = \frac{1}{N} \sum\_{i=0}^{N} \frac{y\_i y\_{i}^{\prime}}{y\_i + y\_{i}^{\prime} - y\_i y\_{i}^{\prime}} \tag{5}$$

**Figure 11.** Different loss function surface. (**a**) Cross entropy surface; (**b**) Jaccard surface.

Its surface is shown in Figure 11b. As we can see, the Jaccard loss can address this problem if we sum the Jaccard loss and the cross-entropy loss together. So, the whole loss function is defined as:

$$L = L\_{\rm cr} - \lambda \log L\_{jaccard} \tag{6}$$

where λ is the weight of the Jaccard loss in the whole loss. Furthermore, the red, green, and blue points in Figure 11 represent the local maxima, saddle points, and local minima on the loss surface, respectively.

In the training phase, we chose Adam as our optimizer and originally set the learning rate to be 0.0001. We reduce the learning rate by 10 times while observing the loss value decreasing slowly. The loss weight λ is set to 1. The batch size during the training phase is set to 1.

#### *3.3. Decision-Level Fusion Experiment*

To verify the effect of every step in the decision-level fusion method for low-grade roads, we apply the fusion method to the road extraction results from the network and method based on the Duda operator and Path operator, using the large-scale images mentioned in Table <sup>2</sup> and the details of *Step 6*, where decision-level fusion is operated as in Figure 12. The detailed workflow of the Decision level fusion.

**Figure 12.** The detailed workflow of the decision level fusion.

As shown in Figure 12, the main process is divided into five steps:

*Step 1*: Road binary map extracted from input optical image and SAR image (not segmented);

*Step 2*: Segment the road binary map extracted from the SAR image, including extracting the road feature direction map, decomposing the binary map according to the direction feature, thinning the decomposed layer based on the curve fitting algorithm, and optimizing the line segment overlap, continuity and intersection to obtain the road segment set extracted from the SAR image;

*Step 3*: Segment the road binary map extracted from the optical image, optimize the overlap and continuity of segments, and record the updated segments of continuity optimization;

*Step 4*: For each road segment extracted from the SAR image, we judge whether it meets the fusion conditions with optical image road extraction results according to the overlap ratio in the corresponding optical extraction road binary layer, and record the qualified SAR road segments;

*Step 5*: After morphological expansion according to the width feature, the continuously optimized and updated line segments and the SAR road line segments meeting the fusion conditions are calculated with the original optically extracted road binary map according to pixels to obtain the fused Road Distribution binary map.

The specific method of searching line segments satisfying fusion conditions is shown in Figure 13. Assuming that *Am* represents the road area on layer m after the decomposition of optical image extraction results, L*mn* is the line segment n on layer m from SAR image road extraction results. They belong to the same layer m, that is, the road has similar directional features. We then count the number of pixels *ln1* and *ln2* where L*mn* falls inside and outside the *Am* region, and calculate the overlap rate r = *ln*1/(*ln*<sup>1</sup> + *ln*2). If r is greater than the threshold T*r*, L*mn* is recorded as the road segment meeting the fusion conditions. In a practical application, the threshold tr takes an empirical value of 0.3. We traverse all

SAR extracted road segments until all SAR image-extracted road segments meeting the above fusion conditions are recorded.

**Figure 13.** Schematic diagram of road fusion condition judgment for optical and SAR extraction.

#### *3.4. Evaluation Metrics*

In order to evaluate the performance of different road segmentation models, four evaluation metrics are used to evaluate the extraction results, including intersectionover-union (*IoU*), completeness (*COM*), correctness (*COR*), and *F*1-score [47], which are defined as:

$$IoI = \frac{TP}{TP + FN + FP} \qquad \qquad \qquad COR = \frac{TP}{TP + FP}$$

$$COM = \frac{TP}{TP + FN} \qquad \dots \quad F1 = \frac{2 \times COM \times COR}{\text{COM + COR}}$$

*TP* (True Positive) indicates that the extraction result is determined as a road, which is actually part of the road; *FP* (False Positive) indicates that the extraction result is determined as a road, but it is not actually part of the road; *FN* (False Negative) indicates that the extraction result is determined to be not a road, but it is actually part of the road. The *COM* scores of different models show the ability to maintain the completeness of the segmented roads. The higher the score, the better the road continuity extracted by the model. The *COR* scores of different models show the ability on reducing false detection of the segmented roads. The higher the score, the fewer areas will be falsely detected. The *IoU* and *F*1 scores are the overall evaluation metrics that synthesize *COM* and *COR* scores and evaluate the overall quality of segmentation results.

Based on these evaluation metrics, we can obtain the performance of model road extraction results in different aspects from *COM* and *COR* scores, and obtain the overall performance judgment from *F*1 and *IoU* scores.

#### **4. Results and Discussion**

#### *4.1. Results of the Massachusetts Roads Dataset*

In order to further verify the effectiveness of the proposed method, we evaluated our network with Massachusetts Roads Dataset. We divided the test images into two levels general scene and complicated scene—according to the complexity of the image content scene. The sample results are shown in Figures 14–18.

**Figure 14.** Road extraction results in general scene images; (**a**) input image; (**b**) label image; (**c**) D-LinkNet; (**d**) S-DenseNet; (**e**) SD-DenseNet; (**f**) SDG-DenseNet.

**Figure 15.** *IoU* scores of the methods in Figure 14.

**Figure 16.** Road extraction results in complicated scene images (**a**) input image; (**b**) label image; (**c**) D-linkNet; (**d**) S-DenseNet; (**e**) SD-DenseNet; (**f**) SDG-DenseNet.

**Figure 17.** *IoU* scores of the methods in Figure 16.

**Figure 18.** Four detailed areas of road extraction results in complicated scene images (**a**) area of label image; (**b**) D-LinkNet; (**c**) S-DenseNet; (**d**) SD-DenseNet; (**e**) SDG-DenseNet.

Figures 14 and 15 show the extraction results of general scene images. D-LinkNet shows the network built on residual blocks and the encoder part, DenseNet shows the network built on the Dense block, S-DenseNet shows the network built on Dense block and Stem block, SD-DenseNet represents that the network has also added dilated convolution on the basis of the previous networks, and SDG-DenseNet is built on the basis of the Stem block, D-Dense Blocks, and the GIRM module. The extraction results of the DLinkNet model contain some redundant information, and many independent patches are left in the image, which could affect the result of the overall accuracy. The parking lot areas, which are similar to roads, were successfully identified as backgrounds. However, some roads are not completely extracted. SDG-DenseNet has been further improved to make the completeness of roads better. The information extracted by the SDG-DenseNet network structure is more accurate.

Figure 15 shows the *IoU* scores for each image in each row in Figure 14. The proposed SDG-DenseNet achieves high *IoU* scores under all three optical images, which are 9.53%, 9.46%, and 8.18% higher than the baseline D-LinkNet, respectively.

Figure 16 shows the three extraction results of the complicated scene images from the 49 test images. Each road network includes more different level roads and flyover roads. These complex situations seriously affect the road extraction results of every network model. However, the SDG-DenseNet can better extract every road including shadow obscured roads.

Figure 17 shows the *IoU* scores for each image in each row in Figure 16. Similar to Figure 15, in the three optical images, the proposed SDG-DenseNet achieves high *IoU* scores, which are 3.16%, 3.8%, and 1.18% higher than the baseline, respectively. At the same time, in order to improve the reliability of the results from a methodological point of view, the effects of different modules on different images are often different, for example, the *IoU* of Line 1 in Figure 17 shows that the two improved methods perform less well on this optical image. However, it is worth mentioning that the comprehensive results of statistics show that the average *IoU* score of the optimized model on the test set (196 test images) is higher than that of the baseline.

Figure 18 shows some detailed areas of the first image in Figure 16. The different results on Area 1 and Area 2 implied that SDG-DenseNet has a better ability on the correctness of segmentation. Area 1 shows the novel network's improvement in avoiding false segmentation, while in Area 2 it also emerges that SDG-DenseNet performs well in the recall ratio of extraction. Besides, Area 3 and Area 4 show that the novel network also performs perfectly when focusing on the completeness of the result of road extraction. In Figure 18, column (1), column (2), column (3), and column (4) correspond to area 1, area 2, area 3, and area 4 respectively.

Figures 14–18 show the semantic segmentation results of some randomly selected images in the Massachusetts test set. In order to further prove the effectiveness of the improved model on the test data set, this paper counts the evaluation indicators of the segmentation results of different models on the test set (196 test images). Through the training model and experiment, we get the D-LinkNet, DenseNet, S-DenseNet, SD-DenseNet, and SDG-DenseNet evaluation metrics index, as shown in Table 3. We found the *IoU* and *F*1 scores of the network built on the Dense block or D-Dense block to be much higher than the network built with the residual block. Besides, the model based on DenseNet with the D-Dense block has higher *IoU* and *F*1-scores than that with Dense block.


**Table 3.** Results of the Massachusetts Roads Dataset of different models. The bold font indicates the optimal value under the current evaluation metrics.

In other words, compared with D-LinkNet, the novel network can extract roads more correctly and maintain good road completeness. Furthermore, when comparing the stem block with the initial block, we find that the network with the stem block is much better than the initial block in the correctness of road extraction. At the same time, stem block also improves the *IoU* and *F*1 scores. The experiment results show the SDG-DenseNet could obtain better *IoU* and *F*1 scores when performing well in the correctness of road extraction. It can also be seen from the table that the SDG-DenseNet is more balanced than other networks in its ability to maintain road completeness and correctness, while both *COR* and *COM* indices are kept at a relatively high level, thus achieving higher *F*1 and *IoU* Scores.

#### *4.2. Results on Massachusetts Roads Dataset of Different Methods*

To evaluate our method performance, we compare its *IoU* scores with Residual Unet [46], Joint-Net [48], Dual Path Morph-Unet [49], and DA-RoadNet [32] which have been used in road extraction from satellite images.

Table 4 shows the scores obtained by different methods on the Massachusetts Roads Dataset. The SDG-DenseNet had the highest *F*1 and *IoU* which proves the excellent performance of SDG-DenseNet in road extraction. Besides, as shown in Table 4, our new network achiever a higher *COM* score than other networks, while the *COR* score of the SDG-DenseNet is not much lower than other networks. In other words, our network achieves a good balance in maintaining the completeness and correctness of segmentation.


**Table 4.** Results of the Massachusetts Roads Dataset of different methods. The bold font indicates the optimal value under the current evaluation metrics.

('\*' represents the metrics not mentioned in the cited papers).

#### *4.3. Results of Low-Grade Roads on the Chongzhou–Wuzhen Dataset*

In order to fully uncover the characteristics of the low-grade road extraction task and the performance of different networks on this task, this paper tests the low-grade road in the Chongzhou–Wuzhen dataset. In the test process, according to whether the low-grade roads are blocked, the complexity of the low-grade road structure, and the complexity of the background scene, the extraction difficulty is divided into four cases: simple, general, and complicated.

Figures 19–21 shows the extraction results of four different network models in simple scenes, general scenes, and complex scenes, respectively. Figure 19 shows the detection result in a simple scenario. The detection effect of D-LinkNet in a simple scenario is best, especially for the detection of the expressway; while its integrity is higher, there are fewer false detection parts. Figure 20 shows the detection effect in a general scenario. At this time, D-LinkNet has obvious road fracture and missing detection. SDG-DenseNet has the best detection effect. Compared with other networks, it extracts the most complete roads. Figure 21 shows the detection effect in complex scenes, and several networks show different degrees of missed detection and false detection. S-DenseNet shows the strongest ability to maintain integrity, but there are many false detection areas; SDG-DenseNet has a certain degree of road fracture, but there are few false detections.

**Figure 19.** Low-grade road extraction in simple scene images (**a**) input image; (**b**) label image; (**c**) D-LinkNet; (**d**) S-DenseNet; (**e**) SD-DenseNet; (**f**) SDG-DenseNet.

**Figure 20.** Low-grade road extraction in general scene images (**a**) input image; (**b**) label image; (**c**) D-LinkNet; (**d**) S-DenseNet; (**e**) SD-DenseNet; (**f**) SDG-DenseNet.

**Figure 21.** Low-grade road extraction in complicated scene images (**a**) input image; (**b**) label image; (**c**) D-linkNet; (**d**) S-DenseNet; (**e**) SD-DenseNet; (**f**) SDG-DenseNet.

Table 5 shows the *IoU* scores of the different models on the test of the Chongzhou– Wuzhen dataset. The result shows that SDG-DenseNet achieved the highest *IoU* scores while its model size is much less than D-LinkNet, which proves that the SDG-DenseNet has the best performance on low-grade road extraction tasks. S-DenseNet has the least parameters of the four networks, which is mainly due to the reduction of parameters by dense block.

**Table 5.** Results of the Chongzhou–Wuzhen test set on different models. The bold font indicates the optimal value under the current evaluation metrics.


#### *4.4. Extraction Results of Low-Grade Roads on Large-Scale Images of the Fusion Method*

In order to verify the feasibility and effect of the decision-level fusion method, and to test the overall effect of the process in the actual application scenario, we extracted the optical image based on SDG-DenseNet and the SAR image based on the Duda operator for the two large-scale images mentioned in Table 2, and then tested the effect of decision-level fusion.

In order to more intuitively reflect the effect of the fusion method, we compare the extracted roads with the roads in the label.

Figures 22 and 23 show the result of the decision-level fusion method tested on our own dataset in the Chongzhou area.

**Figure 22.** Tested data Area 1. Optical and SAR remote sensing images and road extraction results in the Chongzhou area. (**a**) Worldview-4 optical remote sensing image; (**b**) TerraSAR-X remote sensing image; (**c**) road extraction results of optical remote sensing image; (**d**) road extraction results from SAR remote sensing images; (**e**) road fusion extraction results of optical and SAR images; (**f**) ground truth and marking results of road fusion extraction results (green refers to the correctly extracted road, red refers to the incorrectly extracted road, and yellow refers to the omitted real road).

**Figure 23.** Tested data Area 1. Some details in optical and SAR remote sensing images and road extraction results in the Chongzhou area. (**a**) Worldview-4 optical remote sensing image; (**b**) road extraction results of optical remote sensing image; (**c**) road extraction results from the SAR remote sensing images; (**d**) road fusion extraction results of optical and SAR images; (**e**) ground truth and marking results of road fusion extraction results (green refers to the correctly extracted road, red refers to the incorrectly extracted road, and yellow refers to the omitted real road).

Figure 22 display the extraction effect of the optical image, SAR image, and decisionlevel fusion on low-grade roads in practical application scenarios. The road extracted from the optical image using SDG-DenseNet is more complete and continuous than the road extracted from the SAR image using the Duda and Path operators. However, the extraction results of the SAR images contain some information that are not found in optical image extraction results, such as some roads obscured by vegetation or buildings.

Figure 23 show some details from Figure 22. As shown in the figures, according to the extraction results of SAR images, decision-level fusion fixes some problems of road fracture and missing detection caused by vegetation or building occlusion in optical images.

We also tested the decision-level fusion method in the Wuzhen area of the self-made dataset, as shown in Figures 24 and 25. Figure 24 shows the extraction results in large-size images in practical application scenarios, which are similar to the results in Chongzhou. The extraction results of optical images are good in continuity, but there are also obvious problems of broken road extraction results and missing detection. After fusion with the SAR image, the partially occluded roads become continuous, and some roads that were missed in the optical image were detected.

Figure 25 shows some details of the detection results in Wuzhen. Figure 25 region A(b) shows the extracted road breaks due to bridge interference in the optical image, which can be seen in region A(d). After decision-level fusion, the broken extraction result is fixed. A similar situation occurs in region B due to the occlusion of vegetation, and the roads that are missed in the optical image are also repaired by the decision-level fusion method.

Table 6 shows the results of the two large-scale images mentioned in Table 2. We use the manual interpretation annotation method to evaluate and analyze the low-grade road extraction results, in other words, the matching degree between the extracted road network and the reference road is evaluated through completeness, correctness, and accuracy. As shown in Figures 22–25 and Table 6 through the decision-level fusion extraction of optical and SAR images, the *F*1-scores of road extraction can reach more than 0.85. The *F*1, *COM*, and *COR* scores are significantly higher than the results using only the optical image extraction method.

**Figure 24.** Tested data Area 2. Optical and SAR remote sensing images and road extraction results in the Chongzhou area. (**a**) Worldview-2 optical remote sensing image; (**b**) TerraSAR-X remote sensing image; (**c**) road extraction results of optical remote sensing image; (**d**) road extraction results from SAR remote sensing images; (**e**) road fusion extraction results of optical and SAR images; (**f**) ground truth and marking results of road fusion extraction results (green refers to the correctly extracted road, red refers to the incorrectly extracted road, and yellow refers to the omitted real road).

**Figure 25.** Tested data Area 2. Some details in optical and SAR remote sensing images and road extraction results in the Wuzhen area. (**a**) Worldview-4 optical remote sensing image; (**b**) road extraction results of optical remote sensing image; (**c**) road extraction results from SAR remote sensing images; (**d**) road fusion extraction results of optical and SAR images; (**e**) ground truth and marking results of road fusion extraction results (green refers to the correctly extracted road, red refers to the incorrectly extracted road, and yellow refers to the omitted real road).


**Table 6.** The results of two large scale images for the whole area.

#### **5. Conclusions**

In this research, a D-Dense block module was proposed, which combined traditional convolution and dilated convolution based on a dense connection structure. Further, the new semantic segmentation network (SDG-DenseNet) was built with a D-Dense block, and it also adopted the center part of the D-LinkNet for high-resolution satellite imagery road extraction. Since the network also replaces the initial block with the stem block to hold more detailed information, it can be easier to recover the marginal information of the object in the decoder phase. In addition, the introduction of an attention mechanism also improves the ability of the network to obtain global information. Besides, to improve the accuracy of road extraction in large-scale images in practical application, a decisionlevel fusion method was proposed, which fused the information in optical images and SAR images.

Three sets of satellite images were applied to evaluate the network. The extraction results from the Massachusetts Roads dataset show that the SDG-DenseNet not only has the highest *IoU* and *F*1 score but is also suitable to extract roads in complicated scenes. Experiments showed that the *IoU* and *F*1 scores of SDG-DenseNet based on D-Dense block and GIRM modules were 3.61% and 2.75% higher, respectively, than the baseline D-LinkNet. The stem block is helpful to develop the accuracy for road extraction. Furthermore, the Chongzhou–Wuzhen dataset, based on three large-scale optical images, was applied to evaluate the models' extraction ability of the low-grade roads. The results show that the SDG-DenseNet performs best in four networks and its *IoU* score is 6.65% higher than that of D-LinkNet. At the same time, its model size is reduced by about 600 MB to D-LinkNet. Further, two pairs of large-scale optical and SAR images were applied to evaluate the decision-level fusion method. The results show that the fusion method performed well in accurately extracting the roads. After decision-level fusion of road binary map from SAR and optical image based on two tested data, the *F*1 is improved by about 8.4–11.5%, *COR* is about 7.4–7.7%, and *COM* is about 9.3–13.7%.

SDG-DenseNet improves d-block as d-blockplus and combines it with an attention mechanism, which not only ensures road completeness in the segmentation task but also greatly improves the correctness of the segmentation results. Therefore, the network maintains a perfect balance between correctness and completeness. In addition, the decisionlevel fusion method had been proposed to improve the extraction effect on the task of lowgrade road extraction, and the presentation quality is better after the decision-level fusion. In future research, the contribution of each part of the network and every hyperparameter in the training phase should be taken into consideration.

**Author Contributions:** Methodology, J.Z.; software, F.X.; validation, Y.S.; formal analysis, B.P.; investigation, J.Z.; resources, Y.L.; data curation, B.P.; writing—original draft preparation, J.Z.; writing—review and editing, S.L.; supervision, Y.L.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Key Projects of Global Change and Response of Ministry of Science and Technology of China under Grant 2020YFA0608203, in part by the Science and Technology Support Project of Sichuan Province under Grant 2021YFS0335, Grant 2020YFG0296, and Grant 2020YFS0338, in part by Fengyun Satellite Application Advance Plan under Grant FY-APP-2021.0304.

**Data Availability Statement:** The authors would like to thank the team of National Climate Center and University of Toronto for the data and experiments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Lightweight Self-Supervised Representation Learning Algorithm for Scene Classification in Spaceborne SAR and Optical Images**

**Xiao Xiao, Changjian Li and Yinjie Lei \***

College of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China; xiaoxiaox@stu.scu.edu.cn (X.X.); li\_changjian@stu.scu.edu.cn (C.L.) **\*** Correspondence: yinjie@scu.edu.cn

**Abstract:** Despite the increasing amount of spaceborne synthetic aperture radar (SAR) images and optical images, only a few annotated data can be used directly for scene classification tasks based on convolution neural networks (CNNs). For this situation, self-supervised learning methods can improve scene classification accuracy through learning representations from extensive unlabeled data. However, existing self-supervised scene classification algorithms are hard to deploy on satellites, due to the high computation consumption. To address this challenge, we propose a simple, yet effective, self-supervised representation learning (Lite-SRL) algorithm for the scene classification task. First, we design a lightweight contrastive learning structure for Lite-SRL, we apply a stochastic augmentation strategy to obtain augmented views from unlabeled spaceborne images, and Lite-SRL maximizes the similarity of augmented views to learn valuable representations. Then, we adopt the stop-gradient operation to make Lite-SRL's training process not rely on large queues or negative samples, which can reduce the computation consumption. Furthermore, in order to deploy Lite-SRL on low-power on-board computing platforms, we propose a distributed hybrid parallelism (DHP) framework and a computation workload balancing (CWB) module for Lite-SRL. Experiments on representative datasets including OpenSARUrban, WHU-SAR6, NWPU-Resisc45, and AID dataset demonstrate that Lite-SRL can improve the scene classification accuracy under limited annotated data, and it is generalizable to both SAR and optical images. Meanwhile, compared with six state-of-the-art self-supervised algorithms, Lite-SRL has clear advantages in overall accuracy, number of parameters, memory consumption, and training latency. Eventually, to evaluate the proposed work's on-board operational capability, we transplant Lite-SRL to the low-power computing platform NVIDIA Jetson TX2.

**Keywords:** synthetic aperture radar; optical images; scene classification; on-board; lightweight self-supervised algorithm

#### **1. Introduction**

The remote sensing scene classification (RSSC) task aims to classify scene regions into different semantic categories [1–5], which plays an essential role in various Earth observation applications, i.e., land resource exploration, forest inventory, urban-area monitoring [6–8]. In recent years, Landsat, Sentinel, and other missions have provided an increasing number of spaceborne images for scene classification task, including synthetic aperture radar (SAR) images and optical images. With more available data, scene classification methods based on convolution neural networks (CNN) have undergone rapid growth [7,9].

However, the amount of annotated scene data available for supervised CNN training remains limited. Taking SAR data as an example, SAR images are affected by speckle noise due to the imaging mechanism, resulting in poor image quality [10,11]. In addition, the random fluctuation of pixels makes it difficult to distinguish between scene categories [12]. Therefore, the annotation of SAR images requires experienced experts and

**Citation:** Xiao, X.; Li, C.; Lei, Y. A Lightweight Self-Supervised Representation Learning Algorithm for Scene Classification in Spaceborne SAR and Optical Images. *Remote Sens.* **2022**, *14*, 2956. https://doi.org/ 10.3390/rs14132956

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 24 May 2022 Accepted: 15 June 2022 Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is a time-consuming task [13]. The same problem of high annotation costs exists for optical images. This leads to the total images number of RSSC datasets, i.e., OpenSARUrban [14], WHU-SAR6 [11], NWPU-Resisc45 [3], and AID [15], compared with natural image datasets, i.e., ImageNet [16], being much smaller; the specific images number for each datasets is shown Figure A1. With limited annotated samples, CNN tends to be overfitted after training [17], leading to poor generalization performance in RSSC task. Therefore, exploring methods to reduce RSSC task's reliance on annotated data is appealing.

Recently, self-supervised learning (SSL) has emerged as an attractive candidate for solving the problem of labeled data shortage [18]. SSL methods can learn valuable representations from unlabeled images through solving pretext tasks [19]; the network trained in a self-supervised fashion can be used as a pre-trained model to enable higher accuracy with fewer training samples [20]. To this end, an increasing number of RSSC studies have concentrated on SSL. In practice, remote sensing images (RSIs) differ significantly from natural images in the acquisition and transmission stage—RSIs suffer from noise impact and high transmission costs [21]. Performing self-supervised training on satellites can solve these issues; however, existing SSL algorithms are hard to deploy on satellites due to the high computation consumption. The method based on self-supervised instance discrimination [22] was first applied in RSSC task; soon after, the SSL algorithm represented by contrastive multiview coding [20] showed good performance in RSSC tasks. These methods relies on a large batch of negative samples, while the training process needs to maintain large queues, which can consume much computation resources. Other self-supervised methods utilize images with the same geographic coordinate regions from different times and introduce loss functions based on geographic coordinate with complex feature extraction modules [23], which also consume a lot of resources during training. Therefore, we need to reduce the computation consumption during self-supervised training.

As mentioned above, we attempt to deploy a self-supervised algorithm on satellites. A lightweight network is necessary, while a practical on-board training approach can also provide support. Since it is impracticable to carry high-power GPUs on satellites, current trend is to use edge devices, i.e., NVIDIA Jetson TX2 [24] as on-board computing devices [25]. Latest radiation characterized on-board computing modules, such as the S-A1760 Venus [26], utilizes TX2 inside the product to help spacecraft achieve high performance AI computing. Accordingly, we also use TX2 as the deployment platform. As under resource-limited scenarios (limited memory, i.e., memory size of 8 G, limited computation resources, i.e., bandwidth of 59.7 GB/S), distributed strategies are typically applied to train the network; thus, for on-board training a flexible distributed training framework is required. However, the approaches adopted by deep learning frameworks, i.e., PyTorch [27], TensorFlow [28], and Caffe [29], for distributed training remain primitive. Existing dedicated distributed training frameworks, such as Mesh-TensorFlow [30] and Nemesyst [31], are likewise incapable for on-board scenarios, because they fail to consider the case of limited on-board computation resources.

Based on the above observation, we need (i) a self-supervised learning algorithm that satisfies guaranteed accuracy and low computation consumption simultaneously; and (ii) an effective distributed strategy for on-board self-supervised training deployment. To address these challenges, we propose a lightweight On-board Self-supervised Representation Learning (Lite-SRL) algorithm for RSSC task. Lite-SRL uses a contrastive learning structure that contains lightweight modules, by maximizing the similarity of RSIs' augmented views to capture distinguishable feature from unlabeled images. The augmentation strategies we used to obtain contrast views differ slightly between SAR and optical images. Meanwhile, inspired by self-supervised learning algorithm BYOL [32] and SimSiam [33], we use the stop-gradient operation making the training process not rely on large batch size, queues, or negative sample pairs, which greatly reduces the computation workload with guaranteed accuracy. Moreover, the structure of Lite-SRL is adapted to distributed training for deployment. Experiments on representative scene classification datasets including OpenSARUrban, WHU-SAR6, NWPU-Resisc45, and AID dataset demonstrate that

Lite-SRL can improve the scene classification accuracy with limited annotated data; it also demonstrates that Lite-SRL is generalizable to both SAR and optical images in RSSC task. Meanwhile, experiments with six state-of-the-art self-supervised algorithms demonstrate that Lite-SRL has clear advantages in overall accuracy, number of parameters, memory consumption, and training latency.

In order to deploy Lite-SRL algorithm to the low-power computing platform Jetson TX2, we propose a distributed hybrid parallelism (DHP) training framework along with a generic training computation workload balancing module (CWB). Since a single TX2 node cannot complete the whole network training, CWB automatically partitions the network according to the workload balancing principle (View Algorithm 2 for details) and assigns each part to DHP to realize distributed hybrid parallelism training. The integration of CWB and DHP enables training neural networks under limited on-board resources. Eventually, we transplant Lite-SRL algorithm to the on-board computing platform through the distributed training modules.

The main contributions of this article are as follows:


The remainder of this paper is organized as follows: Section 2 covers research works related to this article. Section 3 presents the detailed research steps. Section 4 presents the experimental setups. In Section 5 detailed experimental results are presented and summarized. Section 6 provides detailed records for the deployment process. Section 7 provides conclusions. Appendix A lists all the abbreviations in this article and their corresponding full names.

#### **2. Related Works**

In this section, we provide a brief review of existing related works. We present solutions of related RSSC works under limited labeled samples, among which, the methods based on self-supervised contrastive learning show excellent results, and we further present the development of self-supervised contrastive learning. We also offer the existing related studies on distributed training.

#### *2.1. RSSC under Limited Annotated Samples*

Recently, self-supervised learning (SSL) has attracted considerable interest in the study of RSSC for solving the problem of labeled data shortage. SCL\_MLNet [34] introduced an end-to-end self-supervised contrastive learning-based metric network for few-shot RSSC task. Li et al. [35] proposed Meta-FSEO model to improve the performance of few-shot RSSC task in varying urban scenes. These few-shot learning tasks validate that SSL enables RSSC models to achieve well generalization performance from only a few annotated data. Meanwhile, studies [20,36] proved that using the same domain images for SSL training in RSSC task can help to overcome classical transfer learning problems, which further demonstrates the effectiveness of using SSL as a pre-training process in RSSC. The authors of [20,36,37] explored

the effectiveness of several SSL networks in RSSC task, among which the contrastive learningbased [22,23] SSL algorithm performed best in the RSSC task. Moreover, Jung et al. [38] presented self-supervised contrastive learning solution with smoothed representation for RSSC based on the SimCLR [22] framework. Zhao et al. [39] introduced a self-supervised contrastive learning algorithm to achieve hyperspectral image classification for problems with few labeled samples. It has been proved by the above works that self-supervised contrastive learning provides a great improvement for RSSC task; thus, our work also adopts the selfsupervised contrastive learning method for RSSC task.

#### *2.2. Self-Supervised Contrastive Learning*

Through solving pretext tasks, self-supervised methods utilize unlabeled data to learn representations that can be transferred to downstream tasks. In self-supervised learning methods, relative position predicting [19,40], image inpainting [41], and instancewise contrastive learning [22] are three common pretext tasks. As mentioned above, the validity of contrastive learning is superior to image in-painting and relative position predicting in RSSC tasks. Current state-of-the-art contrastive learning methods differ in detail. SimCLR [22] and MoCo [42] benefit from a large queue of negative samples. Based on earlier versions, MoCo-v2 [43] adds the same nonlinear layer as SimCLR to the encoder representation. MoCo-v2 and SimCLR perform well when maintaining a larger batch. SwAV [44] is another type of clustering-based idea that combines clusters into contrastive learning networks. SwAV computes assignments separately from the two augmented views to perform unsupervised clustering. The clustering-based approach likewise requires large queues or memory banks to supply sufficient samples for clustering. BYOL [32] is characterized by not requiring negative sample pairs and, thus, can eliminate the need to maintain a very large batch of negative sample queues. With no reliance on negative samples, BYOL is more robust to the choice of data enhancement methods. SimSiam [33] is similar to BYOL but with no momentum encoder; meanwhile, it directly shares weights between two branches. SimSiam's experiments demonstrate that without using any of the negative sample pairs, large batch, and momentum encoders, contrastive learning structures can still learn valuable representations. We applied these methods to RSSC tasks and synthetically compared them with our proposed algorithm.

#### *2.3. Distributed Training under Limited Resources*

Distributed training assigns the training process to multiple computing devices for collaborative execution [45]. Current mainstream distributed training methods can be divided into data parallelism [46,47], model parallelism [47,48], and hybrid parallelism [49,50]. In data parallelization, each node trains a duplicate of the model using different mini-batches of data. All nodes contain a complete copy of the model and compute the gradients individually [47], after training the parameters of the final model can be updated through the server. In model parallelization, the network layers are divided into multiple partitions and distributed over multiple nodes for parallel training [47,49]. During model parallelization training, each node has different parameters and is responsible for the computation of different partition layers, and each node updates only the weights of assigned partitions. Hybrid parallelism is a combination of data parallelism and model parallelism, which is the development trend of distributed training. Mesh-TensorFlow [30] and Nemesyst [31] are two end-to-end hybrid parallel training frameworks, both using small independent batches of data for training. Based on Mesh-TensorFlow, Moreno-Alvarez et al. [51] proposed a static load balancing approach for the model parallelism scheme. Akintoye et al. [49] proposed a generalized hybrid parallelization approach to optimize partition allocation on available GPUs. FlexFlow [47] framework applied a simulator to predict optimal parallelization strategy in order to improve training efficiency on GPU clusters. However, the above distributed training frameworks failed to consider resource-limited scenarios; in addition, they do not perform computation workload balancing for training process.

#### **3. Methods**

#### *3.1. Overview of the Proposed Framework*

The overview of the proposed work is shown in Figure 1; our work consists of two main parts: (i) we propose a self-supervised algorithm Lite-SRL for RSSC task, the algorithm satisfies guaranteed accuracy and low computation consumption simultaneously. (ii) We use a low-power computing platform for deployment and we propose a set of distributed training modules to satisfy the requirements.

**Figure 1.** Overview of the proposed work. Lite-SRL: on-board self-supervised representation learning algorithm for RSSC task; CWB: computation workload balancing module; DHP: on-board distributed hybrid parallelism training framework.

To improve the scene classification accuracy with limited annotated data, Lite-SRL learns valuable representations from unlabeled RS images. During the algorithm deployment, CWB automatically partitions the training process of Lite-SRL according to the workload balancing principle (View Algorithm 2 for details) and assigns each partition into the DHP training framework, achieving high efficiency on-board self-supervised training.

#### *3.2. Lite-SRL Self-Supervised Representation Learning Network*

#### 3.2.1. Network Structure

We propose an On-board Self-supervised Representation Learning (Lite-SRL) network for RSSC tasks. Since SimSiam [33] and BYOL [32] have excelled as effective self-supervised contrastive learning methods for many downstream tasks, we use a similar structure as the pretext task for self-supervised contrastive learning and make the training process less resource-intensive. Based on SimSiam's experiment results, our Lite-SRL directly maximizes the similarity of two augmented views of an image without using either negative pairs or momentum encoders, and, thus, the training process does not rely on large batches or queues. Lite-SRL adopts lightweight structures as detailed in Figure 2, allowing us to achieve high accuracy with fewer parameters and training resource usage.

**Figure 2.** Network structure.

The structure of Lite-SRL is shown in Figure 2, where two randomly augmented views *<sup>x</sup><sup>a</sup>* and *<sup>x</sup><sup>b</sup>* are obtained from the training batch {*x*1, *<sup>x</sup>*2, ··· *xk*} as inputs, with the top and bottom paths sharing the parameters of Encoder. These two views are processed separately by encoder *E*, which consists of Backbone and Projection. Prediction is denoted as *P*, it converts the output of one view after encoder and matches it with the other view. Express the output vectors of *x<sup>a</sup>* and *x<sup>b</sup>* are expressed as *p<sup>a</sup>* = *P*(*E*(*xa*)) and *e<sup>b</sup>* = *E xb* . Again, perform the above procedure in reverse order for *x<sup>a</sup>* and *xb*, the output vectors are *p<sup>b</sup>* and *ea*. Vectors' negative cosine similarity is expressed as follows:

$$N\left(p^a, e^b\right) = -\frac{p^a}{||\ \ p^a ||\_2} \cdot \frac{e^b}{||\ \ e^b ||\_2} \tag{1}$$

here Δ <sup>2</sup> is *l2*-norm, *x* = ∑*n <sup>i</sup>* (*xi*) 2 . The symmetrization loss is expressed as follows:

$$L\left(\mathbf{x}^{a},\mathbf{x}^{b}\right) = -\frac{1}{2} \frac{P(E(\mathbf{x}^{a}))}{\|\:P(E(\mathbf{x}^{a}))\:\|\_{2}} \cdot \frac{E\left(\mathbf{x}^{b}\right)}{\|\:E(\mathbf{x}^{b})\:\|\_{2}} - \frac{1}{2} \frac{P\left(E\left(\mathbf{x}^{b}\right)\right)}{\|\:P(E(\mathbf{x}^{b}))\:\|\_{2}} \cdot \frac{E(\mathbf{x}^{a})}{\|\:E(\mathbf{x}^{a})\:\|\_{2}}\tag{2}$$

using Equation (1) to simplify the symmetrization loss calculation, Equation (2) yields the following equation:

$$L = \frac{1}{2}N\left(p^a, \varepsilon^b\right) + \frac{1}{2}N\left(p^b, \varepsilon^a\right) \tag{3}$$

The overall loss during training is the average of all images in the batch. The study of SimSiam and BYOL demonstrated that the stop gradient operation is the key to avoid collapse during training. More importantly, stop gradient operation allows the training process to not rely on large batch size, queues, or negative sample pairs, which greatly reduces the computation workload. We also use the Stop-Grad operation, as shown in Figure 2, for the way that does not go through *P* we apply stop gradient operation to it when performing back propagation, modifying (1) as follows:

$$N\left(p^a, \text{stop\\_gradient}\left(\mathfrak{e}^b\right)\right) \tag{4}$$

which means *e<sup>b</sup>* is considered as a constant in this term. By adding Stop-Grad operation, the form in Equation (3) is realized as:

$$L = \frac{1}{2}N\left(p^a, stop\\_gradient(e^b)\right) + \frac{1}{2}N\left(p^b, stop\\_gradient(e^a)\right) \tag{5}$$

The encoder of *x<sup>b</sup>* in the first term of Equation (5) does not receive the gradient from *eb*, instead receives the gradient from *p<sup>b</sup>* in the second term, and the operation performed on the gradient of *x<sup>a</sup>* is opposite to that of *xb*. After obtaining the contrastive loss, we use the stochastic gradient descent (SGD) optimizer to perform back propagation and update the network parameters. The learning procedure is formally presented in Algorithm 1. The structure of the Projection and Prediction multi-layer perceptron (MLP) modules in Lite-SRL are shown in Figure 2. We use lightweight MLP modules, each fully connected layer in Projection MLP is connected to batch normalization (BN) layer and rectified linear unit (ReLU), we incorporated two concatenate layers in the structure. Prediction MLP uses a bottleneck structure, as detailed in Figure 2. Neither BN nor ReLU is used in the last output layer, and such a structure prevents training collapse [39,44].


Lite-SRL uses a simple, yet effective, network structure, which has significant advantages over existing self-supervised algorithms in (i) network parameters, (ii) memory consumption, and (iii) the average training latency. Detailed experimental results are shown in Section 5.1.

#### 3.2.2. Lite-SRL Network Partition

In order to deploy the algorithm on low-power computing platforms, the training process of Lite-SRL is adapted to a sequential structure as shown in Figure 3a. For two views, *x<sup>a</sup>* and *xb*, of an augmented image, first perform concatenate operation and send them to Encoder together, get the combined output of *e<sup>a</sup>* and *eb*, keep the values of *e<sup>a</sup>* and *e<sup>b</sup>* and do not preserve the gradient information. Then send them to Prediction part and get *p<sup>a</sup>* and *pb*, use the retained *e<sup>a</sup>* and *e<sup>b</sup>* when calculating the contrastive loss with *p<sup>a</sup>* and *pb*, considering *e<sup>a</sup>* or *e<sup>b</sup>* as constant values when applying the stop-gradient operation. This allows the two contrastive losses to be calculated simultaneously.

**Figure 3.** (**a**) We design the training process of Lite-SRL as a sequential structure to adapt model parallelization. (**b**) Schematic of the proposed distributed hybrid parallel (DHP) training baseline.

Each convolution layer within a CNN structure can be used as a single partition to achieve highly efficient model parallelism training capabilities. Given a network *M* that consists of layers *L*1, *L*<sup>2</sup> ··· , *Lq* . Divide network *M* into *n* partitions {*P*1, *P*<sup>2</sup> ··· , *Pn*} where *Pi* = *Lj*, *Lj*+<sup>1</sup> ··· , *Lj*+*<sup>K</sup>* , *Lj*, *Lj*+<sup>1</sup> ··· , *Lj*+*<sup>K</sup>* denotes partition *Pi* is start from layer *Lj* and contains k layers. Except for this, the calculation of all partitions is sequential, partition *Pi* transmit its output feature to its next partition *Pi*+1, while the gradient calculated by partition *Pi* is transmitted to the front partition *Pi*−1. At iteration *t*, during forward propagation send the input *A<sup>t</sup> Li*−<sup>1</sup> from partition *Pi*−<sup>1</sup> to partition *Pi* and delivers activation *At Li* . Identically, during backward propagation of iteration *t*, the *G<sup>t</sup> Li*+<sup>1</sup> indicates the gradient calculated by partition *Pi*+1. With each layer *Li* ≤ *Lx* ≤ *Lq*, we denote the weight parameter of layer *Lx* as *wx*, the gradient is given as:

$$\mathbf{G}\_{w\_{\!x}}^{t-i} = \frac{\delta A\_{L\_{\!x}}^{t-i}}{\delta w\_{\!\!x}^{t-i-1}} \cdot G\_{L\_{\!x+1}}^{t-i} \tag{6}$$

We denote the learning rate as *γt*−*i*, Equation (6) is updated by the following equation:

$$w\_x^{t-i} = w\_x^{t-i-1} - \gamma\_{t-i} \cdot \hat{G}\_{w\_x}^{t-i} \tag{7}$$

For layers in non-sequential CNN, parallel paths are not partitioned; instead the parallel zone is considered as a block. After the network partitioning, the network can be trained in model parallelism mode.

In Figure 3b, the feature and gradient are transferred between devices. Take Device 1 and Device 2, for example; Device 1 is in charge of Partition 1' s training and Device 2 is in charge of Partition 2' s training. In iteration *t*, during forward propagation, the last layer of Partition 1 in Device 1 transmits feature value *A<sup>t</sup> <sup>L</sup>*<sup>1</sup> to the first layer of Partition 2 in Device 2. During backward propagation, Device 2 transmits the gradient value *G<sup>t</sup> L*2 to Device 1, the gradient of Partition 1 s last layer is *<sup>δ</sup>A<sup>t</sup> L*1 *<sup>δ</sup>wt*−<sup>1</sup> <sup>1</sup> ·*Gt L*2 , where *<sup>w</sup>t*−<sup>1</sup> <sup>1</sup> is the weight parameter of Partition 1 s last layer obtained from iteration *t* − 1. Device 1 updates the weight parameters according to Equations (6) and (7): *w<sup>t</sup>* <sup>1</sup> <sup>=</sup> *<sup>w</sup>t*−<sup>1</sup> <sup>1</sup> <sup>−</sup> *<sup>γ</sup><sup>t</sup> δA<sup>t</sup> L*1 *<sup>δ</sup>wt*−<sup>1</sup> <sup>1</sup> ·*Gt L*2 ,where *γt*−*<sup>i</sup>* is learning rate.

#### *3.3. Distributed Training Strategy*

We use a combination of six TX2 nodes and one high-speed switch to form the lowpower computing platform as shown in Figure 15.

With multiple nodes, different amounts of nodes can be flexibly scheduled to participate in the training according to the computing requirements. We propose a distributed hybrid parallelized DHP training framework based on the PyTorch framework, the schematic of DHP is shown in Figure 3b. DHP framework uses TCP communication protocol, and the data transmitted between nodes mainly include the output feature of each layer in the forward propagation, the gradient values obtained from each layer in backward propagation, and the parameters of layers aggregated by each node after reaching the number of iterations. Meanwhile, we propose a generic computation workload balancing module CWB, which can perform model partitioning and workload balancing for a given network structure and working conditions. CWB is the core that enables training CNN under limited computing power. Furthermore, based on our DHP framework, we propose a dynamic chain system that can promote the training speed without sacrificing training accuracy.

#### 3.3.1. Computation Workload Balancing Module

Under model parallelism, each node has different parameters and is responsible for the computation of different model layers respectively, updating only the weights of the assigned model layers. Setting appropriate network partitioning points for network partitioning can improve the efficiency of distributed training. TX2 uses Jetson series SOC, with CPU and GPU sharing 8 GB memory and the memory requirements during Lite-SRL training process are larger than the computing capacity of a single TX2; thus, network partitioning and workload balancing are required.

We propose a gen\*eric Computation workload Balancing module, CWB; it works as follows. For a given network structure and specified batch of input data, take Lite-SRL as an example. Lite-SRL contains a total of *q* layers of networks *L*1, *L*<sup>2</sup> ··· , *Lq* ; CWB first collects the forward inference and back propagation time of each layer running on TX2, where the forward inference time of each layer is denoted as - *Tf* 1, *Tf* <sup>2</sup> ··· , *T* , and the back propagation is denoted as - *Tb*1, *Tb*<sup>2</sup> ··· , *Tbq* . CWB then calculates the memory size occupied by the model parameters of each layers *Mw*1, *Mw*<sup>2</sup> ··· , *Mwq* , and the memory size occupied by the output of the intermediate layers *MI*1, *LI*<sup>2</sup> ··· , *LIq* . CWB partitions Lite-SRL into *n* partitions {*P*1, *P*<sup>2</sup> ··· , *Pn*} and assigns them to *n* TX2 {*TX*21, *TX*<sup>22</sup> ··· , *TX*2*n*}, where *Pi* <sup>=</sup> *Lj*, *Lj*+<sup>1</sup> ··· , *Lj*+*<sup>K</sup>* , then between *Pi* and *Pi*+1, that is, between *TX*2*<sup>i</sup>* and *TX*2*i*+<sup>1</sup> need to transmit the feature data from layer *Lj*+*<sup>K</sup>* to layer *Lj*+*K*<sup>+</sup>1, and the gradient value needs to be transmitted back during back propagation. Record the ratio of the file size to the transmission rate between TX2 as the theoretical transmission latency *Tt*, CWB calculates the transmission latency *Tt*1, *Tt*<sup>2</sup> ··· , *Ttq* for all candidate partition points.

The training time *Tall* for a mini-batch is:

$$T\_{all} = \sum\_{0}^{q} T\_{fi} + \sum\_{0}^{q} T\_{bi} + T\_{ta} + T\_{tb} \tag{8}$$

The equation for calculating the equipment utilization index is as follows:

$$E = -\ln \frac{\sum \left(\frac{T\_{\text{tr2}}^{\text{v}}}{T\_{\text{all}}} - \frac{1}{n}\right)^2}{n} \tag{9}$$

The process of CWB searching for the best partition point is formally presented in Algorithm 2. After network partitioning, each partition is assigned to the DHP system for distributed training. The detailed implementation of CWB is recorded in Figure 13.

#### **Algorithm 2.** CWB search for the best partition point

Step1:CWB performs memory workload balancing Step2:CWB performs time equalization 1: Assign *Mw*1, *Mw*<sup>2</sup> ··· , *Mwq* and *MI*1, *LI*<sup>2</sup> ··· , *LIq* to {*TX*21, *TX*22 ··· , *TX*2*n*} 2: Assume 3 TX2s can satisfy memory allocation, then 2 sets of candidate partition point that satisfy memory workload balancing are recorded as [[*a*, *a* + 1 ···], [*b*, *b* + 1 ···]] 3: **for** a in [*a*, *a* + 1 ···] do 4: **for** b in [*b*, *b* + 1 ···] do 5: Partition point 1 adopts *a*, partition point 2 adopts *b* 6: Denote the running time of *TX*21 as *T*<sup>1</sup> *tx*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*<sup>a</sup>* <sup>0</sup> *Tf i* + <sup>∑</sup>*<sup>a</sup>* <sup>0</sup> *Tbi* 7: Denote the running time of *TX*22 as *T*<sup>2</sup> *tx*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*<sup>b</sup> <sup>a</sup> Tf i* + ∑*<sup>b</sup> <sup>a</sup> Tbi* 8: Denote the running time of *TX*23 as *T*<sup>3</sup> *tx*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*<sup>q</sup> <sup>b</sup> Tf i* <sup>+</sup> <sup>∑</sup>*<sup>q</sup> <sup>b</sup> Tbi* 9: The training time *Tall* for a mini-batch is *Tall* <sup>=</sup> *<sup>q</sup>* ∑ 0 *Tf i* <sup>+</sup> *<sup>q</sup>* ∑ 0 *Tbi* + *Tta* + *Ttb* 10: The partition point use [*a*, *b*], the ratio of running time to waiting time of *TX*2*<sup>n</sup>* is *Tn tx*2 *Tall* 11: Calculate the equipment utilization indices *E* using Equation (9) *E* = −*ln* ∑ *<sup>T</sup><sup>n</sup> tx*<sup>2</sup> *Tall* <sup>−</sup> <sup>1</sup> *n* <sup>2</sup> *n* 12: **end for** 13: **end for** 14: The partition point of *E* with the highest score is the best partition point

#### 3.3.2. Dynamic Chain System

Figure 4a shows our hybrid parallel distributed training baseline schematic, where each node in the chain is fixedly linked to its front and back nodes, and the later nodes in the chain have to wait for the front nodes to finish forward and backward propagation. Overlap network computation time with transmission time is a common method to improve efficiency in distributed training [52], which can improve training efficiency. Our modified dynamic chain is shown in Figure 4b, where three nodes are responsible for the computation of partition 1, two for the computation of partition 2, and one for the computation of partition 3 of the model. We add a communication scheduler module to our distributed training framework, enabling the node that first completes the computation to search for the available nodes in the next layer. Each mini-batch will form a dynamic chain that performs forward and backward propagation, and after each node completes its current backward propagation, it will automatically leave the current chain and construct a new chain with the node that is waiting. Dynamic chain system has higher training efficiency than baseline, improving node utilization without reducing training accuracy. The dynamic chain system can be well generalized for different training demands, and we have conducted additional experiments for different training computations as detailed in Section 6.2.

**Figure 4.** Illustration of our proposed distributed hybrid parallel training baseline and dynamic chain system. (**a**) Distributed hybrid parallel training baseline. (**b**) Dynamic chain system. In iteration 1, Devices 1, 4, and 6 forms a computation chain, while Devices 3 and 5 are in a waiting state. During this time, Device 5 completes the forward computation from Device 2. At the end of iteration 1, Device 6 disconnects from Device 4, automatically links to Device 5, and immediately performs the third part of the training, Device 4 links to Device 3 and waits to link with Device 6. In iteration 3, Device 4 links to Device 6, and the rest nodes also link to available nodes. Subsequent iterations follow the same procedure.

#### **4. Experimental Setups**

#### *4.1. Datasets Description*

For SAR images, we use the OpenSARUrban [14] dataset and the WHU-SAR6 [11] dataset for experiments. We use a small number of training samples to predict a large number of test samples in our experiments, the training proportions for OpenSARUrban dataset are 10% and 20%, and for WHU-SAR6 dataset we set training proportions as 10% and 20%.


For optical images, we use the NWPU-RESISC45 [3] dataset and the Aerial Image dataset (AID) [15]. The training proportions for NWPU-RESISC45 dataset are 10% and 20%, which are more challenging since they both require using a small number of training samples to predict labels for many test data. For the AID dataset, we set training proportions as 10%, 20%, and 50%. Detailed information is shown in Table 1.



**Table 1.** Datasets description and training proportions.

<sup>1</sup> OpenSARUrban dataset has VH and VV polarizations, we used the VH data. <sup>2</sup> For WHU-SAR6 dataset, we cropped the images into small patches of 256 × 256 pixels to increase the dataset volume.

#### *4.2. Data Augmentation*

By performing random crop and resize on target image, the receptive field of the network can achieve both global and local prediction, which is crucial for RSSC task. We perform spatial transformations such as random crop, flip, rotate, and resize to enable the model to learn rotation invariants and scaling invariants simultaneously. Further, we simulate temporal transformations with Gaussian blur, color jitter, and random grayscale. The augmentation strategies differ between SAR images and optical images. Detailed data augmentation result is shown in Figure 5.

**Figure 5.** Illustration of data augmentations.

#### *4.3. Implementation Details*

The input images were normalized to 224 × 224 and used the data augmentation settings shown in Figure 5. The batch size was set to 64, and all methods were trained for 400 epochs. For all competitive algorithms, we used ResNet-18 [53] as the backbone and removed the fully connected layer after Advpooling in ResNet-18. Since the loss functions and optimizers of these competitive methods are different, the experimental results are obtained under the individual methods' respective optimal hyperparameter settings. All competitive algorithms were implemented using PyTorch 1.7, Python3.7. The proposed Lite-SRL method used an SGD optimizer with a momentum of 0.9 and a weight decay of 1 × <sup>10</sup>−4, the initial learning rate was 0.05, and the learning rate decreased using the cosine decay.

The experimental section consists of two parts.


#### **5. Experimental Results**

The flowchart of self-supervised learning experiments is shown in Figure 6. Encoders obtained by self-supervised training are used both as (i) frozen feature extractors (Freeze experiment), and as (ii) initial fine-tune model (Fine-tune experiment). For both the freeze and fine-tune experiments, we connected a linear classifier after the encoder and used an Adam optimizer with a batch size of 64, the learning rate reduced in a cosine manner within 200 epochs.

**Figure 6.** The flowchart of self-supervised learning experiments.

#### *5.1. Guaranteed Accuracy with Less Computation*

We compare (i) overall accuracy, (ii) number of parameters, (iii) memory consumption, and (iv) average training latency with competitive self-supervised algorithms. The memory consumption during network training consists of the following elements. The memory occupied by the model: including the consumption of parameters, gradients and optimizer momentum. The memory occupied by the network intermediate layers' outputs: including the inputs and outputs of each layer.

Considering that on-board scenario is highly sensitive to the computation workload, the algorithm is required to achieve higher accuracy and less computation simultaneously. Experiments show that Lite-SRL can achieve optimal classification accuracy with minimum computation. As shown in Figure 7, Lite-SRL shows the best accuracy in the RSSC task, while Lite-SRL has a clear advantage in terms of computation consumption. Thus Lite-SRL provides a lightweight yet effective solution for on-board self-supervised representation learning.

**Figure 7.** Guaranteed accuracy with less computation. (**a**) Fine-tune and freeze experiment results on NWPU-45 dataset with training proportion of 20%, the horizontal axis compares the number of parameters. (**b**) Freeze experiment results on NWPU-45 dataset with training proportion of 20%; horizontal axis compares the training time consumption per iteration, and the diameter of the bubble is proportional to the memory consumption during network training.

#### *5.2. Self-Supervised Representation Extractor*

In freeze training experiment, we use the encoders obtained from each method as feature extractors to evaluate their performance in scene classification. To visualize the effectiveness of Lite-SRL, we fed the test set images to the pre-trained model learned from Lite-SRL, and applied t-SNE [54] to map the output features to a 2-dimensional space. As shown in Figure 8, features from different classes can be well distributed by our selfsupervised method, with significantly better results than ImageNet supervised pre-trained model. This demonstrates that by utilizing unlabeled RSI data, our proposed representation learning strategy enables the model to produce a valuable feature representation for downstream RSSC task.

In Figure 8c, we marked the samples from the OpenSARUrabn dataset that Lite-SRL failed to distinguish. We found that these SAR samples contain confusable features. For instance, the six different scene categories in Figure 8c all contain a river flowing through the city. Since we do not use any labels during self-supervised learning, Lite-SRL may extract the wrong features for these confusing scene images.

Table 2 shows the results of freeze training. Experimental results show that in RSSC task, these self-supervised models get better results than supervised models pre-trained on ImageNet, despite the fact that the datasets used for self-supervised pre-training are much smaller than the ImageNet dataset (OpenSARUrban dataset has 16,670 images, WHU-SAR6 dataset has 17,590 images, NWPU-RESISC45 dataset has a total of 34,500 images, the ImageNet pre-trained model used approximately 1.5 million images). Lite-SRL achieved the highest classification accuracy, while higher accuracy can be achieved using the fine-tuning method, as detailed in Table 3.

#### *5.3. Improving the Scene Classification Accuracy with Limited Annotated Data*

The proposed self-supervised learning method can solve the problem of annotated data shortage in scene classification task, as high accuracy is achieved in the test set using a small number of training samples.

The fine-tune results of the competitive self-supervised methods are shown in Table 3. Note that due to the differences in these methods, our experimental results record the best results of each method with different learning rates. All of the self-supervised methods showed significant improvements over the randomly initialized models, and at the same time, all of the methods outperformed the models pre-trained in a supervised manner on ImageNet. In 10% training proportion experiments, we used a small number of training samples to predict a large number of test samples. Even so, we achieved high classification accuracy with a simple classification network structure by using the self-supervised pre-

trained model as the start point for fine-tuning, proving the effectiveness of the proposed self-supervised learning method.

**Figure 8.** The t-SNE visualization of feature distributions on different datasets. (**a**) Lite-SRL model on WHU-SAR6 dataset; (**b**) fine-tuned Lite-SRL model on WHU-SAR6 dataset; (**c**) Lite-SRL model on OpenSARUrban dataset; (**d**) fine-tuned Lite-SRL model on OpenSARUrban dataset; for SAR dataset due to the imaging mechanism, we did not use ImageNet's pre-trained model. (**e**) ImageNet pre-trained model on NWPU-45 dataset; (**f**) Lite-SRL model on NWPU-45 dataset; (**g**) fine-tuned Lite-SRL model on NWPU-45 dataset; (**h**) ImageNet pre-trained model on AID dataset; (**i**) Lite-SRL model on AID dataset; (**j**) fine-tuned Lite-SRL model on NWPU-45 dataset.


**Table 2.** Results of freeze experiment in terms of overall accuracy (%).

<sup>1</sup> The ImageNet is the encoder obtained by supervised pre-training on ImageNet dataset.



Note that our method exhibited higher accuracy with small training batch, while with large training batch, methods such as SimCLR, MoCo-v2, and SWAV, which need to maintain large queues or negative sample pairs, would have improved accuracy.

In Table 4 we illustrate the classification performance of some state-of-the-art methods.

**Table 4.** Compare with some SOTA methods, in terms of overall accuracy (%).


Including multi-granularity canonical appearance pools (MG-CAP) [57], recurrent transformer networks (RTN) [56], and MTL [58] using a self-supervised approach. Our Lite-SRL produces an accuracy close to ResNet-101 when using ResNet-18 as the encoder. Further, we set up experiments using ResNet-101 as the encoder in Lite-SRL and produced a top accuracy of 94.43%, which is in pair with the ResNet-101+MTL [58] approach, representing the state-of-the-art performance. The promising performance of Lite-SRL further validates the effectiveness of self-supervised learning in RSSC task.

#### *5.4. Confusion Matrix Analysis*

As can be seen from the OpenSARUrban (20%) confusion matrix shown in Figure 9a, the accuracy of the entire test set is 85.43%. High Building category reported the lowest recognition accuracy, and 6.2% were incorrectly identified as Single Building. Urban building areas including Gen.Res, High Building, Single Building, and Denselow showed high misclassifications, as these urban functional areas have similar characteristics. Due to the imbalance of each category, Railway only had 20 test samples with three incorrect classifications.

**Figure 9.** Confusion matrix of fine-tuned results: (**a**) on OpenSARUrban with 20% training proportion; (**b**) on WHU-SAR6 with 20% training proportion.

Figure 9b shows the confusion matrix of fine-tuned results on WHU-SAR6 (20%), the accuracy of the entire test set is 95.83%, with four of the six categories achieving 95% or higher accuracy. Lake and Bridge are the two classes with the highest confusion rates because these two categories both contain water areas.

As can be seen from the NWPU-45 (20%) confusion matrix shown in Figure 10, With 38 of the 45 categories achieving 90% or higher accuracy, the accuracy of the entire test set is 93.51%. Churches and palaces are the two classes with the highest confusion rates because the buildings have similar distribution and appearance in these two groups.

Figure 11 shows the confusion matrix of fine-tuned results on the AID (50%) dataset. With 26 of the 30 categories reaching 90% or higher accuracy and 23 categories achieving higher than 95%, the accuracy of the entire test set is 95.78%. Resort and park, and center and square are the categories with the highest confusion rate because the images of resorts and parks have a similar distribution of greenery, while center and square are urban scenes with similar characteristics.

**Figure 10.** Confusion matrix of fine-tuned results on NWPU-45 20% training proportion.

**Figure 11.** Confusion matrix of fine-tuned results on AID 50% training proportion.

#### **6. Deployment of Lite-SRL**

We applied the Lite-SRL self-supervised algorithm to the proposed DHP distributed training system. The flowchart of deployment is shown in Figure 12.

**Figure 12.** The flowchart of Lite-SRL's deployment corresponds to the content in the following. N-Layers corresponds to Figure 13a; Statistic the Memory Usage corresponds to Figure 13b; Statistic the Time Consumption corresponds to Figure 13c; Candidate Partition Points corresponds to Figure 14a; Best Partition Point corresponds to Figure 14b.

**Figure 13.** Data collected by CWB. (**a**) Partitionable layers contained in the Lite-SRL network structure, corresponding to 28 partitionable points {*p*1, *p*2, ··· *p*28}. (**b**) CWB calculated the memory workload occupied by each network layer during the training process, including the intermediate variables and network parameters for each layer. (**c**) CWB measured time consumption, including inference latency and backward propagation latency of each layer when trained on TX2, together with the data transmission latency between TX2. The transmission latency was derived from the gradient data size between two layers and the inter-device transfer rate.

**Figure 14.** CWB calculates the optimal partition points. Two sets of candidate partition points are {*p*2, *p*3, *p*4, *p*5} and {*p*7, *p*8, *p*9, *p*10}, the rest of the partition points have been screened out as they cannot satisfy the memory allocation requirements. (**a**) Runtime proportion of each node under candidate partition points. (**b**) Using Equation (9) to calculate equipment utilization evaluation indices under candidate partition points.

#### *6.1. Computation Workload Balancing*

As shown in Figure 13a, CWB figured out all partitionable points over the given Lite-SRL Network structure. CWB requires the following training setup information: (i) the training batch size, (ii) the type of optimizer being used, and (iii) the data exchange rate between TX2 devices to calculate the figures required for workload balancing. In the experiment, the batch is set to 64 and used SGD optimizer with momentum for training. The rate of data transmission between TX2 is simulated by the Linux traffic control tool. According to the partition points and the above setup information, CWB statistic the memory usage and the time consumption of each layer when training on TX2.

In experiments we uniformly use the float32 format data type, one data occupies 4 bytes of memory. CWB first performed memory workload balancing and computed the candidate partition points using the data shown in Figure 13b, the theoretical memory usage for all intermediate values during training is 7599.7 MB. Based on experience, each TX2 can achieve a preferable working state when allocating about 3 GB memory of computation, so it requires 3 TX2s to collaborate the training process of one mini-batch. The two sets of candidate partition points calculated by CWB are {*p*2, *p*3, *p*4, *p*5} and {*p*7, *p*8, *p*9, *p*10}, satisfying that the training of each partition can be carried out on a single TX2, the rest of the partition points have been screened out. CWB then performed time equalization utilizing the data in shown Figure 13c, by accumulating the forward and backward latency of individual layers under different candidate partition points, to obtain the running time of each TX2 node during a training batch.

As shown in Figure 14, CWB used the equipment utilization index to find the optimal partition point among the candidate partition points. Figure 14a shows the runtime proportion of each node under candidate partition points, the transmission latency varies depending on different combinations of partition points. In Figure 14b CWB found the partition point combination with the highest equipment utilization [*p*3, *p*8], representing the optimal partition points. For the given Lite-SRL network structure and training settings, CWB partitioned it into the following three parts: partition 1 {*p*1, *p*2}, partition 2 {*p*3, *p*4, *p*5, *p*6, *p*7}, partition 3 {*p*8, ··· *p*28}.

#### *6.2. Distributed Training with Higher Efficiency*

We used six TX2 nodes to compose an on-board computation platform and tested the proposed distributed training baseline along with the improved dynamic chain system. In our on-board distributed training baseline experiments, six nodes formed a two-chain hybrid parallel training according to Figure 4b. After the workload balancing, we allocated the training of three network partitions to TX2 nodes on each chain. In dynamic chain system experiments, three nodes are responsible for the computation of partition 1, two for the computation of partition 2, and one for the computation of partition 3 of the model, as shown in Figure 4b. The two distributed training methods were performed 1000 iterations each, the distributed training system performed parameter aggregation every 100 iterations and updated the model parameters in each node using the aggregated parameters. Here one iteration referred to the completion of one mini-batch's forward and backward propagation.

As shown in Tables 5 and 6, the average runtime of executing one iteration in the baseline is 3572 s, while the average time in dynamic chain system is 2750 s.


**Table 5.** Distributed training baseline.

**Table 6.** Improved dynamic chain system.


The baseline used 6 nodes to complete 1000 iterations of training in *3572 s*, and the 2 chains had each been running for 500 iterations. In comparison, the dynamic chain system used the same nodes to complete 1000 iterations of training in *2750 s*. With the scheduling of the communication module, the system can be viewed as containing 3 chains; the nodes responsible for the first and second partitions end up with different running iterations, and node 6 completes 1000 iterations. Dynamic chain system improved training efficiency by 23.01% over the baseline without compromising training accuracy.

The experimental platform is shown in Figure 15. Our distributed system consists of six nodes, thus allowing flexible distributed training. More chains can be constituted when the calculation demand is low, and more nodes can be invoked to join the training when the calculation demand is high. The proposed Lite-SRL with ResNet-18 as backbone can be completed using 3 TX2 nodes, the baseline and dynamic chain are shown in Figure 15a. When more complex networks need to be trained, it can be achieved by invoking more nodes to join the training, which manifests the advantage of distributed multi-nodes. To this end, we conducted additional experiments. We use the Lite-SRL algorithm, replacing more complex backbone structure as detailed in the following.

**Figure 15.** The left side is the illustration of baseline and the right side is the illustration of dynamic chain system. (**a**) Lite-SRL with ResNet-18 as encoder; three nodes are required to complete the training of a mini-batch, baseline uses six nodes to form two chains, and dynamic can form three chains. (**b**) Lite-SRL with ResNet-34 as encoder; four nodes are required to complete the training of a mini-batch. Baseline forms one chain with two nodes idle, while the dynamic chain system can schedule all nodes for training. (**c**) Lite-SRL with ResNet-50 as encoder, five nodes are required to complete the training of a mini-batch. Baseline forms one chain with 1 node idle, while the dynamic chain system can schedule all nodes for training.

The time consumption of DHP under different training computations is shown in Table 7. The dynamic link system can avoid node idleness and, thus, improve the efficiency of training. Furthermore, through this experiment, we demonstrate the potential of our distributed training system, which can be applied to a wider range of neural network training tasks.



<sup>1</sup> For ResNet50 the training batch size is 32, the rest of training settings remain unchanged. <sup>2</sup> To compare the accuracy, we use the NWPU-45 dataset and the accuracy test method is the same as the freeze experiment above.

#### **7. Conclusions**

In this article, we propose a self-supervised algorithm Lite-SRL for the scene classification task. Our algorithm has clear advantages in terms of overall accuracy, number of parameters, memory consumption, and training latency. We demonstrate that selfsupervised algorithms can effectively alleviate the shortage of remote sensing labeled data. Taking the experimental results on NWPU-45 dataset as an example, with training proportions of 10% and 20%, which require few labeled data to predict a large number of test samples, we achieve 92.77% and 93.51% accuracy with a simple network structure after self-supervised pre-training. Previous RSSC studies usually require more complex structures and multiple tricks to achieve such classification accuracies. Meanwhile, our algorithm has far better performance than other methods under 10% training proportion, proving that Lite-SRL's self-supervised training provides an effective feature extractor.

We exploit the advantage of self-supervised learning by training on satellites. The integration of CWB and DHP enables training neural networks under limited on-board resources. In addition, we add a communication scheduler module to the DHP framework to improve the training speed on top of the baseline. On the experimental computing platform, we successfully transplant Lite-SRL and verify the effectiveness of proposed on-board distributed training modules.

We believe that on-board self-supervised distributed training can facilitate the development of on-board data processing techniques. Not only for RSSC task, but also other tasks in remote sensing such as remote sensing image segmentation [59], target detection [10], etc., can utilize this working paradigm. Our proposed distributed training modules provide strong adaptability, other types of deep learning algorithms can also be deployed in the distributed training framework, making it possible to enhance the intelligence in remote sensing applications.

The next step of our work will be as follows:


**Author Contributions:** Conceptualization, X.X. and Y.L.; methodology, X.X. and C.L.; software, X.X. and C.L.; investigation, X.X.; resources, X.X.; data curation, X.X. and C.L.; writing—original draft preparation, X.X. and Y.L.; writing—review and editing, X.X. and Y.L.; visualization, X.X.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

As shown in Table A1, we list all the abbreviations and their corresponding full names in this article.


**Table A1.** The abbreviations and corresponding full names, organized in alphabetical order.

#### **References**


## *Article* **Triangle Distance IoU Loss, Attention-Weighted Feature Pyramid Network, and Rotated-SARShip Dataset for Arbitrary-Oriented SAR Ship Detection**

**Zhijing Xu 1, Rui Gao 1,\*, Kan Huang <sup>1</sup> and Qihui Xu <sup>2</sup>**


**\*** Correspondence: 202030310004@stu.shmtu.edu.cn; Tel.: +86-198-2173-5586

**Abstract:** In synthetic aperture radar (SAR) images, ship targets are characterized by varying scales, large aspect ratios, dense arrangements, and arbitrary orientations. Current horizontal and rotation detectors fail to accurately recognize and locate ships due to the limitations of loss function, network structure, and training data. To overcome the challenge, we propose a unified framework combining triangle distance IoU loss (TDIoU loss), an attention-weighted feature pyramid network (AW-FPN), and a Rotated-SARShip dataset (RSSD) for arbitrary-oriented SAR ship detection. First, we propose a TDIoU loss as an effective solution to the loss-metric inconsistency and boundary discontinuity in rotated bounding box regression. Unlike recently released approximate rotational IoU losses, we derive a differentiable rotational IoU algorithm to enable back-propagation of the IoU loss layer, and we design a novel penalty term based on triangle distance to generate a more precise bounding box while accelerating convergence. Secondly, considering the shortage of feature fusion networks in connection pathways and fusion methods, AW-FPN combines multiple skip-scale connections and attention-weighted feature fusion (AWF) mechanism, enabling high-quality semantic interactions and soft feature selections between features of different resolutions and scales. Finally, to address the limitations of existing SAR ship datasets, such as insufficient samples, small image sizes, and improper annotations, we construct a challenging RSSD to facilitate research on rotated ship detection in complex SAR scenes. As a plug-and-play scheme, our TDIoU loss and AW-FPN can be easily embedded into existing rotation detectors with stable performance improvements. Experiments show that our approach achieves 89.18% and 95.16% AP on two SAR image datasets, RSSD and SSDD, respectively, and 90.71% AP on the aerial image dataset, HRSC2016, significantly outperforming the state-of-the-art methods.

**Keywords:** synthetic aperture radar (SAR) image; arbitrary-oriented ship detection; differentiable rotational IoU algorithm; triangle distance IoU loss; attention-weighted feature pyramid network; multiple skip-scale connections; attention-weighted feature fusion; Rotated-SARShip dataset (RSSD)

#### **1. Introduction**

As an active microwave sensor, synthetic aperture radar (SAR) enables all-day, allweather, and long-distance space-to-Earth observation without being limited by light and climate conditions [1]. With the development of spaceborne SAR high-resolution imaging technology, ship detection in SAR images has become a current research hotspot [2–8].

In recent years, with the breakthrough of convolutional neural networks (CNNs) [9] in computer vision, CNN-based methods have been introduced into SAR ship detection [10–15]. Though these works have promoted the development of this field to some extent, most of them simply apply the horizontal bounding box (HBB)-based methods used in natural scenes to SAR scenes, which still encounter severe challenges, stated as follows:

**Citation:** Xu, Z.; Gao, R.; Huang, K.; Xu, Q. Triangle Distance IoU Loss, Attention-Weighted Feature Pyramid Network, and Rotated-SARShip Dataset for Arbitrary-Oriented SAR Ship Detection. *Remote Sens.* **2022**, *14*, 4676. https://doi.org/10.3390/ rs14184676

Academic Editor: Domenico Velotto

Received: 10 August 2022 Accepted: 13 September 2022 Published: 19 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).


**Figure 1.** Densely arranged ships in complex inshore scenes. Here, (**a**,**b**) show the detecting of ship targets using the HBB-based RetinaNet [17]; (**c**,**d**) show the detecting of ship targets using the OBB-based RetinaNet with the proposed TDIoU loss and AW-FPN. The red and green boxes denote the detection results.

To eliminate the defects of HBB-based methods in detecting ships in SAR scenes, oriented bounding box (OBB)-based methods have emerged [18–22]. As shown in Figure 1c,d, OBBs can effectively avoid overlap and attenuate the influence of background clutter, enabling more precise prediction of the location and orientation of ships.

However, OBB-based methods still have the following limitations in SAR scenes:


To overcome these bottlenecks, we propose a unified framework for rotated SAR ship detection. Inspired by IoU-based losses in horizontal detection, we develop a triangle distance IoU loss (TDIoU loss) and implement the forward and backward processes to ensure its trainability. Thanks to its well-designed penalty term, TDIoU loss not only solves the problems caused by angle regression but also dramatically improves convergence speed and simplifies computation. Second, to enables more effective multi-scale feature fusion for detecting ships with large aspect ratios and varying scales in complex SAR scenes, an attention-weighted feature pyramid network (AW-FPN) combining multiple skip-scale connections and the attention-weighted feature fusion (AWF) mechanism is proposed.

Finally, to promote further research in this field, a novel dataset, the rotated-SARShip dataset (RSSD), is released to provide a challenging benchmark for arbitrary-oriented ship detection in SAR images. Extensive experiments and visual analysis on three datasets prove that our approach achieves better detection accuracy than other advanced methods. To sum up, the main contributions of this paper are summarized as follows:


The rest of the paper is organized as follows: Section 2 reviews related works. Section 3 describes the problems in angle regression and conventional IoU-based losses. Section 4 introduces the proposed TDIoU loss and the AW-FPN for rotated SAR ship detection. Section 5 presents details of the proposed RSSD. Extensive experiments and comprehensive discussions are provided in Section 6. Section 7 summarizes the whole work.

#### **2. Related Work**

In this section, we first review CNN-based SAR ship detection methods, then discuss the related works dealing with the problems caused by angle regression and multi-scale feature fusion, and finally analyze several existing publicly available SAR ship datasets.

#### *2.1. SAR Ship Detection Methods Based on Convolutional Neural Networks*

In the field of object detection, convolutional neural networks have become the mainstream algorithm. In recently years, CNN-based methods have made significant progress in SAR ship detection. As a pioneering work, Li et al. [10] discussed the defects of Faster R-CNN [34] in SAR ship detection and proposed an improved framework based on feature fusion and hard negative mining. Zhang et al. [11] proposed a novel concept of balance learning (BL) for high-quality SAR ship detection. Zhang et al. [12] proposed a grid convolutional network with depthwise separable convolution that accelerates ship detection by griding the input image. To enhance the detailed features of ships, Liang et al. [13] proposed a visual attention mechanism. Furthermore, the means dichotomy method and speed block

kernel density estimation method were used for adaptive hierarchical ship detection. Gao et al. [14] achieved better ship detection accuracy by using the anchor-free CenterNet [35] based on an attention mechanism and feature reuse strategy. Zhang et al. [15] designed a quad feature pyramid network consisting of four unique FPNs and verified its effectiveness on five SAR datasets.

However, the above methods fail to take into account the large aspect ratio and multiangle characteristics of ships, leading to missed and false detection. Therefore, in recent years, there has been some research on rotated ship detection. For instance, Wang et al. [18] added the angle regression and semantic aggregation method to SSD. The attention module was used to adaptively select meaningful features of ships. Chen et al. [19] presented a feature-guided alignment module and a lightweight non-local attention module to balance the detection accuracy and inference speed of single-stage rotation detectors. Pan et al. [16] constructed a multi-stage rotational region-based network that generates rotated anchors through a rotation-angle-dependent strategy. To reduce the false alarm rate, Yang et al. [20] devised a novel loss to balance the loss contribution of various negative samples. To enhance the detection of small ships, An et al. [21] proposed an anchor-free rotation detector with a flexible frame. Sun et al. [22] applied the bi-directional feature fusion module and angle classification technique to a YOLO-based rotated ship detector.

#### *2.2. Loss-Metric Inconsistency and Angular Boundary Discontinuity*

To eliminate the gap between the bounding box regression loss and the evaluation metric, IoU-based losses have been introduced in horizontal detectors [36–40]. Unfortunately, they cannot be simply applied to rotation detection, as the general rotational IoU algorithm is non-differentiable for back-propagation. In addition, unlike other bounding box parameters, the angle parameter is periodic in nature, which will lead to a surge in loss value at the boundary of the angle definition range when using *ln*-norm losses.

Some studies have attempted to address part of the above issues from two perspectives. One idea is to design differentiable approximate IoU losses for angle regression. To control the loss value by the amplitude of IoU, Yang et al. [41] added an extra IoU factor into the smooth L1 loss. Furthermore, PIoU [42] estimated the intersection area of two rotated bounding boxes by roughly counting the number of pixels. Aiming to address the uncertainty of convex shapes, Zheng et al. [43] presented an affine transformation to estimate the intersection area. The GWD [23] converted the oriented bounding box to two-dimensional Gaussian distribution, using the Gaussian–Wasserstein distance to approximate the rotational IoU loss. Although these improved regression losses alleviate the problems to some extent, their gradient directions are still not dominated by IoU, and they cannot accurately guide training.

Another idea is to treat the angle prediction as a discrete classification task so as to properly constrain the prediction results. Yang et al. [44] developed a circular smooth label (CSL) technique that directly uses the angle parameter as the category label to tackle the periodicity of the angle and improve the tolerance of adjacent angles. The DCL [45] analyzed the problems of over-thick prediction heads in sparse coded labels and converted the angle categories into dense codes, such as the binary codes and gray codes, to further improve the detection efficiency. Although angle classification techniques avoid angular boundary discontinuity, they are still limited by angular discretization granularity, which inevitably leads to theoretical errors in high-precision angle prediction.

As of now, no full-fledged method exists to address all the above issues. In a sense, the proposed differentiable rotational IoU algorithm opens up the possibility of using the IoU-based loss for rotated bounding box regression, and the newly designed TDIoU loss fundamentally eliminates all these problems in an ingenious manner.

#### *2.3. Multi-Scale Feature Fusion*

In CNNs, high-level features contain richer semantic information and broader receptive fields, making them beneficial for detecting large ship targets. Low-level features are of

high resolution and contain abundant shallow information, which is conducive to locating small ship targets. One of the difficulties in SAR ship detection is how to effectively fuse multi-scale features. Figure 2 displays several mainstream feature fusion networks [24–27]. Analysis shows that they still suffer from the following limitations in SAR scenes:

**Figure 2.** Feature fusion networks. Here, *P*<sup>i</sup> indicates the feature pyramid level i. (**a**) The FPN proposes a top-down pathway to fuse multi-scale features from *P*<sup>3</sup> to *P*<sup>7</sup> ; (**b**) PANet builds up an extra bottom-up pathway; (**c**) NAS-FPN designs the network topology by the neural architecturesearch; (**d**) BiFPN adds transverse skip-scale connections and learnable scalar fusion weights; (**e**) our AW-FPN with multiple skip-scale connections and attention-weighted feature fusion (AWF) mechanism.


In recent years, several investigations on visual attention have begun to focus on the fusion method. In SKNet [46] and ResNest [47], the global channel attention mechanism [48] is used to conduct dynamic weighted averaging of features from multiple kernels or groups. Although these attention-based approaches achieve non-linear feature fusion, they only show solicitude for the feature selections in the same layer, leaving no solution for fusing cross-level features of inconsistent semantics and scales. Furthermore, global channel attention only generates a scalar fusion weight for each channel of the feature map, which is obviously not appropriate for scenes with large variations in target scale. Generally speaking, multi-scale networks need to learn diverse feature representations, and a single global channel interaction will weaken the context information of small targets. Recently, aiming to provide a paradigm for cross-level feature fusion, Dai et al. [49] proposed an attentional feature fusion (AFF) mechanism. Regrettably, as with previous approaches, AFF only tends to focus on the salience representations of features in the channel dimension, which might result in the loss of multi-scale spatial contexts.

Our AW-FPN has improved on both of the above. To enrich the semantic and location information in feature maps, both transverse and longitudinal skip-scale connections are used. To generate high-quality fusion weights, a novel AWF mechanism is proposed. The MCAM and MSAM in AWF aggregate both multi-scale channel and spatial contexts, so as to emphasize the region around real ship targets and suppress background clutter.

#### *2.4. SAR Image Datasets for Ship Detection*

Due to the limitations of SAR imaging conditions, the datasets of SAR scenes are not as diverse as those of natural scenes. Recent research has been committed to constructing larger and more comprehensive SAR ship detection datasets. Table 1 shows the statistics of six existing datasets [28–33]. However, they still suffer from the following defects:



**Table 1.** Statistics of the six SAR ship detection datasets released in references [28–33] and our proposed RSSD.

Our proposed RSSD acquires data from three SAR satellites with different resolutions, polarizations, and imaging modes. The imaging areas are selected in ports and canals with busy trade. All images have been meticulously pre-processed and split into 8013 ship slices of 800 × 800 pixels. With the help of professional tools, 21,479 ships are precisely annotated by OBBs. All these treatments contribute to the complexity and diversity of our dataset.

#### **3. Analysis of Angle Regression Problems and Conventional IoU-Based Losses**

In this section, we first discuss two major problems in the existing rotation detectors mainly caused by angle regression. Then, we review the conventional IoU-based losses and analyze the limitations they may encounter in rotated bounding box regression. Finally, we summarize several requirements that should be met for the rotational IoU loss.

#### *3.1. Problems of Rotation Detectors Based on Angle Regression*

Figure 3 demonstrates two generic parametric definitions of oriented bounding boxes (i.e., OpenCV definition and long-edge definition). According to the above two definitions, any two-dimensional bounding box can be represented as a group of five parameters (*cx*, *cy*, *w*, *h*, and *θ*), where (*cx*, *cy*) represents the centroid coordinate of the oriented bounding box, *w* and *h* indicate the width and height, respectively, and *θ* denotes the rotation angle. To predict the angle *θ* of the bounding box, most rotation detectors directly introduce an additional output channel into the regression subnet and use *ln*-norms as the regression loss during the training phase. However, in the testing stage, the performance is evaluated by IoU. Obviously, such a mismatch may present some problems, which we will now summarize.

**Figure 3.** Two generic parametric definitions of oriented bounding boxes. (**a**) OpenCV definition, where *θ* indicates the acute or right angle between the width *w* and the *x*-axis; (**b**) Long-edge definition, where *w* and *h* signify the long side and short side of a bounding box, respectively. Here, *θ* denotes the angle from the *x*-axis to the direction of the width *w*.

#### 3.1.1. Loss-Metric Inconsistency

In Figure 4a, we compare the relationships between different regression losses and angle differences. Despite the fact that they are all monotonic, only the IoU loss (the light blue curve) and our TDIoU loss (the navy blue curve) are concave, indicating that the gradient directions of *ln*-norms are inconsistent with that of IoU. Figure 4b displays the relationship between the rotational IoU and angle differences under different aspect ratios. For a target with a large aspect ratio, a slight angle difference will also lead to a rapid drop in the IoU value. Figure 4c displays the relationships between different regression losses and aspect ratios. All *ln*-norm losses remain constant regardless of aspect ratio variations, while the IoU-based losses vary dramatically. The loss-metric inconsistency leads to the conclusion that even a small training loss cannot guarantee high detection performance.

**Figure 4.** Loss-metric inconsistency. All ground truths (GT) and predicted boxes (PB) are represented as (*cx*, *cy*, *w*, *h*, and *θ*) under the long-edge definition. (**a**) Regression loss variations versus angle differences (AD); (**b**) rotational IoU variations versus angle differences under different aspect ratios (AR); (**c**) regression loss variations versus aspect ratios.

#### 3.1.2. Angular Boundary Discontinuity

The angular boundary discontinuity refers to the surge in loss at the boundary of the angle definition range due to the periodicity of the angle (PoA) and the exchangeability of edges (EoE) [23]. Figure 5a shows the boundary problem under the OpenCV definition. Suppose there is a blue anchor/proposal and a green ground truth. The angle of the anchor/proposal is exactly around the maximum or minimum of the defined range. The ideal regression form is to rotate the anchor/proposal counterclockwise by a small angle to the position of the red box. However, due to the angle periodicity, the angle of the predicted box exceeds the defined range [−90◦, 0), and the width and height are interchanged relative to the ground truth, leading to a large smooth L1 loss. At this point, the anchor/proposal has to be regressed in a more complex way. For example, it should be rotated clockwise by a larger angle, and its width and height should be scaled at the same time. A similar phenomenon also occurs under the long-edge definition, as shown in Figure 5b.

**Figure 5.** Angular boundary discontinuity under (**a**) the OpenCV definition and (**b**) the long-edge definition.

In essence, angular boundary discontinuity is a kind of manifestation of loss-metric inconsistency. In the boundary case, even if the IoU between the predicted box and the ground truth is very high, a considerable loss will be incurred. Based on the above analysis, we can conclude that the *ln*-norms are inapplicable to rotated bounding box regression.

#### *3.2. Limitations of Conventional IoU-Based Losses*

It has been demonstrated in horizontal detection methods that the IoU-based losses [36–40] can ensure that the training target remains consistent with the evaluation metric. In theory, they should also work in the rotation case, as the only difference is that the IoU computation for oriented bounding boxes is more complex than that for horizontal ones.

Compared to *ln*-norms, the IoU loss has several merits. Firstly, the IoU computation involves all of the geometric properties of bounding boxes, including location, orientation, shape, etc. Secondly, instead of treating the parameters as independent variables as in

318

the case of *ln*-norms, IoU implicitly encodes the relationship between each parameter by area calculation. Finally, IoU is scale-invariant, making it ideal for solving scale and range disparities between individual parameters. The original IoU loss is defined as follows [37]:

$$L\_{\text{IoI}} = 1 - \text{IoU} \tag{1}$$

Here, *LIoU* is valid only when two bounding boxes have overlap and would not offer any moving gradient for non-overlapping cases. Moreover, it cannot reflect the manner in which the boxes intersect. In Figure 6, the relative positions between the predicted box and the ground truth are obviously different, while the evaluation results of *LIoU* remain constant.

**Figure 6.** Comparison between different IoU-based losses. (**a**) Different IoU-based loss curves versus angle differences; (**b**) some examples from (**a**). When *Bpb* and *Bgt* with coincident centroids are in a containment relationship and their widths and heights are constant, GIoU loss, CIoU loss, and EIoU loss all degenerate into the original IoU loss. In contrast, our TDIoU loss (the navy blue curve) is still able to stably reflect the angle difference and is informative for learning.

The GIoU loss [37] alleviates the issue of gradient disappearance in the non-overlapping case by adding an additional penalty term, which is expressed as follows:

$$L\_{\text{Global}} = 1 - \text{IoU} + \frac{\left| \mathbb{C} - B^{\text{pb}} \cup B^{\text{gt}} \right|}{|\mathbb{C}|} \tag{2}$$

where *Bpb* and *Bgt* are the predicted box and the ground truth, and *C* denotes the smallest enclosing box covering *Bpb* and *Bgt*. Research shows that GIoU first tries to increase the size of *Bpb* to overlap *Bgt* and then uses the IoU term to maximize the intersection area of the bounding boxes [40]. Moreover, GIoU loss requires more iterations to converge.

When designing the penalty term, CIoU loss [38] takes into account the centroid distance and the aspect ratio of the bounding boxes, which is defined as follows:

$$L\_{\rm Colol} = 1 - \text{IoU} + \frac{\rho^2 \left(b^{pb}, b^{gt}\right)}{c^2} + av$$

$$v = \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w^{rb}}{h^{pb}} \right), \ a = \frac{v}{(1 - IoI) + v} \tag{4}$$

where *<sup>b</sup>pb* and *<sup>b</sup>gt* represent the centroids of *<sup>B</sup>pb* and *<sup>B</sup>gt*, respectively; *<sup>ρ</sup>*(·) indicates the Euclidean distance; *c* denotes the diagonal length of the smallest enclosing box; *wpb* and *hpb* signify the width and height of *Bpb*, respectively; *wgt* and *hgt* signify the width and height of *Bgt*, respectively. In CIoU loss, *v* only reflects the difference in the aspect ratio, rather than the actual difference between *wpb* and *wgt* (or *hpb* and *hgt*).

To solve this problem, EIoU loss [39] proposes a more efficient form of penalty term:

$$L\_{\rm EloI} = 1 - \text{IoU} + \frac{\rho^2 \left(b^{pb}, b^{gt}\right)}{c^2} + \frac{\rho^2 \left(w^{pb}, w^{gt}\right)}{c\_w^2} + \frac{\rho^2 \left(h^{pb}, h^{gt}\right)}{c\_h^2} \tag{5}$$

where *cw* and *ch* indicate the width and height of the smallest enclosing box, respectively. The EIoU loss directly minimizes the difference in the width and height between *Bpb* and *Bgt*, leading to faster convergence and more accurate bounding box regression.

Recently, a new form of penalty term was released in CDIoU loss [40], which narrows the difference between *Bpb* and *Bgt* by minimizing the distance between their vertices, as follows:

$$L\_{CDolII} = 1 - \text{IoU} + \frac{B^{pb} - B^{\%}r\_2}{c^2} \tag{6}$$

where *<sup>B</sup>pb* − *<sup>B</sup>gt2* is the distance between the corresponding vertices of *<sup>B</sup>pb* and *<sup>B</sup>gt*.

However, the above IoU-based losses are all designed for horizontal detection. Due to the introduction of the angle parameter, applying them to oriented bounding box regression will bring some problems. As shown in Figure 6a,b, when *Bpb* and *Bgt* with coincident centroids are in a containment relationship and their widths and heights are constant, the values of GIoU loss, CIoU loss, and EIoU loss remain the same regardless of changes in the angle of *Bpb*. At this point, they completely degenerate into the original IoU loss, making the regression more difficult and the convergence slower. In other words, general parameter-based penalty terms cannot effectively measure the angle difference between *Bpb* and *Bgt*. A natural idea is to introduce the angle parameter into the penalty term. Nevertheless, such a treatment will reintroduce the angular boundary discontinuity, which goes against our original intention. In addition, we also find that the penalty term of CDIoU loss based on the vertex distance is sensitive to the angle parameter. Unfortunately, the denominator of its penalty term involves computing the smallest enclosing box covering *Bpb* and *Bgt*, an extremely tricky task for two rotated boxes. Since the shape of the convex hull formed by the vertices of *Bpb* and *Bgt* is not fixed, the oriented minimum bounding box algorithm [50] requires exhaustive enumeration to obtain the final result, which will consume a lot of computing time and delay the whole training process.

To sum up, a qualified rotational IoU loss should at least meet the following four requirements:


#### **4. The Proposed Method**

This section elaborates on our proposed unified framework for detecting arbitraryoriented ships in SAR images, including the differentiable rotational IoU algorithm based on the Shoelace formula, the triangle distance IoU loss (TDIoU loss), and the attentionweighted feature pyramid network (AW-FPN) combining multiple skip-scale connections and the attention-weighted feature fusion (AWF) mechanism.

#### *4.1. Differentiable Rotational IoU Algorithm Based on the Shoelace Formula*

Figure 7 visualizes the computation of the intersection-over-union (IoU) for horizontal and oriented bounding boxes. For two-dimensional object detection, the IoU between the ground truth *Bgt* and the predicted box *Bpb* is defined as follows [51]:

$$\text{IoU} \left( B^{\circ \text{fl}}, B^{\text{pb}} \right) = \frac{\left| B^{\circ \text{fl}} \cap B^{\text{pb}} \right|}{\left| B^{\circ \text{fl}} \cup B^{\text{pb}} \right|} = \frac{\text{Area}\_{\text{interscct}}}{\text{Area}\_{\text{union}}} = \frac{\text{Area}\_{\text{interscct}}}{\text{Area}\_{\text{gl}} + \text{Area}\_{\text{pb}} - \text{Area}\_{\text{interscct}}} \tag{7}$$

where *<sup>B</sup>gt* ∩ *<sup>B</sup>pb* and *Areaintersect* signify the area of the intersection area, and *<sup>B</sup>gt* ∪ *<sup>B</sup>pb* and *Areaunion* imply the area of the union area. *Areagt* and *Areapb* denote the area of *Bgt* and *Bpb*, respectively. It can be found that how to calculate *Areaintersect* is the core issue. However, as shown in Figure 7b, the IoU computation for OBBs is more complex than that for HBBs, since the shape of the intersection area in the rotation case could be any polygon with fewer than eight edges. In addition, the general rotational IoU algorithm [52] is non-differentiable, as it uses triangulation to calculate *Areaintersect*. To address the above issue, we derive a differentiable rotational IoU algorithm based on the Shoelace formula [53], whose pseudo code is provided in Algorithm 1 (Pseudo code of the proposed rotational IoU algorithm based on the Shoelace formula). To further apply it to the IoU loss layer, we implement its forward and backward computation, as illustrated in Figure 8.

**Algorithm 1:** IoU computation for oriented bounding boxes


**Figure 7.** IoU computation for (**a**) horizontal and (**b**) oriented bounding boxes. Red and green boxes represent the predicted box and the ground truth, and the intersection area is highlighted in orange.

**Figure 8.** The forward and backward computation of the proposed rotational IoU algorithm. Green and purple boxes signify tensors and operators, respectively. Black and grey arrows indicate forward and backward processes, respectively.

#### 4.1.1. Forward Process

On the basis of Algorithm 1 and Figure 8, the forward process is as follows:

**Step 1**—convert the ground truth *Bgt* and the predicted box *Bpb* into vertex coordinate representations and calculate their areas (i.e., *Areagt* and *Areapb*, respectively);

**Step 2**—find the vertices of the intersection area of *Bgt* and *Bpb*. These are located on the basis of two cases, as follows: (1) from the vertex of *Bgt* and *Bpb*, which falls just inside the other box, and (2) from the intersection point between the edges of two rotated boxes.

In the former case, we use the dot product to calculate the projection of each vertex of *Bgt* and *Bpb* onto two adjacent edges of the other box, respectively, and then determine whether the vertex falls inside the other box, by judging whether the projection exceeds the extent of the corresponding edge. In the latter case, since each edge of rotated boxes is a line segment defined by two vertices, the problem is transformed into locating the intersection point between two line segments in two-dimensional space [54].

Suppose *<sup>L</sup>*<sup>1</sup> is an edge of *<sup>B</sup>gt*, defined by two vertices (*x*1, *<sup>y</sup>*1) and (*x*2, *<sup>y</sup>*2), and *<sup>L</sup>*<sup>2</sup> is an edge of *<sup>B</sup>pb*, defined by two vertices (*x*3, *<sup>y</sup>*3) and (*x*4, *<sup>y</sup>*4). The line segments *<sup>L</sup>*<sup>1</sup> and *L*<sup>2</sup> can be defined in terms of first-degree Bezier parameters, as follows [55]:

$$L\_1 = \begin{bmatrix} \mathbf{x}\_1 \\ \mathbf{y}\_1 \end{bmatrix} + t \begin{bmatrix} \mathbf{x}\_2 - \mathbf{x}\_1 \\ \mathbf{y}\_2 - \mathbf{y}\_1 \end{bmatrix}, \ L\_2 = \begin{bmatrix} \mathbf{x}\_3 \\ \mathbf{y}\_4 \end{bmatrix} + \mathbf{u} \begin{bmatrix} \mathbf{x}\_4 - \mathbf{x}\_3 \\ \mathbf{y}\_4 - \mathbf{y}\_3 \end{bmatrix} \tag{8}$$

where both *t* and u are real numbers, and can be expressed as follows:

$$t = \frac{\det\begin{bmatrix} \mathbf{x}\_1 - \mathbf{x}\_3 & \mathbf{x}\_3 - \mathbf{x}\_4\\ \mathbf{y}\_1 - \mathbf{y}\_3 & \mathbf{y}\_3 - \mathbf{y}\_4 \end{bmatrix}}{\det\begin{bmatrix} \mathbf{x}\_1 - \mathbf{x}\_2 & \mathbf{x}\_3 - \mathbf{x}\_4\\ \mathbf{y}\_1 - \mathbf{y}\_2 & \mathbf{y}\_3 - \mathbf{y}\_4 \end{bmatrix}}, \\ u = \frac{\det\begin{bmatrix} \mathbf{x}\_1 - \mathbf{x}\_2 & \mathbf{x}\_3 - \mathbf{x}\_4\\ \mathbf{y}\_1 - \mathbf{y}\_2 & \mathbf{y}\_3 - \mathbf{y}\_4 \end{bmatrix}}{\det\begin{bmatrix} \mathbf{x}\_1 - \mathbf{x}\_2 & \mathbf{x}\_3 - \mathbf{x}\_4\\ \mathbf{y}\_1 - \mathbf{y}\_2 & \mathbf{y}\_3 - \mathbf{y}\_4 \end{bmatrix}} \tag{9}$$

where det[ · ] represents the determinant computation. If, and only if, 0 ≤ *t* ≤ 1 and <sup>0</sup> <sup>≤</sup> <sup>u</sup> <sup>≤</sup> 1, an intersection point *Px*, *Py* exists as follows:

$$\mathbf{u}\left(P\_{\mathbf{x}}, P\_{\mathbf{y}}\right) = \left(\mathbf{x}\_1 + t(\mathbf{x}\_2 - \mathbf{x}\_1), y\_1 + t(y\_2 - y\_1)\right) = \left(\mathbf{x}\_3 + \mathbf{u}(\mathbf{x}\_4 - \mathbf{x}\_3), y\_3 + \mathbf{u}(y\_4 - y\_3)\right) \tag{10}$$

In particular, when *L*<sup>1</sup> and *L*<sup>2</sup> are collinear (parallel or coincident), they do not intersect. By traversing each edge of *Bgt* and *Bpb*, we obtain all the intersection points.

By computing the above two cases, we finally determine the vertices of the intersection area. If the vertex does not exist, the IoU value is zero;

**Step 3**—sort the vertices of the intersection area. In general, the vertices of the intersection area form a convex hull. To compute its area, we need to sort its vertices. First, calculate the mean value of the abscissa and the ordinate of these vertices, and note it as the centroid of the polygon. Second, compute the vectors from the centroid to each vertex and normalize them to simplify the sort operation. Finally, scan all the vertices in counterclockwise order from the positive direction of the *x*-axis to obtain the sorted vertex indices.

**Step 4**—perform the gather operation to successively fetch the actual coordinate values of the sorted vertices from the unsorted vertex tensor according to the indices;

**Step 5**—compute the area of the intersection polygon using the Shoelace formula, as follows [56]:

$$Area\_{\text{interact}} = \frac{1}{2} \left| \sum\_{i=1}^{n} x\_i (y\_{i+1} - y\_{i-1}) \right| = \frac{1}{2} \left| \sum\_{i=1}^{n} y\_i (\mathbf{x}\_{i+1} - \mathbf{x}\_{i-1}) \right| = \frac{1}{2} \left| \sum\_{i=1}^{n} \det \begin{bmatrix} \mathbf{x}\_i & \mathbf{x}\_{i+1} \\ y\_i & y\_{i+1} \end{bmatrix} \right| \tag{11}$$

where *n* represents the number of edges of the intersection polygon; (*x*i, *y*<sup>i</sup> ) indicate the sorted vertices of the polygon, where i = 1, 2, ··· , *n*. Note that *xn*+<sup>1</sup> = *x*<sup>1</sup> and *yn*+<sup>1</sup> = *y*1; **Step 6**—compute the rotational IoU value of *Bgt* and *Bpb* according to Equation (7).

322

#### 4.1.2. Backward Process

During the forward process, the sort operation returns the indices of sorted vertices in counterclockwise order. Since the return value is discrete (an integer number) rather than continuous (a float number), it is non-differentiable and, therefore, cannot participate in the backward process. However, the computation part of the rotational IoU is still differentiable. This is because we only use the gather operation to obtain the coordinate values of the sorted vertices on the basis of the indices returned by the sort operation, and then adopt the Shoelace formula to compute the area of the intersection area. Throughout the process, the sort operation is not really involved in the area calculation. In most existing deep learning frameworks, the gather function is defined to gather values from the input tensor along a specified dimension and according to a specified index. As its return value is continuous by definition, it is differentiable. Furthermore, the computing process of IoU, including the dot product, the line–line intersection algorithm, and the Shoelace formula, only comprises some essential additive and multiplicative operations, ensuring that the process is robust to the rotational case and feasible for back-propagation.

#### *4.2. Triangle Distance IoU Loss*

The proposed rotational IoU algorithm enables back-propagation of the IoU loss layer and, thus, meets **Requirement 1**. In this part, we aim to design a rotational IoU-based loss, which fulfills **Requirements 2, 3, and 4** by constructing a proper penalty term.

Similarly to [37], we define the IoU-based loss as follows:

$$L = 1 - \text{IoU} + \mathcal{R}\left(B^{pb}, B^{\otimes t}\right) \tag{12}$$

where R *<sup>B</sup>pb*, *Bgt* is the penalty term for the predicted box *Bpb* and the ground truth *Bgt*.

Inspired by CDIoU, we apply the distance between corresponding sampling points (i.e., centroids and vertices) of *Bgt* and *Bpb* to the penalty term to measure the overall similarity between them, while avoiding the angular boundary discontinuity caused by the direct introduction of the angle parameter. To reduce the computing complexity, a novel reference term, namely triangle distance, is devised as the denominator of the penalty term to replace the diagonal length of the smallest enclosing box. Following this idea, we design a triangle distance IoU loss (TDIoU loss), which is defined as follows:

$$L\_{\rm TDolol} = 1 - \text{IoU} + \mathcal{R}\_{\rm TDolol} \tag{13}$$

According to Figure 9a, the penalty term of TDIoU loss is defined as follows:

$$\mathcal{R}\_{\text{TDM}\text{I}} = \frac{|AE| + |BF| + |CG| + |DH| + |PQ|}{\left|\Delta\_{AEQ}^{AQEq}\right| + \left|\Delta\_{BFQ}^{BQFQ}\right| + \left|\Delta\_{CGQ}^{DQHQ}\right| + \left|\Delta\_{DHQ}^{APAQ}\right|} \tag{14}$$

where *ABCD* and *EFGH* indicate the corresponding vertices of the predicted box *Bpb* and the ground truth *Bgt*. Here, *P* and Q represent the centroids of *Bpb* and *Bgt*, respectively. Furthermore, |·| refers to the Euclidean distance between two sampling points, while Δ*AQ*,*EQ AEQ* indicates the sum of the two edges *AQ* and *EQ* of <sup>Δ</sup>*AEQ* (the same applies for other similar terms).

**Figure 9.** The schematic diagram of the TDIoU loss. (**a**) The computation of R*TDIoU*. The red and blue boxes indicate the predicted box *Bpb* and the ground truth *Bgt*, respectively. The red and blue lines denote the distance between sampling points; (**b**) the process of bounding box regression guided by TDIoU loss. After back-propagation, the model tends to pull the centroids and vertices of the anchor/proposal toward the corresponding points of the ground truth until they overlap.

Note that each group of corresponding sampling points is exploited to construct independent triangles in R*TDIoU*. To illustrate this process, here we use the vertices *A* and *E*. As shown in Figure 9a, we use *A*, *E*, and the centroid of *Bgt*, *Q*, to construct Δ*AEQ*, which obviously satisfies |*AE*| < |*AQ*| + |*EQ*|. Then, |*AE*| is put into the numerator of R*TDIoU* to directly measure the distance between the vertices *A* and *E*, while |*AQ*| and |*EQ*| are introduced into the denominator of R*TDIoU* as part of the reference term. In this way, we finally establish the entire reference term by traversing each group of sampling points, specifically as follows:

$$\begin{array}{l|l}|AE| < |AQ| + |EQ| \\ |BF| < |BQ| + |FQ| \\ |CG| < |CQ| + |GQ| \\ |DH| < |DQ| + |HQ| \\ |PQ| < |AP| + |AQ| \end{array} \tag{15}$$

In the denominator reference term of R*TDIoU*, the triangle distance plays a similar role to the diagonal length of the smallest enclosing box, ensuring that the value of the penalty term is limited to [0, 1). The difference is that the computing process of the triangle distance is much simpler than that of the latter as it only involves the computation of the distance between two points, which is able to save more training resources and time.

Overall, our TDIoU loss is a unified solution to all the above requirements. Compared to other bounding box regression losses, it has several advantages in rotation detection, as follows:


3. The penalty term of TDIoU loss takes into account the computing complexity by using triangles formed by each group of sampling points to construct the denominator, which significantly reduces the training time and satisfies **Requirement 4.**

Additionally, as a novel training metric, TDIoU loss has the following properties:


#### *4.3. Attention-Weighted Feature Pyramid Network*

In this part, we introduce the main idea of the proposed attention-weighted feature pyramid network (AW-FPN), which improves the conventional feature fusion networks from the following two aspects: the connection pathway and the fusion method.

#### 4.3.1. Skip-Scale Connections

First used as the identity mapping shortcut in residual blocks [57–59], the skip connection has been a significant component in convolutional networks. In BiFPN, same-level features at different scales are fused via transverse skip-scale connections. However, this single same-level feature reuse neglects the semantic interactions between cross-level features and fails to avoid the semantic loss during layer-to-layer transmission. To search for better network topology, NAS-FPN uses the neural architecture search (NAS) technique. Although it has a haphazard structure that is difficult to interpret, it can guide us in designing a more preferable feature network. As shown in Figure 2c, NAS-FPN contains not only transverse skip-scale connections but also longitudinal skip-scale connections.

Motivated by the above analysis, we devise a more effective feature pyramid network structure, as demonstrated in Figure 10. First, we retain transverse skip-scale connections used in BiFPN for the same-level feature reuse while avoiding adding much cost. Second, to enhance semantic interactions between features of different resolutions, two types of longitudinal skip-scale connections are added in the bi-directional pathways, as follows:


**Figure 10.** Architecture of AW-FPN. Where **A** denotes the attention-weighted feature fusion (AWF).

#### 4.3.2. Attention-Weighted Feature Fusion (AWF)

When fusing features of inconsistent semantics and scales, a common approach is to directly add them together. The BiFPN assigns a learnable scalar weight for each connection pathway. Nevertheless, in the case of considerable variations in target scales, these linear fusion methods still face obstacles. The AFF [49] provides a non-linear attentional feature fusion scheme. To some extent, our proposed attention-weighted feature fusion (AWF) mechanism can be regarded as its follow-up work, but differs in at least three aspects, as follows:


Figure 11 describes the process of implementing the AWF. The given *N* input features from different pyramid levels <sup>F</sup>*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>C×H*n*×W*<sup>n</sup>* (*<sup>n</sup>* <sup>=</sup> 1, 2, ··· , *<sup>N</sup>*). As they are of different widths and heights, we resize them to the same resolution in advance, as follows:

$$\text{Ressize: } \mathbf{F}\_n \to F\_n' \in \mathbb{R}^{\mathbf{C} \times \mathbb{H} \times \mathbf{W}} \tag{16}$$

where *Resize* is an upsampling or downsampling operation. To integrate the information flows of different scales from multiple inputs, we first combine them to construct a fully context aware initial integration U <sup>∈</sup> <sup>R</sup>C×H×W, where <sup>⊕</sup> is an element-wise summation, as follows:

$$\mathbf{U} = F\_1' \oplus F\_2' \oplus \dots \oplus F\_n' \tag{17}$$

**Figure 11.** Diagram of the AWF. To generate the non-linear fusion weight a*n*, an element-wise softmax is performed on the integrated attention descriptors A*n*, fused by multi-scale channel attention C*n* and multi-scale spatial attention S*n*.

Then, to aggregate global and local feature contexts, the initial integration **U** is transmitted to two parallel multi-scale attention modules MCAM and MSAM, as shown in Figure 12.

**MCAM**—to polymerize global spatial information for each channel, we employ both average-pooling and max-pooling operations to squeeze the spatial dimension of **U**, so as to generate two distinct channel-wise statistics. Next, we merge them via an element-wise summation to obtain a refined global channel descriptor. Meanwhile, we follow the idea of AFF to aggregate local channel contexts by altering the pooling size. A simple approach is to directly use **U** as the local channel descriptor. After that, the global and local channel descriptors are fed into two independent excitation branches. As the fully connected layer used in [46,48] cannot be directly performed on the three-dimensional tensor, we adopt the convolution operation with a kernel size of 1 × 1, which only uses point-wise channel interactions at each spatial location to learn the non-linear association between channels. The global channel context C*g*(U) and the local channel context C*L*(U) are defined as follows:

$$\mathbf{C\_{\mathcal{X}}(\mathsf{U})} = \mathcal{B}\left(\mathbf{Conv\_{2}^{1\times1}}\left(\delta\left(\mathcal{B}\left(\mathbf{Conv\_{1}^{1\times1}}(\mathsf{AvgPool(U)}\oplus \textit{MaxPool(U)})\right)\right)\right)\right) \tag{18}$$

$$\mathbf{C}\_{L}(\mathbf{U}) = \mathcal{B}(Conv\_{2}^{1 \times 1} \left( \delta \left( \mathcal{B} \left( Conv\_{1}^{1 \times 1} (\mathbf{U}) \right) \right) \right) \tag{19}$$

where <sup>C</sup>*g*(U) <sup>∈</sup> <sup>R</sup>*NC*×1×<sup>1</sup> and <sup>C</sup>*L*(U) <sup>∈</sup> <sup>R</sup>*NC*×*H*×*W*. Here, <sup>B</sup> denotes the batch normalization [60]. Additionally, *<sup>δ</sup>* is the ReLU function, and *Conv*1×<sup>1</sup> is the 1 <sup>×</sup> 1 convolution. To simplify computation, the first convolution of each branch is used for channel reduction, while the second is used to restore the channel dimension. Hence, the numbers of filters of *Conv*1×<sup>1</sup> <sup>1</sup> and *Conv*1×<sup>1</sup> <sup>2</sup> are set to *<sup>C</sup>*/*<sup>r</sup>* and *NC*, where *<sup>r</sup>* is the channel reduction ratio. Then, C*g*(U) and C*l*(U) are fused via the broadcasting mechanism to construct the multi-scale channel context C(U). This can be seen as follows:

$$\mathbf{C}(\mathbf{U}) = \mathbf{C}\_{\mathcal{S}}(\mathbf{U}) \oplus \mathbf{C}\_{l}(\mathbf{U}) \tag{20}$$

**Figure 12.** Diagram of each attention sub-module. Multi-scale feature contexts are aggregated in both MCAM and MSAM.

Since <sup>C</sup>(U) <sup>∈</sup> <sup>R</sup>*NC*×H×<sup>W</sup> is a channel context aggregation of *<sup>N</sup>* input features, it is subsequently split into C*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>C×H×<sup>W</sup> as the multi-scale channel attention for each input.

**MSAM**—similarly to MCAM, to learn the global and local cross-spatial relationships of the initial integration **U,** we use two parallel branches. First, to obtain a refined global spatial descriptor, we perform the average-pooling and max-pooling operations along the channel dimension and concatenate them, while the initial integration is simply treated as the local spatial descriptor. Then, the convolution layer with a kernel size of 7 × 7, which has a broader receptive field, is selected as the spatial context aggregator to encode emphasized or suppressed positions of spatial descriptors. On this basis, the global spatial context S*g*(U) and the local spatial context S*l*(U) can be defined as follows:

$$\mathbf{S\_{\mathcal{S}}(\mathbf{U}) = Resize} \left( \mathbf{Conv\_{1}^{T \times T}(\mathbf{AvgPool(U)} \circledast \mathbf{MaxPool(U)})} \right) \tag{21}$$

$$\mathbf{S}\_{l}(\mathbf{U}) = \operatorname{Resize} \left( \operatorname{Conv}\_{2}^{\mathsf{T} \times \mathsf{T}}(\mathbf{U}) \right) \tag{22}$$

where <sup>S</sup>*g*(U) <sup>∈</sup> <sup>R</sup>*n*×1×H×<sup>W</sup> and <sup>S</sup>*l*(U) <sup>∈</sup> <sup>R</sup>*n*×C×H×W. Here, indicates a concatenate operation. The numbers of filters of *Conv*7×<sup>7</sup> <sup>1</sup> and *Conv*7×<sup>7</sup> <sup>2</sup> are set to *<sup>N</sup>* and *NC*. As the convolution outputs of the two branches cannot be added directly, we resize them and then fuse them via the broadcast mechanism to obtain the multi-scale spatial context <sup>S</sup>(U) <sup>∈</sup> <sup>R</sup>*n*×C×H×W.

$$\mathbf{S(U)} = \mathbf{S\_{\mathcal{S}}(U)} \oplus \mathbf{S\_{l}(U)} \tag{23}$$

We split <sup>S</sup>(U) into <sup>S</sup>*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>C×H×<sup>W</sup> as the multi-scale spatial attention, and the integrated attention descriptor <sup>A</sup>*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>C×H×<sup>W</sup> can be computed by <sup>A</sup>*<sup>n</sup>* <sup>=</sup> <sup>C</sup>*<sup>n</sup>* <sup>⊕</sup> <sup>S</sup>*n***.** Next, to generate the non-linear fusion weight a*n* for each input feature, a softmax operation is executed on each group of corresponding elements of all attention descriptors A*n*.

$$\mathbf{a}\_{\rm ll} = \frac{\mathbf{e}^{\mathbf{A}\_{\rm n}}}{\mathbf{e}^{\mathbf{A}\_{1}} \oplus \mathbf{e}^{\mathbf{A}\_{2}} \oplus \cdots \oplus \mathbf{e}^{\mathbf{A}\_{n}}} \tag{24}$$

Each element a *<sup>x</sup>*, *<sup>y</sup>*, z *<sup>n</sup>* of <sup>a</sup>*<sup>n</sup>* is a real number between 0 and 1 and fulfills <sup>∑</sup>*<sup>n</sup> <sup>n</sup>*=<sup>1</sup> a *<sup>x</sup>*, *<sup>y</sup>*, z *<sup>n</sup>* <sup>=</sup> 1. As <sup>a</sup>*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>C×H×<sup>W</sup> have the same size as resized features <sup>F</sup> *<sup>n</sup>*, they preserve and emphasize the subtle details in all inputs, enabling high-quality soft feature selections between F *n*.

$$\mathbf{O} = \left(\mathbf{a}\_1 \otimes \mathbf{F}\_1'\right) \oplus \left(\mathbf{a}\_2 \otimes \mathbf{F}\_2'\right) \oplus \dots \oplus \left(\mathbf{a}\_{\text{ll}} \otimes \mathbf{F}\_{\text{n}}'\right) \tag{25}$$

Here, **O** signifies the final fused feature and ⊗ implies an element-wise multiplication.

#### 4.3.3. The Forward Process of the AW-FPN

Our ultimate AW-FPN combines both multiple skip-scale connections and attentionweighted feature fusion. As shown in Figure 10, it takes level 3–7 features extracted by the backbone network as the input Cin = {C3, C4, C5, C6, C7}, where Ci denotes a feature level with a resolution of 1/2i of the input image. The top-down and bottom-up aggregation pathways are constructed successively. Here, we take level 5 as an example to illustrate the forward process. On the top-down pathway, the intermediate feature of level 6 (*P* <sup>6</sup>) is upsampled 2 × and then fused with C5 via AWF, followed by a 3 × 3 convolution to generate the intermediate feature *P* <sup>5</sup>. On the bottom-up pathway, the outputs of levels 3 and 4 (*P*<sup>4</sup> and *P*3) are subjected to 4 × and 2 × downsampling operations, respectively, and then fused with C5 and *P* <sup>5</sup>. The final output *P*<sup>5</sup> is generated by the 3 × 3 convolution, as follows:

$$P\_5' = \text{Conv}\left(\text{AVF}\left(\mathbb{C}\_5, \text{Resize}\left(P\_6'\right)\right)\right) \tag{26}$$

$$P\_{\mathfrak{F}} = \text{Conv}(\text{AVF}(\mathbb{C}\_{\mathfrak{F}\_{\vee}} P\_{\mathfrak{F}\_{\vee}}' \text{ Resize}(P\_4), \text{ Resize}(P\_3)))\tag{27}$$

where *Conv* implies the 3 × 3 convolution, which is followed by a batch normalization operation and a ReLU function. All other feature levels are constructed in a similar way.

#### **5. Rotated-SARShip Dataset**

In this section, we introduce the collection process and data statistics of our proposed rotated-SARShip dataset (RSSD) for arbitrary-oriented ship detection in SAR images.

#### *5.1. Original SAR Image Acquisition*

Table 2 provides detailed information of the original SAR imageries used to establish our RSSD. First, from the Copernicus Open Access Hub [61], we downloaded three raw Sentinel-1 images with a resolution of 5 m × 20 m, characterized by large scales and wide coverages (25,340 × 17,634 pixels on average). As shown in Figure 13, the imagery acquisition areas are selected in the Dalian Port, Panama Canal, and the Tokyo Port (these ports have huge cargo throughputs, and the canal has busy trade). In general, the polarization, imaging mode, and the incident angle of sensors tend to influence the imaging condition of SAR images to a certain extent. For the Sentinel-1 images, the basic polarimetric combination is VV and VH. The imaging mode is interferometric wide swath (IW), which is

the primary sensor mode for data acquisition in marine surveillance zones. Furthermore, to minimize redundant interferences, such as foreshortening, layover, and shadowing of vessels, we choose an incident angle of 27.6~34.8◦ [32].

**Table 2.** Detailed information of the original SAR imageries used to establish our RSSD.


**Figure 13.** Coverage areas of No. 1–3 Sentinel-1 images. (**a**) The Dalian Port; (**b**) the Panama Canal; (**c**) the Tokyo Port.

To ensure complex and diverse image scenes, we also screen 252 and 5535 SAR images from AIR-SARShip [31] and HRSID [32], respectively. As shown in Table 2, the HRSID images shot by Sentinel-1 and TerraSAR-X have resolutions of 0.5 m, 1 m, and 3 m. The polarizations are HH, HV, and VV, and the imaging modes are S3-StripMap (S3-SM), Staring SpotLight (ST), and High-Resolution SpotLight (HS). The AIR-SARShip images from Gaofen-3 have resolutions of 1 m and 3 m, polarizations of single and VV, and imaging modes of SpotLight and StripMap (SM). Since these images have different resolutions and imaging conditions, ships in them usually appear in different forms. Notably, images with a resolution of less than 3 m can retain the detailed features of ships, while images with a resolution of 5 m × 20 m can increase the number of small ship targets.

#### *5.2. SAR Image Pre-Processing and Splitting*

The above original SAR imageries still need to be pre-processed before annotation. To display recognizable features of ships, we first apply the Sentinel-1 toolbox [62] to convert the raw Sentinel-1 data into grayscale images in the 16-bit tag image file format (TIFF), followed by geometrical rectification and radiometric calibration operations. Since images selected from AIR-SARShip and HRSID have already undergone the above processing, we directly perform the de-speckling operation on all the original images to suppress the influence of background noise. Finally, we transform all images into portable network graphics (PNG) files in the same format as the DOTA dataset [63].

Due to the side-scan imaging mechanism of SAR satellites, the original SAR imagery generally has a huge size and should be split into ship slices to fit the input size of CNN-based detectors. First, to avoid duplicate splitting, the offshore areas with a relatively dense ship distribution are separated from the images in advance [32]. After that, a sliding window of 800 × 800 pixels is used to shift over the whole image with a stride of 600 pixels in width and height (25% overlap rate) to preserve the relatively intact features of ships. Since the images screened from HRSID have already been cropped to the expected size matching the network input, the splitting operation is performed only on the images from Sentinel-1 and AIR-SARShip. Furthermore, we reserve the complex inshore scenes containing ships and artificial facilities and remove the negative samples with only pure background.

#### *5.3. Dataset Annotation*

With the assistance of the official document and the Sentinel-1 toolbox, we can easily acquire the precise imaging time and geographic location of each Sentinel-1 image, which will help the automatic identification system (AIS) and Google Earth to provide support for the annotation work. As shown in Figure 14, we first identify the approximate location of the imaging area of each Sentinel-1 image in AIS and Google Earth. Since AIS provides the movement trajectories of most ships around the time the images were shot, it is possible to grasp the approximate distribution of ships and estimate their possible positions in the imaging area. Subsequently, we match the AIS message with each Sentinel-1 image and determine the topographical features and marine conditions of the coverage area with the help of Google Earth. On this basis, we adopt RoLabelImg [64] to annotate the oriented bounding boxes of ships, obtaining relatively accurate ground truths. To ensure that the annotations meet the requirements of most rotation detectors, we convert them to the DOTA format, using four ordered vertices to represent ship ground truths, as shown in Figure 15. In fact, there are still some islands and reefs incorrectly labeled as ships. Thus, we employ Google Earth for further in-depth inspection and correction to ensure the accuracy of the annotations. Note that since the specific shooting information of the images from AIR-SARShip and HRSID cannot be acquired directly, we first refer to their original horizontal ground truths and carefully check whether there are errors and omissions. Then, we annotate them with more elaborate oriented bounding boxes.

**Figure 14.** AIS query and Google Earth correction (take No.1 Sentinel-1 image as an example). (**a**) AIS information of the coverage. Marks of different shapes and colors represent different types of ships; (**b**) corresponding Google Earth image.

So far, we have established the RSSD, and there have been 8013 SAR images with corresponding annotation files, including 21,479 ship targets annotated by rotated ground truths. Figure 16 displays ship ground truth annotations of diverse SAR images in RSSD.

**Figure 15.** The ship annotation. (**a**) The OBB label in a SAR image; (**b**) the xml label file annotated by RoLabelImg. Each ship target is represented by an oriented bounding box as its ground truth. Where (*cx*, *cy*) is the centroid, *w* and *h* denote the width and height, and *θ* indicates the rotation angle; (**c**) the txt label file in DOTA format. Each ship is represented by four ordered vertices. Note that the top-left vertex is taken as the starting point, and the four vertices are arranged in clockwise order.

**Figure 16.** Ship ground truth annotations of diverse SAR images in RSSD. Real ships are accurately marked in green OBBs. (**a**) Offshore single ship; (**b**) offshore multiple ships; (**c**,**d**) densely arranged ships; (**e**) ships lying off the port; (**f**) ships with large aspect ratios; (**g**,**h**) small ships in the canal.

#### *5.4. Statistical Analysis on the RSSD*

Figure 17 visualizes the comprehensive statistical comparison between our RSSD and SSDD, both of which adopt OBB annotations. As Table 3 shows, 70% of the RSSD images are randomly selected as the training set, and 30% are selected as the test set. For the SSDD, we divide all images in the ratio of 8:2 according to [28]. As shown in Table 1, SSDD contains 1160 SAR images with 2587 annotated ships, indicating that each image contains only 2.2 ships on average, while in our dataset, each image contains about 2.7 ships. Figure 17a,e display the width and height distribution of ship ground truths. Compared to the extreme funnel-like distribution of SSDD, our RSSD features a more uniform ship size distribution and more prominent multi-scale characteristics. As per Figure 17b,f, the aspect ratio of ship ground truths in the SSDD is generally below 3, whereas it is concentrated in the range of 2~5 in the RSSD, indicating that most instances in our dataset are with relatively high aspect ratios. Since the difficulty in detecting ships typically increases with the aspect ratio, our RSSD is more challenging compared to other datasets. As per Figure 17c,g, according to the MS COCO evaluation metric [65], the numbers of small ships (*Area*OBBs < 1024 pixels), medium ships (1024 < *Area*OBBs < 9216 pixels), and large ships (*Area*OBBs > 9216 pixels) in the RSSD are 13,369, 7741, and 369, respectively, (62.24%, 36.04%, and 1.72% of all ships, respectively), while in the SSDD, the proportions are 71.12%, 28.30%, and 0.58%, respectively. Ships in both datasets are relatively small in size but have large variations in scale. As shown in Figure 17d,h, the angle distribution of ship ground truths in the RSSD is more balanced than that in the SSDD. This ensures that rotation detectors learn the multi-angle features better.

**Figure 17.** Statistical comparison between the proposed RSSD and the SSDD. Here, (**a**) and (**e**) show the width and height distribution of ship ground truths in RSSD and SSDD, respectively; (**b**) and (**f**) display the aspect ratio distribution of ship OBBs; (**c**) and (**g**) indicate the area distribution of ship OBBs; (**d**) and (**h**) show the rotation angle distribution of ship OBBs.



Based on the above analysis, it is obvious that the ship targets in our RSSD not only differ significantly in orientation degrees but also have multi-scale characteristics, which provides a challenging benchmark for arbitrary-oriented ship detection in SAR images.

#### **6. Experiments and Discussion**

In this section, we first present the benchmark datasets, implementation details, and evaluation metrics. Then, extensive comparative experiments with existing methods are carried out to verify the superiority and robustness of our approach. Meanwhile, comprehensive discussions are provided to analyze and interpret the experimental results.

#### *6.1. Benchmark Datasets and Implementation Details*

The proposed rotated-SARShip dataset (RSSD) and the public SAR ship detection dataset (SSDD), specific information about which is provided in Section 5, are used to evaluate the performance of the proposed method. In our experiments, all SSDD images are resized to 512 × 512 pixels, with padding operation to avoid distortion, while the RSSD images of 800 × 800 pixels are directly used as the network input. The ratio of training set to test set for the RSSD is set to 7:3, while that for the SSDD is set to 8:2. To better assess the

performance of our approach in different SAR scenes, we further divide the test sets into inshore and offshore scenes. Figure 18 and Table 3 show the details of dataset division.

**Figure 18.** The proportion of inshore and offshore scenes in the test sets of (**a**) RSSD and (**b**) SSDD.

Furthermore, a public benchmark for OBB-based ship detection in optical remote sensing images, the HRSC2016 dataset [66], is used to verify the generalization ability of the proposed method across different scenarios. It contains 1061 high-resolution aerial images, including 2976 different types of ships annotated by oriented bounding boxes, with the image size ranging from 300 × 300 to 1500 × 900 pixels. We employ the training (436 images) and validation (181 images) sets for training, and the test set (444 images) for testing. All images are resized to 800 × 512 pixels without altering the original aspect ratio.

The experiments are conducted on the platform with Ubuntu 18.04 OS, 32 GB of RAM, and a NVIDIA GTX 1080Ti GPU. For all datasets, we train the models in 72 epochs. The SGD optimizer is adopted with a batch size of 2 and an initial learning rate of 0.0025. The momentum and weight decay are 0.9 and 0.0001, respectively. As for the learning schedule, we apply the warmup strategy for 500 iterations, and the learning rate is dropped 10-fold at each decay step. If not specified, ResNet50 [57] is employed as the default backbone network. Its parameters are initialized by ImageNet pretrained weights. For fair comparisons with other methods and to avoid over-fitting, we only use random flipping and rotation for data augmentation in the training phase. If not specified, no extra tricks are used.

#### *6.2. Evaluation Metrics*

To qualitatively and quantitatively evaluate the detection performance of different methods in our experiments, two normative metrics, the precision–recall curve (P–R curve) and average precision (AP), are leveraged. Specifically, the precision and the recall can be expressed as follows:

$$Precision = \frac{TP}{TP + FP} \tag{28}$$

$$Recall = \frac{TP}{TP + FN} \tag{29}$$

where *TP* (true positives), *FP* (false positives), and *FN* (false negatives) represent the number of correctly detected ships, false alarms, and undetected ships, respectively. The P–R curve, with precision as the *y*-axis and recall as the *x*-axis, reveals the relationship between these two metrics. The AP is defined as the area under the P–R curve, as follows:

$$\text{AP} = \int\_0^1 P(R)dR \tag{30}$$

where *P* and *R* indicate the precision and recall, respectively. The AP evaluates the overall performance of detectors under different IoU thresholds (0.5 by default) and, the larger the value, the better the performance. Furthermore, we use the total training time as a metric to evaluate the computing complexity and training efficiency of different losses.

#### *6.3. Ablation Study*

In this part, we first introduce two robust rotation detectors as baselines. On this basis, a series of component-wise experiments on the RSSD, the SSDD, and the HRSC2016 are carried out to validate the effectiveness of the proposed TDIoU loss and AW-FPN.

#### 6.3.1. Baseline Rotation Detectors

Two rotation detectors, RetinaNet [17] and CS2A-Net [67], are selected as baselines in our experiments. As a typical single-stage detector, RetinaNet consists of a backbone network, a feature pyramid network, and detection heads. It uses a ResNet [57] to generate a multi-scale feature pyramid and attaches a detection head to each pyramid level (*P*<sup>3</sup> to *P*7). Each detection head is made up of a classification sub-network and a regression sub-network. To implement a RetinaNet-based rotation detector (RetinaNet-R), we modify the regression output to an OBB (*cx*, *cy*, *w*, *h*, and *θ*) under the long-edge definition, where (*cx*, *cy*), *w*, *h*, and *θ* denote the centroid, the width, the height, and the angle, respectively, and θ ∈ [–45 ◦, 135◦). Accordingly, the angle *θ* is taken into consideration in the anchor generation. At each pyramid level, we set anchors in three aspect ratios, {1:2, 1:1, and 2:1}, three scales, {1, 21/3, and 22/3}, and six angles, {–45◦, –15◦, 15◦, 45◦, 75◦, and 105◦}. The proposed TDIoU loss and AW-FPN can be easily embedded into RetinaNet-R, as shown in Figure 19a.

(**a**) RetinaNet-R with TDIoU loss and AW-FPN (**b**) CS2A-Net with TDIoU loss and AW-FPN

**Figure 19.** Architectures of two baselines. As a plug-and-play scheme, TDIoU loss and AW-FPN can be easily embedded into the above rotation detectors. (**a**) The regression output of RetinaNet is modified to an OBB under long-edge definition. Here, 'C' denotes the number of categories, and 'N' denotes the number of anchors on each feature point; (**b**) the CS2A-Net head consisting of the FAM and ODM can be cascaded to improve accuracy. The number of cascade heads is set to 2 by default.

The CS2A-Net is an advanced rotation detector based on the RetinaNet architecture. Its detection head consists of a feature alignment module (FAM) and an oriented detection module (ODM), which can be cascaded to improve accuracy. The FAM uses an anchor refinement network (ARN) to generate refined rotated anchors, and then sends refined anchors and input features to an alignment convolution layer (ACL) to learn aligned features. In ODM, the active rotating filter (ARF) learns orientation-sensitive features, and then a pooling operation extracts the orientation-invariant features for classification and regression. Our TDIoU loss and AW-FPN can also be integrated into CS2A-Net, as shown in Figure 19b.

The multi-task loss function of two baseline detectors is defined as follows:

$$L = \frac{\stackrel{\star}{1}}{N} \sum\_{N=1}^{n} \mathbb{S}\_{\boldsymbol{n}} L\_{\text{reg}} \left( B\_{\boldsymbol{n}}^{\text{pb}} \; \right) \begin{aligned} &B\_{\boldsymbol{n}}^{\text{pb}} \; \; \; B\_{\boldsymbol{n}}^{\text{gt}} \end{aligned} \right) + \frac{\lambda\_{2}}{N} \sum\_{N=1}^{n} L\_{\text{cls}} \left( p\_{\text{n}}^{\text{pb}} \; \; \; \; p\_{\text{n}}^{\text{gt}} \right) \tag{31}$$

where ˘<sup>1</sup> and ˘<sup>2</sup> indicate the loss balance hyper-parameter and are set to 1 by default, *N* denotes the number of anchors in a mini-batch, and S*<sup>n</sup>* is a binary value (S*<sup>n</sup>* = 1 for positive anchors and <sup>S</sup>*<sup>n</sup>* = 0 for negative anchors). The vectors *Bpb <sup>n</sup>* and *<sup>B</sup>gt <sup>n</sup>* denote the locations of the *n*-th predicted box and the corresponding ground truth, respectively. The values *p pb <sup>n</sup>* and *p gt <sup>n</sup>* indicate the predicted classification score and the true label of the *n*th object, respectively. In our experiments, the regression loss *Lreg* is set as the smooth L1 loss, the TDIoU loss, etc. The classification loss *Lcls* is set as the focal loss [17], as follows:

$$L\_{\text{focal}}(p\_t) = -\text{f}\mathfrak{f}\_l (1 - p\_t)^{\mathfrak{f}} \log(p\_t) \tag{32}$$

where (1 − *pt*) <sup>γ</sup> and α*<sup>t</sup>* are two modulation factors that satisfy the following conditions:

$$p\_t = \begin{cases} p\_n^{pb} & p\_n^{gt} = 1 \\ 1 - p\_n^{pb} & \text{otherwise} \end{cases} \text{ and } \alpha\_t = \begin{cases} \alpha, & p\_n^{gt} = 1 \\ 1 - \alpha, & \text{otherwise} \end{cases} \tag{33}$$

where α and γ are two hyper-parameters, which are set to 0.25 and 2, respectively, by default.

6.3.2. Effectiveness of the TDIoU Loss

We evaluate the TDIoU loss with two baseline detectors on three datasets, as shown in Tables 4–6. Both detectors adopt ResNet50 and the original FPN. To ensure the objectivity and richness of the ablation study, we implement two approximate IoU losses (IoU-smooth L1 and GWD loss) and five IoU-based losses (IoU, GIoU, CIoU, EIoU, and CDIoU loss) to compare the performance of different regression losses. Only the regression loss is modified, and all other settings remain intact for fair comparisons.

**Table 4.** Comparison of different regression losses on RSSD. Here, R-50-FPN denotes ResNet50 with FPN, **LMI** and **ABD** denote the loss-metric inconsistency and angular boundary discontinuity, respectively, and indicates that the method has corresponding issue. **Training** represents the total training time (in hours) for 72 epochs with a single GPU and a batch size of 2. Bold items are the best result of each column.


Table 4 shows results on our RSSD. Compared with smooth L1, RetinaNet-R based on approximate IoU losses improves the AP of inshore scenes, offshore scenes, and the entire test set by 1.19~3.98%, 0.99~1.98%, and 1.09~2.79%, respectively. Conventional IoUbased losses improve the AP by 3.07~4.24%, 1.73~2.08%, and 2.23~2.92%, respectively. The proposed TDIoU loss improves the AP by 5.38%, 2.71%, and 3.80%, respectively. Even with the advanced CS2A-Net, TDIoU loss still improves the inshore AP, offshore AP, and test AP by 4.18%, 0.43%, and 1.70%, respectively, indicating that our method dramatically improves ship detection performance, especially in the complex inshore scenes. Similar experimental conclusions are also reflected in the other two datasets. Table 5 shows results on the SSDD. The TDIoU-based RetinaNet-R is improved by 8.50%, 1.01%, and 3.42% on inshore AP, offshore AP, and test AP, respectively, compared to the approximate IoU losses (2.24~5.88%, 0.27~0.61%, and 0.90~2.25%) and the traditional IoU-based losses (2.94~6.75%, 0.39~0.67%, and 1.17~2.61%). When CS2A-Net is used as the base detector, our TDIoU loss further improves the AP by 3.47%, 0.73%, and 1.51%. Table 6 shows results on the HRSC2016. The RetinaNet-R achieves the best accuracy by using the TDIoU loss (i.e., improvement by 3.49% and 4.39% in terms of the 2007 and 2012 evaluation metrics, respectively). Similarly, our TDIoU loss achieves considerable improvement on CS2A-Net, with an increase of 0.32% and 2.48%, respectively.


**Table 5.** Comparison of different regression losses on SSDD.

**Table 6.** Comparison of different regression losses on HRSC2016. Here, AP07 and AP12 indicate the PASCAL VOC 2007 and 2012 metrics, respectively.


Figure 20 shows P–R curves of RetinaNet-R using different regression losses on three datasets. The area under the P–R curve of TDIoU loss is always larger than that of the other losses, indicating that the overall performance of our method is better. The possible causes are summarized as follows: (1) Compared to the approximate IoU losses, we fundamentally eliminate the loss-metric inconsistency by introducing the differentiable rotational IoU algorithm. (2) In contrast to the parameter-based IoU losses, the TDIoU penalty term effectively reflects the overall difference between OBBs by measuring the distance between sampling points. In Table 4, to further investigate the effect of the angle parameter, we directly introduce it into the EIoU penalty term, which is named AIoU loss. However, AIoU loss is prone to non-convergence in the training phase, which is probably because the direct introduction of angle parameter will bring back the boundary discontinuity. On the contrary, the distance-based penalty term can reflect angle differences without directly employing the angle parameter. (3) Compared to the CDIoU loss, the introduction

of the centroid distance is able to speed up bounding box alignment. Figure 21 displays different regression loss curves in the training phase. The TDIoU loss directly minimizes the distance between corresponding centroids and vertices of two boxes and, thus, converges much faster than other losses. Moreover, since we use the triangle distance rather than the diagonal length of the smallest enclosing box to construct the denominator of penalty term, TDIoU loss reduces the training time by nearly half compared with other IoU-based losses, indicating that the computing complexity of our method is greatly reduced. All in all, the proposed TDIoU loss is more applicable to rotated bounding box regression.

**Figure 20.** P–R curves of RetinaNet-R based on different regression losses on (**a**) RSSD, (**b**) SSDD, and (**c**) HRSC2016.

**Figure 21.** Regression loss curves of RetinaNet-R on RSSD. (**a**) Overall comparison of different regression losses; (**b**) comparison between TDIoU loss and approximate IoU losses; (**c**) comparison between TDIoU loss and other IoU-based losses.

#### 6.3.3. Effectiveness of the AW-FPN

Since the proposed AW-FPN combines both multiple skip-scale connections and the attention-weighted feature fusion (AWF) strategy, we want to understand their respective contributions to accuracy improvement. Hence, we implement seven feature fusion networks with different connection pathways and fusion methods to verify the effectiveness of the AW-FPN, as shown in Tables 7–9. Notably, to eliminate the effect of irrelevant factors, the structure of all feature fusion networks is used only once.

Table 7 shows the results on our RSSD. The comparison between different connection pathways shows that the traditional FPN is inevitably limited by a single top-down information flow and achieves the lowest accuracy. The PANet with an extra bottom-up pathway improves by 0.68%, 0.59%, and 0.63% on inshore AP, offshore AP, and test AP, respectively. The BiFPN with single transverse skip-scale connections and the linear weighted fusion (LWF) strategy improves the AP by 1.87%, 1.13%, and 1.35%, respectively. For the AW-FPN with both transverse and longitudinal skip-scale connections, even the simplest additive fusion method can achieve performance similar to that of BiFPN. When using the same LWF method as BiFPN, the AW-FPN improves the AP by 2.38%, 1.34%, and 1.58%, indicating that longitudinal skip-scale connections are also crucial for feature fusion. For

comparisons between different fusion methods, AW-FPN improves by 2.91%, 1.63%, and 1.97% when using AFF (channel attention only) and by 4.09%, 2.06%, and 2.84% when using the proposed AWF (both channel and spatial attention), indicating that the attention-based fusion methods outperform the linear fusion methods and, that to generate non-linear fusion weights, it is better to use both channel and spatial attention rather than only using single channel attention. When CS2A-Net is used as the base detector, our ultimate AW-FPN further improves the AP by 3.96%, 0.38%, and 1.50%. Similar experimental results are obtained on the other two datasets. From Tables 8 and 9, for the SSDD and HRSC2016, the proposed AW-FPN achieves the most outstanding performance on RetinaNet-R and a considerable improvement on the advanced CS2A-Net, which proves the effectiveness of our approach.

**Table 7.** Comparison of different feature fusion networks on RSSD. The structure of all feature fusion networks is used only once in our experiment. Here, ADD represents the direct addition of feature maps, while LWF indicates the linear weighted fusion method in BiFPN [27].


**Table 8.** Comparison of different feature fusion networks on SSDD.


**Table 9.** Comparison of different feature fusion networks on HRSC2016.


Figure 22 shows the P–R curves of RetinaNet-R with different feature fusion networks. The P–R curve of AW-FPN is always higher than that of other methods. This may be because multiple skip-scale connections enhance semantic interactions between features of different resolutions and scales, which contributes to the complement of context information. In addition, in contrast to other linear fusion methods and the AFF using only channel attention, the proposed AWF aggregates global and local feature contexts in both the multi-scale channel attention module (MCAM) and the multi-scale spatial attention module (MSAM) to generate higher quality fusion weights. Figure 23 shows the feature visualization of different feature fusion networks. The region of interest (ROI) is highlighted in the feature heat map. The ROI in the feature maps generated by other methods is usually overlarge and contains considerable background clutter. In contrast, the contour and location of ships in the feature map generated by our AW-FPN is more distinct and accurate, which helps the detectors to focus more on the real ship targets rather than background clutter, and to learn more useful context information.

**Figure 22.** P–R curves of RetinaNet-R with different feature fusion networks on (**a**) RSSD, (**b**) SSDD, and (**c**) HRSC2016.

#### *6.4. Comparison with the State-of-the-Art*

We embed the proposed AW-FPN into CS2A-Net and train it with our TDIoU loss. and then compare our approach with the state-of-the-art methods on three datasets.

#### 6.4.1. Results on the RSSD

Table 10 provides a quantitative comparison of different methods on RSSD. As can be seen, the latest two-stage detection methods, such as CSL, SCRDet++, and ReDet, generally achieve outstanding performance. However, they always adopt complex structures in exchange for improved accuracy at the expense of detection efficiency. Lately, some singlestage detection methods, such as R3Det, GWD, and CS2A-Net, have been presented, which show competitive performance and efficiency on RSSD. Our method can further improve the accuracy of these rotation detectors and has a minimal impact on detection efficiency. As per Table 10, the proposed approach achieves 75.41%, 96.62%, and 87.87% accuracy in terms of inshore AP, offshore AP, and test AP on CS2A-Net, respectively, without using multi-scale training and testing, which is already extremely close to the performance of the advanced ReDet and GWD. When employing a stronger backbone (i.e., ResNet101) and multi-scale training and testing, our approach achieves state-of-the-art performance, with the AP of 77.65%, 97.35%, and 89.18%, respectively, which is 1.98%, 0.65%, and 1.11% higher than that of the suboptimal method (i.e., GWD). Furthermore, the inference speed of our method reaches 12.1 fps, which is 11.1 fps and 2.5 fps faster than that of ReDet and GWD, respectively. Compared to the original CS2A-Net, our approach trades off a speed loss of only 1.1 fps for significant gains, of 3.48%, 0.87%, and 1.85%, in accuracy.

**Table 10.** Comparison with state-of-the-art methods on RSSD. Here, **MS** indicates the multi-scale training and testing, **FPS** is obtained by calculating the overall inference time and the number of images, TDIoU + AW-FPN represents the CS2A-Net detector based on TDIoU loss and AW-FPN, R-50 refers to ResNet50 (likewise R-101, R-152), and ReR-50 and H-104 denote ReResNet50 [68] and Hourglass104, respectively [69].


Figure 24 shows qualitative results of different methods on RSSD. As per the results of the offshore scene containing multiple ships (the first row), the other four methods detect islands and reefs incorrectly as ships, while our method is more robust in distinguishing small ships from background components. For the complex inshore scenes (the second row to the fourth row), the results of other methods include false alarms and leave some vessels undetected. In contrast, our method succeeds in detecting all ships and locating them more precisely, especially for densely arranged ships close to man-made facilities.

#### 6.4.2. Results on the SSDD

Table 11 shows experimental results of different methods on the SSDD. Since SSDD contains few SAR images and the scenes are relatively simple, the ship detection accuracy is generally high. As shown in Table 11, based on CS2A-Net (R-50), our approach achieves 80.75%, 99.64%, and 94.05% of inshore AP, offshore AP, and test AP, respectively. When using ResNet101 as the backbone network, the AP of our approach reaches 84.34%, 99.71%, and 95.16%, compared to the state-of-the-art detectors ReDet (82.80%, 99.18%, and 94.27%) and GWD (81.99%, 99.66%, and 94.35%). Moreover, our approach improves the overall accuracy by 1.26% and 0.44% compared to BiFA-YOLO and R2FA-Det, respectively, and the inference speed by 5.5 fps compared to the suboptimal R2FA-Det, indicating that the proposed method achieves the best performance and satisfies high detection efficiency.

**Figure 24.** Detection results of different methods on RSSD. (**a**) Ground truth (GT); (**b**) RetinaNet-R; (**c**) CS2A-Net; (**d**) ReDet; (**e**) GWD; (**f**) TDIoU + AW-FPN (ours). Green and red boxes represent real ship targets and detection results, respectively.

**Table 11.** Comparison with state-of-the-art methods on SSDD. Here, V-16 and C-53 denote VGG16 [75] and CSPDarknet53 [76]. The method with **\*** indicates that its results are from the corresponding paper. Here, (<800) indicates that the long side of images is less than 800 pixels.


Figure 25 visualizes some detection results of different methods on the SSSD. In the complex inshore scenes, the other three methods suffer from missed and false detection under background clutter interference. In contrast, our approach is highly robust and displays superiority in detecting densely distributed small ships.

**Figure 25.** Detection results of different methods on SSDD. (**a**) GT; (**b**) CS2A-Net; (**c**) GWD; (**d**) TDIoU + AW-FPN (ours).

#### 6.4.3. Results on the HRSC2016

To verify the effectiveness and robustness of our approach in optical remote sensing scenarios, we conduct experiments with state-of-the-art methods on the HRSC2016, which contains a great number of ships with large aspect ratios and arbitrary orientations. As shown in Table 12, our approach achieves 90.71% and 98.65% accuracy on the metrics AP07 and AP12, respectively, outperforming other comparison methods. Compared with the suboptimal approach (i.e., ReDet), the proposed method improves the accuracy by 0.25% and 1.02%. In addition, the inference speed of our method is 16.9 fps, which is much faster than that of the two-stage method ReDet (<1.0 fps). As per the above results, our method shows excellent generalization ability in other rotation detection scenarios.

**Table 12.** Comparison with state-of-the-art methods on HRSC2016. The method with **\*** indicates that its results are from the corresponding paper.


To evaluate the capability of our method to detect ships with extreme aspect ratios, we choose three images containing ships with large aspect ratios. As shown in Figure 26, our approach has fewer false alarms than any other methods. In addition, the position and

orientation of the predicted box generated by our method are much closer to those of the ground truth.

**Figure 26.** Detection results of different methods on HRSC2016. (**a**) GT; (**b**) CS2A-Net; (**c**) GWD; (**d**) TDIoU + AW-FPN (ours).

Figure 27 displays P–R curves of different methods on RSSD, SSDD, and HRSC2016. It can be found that the P–R curve of our method is almost always higher than those of the other methods. Through all the above experiments and discussions, we can draw the conclusion that the proposed TDIoU loss and AW-FPN can improve the detection accuracy of arbitrary-oriented ships in both SAR scenes and optical remote sensing scenes, especially in the case of extreme scale and aspect ratio variations. This may be attributed to the fact that TDIoU loss fundamentally eliminates the loss-metric inconsistency and angular boundary discontinuity, so as to guide the rotation detector to achieve more accurate boundary box regression. Furthermore, the proposed AW-FPN is improved in terms of both the connection pathway and the fusion method, enabling high-quality semantic interactions and soft feature selections between features of inconsistent resolutions and scales.

**Figure 27.** P–R curves of different methods on (**a**) RSSD, (**b**) SSDD and (**c**) HRSC2016.

#### **7. Conclusions**

In this paper, a unified framework combining TDIoU loss, AW-FPN, and RSSD is proposed to improve the capability of rotation detectors in recognizing and locating ships in SAR images. (1) The rotational IoU algorithm based on the Shoelace formula opens up the possibility of using IoU-based loss for rotated bounding box regression. On this basis, an effective TDIoU penalty term is designed to overcome the defects of existing IoU-based losses and solve the problems caused by angle regression. (2) Here, AW-FPN improves previous methods from connection pathways and fusion methods. Skip-scale connections enhance semantic interactions between multi-scale features. The AWF generates attention fusion weights via MCAM and MSAM to encode emphasized and suppressed positions in feature maps, making detectors focus more on real ship targets. (3) We construct a challenging benchmark, namely RSSD, for arbitrary-oriented SAR ship detection. Ships in RSSD not only differ significantly in orientations but also features multi-scale characteristics. In addition, 15 baseline results are provided for research. (4) Extensive experiments are conducted on three datasets. When using TDIoU loss and AW-FPN, even the advanced CS2A-Net is able to improve upon the AP by 1.85%, 1.69%, and 0.54% on RSSD, SSDD, and HRSC2016, respectively, fully demonstrating the effectiveness and robustness of our approach.

Our future work is summarized as follows:


**Author Contributions:** Conceptualization, R.G.; methodology, R.G.; software, R.G.; validation, R.G.; formal analysis, R.G.; investigation, R.G. and Z.X.; resources, R.G. and Z.X.; data curation, R.G., Z.X. and Q.X.; writing—original draft preparation, R.G.; writing—review and editing, R.G., Z.X., K.H. and Q.X.; visualization, R.G.; supervision, Z.X. and K.H.; project administration, R.G.; funding acquisition, Z.X. and K.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Key Research and Development Program, grant number 2019YFB1600605; The Youth Fund from National Natural Science Foundation of China, grant number 62101316; Shanghai Sailing Program, grant number 20YF1416700.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Deep Learning Approach for Object Classification on Raw and Reconstructed GBSAR Data**

**Marin Kaˇcan \*, Filip Turˇcinovi´c, Dario Bojanjac and Marko Bosiljevac**

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia

**\*** Correspondence: marin.kacan@fer.hr

**Abstract:** The availability of low-cost microwave components today enables the development of various high-frequency sensors and radars, including Ground-based Synthetic Aperture Radar (GBSAR) systems. Similar to optical images, radar images generated by applying a reconstruction algorithm on raw GBSAR data can also be used in object classification. The reconstruction algorithm provides an interpretable representation of the observed scene, but may also negatively influence the integrity of obtained raw data due to applied approximations. In order to quantify this effect, we compare the results of a conventional computer vision architecture, ResNet18, trained on reconstructed images versus one trained on raw data. In this process, we focus on the task of multi-label classification and describe the crucial architectural modifications that are necessary to process raw data successfully. The experiments are performed on a novel multi-object dataset RealSAR obtained using a newly developed 24 GHz (GBSAR) system where the radar images in the dataset are reconstructed using the Omega-k algorithm applied to raw data. Experimental results show that the model trained on raw data consistently outperforms the image-based model. We provide a thorough analysis of both approaches across hyperparameters related to model pretraining and the size of the training dataset. This, in conclusion, shows how processing raw data provides overall better classification accuracy, it is inherently faster since there is no need for image reconstruction and it is therefore useful tool in industrial GBSAR applications where processing speed is critical.

**Keywords:** object classification; radar image reconstruction; convolutional neural networks; ResNet18; GBSAR; Omega-K algorithm

#### **1. Introduction**

Synthetic Aperture Radar (SAR) technology is crucial in many modern monitoring applications where optical images are not sufficient or restrictions in terms of light conditions or cloud coverage play a major role. To reach adequate resolution the antenna-based radar system should have a large sensor antenna. The resolution of an optical image obtained using the Sentinel-2 satellite with a mean orbital altitude of 786 km is around 20 m [1]. In order to achieve an equal resolution using a C-band sensor (common in SAR satellites) from that altitude, the sensor antenna would have to be over 2 km long, which is not practical. To virtually extend (synthesize) the length of the antenna (or antenna array), SAR concept utilizes sensor's movement to combine data acquired from multiple positions and reconstruct radar image of the observed area. In the Sentinel-1 SAR satellite launched in 2014, the movement of the 12.3-m-long antenna provides coverage of a 400 km wide area with a spatial resolution of5m[1].

The same principle can be applied in a terrestrial remote sensing imaging system— Ground-based SAR (GBSAR). The main concept of GBSAR is based on the sensor antenna which radiates perpendicular to the moving path, but the sensor moves along a ground track, covering the area in front of it. Even though, in many applications, distances between sensors and observed objects can be up to several kilometers, a smaller distance in combination with wider frequency bandwidth can be interesting for sensing small

**Citation:** Kaˇcan, M.; Turˇcinovi´c, F.; Bojanjac, D.; Bosiljevac, M. Deep Learning Approach for Object Classification on Raw and Reconstructed GBSAR Data. *Remote Sens.* **2022**, *14*, 5673. https:// doi.org/10.3390/rs14225673

Academic Editors: Tianwen Zhang, Tianjiao Zeng and Xiaoling Zhang

Received: 13 September 2022 Accepted: 2 November 2022 Published: 10 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

deformations. Satellite frequencies above the X band are rarely used in SAR due to *H*20 absorption. However, since that is not a strict limitation for terrestrial GBSAR, higher frequencies and wider bandwidths can be used to achieve higher resolutions in radar images, which is essential for anomaly detection and small object recognition.

Such images, in combination with machine learning algorithms, enable surface deformation monitoring [2–6], snow avalanche identification [7], bridge [8] and dam structure monitoring [9], open pit mine safety management [10], and terrain classification [11]. Moreover, machine learning enables object classification using only radar images. Applications include the classification of military targets from the MSTAR dataset [12–19], ship detection [20–24], and subsurface object classification with an ultra-wideband GBSAR [25]. Radars with 32–36 GHz and 90–95 GHz frequency bands have been shown to accurately locate small metallic targets in the near-field region [26]. To facilitate deep learning research for SAR data, LS-SSDD-v1.0 [27], a dataset for small ship detection from SAR images was released, together with many standardized baselines. This resulted in subsequent advances in deep learning methods for ship classification [28,29], detection [30], and instance segmentation [31,32] in SAR images.

On the other hand, object classification on GBSAR data has not been explored sufficiently since most of the aforementioned work was focused on solving problems encountered in typical GBSAR applications. The potential of GBSAR systems lies in industrial applications, such as monitoring and object detection in harsh environments, and classification of concealed objects. We perform object classification on two modalities of GBSAR data: images reconstructed using the Omega-K algorithm and raw GBSAR data. Both approaches are based on a popular convolutional neural network (CNN) architecture—ResNet18—with certain modifications to accommodate processing raw GBSAR data.

A crucial part of the image reconstruction algorithm is an approximation step, which negatively influences the integrity of the data. We tested and quantified its impact on classification results by comparing multiple models based on raw and reconstructed GB-SAR data. The idea of using raw GBSAR data is similar to various end-to-end learning approaches [33], which became more prevalent with the advent of deep learning. In such a paradigm, the model implicitly learns optimal representations of raw data [34], without any explicit transformations during data preprocessing.

The rest of the paper is organized as follows: Section 2 introduces the theory behind GBSAR and the radar image reconstruction algorithm. It also covers the implementation used in generating measurement sets presented in Section 3. Section 4 describes relevant deep learning concepts, such as feature engineering, end-to-end learning, and multi-label classification. It describes two concrete approaches to address GBSAR object classification, which correspond to the two input modalities. The experimental setup and evaluation results are described and analyzed in Section 5. Further discussions and interpretations are given in Section 6. Section 7 provides conclusions and future work.

#### **2. Gbsar Theory and Implementation**

#### *2.1. GBSAR*

The main idea of GBSAR is, following SAR concept, to virtually extend sensor antenna by utilizing sensor movement along the set track while it emits and receives EM waves. The set of measurements provides extraction of the distances to the observed object in each sensor position and, consequently, radar image reconstruction. Range and azimuth resolution of the radar image are mostly determined by the sensor frequency bandwidth and length of the used track, respectively. Hence, wider bandwidth and longer track (for example some sort of rail) provide better range and azimuth resolution [35].

There are two operational modes regarding sensor movement: continuous and stopand-go mode. In the first one sensor moves with constant speed from start to the end of the used rail, while in the second one sensor movement is paused at each sampling position to acquire data without impact of motion. In stop-and-go mode step size and total aperture length can be precisely set but it should be emphasized that chosen values affect azimuth

resolution. In both modes, the moving sensor has to repeatedly provide information about distance which is commonly obtained by using Frequency Modulated Continuous Wave (FMCW) radar principle as the base radar system due to its relatively simple and well known implementation.

FMCW radar radiates continuous signal whose operating frequency changes during transmission. Operating frequency sweeps through some previously defined frequency band B with a known function which is usually linear (most commonly sawtooth type function is used [36]). The same frequency sweep is observed in the echo or received signal which is delayed in time for the time required by the signal to travel to the object and be reflected back. Emitted and received signals are then mixed in order to eliminate highfrequency content in the received signal and use the low frequency difference to extract the delay in the received signal. To provide the analytical basis we start with a typical sawtooth wave which is made of periodic repetition (period T) of upchirp (frequency increases) or downchirp (frequency decreases) signals. Upchirp's frequency changes according to the linear function with rate of change equal to *γ* = *<sup>B</sup> <sup>T</sup>* . Carrier frequency is denoted by *fc*. In the complex form, emitted signal is given by the function

$$S\_t(t) = e^{i(2\pi f\_t t + \pi \gamma t^2)}.\tag{1}$$

Received signal is delayed in time for the time it takes the signal to travel to the observed object and returns back. Geometry of a standard GBSAR system and position of the object are described in Figure 1. Time delay between the transmission and detection is denoted by *td*,

$$S\_r(t) = S\_t(t - t\_d). \tag{2}$$

After the mixing process we obtain signal *S* without high-frequency content 2*π fct*

$$S = S\_l(t)\overline{S\_l}(t) \ = S\_l(t - t\_d)\overline{S\_l}(t) = e^{-i2\pi t\_d(f\_c + \gamma t)}e^{i\pi\gamma t\_d t} \,. \tag{3}$$

All information about the observed object, that can be extracted by radar, is stored in the signal *S* and reconstruction algorithms operate directly on the signal *S*. Signal *S* can be interpreted in various different ways and the algorithm used for the reconstruction depends on the interpretation of *S*. Using Fourier transform in the spatial domain signal *S* can be interpreted as the spatial spectrum in azimuthal *Kx* and distance *Kr* variables [37].

$$S(K\_r, K\_x) = e^{-iK\_x x\_0} e^{-i\sqrt{K\_r^2 - K\_x^2} y\_0} e^{i\frac{\text{BK} \cdot \text{K} \cdot \text{K} \cdot \text{x}}{4\pi\gamma}} A e^{-i\frac{\text{y}}{4}}.\tag{4}$$

Position of the object is saved in the *x*<sup>0</sup> and *y*<sup>0</sup> coordinates in the first two complex exponential functions. Coordinates are multiplied by wave vector variables *Kx* and *K*<sup>2</sup> *<sup>r</sup>* − *K*<sup>2</sup> *x*. Third exponential function is a residual of a finite recording step size and finite radar's bandwith *B*. It manifests through the Δ*Kr* parameter in the Fourier domain. Radar moves with a constant speed *v*, *c* is a speed of light and *γ* represents chirp's rate of change.

We note that this is only one of many ways of interpreting signal *S* and the choice of reconstruction algorithm depends on this interpretation. This kind of interpretation leads to frequency domain reconstruction algorithms. Time domain algorithms usually have higher algorithm complexity but are more robust to the sensor irregular movement during radar operation [38].

**Figure 1.** Ground-based SAR geometry. Coordinates (*x*0, *y*0) represent position of a point object in real space. In the dashed square, coordinates of the Fourier domain space or *K*-space are presented. *Kr* is a wave vector coordinate in the range direction while *Kx* is a wave vector coordinate in the *x* direction.

#### *2.2. Image Reconstruction*

Image reconstruction algorithms are used to generate radar image from the signal *S*. In this work, signals were recorded using GBSAR radar which operates in the stop-and-go mode. Received signal will be interpreted according to (3) and image reconstruction will be based on Omega-K algorithm [39]. This algorithm has many different implementations for images generated by GBSAR [37,40], but the main idea of these types of algorithms is to apply several processing steps on the spatial spectrum and then use inverse Fourier transform in order to extract position of the object.

Spatial spectrum (4) can be interpreted as a product of several functions carrying information about the object and information about various side effects of SAR radar acquisition. The task of the Omega-K algorithm is to separate those two parts and filter information about side effects out. Second part of the spatial spectrum *S*,

$$e^{i\frac{\alpha c\Lambda K\_P K\_X}{4\pi\gamma}},\tag{5}$$

represents residual frequency modulation due to finite step between two image acquisition. If we filter out this part then the remaining part of the spatial spectrum contains information about spatial coordinates of the object

$$e^{-iK\_{\tilde{x}}x\_0}e^{-i\sqrt{K\_{\tilde{r}}^2 - K\_{\tilde{x}}^2}y\_0},\tag{6}$$

which can be extracted using the inverse Fourier transform. It gives *δ*(*x* − *x*0, *y* − *y*0), or object position. Constant

$$A e^{-i\frac{\pi}{4}}\tag{7}$$

does not affect image reconstruction process.

Every Omega-K algorithm is a discrete implementation of previously described steps. They differ by the way they treat spatial spectrum *S* in order to separate information about position of the object from various side effects of image acquisition. They usually have part for the residual frequency modulation compensation, interpolation from spherical to Cartesian coordinate system and they end with 2D inverse Fourier transform (2D IFFT).

Reconstruction algorithms are used in order to recreate captured image according to predefined signal model, such as (3). Ideally, reconstruction algorithms take signal *S* and reproduce the image of the observed object without the loss of information in the signal *S*. In reality, all those steps necessarily introduce an error in the reconstruction process and affect the information in *S*. Numerical algorithms, such as FFT, have very well known and well-described numerical error and those steps do not affect information in *S* significantly. On the other side, approximations, such as frequency modulation compensation, residual-phase compensation, and interpolation from spherical to Cartesian coordinate system, are approximation steps associated to the predefined signal model. Algorithms used in those steps are not as researched as the FFT algorithm is, they are not implemented in numerical libraries, and they necessarily degrade the information in *S* as a result of their approximation character [37]. Although it is easier for human beings to notice useful information and recognize captured object in the reconstructed image than it is in the raw data, it does not mean that there is more information in the reconstructed image. As is described before, just the opposite is true due to all approximation algorithms. Reconstructed representation of the signal *S* is just easier for human beings to grasp than the raw data. That means that a neural network based classifier, that operates directly on the raw data, has a significant potential for a higher accuracy than the classifier that operates on the reconstructed images.

#### *2.3. Implementation*

In order to obtain radar images of objects from the ground while keeping flexibility regarding step size, total aperture length, polarization, and observing angle, GBSAR named GBSAR-Pi was developed. It is based on Raspberry Pi 4B (RPi) microcomputer that controls voltage controlled oscillator (VCO) in FMCW module Innosent IVS-362 [41]. The module operates in 24 GHz band and, besides VCO, has integrated trasmitting and receiving antenna, and mixer. The mechanical platform which contains RPi with AD/DA converter, FMCW module and amplifier is tailor 3D printed to enable precise linear movement and change of polarization. Polarization can be manually set to horizontal (HH) or vertical (VV). Movement of the platform is provided by 5V stepper motor. Figure 2 shows developed GBSAR-Pi: on one side of 1.2 m long rail track is stepper motor which is controlled by the RPi located in the movable white 3D printed platform. The platform also contains FMCW module on the front and touch-screen display used for measurement parameter setup, monitoring, and ultimately representation of reconstructed radar image.

**Figure 2.** Developed GBSAR-Pi.

Developed GBSAR-Pi works in stop-and-go mode. The process of obtaining measurement for one radar image is following: RPi over DA converter controls VCO on FMCW module to emit upchirp signals using *Vtune* pin, as shown in Figure 3. The module in each step transmits and receives signals, mixes them, and the resultant signal in low frequency band is sent to the microcomputer over In phase pin (IF1 in Figure 3) and AD converter. In order to maximize SNR, in each step the system emits multiple signals and stores the

average of received ones. When the result is stored, RPi runs stepper motor via four control pins to move the platform for one step of previously set size. The enable pin in Figure 3 is used to switch off the power supply of the VCO. The process continues until it reaches last step. After that obtained matrix of average signals from each step is stored locally and sent to the server.

**Figure 3.** GBSAR-Pi scheme.

FMCW signals transmitted by the module have central frequency of 24 GHz and sweep range is set to 700 MHz bandwidth. Hence, from well known equation for FMCW range resolution *Rr* = *<sup>c</sup>* <sup>2</sup>*<sup>B</sup>* , the used bandwidth provides range resolution of the system *Rr* = 21.4 cm. Each sweep in emitted signal is generated by RPi with 1024 frequency points and has a duration of 166 ms which gives chirp rate change of frequency *<sup>γ</sup>* = 4.2 × <sup>10</sup>9. Signal stored in each step is average of 10 received signals. FMCW module has output power (EIRP) 12.7 dBm. Number of steps and step size are adjustable and in our measurements there were two cases: case of 0.4 cm step size and 160 steps, and 1 cm step size and 30 steps case. Antennas polarization can be changed manually by rotating case with FMCW module set on the front side of the platform. Since the antennas are integrated within the module, the system is limited to HH and VV polarizations only. GBSAR-Pi use two battery sources: first one charges RPi, AD/DA converter and FMCW module, and second one stepper motor. The whole system is, thus, fully autonomous with the system being controlled by display set on the platform and optionally an external keyboard.

Following Section 2.2, image reconstruction algorithm Omega-K was implemented using python programming language and additional libraries numpy and scipy. Regarding the algorithm, most important methods of the implementation are numpy FFT (Fast Fourier Transform) and IFFT (Inverse FFT), and scipy interpolate which provides one-dimensional array interpolation (interp1d). The visualization of a reconstructed radar image was accomplished using libraries matplotlib (pyplot) and seaborn (heatmap). Implementation consists of following steps: Hilbert transform, RVP (Residual Video Phase), Hanning window, FFT, Reference function multiply (RFM), interpolation, IFFT, and visualization. Complete program code with adjustable central frequency, bandwidth, chirp duration, and step size is given in [42].

#### **3. Data Acquisition**

#### *3.1. Measurements*

The measurements were taken using GBSAR-Pi. There were two sets of measurements: the first was conducted in laboratory conditions, while the second set of measurements

was obtained in a more complex "real world" environment. Thus, we named the collected datasets LabSAR and RealSAR, respectively.

In LabSAR, three test objects set at the same position in an anechoic chamber are recorded. The test objects were a big metalized box, a small metalized box, and a cylindrical plastic bottle filled with water. Hence, the objects were different in size and material reflectance. The distance between GBSAR-Pi and the observed object did not change between measurements and was approximately 1 m. The azimuth position of the object was at the center of the GBSAR-Pi rail track. Only one object was recorded in each measurement for this setup. All measurements were conducted using horizontal polarization. 160 steps of size 4 mm gave total aperture length of 64 cm.

In RealSAR, the conditions were not as ideal. The measurements were intentionally conducted in a room full of various objects in order to produce additional noise. Once again, there were three test objects. However, the objects were empty bottles of similar size and shape. The bottles could be distinguished based on the different reflectances of the material. The three bottles used as test objects were made from aluminium, glass, and plastic. Compared to LabSAR, there was much more variance in recorded scenes since any of the eight (23) possible subsets of three objects could appear in a given scene (including an empty scene without any objects). Moreover, the different object positions, which vary in both azimuth and range direction, and different polarizations used for recording also contribute to the complexity of the task. We note that any scene can include at most one object of a certain material. Compared to LabSAR, the step size was increased to 1 cm, while the total aperture length was decreased to 30 cm. RealSAR objects are shown in Figure 4. Comparison between measurement sets is given in Table 1.

**Figure 4.** RealSAR objects: aluminium, glass and plastic bottle, and GBSAR-Pi.


**Table 1.** LabSAR and RealSAR measurement set comparison. First five rows describe objects and scenes, while others GBSAR-Pi parameters used in the measurements.

#### *3.2. Datasets*

Using the measurement sets LabSAR and RealSAR, four datasets were created: LabSAR-RAW, LabSAR-IMG, RealSAR-RAW, and RealSAR-IMG. LabSAR-RAW and RealSAR-RAW consist of raw data from measurement sets mentioned in their names. At the same time, LabSAR-IMG and RealSAR-IMG contain radar images generated using the reconstruction algorithm on that raw data. Since the number of frequency points in GBSAR-Pi did not change throughout the measurements, the dimensions of matrices in RAW datasets depend only on the number of steps: in LabSAR-RAW, each matrix is 160 × 1024, while in RealSAR-RAW, it is 30 × 1024. Dimensions of both IMG datasets are the same: 496 × 369 px. The LabSAR datasets contain 150 and the RealSAR datasets 337 matrices (RAW) and images (IMG). Specifically, in LabSAR datasets, there are 37 data points containing a big metalized box, 70 data points with a small metalized box, and 43 data points containing a bottle of water. It is important to emphasize that measurement conditions and test objects of LabSAR proved not to be adequately challenging for object classification. Models trained on LabSAR-RAW and those trained on LabSAR-IMG all achieved extremely high accuracy and could, thus, not be meaningfully compared. This is why we primarily focused on classification and comparisons based on RealSAR datasets, while the LabSAR datasets were used to pretrain neural network models.

Along with single object measurements, RealSAR datasets also include all possible combinations of multiple objects and scenes with no objects, which affects the number of measurements per object. Therefore, 337 raw matrices of RealSAR measurements include 172 matrices of scenes with an aluminium bottle, 172 matrices with a glass bottle, and 179 matrices with a plastic bottle. RealSAR datasets also contain 29 measurements of a scene without objects.

Four examples of two RealSAR datasets are shown in Figure 5. Left image of each example is a heatmap of raw data (an example from RealSAR-RAW), while right one is an image reconstructed using that data (an example from RealSAR-IMG). Aforementioned dimensions of matrices in RealSAR-RAW dataset can be seen in the raw data examples. Horizontal axis represents number of steps in the GBSAR measurement (in our case 30 steps), while vertical axis number of frequency points in one FMCW frequency sweep (in our case 1024). Four depicted examples stand for four scenes: (a) empty scene, (b) scene with aluminium bottle, (c) with aluminium and glass bottles, and (d) with all three bottles. The examples highlight three possible problems for image-based classification:


On the other hand, since input data in raw-based classification model is not preprocessed, the integrity of the data is preserved and, hence, such model has an advantage over image-based one.

Both variants of the RealSAR dataset were split into training, validation, and testing splits in the ratio of approximately 60:20:20. The splits consist of 204, 67, and 67 examples, respectively. A single class, in our setting, is any of the 2*<sup>n</sup>* subsets of n different objects that can be present in the scene. Thus, in our case, there are 8 classes. The split was performed in a stratified fashion [43], meaning that the distribution of examples over different classes in each split corresponded to the distribution in the whole dataset. Since RealSAR-IMG was created by applying the Omega-K algorithm on examples in RealSAR-RAW, both variants contain the same examples. The training, validation, and testing splits, as well as any subset of the training set used in experiments, also contain the same examples in both RAW and IMG variants.

**Figure 5.** *Cont.*

**Figure 5.** Pairs of RealSAR-RAW (left) and RealSAR-IMG (right) examples. Example pair (**a**) represents an empty scene, (**b**) scene with an aluminium bottle, (**c**) scene with an aluminium and a glass bottle, while (**d**) contains all three bottles.

#### **4. Deep Learning**

#### *4.1. End-to-End Learning vs. Feature Engineering*

In classical machine learning, we usually need to transform our raw data to make it suitable as input to our machine learning model. We extract a set of features from each raw data point using classical algorithms or hand-crafted rules. The process of choosing an appropriate set of features with corresponding procedure for their extraction is called feature engineering [44].

With the advent of deep learning, approaches with fewer preprocessing steps have become more popular. Instead of manually transforming raw data into representations appropriate for the model, new model architectures were developed that were able to consume raw data [33]. Such models would implicitly learn their own optimal representations [34] of raw data, often outperforming model which learned on fixed, manually extracted features [45]. This paradigm is also referred to as end-to-end learning.

We designed our models and experiments to compare and contrast these two paradigms. The first approach is to use an existing image classification architecture ResNet18, which has been shown to be able to tackle a diverse set of computer vision problems [46,47]. This model is trained on RGB images reconstructed from radar data using Omega-K algorithm [39]. The second is an end-to-end learning approach, where the input is raw data collected with our GBSAR-Pi. The model architecture is based on ResNet18, with a few modifications to accommodate learning from raw data.

#### *4.2. Multi-Task Learning and Multi-Label Classification*

Our task consists of detecting the presence of multiple different objects in a scene. Instead of training a separate model for recognizing each individual object, we adopt the multi-task learning paradigm [48] and train only one model which recognizes all objects at once. The model consists of a single convolutional backbone which learns shared feature representations [49] that are then fed to multiple independent fully-connected binary

classification output layers (heads). Each classification head is designated for classifying the presence of one of the objects which can appear in a scene. We calculate the binary cross-entropy loss for each head individually. The final loss which is optimized during training is the arithmetic mean of all individual binary cross entropy losses. The described formulation is also known as multi-label classification [50]. An alternative formulation would be to do multi-class classification. In multi-class classification, each possible subset of objects which can occur in a scene would be treated as a separate class. There would only be one head which would classify examples into only one of the classes. Although we do care that our model learns to recognize all possible object subsets well, we chose not to address the problem in this way because it does not scale. Namely, for *n* objects, there are 2*<sup>n</sup>* different subsets. This means the number of outputs of a multi-class model grows exponentially with the number of objects that can appear in a scene. In contrast, in the multi-label formulation, there are only *n* outputs for *n* different objects.

#### *4.3. Models*

We train and evaluate multiple approaches for raw data and image-based classification. For image data, we test two popular efficient convolutional architectures for low-powered devices: MobileNetV3 [51] and ResNet18 [46], and compare their results. For raw radar data classification we consider two baseline approaches to compare against. The first is a single-layer fully connected classifier, while the second one uses an LSTM [52] network to process data before classifying it. We also apply the two convolutional architectures-MobileNetV3 and ResNet18-to classifiy raw radar data because of multiple reasons. Firstly, these networks have been shown to work well across a wide array of classification tasks. Secondly, the inductive biases and assumptions of convolutional neural networks regarding spatial locality [53] make sense for raw data as well. This is because raw radar data points are 2D matrices of values. The horizontal dimension corresponds to the lateral axis on which the radar moves as it records the scene (shown in Figure 5a). The values in the vertical dimension represent the signal after the mixer in FMCW system whose frequency correlates with the distance of objects. Finally, our main contribution in this regard is a modification to the ResNet18 network (RAW-RN18) when applied raw radar data, which prevents subsampling of the input matrix in the horizontal dimension.

Each of our image-based classification models takes a batch of RGB images with dimensions (496, 369, 3) as input. The pixels of RGB images are preprocessed by subtracting the mean and dividing by the standard deviation of pixel values, with statistics calculated on the train set. This is common in all computer vision approaches. For each image in the batch, the model outputs three numbers, each of which is then fed to the sigmoid activation function. The output of each sigmoid function is a posterior probability distribution *P*(*Yi* = 1|*x*), where *i* ∈ {*aluminium*, *glass*, *plastic*}. Each of the three distributions is the probability that the corresponding object (aluminium, glass, or plastic) is present in the scene. To summarize: *P*(*Yi* = 1|*x*) = *σ*(*model*(*x*)*i*).

Raw data classification models take a batch of 2D matrices of real numbers with dimensions (1024, 30, 1) as input. We preprocess raw data matrices in the same way that we do for images, with the raw data statistics also calculated on the train set. Raw data models produce the outputs in the same format as the image-based models. Figure 6 describes the machine learning setup for both input modalities.

All models are trained in the multi-label classification fashion. For each of the three outputs, we separately calculate a binary cross-entropy loss: *L* = −*y* log *P*(*Y* = 1|*X*) − (1 − *y*)log(1 − *P*(*Y* = 1|*X*)), where *y* is the ground truth label and *P*(*Y* = 1|*X*) is the predicted probability of the corresponding object. The total loss is calculated as the arithmetic mean of the three individual losses.

**Figure 6.** The machine learning setup for all image and raw data models. For image-based classification, the dimensions of input images are (496, 369, 3). For raw data classification, the dimensions of the input matrix are (1024, 30, 1). All models produce three posterior probability distributions *P*(*Yi* = 1|*x*), where *i* ∈ {*aluminium*, *glass*, *plastic*}.

During training, we use random color jittering and horizontal flipping as data augmentation procedures. These are common ways of artificially increasing dataset size and improving model robustness in computer vision. To extend color jittering to raw data classification, we add Gaussian noise with variance *σ*<sup>2</sup> = 0.01 to each element of the input matrix.

#### 4.3.1. Baselines

We develop two baseline approaches for raw data classification. The first is a singlelayer fully connected classification neural network. It flattens the input matrix of dimensions (1024, 30) into a vector of 30,720 elements. The vector is then fed to a fully-connected layer which produces the output vector with three elements. The network is shown in Figure 7. The second approach uses a single-layer bidirectional LSTM network for classification [54], which processes the inputs sequentially in the horizontal dimensions. It treats the input matrix with dimensions (1024, 30), as a sequence of 30 tokens with size 1024. Each 1024-dimensional input vector is embedded into a 256-dim representation by a learned embedding matrix [55], which is a standard way of transforming LSTM inputs. Because LSTM network is bidirectional, it aggregates the input sequence by processing it in both directions, using two separate unidirectional LSTM networks. The dimension of the hidden state is 256. The final hidden states of both directions are concatenated and then fed to a fully-connected classification layer. The network is shown in Figure 8.

**Figure 7.** Single-layer fully connected neural network classifier for raw data. The input matrix of dimensions (1024, 30, 1) is flattened into a vector of 30,720 elements. The vector is then fed to a fully connected classifier with the sigmoid activation function, which produces an output vector size 3 which represents three posterior probability distributions *P*(*Yi* = 1|*x*), where *i* ∈ {*aluminium*, *glass*, *plastic*}.

**Figure 8.** Raw data classifier based on the long short term memory network. Since the LSTM is a sequential model, the input matrix with dimensions (1024, 30) is processed as a sequence of 30 vectors with size 1024. Each 1024-dimensional input vector is embedded into a 256-dim representation by a learned embedding layer (Emb). The dimension of the hidden state is 256. The bidirectional LSTM network aggregates the input sequence by processing it in two directions and concatenating the final hidden states for both directions. As with the fully connected classifier, the resulting vector is given to a fully connected classifier with the sigmoid activation function, which the three posterior probability distributions.

#### 4.3.2. Computer Vision Models

For image-based classification, we employ two popular efficient convolutional neural network classification architectures: MobileNetV3 [51] and ResNet18 [46]. These lightweight, efficient, networks have been shown to work well across a wide array of classification tasks. We chose them because they are tailored to work on low-powered devices, such as Raspberry Pi, which we want to use to run model inference in real-time. Furthermore, the dataset which we developed is small so we opt for smaller models and forgo using larger classification architectures. We use both networks without any modifications for image and raw data classification.

Our main contribution came from analysing the ResNet18 architecture and devising an architectural modification to make the network more suitable for raw data classification. The ResNet18 network is composed of 17 convolutional layers and one fully-connected output layer. After the initial convolutional layer (conv1) which is followed by a max pooling layer, the remaining 16 convolutional layers are divided into 4 groups (conv2, conv3, conv4, conv5). Each group consists of 2 residual blocks, each of which consists of 2 convolutional layers. Residual blocks have skip (residual) connections which perform identity mapping and add the input of the residual block to the output of the two convolutional layers. In the vanilla ResNet18 architecture, the convolutional layers transform the input by gradually reducing the spatial dimensions and increasing the depth (number of feature maps). The

spatial dimensions are halved in 5 places during the forward pass of the network, by using a stride of 2 in the following layers:


Figure 9 shows all of the ResNet18 layers and their groups, with the four convolutional layers which perform subsampling emphasized. Because the input tensor is halved in both spatial dimensions 5 times, output of the last convolutional layer is a tensor whose both spatial dimensions are 32 times smaller than that of the input. The depth of the tensor is 512. Global average pooling (GAP) is used to pool this tensor across the spatial dimensions into a vector of size 512. This shared representation is given as input to the fully-connected output layer which outputs a posterior probability *P*(*yi* = 1|*x*) for each object *i*.

**Figure 9.** All 17 convolutional layers and one fully-connected layer of the unmodified ResNet18 network. The convolutional layers are grouped into 5 groups: conv1, conv2, conv3, conv4, and conv5. There are four convolutional layers that downsample the input tensor by using a stride of 2. They are the first convolutional layers in groups conv1, conv3, conv4, and conv5. These layers are emphasized in the image. There is also a max pooling layer between groups conv1 and conv2, which also downsamples the image by using a stride of 2. Since the input image is downsampled five times, the resulting tensor which is output by the final convolutional layer is 32 times smaller in both spatial dimensions than the input image.

We considered the dimensionality of our input data: the vertical dimension is 1024, while the horizontal dimension is 30. Furthermore, while the two spatial dimensions of reconstructed images are similar (496 and 369), with an aspect ratio of 1.34, we can see this is not the case with raw radar data, where the horizontal dimension is much smaller than the vertical. In general, since we are focused on GBSAR with a limited aperture size, the horizontal dimension—which corresponds to the number of steps—will usually be low.

With that in mind, we see that the resulting tensor after the convolutional layers in ResNet18 is downsampled to have spatial dimensions (32, 1). Our modification changes the network so that it does not downsample the input data in the horizontal dimension at all. As we described, subsampling is performed in four of the convolutional layers and one max pooling layer of the network. In all cases, the subsampling is performed by using a stride of 2 in both spatial dimensions for the sliding window of either a convolutional or a max pooling layer. We change the horizontal stride of these downsampling layers from 2 to 1, while we keep the vertical stride unchanged. Figure 10 shows how the raw data input matrix is gradually subsampled in ResNet18 after the first convolutional layer (conv1), and each convolutional group (conv2, conv3, conv4, conv5), with and without our modification. The dimensions in the figure are not to scale since it would be impractical to display. Rather, it shows symbolically that without our modification, the horizontal dimension is subsampled from 30 to 1, while with the modification it remains 30.

**Figure 10.** The spatial dimensions of the input raw data matrix and subsequent subsampled intermediate representations after each group of convolutional layers. For groups conv1, conv3, conv4, and conv5, the subsampling is completed in the first convolutional layer of the group. For group conv2, the subsampling is completed in the max pooling layer immediately before the first convolutional layer of conv2. The first diagram show how, without any modification to the ResNet18 architecture, both the vertical and horizontal dimensions are halved five times. The vertical dimension is downsampled from 1024 to 32, while the horizontal dimension is downsampled from 30 to 1. The second diagram shows how only the vertical dimension is subsampled after our modification to the ResNet18 architecture, while the horizontal dimension remains constant. This is because our modification sets the horizontal stride of subsampling layers to 1. Note that the dimensions in the figure are not to scale due to impracticality of displaying very tall and narrow matrices.

This modification does not change the dimensionality of any kernels of the convolutional layers, so it does not add any parameters to the model. Thus, it does not prevent us from using any pretrained set of parameters of the original ResNet18 architecture. Even though the horizontal stride of the modified network is different, which can change the horizontal scale and appearance of features in deeper layers of the network, using existing ImageNet-pretrained weights is still a sensible initialization procedure. ImageNet pretraining has been shown to be consistently beneficial in a wide array of image classification tasks, some of which have different image dimensions, scales of objects appearing in the images, and even cover an entirely different domain of images than the ImageNet-1k dataset [56,57]. We empirically validate the contribution of pretraining for weight initialization in our experiments.

The modification results in the dimension of the tensor output by the last convolutional layer in the network being (32, 30), since the input is not subsampled in the horizontal dimension. The depth of that tensor remains 512, and the global average pooling layers aggregates this tensor across the spatial dimensions into a vector of size 512. Thus, the vector representation which is fed to the fully-connected output layer is of the same dimensionality as before the modification, the output layer also does not require any changes.

#### **5. Experiments**

#### *5.1. Experiment Setup*

To evaluate our models we perform experiments on the RealSAR dataset. We evaluate and compare two main approaches, each with a number of modifications regarding data augmentation procedures, weight initialization, the size of the training set and the classification architecture used. The first approach consists of models which work on reconstructed images. We train and evaluate them on the RealSAR-IMG dataset. In the second approach, we evaluate models which work with raw radar data. Experiments for the second approach are performed on the RealSAR-RAW dataset. All experiments were completed in a multi-label classification setting, where each label denotes the absence or presence of one of the three different objects that we detect. The training procedure minimized the mean of binary cross-entropy losses of all labels.

We chose all hyperparameters manually, by testing and comparing different combinations of values on the validation set. In all experiments for both models, we use the Adam optimizer, with a learning rate of 2 × <sup>10</sup><sup>−</sup>4, and the weight decay parameter set to 3 × <sup>10</sup><sup>−</sup>4.

Experiments on the full training set (156 examples) are performed over 30 epochs, with a batch size of 16, which results in 300 parameters updates. To ensure comparisons are fair, when training on subsets of the training set, the number of epochs is chosen, such that the total number of parameter updates stays the same.

To prevent our models from overfitting and increase robustness to noise, we use two stochastic data augmentation procedures during training: random horizontal flipping and color jittering. Since the radar moves laterally when recording a scene the raw recorded data are equivariant to changes in the lateral positions of objects in the scene. The Omega-K algorithm preserves the equivariance as well. Thus, for both the raw and the vision representation of any given scene, the horizontally flipped representation corresponds to rearranging the scene to be symmetric with respect to the middle of the lateral axis. Thus, the horizontal flip transformations does not change the semantics, i.e., the labels of that representation, and can be used as an augmentation procedure. Color jittering is used in computer vision to stochastically apply small photometric transformations on images. This artificially increases the dataset, since in every epoch, each image is transformed with different parameters, sampled randomly. One novelty we introduce is to extend this technique to raw data as well, by adding small Gaussian noise to each example. We validate this contribution empirically.

We trained both the raw and the image-based models with different modifications. To explore the contribution of transfer learning [34], we used three different weight initialization procedures: random initialization [58], initializing with ImageNet-1k pretrained weights, and initializing by pretraining on the LabSAR dataset. To test how the model improves as more data are collected, we performed experiments on 25%, 50%, 75%, and 100% of examples in the training set. The validation and test sets remained fixed throughout all experiments to ensure fair comparisons. Through these experiments we converged to the two best models: one based on raw radar data, and the other based on reconstructed images. We used these two models to performed further analyses.

To test whether combining models trained on different input modalities yields an increase in performance, we evaluated an ensemble model. The ensemble consists of the best RAW and the best IMG model. Its output vector is the mean of the output vectors of the two models.

#### *5.2. Metrics*

All of our models are trained in the multi-label classification setting. The model output *P* is a vector of *n* numbers, where *n* is the number of different objects which can appear in a scene. The element *i* of the output vector (*P*[*i*]) is the probability given by the model that the corresponding object *i* is present in the scene. The probability of its absence is 1 − *P*[*i*]. Thus, the detection of each individual object can be seen as a separate binary classification problem.

In binary classification evaluation, we have a set of binary ground truth labels for all examples, and the corresponding model predictions, either as probabilities or as binary values (0 and 1). When the predictions are binary values, the pair consisting of the ground truth label and the prediction for any example in the dataset can be classified into one of four sets:


Various metrics can be defined using the described quantities. Accuracy measures the percentage of correct predictions:

$$\text{Acc} = \frac{\text{TN} + \text{TP}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{8}$$

Precision (*P*) measures how many positive predictions were correct:

$$\text{IP} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{9}$$

It decreases as the number of false positives increases. Recall (R) measures how many positive examples were captured:

$$\text{RR} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{10}$$

It decreases as the number of false negatives increases. If we want a strict metric which will penalize both excessive false positives and false negatives, we can use the *F*1-score, which is the harmonic mean of precision and recall:

$$F1 = \frac{2 \cdot \text{P} \cdot \text{R}}{\text{P} + \text{R}} = \frac{2 \cdot \text{TP}}{2 \cdot \text{TP} + \text{FP} + \text{FN}} \tag{11}$$

Since the output *P*[*i*]of the model is the probability of the positive class, we classify an example as a positive if the probability is higher than a certain threshold *T*[*i*]. Otherwise, we classify it as a negative. The natural threshold to choose is 0.5. However, the actual optimal threshold varies depending on the metric. For example, recall increases as the threshold decreases. The minimum threshold (0) will maximize recall (1.0), since it will capture all positive examples. A threshold that is higher than the highest prediction probability will minimize recall (0.0), since it will not capture any positive examples. Conversely, a higher threshold tends to result in higher precision, since only very certain predictions will be classified as positive.

To remove the variability of the choice of the threshold from model evaluation and comparison, the average precision (AP) metric is often used. It is calculated as the average over precision values for every possible recall value. To extend it to the multi-label setting (multiple binary classification problems), we use the mean average precision (mAP) which is the arithmetic mean of the AP values of all individual tasks.

Once the best model has been chosen according to mAP, in order to use it to make predictions, we have to find the optimal threshold parameter for each individual output. We want our model to perform well on all classes, i.e., all 8 possible subsets of objects that can appear in the scene. Thus, we choose a vector of thresholds such that it maximizes the macro-*F*1 score on the validation set [59].

The macro-*F*1 score is a multi-class extension of the *F*1 score. It is calculated as the arithmetic mean over the *F*1-score of each class.

$$\text{macro-}F1 = \frac{1}{n}\sum\_{k=1}^{n}F1\_k\tag{12}$$

To calculate the *F*1-score of a given class *ck* in a multi-class setting, the results are reformulated as a binary classification problem. Class *ck* is treated as the positive class, while all other classes are grouped into one, negative, class. The multi-class confusion matrix is accordingly transformed into a binary confusion matrix. The quantities necessary for calculating the binary precision and recall values are then obtained as shown on Figure 11.

**Figure 11.** The transformation of a multi-class confusion matrix into a binary confusion matrix for class *ck*. The macro-*F*1 score is calculated as the arithmetic mean of *F*1 scores of all classes [60].

#### *5.3. Results*

We consider two main approaches for GBSAR object classification, which correspond to the two possible input modalities: raw radar data (RAW) and reconstructed images (IMG). For raw data approaches, we compare five different classification architectures described in Section 4: the fully-connected baseline model (FC), the LSTM-based baseline model (LSTM), the ResNet18 (RN18) and MobileNetV3 (MNv3) convolutional architectures applied without any modifications, and, finally, our ResNet18 modified to handle raw radar data (RAW-RN18). For image-based approaches, we compare classifiers based on two convolutional architectures: ResNet18 and MobileNetV3.

Firstly, we validate the contribution of extending jittering as a data augmentation procedure to the raw input data modality. Table 2 shows the mAP results for all raw data approaches with and without jittering. For a given class, the AP metric averages the precision score over all possible recall scores, which correspond to different threshold values for classification. This makes our comparison of different variants of our approaches invariant to the threshold value. The mean AP score (mAP) is obtained by averaging the AP over all classes. It can be seen that jittering improves performance across the board. We also see that popular computer vision architectures ResNet18 and MobileNetV3 offer

considerable improvements compared to the two baseline approaches, even when applied to raw data without any modification.

**Table 2.** Comparison of performance of all raw data models with and without jittering. The metric used is mean average precision (mAP).


We compare all considered raw and image-based classification models, combined with different weight initialization procedures. The weight initialization procedures that we consider are the following: random initialization (random), pretraining on the LabSAR dataset (LabSAR), pretraining on the ImageNet-1k dataset (ImageNet), and pretraining on ImageNet-1k followed by LabSAR (ImageNet + LabSAR). Table 3 shows the mAP results for all combinations for the raw and image-based classification models. We see that ResNet18 and MobileNetV3 improve upon the baseline for both input modalities. However, for raw data, our modified ResNet18 (RAW-RN18), which ensures that the input tensor is not subsampled in the spatial dimension, significantly outperforms those models. As expected, we see that the vanilla, unmodified ResNet18 and MobileNetV3 architectures are more apt for image-based input, as they achieve significantly better results there compared to raw data. We also observe that out of the two convolutional architectures that we considered, the ResNet18 network consistently outperforms MobileNetV3 for this task. Finally, we can see that pretraining on ImageNet generally improves performance, while LabSAR pretraining is only beneficial in some cases, with a smaller impact. The knowledge learned from pretraining on LabSAR measurements with limited variance due to the controlled laboratory conditions in which they were captured proved not to transfer as significantly to the more complex RealSAR dataset.

**Table 3.** Comparison of performance of all considered raw and image-based classification models in combination with all different weight initialization procedures on the validation set. The metric used is mean average precision (mAP).


Based on the results of these experiments, we converged to two main approaches for subsequent experiments. For raw data, we use our modified ResNet18 (RAW-RN18) model, while for image data we use the standard ResNet18. To observe how the size of the training dataset impacts performance, we train our models on subsets that contain 25%, 50%, and 75% of all training set examples. Thus, we perform training with all of the following combinations:


The results on the validation set are displayed in Table 4. They show that models based on raw data experience a smaller drop in performance when trained on a smaller training set.

**Table 4.** Comparison of performance of the two best model configurations for image and raw data classification. The chosen image model was an unmodified ResNet18 (RN18), while the chosen raw data model was a ResNet18 with our modification which prevents horizontal subsampling (RAW-RN18). We compare the two models across different weight initialization procedures and training set sizes on the validation set. The metric used is mean average precision (mAP).


We chose one RAW model and one IMG model with the best performance on the validation set to perform further analyses. The best IMG model was pretrained on ImageNet, while the best RAW model was pretrained first on ImageNet, then on LabSAR. Per-class AP scores on the test set for the two chosen models are displayed in Table 5. The mAP scores on the test set of both models are significantly lower than results on the validation set. This is expected since all results in Table 4 are chosen from the best epoch, as measured on the validation set. The validation set is used to choose the best hyperparameters for the model, as is standard practice in machine learning [43,61]. On the other hand, the test set results of chosen models are realistic mAP performance estimates. The RAW model has the higher AP score for classes aluminium and plastic, while the IMG model is slightly better for the glass bottles. The RAW model also has the higher mean AP score. This suggests that classification based on raw radar data, which circumvents lossy reconstruction steps, coupled with architectural modifications of neural networks can yield better performance than traditional computer vision approaches on reconstructed images. Figure 12 shows the average test mAP of the RAW and IMG pairs of models for each combination of the weight initialization scheme and training set size. We notice the general trend of increasing performance as the training set grows. The plots in the graph do not seem to be in a saturation regime, so additional data would be expected to increase the performance further. ImageNet pretraining is shown to be beneficial both in regards to the total performance and to the stability across different training set sizes.

**Table 5.** Per-class AP performance of the best IMG and RAW models on the test set.


To obtain a thorough evaluation of the performance of our two chosen models, we also test them in a multi-class setting. There are eight classes that cover all possible combinations of objects in the scene. In this way, we can get more insight into how the model behaves for objects when they appear individually in the scene versus how different groups of objects interact. This is especially interesting since the considered objects have varying reflectances and, thus, certain pairs of objects might be more difficult to discern among than others. We average the F1 score of each class to obtain the macro-F1 score. Although the comparison of different approaches in Table 4 was performed over all thresholds (mAP), for a multi-class comparison of the two chosen models, we need to choose concrete threshold parameters. For each of the two models, we find a threshold vector which maximizes the macro F1 score over all classes on the validation set. Table 6 shows the validation F1 score of each

class, along with the macro F1 score, for both models. Once the threshold has been chosen using the validation set, we evaluate the models on the test set. The test set results are shown in Table 7. The RAW model outperforms the IMG model in all classes except for the glass–plastic combinations. The test set results also include the results of the ensemble model created by averaging the predictions of the two models.

**Figure 12.** Average test mAP of the RAW and IMG pair of models for each combination of weight initialization scheme and training set size.

**Table 6.** Per-class F1 scores and the macro F1 score for the best threshold values on the validation set. E—Empty, A—Aluminium, P—Plastic, and G—Glass.


**Table 7.** Per-class F1 scores and the macro F1 score of the two best models and their ensemble on the test set. Classes: E—Empty, A—Aluminium, P—Plastic, and G—Glass.


Out of the 67 test examples, the IMG model misclassifies 11 examples, while for the RAW model, the number of incorrect classifications is 8. This yields accuracy scores of 83.6% and 88.1%, respectively. Looking at the error sets, we found that all of the examples misclassified by the RAW model are also misclassified by the IMG model. In other words, there are no examples where the IMG model outperformed the RAW model. This explains why the ensemble model did not improve upon the performance of the single RAW model. The general idea of ensemble models is to aggregate predictions of different base estimators, each of which is more specialized in a certain sub-region of the input space than other estimators. In our case, the image-based model is not better than the raw model in any sub-region of the input space.

The confusion matrices of both models are shown in Figure 13a,b. The numbers of misclassified examples for different combinations of ground truth and prediction values are contained in elements off the diagonal.

**Figure 13.** Confusion matrices for the best IMG and RAW models. E—Empty, A—Aluminium, P—Plastic, and G—Glass.

#### **6. Discussion**

The main goal of this work was to compare two deep learning approaches for object classification on GBSAR data. Experimental results presented in the paper show that the model trained on raw data (RAW) consistently outperforms the image-based model (IMG), which uses the same data preprocessed with an image reconstruction algorithm. Specifically, out of 67 test examples, the IMG model misclassified 11, while the RAW model misclassified 8. The overlap of the two misclassification sets is peculiar. All 8 examples misclassified by the RAW model are also misclassified by the IMG model. This means there is no example in the test set where using the IMG model was beneficial compared to the RAW model. The three test examples where the IMG model made a mistake while the RAW model was correct (one of which is shown in Figure 14a indicate that reconstruction algorithms degrade the information in signals obtained by GBSAR due to approximations, as described in Section 2.2.

Similar behavior is shown in Table 7: the F1 score is higher in the RAW model for every class except for the combination of glass and plastic (G, P). Half of the 8 examples incorrectly classified by the RAW model were predicted to be in the (G, P) class, as seen in the confusion matrix in Figure 13b. This significantly decreased the precision score, and, consequently, the F1 score, of the RAW model on that class. The table also shows that both models have difficulties classifying aluminium and combinations with aluminium. Aluminium has a much higher reflectance than glass and especially plastic. Consequently, it is hard for models to differentiate between scenes where the aluminium object appears alone versus ones where it appears together with another object of significantly lower reflectance. One such example is shown in Figure 14b, in which a scene containing both an aluminium and a plastic bottle is mistaken by both models for a scene containing only an aluminium bottle. The confusion matrices in Figure 13a,b also capture this phenomenon. Most of the misclassifications of both models are found in rows and columns that represent the aluminium class. The confusion matrices also highlight the difficulty of distinguishing an empty scene from one containing only a plastic bottle due to the very low reflectance of plastic.

In addition to reflectance and the aforementioned approximations, there is another factor impacting the results of the IMG model—the heatmap scale of reconstructed images. Total GBSAR signal intensities of scenes containing aluminium are much larger than signal intensities of empty scenes and scenes containing plastic bottles. Consequently, using a fixed scale in the visualization would result in objects fading out from reconstructed images due to their very high or very low intensity responses. By choosing not to set a fixed scale, we reduced the likelihood of such a scenario. However, this means that the resulting pixel

intensities in reconstructed images are not absolute. For example, two pixels with equal color intensities in a scene with aluminium and an empty scene correspond to different raw signal intensities from which they were reconstructed. Hence, similar reconstructed images of two scenes with different objects can lead to misclassifications, such as the one presented in Figure 14a. RAW model classified that test example correctly as glass bottle, but IMG model mistook it with plastic one. In problems containing materials with less variance in terms of reflectance, the scale might be fixed.

**Figure 14.** RealSAR-RAW (left) and RealSAR-IMG (right) misclassified examples. Example (**a**) is misclassified by the IMG model while example (**b**) is misclassified by both RAW and IMG models. In (**a**) the recorded scene included a glass bottle, but IMG model classified it as plastic bottle. In (**b**) the recorded scene included an aluminium and a plastic bottle but both models classified it as scene with an aluminium bottle only.

Even though the comparisons between the two input modalities suggest the rawdata approach is superior, the usage of reconstructed images has certain advantages. Reconstructed images are more interpretable to humans, especially regarding the locations of the objects in the scene. It is extremely challenging for humans to discern objects and their locations in raw GBSAR data. Reconstructed images also allow for meaningful error analyses, making it easier to deduce why particular examples are misclassified. A valid paradigm might be to use raw data for the actual classification and only generate reconstructed images when we need to analyze specific examples or localize objects.

Regarding weight initialization, both models reached the highest classification accuracy when they were pretrained on the ImageNet dataset. ImageNet pretraining also decreased the variance of the results with respect to the size of the training set, as seen in the plot in Figure 12. Table 4 shows that the ImageNet-pretrained variant of the model based on reconstructed images achieved the highest mean average precision. In the case of the raw data model, the highest mAP was achieved by pretraining on both ImageNet and LabSAR datasets. This highlights the importance of pretraining and suggests that the RealSAR dataset might also be useful in future radar-related deep learning research.

#### **7. Conclusions**

The presented paper investigates the potential and benefits of processing raw GBSAR data in automated radar classification which can be particularly useful for industrial applications focused on monitoring and object detection in environments with limited visibility where this approach can save considerable resources and time. This differs from, and in this paper is contrasted with, the standard practice of applying classification algorithms on reconstructed images. The testing setup was developed around an FMCW based GBSAR system which was designed and constructed using low-cost components. The developed GBSAR was used in a series of measurements that were performed on several objects made of different materials with the final intention to realize a complete GBSAR radar system with embedded computer control capable of classifying these objects.

For classification purposes, a detailed analysis and comparison of two deep learning approaches to GBSAR object recognition was performed. The first approach was a more conventional SAR classification approach based on reconstructed GBSAR images, while in the second approach the classification was performed essentially on raw (unprocessed) data. Multiple different deep learning architectures for both approaches were trained, tested, and compared. These included two baselines based on fully-connected layers and LSTM networks, two standard convolutional neural network architectures-ResNet18 and MobileNetV3-popular in computer vision, and a modified variant of the ResNet18 network which makes it suitable for processing raw GBSAR data. The modification consists of preventing subsampling of the input in the horizontal dimension which is generally small due to the nature of GBSAR data. This modification is our main contribution regarding deep learning model design. The contribution was validated through experiments which showed that the best results on raw data are achieved by the modified version of ResNet18. Additionally, the practice of color jittering as a common data augmentation procedure in computer vision was extended to classification of raw GBSAR data and was shown to provide consistent, though small, improvements. Furthermore, it was shown that classification on raw data outperforms the classification based on reconstructed images. This was partially expected since the SAR image reconstruction algorithms necessarily introduce certain approximations, and with this negatively influence the integrity of the recorded data. In addition to better classification performance, raw data classification is inherently faster since it avoids the need for image reconstruction, and with this is more suitable for embedded computer implementation which opens possibilities for various application scenarios. On the other hand, it limits human visual confirmation and disables approaches in which radar images are combined with optical ones. However, keeping in mind that the primary focus is on applications in embedded systems, this is not a serious hindrance.

Even though this study has shown the applicability of this concept, it was tested using a relatively small dataset in order to focus on the comparison between approaches. Using larger datasets, more classes, and more general problem formulations could lead to more powerful and useful models. The other directions is to seek improvements in terms of more efficient networks and implementation in embedded computers with limited resources. With this in mind, for potential use in future research we generated and publicly released these datasets of raw GBSAR data and reconstructed radar images which we intend to update with time.

**Author Contributions:** Conceptualization, M.K., F.T., D.B. and M.B.; methodology, M.K., F.T., D.B. and M.B.; software, M.K. and F.T.; validation, M.K., F.T., D.B. and M.B.; formal analysis, M.K. and F.T.; investigation, M.K. and F.T.; resources, M.B.; data curation, M.K. and F.T.; writing—original draft preparation, M.K. and F.T.; writing—review and editing, D.B. and M.B.; visualization, M.K. and F.T.; supervision, D.B. and M.B.; project administration, M.B.; funding acquisition, M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by Croatian Science Foundation (HRZZ) under the project number IP-2019-04-1064.

**Data Availability Statement:** RealSAR datasets (RAW and IMG) used in this research are publicly available: https://data.mendeley.com/datasets/m458grc688/draft?a=ff342b09-dd03-4d09-a169-5 60af2f87773 (accessed on 1 November 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Remote Sensing* Editorial Office E-mail: remotesensing@mdpi.com www.mdpi.com/journal/remotesensing

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6383-1