1. Introduction
Synthetic aperture radar (SAR) is an advanced active microwave sensor. Its all-weather, all-day operation capabilities and resistance to lighting and climate conditions have led to its widespread application in fields such as geological surveying, disaster relief, climate monitoring, agricultural management, and maritime surveillance [
1]. Under adverse weather conditions, high-resolution SAR images can provide effective detection and monitoring of typical objects such as ships, buildings, aircraft, and oil tanks [
2,
3,
4]. Due to the unique imaging modality of SAR, its imaging results significantly differ from optical remote sensing [
5], as shown in
Figure 1. Compared with the optical images, SAR images have lower signal-to-noise ratios and spatial resolution, and are heavily contaminated with image noise, which leads to difficulty for object detection. In the community of SAR object detection, the approach of utilizing optical images to assist in the interpretation of SAR images has garnered attracted substantial attention.
Traditional approaches for object detection in SAR images can be broadly classified into two groups [
6]. The first group utilizes non-deep learning feature extraction methods, including detection algorithms based on structural features, grayscale characteristics, and image texture properties. Gu et al. [
7] proposed a multifeature joint algorithm that extracts both the size and orientation of the object, with binary search employed for precise orientation search. The Constant False Alarm Rate algorithm (CFAR) [
8], a typical grayscale-based detection approach, calculates the detection threshold by considering background clutter features to determine whether the target in the SAR image is a object pixel. Image texture segmentation and classification are performed using rotation-invariant features [
9], which effectively represent domain scale-related texture features. However, these algorithms have low detection accuracy, high rates of missed detection and false alarms, and are easily affected by background clutter. The second group of methods capitalizes on deep learning techniques, specifically convolutional neural networks (CNNs), mainly divided into two-stage object detection algorithms-generated candidate bounding boxes and single-stage object detection algorithms based on regression. Kang et al. [
10] were pioneers to apply the two-stage detector Faster R-CNN to SAR target detection. Cui et al. [
11] proposed a dense attention pyramid network for multiregion analysis, specifically tailored for multiscale ship detection in SAR. Sun et al. [
12] proposed an innovative SAR ship detector based on YOLO, which incorporates bidirectional feature fusion and angle classification, enabling arbitrary direction detection. Numerous studies [
13,
14,
15] have demonstrated that methods based on CNN have achieved exceptional performance in the field of SAR object detection.
However, methods based on CNN require a large and diverse dataset [
16]. Unlike optical images, SAR images exhibit speckles and intricate details related to texture, and this low degree of visualization brings challenges to SAR image interpretation [
17]. In addition, the rapid advancement of remote sensing technology has led to an abundance of SAR image data. However, these data originate from various carriers and imaging platforms, each with distinct technical specifications (e.g., radar parameters, imaging modes, and angles). The feature distributions of SAR images from different imaging platforms often vary significantly. This variation results in a phenomenon where a model trained on one dataset frequently underperforms when tested on another dataset.
Domain adaptation (DA) emerges as a valuable technique to mitigate the challenge posed by limited annotated samples in the field of object detection [
18]. The fundamental goal of domain adaptation is to train a model on a source dataset and ensure its robust performance on a significantly different target dataset. Typically, the source domain comprises the data distribution used for model training, while the target domain represents the distinct data distribution encountered during testing [
19]. Methods of domain adaptation have been widely applied in cross-domain target detection tasks [
20,
21,
22]. Recently, several studies have employed domain adaptation methods for SAR image object detection. Shi et al. [
23] extended the pioneering work of domain-adapted object detection by integrating multiple discriminators by Chen et al. [
24]. Pan et al. [
25] proposed an end-to-end domain-adaptation-based ship detection network consisting of imbalanced discriminant alignment and imbalanced prediction consistency. Xu et al. [
26] introduced a multilevel alignment network that transfers knowledge from the optical domain to the SAR domain using domain adversarial strategies. However, most current cross-domain methods overlook two issues. The first is the neglect of the supervisory role of few annotated SAR images on the network. In practical applications, there are often few annotated data in the SAR domain (target domain) that can be accessed. The second issue is the focus on feature adversarial alignment while ignoring the fusion alignment on the data across the two domains.
Therefore, we propose a semi-supervised cross-domain object detection from optical to SAR domain. This method leverages a large amount of labeled optical images (source domain) and few labeled SAR images (target domain) to facilitate knowledge transfer for SAR object detection. Our method focuses on the data-processing aspects to gradually transfer knowledge at the image, instance, and feature levels. First, we propose a data augmentation method of image mixing and instance swapping to generate a mixed domain that is more similar to the SAR domain feature distribution. This method focuses on data processing and fully utilizes few SAR annotation information to reduce domain shift at image and instance levels. Second, at the feature level, we propose an adaptive optimization strategy to filter out mixed domain samples that significantly deviate from the SAR feature distribution and select data similar to SAR samples to train the feature extractor. For mixed data of SAR and optical images, the convolution-based local feature extraction method still focuses on the SAR and optical images themselves, which limits the extraction capability of the fused features. Therefore, we adopt Vision Transformer as a feature extractor. Through its global receptive field, ViT can better extract the features of the mixed images. This approach aims to enhance the feature extraction process and improve the overall performance of the model in handling mixed SAR and optical data. In addition, in SAR images, manmade objects such as ships, aircrafts, oil tanks, etc., have smaller effective energy regions and relatively scarce detail information. The conventional intersection over union (IOU) metric, along with its extensions, exhibits high sensitivity to small deviations in object localization. To address this problem, we propose an alternative approach by modeling bounding boxes as two-dimensional Gaussian distributions, the normalized Wasserstein distance (NWD) [
27]. The NWD measures the similarity of corresponding Gaussian distributions and can be seamlessly integrated into any anchor-based detector, replacing commonly used IOU metrics. In summary, the main contributions of this paper are as follows.
- (1)
Feature-level processing: We propose a novel adaptive optimization strategy that utilizes metric learning to identify and filter feature samples that are more conducive to knowledge transfer, rather than blindly increasing the amount of data. This approach not only focuses on optimizing the quality of the selected samples but also effectively prevents overfitting by addressing the issue of data scarcity.
- (2)
Image-level and instance-level processing: We construct a two-step data augmentation method called Domain Mix. In the first step, we randomly combine images from the optical and SAR domains to enhance the diversity of the training data. In the second step, we separate a limited number of instance annotations from the SAR domain and interchange these annotations with the optical domain. This augmentation technique focuses on aligning the two domains at the image and instance levels, significantly enhancing the diversity of the dataset and playing a crucial role in improving detection performance.
- (3)
Two Improvements for SAR object detection: In contrast to the local feature extraction of convolution, we employ ViT for mixed images between SAR and optical image, aligning SAR and optical image features better through a global receptive field. Additionally, considering the smaller effective energy region of SAR image objects, we model the bounding box as a two-dimensional Gaussian distribution and utilize the normalized weighted distance (NWD) metric to improve the detection accuracy in complex scenarios.
3. Methodology
We present the details of a novel semi-supervised cross-domain object detection framework from optical to SAR domain. The proposed method’s overall architecture is shown in
Figure 2, and it mainly involves three iterative steps to optimize the detector to transfer knowledge from the fully labeled optical domain to the SAR domain with only few labeled samples. Firstly, to address the issue of insufficient SAR image data in the target domain and align the image-level and instance-level features between the two domains, we construct a new data augmentation method to generate a new domain that is more similar to the distribution of the SAR domain. Secondly, we enhance the acquisition of global features by using ViT for extract features both the expanded hybrid and SAR images. Thirdly, to enhance the effectiveness of data features, we introduce a metric matrix to filter the extracted features. Finally, we input the optimized samples for iterative optimization of the detector, and the IOU of the detector head is replaced with NWD.
In our cross-domain detection task, we have a large dataset optical domain (source domain) and a few examples from the SAR domain (target domain) , where represent optical images and SAR images, respectively, and represent the bounding boxes and categories of objects in the images. The target domain also has some test data. The goal of our method is to train an adaptive detector that can alleviate performance drop due to domain gap. In the following subsections, we present the details of SAR-CDSS.
3.1. Data Augmentation: Domain Mix
In order to reduce the domain gap between the source and target domains, we propose a new data augmentation strategy called Domain Mix. The hybrid images generated by this method are close to the distribution of target domain. We propose data augmentation as Mix at image-level and Instance-level.
Image-level augmentation: To increase the diversity of images and align image-level features, we randomly mix images from the source domain and target domain. Assuming a batch of data in the source domain
and target domain
, sample
and
samples, respectively, from two domain
and
, we randomly mix these samples to a single image
as follows:
where
denotes an initialized empty image with dimensions different from those of
and
, The transformation matrix
is hand-crafted for image pair
. The weights
, ensuring
, corresponding to
and
, respectively.
Instance-level augmentation: To make the most of limited instance annotations and align instance-level features between the two domains, we can separate limited instance annotations from the background and randomly place them in other images. Unlike previous pixel-level copy–paste methods [
44], we separate the entire bounding box annotation and then paste it onto other images. Given bounding box
and
from the source and target domain, each resized to dimensions
and
, we perform an exchange operation to combine their distinctive characteristics. This can be specifically represented as follows:
where
,
denotes the pixel indices within bounding boxes
and
. The weight
corresponds to each index. The visualization of image-level and instance-level augmentations is shown in
Figure 3.
3.2. Global Feature Extractor: Vision Transformer
In ViT, the role of self-attention is similar to that of the convolutional layer in CNN. For the input token embedding sequence
, we create the query
, the key
, and the value
, which are obtained through three learnable linear projectors
, and
applied to layer-normalized features. Then, we match the query sequence with the key to construct an
self-attention matrix, where each element signifies the semantic relevance between corresponding query–key pairs. These embedded states can be learned and used as image representations. In both pretraining and fine-tuning stages, the classification head is attached to the same dimension. Additionally, we add one-dimensional position embedding to patch embeddings to retain positional information. Notably, ViT employs the standard Transformer encoder and produces output prior to the multilayer perceptron (MLP) header. Typically, ViT is pretrained on large datasets and subsequently fine-tuned for downstream tasks using smaller datasets. Additionally, we augment patch embedding with one-dimensional position embeddings to retain positional information. Importantly, ViT’s output is computed as weighted sums of values based on the self-attention matrix, as expressed by the following equation:
where
represents the logistic regression function. Equation (
3) computes scores between different vectors using
, determining attention weights for encoding words at the current position. These scores are normalized for gradient stability and transformed into probabilities. Finally, each value vector is multiplied by the sum of these probabilities. Following the principles of SAR image target detection, we extract SAR image blocks centered on individual pixels and create training sets by randomly sampling from labeled blocks. These patches are then fed into the Vision Transformer pipeline to construct feature representations. ViT can be reconstructed layer by layer, comprising data preprocessing layers, self-attention layers, and multilayer perceptron layers. Analogous to the role of convolution in CNNs, ViT primarily constructs features through self-attention mechanisms. The underlying idea is to assess the importance of each pixel relative to others, capturing their long-range interactions. Self-attention aims to generate a weighted average of embedding values, facilitating robust representation learning.
3.3. Adaptive Optimization Strategy
Domain adaptation assumes that the source domain and target domain exist in distinct feature distribution spaces, and attempts to align their data distributions for effective knowledge transfer. Two common approaches for achieving feature alignment are discrepancy-metric-based and adversarial-based methods [
19]. Discrepancy-metric-based methods typically design metrics to quantify the distribution difference between the source and target domains, minimizing these metrics to achieve alignment. As shown in
Figure 2, we can generate a set of data using the introduced data augmentation method called Domain Mix, which we expect to be as close as possible to the target domain distribution. In our framework, the detector
comprises parameters
composed of a backbone
and a head
. The SAR-CDSS leverages
as a feature extractor to produce representations of
in
as follows:
Then, the SAR-CDSS employs a discrepancy-metric-based method to represent the distance between
and
to sort the mixed candidate samples
as follows:
To mitigate noise in
, we introduce a shrinkage ratio
to reduce the number of expanded candidates. Subsequently, we define an optimization function
to refine
, resulting in the optimized extended domain denoted as
:
Through
, we select the top
candidates from
. This process yields an optimized extended domain, denoted as
, which better aligns with the target domain distribution
. However, the suitability of
may change with the convergence of
. To solve this problem, we iteratively optimize
, assuming that detectors
and
have gone through
a and
b iterative epochs
. Since
are updated errors of
in source domain and target domain
are expected to be less than
. Consequently, the feature extractor
more accurately represents both
and
compared to
. Leveraging this insight, we iteratively optimize
using the
metric, updating feature representations
and
. After the
iterative epoch, we can obtain
by filtering
as follows:
Finally, the adaptive detector
can also be iteratively optimized by
as follows:
Given an optimizer denoted as
, a learning rate
for
, and a loss function
, we aim to assess the correlation between
and
, so we adopt the widely used maximum mean discrepancy (MMD) metric function, denoted as
. In reproducing Keral Hilbert space, MMD quantifies the distance between two distributions. Specifically, we instantiate it as
using the following formula:
3.4. NWD-Based Head
As is well known, IoU is actually the calculation of the Jaccard similarity coefficient of two finite samples. However, IoU exhibits varying sensitivity to objects of different scales. Notably, for small objects, even a minor positional deviation can lead to a substantial IoU drop, resulting in inaccurate label assignment. Conversely, for objects at a normal scale, IoU remains relatively stable under the same positional deviation. Therefore, in detection tasks with small objects, IoU is not suitable for evaluating the positional relationship between small objects, so we propose utilizing the Wasserstein distance, based on optimal transport theory, as an alternative metric for SAR object detection. For small objects, real-world entities are rarely strictly rectangular, often exhibiting background pixels within their bounding boxes. In such bounding boxes, foreground and background pixels tend to concentrate around the center and boundary, respectively. To better characterize the significance of different pixels within the bounding box, we model it as a two-dimensional Gaussian distribution. In this model, the center pixel carries the highest weight, with pixel importance gradually decreasing from the center toward the boundary. Specifically, consider a horizontal bounding box
, where
denote the center coordinates and
represent the width and height, respectively. The equation of its inscribed ellipse can be expressed as follows:
where
represents the center coordinates of the ellipse, and
denote the semi-axis lengths along the x and y axes, respectively. Consequently, we have
. Next, we derive the probability density function of the two-dimensional Gaussian distribution as follows:
where
,
represent the coordinates
of the Gaussian distribution, the mean vector, and the covariance matrix, respectively. When
, the ellipse described by the formula corresponds to the density contour of a two-dimensional Gaussian distribution. Consequently, we can model a horizontal bounding box
as a two-dimensional Gaussian distribution
, where
We employ the Wasserstein distance in optimal transport theory to quantify distribution dissimilarity. Consider two two-dimensional Gaussian distributions:
and
. The second-order Wasserstein distance between
and
can be expressed as follows:
where
denotes the Frobenius norm. When considering Gaussian distributions
and
, modeled from bounding boxes
and
, the formula can be further simplified as follows:
However,
serves as a distance metric. To achieve a value range akin to IoU (i.e., between 0 and 1), we employ an exponential nonlinear transformation function to remap the Wasserstein distance into another space. This transformation yields the normalized Wasserstein distance (NWD), defined as follows:
where
C represents a constant closely tied to the dataset. Empirically, setting
C to the average absolute size of the dataset yields optimal performance. By adopting this approach, we achieve a more accurate evaluation of the positional relationships among SAR objects.
4. Evaluation
4.1. Datasets Description and Implementation Details
In this section, we evaluate our proposed method for domain adaptation from optical to SAR object detection datasets. To assess the generalization ability of our framework, we select four datasets, divided into ship detection and oil tank detection. Ship detection consists of the optical domain HRSC2016 [
45] (source domain) and the SAR domain HRSID [
46] (target domain); oil tank detection consists of optical and SAR images from the SpaceNet 6(SN6) [
47] dataset, which contains 820 pairs of single-polarization synthetic aperture radar (target domain) and corresponding optical images (source domain) of oil tanks. In addition, the statistics of these real-world public datasets are shown in
Table 1.
Ship detection (HRSC2016 → HRSID): In this scenario, we use HRSC2016 as the source dataset. It is extracted from six major ports in Google Earth, containing 1061 valid annotated images, with image sizes ranging from 300 × 300 to 1500 × 900, and we use all of them for training. HRSID contains 5604 SAR images with resolutions ranging from 0.5 m to 3 m and sizes of 800 × 800, which include 16,951 ship instances. In order to challenge the detection task in complex scenes, we only use 471 inshore images as the target dataset, with the amount of 5% images for training and 95% for testing.
Oil tank detection (SN6 Optical → SAR): SpaceNet 6 is a multisensor all-weather mapping dataset that combines SAR and optical image datasets. The spatial resolution of its SAR and optical images is 0.5 m/pixel, and the size is 900 × 900. It contains 820 pairs of oil tank images. We use its optical images as the source dataset and SAR images as the target dataset. We use all optical images and 5% of SAR images as the training set, and 95% of SAR images as the test set.
All experiments are conducted using Pytorch. In our experiments, we use the supervised single stage detection YOLOv5 as the baseline network and compare it with other unsupervised and semi-supervised cross-domain methods. For the unsupervised domain adaptation (UDA) setting, we use the labeled optical domain dataset as the source domain and the unlabeled SAR dataset as the target domain. For the semi-supervised domain adaptation (SSDA) setting, we use the labeled optical domain dataset as the source domain and few labeled SAR data as the target domain. In our semi-supervised setting, we randomly select 5% labeled images in the SAR dataset for supervision, and the results are averaged over the same number of images five times. The optimization shrinkage rate in domain adaptation optimization is k = 0.85, all experiments adjust the input image size to 640 × 640, the number of iterations is 300 times, and the average precision when the IoU threshold in the evaluation metrics is 0.5 is represented as AP50.
4.2. Evaluation Metrics
We employ the standard MSCOCO evaluation metrics to compare the performance of ship and oil tank detection on SAR images. All results are reported on the test set. The intersection over union (IoU) measures the ratio of the area covered by the ground truth bounding box to that of the predicted bounding box. If the IoU is above a certain threshold, it is considered a true positive (TP). If the IoU is below the threshold, it is considered a false positive (FP). False negative (FN) refers to real objects that are not detected. The formulas for calculating precision (PR) and recall (RE) are as follows:
Then, we sort the precision and recall according to the confidence score to generate the PR curve. Based on the PR curve, the average precision (AP) score can be obtained by calculating the area under the PR curve. The formula for calculating AP is as follows:
In the experiments of this paper, we use AP50 as the final evaluation indicator, which is the AP calculated under the condition that the IoU threshold is 0.5.
4.3. Comparison Experiments
To demonstrate the effectiveness and superiority of our proposed method for cross-domain object detection on the HRSC2016, HRSID, and SpaceNet 6 datasets, this section compares our method with several advanced cross-domain methods.
YOLOv5 [
48]: This is the baseline method of our approach and is currently a mainstream excellent single-stage detection method. While achieving the same detection accuracy as the two-stage detector FasterRCNN, its detection speed is faster. Also, our detector is based on YOLOv5.
SWDA [
20]: This is a method for UDA object detection based on the two-stage detector FasterRCNN. It performs strong alignment on local features while conducting weak alignment on global scenarios, which is one of the pioneering works in domain adaptation object detection and provides ideas for subsequent research.
SSDA-YOLO [
49]: This is a new semi-supervised domain adaptation method based on YOLOv5. It uses the style transfer algorithm CUT for scene style transfer, cross-generates pseudo-images in different domains to bridge image-level differences, and adapts the knowledge distillation framework with the mean–teacher model to enable the student model to obtain instance-level features of the unlabeled target domain.
OS-SSL [
50]: This is a method for enhancing SAR images with multichannel optical images for oil tank detection. The feature extractor is trained using the OS-SSL method in the pretraining stage, and an optical image knowledge distillation algorithm with attention mechanism is used in the training of the detection network to further learn optical feature knowledge. The dataset established in this paper is the oil tank dataset used in our experiment.
SoftTeacher [
51]: This is an end-to-end semi-supervised object detection method. The proposed soft teacher mechanism effectively balances the classification loss of unlabeled boxes, and the box network mechanism selects reliable pseudo boxes for learning box regression.
To be fair, we assign the same dataset partition to different algorithms. In
Table 2, we report the AP(%) of various comparison algorithms on the test set. The third column of
Table 2 shows the results of various algorithms in the ship detection when the source domain is HRSC2016 and the target domain is HRSID. The fourth column shows the results of various algorithms in the oil tank when the source domain is SN6 optical images and the target domain is SN6 SAR images. From
Table 2, we can see that using optical remote sensing data to assist SAR image target detection achieved significant results. The unsupervised cross-domain method improved the detection performance by about 9%. Due to the significant differences in imaging methods between optical images and SAR images, it poses a great challenge to cross-domain detection. In practical applications, we often have a small amount of annotated SAR data available. On the basis of unsupervised cross-domain, we added a small amount of labeled SAR data for supervision. The detection performance was increased by more than 15% compared with the unsupervised cross-domain method in the ship and oil tank dataset experiments. Compared with the advanced semi-supervised cross-domain object detection method Softteacher, our proposed method had an improvement of more than 7% in detection performance. This also validates that our approach focusing on performing fusion on raw data can effectively reduce the shift between two domains.
Figure 4 and
Figure 5 illustrate the visualization results of the SAR image object detection task in the ship and oil tank scenario, respectively. Our method is capable of accurately detecting regression bounding boxes while addressing the issue of densely distributed multiple ships and oil tanks. This is attributed to our proposed data augmentation method, which aligns features between two domains at both the image and instance levels. This approach effectively improves detection performance in situations where objects are densely distributed. In the ship experiment, we specifically focus on inshore scenes, which represent one of the most challenging contexts for SAR image ship detection. In the oil tank experiment, we show different sizes of oil tanks and high-density oil tank areas. In comparison experiments, our method exhibits relatively fewer false alarms but occasionally misses some detection objects, which is due to convolution-based methods focus on local features, while our ViT-based feature extractor pays more attention to global features. From
Figure 4 and
Figure 5, we can see that our method is still performing well on complex scenes. Overall, our method achieves the best detection performance.
The specific implementation details of our proposed framework are presented in Algorithm 1.
Algorithm 1: Details of SAR-CDSS. |
Input: Initialized detector , the labeled source domain , few labeled target domain , total epochs T, metric function , the number of steps per epoch N, domain-adaptive optimization function , shrinkage k, loss function . |
Output: Adaptive Detector |
1: Initialize |
2: Initialize feature extractor and head |
3: for do |
4: |
5: |
6: for do |
7: sample batch from |
8: |
9: Update to minimize loss |
10: end for |
11: |
12: end for |
13: |
4.4. Ablation Experiment
4.4.1. Component Analysis
In order to evaluate the impact of each component of our proposed method, we conducted ablation experiments in the ship and oil tank scenarios to further validate the effectiveness of our modules.
Table 3 shows the detailed results for the various modules. Our ablation experiments were divided into four groups: (1) Incorporating the data augmentation module called Domain Mix, while keeping all other components constant; (2) Solely substituting the original feature extraction Darknet-53 of YOLOv5 with ViT, while keeping all other components constant; (3) Incorporating data optimization called Domain Mix, and utilizing ViT as a feature extractor, while keeping all other components constant; (4) Retaining the use of data augmentation and ViT, and substituting the IoU utilized by the head of YOLOv5 with NWD. The results of the study indicate that our proposed data augmentation method called Domain Mix significantly enhances the detection performance of the model, improving performance by over 5%. ViT, which emphasizes global feature extraction, reduces the false alarm and improves detection performance by approximately 2%. Given the limited effective energy area of SAR image targets within the image, the detection head combined with NWD effectively enhances the detection precision of SAR objects. Furthermore,
Table 4 illustrates the enhancement of small object detection performance by the NWD module, which incorporates the multiscale features of the COCO evaluation system. As evidenced in
Table 4, the NWD module effectively boosts the detection performance for small objects, with an increase of over 1% in the
.
4.4.2. SAR Data Quantity Analysis
We conduct a series of experiments to verify the effect of different numbers of data in target domain, and choose 1%, 5%, and 10% data from HRSID and SN6-SAR for model training.
Table 5 illustrates the performance of three methods under the supervision of different quantities of SAR images. Due to the limited number of training datasets, the performance of various algorithms is not satisfactory at 1% SAR quantity. However, our method exhibits outstanding performance in all cases.
4.4.3. Shrinkage Ratio k Analysis
Furthermore, we investigated the impact of the shrinkage rate
k in filtering samples during adaptive optimization process on detection performance, where
indicates no optimization, as shown in
Figure 6. The line graph in the figure suggests that our proposed adaptive optimization method can better assist the model in cross-domain training. Simultaneously, it can be observed that within a certain range of
k values, the detection performance of the model remains relatively stable.
5. Discussion
In the oil tank and ship datasets, extensive experiments and analyses have substantiated the efficacy of our proposed method. As indicated in
Table 2, we found that the detection performance of the detector is low when only a small amount of labeled SAR images are used for training. The introduction of other modal data such as optical remote sensing data can significantly enhance the detection performance of the detector. Optical data provide a more potent representation for learning from SAR images. Compared to optical images, SAR images exhibit distinct characteristics: discontinuous contours, severe geometric deformation, and the presence of speckle noise. These factors weaken the semantic and appearance correlations among objects, rendering semantic interpretation uncertain and SAR image analysis challenging.
To address these challenges, we leverage both labeled SAR image information and optical remote sensing data for detector training. The results show that the proposed training method significantly improves the detection performance. Our method combines the scattering intensity information of SAR images and optical remote sensing prior knowledge to reduce the domain gap between SAR images and optical images at the image level, instance level, and feature level. In addition, unlike previous convolutional operations that focus on local feature extraction, we use a global feature extractor to better extract common features of targets in both domains, which is suitable for the characteristics of SAR image targets. Our method was validated on oil tank and ship targets and can be extended to other SAR image object detection tasks. The use of optical data to assist SAR image target detection tasks has significant implications for deep-learning-based SAR image interpretation.
6. Conclusions
In this study, we propose an innovative semi-supervised cross-domain object detection method that bridges the gap between the optical domain and the SAR domain. By leveraging a limited number of annotated SAR images and a large number of annotated optical images, our method can effectively perform cross-domain detection on SAR images. This novel approach solves the scarcity of SAR data and its labeling difficulties, and the domain shift between optical and SAR images, and provides new insights for more stable and efficient object detection in SAR images. We experimentally validated our method on ship objects and oil tank objects, and the results show that compared with existing advanced cross-domain detection methods, our method has significantly improved detection performance and model robustness, demonstrating excellent generalization ability and robustness of our method.
Looking forward, we believe there is still much to explore in this field. Integrating more advanced semi-supervised learning techniques or introducing additional data sources may further enhance the performance of our method. In addition, our method can be extended to other tasks in computer vision, such as segmentation and classification. We expect our work to inspire future research in this exciting field.