1. Introduction
Due to the special characteristics of synthetic aperture radar (SAR) [
1], SAR image processing has been widely used in many fields [
2,
3]. In particular, some applications [
4,
5,
6,
7,
8], such as SAR image change detection [
4], SAR image fusion [
5,
6], open set recognition [
9], etc., require the simultaneous processing of two or more SAR images. However, SAR images are generally obtained by different sensors under different conditions (such as time, viewpoint, noise). This leads to some differences among these images. Therefore, the registration of two or more SAR images becomes indispensable and significant. SAR image registration [
10,
11,
12] is used to match two SAR images by exploring the geometric transformation model between them, where two images are called the reference image and the sensed image. At present, many registration methods [
13,
14,
15,
16,
17] have been proposed to achieve the registration of two SAR images, and they can be simply divided into the traditional registration methods [
13,
18,
19] and the deep-learning-based registration methods [
12,
20,
21,
22,
23]. Due to the prominent performance of deep learning, the deep-learning-based method of SAR image registration has recently received more attention compared to traditional methods.
In general, a registration-model-based deep learning is designed from the perspective of the two-classification problem, where the pair of matched-points and the pair of non-matched-points are considered as the positive category and the negative category, respectively, [
12,
16,
21,
22,
24]. Therefore, in order to achieve better registration performance, it is expected that a mass of matched-point and non-matched-point pairs could be given and then fed to the deep model. However, compared to non-matched-point pairs, it is difficult to obtain many reliable matched-point pairs directly from two SAR images, which also limits the performance of the deep model.
Interestingly, we find that each point in an SAR image is essentially different and independent from others. This essential characteristic leads us to a worthwhile thought: why did we not directly consider multiple key points as multiple classes to construct a multi-classification deep model for SAR image registration, instead of a two-classification deep model? Recently, Mao et al. [
25] proposed an adaptive self-supervised SAR image registration method, which utilized each key point as an independent instance to train the self-supervised deep model and then compared the latent features of each key point with other points to search for the final matched-point pairs. Therefore, inspired by these, we aim to design the SAR image registration method from the perspective of multiple classification (discriminating multiple key points), where each key point is considered as an independent class, abandoning the idea of constructing a two-classification model (discriminating matched and non-matched pairs).
Noticeably, for the SAR image registration model, we know that its purpose is to find
K matched-points (
) between
m key points (
) on the reference image and
n key points (
) on the sensed image. Given
and
, if
and
are a pair of matched-points, it means that they are from the same location in two images, and the image information corresponding to them should also be consistent. In brief, the mathematical description of three sets is given as
where
and
,
. Therefore,
if each key point is considered as a class, the set of matched-points is equivalent to the set of overlaps between (with m classes) and (with n classes). This means that there are some overlapping classes in all classes, when each key point is considered as an independent class to construct the multiple classification model.
As we know, in a learning model, the categories of training and testing sets are generally consistent, whereas in SAR image registration, only overlapping key points between reference and sensed images are consistent, since these points are matched-points, and the rest of key points should be different. In the other words, if all key points ( points) are considered as classes, our real independent classes should be (not ), but K is unknown and must be obtained by the model. Between m classes and n classes, there are only K same classes, and meanwhile there are () different classes. It brings a difficult problem: determining how to construct a multi-classification model based on () given key points for SAR image registration.
Based on the above analyses, in this paper, we propose a double-transformation network for SAR image registration from the perspective of multi-classification, which mainly includes a coarse-matching-module-based double network and a precise-matching module. Considering that there are overlapping points between key points from reference and sensed images, the proposed method first constructs two multi-classification sub-networks, respectively, based on key points from two images, to seek coarse matched-points. In each sub-network, key points from one image are used as classes to train a multi-classification network; meanwhile, key points from another image are used as the testing set, where Swin-Transformer is used as the basic network. Then, with two multi-classification sub-networks, the prediction of m key points can be obtained by the training model with n key points, and the predictions of n key points are obtained by the other training model with m key points. Since only partial points are matched between key points (classes) from two images, this means that the predictions of two models are inaccurate, but some predictions should be consistent. Therefore, secondly, a precise-matching module is designed to seek these consistent predictions from the results of two models as matched-points. Then, the registration transformation is obtained based on the obtained matched-points. In addition, to weaken the effect of inherent differences between two SAR images, a strategy is used to convert key points from the sensed image into the reference image by an initial transformation matrix, and the reference image is used as the base image to capture the sub-image of each key point. Finally, experimental results illustrate that the proposed method can achieve higher registration performance than the state-of-the-art methods.
Compared with most existing methods, the contributions of the proposed method are listed as follows:
We utilize each key point directly as a class to design the multi-class model of SAR image registration, which avoids the difficulty of constructing the positive instances (matched-point pairs) in the traditional (two-classification) registration model.
We design the double-transformation network with the coarse-to-precise structure, where key points from two images are, respectively, used to train two sub-networks that alternately predict key points from another image. It addresses the problem that the categories are inconsistent in training and testing sets.
A precise-matching module is designed to modify the predictions of two sub-networks and obtain the consistent matched-points, where the nearest points of each key point are introduced to refine the predicted matched-points.
The rest of this paper is organized as follows.
Section 2 first shows related works. Second,
Section 3 introduces the detail of the proposed method. Then,
Section 4 shows the corresponding experimental results and analysis to verify the performance of the proposed method. Finally,
Section 6 shows the conclusion of this paper.
4. Experiments and Analyses
In order to validate the registration performance of the proposed method, we implement experiments and analyses from four items: (1) comparing the registration performance of the proposed method with the state-of-the-art methods; (2) the visualization on chessboard diagrams of SAR image registration; (3) the analysis on the precise-matching module; (4) the analysis on the double-transformation network. In the experiments, four datasets are used to validate the registration performance of the proposed method, including Wuhan, Australia-Yama, YellowR1, and YellowR2 datasets, and more detailed descriptions are available in [
16,
25]. Four datasets are shown in
Figure 3,
Figure 4,
Figure 5 and
Figure 6, respectively.
All experiments are implemented in the environment with NVIDIA GeForce RTX 2080Ti and Windows 10 with 64 GB memory and kernel Intel (R) Xeon (R) CPU E5-2605 v3 @2.30 GHz, and the frame is Pytorch. In the process of data enhancement, three transformation are used, including scale, rotation, and contrast. Note that the parameter settings are different for the training set and for the validation set. For the training samples, the parameter range of scale-transformation is
, and the parameters of rotation-transformation are selected between 1 to 20 degrees. Contrast-transformation is set at
. For the validation sample, the parameter range of scale-transformation is
, while the parameter of rotation-transformation is selected between 1 and 10 degrees. Meanwhile the contrast transformation is set at
. For the parameters of Swin-Transformer, the batch size is set as 128, the feature-dim size is set as 128, and the temperature size is set as 0.5. Referred to [
40,
42], the layer number of each block is set as 2, 2, 18, and 2. An AdamW optimizer is used to train the network for 300 epochs with a cosine decay learning rate scheduler, the initial learning rate is set as 0.001, and the weight decay is set as 0.05.
In addition, eight quantified evaluation indicators [
16] are used to validate the registration performance, including
,
,
,
,
,
,
, and
. Eight indicators are shown in detail as follows:
- 1.
expresses the root mean square error of the registration result. Note that means that the performance reaches sub-pixel accuracy.
- 2.
is the number of matched-points pairs. Its value is higher, which may be beneficial for obtaining a transformation matrix with a better performance of image registration.
- 3.
expresses the error obtained based on the Leave-One-Out strategy and the root mean square error. For each point in , is the average of all errors ( of points).
- 4.
is used to detect whether the retained feature points are evenly distributed in the quadrant, and its value should be less than .
- 5.
expresses the bad point proportion in obtained matched-points pairs, where a point with a residual value above a certain threshold (r) is called the bad point.
- 6.
denotes the absolute value of the calculated correlation coefficient. Note that the Spearman correlation coefficient is used when ; otherwise, the Pearson correlation coefficient is applied.
- 7.
is a statistical evaluation of the entire image feature point distribution [
43], which should be less than
.
- 8.
is the linear combination of the above seven indicators, calculated by
When
,
is not used, and the above formula is simplified as
and its value should be less than 0.605.
4.1. Comparison and Analysis of the Experimental Results
In this part, to validate the registration performance, we compare the proposed method with eight existing methods: SIFT [
44], SAR-SIFT [
13], VGG16-LS [
45], ResNet50-LS [
46], ViT-LS [
39], DNN+RANSAC [
12], MSDF-Net [
16], and AdaSSIR [
25]. In the nine compared methods, SIFT and SAR-SIFT are two traditional registration methods.
SIFT is mainly matched by using the Euclidean distance ratio between the nearest and second-nearest neighbors of the corresponding features.
SAR-SIFT is an improvement of the SIFT method, and it is more consistent with the SAR image characteristics.
VGG16-LS, ResNet50-LS and ViT-LS are deep-learning-based classification methods.
DNN + RANSAC [
12] constructs the training sample set by using self-learning methods, and then it uses DNN networks to obtain matched image pairs.
MSDF-Net [
16] uses deep forest to construct multiple matching models based on multi-scale fusion to obtain the matched-points pairs, and then it uses RANSAC to calculate the transformation matrix.
AdaSSIR [
25] proposes an adaptive self-supervised SAR image registration method, where the registration of SAR images is considered as a self-supervised learning problem and each key point is regarded as a category-independent instance to construct the contrastive model for searching out the accurate matched points.
Table 1,
Table 2,
Table 3 and
Table 4 show the registration results on the four datasets, respectively. From four tables, it can be seen that the proposed method (STDT-Net) obtains better registration performances compared with other methods on four datasets. Obviously, the
and
of the proposed method are highest in six methods for all datasets, meanwhile obtaining the lowest bad point ratio (
), whereas the final retained key points are not well distributed in the quadrant (
). In short, the results for four datasets demonstrate that the proposed method can improve the performance of SAR image registration by double-transformation network based on Swin-Transformer.
Compared with three compared classification networks (VGG16-LS, ResNet50-LS, ViT-LS, DNN+RANSAC, MSDF-Net, and AdaSSIR), the proposed method (STDT-Net) obtains a relatively large number of matching pairs () and the minimum error rate in obtaining the correct matching pair index, and also achieves better registration accuracy ( and ). In general, the registration accuracy is better if a method has a large and a smaller value. It is seen that, compared with other classification networks, the obtained feature space is best when the Swin-Transformer network is used as the basic classification model. It means that the coordinate error between the obtained matched-points pairs is smaller, which results in the decreased proportion of bad points and the rise of indicators, meanwhile reducing the registration error. In short, the proposed method obtains better registration performance than other compared methods.
4.2. The Visual Results of SAR Image Registration
In this part, we draw the chessboard map (CB-map) of two matched SAR images to visually show the registration results on four datasets.
Figure 7,
Figure 8,
Figure 9 and
Figure 10 show the CB-maps for four datasets, respectively. In each figure, the chessboard mainly focuses on the overlapping region of two image registrations, and the size of each checker is set based on the size of each datum. Except for the overlapping region, the other regions are filled by their corresponding images (reference or sensed image). For example, in
Figure 9, the rightmost area of the CB-map is filled by the sensed image; meanwhile, the leftmost area of the CB-map is filled by the reference image. In order to enhance the visual effect, the contrast ratio between reference and sensed images is increased by darkening one image or brightening the other image.
For each chessboard map, if edges and overlapping regions are more continuous and consistent, it illustrates that the registration result is more accurate. Obviously, it is observed from four CB-maps that two images are matched well by the proposed method for each data. In four CB-maps, these regions (such as river, roads, croplands, etc.) are continuous and consistent, which illustrates that two images are accurately registered. These visual results also validate that the proposed method can obtain more accurate registration results.
4.3. Analyses on the Precise-Matching Module
In this part, we give an analysis of the precise-matching module to validate its effectiveness for enhancing the registration performance. In this experiment, we show the results of two branches (the R-S sub-network and the S-R sub-network) with precise-matching and without precise-matching.
is used as the quantitative indicator, and
Table 5 shows the experimental results. From
Table 5, it is seen that the accuracy without precise-matching is improved by using precise-matching for four datasets regardless of the R-S network or the S-R network. It indicates that our designed precise-matching module is effective for finding more accurate matched points.
Additionally, in order to verify the effectiveness of the precise-matching module more intuitively, we show some visual results. First, nine points were selected from the reference of Yellow River R1 data, and the obtained matched-points corresponding to nine points before and after using the precise-matching module are given. Then, their corresponding sub-images are captured from the reference image.
Figure 11 shows the comparison of sub-images corresponding to nine matched-points pairs obtained by the proposed method without precised-matching module and with precise-matching module, where the locations of nine points in the reference image are given and the matched results with precised-matching module are labeled in the red box. From
Figure 11, it is observed that the sub-images corresponding to the matched-points obtained by the proposed method with the precise-matching module are more similar to the original sub-images of nine points than without the precise-matching module. It also illustrates that the precise-matching is effective for improving the performance of our method.
4.4. Analyses on the Double-Transformation Network
In this part, we give an analysis of the proposed double-transformation network.
Figure 12 shows the comparison of the root mean square error (RMSE,
) obtained by the proposed method (with two branches) and only using one branch (the R-S network or the S-R network) for the four datasets. From
Figure 12, it is seen that the proposed method obtains more registration results than only using the single network, which indicates that our double-transformation network can seek more accurate matched-points between two images compared with two single networks (the R-S network or the S-R network), since these matched-points pairs are obtained based on two multi-classification models which are trained in two feature spaces.
5. Discussion
Based on the above experimental results and analyses, it is demonstrated that the proposed method can obtain more accurate registrations of SAR images than other existing methods. According to our analyses, the reasons mainly focus on several items: First, the proposed method utilizes key points as the independent classes to handle SAR image registration, which effectively avoids the defects of traditional two-classification model for SAR image registration. Compared with existing methods, it is novel to directly use key points as independent classes to construct the multi-classification model. Additionally, key points are easily obtained from an image, which means that there are more corresponding training samples when there are more key points. Second, considering the categories of training and testing sets in SAR image registration, the proposed method is designed with a double-transformation network to construct the multi-classification model. Specially, two sub-networks effectively achieve the complementation of obtaining predicted matched-points, since two sub-networks are trained based on different key points (categories). Third, a precise-matching module is designed based on the nearest points to modify the predictions of two sub-networks and obtain more consistent matched-points. Moreover, since the detection of key points is not key for the proposed method, a simple method is used to detect the key points from two SAR images in our model, whereas it does not mean that other advanced methods are not adaptive for our model. In actuality, these methods can be used to detect the key points.
In addition, the Swin-Transformer is used as the basic network to construct the training model of the proposed method. Therefore, we also make a classification performance comparison of Swin-Transformer with three different basic networks, including VGG16 [
45], ResNet50 [
46], and ViT [
39], which may be helpful to select which basic network to use as the basic classification model for researchers. In this part, the four networks have the same hyperparameters and the number of training iterations, and the accuracy rate (
) and the running time (
) are used as the verifiable indicators to analyze the classification performances of four basic networks.
Table 6 shows the experimental results of the four networks on the Yellow River R1 data and the Wuhan data, where the best results are in bold. From
Table 6, it is obvious that Swin-Transformer obtains higher accuracy rates of classification and has shorter running times compared with the other three basic networks. It indicates that the proposed method using Swin-Transformer as the basic classification network is more effective and more suitable for SAR image registration.