**1. Introduction**

Image registration is an essential and fundamental task for remote sensing interpretation. It is aimed at registering images obtained from different sensors, different perspectives, different times or different imaging conditions [1], and is an essential preliminary task for image fusion [2], 3D modeling [3], and change detection [4]. With the rapid advance of remote sensing systems, more and more data sources can now be acquired. As a result, the complementary information between multimodal remote sensing images can significantly improve the capacity and effectiveness of the interpretation. However, the efficiency and accuracy of the image registration result greatly affects the performance of the subsequent processing [1]. Nevertheless, for multimodal remote sensing images, a large number of nonlinear radiation distortions (NRD) will be present, as a result of the different physical imaging mechanisms, which brings significant challenges to the image registration.

Generally speaking, image registration methods can be divided into two categories according to the factors (areas and features) on which they are based. The area-based methods adopt the intensity value of the image itself, while the transformation model for the registration is calculated by optimizing a similarity measure between the image to be registered and the reference image. Correlation [5–7], mutual information [8–13], or frequency-domain information [14,15] can be used as metrics to measure whether the images are registered. However, these area-based methods often reach a locally optimal solution for the optimization of the model transformation, especially when there is NRD among the images. At the same time, the optimization process for the image conversion parameters is of high computational complexity. The feature-based methods have high robustness to geometric distortion and NRD, so that they are commonly adopted in the registration of multimodal remote sensing images. The feature-based methods achieve the matching goal by identifying the reliable characteristic correspondence between the images, and they are not directly based on the image intensity [16]. The features considered by the feature-based methods include point features [17–19], line characteristics [20], and structural features [21,22]. Scale-invariant feature transform (SIFT) [23] is a classic feature point matching method. A number of improved versions of SIFT have since been developed, including affine SIFT (ASIFT) [24], speeded-up robust features (SURF) [25], the SIFT-like algorithm for synthetic aperture radar (SAR) imaging (SAR-SIFT), and principal component analysis SIFT (PCA-SIFT). When SIFT or one of its improved algorithms is used to describe the features, the estimated reference direction must be specified, to make the descriptions more unique and robust to the rotation. Image registration based on features is confronted with two problems: (1) in the feature description stage, the estimation of the principal orientation is often prone to error when it is based on the local features of the image, and a lot of corresponding points will be removed due to incorrect principal orientation estimation; and (2) in the feature matching stage, due to the significant NRD, a feature detection result for an image can often not be found, so that there are many abnormalities in the matching result.

In recent years, research into deep learning has exploded in the computer vision field [26,27]. In the study of medical image registration, deep learning has been used to model the relationship between different modal images [28,29]. However, studies of remote sensing image registration based on deep learning are relatively rare, especially for multimodal remote sensing images [30–33]. The main reasons for this are as follows: (1) Compared with natural or medical images, remote sensing images have an extensive range, complex distortion, and weak uniqueness of targets. It is also common in remote sensing images that the same ground object type presents different forms, or the same form corresponds to different ground object types. (2) Compared with the large number of natural image datasets that can be used for training, manually annotated remote sensing image datasets are very rare. The labeling of remote sensing images needs considerable expert knowledge and manpower. Furthermore, the application of a model trained on natural images to remote sensing registration tasks is impractical. (3) Deep learning is essentially a method of supervised learning, but it is almost impossible to use a model trained on one modal image to register another modal image. For example, the performance of the LiDAR and optical image registration task can be unsatisfactory when using the SAR and optical image training model. Therefore, there is a need to develop a universal handcrafted descriptor for the multimodal remote sensing image registration task, from the perspective of the physical radiation mechanism and the imaging geometric model, in the case of limited training data.

In this paper, we propose a scale-radiation-rotation-invariant feature transform (SRIFT) algorithm for the registration of multimodal remote sensing images. The contributions of this paper can be summarized as follows:

(1) A modality-free multimodal remote sensing image registration method is proposed, which can handle the scale, radiation, and rotation distortions at the same time. The structural characteristics are captured, in which the same kind of ground object will present a similar structural distribution. Thus, the corresponding features of the images are mapped into a unified space to establish a one-to-one relationship.

(2) The nonlinear di ffusion scale (NDS) space is constructed using a nonlinear di ffusion filter, instead of a Gaussian filter, to preserve more structure and detail information, which is of grea<sup>t</sup> importance for feature extraction for the registration. The structural characteristics are captured by computing the local orientation and scale phase congruency (LOSPC) value in the NDS space of the image. The minimum and maximum moment maps of LOSPC are then used to detect the remaining points. Rotation invariance depends on a rotation-invariant coordinate (RIC) system in the feature description stage. The points in the neighborhood are statistically calculated through a continually changing local coordinate system, which itself realizes rotation invariance, without the need for the estimated orientation to be assigned.

The rest of this paper is organized as follows. The related work is introduced in Section 2. Section 3 details the process of image registration using the SRIFT algorithm. In Section 4, the experimental results obtained using both simulated and real multimodal remote sensing images are provided for the experimental verification and analysis. A summary and our conclusion are presented in Section 5.

### **2. Related Work**

For the registration of homologous images, the pixel basis of the above methods is the intensity or the gradient of the image; however, for multimodal images, the structural information presented in the image by the same point on the ground will be entirely di fferent, due to the NRD caused by the di fferent imaging mechanisms. As a result, the above methods will match many pseudo-corresponding features.

With regard to multimodal remote sensing image registration, scholars have put forward various descriptors on the basis of structural information [34,35]. Compared with gradient information, structural information is less sensitive to nonlinear intensity changes [36]. For example, the edge-oriented histogram (EOH) descriptor designates the shape and contour information of the local image centered on each keypoint, instead of the gradient [37]. The partial intensity invariant feature descriptor (PIIFD) was proposed to solve the problem of the relative gradient direction of the corresponding points [38]. A descriptor for the distribution of internal geometric structures was introduced for images captured on a log-polar grid, which is known as the local self-similarity (LSS) descriptor [39]. A dense LSS (DLSS) method was used to process the registration of optical and SAR images in [40]. The histogram of oriented phase congruency (HOPC) [41] algorithm was extended on the basis of the phase congruency algorithm, and its description process adds phase direction statistics to increase the robustness. Radiation-invariant feature transform (RIFT) [42] also considers the phase congruency, and presents a maximum index map, instead of the gradient, in the feature description. The phase congruency-based structural descriptor (PCSD) [43] has also been successful in optical and SAR image registration without rotational distortion.

In this section, the HOPC and RIFT algorithms are introduced as typical registration methods for multimodal remote sensing images.

The HOPC method successfully introduces an oriented phase congruency algorithm into the automatic registration of remote sensing images, while innovatively mapping the two images acquired under di fferent physical mechanisms into a unified space. Therefore, the corresponding features that cannot be mapped one by one can be mapped in this space. However, there are three significant deficiencies in HOPC: (1) it needs relatively accurate geographic information (or rough geometric correction) for the images in the execution process. However, many multimodal remote sensing images do not contain accurate geographic information. (2) As a template matching algorithm, it is sensitive to geometric distortion, such as rotation and scale. (3) It uses the Harris corner detector, which is very sensitive to NRD when extracting feature keypoints.

The RIFT algorithm was developed on the basis of the limitations of the HOPC algorithm. It also uses a phase congruency algorithm for reference in the stage of mapping the corresponding features, but it adopts a novel descriptor for the feature description. The RIFT algorithm is a method of describing the features per-pixel, which also takes rotation invariance into account. However, its rotation invariance and robustness are not outstanding, because of the possibility of losing the spatial information, and the time-consuming nature of its calculation process. Furthermore, it is not scale-invariant. The main disadvantages of the RIFT algorithm are as follows: (1) the RIFT algorithm adopts a "convolution sequence ring" to deal with the rotation distortion of the images, which may lose some spatial information, resulting in insu fficient unique feature vectors being generated and unfavorable feature matching; and (2) the RIFT algorithm does not consider the scale invariance, so it is susceptible to scale changes.

Although image registration has been the subject of numerous studies over the last few decades, there is still no unified registration framework that can automatically register multiple multimodal images while considering the scale and rotation distortions.

#### **3. Image Registration Based on SRIFT**

In this section, we describe the proposed SRIFT method in detail. Figure 1 presents a registration flowchart based on SRIFT. The highlighted feature extraction and matching parts in the second column of the figure represent the main innovations of this paper. The first two steps of the algorithm involve solving the problem of feature extraction, and the last step involves the feature description. Initially, the NDS space is constructed using a nonlinear di ffusion filter, instead of a Gaussian filter, to preserve more structural and detail information, in order to achieve scale invariance. The structural characteristics are then captured by computing the LOSPC values, in which the same kind of ground object will present similar structural distributions. Finally, the RIC system is applied, which itself realizes rotation invariance, without the need for the estimated reference orientation.

**Figure 1.** Multimodal image registration flowchart based on scale-radiation-rotation-invariant feature transform (SRIFT).

The fundamental reason that restricts the accurate registration of multiple source images is the nonlinear mapping of the corresponding features caused by the existence of NRD. Therefore, as long as the corresponding features can correspond one-to-one, the registration problem for the di fferent source images can be transformed into a registration problem for the same source images. Hence, in the third column of Figure 1, we can use a large number of methods that have been well studied by predecessors in the field of homologous image registration for the image transformation and resampling.

#### *3.1. Scale Invariance Through the Nonlinear Di*ff*usion Scale (NDS) Space*

In the construction process of the scale space, the SIFT algorithm uses a Gaussian space, which is generated by convolving the original image with a Gaussian filter of different scales. However, some structure and contour information in the image will be lost in the filtering process because the Gaussian filtering is a kind of smoothing operator, and the loss of information will adversely affect the feature extraction of multimodal remote sensing images with NRD.

The detailed structural information needs to be included as much as possible when the scale space is established. Inspired by anisotropic diffusion [44], a nonlinear diffusion function, instead of a Gaussian function, is adopted to generate the scale space of the image in the proposed algorithm:

$$\frac{\partial f(\mathbf{x}, y)}{\partial t} = f\_t = \operatorname{div}(\mathbf{c}(\mathbf{x}, y, t)\nabla f) = \mathbf{c}(\mathbf{x}, y, t)\Delta f + \nabla \mathbf{c} \bullet \nabla f \tag{1}$$

where *t* is a scale parameter, *div* is a bifurcation operator, ∇ is the gradient operator,Δ is the Laplace operator, and *<sup>c</sup>*(*<sup>x</sup>*, *y*, *t*) is the diffusion coefficient. In the particular case where the nonlinear diffusion is assumed to be isotropic, *<sup>c</sup>*(*<sup>x</sup>*, *y*, *t*) is a constant, and the above formula is the same as a Gaussian function. Since there is no analytical solution to the nonlinear diffusion equation, a numerical method is needed to approximate the solution. Equation (1) generates the following relationship after being applied as an additive operator splitting strategy:

$$f^{k+1} = \left(I - \tau \sum\_{l=1}^{m} A\_l(f^k)\right)^{-1} f^k \tag{2}$$

where *I* is an identity matrix, τ signifies the time step, and *l* represents the direction. Along the *l-*th coordinate axis, matrix *Al* is established accordingly.

The same tactic is applied as is used in SIFT, where the scale space is discretized into a series of *O* octaves and *S* sublevels. By using the original image as an initial condition, the multi-scale space for the multimodal image is generated as a series of smoothed images. The scale values are equal to:

$$s = \sigma^2 / 2\tag{3}$$

where σ values are calculated from the following expression:

$$\sigma\_i(o, s) = \sigma\_o \mathcal{Z}^{o + \frac{s}{S}}, o \in [0, \dots, O - 1], s \in [0, \dots, S + 2], i \in [0, \dots, N - 1] \tag{4}$$

where σ0 is the base scale level; *o* and *s* are the indices of octave *O* and sublevel *S*, respectively; and *W* is the total number of smoothed images. It is notable that the image is downsampled when the last sublevel is reached in each octave, and the downsampled image is used as an initial image for the next octave, as shown in Figure 2.

Through the construction of the NDS space, the attained multi-scale images can preserve the structural and detail information. Therefore, it is anticipated that the proposed SRIFT method will be able to detect many more keypoints.

**Figure 2.** Creation of the nonlinear diffusion scale (NDS) space.

#### *3.2. Radiation Invariance Through Local Orientation and Scale Phase Congruency (LOSPC)*

It should be noted that the orientation and scale referred to in this section are different from the orientation referred to in Section 3.1 and the scale referred to in Section 3.3. The orientation and scale referred to in this section are aimed at giving the spatial mapping more structural information, and the orientation referred to in Section 3.1 and the scale referred to in Section 3.3 are taken into consideration to make the feature description more stable.

#### 3.2.1. Frequency Domain Spatial Mapping via Phase Congruency

The feature extraction can be carried out after the establishment of the multi-scale space. To solve the NRD problem, the LOSPC algorithm is proposed to construct a unified and describable space, which is a prerequisite for the subsequent extraction of the corresponding features.

Feature extraction in the spatial domain of multimodal images often fails due to the distortion of the grayscale and gradient information. Instead, in the frequency domain, an image is decomposed into amplitude and phase components, where the same kind of ground object will present similar structural distribution features in the multimodal images. Phase congruency can be used to measure the degree of local phase information consistency at various angles [45]. Instead of considering the locations with the maximum intensity gradient as being the edges, the phase congruency model regards the edge points as being where the Fourier components are maximally in phase [46]. The phase information is the measurement describing the structural distribution features of the image in the frequency domain.

A 2-D phase congruency operator is developed for the calculation of the phase congruency of any point in the plane, which is a theory that can also be applied to image processing [47]. We define an image as *<sup>I</sup>*(*<sup>x</sup>*, *y*), and then the odd-symmetric part *Oso*(*<sup>x</sup>*, *y*) and even-symmetric part *Eso*(*<sup>x</sup>*, *y*) can be obtained by convolving the image *<sup>I</sup>*(*<sup>x</sup>*, *y*) with the log-Gabor wavelet transform:

$$\begin{bmatrix} \left[ E\_{so}(\mathbf{x}, \mathbf{y}), O\_{so}(\mathbf{x}, \mathbf{y}) \right] = \\ \left[ I(\mathbf{x}, \mathbf{y}) \otimes L^{\text{even}}(\mathbf{x}, \mathbf{y}, s, o), I(\mathbf{x}, \mathbf{y}) \otimes L^{\text{odd}}(\mathbf{x}, \mathbf{y}, s, o) \right] \end{bmatrix} \tag{5}$$

where, in scale *o* and orientation *s*, *Leven*(*<sup>x</sup>*, *y*,*s*, *o*) and *Lodd*(*<sup>x</sup>*, *y*,*s*, *o*) stand for the even-symmetric and the odd-symmetric log-Gabor wavelets, respectively. The amplitude and the phase parts of image *<sup>I</sup>*(*<sup>x</sup>*, *y*) can be expressed as:

$$A\_{\rm so}(\mathbf{x}, \mathbf{y}) = \sqrt{E\_{\rm so}(\mathbf{x}, \mathbf{y})^2 + O\_{\rm so}(\mathbf{x}, \mathbf{y})^2} \tag{6}$$

$$\phi\_{so}(\mathbf{x}, \boldsymbol{y}) = \arctan(O\_{so}(\mathbf{x}, \boldsymbol{y}) / E\_{so}(\mathbf{x}, \boldsymbol{y})) \tag{7}$$

When all the scales *o* and orientations *s* are considered, the results of the two-dimensional phase congruency are calculated as follows:

$$PC(\mathbf{x}, y) = \frac{\sum\_{s} \sum\_{o} w\_o(\mathbf{x}, y) \lfloor A\_{so}(\mathbf{x}, y) \Delta \Phi\_{so}(\mathbf{x}, y) - T \rfloor}{\sum\_{s} \sum\_{o} A\_{s0}(\mathbf{x}, y) + \xi} \tag{8}$$

where *wo*(*<sup>x</sup>*, *y*) is a weight function; ξ is a constant with a minimal number; the action is to prevent negative values, which means that when the value is negative, its result is 0; and *Aso*(*<sup>x</sup>*, *y*)ΔΦ*so*(*<sup>x</sup>*, *y*) is a phase deviation function, which is defined as:

$$\begin{aligned} \mathcal{A}\_{\text{so}}(\mathbf{x},\mathbf{y})\Delta\Phi\_{\text{so}}(\mathbf{x},\mathbf{y}) &= (E\_{\text{so}}(\mathbf{x},\mathbf{y})\overline{\phi}\_{E}(\mathbf{x},\mathbf{y}) \\ + \mathcal{O}\_{\text{so}}(\mathbf{x},\mathbf{y})\overline{\phi}\_{O}(\mathbf{x},\mathbf{y})) - \left| E\_{\text{so}}(\mathbf{x},\mathbf{y})\overline{\phi}\_{O}(\mathbf{x},\mathbf{y}) + \mathcal{O}\_{\text{so}}(\mathbf{x},\mathbf{y})\overline{\phi}\_{E}(\mathbf{x},\mathbf{y}) \right| \end{aligned} \tag{9}$$

where

$$\overline{\phi}\_E(\mathbf{x}, \mathbf{y}) = \sum\_{\mathbf{s}} \sum\_{\mathbf{o}} E\_{\mathbf{so}}(\mathbf{x}, \mathbf{y}) / \mathbb{C}(\mathbf{x}, \mathbf{y}) \tag{10}$$

$$\sqrt{\phi}\_{\mathcal{O}}(\mathbf{x}, \mathbf{y}) = \sum\_{\mathbf{s}} \sum\_{\varrho} \mathcal{O}\_{\mathrm{so}}(\mathbf{x}, \mathbf{y}) / \mathcal{C}(\mathbf{x}, \mathbf{y}) \tag{11}$$

$$\mathcal{C}(\mathbf{x}, \mathbf{y}) = \sqrt{\left(\sum\_{s} \sum\_{o} E\_{\rm{so}}(\mathbf{x}, \mathbf{y})\right)^{2} / \left(\sum\_{s} \sum\_{o} O\_{\rm{so}}(\mathbf{x}, \mathbf{y})\right)^{2}} \tag{12}$$

Through the above formulas, each pixel in the image can acquire a statistical value for the phase congruency, which contains orientation and scale information and is based on the structural distribution. In di fferent multimodal remote sensing images, the same ground object will have the same structural distribution. Although it has undergone di fferent physical radiation mechanisms, the value of the phase congruency will be the same.

#### 3.2.2. Feature Point Extraction

In the feature extraction stage, it is necessary to design a feature extraction method based on the statistical index of the phase congruency. Equation (8) is used to calculate the phase congruency of the image pixel by pixel, so an edge graph can be obtained, which is robust to the various multimodal remote sensing images. However, when the image is calculated with log-Gabor filters in di fferent directions, the phase congruency should change with the direction of the filter. Unfortunately, this information is not recorded. Therefore, in order to prevent the phase congruency information changing with the direction, it is necessary to calculate the moment of the phase congruency in each direction, and to record the values with the change of direction. In moment statistics, the principal axis indicates that the moment at this axis is the smallest [46]. The moment perpendicular to the principal axis is the maximum moment. If the maximum moment is large, it is likely to be an edge point in the image, while if the minimum moment is large, it is likely to be a corner point in the image.

According to the moment analysis algorithm, the following three statistics can be obtained from the typical moment calculation formula:

$$a = \sum\_{o} \left( PC(\theta\_o) \cos(\theta\_o) \right)^2 \tag{13}$$

$$b = 2\sum\_{o} \left( \text{PC}(\theta\_o) \cos(\theta\_o) \right) \left( \text{PC}(\theta\_o) \sin(\theta\_o) \right) \tag{14}$$

$$\mathcal{L} = \sum\_{o} \left( \text{PC}(\theta\_o) \sin(\theta\_o) \right)^2 \tag{15}$$

The angle of the principal axis ψ can then be calculated by the following formula:

$$
\psi = \frac{1}{2} \arctan\left(\frac{b}{a-c}\right) \tag{16}
$$

After obtaining the principal axis, the minimum moment *<sup>m</sup>*ψ and maximum moment *<sup>M</sup>*ψ can are calculated as follows: 

$$m\_{\psi} = \frac{1}{2} \left( c + a - \sqrt{b^2 \left( a - c \right)^2} \right) \tag{17}$$

$$M\_{\psi} = \frac{1}{2} \left( c + a + \sqrt{b^2 + (a - c)^2} \right) \tag{18}$$

By detecting the extreme values of the maximum moment and the minimum moment on the image, a group of feature points can be obtained, which are called keypoints. The feature extraction step has now been completed. The detection of feature points is undertaken in the frequency domain, instead of the traditional approach of the feature points being detected directly from the image intensity or gradient value, so that the proposed method can better deal with the NRD in multimodal remote sensing images. These keypoints are then used in the subsequent feature matching.

#### *3.3. Rotation Invariance Through a Rotation-Invariant Coordinate (RIC) System*

After extracting a large number of stable feature keypoints, these points then need to be described. The process of description should consider the change of features with the transformation of the various influencing factors, and should highlight the uniqueness of the features, to ensure the uniqueness of the subsequent feature matching. In order to achieve rotation invariance in the feature description stage, inspired by the work of "aggregating gradient distributions into intensity orders" [48], a RIC system is proposed. In the feature description, the sample points neighboring a keypoint are statistically calculated through a constantly changing local coordinate system, which itself realizes rotation invariance, without the need for the dominant direction. Firstly, we select several candidate support regions, according to a certain proportion. The difference between local orientation and scale phase congruency (DLOSPC) histogram is then calculated in the local RIC system at each sub-region. Finally, by connecting each DLOSPC vector in the image neighborhood of the multiple support regions, the descriptor is constructed. A flowchart of the construction of the descriptor is shown in Figure 3.

#### 3.3.1. The Local Rotation-Invariant Coordinate (RIC) System

In order to realize the rotation invariance of the proposed algorithm, the descriptor of each support region is calculated by the local RIC system. Specifically, a RIC system is built around each keypoint. *pi* is a point in one support region of the keypoint *p*, where the line connecting *p* and *pi* is set as the y-axis, and the direction of the vector *ppi* is the y-axis direction. We then construct a local Cartesian (x-y) coordinate system. For the sample points *pi*, the pixels in the field participate in the calculation in a rotation-invariant manner; that is to say, the local structure in the field of the sample points is retained. Therefore, the feature description in the locally invariant coordinate system is rotation-invariant. A local RIC system for keypoint *p* is then constructed. *pi* is set as the origin, i.e., the first pixel along the direction of the y-axis is set to *pi*1, and then the pixels in the eight fields of *pi* are marked as *pi*2, *pi*3,... *pi*8, as shown in Figure 3b.

**Figure 3.** A flowchart of the construction of the descriptor. (**a**) Multiple supported regions. (**b**) Local RIC system. (**c**) difference between local orientation and scale phase congruency (DLOSPC). (**d**) Vectors of SRIFT.

3.3.2. The Difference between Local Orientation and Scale Phase Congruency (DLOSPC) in the RIC System

For each sample point *pi*, its difference of local orientation scale phase congruency can be computed in the local RIC system. The calculation formula is as follows:

$$D\mathbf{x}(p\_i) = l(p\_{i\mathfrak{I}}) - l(p\_{i\mathfrak{I}})\_\prime \tag{19}$$

$$D y(p\_i) = l(p\_{i1}) - l(p\_{i5}).\tag{20}$$

where *pij* are point *pi*'s neighboring points along the x-axis and y-axis in the local x-y coordinate system, and *<sup>I</sup>pij* stands for the intensity at *pij* on the LOSPC map. The difference *<sup>D</sup>*(*pi*) and orientation <sup>θ</sup>(*pi*) can then be computed as:

$$D(p\_i) = \sqrt{Dx(p\_i)^2 + Dy(p\_i)^2},\tag{21}$$

$$D\theta(p\_i) = \tan^{-1}(D y(p\_i) / D x(p\_i)).\tag{22}$$

Note that <sup>θ</sup>(*pi*) is converted into the range of [0, <sup>2</sup>π), along with the values of *Dx*(*pi*) and *Dy*(*pi*). The gradient of *pi* is then constructed as a *d*-dimensional vector represented as *FG*(*pi*) = (*fG*1, *fG*2, ... , *fGd*). To do this, [0, 2 π) is divided into *d* equivalent boxes as *diri* = (<sup>2</sup>π/*d*) × (*i* − <sup>1</sup>), *i* = 1, 2, ... , *d*, and then <sup>θ</sup>(*pi*) is assigned to the di fferent boxes by linear distance weighting *<sup>D</sup>*(*pi*):

$$f\_j^G = \begin{cases} \,^G D(p\_i) \frac{\left(2\pi/d - \alpha\left(0(p\_i), \text{dir}\_j\right)\right)}{\frac{2\pi}{d}}, \alpha\left(\theta(p\_i), \text{dir}\_j\right) < \frac{2\pi}{d} \\ \,^G 0, \text{otherwise} \end{cases} \tag{23}$$

where α <sup>θ</sup>(*pi*), *dirj* is the angle between <sup>θ</sup>(*pi*) and *dirj*.

3.3.3. Construction of the Keypoint Descriptors Through Multiple Supported Regions

As shown in Figure 3, it is not su fficient to distinguish correct matches from a large number of wrong matches with a single support region. Furthermore, two non-corresponding keypoints may exhibit similarity in some support regions. However, the two corresponding keypoints should have a similar appearance in all the support regions of di fferent sizes, although there may be some small di fferences due to positioning errors of the keypoints and area detection. That is, when multiple support regions are used, mismatches can be better handled than when only a single support region is used.

N nested support regions are selected with the di fferent radius *ri*, centered on a keypoint, as shown in Figure 3a. The minimum support region is defined as *A* ∈ *l*<sup>2</sup>×2, and then the other support regions can be expressed as *Ai* = (1/*ri*)*<sup>A</sup>*, where *ri* represents the size of the *i*-th support region. *ri* is defined as *ri* = 1 + 0.5 × (*i* − 1) in this paper, so that the radius increments of the support regions are equal.

The cumulative vectors are combined to form a vector in each support region, and then the cumulative vectors of the di fferent support regions are connected together to describe the features of the keypoint. *F*(*Ri*) is used to represent the cumulative vector for a support region:

$$F(R\_i) = \sum\_{p \in R\_i} F\_G(p)\_\prime \tag{24}$$

Finally, all the vectors calculated in the N support regions are connected together to form the final descriptor {*F*1, *F*2 ... *FN*}.

#### **4. Experiments and Analyses**

The performance of the proposed SRIFT method was tested in both simulated and real-data experiments. In the simulated experiments, the robustness of the algorithm to geometric distortion was tested by artificially adding various scale and rotation distortions. In the real-data experiments, the registration performance obtained by the SRIFT method was compared to that of eight state-of-the-art image registration methods (SIFT [23], ASIFT [24], SAR-SIFT [49], PSO-SIFT [50], DLSS [40], HOPC [41], PCSD [43], and RIFT [51]), with di fferent modal image pairs.
