1. Introduction
Image matching [
1,
2] is one of the important research contents in computer vision and image processing, and is widely used in visual 3D reconstruction [
3], tracking [
4], object recognition [
5] and content-based image retrieval [
6]. Its purpose is to find one or more transformations in the transformation space so that two or more images of the same scene from different times, different sensors or different perspectives are spatially consistent. There are many types of image matching methods, among which feature-based matching has better robustness to image distortion, noise and occlusion. However, this matching depends to a large extent on the quality of feature extraction. One of the research hotspots is pattern recognition [
7,
8,
9,
10]. The most basic one is the scale-invariant feature transform (SIFT) algorithm proposed by Lowe in 2004 [
11,
12]. However, it only considers the Euclidean distance between the feature vectors when matching, and does not use any structural information contained in the dataset itself; therefore, the search efficiency is relatively low. When the image noise or the difference between matching objects is large, the mismatching situation is obvious. To solve this problem, Ooi and Weinberger proposed using a Kd tree to divide the space and then perform a nearest neighbor query [
13,
14], but this means of establishing an index structure carries a relatively high cost. Chen and Torr proposed the M estimation method [
15] and the maximum likelihood estimation by sample and consensus (MLESAC) [
16] algorithm to estimate the matching matrix. The
M estimation method relies completely on the linear least squares method, so the initial value of the estimated matrix low accuracy and poor stability; the MLESAC algorithm is incapable of modeling outliers and has low estimation accuracy. In response to these problems, Choi proposed a better performance random sample consensus (RANSAC) algorithm [
17] to purify the matching pairs consistently. Since RANSAC depends on the setting of the number of iterations, the results have errors and may not be optimal. At present, researchers have proposed a variety of description methods for local feature regions of images, such as descriptors based on Gaussian differentiation, invariant moments, controllable filters, time-frequency, pixel gray value distribution or pixel gradient value distribution. Of these methods, the most concerned is Lowe’s SIFT descriptor.
The construction of this feature descriptor is achieved by establishing a three-dimensional gradient directional histogram for the neighborhood of feature points. The SIFT feature is not only invariant to the scale change and rotation of the image, but also has strong adaptability to the illumination change and image deformation and has a high discrimination ability. On this basis, researchers have improved and extended SIFT features, such as the PCA-SIFT descriptor proposed by Ke and Sukthankar [
18], and the Gradient location-orientation histogram (GLOH) descriptor proposed by Mikolajczyk and Schmid [
19], the Rotation-invariant feature transform (RIFT) descriptor proposed by Lazebnik [
20] and the Speeded up robust features (SURF) descriptor proposed by Bay [
21].
In the literature, [
22] the performance of descriptors similar to SIFT was found to be the best after evaluating the performance of many representative descriptors. Local binary pattern (LBP) is one of the more effective texture analysis features for two-dimensional images [
23]. It is essentially a texture descriptor based on pixel gray order that uses local patterns as texture primitives for analysis.
It has the characteristics of simple calculation and invariance to linear illumination changes, and has been widely used in face recognition, background extraction and image retrieval [
24,
25,
26,
27]. Reference [
28] was the first to apply an LBP operator to the construction of local image feature descriptors, and proposed a Centersymmetric local binary pattern (CS-LBP) local image feature area description method. Experimental results showed that the CS-LBP descriptor has better image matching than the SIFT descriptor and has obvious storage advantages since the SIFT has color space requirements and computational overhead.
Tan and Triggs extended the LBP operator to a ternary code and proposed a Local trinary pattern (LTP) operator [
29]. The LTP feature has stronger discrimination than the LBP feature, but its histogram dimension is greatly increased, which is not suitable for directly describing the local feature area of the image. Extending the CS-LBP descriptor directly to the Center symmetric local trinary pattern (CS-LTP) descriptor reduces the dimensionality of the descriptor to a certain extent, but it still cannot meet the needs of practical applications.
There are many derivatives algorithms based on the SIFT algorithm: the GLOH, proposed by Mikolajczyk [
30]; the CSIFT, proposed by AbdelHakim [
31]; the ASIFT, proposed by Morel [
32]; the simplified SSIFT proposed by Liu Li [
33]; the PSIFT proposed by Cai Guorong [
34]; local feature description based on Laplace, proposed by Tang Yonghe [
35]; and image matching based on the adaptive redundant keypoint elimination method in the SIFT [
36], the efficiency of SIFT still has a lot of room for improvement.
To address the SIFT algorithm’s poor real-time performance, this paper first changed the scale space calculation method and then added a stability factor to reduce the matching error and calculation time; then, by establishing the feature descriptor of the cross-shaped partition, the dimension of the descriptor was reduced from 128 to 96, which reduced the amount of matching calculation and shortened the matching time.
The structure of the rest of the article is arranged as follows:
Section 2 presents the original and
Section 3 the improved SIFT algorithm.
Section 4 analyzes the experimental results, and
Section 5 presents the conclusions.