Next Article in Journal
Inverse Boundary Conditions Interface Problems for the Heat Equation with Cylindrical Symmetry
Previous Article in Journal
On the Spanning Cyclability of k-ary n-cube Networks
Previous Article in Special Issue
A Novel Self-Adaptive Deformable Convolution-Based U-Net for Low-Light Image Denoising
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Unsupervised Feature Point Prediction Algorithm for Multigrid Image Stitching

College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(8), 1064; https://doi.org/10.3390/sym16081064
Submission received: 9 July 2024 / Revised: 9 August 2024 / Accepted: 14 August 2024 / Published: 18 August 2024
(This article belongs to the Special Issue Advances in Image Processing with Symmetry/Asymmetry)

Abstract

:
The conventional feature point-based image stitching algorithm exhibits inconsistencies in the quality of feature points across diverse scenes. This may result in the deterioration of the alignment effect or even the inability to align two images. To address this issue, this paper presents an unsupervised multigrid image alignment method that integrates the conventional feature point-based image alignment algorithm with deep learning techniques. The method postulates that the feature points are uniformly distributed in the image and employs a deep learning network to predict their displacements, thereby enhancing the robustness of the feature points. Furthermore, the precision of image alignment is enhanced through the parameterization of APAP (As-projective-as-possible image stitching with moving DLT) multigrid deformation. Ultimately, based on the symmetry exhibited by the homography matrix and its inverse matrix throughout the projection process, image chunking inverse warping is introduced to obtain the stitched images for the multigrid deep learning network. Additionally, the mesh shape-preserving loss is introduced to constrain the shape of the multigrid. The experimental results demonstrate that in the real-world UDIS-D dataset, the method achieves notable improvements in feature point matching and homography estimation tasks, and exhibits superior alignment performance on the traditional image stitching dataset.

1. Introduction

Inthe contemporary digital age, image processing technology is assuming an increasingly pivotal role in a multitude of fields. Image alignment is a crucial issue in the field of computer vision, with applications in image stitching [1], object recognition [2], 3D reconstruction [3], and other areas. Its use in these contexts has the potential to enhance image quality and accuracy, which is of great significance.
Conventional alignment techniques for image stitching are typically founded upon pixel alignment discrepancies and feature point-based image warping. These techniques generally align the images by evaluating the homography of the two images, which is an invertible mapping of a plane in two viewpoints. Pixel alignment error-based methods optimize the homography parameter between two images by iterating [4], but this method generally yields superior results only when the overlap is high. Conventional feature-point-based image alignment methods typically employ disparate feature extractors [5,6,7] and robustness estimation methods [8,9] to identify the feature points following exact matching. These methods then utilize distinct alignment strategies to align the images. Alignment strategies such as APAP [10] employ a grid-based approach to align the image as much as possible. SVA [11] introduces a smoothly varying affine stitching field while retaining the good extrapolation and occlusion processing properties of the parametric transforms. ELA [12] is a robust image stitching method based on the TPS [13] transform for addressing the parallax-tolerance issue in image stitching. TFA [14] employs the technique of triangular facet approximation, which divides the image into small triangular facets and performs locally adaptive image alignment for each triangular region. SPHP [15] employs the semi-projective transform, which lies between affine and projective transforms, to transform the image, thereby preserving its underlying geometry to a certain extent. GSP [16] enhances image stitching by introducing global similarity a priori information, which results in a more natural and coherent stitching outcome. AANAP [17] proposes a novel image stitching method that combines multiple techniques to render the panorama more natural. DHW [18] utilizes two homographies to better align the image. However, feature point-based image alignment methods are subject to certain limitations when applied to low-texture and low-light images. For instance, uneven distribution of feature points can occur when the light is not uniform, which can result in distorted alignment results in regions with fewer features. Additionally, insufficient numbers of matching feature points can be encountered in low-overlap images, which can lead to failure to align due to insufficient feature points.
In comparison to traditional methods, deep learning image stitching methods [19,20,21,22] predict four point displacements of an image and solve for a global homography, a homography that only describes the mapping relationship of points on the same plane in two viewpoints. Consequently, this has limitations when dealing with parallax. Ref. [23] proposed a method of meshing the image and predicting four point displacements for each mesh in order to solve for a multigrid homography. This enables multigrid warping in deep learning, but the problem of not being able to determine the pixel-point mapping relationship occurs when going beyond the mesh space. Consequently, obtaining the final stitched image is difficult. Furthermore, due to the prediction of displacements of mesh points, it is not possible to apply traditional image alignment algorithms to the field of homography.
To address these challenges, this study proposes a predictive feature point method based on deep learning. This method enhances the robustness of deep feature extraction by exploring the matching relationship between deep feature maps. The combination of the traditional APAP (As-projective-as-possible image stitching with moving DLT) multigrid image alignment algorithm with the field of deep learning enables the application of the traditional image alignment algorithm to deep learning, thereby improving the accuracy and robustness of image alignment. In order to obtain a complete stitched image, we introduce a mesh shape-preserving loss and train the model by distorting the target image. After the model is trained, we obtain a complete distorted image by chunking the reference image and reverse distorting it to obtain a complete stitched image by combining it with the target image. Our method employs multi-grid homography estimation to address challenging scenarios, thereby enhancing alignment accuracy in comparison to existing deep learning methods [22] that predict a global homography.
Specifically, our method employs singular value decomposition (SVD) to solve the homography of each grid. This is achieved by first obtaining a weighted map of all feature points in each grid. This is conducted by assuming a uniform distribution of feature points, predicting the displacements of the feature points, and simultaneously calculating the weights of each feature point within each grid. Subsequently, the accurate image alignment is achieved by warping the target image in the grid space. The proposed framework can be readily trained in an unsupervised manner using pixel-level content loss. Additionally, we introduce a shape-preserving loss to constrain the mesh inverse warping, thereby preventing the stitched images from appearing cracked (see Figure 1b). Once training is complete, the reference image is divided into grid form. The invertibility of the homography matrix is then employed to inverse warp the corresponding image blocks, resulting in a warped map of each image block in the target image space. Finally, all the warped image blocks are synthesized into a complete warped map, which is then spliced with the target image.
In the course of our experiments, we assess the efficacy of our approach in the domains of homography estimation, image alignment, and feature point prediction. The experimental results in real scenarios demonstrate the superiority of the method. The principal contributions of this paper are as follows:
1. A deep learning network for predicting feature points is proposed. This network achieves more robust deep feature extraction and ensures that the network outputs the desired number of feature points in any case by custom setting the number of feature points.
2. The APAP multigrid deformation parameterization was implemented for unsupervised multigrid graph alignment.
3. Image chunking inverse warping and mesh shape-preserving loss are proposed. This is achieved by first chunking the reference image, inverting the image chunks using the invertibility of the singular response matrix, and then using the mesh shape-preserving loss to constrain the distances between image chunks in order to obtain a complete spliced image.

2. Related work

2.1. Traditional Image Stitching

The conventional approach to image stitching typically comprises three distinct stages. The initial step is the detection and matching of feature points. The second step is adaptive warping. The final step in the process is image fusion.
Feature point detection and matching: Feature point detection algorithms can extract key points of special significance from an image, which provides an important foundation for applications such as image matching, target tracking, 3D reconstruction, and so forth. The SIFT algorithm is a scale-invariant feature point detection algorithm based on scale invariance. It is capable of detecting the same feature points under different scales, different rotations, and different brightness conditions. The SURF algorithm is a fast and robust feature point detection algorithm. It accelerates the feature point detection and matching process by using techniques such as integral images and fast hash tables. The ORB algorithm combines the advantages of speed, rotation invariance, and scale invariance, rendering it suitable for feature extraction and matching in real-time image processing and in environments where computational resources are limited. The Harris corner point detection algorithm is a classical feature point detection algorithm that searches for significant features in an image by detecting the corner points in the image. To obtain more accurate feature points, RANSAC [24] employs random sampling and iterative optimization to estimate the mathematical model parameters and remove outliers in the data, thereby facilitating the acquisition of more accurate matches. MAGSAC [9] represents an enhancement of RANSAC [24] that utilizes the M-estimator function and adaptive weight updating to enhance resilience to outliers. MAGSAC++ [8] incorporates a preprocessing step, an optimized sampling strategy, and an interior point evaluation method based on MAGSAC [9] to enhance the robustness and efficiency of the algorithm.
Adaptive warping is a common method employed in the alignment of images to be stitched. SVA [11], ELA [12], TFA [14], and AutoStitch [1] employ the extraction of differentiated keypoints to construct a global transform. DHW [18] aligns the foreground and background using two homographies. In contrast, apap [10] grids the image, giving greater weight to nearer feature points and lesser weight to farther feature points. It then seeks a local homography transform for each grid, thereby achieving superior alignment accuracy. Ref. [25] proposed the elimination of feature points with large residuals by means of a local homography transform, followed by culling, in order to obtain a more accurate alignment. The use of SPHP [15], GSP [16], and AANAP [17] was proposed in order to better preserve the shape.
Image fusion is the process of combining two or more images into a single, unified image. Image fusion is the process of combining two images to create a single, complete image. Image fusion is achieved through a process of stitch cutting and image fusion. For instance, stitch cutting [26] identifies the optimal stitch based on pixel value differences, thereby enhancing the visual coherence of the overlapping regions in the image. Additionally, there are gradient differences [27,28], motion and motion-aware differences [29], significant differences [30], and unsupervised learning of sutures [31]. Image fusion, for instance, involves decomposing images using a Gaussian pyramid [32], fusing them across different frequency bands, and finally reconstructing seamless stitched images. Another approach, proposed by Nie et al. [22], involves reconstructing stitched images through unsupervised learning.

2.2. Deep Learning Image Stitching

In the context of deep learning image stitching, the determination of the homography between images is typically achieved through the prediction of the offsets of the image vertices.
DeTone et al. [33] proposed the initial deep homography method, which determines the homography by predicting the displacement of four vertices of an image. Nguyen et al. [19] subsequently proposed an unsupervised deep homography method, which determines the homography by unsupervised loss. Zhang et al. [20] proposed a content-aware unsupervised network, which helps to improve the performance of deep homography. Nie et al. [22] proposed the first end-to-end deep image stitching network, which is divided into an image warping part and an image reconstruction part. The warping part employs the network framework of Nie et al. [21], which connects the feature pyramid and feature correlation in a unified framework. This enables the network to predict homography from coarse to fine and to handle scenes with a relatively large baseline. Nie et al. [23] predicts homography by meshing the image and predicting the displacements of the grid points separately to determine the homography for each grid. In contrast, Nie et al. [31] warps an image by using a network to predict the control points for the TPS [13] algorithm. This is achieved by first predicting a global homography warping of the image to initially distort the control points, and then solving for the final TPS warping matrix while predicting the residual displacements of the control points.

3. Methodological Process

The method comprises two stages: deep multigrid warping and chunked inverse transformation. In the initial stage, as illustrated in Figure 2, our method accepts the reference image I r and the target image I t as inputs, generates the displacements of the feature points, and then computes the weight matrix W and multiplies it with the feature point mapping relation matrix A. The distortion matrix of the multigrid is obtained by singular value decomposition of the matrix W A . The second stage is depicted in Figure 3. It involves the segmentation of the reference image, the application of an inverse transformation superposition to all image segments, the generation of the reference image in the target image view, and the weighted average fusion of the overlapping regions to obtain the final panorama.

3.1. Multi-Mesh Warping Parameterization

The homography transform is a common method for image alignment. It is an invertible mapping from one picture to another with eight degrees of freedom. Every two degrees of freedom are used for translation, rotation, scaling, and straight lines at infinity. However, the homography transform is only capable of aligning a single plane and is, therefore, inadequate for aligning the real picture. The APAP algorithm, a traditional method, is more effective for this problem. However, the algorithm is based on a traditional feature point extraction method, which may result in difficulties aligning the image when the distribution of image feature points is not uniform or there are not enough feature points. While deep learning methods can effectively address this issue, the current deep learning frameworks are primarily based on the displacement prediction of grid points, which limits their ability to utilize the APAP alignment algorithm effectively. To address this issue, a parameterization of APAP warping is implemented, assuming that the feature points are uniformly distributed on the image in the shape of grid points. Deep learning is employed to predict the displacement of each feature point, calculate the weight of each feature point in each grid, and identify the homography of each grid through singular value decomposition.
The APAP alignment algorithm computes the local homography of each grid by matching two sets of feature points, X 1 = x 1 1 , x 2 1 x N 1 of the reference image and X 2 = x 1 2 , x 2 2 x N 2 of the distorted image. The assumption is that X 1 is uniformly distributed in the form of a grid on the reference image, and that X 2 is obtained by the deep learning network by predicting the displacements and then summing them up with X 1 . The i-th pair of feature points is defined as x i 1 , x i 2 x i 1 R 2 × 1 , x i 2 R 2 × 1 . Let C = c 1 , c 2 c M c k R 2 × 1 , k 0 , 1 M be the set of grid centers and c k the center of the k-th grid. Based on the distance between the reference image feature points and ck, the weight w k i k 0 , 1 M , i 0 , 1 N of each feature point in the local distortion of the k-th grid can be calculated.
w k i = max e c k x i 2 / σ 2 , γ .
The scale parameter, denoted by σ , is used to determine the scale of the feature points, while the parameter γ is employed to restrict the minimum weight of the feature points. When γ equals one, the distortion becomes a global projection distortion.
Let the projection matrix be h ^ .
h ^ = h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 = h r 1 h r 2 h r 3 .
The value of h r i represents the i-th row of h ^ . The Equation (3) can be derived by utilising a pair of feature points.
x i 1 y i 1 1 0 0 0 x i 1 x i 2 y i 1 x i 2 x i 2 0 0 0 x i 1 y i 1 1 x i 2 y i 2 y i 2 y i 2 y i 2 h r 1 T h r 2 T h r 3 T = a i h ^ = 0 0 .
By employing all available feature points, the following equation can be derived.
x 1 1 y 1 1 1 0 0 0 x 1 1 x 1 2 x 1 1 x 1 2 x 1 2 0 0 0 x 1 1 y 1 1 1 x 1 1 x 1 2 y 1 1 y 1 2 y 1 2 x N 1 y N 1 1 0 0 0 x N 1 x N 2 y N 1 x N 2 x N 1 0 0 0 x N 1 y N 1 1 x N 1 y N 2 y N 1 y N 2 y N 1 h r 1 T h r 2 T h r 3 T = A h ^ = 0 0 0 0 0 0 .
The minimal effective right singular vector, designated as h ^ , can be derived through the singular value decomposition of the matrix comprising all the feature points.
h ^ = a r g m i n h i = 1 N a i h 2 = a r g m i n h A h 2 .
The value of A R 2 N × 9 is obtained by vertically stacking all values of a i , while the value of h ^ represents the lowest effective right singular vector. The local deformation matrix, h ^ k , for the k-th mesh is computed as follows:
h ^ k = a r g m i n h k W k A h 2 , W k = d i a g w k 1 w k 1 w k 2 w k N ,
where W k N R 2 N × 9 . The minimum singular right vector h ^ k is obtained by performing a singular value decomposition of W k A .
The objective is to determine the homography matrix H for all meshes.
H = h ^ 1 h ^ M = a r g m i n H W A = a r g m i n h 1 W 1 A a r g m i n h M W M A ,
where W = W 1 W M .
In our research on the multigrid twisted transform parameterization, we define feature points to be uniformly distributed over the image as grid points and then predict the displacement of these points. The weights of the feature points within each grid are calculated, and singular value decomposition (SVD) is employed to identify the homography of each grid.

3.2. Network Framework

Figure 2 provides a concise overview of the multigrid deep homography network. Images I r and I t are processed to extract semantic features using a ResNet50 [34] model that maps the image to 512 × 512 resolution feature blocks. This is subsequently mapped into a two-channel 64 × 64 feature stream using the Contextual Correlation Layer [23], and the input feature stream is then predicted using a regression network with N × 2 displacement parameters for all feature points. The initial feature points and the predicted displacements are incorporated into Equation (4) to yield the matrix A. According to Equation (1), the weights of all the feature points in each grid are obtained. Since the difference between the grids to solve for the local distortion matrix is only the difference in the weights, we can solve for the local distortion matrix of all the grids in parallel, H. After obtaining H, we perform a multigrid distortion of the target image I t to obtain the distorted target image I t r . The red dots in I r f represent the initial assumption that the feature points are located on the reference image. The green dots in IA indicate the predicted locations of the feature points on the target image, as determined by the network. The yellow dots in I t f represent the centers of each grid, which are utilized to compute the weight of each feature point within that grid. I t r is the distorted image of the target image.

3.3. Chunking and Reverse Distortion

Deep multigrid morphing differs from deep single-grid morphing and image warping using one coordinate transformation formula (e.g., TPS [13]) in that each pixel in deep single-grid morphing and TPS shares the same warping parameter, whereas deep multigrid morphing requires the assignment of a distinct homography to each pixel within the grid following the morphing process. As illustrated in Figure 4a, we only assign a homography to the pixel points within the grid after deformation, which correspond to the shape before distortion, as shown in Figure 4b. In the image, pixel points outside the grid region shown in Figure 4b could not be used to determine the warping parameters for warping. The determination of the precise area of each pixel block in an image and the performance of pixel interpolation operations on a single image are complicated by the uncertainty of the grid shape. Consequently, it is challenging to generate a comprehensive distorted image by distorting the target image.
To address this issue, it was observed that the monoresponse matrix H and its inverse matrix H 1 exhibited symmetry during the projection process. This was due to the fact that the transformations applied to the monoresponse matrix and the inverse transformations using its inverse matrix were found to restore the original image to its original state. Moreover, the single response matrix preserves the parallelism of the parallel lines in the image, and the inverse matrix also exhibits this property. Consequently, the inverse of the distortion matrix for each grid can be obtained as H 1 = h ^ 1 1 h ^ M 1 , which represents the multigrid transformation matrix from the reference image I r to the target image I t . Thereafter, the reference image is partitioned into corresponding blocks in accordance with the shape of the grid divisions, resulting in P = p 1 p M . Subsequently, the distorted image blocks of the reference image from the viewpoint of the target image are obtained by applying corresponding distortions to the corresponding blocks. Ultimately, all the distorted blocks are superimposed to obtain the distorted image I t r .
I r t = P H 1 = i = 1 M p i h ^ i 1 .
Once the image I r t and the target image I t have been obtained, the stitched image can be generated through the application of image fusion algorithms, such as the Average Fusion Algorithm and Graphcut Textures [26]. As illustrated in Figure 3.

3.4. Loss Function

In the context of real-world image stitching networks, current losses can be classified into two categories: content loss and shape-preserving loss. Content loss is employed to regulate the degree of overlap between the constituent image parts, with the objective of optimizing the network. However, relying on content alignment loss alone may potentially lead to the emergence of unnatural mesh distortions, such as self-intersections. It is now necessary to impose constraints by utilizing shape-keeping loss. The shape-preserving loss function correlates neighboring meshes with their surrounding environment, thereby ensuring that all meshes maintain a consistent shape. As our network does not predict the displacements of grid points, but rather the displacements of feature points, we can only solve the distortion matrix of different grids using content alignment loss and inversely warp the grids. As a result, the output image blocks obtained from the inverse warping are not connected to each other (see Figure 1a,b). To ensure continuity between the image blocks, we introduce a grid shape-keeping loss function L 1 .
L 1 = i = 1 Y 1 j = 1 X 1 m i , j 4 m i + 1 , j 2 10 14 + m i , j 4 m i + 1 , j + 1 1 10 14 + m i + 1 , j + 1 1 m i , j + 1 3 10 14 + m i + 1 , j 2 m i , j + 1 3 10 14 .
In the context of inverse warping, m i , j denotes the grid of row i and column j. The subscripts 1, 2, 3, and 4 denote the points on the upper left, upper right, lower left, and lower right of the grid, respectively. Y represents the number of rows of the grid, and X represents the number of columns of the grid. The inclusion of the 10 14 in Equation (9) serves to prevent singular value decomposition errors during training.
In order to guarantee the efficacy of image alignment effects, it is necessary to utilize the content alignment loss L 2 .
L 2 = I r φ , H φ I t , H + I t φ , H 1 φ I r , H 1 ,
where φ A , B represents the distortion operation B applied to image A, • denotes the pixel dot product, and  denotes the all-ones matrix. The total network loss is L .
L = L 1 + L 2 .

4. Experimental Results and Analysis

4.1. Data Sets and Experimental Environments

The model was trained on the UDIS-D [22] real-world dataset, which comprises 10,440 image pairs for training and 1106 image pairs for testing. The dataset encompasses a range of overlap rates, degrees of parallax, and variable scenes, including indoor, outdoor, night, dark, snow, and zoom.
The network was trained using the Adam optimizer. The learning rate was exponentially decaying, with an initial decay rate of. The batch size was set to 4, and 400 epochs were trained. The number of feature points was set to (32 + 1) × (32 + 1), the number of meshes was set to 16 × 16, and the σ , γ were set to 1 and 0.0001, respectively. All the implementations were based on PyTorch (https://pytorch.org/), using a single GPU with an NVIDIA RTX 3060 Ti (NVIDIA Corporation, Santa Clara, CA, USA).

4.2. A Comparative Analysis of Homography Estimation Techniques

The network warping results are compared with those obtained using SIFT [5] + RANSAC [24] (a pipeline of AutoStitch [1]), SPW [34], AANAP [17], APAP [10], LCP [35], DAMGDH [23] and UDIS [22]. The image alignment performance is evaluated using the PSNR and SSIM metrics on the real-world UDIS-D dataset. A total of 1106 samples from the test set were utilized for evaluation purposes. Higher metrics indicate superior performance of the image in question. The results of the comparison are presented in Table 1. The high and low metrics are used to categorize the metrics into three levels for comparison. The “Easy” category represents the average of the metrics for images with the best splicing performance, the “Moderate” category represents the average of the metrics for images with medium splicing performance, and the “Hard” category represents the average of the metrics for images with poor splicing performance. The “Average” category represents the average of all the image metrics. Furthermore, two images without distortion are employed as a reference. In the process of comparison, some images, due to the number of matching feature points, are not sufficient to allow for the traditional method of alignment to be applied. In such cases, the original image is used instead of the metrics calculation. The experimental results demonstrate that our method has a greater complexity than the PRNR and SSIM metrics, but is superior to other methods in this regard. This indicates that the robustness of our method is superior to that of the other methods. The parameters used in the traditional APAP algorithm and those employed in our method are identical, which suggests that our method of image warping with deep learning to predict feature points is more effective than the traditional method of image warping based on feature extraction.
As illustrated in Figure 5, the blue channel in the reference image and the red channel in the deformed target image were set to zero in order to fuse the reference image and the target image. The non-overlapping region is represented by orange, while the unaligned part in the overlapping region is highlighted with different colors. To facilitate comparison, please zoom in on the red box position in the image. It is evident that the alignment accuracy of our method is superior to that of other methods.

4.3. Comparison Experiment of Spliced Images

4.3.1. A Comparative Analysis of Stitched Images from the UDIS-D Dataset

A comparative analysis of stitched images from the UDIS-D dataset As illustrated in Figure 6, a comparison of our splicing results with those of other methods in the UDIS-D dataset reveals that the traditional methods SPW [34], LCP [36] and APAP [10] exhibit more pronounced artifacts in splicing, while the deep-learning method UDIS [22] still displays some artifacts after reconstruction. In contrast, our method produces both weak texture and large parallax in the case of superior splicing results, thereby demonstrating its superior performance.

4.3.2. A Comparative Analysis of Spliced Images Derived from Conventional Datasets

A comparison is presented of the traditional datasets ANAP [17], APAP [10], DHW [18], and LCP [36]. To obtain the spliced image for our network, we employed a square cropping of the input image, subsequently morphing it to a height and width of 512 pixels. In order to obtain a superior spliced image and enhanced local alignment of the model, the input image was trained for 50 iterations. As illustrated in Figure 7, a comparison is presented with the APAP [10], LCP [36], SPW [34], and UDIS [22] methods, focusing on the overlapping region. The results demonstrate that the splicing effect of our network is comparable to, and in some cases superior to, that of other methods.

4.4. A Comparison of Feature Point Matching in Low-Light Environments

As illustrated in Figure 8, the results of the network’s predicted feature points are presented, with the red dots indicating the hypothesized grid feature points and the green dots indicating the locations of the predicted corresponding feature points. It should be noted that some feature points are located beyond the image space, specifically the green feature points that are situated within the black region. A comparison is made between the traditional feature point matching method SIFT+RANSAC and the proposed approach in low light conditions. Figure 9a,b show a selection of matched feature points. The upper image represents the results of matching by the traditional method, while the lower image shows the results of prediction by the proposed network. It can be observed that the traditional method is prone to matching errors in low light conditions, and these are indicated by the red circles in the figure. The feature points are marked with red circles in the figure. In contrast, our method predicts the feature points inside and outside the image boundary with greater accuracy.

4.5. Ablation Experiment

A series of ablation studies was conducted to assess the efficacy of different loss functions and the potential benefits of incorporating a multigrid deformation module.
A comparison was made between the generated spliced images produced with and without the use of shape-preserving loss. As illustrated in Figure 1b represents the spliced image without the incorporation of shape-preserving loss. This image is comprised of numerous image blocks, and notable black cracks are evident between the blocks. Figure 1d is the resulting image with the introduction of shape-preserving loss. It is evident that the continuity between image blocks is well-maintained after the addition of shape-preserving loss. A series of ablation experiments were conducted to ascertain the impact of including the APAP multigrid deformation module at varying numbers of feature points. As the number of meshes and feature points increases, the video memory required for computation also increases linearly. The mesh is thus divided into a 16 × 16 grid. In the absence of the multigrid deformation module, a global matrix is employed for aligning the feature points, thereby distorting the image. In this study, we use V1 to denote the case without the addition of the multigrid deformation module and V2 to denote the case with the addition of the multigrid deformation module. As demonstrated in Table 2, the PSNR and SSIM metrics exhibit a notable improvement with the incorporation of the multigrid deformation module. Furthermore, the degree of alignment of the overlapping portions of the image increases gradually with the increase in the number of feature points.

5. Discussion

This chapter examines the scalability of the proposed method in terms of alignment performance, processing time, space complexity, the handling of larger datasets, and the ability to accommodate different resolutions. The pseudo-code Algorithm 1 of our network model is also provided at the conclusion of this chapter.
With regard to alignment performance, the method employs a local homography matrix solution based on predicted feature points. While it does enhance grid-based image alignment, it relies on a similar set of feature points to determine the homography, with significant constraints between the local homographies. Consequently, it is unable to distort the image and align in a more flexible manner, as illustrated in Figure 1a demonstrates that even when no grid constraint loss is applied, the image is naturally distorted to a recognizable degree. Consequently, the image content remains misaligned, even in the presence of significant parallax.
In the future, we intend to address this issue in two ways. One approach is to reduce the constraints between the grids, thereby improving their alignment with the image. Conversely, an alternative approach would be to ascertain whether there exists a superior methodology for partitioning the local distortion space of the image. This could entail the division of disparate distorted image blocks according to their respective objects, with the objective of addressing the issue of large parallax.
In the context of larger datasets, the processing time required is necessarily longer. As the training of the model and the acquisition of the spliced images are conducted as two separate processes, we evaluated both in terms of the time required for training and the time required for acquiring the spliced images with the trained model. Table 3 illustrates the time required for a single training batch with varying numbers of feature points and grids. It can be observed that the time necessary for a single training session increases gradually with the increase in the number of feature points and grids. Notably, the increase in time associated with an increase in the number of grids is more pronounced. Table 4 illustrates the time required to obtain a stitched image through chunked reverse warping using the trained model with varying numbers of grids. It can be observed that due to the implementation of warping for each image chunk, the number of warping operations increases with an increase in the number of grids, resulting in a prolonged processing time.
Accordingly, for larger datasets, a smaller number of grids and a reduced number of features may be selected, thereby enabling the stitching results to be obtained in a more expeditious manner.
In terms of space complexity, this study examines the impact of the number of feature points and the number of meshes on the space complexity of the multigrid twisted parameterization module. Let us assume that the number of feature points is M and the number of grids is N. The computational matrix A requires a space of O ( M ) . In the weight computation stage, the requisite space for the weights of each grid is O ( M ) , and the total space required for all N grids is O ( N M ) . The matrix W A requires a space of O ( N M ) . The greatest space complexity of the three matrices generated by singular value decomposition SVD is O ( N M 2 ) . It is evident that the number of feature points has a more pronounced impact on the computational space resource requirement. Furthermore, an investigation is conducted to ascertain the influence of the number of grids on the space complexity of the chunked inverse warping module. Let us suppose that the image in question has dimensions H × W and that there are N grids. The requisite space complexity for the implementation of chunked reverse warping is O ( N H M ) . Given that the dimensions of the stitched image are typically greater than those of the input image, the actual space complexity may be somewhat higher than the theoretical value.
Accordingly, in the multigrid distortion parameterization module, the computation of the distortion matrix of a mesh by reducing the number of feature points can effectively reduce the computational space and resource occupation. In the future, an additional optimization may be achieved by solving the distortion matrix of each mesh independently using a small number of feature points. In the case of the chunked inverse warping module, the generation of the stitched image at a significantly faster rate is possible if the warping matrix of each pixel point can be accurately determined at the time of image generation. This may result in an improvement of up to N-fold in the speed of image generation.
At present, our network is only capable of image stitching at a resolution of 512 × 512 pixels. However, in practical applications, it is often necessary to handle images of varying resolutions. As our network is based on feature point displacement prediction, we can scale the image to 512 × 512 pixels, predict the displacement of image feature points, and restore these feature points proportionally. By employing these reduced feature points, the distortion matrix at the original resolution can be calculated, thereby enabling the task of image stitching at different resolutions to be accomplished.
Algorithm 1: networked algorithmic process
Symmetry 16 01064 i001

6. Conclusions

The article presents an innovative approach to image alignment, proposing a novel method for predicting feature point displacements and combining them with traditional feature point-based warping techniques to achieve a more accurate picture alignment process. By assuming a uniform distribution of feature points and using a deep learning network to predict their displacements, we successfully simulate the position of feature points of one image in the viewpoint of another image to achieve robust feature point extraction. Furthermore, we parameterize the APAP algorithm to achieve deep learning APAP multigrid image stitching, thereby obtaining a more accurate alignment of its results. Furthermore, we introduce a post-processing method of multigrid inverse warping and mesh shape preservation loss for the generation of crack-free panoramic images.
The experimental results demonstrate that the proposed method achieves a significant improvement in the image alignment task, indicating potential applications in real-world applications. The combination of deep learning and traditional algorithms enhances the accuracy of the traditional graphic alignment method, thereby providing substantial support and guidance for the resolution of image processing issues in practical applications.

Author Contributions

Conceptualization, Y.C.; methodology, Y.C.; validation, Y.C. and J.L.; formal analysis, A.M.; investigation, J.L.; resources, J.L.; data curation, J.L.; writing—original draft preparation, Y.C.; writing—review and editing, A.M.; visualization, Y.C.; supervision, J.L.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data supporting the reported results are contained within the article. No new data were created or analyzed beyond what is presented in the article. For more information, please contact [email protected].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Brown, M.; Lowe, D.G. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vision 2007, 74, 59–73. [Google Scholar] [CrossRef]
  2. Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 1150–1167. [Google Scholar]
  3. Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; pp. 147–152. [Google Scholar]
  4. Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
  5. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 2004, 60, 91–110. [Google Scholar] [CrossRef]
  6. Bay, H.; Tuytelaars, T.; Gool, L.V. Surf: Speeded up robust features. In Proceedings, Part I, Computer Vision—ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
  7. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  8. Barath, D.; Noskova, J.; Ivashechkin, M.; Matas, J. MAGSAC++: A fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1304–1312. [Google Scholar]
  9. Barath, D.; Matas, J.; Noskova, J. MAGSAC: Marginalizing sample consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10197–10205. [Google Scholar]
  10. Zaragoza, J.; Chin, T.J.; Brown, M.S.; Suter, D. As-projective-as-possible image stitching with moving DLT. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2339–2346. [Google Scholar]
  11. Lin, W.Y.; Liu, S.; Matsushita, Y.; Ng, T.T.; Cheong, L.F. Smoothly varying affine stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 345–352. [Google Scholar]
  12. Li, J.; Wang, Z.; Lai, S.; Zhai, Y.; Zhang, M. Parallax-tolerant image stitching based on robust elastic warping. IEEE Trans. Multimed. 2017, 20, 1672–1687. [Google Scholar] [CrossRef]
  13. Bookstein, F.L.; Green, W.D.K. A thin-plate spline and the decomposition of deformations. Math. Methods Med. Imaging 1993, 2, 14–28. [Google Scholar]
  14. Li, J.; Deng, B.; Tang, R.; Wang, Z.; Yan, Y. Local-adaptive image alignment based on triangular facet approximation. IEEE Trans. Image Process. 2019, 29, 2356–2369. [Google Scholar] [CrossRef] [PubMed]
  15. Chang, C.H.; Sato, Y.; Chuang, Y.Y. Shape-preserving half-projective warps for image stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 3254–3261. [Google Scholar]
  16. Chen, Y.S.; Chuang, Y.Y. Natural image stitching with the global similarity prior. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 186–201. [Google Scholar]
  17. Lin, C.C.; Pankanti, S.U.; Natesan Ramamurthy, K.; Aravkin, A.Y. Adaptive as-natural-as-possible image stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1155–1163. [Google Scholar]
  18. Gao, J.; Kim, S.J.; Brown, M.S. Constructing image panoramas using dual-homography warping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 49–56. [Google Scholar]
  19. Nguyen, T.; Chen, S.W.; Shivakumar, S.S.; Taylor, C.J.; Kumar, V. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robot. Autom. Lett. 2018, 3, 2346–2353. [Google Scholar] [CrossRef]
  20. Zhang, J.; Wang, C.; Liu, S.; Jia, L.; Ye, N.; Wang, J.; Zhou, J.; Sun, J. Content-aware unsupervised deep homography estimation. In Proceedings, Part I, Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 653–669. [Google Scholar]
  21. Nie, L.; Lin, C.; Liao, K.; Zhao, Y. Learning edge-preserved image stitching from large-baseline deep homography. arXiv 2020, arXiv:2012.06194. [Google Scholar]
  22. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Trans. Image Process. 2021, 30, 6184–6197. [Google Scholar] [CrossRef] [PubMed]
  23. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Depth-aware multi-grid deep homography estimation with contextual correlation. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4460–4472. [Google Scholar] [CrossRef]
  24. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  25. Lee, K.Y.; Sim, J.Y. Warping residual based image stitching for large parallax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8198–8206. [Google Scholar]
  26. Kwatra, V.; Schödl, A.; Essa, I.; Turk, G.; Bobick, A. Graphcut textures: Image and video synthesis using graph cuts. ACM Trans. Graph. (TOG) 2003, 22, 277–286. [Google Scholar] [CrossRef]
  27. Agarwala, A.; Dontcheva, M.; Agrawala, M.; Drucker, S.; Colburn, A.; Curless, B.; Salesin, D.; Cohen, M. Interactive digital photomontage. In Proceedings of the SIGGRAPH04: Special Interest Group on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 8–12 August 2004; pp. 294–302. [Google Scholar]
  28. Dai, Q.; Fang, F.; Li, J.; Zhang, G.; Zhou, A. Edge-guided composition network for image stitching. Pattern Recognit. 2021, 118, 108019. [Google Scholar] [CrossRef]
  29. Eden, A.; Uyttendaele, M.; Szeliski, R. Seamless image stitching of scenes with large motions and exposure differences. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 2498–2505. [Google Scholar]
  30. Li, N.; Liao, T.; Wang, C. Perception-based seam cutting for image stitching. Signal Image Video Process. 2018, 12, 967–974. [Google Scholar] [CrossRef]
  31. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Parallax-tolerant unsupervised deep image stitching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 7399–7408. [Google Scholar]
  32. Ouyang, N.; Zhai, Z.; Shou, Z.; Zhang, T.; Yuan, H. Image Stitching of Multi-band Blending Based on Graph Cut. Microelectron. Comput. 2013, 30, 107–110. [Google Scholar]
  33. DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv 2016, arXiv:1606.03798. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  35. Liao, T.; Li, N. Single-perspective warps in natural image stitching. IEEE Trans. Image Process. 2019, 29, 724–735. [Google Scholar] [CrossRef] [PubMed]
  36. Jia, Q.; Li, Z.; Fan, X.; Zhao, H.; Teng, S.; Ye, X.; Latecki, L.J. Leveraging line-point consistence to preserve structures for wide parallax image stitching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12186–12195. [Google Scholar]
Figure 1. Shown the concatenated image (b) without shape preservation loss and the corresponding grid image (a), as well as the concatenated image (d) with shape preservation loss and the corresponding grid image (c).
Figure 1. Shown the concatenated image (b) without shape preservation loss and the corresponding grid image (a), as well as the concatenated image (d) with shape preservation loss and the corresponding grid image (c).
Symmetry 16 01064 g001
Figure 2. Unsupervised Multi grid Deep Homography Estimation Network Framework.
Figure 2. Unsupervised Multi grid Deep Homography Estimation Network Framework.
Symmetry 16 01064 g002
Figure 3. Image Blocking Reverse Warping Flowchart.
Figure 3. Image Blocking Reverse Warping Flowchart.
Symmetry 16 01064 g003
Figure 4. (a) Showcased grid partitioning of distorted images. (b) Displayed the corresponding grid shape in the target image space, where pixels within the grid have corresponding mapping relationships, while pixels outside the grid cannot determine the mapping relationship.
Figure 4. (a) Showcased grid partitioning of distorted images. (b) Displayed the corresponding grid shape in the target image space, where pixels within the grid have corresponding mapping relationships, while pixels outside the grid cannot determine the mapping relationship.
Symmetry 16 01064 g004
Figure 5. Qualitative comparison with other image distortion methods on the UDIS-D dataset, for each instance, we use a red box to enlarge and compare its alignment performance.
Figure 5. Qualitative comparison with other image distortion methods on the UDIS-D dataset, for each instance, we use a red box to enlarge and compare its alignment performance.
Symmetry 16 01064 g005
Figure 6. Compare the effect of stitching images on the UDIS dataset, and mark the areas with severe artifacts with red arrows in the stitched images.
Figure 6. Compare the effect of stitching images on the UDIS dataset, and mark the areas with severe artifacts with red arrows in the stitched images.
Symmetry 16 01064 g006
Figure 7. A comparative analysis of the splicing results obtained using our method and those obtained using other methods on traditional datasets, namely ANAP, APAP, DHW, and LPC.
Figure 7. A comparative analysis of the splicing results obtained using our method and those obtained using other methods on traditional datasets, namely ANAP, APAP, DHW, and LPC.
Symmetry 16 01064 g007
Figure 8. Displayed the situation of network prediction feature points.
Figure 8. Displayed the situation of network prediction feature points.
Symmetry 16 01064 g008
Figure 9. In two low-light environments, a comparison is made between the performance of our network-predicted feature points and that of the traditional SIFT + RANSAC feature point matching method. In comparison, the conventional approach randomly selects one-tenth of the matched feature points, whereas our network-predicted method randomly selects one-eightieth of the matched feature points. (a,b) illustrate the results of the two methods, with the red lines in the top images representing the traditional method and the bottom images representing the network prediction.
Figure 9. In two low-light environments, a comparison is made between the performance of our network-predicted feature points and that of the traditional SIFT + RANSAC feature point matching method. In comparison, the conventional approach randomly selects one-tenth of the matched feature points, whereas our network-predicted method randomly selects one-eightieth of the matched feature points. (a,b) illustrate the results of the two methods, with the red lines in the top images representing the traditional method and the bottom images representing the network prediction.
Symmetry 16 01064 g009
Table 1. Comparison of PSNR and SSIM metrics for various distortion methods on the UDIS-D dataset. The best ones are marked in red, and the second best ones are marked in blue.
Table 1. Comparison of PSNR and SSIM metrics for various distortion methods on the UDIS-D dataset. The best ones are marked in red, and the second best ones are marked in blue.
PSNRSSIM
MethodEasyModerateHardAverageEasyModerateHardAverage
I 3 × 3 15.8712.7610.6812.860.5300.2860.1460.303
AANAP [17]27.5823.8519.7023.310.8960.8230.6590.779
APAP [10]28.3624.4018.3923.170.9120.8370.6420.781
SPW [35]26.9822.6716.7721.600.8800.7580.4900.687
LCP [36]26.9422.6319.3122.590.8780.7640.6100.736
UDIS [22]27.8423.9520.7023.800.9020.8300.6840.793
SIFT [5] + RANSAC [24]28.7524.0818.5523.270.9160.8330.6360.779
DAMGDH [23]29.5225.2421.2024.890.9230.8600.7080.817
Ours29.5625.2921.2024.920.9470.8690.6980.824
Table 2. A comparison of the average PSNR and SSIM metrics under the UDIS dataset in the presence and absence of the multigrid deformation module for different numbers of feature points is presented.
Table 2. A comparison of the average PSNR and SSIM metrics under the UDIS dataset in the presence and absence of the multigrid deformation module for different numbers of feature points is presented.
PSNR
Number of Feature PointsV1V2
8 × 823.6924.83
16 × 1623.7424.85
32 × 3223.8024.92
SSIM
Number of feature pointsV1V2
8 × 80.7670.818
16 × 160.7690.822
32 × 320.7740.824
Table 3. The time in seconds required to complete a training batch with varying numbers of feature points and grids. The testing was conducted using a NVIDIA RTX 3060 Ti graphics processing unit (GPU).
Table 3. The time in seconds required to complete a training batch with varying numbers of feature points and grids. The testing was conducted using a NVIDIA RTX 3060 Ti graphics processing unit (GPU).
Running TimeNumber of Feature Points
8 × 816 × 1632 × 32
Number of grids1 × 10.0570.0570.063
2 × 20.0600.0610.069
4 × 40.0670.0690.076
8 × 80.1010.1060.128
16 × 160.2100.2350.261
Table 4. The table illustrates the time (in seconds) necessary to obtain the stitched image by chunking the reverse distortion for varying numbers of meshes. The testing was conducted using a NVIDIA RTX 3060 Ti graphics processing unit (GPU).
Table 4. The table illustrates the time (in seconds) necessary to obtain the stitched image by chunking the reverse distortion for varying numbers of meshes. The testing was conducted using a NVIDIA RTX 3060 Ti graphics processing unit (GPU).
Number of grids1 × 12 × 24 × 48 × 816 × 6
running time0.0790.0940.1620.4191.214
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Chen, Y.; Mu, A. Research on Unsupervised Feature Point Prediction Algorithm for Multigrid Image Stitching. Symmetry 2024, 16, 1064. https://doi.org/10.3390/sym16081064

AMA Style

Li J, Chen Y, Mu A. Research on Unsupervised Feature Point Prediction Algorithm for Multigrid Image Stitching. Symmetry. 2024; 16(8):1064. https://doi.org/10.3390/sym16081064

Chicago/Turabian Style

Li, Jun, Yufeng Chen, and Aiming Mu. 2024. "Research on Unsupervised Feature Point Prediction Algorithm for Multigrid Image Stitching" Symmetry 16, no. 8: 1064. https://doi.org/10.3390/sym16081064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop