Next Article in Journal
POSS and PAG Dual-Containing Chemically Amplified Photoresists by RAFT Polymerization for Enhanced Thermal Performance and Acid Diffusion Inhibition
Previous Article in Journal
Bioavailability of Liposomal Vitamin C in Powder Form: A Randomized, Double-Blind, Cross-Over Trial
Previous Article in Special Issue
A Raisin Foreign Object Target Detection Method Based on Improved YOLOv8
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Image Registration Algorithm for Stamping Process Monitoring Based on Improved Unsupervised Homography Estimation

School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(17), 7721; https://doi.org/10.3390/app14177721 (registering DOI)
Submission received: 12 July 2024 / Revised: 14 August 2024 / Accepted: 19 August 2024 / Published: 2 September 2024

Abstract

:

Featured Application

This paper is applied in the industrial monitoring of the stamping process, where vibrations are common, necessitating the use of image registration and homography estimation methods to align template images with test images. By employing machine-vision and image-processing techniques, the process is monitored in real-time to detect any anomalies, ultimately aiming to protect the stamping molds.

Abstract

Homography estimation is a crucial task in aligning template images with target images in stamping monitoring systems. To enhance the robustness and accuracy of homography estimation against random vibrations and lighting variations in stamping environments, this paper proposes an improved unsupervised homography estimation model. The model takes as input the channel-stacked template and target images and outputs the estimated homography matrix. First, a specialized deformable convolution module and Group Normalization (GN) layer are introduced to expand the receptive field and enhance the model’s ability to learn rotational invariance when processing large, high-resolution images. Next, a multi-scale, multi-stage unsupervised homography estimation network structure is constructed to improve the accuracy of homography estimation by refining the estimation through multiple stages, thereby enhancing the model’s resistance to scale variations. Finally, stamping monitoring image data is incorporated into the training through data fusion, with data augmentation techniques applied to randomly introduce various levels of perturbation, brightness, contrast, and filtering to improve the model’s robustness to complex changes in the stamping environment, making it more suitable for monitoring applications in this specific industrial context. Compared to traditional methods, this approach provides better homography matrix estimation when handling images with low texture, significant lighting variations, or large viewpoint changes. Compared to other deep-learning-based homography estimation methods, it reduces estimation errors and performs better on stamping monitoring images, while also offering broader applicability.

1. Introduction

With the advancement of Industry 4.0, machine-vision technology has replaced traditional manual inspection methods, enabling real-time monitoring of the stamping process to protect molds [1]. By capturing images of the metal sheet before and after the mold closes, the process is analyzed for any abnormalities. If an anomaly is detected, an emergency stop signal is issued to halt the stamping equipment immediately, thereby protecting the molds, reducing economic losses, and enhancing production efficiency. However, the actual stamping process may be affected by factors such as the impact force of the mold opening and closing or unstable feed rates from the feeder, causing random vibrations in the metal sheets. This results in random offsets in the images captured each time, necessitating image registration of the template and real-time detection images. The difference between these two images is then used to determine if there are any abnormalities in the process. Therefore, image registration algorithms are indispensable in the application of monitoring the stamping process.
Image registration refers to the process of transforming different images of the same object from different spatial coordinate systems to the same coordinate system through a certain mapping relationship. In this process, homography estimation plays an important role. Traditional homography estimation methods are divided into direct methods and feature-based matching methods [2]. Direct methods like the Lucas–Kanade algorithm initialize homography parameters and use gradient descent to minimize errors [3,4]. Other methods include optical flow and image gradient-based techniques [5,6], which are sensitive to lighting changes and require good initial values to prevent local optima. The optimization is computationally intensive, complicating their use in high-resolution and complex scenes. The second method involves feature-based matching, which starts with feature detection algorithms like SIFT, SURF, and ORB to extract key points from images [7,8,9]. Each detected key point is assigned a descriptor that characterizes its surrounding image area. These feature descriptors are then compared between two images using matchers like FLANN or Brute-Force to find matching pairs [10]. Finally, robust estimation methods such as RANSAC or LMedS are used to compute the optimal homography matrix [11,12]. While these methods generally perform faster than direct methods, they heavily rely on the detection and matching of key points. In areas with low texture, sufficient key points may not be detectable, and significant changes in perspective or lighting between images can lead to incorrect key point matches, directly impacting the accuracy of homography estimation.
The stamping surveillance environment is influenced by external factors like lighting changes and equipment instability, as well as human factors such as camera angle adjustments. These factors can significantly alter the comparison between real-time detection and template images, impacting the accuracy of feature-point detection and homography estimation. Leveraging the success of deep convolutional neural networks in computer vision, deep-learning-based homography estimation methods can effectively tackle these robustness issues and the reliance on feature points seen in traditional approaches [13]. Detone et al. introduced HomographyNet, a simple convolutional neural network trained end-to-end to directly regress the parameters of the homography matrix [14]. It outperforms traditional methods in complex scenarios such as low-texture environments, but the model is trained on synthetic datasets, leading to poor inference capabilities on real images. Real-world image labeling is also highly costly. Ty Nguyen et al. [15] were the first to propose an unsupervised homography estimation method. Building on a similar network structure as previous approaches, they enhanced the robustness of homography estimation to lighting variations by learning the homography relationship through minimizing the pixel intensity difference between two images. This method eliminates the need for labeled data. However, it performs poorly on images with significant viewpoint differences and lacks scale and rotation invariance. In recent years, there has been increasing research on unsupervised homography estimation [16,17,18,19,20]. Unsupervised learning methods eliminate the complex process of labeling training data, reducing the impact of human bias. These methods are better suited to adapt to complex and changing environments and demonstrate improved generalization ability when handling variations in real-world conditions, such as different lighting conditions, motion blur, shifts, and rotations. Zang J. et al. [16] proposed an unsupervised learning model that learns feature representations for homography estimation and image comparison. This model calculates loss in the feature space rather than in the pixel space, achieving robustness to varying lighting conditions. Koguciuk et al. [17] extended this work by adding perceptual loss computation [21], further enhancing the robustness of deep homography estimation methods to lighting changes. However, both of these new unsupervised homography estimation methods are not applicable to scenes with small image overlapping regions and low similarity. Liu S. et al. [18] proposed a new unsupervised learning framework that improves homography estimation by learning motion bases. This approach can estimate global homography transformations and handle local homography transformations at a finer granularity, thus improving the accuracy of homography estimation. However, due to the complex feature extraction and learning processes involved, this method requires significant computational resources and time, making it unsuitable for industrial applications.
To address the issues mentioned above, this paper proposes an improved unsupervised homography estimation model based on existing models, tailored for the specific industrial context of stamping process monitoring. By incorporating a special deformation convolution module and a GN normalization layer into the backbone network, the model enhances its rotational invariance when handling large-scale, high-resolution images. A multi-scale, multi-stage network architecture is developed, utilizing unsupervised training through minimizing pixel-level photometric loss at each stage, which progressively refines the homography transformations between images from coarse to fine, improving the accuracy of homography estimation and enhancing the model’s learning ability for scale invariance. To make the model suitable for the specific industrial application of the stamping process and improve its robustness against complex environmental changes such as lighting variations and vibration disturbances, the training dataset integrates stamping surveillance image data with random perturbations, lighting, contrast, and blur for data augmentation, addressing the issue of dataset scarcity in specific industrial applications of deep learning.
The remainder of this paper is organized as follows. Section 2 first describes the network structure of the unsupervised single-response estimation model, improves on it by introducing a special deformation convolution module and a GN normalization layer, and constructs a multi-scale, multi-stage overall network structure and establishes a multi-stage coupling relation, minimizing the luminosity loss function. Comparison and ablation experiments are carried out in Section 3 to analyze the effectiveness of the method of this paper, and conclusions are summarized in Section 4.

2. Materials and Methods

2.1. Unsupervised Homography Estimation Model

The unsupervised homography estimation method proposed by Nguyen et al. [15] consists of four main parts: a four-point parameterization homography estimation layer, a tensor-based direct linear transformation layer, a spatial transformation layer, and a loss function computation layer. The overall network structure of this method is illustrated in Figure 1.

2.1.1. Four-Point Parameterization Homography Estimation Layer

This layer is the HomographyNet network model, which serves as the backbone network. It is a regression network similar to the VGG architecture. As shown in Figure 2, the network model includes 8 convolutional layers, 4 max pooling layers, 1 dropout layer, and 2 fully connected layers. The model input consists of two stacked grayscale images. After processing through convolutional layers and fully connected layers, the final output is a 4-point parameterized homography matrix, H 4 points , which is equivalent to a 3 × 3 form of the homography matrix, H .
The essence of planar homography estimation is to compute the linear mapping between pixels of two images to provide a non-singular linear relationship between points in the two images. The homography matrix has 8 degrees of freedom, so theoretically, only 4 pairs of matching points are needed to estimate the result. The HomographyNet network model discards the traditional 3 × 3 homography matrix’s rotation, translation, scaling, and other component errors that are difficult to reflect in the loss function, which can affect the training of deep neural network models. Instead, it uses a 4-point parameterized form to represent the homography matrix. The principle is as follows:
Assuming points x = ( u ,   v ) T and x =   ( u ,   v ) T are a pair of 2D coordinates with a mapping relationship in the transformed image, represented in homogeneous coordinates as x = ( u ,   v ,   1 ) T and x = ( u ,   v ,   1 ) T , the mapping relationship between the two points using the 3 × 3 homography matrix,   H , is expressed as:
[ u v 1 ] = [ h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 h 33 ] [ u v 1 ]
As long as there are four pairs of matching points, ( u i ,   v i ) match ( u i , v i ) , the unique solution for H can be calculated. Similarly, as long as there are four pairs of ( u i ,   v i ) , the unique solution for H can also be calculated, where u i = u i     u i and v i = v i v i represent the offsets of the horizontal and vertical coordinates of these four pairs of matching points. Therefore, the 4-point parameterized matrix, H 4 points , is used to replace the 3 × 3 homography matrix, H :
H 4 points = [ u 1 v 1 u 2 v 2 u 3 v 3 u 4 v 4 ]   ~ H = [ h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 h 33 ]
The mapping relationship between the two points x = ( u ,   v ,   1 ) T and x = ( u ,   v ,   1 ) T is expressed using H 4 points as:
x = H 4 points x

2.1.2. Tensor Direct Linear Transformation Layer

This layer essentially applies the Direct Linear Transformation (DLT) algorithm to the tensors in the deep-learning network. Its function is to convert the 4-point parameterized matrix, H 4 points , obtained in the previous stage into a 3 × 3 homography matrix, H , using the DLT algorithm. Additionally, it is differentiable to ensure that backpropagation can occur during training. The input of this layer is the coordinates of the four corner points before and after the transformation, and the output is the 3 × 3 homography matrix, H . Suppose H is the homography matrix of the corresponding corner points x i x i . According to the homography relationship, x i ~ H x i , which is equivalent to x i × H x i = 0 . Let h jT represent the j-th row of H , and let h j represent the column vector of h jT . Then, H x i can be expressed as:
H x i = [ h 1 T x i h 2 T x i h 3 T x i ] = [ x i T h 1 x i T h 2 x i T h 3 ]
Substituting x i = ( u i , v i , 1 ) T into x i × H x i = 0 , we have:
[ u i v i 1 ] × [ x i T h 1 x i T h 2 x i T h 3 ] = [ v i x i T h 3 x i T h 2 x i T h 1 u i x i T h 3 u i x i T h 2 v i x i T h 2 ] = 0
We can arrange this as follows:
[ 0 3 × 1 T x i T v i x i T x i T 0 3 × 1 T u i x i T v i x i T u i x i T 0 3 × 1 T ] [ h 1 h 2 h 3 ] = 0
Therefore, given the position changes of the four corner points, substituting them into the above equation allows us to directly solve the system of equations for h j   ( j = 1 ,   2 ,   3 ) , thereby obtaining the 3 × 3 homography matrix.

2.1.3. Spatial Transformation Layer

This layer applies the 3 × 3 homography matrix, H , obtained in the previous stage, to each pixel coordinate, x i , of the target image, I A , resulting in the corresponding coordinates, H ( x i ) , after the homographic transformation. These transformed coordinates are necessary for the unsupervised method to compute the pixel intensity loss function between the two images. This layer comprises three parts: the normalization inverse computation of homography estimation, the parametric sampling grid generator, and the differential image sampling. Additionally, it must be differentiable to allow the error gradient to backpropagate for network training.
  • Normalized Inverse Computation of Homography Estimation
The purpose of the spatial transformation layer is to align the target image with the reference image using the homography matrix. Let I A and I B be a pair of images with a homography relationship, where I A is the target image to be transformed and I B is the reference image. First, the pixel coordinates of input images I A and I B are normalized to the range [−1, 1]. Then, the inverse homography matrix, H inv , is calculated.
H inv = M 1 H 1 M
where M = [ W 2 0 W 2 0 H 2 H 2 0 0 1 ] and W and H are the width and height of image I B .
  • Parametric Sampling Grid Generator
Create a grid, G = { G i } , of the same size as I B for sampling, where each grid element, G i = ( u i , v i ) , corresponds to the pixel coordinates of image I B . Apply the previously computed normalized inverse homography matrix, H inv , to the grid, G , to obtain the pixel grid of predicted image I A .
[ u i v i 1 ] = H inv ( G i ) = H inv [ u i v i 1 ]
  • Differential Image Sampling
Based on the sampled points, H inv ( G i ) , obtained in the previous step, generate a transformed image, V with height h , width w , and C channels, where V ( x i ) = I A ( H ( x i ) ) . The image, V , is defined as:
V i C = n h m w I n m c k ( u i m ; u ) k ( v i n ; v )       i [ 1 h w ] ,     c [ 1 c ]
where h and w are the height and width of input image I A , k(·) denotes the sampling kernel, u and v are parameters of the sampling kernel, I nm c is the pixel value of the input image at channel c and coordinates ( n ,   m ) , and V i C is the pixel value of the output image at channel c and coordinates (ui, vi). Due to the potential non-integer values resulting from image transformation, bilinear interpolation is employed to obtain integer values. Additionally, to ensure that the loss function can be backpropagated during network training, it needs to be differentiable. Therefore, the gradient of the image after bilinear interpolation is defined as:
V i C I nm c = n h m w max ( 0 , 1 | u i m | ) max ( 0 , 1 | v i n | )

2.1.4. Loss Calculation Layer

The spatial transformation layer warps the initial image patch, P A , based on the estimated 3 × 3 homography matrix obtained earlier, resulting in the predicted image patch P B . Finally, the photometric loss, L PW , is computed pixel-wise between I B and I B as the loss parameter. This loss is then backpropagated to HomographyNet, with the assumption that the network’s output, the 4-point parameterized homography matrix estimate H 4 points , should minimize the photometric loss. The formula for L PW is as follows:
L PW = 1 x i x i | I A ( H ( x i ) ) I B ( x i ) |

2.2. Improved Unsupervised Homography Estimation Model

During the stamping process, the opening and closing of the mold causes random vibrations to the metal sheet, and the complexity of the on-site environment also changes. These factors lead to random shifts in the sheet in the real-time collected images. Therefore, when estimating the homography matrix for image registration, the model is required to have scale and rotation invariance comparable to traditional methods, as well as robustness to lighting changes and noise. The unsupervised homography estimation method mentioned earlier has shown improved performance under lighting changes compared to the supervised method in [14], but it still lacks resistance to deformations and position changes in real environments, making it unsuitable for the scenarios discussed in this paper. Additionally, current supervised and unsupervised homography estimation models are trained on synthetic datasets, which are far removed from real-world conditions.
To address the aforementioned issues, this paper proposes an improved unsupervised homography estimation model that introduces a special deformable convolution module. This module expands the receptive field while enhancing the model’s ability to learn rotational invariance during homography estimation. A GN normalization layer is added after each convolution to improve the model’s stability when training with large-resolution images in small batches. A multi-scale image pyramid is constructed from the input image, allowing for homography estimation at different scales and improving the model’s ability to learn scale invariance. Additionally, a real dataset of stamping process images collected with an industrial camera is created. The dataset is augmented by randomly adding various disturbances, lighting conditions, contrast, and blur levels to enhance the model’s ability to estimate homography for sheet metal shifts in a vibrating stamping environment.

2.2.1. Improved HomographyNet Model for Enhanced Rotational Invariance

The task of unsupervised homography estimation is a pixel-level task, where pixels outside the receptive field do not affect the network output. To ensure that important information is not ignored during decision-making and to better handle images with significant perspective changes, it is generally considered that a larger receptive field is better. Deepening the network can increase the receptive field, but when the network depth reaches a certain level, continuing to add convolutional layers increases computational load and risks network degradation [22,23]. Pooling operations can also increase the receptive field but at the cost of reduced spatial resolution. Dilated convolution is another method to increase the receptive field, as it can expand the receptive field without losing spatial resolution.
This paper introduces a special deformable convolution module [24], which, like dilated convolution, can expand the receptive field without reducing resolution. Additionally, it enhances the network’s ability to extract deformations, thereby improving rotational invariance in homography estimation. As shown in Figure 3, this module differs from regular convolution and dilated convolution, demonstrating its ability to learn rotational characteristics. Introducing the special deformable convolution too early can result in the loss of local information, while introducing it too late makes it difficult for the network to learn rotational invariance. This paper finds that the best approach is to introduce the special deformable convolution module to replace regular convolution right after the first downsampling, enhancing the backbone network’s ability to learn rotational invariance in homography estimation tasks.
In addition, the parameters of each layer in the convolutional neural network change continuously during training. These parameter changes alter the probability distribution of input data for the next layer, necessitating weight initialization and reduced learning rates. Therefore, this paper augments the original network by adding GN normalization layers after each convolutional layer. Unlike Batch Normalization (BN), which depends on batch size, GN normalizes by computing the mean and variance across channels for each sample group, reducing dependency on batch size. This allows the model to maintain stable training, even with smaller batches, making it suitable for training with irregular input sizes and high-resolution images, which are more applicable to the training scenarios in this paper. The improved network with the introduction of the special deformable convolution module and GN normalization is depicted in Figure 4.

2.2.2. Unsupervised Multi-Stage Homography Estimation Model for Improved Scale Invariance

Previous studies have shown that using a multi-stage approach to progressively estimate homography can effectively handle large variations in viewpoint in input images [25,26,27]. Therefore, building on the improved HomographyNet introduced in Section 2.2.1, this paper proposes an unsupervised multi-scale, multi-stage homography estimation method. Initially, the input image pairs are processed into a three-level, down-sampled image pyramid. Subsequently, at different stages of the method, images of varying resolutions are used as inputs, starting with high-resolution images. Local homography transformations are estimated on the high-resolution inputs, while global homography transformations are estimated on the lower-resolution inputs. The coarse-to-fine integration refines the relationships between the homography transformations across progressively scaled images, thereby enhancing the homography estimation network’s ability to learn scale invariance.
The overall network architecture of the proposed method in this paper is illustrated in Figure 5. Template image I A and target image I B are inputted in a channel-stacked manner. Image patches of the same size are extracted at corresponding positions for downsampling, constructing a three-level image pyramid with varying resolutions. The resolution of each subsequent level in the pyramid is half of the previous one, with the resolutions of the three pyramid levels being 512 × 512, 256 × 256, and 128 × 128, respectively. These three pyramid levels serve as inputs for different stages of the network. The overall network model consists of three stages, each sharing the same network structure. Each stage includes the Improved HomographyNet layer, the Tensor DLT layer, and the Spatial Transformation layer. The latter two layers draw inspiration from the work described in [15], as detailed in Section 2.1.2 and Section 2.1.3, and therefore will not be elaborated on here.
In the first stage, image patches of size 512 × 512 are inputted into the homography estimation network comprising the Improved HomographyNet layer, the Tensor DLT layer, and the Spatial Transformation layer. This process yields homography matrix H 1 , which transforms image P B 1 to P A 1 , while retaining the offsets 1 corresponding to the coordinates of the four corner points. To minimize the overall network’s training error, and considering the coupling between stages, the input image resolution in the second stage is half of the previous stage. Therefore, the coordinate offsets predicted in the first stage are scaled down by a factor of 2: 1 = 1 / 2 . Subsequently, P A 2 is transformed by the inverse homography transformation, H 1 S , to obtain P ¯ A 2 , where S is the scaling factor of 1/2. Then, P ¯ A 2 and P B 2 are inputted into the second-stage homography estimation network, following a process similar to the first stage, resulting in H 2 and 2 . The third stage follows the same procedure to obtain H 3 and 3 . The final coordinate offsets in the four-point parameterized form corresponding to the entire homography matrix are:
= ( 1 2 + 2 ) / 2 + 3
The homography estimation networks in the first to third stages estimate homography transformations at different scales based on input images of varying resolutions, which helps improve the network’s ability to learn scale invariance.
The unsupervised multi-scale homography estimation network proposed in this paper trains the network in an unsupervised manner by minimizing the pixel differences between images at each stage. The loss function includes three branches:
L = α 1 W ( P B 1 ,   H 1 ) P A 1 1 + α 2 W ( P B 2 ,   H 2 ) P ¯ A 2 1 + α 3 W ( P B 3 ,   H 3 ) P ¯ A 3 1
where H 1 , H 2 , and H 3 represent the homography matrices obtained in the three stages; W ( · ) denotes the spatial transformation operation applying the homography matrix to the image; and α 1 , α 2 , and α 3 are the corresponding weight coefficients.

2.2.3. Data Augmentation Techniques Applied to Stamping Monitoring Images

There are no publicly available datasets containing real-world images for supervised and unsupervised homography estimation tasks. Typically, the COCO dataset is used to generate synthetic data for training. Neural network training requires large datasets, and one major challenge in applying deep learning in industry is the dataset issue. Therefore, this paper uses a portion of the MSCOCO2014 dataset as the original image library and integrates stamping process images captured by industrial cameras for data fusion training. This approach ensures the dataset’s scale while enhancing the model’s generalization ability and robustness in the specific industrial application of stamping monitoring.
Detone et al. [14] first proposed the S-COCO synthetic dataset for homography estimation, using PatchA and PatchB as inputs to the HomographyNet network model and 4 sets of ( x i , y i ) as supervision labels. Although it performs poorly for homography estimation on real-world images, it pioneered the task of homography estimation based on deep learning. Nguyen et al. [15] improved robustness to lighting variations by randomly injecting color, brightness, and gamma shifts when training unsupervised deep models. Koguciuk et al. [17] introduced the PDS-COCO dataset, which builds on the S-COCO dataset by using data augmentation techniques, such as randomly adjusting brightness, contrast, saturation, and noise in certain image regions, and randomly swapping image color channels with a 50% probability. This approach addresses the impact of lighting and noise on homography estimation.
So far, datasets used for homography estimation tasks are based on three-channel color images. Considering the specific industrial application of stamping monitoring, where the images captured by industrial cameras are single-channel grayscale images, the COCO dataset used in this paper is first converted to grayscale. Additionally, this paper adopts the data generation algorithm and data augmentation techniques from Detone et al. [14]. To expand the dataset for the specific industrial application scenario of stamping process monitoring, data augmentation is performed by randomly adding different perturbations, lighting variations, and degrees of blurring, simulating the uncertainties in actual stamping environments. This approach makes the model more suitable for the homography estimation of sheet metal displacement in stamping vibration environments, enhancing its robustness. The dataset generation process is shown in Figure 6.

3. Results and Discussion

This method was implemented using Python 3.9.16 and PyTorch 2.0.0+cu118. The software platform was PyCharm 2022.1.4, and the hardware environment was a Windows 11 64-bit operating system with a 12th Gen Intel(R) Core(TM) i9-12900H processor (2.50 GHz) and 16 GB of memory. All training was conducted on a single NVIDIA GeForce RTX 3060 Laptop GPU with a batch size of 64 and 30,000 training iterations. The initial learning rate was set to 0.005, and it was reduced to one-tenth of its original value every 6000 iterations. The Adam optimizer was used with hyperparameters set to β 1 = 0.9 , β 2 = 0.999 , and ε = 10 8 . The balance weights for the three branches of the loss function were set to α 1 = 0.5 , α 2 = 0.3 , and α 3 = 0.2 , respectively. The experiments used the MSCOCO2014 dataset and stamping monitoring images as the original image library. The stamping monitoring images were single-channel grayscale images collected by an industrial area-scan monochrome camera with a resolution of 1920 × 1080. The MSCOCO2014 dataset was divided into training, validation, and test sets, generating 90,000 data samples. The stamping monitoring image dataset consisted of 1000 images and generated 10,000 data samples, which were divided into training, validation, and test sets in a 5:3:2 ratio.

3.1. Evaluation Indicators

To measure the accuracy of different methods for homography estimation, this paper uses the Mean Average Corner Error (MACE) as an evaluation metric, which is the average Euclidean distance between the predicted and the true positions of the corner points. Given a pair of images, ( I s ,   I t ) , and their corresponding homography matrix, H st , let the coordinates of the four corner points of image I s be denoted as p s i = [ x s i , y s i , 1 ] T , i { 1 , 2 , 3 , 4 } . The homography matrix predicted by the homography estimation model is H st . The true positions, p t i , and the predicted positions, P t i , of the four corner points in the target image, I t , are calculated respectively.
p t i = H st p s i ,   P t i = H st p s i , i { 1 , 2 , 3 , 4 }
The Mean Average Corner Error here is defined as the average of the four corner errors over M pairs of images:
MACE = 1 4 N j = 1 M i = 1 N ( p t i P t i 2 ) j

3.2. Comparative Experiment Results and Analysis

To evaluate the effectiveness of the proposed method, it will be compared with traditional homography estimation methods and deep-learning-based homography estimation methods. The average L2 distance between the true corner coordinates and the predicted coordinates will be calculated for all four corners of all images in the test set. Specifically, ORB + RANSAC and SIFT + RANSAC will be chosen as traditional feature-based homography estimation methods due to their superior performance over most traditional methods. HomographyNet and PFNet [28] will be selected as supervised learning-based homography estimation methods, where HomographyNet is the first model to estimate homography using deep learning, and PFNet is an excellent supervised homography estimation model that performs well on various datasets. The homography estimation model from Nguyen et al. [15] will be chosen as the unsupervised learning-based homography estimation method, as it is the first to apply unsupervised learning to the homography estimation task and has been used as a basis in subsequent studies. To demonstrate the novelty of the method proposed in this paper, the content-aware unsupervised homography estimation network (C-AHomE) from Zhang J. et al. [16] and the bidirectional implicit loss-based unsupervised homography estimation network (biHomE) from Koguciuk et al. [17] are selected as the latest research methods for comparison.
The comparison results of the mean corner error between the proposed method and other homography estimation methods are shown in Figure 7. It can be seen that the two traditional methods based on image features have significantly higher errors. This is because they overly rely on the quality of feature points during the feature extraction stage and the content of the image itself. Due to the fact that the feature points on the surface of the metal sheet are concentrated on the edges, punching holes, etc., and are not sufficiently abundant or well-distributed, they are prone to errors from lighting changes and perspective changes, resulting in poor homography estimation accuracy. Deep-learning methods, due to their strong feature representation capabilities, extract features of higher quality than traditional algorithms, resulting in lower errors. Among them, PFNet performs better. Unsupervised homography estimation networks do not rely on real labeled data during training and improve robustness to lighting changes by minimizing pixel-level photometric errors, resulting in better accuracy than supervised learning methods. However, this method uses a single-stage network structure. The proposed method achieves the lowest error estimation, reducing the MACE error by 2.66 compared to the previous unsupervised homography estimation network, with a 61.4% improvement in performance. When compared to the latest unsupervised homography estimation networks, C-AHomE and biHomE, the performance is improved by 13.9% and 8.2%, respectively.
Figure 8 presents the visualized results of homography estimation using the aforementioned eight methods across eight different test images, including five sets of stamping monitoring images and three sets of images from the MSCOCO2014 dataset. This selection aims to demonstrate the effectiveness of the proposed method in the specific industrial context of stamping monitoring while also highlighting its general applicability to various types of images. The leftmost column shows the template images of the eight test images, with blue frames representing the initial image patches and corner points. The green frames indicate the predicted positions of the corner points, with different levels of precision based on the homography matrices estimated by each method. The higher the overlap between the blue and green frames, the higher the accuracy of the homography estimation and the lower the error. From Figure 8, it can be seen that the proposed method achieves the highest alignment in the visualization results, indicating that the predicted homography matrix is the most accurate and has the smallest error compared to the actual homography matrix. This performance surpasses that of the other seven methods.
Next, the impact of different offsets on various methods is analyzed. When generating test data, different offset values, ρ , are set, representing the range of random perturbations of the four corner points. The smaller the offset, ρ , the less distortion in the homography-transformed image, and vice versa. In this experiment, the offset, ρ , is chosen from the range [10, 60]. The error results of different methods are shown in Table 1. It can be seen that, under the same offset value, the proposed method has the lowest homography estimation error. As the input image offset increases, the error of HomographyNet increases significantly, while PFNet performs relatively better. The two traditional feature-based methods, despite having larger errors, show relatively smaller increases in error changes due to their reliance on feature-point extraction. The error of the unimproved unsupervised homography estimation network is also significantly higher than that of the proposed method. Although the new unsupervised homography estimation methods, C-AHomE and biHomE, exhibit relatively low error estimates, their stability significantly decreases as the perturbation range increases, falling short of the stability demonstrated by the method proposed in this paper. This indicates that the proposed method performs well, even when the input images have a low overlap rate, and demonstrates greater robustness to different scaling, rotation, and other viewpoint changes.

3.3. Ablation Experiment Results and Analysis

To validate the effectiveness of the proposed improvements on homography estimation performance, experimental analysis was conducted on the module selection of the improved HomographyNet, the construction of the scale pyramid, and the number of stages in the overall network.
Firstly, ablation studies were conducted on the special deformation convolution layer and GN normalization layer introduced in the improved HomographyNet, as shown in Table 2. The results indicate that the original HomographyNet, even with a three-stage network structure, had the highest error in homography estimation. Introducing either the special deformation convolution layer or the GN normalization layer independently improved network performance to different extents, with the deformation layer having a greater impact. This is because this module expands the receptive field and enhances the learning ability for feature rotational invariance during the feature extraction process, which reduces the error in homography estimation. Meanwhile, the GN normalization layer provides stable training capabilities when dealing with small batches of large sizes, but it does not directly affect the error in homography estimation. The improved HomographyNet model proposed in this paper, which incorporates both the special deformation convolution module and the GN normalization module, shows the smallest error in homography estimation.
Subsequently, the method proposed in this paper was tested by using images of different scales at various stages and images of the same scale at all stages to verify the effectiveness of constructing an image pyramid for multi-scale input in homography estimation. As shown in Table 3, the method with same-scale input exhibited higher homography estimation errors at each stage compared to the method proposed in this article. To demonstrate the effectiveness of the improvements, the same test images from the previous experiment were used for homography estimation at three stages, with both same-scale and different-scale inputs. The visualization results, as shown in Figure 9, include the following: (a) the template image with the initial positions of the four corners marked by red boxes; (b), (c), (d) the estimation results at three stages with same-size image inputs; and (e), (f), (g) the estimation results at three stages with different-scale image inputs, where the yellow boxes indicate the predicted positions of the four corners. The higher the overlap between the two boxes, the smaller the error in the estimated homography matrix.
Finally, an ablation study was conducted on the number of network stages used in the proposed method, with the results tested on the aforementioned eight sets of images, as shown in Figure 10. The number of network stages is a crucial hyperparameter in our method. As indicated in the graph, the homography estimation error significantly decreases as the number of network stages increases from one to three, demonstrating that a multi-stage network structure can progressively refine the homography estimation task, reducing errors and improving accuracy. However, when the number of stages increases to four, the errors begin to rise, and at the fifth stage the errors significantly increase. This is attributed to the reduced image size for input when the network exceeds three stages, leading to unstable network training. Therefore, a three-stage network is optimal for this method.

4. Conclusions

Due to the random vibrations of the metal sheet caused by the opening and closing of the mold during the stamping process, image registration between the template image and the real-time monitoring image becomes crucial, making homography estimation especially important. Addressing the issues of large estimation errors, poor generalization, and limited industrial applications in current unsupervised homography networks under conditions of significant viewpoint changes and high image overlap, this paper proposes a multi-stage, multi-scale unsupervised homography estimation model. First, the HomographyNet model is improved by introducing a specialized deformable convolution module and GN (Group Normalization) layer, enhancing the model’s ability to learn rotation invariance when processing large-scale, high-resolution images. Next, based on the unsupervised homography estimation network model, a multi-scale, multi-stage network structure is constructed. This structure estimates homography at different scales across various stages, minimizing pixel-level photometric loss at each stage for unsupervised training. By progressively refining the homography transformation relationship between images through multiple stages, the accuracy of homography estimation is improved, along with the model’s ability to learn scale invariance. Finally, by integrating training with stamping process image data and the MSCOCO2014 dataset, and applying data augmentation through random perturbations, lighting variations, and different levels of blurriness, the model’s robustness to the complex changes in the stamping monitoring environment is enhanced. This approach not only ensures the data scale required for network training but also addresses the issue of data scarcity in applying deep learning to specific industrial environments. It is worth noting that, due to the uncertainty in the stamping environment, the images of the sheet surface may be significantly occluded. Therefore, improving the robustness of homography estimation under occlusion will be the focus of future research.

Author Contributions

Methodology, Y.Z. and Y.D.; validation and formal analysis, Y.D.; supervision, Y.Z.; writing—original draft, Y.D.; writing—review and editing, Y.Z. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, L.; Zhao, B. Analysis of the current situation and development trend of automobile sheet metal stamping technology. Forg. Equip. Manuf. Technol. 2022, 57, 7–10. [Google Scholar]
  2. Szeliski, R. Image alignment and stitching: A tutorial. Found. Trends Comput. Graph. Vis. 2007, 2, 1–104. [Google Scholar] [CrossRef]
  3. Lucas, B.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679, HAL ID: hal-03697340. [Google Scholar]
  4. Baker, S.; Matthews, I. Lucas-kanade 20 years on: A unifying framework. Int. J. Comput. Vis. 2004, 56, 221–255. [Google Scholar] [CrossRef]
  5. Syed, T.; Xiang, X. Traditional and modern strategies for optical flow: An investigation. SN Appl. Sci. 2021, 3, 2523–3963. [Google Scholar]
  6. Luo, Y.; Wang, X.; Wu, Y.; Shu, C. Detail-Aware Deep Homography Estimation for Infrared and Visible Image. Electronics 2022, 11, 4185. [Google Scholar] [CrossRef]
  7. Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999; IEEE Press: New York, NY, USA, 1999; pp. 1150–1157. [Google Scholar]
  8. Bay, H.; Tuytelaars, T.; Gool, L. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer International Publishing: Cham, Switzerland, 2006; pp. 404–417. [Google Scholar]
  9. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE Press: New York, MY, USA, 2011; pp. 2564–2571. [Google Scholar]
  10. Muja, M.; Lowe, D. Fast approximate nearest neighbors with automatic algorithm configuration. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisboa, Portugal, 5–8 February 2009; INSTICC Press: Lisboa, Portugal, 2009; pp. 331–340. [Google Scholar]
  11. Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  12. Peter, R. Least Median of Squares Regression Least Median of Squares Regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar]
  13. Luo, Y.; Wang, X.; Liao, Y.; Fu, Q.; Shu, C.; Wu, Y.; He, Y. A Review of Homography Estimation: Advances and Challenges. Electronics 2023, 12, 4977. [Google Scholar] [CrossRef]
  14. Detone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv 2016, arXiv:1606.03798. [Google Scholar]
  15. Nguyen, T.; Chen, S.; Shivakumar, S. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robot. Autom. Lett. 2018, 3, 2346–2353. [Google Scholar] [CrossRef]
  16. Zhang, J.; Wang, C.; Liu, S.; Jia, L.; Ye, N.; Wang, J.; Zhou, J.; Sun, J. Content-Aware Unsupervised Deep Homography Estimation. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2020; Volume 12346, pp. 653–669. [Google Scholar]
  17. Koguciuk, D.; Arani, E.; Zonooz, B. Perceptual Loss for Robust Unsupervised Homography Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual Conference, 19–25 June 2021; pp. 4274–4283. [Google Scholar]
  18. Liu, S.; Lu, Y.; Jiang, H.; Ye, N.; Wang, C.; Zeng, B. Unsupervised Global and Local Homography Estimation With Motion Basis Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7885–7899. [Google Scholar] [CrossRef] [PubMed]
  19. Hu, W.; He, C.; Lin, M.; Zhou, H. Unsupervised deep homography with multi-scale global attention. IET Image Process. 2023, 17, 2937–2948. [Google Scholar] [CrossRef]
  20. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Trans. Image Process. 2021, 30, 6184–6197. [Google Scholar] [CrossRef] [PubMed]
  21. Justin, J.; Alexandre, A.; Li, F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; Volume 9906, pp. 694–711. [Google Scholar]
  22. He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE Press: New York, NY, USA, 2015; pp. 5353–5360. [Google Scholar]
  23. Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
  24. Dai, J.; Qi, H.; Xiong, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE Press: New York, NY, USA, 2017; pp. 764–773. [Google Scholar]
  25. Erlik, N.; Laganiere, R.; Japkowicz, N. Homography estimation from image pairs with hierarchical convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 913–920. [Google Scholar]
  26. Li, Y.; Pei, W.; He, Z. SRHEN: Stepwise-refining homography estimation network via parsing geometric correspondences in deep latent space. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 3063–3071. [Google Scholar]
  27. Le, H.; Liu, F.; Zhang, S.; Aseem, A. Deep homography estimation for dynamic scenes. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE Computer Society: Piscataway, NJ, USA, 2020; pp. 7652–7661. [Google Scholar]
  28. Zeng, R.; Denman, S.; Sridharan, S.; Fookes, C. Rethinking planar homography estimation using perspective fields. In Proceedings of the 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2019; Volume 11366, pp. 571–586. [Google Scholar]
Figure 1. Network structure of the unsupervised homography estimation model.
Figure 1. Network structure of the unsupervised homography estimation model.
Applsci 14 07721 g001
Figure 2. HomographyNet network model.
Figure 2. HomographyNet network model.
Applsci 14 07721 g002
Figure 3. Different convolution sampling methods: (a) regular convolution; (b) dilated convolution; (c) special deformable convolution.
Figure 3. Different convolution sampling methods: (a) regular convolution; (b) dilated convolution; (c) special deformable convolution.
Applsci 14 07721 g003
Figure 4. Structure of the improved HomographyNet model.
Figure 4. Structure of the improved HomographyNet model.
Applsci 14 07721 g004
Figure 5. Structure of the unsupervised multi-scale, multi-stage homography estimation network.
Figure 5. Structure of the unsupervised multi-scale, multi-stage homography estimation network.
Applsci 14 07721 g005
Figure 6. Data generation process: (a) randomly extract a rectangular image patch PatchA from the image; (b) apply random perturbations to the 4 corners of PatchA; (c) calculate H AB based on the perturbations (∆xi,∆yi); (d) apply the inverse matrix of H AB to the entire image and extract a corresponding image patch PatchB at the same position and size.
Figure 6. Data generation process: (a) randomly extract a rectangular image patch PatchA from the image; (b) apply random perturbations to the 4 corners of PatchA; (c) calculate H AB based on the perturbations (∆xi,∆yi); (d) apply the inverse matrix of H AB to the entire image and extract a corresponding image patch PatchB at the same position and size.
Applsci 14 07721 g006
Figure 7. Comparison results of MACE between the proposed method and other methods.
Figure 7. Comparison results of MACE between the proposed method and other methods.
Applsci 14 07721 g007
Figure 8. Visualization results of different homography estimation methods.
Figure 8. Visualization results of different homography estimation methods.
Applsci 14 07721 g008
Figure 9. Visualization results of same-size image input versus multi-scale image input.
Figure 9. Visualization results of same-size image input versus multi-scale image input.
Applsci 14 07721 g009
Figure 10. Ablation study results for eight test images across different network stages.
Figure 10. Ablation study results for eight test images across different network stages.
Applsci 14 07721 g010
Table 1. Error results of different methods under different offset values.
Table 1. Error results of different methods under different offset values.
MethodORB + RANSACSIFT + RANSACHomography NetPFNetUnsupervised HomographyNetC-AHomEbiHomEOurs
ρ = 108.346.823.192.722.260.970.890.72
ρ = 2010.828.576.564.853.491.401.311.19
ρ = 3012.1010.739.416.364.331.941.821.67
ρ = 4015.7913.3215.138.915.273.823.512.23
ρ = 5018.3517.9119.7811.286.125.114.672.81
ρ = 6021.2720.4524.3013.147.916.395.753.56
Table 2. Ablation study results of on module selection in the improved HomographyNet.
Table 2. Ablation study results of on module selection in the improved HomographyNet.
Special Deformation Convolution LayerGroup Normalization LayerStage 1Stage 2Stage 3
××10.156.214.32
×4.283.161.91
×9.935.383.23
3.912.851.67
Table 3. Ablation study results for same-size image input versus multi-scale image input.
Table 3. Ablation study results for same-size image input versus multi-scale image input.
Image ScaleStage 1Stage 2Stage 3
The same 128 × 1284.573.442.12
512 × 512, 256 × 256, 128 × 1283.912.851.67
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Du, Y. Image Registration Algorithm for Stamping Process Monitoring Based on Improved Unsupervised Homography Estimation. Appl. Sci. 2024, 14, 7721. https://doi.org/10.3390/app14177721

AMA Style

Zhang Y, Du Y. Image Registration Algorithm for Stamping Process Monitoring Based on Improved Unsupervised Homography Estimation. Applied Sciences. 2024; 14(17):7721. https://doi.org/10.3390/app14177721

Chicago/Turabian Style

Zhang, Yujie, and Yinuo Du. 2024. "Image Registration Algorithm for Stamping Process Monitoring Based on Improved Unsupervised Homography Estimation" Applied Sciences 14, no. 17: 7721. https://doi.org/10.3390/app14177721

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop