Next Article in Journal
Self-Intersections of Cubic Bézier Curves Revisited
Previous Article in Journal
Adaptive Mission Abort Planning Integrating Bayesian Parameter Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Universal Network for Image Registration and Generation Using Denoising Diffusion Probability Model

Department of Mechanical, Electrical and Information Engineering, Shandong University, Weihai 264209, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(16), 2462; https://doi.org/10.3390/math12162462
Submission received: 15 July 2024 / Revised: 3 August 2024 / Accepted: 8 August 2024 / Published: 9 August 2024
(This article belongs to the Special Issue Mathematical Methods for Image Processing and Computer Vision)

Abstract

:
Classical diffusion model-based image registration approaches require separate diffusion and deformation networks to learn the reverse Gaussian transitions and predict deformations between paired images, respectively. However, such cascaded architectures introduce noisy inputs in the registration, leading to excessive computational complexity and issues with low registration accuracy. To overcome these limitations, a diffusion model-based universal network for image registration and generation (UNIRG) is proposed. Specifically, the training process of the diffusion model is generalized as a process of matching the posterior mean of the forward process to the modified mean. Subsequently, the equivalence between the training process for image generation and that for image registration is verified by incorporating the deformation information of the paired images to obtain the modified mean. In this manner, UNIRG integrates image registration and generation within a unified network, achieving shared training parameters. Experimental results on 2D facial and 3D cardiac medical images demonstrate that the proposed approach integrates the capabilities of image registration and guided image generation. Meanwhile, UNIRG achieves registration performance with NMSE of 0.0049, SSIM of 0.859, and PSNR of 27.28 on the 2D facial dataset, along with Dice of 0.795 and PSNR of 12.05 on the 3D cardiac dataset.

1. Introduction

Image registration plays a fundamental role in various applications [1,2,3,4,5]. It achieves image alignment by establishing correspondence between corresponding points and mapping them to the same spatial coordinates. By registering images at different times, the trajectory or trend of the motion of the target can be analyzed, which has a wide range of applications in medical image registration [6]. The image registration task faces significant challenges as imaging technologies advance and image quality improves. However, traditional registration approaches [7,8] are hindered by issues such as poor robustness and high computational complexity, which constrain further advances in image registration.
To overcome these difficulties, innovative approaches [9,10,11] using deep learning techniques have been applied in image registration. The typical deep learning-based image registration approach involves feeding moving and fixed images into a network and training the network to estimate the deformation field between the paired images, which is used to transform and register the images. Prominent examples of such approaches include DIRNet [12], VoxelMorph (VM) [13], and VM-diff [14], which demonstrate fast and accurate image deformations during evaluation while ensuring high precision and robustness even on unseen images within the same dataset.
Deep learning-based image registration tasks have recently embraced the use of generative models such as GANs [15] and VAEs. These models are widely applied in image translation, generation [16,17], and super-resolution processing [18]. By leveraging the learned distribution from the training data, the generative model enables the network to generate realistic target samples, effectively capturing complex data distributions.
In a typical example, Mahapatra et al. [19] successfully applied a cyclic GAN to achieve accurate multimodal retinal and cardiac image registration by incorporating additional constraints. Fan et al. [20,21] employed a GAN in medical image registration, leveraging unsupervised autonomous learning loss to obtain precise results. Rezayi et al. [22] proposed an efficient approach for simultaneous super-resolution and image registration with a continuous generative model utilizing Gaussian kernels instead of conventional interpolation.
Among generative models, diffusion models have gained attention for their excellent performance in image generation, prompting researchers to explore their applications in various image processing tasks. By introducing noise and ensuring consistency between the noise distributions of the forward and reverse processes, the diffusion model achieves the requirements of the target task. Notable instances of such models include the denoising diffusion probability model in [23] based on Markov Chains, as well as conditional diffusion models [24,25] designed for generating images with target semantics.
It is worth mentioning the notable image registration method called DiffuseMorph (DM) proposed by Kim et al. [26], which utilized the denoising diffusion model in image registration for the first time. The DM method employs two separate networks to handle the generation and registration tasks. By incorporating latent variables, DM enables joint training of both tasks through data transfer.
In this paper, we propose an innovative approach to address the issues of computational cost and task balancing by sharing the training process of generating and registering tasks. Specifically, our approach utilizes a universal model based on the U-net [27] architecture. By feeding diffused fixed images and incorporating additional spatial conditions into the generative model during training, we achieve image registration and obtain deformed images and deformation fields. Then, we leverage the learned target distribution to perform sampling for generating synthetic deformed images. Furthermore, we employ a mean-shift strategy to guide the generation of images with desired semantics.
The proposed approach offers several advantages over classical diffusion model-based registration methods. Primarily, it performs both image registration and generation tasks using a unified network, avoiding the need to design separate network structures for different tasks. Such a strategy results in increased registration accuracy and reduced computational cost while avoiding the challenge of balancing energy functions based on different tasks. Additionally, unlike traditional deep learning-based registration methods, the proposed model maintains accurate registration performance even in the presence of perturbations in the input image. Moreover, the model allows for control over the quality and content of the generated images by attaching a shift factor as a generative constraint.

2. Background and Related Works

2.1. Deformable Image Registration Model

Image registration can be described as an optimization problem, that is, finding the optimal deformation field d * between a moving image J and fixed image I such that the dissimilarity between the deformed image J ( d ) and I is minimized:
d * = a r g m i n d E D ( I , J ( d ) ) + λ E R ( d )
where E D ( I , J ( d ) ) is the dissimilarity function used to measure the similarity between J ( d ) and I, the term E R ( d ) represents the regularization constraint responsible for preserving the smoothness of the predicted deformation field, and the hyperparameter λ is usually introduced to adjust the contributions of E D ( I , J ( d ) ) and E R ( d ) .

2.2. Denoising Diffusion Probabilistic Model

A denoising diffusion probabilistic model (DDPM) [23] is a type of generative model that learns the Markov Chain from the simple Gaussian distribution to the data distribution. The forward diffusion process of DDPM involves gradually adding noise ϵ t N ( 0 , I ) , t [ 1 , T ] to the original data x 0 using a fixed Markov Chain. Here, each step in sampling the latent variables x t is defined as a Gaussian transition:
q ( x t | x t 1 ) = N ( x t ; x t 1 1 β t , β t I )
where the noise variance β t ( 0 , 1 ) follows the fixed variance schedule. Given data x 0 , the resulting sampling of x t is then expressed as follows:
q ( x t | x 0 ) = N ( x t ; x 0 α ¯ t , ( 1 α ¯ t ) I )
where α ¯ t = i = 1 t ( 1 β i ) . According to the reparameterization trick [28], given x 0 and ϵ N ( 0 , I ) , x t can be sampled by
x t = x 0 α ¯ t + ϵ 1 α ¯ t .
The generative (or reverse) process in DDPM shares the same functional form as the forward process, and involves learning parameterized Gaussian transitions p θ ( x t 1 | x t ) , expressed as follows:
p θ ( x t 1 | x t ) = N ( x t 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) .
The training of the DDPM involves optimizing the usual variational bound on the negative log-likelihood. The training loss is approximated by aligning p θ ( x t 1 | x t ) with the forward process posteriors conditioned on x 0 for tractability:
q ( x t 1 | x t , x 0 ) = N ( x t 1 ; μ q ( x t , t ) , Σ q ( x t , t ) )
where
μ q ( x t , t ) = β t α ¯ t 1 1 α ¯ t x 0 + ( 1 α ¯ t 1 ) α t 1 α ¯ t x t
Σ q ( x t , t ) = 1 α ¯ t 1 1 α ¯ t β t I
Thereby, the generated x ^ t 1 can be obtained by sampling from the learned target distribution p θ and represented as a linear combination of μ θ ( x t , t ) and Σ θ ( x t , t ) :
x ^ t 1 = μ θ ( x t , t ) + ϵ Σ θ ( x t , t )
where
μ θ ( x t , t ) = β t α ¯ t 1 1 α ¯ t x ˜ 0 ( θ , x t , t ) + ( 1 α ¯ t 1 ) α t 1 α ¯ t x t
and Σ θ = Σ q . Here, x ˜ 0 is a parameterized model.

2.3. Conditional Diffusion Models

Recently, conditional diffusion models have been proposed to condition the generative process in well-performing unconditional DDPM, enabling the model to generate images with desired semantics similar to the reference image. Among the conditional diffusion models, the iterative latent variable refinement model (ILVR) [24] and classifier-guided diffusion model (CGD) [25] stand out as the most representative.

2.3.1. Iterative Latent Variable Refinement Model

Under the framework of DDPM, ILVR [24] controls the semantics of generated images x ^ t 1 by introducing a condition c : ϕ N ( y t 1 ) = ϕ N ( x t 1 ) into the unconditional generative process x t 1 p θ ( x t 1 | x t ) of DDPM. Here, ϕ N ( ) is the low-pass filtering operation (https://github.com/assafshocher/ResizeRight, accessed on 7 August 2024) with scale N, while y q ( y t 1 | y 0 ) denotes the reference image containing the target semantics. Therefore, the generated x ^ t 1 can be expressed as
x ^ t 1 = ϕ N ( y t 1 ) ϕ N ( x t 1 ) + x t 1 .

2.3.2. Classifier-Guided Diffusion Model

Similar to ILVR, CGD [25] does not require additional changes to the training process of DDPM. Instead, it directly decomposes the distribution p θ , ψ ( x t 1 | x t , y ) into an unconditional process p θ ( x t 1 | x t ) and conditional classifier-guided process p ψ ( y | x t 1 ) . Consequently, the generated target x ^ t 1 can be obtained from the sampling process:
x ^ t 1 = μ θ + s · Σ θ · g r a d + ϵ Σ θ
where g r a d = x t 1 log p ψ ( y | x t 1 ) | x t 1 = μ is the gradient of the classifier, the scaling s implies the intensity of the guidance of the classifier, and θ and ψ represent the learned parameters of the generation and classification networks, respectively.

2.4. DiffuseMorph

To the best of our knowledge, DiffuseMorph (https://github.com/DiffuseMorph/DiffuseMorph, accessed on 7 August 2024) (DM) [26] is the first work to apply diffusion models to image registration, with the others mainly used in generation tasks. In DM, the conditional noise added in the forward process is predicted by the generative network. The predicted noise ϵ ^ θ ( I 0 , t , c = ( I 0 , J 0 ) ) is then utilized by the registration network to learn the optimal deformation field d * for image registration:
d * = a r g m i n θ , ψ E G ( θ , I 0 , t , c ) + λ 1 E D J ( d ψ ) , ϵ ^ θ + λ 2 E R ( d ψ )
where E G ( θ , I 0 , t , c ) and E D J ( d ψ ) , ϵ ^ θ denote the generation and registration loss functions, respectively, θ and ψ represent the learned parameters of the generation and the registration networks, respectively, E R ( d ψ ) describes the regularization constraint of the deformation field, and λ 1 and λ 2 are hyperparameters that balance the loss for different tasks. However, the requirement of two separate networks for registration in DM leads to a series of issues, including challenges in model optimization, difficulty in balancing losses from different networks, and low registration accuracy.

3. Proposed Method

Leveraging the capability of DDPM, we propose a universal network to perform image registration and image generation simultaneously. Specifically, the proposed method is illustrated in Figure 1. The network takes a moving image J 0 , fixed image I 0 , and the diffused fixed image I t (obtained by Equation (4) with x 0 = I 0 ) as inputs, then computes the deformation field d. Subsequently, we warp the moving image J 0 to I ˜ 0 , i.e., J 0 ( d ) = I ˜ 0 , using the Spatial Transformer Layer (STL) [29], allowing the similarity between I 0 and I ˜ 0 to be evaluated. Furthermore, we use the predicted I ˜ 0 and the equations in Figure 1 to perform the reverse diffusion process for image generation.

3.1. Universal Training Process of UNIRG

Referring to [23,24,25], diffusion model-based image generation can be achieved by solving a probabilistic matching problem between forward and reverse diffusion (or the matching problem between true noise and predicted noise). Because the distributions of the forward and reverse processes follow the Gaussian distribution, the matching problem can be transformed into a mean matching problem. Furthermore, the generated target in Equations (9), (11) and (12) can be reformulated as follows:
x ^ t 1 = M θ + ϵ Σ θ
where M θ denotes the modified mean. For example, in DDPM we have M θ = μ θ , while for ILVR and CGD we have
I L V R : M θ = ϕ N ( y t 1 ) ϕ N ( x t 1 ) + μ θ , C G D : M θ = s · Σ θ · g r a d + μ θ .
Because these models share the same training process as DDPM, x ˜ 0 ( θ , x t , t ) can be obtained by solving a r g m i n θ | | μ θ μ q | | 2 . Therefore, we can make the following modifications to incorporate the modified means into the training process:
I L V R : a r g m i n θ | | M θ μ q + ϕ N ( y t 1 ) ϕ N ( x t 1 ) | | 2 , C G D : a r g m i n θ | | M θ μ q + s · Σ θ · g r a d | | 2 .
In general, we define M θ = A · x ˜ 0 ( θ , x t , t ) + B · x t following the form of Equation (10), where A = β t α ¯ t 1 1 α ¯ t , B = ( 1 α ¯ t 1 ) α t 1 α ¯ t and x ˜ 0 ( θ , x t , t ) is the modified parameterized model, which varies between different tasks. Subsequently, Equation (16) can be expressed as
I L V R : a r g m i n θ | | x ˜ 0 x 0 + ϕ N ( y t 1 ) ϕ N ( x t 1 ) A | | 2 , C G D : a r g m i n θ | | x ˜ 0 x 0 + s · Σ θ · g r a d A | | 2 .
To summarize, the training objective of the generative model x ˜ 0 is to generate an image x 0 with the desired semantics. This training process shares similarities with dissimilarity loss optimization in image registration models. Therefore, by considering the fixed image I 0 as the desired target image, i.e., x 0 = I 0 , and incorporating condition c = ( I 0 , J 0 ) with deformation information into the generative model I ˜ 0 ( θ , d θ , I t , t , c ) , we can simultaneously achieve image registration. The deformation field is estimated by
d * = a r g m i n θ | | I ˜ 0 ( θ , d θ , I t , t , c ) I 0 | | 2 .
Furthermore, a regularization term is added to Equation (18) to preserve the smoothness of the predicted deformation field d. The overall training step of UNIRG is shown in Algorithm 1.
d * = a r g m i n θ | | I ˜ 0 ( θ , d θ , I t , t , c ) I 0 | | 2 + λ | | d θ | | 2
Algorithm 1: Training Process
Mathematics 12 02462 i001

3.2. Image Generation via Reverse Diffusion

3.2.1. Generation

Similar to DDPM, UNIRG can perform a reverse diffusion process to generate target images. Specifically, according to Equation (18), the mean of the generated distribution can be written as M θ = A · I ˜ 0 ( θ , d θ , I t , t , c ) + B · I t . Then, the specific form of the generated samples can be obtained:
I ^ t 1 = A · I ˜ 0 * ( θ , d θ , I t , t , c ) + B · I t + ϵ Σ d θ
where ϵ N ( 0 , I ) from t = T to t = 1 , I ˜ 0 * represents the deformed images with optimal deformation field d θ * after network training, and Σ d θ = Σ q , which is the same as DDPM. The pseudocode for image generation is described in Algorithm 2.
Algorithm 2: Sampling Process
Mathematics 12 02462 i002

3.2.2. Guided Generation

In addition, UNIRG can achieve guided generation in the same way as ILVR or CGD by further modifying Equation (18) to the form of Equation (17):
a r g m i n θ | | I ˜ 0 ( θ , d θ , I t , t , c ) I 0 + η A | | 2
where I ˜ 0 is a latent model, η is set as a shift factor to control the direction of the generation, and the guided generated samples are provided by
I ^ t 1 = A · I ˜ 0 * ( θ , d θ , I t , t , c ) + B · I t + η + ϵ Σ d θ .
In particular, η = 0 denotes the unconditional reverse process, while η = ϕ N ( y t 1 ) ϕ N ( x t 1 ) and η = s · Σ d · g r a d represent the same conditional reverse process as ILVR and CGD, respectively. Here, we set η = g ( J 0 I 0 ) to perform the guided generation in this paper and generate an image with similar content as J 0 , where g ( ) denotes the operation of image transformations.

3.3. Network Architecture

UNIRG utilizes a typical encoder–decoder network (U-net) as the backbone for estimating θ . The encoding part consists of residual and downsampling layers, while the decoding structure uses residual and upsampling layers instead. Meanwhile, encoding and decoding features of the same dimension are matched and fused by some skip connections. To comprehensively improve registration and generation performance, we modify the architecture of U-net used in VM as follows:
  • Adding a multihead attention module to enhance the model’s attention regarding the input features.
  • Embedding time information to help the model determine the time step.
  • Using the low-pass filtering operation in ILVR instead of the downsampling or pooling operation to retain more features and improve the accuracy of image registration.
The modified architecture ensures accurate capture of the deformation field between images even in the presence of considerable input noise, thereby improving the robustness of the model. After these modifications and the use of dense STL, the proposed model can achieve better registration and generation performance. The specific network structure is shown in Figure 2.

4. Experimental Results

We conducted experiments on various datasets to validate the effectiveness of the proposed method using the modified network architecture. The evaluation focused on the performance of the proposed method in both image registration and generation tasks. For comparison, we compared UNIRG to VM [13], VM-diff [14], and DM [26]. Specifically, our experiments used the 2D Radboud Faces Database (RaFD) [30] and the 3D Automated Cardiac Diagnosis Challenge Dataset (ACDC) [31]. Further details regarding the datasets, experiments, and analysis of the results are provided below.

4.1. Radboud Faces Database

4.1.1. Dataset and Preprocessing

The RaFD database comprises 67 subjects exhibiting eight emotional expressions: angry, disgusted, fearful, happy, sad, surprised, contemptuous, and neutral. Each emotion is displayed in three different gazes (left, frontal, and right) while maintaining a consistent head orientation. We divided the dataset into a training set (60 groups) and a test set (7 groups) to better adapt the image registration task. Additionally, we resized the images to 128 × 128 pixels for further processing.

4.1.2. Implementation Details

During the training process, we used the Adam optimizer with a learning rate of 5 × 10 6 for optimization. The training process was conducted on a single NVIDIA RTX 2080S GPU using 50 epochs and a batch size of 8, considering both time consumption and accuracy. The time step was set to T = 2000 , and linear noise scheduling was applied with levels ranging from 10 6 to 10 2 . The number of attention heads was set to 4. Furthermore, to ensure that the regularization loss remained smaller than the dissimilarity loss and considering the randomness introduced by the initial noise and random sample selection, we found that setting the coefficient λ to 2.5 × 10 2 yielded better results.

4.1.3. Image Registration

During the evaluation, we fixed the random seed to 0 for easier analysis. To ensure fairness, we conducted comparison experiments using the same size of VM and VM-diff models as well as the DM model with the same size of registration network. The registration performance of the proposed method was verified by comparing it with VM, VM-diff, and DM. Visual comparisons of the 2D facial grayscale image registration results can be seen in Figure 3. Additionally, by applying the trained deformation field to the image in each channel, we obtained the registration results for 2D facial RGB expressions, as illustrated in Figure 4.
As part of the image evaluation, we carried out numerical comparisons for further analysis. We utilized NMSE, SSIM, PSNR, and the ratio of non-positive Jacobian determinants over the deformation field as metrics to enrich the experimental comparison. The first three metrics are used to assess the accuracy of image registration, while the last metric aim to compare the folding degree and smoothness of the deformation field. The results of these comparisons are presented in Table 1.
The results shown in Figure 3 and Figure 4 demonstrate the accuracy of registration achieved by the proposed method, even on unseen images, moreover, in comparing our model with other models, our model achieves superior registration accuracy. The evaluation values in the figures indicate higher SSIM values and lower NMSE values for our model. Notably, our model exhibits a nearly two-fold performance improvement in NMSE compared to VM and VM-diff.
Table 1 further demonstrate the effectiveness of the proposed model in image registration. Compared to VM, our model reveals significant improvements in NMSE and PSNR along with a slight enhancement in SSIM. Compared to VM-diff, our model shows more noticeable improvement. These results indicate the better registration performance of the proposed model. Furthermore, our model aligns well with VM and VM-diff in terms of | J a ( d ) | , validating its topological preservation ability during registration.

4.1.4. Guided Generation

Utilizing η = g ( J 0 I 0 ) , we performed reverse diffusion starting from t = 200 for generation. This process begins with a noisy fixed image I 200 = I 0 α ¯ 200 + ϵ 1 α ¯ 200 , which serves as the initial point for generating an image resembling the moving image. Figure 5 displays the visualization of generation results on 2D facial grayscale images. Furthermore, the comparison results combined with registration are shown in Figure 6.
The experimental results presented in Figure 5 and Figure 6 show that the proposed method using the shift factor enables guidance in image generation, resulting in high-quality generated images when there are slight changes in expressions or poses. Despite ghosting in cases of significant image changes, the generated images preserve the desired semantics from the target image. Additionally, we find that the pixel value range impacts the quality of the generated results. In this paper, we normalize the pixel values to the range of [ 1 , 1 ] , consistent with previous DDPM generation tasks.

4.2. Automated Cardiac Diagnosis Challenge Dataset

4.2.1. Dataset and Preprocessing

The ACDC dataset is a fully annotated public MRI cardiac dataset. It comprises 150 exams from different patients and includes additional information such as weight, height, and diastolic/systolic phase instants. The dataset is divided into five subgroups: normal subjects (NOR), patients with previous myocardial infarction (MINF), patients with dilated cardiomyopathy (DCM), patients with hypertrophic cardiomyopathy (HCM), and patients with abnormal right ventricle (RV). Each subgroup consists of 30 cases. For each case, 4D MRI images and corresponding segmentation images at the diastolic and systolic phases are provided. We resampled each case to a voxel spacing of 1.5 × 1.5 × 3.15 mm3 and then cropped them to a size of 128 × 128 × 32. Subsequently, we selected one sample out of every five to form a test set consisting of 30 cases, with the remaining 120 cases used as training samples.

4.2.2. Implementation Details

In contrast to the 2D facial image training process, we changed the learning rate to 2 × 10 4 . We set the epoch number to 800 and the batch size to 1. The other parameters remained the same.

4.2.3. Continuous Image Registration

Considering the medical image registration task, continuous deformations can be taken into account to make the results more meaningful. In this paper, the continuous deformation from the end-diastolic phase (ED) to the end-systolic phase (ES) can be observed by treating the 4D MRI serial images at each phase as fixed images. Specifically, the parameter P h a s e γ was set, where γ = 0 is the ED phase and γ = 1 is the ES phase. The 3D MRI image at the γ [ 0 , 1 ] phase was chosen as the fixed image and the image at the γ = 0 phase was used as the moving image for the registration experiment. By observing the deformation fields obtained in the cardiac registration experiment from ED to ES phases, the cardiac continuous deformations could be analyzed. The continuous deformations are displayed in Figure 7 and Figure 8 to illustrate the motion of the anatomical structures. A common observable and significant change from the ED to the ES state is the reduction of the diastolic volume. This reduction is a key focus of our observations, and highlights a consistent trend across different cases. Moreover, the performance of the various models was evaluated based on how closely the registration results resemble the fixed images. Figure 9 shows the registration results from the ED to ES phases for different groups with various pathological conditions, offering a comprehensive understanding of the registration performance in different cardiac states.
In addition, the NMSE, PSNR, and Dice metrics were employed in an accuracy evaluation to quantify the registration performance, and | J a ( d ) | was selected to measure the quality of the deformation field. The registration results for phases γ = 0.4 and γ = 1 are presented in Table 2. Meanwhile, the mask images of the end-diastolic and end-systolic phases were also used for registration quality analysis, as shown in Table 3.
As can be seen from Figure 7 and Figure 8, the proposed model achieves continuous deformation in medical images. Compared to other models, our model exhibits better performance in anatomical structure registration. Furthermore, our model achieves accurate deformations while maintaining robustness for different pathological features. According to Figure 9, the model in this paper can effectively capture useful features and preserve essential image details to complete the registration task, even for complex anatomical structures. The results of UNIRG in Figure 9, showing that our model achieves higher similarity between registration results and fixed images, demonstrating improved registration performance.
Moreover, according to Table 2, our model achieves the minimum NMSE and the maximum PSNR at each stage of continuous deformations while maintaining similar Jacobian ratios as the other models. These results demonstrate that the proposed model maintains a topological structure and achieves higher registration accuracy. From Table 3, it can be seen that the proposed model is also advantageous for the mask image registration task, even when the mask images are not used for network training.

4.2.4. Ablation Study on Module Modifications

Based on our experiments on the 3D cardiac diagnosis dataset, we evaluated the impact of different model structures on end-diastolic and end-systolic mask image registration. The results in Table 4 demonstrate the positive influence of module modifications on registration; combining all three modules yields the most significant improvement. The number of modules used positively correlates with the PSNR metric evaluation results. The Resizer module has a stronger effect than the ResAtten module, which in turn outperforms the ResConv module. The Resizer module also plays a crucial role when considering the NMSE metric, resulting in a minor difference between deformed and fixed images and achieving higher registration accuracy. These results confirm the effectiveness of the proposed method in enhancing 3D cardiac mask image registration.

5. Discussion and Conclusions

In this paper, we have presented an analysis of DDPM and applied it to image registration tasks. By leveraging a modified U-net architecture, we successfully implement both image generation and registration tasks within a unified network. In addition, we propose a shift factor to realize the guidance of the target generation. Specifically, regarding the improvement of the network architecture, the Resizer module is employed to replace the traditional pooling operation, thereby enabling noise filtering and reducing the impact of irrelevant information on the results. The ResAtten module is utilized to enhance the focus on the target features. Finally, the ResConv module is used to capture temporal features, which improves the predictive performance of the model. Each of the three modules serves a distinct purpose; according to ablation studies, the combined use of all three modules significantly enhances registration accuracy. Experimental results demonstrate the superior registration performance of our model compared to the VM, VM-diff, and DM models. Specifically, we achieve impressive results in the guided generation and registration for facial expression images as well as accurate continuous motion estimation for medical images. It is noteworthy that the NMSE and other similar metrics are determined by the difference in pixel intensity between the images being compared. UNIRG achieves higher similarity between the deformed image and the target image, demonstrating superior performance on these metrics. Although the proposed model is not explicitly designed for enhanced topological preservation, it does leverage convolution operations, similar to the VM and VM-diff models. Convolutional layers ensure that spatially local information is processed together, thereby preserving the spatial structure. Moreover, the use of the Resizer module, which also processes information based on local information, further guarantees topological preservation.
Nevertheless, some directions for further improvement are worth noting, such as reducing computational time, improving image preprocessing quality, and exploring the impact of the value range on generation. Moreover, the ResAtten module incorporating the attention mechanism requires careful control of the number of these modules in order to avoid introducing excessive parameters. Additionally, the Resizer module modifies the feature scale, requiring consideration regarding its placement. Future research can be consider these issues to enhance the performance of the proposed model and complete more complex registration tasks.

Author Contributions

Writing—original draft preparation, H.J.; writing—review and editing, H.J., P.X. and E.D.; supervision, E.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities, in part by the National Natural Science Foundation of China under Grants 62171261, 81671848, and 81371635, in part by the Natural Science Foundation for Young Scholars of Shandong Province under Grant ZR2023QF058, and in part by the Innovation Ability Improvement Project of Science and Technology Small and Medium-Sized Enterprises of Shandong Province under Grants 2021TSGC1028 and 2023TSGC0650.

Data Availability Statement

The data on which this study is based were accessed from a repository and are available for download through the following link: https://rafd.socsci.ru.nl/?p=main; https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html (accessed on 7 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Linger, M.E.; Goshtasby, A.A. Aerial image registration for tracking. IEEE Trans. Geosci. Remote. Sens. 2014, 53, 2137–2145. [Google Scholar] [CrossRef]
  2. Seetharaman, G.; Gasperas, G.; Palaniappan, K. A Piecewise Affine Model for Image Registration in Nonrigid Motion Analysis. In Proceedings of the 2000 International Conference on Image Processing (Cat. No. 00CH37101), Vancouver, BC, Canada, 10–13 September 2000; Volume 1, pp. 561–564. [Google Scholar]
  3. Stockman, G.; Kopstein, S.; Benett, S. Matching images to models for registration and object detection via clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 3, 229–241. [Google Scholar] [CrossRef] [PubMed]
  4. Brown, L.G. A survey of image registration techniques. ACM Comput. Surv. (CSUR) 1992, 24, 325–376. [Google Scholar] [CrossRef]
  5. Sheng, Y.; Shah, C.A.; Smith, L.C. Automated image registration for hydrologic change detection in the lake-rich Arctic. IEEE Geosci. Remote Sens. Lett. 2008, 5, 414–418. [Google Scholar] [CrossRef]
  6. El-Gamal, F.E.Z.A.; Elmogy, M.; Atwan, A. Current trends in medical image registration and fusion. Egypt. Inform. J. 2016, 17, 99–124. [Google Scholar] [CrossRef]
  7. Crum, W.R.; Hartkens, T.; Hill, D. Non-rigid image registration: Theory and practice. Br. J. Radiol. 2004, 77, S140–S153. [Google Scholar] [CrossRef]
  8. Wyawahare, M.V.; Patil, P.M.; Abhyankar, H.K. Image registration techniques: An overview. Int. J. Signal Process. Image Process. Pattern Recognit. 2009, 2, 11–28. [Google Scholar]
  9. Yang, X.; Kwitt, R.; Styner, M.; Niethammer, M. Quicksilver: Fast predictive image registration—A deep learning approach. NeuroImage 2017, 158, 378–396. [Google Scholar] [CrossRef]
  10. Haskins, G.; Kruger, U.; Yan, P. Deep learning in medical image registration: A survey. Mach. Vis. Appl. 2020, 31, 1–18. [Google Scholar] [CrossRef]
  11. Fu, Y.; Lei, Y.; Wang, T.; Curran, W.J.; Liu, T.; Yang, X. Deep learning in medical image registration: A review. Phys. Med. Biol. 2020, 65, 20TR01. [Google Scholar] [CrossRef]
  12. De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Staring, M.; Išgum, I. End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 14 September 2017; pp. 204–212. [Google Scholar]
  13. Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. An Unsupervised Learning Model for Deformable Medical Image Registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9252–9260. [Google Scholar]
  14. Dalca, A.V.; Balakrishnan, G.; Guttag, J.; Sabuncu, M.R. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med. Image Anal. 2019, 57, 226–236. [Google Scholar] [CrossRef] [PubMed]
  15. Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
  16. Han, C.; Hayashi, H.; Rundo, L.; Araki, R.; Shimoda, W.; Muramatsu, S.; Furukawa, Y.; Mauri, G.; Nakayama, H. GAN-Based Synthetic Brain MR Image Generation. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 734–738. [Google Scholar]
  17. Semeniuta, S.; Severyn, A.; Barth, E. A hybrid convolutional variational autoencoder for text generation. arXiv 2017, arXiv:1702.02390. [Google Scholar]
  18. Zhu, X.; Zhang, L.; Zhang, L.; Liu, X.; Shen, Y.; Zhao, S. GAN-based image super-resolution with a novel quality loss. Math. Probl. Eng. 2020, 2020, 1–12. [Google Scholar] [CrossRef]
  19. Mahapatra, D. GAN based medical image registration. arXiv 2018, arXiv:1805.02369. [Google Scholar]
  20. Fan, J.; Cao, X.; Wang, Q.; Yap, P.T.; Shen, D. Adversarial learning for mono-or multi-modal registration. Med. Image Anal. 2019, 58, 101545. [Google Scholar] [CrossRef]
  21. Fan, J.; Cao, X.; Xue, Z.; Yap, P.T.; Shen, D. Adversarial Similarity Network for Evaluating Image Alignment in Deep Learning Based Registration. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; pp. 739–746. [Google Scholar]
  22. Rezayi, H.; Seyedin, S.A. A Joint Image Registration and Superresolution Method Using a Combinational Continuous Generative Model. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 834–848. [Google Scholar] [CrossRef]
  23. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  24. Choi, J.; Kim, S.; Jeong, Y.; Gwon, Y.; Yoon, S. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv 2021, arXiv:2108.02938. [Google Scholar]
  25. Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  26. Kim, B.; Han, I.; Ye, J.C. DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 347–364. [Google Scholar]
  27. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  28. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  29. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
  30. Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.; Hawk, S.T.; Van Knippenberg, A. Presentation and validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [Google Scholar] [CrossRef]
  31. Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The graphical model. The blue arrows indicate the direction of feature propagation when training the network. The orange arrow depicts the basic generation task. The guided generation process is indicated by the purple arrow. Input: moving image J 0 , fixed image I 0 , and diffused fixed image I t . Output: deformed image I ˜ 0 , generated image I ^ 0 , and guided generated image J ^ 0 .
Figure 1. The graphical model. The blue arrows indicate the direction of feature propagation when training the network. The orange arrow depicts the basic generation task. The guided generation process is indicated by the purple arrow. Input: moving image J 0 , fixed image I 0 , and diffused fixed image I t . Output: deformed image I ˜ 0 , generated image I ^ 0 , and guided generated image J ^ 0 .
Mathematics 12 02462 g001
Figure 2. Architecture of the registration network. The number of output channels is denoted as a / b , where a corresponds to 2D tasks and b corresponds to 3D tasks. For 3D image registration, only two CRBlocks are used before the output. Among the various residual blocks, only the first CRBlock adjusts the channel size of the features, while the second CRBlock maintains the same channel size. The LeakyReLU activation function is used with a parameter of 0.2 for all CRBlocks in the experiment. Moreover, all CRBlocks preserve the feature size and only adjust the number of channels, that is, using convolution layers with a kernel size of 3, stride of 1, and padding of 1. In addition, time embedding is employed to project the time steps and embed temporal information. A scaling factor of 1/2 is chosen in the encoding phase for the low-pass filtering operation, and a 2× low-pass filtering operation is performed in the decoding phase. Moreover, linear interpolation is performed for all interpolation operations. Because the deepest feature size may be smaller than a pixel, we still assign it as a pixel. In the decoding path, following the idea of super-resolution, encoded and decoded features of the same scale are concatenated and then fed into the CRBlocks to enhance the image sharpness and retain more detailed features.
Figure 2. Architecture of the registration network. The number of output channels is denoted as a / b , where a corresponds to 2D tasks and b corresponds to 3D tasks. For 3D image registration, only two CRBlocks are used before the output. Among the various residual blocks, only the first CRBlock adjusts the channel size of the features, while the second CRBlock maintains the same channel size. The LeakyReLU activation function is used with a parameter of 0.2 for all CRBlocks in the experiment. Moreover, all CRBlocks preserve the feature size and only adjust the number of channels, that is, using convolution layers with a kernel size of 3, stride of 1, and padding of 1. In addition, time embedding is employed to project the time steps and embed temporal information. A scaling factor of 1/2 is chosen in the encoding phase for the low-pass filtering operation, and a 2× low-pass filtering operation is performed in the decoding phase. Moreover, linear interpolation is performed for all interpolation operations. Because the deepest feature size may be smaller than a pixel, we still assign it as a pixel. In the decoding path, following the idea of super-resolution, encoded and decoded features of the same scale are concatenated and then fed into the CRBlocks to enhance the image sharpness and retain more detailed features.
Mathematics 12 02462 g002
Figure 3. Comparison results for 2D facial expression grayscale image registration. Original images (left two columns), deformed images (middle four columns), deformation fields (right four columns), and NMSE/SSIM values for grayscale image registration. Top: Fearful front gaze (moving) to surprised front gaze (fixed). Bottom: Disgusted left gaze (moving) to happy left gaze (fixed).
Figure 3. Comparison results for 2D facial expression grayscale image registration. Original images (left two columns), deformed images (middle four columns), deformation fields (right four columns), and NMSE/SSIM values for grayscale image registration. Top: Fearful front gaze (moving) to surprised front gaze (fixed). Bottom: Disgusted left gaze (moving) to happy left gaze (fixed).
Mathematics 12 02462 g003
Figure 4. Comparison results for 2D facial expression RGB image registration. From top to bottom: surprised front gaze (moving) to fearful front gaze (fixed); sad left gaze (moving) to angry left gaze (fixed). The NMSE/SSIM values below correspond to grayscale image registration.
Figure 4. Comparison results for 2D facial expression RGB image registration. From top to bottom: surprised front gaze (moving) to fearful front gaze (fixed); sad left gaze (moving) to angry left gaze (fixed). The NMSE/SSIM values below correspond to grayscale image registration.
Mathematics 12 02462 g004
Figure 5. Visualization of 2D facial grayscale image generation results. Original image (left), guided generated images with η = J 0 I 0 (middle), and guided generated images with η 1 = 0 and η 2 = g ( J 0 I 0 ) (right), where g ( ) in η 2 represents using a center mask with a width of 40 for smoothing operations. From top to bottom: sad front gaze (moving) to neutral right gaze (fixed), contemptuous front gaze (moving) to angry right gaze (fixed), and happy front gaze (moving) to disgusted left gaze (fixed).
Figure 5. Visualization of 2D facial grayscale image generation results. Original image (left), guided generated images with η = J 0 I 0 (middle), and guided generated images with η 1 = 0 and η 2 = g ( J 0 I 0 ) (right), where g ( ) in η 2 represents using a center mask with a width of 40 for smoothing operations. From top to bottom: sad front gaze (moving) to neutral right gaze (fixed), contemptuous front gaze (moving) to angry right gaze (fixed), and happy front gaze (moving) to disgusted left gaze (fixed).
Mathematics 12 02462 g005
Figure 6. Comparison of 2D facial expression grayscale image registration and generation results, showing moving–fixed expressions. From top to bottom: contemptuous front gaze–neutral front gaze; angry left gaze–contemptuous left gaze; happy front gaze–contemptuous front gaze; angry front gaze–sad front gaze.
Figure 6. Comparison of 2D facial expression grayscale image registration and generation results, showing moving–fixed expressions. From top to bottom: contemptuous front gaze–neutral front gaze; angry left gaze–contemptuous left gaze; happy front gaze–contemptuous front gaze; angry front gaze–sad front gaze.
Mathematics 12 02462 g006
Figure 7. Continuous registration results for cardiac MRI images. Visualization of patient No. 35 with hypertrophic cardiomyopathy. Left column: original image in P h a s e γ . Middle columns: registration results obtained by deforming the moving image to the P h a s e γ image. Right columns: corresponding deformation fields.
Figure 7. Continuous registration results for cardiac MRI images. Visualization of patient No. 35 with hypertrophic cardiomyopathy. Left column: original image in P h a s e γ . Middle columns: registration results obtained by deforming the moving image to the P h a s e γ image. Right columns: corresponding deformation fields.
Mathematics 12 02462 g007
Figure 8. Continuous registration results for cardiac MRI images. Visualization of fixed images for registration from ED to ES phase of a normal subject (No. 110) on the left. The middle columns show the registration results and the right columns display the corresponding deformation fields.
Figure 8. Continuous registration results for cardiac MRI images. Visualization of fixed images for registration from ED to ES phase of a normal subject (No. 110) on the left. The middle columns show the registration results and the right columns display the corresponding deformation fields.
Mathematics 12 02462 g008
Figure 9. ED–ES registration results for different pathological cases. Cases 70, 120, 10, 40, and 85, respectively representing NOR, MINF, DCM, HCM, and RV, are selected for display.
Figure 9. ED–ES registration results for different pathological cases. Cases 70, 120, 10, 40, and 85, respectively representing NOR, MINF, DCM, HCM, and RV, are selected for display.
Mathematics 12 02462 g009
Table 1. Numerical results for grayscale facial image registration.
Table 1. Numerical results for grayscale facial image registration.
MethodNMSE × 10 1 SSIMPSNR|Ja(d)|≤ 0
Origin0.301 (0.213)0.668 (0.100)19.692 (3.172)
DM0.279 (0.100)0.643 (0.065)19.210 (1.347)0.424 (0.047)
VM0.098 (0.037)0.828 (0.058)23.770 (1.664)0.506 (0.013)
VM-diff0.103 (0.038)0.823 (0.060)23.567 (1.647)0.510 (0.021)
UNIRG0.049 (0.030)0.859 (0.054)27.280 (2.617)0.500 (0.026)
Table 2. Evaluation results of cardiac MRI image registration.
Table 2. Evaluation results of cardiac MRI image registration.
PhaseMethodNMSEPSNR|Ja(d)|≤ 0
γ = 0.4 Origin0.102 (0.168)26.307 (6.149)
DM0.112 (0.162)24.330 (4.775)0.498  (0.001)
VM0.091 (0.178)30.068 (6.863)0.501 (0.005)
VM-diff0.095 (0.179)28.947 (6.622)0.501 (0.006)
UNIRG0.079 (0.159)31.162  (6.690)0.506 (0.009)
γ = 1 Origin0.135 (0.163)22.644 (4.292)
DM0.144 (0.158)21.915 (3.720)0.498  (0.001)
VM0.092 (0.165)27.540 (5.251)0.498 (0.004)
VM-diff0.099 (0.166)26.499 (4.938)0.499 (0.006)
UNIRG0.079  (0.149)28.914  (5.117)0.502 (0.009)
Table 3. Evaluation results of cardiac MRI mask registration.
Table 3. Evaluation results of cardiac MRI mask registration.
MethodDicePSNRNMSETime
Origin0.708 (0.184)10.753 (1.745)0.197  (0.091)
DM0.708 (0.182)10.691 (1.671)0.202 (0.090)0.533 (0.549)
VM0.770 (0.145)11.500 (1.981)0.279 (0.311)0.160  (0.441)
VM-diff0.786 (0.139)11.986 (1.855)0.235 (0.233)0.182 (0.233)
UNIRG0.795  (0.124)12.050  (2.126)0.227 (0.160)0.516 (0.536)
Table 4. Comparison among UNIRG variants.
Table 4. Comparison among UNIRG variants.
ResConvResAttenResizerDicePSNRNMSE
0.791 (0.132)11.943 (2.191)0.255 (0.245)
0.783 (0.145)11.949 (2.286)0.275 (0.335)
0.789 (0.140)11.953 (2.187)0.284 (0.413)
0.790 (0.129)11.975 (2.132)0.241 (0.187)
0.789 (0.137)11.962 (2.267)0.254 (0.235)
0.794 (0.122)11.983 (2.064)0.230 (0.164)
0.789 (0.139)12.005 (2.254)0.272 (0.348)
0.795  (0.124)12.050  (2.126)0.227 (0.160)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ji, H.; Xue, P.; Dong, E. Universal Network for Image Registration and Generation Using Denoising Diffusion Probability Model. Mathematics 2024, 12, 2462. https://doi.org/10.3390/math12162462

AMA Style

Ji H, Xue P, Dong E. Universal Network for Image Registration and Generation Using Denoising Diffusion Probability Model. Mathematics. 2024; 12(16):2462. https://doi.org/10.3390/math12162462

Chicago/Turabian Style

Ji, Huizhong, Peng Xue, and Enqing Dong. 2024. "Universal Network for Image Registration and Generation Using Denoising Diffusion Probability Model" Mathematics 12, no. 16: 2462. https://doi.org/10.3390/math12162462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop