1. Introduction
In the past two decades, diffeomorphic registration has become a fundamental problem in medical image analysis [
1]. The diffeomorphic transformations estimated from the solution of the image registration problem constitute the inception point in Computational Anatomy studies for modeling and understanding population trends and longitudinal variations, and for establishing relationships between imaging phenotypes and genotypes in Imaging Genetics [
2,
3,
4,
5,
6,
7,
8]. Moreover, diffeomorphic registration can be as useful as any other deformable image registration framework in the fusion of multi-modal information from different sensors, the capture of correlations between structure and function, the guidance of computerized interventions, and many other applications [
9,
10,
11].
A relevant issue in deformable image registration is the quest for the most sensible transformation model for each clinical domain. On the one hand, there are domains where the underlying biophysical model of the transformation is known. The incompressible motion of the healthy heart is a relevant example [
12]. On the other hand, there are also important clinical contexts where the deformation model is not known, although there is active research on finding the most plausible transformation among those explained by a physical model [
13,
14,
15]. The most relevant examples are the deformation between healthy and diseased brains or the longitudinal evolution of the brain changes in healthy and diseased individuals.
Although the differentiability and invertibility of the diffeomorphisms constitute fundamental features for Computational Anatomy, the diffeomorphic constraint does not necessarily guarantee that a transformation computed with a given method is physically meaningful for the clinical domain of interest. In order to obtain physically meaningful diffeomorphisms, the diffeomorphic registration methods should be able to impose a plausible physical model to the computed transformations.
PDE-constrained LDDMM (PDE-LDDMM) registration methods have arisen as an appealing paradigm for computing diffeomorphisms under plausible physical models [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22]. In the cases where the model is known, the PDE-LDDMM formulation allows the introduction of the priors of the particular model [
23,
24,
25,
26]. The PDE-constrained formulation is also helpful in the quest of plausible transformations for a clinical application with an unknown deformation model [
15]. In addition, PDE-LDDMM is a well-suited approach for the estimation of registration uncertainty [
27,
28].
The different PDE-LDDMM methods differ on the variational problem formulation, diffeomorphism parameterization, regularizers, image similarity metrics, optimization methods, and additional PDE constraints. From them, the use of Gauss–Newton–Krylov optimization [
13,
14,
22], the addition of nearly incompressible terms in the variational formulation [
14,
19], the use of variants involving the deformation state equation [
18,
22], and the introduction of the band-limited parameterization and HPC or GPU implementations [
19,
20,
22,
29], constitute the most successful contributions to the realistic and efficient computation of physically meaningful diffeomorphisms so far. Some attention has been given to the image similarity metric, where the sum of squared differences (SSD) between the final state variable and the target image has been mostly used [
13,
16,
22]. Only two variations of PDE-LDDMM with normalized gradient fields (NGFs) and mutual information (MI) have been proposed in [
18,
30] (
ArXiv paper). These two methods use gradient-based techniques for optimization.
SSD is based on image subtraction, so it is only well suited in uni-modal registration for images where the intensity of the reciprocal structures do not vary much. Indeed, SSD is not robust to noise, intensity inhomogeneity, and partial volume effects. Moreover, SSD is not suitable for multi-modal registration, even for transformation models with a few degrees of freedom such as those in rigid or affine image registration [
31,
32]. Therefore, there is a need for PDE-LDDMM methods, preferably with Gauss–Newton–Krylov optimization, which, apart from SSD, can support alternative image similarity metrics better behaved than SSD. The solution to this problem will promote the use of PDE-LDDMM in a wide variety of clinical applications depending on deformable image registration where images are acquired from different sensors.
This work proposes a unifying framework for introducing different image similarity metrics in the two best-performing variants of PDE-LDDMM [
22,
33]. From the Lagrangian variational problems, we have identified that a change in the image similarity metric involves changing the initial adjoint and the initial incremental adjoint variables. We have derived the equations of these variables needed for gradient-descent and Gauss–Newton–Krylov optimization with Normalized Cross Correlation (NCC), its local version (lNCC), Normalized Gradient Fields (NGFs), and Mutual Information (MI).
NCC, lNCC, and MI have accompanied SSD in different deformable image registration methods since their inception [
34]. NGFs is an interesting metric for a wide variety of deformable registration problems [
35]. These metrics are available in relevant deformable image registration packages such as the Insight Toolk it (
www.itk.org, accessed on 1 January 2022), NiftyReg (
https://sourceforge.net/projects/niftyreg, accessed on 1 January 2022), or Fair [
32], among others. In the framework of diffeomorphic image registration, ANTS registration was implemented for SSD, lNCC, and MI (
www.nitrc.org/projects/ants, accessed on 1 January 2022) and the performance of the different image similarity metrics was evaluated in [
36]. We have selected these metrics as a starting point, although our framework is extensible to other image similarity metrics proposed in the literature, provided that the first and second-order variations of the image similarity metric can be written in the expected form [
37,
38].
Our experiments focused on the spatial (SP) and band-limited (BL) stationary parameterization of diffeomorphisms, although the proposed methods can be straightforwardly extended to the non-stationary parameterization [
39]. We have obtained successful Gauss–Newton–Krylov methods for SSD, NCC, and lNCC with evaluation results greatly overpassing gradient-descent and competing with the respective version of ANTS diffeomorphic registration [
40]. For NGFs, the second-order method did not provide satisfactory results in comparison with gradient-descent. For MI, the memory load of the second-order method hindered a proper evaluation in 3D datasets. We extensively studied the performance of our methods in NIREP16 and Klein et al.’s evaluation frameworks [
41,
42], obtaining an interesting insight into the impact of the different image similarity metrics in the PDE-LDDMM framework.
Since the advances that made it possible to learn the optical flow using convolutional neural networks (FlowNet [
43]), dozens of deep-learning data-based methods were proposed to approach the problem of deformable image registration in different clinical applications [
44]. Some of them are specifically devised for diffeomorphic registration where the different LDDMM ingredients are used as a backbone for diffeomorphism parameterization and the definition of the loss functions [
45,
46,
47,
48,
49,
50,
51,
52,
53,
54]. These methods use SSD and NCC metrics in the image similarity loss function, and the proposed models are usually limited to a single modality where the appearance of the image pairs needs to be similar to the training data. From them, only SynthMorph proposed a model valid for multi-modal registration through the extensive generation of simulated data for training, which yields a fast inference for diffeomorphism computation once the difficulties with training have been overcome [
53].
Although the authors of SynthMorph provided an extensive study on the generalization capability of their models in different multi-modal experiments, the loss function is restricted to the Dice Similarity Coefficient (DSC) on image segmentations and, therefore, interesting questions such as the actual ability of deep-learning models to deal with multimodality or influence of the image similarity loss function on the registration accuracy have not been answered. In addition, results from the Learn2Reg challenge question the superiority of deep learning approaches and open new research directions into hybrid methods for which contributions to traditional optimization-based methods like ours may be of interest [
55].
In the following,
Section 2 reviews the foundations of PDE-LDDMM and BL PDE-LDDMM, from the original variant proposed in [
13,
16] to the variants used in this work.
Section 4.5 analyzes the change of image similarity metrics in PDE-LDDMM and derives the equations needed for gradient-descent and Gauss–Newton–Krylov optimization for the considered metrics.
Section 4 gathers the experimental setup for the evaluation of the methods, the numerical and implementation details of the proposed methods and the benchmarks.
Section 5 shows the evaluation results. Finally,
Section 7 gathers the most remarkable conclusions of our work.
4. Experimental Setup
4.1. Datasets
We used five different databases in our evaluation:
NIREP16, was proposed in [
41] for the evaluation of non-rigid registration. NIREP16 consists of 16 T1 Magnetic Resonance Imaging (MRI) images. NIREP16 images were acquired at the Human Neuroanatomy and Neuroimaging Laboratory, University of Iowa. They were selected for the NIREP project from a database of 240 normal volunteers. Datasets correspond to 8 males and 8 females with a mean age of
and
years, respectively. The images are skull stripped and aligned according to the anterior and posterior commissures. Image dimension is
with a voxel size of
mm. Images are distributed with the segmentation of 32 gray matter regions at frontal, parietal, temporal, and occipital lobes. The most remarkable feature of this dataset is its excellent image quality. The geometry of the segmentations provides a specially challenging framework for deformable registration evaluation. In our previous works, a subsampled version of this dataset has been extensively used for the evaluation of different LDDMM methods. The mages of this dataset have been subsampled by reducing image dimension to
with a voxel size of
mm. Subsampling is needed to be able to run interesting but memory-demanding benchmark methods and to maintain the continuity of the evaluation results shown in previous works.
Klein datasets were proposed in [
42] in the first extensive evaluation study of non-rigid registration methods. The datasets contain the T1 MRI images and segmentations from the LPBA40, IBSR18, CUMC12, and MGH10 databases. The four databases provide images with different levels of quality, providing varying difficulties for deformable registration [
62]. Image dimension is
with a voxel size of
mm.
LPBA40 contains 40 skull-stripped brain images without the cerebellum and the brain stem. LPBA40 provides the segmentation of 50 gray matter structures together with left and right caudate, putamen, and hippocampus. LPBA40 protocols can be found at
https://loni.usc.edu/research/atlases (accessed on 1 January 2022).
IBSR18 contains 18 brain images with the segmentation of 96 cerebral structures. This dataset provides the segmentation of brain structures of interest for the evaluation of image registration methods. The image quality is low. For example, most of the images show motion artifacts. The variability of the ventricle sizes is high.
CUMC12 contains 12 full brain images with the segmentation of 130 cerebral structures. Overall, the image quality is acceptable, although some of the images are noisy. The variability of the ventricle sizes is high.
MGH10 contains 10 full brain images with the segmentation of 106 cerebral structures. Overall, the image quality is acceptable, although some of the images are noisy. Ventricle sizes are usually all big.
In addition, we studied the performance of our methods in a multi-modal experiment, where the images were obtained from:
Oasis. The open-access series of imaging studies (
https://www.oasis-brains.org/, accessed on 1 January 2022) is a project aimed at making neuroimaging data sets of the brain freely available to the scientific community. OASIS-3 compiles images from more than 1000 participants ranging from cognitively normal to various stages of cognitive decline. For each participant, the study includes different MRI sessions including T1, T2, FLAIR, and others. Our multimodal experiment selected a T2 image from an Alzheimer’s disease participant as the source, and a T1 image from a cognitive normal participant as the target image.
4.2. Image Registration Pipeline
The evaluation in NIREP16 was performed consistently with our previous works on PDE-LDDMM diffeomorphic registration. The registrations were carried out from the first subject to every other subject in the database, yielding a total of 15 registrations per variant, optimization method, and image similarity metric. The subsampled NIREP16 database was obtained from the resampling of the original images into volumes of size with a voxel size of mm after the alignment to a common coordinate system using affine transformations. The images were scaled between 0 and 1 for SSD and NCC metrics, and between 0 and 255 for lNCC, NGFs, and MI. The affine alignment and subsampling were performed using the Insight Toolkit (ITK).
The LPBA40, IBSR18, CUMC12, and MGH10 images were preprocessed similarly to [
42]. The input images were selected from the Synapse repository (
https://www.synapse.org/#%21Synapse:syn3217707, accessed on 1 January 2022 ) in the folder hosting FLIRT affine registered images. In the first place, histogram matching was applied to all the images. The images were then scaled between 0 and 1 for SSD and NCC metrics, and between 0 and 255 for lNCC, NGFs, and MI. To perform these preprocessing steps we used the algorithms available in ITK.
Oasis images were finely aligned to the MNI152 atlas with NiftyReg and then skull stripped using the robust brain extraction software RobEx (
https://www.nitrc.org/projects/robex, accessed on 1 January 2022).
4.3. Numerical Details, Parameter Configuration, and Implementation Details
The experiments were run on a cluster of two machines equipped with four NVidia GeForce GTX 1080 ti with 11 GBS of video memory and an Intel Core i7 with 64 GBS of DDR3 RAM, and two NVidia Titan RTX with 24 GBS of video memory and an Intel Core i7 with 64 GBS of DDR3 RAM, respectively. The codes were developed in the GPU with Matlab 2017a and Cuda .
Regularization parameters were selected from a search of the optimal parameters in the registration experiments performed in our previous work [
22]. We selected the parameters
,
, and
and a unit-domain discretization of the image domain
[
56]. We also tested the parameters
,
, and
and a spatial-domain discretization of
, selected as optimal in [
58]. For gradient-descent optimization, we obtained excellent evaluation results; however, the obtained maximum Jacobians were much higher than recommended. On the other hand, Gauss–Newton–Krylov showed convergence problems during PCG, with negative curvature values found at early inner iterations. This suggests that the specific selection of parameters in [
58] might achieve fairly high structural overlaps with the cost of very aggressive underlying deformations which are glimpsed in the malfunctioning of Gauss–Newton.
The BL experiments were performed with band sizes of
for BL Variants I and II. This selection was found as optimal for each method in our previous work [
20,
22,
33].
Gradient-descent was implemented with an efficient method for the update of the step size based on offline backtracking line-search combined with a check on Armijo’s condition. We used the stopping conditions in [
13,
32]. Otherwise, the optimization was stopped after 50 iterations.
Gauss–Newton–Krylov was also implemented with an offline backtracking line-search combined with a check on Armijo’s condition. The number of PCG iterations was set to five. The PCG tolerance was selected from
We used the stopping conditions in [
13,
32]. Otherwise, the optimization was stopped after 10 iterations. These parameters were selected as optimal in our previous work since the methods achieved state-of-the-art accuracy in a reasonable amount of time [
22].
PDE-LDDMM was embedded into a multi-resolution scheme. The images were subsampled, and the velocity fields were resampled similarly to [
63,
64]. The PDE-LDDMM registration methods were executed on each resolution level. For the multi-resolution experiments, the pyramid was built with three levels with the same number of outer and inner iterations, as for the single-resolution.
To integrate the PDEs, we used the semi-Lagrangian Runge–Kutta schemes proposed in [
33] for the SSD versions of Variants I and II. The solutions were computed at the Chebyshev–Gauss–Lobatto discretization of the temporal domain
. The number of time steps was selected as five. Since Matlab lacks a 3D GPU cubic interpolator, we implemented in a Cuda MEX file the GPU cubic interpolator with prefiltering proposed in [
65].
The computation of differentials was approached using Fourier spectral methods as a machine-precision accurate alternative to commonly used finite difference approximations [
66]. Spectral methods allow solving of ODEs and PDEs for high accuracy in simple domains for problems involving smooth data. To this end, the images were smoothed with a Gaussian filter as a preprocessing step. However, for the Gauss–Newton–Krylov version of NGFs, we used the matrix version of the differential operators, and then the computation of differentials must be approached with finite difference approximations. To be consistent with the input data, the images were also smoothed as a preprocessing step.
For lNCC, the size of the neighborhood was selected as four. For NGFs, the value of for the -norms was equal to 1000. For MI, the number of histogram bins was selected equal to 16. The computation of the adjoint variable for MI required the use of sparse matrices and was implemented in the CPU since Matlab does not yet have GPU support for these data structures.
4.4. Benchmarks
For benchmarking, we run single- and multi-resolution versions of ANTS registration with SSD, lNCC and MI image similarities [
40]. We also extended Stationary LDDMM (St-LDDMM), proposed in [
67] as an efficient stationary variant of Beg et al.’s LDDMM [
56], with NCC, lNCC, NGFs, and MI metrics. The details of the method extension can be found in
Appendix A. In addition, we studied the accuracy obtained with QuickSilver, a supervised deep-learning based method with SSD in the loss function [
46], and VoxelMorph, an unsupervised deep learning-based model with SSD and NCC metrics in the loss function [
50].
St-LDDMM was run with the same parameters than PDE-LDDMM in the common steps of the algorithms. ANTS was run with the following parameters for the single-resolution experiments
$synconvergence = “[50,,10]”,
$synshrinkfactors = “1”,
and $synsmoothingsigmas = “3vox”.
For the multi-resolution experiments the parameters were set to
$synconvergence = “[50×50×50,,10]”,
$synshrinkfactors = “4×2×1”,
and $synsmoothingsigmas = “3×2×1vox”.
The selection of the number of iterations was in agreement with the number of iterations used in gradient-descent and the number of outer × inner iterations used in Gauss–Newton–Krylov optimization for PDE-LDDMM.
In the multi-modal experiment, we compare our proposed methods with NiftyReg, a software for efficient registration developed at the Centre for Medical Image Computing at University College London, UK (
https://sourceforge.net/projects/niftyreg/, accessed on 1 January 2022). NiftyReg is usually selected as a benchmark for non-rigid multimodal image registration. We also include in the comparison ANTS and SynthMorph, a VoxelMorph adaptation for building deep-learning models capable of dealing with multimodality [
53].
4.5. Metrics and Statistical Analysis for Registration Evaluation
The evaluation in NIREP16 and Klein datases is based on the accuracy of the registration results for template-based segmentation. The Dice Similarity Coefficient (DSC) is selected as the evaluation metric. Given two segmentations
S and
T, the DSC is defined as
This metric provides the value of one if
S and
T exactly overlap and gradually decreases towards zero depending on the overlap of the two volumes. The statistical distribution of the DSC results across the segmented structures are shown in the shape of box and whisker plots following the evaluation methods in [
42]. The DSC distribution is taken over the DSC values over the different segmentation labels. This way of computing the DSC distribution reflects the recommendation given in [
68] that the evaluation of non-rigid registration with the segmentation of sufficiently locally labeled regions of interest is strongly recommended for obtaining reliable measurements of the performance of the registration.
The evaluation in NIREP16 was completed with two different statistical analysis on the DSC values. In the first place, the analysis of variance (ANOVA) was conducted in order to assess whether the means of the DSC distributions are different for the image similarity metrics when observations are grouped by type of method (baseline vs. PDE-LDDMM) or variants (I vs. II). Baseline methods include ANTS, Stationary LDDMM, VoxelMorph, and QuickSilver. In the second place, pairwise right-tailed Wilcoxon rank-sum tests were conducted for the assessment of the statistical significance of the difference of medians for the distribution of the DSC values. The alternative hypothesis is that the median of the first distribution is higher than the median of the second one.
Finally, we include for NIREP16 the quantitative assessment provided by the mean and standard deviation of the relative image similarity error after registration,
the relative gradient magnitude,
and the extrema of the Jacobian determinant.
5. Results
In this section, we show the experiments conducted to evaluate the performance of the two PDE-LDDMM variants for the different image similarity metrics. First, we provide an extensive evaluation of our proposed methods in the NIREP16 database, where we have extensively evaluated previous LDDMM and PDE-LDDMM registration methods [
20,
22,
33,
67]. Next, we evaluate our proposed methods in Klein et al. databases. Finally, we compare the behavior of the different metrics in a challenging multimodal experiment.
5.1. Results on NIREP16
5.1.1. Evaluation
Figure 1 shows, in the shape of box and whisker plots, the statistical distribution of the DSC values that were obtained after the registration across the 32 segmented structures. In addition,
Figure 2 gathers the results obtained with Gauss–Newton–Krylov optimization grouped by variant for a better assessment of the best-performing combination of variant and metrics. Our first observation is that the DSC coefficients for the multi-resolution experiments outperform the single-resolution experiments. The improvement is substantial for lNCC, NGFs, and MI metrics.
Single-resolution. Regarding the single-resolution experiments, it is striking the low performance of ANTS for all metrics. Both St-LDDMM and PDE-LDDMM perform more reasonably than ANTS. In general, PDE-LDDMM methods tend to outperform St-LDDMM.
For PDE-LDDMM and SSD, the differences between Gauss–Newton and gradient-descent are small. Gauss–Newton optimization significantly outperforms gradient-descent for NCC and lNCC. On the contrary, for NGFs, gradient-descent optimization outperforms Gauss–Newton in all cases. The spatial version of Variant II exhibits an especially lower performance. For St-LDDMM, the trends observed in PDE-LDDMM are also observed for SSD and NCC metrics. However, gradient-descent performs similarly to Gauss–Newton for lNCC and Gauss–Newton outperforms gradient descent for the NGFs metric.
The results obtained with the spatial versions of the PDE-LDDMM variants are similar to the corresponding BL versions. Comparing the accuracy of both variants, Variant II provides in general better performance than Variant I.
For all variants, the overall best performing metric is NCC. For almost all variants, the gradient-descent version of NGFs performs similarly to the Gauss–Newton version of lNCC. For the BL methods, the Gauss–Newton version of NGFs also performs similarly to the gradient-descent version of SSD. Moreover, the gradient-descent version of MI performs similarly to the gradient-descent version of SSD.
Multi-resolution. Regarding the multi-resolution experiments, the performance of ANTS reached the level of accuracy of St-LDDMM and the PDE-LDDMM methods. ANTS with the lNCC metric greatly outperformed the other ANTS variants using SSD and MI, ranking among the best-performing methods. In general, PDE-LDDMM methods tend to outperform St-LDDMM, with the exception of the NGFs metric.
As happened with the single-resolution experiments for PDE-LDDMM, the differences between Gauss–Newton optimization and gradient-descent are small for the SSD metric. Gauss–Newton also outperforms gradient-descent for NCC and lNCC. For NGFs, gradient-descent optimization greatly outperforms Gauss–Newton in the case of Variant II. However, for Variant I, the differences between both optimization methods are small, especially for the BL version of the methods. The performance of the NGFs metric is further explored in
Appendix B for a better understanding of these observations. For St-LDDMM, the trends observed with the single-resolution experiments are mostly observed.
For Variant I, the spatial version tends to outperform the BL version of the same variant slightly. However, the performance of the BL version of Variant II is similar or even improves the spatial version for almost all metrics. As happened with the single-resolution experiments, Variant II provides better performance than Variant I. The best performing metric for Variant I is lNCC, while for Variant II, the best-performing metric is still NCC, closely followed by lNCC. The resemblance of performance between MI and SSD metrics in the single-resolution experiments remains for the multi-resolution experiments. However, the excellent performance of NGFs metric with gradient-descent optimization is remarkable, ranking close to the best-performing metrics for Variant II.
Comparing ANTS with PDE-LDDMM methods, Variant I with SSD and gradient-descent performs similarly to ANTS-SSD. In the case of MI, PDE-LDDMM methods outperform ANTS-MI. Some PDE-LDDMM methods achieve results competing with ANTS-lNCC for the NCC and lNCC metrics.
Deep-learning methods. Because of the increasing relevance of deep-learning methods in the field, we added to our evaluation the performance of VoxelMorph [
50] and QuickSilver [
46]. VoxelMorph with SSD and QuickSilver with the correction step performed similarly to Variant II of PDE-LDDMM with the SSD metric. Diffeomorphic VoxelMorph with SSD ranked among the best-performing methods, with a box-plot distribution similar to Variant I with lNCC and BL Variant II with NCC and lNCC. Despite all LDDMM methods agreeing in the much better performance of NCC and lNCC metrics over SSD, Diffeomorphic VoxelMorph trained with a loss function based on SSD greatly outperformed the method trained with NCC. Lastly, it is a remarkable fact that, although VoxelMorph is informed during training of the performance through the DSC, our best-performing PDE-LDDMMs were able to achieve competitive results without the use of this information.
Statistical analysis.
Table 1 shows the results of the analysis of variance (ANOVA) for the effects of method and image similarity metric selection on the distribution of the DSC values obtained in the multi-resolution experiments with Gauss–Newton optimization (with the exception of the methods combined with the MI metric). The methods on the first factor were grouped by type of method (baseline vs. PDE-LDDMM) and variants. The tests only showed no statistical significance for the differences between the spatial versions of Variant I and Variant II. The selection of the
metric resulted in statistical significance for all cases.
Figure 3 shows the
p-values of pairwise right-tailed Wilcoxon rank-sum tests for the distribution of the DSC values obtained in the multi-resolution experiments with Gauss–Newton optimization (with the exception of the methods combined with the MI metric). The figure shows statistical significance for the better performance of the NCC and lNCC metrics over SSD and MI. For NGFs, obtaining statistical significance depends on the method. Among the best-performing methods, no statistical significance was found for the difference of medians.
5.1.2. Quantitative Assessment
Table 2 shows, averaged by the number of experiments, the mean and standard deviation of the
,
, and the extrema of the Jacobian determinant obtained with PDE-LDDMM in the NIREP16 dataset. We restrict the results to the methods with Gauss–Newton–Krylov optimization with the exception of the methods with the MI metric. For NGFs, the results with gradient descent and different Gauss–Newton approximations are analyzed in depth in
Appendix B.
Table 3 shows the average
values and the extrema of the Jacobian determinant for VoxelMorph and QuickSilver.
For the single-resolution experiments, the best
values were obtained by the NCC metric, followed by SSD. Although the
values for the lNCC and MI metrics ranged higher than
, their performance in the evaluation reported a similar distribution. For lNCC, NGFs, and MI, the correlation between the lowest
values and the highest DSC results that are usually seen for SSD in previous works does not hold anymore [
22,
33].
The spatial methods slightly outperformed the BL methods in terms of the values, as expected. Variant II performed better than Variant I. The relative gradient was reduced to average values ranging from to , except for the lNCC and NGFs metrics. This means that the optimization was stopped in acceptable energy values in all these cases. Although the relative gradient obtained with lNCC was higher than recommended, the corresponding DSC distributions indicate that the lNCC methods can reach a local minimum providing good registration results. All the Jacobians remained above zero.
For the multi-resolution experiments, the results regarding the values and the Jacobians were consistent with the single-resolution experiments. However, the high values of the relative gradient indicate a stagnation of the convergence in the finer resolution level that may be due to the method already starting close to the convergence point at the beginning of this resolution level.
Both VoxelMorph and QuickSilver usually obtained
values greater than PDE-LDDMM with the corresponding image similarity metric. It is striking the magnitude of the Jacobian extrema obtained by VoxelMorph and its diffeomorphic version, indicating that the accuracy of the registration results shown in
Figure 2 are obtained through large folds in a considerable number of locations.
Figure 4 shows the evolution of the convergence curves for the image similarity metrics
in the single-resolution experiments. For all the metrics, the trend of the
values is decreasing. The most unexpected behavior is for the curves of the lNCC metrics, where the standard deviation remains stable and large in comparison with the energy reduction. The curves of the NGFs metrics show the stagnation of the energy values for the BL variants. This is the cause of the low DSC distributions already shown in
Figure 1.
Spatial methods show slightly better
values than BL methods. Comparing the variants, Variant II provides slightly lower
values than Variant I. These results are consistent with the evaluation and the quantitative assessment shown in
Figure 1 and
Table 2.
5.1.3. Qualitative Assessment
For a qualitative assessment of the proposed registration methods, we show the registration results obtained by the different metrics for the BL version of Variant II in the multi-resolution experiments.
Figure 5 shows the warped images, the difference between the warped and the target images after registration, the velocity fields, and the logarithm of the Jacobian determinant. The resemblance of the differences between the warped and the target images was high for all the metrics except for NGFs. Focusing on the registration results at the ventricle, SDD and NCC were able to achieve the best compression of the structure, while NGFs obtained the worst registration results at this location.
Figure 6 shows the warped images, the difference between the warped and the target images, the displacement fields, and the logarithm of the Jacobian determinant for VoxelMorph. The resemblance of the differences between the warped and the target images was higher for SSD than NCC. The displacement fields were visually less smooth than the velocity fields obtained with PDE-LDDMM. The Jacobian determinant had negative regions all over the image. In particular, the registration results at the ventricle were achieved through large expansions and foldings in its upper boundary.
5.1.4. Computational Complexity
Table 4 shows the average and standard deviation of the total computation time and the VRAM peak memory reached through the computations in the NIREP16 database for the single-resolution experiments. The BL methods achieved a substantial time and memory reduction over the spatial methods, as already demonstrated in [
22,
33,
39]. From the Gauss–Newton methods, the methods with SSD and NCC image similarity metrics were the most efficient ones, as expected. On the other side, the methods with MI were the most time-consuming ones. Regarding memory usage, the methods using SSD and NCC were more efficient than lNCC. The memory efficiency shown by NGFs and MI metrics was due to the combination with gradient-descent and the need to perform operations involving sparse matrices on the CPU.
5.2. LPBA40, IBSR18, CUMC12, and MGH10 Evaluation Results
Figure 7 shows the statistical distribution of the DSC values obtained with PDE-LDDMM for Klein databases [
42]. As a benchmark, we include the results reported in [
42] for affine registration (FLIRT) and three diffeomorphic registration methods: Diffeomorphic Demons, SyN, and Dartel. We also include the results of QuickSilver and VoxelMorph.
For LPBA40, Variant I with lNCC and Variant II with metrics from SSD to NGFs reached a performance similar to SyN with many outliers significantly reduced. For each metric, Variant II outperformed the corresponding Variant I. The worst performing results were consistently achieved by NGFs and Gauss–Newton–Krylov optimization. QuickSilver performed slightly better under SDD and NCC versions of Variant I. VoxelMorph was the worst-performing method for all metrics.
For IBSR18, SP and BL Variant I with lNCC, SP Variant II with NCC and BL Variant II with NCC and lNCC metrics were the best performing PDE-LDDMM methods. Their performance was slightly over the one exhibited by QuickSilver and greatly over the one obtained with VoxelMorph. However, in all cases, these methods underperformed SyN and Dartel.
For CUMC12, the best performing PDE-LDDMM methods were SP and BL Variant I with lNCC and Variant II with NCC and lNCC. As happened with IBSR18, these methods slightly outperformed QuickSilver while greatly outperformed VoxelMorph. It is remarkable the low performance of BL variants with NGF and Gauss–Newton–Krylov optimization. All the methods underperformed SyN and Dartel methods.
Finally, for MGH10, the best performance was achieved by variants I and II with lNCC similarity metric. It is remarkable the low performance of Variant I with NGF with gradient descent underperforming Gauss–Newton–Krylov optimizers. In this case, the methods underperformed SyN, while the best-performing methods showed a DSC distribution similar to Dartel. QuickSilver and VoxelMorph achieved performance similar to the SSD version of Variant I.
These results corroborate the better performance of Variant II over Variant I obtained in the evaluation with NIREP16 for the majority of metrics. The lNCC metric is positioned as the best-performing one for the majority of methods and databases. The NGFs metric has shown better performance for gradient descent optimization in the great majority of experiments. The best PDE-LDDMM combination of variants and metrics overpassed deep-learning based methods in all the datasets.
Regarding the consistent outperformance of SyN and Dartel over all the considered methods, we found out that SyN used a probabilistic image similarity metric while Dartel used tissue probability maps as inputs. The images in IBSR18, CUMC12, and MGH10 have low contrast, and, therefore, the algorithmic choices performed by SyN and Dartel overpass the use of challenging inputs. We have also seen that performing histogram equalization for contrast enhancement as in QuickSilver original paper [
46] improved the evaluation results reaching SyN and Dartel performance. However, this preprocessing reduces the influence of the used metrics in the obtained DSCs and provides less informative results.
7. Discussion and Conclusions
In this work, we have presented a unifying framework for introducing different image similarity metrics in the two best-performing variants of PDE-LDDMM with Gauss–Newton–Krylov optimization [
22,
33]. From the Lagrangian variational problem, we have identified that the change in the image similarity metric involves changing the initial adjoint and the initial incremental adjoint variables. We derived the equations of these variables for NCC, its local version (lNCC), NGFs, and MI. PDE-LDDMM with Gauss–Newton–Krylov optimization being successfully extended from SSD to NCC, and lNCC image similarity metrics. For NGFs, the method was not able to overpass gradient-descent optimization. With MI, the computation of the Hessian-matrix product required the product of dense matrices that requested more than 5000 GBs of memory, thus becoming far from feasible. Therefore, we obtained varying degrees of success in our initial objective.
The evaluation performed in NIREP16 database has shown the superiority of Variant II with respect to Variant I, as happened in [
22,
33]. In addition, the results reported for the BL version of Variant II were statistically indistinguishable from the SP (spatial) version. For any image similarity metric, BL Variant II overpassed the baseline established by ANTS. For BL Variant II, NCC and its local version were the best-performing metrics, closely followed by the gradient-descent version of NGFs. The superiority of these metrics was statistically significant. The outperformance of lNCC was quantified for the first time for ANTS diffeomorphic registration with gradient-descent and LPBA40 in [
36]. Our best-performing variants overpassed QuickSilver, a supervised deep-learning method for diffeomorphic registration. In addition, they provided competitive results when compared with VoxelMorph with the added value of PDE-LDDMM being agnostic to the evaluation metric and providing purely diffeomorphic solutions.
The values were in agreement with the DSC distributions obtained with NCC and SSD. However, for lNCC, NGFs and MI, the correlation between the values and the DSC seen usually for SSD in previous works does not hold anymore.
The experiments with Klein databases corroborated the superiority of Variant II over Variant I for almost all the metrics. The evaluation in LPBA40 has shown how PDE-LDDMM based on the deformation state equation performs similarly to SyN for the majority of metrics with a reduced number of outliers. The evaluation in IBSR18, CUMC12, and MGH10 datasets has consistently shown lNCC as the best-performing metric for PDE-LDDMM. It is striking that the optimum DSC values greatly vary depending on the dataset used for evaluation. For example, SyN obtains an average DSC value greather than 0.7 for LPBA40 while the average DSC value is close to 0.5 for IBSR18, CUMC12, and MGH10 data. We believe that the disparity of the obtained DSC values depends on the geometry of the anatomies involved in the dataset which may downgrade the overall accuracy.
Although not being able to report functional Gauss–Newton–Krylov PDE-LDDMM methods for NGFs and MI has been disappointing, it encourages us to embed PDE-LDDMM into different optimization methods competing with gradient-descent as Gauss–Newton–Krylov does for the SSD, NCC, and lNCC metrics. In future work, we will address the problem with limited-memory BFGS or, in the framework of Krylov subspace methods, with the generalized minimal residual method (GMRES).
Our method has shown visually acceptable registration results on a challenging multi-modal intra-subject experiment. The results were competitive with SynthMorph, a deep-learning method that uses a loss function based on DSC from image segmentations. The experiment pointed out the differences between the combination of different optimization methods and metrics. In future work, we will explore in depth the influence of metric and optimization selection in the accuracy of multi-modal registration.
Despite the methodological improvements that have been subsequently proposed in PDE-LDDMM for efficiency (Gauss–Newton–Krylov optimization, band-limited parameterization, and Semi-Lagrangian Runge–Kutta integration), our PDE-LDDMM methods are able to compute a diffeomorphism in a volume of size in one to five minutes, depending on the variant and the metric. This may be considered a non-acceptable amount of time in comparison with modern deep-learning approaches where the inference takes about one second. However, the time and resources needed for training are not usually considered in the comparison while they should be at least apportioned. In addition, deep-learning methods are not memory efficient while our proposed methods run in a commodity graphics card with a VRAM of less than 4 GBs.
BL Variant II with SSD, NCC, and lNCC has been recently included in the diffeomorphic normalization step into the pipeline of Spasov et al. [
69] for the prediction of stable vs. progressive mild cognitive impairment (MCI) conversion in Alzheimer’s disease with multi-task learning and Convolutional Neural Networks [
70]. PDE-LDDMM overpassed ANTS-lNCC for this task, in terms of accuracy, sensitivity, and specificity. ANTS-lNCC obtained a median accuracy value of 84%, a sensitivity of 88% and specificity of 81%. Variant II with NCC achieved the best performing accuracy, with a median value of 89%, and sensitivity and specificity values among the best ones, with a median value of 94% and 91%, respectively. Indeed, NCC overpassed lNCC metric in this task, despite the comparable performance achieved by both metrics in the template-based evaluation presented in this work. As future work, we will perform a comprehensive study to find out the whys behind the improved performance of a given configuration with respect to the others.
Our PDE-LDDMM method may serve as a benchmark method for the exploration of different image similarity metrics in the loss function of deep-learning methods. In addition, it may be a good candidate in applications where there are not enough data to generate accurate learning-based models. Even more, it may be used as the backbone of hybrid approaches that combine traditional with modern learning-based models which are being pointed out as one promising research direction [
55].