**1. Introduction**

In recent years, the non-rigid three-dimensional (3D) multi-modal medical image registration has attached significant attention [1–4]. This mainly stems from two aspects. Firstly, the different 3D imaging modalities are often fused to produce the precise diagnosis, since they can provide complementary information for interpreting the anatomy, tissue, and organ. As the necessary prerequisite for image fusion, multi-modal medical image registration is significant in relating clinically significant information from the different images. However, the relationship of intensity values in multi-modal 3D medical images might be highly complicated due to differences between the imaging principles, which leads to the difficulty in the construction of the appropriate similarity measure. Secondly, non-rigid deformation generally cannot be ignored for the soft organs that are easy to deform. Accordingly, the non-rigid transformation must be used as a deformation model in the non-rigid multi-modal medical image registration. However, the non-rigid transformation often involves numerous parameters, which will render accurate image registration difficult [5–8]. Therefore, the non-rigid multi-modal 3D medical image registration has become a challenging task [9–12].

The grey information and spatial information of 3D images are generally considered at the same time in order to construct a suitable similarity measure for non-rigid multi-modal 3D medical image registration. The typical similarity measure construction method is to combine the mutual information (MI) and spatial information [13]. Rueckert et al. [14] proposed using the second-order MI to encode the local information by considering both intensity information and structural information of images. However, this method needs to use a four-dimensional (4D) histogram to calculate the similarity measure. The number of grey levels cannot be too large in order to avoid the curse of dimensionality of high dimensional histograms. Pluim et al. [15] put forward a method that combines the normalized MI with the gradient amplitude and direction for rigid multi-modal image registration. Loeckx et al. [16] presented the image registration method that is based on the conditional MI. This method adopted the 3D joint histogram including the grey levels and the spatial information distribution of the reference and float images. However, the MI for 3D images in itself is computationally complicated and its combination with the spatial information will further lead to high computational complexity for the above registration methods.

The structural representation methods have been presented to more effectively measure the similarity between the different images [17–22]. Wachinger et al. [17] presented the entropy images based structural representation method. In this method, the entropy images were produced by calculating the histogram of image blocks, and then the sum of squared differences (SSD) on the entropy images was used as the similarity measure for image registration. As this method tends to produce the blurred entropy images, it cannot ensure the satisfactory registration results. Heinrich et al. [18] proposed a modality independent neighborhood descriptor (MIND) for the non-rigid multi-modal image registration. Based on the concept of image self-similarity that was introduced in non-local means image denoising, the MIND first extracted the distinctive and multi- dimensional features based on the intensity differences within a search region around each voxel in each modality. Subsequently, the SSD between MIND representations of two images was used as the similarity metric within a standard non-rigid registration framework. Although the MIND is robust to non-functional intensity relations and image noise, it cannot provide the effective structural representation for the complicated medical images with the weak, discontinuous, and complex details, because it only utilizes the similarity of image intensities. The self-similarity context (SSC) descriptor, an improved version of MIND, was proposed in [19]. The SSC descriptor was designed to find the context around the voxel of interest. The point-wise distance between SSC descriptors was used as the metric for the deformable registration on a minimum spanning tree while using dense displacement sampling (deeds) [20]. Zhu et al. [21] explored the self-similarity inspired local descriptor for structural representation based on the low-order Zernike moments with good robustness to image noise. This method cannot work well for ultrasound (US) images and positron emission tomography (PET) images with blurred features due to the ignorance of high-order Zernike moments with better feature representation ability than the low-order ones.

The solution of deformation parameters involved in the transformation model, as a high dimensional optimization problem, is a very challenging task apart from the difficulty in the construction of similarity measure for the non-rigid multi-modal 3D medical image registration. One approach to solve this problem is to use the local optimization methods (e.g., the L-BFGS-B method [23]), the global optimization methods (e.g., the evolutionary strategies [24] and the particle swarm optimization (PSO) [25]), as well as the combined methods (e.g., the hybrid L-BFGS-B and cat swarm optimization (HLCSO) method [26]). However, these methods cannot produce the satisfactory registration results in the case of the high-dimensional optimization problem. Another popular method is to reduce the dimension of transformation parameters while using the geometric transform models that are based on knowledge [27–30]. In these methods, it is required to have enough understanding of material properties of organs or tissues to establish a suitable geometric transform. However, some organs and tissues are so complicated that the existing methods cannot accurately characterize their material properties. Meanwhile, when determining the geometry and the boundary conditions, it is necessary to accurately segment the anatomy of medical images, which indeed is a very challenging task. Some alternative methods can be adopted to address this challenging problem. For example, by means of the mask image, the areas in the images that involve no non-rigid deformation can be covered up to reduce the number of the deformation field variables that are involved in the optimization process. However, the shape of such areas might often be irregular, thereby accurately leading to the difficulty in determining the mask image.

We have proposed a novel registration method using an improved modality independent neighborhood descriptor (MIND) based on the foveated nonlocal self-similarity to address these problems in the construction of similarity measure and the solution of non-rigid transformation parameters. The contributions of our work lie in the two aspects. For one thing, we have designed the foveated MIND (FMIND) for the effective structural representations of 3D medical images, thereby ensuring accurate image registration. One the other hand, the spatial constraint method based on the FMIND is proposed and introduced into the Markov random field (MRF) optimization to reduce the number of non-rigid transformation parameters and restrict the calculation of the energy function in the image regions involving local non-rigid deformation, thereby ensuring efficient image registration. Extensive experiments on multi-modal medical images demonstrate that the proposed method is provided with higher registration accuracy, except for computed tomography-positron emission tomography (CT-PET) images and higher computational efficiency than other evaluated registration methods.

### **2. Methods**

### *2.1. The Framework of the FMIND Based Image Registration Method*

Figure 1 shows the flowchart of the proposed image registration based on the FMIND. Firstly, the FMIND is constructed based on the foveated nonlocal self-similarity and it is applied to the reference image *IR* and the float image *IF* to produce the corresponding structural representations FMIND (*IR*) and FMIND (*IF*), respectively. Afterwards, the objective function, i.e., the energy function, is established based on the free-from deformation (FFD) model and the similarity measure defined as the sum of absolute differences (SAD) between FMIND(*IR*) and FMIND(*IF*). Finally, the FMIND based spatial constraint is introduced to produce the mask image for the MRF discrete optimization. During the iterative optimization, the deformation vector, which is a vector of parameters defining the deformation field, is produced at each iteration. The final optimal deformation vector *T*' will be obtained once the optimization procedure is terminated, and it is utilized to produce the registration result.

**Figure 1.** Flowchart of the foveated modality independent neighborhood descriptor (FMIND) based image registration.

### *2.2. The Foveated Modality Independent Neighborhood Descriptor*

The FMIND is presented based on the characteristics of human visual system (HVS). In the HVS, the distribution of cone cells is uneven. The foveation has the highest density. If the foveation is taken as the center, the cell density will fall fast when it is extended around. The optic nerve cells have similar characteristics. Therefore, when we watch a point in an image, this point will have the highest sensitivity and the sensitivity will drop with the increasing distance to the point. Inspired by the characteristics of the HVS, Alessandro Foi et al. [31] have proposed calculating the patch similarity based on the Euclidean distance *d* FOV between the the foveated patches, defined as:

$$d^{\text{FOV}}(l, \mathbf{x}\_1, \mathbf{x}\_2) = \|l\_{x\_1}^{\text{FOV}} - l\_{x\_2}^{\text{FOV}}\|\_2^2 \tag{1}$$

where *I* FOV *<sup>x</sup>*<sup>1</sup> and *<sup>I</sup>* FOV *<sup>x</sup>*<sup>2</sup> denote the foveated patches that were obtained by foveating the image *I* at the two fixation points *x*<sup>1</sup> and *x*2. By applying the foveation operator *F* to the image *I*, the foveated patch *I* FOV *<sup>x</sup>* is produced as:

$$I\_x^{\text{FOV}}(u) = \, \_F[I, \, x](u), \, u \in S \tag{2}$$

where *u* denotes the location of any pixel in the foveated image patch *S*. In [31], the designed foveation operators mainly include the isotropic and anisotropic foveation operators. As the latter has more advantages than the former in describing the image edges and textures, it will be used as the foveation operator. This operator is defined as:

$$F\_{\boldsymbol{\rho},\boldsymbol{\theta}}[I,\boldsymbol{x}](\boldsymbol{u}) = \sum\_{\boldsymbol{\xi}\in\boldsymbol{Z}^2} I(\boldsymbol{\xi}+\boldsymbol{x}) v\_{\boldsymbol{\mu}}^{\boldsymbol{\rho},\boldsymbol{\theta}}(\boldsymbol{\xi}-\boldsymbol{u}), \forall \boldsymbol{u} \in \mathcal{S} \tag{3}$$

where *v* ρ,θ *<sup>u</sup>* denotes the blur kernel and it is mainly structured by the elliptical Gaussian probability density function (PDF), ρ determines the elongation of the Gaussian PDF, and θ denotes the angular offset, respectively. The blur kernel *v* ρ,θ *<sup>u</sup>* is defined as [31]:

$$\upsilon\_{u}^{\rho,0} = \begin{cases} \sqrt{\mathcal{K}(0)} g^{\rho,\iota u + 0} \frac{1}{2\sqrt{\pi}} \sqrt{\frac{\mathcal{K}(0)}{\mathcal{K}(u)}} u & \neq 0 \\\ \sqrt{\mathcal{K}(0)} g^{\frac{1}{2\sqrt{\pi}}}\_{\frac{1}{2\sqrt{\pi}}} u = 0 \end{cases} \tag{4}$$

where 4 *K*(0) = *vu* 1, 4 *K*(*u*) = *vu* 2, *g* <sup>1</sup> 2 √ π denote the elliptical Gaussian PDF with the standard deviation of <sup>1</sup> 2 √ <sup>π</sup> and <sup>∠</sup>*u*+<sup>θ</sup> determines the orientation of the axes of the elliptical Gaussian PDF.

Figure 2 gives an example of two anisotropic foveation operators, where *S* is a 7 × 7 foveated patch, θ = 0, and the different kernel elongation parameters ρ = 2 and ρ = 6 are used, respectively. Clearly, this radial design of these anisotropic foveation operators is consistent with HVS features, which thereby leads to the effective structural representation of images for the FMIND.

**Figure 2.** Anisotropic foveation operators with a 7 × 7 foveated patch and θ = 0. (**a**) ρ = 2; and, (**b**) ρ = 6.

*Sensors* **2019**, *19*, 4675

We will propose the FMIND based on the foveated nonlocal self-similarity between different image patches in the same image borrowing the idea of self-similarity in the non-local means denoising. The FMIND is expressed as:

$$\text{FMIND}(I, \text{ x, } r) = \frac{1}{n} \exp\left(-\frac{d^{\text{FOV}}(I, \text{ x, } \text{x} + r)}{V^{\text{FOV}}(I, \text{ x})}\right) r \in R \tag{5}$$

where *R* denotes a search window centered at *x*, *d* FOV(*I*, *x*, *x* + *r*) denotes the distance between the foveated image patches *I* FOV *<sup>x</sup>* and *I* FOV *<sup>x</sup>*+*<sup>r</sup>* ; *n* is a normalization constant to ensure that the maximum of FMIND(*I*, *x*, *r*) is 1; *V* FOV(*I*, *x*) denotes the variance of the foveated image patch *I* FOV *<sup>x</sup>* centered at *x* in the image *I*, and it controls the attenuation degree of this function in Equation (5). The variance *V* FOV(*I*, *x*) is estimated as the mean of foveated distances for all the pixels in the foveated patch *S*.

$$V^{\text{FOV}}(l,\mathbf{x}) = \frac{1}{|S|} \sum\_{m \in S} d^{\text{FOV}}(l, \mathbf{x}, \mathbf{x} + m) \tag{6}$$

where |*S*| denotes the number of pixels in *S*.

The structural information around the pixel *x* in the image *I* will be described by one-dimensional vector of size |*R*|, where |*R*| denotes the number of pixels in the search window *R* by means of the FMIND. After obtaining the FMIND for the reference and float images, the similarity metric SADF(*I*(*x*), *J*(*x*)) between two pixels at the same position *x* in the images *I* and *J* can be expressed as the mean SAD between FMIND(*I*, *x*, *r*) and FMIND(*J*, *x*, *r*) of pixels in *R*.

$$\text{SADF}(I(\mathbf{x}), \ f(\mathbf{x})) = \frac{1}{|\mathcal{R}|} \sum\_{r \in \mathcal{R}} \left| \text{FMIND}(I\_r \ \mathbf{x}, r) - \text{FMIND}(f\_r \ \mathbf{x}, r) \right| \tag{7}$$

where *R* takes a six-neighborhood in this paper.

### *2.3. MRF Optimization Based on the Spatial Constraint*

### 2.3.1. Discrete Optimization Based on the MRF

After obtaining the similarity measure for the two different modal images, we will use the FFD as the transformation model and use Markov random field (MRF) optimization [32] to obtain the transformation parameters in the FFD. The reason for choosing this discrete optimization method is that it does not need to calculate the gradient of the energy function in the process of optimization, which thereby facilitates producing the good registration result by avoiding falling into the local minimum. In this method, the image registration problem will be converted into the MRF based discrete optimization problem.

$$E\_{\rm MRF}(\mathbf{I}) = \frac{1}{|\mathbf{G}|} \sum\_{p \in \mathcal{G}} \left( V\_p(l\_p) + \lambda \sum\_{q \in \mathcal{N}(p)} V\_{pq}(l\_{p\prime}, l\_q) \right) \tag{8}$$

$$V\_p(l\_p) = \int\_{\Omega} SADF(I(\mathbf{x}), \ f \circ T\_{l\_p}(\mathbf{x}))d\mathbf{x} \tag{9}$$

$$\left| V\_{p\eta} \left( l\_p, \ l\_\eta \right) \right| = \left\| T\_{l\_p} - T\_{l\_\eta} \right\|\_1 \tag{10}$$

where *E* denotes the general form of a first-order MRF, i.e., the energy function and λ is a constant; *G* is the set of vertices and |*G*| denotes the number of vertices in *G*, where *G* can be regarded as the vertex set in the FFD, because this method uses the FFD as the deformation model; *N*(*p*) and *N*(*q*) refer to the neighborhood of vertices *p* and *q*, respectively; **l** is the discrete labelling while *lp* and *lq* are the labels that are assigned to the vertices *p* and *q*, respectively; *Vp lp* denotes the data item of the

energy function *<sup>E</sup>*MRF(**l**), while *Vpq lp*, *lq* represents its smooth regularization and it takes the *L*1-norm to encourage the neighboring nodes *p* and *q* to keep the displacement.

Accordingly, the MRF optimization, actually, is to seek to assign a label that is associated with the deformation to each vertex, so that the energy function in Equation (8) is minimized. In this paper, the fast primal-dual (Fast-PD) [33] algorithm will be used for the MRF optimization to produce the registration result. More details about the Fast-PD algorithm can be found in [33].

### 2.3.2. Spatial Constraint Based on the FMIND

When the above registration method is applied to three-dimensional (3D) medical images, the number of deformation field variables will be large. If all pixels' displacement along the *x*, *y*, and *z* directions is considered, the number of dense deformation field variables will be 3·|*Ix*|·|*Iy*|·|*Iz*|, where |*Ix*|, |*Iy*| and |*Iz*| denote the number of pixels in the *x*, *y,* and *z* dimensions of the image *I*, respectively. For example, there will be 50,331,648 deformation field variables when |*Ix*| =|*Iy*|=|*Iz*| = 256. It is indeed very time-consuming to address such a high dimensional optimization problem.

In the reference and float images, sometimes only some areas involve non-rigid deformation. In addition, the non-rigid registration is unnecessary for some smooth areas. For these regions, the mask can be used to indicate that they will be excluded from the registration process. In this way, we can not only reduce the number of variables for describing the deformation field, but also focus the calculation of the energy function in the image areas that indeed involve the local non-rigid deformation. However, the shape of areas without non-rigid deformation is often irregular. Generally, manual intervention or image segmentation is needed for obtaining the appropriate mask image. However, these technologies cannot ensure that the satisfactory mask image can be produced for US and PET images due to their low image contrast, blurriness, and edge discontinuousness.

We will put forward the spatial constraint method based on the FMIND to address the above problem. From Equation (7), it can be seen that SADF(*I*(*x*), *J*(*x*)) contains the corresponding relationship of the local spatial information at the pixel *x* in the images *I* and *J*. This information can be used to reduce the number of variables for describing the deformation field and limit the control nodes in the deformation field to move in the areas with the local non-rigid deformation. In the FMIND based spatial constraint method, the vertex set *G* will be divided into the set *Gs* of static vertices and the set *Gd* of dynamic vertices based on the local spatial information included in the FMIND. The vertices in *Gs* are similar to those in the smooth areas and the areas that involve no deformation. Meanwhile, the vertices in *Gd* are similar to those in the non-smooth areas involving the deformation. The calculation of the energy function can be restricted in the areas involving the non-rigid deformation through the movement of these dynamic vertices with the local non-rigid deformation. In this way, the number of deformation field variables will decrease from 3·|*Gx*|·|*Gy*|·|*Gz*| to 3·|*Gd*,*x*|·|*Gd*,*y*|·|*Gd*,*z*|, where |*Gd*,*x*|, |*Gd*,*y*|, and |*Gd*,*z*| denote the number of vertices in *x*, *y*, and *z* dimensions of *Gd*. By utilizing the FMIND based spatial constraint, we can obtain the division of vertices, thereby generating the mask image without manual intervention and image segmentation.

There will be two requirements that no vertices of the whole MRF model will be omitted and no repeated division of vertices will be done to ensure the effective division of *G* in the FMIND based spatial constraint method. Correspondingly, the logical relationship among *G*, *Gs*, and *Gd* can be expressed as *<sup>G</sup>* <sup>=</sup> *Gs* <sup>∪</sup> *Gd* and <sup>∅</sup> <sup>=</sup> *Gs* <sup>∩</sup> *Gd*, where <sup>∅</sup> denotes the empty set. We have designed the following vertex partition algorithm according to the above requirements. For any vertex *pi*, *<sup>j</sup>*, *<sup>k</sup>* of the set *<sup>G</sup>* in the MRF model, i.e., *<sup>G</sup>* = \* *pi*, *<sup>j</sup>*, *<sup>k</sup>*|1 ≤ *i* ≤ |*Gx*|, 1 ≤ *j* ≤ |*G <sup>y</sup>*|, 1 ≤ *k* ≤ |*Gz* + , we will check the similarity metric SADF for each pixel *x* in the local image patch *LP*(*pi*, *<sup>j</sup>*, *<sup>k</sup>*) with radius *RLP*, which takes *pi*, *<sup>j</sup>*, *<sup>k</sup>* in *G* as the center, as shown in Figure 3. Let *con* denote the number of pixels in *LP*(*pi*, *<sup>j</sup>*, *<sup>k</sup>*) whose similarity 1-SADF(*I*(*x*), *<sup>J</sup>*(*x*)) is greater than a certain threshold <sup>δ</sup>. If the ratio of *con* to the patch size *LP*(*pi*, *<sup>j</sup>*, *<sup>k</sup>*) is greater than the static factor <sup>ε</sup>, *pi*, *<sup>j</sup>*, *<sup>k</sup>* will be regarded as a static vertex. In a similar way, we can determine other static vertices to generate the final set *Gs*. Accordingly, *Gd* will be easily computed as *Gd* = *G* − *Gs*.

**Figure 3.** The illustration of the local image block *LP*(*pi*, *<sup>j</sup>*, *<sup>k</sup>*).

Obviously, the performance of the vertex partition algorithm depends on three key parameters *RLP*, δ, and ε. Here, δ will influence the decision of whether two pixels are similar and ε is used to adjust the probability that *pi*, *<sup>j</sup>*, *<sup>k</sup>* is divided into *Gs*. Section 3.1 discusses the choice of these parameters. Algorithm 1 shows the detailed implementation of the proposed vertex partition algorithm.

**Algorithm 1.** Partition of the vertex set *G*

```
Input: FMIND(I, x, r), FMIND(J, x, r), G, δ, ε, RLP
Output: Gs, Gd
(1) Gs = ∅;
(2) for (i = 1; i ≤ |Gx|; i = i++)
(3) for (j = 1; j ≤ |Gy|; j = j++)
(4) for (k = 1; k ≤ |Gz|; k = k++)
(5) con = 0;
(6) while x ∈ LP
                  pi, j, k

(7) if (1-SADF(I(x), J(x))> δ)
(8) con++;
(9) end if
(10) end while
(11) if ( con 
        LP(pi, j, k )

                   > ε)
(12) Gs = pi, j, k ∪ Gs;
(13) end if
(14) end for
(15) end for
(16) end for
(17) Gd = G − Gs;
(18) return Gs, Gd;
```
### **3. Results**

In this section, we will first discuss the selection of several key parameters in the FMIND method, and then use the method based on the anatomical landmarks selected by doctors to compare registration accuracy and efficiency of the proposed FMIND method with those of the entropy images based SSD (ESSD) [17], MIND [18], SSC [19], and HLCSO [26] methods. For the appreciation of registration performance, we have used four datasets with synthetic deformation, including simulated 3D MR images in BrainWeb database [34], 3D CT and MR images in NA-MIC database [35], 3D CT and PET images in NA-MIC database [36], and real 3D MR images from Retrospective Image Registration Evaluation project [37]. Besides, we have used the real MR and US images with unknown real deformation in the Brain Images of Tumors for Evaluation (BITE) database [38] available at [39] to appreciate the practicality of the proposed method. Here, the implementation efficiency of the evaluated registration methods is appreciated by their running time. For all evaluated methods, they are implemented on the personal computer with 2.40 GHz CPU and 4 GB RAM while using the mixed programming of Matlab and C++.

In the case of synthetic deformation, the registration accuracy is appreciated by the target registration error (TRE) [40], defined as:

$$\text{TRE} = \frac{1}{N} \sum\_{i=1}^{N} \sqrt{\left(T\_{L\_x} - T\_{D\_x}\right)^2 + \left(T\_{L\_y} - T\_{D\_y}\right)^2 + \left(T\_{L\_z} - T\_{D\_z}\right)^2} \tag{11}$$

where *TL* is the synthetic deformation (i.e., the ground truth generated by using a linear combination of radial basis functions), *TD* is the deformation that is estimated by the registration methods, and *N* denotes the number of landmarks selected manually based on doctors' advice from the reference images. For each pair of reference and float images, different synthetic deformations will be applied to the float image for 25 times and we will manually select 90 (*N* = 90) landmarks from each 3D reference image to compute the TRE. The mean of TREs values for registering 25 deformed images will be used to appreciate the registration accuracy. Figure 4 gives an example of chosen landmarks in one slice of simulated 3D PD weighted image and real 3D T1 weighted image for MR image registration, 3D CT image for CT-MR image registration and 3D CT image for CT-PET image registration.

**Figure 4.** Landmarks in one slice of three-dimensional (3D) medical images. (**a**) simulated proton density (PD) weighted image; (**b**) real T1 weighted image; (**c**) abdomen computed tomography (CT) image; and, (**d**) whole-body CT image.

### *3.1. Parameter Setting*

In the FMIND method, we will fix θ = 0, *S* to be a 5 × 5 foveated patch in Equation (3) and λ = 0.01 in Equation (8). The remaining parameters include the kernel elongation parameter ρ, the image patch radius *RLP*, the similarity threshold δ, and the static factor ε. We will conduct experiments on three pairs of simulated MR images (T2-T1, PD-T2, and PD-T1) from BrainWeb database in order to effectively determine these parameters, where the former and the latter will be used as the reference and float images, respectively. These simulated MR images are realistic MRI data volumes that are produced by an MRI simulator while using three sequences (T1, T2, and PD weighted) and a variety of slice thicknesses, noise levels, and levels of intensity non-uniformity.
