1. Introduction
Nonrigid multimodal image registration is very important for medical image processing and analysis. Due to the different principles of various imaging technologies, such as ultrasound (US), computed tomography (CT), magnetic resonance (MR) imaging, and positron emission tomography (PET), they have their own advantages in reflecting anatomical or functional information of the human body. Multimodal image fusion can combine information of different modalities to facilitate disease diagnosis and treatment. Image registration, which aims to find the correct spatial alignment between the corresponding structures in images, is a prerequisite process for effective image fusion and has been widely studied. For example, the registration of the transrectal ultrasound image (TRUS) with the preoperative MR image has been widely studied for the guidance of prostate biopsies [
1,
2]. In craniomaxillofacial surgery [
3], it is necessary to align the intraoperative images with the three-dimensional (3D) virtual model in the surgical navigation. In the treatment of epilepsy [
4], the registration of MRI and functional imaging modalities, such as PET, is implemented to assist in identifying the functionally eloquent regions of brain tissue and guide the placement of the electrode.
For image registration, the similarity metrics play an important role in ensuring accurate registered results. However, it is a challenge to find an effective similarity metric for multimodal image registration. There are many widely used similarity metrics, such as sum of squared differences (SSD) and cross-correlation [
5], but such metrics are not suitable for multimodal medical image registration directly due to the different intensities of the multimodal images. Mutual information (MI) [
6], which aims to find the statistical intensity relationship across images, has been widely used in multimodal image registration. However, this method is very time-consuming, and it is likely to produce misalignment because MI is not a convex function and it is easy to get into the local optimum [
7]. Various improved MI metrics have been presented to address these problems, including normalized mutual information (NMI) [
8,
9], regional mutual information [
10,
11], conditional mutual information [
12], and self-similarity weighted mutual information [
13]. However, these methods ignore the local structural information, which may lead to the degraded registration performance especially in the edge regions.
To overcome the drawback of the above-mentioned similarity metrics, structural representation (SR) methods have been investigated by firstly transforming a multimodal registration into a mono-modal one using the SR method and then computing the SSD between the SR results. The local gradient orientation is utilized to find correspondences across image modalities [
14,
15]. However, the gradient estimation is sensitive to noise. The Weber local descriptor (WLD)-based SR method has been presented by Yang et al. [
16]. However, this method is sensitive to image noise and sometimes it cannot provide consistent structural representation results for the same organs in multimodal images. Heinrich et al. have proposed the modality-independent neighborhood descriptor (MIND) [
17] based on the principle of local self-similarity utilized in nonlocal means denoising. Although this descriptor is robust to noise, intensity differences, and non-uniform bias fields, it is not provided with rotational invariance. Wachinger et al. [
18] have proposed the construction of structural representations based on entropy images. This method estimates the probability density function of image patches centered on the pixel and then constructs entropy images based on Shannon’s theorem. However, the entropy images tend to be fuzzy, which may result in the inaccurate computation of similarity metrics. The idea of dimensionality reduction has been also used for image registration [
19] and the construction of structural representation based on dimensionality reduction has been studied extensively. Structural representation based on Laplacian eigenmaps is proposed in [
18]. In this method, a neighborhood graph is constructed based on all image patches, and it is utilized to calculate the Laplacian graph to determine the low-dimensional embedding. The structural representation is produced by aligning the embeddings of different modalities. Compared with an entropy image, Laplacian image looks more like the original image and thus its appearance image across the modalities is more consistent. The disadvantage of Laplacian image lies in its high computational cost and vulnerability to image noise. Piella has proposed to use the image intensities and position of pixels to construct diffusion maps for multimodal image registration [
20]. The registration performance of this method is influenced by the adoption of the first diffusion coordinate, which only represents the coarse geometry of the image. A probabilistic edge map (PEM) that is generated from a structured decision forest has been proposed by Oktay et al. for multimodal ultrasound image registration [
21]. In general, these SR methods are mostly based on human-designed low-level features, which do not properly represent the complicated characteristics of the variety of medical images.
Learning the features automatically from data of interest provides a plausible solution to accurate structural image representation. Deep learning, as a popular machine learning method, is well-suited for automatic feature learning. The various deep learning models, such as deep belief network (DBN), convolutional neural network (CNN), and stacked autoencoder (SAE), can extract intrinsic features from a large number of data using multilayer linear and nonlinear transformation. Recently, deep learning has been utilized for image registration using two kinds of methods. The first kind of method extracts the image features using such networks as CNN and SAE and uses them in traditional registration methods to produce the registered images [
22,
23,
24]. In these methods, the improper selection of parameters in CNN and SAE may lead to locally optimal registered results. Furthermore, these deep learning models generally involve a slow learning rate. The second kind of method uses deep learning to realize end-to-end image registration by learning the deformation field directly from the input images. Deformable image registration methods based on CNN are proposed in [
25,
26]. However, these methods are not suitable for multimodal image registration. Hu et al. [
27] and Hessam et al. [
28] extended CNN to multimodal deformable image registration. Due to the adoption of supervised learning in these methods, their performance was influenced by the scarcity of the labeled medical imaging data and the inaccurate labeled training samples produced by the traditional registration method for training the CNN.
To address these problems, this paper has proposed to realize the structural representation for image registration based on PCANet. PCANet, proposed by Chan et al. [
29], involves an input layer, hidden layers, and an output layer. This network utilizes a principal component analysis (PCA)-based unsupervised learning method working on image patches to produce the network parameters in hidden layers. Compared with CNN, PCANet can be trained much more easily and it requires no labeled data for network training. However, this network uses binary hashing in the output layer to produce the final output, which may lead to the loss of some image features in the output. Therefore, the improved version of PCANet is proposed in this paper to realize structural image representation. In the proposed registration framework, the improved PCANet-based structural representation (PSR) method is used to extract the multilevel features of medical images and the extracted features are combined to produce the structural representation results. The L2 distance between the PSR results of images is used as the similarity metric. By using the free-form deformation (FFD) [
30] model as the transformation model, the Limited memory Broyden–Fletcher–Goldfarb–Shannon (L-BFGS) [
31] is used to optimize the similarity metrics to obtain the parameters of the transformation model. Extensive experiments have been conducted on simulated and real medical images to appreciate the registration performance of the proposed method in terms of the target registration error (TRE).
This paper is organized as follows:
Section 2 describes the construction of PSR and the implementation of the proposed nonrigid multimodal registration method.
Section 3 presents the parameter settings in the proposed method and a comparison of its registration performance with that of the state-of-the-art registration methods. Finally, conclusions and future research directions are given in
Section 4.
3. Experimental Results
For the appreciation of registration accuracy, the target registration error (TRE), which is the average Euclidean distance of these anatomical landmarks in the images to be registered, is used and it is defined as:
where
and
denote the set of anatomical landmarks and the number of landmarks selected in the reference image, respectively, and
and
represent the pixel position of the landmarks in the reference and floating images, respectively. The floating image is produced based on a B-spline function. Here, the B-spline control points are applied to the image and randomly displaced to generate a random deformation field, based on which the floating image is produced.
is the estimated deformation obtained by the registration method. In this paper, the landmarks are selected manually based on the doctors’ advice.
Figure 3 shows an example of the distribution of the chosen landmarks in MR images from the Atlas, BrainWeb, and RIRE datasets.
3.1. Parameter Settings
The main parameters of the PCANet include the number of stages, the number of filters in each stage, the patch size, and the decay parameters. In our experiments, it is found that two-layer PCANet is good enough for image registration, and a deeper network does not lead to much better performance. As regards the number of filters in each stage, it is set to
, which is inspired from the Gabor filters [
34] with eight orientations. To determine the remaining two parameters, i.e., the patch size and the decay parameters, we will tune them based on the Atlas dataset. From this dataset, 100 images of size
are chosen. In order to ensure sufficient training data, we will perform four different nonrigid deformations on each image. The resultant 500 images are used to train the PCANet.
3.1.1. Impact of the Patch Size
To explore the effect of patch size on image registration,
and
are changed from 3 to 11.
Figure 4 shows the PSR of a T2-weighted image. We can see that the PSR becomes more blurry with the increasing of patch size. As shown in
Figure 4d–f, the features will not be obvious and some weak features even disappear when the patch size is large, which is disadvantageous for image registration. To objectively evaluate the effect of patch size on image registration, the chosen 10 images will be aligned and their TRE values calculated.
Figure 5 shows the TRE for our method using the various patch sizes. The observation from
Figure 5 demonstrates that a too large patch size has a negative influence on registration accuracy, and the best registered result can be obtained when the patch size is
.
3.1.2. Impact of the Coefficients and
The coefficients
and
have an important influence on the PSR. When the two coefficients are small, this will lead to a sharp decay function and the values of PSR will be close to zero, which makes it difficult to distinguish the features in PSR. Higher values indicate a broader response, which is also disadvantageous for feature discrimination.
Figure 6b–d show the PSR of a T1-weighted MR image using the different global thresholds. It can be seen from
Figure 6b that a small global threshold will result in some false features and renders it difficult to distinguish the smooth region and the edge region. As shown in
Figure 6d, some features are weakened and the less-obvious features in the image disappear due to the influence of a high threshold. By comparison, a suitable threshold will facilitate producing a clear structural representation as shown in
Figure 6c.
To evaluate the influence of the coefficients objectively,
is varied from 0.2 to 1 and
is varied from 0.4 to 1.2 with a step size of 0.1. The corresponding TRE results for the various coefficients are shown in
Figure 7. It is easy to see from
Figure 7 that relatively low TRE values can be obtained when
is between 0.6 and 0.9 and
is between 0.5 and 0.8. Based on the above analysis, we will set the coefficients
and
to be 0.8 and 0.6, respectively, to ensure good registration results.
3.2. Comparison with State-of-the-Art Registration Methods
In order to demonstrate the superiority of our method, we compare PSR with the structural representations obtained by other methods on the BrainWeb dataset.
Figure 8a–c show the PD-, T1-, and T2-weighted MR images.
Figure 8d–f,g–i, and j–l show the structural representation results of these images for the proposed method, the ESSD method, and the WLD method, respectively. It can be seen from
Figure 8 that the PSR result for our method is more consistent for multimodal images than the results from the other compared methods. Furthermore, a higher contrast exists between the smooth region and the edge region in the PSR. Compared with PSR, the result from the ESSD method seems to be blurry. As for the WLD method, it produces a lot of artifacts around the boundary and poorer consistency among the different modalities. For example, the intensities of the structural representation results of the same tissue in the T1 and T2 images differ greatly as shown in
Figure 8k,l.
In addition to the comparison of the structural representations, a comparison of the registration accuracy between the proposed method and the state-of-the-art methods has also been made based on the BrainWeb dataset and the RIRE dataset. For the proposed method, we set the penalty parameter in the optimization algorithm to 0.01, the PCANet patch size to , the decay number parameters and , and the number of filters in each layer . For the compared methods, the related parameters are tuned to ensure the best registered results for the two datasets.
3.2.1. Test on the BrainWeb Dataset
Figure 9 shows some registered results for the five compared methods. Obviously, these methods can correct the deformations to some extent. However, the NMI, WLD, and ESSD methods often fail to align the float and reference images with large deformations. By comparison, the registered results of the MIND and PSR methods are much closer to the reference image. Meanwhile, the proposed method has better registered results for the image details than the MIND method. For a more intuitive comparison, the deformation fields are shown in
Figure 10. The ground truth of the deformation fields is the spatial difference between the reference and floating images. Clearly, the PSR method provides a closer deformation field to the ground truth than the MIND method.
In order to evaluate and compare the registered results objectively, 50 registration experiments have been conducted.
Table 1 shows the mean and the standard deviation (std) of TRE values for all registration methods and ’/’ represents that no registration is implemented. The results indicate that the proposed method achieves higher registration accuracy than the other methods. Meanwhile, we can see that the proposed method is more stable than the other methods. A
t-test is performed between the PSR method and each comparison method. The test results show that there exists a significant difference between the proposed registration method and any other compared method (
p < 0.05). The reason for the superiority of the PSR method can be explained in this way. Medical images are complicated and it is difficult for human-designed descriptors to represent their complex features. However, due to the strong feature learning abilities of PCANet in the proposed method, more intrinsic features can be extracted automatically and effectively, thereby leading to better structural representation results.
In order to explore the robustness of the registration methods, different levels (1–10%) of Rician noise have been added to the MR images.
Figure 11 shows the TRE of the PSR, MIND, and ESSD registration methods and
Figure 12 shows some registered results of T1–T2 images with different levels of Rician noise for PSR. The result shows that with increasing levels of noise, the registered results will become worse to some extent, but as a whole, our method is robust to the noise.
3.2.2. Test on the RIRE Dataset
To further test the registration performance of the proposed method, experiments have been conducted on the RIRE dataset.
Figure 13 shows the filters learned by the PCANet. It can be seen from
Figure 13 that the PCANet can learn the different convolution kernels for the CT and MR images, which can facilitate effective feature extraction for the different images.
Figure 14 shows the TRE for 50 times registration of CT and PD-weighted MR images. From
Figure 14, we can see that the proposed method has a lower mean TRE and is more robust than the other methods. In particular, compared with the most competitive MIND method, our method can provide a mean improvement of TRE by 0.33. To compare the performance of the registration methods intuitively, some examples of registration for MR and CT images are shown in
Figure 15. Among the five registration methods, the NMI and WLD methods provide the worst registration results. As shown in the red boxes in
Figure 14f,g, the registered results are not ideal either for the contour or for the details. The ESSD method performs better than the NMI and WLD methods, but it cannot provide a satisfactory registered result for the weak edges as shown
Figure 15e. Besides this, the registered result of the MIND method is worse than that of the PSR method for both contour and image details as shown in the red boxes in
Figure 15d. The experiments have demonstrated the superiority of our method in MR and CT image registration.
4. Conclusions
In this paper, a novel PCANet-based structural representation method has been proposed for multimodal medical image registration. Compared with the human-designed feature extraction methods, the PCANet can automatically learn the intrinsic features from a large number of medical images through multilevel linear and nonlinear transformations. Distinctively, the proposed method can provide effective structural representations for multimodal images by utilizing the multilevel image features extracted in the various layers of the PCANet. Extensive experiments on Atlas, BrainWeb, and RIRE datasets demonstrate that the proposed method can provide lower TRE values and more satisfactory registered results in terms of human vision than the MIND, ESSD, WLD, and NMI methods.
Our future work will focus on extending the proposed approach to more challenging data, in particular 3D ultrasound data. The extension of our method is an arduous task due to the serious noise and the unclear anatomical edge in ultrasound images, which pose a challenge for the construction of structural representation. To ensure that the proposed method can be used for the registration of challenging data, the training of the PCANet and the construction of the structural representation will be modified and more clinical samples will be collected to facilitate the training and testing of the PCANet-based registration method.