*Article* **2D/3D Non-Rigid Image Registration via Two Orthogonal X-ray Projection Images for Lung Tumor Tracking**

**Guoya Dong 1,2,3,†, Jingjing Dai 1,2,3,4,†, Na Li 5, Chulong Zhang 4, Wenfeng He 4, Lin Liu 4, Yinping Chan 4, Yunhui Li 4, Yaoqin Xie <sup>4</sup> and Xiaokun Liang 4,\***

<sup>1</sup> School of Health Sciences and Biomedical Engineering, Hebei University of Technology, Tianjin 300130, China

<sup>2</sup> Hebei Key Laboratory of Bioelectromagnetics and Neural Engineering, Tianjin 300130, China


**Abstract:** Two-dimensional (2D)/three-dimensional (3D) registration is critical in clinical applications. However, existing methods suffer from long alignment times and high doses. In this paper, a non-rigid 2D/3D registration method based on deep learning with orthogonal angle projections is proposed. The application can quickly achieve alignment using only two orthogonal angle projections. We tested the method with lungs (with and without tumors) and phantom data. The results show that the Dice and normalized cross-correlations are greater than 0.97 and 0.92, respectively, and the registration time is less than 1.2 seconds. In addition, the proposed model showed the ability to track lung tumors, highlighting the clinical potential of the proposed method.

**Keywords:** 2D/3D registration; orthogonal X-ray; deep learning

**Citation:** Dong, G.; Dai, J.; Li, N.; Zhang, C.; He, W.; Liu, L.; Chan, Y.; Li, Y.; Xie, Y.; Liang, X. 2D/3D Non-Rigid Image Registration via Two Orthogonal X-ray Projection Images for Lung Tumor Tracking. *Bioengineering* **2023**, *10*, 144. https://doi.org/10.3390/ bioengineering10020144

Academic Editors: Paolo Zaffino and Maria Francesca Spadea

Received: 26 December 2022 Revised: 10 January 2023 Accepted: 16 January 2023 Published: 21 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Medical imaging has helped a lot with diagnosing and treating diseases as modern medical technology has grown quickly. Image registration is crucial in medical image processing because it helps predict, diagnose, and treat diseases. For the images to be registered, three-dimensional (3D) medical images with rich anatomical and structural information are an inevitable choice for clinical problems. Unfortunately, 3D images have a higher radiation dose and a slower imaging speed, which inconveniences real-time clinical problems, such as image-guided radiotherapy and interventional surgery. On the other hand, two-dimensional (2D) images lack some spatial structure information, while the imaging speed is very fast. Therefore, in recent years, 2D/3D image registration with faster speed and simple imaging equipment has attracted much attention. The types of 2D images are usually X-ray [1–4], fluoroscopic [5], digital subtraction angiography (DSA) [6,7], or ultrasound [8], whereas 3D images are chosen from computed tomography (CT) [1–4] or magnetic resonance imaging(MRI) [8].

2D/3D registration methods can be divided into traditional and deep learning-based image registration. In traditional image registration, 2D/3D alignment usually translates into the problem of solving for the maximum similarity between digitally reconstructed radiographs (DRR) and X-ray images. Similarity metrics are usually based on intensitybased mutual information [9–11], normalized cross-correlation (NCC) [12] and Pearson correlation coefficients [13], or gradient-based similarity metrics [14]. To minimize the dimensionality of the transformation parameters, regression models that rely on a priori information are usually built using B-spline [15] or principal component analysis (PCA) [16–19]. However, organ motion and deformation can cause errors in regression models, which rely too much on prior information. By incorporating finite element information into the

regression model, Zhang et al. [18,19] obtained more realistic and effective deformation parameters. However, adding finite element information makes the model-driven method of finding the optimal solution iteratively more inefficient. Therefore, this process is a constraint for developing real-time 2D/3D registration and tumor tracking algorithms.

With the development of artificial intelligence and deep learning, learning-based methods replace the tedious iterative optimization process with predicted values in the testing process, greatly improving computing efficiency. Zhang [20] proposed an unsupervised 2D-3D deformable registration network that addresses 2D/3D registration based on finite angles. Li et al. [4] proposed an unsupervised multiscale encode decode framework to achieve non-rigid 2D/3D registration based on a single 2D lateral brain image and 3D CBCT image. Ketcha et al. [21] used multi-stage rigid registration based on convolutional neural networks (CNN) to obtain a deformable spine model. Finally, Zhang et al. [22] achieved a deformable registration of the skull surface. Unfortunately, the above learning-based approach evaluates the similarity between DRR and X-ray, a 2D/3D registration reduced dimension to 2D/2D registration. Therefore, it is inevitable that spatial information will be lost to some extent. In addition, even with Graphic Processing Unit (GPU) support, forward projection, backward projection, and DRR generation involved in the above methods are computationally expensive. Then, the researchers completed end-to-end 2D/3D registration by integrating the forward/inverse projection spatial transformation layer into a neural network [3,23]. Frysch et al. [2] used Grangeat's relation instead of expensive forward/inverse projection to complete the 2D/3D registration method based on a single projection of arbitrary angle, which greatly accelerated the computational speed. However, this is a rigid transformation which is difficult to apply to elastic organs. Likewise, deep learning researchers have attempted to use statistical deformation models to build deep learning-based regression models. Using a priori information to build patient-specific deformation spaces, convolutional neural networks are used to accomplish regression on PCA coefficients [1,24,25] or B-spline parameter coefficients [26,27] to achieve patient-specific registration networks. Tian et al. [28] obtained the predicted deformation field based on the regression coefficients. However, this deformation space, which is completely based on a priori information, may lead to mistakes in the clinical application stage. In addition, some researchers [29,30] also accomplished 2D/3D image registration by extracting feature points. With the maturity of point cloud technology, many researchers have also built pointto-plane alignment models by extracting global point clouds to complete 2D/3D alignment models, but the anomaly removal for 2D/3D alignment models presents a challenge [31–33]. Graphical neural networks are also used for 2D/3D registration in low-contrast conditions [34]. Shao et al. [35] tracked liver tumors by adding finite element modeling. Still, the introduction of finite elements also brought some trouble to the registration time.

Therefore, we developed a deep learning-based method for non-rigid 2D/3D image registration of the same subject. Compared with traditional algorithms based on iterative optimization, this approach significantly improves the registration speed. Compared with the downscaled optimization of DRR and X-ray similarity, we optimized the similarity of 3D/3D images, which can effectively moderate the loss of spatial information. Additionally, only two projections based on orthogonal angles were chosen for 2D images to reduce the irradiation dose further. The proposed method is used to study the process of changes in the elastic organ as respiratory motion proceeds. More significantly, we also investigated the change in tumor position with respiratory motion, which can be used to achieve tracking of tumors based on orthogonal angular projections during radiotherapy.

The contributions of our work are summarized as follows:

1. We propose a 2D/3D elastic alignment framework based on deep learning, which can be applied to achieve organ shape tracking at lower doses using only two orthogonal angles of X-rays.

2. Our framework is expected to be used for tumor tracking with tumor localization accuracy up to 0.97 and registration time within 1.2 s, which may be a potential solution for image-guided surgery and radiotherapy.

The organizational structure of this article is as follows. Section 2 describes the experimental method. Section 3 describes the experiment setups. Section 4 shows the Result. Section 5 is the discussion and Section 6 concludes the paper and the references.

#### **2. Methods**

#### *2.1. Overview of the Proposed Method*

The framework of this method is shown in Figure 1. We design a non-rigid 2D/3D registration framework based on deep learning of orthogonal angle projection. Since it is a deep learning-based model, a large amount of data is needed to participate in training. The real paired 2D/3D medical images at the same time are very scarce, so the first task that needs to be done is data augmentation. We chose 4D CT of the lungs as the experimental subject. The expiratory end was used as a moving image *MCT* and hybrid data augmentation [36,37] was used to obtain a large number of CT *FCT* representing each respiratory phase of the lung (this procedure will be described in Section 3.1). Then, the ray casting method obtains a pair of 2D DRRs of *FCT* with orthogonal angles. After that, the orthogonal DRR and the moving image *MCT* are input into the 2D/3D registration network. The network outputs a 3D deformation field *φp*. Then, the moving image *MCT* is transformed by the spatial transformation layer [38] to obtain the corresponding predicted CT image. The maximum similarity between the predicted CT image and the ground truth *FCT* is calculated. Through continuous iterative optimization, we can complete the model training. In the inference phase, only the X-ray projections or DRRs and the moving image must be input to the trained network to get corresponding 3D images.

**Figure 1.** Overview of the proposed method. (**a**) Flowchart of the training phase of the method. First, a large number of CT *FCT* and segmentation *Fseg* representing each phase are obtained by performing hybrid data augmentation of the moving image *MCT* and the corresponding segmentation image *Mseg*. Then, the *FCT* images are projected to obtain the 2D *DRR*<sup>90</sup> and *DRR*00. After that, they are fed into the registration network with the moving image *MCT* to obtain the predicted deformation field *φp*. Finally, the moving image *MCT* and the moving segmentation map *Mseg* are transformed to obtain the corresponding predicted images, *PCT* and *Pseg*. (**b**) The process of hybrid data augmentation. The deformation field *φinter* is first obtained by inter-phase registration using traditional image registration. The small deformation *φintra* is simulated by TPS interpolation. The hybrid deformation field *φhybrid* is obtained by summing with random weights for data augmentation. (**c**) Inference stage. The 2D projection and moving images are directly input to the trained network to get the prediction *φp*, and then the registration can be completed by transformation.

#### *2.2. 2D/3D Registration Network*

Figure 2 shows the registration network. For 2D/3D image registration, the first thing to consider is the consistency of spatial dimension. As a result, we use the extracted feature up-dimensional approach to transform the 2D/3D registration problem into the 3D/3D registration problem. We used the residual network to get the 2D features. The most important step is identity mapping, stopping the gradient from going away, and helping train the network. Thus, when two DRRs with orthogonal angles are input to the network, they are first concatenated in the channel layer as the input of the residual network and then passed through the convolution layer, the max pooling layer, and two output channels with 64 and 128 residual blocks in turn. The channel layer is the third dimension to form a 3D feature map, which is input to the feature extraction network together with the moving image.

We selected the 3D Attention-U-net [37,39] (3D Attu) as the feature extraction network in this study. It can be called the 3D/3D matching network. The network 3D Attu adds an attention gate mechanism to the original U-net, which can automatically distinguish the target shape and scale, and learn more useful information. It also employs encoding and decoding mechanisms and skips connection mechanisms. It effectively blends high- and low-level semantic information while widening the perceptual domain. It has been used in many medical image processing tasks with excellent results. As a result, in this model, we feed the moving image *MCT* and the 3D feature map into the 3D Attu. The output is the predicted deformation field.

**Figure 2.** 2D/3D registration network. First, 2D DRRs at orthogonal angles are processed by residual blocks to obtain 3D feature maps. Then, the feature maps and moving images are fed into a 3D Attubased encode–decode network. The final output of this network is the predicted 3D deformation field.

#### *2.3. Loss Function*

The mutual information (MI) between the ground truth *FCT* and the predicted 3D CT *PCT* obtained by the registration network constitute the loss function *LMI*(*FCT*, *PCT*). The other part of the loss function is *LDice*(*Fseg*, *Pseg*), obtained by computing the Dice between the corresponding segmented images, which allows the model to focus more on the lung region. Lastly is the regularized smoothing constraint *LReg*(*φp*) for the deformation field.

$$L\_{Dice}(F\_{\text{seg}}, P\_{\text{seg}}) = \sum\_{i=0}^{n} \frac{1}{n} \frac{2 \left| F\_{\text{seg}}^{i} \cap P\_{\text{seg}}^{i} \right|}{\left| F\_{\text{seg}}^{i} + P\_{\text{seg}}^{i} \right|} \tag{1}$$

$$L = \lambda\_1 L\_{\text{Dice}}(F\_{\text{seg}}, P\_{\text{seg}}) + \lambda\_2 L\_{MI}(F\_{\text{CT}}, P\_{\text{CT}}) + \lambda\_3 L\_{R\text{eg}}(\phi\_p) \tag{2}$$

where *n* denotes the number of categories in the image, *i* denotes the *i*-th category of the image. *ϕ* denotes all elements in the entire deformation field. *λ*1, *λ*2, *λ*<sup>3</sup> denote the weights of *LDice*, *LMI*, *LReg* respectively, which were chosen as 0.5, 0.5, and 0.1 in this experiment.

#### **3. Experiment Setups**
