**Simulation Study of Low-Dose Sparse-Sampling CT with Deep Learning-Based Reconstruction: Usefulness for Evaluation of Ovarian**

## **Cancer Metastasis**

**Yasuyo Urase 1, Mizuho Nishio 1,2,\*, Yoshiko Ueno 1, Atsushi K. Kono 1, Keitaro Sofue 1, Tomonori Kanda 1, Takaki Maeda 1, Munenobu Nogami 1, Masatoshi Hori <sup>1</sup> and Takamichi Murakami <sup>1</sup>**


Received: 11 May 2020; Accepted: 24 June 2020; Published: 28 June 2020

**Abstract:** The usefulness of sparse-sampling CT with deep learning-based reconstruction for detection of metastasis of malignant ovarian tumors was evaluated. We obtained contrast-enhanced CT images (*n* = 141) of ovarian cancers from a public database, whose images were randomly divided into 71 training, 20 validation, and 50 test cases. Sparse-sampling CT images were calculated slice-by-slice by software simulation. Two deep-learning models for deep learning-based reconstruction were evaluated: Residual Encoder-Decoder Convolutional Neural Network (RED-CNN) and deeper U-net. For 50 test cases, we evaluated the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as quantitative measures. Two radiologists independently performed a qualitative evaluation for the following points: entire CT image quality; visibility of the iliac artery; and visibility of peritoneal dissemination, liver metastasis, and lymph node metastasis. Wilcoxon signed-rank test and McNemar test were used to compare image quality and metastasis detectability between the two models, respectively. The mean PSNR and SSIM performed better with deeper U-net over RED-CNN. For all items of the visual evaluation, deeper U-net scored significantly better than RED-CNN. The metastasis detectability with deeper U-net was more than 95%. Sparse-sampling CT with deep learning-based reconstruction proved useful in detecting metastasis of malignant ovarian tumors and might contribute to reducing overall CT-radiation exposure.

**Keywords:** deep learning; neoplasm metastasis; ovarian neoplasms; radiation exposure; tomography; x-ray computed

#### **1. Introduction**

Ovarian cancer is the eighth leading cause of female cancer death worldwide [1]. The incidence of ovarian cancer increases with age and peaks in the 50s [2]. In addition, malignant germ cell tumors are common in young patients with ovarian cancer [3].

CT is the major modality for diagnosing ovarian tumors, detecting metastases, staging ovarian cancer, following up after surgery, and assessing the efficacy of chemotherapy. On the other hand, CT radiation exposure may be associated with elevated risks of thyroid cancer and leukemia in all

adult ages and non-Hodgkin lymphoma in younger patients [4]. Patients with ovarian cancer tend to be relatively young, therefore the reduction of CT radiation exposure is essential. Radiation exposure of CT is mainly controlled by adjusting the tube current and voltage [5]. Lowering the radiation dose increases image noise, so techniques that reduce image noise and artifacts and maintain image quality are needed. Low-dose CT images were reconstructed by filtered back projection (FBP) until the 2000s. However, iterative reconstruction (IR) has been the mainstream since the first IR technique was clinically introduced in 2009 [5]. IR reconstruction technology has evolved into hybrid IR, followed by model-based IR (MBIR). IR has been reported to reduce the radiation dose by 23–76% without compromising image quality compared to FBP [5].

In recent years, a technique called sparse-sampling CT that resembles compressed sensing in MRI has attracted attention as a possible new technique to reduce exposure. This technique reconstructs CT images using a combination of sparse-sampling CT and Artificial intelligence (AI), especially deep learning, which may reduce CT radiation exposure more than two-fold over the current technology [5]. A few studies show that with the application of sparse-sampling CT and deep-learning, lower-dose CT could be used [6,7].

Research for the noise reduction of CT images using deep learning started around 2017 [6–15]. In 2017, image-patch-based noise reduction was performed using deep learning model on low-dose CT images [7,11]. On the other hand, Jin et al. show that entire CT images could be directly denoised using U-net [9]. To improve perceptual image quality, generative adversarial network (GAN) was introduced for CT noise reduction [12,13]. Following the advancement in noise reduction using deep learning, Nakamura et al. evaluated noise reduction using deep learning on a real CT scanner [16]. However, most of them focused on quantitative measures such as peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). To the best of our knowledge, there are few studies that radiologists visually evaluate abnormal lesions such as metastasis on CT images processed with deep learning [16]. Furthermore, the quantitative measure, such as PSNR and SSIM, and human perceived quality were not always consistent in agreement [17]. Therefore, we suggest that PSNR and SSIM alone cannot assure clinical usefulness and accuracy of lesion detection.

The present study aimed to evaluate the usefulness of sparse-sampling CT with deep learning-based reconstruction for radiologists to detect the metastasis of malignant ovarian tumors. This study used both quantitative and qualitative assessment of denoised sparse-sampling CT with deep learning, including PSNR and SSIM, along with radiologists' visual score, and the detectability of metastasis.

#### **2. Materials and Methods**

This study used anonymized data from a public database. The regulations of our country did not require approval from an institutional review board for the use of a public database.

#### *2.1. Dataset*

Our study tested abdominal CT images obtained from The Cancer Imaging Archive (TCIA) [18–20]. We used one public database of the abdominal CT images available from TCIA: The Cancer Genome Atlas Ovarian Cancer (TCGA-OV) dataset. The dataset is constructed by a research community of The Cancer Genome Atlas, which focuses on the connection between cancer phenotypes and genotypes by providing clinical images. In TCGA-OV, clinical, genetic, and pathological data reside in Genomic Data Commons Data Portal while radiological data are stored on TCIA.

TCGA-OV provides 143 cases of abdominal contrast-enhanced CT images. Two cases were excluded from the current study because the pelvis was outside the CT scan range. The other 141 cases were included in the current study. The 141 cases were randomly divided into 71 training cases, 20 validation cases, and 50 test cases. For training, validation, and test cases, the number of CT images was 6916, 1909, and 4667, respectively.

#### *2.2. Simulation of Sparse-Sampling CT*

As in a previous study [9], sparse-sampling CT images were simulated for the 141 sets of abdominal CT images of TCGA-OV. The original CT images of the TCGA-OV were converted into sinograms with 729 pixels by 1000 views using ASTRA-Toolbox (version 1.8.3, https://www.astra-toolbox.com/), an open-source MATLAB and Python toolbox of high-performance graphics processing unit (GPU) primitives for two- and three-dimensional tomography [21,22]. To simulate sparse-sampling CT images, we uniformly (at regular view intervals) subsampled the sinograms by a factor of 10, which corresponded to 100 views. While a 20-fold subsampling rate was used in the previous study [9], our preliminary analysis revealed that the abdominal CT images simulated with a 20-fold subsampling rate were too noisy. As a result, we utilized a 10-fold subsampling rate in the current study. The 10-fold subsampled sinograms were converted into the sparse-sampling CT images using FBP of the ASTRA-Toolbox.

#### *2.3. Deep Learning Model*

To denoise the sparse-sampling CT images, a deep learning model was employed in the current study. The outline of the training phase and deployment (denoising) phase using a deep learning model is represented in Figure 1. In the training phase, pairs of original and noisy CT images were used for constructing a deep learning model. In the deployment phase, we used the deep learning model for denoising noisy CT images. We used a workstation with GPU (GeForce RTX 2080 Ti with 11 GB memory, NVIDIA Corporation, Santa Clara, California, USA) for training and denoising.

#### **Figure 1.** Outline of the training phase and deployment phase of the deep learning model.

Two types of deep learning models were evaluated: Residual Encoder-Decoder Convolutional Neural Network (RED-CNN) [7] and deeper U-net with skip connection [6]. RED-CNN combines autoencoder, deconvolution network, and shortcut connections into its network structure, and it performed well in denoising low-dose CT images. RED-CNN used image patches extracted from the CT image (size 55 × 55 pixels) for training [7]. Nakai et al. developed deeper U-net for denoising sparse-sampling chest CT images and showed that deeper U-net was superior to conventional U-net with skip connection [6]. Contrary to RED-CNN, deeper U-net made it possible to use entire CT images (size 512 × 512 pixels) as training data. In the current study, the usefulness of deeper U-net was evaluated and compared to RED-CNN.

We implemented deeper U-net using Keras (version 2.2.2, https://keras.io/) with TensorFlow (version 1.10.1, https://www.tensorflow.org/) backend. The major differences of network structure between our deeper U-net and Nakai's deeper U-net were as follows: (i) the number of maxpooling and upsampling was 9; (ii) the numbers of feature maps in the first convolution layer of our U-net was 104. After the maxpooling layer, the number of feature maps in the convolution layer was doubled. However, if the numbers of feature maps were 832, the number of feature maps was not increased even after the maxpooling layer. The changes in the network structure of our deeper U-net including (i) and (ii) are shown in Appendix A in more detail. To train deeper U-net, pairs of original CT images and sparse-sampling CT images were prepared. Mean squared error (MSE) between the original and denoised CT images represented the loss function of deeper U-net. Adam was used as an optimizer, and its learning rate was 0.0001. The number of training epochs was 100. 4000 seconds were required for training deeper U-net per one epoch.

RED-CNN was trained using its PyTorch implementation (https://github.com/SSinyu/RED\_CNN). RED-CNN was trained on an image patch size of 55 × 55 pixels. Network-related parameters of RED-CNN were retained as described previously [7].

#### *2.4. Quantitative Image Analysis*

To evaluate the denoising performance of deep learning models, we used two quantitative measures, PSNR and SSIM, on the 4667 CT images from 50 test cases [23]. These parameters are frequently used as standard objective distortion measures and for quantitative assessment of the reconstructed images [10]. PSNR is defined as

$$PSNR \;= \; 20 \log\_{10} (\frac{MAX\_I}{\sqrt{MSE}}),$$

where *MSE* is calculated between the denoised and original CT images, and *MAXI* is the maximum value of the original CT image. SSIM is a metric that supposedly reflects the human visual perception rather than PSNR. It is defined as

$$SSIM(x,y) = \frac{(2u\_xu\_y + c\_1)(2s\_{xy} + c\_2)}{(u\_x^2 + u\_y^2 + c\_1)(s\_{x2} + s\_{y2} + c\_2)} \tag{2}$$

where *x* and *y* are the denoised and original CT images, respectively; *ux* and *uy* are the means of *x* and *y*, respectively; *sx2* and *sy2* are the variances of *x* and *y*, respectively; *sxy* is the covariance of *x* and *y*; and *c1* and *c2* are determined by the dynamic range of the pixel values to stabilize the division with the weak denominator. Scikit-image (version 0.13.0, https://scikit-image.org/) was used to calculate these two quantitative measures.

#### *2.5. Qualitative Image Analysis*

On the denoised sparse-sampling CT of 50 test cases, the normal and abnormal lesions were visually evaluated. For the qualitative evaluation, four radiologists were participated, two performing visual assessments as readers and the other two defining and extracting lesions to be assessed. The two groups were independent of each other.

As readers, two board-certified radiologists (with 17 and 10 years of experience, respectively) independently evaluated the denoised CT images and referred to the original images on 3D Slicer (version 4.10.2, https://www.slicer.org/) [24]. For all visual evaluations described in the following section, we used a five-point scale as follows: (1) Unacceptable, (2) Poor, (3) Moderate, (4) Good, and (5) Excellent. The definition of each score and detail procedure of qualitative evaluation is shown in Tables 1–3 and Appendix B, respectively.


#### **Table 1.** Score criteria for entire CT image quality.

**Table 2.** Score criteria for the evaluation of normal local lesions (common iliac artery, internal iliac artery, and external iliac artery).


**Table 3.** Score criteria for the evaluation of abnormal lesions.


The image quality evaluation of the entire CT and the normal local lesions were evaluated. For the entire CT image quality, (A) Overall image quality and (B) Noise and artifacts were evaluated. The overall image quality represented a comprehensive evaluation, including noise, artifacts, and visibility of anatomical structures.

As an evaluation of the normal local lesions, the visibility of the iliac artery (the common iliac artery, internal iliac artery, and external iliac artery) was evaluated. A score was given on whether or not the diameter could be reliably measured at each of the three points of the common iliac artery, internal iliac artery, and external iliac artery.

Peritoneal dissemination, liver metastasis, and lymph node metastasis were visually evaluated as abnormal lesions by the two radiologists. The abnormal lesions were determined by the consensus of two other independent board-certified radiologists (6 and 14 years of experience, respectively) on the original CT image based on the following criteria. Peritoneal dissemination was defined as previously established as either 1) an isolated mass or 2) subtle soft tissue infiltration and reticulonodular lesions [25]. Lymph node metastasis was defined as short axis ≥10 mm. With reference to RESIST v1.1, we defined peritoneal dissemination and liver metastasis as follows: peritoneal dissemination for non-measurable or measurable (long axis ≥ 10 mm); liver metastasis (long axis ≥ 10 mm) [26]. The measurable lesions of peritoneal dissemination were further subdivided into long axis ≤ 20 and > 20 mm because the staging of FIGO 2014 differs depending on the size [27].

#### *2.6. Statistical Analysis*

For the quantitative assessment of the denoised images, the mean scores of PSNR and SSIM of deeper U-net and RED-CNN were calculated. All the qualitative image quality scores were compared between deeper U-net and RED-CNN using the Wilcoxon signed-rank test. For each abnormal lesion, a 5-point score ≥ 3 was regarded as true positive (TP), and a 5-point score < 3 as false negative (FN). He detectability of abnormal lesions was calculated based on the following equation: *TP TP*+*FN* (sensitivity). The detectability of abnormal lesions was compared between deeper U-net and RED-CNN using the McNemar test. Statistical analyses were performed using JMP® (version 14.2, SAS Institute Inc., Cary, NC, USA). All tests were two sided with a significance level of 0.05.

#### **3. Results**

A summary of patient demographics of the 141 cases is provided in Table 4. The location of ovarian cancer and clinical stage were available from TCIA in 140 cases. Age was obtained from DICOM data of CT images. For the 50 test cases, 124 abnormal lesions were determined, including 6 liver metastases, 25 lymph node metastases, and 93 peritoneal disseminations. For the peritoneal disseminations, the numbers of non-measurable lesions, measurable lesions with long axis ≤ 20 mm, and measurable lesions with long axis > 20 mm were 53, 28, and 12, respectively.

For normal local lesions and abnormal lesions, representative images of the original CT and denoised CT obtained using deeper U-net and RED-CNN are shown in Figure 2. Additionally, representative images of the original CT, the sparse-sampled CT images before denoised processing and denoised CT obtained using deeper U-net and RED-CNN are shown in Figure 3.


**Table 4.** Patient demographics of TCGA-OV.

Note: \* and \*\* indicate that data were obtained from 139 and 140 cases, respectively. Clinical stage of patients were extracted from the TCGA-OV dataset; it is unknown whether the clinical stage is based on FIGO classification or TNM classification.

**Figure 2.** Representative images of the original CT and denoised CT obtained using deeper U-net and RED-CNN. Note: (**A**) Visual scores of common iliac artery (red arrow): 5 points for deeper U-net, 2 points for RED-CNN for reader 1; 4 points for deeper U-net, 2 points for RED-CNN for reader 2. (**B**) Visual scores of liver metastasis (yellow arrow): 3 points for deeper U-net, 2 points for RED-CNN for reader 1; 4 points for deeper U-net, 2 points for RED-CNN for reader 2. Abbreviation: RED-CNN, Residual Encoder-Decoder Convolutional Neural Network.

**Figure 3.** *Cont.*

**Figure 3.** Representative images of the original CT, sparse-sampling CT before denoising, and denoised CT obtained using deeper U-net and RED-CNN. Note: (**A**) Case 1: Visual scores of peritoneal dissemination (white circle): 4 points for deeper U-net, 1 points for RED-CNN for reader 1; 4 points for deeper U-net, 1 points for RED-CNN for reader 2. (**B**) Case 2: Visual scores of lymph node metastasis (yellow arrow): 5 points for deeper U-net, 2 points for RED-CNN for reader 1; 4 points for deeper U-net, 2 points for RED-CNN for reader 2. (**C**) Case 3: Visual scores of liver metastasis (red arrow): 4 points for deeper U-net, 2 points for RED-CNN for reader 1; 4 points for deeper U-net, 2 points for RED-CNN for reader 2. (**D**) Case 4: Visual scores of peritoneal dissemination (red arrow): 4 points for deeper U-net, 2 points for RED-CNN for reader 1; 4 points for deeper U-net, 1 points for RED-CNN for reader 2.

#### *3.1. Quantitative Image Analysis*

We evaluate the PSNR and SSIM on the 4667 CT images from 50 test cases. The number of samples for calculating PSNR and SSIM was 4667. The PSNR and SSIM were 29.2 ± 1.49 and 0.75 ± 0.04 for the sparse-sampling images before denoising, 48.5 ± 2.69 and 0.99 ± 0.01 for deeper U-net, and 37.3 ± 1.97 and 0.93 ± 0.02 for RED-CNN.

#### *3.2. Qualitative Image Analysis*

The results of the visual evaluation are shown in Figures 4 and 5 and Tables 5 and 6. For all items of the visual evaluation, deeper U-net scored better than that of RED-CNN for both readers as shown in Table 7.

Streak artifacts tended to be stronger on images at the upper abdomen level, especially where the lung and abdominal organs were visualized on the same image.

**Figure 4.** Visual evaluation of entire CT image quality by the two readers using different deep learning algorithms.

**Table 5.** Visual evaluation of entire CT image quality by the two readers using different deep learning algorithms.


**Figure 5.** Visual evaluation of normal local lesions and abnormal lesions by the two readers using different deep learning algorithms.

**Table 6.** Visual evaluation of normal local lesions and abnormal lesions by the two readers using different deep learning algorithms.



**Table 7.** Results of visual evaluations by the two readers using different deep learning algorithms.

Abbreviation: IQR, interquartile range; RED-CNN, Residual Encoder-Decoder Convolutional Neural Network.

The detectability of abnormal lesions with deeper U-net was significantly better than that with RED-CNN: 95.2% (118/124) vs. 62.9% (78/124) (*p* < 0.0001) for reader 1 and 97.6% (121/124) vs. 36.3% (45/124) (*p* < 0.0001) for reader 2. The number of FN with deeper U-net were six and three for readers 1 and 2, respectively. All these abnormal lesions were non-measurable peritoneal dissemination, which were identified as slight subtle soft-tissue infiltration and reticulonodular lesions on the original CT image. The representative images of the FN case are shown in Figure 6.

**Figure 6.** A representative undiagnosable lesion on denoised CT image with deeper U-net. Note: The circle shows non-measurable peritoneal dissemination. With deeper U-net, the score was 1 point for reader 1 and 2 points for reader 2.

#### **4. Discussion**

In the current study, we compared the quantitative and qualitative image quality of sparse-sampling CT denoised with deeper U-net and RED-CNN. RED-CNN was compared with our deeper U-net because of its similar network structure [7]. For quantitative analysis, mean scores of PSNR and SSIM of CT image quality with deeper U-net were better than those with RED-CNN. For all of the visual evaluation items, the scores of CT image quality with deeper U-net were significantly better than those of RED-CNN. In addition, the detectability of ovarian cancer metastasis was more than 95% in deeper U-net.

A few studies on deep learning-based reconstruction have shown that it improved image quality and reduced noise and artifacts better than hybrid IR and MBIR [8,16,28]. Nakamura et al. reported that deep learning reconstruction could reduce noise and artifacts more than hybrid IR could and that it may improve the detection of low-contrast lesions when evaluating hypovascular hepatic metastases [16]. While their study did not evaluate low-dose CT, the deep learning model is also considered an effective method with the potential to use lower-dose CT techniques such as sparse sampling with clinically acceptable results [5]. Our results showed that denoising with deeper U-net could be used to detect ovarian cancer metastasis.

To the best of our knowledge, this was the first study that evaluated the detectability of cancer metastasis, including peritoneal dissemination, liver metastasis, and lymph node metastasis on deep learning-based reconstructed CT images. The usefulness of sparse-sampling CT with deep learning has been previously reported [6,7,9,10], but image evaluation was limited to quantitative measures in most of these studies. While Nakai et al. reported on quantitative and qualitative assessments of the efficacy of deep learning on chest CT images [6], our study evaluated the usefulness of sparse-sampling CT denoised with deep learning techniques from the clinical viewpoint. We have proven that deeper U-net has an excellent ability to improve image quality and detectability of metastasis, and it could prove effective in clinical practice.

The performance difference between deeper U-net and RED-CNN was significant when assessing sparse-sampling CT images. A strong streak artifact around bony structures affected the image quality of sparse-sampling CT [29]. Therefore, to improve the image quality of sparse-sampling CT, an ideal deep learning model should reduce streak artifact associated with anatomical structures. RED-CNN used an image patch (size 55 × 55 pixels) for its training, therefore the algorithm had difficulty discerning between a streak artifact and anatomical structures. As a result, reducing the streak artifact related to anatomical structure may be limited in RED-CNN. In contrast, since deeper U-net used the entire CT image (size 512 × 512 pixels) for training, deeper U-net could be optimized to reduce streak artifact related to anatomical structures. This difference between the two deep learning models may lead to performance differences shown in the current study.

Since score 5 was defined as image quality and visualization equivalent to original CT (Tables 1–3), the denoised CT images of deeper U-net was not the same image quality as original CT images from the viewpoint of score. However, the visual scores and detectability of deeper U-net were sufficiently high.

Although patients with peritoneal dissemination are diagnosed as advanced stage, complete debulking surgery can be expected to improve the prognosis in epithelial malignant ovarian tumor [30]. In addition, there are some histological types with a favorable prognosis due to successful chemotherapy, such as yolk sac tumor [31]. Thus, the reduction of CT-radiation exposure is essential for patients with ovarian cancer. With our proposed method, theoretically, the CT radiation exposure can be reduced to one-tenth of that of the original CT. The reduction of radiation exposure may reduce the incidence of radiation-induced cancer. Furthermore, while we evaluated about only the detection of metastasis of malignant ovarian tumors in the current study, we speculate that the proposed method may be applied to other diseases.

While our results show that deeper U-net proved useful in detecting cancer metastasis, there were several drawbacks in the model. First, fine anatomical structures were obscured due to excessive denoising. This effect might be minimized by blending images of FBP and deep learning-based reconstruction, such as hybrid IR and MBIR, by adjusting radiation exposure (rate of sparse sampling) and blending rate. Secondly, the strong streak artifacts around the prosthesis and the upper abdomen compromised diagnostic ability near these anatomical lesions. Furthermore, streak artifacts tended to be stronger on images at the upper abdomen level, especially where the lung and abdominal organs were visualized on the same image. This effect may have resulted from the relatively small number of training data images that included both lung and abdominal organs compared to images that included only abdominal organs. Since ovarian cancer primarily metastasizes to the peritoneal liver surface and the liver, improving the image quality in these areas is considered a future research area. Increasing the number of training data images with cross-sections displaying both the lung and abdominal organs may help improve image quality and reduce streak artifacts in deep learning models, including deeper U-net.

Our study had several limitations. First, we used images from only one public database. The application of our deep learning model should be further evaluated in other databases. Second, sparse-sampling images cannot be obtained from real CT scanners at the current time. Our simulated subsampled images may differ from the images on real scanners. In future, we need to evaluate the performance of our deeper U-net using real CT acquisitions. Third, images obtained with the deep

learning model of GAN tended to be more "natural" than those obtained with conventional deep learning model. However, the noise reduction of GAN is weaker than that of a conventional deep learning model [17]. There was a concern that the radiologist's ability to detect metastasis might decline if the noise reduction was insufficient. Therefore, GAN was not used in the current study. Finally, because of our study design, we did not evaluate false positives, true negatives, and specificity in the current study. Therefore, it is necessary to conduct radiologists' observer studies in which false positives and true negatives are evaluated.

#### **5. Conclusions**

Sparse-sampling CT with deep learning reconstruction could prove useful in detecting metastasis of malignant ovarian tumors and might contribute to reducing CT radiation exposure. With our proposed method, theoretically, the CT radiation exposure can be reduced to one-tenth of that of the original CT, while keeping the detectability of ovarian cancer metastasis more than 95%. It may reduce the incidence of radiation-induced cancer.

**Author Contributions:** Conceptualization, M.N. (Mizuho Nishio); methodology, M.N. (Mizuho Nishio); software, M.N. (Mizuho Nishio); validation, M.N. (Mizuho Nishio), Y.U. (Yasuyo Urase); formal analysis, Y.U. (Yasuyo Urase), M.N. (Mizuho Nishio); investigation, Y.U. (Yasuyo Urase), M.N. (Mizuho Nishio), Y.U. (Yoshiko Ueno), A.K.K.; resources, M.N. (Mizuho Nishio); data curation, M.N. (Mizuho Nishio); writing—original draft preparation, Y.U. (Yasuyo Urase); writing—review and editing, Y.U. (Yasuyo Urase), M.N. (Mizuho Nishio), Y.U. (Yoshiko Ueno), A.K.K., K.S., T.K., T.M. (Takaki Maeda), M.N. (Munenobu Nogami), M.H., T.M. (Takamichi Murakami); visualization, Y.U. (Yasuyo Urase), and M.N. (Mizuho Nishio); supervision, T.M. (Takamichi Murakami); project administration, M.N. (Mizuho Nishio); funding acquisition, M.N. (Mizuho Nishio). All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the JSPS KAKENHI (Grant Number JP19K17232 and JP19H03599).

**Acknowledgments:** The authors would like to thank Izumi Imaoka from department of Radiology, Kobe Minimally invasive Cancer Center for her suggestion about defining metastatic lesions of ovarian cancer.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. Network Structure of Deeper U-Net**

The network structure of our deeper U-net is slightly modified from original deeper U-net [6]. The differences between the two networks were as follows:


#### **Appendix B. Detail Procedure of Qualitative Evaluation by Radiologists**

The detail procedure of qualitative evaluation was as follows. Two board-certified radiologists (17 and 10 years of experience, respectively) independently performed the visual evaluation as readers. The patient IDs of TCGA-OV in 50 test cases were sorted in alphabetical and numerical order, and the two radiologists interpreted the CT images in this order. All the 50 sets of denoised CT images with deeper U-net were evaluated first, then followed by those with RED-CNN. Image quality evaluation of the entire CT and the local normal lesions were performed by comparing the original image and the denoised image on 3D Slicer (version 4.10.2, https://www.slicer.org/) [24], and then the abnormal lesions were evaluated. Two other board-certified radiologists (6 and 14 years of experience, respectively) determined the abnormal lesions on the original image by consensus, and recorded the locations of

the abnormal lesions on a file. In evaluating the abnormal lesions, the two readers referred to the file for the locations of abnormal lesions. At the time of interpretation, the two readers were informed of patient's age, and blind to all other clinical data. The image quality differed greatly between the two models, therefore the readers could easily determine the deep learning model with which the given CT images were denoised. Therefore, the interpretation order of the denoised images with deeper U-net and RED-CNN was not randomized. It was presumed that bias in evaluation of denoised CT images was inevitable even if interpretation order of deeper U-net and RED-CNN was randomized or the evaluations of CT images denoised with the two models were performed separately at long interval.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Automatic Pancreas Segmentation Using Coarse-Scaled 2D Model of Deep Learning: Usefulness of Data Augmentation and Deep U-Net**

#### **Mizuho Nishio 1,\*, Shunjiro Noguchi <sup>1</sup> and Koji Fujimoto <sup>2</sup>**


Received: 27 March 2020; Accepted: 9 May 2020; Published: 12 May 2020

**Abstract:** Combinations of data augmentation methods and deep learning architectures for automatic pancreas segmentation on CT images are proposed and evaluated. Images from a public CT dataset of pancreas segmentation were used to evaluate the models. Baseline U-net and deep U-net were chosen for the deep learning models of pancreas segmentation. Methods of data augmentation included conventional methods, mixup, and random image cropping and patching (RICAP). Ten combinations of the deep learning models and the data augmentation methods were evaluated. Four-fold cross validation was performed to train and evaluate these models with data augmentation methods. The dice similarity coefficient (DSC) was calculated between automatic segmentation results and manually annotated labels and these were visually assessed by two radiologists. The performance of the deep U-net was better than that of the baseline U-net with mean DSC of 0.703–0.789 and 0.686–0.748, respectively. In both baseline U-net and deep U-net, the methods with data augmentation performed better than methods with no data augmentation, and mixup and RICAP were more useful than the conventional method. The best mean DSC was obtained using a combination of deep U-net, mixup, and RICAP, and the two radiologists scored the results from this model as good or perfect in 76 and 74 of the 82 cases.

**Keywords:** pancreas; segmentation; computed tomography; deep learning; data augmentation

#### **1. Introduction**

Identification of anatomical structures is a fundamental step for radiologists in the interpretation of medical images. Similarly, automatic and accurate organ identification or segmentation is important for medical image analysis, computer-aided detection, and computer-aided diagnosis. To date, many studies have worked on automatic and accurate segmentation of organs, including lung, liver, pancreas, uterus, and muscle [1–5].

An estimated 606,880 Americans were predicted to die from cancer in 2019, in which 45,750 deaths would be due to pancreatic cancer [6]. Among all major types of cancers, the five-year relative survival rate of pancreatic cancer was the lowest (9%). One of the reasons for this low survival rate is the difficulty in the detection of pancreatic cancer in its early stages, because the organ is located in the retroperitoneal space and is in close proximity to other organs. A lack of symptoms is another reason for the difficulty of its early detection. Therefore, computer-aided detection and/or diagnosis using computed tomography (CT) may contribute to a reduction in the number of deaths caused by pancreatic cancer, similar to the effect of CT screenings on lung cancer [7,8]. Accurate segmentation of pancreas is the first step in the computer-aided detection/diagnosis system of pancreatic cancer.

Compared with conventional techniques of organ segmentation, which use hand-tuned filters and classifiers, deep learning, such as convolutional neural networks (CNN), is a framework, which lets computers learn and build these filters and classifiers from a huge amount of data. Recently, deep learning has been attracting much attention in medical image analysis, as it has been demonstrated as a powerful tool for organ segmentation [9]. Pancreas segmentation using CT images is challenging because the pancreas does not have a distinct border with its surrounding structures. In addition, pancreas has a large shape and size variability among people. Therefore, several different approaches to pancreas segmentation using deep learning have been proposed [10–15].

Previous studies designed to improve the deep learning model of automatic pancreas segmentation [10–15] can be classified using three major aspects: (i) dimension of the convolutional network, two-dimensional model (2D) versus three-dimensional model (3D); (ii) use of coarse-scaled model versus fine-scaled model; (iii) improvement of network architecture. In (i), the accuracy of pancreas segmentation was improved in a 3D model and compared with a 2D model; the 3D model makes it possible to fully utilize the 3D spatial information of pancreas, which is useful for grasping the large variability in pancreas shape and size. In (ii), an initial coarse-scaled model was used to obtain a rough region of interest (ROI) of the pancreas, and then the ROI was used for segmentation refinement using a fine-scaled model of pancreas segmentation. The difference in mean dice similarity coefficient (DSC) between the coarse-scaled and find-scaled models ranged from 2% to 7%. In (iii), the network architecture of a deep learning model was modified for efficient segmentation. For example, when an attention unit was introduced in a U-net, the segmentation accuracy was better than in a conventional U-net [12].

In previous studies, the usefulness of data augmentation in pancreas segmentation was not fully evaluated; only conventional methods of data augmentation were utilized. Recently proposed methods of data augmentation, such as mixup [16] and random image cropping and patching (RICAP) [17], were not evaluated.

In conventional data augmentation, horizontal flipping, vertical flipping, scaling, rotation, etc., are commonly used. It is necessary to find an effective combination of these, since among the possible combinations, some degrade the performance. Due to the number of the combinations, it is relatively cumbersome to eliminate the counterproductive combinations in conventional data augmentation. For this purpose, AutoAugment finds the best combination of data augmentation [18]. However, it is computationally expensive due to its use of reinforcement learning. In this regard, mixup and RICAP are easier to adjust than conventional data augmentation because they both have only one parameter.

The purpose of the current study is to evaluate and validate the combinations of different types of data augmentation and network architecture modification of U-net [19]. A deep U-net was used, to evaluate the usefulness of network architecture modification of U-net.

#### **2. Materials and Methods**

The current study used anonymized data extracted from a public database. Therefore, institutional review board approval was waived.

#### *2.1. Dataset*

The public dataset (Pancreas-CT) used in the current study includes 82 sets of contrast-enhanced abdominal CT images, where pancreas was manually annotated slice-by-slice [20,21]. This dataset is publicly available from The Cancer Imaging Archive [22]. The Pancreas-CT dataset is commonly used to benchmark the segmentation accuracy of pancreas on CT images. The CT scans in the dataset were obtained from 53 male and 27 female subjects. The age of the subjects ranged from 18 to 76 years with a mean age of 46.8 ± 16.7. The CT images were acquired with Philips and Siemens multi-detector CT scanners (120 kVp tube voltage). Spatial resolution of the CT images is 512 × 512 pixels with varying pixel sizes, and slice thickness is between 1.5−2.5 mm. As a part of image preprocessing, the pixel values for all sets of CT images were clipped to [−100, 240] Hounsfield units, then rescaled to the range [0, 1]. This preprocessing was commonly used for the Pancreas-CT dataset [15].

#### *2.2. Deep Learning Model*

U-net was used as a baseline model of deep learning in the current study [19]. U-net consists of encoding–decoding architecture. Downsampling and upsampling are performed in the encoding and decoding parts of U-net, respectively. The most important characteristic of U-net is the presence of shortcut connections between the encoding part and the decoding part at equal resolution. While the baseline U-net performs downsampling and upsampling 4 times [19], deep U-net performs downsampling and upsampling 6 times. In addition to the number of downsampling and upsampling, the number of feature maps in the convolution layer and the use of dropout were changed in the deep U-net; the number of feature maps in the first convolution layer equaled to 40 and dropout probability to 2%. In the baseline U-net, 64 feature maps and no dropout were used. In both, the baseline U-net and the deep U-net, the number of feature maps in the convolution layer was doubled after each downsampling. Figure 1 presents the deep U-net model of the proposed method. Both the baseline U-net and deep U-net utilized batch normalization. Keras (https://keras.io/) with Tensorflow (https://www.tensorflow.org/) backends was used for the implementation of the U-net models. Image dimension of the input and output in the two U-net models was 512 × 512 pixels.

**Figure 1.** Illustration of the deep U-net model. The number of downsampling and upsampling is 6 in the deep U-net. Except for the last convolution layer, dropout and convolution layer are coupled. Abbreviations: convolution layer (conv), maxpooling layer (maxpool), upsampling and convolution layer (up conv), rectified linear unit (ReLU).

#### *2.3. Data Augmentation*

To prevent overfitting in the training of the deep learning model, we utilized the following three types of data augmentation methods: conventional method, mixup [16], and RICAP [17]. Although mixup and RICAP were initially proposed for image classification tasks, we utilized them for segmentation by merging or cropping/patching labels in the same way as is done for images.

Conventional augmentation methods included ±5◦ rotation, ±5% x-axis shift, ±5% y-axis shift, and 95%–105% scaling. Both image and label were changed by the same transformation when using a conventional augmentation method.

Mixup generates a new training sample from linear combination of existing images and their labels [16]. Here, two sets of training samples are denoted by (*x*, *y*) and (*x'*, *y'*), where *x* and *x'* are images, and *y* and *y'* are their labels. A generated sample (*x#*, *y#*) is given by:

$$\mathbf{x}^{\#} = \lambda \mathbf{x} + (1 - \lambda)\mathbf{x}' \tag{1}$$

$$y^\# = \lambda y + (1 - \lambda)y' \tag{2}$$

where λ ranges from 0 to 1 and is distributed according to beta distribution: λ~*Beta*(β, β) for β ∈ (0, ∞). The two samples to be combined are selected randomly from the training data. The hyperparameter β of mixup was set to 0.2 empirically.

RICAP generates a new training sample from four randomly selected images [17]. The four images are randomly cropped and patched according to a boundary position (*w*, *h*), which is determined according to beta distribution: *w*~*Beta*(β, β) and *h*~*Beta*(β, β). We set the hyperparameter β of RICAP to 0.4 empirically. For four images to be combined, the coordinates (*x*k, *y*k) (*k* = 1, 2, 3, and 4) of the upper left corners of the cropped areas are randomly selected. The sizes of the four cropped images are determined based on the value (*w*, *h*), such that they do not increase the original image size. A generated sample is obtained by combining the four cropped images. In the current study, the image and its label were cropped at the same coordinate and size.

#### *2.4. Training*

Dice loss function was used as the optimization target of the deep learning models. RMSprop was used as the optimizer, and its learning rate was set to 0.00004. The number of training epochs was set to 45. Following previous works on pancreas segmentation, we used 4-fold cross-validation to assess the robustness of the model (20 or 21 subjects were chosen for validation in folds). The hyperparameters related with U-net and its training were selected using random search [23]. After the random search, the hyperparameters were fixed. The following 10 combinations of deep learning models and data augmentation methods were used:


#### *2.5. Evaluation of Pancreas Segmentation*

For each validation case of the Pancreas-CT dataset, three-dimensional CT images were processed slice-by-slice using the trained deep learning models, and the segmentation results were stacked. Except for the stacking, no complex postprocessing was utilized. Quantitative and qualitative evaluations were performed for the automatic segmentation results.

The metrics of quantitative evaluation were calculated using the three-dimensional segmentation results and annotated labels. Four types of metrics were used for the quantitative evaluation of the segmentation results: dice similarity coefficient (DSC), Jaccard index (JI), sensitivity (SE), and specificity (SP). These metrics are defined by the following equations:

$$DSC = \frac{2|P \cap L|}{|P| + |L|} \tag{3}$$

$$JI = \frac{|P \cap L|}{|P| + |L| - |P \cap L|} \tag{4}$$

$$SE = \frac{|P \cap L|}{|L|} \tag{5}$$

$$SP = 1 - \frac{|P| - |P \cap L|}{|I| - |L|} \tag{6}$$

where |*P*|, |*L*|, and |*I*| denote the number of voxels for pancreas segmentation results, annotated label of pancreas segmentation, and three-dimensional CT images, respectively. |*P* ∩ *L*| represents the number of voxels where the deep learning models can accurately segment pancreas (true positive). Before calculating the four metrics, a threshold of 0.5 was used for obtaining pancreas segmentation mask from the output of the U-net [24]. The threshold of 0.5 was fixed for all the 82 cases. A Wilcoxon signed rank test was used to test statistical significance among the DSC results of 10 combinations of deep learning models and data augmentation methods. Bonferroni correction was used for controlling family wise error rate. *p*-values less than 0.05/45 = 0.00111 was considered as statistical significance.

For the qualitative evaluation, two radiologists with 14 and 6 years of experience visually evaluated both the manually annotated labels and automatic segmentation results using a 5-point scale: 1, unacceptable; 2, slightly unacceptable; 3, acceptable; 4, good; 5, perfect. Inter-observer variability between the two radiologists were evaluated using weighted kappa with squared weight.

#### **3. Results**

Table 1 shows results of the qualitative evaluation of the pancreas segmentation of Deep U-net + RICAP + mixup and the manually annotated labels. The mean visual scores of manually annotated labels were 4.951 and 4.902 for the two radiologists, and those of automatic segmentation results were 4.439 and 4.268. The mean score of automatic segmentation results demonstrates that the accuracy of the automatic segmentation was good; more than 92.6% (76/82) and 87.8% (74/82) of the cases were scored as 4 or above. Notably, Table 1 shows that the manually annotated labels were scored as 4 (good, but not perfect) in four and eight cases by the two radiologists. Weighted kappa values between the two radiologists were 0.465 (moderate agreement) for the manually annotated labels and 0.723 (substantial agreement) for the automatic segmentation results.

**Table 1.** Results of qualitative evaluation of automatic pancreas segmentation and manually annotated labels.


Table 2 shows the results of the quantitative evaluation of pancreas segmentation. Mean and standard deviation of DSC, JI, SE, and SP are calculated from the validation cases of 4-fold cross validation for the Pancreas-CT dataset. Mean DSC of the deep U-net (0.703–0.789) was better than the mean DSC of the baseline U-net (0.686–0.748) across all data augmentation methods. Because mean SP was 1.00 in all the combinations, non-pancreas lesions were not segmented by the models. Therefore, mean DSC was mainly affected by mean SE (segmentation accuracy only for pancreas lesion) as shown in Table 2. Table 2 also shows the usefulness of data augmentation. In both, the baseline U-net and deep U-net, the model combined with any of the three types of data augmentation performed better than the model with no data augmentation. In addition, mixup and RICAP were more useful than the

conventional method; the best mean DSC was obtained using the combination of mixup and RICAP. The best mean DSC was obtained using the deep U-net with RICAP and mixup.

**Table 2.** Results of quantitative evaluation of automatic pancreas segmentation from the 82 cases using 4-fold cross validation.


Note: data are shown as mean ± standard deviation. Abbreviations: Random image cropping and patching (RICAP), dice similarity coefficient (DSC), Jaccard index (JI), sensitivity (SE), and specificity (SP).

Table A2 of Appendix B shows the results of the Wilcoxon signed rank test. After the Bonferroni correction, the DSC differences between Deep U-net + RICAP + mixup and the other six models were statistically significant.

Representative images of pancreas segmentation are shown in Figures 2 and 3. In the case of Figure 2, the manually annotated label was scored as 4 by the two radiologists because the main pancreas duct and its surrounding tissue were excluded from the label.

**Figure 2.** Representative image of automatic pancreas segmentation. (**a**) Original computed tomography (CT) image; (**b**) CT image with manually annotated label in red, scored as not perfect by two radiologists; (**c**) CT image with automatic segmentation in blue.

**Figure 3.** Representative image of a low-quality automatic pancreas segmentation. (**a**) Original computed tomography (CT) image; (**b**) CT image with manually annotated label in red; (**c**) CT image with automatic segmentation in blue, with part of the pancreas excluded from the segmentation.

#### **4. Discussion**

The results of the present study show that the three types of data augmentation were useful for the pancreas segmentation in both the baseline U-net and deep U-net. In addition, the deep U-net, which is characterized by additional layers, was overall more effective for automatic pancreas segmentation than the baseline U-net. In data augmentation, not only the conventional method, but also mixup and RICAP were useful for pancreas segmentation; the combination of mixup and RICAP was the most useful.

Table 3 summarizes results of previous studies using the Pancreas-CT dataset. While Table 3 includes the studies with coarse-scaled models, Table A1 includes the studies with fine-scaled models. As shown in Table 3, the coarse-scaled 2D model of the current study achieved sufficiently high accuracy, comparable to those of previous studies. While the present study focused on the 2D coarse-scaled models, the data augmentation methods used in the present study can be easily applied to 3D fine-scaled models. Therefore, it can be expected that the combination of the proposed data augmentation methods and 3D fine-scaled models might lead to further improvement of automatic pancreas segmentation.


**Table 3.** Summary of coarse-scaled models using the Pancreas-CT dataset.

Data augmentation was originally proposed for the classification model, and the effectiveness of mixup was validated for segmentation on brain MRI images [25]. The results of the current study demonstrate the effectiveness of multiple types of data augmentation methods for the two models of U-net for automatic pancreatic segmentation. To the best of our knowledge, the current study is the first to validate the usefulness of multiple types of data augmentation methods in pancreas segmentation.

Table 2 shows that deep U-net was better than baseline U-net. Deep U-net included additional layers in its network architecture, compared with baseline U-net. It is speculated that these additional layers could lead to performance improvement for pancreas segmentation. Nakai et al. [26] showed that deeper U-net could efficiently denoised low-dose CT images. They also showed that deeper U-net was better than baseline U-net. Kurata et al. [4] showed that their U-net with additional layers was effective for uterine segmentation. The results of the current study are consistent with the results of these studies. The effectiveness of deep/deeper U-net has not been sufficiently investigated so far. Because U-net can be used for segmentation, image denoising, detection, and modality conversion, it is necessary to evaluate what tasks the deep/deeper U-net is effective for.

Combined use of mixup and RICAP was the best for data augmentation in the current study. The combination of mixup and RICAP was also used in the study of bone segmentation [24]. The results of bone segmentation show that effectiveness of data augmentation was observed in the dataset with limited cases, and the optimal combination was conventional method and RICAP. Based on the studies of bone and pancreas segmentation, usefulness of combination of conventional method, mixup, and RICAP should be further investigated.

Sandfort et al. used CycleGAN as data augmentation to improve generalizability in organ segmentation on CT images [27]. CycleGAN was also used for data augmentation in the classification task [28]. Because the computational cost of training CycleGAN is relatively high, the use of CycleGAN as a data augmentation method needs some consideration. In this regard, computational cost of mixup and RICAP is relatively low, and mixup and RICAP are easy to implement.

Accuracy of pancreas segmentation was visually evaluated by the two radiologists in the current study. To our knowledge, there was no study of deep learning to evaluate the segmentation accuracy of pancreas structure visually. The results of visual scores mean that automatic segmentation model of the current study was good. It is expected that the proposed model may be useful for clinical cases if the clinical CT images have similar condition and quality to those of the Pancreas-CT dataset.

In the current study, we evaluated automatic pancreas segmentation using the public dataset called Pancreas-CT. Although this dataset was used in several studies as shown in Table 3, the manually annotated labels of four or eight cases were scored as not perfect based on the visual assessment of the current study. In most of the cases, the labels for the pancreas head were assessed as low-quality. It is presumed that the low-quality labeling is caused by the fact that annotators did not fully understand the boundary between the pancreas and other organs (e.g., duodenum). To evaluate the segmentation accuracy, reliable labeling is mandatory. For this purpose, a new database for pancreas segmentation is desirable.

There were several limitations to the present study. First, we investigated the usefulness of data augmentation only in segmentation models. The usefulness of data augmentation should be evaluated for other models such as classification, detection, and image generation. Second, the 3D fine-tuned model of pancreas segmentation was not evaluated. Because U-net, mixup, and RICAP were originally suggested for 2D models, we constructed and evaluated the 2D model of pancreas segmentation. We will apply the proposed methods to the 3D fine-tuned model in future research.

#### **5. Conclusions**

The combination of deep U-net with mixup and RICAP achieved automatic pancreas segmentation, which the radiologists scored as good or perfect. We will further investigate the usefulness of the proposed method for the 3D coarse-scaled/fine-scaled models to improve segmentation accuracy.

**Author Contributions:** Conceptualization, M.N.; methodology, M.N.; software, M.N. and S.N.; validation, M.N. and S.N.; formal analysis, M.N.; investigation, M.N.; resources, M.N. and K.F.; data curation, M.N. and SN; writing—original draft preparation, M.N.; writing—review and editing, M.N., S.N., and K.F.; visualization, M.N.; supervision, K.F.; project administration, M.N.; funding acquisition, M.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** The present study was supported by JSPS KAKENHI, grant number JP19K17232.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**


**Table A1.** Summary of fine-scaled models using Pancreas-CT dataset.

#### **Appendix B**


**Table A2.** Results of Statistical significance for DSC difference.

Note: In Target 1 and Target 2, values of cells mean the followings: (1) Baseline U-net + no data augmentation, (2) Baseline U-net + conventional method, (3) Baseline U-net + mixup, (4) Baseline U-net + RICAP, (5) Baseline U-net + RICAP + mixup, (6) Deep U-net + no data augmentation, (7) Deep U-net + conventional method, (8) Deep U-net + mixup, (9) Deep U-net + RICAP, (10) Deep U-net + RICAP + mixup. *p*-values less than 0.05/45 = 0.00111 was considered as statistical significance.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com