Percentages were calculated after excluding missing cases from the denominator; & Log-transformed; Data are displayed using number (percentage) for categorical variables and mean (±standard deviation) for normally continuous data. BMI: body mass index; hsCRP: high sensitivity c-reactive protein; CKD-EPI: Chronic Kidney Disease Epidemiology Collaboration; HDL: high density lipoprotein; LDL: low density lipoprotein.

#### *3.2. Risk Factors for DISH*

Results of logistic regression analyses are listed in Table 2. After adjusting for age and sex, DISH was significantly associated with presence of metabolic syndrome (OR 1.78 (95%CI: 1.43–2.24)), the presence of diabetes (OR 1.50 (95%CI: 1.18–1.91)), and glucose (per 1 mmol/L) (OR 1.10 (95%CI: 1.04–1.17)). Systolic blood pressure (per 1 mmHg), the presence of hypertension, and pulse pressure (per 1 mmHg) were also associated with DISH, whereas diastolic blood pressure was not. Regarding blood lipid profile, DISH was associated with HDL-cholesterol.




**Table 2.** *Cont*.

\* Sex adjusted; # Age adjusted; & Log-transformed. OR: odds ratio; CI: confidence interval; BMI: body mass index; hsCRP: high sensitivity c-reactive protein; CKD-EPI: Chronic Kidney Disease Epidemiology Collaboration; HDL: high density lipoprotein; LDL: low density lipoprotein.

#### *3.3. Intra-Abdominal Fat Measurements and Adiposity Markers in Relation to DISH in Males*

Results of adiposity measurements with an increase of 1 SD in relation to the presence of DISH in males are listed in Table 3. In the crude analysis, the presence of DISH was associated with the adiposity measures weight, BMI, waist circumference, subcutaneous fat, VAT, and VAT%. After full adjustments, the significant adiposity markers were weight (OR 1.56; 95%CI: 1.36–1.79), BMI (OR 1.58; 95%CI: 1.28–1.94), waist circumference (OR 1.45; 95%CI: 1.15–1.82), and VAT (OR 1.35; 95%CI: 1.20–1.54). An increase of 1 SD of subcutaneous fat, the waist-to-hip ratio, or VAT% was not significantly associated with the presence of DISH. In general, the adiposity measures weight, BMI, waist circumference, and VAT were significant for all grades of DISH in crude and full adjusted analyses. In the most severe DISH group, the relation between VAT and the presence of DISH became stronger (OR 1.61; 95%CI: 1.31–1.98). Moreover, in this group with most severe DISH, 1 SD increase in subcutaneous fat was negatively associated with the presence of DISH (OR 0.65; 95%CI: 0.49–0.95), whereas VAT% was positively associated with the presence of DISH (OR 1.80; 95%CI: 1.25–2.68). These relations for subcutaneous fat and VAT% were not observed in the groups with grade 1 or grade 2 DISH.

**Table 3.** Adiposity measurements per SD with different severities of DISH as outcome in males.



**Table 3.** *Cont*.

Model 1: DISH crude; Model 2: adjusted for age; Model 3: adjusted for age, systolic blood pressure, diabetes, non-HDL cholesterol, smoking status, and renal function. <sup>a</sup> *p* < 0.05, SD: standard deviation; OR: odds ratio; CI: confidence interval; BMI: body mass index; VAT: visceral adipose tissue; VAT%: visceral adipose tissue in relation to total abdominal fat.

#### *3.4. Intra-Abdominal Fat Measurements and Adiposity Markers in Relation to DISH in Females*

Table 4 lists the results of adiposity measures in females in relation to the presence of DISH. The presence of DISH was related to the markers weight (OR 1.52; 95%CI: 1.20–1.94), BMI (OR 1.55; 95%CI: 1.28–1.89), waist circumference (OR 1.54; 95%CI: 1.06–2.24), and VAT (OR 1.71; 95%CI: 1.33–2.19). After adjusting for cardiovascular risk factors, the relation between the presence of DISH and waist circumference became attenuated (OR 1.39; 95%CI: 0.89–2.16), while an increase by 1 SD of subcutaneous fat was associated with the presence of DISH (OR 1.43; 95%CI: 1.14–1.80). The adiposity markers weight (OR 1.75; 95%CI: 1.29–2.38), BMI (OR 1.66; 95%CI: 1.30–2.13), and VAT (OR 1.43; 95%CI: 1.06–1.93) remained significantly associated after full adjustment. For the different Grades of DISH, the adiposity measures weight and BMI were significant for all grades of DISH in crude and full adjusted analyses.



Model 1: DISH crude; Model 2: adjusted for age; Model 3: adjusted for age, systolic blood pressure, diabetes, non-HDL cholesterol, smoking status, and renal function. <sup>a</sup> *p* < 0.05, SD: standard deviation; OR: odds ratio; CI: confidence interval; BMI: body mass index; VAT: visceral adipose tissue; VAT%: visceral adipose tissue in relation to total abdominal fat.

#### **4. Discussion**

In the current study, we aimed to assess the relation between different severities of DISH and various measurements of adiposity in both males and females with a high risk for cardiovascular disease. We found that, in males, all adiposity markers except for subcutaneous fat and the waist-to-hip ratio were associated with the presence of DISH. When analyzing the group with the most severe DISH, the relation between VAT and the presence of DISH became stronger. Moreover, increased subcutaneous fat was negatively associated with cases of DISH with extensive ossification, reinforcing the importance of adipose tissue distribution in the pathogenesis of DISH.

In females, the adiposity markers we identified with the presence of DISH were weight, BMI, subcutaneous fat, and VAT. Waist circumference was not associated with the presence of DISH, which was the case for males, whereas in female DISH patients increased subcutaneous fat was positively associated with the presence of DISH.

The risk factors we identified for DISH in our cohort also strongly relate to the presence of VAT and obesity [16] showing the probable causal relation between VAT and insulin resistance. The formation of bone in DISH is potentially linked with metabolic derangements via the insulin-like growth factor-I pathway, which is able to induce proliferation in chondrocytes and osteoblasts [17].

The prevalence of DISH in our cohort was 9.0% and our data confirm previously observed associations between DISH and BMI [3,18–21], diabetes [3,19–21], waist circumference [5,18,22], metabolic syndrome [5,18], systolic blood pressure [18,23], and hypertension [5,18]. A higher level of HDL-cholesterol was significantly associated with the presence of DISH in our study, whereas other cohorts did not find this relation [5,18]. These risk factors are described to strongly relate to excess levels of VAT and the presence of insulin resistance [16]. In line with previous work, no association was found between DISH and hsCRP [18]. As our patient population had increased risk for cardiovascular disease, a large portion of our cohort was treated with statin therapy for cardiovascular risk management. The use of statins is associated with a reduction in levels of hsCRP [24], which may explain why no significant difference was observed for hsCRP between the groups with and without DISH in our cohort.

Our results show that the presence of DISH is associated with VAT, which is in accordance with Lantsman et al. [25] and Okada et al. [26], who measured VAT in DISH patients using CT imaging. In the study by Okada and colleagues, the area of VAT was significantly increased in DISH patients (130.7 ± SD 58.2 cm<sup>2</sup> vs. 89.0 ± SD 48.1 cm<sup>2</sup> ).

Interestingly, females with DISH had both increased subcutaneous fat and VAT in our cohort. Contrarily in males, an increased VAT was linked with DISH while increased subcutaneous fat was not. When estimating the percentage of VAT in relation to total abdominal fat, no association was found between VAT% and DISH for both sexes. This might be explained by the poor reliability of using adiposity measurements with ultrasound as proxies for VAT accumulation in relation to total abdominal fat. Ideally, CT-based segmentations in the coronal plane are preferred as this can more accurately measure the total area of visceral fat in relation to the total area of abdominal fat. To minimize this discrepancy, our measurements adhered to a strict protocol, and the estimations were averaged over multiple measurements of the same patient.

Although other adiposity markers had stronger observed associations with DISH compared to VAT in our study, our results still indicate that one SD increase of VAT is associated with a 35% and 43% increase in risk for DISH in males and females, respectively. VAT is known to increase with older age, and a higher percentage of VAT is found in men [27,28]. Furthermore, it is now well established that VAT produces different adipokines and inflammatory molecules including leptin, adiponectin, tumor necrosis factor-α, and interleukin-6. In the literature, few studies have reported these adipokines in relation to DISH. Visceral obesity results in lower levels of adiponectin [29], which was reported for DISH in two studies [30,31]. Moreover, increased levels of leptin [31,32] and visfatin [30] were also observed in DISH patients. Both leptin and adiponectin are known to influence

bone metabolism and bone homeostasis [31,33]. An adequate explanation for the role of these adipokines in the pathogenesis of DISH remains to be determined. Recently, Mader et al. [34] reviewed the involvement of a possible inflammatory component in DISH, and concluded that local inflammation, prior to or as a consequence of metabolic derangements, could play a crucial role in the development of DISH. Our results support the notion that research on VAT and inflammation should be further (re)explored in patients with DISH.

#### *Strengths and Limitations*

The strengths of our study are the relatively large sample size of our prospective cohort, with extensive and accurate information on a broad array of cardiovascular risk factors. Moreover, we studied the relative importance of adiposity measurements and corrected for confounders, which has not been reported previously in DISH.

Our study, however, also has limitations. Visceral and subcutaneous fat measured with ultrasonography have been reported to be prone to measurement variability. However, an interobserver coefficient of variation of 5.4% was found for our cohort, indicating good measurement reliability [12]. Secondly, the Resnick criteria for DISH are arbitrary and some milder forms or earlier stages of DISH will be misclassified. This can result in some underestimation of the associations. Finally, the cross-sectional design of our study should warrant a cautious approach when drawing causal etiological conclusions.

#### **5. Conclusions**

To summarize, measurements of adiposity, including visceral adipose tissue thickness, were associated with the presence of DISH in both males and females. Subcutaneous adipose tissue thickness was negatively associated in males with most severe DISH. In females, subcutaneous adipose tissue was positively associated with the presence of DISH. Our research supports further investigation into the role of visceral adipose tissue and insulin resistance in the pathogenesis of DISH.

**Author Contributions:** Conceptualization, N.I.H., J.W. and P.A.d.J.; methodology, N.I.H. and J.W.; software, N.I.H.; validation, N.I.H. and J.W.; formal analysis, N.I.H.; visualization, N.I.H.; data curation, UCC-SMART-Study Group, F.A.A.M.H., P.A.d.J., W.F., M.E.H., R.W., P.H.v.d.V., and B.v.G.; writing—original draft preparation, N.I.H., J.W., P.A.d.J., and F.A.A.M.H.; writing—review and editing, N.I.H., J.W., W.F., M.E.H., R.W., P.H.v.d.V., B.v.G., J.S.K., J.-J.V., P.A.d.J., and F.A.A.M.H.; supervision, J.W., P.A.d.J., J.-J.V., and F.A.A.M.H.; project administration, N.I.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** The UCC-SMART study was financially supported by a grant from the University Medical Center Utrecht. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Institutional Review Board Statement:** The UCC-SMART study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review of the University Medical Center Utrecht (NL45885.041.13).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The informed consent that was signed by the study participants is not compliant with publishing individual data in an open access institutional repository or as supporting information files with the published paper. However, a data request can be sent to the SMART Steering Committee at uccdatarequest@umcutrecht.nl.

**Acknowledgments:** We gratefully acknowledge the contribution of the research nurses; R. van Petersen; B. van Dinther and the Members of the Utrecht Cardiovascular Cohort-Second Manifestations of ARTerial disease-Study Group (UCC-SMART-Study Group); F.W. Asselbergs and H.M. Nathoe, Department of Cardiology; G.J. de Borst, Department of Vascular Surgery; M.L. Bots and M.I. Geerlings, Julius Center for Health Sciences and Primary Care; M.H. Emmelot, Department of Geriatrics; P.A. de Jong and T. Leiner, Department of Radiology; A.T. Lely, Department of Obstetrics and Gynecology; N.P. van der Kaaij, Department of Cardiothoracic Surgery; L.J. Kappelle and Y.M. Ruigrok, Department of Neurology; M.C. Verhaar, Department of Nephrology; F.L.J. Visseren and J. Westerink, Department of Vascular Medicine, University Medical Center Utrecht and Utrecht University.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


#### *Article*

## **Evaluation of Scalability and Degree of Fine-Tuning of Deep Convolutional Neural Networks for COVID-19 Screening on Chest X-ray Images Using Explainable Deep-Learning Algorithm**

#### **Ki-Sun Lee 1,\* ,**† **, Jae Young Kim 1,**† **, Eun-tae Jeon <sup>1</sup> , Won Suk Choi <sup>2</sup> , Nan Hee Kim <sup>3</sup> and Ki Yeol Lee <sup>4</sup>**


Received: 22 October 2020; Accepted: 30 October 2020; Published: 7 November 2020

**Abstract:** According to recent studies, patients with COVID-19 have different feature characteristics on chest X-ray (CXR) than those with other lung diseases. This study aimed at evaluating the layer depths and degree of fine-tuning on transfer learning with a deep convolutional neural network (CNN)-based COVID-19 screening in CXR to identify efficient transfer learning strategies. The CXR images used in this study were collected from publicly available repositories, and the collected images were classified into three classes: COVID-19, pneumonia, and normal. To evaluate the effect of layer depths of the same CNN architecture, CNNs called VGG-16 and VGG-19 were used as backbone networks. Then, each backbone network was trained with different degrees of fine-tuning and comparatively evaluated. The experimental results showed the highest AUC value to be 0.950 concerning COVID-19 classification in the experimental group of a fine-tuned with only 2/5 blocks of the VGG16 backbone network. In conclusion, in the classification of medical images with a limited number of data, a deeper layer depth may not guarantee better results. In addition, even if the same pre-trained CNN architecture is used, an appropriate degree of fine-tuning can help to build an efficient deep learning model.

**Keywords:** COVID-19; chest X-ray; deep learning; convolutional neural network; Grad-CAM

#### **1. Introduction**

CORONAVIRUS disease (COVID-19) has quickly become a global pandemic since it was first reported in December 2019, reaching approximately 21.3 million confirmed cases and 761,799 deaths as of 16 August 2020 [1]. Due to the highly infectious nature and unavailability of appropriate treatments and vaccines for the virus, early screening of COVID-19 is crucial to prevent the spread of the disease by the timely isolation of susceptive individuals and the proper allocation of limited medical resources.

Currently, reverse transcription polymerase chain reaction (RT-PCR) was introduced as the gold standard screening method for COVID-19 [2]. However, since the overall positive rate of RT-PCR, using nasal and throat swabs, is reported to be 60–70% [3], there is a risk that a false-negative patient may

again act as another source of infection in a healthy community. Conversely, there have been reports of high sensitivity to COVID-19 screening in radiological tests such as chest computed tomography or chest X-ray (CXR) [3–5]. According to the reports on CXR characteristics of patients confirmed as the COVID-19 case, it demonstrated multi-lobar involvement and peripheral airspace opacities, which was most frequently demonstrated as ground-glass [6]. However, in the early stages of COVID-19, this ground-glass pattern may appear at the edges of the lung vessels, or as asymmetric diffused airspace opacities [7], it can be difficult to visually detect the characteristic patterns of COVID-19 from X-rays. Therefore, considering the fact that the number of suspected patients increases exponentially in contrast to the limited number of highly trained radiologists, the diagnostic supporting procedures, using an automated screening algorithm with a producing objective, reproducible, and scalable results, can speed up earlier precise diagnosis.

In recent years, deep learning (DL) technology, a specific field of artificial intelligence (AI) technology, has made remarkable advances in medical image analysis and diagnosis, and is considered to be a potentially powerful tool to solve such problems [8,9]. Despite the lack of available published data to date, DL approaches for the diagnosis of COVID-19 from CXR have been actively studied [10–17]. Because the available data are limited, previous research has focused on creating a new DL architecture based on deep convolutional neural networks (CNNs) for providing effective diagnosis algorithms. However, previous studies have focused only on the efficacy of the newly created network through comparison between different CNNs, so the effect of the layer depth, called scalability, and degree of fine-tuning of transfer learning with CNN has not been comparatively studied. Therefore, the main objective of this study was to further investigate the effect of layer depth on the same CNN architecture, and the degree of fine-tuning of transfer learning with the same CNN at the same hyper-parameters. Furthermore, by employing the gradient-weighted class activation map (Grad-CAM) [18,19], this study provided a visual interpretation explaining the feature characteristic region that the DL model has the most influence on classification prediction.

#### **2. Materials and Methods**

#### *2.1. Experimental Design*

The overall experimental steps and experimental groups used in this study are shown in Figure 1. The experiment consisted of 12 experimental subgroups. To evaluate the scalability of the same CNN architecture, the experiment consisted of two main groups according to the layer depths of each CNN. Each CNN main group is divided into 6 subgroups according to the degree of fine-tuning.

**Figure 1.** The experiment consists of a total of 12 experimental subgroups. It is largely divided into two main groups according to the layer depths, and each convolutional neural network (CNN) subgroup is divided into 6 subgroups according to the degree of fine-tuning.

#### *2.2. Datasets*

The datasets used for classification are described in Table 1. Several publicly available image data repositories have been used to collect COVID-19 chest-ray images. Normal and pneumonia samples were extracted from the open source NIH chest X-ray dataset used for the Radiological Society of North America (RSNA) pneumonia detection challenge [20]. The total dataset was curated into three classes: normal, pneumonia, and COVID-19. Since the balance of data for each class is a very important factor in classification analysis, this study randomly extracted the images of other classes according to the number of COVID-19 images that can be obtained as much as possible.


**Table 1.** Description of datasets for COVID-19 classification.

The entire dataset was combined with 607 COVID-19 image data publicly shared at the time of the study, as well as 607 normal and 607 pneumonia chest radiographs randomly extracted from the RSNA Pneumonia Detection Challenge dataset, resulting in 1821 data being combined. In the case of the COVID-19 dataset, four public datasets were used, and only one image was used when the source of the image was duplicated. In the public datasets used in the experiment, patient information was de-identified or not provided.

The entire collected dataset was randomly divided into a training and testing ratio of 80:20 for each class, and training data were also randomly divided by a training and validation ratio of 80:20 for use in the 5-fold cross validation.

#### *2.3. Image Preprocessing*

Because the image data used in this experiment were collected from multiple centers, most of the images have different contrast and dimensions. Therefore, all images used in this study required contrast correction through the histogram equalization technique and resizing to a uniform size before the experiment. In this study, preprocessing was performed using the contrast limited adaptive histogram equalization (CLAHE) technique [25], which has been adopted in previous studies related to lung segmentation and pneumonia classification [26–28]. Figure 2 shows sample images with CXR contrast corrected using the CLAHE technique. For the consistency of image analysis, each image was resized to a uniform size of 800 × 800.

#### *2.4. Convolutional Neural Networks*

This study employed two different deep CNNs as backbone networks: VGG-16 and VGG-19. VGG [29] is a pre-trained CNN, from the Visual Geometry Group, Department of Engineering Science, University of Oxford. The numbers 16 and 19 represent the number of layers with trainable weights of VGG networks. VGG architecture had been widely adopted and recognized as a state of the art in both general and medical image classification tasks [30]. Since VGG-16 and VGG-19 have the same neural network architecture but different layer depths, a comparative evaluation of performance according to the degree of layer depths can be performed under the same architectural condition.

**Figure 2.** Sample images after applying contrast correction by contrast limited adaptive histogram equalization (CLAHE) and the semantic segmentation of lung on original chest X-ray (CXR) images.

#### *2.5. Fine-Tuning*

When the training dataset is relatively small, transferring a network pre-trained on a large annotated dataset and fine-tuning it for a specific task can be an efficient way to achieve acceptable accuracy and less training time [31]. Although the classification of diseases from CXR images differs from object classification and natural images, they can share similar learned features [32]. During the fine-tuning of transfer learning with deep CNNs, model weights were initialized based on pre-training on a general image dataset, except that some of the last blocks were unfrozen so that their weights were updated in each training step. In this study, the VGG-16 and VGG-19, used in this study as a backbone neural network, consist of 5 blocks regardless of the network layer depth. Therefore, fine-tuning was performed in a total of 6 steps in a manner that was unfrozen sequentially from 0 to 5 blocks starting from the last block, depending on how many blocks were unfrozen. As a result, VGG-16 and VGG-19 were used as backbone networks, and each deep CNN was divided into 6 subgroups according to the degree of fine-tuning. Figure 3 shows the schematic diagrams of the layer composition and the degree of fine-tuning of VGG-16 and VGG-19.

#### *2.6. Training*

The 1458 images selected as the training dataset were randomly divided into five folds. This was done to perform 5-fold cross validation to evaluate the model training, while avoiding overfitting or bias [33–35]. Within each fold, the dataset was partitioned into independent training and validation sets using an 80 to 20% split. The selected validation set was a completely independent fold from the other training folds and was used to evaluate the training status during the training. After one model training step was completed, the other independent fold was used as a validation set and the previous validation set was reused as part of the training set to evaluate the model training. An overview of the 5-fold cross validation performed in this study is presented in Figure 4. As an additional method to prevent overfitting, drop out was applied to the last fully connected layers, and early stopping was also applied by monitoring the validation loss at each epoch.

The above training process was repeated for all 24 experimental groups (Figure 1). All deep CNN models were trained and evaluated on an NVIDIA DGX StationTM (NVIDIA Corp., Santa Clara, CA, USA) with an Ubuntu 18 operating system, 256 GB system memory, and four NVIDIA Telsa V100 GPU. The building, training, validation, and prediction of DL models were performed using the Keras [36] library and TensorFlow [37] backend engine. The initial training rate of each model was 0.00001. A ReduceLROn-Plateau method was employed because it reduces the learning rate when it stops improving the training performance. The RMSprop algorithm was used as the solver. After training all the 5-fold deep CNN models, the best model was identified by testing with the test dataset.

**Figure 4.** The overview of the 5-fold cross validation applied in this study.

#### *2.7. Performance Evaluation*

To comprehensively evaluate the screening performance on the test dataset, the accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curve, and precision recall (PR) curve were calculated. The accuracy, sensitivity, and specificity score can be calculated as follows:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FN + FP}$$

$$\text{Sensitivity} = \frac{TP}{TP + FN}$$

$$\text{Specificity} = \frac{TN}{TN + FP}.$$

Sensitivity = + *TP* and *FP* are the number of correctly and incorrectly predicted images, respectively. Similarly, *TN* and *FN* represent the number of correctly and incorrectly predicted images, respectively. The area under the ROC curve (AUC) was also calculated in this study.

#### *2.8. Interpretation of Model Prediction*

Because it is difficult to know the process of how deep CNNs make predictions, DL models have often been referred to as non-interpretable black boxes. To determine the decision-making process of the model, and which features are most important for the model to screen COVID-19 in CXR images, this study employed the gradient-weighted class activation mapping technique (Grad-CAM) [18,19] so that the most significant regions for screening COVID-19 in CXR images were highlighted.

#### **3. Results**

#### *3.1. Classification Performance*

Table 2 summarizes the classification performance of the three classes, normal (N), pneumonia (P), and COVID-19 (C), for each experimental group.


**Table 2.** Performance metrics of experimental groups where N, P and C are normal, pneumonia and COVID-19, respectively.

Compared with all the tested deep CNN models, the fine-tuned with two blocks of the VGG-16 (VGG16-FT2) model achieved the highest performance in terms of the COVID-19 classification of accuracy (95.9%), specificity (97.5%), sensitivity (92.5%), and AUC (0.950). For all the tested deep CNNs, fine-tuning the last two convolutional blocks presented a higher classification performance compared to the fine-tuning of the other number of convolutional blocks. In addition, the case of all untrainable convolutional blocks without fine-tuning, regardless of the scalability of the backbone network, showed the lowest classification. Generally, the fine-tuned models using VGG16 as a backbone architecture were better than those using VGG19.

Figure 5 shows how the number of fine-tuned deep CNN blocks influences the classification performance in terms of the accuracy of COVID-19 screening. In this figure, the classification performance was not proportionately dependent on the degree of fine-tuning with the base model. There was a decrease in classification accuracy when more than three convolutional blocks of all deep CNNs were used. In addition, regardless of the number of fine-tuned blocks, the VGG19 models with more convolutional layers had lower classification accuracy than the VGG16 models. The confusion matrix and ROC of VGG16-FT2 achieving the highest performance in multi-class classification are presented in Figures 6 and 7.

**Figure 5.** COVID-19 classification performance versus the number of fine-tuned convolutional blocks.

**Figure 6.** Confusion matrix of the best performed classification model (VGG16-FT2) in this study.

**Figure 7.** Receiver operating characteristics (ROC) curve of the best performing classification model (VGG16-FT2) in this study.

#### *3.2. Interpretation of Model Decision Using Grad-CAM*

Figures 8–10 show examples of a visualized interpretation of predictions using deep CNN models in this study. In each example, the color heat map presented which areas were most affected by the classification of the deep CNN model.

**Figure 8.** Samples of original and gradient-weighted class activation mapping technique (Grad-CAM) images were correctly predicted by the best performing classification model (VGG16-FT2) in this study.

**Figure 9.** Original and Grad-CAM sample images presumed to be misclassified according to the wrong reason by the best performing classification model in this study (VGG16-FT2).

**Figure 10.** Original and Grad-CAM sample images presumed to be correctly classified according to the wrong reason by the best performing classification model in this study (VGG16-FT2).

Figure 8 shows representative examples of correctly classified cases for each of the three classes (normal, pneumonia, and COVID-19) in the VGG16-TF2 experimental group that showed the highest classification performance. Through the Grad-CAM result in Figure 8, it is possible to identify the significant region where the difference in CXR image features of each of the three classes is made. Figures 9 and 10 show representative examples of wrong and right classifications based on the wrong reasons. In most cases where classification has occurred based on the wrong reason, there is a foreign body in the chest cavity of the CXR image.

#### **4. Discussion**

In addition to the long-term sustainability of the COVID-19 pandemic and symptom similarity with other pneumonia diseases, the limited medical resources and lack of expert radiologists have greatly increased the importance of screening for COVID-19 from CXR images for the right concentration of medical resources and isolation of potential patients. To overcome these limitations, various cutting-edge artificial intelligence (AI) technologies have been applied to screen COVID-19 from various medical data. Accordingly, until recently, numerous new DL models, such as COVID-Net [10], Deep-COVID [16], CVDNet [38], and Covid-resnet [13], to classify COVID-19 through publicly shared CXR images have been proposed, or mutual comparison studies through the transfer learning of various pre-trained DL models have been presented [39,40]. These previous papers showed high accuracy of more than 95%. However, most of them performed transfer learning but did not mention the specific degree of fine-tuning. It is also rare to have a qualitative evaluation. As a result, it is often difficult to reproduce a similar degree of accuracy with the same pre-trained DL model. Therefore, in the present study, the effects of the degree of fine-tuning and layer depths on deep CNNs for the screening performance of COVID-19 from CXR images were evaluated. Furthermore, these influences were visually interpreted using the Grad-CAM technique.

#### *4.1. Scalability of Deep CNN*

It is known that the VGG architecture used as the deep CNN backbone network in this experiment does not leverage residual principles, has a lightweight design, and low architectural diversity, so it is convenient to fine-tune [10]. In particular, the VGG-16 and VGG-19 used in this study have the same architecture with five convolutional blocks; however, the depth of the layers of VGG-19 is deeper than that of VGG-16 (Figure 3).

According to Table 2 and Figure 5, the overall classification performance of VGG-16 was higher than that of VGG-19, regardless of the fine-tuning degree. These results are similar to the fact that the latest deep neural networks do not guarantee higher accuracy in the classification of medical images such as CXR images, as in other previous research papers [39]. It can be considered that in the case of medical images requiring less than 10 classifications, deep CNNs with low scalability can show better performance, unlike the classification of general objects that require more than 1000 classifications.

#### *4.2. Degree of Fine-Tuning of Deep CNN*

In general, the deep CNN model learned from pre-trained deep neural networks on a large natural image dataset which could be used to classify common images but cannot be well utilized for specific classifying tasks of medical images. However, according to a previous study that described the effects and mechanisms of fine-tuning on deep CNNs, when certain convolutional blocks of a deep CNN model were fine-tuned, the deep CNN model could be further specialized for specific classifying tasks [32,41]. More specifically, the earlier layers of a deep CNN contain generic features that should be useful for many classification tasks; however, later layers progressively contain more specialized features to the details of the classes contained in the original dataset. Using this property, when the parameters of the early layers are preserved and that in later layers are updated during the training of new datasets, the deep CNN model can be effectively used in new classification tasks. In conclusion, fine-tuning uses the parameters learned from a previous training of the network on a large dataset, and then adjusts the parameters in later layers from the new dataset, improving the performance and accuracy in the new classification task.

As far as the authors know, there has been no previous research paper evaluating the accuracy of COVID-19 screening according to the degree of fine-tuning. According to Figure 5, regardless of the scalability of VGG, classification accuracy increases as the degree of fine-tuning increases; however, the fine-tuning of more than a certain convolutional block (more than 3 blocks in this experiment) decrease the classification accuracy. Therefore, it seems necessary to find the appropriate degree of fine-tuning by judging the degree of fine-tuning in the transfer learning by a hyper-parametric variable such as batch-size or learning rate in DL.

#### *4.3. Visual Interpretation Using Grad-CAM*

Grad-CAM uses the gradient information flowing into the last convolutional layer of the deep CNN to understand the significance of each neuron for making decisions [18]. In this experiment, a qualitative evaluation of classification adequacy was performed using the Grad-CAM technique. In the case of the deep CNN model, which showed the best classification as shown in Figure 8, image feature points for each class were specified within the lung cavity in CXR images. However, as shown in Figure 9, if there is a foreign substance in the lung cavity in a CXR image, it can be classified incorrectly. Moreover, even if a CXR image is correctly classified, it can be classified for an incorrect reason as shown in Figure 10. In the CXR image analysis using the DL algorithm, the implanted port catheter and pacemaker or defibrillator generator have shown similar results to the previous studies that interfere with the performance of the DL algorithm by causing false positives or false negatives [42]. This shows the pure function of the Grad-CAM technique and suggests candidate areas to be excluded through image preprocessing for areas or foreign body subjects that affect classification accuracy improvement on the image.

#### **5. Conclusions**

This experiment showed the appropriate transfer learning strategy of a deep CNN to screen for COVID-19 in CXR images as follows. In using the deep CNNs for COVID-19 screening in CXR images, it is not always guaranteed to achieve cutting-edge results, increasing their complexity and layer depth. In addition, when applying transfer learning to a deep CNN for classification, an appropriate degree of fine-tuning is required, and this must also be treated as an important hyper-parametric variable that affects the accuracy of DL. In particular, in the case of image classification using DL, it is also necessary to qualitatively evaluate a classification as to whether an appropriate classification has occurred based on the correct reason, using visual interpretation methods such as the Grad-CAM technique.

**Author Contributions:** Conceptualization, K.-S.L., J.Y.K., W.S.C., N.H.K. and K.Y.L.; data curation, K.-S.L., J.Y.K. and E.-t.J.; formal analysis, K.-S.L., J.Y.K. and E.-t.J.; funding acquisition, K.-S.L.; investigation, K.-S.L., J.Y.K. and K.Y.L.; methodology, K.-S.L. and E.-t.J.; project administration, K.-S.L. and N.H.K.; resources, N.H.K.; software, K.-S.L. and E.-t.J.; supervision, K.-S.L., J.Y.K. and W.S.C.; validation, J.Y.K., W.S.C., N.H.K. and K.Y.L.; visualization, K.-S.L.; writing—original draft, K.-S.L.; writing—review and editing, K.-S.L., W.S.C., N.H.K. and K.Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea under Grant NRF-2019R1I1A1 A01062961 and a Korea University Ansan Hospital Grant O2000301.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

1. World Health Organization. *Coronavirus Disease (COVID-19): Situation Report, 182*; World Health Organization: Geneva, Switzerland, 2020.


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Towards Personalised Contrast Injection: Artificial-Intelligence-Derived Body Composition and Liver Enhancement in Computed Tomography**

**Daan J. de Jong <sup>1</sup> , Wouter B. Veldhuis <sup>1</sup> , Frank J. Wessels <sup>1</sup> , Bob de Vos <sup>2</sup> , Pim Moeskops <sup>2</sup> and Madeleine Kok 1,\***


**Citation:** de Jong, D.J.; Veldhuis, W.B.; Wessels, F.J.; de Vos, B.; Moeskops, P.; Kok, M. Towards Personalised Contrast Injection: Artificial-Intelligence-Derived Body Composition and Liver Enhancement in Computed Tomography. *J. Pers. Med.* **2021**, *11*, 159. https://doi.org/ 10.3390/jpm11030159

Academic Editors: Pim A. de Jong, Wouter Foppen and Nelleke Tolboom

Received: 12 December 2020 Accepted: 18 February 2021 Published: 24 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Abstract:** In contrast-enhanced computed tomography, total body weight adapted contrast injection protocols have proven successful in achieving a homogeneous enhancement of vascular structures and liver parenchyma. However, because solid organs have greater perfusion than adipose tissue, the lean body weight (fat-free mass) rather than the total body weight is theorised to cause even more homogeneous enhancement. We included 102 consecutive patients who underwent a multiphase abdominal computed tomography between March 2016 and October 2019. Patients received contrast media (300 mgI/mL) according to bodyweight categories. Using regions of interest, we measured the Hounsfield unit (HU) increase in liver attenuation from unenhanced to contrast-enhanced computed tomography. Furthermore, subjective image quality was graded using a four-point Likert scale. An artificial intelligence algorithm automatically segmented and determined the body compositions and calculated the percentages of lean body weight. The hepatic enhancements were adjusted for iodine dose and iodine dose per total body weight, as well as percentage lean body weight. The associations between enhancement and total body weight, body mass index, and lean body weight were analysed using linear regression. Patients had a median age of 68 years (IQR: 58–74), a total body weight of 81 kg (IQR: 73–90), a body mass index of 26 kg/m<sup>2</sup> (SD: ±4.2), and a lean body weight percentage of 50% (IQR: 36–55). Mean liver enhancements in the portal venous phase were 61 ± 12 HU (≤70 kg), 53 ± 10 HU (70–90 kg), and 53 ± 7 HU (≥90 kg). The majority (93%) of scans were rated as good or excellent. Regression analysis showed significant correlations between liver enhancement corrected for injected total iodine and total body weight (*r* = 0.53; *p* < 0.001) and between liver enhancement corrected for lean body weight and the percentage of lean body weight (*r* = 0.73; *p* < 0.001). Most benefits from personalising iodine injection using %LBW additive to total body weight would be achieved in patients under 90 kg. Liver enhancement is more strongly associated with the percentage of lean body weight than with the total body weight or body mass index. The observed variation in liver enhancement might be reduced by a personalised injection based on the artificial-intelligence-determined percentage of lean body weight.

**Keywords:** computed tomography; artificial intelligence; contrast media; body composition

#### **1. Introduction**

Even if ultrasound represents the first-line technique for the assessment of liver structure and potential lesions [1], contrast-enhanced computed tomography (CT) is commonly used to detect and characterise liver lesions [2,3]. The majority of these lesions are hypovascular and are, therefore, better identifiable with portal venous contrast enhancement [4,5]. A minimum enhancement of liver tissue of 50 HU is considered essential to ensure appropriate detectability [6–8]. The degree of contrast enhancement in CT is dependent on different factors: CT scan parameters (e.g., tube voltage, scan delay), injection parameters (e.g., amount of injected iodine), and patient-related factors (e.g., height, weight, cardiac output) [9]. The most widespread practise is to administer iodine contrast in fixedcontrast media injection protocols. Fixed protocols result in varying enhancement levels because of differences in body size and composition [9]. Lowering the dose of contrast media decreases the sensitivity and specificity in the detection and characterisation of liver lesions [10]. Higher doses of contrast media are costly and might increase the risk of renal toxicity [11,12]. A personalised protocol for iodine dosing should be preferred to the standard fixed-contrast protocol [13]. In this respect, body-weight-adapted contrast injection protocols have proven successful in achieving a more homogeneous enhancement of vascular structures and liver parenchyma in patients [8,14–17]. However, total body weight (TBW) is not the only relevant body-size-related factor; lean body weight (LBW) and body mass index (BMI) might also be important. Solid organs have greater perfusion than adipose tissue [18]; consequently, using LBW (or the fat-free mass) as the basis for determining the amount of iodine is hypothesised to result in more uniform liver enhancement than using TBW or BMI [18,19].

Some previous studies concluded that injection protocols based on LBW rather than on TBW alone performed better in terms of liver enhancement [13,18–20]. However, we find these results not to be generalisable to our clinic because many of the aforementioned studies were performed in populations with smaller ranges in weight.

Furthermore, these studies did not use body composition on a per patient basis, but performed analysis on averaged body composition values [13,19] or estimated the body composition using empirically derived formulas [18,20].

We want to take personalised medicine a step further, using artificial intelligence as a way to determine body composition. We will use a tool that automatically segments clearly visible structures such as fat, muscle, and bone on scanned images and determines the body composition of a patient. The automated nature of this technique makes it possible to dose contrast material in real-time and in a personalised fashion, and may have wide implications.

In this study, we retrospectively evaluated the influence of TBW, BMI, and artificialintelligence-derived LBW on liver enhancement in multiphase abdominal CT, showing that subjective image quality was related to liver enhancement.

#### **2. Materials and Methods**

#### *2.1. Patients*

We retrospectively included patients from the period of March 2016 to October 2019. We included the first CT scan of all patients who underwent a multiphase abdominal CT, including an unenhanced CT for suspicion of a kidney tumour, on a spectral CT scanner in the University Medical Center Utrecht. Inclusion criteria were an age of 18 years or older and known patient weight and height. Based on these criteria, we identified 122 patients. Exclusion criteria were patients with liver cirrhosis (*n* = 2), a fatty liver (<40 HU) (*n* = 12), numerous liver metastases (*n* = 1), a partial hepatectomy (*n* = 2), and technical problems during CT examination (*n* = 1), leaving a study population of 102 patients. The Dutch Law on Medical Research (WMO) did not apply to this retrospective cohort study according to the local medical ethical committee (METC, ref. 20-025/C). No informed consent was obtained given the anonymous research data handling.

#### *2.2. Imaging Protocols*

All included multiphase CTs were performed on a spectral CT scanner (IQon Spectral CT, Philips Healthcare, Best, The Netherlands). The scan range for the unenhanced and arterial phase was the upper abdomen. The scan range for the portal venous phase was set from approximately 1 cm cranial of the diaphragm to the lower pelvis. The scan range for the (possible) equilibrium phase was set from the kidneys to just caudal of the bladder.

Scans were performed with the following parameters: tube voltage 120 kV, 64 × 0.625 mm collimation, gantry rotation time of 0.27 s, and tube current was switched on with a quality reference tube current of 116 mAs. Image reconstruction was performed in the axial plane for the unenhanced and arterial phase, with 3 and 5 mm slice thicknesses and 2 and 4 mm increments. Image reconstruction was performed in the axial, coronal, and sagittal plane for the portal venous phase, with 5 mm slice thicknesses and 4mm increments. All images were reconstructed using a B (abdominal) kernel at iDose level 3.

All scans were performed with bolus tracking. A circular region of interest (ROI) was placed in the abdominal aorta with a threshold of 150 HU. The post-threshold delay before scanning was 20 s for the arterial phase and 90 s for the portal venous phase.

#### *2.3. Contrast Material Injection and CT Protocols*

All patients received an 18–20 G cannula in an antecubital vein before injection. Preheated iodinated contrast (Ultravist, Iopromide 300 mgI/mL; Bayer Healthcare, Berlin, Germany) was injected using a standard dual-head CT power injector (Stellant, Bayer Healthcare, Berlin, Germany). The contrast media was preheated to 37 ◦C to decrease viscosity [21].

In current clinical practice, body-weight-adapted protocols are used for the multiphase abdominal CT. Injection parameters were divided into three different weight groups: ≤70 kg, 70–90 kg, and ≥90 kg. The total injected volume, iodine, and flow rate were: 120 mL, 36.0 gI, 4 mL/s for group ≤70 kg; 150 mL, 45.0 gI, 4.5 mL/s for group 70–90 kg, and 185 mL, 55.5 gI, 5 mL/s for group ≥90 kg, respectively. A saline flush of 50 mL followed the contrast bolus at the same flow rate. In some cases, technicians adapted the amount of contrast media according to their experience, which was recorded in the scan protocol. In further analysis, we did not analyse weight groups, but instead used the weight of the patient; therefore, changes in scan protocol had no effect on analyses.

#### *2.4. Quantitative Image Analysis*

The body composition was calculated with the Quantib-U bod composition algorithm [22] on unenhanced images (Figure 1) [23]. Firstly, using a convolutional neural network, the method automatically detected the slice at the third lumbar vertebra from the CT data set (resampled to 5mm slices). Secondly, this slice was automatically segmented into visceral fat, subcutaneous fat, psoas muscle, abdominal muscle, and long spine muscle using a second convolutional neural network. Using the areas of these segmentations in proportion to those of the entire slice, percentages of body composition were calculated. To minimise the influence of the exact slice that was selected, the areas were computed by segmenting a total of five slices around the detected L3 level—two above and two below—and averaging the results. The %LBW (percentage of lean body weight) was defined as 100%—% total body fat (=subcutaneous fat % + visceral fat %). Total fat and LBW in kilograms were then calculated using TBW. Moreover, %LBW is an areal measure and LBW is in kilograms.

CT liver enhancement values (HU) were measured (M.K., who has seven years of experience in CT imaging) on the unenhanced and portal venous phase images using circular regions of interest (ROI) of 1–2 cm in diameter. ROIs were placed in three different liver segments (S2, S8, and S7) according to the Couinaud segmental classification and mean values were calculated (Figure 2). The degree of contrast enhancement in the liver was defined as the change in enhancement values (∆HU) and was calculated by subtraction of the unenhanced values from post-contrast enhancement values.

— — **Figure 1.** Fully automatic measurement of body composition at the lumbar 3 level [22]. Lean body weight (LBW) was defined as the difference between body weight and body fat weight, expressed in kilograms. In this example, LBW is 36.6% of the total body weight (100%—27.0% (subcutaneous fat)—36.4% (visceral fat) = 36.6% (LBW)).

–

rater variability was determined using Cohen's kappa. All p

enhancement values per gram of iodine (ΔHU/gI). These enhancement values were subsequently adjusted for TBW or LBW in kilograms (ΔHU/(gI/TBW) and

of iodine (ΔHU/gI) or the adjusted enhancement values ΔHU/(gI/TBW) and

The degree of enhancement (ΔHU) was calculated by subtracting **Figure 2.** Region of interest (ROI) placement according to the Couinaud segmental classification to measure liver enhancement. ROIs were drawn in S2, S8, and S7 of the liver (when available) in unenhanced and enhanced images (portal venous phase). The degree of enhancement (∆HU) was calculated by subtracting the unenhanced enhancement values (**A**) from enhanced enhancement values (**B**).

ΔHU/(gI/LBW)), according

ΔHU/(gI/LBW) were evaluated

#### *2.5. Qualitative Image Analysis*

The quality of all scans was independently graded by two radiologists (F.W. and M.K., with eleven and four years of experience in abdominal radiology, respectively) who were blinded to the injection protocols. The timing of the scans and the subjective liver enhancement were scored. For scan timing, a five-point scale was used to evaluate enhancement of the common portal vein (1 = too early (non-diagnostic); 2 = early (moderate, but still diagnostic); 3 = portal venous phase (good); 4 = late (moderate, but still diagnostic); 5 = too late (non-diagnostic)). Liver enhancement was assessed using a four-point Likert scale (1 = excellent; 2 = good; 3 = moderate but still diagnostic; 4 = non-diagnostic). We arbitrarily defined enhancements of >70 HU and <40 HU as non-diagnostic.

#### *2.6. Statistical Analysis*

Statistical analyses were performed in SPSS version 26 (SPSS Inc., Chicago, IL, USA). Normality was checked using histograms and the Shapiro-Wilk test. Continuous variables were reported as the mean with standard deviation (± SD) for normally distributed data and as the median with an interquartile range (IQR) for non-normal distributed data. Categorical variables were reported as proportions. Continuous variables with normal distributions were compared using the repeated measures ANOVA for dependent measures or a one-way ANOVA for independent measures. A Kruskal–Wallis test was used for nonparametric continuous variables. All tests were performed with post hoc comparison. The inter-rater variability was determined using Cohen's kappa. All p-values were 2-sided and a *p*-value of less than 0.05 was considered to be statistically significant.

Enhancement parameters of the liver obtained for further analyses were changed into enhancement values per gram of iodine (∆HU/gI). These enhancement values were subsequently adjusted for TBW or LBW in kilograms (∆HU/(gI/TBW) and ∆HU/(gI/LBW)), according to a method proposed by Heiken et al. [8] and Kondo et al. [19]. We used %LBW on a per-patient basis. Both single- and multivariable linear regressions between TBW, BMI, and %LBW and changes in enhancement values per gram of iodine (∆HU/gI) or the adjusted enhancement values ∆HU/(gI/TBW) and ∆HU/(gI/LBW) were evaluated (Table S1).

#### *2.7. Simulation of Future Potential Clinical Applicability*

Based on the formed regression formulas, we analysed the potential impacts for future patients by assessing the amount of contrast media needed to reach sufficient liver enhancement using our regression formulas for both %LBW and TBW. Our calculations for sufficient enhancement were based on an increase of 50 HU in the portal venous phase [6–8].

#### **3. Results**

#### *3.1. Baseline Characteristics*

The 102 patients (70.6% male) had a median age of 68 years (IQR: 57–74). Their median TBW was 81.0 kg (IQR: 72.8–90.0)—19.6% were below 70 kg and 19.6% were above 90 kg. The median %LBW was 49.8% (IQR: 35.8–55.3) and the mean BMI was 26.3 kg/m2 (SD: ±4.18). Patients in the group ≤ 70 kg received a median of 36.0 g (IQR: 36.0–43.5) of iodine, the group 70–90 kg received 45.0 g (IQR: 39.0–45.0), and the group ≥ 90 kg 45.0 g (IQR: 45.0–45.7). Overall, the patients received 42.6 g (SD: ±4.42) of iodine per scan (Table 1).


**Table 1.** Baseline characteristics. Normally distributed data are given as means with ±SDs and non-parametric data are given as medians with interquartile ranges (IQRs). TBW = total body weight; LBW = lean body weight in kilograms or percentage of lean body weight; BMI = body mass index.

#### *3.2. Quantitative Image Quality*

Mean enhancement values in different liver segments were as follows: S2 54.3 HU (SD: ±5.83), S8 54.8 HU (SD: ±6.61), and S7 54.3 HU (SD: ±9.30). There was no significant difference in enhancement between the liver segments for all groups. The overall mean enhancement was 54.6 HU (SD: ±10.2; range: 25.0–93.3) and 28.4% did not reach the proposed enhancement of 50 HU or more. The mean enhancement value was for ≤70 kg 60.7 HU (SD: ±12.4), for 70-90 kg was 53.3 HU (SD: ±9.25), and for ≥90 kg was 52.4 HU (SD: ±7.45). The between-group difference reached significance (*p* = 0.007) and in post hoc analysis the ≤70 kg group was enhanced significantly more than the 70–90 kg group (*p* = 0.019) and ≥90 kg group (*p* = 0.034) (Table 2). The percentages of patients enhanced by <50 HU were 20%, 30%, 35% in the ≤70 kg, 80–90 kg, and ≥90 kg groups, respectively. The percentages of patients enhanced by >70 HU were 30%, 4.8%, 0.0% in the ≤70 kg, 80–90 kg, and ≥90 kg groups, respectively (Table S2).

#### *3.3. Qualitative Image Quality*

The inter-rater variability was good for scan timing (*k* = 0.882 (95% CI: 0.825–0.920)) and liver enhancement (*k* = 0.921 (95% CI: 0.833–0.946)). For timing, no scans were found to be non-diagnostic (Table S3). For liver enhancement, nearly all scans were of good (25.5%) or moderate (5.90%) quality, while one scan was non-diagnostic scored by only one of the observers (objective liver enhancement 25 HU) (Table S4). Most scans of moderate quality scored lower than 40 HU.

#### *3.4. Regression Analysis*

For the association between liver enhancement values per gram of iodine (∆HU/gI) and body parameters, a correlation was observed with TBW (*r* = 0.531; R2 = 0.282; *p* < 0.001), while no significant values were observed for BMI (*p* = 0.253) or %LBW (*p* = 0.493) (Table S1). The formula for this relationship is: gI = ∆HU/(2.075-0.01 TBW), which can also be written as gI = ∆HU/(2.075-0.01 TBW) (Figure 3A). For the liver enhancement values additionally adjusted per gram of iodine per TBW (∆HU/(gI/TBW)), no significant correlations were found (BMI; *p* = 0.139. TBW; *p* = 0.302. %LBW; *p* = 0.628) (Table S1). For the liver enhancement values additionally adjusted per gram of iodine per LBW (∆HU/(gI/LBW)) the strongest association was observed with %LBW (*r* = 0.733; R2 = 0.538; *p* < 0.001), no significant correlations were observed for BMI (*p* = 0.099) or TBW (*p* = 0.371) (Figure 3B) (Table S1). The formula for this relationship is: ∆HU/(gI/LBW) = 10.3 + 0.823 %LBW or gI = ∆HU/(10.3/LBW + 82.3/TBW).


**Table 2.** Enhancement in liver segments for the weight groups.

Enhancement values for the liver segments S2, S8, and S7 for the different weight groups. Values are given as means with ± SDs; *p*-values are calculated using a one-way ANOVA or repeated measures ANOVA. The blanco scans are non-enhanced scans, PV scans are scans made in the portal venous phase, the SD is given for the mean region of interest (ROI) SD, and lastly the mean enhancement is given; ∆S2 ∆S8 ∆S7 is the significance of enhancement between the three liver segments. The mean enhancement (mean ∆HU) is the average of ∆S2 ∆S8 ∆S7. Note: \* Post hoc analysis showed a significant difference between ≤70 kg and 70–90 kg weight categories and between ≤70 kg and ≥90 kg weight categories.

elationship between ΔHU/gI and TBW relationship between ΔHU/(gI/LBW) and %LBW ( **Figure 3.** Regression analysis between enhancement and body size measures: (**A**) relationship between ∆HU/gI and TBW (*r* = 0.531; R <sup>2</sup> = 0.282; *p <* 0.001); (**B**) relationship between ∆HU/(gI/LBW) and %LBW (*r* = 0.733; R <sup>2</sup> = 0.538; *p <* 0.001). Note: TBW: total body weight; LBW: lean body weight.

#### *3.5. Simulation of Future Potential Clinical Applicability*

– – – – For the 102 included patients, we used an average of 42.6 g of iodine (SD: ±4.42; range: 36–55.5) per scan in the standard protocol, totaling approximately 4345 g of iodine and 14.5 litres of contrast for 102 patients. This is on average 0.532 g (SD: ±0.0811; range: 0.33–0.75) of iodine per kilogram TBW. For our regression formula based on %LBW, an estimated average of 39.4 g (SD: ±6.05; range: 27.6–57.5) of iodine per scan would be sufficient to achieve 50 HU for each patient in the study population. This would be on average 0.486 g (SD: ±0.0210; range: 0.44–0.53) of iodine per kilogram of TBW, which is 4019 g iodine and 13.4 litres of contrast for 102 patients (Figure 4). As an example, we would like to illustrate the added value of LBW for two patients weighing 80 kg with

different %LBW values. The first patient had a %LWB of 35.5% and the expected amount of contrast to reach 50 HU was 35.9 g. The second patient had a %LWB of 78.5% and the expected amount of contrast to reach 50 HU was 41.9 g. Hence, there would be a difference of six grams of iodine for these patients who received 45 g of contrast and were enhanced by 63 HU and 57 HU, respectively.

the ≤ – **Figure 4.** Analysis of future contrast applications: grams of iodine (gI) per kilogram (kg) TBW in the LBW formula; grams of iodine per kilogram of TBW (total body weight) in the LBW (lean body weight) formula for the population of our study. The grams of iodine per kilogram of TBW all lay between the patient with the maximum LBW (iodine (gr) maximum LBW/kg) and the patient with the lowest LBW (iodine (gr) minimum LBW/kg). Herein, the highest spreads were found in the ≤70 kg and 70–90 kg weight groups.

#### **4. Discussion**

groups ≤ –

≥

Our results showed that the highest influence on liver enhancement was of %LBW, followed by TBW. Although the mean enhancement was >50 HU in all weight groups, the spread within groups was substantial; over one-quarter of patients did not reach the 50 HU liver enhancement threshold. Those who were enhanced by <40 HU were nearly all heavyweight patients of 90 kg or heavier, while patients enhanced by >70 HU were mostly patients weighing less than 70 kg. This indicates that our current protocol based on three weight categories overadministers contrast in lightweight patients and underadministers contrast in heavier patients. A more personalised protocol based on artificial-intelligence-determined body composition might both reduce overall contrast usage in our population and make liver enhancement more consistent between patients, but this requires prospective confirmation.

– Several previous studies have investigated TBW- and LBW-adjusted contrast dosing protocols [8,18–20,24]. Heiken et al. [8] suggested the use of 0.521 g of iodine per kilogram of TBW for a 50 HU liver enhancement, while Kondo et al. [19] indicated that the use of LBW rather than TBW served better to achieve a consistent enhancement with reduced patientto-patient variability. They suggested using an amount of 0.642 g of iodine per kilogram of LBW, based on a hepatic enhancement of 50 HU and a fixed average body fat percentage. Our finding supports the findings of Kondo et al. [19] and other studies [18,20,24]. However,

both the studies of Kondo et al. [19] and Matsumoto et al. [24] concluded that LBW-based protocols best perform in the normal and high weight/BMI groups. In contrast, we found that LBW played the most important role in the weight groups ≤ 70 kg and 70–90 kg, wherein the spread of gI/TBW was the highest. For the group ≥ 90 kg, there was only a minor spread, and thus LBW played a less important role in this group in our study.

The differences between our results and the above-mentioned studies could be explained by the fact that the population in the study by Kondo et al. [19] was only partially comparable to our population. Our population represented a broader weight spectrum, with a range of 54–126 kg and with a median just above 80 kg, whereas in the study by Kondo et al. [19] the study population had a TBW range of 30–80 kg, with a mean just above 50 kg. Our study might, thus, have implications for a population with a wider range in weight. Reassuringly, we found the same results in the overlapping parts of the studies by Kondo et al. [19] and Matsumoto et al. [20]; LBW might be a better variable to determine the amount of iodine contrast used for light and average weight patients.

Contrast administration based on LBW might be economically effective. There was a difference of 3.2 g between the mean iodine dose per scan in the formula based on LBW and the mean iodine dose used in our current protocol. Based on the 600,000 yearly abdominal CT scans performed in the Netherlands, the new personalised method could save 1.8 tonnes of iodine a year [25], which is approximately €580,000 of yearly savings. Moreover, despite the conclusion that LBW performs better in personalising contrast application and the fact that the implementation of this finding might be beneficial if replicated prospectively, we conclude that the influence of LBW is minor. Similar to Kondo et al. [19], our equation is based on both TBW and LBW (the opposite of body fat percentage), and when dissecting the formula based on LBW (gI = ∆HU/(10.3/LBW + 82.3/TBW)) we find that TBW is still the most important factor and that LBW only has less influence.

In the study by Kondo et al. [19], an average body fat percentage of 23% was used for every patient to perform analysis, whilst some patients in their population had body fat percentages of up to 50%. We used per-patient calculated body composition for analysis. For the calculation of %LBW, we were able to use an artificial intelligence algorithm that automatically calculates body composition based on CT slices [22]. The tool proved useful for determining the body composition values for the large quantity of patients in our study, especially because this process was fully automated.

In the literature, several methods have been used to estimate the LBW (e.g., methods proposed by James [26], Boer [27], and Janmahasatian [28]), yet no consensus has been reached on a golden standard. Therefore, our artificial intelligence tool [22] may have wide implications in measuring LBW rather than in estimating LBW. In a clinical scenario, the tool can be used in protocols containing unenhanced or arterial phase scans. If the protocol does not contain unenhanced or arterial phase scans, the body composition can be determined in several ways: from earlier recorded scans or by performing one single slice through the abdomen before scanning (as done for bolus timing acquisitions). Furthermore, bolus tracking slices may be (re)used in the future when the algorithm is tested on such arterial slices. However, the latter still has to be evaluated in future research. Moreover, while this study addresses abdominal scans, the AI algorithm can segment neck, chest, pelvis, or lower extremity scans as well when acquired, to calculate the body composition without the use of the abdomen. Once validated, the benefit could extend to those regions as well.

The limitations of this study are that this is a retrospective study design using a limited number of patients. As we needed to calculate enhancement, regular abdominal CT could not be included. Secondly, there were two outliers with enhancement levels <40 HU. The low enhancement could be due to small contrast extravasation, although this was not recorded. Another explanation could be a poor cardiac output, which results in poor enhancement and image quality [17]. However, we used premonitoring for contrast timing in our scan protocol and no scans were found to be non-diagnostic based on the timing of the scan.

Future prospective studies could investigate the impact of personalised dosing on liver enhancement and diagnostic properties, which should also take tube voltage into account [14]. Many studies already investigated the potential of low kVp settings (e.g., 70, 80, and 100 kVp) [2,14,29–32] or virtual monochromatic imaging with low kV reconstruction [33–37] in combination with a reduced amount of injected iodine in a more lightweight population using CT angiography protocols, wherein only the signal during the first pass of contrast media is crucial [29]. However, this has not properly been investigated for abdominal protocols yet, which rely on longer contrast media boluses to provide homogeneous enhancement of parenchymal organs, such as the liver. With the newest CT technologies (e.g., automated kVp selection, monochromatic data reconstruction, and iterative reconstruction), it is expected that more CT scans will be performed using lower kVp settings in the future [38]. As lower kVp/kV settings result in higher attenuation values, there is an opportunity to save even more contrast media than the above-mentioned €580,000. We anticipate that personalised contrast dosing is at least partly additional to the above-mentioned technological innovations.

#### **5. Conclusions**

In summary, in this study, we investigated the relationship between body parameters, such as TBW, LBW, and BMI, on liver enhancement in CT. We found that contrast-enhanced CT values of 40 HU and higher were of diagnostic value when assessed visually. Our data suggest the use of an artificial intelligence body composition-based algorithm to determine LBW can reduce interpatient variability in liver enhancement whilst saving contrast media. The automated nature of the algorithm makes real-time personalisation of contrast dosing technically feasible. Further research should focus on how to integrate body-composition-based personalised contrast dosing with lower tube voltage settings or monochromatic imaging.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2075-4 426/11/3/159/s1, Table S1: Correlations in regression. Table S2: Enhancement values per group. Table S3: Subjective phase classification per rater. Table S4: Subjective enhancement classification on the four-point Likert scale per rater.

**Author Contributions:** Conceptualisation, M.K., W.B.V., P.M., B.d.V. and D.J.d.J.; methodology, M.K. and D.J.d.J.; software, W.B.V., P.M. and B.d.V.; validation, M.K.; formal analysis, M.K. and D.J.d.J.; investigation, M.K. and F.J.W.; resources, P.M. and B.d.V.; writing—original draft preparation, M.K. and D.J.d.J.; writing—review and editing, W.B.V., P.M., B.d.V. and F.J.W.; supervision, M.K.; project administration, M.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Ethical review and approval were waived for this study. The Dutch Law on Medical Research (WMO) did not apply to this retrospective cohort study according to the local medical ethical committee (METC, ref. 20-025/C).

**Informed Consent Statement:** Patient consent was waived due to the anonymous research data handling (METC, ref. 20-025/C).

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ongoing unpublished research.

**Acknowledgments:** We would like to thank the radiology technicians of the UMC Utrecht for their work in collecting data. Moreover, we would like to thank A. Schilham for contributing useful advise. D. de Jong is a medical student participating in the Honours programme of the Faculty of Medicine, UMC Utrecht.

**Conflicts of Interest:** The scientific guarantor of this publication is M. Kok. The authors of this manuscript declare that the Department of Radiology of the UMC Utrecht receives research support form Philips Healthcare and some of the contributing authors declare having a relationship with Quantib-U (B. de Vos and P. Moeskops).

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Journal of Personalized Medicine* Editorial Office E-mail: jpm@mdpi.com www.mdpi.com/journal/jpm

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-2109-1