*Article* **Image Embeddings Extracted from CNNs Outperform Other Transfer Learning Approaches in Classification of Chest Radiographs**

**Noemi Gozzi 1,2, Edoardo Giacomello 3, Martina Sollini 1,4,\*, Margarita Kirienko 5, Angela Ammirabile 1,4, Pierluca Lanzi 3, Daniele Loiacono <sup>3</sup> and Arturo Chiti 1,4**


**Abstract:** To identify the best transfer learning approach for the identification of the most frequent abnormalities on chest radiographs (CXRs), we used embeddings extracted from pretrained convolutional neural networks (CNNs). An explainable AI (XAI) model was applied to interpret black-box model predictions and assess its performance. Seven CNNs were trained on CheXpert. Three transfer learning approaches were thereafter applied to a local dataset. The classification results were ensembled using simple and entropy-weighted averaging. We applied Grad-CAM (an XAI model) to produce a saliency map. Grad-CAM maps were compared to manually extracted regions of interest, and the training time was recorded. The best transfer learning model was that which used image embeddings and random forest with simple averaging, with an average AUC of 0.856. Grad-CAM maps showed that the models focused on specific features of each CXR. CNNs pretrained on a large public dataset of medical images can be exploited as feature extractors for tasks of interest. The extracted image embeddings contain relevant information that can be used to train an additional classifier with satisfactory performance on an independent dataset, demonstrating it to be the optimal transfer learning strategy and overcoming the need for large private datasets, extensive computational resources, and long training times.

**Keywords:** medical imaging; X-rays; artificial intelligence; transfer learning; explainability

#### **1. Introduction**

The world's population increased by about threefold between 1950 and 2015 (from 2.5 to 7.3 billion), and this trend is projected to continue in the coming decades (a population of 19.3 billion people is expected in 2100), with a growing share of the aging population (≥65 years) (https://www.eea.europa.eu/data-and-maps/indicators/total-populationoutlook-from-unstat-3/assessment-1 (accessed on 30 April 2021)). This projected trend is strongly linked to the increasing demand for medical doctors, including imagers. The medical community has offered some warnings about the urgent need to act (https://www. rcr.ac.uk/press-and-policy/policy-priorities/workforce/radiology-workforce-census (accessed on 30 April 2021)), suggesting that artificial intelligence (AI) might partially fill this gap [1]. The joint venture between AI and diagnostic imaging relies on the advantages offered by machine learning approaches to the medical field, which include the automation of repetitive tasks, the prioritization of unhealthy cases requiring urgent referral, and the development of computer-aided systems for lesion detection and diagnosis [2]. Nonetheless, the majority of such AI-based methods are still research prototypes, and only a few

**Citation:** Gozzi, N.; Giacomello, E.; Sollini, M.; Kirienko, M.; Ammirabile, A.; Lanzi, P.; Loiacono, D.; Chiti, A. Image Embeddings Extracted from CNNs Outperform Other Transfer Learning Approaches in Classification of Chest Radiographs. *Diagnostics* **2022**, *12*, 2084. https:// doi.org/10.3390/diagnostics12092084

Academic Editors: Sameer Antani and Sivaramakrishnan Rajaraman

Received: 7 July 2022 Accepted: 24 August 2022 Published: 28 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

have been introduced in clinical practice [3], despite increasing evidence the superior performance of AI relative to that of doctors [4,5]. A number of reasons may be called upon to explain this fact [6–8]. A successful AI-based tool relies on three main ingredients: an effective algorithm, high computational power, and a reliable dataset. Whereas the first two ingredients are generally available and can leverage several applications in different domains, the latter is perhaps the most critical in medical imaging. An adequate quality and amount of data necessary for machine learning approaches are still challenging or unfeasible in most clinical trials [6]. Accordingly, some strategies can be used to cross the hurdle of datasets in the medical imaging field. These comprise virtual clinical trials [9,10], privacy-preserving multicenter collaboration [11], and transfer learning approaches [12]. In particular, transfer learning, i.e., leveraging patterns learned on a large dataset to improve generalization for another task, is an effective approach for computer vision tasks on small datasets. Besides enabling training with a smaller amount of data, avoiding overfitting, transfer learning has shown remarkable performance in generalizing from one task and/or domain to another [13]. However, the optimal transfer learning strategy has not yet been defined due to the lack of dedicated comparative studies. In this work, we propose:


We tested this proof-of-concept approach on chest radiographs (CXRs). CXR is the most frequently performed radiological examination. Thus, the semiautomatic interpretation of CXRs could significantly impact medical practice by potentially offering a solution to the shortage of radiologists.

#### **2. Materials and Methods**

#### *2.1. Datasets*

The experimental analysis discussed in this paper involved two datasets: (i) CheXpert, a large public dataset, which was used to pretrain several classification models; and (ii) HUM-CXR, a smaller local dataset, which was used to evaluate the investigated transfer learning approaches.

CheXpert. This dataset comprises 224316 CXRs of 65,240 patients collected from the Stanford Hospital from October 2002 to July 2017 [14]. For this study, 191027 CXRs from the original dataset that presented a full reported diagnosis were selected. Each image was annotated with a vector of 14 labels corresponding to major findings in a CXR. Mentions of diseases were extracted from radiology reports with an automatic rule-based system and mapped—for each disease—with positive, negative, and uncertain labels according to their level of confidence. Table 1 shows the data distribution among the 14 labels included in the dataset.

HUM-CXR. We retrospectively collected all chest X-rays performed between 1 May 2019 and 20 June 2019 from the IRCCS Humanitas Research Hospital institutional database. We excluded records (1) not focused on the chest, (2) without images stored in the Institutional PACS, (3) without an available medical report, and (4) without an anteroposterior view. HUM-CXR is composed of 1002 CXRs, including anteroposterior, lateral, and portable (i.e., in bed) CXRs. Labels were manually extracted from medical reports (CJ). Uncertain cases were reassessed by two independent reviewers (M.S. and M.K.), and discordant findings were solved by consensus. Each image was annotated as normal or abnormal; abnormalities were further specified as mediastinum, pleura, diaphragm, device, other, gastrointestinal (GI), pneumothorax (PNX), cardiac, lung, bone, or vascular, resulting in a vector of 12 labels. It was not possible to use available automatic labelers [14] because they are designed for English-language use, whereas our radiological reports were written in Italian. Mediastinum, diaphragm, other, GI, and vascular labels were not included in this

work due to a limited number of available X-rays (<30) and significant inconsistencies with CheXpert labels. Ultimately, 941 CXRs were included in the analysis. Table 2 shows the data distribution of the labels selected for this study.


**Table 1.** Absolute frequencies of positive, uncertain, and negative samples for each finding (relative frequencies are reported in parentheses) in the CheXpert dataset (*n* = 191,027).

**Table 2.** Absolute frequencies of positive, uncertain, and negative samples for each finding (relative frequencies are reported in parentheses) in the HUM-CXR dataset (*n* = 941).


This study was approved by the Ethical Committee of IRCCS Humanitas Research Hospital (approval number 3/18, amendment 37/19); due to the retrospective design, specific informed consent was waived.

Preprocessing. For both datasets, we selected only anteroposterior images. Concerning CheXpert, following the approach described in [15], we resized the images to 256 × 256, and a chest region of 224 × 224 was extracted using a template-matching algorithm. We then normalized the images by scaling their values in the range [0, 1]; because the original models were pretrained on ImageNet, we further standardized them with respect to ImageNet mean and standard deviation. Concerning HUM-CXR, we selected X-rays acquired with an anteroposterior view, screening the images according to the series description in DICOM format, which had to be anteroposterior, posteroanterior, or portable; the final sample comprised 941 image of 746 patients. First, we clipped pixel values with a maximum threshold of 0.9995 quantile to minimize the noise due to the landmark (see Figure 1).

**Figure 1.** Preprocessing by clipping values larger than the 0.9995 quantile. The presence of a landmark, significantly whiter than the other pixels, created significant noise after normalization (**a**); original image (**b**); clipped image (**c**); normalized original image (**d**). To match the input dimension of the models, we resized the images to 224 × 224 and encoded them as RGB images by repeating the images for three channels. This was a necessary step in order to use the state-of-the-art image classification networks already pretrained on the ImageNet dataset. Then, we normalized each image by scaling the values in the range.

#### *2.2. Pretraining on CheXpert*

In this work, we trained several classifiers on the CheXpert dataset to predict CXR findings. Following the protocol described in [15], we considered seven convolutional neural networks (CNNs) with different topologies and numbers of parameters: DenseNet121 (7M parameters), DenseNet169 (12,5M parameters), DenseNet201 (18M parameters) [16], InceptionResNetV2 (54M parameters) [17], Xception (21M parameters) [18], VGG16 (15M parameters) [19], and VGG19 (20M parameters) [19]. We selected these seven network architectures because (i) they are the most common architectures used to perform classification, and (ii) the performance of each architecture differed depending on the labels. With no predominant architectures, aggregating multiple models can improve the final performances. To use these networks as classifiers, we removed the original dense layer and replaced it with a global average pooling (GAP) [20] layer, followed by a fully-connected layer with a number of outputs that matched the number of labels. These seven networks were not trained from scratch; instead, following a common practice in CNN training, we performed a first transfer learning step by initializing the convolutional layers of the networks with weights of pretrained models on the ImageNet dataset [21]. Then, we trained all the weights (both convolutional and classification layers) on the CheXpert dataset, using 90% of the

sample for training and 10% for validation (further details on the training process can be found in our previous work [15]).

Once trained to classify images, the convolutional blocks of CNN models can be employed as a mean to extract a vector of features from images, usually called image embedding. CNNs learn to classify images by learning an effective input representation directly from raw data; the sequence of convolutional layers progressively reduces the size of the input and extracts features from images from low-level features (e.g., edges, pixel intensities, etc.) in early convolutional layers to high-level semantic features in the latest convolutional layers. Accordingly, the last convolutional block, resulting from the training process, is designed to output a vector with the relevant features.

#### *2.3. Transfer Learning*

In this paper, we propose three transfer learning approaches, as depicted in Figure 2. As a reference standard, we mapped the CheXpert labels to the HUM-CXR labels and using the pretrained CNNs. The first transfer learning approach consisted of combining the outputs of pretrained CNNs using stacking [22]. The second approach exploited the pretrained CNNs to compute the image embeddings from HUM-CXR data and used them to train tree-based classifiers. The last approach consisted of tuning the CNNs pretrained on CheXpert on HUM-CXR data. In the remainder of this section, we describe these four approaches in detail.

Pretrained CNNs. This was the most straightforward of the investigated approaches and was used mainly as a baseline. It consisted of providing the HUM-CXR images as input to the CNNs trained on CheXpert and using the output of the networks to classify them based on a mapping between the labels of the two datasets. Table 3 shows the mapping designed as a result of an analysis of the images and labels in the two datasets.

**Table 3.** Correspondence between CheXpert and HUM-CXR labels.


For multiple labels, we selected the maximum output probability of the network for CheXpert labels as the predicted value for the respective HUM-CXR outcome. As reported in previous works [15,23], none of the trained CNNs outperformed any of the other networks on the label problem. Thus, to improve the overall classification performances, we combined the outputs of the trained CNSs through two ensemble methods: simple average and entropy-weighted average. In the case of simple average, the predictions of the classifiers were combined as:

$$
\tilde{y}\_i = \frac{1}{N} \sum\_{k=1}^N p\_{k,i} \tag{1}
$$

where *pk*,*<sup>i</sup>* is the prediction of classifier *k* for label *i*, *N* is the number of classifiers, and yei is the resulting prediction of the ensemble for label *i*.

When using entropy-weighted average, the predictions were combined as:

$$\tilde{y}\_i = \sum\_{k=1}^{N} (1 - H(p\_{k,i})) p\_{k,i} \tag{2}$$

where *pk*,*<sup>i</sup>* is the prediction of classifier *k* for label *i*, *N* is the number of classifiers, *<sup>H</sup>*(*p*) = <sup>−</sup>*p*log2(p) <sup>−</sup> (1 <sup>−</sup> *<sup>p</sup>*)log2(1 <sup>−</sup> *<sup>p</sup>*) is the binary entropy function, and *<sup>y</sup><sup>i</sup>* is the resulting prediction of the ensemble for label *i*.

**Figure 2.** An overview of our experimental design. In the first phase, state-of-the-art image classification networks were tuned on a large public dataset of X-rays (CheXpert [14]). Then, we performed four different steps on the HUM-CXR dataset: (1) we tested the originally trained networks on the X-rays of the new dataset, mapping the HUM\_CXR labels to CheXpert labels; (2) we used the originally pretrained networks with a metaclassifier to combine the predictions of each network on the new dataset; (3) we fine-tuned the networks by removing the fully connected classification layer from the seven CNNs trained on CheXpert and replacing it with a seven-output layer that matched HUM-CXR labels; and (4) we extracted the image embeddings from each network and trained tree-based classifiers to predict the HUM-CXR labels starting from the extracted embeddings.

Stacking. This approach extends the previous approach by using a method called stacked generalization or stacking [22]. Instead of combining the outputs of the CNNs with a simple or an entropy-weighted average as described above, we combined them using a metaclassifier trained for this purpose. Thus, we trained a random forest (RF) to predict the label for HUM-CXR samples based on the predictions of the seven CNNs trained on CheXpert and mapped to labels of HUM-CXR, as shown in Table 3. The data were divided into a training set (70%) and a test set (30%).

Tree-based classifiers. This approach exploits the CNNs trained on CheXpert to compute the image embeddings of CXRs included in the HUM-CXR dataset. Image embeddings can be used to predict the label of the corresponding images using much simpler models than CNNs, such as tree-based models. The benefit of using tree-based models with respect to CNNs is that they do not require either high computational resources or extremely large datasets for training, making them suitable for smaller single-institution datasets. In this work, we focused on three kinds of tree-based methods: decision tree (DT), random forest (RF), and extremely randomized trees (XRT). For each method, we trained seven classifiers using the seven CNNs pretrained on CheXpert to compute the image embeddings from the HUM-CXR dataset, with 70% of the samples used for training and 30% for testing. As previously mentioned, for the pretrained CNNs, we applied ensemble methods, i.e., the simple average and the entropy-weighted average, to combine these seven classifiers. We tuned the training hyperparameters of the tree-based classifiers with a grid-search optimization using stratified K-fold cross validation (Table 4 shows the parameters).


**Table 4.** Embedding model hyperparameters.

The results were combined with simple average and entropy-weighted average.

Fine-tuning. This is a common transfer learning approach in deep learning that consists of adapting and retraining the last layers of a pretrained neural network on different data or tasks [13]. Therefore, we removed the fully connected classification layer from the seven CNNs trained on CheXpert and replaced it with a seven-output layer that matched the HUM-CXR labels. Then, the HUM-CXR (70% training, 10% validation) dataset was used to finetune the original networks. The models were fine-tuned for five epochs with early stopping on the validation AUC set to three epochs. Binary cross entropy was used as loss function, and the learning rate was initially set to 1 × <sup>10</sup>−4, to be reduced by a factor of 10 after each epoch. For each CNN, the best-performing model upon validation was tested on the remaining 20% of the HUM-CXR dataset. The performances were evaluated with simple average, entropy-weighted average, and stacking.

#### *2.4. Performance Assessment*

To assess the performances of our classifiers, we computed the area under the receiving operating characteristic (ROC) curve. The ROC curve was obtained by plotting the true positive rate (TPR) (or sensitivity) versus the false positive rate (FPR) (or 1-specificity). Values higher than 0.8 were considered excellent [24], and the training time was recorded.

#### *2.5. Explainability*

Despite having proven successful predictive performance, CNNs are recognized as black-box models, i.e., the reasoning behind the algorithm is unknown or known but not interpretable by humans. In order to build trust in AI systems, it is necessary to provide the user with details and reasons to make their functioning clear or easy to understand [25]. We applied gradient-weighted class activation map (Grad-CAM) [26], a state-of-the-art classdiscriminative localization technique for CNN interpretation that outputs a visualization of the regions of the input (heat map) that are relevant for a specific prediction. Grad-CAM uses the gradient of an output class in the final convolutional layer to produce a saliency map that highlights areas of the image relevant to detection of the output class. Then, the map is upsampled to the dimensions of the original image, and the mask is superimposed on the CXR. Grad-CAM is considered an outcome explanation method, providing a local explanation for each instance. Therefore, we applied Grad-CAM to randomly selected HUM-CXR data. Grad-CAM heat maps were computed for each CNN model and averaged. In addition to superimposing them on the original image, we used Grad-CAM heat maps to automatically generate a bounding box surrounding the area associated with the outcome. We created a mask with the salient part of the heat map (pixel importance larger than the 0.8 quantile) and used its contours to draw a bounding box highlighting the region of the input that contributed most to the prediction. Grad-CAM saliency maps were compared to saliency masks manually extracted by a radiologist (A.A.). The agreement was evaluated as intersection area over the total area identified by the imager. DeGrave et al. [27] suggested that single local explanations are not enough to validate the correctness of a model against shortcuts and spurious correlations. Therefore, we propose a population-level explanation averaging the saliency maps of 200 randomly sampled images, with the prediction with the highest probability selected.

#### **3. Results**

In this section, we first introduce the baseline results (of the networks originally trained on CheXpert) and the performance of the networks following stacking, embedding, and fine tuning. Then, we present an in-depth analysis of the classification by applying Grad-CAM and comparing the extracted saliency maps with those generated by radiologists.

#### *3.1. Baseline with Pretrained CNN*

Table 5 shows the performance on the test set achieved by transfer learning without retraining in terms of AUC for each HUM-CXR class and on average.

The results are shown for each CNN, ensembling with averaging, and weighted entropy averaging. Generally, the networks pretrained on CheXpert showed promising performance on the new dataset (HUM-CXR). Failures occurred mainly for bone. Ensembling generally achieved better average results compared to single-model performance.


**Table 5.** CNN results with pretrained networks without retraining in terms of AUC. Each column represents an HUM-CXR label. We report the results for each network and for the two ensembling strategies. The best results for each class and average are highlighted in bold.

#### *3.2. Stacking and Embeddings*

Combining the predictions with a metaclassifier (stacking) significantly improved bone classification and the mean classification AUC compared to the baseline. Furthermore, the embeddings extracted from pretrained CNNs were used to train tree-based classifiers. Table 6 shows the performance achieved by stacking and embeddings with DT, RF, and XRT ensembled with simple average and entropy-weighted average.

**Table 6.** Results of stacking and tree-based models trained on embeddings extracted from pretrained CNNs in terms of AUC. Each column represents an HUM-CXR finding. We report the results for each tree model and for both ensembling strategies. Best results for each class and average are highlighted in bold.


The best model (RF + simple averaging) achieved a mean AUC of 0.856 with a maximum of 0.94 for pleura. The results show that stacking and embedding achieved better classification performance compared to the baseline. Complex machine learning models (XRT and RF) achieved better performance than simple decision tree classifiers.

#### *3.3. Fine Tuning*

The last set of experiments consisted of fine tuning the classification layers of the pretrained CNNs (Table 7). Single-model performance improved with respect to transfer learning without retraining, except for VGG16 and VGG19. Ensemble AUC increased for all strategies. Fine tuning combined with stacking achieved the best AUC for PNX (0.97), whereas on average, it was performant than the best embedding model. However, these results show that fine tuning alone is not enough to achieve competitive performance, and an additional metaclassifier is required to combine the results. All the described models are available at https://github.com/DanieleLoiacono/CXR-Embeddings.


**Table 7.** CNN results with fine tuning of the classification layer of pretrained networks in terms of AUC. Each column represents an HUM-CXR finding. We report the results for each single network, for the two ensembling strategies, and for stacking. The best results for each class and average are highlighted in bold.

#### *3.4. Grad-CAM*

We averaged the saliency maps of two batches of 200 randomly sampled images computed with Grad-CAM. The Grad-CAM heat map emphasizes the salient area within the image in shades of red and yellow, whereas the rest of the image is colored in blues and greens. Figure 3 shows that at a population level, the model was generally focused on the lung field and did not take into account shortcuts or spurious correlations that could be present in the borders.

**Figure 3.** Average of Grad-CAM saliency maps for two batches (panels **a**,**b**) of 200 randomly sampled images. Images confirm that the model focuses on the lung field (image in shades of red and yellow) and does not take into account shortcuts or spurious correlations that could be present in the borders.

We visualized the areas of the CXRs that the model predicted to be most indicative of each prediction using gradient-weighted class activation mappings (GradCAMs) [26] and by creating a bounding box surrounding it. Randomly selected examples are shown in Figures 4–7.

**Figure 4.** Visualization of pleura prediction maps for two selected CXRs. The panels represent the saliency mask obtained with Grad-CAM (panels **a**,**d**), the relevant area (mask values higher than the 0.8 quantile (panels **b**,**e**)), and the respective bounding box (panels **c**,**f**). The saliency mask focuses on plaura abnormalities, as shown by the heat map (panel **a**,**d**).

**Figure 5.** Visualization of device prediction maps for two selected CXRs. The panels represent the saliency mask obtained with Grad-CAM (panels **a,d**), the relevant area (mask values higher than the 0.8 quantile) (panels **b**,**e**), and the respective bounding box (panels **c**,**f**). The saliency mask focuses on device (hardware and/or leads), as shown by the heat map (panel **a**,**d**).

**Figure 6.** Visualization of pneumothorax prediction maps for a selected CXR. The panels represent the saliency mask obtained with Grad-CAM (panel **a**), the relevant area (mask values higher than the 0.8 quantile) (panel **b**), and the respective bounding box (panel **c**). The saliency mask, as emphasized by the heat map (panel **a**), focuses on the right lung field, which shows the pneumothorax.

**Figure 7.** Visualization of lung prediction maps for a selected CXR. The panels represent the saliency mask obtained with Grad-CAM (panel **a**), the relevant area (mask values higher than the 0.8 quantile) (panel **b**), and the respective bounding box (panel **c**). The saliency mask, as emphasized by the heat map (panel **a**), focuses on the left lung, which shows lung abnormality.

In Figure 8, we superimposed the bounding boxes for two classes to show how the model looks at different input areas depending on the specific class.

**Figure 8.** Superimposition of bounding boxes for cardiac (panel **a**, cardiac in blue and device in red) and device (panel **b**, device in blue and cardiac in red) outcomes for two examples.

Furthermore, we compared Grad-CAM maps with saliency masks extracted by a radiologist in terms of common area over the full area identified by the expert. Our models achieved an overall average agreement of 75% (80% lung, 65% pleura, 84% cardiac, 75% PNX, and 67% device), showing how the models automatically learned meaningful features from the images similarly to an expert radiologist. Explainable AI (XAI) algorithms for visualization are successful approaches to identify potential spurious shortcuts that the network may have learned. Overall, our CNNs focused on meaningful areas of the image for the respective prediction. We found some inconsistencies in some examples of device predictions, especially with pacemakers. Figure 9 shows an example of a correct classification but based on an area that does not match well the hardware of the CIED.

**Figure 9.** Shortcut for the identification of a pacemaker focusing on the leads: the saliency mask obtained with Grad-CAM (panel **a**), the relevant area (mask values higher than the 0.8 quantile) (panel **b**), and the respective bounding box (panel **c**).

The saliency map highlights the intracardiac leads as the region responsible for device prediction.

#### **4. Discussion**

In this work, we first developed and trained CNN models to extract features; thereafter, we proposed the application of different transfer learning approaches to the feature extractor stage of pretrained CNNs to a test dataset, proving the efficiency of transfer learning for domain and task adaptation in medical imaging. Finally, we used Grad-CAM saliency maps to interpret, understand, and explain CNNs and to investigate the presence of potential Clever Hans effects, spurious shortcuts, and dataset biases. Our results support the use of transfer learning to overcome the need for large datasets toward promising AIpowered medical imaging to assist imagers in automating repetitive tasks and prioritizing unhealthy cases. CNNs were first introduced in handwritten zip code recognition in [28], dramatically increasing the performance of deep learning models, especially with Ndimensional matrix input (e.g., three channels images). Since then, CNNs have proven successful capabilities for image analysis, understanding, and classification. Convolutional layers are used in sequence to progressively reduce the input size and simultaneously perform feature extraction, starting from simple patterns in early convolutional layers (edges, curves, etc.) to semantically strong high-level features in deeper layers. The feature maps, i.e., the output at each convolutional step, can be represented as a continuous vector that contains a low-dimensional representation of the image, namely the image embedding. Image embeddings meaningfully represent the original input in a transformed space, reducing the dimensionality. Image embeddings can be used as input to train classifiers based on trees, kernels, Bayesian statistics, etc. Thereby, the advantage of using embeddings lies in benefiting the feature extraction capabilities of CNNs trained on a large dataset of images while designing a specific classifier for new data and, eventually, for a slightly different task. We trained our CNNs with a large publicly available dataset [14] to create an efficient feature extractor that could learn from a large corpus of images. Next, we proposed three transfer learning approaches to apply the feature extractor stage of pretrained CNNs to a new local, independent dataset—HUM-CXRs. Transfer learning has shown remarkable capabilities in computer vision, boosting performance for applications with small datasets. Transfer learning avoids overfitting, in addition to enabling generalization from one task to another [13], although the generalization capabilities decrease according to the dissimilarity between the base task and target task. Transfer learning has been successful in several fields, including image classification [21,29,30], natural language processing [31–34], cancer subtype discovery [35], and gaming [36]. We applied transfer learning to medical imaging understanding and classification, envisioning the possibility of developing a library of pretrained models for different medical imaging modalities and tasks. Our first TL approach consisted of stacking the predictions of the pretrained CNNs and training an additional metaclassifier to learn the correspondence between them and the HUM-CXR outcomes. The second approach involved two steps: first, the image embeddings of the last convolutional layer were extracted, and additional tree-based classifiers were trained to classify them into the output vector. Finally, we applied a more conventional fine tuning of the last classification layer of each CNN. In this way, the classification layer was customized to the label vector of the new dataset, and the final weights were updated to learn the correspondence between the features extracted by the CNN and the output. In addition to achieving a best classification performance of 0.856 average AUROC, transfer learning with image embeddings has the advantage of minimizing the computational power, dataset dimensions, and time required to adapt the pretrained models to a new dataset and task. The time required to train our tree-based model was in the order of a few minutes, overcoming the need for considerable computational resources, long training times, and GPU availability.

As a proof of concept, we applied this framework to CXRs. CXRs are commonly used for diagnosis, screening, and prognosis; thus, large labeled datasets are already available, such as CheXpert [14], MIMIC-CXR [37], and ChestX-ray [38]. Several previous studies were focused on CXR diagnosis with deep learning, along with these publicly available datasets. CheXNet [39] achieved state-of-the-art performance on fourteen disease classification tasks with ChestXray data [38], and the modified version CheXNeXt [40] achieved radiologist-like performance on ten diseases. On the same dataset, Ye et al. [41] proposed localization of thoracic diseases, in addition to CXR classification. Along with the publication of the dataset, Irvin et al. [37] proposed a solution to achieve performance comparable to that of expert radiologists for the classification of five thoracic diseases. Recently, Pham et al. [23] improved state-of-the-art results on CheXpert, proposing an ensemble of CNN architectures. We used the same dataset as Irvin et al. [37], Pham et al. [23], and Giacomello et al. [15] for pretraining; however, whereas they focused on only five representative findings, we enlarged the classification to seven classes. We can compare the performance of cardiomegaly and pleural effusion, the two findings that are most similar between HUM-CXRs and CheXpert. With respect to cardiomegaly [14,15,23], achieved a best AUROC of 0.828, 0.854, and 0.910, respectively. With respect to pleural effusion, [14,15,23] achieved a best AUROC of 0.940, 0.964, and, 0.936, respectively. Our models obtained by transferring the knowledge acquired on CheXPert to an independent local dataset achieved a best AUROC of 0.88 and 0.97 for cardiac and pleura, respectively. However, [14,15,23] trained and tested on data from the same dataset, i.e., the same distribution, demographic and geographic characteristics (USA residents, Stanford Hospital) and—potential—bias in the data. Hence, these models are potentially prone to the "Clever Hans" effect [42], which limits their actual transition to clinical application. Weber et al. [43] discussed the importance of evaluating the performance of a DL model for applications for which it was not explicitly trained to characterize its generalization capabilities and avoid the Clever Hans effect. Similarly, in a recent analysis of COVID-19 machine learning predictors, Roberts et al. [44] claimed that none of the works under review was reliable enough for the transition from scientific research to clinical routine due to dataset biases, insufficient model evaluation, limited generalizability, and lack of reproducibility, among other reasons. Furthermore, they argued that the scientific community is focusing too much on outperforming benchmarks on public datasets. Using only public datasets without generalizing to new data can lead to overfitting, strongly hindering clinical translation. For

these reasons, in this work, we did not focus on outperforming the state of the art in CXR classification, instead proposing a reproducible framework to overcome some of the main limitations of DL in medical imaging toward a more robust AI-powered clinical routine. In particular, we achieved the following insights. The original models performed poorly on the baseline task (best average AUROC: 0.777), i.e., using the CNNs directly in inference on the new external independent dataset; therefore, even if they were trained on an extremely large dataset, the CNNs were not able to generalize to a new domain and additional data. On the other hand, using transfer learning, in particular with image embeddings, it is possible to adapt the original models to a new domain, i.e., a new hospital, geographic and demographic characteristics, and new tasks, i.e., different labels, with minimum effort and competitive performance (best average AUROC: 0.856). Our approach is not limited to our dataset and the highlighted application; it could be adopted and successfully applied by any other research group or hospital that might need to classify medical images but does not have either a sufficient volume of data or the computational resources to train the model. Following this framework, the resulting models will have excellent feature extraction capability learned from large public datasets, but they will be validated, tailored, and improved with respect to the specific application to achieve optimal results.

Although adherence to the FAIR principles [45] is recommended for scientific data management, a recent systematic review proved the scarce reproducibility of deep learning research. The majority of published deep learning studies focused on medical imaging were non-randomized retrospective trials (only 7% of prospective were tested in a real-world clinical setting) affected by a high risk of bias (72%), with a low adherence to existing reporting standards and without access to data and code (available in 5% and 7% of cases, respectively). Furthermore, deep learning studies typically scantly and elusively describe the used methods, affecting external validity and implementation in clinical settings [7]. To comply with the FAIR principles, respect legal requirements, and preserve the institutional policy, we exhaustively described our methods, providing details for each step, from image analysis to model building, and we made our models available (https://github.com/DanieleLoiacono/CXR-Embeddings) Regardless of the singular value of AUC for each class and the direct comparison between HUM-CXRs and CheXpert among labels, we demonstrated the efficiency of the proposed method. We believe that by making the data available, we guarantee the reproducibility of the proposed methodology, strongly encouraging other groups to repeat our approach with CXRs and/or other images (e.g., computed tomography (CT)).

Finally, CNNs are black-box models that are difficult to interpret, significantly hindering their acceptance in critical fields, such as medicine. Degrave et al. [27] demonstrated that DL models for COVID-19 detection relied on spurious shortcuts, such as lateral markers, image annotations, and borders, to distinguish between positive and negative patients instead of identifying real markers of COVID-19 in the lung field. They suggested that explainable AI (XAI) models should be applied to every AI application in medicine and should be a prerequisite for clinical translation to routine practice. The trustworthiness of AI models for clinical diagnosis and prognosis has to be accurately assessed before they can be applied in a real setting. Several algorithms have been proposed to overcome the intrinsic black-box nature of CNNs. DeepLIFT [46] and SHAP GradientExplainer [47] are based on feature importance, with the aim of measuring the relevance and importance of each input feature in the final predicted output, usually using the coefficients of linear models as interpretability models. Another proposed approach is the use of DGN-AM [48], which evaluates which neurons are maximally activated with respect to a particular input observation, with the aim of identifying input patterns that maximize the output activation. CAM [49], Grad-CAM [26], and LRP [50] create coarse localization maps of the important regions of the input defining the discriminative regions for a specific prediction.

For these reasons, we applied Grad-CAM [26] to our problem, with the aim of (1) interpreting, understanding, and explaining CNN black-box behavior through comprehensible explanations to increase the trust and acceptance in AI for medical imaging for translation to clinical routine; and (2) investigating the presence of potential Clever Hans effects, spurious shortcuts, and dataset biases. Overall, the explanations provided by Grad-CAM showed a satisfactory ability of the model to identify specific markers and features with respect to the identified class. Grad-CAM saliency maps were found inside the lung field, with particular attention to the correct side of the chest. Double-class images correctly showed the differences between chest findings. However, De Grave et al. [27] were skeptical about presenting only a few examples of explanations, as they may not truthfully represent the real behavior of the model. They discussed the need for a population-level explanation to demonstrate the correctness and reasoning of the entire model, in addition to selecting single examples. In this work, we presented randomly selected examples and population-level explanations averaging two batches of 200 CXRs. The averaged saliency maps presented in Figure 3 demonstrate a high level of attention in the center of the image, whereas the borders are almost useless. Our findings demonstrate that the models were generally focused on the lung field without deploying shortcuts and spurious correlations that may be present outside the lung field, such as annotations, different border dimensions, and lateral markers. Overall, examples of local explanations did not indicate the use of shortcuts as the general model. The only exception we identified concerns the device class, particularly when detecting a CIED. Whereas the model generally correctly focused on the hardware components, in some examples (Figure 9), it correctly classified "device" but exploited the intracardiac leads. This finding is not incorrect, but we would expect the model to focus more on the hardware, i.e., the main box. We believe this might be caused by the original dataset on which the models were pretrained. The device class is extensive and includes lines, tubes, valves, catheters, CIEDs, hardware, coils, etc. However, the percentage proportion of each subclass is not publicly available, so it is possible that "some objects", such as tubes, leads, electrodes, and catheters, are more present than CIEDs, inducing the model to focus on them. Furthermore, we investigated false-positive predictions with respect to the device class. In most cases, we assert that the model was correctly classified devices, although the ground truth was incorrect. The main reason for such false positives is that our labels were extracted from unstructured medical reports. Whereas diseases are clearly written and discussed in the report, cardiac devices, electrodes, prosthesis, and other "objects" may be omitted in the report because are not considered "abnormal" as medical pathologies or clinically relevant. We reasonably believe that with further effort in the definition of the ground truth, the performance of "normal" and "device" labels can be improved.

In contrast to reports by Saporta et al. [51] and Arun et al. [52], who recently demonstrated the unreliability of current saliency methods to explain deep learning algorithms in chest X-ray interpretation, we proved a satisfactory match between Grad-CAM saliency maps and a human benchmark (overall average agreement of 75%), although our data confirmed the same issues (a larger gap between Grad-CAM and radiologist saliency maps in cases of diseases characterized by multiple instances, complex shapes, and small size [51]). We found more variability in some classes, such as pleura and device (65% and 67%, respectively), whereas lung, cardiac, and PNX exhibited greater confidence (80%, 84%, and 75%, respectively).

Results of our study may be of value for both the medical and the scientific communities, as well as for the general population. Overall, our results may impact AI applicability in the medical field, speeding up the grounding system of machine and deep learning algorithms toward clinical application, partially overcoming the problem of the increasing demand for medical doctors. In this work, we analyzed the adaptability and applicability of state-of-the-art imaging classification techniques to a new dataset of images collected in a different country with different scanners. Our models achieved competitive performances (AUC > 85%), correctly identifying and labeling seven classes from X-ray images. We also showed that our model correctly interpreted X-rays similarly to expert radiologists. We proved the feasibility of our approach to train large models and apply them in different countries and hospitals. Next steps of this work will include the investigation of more

recent CNN models [53–55] and the validation of this proof of concept on other datasets, as well as on different kinds of images (e.g., computed tomography). Finally, we have to acknowledge some limitations in our study. First, our results are limited by the retrospective design of our study. Secondly, we did not evaluate the optimal transfer learning approach when we trained the seven CNNs on the CheXpert dataset; however, this was outside of the scope of the present work. Thirdly, different and more recent CNN models should also be used in future research.

#### **5. Conclusions**

In this work, we proposed three transfer learning approaches for medical imaging classification. We demonstrated that CNNs pretrained on a large public dataset of medical images can be exploited as feature extractors for a different task (i.e., different classes) and domain (different country, scanner, and hospital) than the original one. In particular, the extracted image embeddings contain relevant information to train an additional classifier with satisfactory performance on an independent local dataset. This overcomes the need for large private datasets, considerable computational resources, and long training times, which are major limitations for the successful applications of AI in clinical practice. Finally, we proved that we can rely on saliency map for deep learning explainability in medical imaging, showing that the models automatically learned how to interpret X-rays in agreement with expert radiologists.

**Author Contributions:** P.L., D.L. and A.C. ideated the project; D.L., E.G., M.S. and M.K. planned the project; M.S., A.A. and M.K. contributed to data collection and image labelling and retrieval; N.G. performed the analyses; E.G. and D.L. supervised the analyses; N.G., M.S., D.L. and M.K. prepared the draft; A.C. and P.L. commented on the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This retrospective study was performed in accordance with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of IRCCS Humanitas Research Hospital (authorization number 3/18 of 17/04/2018; emended on 22/10/2019 with authorization number 37/19).

**Informed Consent Statement:** Informed consent was waived (observational retrospective study).

**Data Availability Statement:** This manuscript represents valid work, and neither this manuscript nor one with substantially similar content under the same authorship has been published or is being considered for publication elsewhere. Arturo Chiti had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All the described models are available at https://github.com/DanieleLoiacono/CXR-Embeddings.

**Acknowledgments:** We thank Elisa Maria Ragaini, Calvin Jones, and Alessandro Gaja Levra for their support in data collection.

**Conflicts of Interest:** Arturo Chiti reports a fellowship grant from Sanofi and personal fees from AAA, Blue Earth Diagnostics, and General Electric Healthcare outside the scope of the submitted work. The other authors do not report any conflict of interest.

#### **References**


## *Article* **Detecting Coronary Artery Disease from Computed Tomography Images Using a Deep Learning Technique**

**Abdulaziz Fahad AlOthman 1,\*, Abdul Rahaman Wahab Sait <sup>1</sup> and Thamer Abdullah Alhussain <sup>2</sup>**


**Abstract:** In recent times, coronary artery disease (CAD) has become one of the leading causes of morbidity and mortality across the globe. Diagnosing the presence and severity of CAD in individuals is essential for choosing the best course of treatment. Presently, computed tomography (CT) provides high spatial resolution images of the heart and coronary arteries in a short period. On the other hand, there are many challenges in analyzing cardiac CT scans for signs of CAD. Research studies apply machine learning (ML) for high accuracy and consistent performance to overcome the limitations. It allows excellent visualization of the coronary arteries with high spatial resolution. Convolutional neural networks (CNN) are widely applied in medical image processing to identify diseases. However, there is a demand for efficient feature extraction to enhance the performance of ML techniques. The feature extraction process is one of the factors in improving ML techniques' efficiency. Thus, the study intends to develop a method to detect CAD from CT angiography images. It proposes a feature extraction method and a CNN model for detecting the CAD in minimum time with optimal accuracy. Two datasets are utilized to evaluate the performance of the proposed model. The present work is unique in applying a feature extraction model with CNN for CAD detection. The experimental analysis shows that the proposed method achieves 99.2% and 98.73% prediction accuracy, with F1 scores of 98.95 and 98.82 for benchmark datasets. In addition, the outcome suggests that the proposed CNN model achieves the area under the receiver operating characteristic and precision-recall curve of 0.92 and 0.96, 0.91 and 0.90 for datasets 1 and 2, respectively. The findings highlight that the performance of the proposed feature extraction and CNN model is superior to the existing models.

**Keywords:** coronary artery disease; deep learning; machine learning; cardiopulmonary disease; faster CNN

#### **1. Introduction**

Coronary artery disease (CAD) has recently become regarded as one of the most dangerous and life-threatening chronic diseases [1]. Blockage and narrowing of the coronary arteries is the primary cause of heart failure. The coronary arteries must be open to provide the heart with adequate blood [2–4]. According to a recent survey, the United States has the highest heart disease prevalence and the highest ratio of heart disease patients [5]. Shortness of breath, swelling feet, fatigue, and other symptoms of heart disease are among the most frequent. CAD is the most common type of heart disease, which can cause chest discomfort, stroke, and heart attack. Besides heart disease, there are heart rhythm issues, congestive heart failure, congenital heart disease, and cardiovascular disease [6].

Traditional methods of investigating cardiac disease are complex [7–10]. The lack of medical diagnostic instruments and automated systems makes pulmonary heart disease detection and treatment challenging in developing nations. However, to reduce the impact

**Citation:** AlOthman, A.F.; Sait, A.R.W.; Alhussain, T.A. Detecting Coronary Artery Disease from Computed Tomography Images Using a Deep Learning Technique. *Diagnostics* **2022**, *12*, 2073. https://doi.org/10.3390/ diagnostics12092073

Academic Editors: Sameer Antani and Sivaramakrishnan Rajaraman

Received: 23 June 2022 Accepted: 23 August 2022 Published: 26 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**<sup>\*</sup>** Correspondence: afalothman@kfu.edu.sa

of CAD, an accurate and appropriate diagnosis of cardiac disease is necessary. Developing countries experience an alarming rise in the number of people dying from heart disease [11–16]. According to WHO, CAD is the most frequent type of heart disease, claiming the lives of 360,900 individuals globally in 2019 [17]. The sum accounts for nearly 30% of all deaths worldwide. The number of persons who are victimized is increasing exponentially. Multiple risk factors are involved in the CAD prediction. Thus, healthcare centers require a tool to detect CAD at earlier stages. The recent developments in CNN models enable researchers to develop a prediction model for CAD. However, CNN's structure is complex and needs an excellent graphical processing unit (GPU) to process complex images.

Among conventional approaches, analytical angiography is considered one of the most accurate procedures for detecting heart abnormalities. The disadvantages of angiography include the expensive cost, various side effects, and the need for a high level of technological competence [18]. Due to human error, conventional methods often yield inaccurate diagnoses and take longer to complete. In addition, it is a costly and time-consuming method for diagnosing disease and requires considerable processing.

Artificial intelligence (AI) applications have been increasingly included in clinical diagnostic systems during the last three decades to improve their accuracy. Data-driven decision-making using AI algorithms has been increasingly common in the CAD field in recent years [19]. The diagnostic accuracy can be improved by automating and standardizing the interpretation and inference processes. AI-based systems can help speed up decisionmaking. Healthcare centers can obtain, evaluate, and interpret data from these emerging technologies and facilitate better patient service [20]. The raw data can significantly affect the quality and performance of AI approaches. As a result, extensive collaboration between AI engineers and clinical professionals is required to improve the quality of diagnosis [21]. The recent CAD detection technique is based on images. Faster predictions can be made for clinicians and computer scientists by deleting irrelevant features. The key features representing the crucial part of CAD decide the performance of the AI techniques [22]. Many studies use deep learning (DL) to determine the existence of CAD.

Convolutional neural networks (CNN) are becoming increasingly popular in medical image processing. CNN was initially demonstrated in medical image analysis in the work of [23] for lung nodule diagnosis. Numerous medical imaging techniques are based on this concept [24–27]. Using a pre-trained network as a feature generator and fine-tuning a pre-trained network to categorize medical pictures are two strategies to transmit the information stored in the pre-trained CNNs. Standard networks can be divided into multiple classes as pre-trained medical image analysis models. Kernels with large receptive fields are used in the higher layers near the input, while smaller kernels are used in the deeper levels. Among the networks in this group, AlexNet is the most widely used and has many applications in medical image processing [28–31].

Deep learning networks are advanced AI techniques and have gained popularity in the medical field. The first network in this category was GoogleNet [32–36]. However, there is a shortcoming in the existing methods, such as more computation time and high-end systems. In addition, the performance of the current CNN architectures is limited in terms of accuracy and F-Measure. In addition, literature is scarce related to integrating feature minimization and CAD techniques. Therefore, this study intends to develop a CNN-based classifier to predict CAD with high accuracy. The objective of the study is as follows:


The research questions of the proposed study are:

Research Question-1 (RQ1): How to improve the performance of a CAD detection technique? Research Question-2 (RQ2): How to evaluate the performance of a CAD detection technique? The structure of the study is organized as follows: Section 2 presents the recent

literature related to CNN and CAD. Section 3 outlines the methodology of the proposed

research. Results and discussion are highlighted in Section 4. Finally, Section 5 concludes the study with its future improvement.

#### **2. Literature Review**

High-accuracy data-mining techniques can identify risk factors for heart disease. Studies on the diagnosis of CAD can be found in existing studies [1–5]. Artificial immune recognition system (AIRS), K nearest neighbor (KNN), and clinical data were used to develop a system for diagnosing CAD and achieved an accuracy rate of 87%.

The authors [1] developed and evaluated a deep-learning algorithm for diagnosing CAD based on facial photographs. Patients who underwent coronary angiography or CT angiography at nine Chinese locations participated in a multicenter cross-sectional study to train and evaluate a deep CNN to detect CAD using patient facial images. More than 5796 patients were included in the study and were randomly assigned to training and validation groups for algorithm development. According to the findings, a deep-learning algorithm based on facial photographs can help predict CAD.

According to a study [2], the combination of semi-upright and supine stress myocardial perfusion imaging with deep learning can be used to predict the presence of obstructive disease. The total perfusion deficit was calculated using standard gender and camera type limits. A study [3] employed interferometric OCT in cardiology to describe coronary artery tissues, yielding a resolution of between 10 and 20 μm. Using OCT, the authors [3] investigated the various deep learning models for robust tissue characterization to learn the various intracoronary pathological formations induced by Kawasaki disease. A total of 33 historical cases of intracoronary cross-sectional pictures from different pediatric patients with KD are used in the experimentation. The authors analyzed in-depth features generated from three pre-trained convolutional networks, which were then compared. Moreover, voting was conducted to determine the final classification.

The authors [6] used deep-learning analysis of the myocardium of the left ventricle to identify individuals with functionally significant coronary stenosis in rest coronary CT angiography (CCTA). There were 166 participants in the study who had invasive FFR tests and CCTA scans taken sequentially throughout time. Analyses were carried out in stages to identify patients with functionally significant stenosis of the coronary arteries.

Using deep learning, the researchers [7] investigated the accuracy of the automatic prediction of obstructive disease from myocardial perfusion imaging compared to the overall perfusion deficit. Single-photon emission computed tomography may be used to build deep convolutional neural networks that can better predict coronary artery disease in individual patients and individual vessels. Obstructive disease was found in 1018 patients (62%) and 1797 of 4914 (37%) arteries in this study. A larger area under the receiveroperating characteristic curve for illness prediction using deep learning than for total perfusion deficits. Myocardial perfusion imaging can be improved using deep learning compared to existing clinical techniques.

In the study [8], several deep-learning algorithms were used to classify electrocardiogram (ECG) data into CAD, myocardial infarction, and congestive heart failure. In terms of classification, CNNs and LSTMs tend to be the most effective architectures to use. This study built and verified a 16-layer LSTM model using a 10-fold cross-validation procedure. The accuracy of the classification was 98.5%. They claimed their algorithm might be used in hospitals to identify and classify aberrant ECG patterns.

Author [9] proposed an enhanced DenseNet algorithm based on transfer learning techniques for fundus medical imaging. Medical imaging data from the fundus has been the subject of two separate experiments. A DenseNet model can be trained from scratch or fine-tuned using transfer learning. Pre-trained models from a realistic image dataset to fundus medical images are used to improve the model's performance. Fundus medical image categorization accuracy can be improved with this method, which is critical for determining a patient's medical condition.

The study [10] developed and implemented a heterogeneous low-light image-enhancing approach based on DenseNet generative adversarial network. Initially, a generative adversarial network is implemented using the DenseNet framework. The generative adversarial network is employed to learn the feature map from low-light to normal-light images.

To overcome the gradient vanishing problem in deep networks, the DenseNet convolutional neural network with dense connections combines ResNet and Highway's strengths [11,12]. As a result, all network layers can be directly connected through the DenseNet. Each layer of the network is directly related to the next layer. It is important to remember that each subsequent layer's input is derived from the output of all preceding layers. The weak information transmitted in the deep network is the primary cause of the loss of gradients [13]. A more efficient way to reduce gradient disappearance and improve network convergence is to use the dense block design, in which each layer is directly coupled to input and loss [14].

The authors [15] employed a bright-pass filter and logarithmic transformation to improve the quality of an image. Simultaneous reflectance and illumination estimation (SRIE) was given a weighted variational model by the authors [16] to deal with the issue of overly enhanced dark areas. Authors [17] developed low light image enhancement by illumination map estimation (LIME), which simply estimates the illumination component. The reflection component of the image was calculated using local consistency and structural perception restrictions, and the output result was based on this calculation.

The study [18] used the Doppler signal and a neural network to gain the best possible CAD diagnosis. By combining the exercise test data with a support vector machine (SVM), the authors [19] achieved an accuracy of 81.46% in the diagnosis of coronary artery disease (CAD). By employing multiple neural networks, authors [20] achieved an accuracy of 89.01% for CAD diagnosis using the Cleveland dataset [21]. It is possible to forecast artery stenosis disease using various feature selection approaches, including CBA, filter, genetic algorithm, wrapper, and numerical and nominal attribute selection. Also, Ref. [22] uses a new feature creation method to diagnose CAD.

Inception-v3 [24] is an enhanced version of GoogleNet and is applied in medical image analysis. It categorizes knee images by training support vector machines using deep feature extraction from CaffeNets. Adults' retinal fundus pictures were analyzed using a fine-tuned network to detect diabetic retinopathy [24]. Classification results utilizing finetuned networks compete with human expert performance [25]. Recent research has focused on applying deep learning techniques to segment retinal optical coherence tomography (OCT) images [26–28]. Combining CNN and graph search methods, OCT retinal images are segmented. Layer border classification probabilities are used in the Cifar-CNN architecture to partition the graph search layer [29,30].

Authors [31] proposed a deep learning technique to quantify and segment intraregional cystoid fluid using fuzzy CNN. Geographic atrophy (GA) segmentation using a deep network is the subject of another study [33]. An automated CAD detector was developed using a CNN with encoder–decoder architecture [34]. In another study, researchers employed GoogleNet to identify retinal diseases in OCT pictures [35].

Several grayscale features collected from echocardiogram pictures of regular and CAD participants were proposed in [36] as a computer-aided diagnosis approach. In [24], ECG data from routine and CAD participants was evaluated for HR signals. Various methods were used to examine the heart rate data, including non-linear analysis, frequency, and time-domain. They found that CAD participants' heart rate signals were less erratic than normal subjects. The recent CNN models are widely applied in CAD diagnostics [36]. In [37], the authors proposed a model for identifying cardiovascular diseases and obtained a prediction accuracy of 96.75%. Ali Md Mamun et al. [38] argued that a simple supervised ML algorithm can predict heart disease with high accuracy. The authors [39] developed a biomedical electrocardiogram (ECG)-based ML technique for detecting heart disease. Jiely Yan et al. [40], proposed a model to predict ion channel peptide from the images. Table 1 outlines the features and limitations of the existing CNN models.


**Table 1.** Features of the existing literature.

#### **3. Research Methodology**

According to the research questions, the researchers developed a CNN architecture to predict positive CAD patients from CT images. Figure 1 presents the proposed architecture. Initially, the images are processed to extract the features. The CNN model treats the extracted features, generating output through an activation function. The following part of this section provides the information related to datasets, feature extraction, CNN construction, and evaluation metrics.

**Figure 1.** Proposed CNN network for CAD.

In this study, researchers employed two datasets of CT angiography images. The details of the datasets are as follows:

Dataset 1 [4] contains coronary artery image sets of 500 patients. A number of 18 views of the same straightened coronary artery are shown in each mosaic projection view (MPV). The Training–Validation–Test picture sets have a 3/1/1 ratio (300/100/100) with 50% normal and 50% sick cases for each patient in the subset. To improve modeling and dataset balance, 2364 (i.e., 394 × 6) artery pictures were obtained from the 300 training instances. Only 2304 images of the training dataset were augmented: 1. the standard component; 2. all the validation images; and 3. all the testing images. The balance was maintained in the validation dataset by randomly selecting one artery per normal case (50 images) and sick patient (50 images). Figure 2a,b outlines the CT images of positive and negative CAD patients.

**Figure 2.** (**a**): Positive individual, (**b**): negative individual.

Dataset 2 [5] consists of CT angiography images of 200 patients. This dataset used images from a multicenter registry of patients who had undergone clinically indicated coronary computed tomography angiography (CCTA). The annotated ground truth included

the ascending and descending aortas (PAA, DA), superior and inferior vena cavae (SVC, IVC), pulmonary artery (PA), coronary sinus (CS), right ventricular wall (RVW), and left atrial wall (LAW). Figure 3 shows the CT images of dataset\_2. Table 2 outlines the description of the datasets. Both datasets contain CT images of CAD and Non-CAD patients.

**Figure 3.** Superior vena cava images of individuals.

**Table 2.** Description of datasets.


The study applies the following steps for identifying CAD using CNN architecture from datasets:

Step 1: Preprocess images

The CCTA images are processed to fit the feature extraction phase. All images are converted into 600 × 600 pixels. The image size suits the feature extraction process to generate a reduced set of features without losing any valuable data.

Step 2: Feature extraction

The proposed study applies an enhanced features from accelerated segment test (FAST) [6] algorithm for extracting features to support the pooling layer of CNN to produce effective feature maps to answer RQ1. To reduce the processing time of the FAST algorithm, researchers employed the enhanced FAST [5]. Figure 4 showcases the feature extracted from a 4 × 4 image into a 2 × 2 image. In addition, it highlights that the actual image can be reconstructed from a 2 × 2 image to a 4 × 4 image.

**Figure 4.** Process of feature extraction.

The extraction process is described as follows:

Let image *I* of *M*<sup>1</sup> × *M*<sup>2</sup> pixels be divided into segments *S*<sup>1</sup> × *Sn*. The number of segments is *N*<sup>1</sup> × *N*2, where *N*<sup>1</sup> = *M*1/*S*<sup>1</sup> and *N*<sup>2</sup> = *M*2/*Sn*. The segments are represented in Equation (1).

$$I = \begin{bmatrix} St\_{1,1} & St\_{1,2} \dots & St\_{1,N\_n} \\ \vdots & \vdots & \vdots \\ St\_{N\_1,1} & St\_{N\_1,2} \dots & St\_{N\_1,N\_n} \end{bmatrix} \tag{1}$$

where *Sdx,y* referred to the image segment in the *x* and *y* direction and is described in Equation (2).

$$Sd\_{x,y} = I(i,j) \tag{2}$$

where *i* and *j* represent the size of the image segment, *Sdx,y*.

Both Equations (3) and (4) describe the pixel values of image segments.

$$\dot{y} = (y-1)M\_{2"\prime}(y-1)M\_2 - 1, \dots, \; yM\_2 - 1 \tag{3}$$

$$j = (\mathbf{x} - \mathbf{1})M\_1, (\mathbf{x} - \mathbf{1})M\_1 - \mathbf{1}, \dots, \; yM\_1 - \mathbf{1} \tag{4}$$

The transformation function ensures that the image or segment can be reconstructed to its original form. It supports the proposed method to backtrack the CNN network to fine-tune its performance. The transformation function for each segment is mentioned in Equation (5) as follows:

$$
\rho \mathcal{M} d\_{x,y} = Z\_{S\_1} \mathcal{M} d\_{x,y} Z\_{M\_2}^T \tag{5}
$$

where *ϕMdx*,*<sup>y</sup>* represents a part of an extracted feature from the image segment, *<sup>x</sup>* = 1, ...... , *<sup>N</sup>*1, *<sup>y</sup>* = 1, ...... , *Nn* and *<sup>T</sup>* represents the transform matrix, *ZM*<sup>1</sup> ∈ *<sup>Z</sup><sup>O</sup> M*<sup>1</sup> , *O* represents the order of the transformation. The segment can be reconstructed as in Equation (6).

$$Sd\_{x,y} = Z\_{S\_1}^T \boldsymbol{\varrho} \mathbf{S} d\_{x,y} Z\_{S\_n} \tag{6}$$

Sequentially, the process must be repeated *N*<sup>1</sup> × *Nn* times to extract a set of features from the image. Thus, the transform co-efficient of all image segments can be integrated using Equations (7)–(11).

$$\boldsymbol{\varphi} = \begin{bmatrix} \boldsymbol{Z}\_{\boldsymbol{S}\_{1}} \operatorname{\mathbf{S}} \boldsymbol{d}\_{1,1} \boldsymbol{Z}\_{\boldsymbol{S}\_{n}}^{T} & \dots & \boldsymbol{Z}\_{\boldsymbol{S}\_{1}} \operatorname{\mathbf{S}} \boldsymbol{d}\_{1,N\_{n}} \boldsymbol{Z}\_{\boldsymbol{S}\_{2}}^{T} \\ \vdots & \dots & \vdots \\ \boldsymbol{Z}\_{\boldsymbol{S}\_{1}} \operatorname{\mathbf{S}} \boldsymbol{d}\_{N\_{1},1} \boldsymbol{Z}\_{\boldsymbol{S}\_{n}}^{T} & \dots & \boldsymbol{Z}\_{\boldsymbol{S}\_{1}} \operatorname{\mathbf{S}} \boldsymbol{d}\_{N\_{1},N\_{n}} \boldsymbol{Z}\_{\boldsymbol{S}\_{2}}^{T} \end{bmatrix} \tag{7}$$

Equations (8) and (9) denote the features *FS*<sup>1</sup> and *FSn* , which represent the features that can be constructed using *Zs*<sup>1</sup> & *Zsn*, as follows:

$$F\_{\mathcal{S}\_1} = \begin{bmatrix} Z\_{\mathcal{S}\_1} & O & \dots & O \\ O & Z\_{\mathcal{S}\_1} & \dots & \vdots \\ \ddots & \vdots & \vdots & O \\ O & \dots & O & Z\_{\mathcal{S}\_1} \end{bmatrix} \text{order of N}\_1 \tag{8}$$

$$F\_{\mathbb{S}\_n} = \begin{bmatrix} Z\_{\mathbb{S}\_n} & O & \dots & O \\ O & Z\_{\mathbb{S}\_n} & \dots & \vdots \\ \cdot & \cdot & \vdots & Z\_{\mathbb{S}\_n} & O \\ O & \dots & O & Z\_{\mathbb{S}\_n} \end{bmatrix} \text{ order of N}\_{\mathbb{N}} \tag{9}$$

Equation (10) shows a sample set of features, *∂nm*.

$$\sum\_{\mathbf{x}\in X} \left( F\_{\mathcal{S}(n,\mathbf{x})} \* F\_{\mathcal{S}(m,\mathbf{x})} \right) = \partial\_{nm} \tag{10}$$

Equation (11) defines the reconstruction of the image using the extracted features.

$$I = F\_{S\_1}^T \oint F\_{S\_n} \tag{11}$$

Step 3: Processing features

The extracted features *FS*<sup>1</sup> ...... *FSn* are treated as an input for the proposed CNN. DenseNet ensures the transmission of information between the layers. One of the features of the DenseNet is the direct link between each layer. Thus, a back propagation method can be implemented in DenseNet. The feature extraction process reduces the number of blocks in DenseNet and improves its performance. Therefore, the modified DenseNet contains a smaller number of blocks and parameters. Research studies highlight that the complex network requires a greater number of samples. This study applies DenseNet-161 (K = 48), which includes three block modules. Figure 5 illustrates the proposed DenseNet model. Most CNN models depend on the features to make a decision. Thus, the feature extraction process is crucial in disease detection techniques. The minimal set of features reduces the training time of the CNN model. In addition, the features should support CNN to generate effective results. Researchers applied an edge-detection technique.

**Figure 5.** Fine-tuned DenseNet Architecture.

Step 3.1: Pooling layer

Two-dimensional filters are used to integrate the features in the area covered by the two-dimensional filter as it slides over each feature map channel. The dimension of the pooling layer output is in Equation (12):

$$(I\_{\rm h} - f + 1) / l \* (I\_{\rm w} - f + 1) / s \* I\_{\rm c} \tag{12}$$

where *Ih*—the height of the feature map, *Iw*—width of the feature map, *Ic*—number of channels in the map, *f*—filter size, *l*—stride length

Step 3.2: Generating output

Transfer learning is adopted to alter the architecture of DenseNet. Leaky ReLu is used as the activation function. The existing CNN includes are employed. GITHUB portal (https://github.com/titu1994/DenseNet accessed on 7 December 2021) is utilized to implement the existing CNN architecture. The studies [10,18,21] are employed to evaluate the performance of the proposed CNN (PCNN) model. In addition, CNN models, including GoogleNet and Inception V3, are used for performance evaluation. The following form of the sigmoid function is applied for implementing the modified DenseNet. Figure 6 represents the proposed feature extraction for pre-processing the CT images and extracting the valuable features. Furthermore, Figure 7 highlights the proposed CNN technique for predicting CAD from the CT images.

**Figure 6.** Proposed feature extraction algorithm.

**Figure 7.** Proposed CNN model.

The study constructs a feed-forward back propagation network. Thus, Leaky ReLu is employed in the study as an activation function in Equation (13) to produce an outcome.

$$f(x) = \max(0, x) \tag{13}$$

Leaky ReLu considers negative value as a minimal linear component of X. The definition of Leaky ReLu is defined as:

```
Def Leaky_function(I)
     If feature(I) < 0:
return 0.01 * f(I)
     Else:
return f(I)
Step 4: Evaluation metrics
```
The study applies the benchmark evaluation metrics, including accuracy, recall, precision, and F-measure, to provide a solution for RQ2. The metrics are computed as shown in Equations (14)–(18):

True positive (TPCI) = predicting a valid positive CAD patient from CT images (CI). True negative (TNCI) = predicting a valid negative CAD patient from CI. False positive (FPCI) = predicting a negative CAD patient as positive from CI.

False negative (FNCI) = predicting a positive CAD patient as negative from CI.

$$\text{Recall} = \frac{\text{TP}\_{\text{CI}}}{\text{TP}\_{\text{CI}} + \text{FN}\_{\text{CI}}} \tag{14}$$

$$\text{Precision} = \frac{\text{TP}\_{\text{CI}}}{\text{TP}\_{\text{CI}} + \text{FP}\_{\text{CI}}} \tag{15}$$

$$\text{F}-\text{measure} = \frac{2 \ast \text{Recall} \ast \text{Precision}}{\text{Recall} + \text{Precision}} \tag{16}$$

$$\text{Accuracy} = \frac{\text{TP}\_{\text{CI}} + \text{TN}\_{\text{CI}}}{\text{TP}\_{\text{CI}} + \text{TN}\_{\text{CI}} + \text{FP}\_{\text{CI}} + \text{FN}\_{\text{CI}}} \tag{17}$$

$$\text{Specificity} = \frac{\text{TN}\_{\text{CI}}}{\text{TN}\_{\text{CI}} + \text{FP}\_{\text{CI}}} \tag{18}$$

In addition, Matthews correlation coefficient (MCC) (Equation (19)) and Cohen's Kappa (K) (Equation (20)) are employed to ensure the performance of the proposed method.

$$\text{MCC} = \frac{(\text{TP}\_{\text{CI}} \ast \text{TN}\_{\text{CI}}) - (\text{FP}\_{\text{CI}} \ast \text{FN}\_{\text{CI}})}{\sqrt{\left(\text{TP}\_{\text{CI}} + \text{FP}\_{\text{CI}}\right) \ast \left(\text{TP}\_{\text{CI}} + \text{FN}\_{\text{CI}}\right) \ast \left(\text{TN}\_{\text{CI}} + \text{FP}\_{\text{CI}}\right) \ast \left(\text{TN}\_{\text{CI}} + \text{FN}\_{\text{CI}}\right)}} \tag{19}$$

The minimum MCC is −1, which indicates a wrong prediction, whereas the maximum MCC is +1, which denotes a perfect prediction.

$$\mathcal{K} = \frac{2 \ast \left( (\text{TP}\_{\text{CI}} \ast \text{TN}\_{\text{CI}}) - (\text{FP}\_{\text{CI}} \ast \text{FN}\_{\text{CI}}) \right)}{(\text{TP}\_{\text{CI}} + \text{FP}\_{\text{CI}}) \ast (\text{FP}\_{\text{CI}} + \text{TN}\_{\text{CI}}) \ast (\text{TP}\_{\text{CI}} + \text{FN}\_{\text{CI}}) \ast (\text{FN}\_{\text{CI}} + \text{TN}\_{\text{CI}})} \tag{20}$$

MCC and K are class symmetric, reflecting the ML technologies' classification accuracy. Finally, CNN technique computational complexity is presented to find the time and space complexities.

In order to ensure the predictive uncertainty of the proposed CNN (PCNN), the researchers applied standard deviation (SD) and entropy (E). The mathematical expression of the confidence interval (CI) is defined in Equation (21).

$$\text{CI} = a \pm z \frac{\sigma}{\sqrt{N}} \tag{21}$$

where *a* represents the mean of the predictive distribution of an image *a*(*i*), *N* is the total number of predictions, and *z* is the critical value of the distribution. The researchers computed CI at 95% confidence. Thus, the value of *Z* is 1.96.

Finally, the researchers followed *E* of the prediction to evaluate the uncertainty of the proposed model. It is calculated over the mean predictive distribution. The mathematical expression of *E* is defined in Equation (22).

$$E(P(y^\*|a^\*) = -\sum\_{i=1}^C P(y^\*|a^\*)\log(P(y^\*|a^\*) \tag{22})$$

#### **4. Experiment and Results**

The PCNN is implemented in Python with Windows 10 Professional platform. The existing algorithms are developed using the GITHUB portal. Both datasets are divided into training and testing sets. Accordingly, the CNN architectures are trained with a relevant training set of dataset\_1 and dataset\_2.

To evaluate the performance of PCNN, the dataset is utilized using 5-fold crossvalidation. Statistical tests, including SD, CI using binary class classification, and E are applied accordingly on the dataset\_1 and dataset\_2. Table 3 presents the implementation of PCNN during the cross-validation using daaset\_1. It highlights that PCNN achieves

more than 98% accuracy, precision, recall, F-measure, and specificity, respectively. Likewise, Table 4 denotes the cross-validation outcome for dataset\_2.


**Table 3.** Performance analysis of PCNN model for dataset\_1.

**Table 4.** Performance analysis of PCNN model for dataset\_2.


#### *4.1. Uncertainty Estimation*

In this study, the researchers apply Monte Carlo dropout (MC dropout) to compute the model uncertainty. The dropout value ensures that the predictive distribution is not diverse, and CI is insignificant. The researchers experimentally found that the MC dropout value of 0.379 is optimal for this model. The predictive distribution is obtained by evaluating PCNN 200 times for each image. Furthermore, model uncertainty is computed using CI, SD, and E.

Tables 5 and 6 highlight the model uncertainty for dataset\_1 and dataset\_2, respectively. The proposed model achieved a low entropy and SD for both datasets. It can be observed in Tables 5 and 6 that the average CI of [98.55–98.61] and [98.45–98.51] for dataset\_1 and dataset\_2 indicate the proposed model has high confidence and minimum variance in its outcome.

**Table 5.** Model uncertainty analysis outcome for dataset\_1.


**Table 6.** Model uncertainty analysis outcome for dataset\_1.


Table 7 highlights the performance measures of dataset\_1. Among the CNN architectures, PCNN scored a superior accuracy, precision, recall, and specificity of 98.96, 98.2, 98.52, 98.36, and 98.7, respectively. The performance of the Banerjee model [18] is lower than the other CNN architectures. PCNN performs better than the existing CNN models for CAD prediction. Dataset\_1 contains a greater number of images. The mapping of features made the CNN architectures generate more features. However, the feature extraction process of the proposed method enabled PCNN to produce a smaller number of features and maintain a better performance than the existing architectures. Figure 8 represents the comparative analysis outcome of CNN. It is evident from Figure 8 that the performance of PCNN is higher than the current architectures.


**Table 7.** Comparative analysis outcome of CNN model for dataset\_1.

**Figure 8.** Comparative analysis outcome: Dataset\_1.

Likewise, Table 8 outlines the performance of CNN architectures with Dataset\_2. The value of accuracy, precision, recall, F-measure, and specificity is 98.96, 98.2, 98.52, 98.36, and 98.7, accordingly. However, GoogleNet has scored low accuracy, precision, recall, F-measure, and specificity of 97.1, 96.7, 97.1, 96.9, and 96.4, respectively. The absence of temporary memory is one of the limitations of the Banerjee model that reduces its predicting performance. In addition, the outcome of Tables 5 and 6 suggest that the performance of PCNN is higher than the existing CNN architectures. Figure 9 shows the relevant graph of Table 6.


**Table 8.** Comparative analysis outcome of CNN model for dataset\_2.

**Figure 9.** Comparative analysis outcome: Dataset\_1.

In addition to the initial comparative analysis, the researcher applied MCC and Kappa to evaluate the performance of PCNN. Figures 10 and 11 reveal that PCNN achieved a superior MCC and K score compared to the existing models.

**Figure 10.** MCC and Kappa: Dataset\_1.

**Figure 11.** MCC and Kappa: Dataset\_2.

Table 9 outlines the memory size and computing time during the training phase. PCNN consumes 121.45 MB and 118.45 MB for Dataset\_1 and Dataset\_2, accordingly. The computing time of PCNN is 99.32 min and 99.21 min, respectively. The computing time of PCNN is superior to the existing CNN with less memory. Figure 12 highlights CNN's space and computation time for both Dataset\_1 and Dataset\_2.


**Table 9.** Memory sizes of CNN for Dataset\_1 and Dataset\_2.

**Figure 12.** Computation time of CNN models.

Table 10 outlines the error rate of the CNN architectures during the testing phase. The error rate of PCNN is 15.1 and 13.9 for Dataset\_1 and Dataset\_2, respectively. Nevertheless, Jingsi model scores 20.5 and 19.6, which is higher than other CNN models. The outcome

emphasizes the efficiency of the feature extraction process of PCNN. Figure 13 illustrates the error rate of CNN models.



Figure 14 represents the receiver operating characteristic (ROC) and precision–recall (PR) curve for dataset\_1 during the testing phase. It shows that PCNN achieves a better Area under the ROC curve (AUC) for CAD and No CAD classification, respectively.

**Figure 14.** Receiver operating characteristic (ROC) and precision–recall curve: dataset\_1.

Similarly, Figure 15 reflects the ROC and PR curve for dataset\_2. It outlines that PCNN achieves a better ROC AUC score of 0.93. Furthermore, the AUC score of the PR curve (0.91) indicates that PCNN predicts CAD better than the existing models.

**Figure 15.** Receiver operating characteristic (ROC) and precision–recall curve: dataset\_2.

Table 11 highlights the computational complexities of CNN models for Dataset\_1. It is evident from the outcome that PCNN requires a smaller number of parameters (4.3 M), learning rate (1 × <sup>10</sup><sup>−</sup>4), number of flops (563 M), and computation time (1.92 s).


**Table 11.** Computational complexities of CNN for Dataset\_1.

Likewise, Table 12 reflects the outcome for Dataset\_2. It shows that PCNN generates an output with fewer parameters, flops, and learning rates than the existing CNN models.


**Table 12.** Computational complexities of CNN for Dataset\_2.

#### *4.2. Clinical Insights and Limitations*

PCNN generates outcomes that are superior to the existing CNN models. It can be employed in real-time applications to support physicians in diagnosing CAD. In addition, it can be integrated with Internet of Things devices to support healthcare centers in identifying CAD at an earlier stage. The feature extraction and the pooling layer of PCNN can detect CAD from complex CT images. The dropout layer reduces the neurons to avoid limitations such as overfitting and underfitting. PCNN applies loss function to compute the kernels and weights of the model. It optimizes the model's performance and generates a meaningful outcome.

PCNN produces an effective result and supports CAD diagnosing process. However, a few limitations need to be addressed in future studies. The multiple layers of CNN

increase the training time and require a better graphical processing unit. The imbalanced dataset may reduce the performance of the proposed method. The researcher introduced the concept of temporary storage to hold the intermediate results.

Nonetheless, there is a possibility of losing information due to multiple features. The lack of co-ordinate frames may lead to the adversarial visualization of images. The feature selection process can improve the images' internal representation. Finally, the structure of PCNN requires a considerable amount of data to produce an exciting result. To maintain the better performance, data pre-processing is necessary to handle image rotation and scaling tasks.

#### **5. Conclusions**

This study developed a CNN model for predicting CAD from CT images. The existing CNN architectures require a high-end hardware configuration for processing complex images. A feature extraction technique is employed to support the proposed CNN model. The proposed method modifies the existing DenseNet architecture in order to implement a feed-forward back-propagation network. Two benchmark datasets are used for the performance evaluation. The experiment analysis's outcome highlights the superior performance of the proposed CNN model in terms of accuracy, precision, recall, F-measure, and specificity. Moreover, the proposed CNN's memory consumption and computation time during the training phase are lower than the existing CNNs. In addition, ROC and PR curve analysis suggest that the proposed method can predict CAD with a lower false positive rate with higher prediction accuracy. Thus, the proposed method can support the physician in detecting and preventing CAD patients. In the future, the proposed model can be implemented to predict CAD from electronic health records.

**Author Contributions:** Conceptualization, A.F.A., A.R.W.S. and T.A.A.; Data curation, A.F.A. and A.R.W.S.; Formal analysis, A.R.W.S.; Investigation, A.R.W.S.; Methodology, A.F.A. and A.R.W.S.; Project administration, A.F.A. and A.R.W.S.; Resources, A.F.A. and A.R.W.S.; Software, A.R.W.S.; Validation, A.R.W.S.; Visualization, T.A.A.; Writing—original draft, A.F.A. and A.R.W.S.; Writing—review & editing, T.A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. GRANT843].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare that they have no conflict of interest.

#### **References**


## *Article* **Deep Transfer Learning for the Multilabel Classification of Chest X-ray Images**

**Guan-Hua Huang 1,\*, Qi-Jia Fu 1, Ming-Zhang Gu 1, Nan-Han Lu 2,3,4, Kuo-Ying Liu <sup>5</sup> and Tai-Been Chen 1,4**


**Abstract:** Chest X-ray (CXR) is widely used to diagnose conditions affecting the chest, its contents, and its nearby structures. In this study, we used a private data set containing 1630 CXR images with disease labels; most of the images were disease-free, but the others contained multiple sites of abnormalities. Here, we used deep convolutional neural network (CNN) models to extract feature representations and to identify possible diseases in these images. We also used transfer learning combined with large open-source image data sets to resolve the problems of insufficient training data and optimize the classification model. The effects of different approaches of reusing pretrained weights (model finetuning and layer transfer), source data sets of different sizes and similarity levels to the target data (ImageNet, ChestX-ray, and CheXpert), methods integrating source data sets into transfer learning (initiating, concatenating, and co-training), and backbone CNN models (ResNet50 and DenseNet121) on transfer learning were also assessed. The results demonstrated that transfer learning applied with the model finetuning approach typically afforded better prediction models. When only one source data set was adopted, ChestX-ray performed better than CheXpert; however, after ImageNet initials were attached, CheXpert performed better. ResNet50 performed better in initiating transfer learning, whereas DenseNet121 performed better in concatenating and co-training transfer learning. Transfer learning with multiple source data sets was preferable to that with a source data set. Overall, transfer learning can further enhance prediction capabilities and reduce computing costs for CXR images.

**Keywords:** convolutional neural network; deep learning; source data set; supervised classification

#### **1. Introduction**

A chest X-ray (CXR), which is generated by exposing the chest to a small dose of ionizing radiation, is a projection radiograph of the chest used for imaging subtle lesions and the density of human tissues. It is commonly used for visualizing the condition of the thoracic cage, chest cavity, lung tissue, mediastinum, and heart. It can thus facilitate the diagnosis of common thorax diseases, including aortic sclerosis or calcification, arterial curvature, abnormal lung fields, anomalous lung patterns, spinal lesions, intercostal pleural thickening, and cardiac hypertrophy.

Computer vision technology and hardware computing capabilities have progressed considerably. Considering the overload of medical resources and the high demand for medical image analysis, the development of computer-aided diagnosis systems with a high diagnostic efficiency and accuracy is warranted. CXR images containing large amounts of physiological data can aid data-hungry deep learning paradigms in the construction of valuable intelligent auxiliary systems. Deep learning is a branch of machine learning;

**Citation:** Huang, G.-H.; Fu, Q.-J.; Gu, M.-Z.; Lu, N.-H.; Liu, K.-Y.; Chen, T.-B. Deep Transfer Learning for the Multilabel Classification of Chest X-ray Images. *Diagnostics* **2022**, *12*, 1457. https://doi.org/10.3390/ diagnostics12061457

Academic Editors: Sameer Antani and Sivaramakrishnan Rajaraman

Received: 25 May 2022 Accepted: 10 June 2022 Published: 13 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

through linear or non-linear transform from multiple layers, deep learning can automatically extract sufficient and representative features from a data set. In traditional machine learning, features are usually extracted using handcrafted rules, which are created by relevant domain experts. After the data characteristics are understood, useful and effective features can be produced. However, the ability to automatically extract features from deep learning can reduce the time spent by experts in feature engineering. Therefore, deep learning may afford an excellent performance in applications where machines may have failed in the past.

Several studies have used deep learning for CXR analysis, particularly for image classification. Most of these studies have trained deep learning models by using well-designed convolutional neural network (CNN) architectures such as VGG [1], GoogleNet [2], ResNet [3], and DenseNet [4]. CNN architecture depth, data augmentation, input preprocessing methods, image size, and pretraining schemes can affect the performance of a deep learning model [5]. No standardized design methodology for improving deep learning model performance has been reported thus far. Most of the relevant studies have focused on comparing the performance of multiple design methodologies for a specific task, rather than reporting novel methodologies [6–8]. Some studies have achieved methodological novelty by utilizing methods that can aid in improving model performance. For example, Hu et al. [9] and Wu et al. [10] used an extreme learning machine (ELM) to replace the conventional fully connected layer in a deep CNN for real-time analysis and applied a Chimp optimization algorithm or sine-cosine algorithm to ameliorate the ELM's ill-conditioning and nonoptimal problems. Wang et al. [11] trained CNN models by using the whale optimization algorithm, which can resolve difficulties related to requiring a considerable amount of manual parameter tuning and parallelizing the training process of traditional gradient descent-based approaches. Khishe et al. [12] proposed an efficient biogeography-based optimization approach for automatically finetuning model hyperparameters (e.g., number of output channels, convolution kernel size, layer type, training algorithm's learning rate, epoch, and batch size), which are typically selected manually. CXR images are commonly classified as normal or abnormal in the literature [7]. Although CXR can be used to detect multiple diseases of the thorax, few methods have been proposed for classifying multiple disease labels [13].

Applying deep learning methods to CXR image analysis may have promising applications. However, the advancement of automatic image analysis in hindered by several underlying limitations. The main limitation is the lack of large-scale CXR datasets. Although the number of parameters required for deep learning models is considerably large, CXR image training data are limited; this can cause model overfitting. Compared with ordinary images, collecting and labeling CXR images can be difficult and cost intensive. CXR images generated using various instruments with different settings and in different environments cannot be naively analyzed together because this can lead to various errors. Transfer learning [14], which uses the knowledge learned from one task as a starting point for related tasks, can aid in making the best use of different CXR databases. Transfer learning mainly involves using a large amount of open-source data to make up for target data shortages so as to achieve improved performance in model fitting.

In this study, we used one private target data set—1630 chest radiographs provided by the E-Da Hospital, I-Shou University, Taiwan—and three open-source datasets—the ImageNet dataset [15], ChestX-ray dataset [16,17] (with >100,000 chest radiograph images provided by the National Institutes of Health (NIH)), and CheXpert dataset [18] (collected by the Stanford ML Group, comprising nearly 220,000 chest radiographs). The images from the private target data set were labeled by radiology specialists as either being disease-free or containing multiple sites of abnormalities—representing a typical multilabel classification problem. Although ImageNet is sufficiently large for deep learning, most of its 14 million images were dissimilar to those in our private data set; therefore, this data set may not have been able to provide accurate feature representations to classify images in our private data set. ChestX-ray and CheXpert, although modest in size, are more similar to our private data set, and thus, they may be able to support target data training more efficiently.

Transfer learning has been widely applied in CXR deep learning analysis. Most studies have first trained deep CNN models on the large ImageNet dataset for natural image classification, followed by the use of trained weights for initialization to retrain all layers or only retrain the final (fully connected) layer for target CXR image classification [8]. The performance of transfer learning might be affected by several factors such as the sizes of the source and target data set, similarity between these data sets, retraining of all or partial layers in the target task, and CNN architecture [19]. For natural image classification, Azizpour et al. [19] and Cui et al. [20] have reported best practices on how these factors should be set in generic and domain-specific tasks, respectively. However, only a few studies have focused on the factors that affect the transferability of medical image analysis approaches. Tajbakhsh et al. [21] considered four different medical imaging applications and demonstrated that CNNs pretrained on ImageNet performed better and were more robust to the size of the target training data than the CNNs trained from scratch. Gozes and Greenspan [22] pretrained their model on the ChestX-ray data set and demonstrated that the pretrained weights enabled the model to exhibit improved predictions on small-scale CXR data sets compared with the performance of a model pretrained on ImageNet.

To maximize the performance of transfer learning for CXR image classification, the effects of different transfer learning characteristics in medical image analysis must be systematically investigated. Because of the substantial differences between natural and medical images, we could not apply the knowledge learned for natural images in previous studies [19,20] to our current CXR analysis. Accordingly, our study focused on the multilabel classification of CXR images, which is a crucial topic that warrants research. We thoroughly investigated the aforementioned transferability factors and included the ImageNet, ChestX-ray, and CheXpert data sets as the source data. Previous studies have typically used one type of source data at a time for transfer learning. In this study, we developed new approaches for integrating different source data sets practically to eventually obtain novel powerful source data sets. Our results may aid in devising best practices for the efficient use of different types of data sets to alleviate the insufficiency of training data and enhance the performance of deep learning models in the medical field.

#### **2. Materials**

The target data set analyzed in this study contained the CXR images from the E-Da Hospital, I-Shou University, Taiwan. A deep learning model was trained to classify these images as being disease-free or containing multiple sites of abnormalities. We selected three source datasets with different sizes to pretrain our deep learning model and to improve its performance for the target data. Of all our source data sets included here, ImageNet contains the largest amount of data, followed by CheXpert and then ChestX-ray. Although ImageNet is the largest in size, its data had less similarity to the target data than the other two data sets had. Table 1 lists the basic characteristics of the data sets used.


**Table 1.** Characteristics of the included data sets.

<sup>1</sup> Thirteen common thoracic disease labels and a "no finding" label (indicating the absence of any disease). <sup>2</sup> Thirteen disease labels and a normal label. <sup>3</sup> Seven disease labels and a normal label. <sup>4</sup> One hundred and ten images labeled as "hernia" in the original dataset were discarded in this analysis due to the small sample size.

#### *2.1. Target Data Set*

The target data set contained CXR images that were collected from patients who received a CXR between January 2008 and December 2018; the images were stored in the DICOM (Digital Imaging and Communications in Medicine) format. These images were retrospectively extracted from the archiving and communication system (PACS) of E-Da Hospital. Patients' gender, age, and diagnostic reports from radiology specialists were also provided. Images were excluded if their quality and the corresponding interpretation of diagnostic reports were unclear. Images of minors (aged < 18 years) were also excluded from this study. This clinical study was approved by the Institutional Review Board of E-Da Hospital. All patients signed written informed consent before participating.

The image size ranged between 1824 and 2688 pixels in length and 1536 and 2680 pixels in width. The image resolution was 0.16 mm per pixel. When we experimentally analyzed images resized at 1024 × 1024, the results were nonsignificant, and the process was cost intensive. Therefore, we resized all images to 512 × 512 pixels for analysis.

We first removed duplicate and outlier images. We also discarded five images uniquely labeled as "heart pacemaker placement" due to their inconsistent disease property. The analyzed data set comprised 1630 images with 1 normal label and 17 diseases labels, which had been integrated into 8 categories (including normal) with guidance from physicians. Of these images, 1485 had a single label and 145 had multiple labels. Table 2 lists the numbers of images that contained certain labels in the data set.


**Table 2.** Numbers and labels of images in the target data set.

#### *2.2. Source Data Sets*

#### 2.2.1. ImageNet

ImageNet is a large visual database designed for visual object recognition. It contains more than 14 million images that have been hand-labeled to indicate their object category, of which there are >20,000. The data set size is approximately 1 TB. The Python package Keras provides the pretrained weights for various networks, which eliminates the requirement to training them from scratch.

#### 2.2.2. CheXpert

CheXpert is a large CXR image data set collected by the Stanford ML Group; it comprises 224,316 chest radiographs labeled to indicate the presence of 13 common thoracic diseases or labeled "no finding" to indicate the absence of all diseases [18]. Natural language processing (NLP) is used to extract observations from radiology reports, and this extracted information serves as the basis of labeling. The training labels in the data set for each category are 0 (negative), 1 (positive), or *u* (uncertain). Different approaches for using uncertainty labels during model training may lead to differences in network performance. In this study, we followed the results from the original paper for dealing with the uncertainty label in relation to five diseases: atelectasis, edema, pleural effusion, cardiomegaly, and consolidation. In other words, we reconstructed a five-dimensional label vector for the five aforementioned diseases—where *u* was replaced with 1 for the first three diseases and with 0 for the final two—and then we applied it as the label for five-class multilabel classification. In other words, we transformed the pretraining task into a task for classifying whether the images had these five diseases.

In this study, we resized the images to 512 × 512 pixels to save memory. Moreover, some of the images from the data set were gray-scale images (where they only had one channel), whereas the other images had four channels. To feed them into the available pretrained models through a three-channel input, we replicated the one-channel images three times but removed the fourth channel of the four-channel images.

#### 2.2.3. ChestX-ray

ChestX-ray is an open-source data set compiled by the National Institutes of Health; it comprises 112,120 chest radiographs with 1 normal label and 14 disease labels [16,17]. Similar to CheXpert, ChestX-ray uses NLP for labeling, but it does not have an uncertainty label. We stacked and removed the channels of the images to fit them into the available pretrained networks as we did for CheXpert. We also resized the original images to 512 × 512 pixels for fast processing and excluded the disease label "hernia" (and thus deleted 110 images) from the analysis because its sample size was relatively small.

#### **3. Methods**

This analysis aimed to predict common thorax diseases in patients through their CXR images. The images were labeled as being disease-free or containing multiple sites of abnormalities, representing a typical multilabel classification problem. Deep learning architectures were used to automatically extract features and build a classifier, and transfer learning was conducted to combine information from different datasets to improve model performance. We trained deep learning models using the well-designed convolutional neural network (CNN) architectures ResNet50 [3] and DenseNet121 [4]. Implementing transfer learning in the CNN involved reusing the first several layers of the network for classifying images in the open-source data sets ImageNet, CheXpert, and ChestX-ray (the source task). Pretrained weights from these layers served as the starting points for classifying the CXR images from E-Da Hospital (the target task). By combining three source data sets in various manners, we obtained several different sets of pretrained weights. To transfer the pretrained weights to the target task, we either used the pretrained weights as the initial values and retrained the model from the scratch or fixed the weights of some early layers at the corresponding pretrained weights and reconstructed the others.

Image augmentation techniques were applied to artificially create variations in the existing images; this expanded the training data set to represent a comprehensive set of possible images. Our target dataset was imbalanced: the number of images containing each disease label was unequal. Deep learning algorithms can be biased toward the majority labels and fail to detect the minority labels. To prevent imbalanced data, sample weighting in the loss function was used. Model performance was evaluated using various metrics and stratified five-fold cross-validation. Our analytic approach is illustrated in Figure 1. The programming language Python [23] was used to implement these methods.

**Figure 1.** Flow chart of our deep transfer learning approach for the multilabel classification of the chest X-ray images.

#### *3.1. Multilabel Classification*

Image classification aims at building a model that maps the input of the *<sup>i</sup>*th image *<sup>x</sup><sup>i</sup>* to a label vector *<sup>y</sup><sup>i</sup>* <sup>=</sup> (*yi*1, ··· , *yiK*), where *yik* <sup>=</sup> <sup>0</sup> or 1, *<sup>k</sup>* <sup>=</sup> 1, ··· , *<sup>K</sup>* is the indicator for the *k*th disease class. Multiclass classification involves classifying an image into one of multiple classes; in other words, the label vector of multiclass classification has only a single element equal to 1. However, in multilabel classification, more than one class may be assigned to the image, with the label vector possibly having 0 or 1 in each element.

CXR images in our target data set were labeled for seven diseases, and multiple diseases were often identified in one image. The target task was multilabel classification and defined a seven-dimensional label vector with an all-zero vector (0, 0, 0, 0, 0, 0, 0) representing a normal status (where none of the seven diseases was detected). We treated multilabel classification as a multiple binary classification problem; thus, the loss function was the sum of multiple binary cross-entropies. However, our data were seriously imbalanced: the elements of the label vectors were almost equal to 0. The number of 0 s was much larger than that of 1 s; this could have misled our model toward predicting 0 s if the usual loss function was used. Therefore, we adjusted the common binary cross-entropy with weights that considered the proportions of 0 s and 1 s in the same sampling batch.

Suppose that the entire training set is divided into *J* batches, with each batch size being *<sup>M</sup>*. Let *<sup>x</sup>jm* be the input for the *<sup>m</sup>*th image in the *<sup>j</sup>*th batch and *<sup>y</sup>jm* <sup>=</sup> *yjm*1, ··· , *yjmK* be its label vector. Then, the proposed weighted binary cross-entropy (WBCE) loss is defined as:

$$L\_{\rm WBCE} = \sum\_{j=1}^{l} \left( \sum\_{m=1}^{M} \left\{ \beta\_{Pj} \sum\_{k:\, y\_{jm} = 1} \left[ -\ln \left( \sigma \left( f\_k \left( \mathbf{x}\_{jm} \right) \right) \right) \right] + \beta\_{Nj} \sum\_{k:\, y\_{jm} = 0} \left[ -\ln \left( 1 - \sigma \left( f\_k \left( \mathbf{x}\_{jm} \right) \right) \right) \right] \right\} \right), \tag{1}$$

where *fk xjm* is *<sup>x</sup>jm*'s *<sup>k</sup>*th input for the final fully-connected layer, *<sup>σ</sup>*(*z*) = *<sup>e</sup>z*/(<sup>1</sup> <sup>+</sup> *<sup>e</sup>z*) is the sigmoid function, and *<sup>β</sup>Pj* is set to <sup>|</sup>*Pj*|+|*Nj*<sup>|</sup> <sup>|</sup>*Pj*<sup>|</sup> , whereas *<sup>β</sup>Nj* is set to <sup>|</sup>*Pj*|+|*Nj*<sup>|</sup> <sup>|</sup>*Nj*<sup>|</sup> , with *Pj* and *Nj* being the number of 1 s and 0 s in the label vectors of the *j*th batch, respectively.

#### *3.2. Image Augmentation*

Training deep CNNs requires a considerable amount of data, but the sample size of medical images is typically not sufficiently large. Image augmentation is a powerful technique for generating images using a combination of affine transformations—such as shift, zoom in-zoom out, rotate, flip, distort, and shade with a hue—which can feed more images into the neural networks and exploit information in the original data more fully. We augmented our training images by using the ImageDataGenerator function in the Keras Python library. "Online" augmentation was applied; here, we applied the image augmentation techniques in mini-batches and then fed them to the model. The model with online augmentation was presented with different images at each epoch; this aided the model in generalizing, and the model did not need to save the augmented images on the disk, which reduced the computing burden.

#### *3.3. Deep CNNs*

Image classification using deep CNNs has an outstanding performance compared with traditional machine learning approaches. CNN is a deep learning method in which a series of layers is constructed to progressively extract higher-level features from the raw input. In a CNN, the typical layers comprise the input layer (which imports image data for model training), the convolution layer (which uses various filters to automatically learn representation of features at different levels), the pooling layer (which selects the most prominent features to reduce the dimension of subsequent layers), and the fully connected layer (which flattens the matrix feature maps into a single vector for label prediction).

The CNN architecture refers to the overall structure of the network: the types of layers it should have, the number of units each layer type should contain, and the manner in which these units should be connected to each other. CNN architectures such as VGG [1], GoogleNet [2], ResNet [3], and DenseNet [4] have been widely used. We included two high-performing CNN models: 50-layer ResNet (ResNet50) and 121-layer DenseNet (DenseNet121). ResNet50 has fewer filters and a lower complexity than VGG nets, and DenseNet121 requires fewer parameters than ResNet50 does.

#### 3.3.1. ResNet

In theory, the deeper the network model, the better its results. Nevertheless, the degradation problem may arise: as the network becomes deeper, the model's accuracy may become saturated or even decrease. This problem is different from overfitting because of increased training errors. ResNet made a historical breakthrough in deep neural networks by solving the degradation problem through residual learning.

In this study, we established a deeper model by stacking new layers on a shallower architecture. Let *<sup>x</sup>* denote the output of the shallow part of the model and *<sup>H</sup>*(*x*) denote the output of the deeper model. No higher training error should be obtained if the added layers are for identity mapping. Rather than expecting the stack of layers to learn identity mapping, ResNet argues that it is easier to let these layers fit a residual mapping of *<sup>F</sup>*(*x*) <sup>=</sup> *<sup>H</sup>*(*x*) <sup>−</sup> *<sup>x</sup>* to zero and recast the output of the deeper model as *<sup>F</sup>*ˆ(*x*) <sup>+</sup> *<sup>x</sup>*, where *<sup>F</sup>*ˆ(*x*) is the fitted residual. Generally, *<sup>F</sup>*ˆ(*x*) will not be zero; therefore, the stacking layers can still learn new features and demonstrate an improved performance.

The formulation of *<sup>F</sup>*ˆ(*x*) <sup>+</sup> *<sup>x</sup>* can be realized using feedforward neural networks with a shortcut connection, which skips some of the layers in the neural network and feeds the output of one layer as the input to the next layers. A series of shortcut connections (residual blocks) forms ResNet. This study applied the deeper ResNet50 that contained 50 layers and used a stack of three layers with 1 × 1, 3 × 3, and 1 × 1 convolutions as the building residual block. This three-layer residual block adopted a bottleneck design to improve computational efficiency, where the 1 × 1 layers were responsible for reducing and then increasing (restoring) the dimensions, leaving the 3 × 3 layer as a bottleneck with smaller input and output dimensions [3]. In ResNet50, batch normalization (BN) [24] is adopted immediately after each convolution and before ReLU activation, and global average pooling (GAP) [25] is performed to form the final fully connected layer.

#### 3.3.2. DenseNet

As CNNs become increasingly deep, information about the input passes through many layers and can vanish by the time it reaches the end of the network. Different approaches, which vary in network topology and training procedure, create short paths from early layers to later layers to address this problem. DenseNet proposes an architecture that distills this insight into a simple connectivity pattern of dense blocks and transition layers. A dense block is a module containing many layers connected densely with feature maps of the same size. In a dense block, each layer obtains additional inputs from all preceding layers, and it passes on its own feature maps to all the subsequent layers. The transition layer connects two adjacent dense blocks, and it reduces the size of the feature map through pooling. Compared with ResNet that connects layers through element-level addition, layers in DenseNet are connected by concatenating them at the channel level.

In a dense block, the convolution for each layer produces *k* feature maps, which denote *k* channels in the output. *k* is a hyperparameter known as the growth rate of DenseNet, which is usually set to be small. Under the assumption that the initial number of channels is *k*0, the number of input channels in the th layer is *k*<sup>0</sup> + *k*( − 1). As the number of layers increases, the input can be extremely large even if *k* is small. Because the input of the latter layer grows quickly, the bottleneck design is introduced into the dense block to reduce the burden of calculation, where a 1 × 1 convolution and then a 3 × 3 convolution are applied to each layer to generate the output. Similar to ResNet, DenseNet uses a composite of three consecutive operations for each convolution: BN + ReLU + convolution.

#### 3.3.3. Implementation

We used the Python package Keras to implement ResNet50 and DenseNet121. Optimizer Adam [26] with a mini-batch size of 16 and an epoch number of 30 was used. The learning rate started from 0.0001 and was divided by 10 when the validation loss did not decrease in 10 epochs. DenseNet121 set the growth rate to *k* = 12.

#### *3.4. Transfer Learning*

In practice, we usually do not have sufficient data to train a deep and complex network such as ResNet or DenseNet. Techniques such as image argumentation are insufficient for resolving this problem. The lack of data can cause overfitting, which can make our trained model overly rely on a particular data set and then fail to fit to additional data. This problem can be resolved using transfer learning [14]. The main concepts of transfer learning are to first train the network with sufficiently large source data and then transfer the structure and weights from this trained network to predict the target data.

In the current study, we used the data from the ImageNet [15], ChestX-ray [16,17], and CheXpert [18] data sets as the source data and the CXR images from the E-Da Hospital as the target data. ImageNet contains more than 14 million natural images and thus is sufficiently large for a deep learning application; however, it does not contain images with similarity to medical images, and therefore, it may fail to provide useful feature representations to classify our target data. ChestX-ray and CheXpert—consisting of about 100,000 and 220,000 CXR images, respectively—are modest in size but are more similar to our target data set.

Implementing transfer learning in the CNN involves reusing the parameter estimators pretrained on the source data when fitting the target data. Model finetuning is a method in which these pretrained parameter estimators are used as the initial values and finetuned to fit the target data. In the layer transfer approach, target data fitting preserves some of the layers from a pretrained model and reconstructs the others. We thus adopted model finetuning or layer transfer to reuse the parameter estimators pretrained on the source data (from ImageNet, ChestX-ray, or CheXpert).

We next tried to combine these source data sets to obtain new powerful source data sets for transfer learning.

#### 3.4.1. Initiating Transfer Learning

In Keras, we can implement transfer learning with initial weights selected randomly or from ImageNet pretraining. ImageNet pretraining may result in robust parameter estimation due to the diversity and richness of the data set, and the pretraining of the similar data sets ChestX-ray and CheXpert may accelerate the convergence of the model. Therefore, we adopted five different pretrained weights in modeling our target data: ImageNet pretraining with randomly selected initials, ChestX-ray pretraining with random or ImageNet pretraining initials, and CheXpert pretraining with random or ImageNet pretraining initials.

#### 3.4.2. Concatenating Transfer Learning

In this approach, we first constructed two ResNet backbone models by using different pretrained weights: one from ImageNet pretraining with random initials (IR) and the other from ChestX-ray pretraining with random initials (CR). Thereafter, we concatenated the outputs of the final convolution layer from two models and then used the GAP to reduce the dimension, and finally applied the fully connected layer to generate the final prediction (Figure 2). The ResNet model concatenating IR and CheXpert pretraining with random initials (XR) was obtained in a similar manner. The DenseNet backbone models concatenating IR and CR and concatenating IR and XR could also be obtained. Models using different pretrained weights may extract different features. In contrast to the first approach aimed at extracting features from a single domain, this approach could expand features from two different domains. However, compared with the first approach, this approach used twice the memory and time to store and upgrade parameters.

#### 3.4.3. Co-Training Transfer Learning

In this approach, we aimed to combine two source data sets, namely ChestX-ray and CheXpert, to a larger CXR data set to serve as source data for medical image classification tasks. ChestX-ray and CheXpert cannot be combined directly: although both the data sets contain only CXR images, they use different class definitions. To resolve this issue, we applied a co-training approach. First, we fed images from two source data sets to jointly train convolutional layers but to also connect to different fully connected layers to predict their corresponding classes. In other words, they shared the parameters from the convolution layers but not those after the convolution layers. Finally, we reserved these shared convolutional weights as the pretrained weights in target data model fitting. The reason that we did not cotrain ImageNet with ChestX-ray (or CheXpert) was that merging different domain data may mislead the model and reduce its efficiency. The co-training approach is illustrated in Figure 3.

**Figure 2.** Flow chart of concatenating transfer learning.

*3.5. Evaluation*

3.5.1. Stratified *K*-Fold Cross-Validation

In *K*-fold cross-validation, first, the data are shuffled to ensure the randomness when the data are split. Second, whole data are split into *K* groups, where *K* is usually set to be 5 or 10 based on the sample size. Third, one of groups is considered "test data," and the others are considered "training data" with a total of *K* different test-training combinations (*K* cross-validation rounds). Finally, the "training data" are further divided into "training" and "validation" sets, which are used to train the model parameters and to instantly evaluate the performance of various hyper-parameters, respectively.

In stratified *K*-fold cross-validation, every group must have the same class distribution. This method has great applicability for multiclass classification, where each sample belongs to only one class. In a multilabel task, every sample may have multiple class categories. To appropriately perform stratified *K*-fold cross-validation for a multilabel classification task, performing iterative stratification is essential [27]; it can be implemented using Python's scikit-learn-compatible package MultilabelStratifiedKFold.

**Figure 3.** Flow chart of co-training transfer learning.

#### 3.5.2. Metrics

For an imbalanced data set, good accuracy does not indicate that a classifier is reliable. A receiver operating characteristic (ROC) curve is a graphical plot that illustrates the performance of a binary classifier (with vs. without the label) at different thresholds, where the x-axis represents the false positive rate, and the y-axis represents the true positive rate. The area under the ROC curve (AUC) is used to summarize the ROC curve. A precision (PR) curve is another type of plot that evaluates the performance of a model, with the x-axis being the recall and the y-axis being the precision. The area under the PR curve is the average precision (AP). Moreover, training accuracy represents the average accuracy of the training sets during the five cross-validation rounds. Test accuracy, AUC, and AP refer to the accuracy, AUC, and AP of the five test groups, respectively. In multilabel classification, we can treat the problem as a combination of multiple binary classifications; thus, we can evaluate every label individually with binary metrics and obtain the average of these binary metrics (i.e., the mean metric). In this study, we calculated the training accuracy, test accuracy, test AUC, and test AP for each individual label and the mean training accuracy, test accuracy, test AUC, and test AP for all labels.

#### **4. Results**

This section presents the results of all the experiments performed in the current study. Table 3 presents a summary of our experimental configurations and parameter settings.

#### *4.1. Layer Transfer versus Model Finetuning*

To using the layer transfer approach, we fixed some lower layers in the network at weights from the source data and retrained the remaining layers on the target data. We first froze 10, 22, 40, and 49 layers in ResNet50, which corresponded to the end of the first, second, third, and fourth residual blocks. We used weights pretrained on ImageNet with random initials, ChestX-ray with random initials, and ChestX-ray with ImageNet pretraining initials.

The mean test accuracies and AUCs from the layer transfer or model finetuning on different pretrained weights are listed in Table 4. When adopting the layer transfer approach, it is not recommanded to modify only the final layer, which is equivalent to freezing the first 49 layers in ResNet50. Here, for pretrained weights from ChestX-ray with random or ImageNet pretraining initials, freezing more layers typically led to a significantly higher accuracy but a somewhat worse AUC. For the ImageNet pretrained weight, freezing more layers resulted in a lower accuracy and AUC, although the difference was negligible. These results indicated that the ideal number of frozen layers depends on the similarity between the source and target data. When the target data are distinct from the source data, more layers may need to be unfrozen.

**Table 3.** Summary of experimental configurations and parameter settings.


**Table 4.** Mean test accuracies and AUCs of different approaches of reusing pretrained weights in ResNet50 1.


<sup>1</sup> Bold numbers indicate the top two approaches in each metric. <sup>2</sup> The model without transfer learning. <sup>3</sup> RB\_1 = Pretrained weights on the first 10 layers, which correspond to the end of the first residual block; RB\_2 = Pretrained weights on the first 22 layers, which correspond to the end of the second residual block; RB\_3 = Pretrained weights on the first 40 layers, which correspond to the end of the third residual block; RB\_4 = Pretrained weights on the first 49 layers, which correspond to the end of the fourth residual block. <sup>4</sup> AC = Accuracy.

Compared with the layer transfer approach, model finetuning afforded a higher accuracy when the ImageNet pretrained weight was set directly and provided considerably larger AUCs for all types of pretrained weights. The model finetuning approach appeared to an attractive option for focusing on prediction accuracy related to every disease label without to sacrificing the minority label.

#### *4.2. Effects of Transfer Learning*

After transfer learning was applied with model finetuning and pretrained weights from ImageNet, ResNet50 s mean test accuracy increased by 11% and the mean test AUC increased by 4% compared with when transfer learning was not applied (Table 4). After this transfer learning approach was used to train DenseNet121, the mean test accuracy and mean test AUC were 0.799 and 0.803, respectively. However they decreased to 0.657 and 0.736, respectively, after DenseNet121 was trained without transfer learning. With transfer learning, DensNet121's accuracy increased by 18% and the AUC increased by 8%. Model finetuning-based transfer learning could improve the model fit and appeared to have a greater influence on models with a more complex structure (i.e., those with a greater network depth). These performance improvements remained when model finetuning with pretrained weights from ChestX-ray (with random or ImageNet pretraining initials) was applied.

The layer transfer approach did not always improve model performance (Table 4). Even with the best selected number of frozen layers, models with transfer learning had worse AUCs than those without transfer learning.

#### *4.3. Comparison of Various Transfer Learning Approaches*

The results from Sections 4.1 and 4.2 demonstrated that compared with layer transfer, model finetuning—re-training the whole model with pretrained weights as initial values—provided a better prediction model. We thus adopted model finetuning for subsequent analyses.

This section presents the results from initiating, concatenating, and co-training transfer learning methods, which use different approaches to combine several source data sets to provide an improved model performance. This performance, evaluated using mean accuracy, AUC, and AP on training and test data, is presented in Table 5. In the subsequent subsections, we compare this performance from the perspective of backbone models, source data sets, and combined methods.


**Table 5.** Training and test performance of various transfer learning approaches 1.

<sup>1</sup> Bold numbers indicate the top three approaches in each metric. <sup>2</sup> IR = ImageNet pretraining with random initials, CR = ChestX-ray pretraining with random initials, CI = ChestX-ray pretraining with ImageNet pretraining initials, XR = CheXpert pretraining with random initials, XI = CheXpert pretraining with ImageNet pretraining initials, I + C = Concatenating ImageNet + ChestX-ray, I + X = Concatenating ImageNet + CheXpert, C ∪ X = Co-training ChestX-ray + CheXpert.

#### 4.3.1. Backbone Model Comparison

In this study, we used ResNet50 and DenseNet121 as the backbone model for transfer learning. Compared with DenseNet121, ResNet50 required less time to train a model, but it had more parameters to save. To summarize the two backbone models' performance levels, the radar plots for mean test AUCs and APs from various transfer learning approaches were created (Figures S1 and S2). In initiating transfer learning, ResNet50 outperformed DenseNet121. By contrast, DenseNet121 performed better in concatenating and co-training transfer learning. Therefore, a model with a relatively complex structure may provide more accurate results if a larger, more diverse source data set created by combining different data sets is applied.

#### 4.3.2. Source Data Comparison

In this study, we used data from three sources—ImageNet, ChestX-ray, and CheXpert to perform transfer learning. Each has its own strengths and limitations. Although it is nearly 100 times the size of other two data sets, ImageNet's data had the weakest association with our target data. Although CheXpert is twice the size of ChestX-ray, it contains uncertainty labels, which might increase the difficulty of the training process. ChestX-ray is the smallest data set among those included in this study; nevertheless, it demonstrated the stronger connection to our target data and the most precise labeling. To compare the performance of the different source data, we collected the results from ResNet50 with initial weights from IR, CR, and XR. Their PR and ROC curves for each disease label on test data are presented in Figures S3–S5. DenseNet121's corresponding PR and ROC curves are presented in Figures S6–S8.

For both the backbone models, the use of ChestX-ray as source data led to a better performance than using CheXpert (Table 5). Even though ChestX-ray is only half the size of CheXpert, its accurate labeling made up for the lack of data. Although ImageNet had the best performance in the training process, its performance in the test process was worse than that of ChestX-ray when it was fit in the ResNet50 model (Table 5). Therefore, the use of ImageNet as the source data possibly led to overfitting. In general, the ChestX-ray data had the best performance when only one single-source data set was adopted as in previous studies.

#### 4.3.3. Comparison of Combined Methods

We used three methods to combine source data sets. The initiating transfer learning approach adopted ImageNet initials when training pretrained weights from ChestX-ray or CheXpert data to ensure a robust estimation of these pretrained weights. Concatenating transfer learning aimed at collecting the features obtained using distinct source data sets (e.g., ImageNet and ChestX-ray) to expand the features' coverage. Finally, cotraining transfer learning combined two similar source data sets (e.g., ChestX-ray and CheXpert) to train pretrained weights, with the assumption that a large data size can improve model performance.

The combining methods' PR and ROC curves for an individual disease label on test data are shown in Figures S9–S18. In summary, first, for both backbone models, using ChestX-ray with random initials as the source data led to a better performance than using ChestX-ray with ImageNet initials, and using CheXpert with ImageNet initials as source data led to a better performance than using CheXpert with random initials. ImageNet initials ensured a robust estimation of pretrained weights in CheXpert but not in ChestX-ray. Second, concatenating transfer learning provided the highest training accuracy, but it did not achieve the best performance in the test process, possibly because of the overloading of the model with too many parameters (i.e., with twice the number of parameters), causing overfitting. Third, initiating transfer learning was the most suitable for ResNet50, whereas concatenating and co-training transfer learning were the most suitable for DenseNet121. Notably, under-fitting may have arisen for co-training transfer learning in DenseNet121: more epochs in training can overcome this issue. Fourth, DenseNet121 that involved transfer learning with various source data sets performed better than that with a singlesource data set. ResNet50 provided similar results; however, the effect was not as significant as that in DenseNet121.

#### **5. Discussion**

When reusing pretrained weights in transfer learning, the approach that re-trains the whole model with pretrained weights as initial values (i.e., the model finetuning approach) typically afforded excellent results but required many more computational resources compared with the other approaches. The layer transfer approach, which freezes some layers on pretrained weights, demonstrated advantages over the model finetuning approach by allowing larger batch sizes and requiring a shorter run time and less GPU/CPU memory. For building the most cost-effective model by using the layer transfer approach, the number of frozen layers should be determined accurately. Our results demonstrated that the higher the similarity between the source and target data, the larger the allowed number of frozen layers should be. However, none of the included approaches were universally applicable; selecting the most appropriate approach would require the assessment of its benefits and costs on the basis of specific goals and available resources.

When adopting only one single-source dataset, ChestX-ray demonstrated a better performance than CheXpert. ChestX-ray is only half the size of CheXpert; nonetheless, its accurate labeling was found to make up for its smaller size. However, after ImageNet initials were attached, CheXpert outperformed ChestX-ray. As such, ImageNet appeared to enhance the data volume and variety of these data sets.

ResNet50 was suitable for initiating transfer learning, whereas DenseNet121 performed better in concatenating and co-training transfer learning. Transfer learning combined with various source data sets was also preferable with the use of a single-source data set; however, DenseNet121 led to greater benefits than ResNet50 did. Compound weights from several source data sets may be superior to single weights because they contain additional information offered by another data set. Nevertheless, a more complex transfer process may produce more noise. Compared with ResNet50, DenseNet121 was more complex, and its dense block mechanism could process more data and absorb more information; consequently, DenseNet121 is more suitable than BesNet50 for integrating source data sets.

Few studies have focused on the factors that affect the performance of transfer learning in medical image analysis. The strength of the present study lies in its systematic approach to investigating the transferability of CXR image analytic approaches. Nevertheless, this study had several limitations that should be resolved in future studies. First, our experiments were based on the 50-layer ResNet and 121-layer DenseNet architectures, and the derived weights were estimated using the gradient descent-based optimizer Adam [26] under the consideration of manually selected hyperparameter values. ResNet and DenseNet are both widely used for conducting deep learning analyses on CXR images; nevertheless, they differ in several aspects, and such differences can be leveraged to investigate the impact of the CNN architecture on transfer learning. Alternatively, new architectures such as EfficientNet [28] and CoAtNet [29], which have shown a high performance in challenging computer vision tasks—could be used for analysis. Moreover, to enhance the efficiency of model parameter estimation, Adam may be replaced with recent optimizers such as Chimp [9] and Whale [11], and biogeography-based optimization can be applied to automatically finetune model hyperparameters [12]. Second, CXR images can be taken in posteroanterior, anteroposterior, and lateral views. In this study, we included only one target data set in which all CXR images were in the posteroanterior view; this may have limited the real-world applicability of our findings. Moreover, we included only image classification as the target task. However, CXR images can be used for several other types of deep learning tasks such as segmentation, localization, and image generation [5]. Accordingly, future studies could use CXR images taken in different positions and could consider a wide range of deep learning tasks.

#### **6. Conclusions**

In this study, we conducted a thorough investigation of the effects of various transfer learning approaches on deep CNN models for the multilabel classification of CXR images. Our target data set, collected through general clinical pipelines, contained 1630 chest radiographs with 17 clinically common disease labels. Transfer learning methods that reused pretrained weights through model finetuning and layer transfer were examined. We considered three source data sets with different sizes and different levels of similarity to our target data and assessed their effect on transfer learning effectiveness. These source data sets could be incorporated into transfer learning individually or in combination. We also proposed initiating, concatenating, or co-training different source data sets for joint transfer learning and used two backbone CNN models with different network architectures to adopt the aforementioned transfer learning approaches.

Several substantial findings were obtained. The results demonstrated that transfer learning could improve the model fit. Compared with the layer transfer approach, the model finetuning approach typically afforded better prediction models. When only one single-source data set was adopted as in previous studies, ChestX-ray outperformed ImageNet and CheXpert. However, CheXpert with ImageNet initials attached performed better than ChestX-ray with ImageNet initials attached. ResNet50 performed better in initiating transfer learning, whereas DenseNet121 performed better in concatenating and co-training transfer learning. Transfer learning with multiple source data sets was preferable to that with a single-source data set.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics12061457/s1, Figure S1: Radar plots for mean test AUCs from various transfer learning approaches; Figure S2: Radar plots for mean test APs from various transfer learning approaches; Figure S3: PR and ROC curves for each disease label in test data with initial weights from ImageNet pretraining with random initials in ResNet50; Figure S4: PR and ROC curves for each disease label in test data with initial weights from ChestX-ray pretraining with random initials in ResNet50; Figure S5: PR and ROC curves for each disease label in test data with initial weights from CheXpert pretraining with random initials in ResNet50; Figure S6: PR and ROC curves for each disease label in test data with initial weights from ImageNet pretraining with random initials in DenseNet121; Figure S7: PR and ROC curves for each disease label in test data with initial weights from ChestX-ray pretraining with random initials in DenseNet121; Figure S8: PR and ROC curves for each disease label in test data with initial weights from CheXpert pretraining with random initials in DenseNet121; Figure S9: PR and ROC curves for each disease label in test data for initiating transfer learning with pretrained weights from ChestX-ray adopting ImageNet initials in ResNet50; Figure S10: PR and ROC curves for each disease label in test data for initiating transfer learning with pretrained weights from CheXpert adopting ImageNet initials in ResNet50; Figure S11: PR and ROC curves for each disease label in test data for initiating transfer learning with pretrained weights from ChestX-ray adopting ImageNet initials in DenseNet121; Figure S12: PR and ROC curves for each disease label in test data for initiating transfer learning with pretrained weights from CheXpert adopting ImageNet initials in DenseNet121; Figure S13: PR and ROC curves for each disease label in test data for concatenating transfer learning of ImageNet + ChestX-ray in ResNet50; Figure S14: PR and ROC curves for each disease label in test data for concatenating transfer learning of ImageNet + CheXpert in ResNet50; Figure S15: PR and ROC curves for each disease label in test data for concatenating transfer learning of ImageNet + ChestX-ray in DenseNet121; Figure S16: PR and ROC curves for each disease label in test data for concatenating transfer learning of ImageNet + CheXpert in DenseNet121; Figure S17: PR and ROC curves for each disease label in test data for co-training transfer learning of ChestX-ray + CheXpert in ResNet50; Figure S18: PR and ROC curves for each disease label in test data for co-training transfer learning of ChestX-ray + CheXpert in DenseNet121.

**Author Contributions:** Conceptualization, G.-H.H. and T.-B.C.; Data curation, M.-Z.G., N.-H.L., K.-Y.L. and T.-B.C.; Formal analysis, G.-H.H., Q.-J.F. and M.-Z.G.; Funding acquisition, G.-H.H.; Investigation, N.-H.L., K.-Y.L. and T.-B.C.; Methodology, G.-H.H., Q.-J.F. and M.-Z.G.; Project administration, G.-H.H.; Resources, T.-B.C.; Software, Q.-J.F. and M.-Z.G.; Supervision, G.-H.H.; Writing original draft, G.-H.H., Q.-J.F. and M.-Z.G.; Writing—review and editing, G.-H.H., N.-H.L., K.-Y.L. and T.-B.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially supported by grants from the Ministry of Science and Technology, Taiwan (MOST 107-2118-M-009-005-MY2, and MOST 109-2118-M-009-004-MY2).

**Institutional Review Board Statement:** The study was conducted in accordance with the guidelines of the Declaration of Helsinki. All experimental procedures were approved by the Institutional Review Board of the E-Da Hospital, Kaohsiung, Taiwan (approval number EMRP-108-115).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the target data set of this study.

**Data Availability Statement:** The data used and analyzed in this study are available from the corresponding author upon reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Deep Modality-Specific Ensemble for Improving Pneumonia Detection in Chest X-rays**

**Sivaramakrishnan Rajaraman \*,†, Peng Guo †, Zhiyun Xue and Sameer K. Antani**

Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; peng.guo@nih.gov (P.G.); zhiyun.xue@nih.gov (Z.X.); santani@mail.nih.gov (S.K.A.) **\*** Correspondence: sivaramakrishnan.rajaraman@nih.gov

† These authors contributed equally to this work.

**Abstract:** Pneumonia is an acute respiratory infectious disease caused by bacteria, fungi, or viruses. Fluid-filled lungs due to the disease result in painful breathing difficulties and reduced oxygen intake. Effective diagnosis is critical for appropriate and timely treatment and improving survival. Chest X-rays (CXRs) are routinely used to screen for the infection. Computer-aided detection methods using conventional deep learning (DL) models for identifying pneumonia-consistent manifestations in CXRs have demonstrated superiority over traditional machine learning approaches. However, their performance is still inadequate to aid in clinical decision-making. This study improves upon the state of the art as follows. Specifically, we train a DL classifier on large collections of CXR images to develop a CXR modality-specific model. Next, we use this model as the classifier backbone in the RetinaNet object detection network. We also initialize this backbone using random weights and ImageNet-pretrained weights. Finally, we construct an ensemble of the best-performing models resulting in improved detection of pneumonia-consistent findings. Experimental results demonstrate that an ensemble of the top-3 performing RetinaNet models outperformed individual models in terms of the mean average precision (mAP) metric (0.3272, 95% CI: (0.3006,0.3538)) toward this task, which is markedly higher than the state of the art (mAP: 0.2547). This performance improvement is attributed to the key modifications in initializing the weights of classifier backbones and constructing model ensembles to reduce prediction variance compared to individual constituent models.

**Keywords:** chest X-ray; deep learning; modality-specific knowledge; object detection; RetinaNet; ensemble learning; pneumonia; mean average precision

#### **1. Introduction**

Pneumonia is an acute respiratory infectious disease that can be caused by various pathogens such as bacteria, fungi, or viruses [1]. The infection affects the alveoli in the lungs by filling them up with fluid or pus, thereby resulting in reduced intake of oxygen and causing difficulties in breathing. The potency of the disease depends on several factors including age, health, and the source of infection. According to the World Health Organization (WHO) report (https://www.who.int/news-room/fact-sheets/detail/pneumonia, accessed on 11 December 2021), pneumonia is reported to be an infectious disease that results in a higher mortality rate, particularly in children. About 22% of all deaths in pediatrics from 1 to 5 years of age are reported to result from this infection. Effective diagnosis and treatment of pneumonia are therefore critical to improving patient care and survival rate.

Chest X-rays (CXRs) are commonly used to screen for pneumonia infection [2,3]. Analysis of CXR images can be particularly challenging in low and middle-income countries due to a lack of expert resources, socio-economic factors, etc. [4]. Computer-aided detection systems using conventional deep learning (DL) methods, a sub-class of machine learning (ML) algorithms can alleviate this burden and have demonstrated superiority over traditional machine learning methods in detecting disease regions of interest (ROIs) [5,6]. Such

**Citation:** Rajaraman, S.; Guo, P.; Xue, Z.; Antani, S.K. A Deep Modality-Specific Ensemble for Improving Pneumonia Detection in Chest X-rays. *Diagnostics* **2022**, *12*, 1442. https://doi.org/10.3390/ diagnostics12061442

Academic Editor: Henk A. Marquering

Received: 17 May 2022 Accepted: 8 June 2022 Published: 11 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

algorithms (i) automatically detect pneumonia-consistent manifestations on CXRs; and (ii) can support clinical-decision making by facilitating swift referrals for critical cases to improve patient care.

#### *1.1. Related Works*

A study of the literature reveals several studies that propose automated methods using DL models for detecting pneumonia-consistent manifestations on CXRs. However, DL models vary in their architecture and learn discriminative features from different regions in the feature space. They are observed to be highly sensitive to data fluctuations resulting in poor generalizability due to varying degrees of biases and variances. An approach to achieving a low bias and variance and ensuring reliable outcomes is using ensemble learning which is an established ML paradigm that combines predictions from multiple diverse DL models and improves performance compared to individual constituent models [7]. The authors of [8] proposed an ensemble of Faster-RCNN [9], Yolov5 [8], and EfficientDet [8] models to localize and predict bounding boxes containing pneumonia-consistent findings in the publicly available VinDr-CXR [8] dataset and reported a mean Average Precision (mAP) of 0.292. The following methods used ensembled object detection models to detect pneumonia-consistent findings using the CXR collection hosted for the RSNA Kaggle pneumonia detection challenge (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge accessed on 3 March 2022). The current state-of-the-art method according to the challenge leaderboard (https: //www.kaggle.com/competitions/rsna-pneumonia-detection-challenge/leaderboard accessed on 3 March 2022) has a mAP of 0.2547. In [10], an ensemble of RetinaNet [11] and Mask RCNN models with ResNet-50 and ResNet-101 classifier backbones delivered a performance with a mAP of 0.2283 using the RSNA Kaggle pneumonia detection challenge CXR dataset. Another study [12] proposed a weighted-voting ensemble of the predictions from Mask R-CNN and RetinaNet models to achieve an mAP of 0.2174 in detecting pneumonia-consistent manifestations. These studies used the randomized test set split from the challenge-provided training data. This is a serious concern since the organizers have not made the blinded test set used during the challenge available for further use. This cripples follow-on research, such as ours, from making fair comparisons.

#### *1.2. Rationale for the Study*

All above studies used off-the-shelf DL object detection models with ImageNet [13] pretrained classifier backbones. However, ImageNet is a collection of stock photographic images whose visual characteristics, including shape and texture among others, are distinct from CXRs. As well, the disease-specific ROIs in CXRs are relatively small and many go unnoticed which may result in suboptimal predictions [14]. Our prior works and other literature have demonstrated that the knowledge transferred from DL models that are retrained on a large collection of CXR images is shown to improve performance on relevant target medical visual recognition tasks [15–17]. To the best of our knowledge, we observed that no literature discussed the use of CXR modality-specific backbones in object detection models, particularly applied to detecting pneumonia-consistent findings in CXRs.

#### *1.3. Contributions of the Study*

Our study improves upon the state-of-the-art as follows:


Finally, we construct an ensemble of the aforementioned models resulting in improved detection of pneumonia-consistent findings.

(iii). Through this approach, we aim to study the combined benefits of various weight initializations for classifier backbones and construct an ensemble of the best-performing models to improve detection performance. The models' performance is evaluated in terms of mAP and statistical significance is reported in terms of confidence intervals (CIs) and *p*-values.

Section 2 discusses the datasets, model architecture, training strategies, loss functions, evaluation metrics, statistical methods, and computational resources, Section 3 elaborates on the results and Section 4 concludes this study.

#### **2. Materials and Methods**

*2.1. Data Collection and Preprocessing*

The following data collections are used for this study:


We used the frontal CXRs from the CheXpert and TBX11K data collection during CXR image modality-specific retraining and those from the RSNA CXR collection to train the RetinaNet-based object detection models. All images are resized to 512 × 512 spatial dimensions to reduce computation complexity. The contrast of the CXRs is further increased by saturating the top 1% and bottom 1% of all the image pixel values. For CXR modalityspecific retraining, the frontal CXR projections from the CheXpert and TBX11K datasets are divided at the patient level into 70% for training, 10% for validation, and 20% for testing. This patient-level split prevents the leakage of data and subsequent bias during model training. For object detection, the frontal CXRs from the RSNA CXR dataset that shows pneumonia-consistent manifestations are divided at the patient level into 70% for training, 10% for validation, and 20% for testing. Table 1 shows the number of CXR images across the training, validation, and test sets used for CXR modality-specific retraining and object detection, respectively.


**Table 1.** Patient-level dataset splits show the number of images for CXR modality-specific retraining and object detection. Note: TBX11K and RSNA datasets have one image per patient.

#### *2.2. Model Architecture*

#### 2.2.1. CXR Modality-Specific Retraining

The ImageNet-pretrained DL models, viz., VGG-16, VGG-19, DenseNet-121, ResNet-50, EfficientNet-B0, and MobileNet have demonstrated promising performance in several medical visual recognition tasks [14,19,21–23]. These models are further retrained on a large collection of CXR images to classify them as showing cardiopulmonary abnormal manifestations or no abnormalities. Such retraining helps the models to learn CXR image modality-specific features that can be transferred and fine-tuned to improve performance in a relevant task using CXR images. The best-performing model with the learned CXR image modality-specific weights is used as the classifier backbone to train the RetinaNet-based object detection model toward detecting pneumonia-consistent manifestations. Figure 1 shows the block diagram illustrating the steps involved in CXR image modality-specific retraining.

**Figure 1.** Steps illustrating CXR image modality-specific retraining of the ImageNet-pretrained models.

#### 2.2.2. RetinaNet Architecture

We used RetinaNet as the base object detection architecture in our experiments. The architecture of the RetinaNet model is shown in Figure 2. As a single-stage object detection structure, RetinaNet shares a similar concept of "anchor proposal" with [24]. It used a feature pyramid network (FPN) [25] where features on each of the image scales are computed separately in the lateral connections and then summed up through convolutional operations via the top-down pathways. The FPN network combines low-resolution features with strong semantic information, and high-resolution features with weak semantics through top-down paths and horizontal connections. Thus, feature maps with rich semantic information are obtained that would prove beneficial for detecting relatively smaller ROIs consistent with pneumonia compared to the other parts of the CXR image. Furthermore, when trained to minimize the focal loss [5], the RetinaNet was reported to deliver significant performance focusing on hard, misclassified examples.

**Figure 2.** Method flowchart for the RetinaNet network.

2.2.3. Ensemble of RetinaNet Models with Various Backbones

We initialized the weights of the VGG-16 and ResNet-50 classifier backbones used in the RetinaNet model using three strategies: (i) Random weights; (ii) ImageNet-pretrained weights, and (iii) CXR image modality-specific retrained weights as discussed in Section 2.2.1. Each model is trained for 80 epochs and the model weights (snapshots) are stored at the end of each epoch. Varying modifications of the RetinaNet model classifier backbones and loss functions are mentioned in Table 2.

**Table 2.** RetinaNet model classifier backbones with varying weight initializations and loss functions. The loss functions mentioned are used for classification. For bounding box regression, only the smooth-L1 loss function [26] is used in all cases.


We adopted the non-maximum suppression (NMS) in the RetinaNet training with an IoU threshold of 0.5 and evaluated the models using all the predictions with a confidence score over 0.9. A weighted averaging ensemble is constructed using (i) the top-3 performing models from the 12 RetinaNet models mentioned in Table 2, and (ii) the top-3 performing snapshots (model weights) using each classifier backbone. We empirically assigned the weights as 1, 0.9, and 0.8 for the predictions of the 1st, 2nd, and 3rd best performing models. A schematic of the ensemble procedure is shown in Figure 3. An ensembled bounding box is generated if the IOU of the weighted average of the predicted bounding boxes and the ground truth (GT) boxes is greater than 0.5. The ensembled model is evaluated based on the mean average precision (mAP) metric.

**Figure 3.** Method Schematic of the ensemble approach.

2.2.4. Loss Functions and Evaluation Metrics CXR Image Modality-Specific Retraining

During CXR image modality-specific retraining, the DL models are retrained on a combined selection of the frontal CXR projections from the CheXpert and TBX11K datasets (details in Table 1). The training is performed for 128 epochs to minimize the categorical cross-entropy (CCE) loss. The CCE loss is the most commonly used loss function in classification tasks, and it helps to measure the distinguishability between two discrete probability distributions. It is expressed as shown in Equation (1).

$$\text{CCE}\_{loss} = -\sum\_{k=1}^{output\,\,size} y\_k \log y\_k^{\hat{\phantom{\phantom{\phantom{0}}}}} \tag{1}$$

Here, *y* <sup>ˆ</sup> *<sup>k</sup>* denotes the *k*th scalar value in the model output, *yk* denotes the corresponding target, and the *output size* denotes the number of scalar values in the model output. The term *yk* denotes the probability that event *k* occurs and the sum of all *yk* = 1. The minus sign in the CCE loss equation ensures the loss is minimized when the distributions become less distinguishable. We used a stochastic gradient descent optimizer with an initial learning rate of 1 × <sup>10</sup>−<sup>4</sup> and momentum of 0.9 to reduce the CCE loss and improve performance. Callbacks are used to store the model checkpoints and the learning rate is reduced after a patience parameter of 10 epochs when the validation performance ceased to improve. The weights of the model that delivered a superior performance with the validation set are used to predict the test set. The models are evaluated in terms of accuracy, the area under the receiver-operating characteristic curve (AUROC), the area under the precision-recall (PR) curve (AUPRC), sensitivity, precision, F-score, Matthews correlation coefficient (MCC), and Kappa statistic.

#### RetinaNet-Based Detection of Pneumonia-Consistent Findings

Considering medical images, the disease ROIs span a relatively smaller portion of the whole image. This results in a considerably high degree of imbalance in the foreground ROI and the background pixels. These issues are particularly prominent in applications such as detecting cardiopulmonary manifestations like pneumonia where the number of pixels showing pneumonia-consistent manifestations is markedly lower compared to the total number of image pixels. Generalized loss functions such as balanced cross-entropy loss do not take this data imbalance into account. This may lead to a learning bias and subsequent adversity in learning the minority ROI pixels. Appropriate selection of the loss function is therefore critical for improving detection performance. In this regard, the authors of [11] proposed the focal loss for object detection, an extension of the cross-entropy loss, which alleviates this learning bias by giving importance to the minority ROI pixels while downweighting the majority background pixels. Minimizing the focal loss thereby reduces the loss contribution from majority background examples and increases the importance of correctly detecting the minority disease-positive ROI pixels. The focal loss is expressed as shown in Equation (2).

$$Focal\ loss(p\_l) = -\kappa\_l (1 - p\_l)^\gamma \log(p\_l) \tag{2}$$

Here, *pt* denotes the probability the object detection model predicts for the GT. The parameter *γ* decides the rate of down-weighting the majority (background non-ROI) samples. The equation converges to the conventional cross-entropy loss when *γ* = 0. We empirically selected the value of *γ* = 2 which delivered superior detection performance.

Another loss function called the Focal Tversky loss function [27], a generalization of the focal loss function, is proposed to tackle the data imbalance problem and is given in Equation (3). The Focal Tversky loss function generalizes the Tversky loss which is based on the Tversky index that helps achieve a superior tradeoff between recall and precision when trained on class-imbalanced datasets. The Focal Tversky loss function uses a smoothing parameter *γ* that controls the non-linearity of the loss at different values of the Tversky index to balance between the minority pneumonia-consistent ROI and majority background classes. In Equation (3), *T I* denotes the Tversky index, expressed as shown in Equation (4).

$$FT\_{\text{loss}\_{\mathcal{E}}} = \sum\_{\mathcal{C}} \mathbf{1} - TI\_{\mathcal{C}}^{\gamma} \tag{3}$$

$$TI\_{\mathcal{C}} = \frac{\sum\_{i=1}^{M} t\_{ic} g\_{ic} + \in}{\sum\_{i=1}^{M} t\_{ic} g\_{ic} + \alpha \sum\_{i=1}^{M} t\_{ic} g\_{ic} + \beta \sum\_{i=1}^{M} t\_{ic} g\_{ic} + \in} \tag{4}$$

Here, *gic* and *tic* denote the ground truth and predicted labels for the pneumonia class *c*, where *gic* and *tic* ∈ {0,1}. That is, *tic* denotes the probability that the pixel *i* belongs to the pneumonia class *c* and *tic*<sup>ˆ</sup> denotes the probability that the pixel *i* belongs to the background class *c* ˆ . The same holds for *gic* and *gic*ˆ. The term *M* denotes the total number of image pixels. The term ∈ provides numerical stability to avoid divide-by-zero errors. The hyperparameters *α* and *β* are tuned to emphasize recall under class-imbalanced training conditions. The Tversky index is adapted to a loss function by minimizing ∑*<sup>c</sup>* 1 − *T Ic*. After empirical evaluations, we fixed the value of *γ* = 4/3, *α* = 0.7 and *β* = 0.75.

As is known, the loss function within RetinaNet is a summation of a couple of loss functions, one for classification and the other for bounding box regression. We left the Smooth-L1 loss that is used for bounding box regression unchanged. For classification, we explored the performance with focal loss and focal Tversky loss functions individually for training the RetinaNet models with varying weight initializations. We used the bounding box annotations [20] associated with the RSNA CXRs showing pneumonia-consistent manifestations as the GT bounding boxes and measured its agreement with that generated by the models initialized with random weights, ImageNet-pretrained, and CXR image modality-specific retrained classifier backbones. Let TP, FP, and FN denote the true positives, false positives, and false negatives, respectively. Given a pre-defined IOU threshold, a predicted bounding box is considered to be TP if it overlaps with the GT bounding box by a value equal to or exceeding this threshold. FP denotes that the predicted bounding box has no associated GT bounding box. FN denotes the GT bounding box has no associated predicted bounding box. The mAP is measured as the area under the precision-recall curve (AUPRC) as shown in Equation (5). Here, *P* denotes precision which measures the accuracy of predictions, and *R* denotes recall which measures how well the model identifies all the

TPs. They are computed as shown in Equations (6) and (7). The value of mAP lies in the range [0, 1].

$$\text{mean average precision } (mAP) = \int\_0^1 P(R)dR \tag{5}$$

$$Precision\ (P) = \frac{TP}{TP + FP} \tag{6}$$

$$Recall = \frac{TP}{(TP + FN)}\tag{7}$$

We used a Linux system with 1080Ti GPU, the Tensorflow backend (v. 2.6.2) with Keras, and CUDA/CUDNN libraries for accelerating the graphical processing unit (GPU) toward training the object detection models that are configured in the Python environment.

#### *2.3. Statistical Analysis*

We evaluated statistical significance using the mAP metric achieved by the models trained with various weight initializations and loss functions. The 95% confidence intervals (CIs) are measured as the binomial interval using the Clopper-Pearson method.

#### **3. Results and Discussion**

We organized the results from our experiments into the following sections: Evaluating the performance of (i) CXR image modality-specific retrained models and (ii) RetinaNet object detection models using classifier backbones with varying weight initializations and loss functions.

#### *3.1. Classification Performance during CXR Image Modality-Specific Retraining*

Recall that the ImageNet-pretrained DL models are retrained on the combined selection of CXRs from the CheXpert and TBX11K collection. Such retraining is performed to convert the weight layers specific to the CXR image modality and let the models learn CXR modality-specific features to improve performance when the learned knowledge is transferred and fine-tuned for a related medical image visual recognition task. The performance achieved by the CXR image modality-specific retrained models using the hold-out test set is listed in Table 3 and the performance curves are shown in Figure 4. The *no-skill* line in Figure 4 denotes the performance when a classifier would fail to discriminate between the normal and abnormal CXRs and therefore would predict a random outcome or a specific category under all circumstances.

**Figure 4.** The collection of performance curves for the CXR image modality-specific retrained models. The performance is recorded at the optimal classification threshold measured with the validation data. (**a**) ROC and (**b**) PR curves.


**Table 3.** Performance of the CXR image modality-specific retrained models with the hold-out test set. Bold numerical values denote superior performance. The values in parenthesis denote the 95% CI for the MCC metric.

We could observe from Table 3 that the CXR image modality-specific retrained VGG-16 model demonstrates the best performance compared to other models in terms of all metrics except sensitivity. Of these, the MCC metric is a good measure to use because unlike F-score because it considers a balanced ratio of TPs TNs, FPs, and FNs. We noticed that the differences in the MCC values achieved by the various CXR image modalityspecific retrained models are not significantly different (*p* > 0.05). Based on its performance, we used VGG-16 as the backbone for the RetinaNet detector. However, to enable fair comparison with other conventional RetinaNet-based results, we included the ResNet-50 backbone for detecting pneumonia-consistent manifestations. The VGG-16 and ResNet-50 classifier backbones are also initialized with random and ImageNet-pretrained weights for further comparison.

#### *3.2. Detection Performance Using RetinaNet Models and Their Ensembles*

Recall that the RetinaNet models are trained with different initializations of the classifier backbones. The performance achieved by these models using the hold-out test set is listed in Table 4. Figure 5 shows the PR curves obtained with the RetinaNet model using varying weight initializations for the selected classifier backbones. These curves show the precision and recall value of the model's bounding box predictions on every sample in the test set. We observe from Table 4 that the RetinaNet model with the CXR image modality-specific retrained ResNet-50 classifier backbone and trained using the focal loss function demonstrates superior performance in terms of mAP. Figure 6 shows the bounding box predictions of the top-3 performing RetinaNet models for a sample CXR from the hold-out test set.

We used two approaches to combine the bounding box predictions. They are (i) using the bounding box predictions from the top-3 performing RetinaNet models, viz., ResNet-50 with CXR image modality-specific weights + focal loss, ResNet-50 with CXR image modality-specific weights + focal Tversky loss, and ResNet-50 with random weights + focal loss; and, (ii) using the bounding box predictions from the top-3 performing snapshots (weights) within each model. The results are presented in Table 5 and Figure 7. A weighted averaging ensemble of the bounding boxes is generated when the IoU of the predicted bounding boxes is greater than the threshold value which is set at 0.5. Recall that the models are trained for 80 epochs and a snapshot (i.e., the model weights) is stored at the end of each epoch. We observed that the ensemble of the top-3 performing RetinaNet models delivered superior performance in terms of mAP metric compared to other models and ensembles. Figure 8 shows a sample CXR image with GT and predicted bounding

boxes using the weighted averaging ensemble of the top-3 individual models and the top-3 snapshots of the best-performing model.

**Table 4.** Performance of RetinaNet with the varying weight initializations for the classifier backbones and training losses. The values in parenthesis denote the 95% CI for the mAP metric. Bold numerical values denote superior performance.


**Figure 5.** PR curves of the RetinaNet models initialized with varying weights for the classifier backbones. (**a**) ResNet-50 with CXR image modality-specific weights + focal Tversky loss; (**b**) ResNet-50 with CXR image modality-specific weights + focal loss, and (**c**) ResNet-50 with random weights + focal loss.

**Figure 6.** Bounding box predictions of the RetinaNet models initialized with varying weights for the classifier backbones. Green boxes denote the model predictions and red boxes denote the ground truth. (**a**) A sample CXR with ground truth bounding boxes. (**b**) ResNet-50 with CXR image modalityspecific weights + focal Tversky loss; (**c**) ResNet-50 with CXR image modality-specific weights + focal loss, and (**d**) ResNet-50 with random weights + focal loss.

**Table 5.** Ensemble performance with the top-3 performing models (from Table 4) and the top-3 snapshots for each of the models trained with various classifier backbones and weight initializations. Values in parenthesis denote the 95% CI for the mAP metric. Bold numerical values denote superior performance.


**Figure 7.** PR curves of the model ensembles. (**a**) PR curve obtained with the weighted-averaging ensemble of top-3 performing models (ResNet-50 with CXR modality-specific weights + focal loss, ResNet-50 with CXR modality-specific weights + focal Tversky loss, and ResNet-50 with random weights + focal loss and (**b**) PR curve obtained with the ensemble of top-3 performing snapshots while training the ResNet-50 with CXR modality-specific weights + focal loss model.

**Figure 8.** Bounding box predictions using the ensemble of RetinaNet models initialized with varying weights for the classifier backbones. Green boxes denote the individual model predictions, blue boxes denote the ensemble predictions and red boxes denote the ground truth. (**a**) ResNet-50 with CXR image modality-specific weights + focal Tversky loss; (**b**) ResNet-50 with CXR image modalityspecific weights + focal loss; (**c**) ResNet-50 with random weights + focal loss, and (**d**) the ensembled bounding box prediction.

#### **4. Conclusions and Future Work**

In this study, we demonstrated the combined benefits of training CXR image modalityspecific models, using them as backbones in an object detection model, evaluating them in different loss settings, and constructing ensembles of the best-performing models to improve performance in a pneumonia detection task. We observed that both CXR image modality-specific classifier backbones and ensemble learning improved detection performance compared to the individual constituent models. This study, however, suffers from the limitation that we have only investigated the effect of using CXR modality-specific classifier backbones in a RetinaNet-based object detection model to improve detecting pneumoniaconsistent findings. The efficacy of this approach in detecting other cardiopulmonary disease manifestations is a potential avenue for future research. Additional diversity in the training process could be introduced by using CXR images and their disease-specific annotations collected from multiple institutions. With the advent of high-performance computing and current advancements in DL-based object detection, future studies could explore the use of mask x-RCNN, transformer-based models, and other advanced detection methods [28–31] and their ensembles in improving detection performance. Novel model optimization methods and loss functions can be proposed to further improve detection performance. However, the objective of this study is not to propose a new objection detection model but to validate the use of CXR modality-specific classifier backbones in existing models to improve performance. As the organizers of the RSNA Kaggle pneumonia detection challenge have not made the blinded GT annotations of the test set publicly available, we are unable to compare our results with the challenge leaderboard. However, the performance of our method on a random split from the challenge-provided training set, where we sequester 10% of the images for testing, using 70% for training and 20% for validation, respectively, is markedly superior to the best performing method on the leaderboard.

**Author Contributions:** Conceptualization, S.R., P.G. and S.K.A.; Data curation, S.R. and P.G.; Formal analysis, S.R., P.G. and Z.X.; Funding acquisition, S.K.A.; Investigation, Z.X. and S.K.A.; Methodology, S.R. and P.G.; Project administration, S.K.A.; Resources, S.K.A.; Software, S.R. and P.G.; Supervision, Z.X. and S.K.A.; Validation, S.R., P.G. and S.K.A.; Visualization, S.R. and P.G.; Writing—original draft, S.R. and P.G.; Writing—review & editing, S.R., P.G., Z.X. and S.K.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. The funders had no role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

**Institutional Review Board Statement:** Ethical review and approval were waived for this study because of the retrospective nature of the study and the use of anonymized patient data.

**Informed Consent Statement:** Patient consent was waived by the IRBs because of the retrospective nature of this investigation and the use of anonymized patient data.

**Data Availability Statement:** The data required to reproduce this study is publicly available and cited in the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

