1. Introduction
Over the past few years, there has been a surge in studies exploring deep learning techniques to diagnose COVID-19 and pneumonia. Systematic reviews of AI-enabled COVID-19 detection can be found in [
1,
2,
3] in 2021, [
4,
5,
6,
7,
8,
9,
10] in 2022, and [
11,
12,
13] in 2023. While all reviews delved into deep learning neural networks and medical image databases, there were apparent shifts in focus, from the feasibility of deep learning to this aim and limited databases in 2021, to model performance comparison and database survey in 2022, and to model enhancement/development and real-world applications recently. Because COVID-19 medical images were still limited in 2021, prior studies often focused on demonstrating the feasibility of deep learning models in distinguishing COVID-19 cases from normal subjects or patients with other respiratory diseases, such as pneumonia [
14,
15,
16]. These studies either used chest X-ray images or CT scans [
17,
18,
19,
20,
21,
22] and were designed as either binary or multi-class classifications [
23,
24]. In general, the model classification accuracy was higher using chest X-ray images (90%+) than using CT (~85%) and was higher (1–5%) in binary than multi-class classifications [
10,
25,
26]. To remedy data shortages, various techniques of data augmentation were used, including scaling, rotation, shifting, and image filtering such as grayscale and Gaussian blur [
27,
28,
29].
With the increasing availability of COVID-19 X-rays and CT scans since 2021, research has pivoted to comparing various machine learning models and developing new models tailored for COVID-19 detection. For instance, Garg et al. [
30] considered 20 CNN models for COVID-19 detection, including EfficientNet-B5, DenseNet169, InceptionV3, ResNet-50, and VGG16, using 4173 CT images, and demonstrated that EfficientNet-B5 and ResNet-50 were persistently superior in accuracy and sensitivity, followed by DenseNet, EfficientNet, and Xception, while VGG-19 remained the lowest across models. However, mixed results were frequently reported by other studies. Chouat et al. [
31] compared ResNet-50, InceptionV3, VGG-19, and Xception using CT scans and X-ray images and reported that VGG-19 performed best on CT scans (87% accuracy) and Xception performed best on X-ray images (98% accuracy). In another study using CT images [
32], ResNet-50 had the highest accuracy (96.97%), followed by Xception (90.71%), Inception-v3 (89.38%), and VGG16 (87.12%). Considering that such variability was reflective of the inherent advantages and setbacks of CNN models, hybrid or ensemble models leveraging multiple networks have been explored. By combining Inception V3 with VGG16, Srinivas et al. [
33] achieved a 98% accuracy of COVID-19 prediction using 243 X-ray images, which outperformed Inception V3, VGG16, ResNet-50, DenseNet121, and MobileNet when tested individually. Similarly, Wang et al. [
34] integrated features extracted from Xception, MobileNetV2, and NasNetMobile and made the classification via a confidence fusion method.
A notable variance in prior studies was the reported classification accuracies, ranging broadly from 78% to 100%, presumably due to the size or quality of the training/testing datasets. For instance, in 2021, Karar et al. [
35] reported 99.9% classification accuracy for VGG-19 and ResNet-50. However, only 263 original X-ray images (including 56 normal, 49 COVID-19, and 128 pneumonia) were used, which were increased to 3325 images for each category through augmentation techniques like flipping, rotation, and shifting. In 2022, using 4326 chest X-ray images, Kumar et al. [
36] reported an accuracy of 100% for binary classification (normal vs. COVID-19) and 98.82% for multi-class classification (normal, COVID-19, pneumonia). On the other hand, 78% accuracy using VGG-19 and 4137 CT images was reported by Garg et al. [
30].
Despite exhibiting high accuracies in training and validation, CNN-based transfer learning might underperform during the testing phase [
37,
38,
39,
40,
41]. This could be due to overfitting, which is a common issue in medical image classification with a limited number of discriminatory image features [
7,
13,
42,
43,
44]. Prominent CNN models typically encompass over ten layers, boasting 60+ million trained parameters, and are trained on expansive datasets like ImageJ, which houses 1000 categories. In contrast, medical images have limited features, and their differences are often imperceptible to the human eye. This could be why the features/filters/convolutional layers trained on ImageJ may differ from those of medical images, resulting in lower testing performance [
45]. In transfer learning, the adapted filters may retain irrelevant features, contaminating the classification process.
In summary, researchers have achieved accuracy levels exceeding 95% for COVID-19 detection through the selection and refinement of appropriate CNN models. Yet, systems based on these CNN models were often developed using data specific to certain clinics and acquired using their unique imaging modalities. Most proposed models or systems have not been tested on other datasets. It is also noted that even though the COVID-19 datasets have become more available, most have not been validated and can be subject to mislabeling, noise, incompleteness, corruption, or low quality [
46,
47,
48,
49]. One relevant question is how these models or systems that have been trained on one training database will perform on other datasets that inevitably have certain disparities. This question is even more pronounced when the training dataset contains a limited number of images. In this scenario, will the high accuracy of trained models be maintained when tested on larger data or images from other clinics? If not, how much lower accuracy will be expected and/or tolerated? Will adding new images to the original training dataset always improve the model’s performance, and what level of improvement can be expected via re-training the model with extended data? To achieve a desirable classification accuracy, how many new images should be added to the original training dataset?
The objective of this study was to evaluate the performance of CNN models in diagnosing COVID-19 and pneumonia in the lungs based on X-ray images from different data sources, as well as to assess the performance variation from multi-level training. Specific aims include the following:
Comparing the three-class classification performances of four CNN models: AlexNet, ResNet-50, MobileNet, and VGG-19.
For a given model, evaluating the model’s performance on X-ray images from different sources (i.e., inside and outside the training space).
Quantifying the benefits of multi-round training with extended data on the model’s performance.
Evaluating the model’s interpretability via heatmaps and intermediate activation features.
2. Methods
2.1. COVID-19 and Pneumonia Datasets from Multiple Sources
Chest X-ray images were selected from three sources, with each source encompassing the categories of normal, COVID-19, and pneumonia. The first dataset [
50,
51,
52] contains 10,192 normal cases, 3616 COVID-19 positive cases, and 1345 viral pneumonia images. Among these, approximately 80% were used for training (11,538) and 20% for testing (3035), as shown in the left panel of
Figure 1a. Both the training and testing datasets were divided in such a way that each split retained approximately the same proportion of samples for each class as in the original dataset (i.e., stratified random sampling). Among the 11,538 training images, 7791 are normal, 2712 are COVID-19, and 1030 are viral pneumonia, constituting a disproportionate distribution of 68%, 24%, and 9%, respectively. Likewise, among the 3035 testing images, 2032 are normal, 742 are COVID-19, and 261 are viral pneumonia. Three sample images typical of normal, COVID-19, and viral pneumonia cases are also shown below.
The second dataset [
53] contains 6432 images. Like Dataset 1, stratified random sampling was implemented to set apart 20% (1288) as the level 2 test images, which comprised 855, 317, and 116 for normal, COVID-19, and pneumonia, respectively (
Figure 1a). Multi-center X-ray images were tested to evaluate model applicability in external centers.
A subfolder named “Outliers” was created by collecting images from the first and second datasets that are apparently distorted or irregular. Some radiographic images exhibited features such as rotation, distortion, differing magnifications, and variations in contrast and brightness, as shown in the lower panel of
Figure 1a. These outlier images were excluded from the Round 1 training dataset. Instead, they were trained at designated proportions (25% and 50%) in Rounds 2 and 3. The “Outliers” dataset consisted of 247 images and was also used to test CNN models as level 3 test samples. It contained 104 normal, 107 COVID-19, and 36 pneumonia cases. Considering that these images have a lower level of similarity to the training images, lower classification accuracies are expected on this test dataset in comparison to the level 1 and level 2 testing datasets.
The third dataset [
54] contains 7135 X-ray images. Our initial study indicated that CNN models trained on Dataset 1 consistently performed well on Dataset 2 but poorly on Dataset 3. We thus inferred that the images in Dataset 3 had higher levels of differences. In this study, around 10% (730 images) of Dataset 3 was designated for testing via stratified random sampling, which was used as the Level 4 testing dataset. It comprised 234 normal, 105 COVID-19, and 390 pneumonia cases (right panel,
Figure 1a). This dataset functioned similarly to dataset 2 (or Level 2) and was used to substantiate the statistical significance by testing images from an external medical center.
2.2. Selection of CNN Models
The selection of CNN models depended on the training purpose and the model’s specificities. The purpose of this study was to develop a CNN-based automated diagnostic system with high performance for COVID-19 and pneumonia patients based on chest X-ray images. Relevant criteria included the following: high classification accuracy, high sensitivity and specificity, robust performance across centers, the ability to learn continuously, and the ability to identify discriminative features among categories. Features learned are often abstract, beyond human comprehension or interpolation. However, sometimes they can capture some inherent differences that are consistent with our perceptions, which can be useful in understanding either the etiology or manifestations of a specific disease. Ideally, only defining features are learned from the training dataset; subsequent addition of highly similar images will not bring new features to the CNN model and thus can only marginally improve the CNN model. Only new images with new features introduced to the model will see extra features to refine the classification of the test dataset.
Four convolutional neural network (CNN) models were selected for this study: AlexNet, ResNet-50, MobileNet, and VGG-19. AlexNet and ResNet-50 were selected because they were the 2012 and 2015 winners of the ImageNet competition, respectively [
32,
33]. AlexNet was groundbreaking in its use of GPUs for training deep neural networks, while ResNet-50 introduced residual connections between different layers to improve gradient flow and enable the training of even deeper neural networks [
34,
35]. MobileNet was chosen for its simpler architecture and smaller computational requirements [
36,
37,
38,
39]. It will be desirable to run a computer-aided diagnostic (CAD) system on a personal computer or even a smartphone, provided it can achieve sufficient diagnostic accuracy. VGG-19 was selected for its simplicity, which uses only 3 × 3 convolutional layers stacked on top of each other in increasing depths. One drawback of VGG-19 is its requirement for more memory than the other three models, primarily due to its fully connected layers.
To determine whether the architecture of a pre-trained network is suitable for a test dataset, the following steps were taken. First, the network was trained and tested on the same dataset, providing an initial evaluation of the model’s performance on one testing dataset. In this step, various metrics are evaluated, including accuracy, AUC, specificity, sensitivity, precision, and ROC, to identify areas for improvement (
Figure 1c). Second, the pre-trained network was fine-tuned based on the identified areas for improvement by adjusting the hyperparameters or replacing some of its layers to improve its performance. After fine-tuning, the pre-trained network was re-evaluated on the test dataset until its performance reached a threshold classification accuracy.
The fine-tuned network model was further tested on additional datasets from the same medical center that the model had never seen before (for instance, newly acquired images). This step evaluates the pre-trained model’s interpolation capacity.
In the following step, the model was tested on datasets from other medical centers that belong to the same category (e.g., COVID-19 images, but the model has never seen them before). This step evaluates the pre-trained model’s extrapolation capacity. Even though the X-ray images acquired at any medical center share inherent similarities among COVID-19 patients, differences among COVID-19 images can exist across medical centers, and the performance of the pre-trained model can be affected by such differences.
In the next step, the model was re-trained by adding new images to the training dataset and re-tested. In this step, two metrics are evaluated: (1) the improvement in performance, and (2) the minimal number of images needed to reach a satisfactory classification accuracy. In doing so, it is hoped to provide a practical example of the CNN-based system when applied in medical centers other than the one where the system was originally trained.
2.3. Study Design for Multi-Level CNN Model Training and Testing
The training protocol was conceived to have five rounds, and in each round, testing was conducted at four levels on images differing in quality or similarity (as detailed in
Table 1). In each test, a three-class classification (normal, COVID-19, and pneumonia) was performed. By training one model through several rounds with augmented datasets and testing its performance on datasets with decreasing similarities, it was aimed to (1) select the optimal CNN model, (2) test the model’s ability for interpolation and extrapolation, and (3) test the model’s ability to learn from new data.
Round 1, referred to as the “baseline training”, incorporated 11,538 radiographic images (baseline, or “Base” for short in
Table 1). It was aimed at validating the CNN models (i.e., Level 1) as well as evaluating whether the models can classify new samples that are either similar (Level 2), outliers (Level 3), or have higher levels of dissimilarity (Level 4). In doing so, a preliminary evaluation of the model’s interpolation and extrapolation capacity was obtained, based on which further training and testing could be designed.
To evaluate the effects of re-training on model performance, five rounds of training were performed in this study with incrementally extended data. The selection of added images was determined by the testing results in Round 1. In Round 2, 25% of the outlier (Level 3) images were added in the hope that the CNN model could learn the inherent features of those outliers and thus improve the classification accuracy of these outlier images. In Round 3, 50% of the outlier images were added. In comparison to Round 2, this round aimed to quantitatively evaluate the performance improvement as a function of added images.
Round 2 and Round 3 training exemplified the processes of model development and optimization within one medical center based on its own data. In contrast, Rounds 4 and 5 training represented scenarios where the automated diagnostic system was used in other medical centers, and when generating sub-optimal performances, it was further improved by including new images to re-train the system. Similarly, the two-round training (Rounds 4 and 5) was designed to quantify performance improvements vs. added images. Round 4 included 25% of dataset 3 training images in addition to the Base and 50% outlier, while Round 5 included 50% of dataset 3 training images (
Figure 1b, lowest row).
To evaluate the network classification performance, various indices were quantified, including accuracy, sensitivity, specificity, precision, AUC (area under the curve), and ROC (receiver operating characteristic) curve. For three-class classification (i.e., 1, 2, 3), the performance matrices were extended from their binary counterparts. Based on a three-class confusion matrix (blue dashed rectangle) in
Figure 1c, the accuracy is the ratio of correct predictions over all predictions (
Figure 1c, left bottom). The category-wise matrices (specificity, sensitivity, and precision) were calculated using the equations listed to the right of
Figure 1c. For the Receiver Operating Characteristic (ROC) curve, the One-vs-Rest approach (OvR) was used, e.g., COVID vs. (normal plus pneumonia). In this study, all category-wise matrices were presented for COVID unless stated otherwise.
The CNN model training/testing was conducted using an AMD Ryzen 3960X 24-Core workstation with 3.79 GHz processors, 256 GB RAM, and 24 GB GPU (PNY NVIDIA GeForce RTX 3090). For the approximately 12,000 images used in this study, one round of training for one CNN model took approximately 4–7 h.