Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray

Yulvina, Rizka; Putra, Stefanus Andika; Rizkinia, Mia; Pujitresnani, Arierta; Tenda, Eric Daniel; Yunus, Reyhan Eddy; Djumaryo, Dean Handimulya; Yusuf, Prasandhya Astagiri; Valindria, Vanya

doi:10.3390/computers13120343

Open AccessArticle

Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray

by

Rizka Yulvina

¹,

Stefanus Andika Putra

¹,

Mia Rizkinia

¹

,

Arierta Pujitresnani

^2,3

,

Eric Daniel Tenda

^3,4

,

Reyhan Eddy Yunus

^3,5

,

Dean Handimulya Djumaryo

⁶,

Prasandhya Astagiri Yusuf

^2,3,*

and

Vanya Valindria

³

¹

Department of Electrical Engineering, Faculty of Engineering, Depok Campus, University of Indonesia, Jakarta 16424, Indonesia

²

Department of Medical Physiology and Biophysics, Faculty of Medicine, University of Indonesia, Jakarta 10430, Indonesia

³

Medical Technology Cluster, Indonesian Medical Education and Research Institute (IMERI), Faculty of Medicine, Universitas Indonesia, Jakarta 10430, Indonesia

⁴

Division of Respirology and Critical Illness, Department of Internal Medicine, Faculty of Medicine Universitas Indonesia, Dr. Cipto Mangunkusumo National General Hospital, Jakarta 10430, Indonesia

⁵

Department of Radiology, Faculty of Medicine Universitas Indonesia, Dr. Cipto Mangunkusumo National General Hospital, Jakarta 10430, Indonesia

⁶

Department of Clinical Pathology, Faculty of Medicine Universitas Indonesia, Dr. Cipto Mangunkusumo National General Hospital, Jakarta 10430, Indonesia

^*

Author to whom correspondence should be addressed.

Computers 2024, 13(12), 343; https://doi.org/10.3390/computers13120343

Submission received: 21 October 2024 / Revised: 20 November 2024 / Accepted: 25 November 2024 / Published: 17 December 2024

(This article belongs to the Special Issue Applications of Machine Learning and Artificial Intelligence for Healthcare)

Download

Browse Figures

Versions Notes

Abstract

:

Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a leading cause of global mortality. While TB detection can be performed through chest X-ray (CXR) analysis, numerous studies have leveraged AI to automate and enhance the diagnostic process. However, existing approaches often focus on partial or incomplete lesion detection, lacking comprehensive multi-class and multi-label solutions for the full range of TB-related anomalies. To address this, we present a hybrid AI model combining vision transformer (ViT) and convolutional neural network (CNN) architectures for efficient multi-class and multi-label classification of 14 TB-related anomalies. Using 133 CXR images from Dr. Cipto Mangunkusumo National Central General Hospital and 214 images from the NIH datasets, we tackled data imbalance with augmentation, class weighting, and focal loss. The model achieved an accuracy of 0.911, a loss of 0.285, and an AUC of 0.510. Given the complexity of handling not only multi-class but also multi-label data with imbalanced and limited samples, the AUC score reflects the challenging nature of the task rather than any shortcoming of the model itself. By classifying the most distinct TB-related labels in a single AI study, this research highlights the potential of AI to enhance both the accuracy and efficiency of detecting TB-related anomalies, offering valuable advancements in combating this global health burden.

Keywords:

CXR digital image; convolutional neural network; multi-label classification; tuberculosis; vision transformer

1. Introduction

Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a significant global health challenge and one of the leading causes of death worldwide [1]. Accurate diagnosis is particularly difficult in immunocompromised individuals and pediatric patients, where sample collection for bacterial confirmation is challenging [2]. In Indonesia alone, an estimated 824,000 TB cases were reported, with 97,800 deaths in 2021 [3]. The gold standard for TB diagnosis, bacterial isolation, takes 2 to 3 weeks for results, while the Mantoux test, using purified protein derivative (PPD), requires 48 to 78 h [4]. Faster alternatives, such as polymerase chain reaction (PCR), are costly and less accessible in certain regions [5]. To address these limitations, the World Health Organization (WHO) recommends using chest X-ray (CXR) as a screening tool [6], though its effectiveness is limited by variability in interpretation among radiologists, introducing subjectivity into the process.

Recent advancements in artificial intelligence (AI) have paved the way for computer-aided diagnosis (CAD) systems to automatically detect TB by analyzing CXR images through image segmentation, feature extraction, and classification [7]. Deep learning (DL), a form of ML, uses multiple neural network layers to process raw data and has proven highly effective in image classification tasks [8,9,10].

Several recent studies have applied deep learning models to detect TB using CXR images, with significant results. For example, one study achieved an accuracy of 87%, utilizing CNN architectures such as InceptionV3, Xception, ResNet50, VGG19, and VGG16 [7]. However, this study focused solely on binary classification, determining whether a CXR image indicated TB or not. Similarly, a study of deep learning-based classification and semantic segmentation of lung tuberculosis lesions reached 100% accuracy for distinguishing only four lesion types: infiltrations/bronchiectasis and opacity/consolidation [11]. A third study also performed well, achieving 99.29% accuracy using a UNet for lung segmentation and Xception for TB classification [12]. Yet, like other efforts, it primarily focused on binary classification (TB vs. normal) or a very limited range of abnormalities.

While these results are promising, they remain limited in scope, focusing on detecting TB presence or classifying only a few lesion types. For instance, lesions like infiltrations, opacity, and bronchiectasis are often classified, but these approaches fail to address a broader range of TB-related abnormalities or handle the complexity of multi-label classification where several abnormalities can co-exist in a single image.

In this study, we introduce a novel hybrid AI model that combines convolutional neural networks (CNNs) [8] and vision transformers (ViTs) [13] to detect TB anomalies in CXR images through a multi-label classification framework. Our model focuses on 14 distinct TB-related anomalies, a comprehensive range that exceeds previous studies in scope, making this the most extensive multi-label classification effort for TB detection using AI. We tackled data imbalance using augmentation, class weighting, and focal loss to ensure robust performance across all classes.

To further validate the performance of our model, we conducted a comparative analysis against several state-of-the-art CNN architectures, including Inception [14], ResNet [15], EfficientNet [16], VGG [17], and DenseNet [18]. Furthermore, we evaluated vision transformer (ViT) models, including ViT Base and ViT Large [13], to determine their effectiveness in handling the multi-label classification of tuberculosis (TB) abnormalities. This evaluation aims to identify the most effective architecture for accurate and efficient TB anomaly detection, particularly in resource-constrained settings such as Indonesia.

The key contribution of this study lies in its ability to classify 14 distinct TB-related abnormalities in a single AI-driven approach. The labeled abnormalities include (1) Infiltrate, (2) Fibroinfiltrates, (3) Consolidation, (4) Cavity, (5) Pleural Effusion, (6) Fibrosis, (7) Bronchiectasis, (8) Pleural Thickening, (9) Atelactasis, (10) Lymphadenopathy, (11) Pneumothorax, (12) Bullae, (13) Tuberculoma, and (14) Miliary. One of the primary challenges in this study is that two or more abnormalities can occur in a single patient, making multi-label classification essential in addition to traditional multi-class classification.

This approach provides enhanced decision support for radiologists and facilitates faster, more reliable diagnoses in clinical settings, ultimately aiding in more efficient TB management.

2. Materials and Methods

2.1. Datasets

2.1.1. Primary Dataset

The primary datasets utilized in this research were sourced from patient data at the Indonesian Medical Education and Research Institute (IMERI) and Dr. Cipto Mangunkusumo National Central General Hospital (RSCM), in collaboration with the Faculty of Medicine, Universitas Indonesia. The collection of these datasets strictly adhered to ethical guidelines, and the research proposal received ethical approval on 27 July 2020, following an extensive review process (KET797/UN2.F1/ETIK/PPM.00.02/2020). The chest X-rays (CXRs) themselves were performed in 2018. The mean age of patients was 45.17 ± 15.15 (mean ± SD) years, with a gender distribution of 58.33% male and 41.67% female.

The dataset comprises 133 digital chest X-ray (CXR) images, each accompanied by a corresponding CSV file containing evaluations provided by radiologists regarding the patients’ medical conditions. These evaluations were used to assign labels to the images, indicating the presence of various abnormalities that could lead to tuberculosis (TB). It is important to note that the images are labeled descriptively, meaning the abnormalities are documented in the CSV file without specific region-based annotations on the images themselves. Due to ethical considerations, only limited information, such as the year of CXR acquisition, mean patient age, gender distribution, and the labels assigned to each CXR, can be disclosed in this study.

As mentioned earlier, the main challenge of this study lies in the fact that two or more abnormalities can occur in a single patient. Below, Table 1 provides a few examples of data samples and their corresponding labels, demonstrating the variety in the number of abnormalities per patient.

2.1.2. Supplementary Dataset: NIH Chest X-Ray

To enrich the dataset and enhance model training, an additional 214 CXR images were incorporated from the NIH: National Institutes of Health Chest X-Ray Dataset [19]. This dataset comprises 112,120 X-ray images with disease labels from 30,805 unique patients. Labels were generated using natural language processing (NLP) to text-mine disease classifications from radiology reports, with an expected accuracy of over 90%. Since the NIH data were already provided in PNG format, it allowed for straightforward integration with the RSCM dataset without the need for additional format conversion. However, only a subset of the NIH dataset was used in this study because not all labels needed for this project were available in the NIH dataset. Specifically, we selected only those labels that were present and aligned with our classification needs to avoid increasing the imbalance between the classes.

The NIH dataset was collected at the NIH Clinical Center in the United States and released in 2017. It comprises data from patients with a mean age of 46.63 ± 16.60 (mean ± SD) years, with a gender distribution of 56.49% male and 43.51% female.

Below, Table 2 provides a few examples of data samples from the NIH dataset and their corresponding labels.

For the classes that were underrepresented and not available in the NIH dataset, we applied data augmentation to create additional samples and ensure a better representation of these rare classes. As mentioned below, this strategy, combined with focal loss and class weighting, helped address the challenge of handling imbalanced data and improved model performance.

The evaluation of the NIH and RSCM datasets as separate entities was deemed unnecessary due to the study’s overarching goal of creating a robust classification model for tuberculosis (TB) anomalies. The focus was on developing a unified approach that leverages the combined datasets to enhance the model’s performance and generalizability.

From a visual perspective, as illustrated in the comparison images above, both the RSCM as in Figure 1 and NIH as in Figure 2 datasets exhibit comparable quality in terms of resolution, clarity, and diagnostic relevance. This similarity ensures that image quality does not introduce any significant bias or variation.

To provide a comprehensive overview of the dataset, Table 3 presents the distribution of the different classes of abnormalities observed within the dataset.

Below Figure 3 is a bar plot illustrating the distribution of data for each label across the RSCM and NIH datasets. This visualization highlights the count of each label contributed by both datasets, providing insight into data balance and availability for each condition.

2.2. Transformer

The Transformer architecture, introduced by Google in 2017, revolutionized natural language processing (NLP) through its attention mechanism and sequence-to-sequence model [20]. It consists of an encoder–decoder structure, incorporating key components such as multi-head attention, layer normalization, and a multi-layer perceptron (MLP). The multi-head attention mechanism processes inputs in parallel, while layer normalization ensures consistent neuron distribution across layers, and the MLP transforms input into final outputs [20].

Building on the success of the Transformer, the vision transformer (ViT) was developed for computer vision tasks [13]. ViT divides images into patches and processes them similarly to how the Transformer handles sequences. Patches are embedded and passed through an encoder, with a final MLP head generating the model’s output [13].

Both ViTs and CNNs are widely used in computer vision but differ in their approach. CNNs use convolutional layers to extract spatial features and pooling layers for information aggregation [8]. This hierarchical structure helps CNNs excel at capturing local patterns and spatial relationships, making them effective for tasks like image classification and object detection [21].

2.3. Focal Loss

Focal loss is a loss function designed to address class imbalance in tasks like object detection and multi-label classification by focusing on hard-to-classify examples [22]. It incorporates a modulating factor and a focusing parameter, which downweight well-classified examples and emphasize misclassified ones during training. Unlike binary cross-entropy (BCE) [23], which treats all errors equally, focal loss shifts attention to minority classes, improving model performance on imbalanced datasets.

Focal loss plays a crucial role in handling the significant class imbalance in this multi-label classification of TB-related anomalies task. Since some anomalies are underrepresented in the dataset, focal loss ensures that the model is not overwhelmed by the majority of classes. Dynamically adjusting the loss contribution allows the model to focus more on challenging cases, such as those with rare labels, leading to better classification performance across all anomalies.

The mechanism involves applying a modulating factor that reduces the impact of well-classified examples and increases the focus on misclassified ones, helping the model learn from the harder examples. This dynamic adjustment helps improve the model’s ability to generalize, especially in imbalanced scenarios like this project, where certain TB-related anomalies are far less frequent than others [22].

2.4. Class Weight

Class weighting is a commonly used technique in machine learning to address class imbalance, particularly in classification tasks where certain classes are underrepresented. This approach assigns higher weights to minority classes and lower weights to majority classes, ensuring the model does not become biased toward the majority class during training. By giving more importance to minority classes, class weighting helps improve model performance on underrepresented data [24,25].

In this project, class weighting complements the use of focal loss. While focal loss targets difficult-to-classify examples by adjusting the loss based on prediction difficulty, class weighting ensures that the model does not overlook minority classes. Together, these techniques allow the model to effectively manage multi-class and multi-label tasks, where some labels are rare. For example, in the dataset used for this project, some anomalies like bullae are rare, while others like infiltrate are more frequent. Class weighting increases the importance of correctly classifying rare anomalies, while focal loss ensures the model focuses on challenging cases across all labels [22].

The class weights are calculated based on the inverse frequency of each class, giving more weight to less frequent classes. This strategy, combined with focal loss, enhances the model’s ability to generalize, improving performance on rare TB-related anomalies while maintaining accuracy for more common ones [24].

2.5. Data Augmentation

Data augmentation is a widely used technique to artificially expand the size of a dataset by applying transformations to existing data, creating variations that simulate different conditions. This method is particularly useful in dealing with imbalanced datasets, as it helps increase the representation of minority classes and reduces the risk of overfitting in models with limited data. In medical imaging, especially with chest X-rays (CXRs), data augmentation can introduce subtle variations that simulate different imaging conditions, enhancing the diversity of the training set and helping models become more robust [26].

For this project, we used the ImageDataGenerator from Keras to apply various augmentation techniques, including rotation (up to 15 degrees), small width and height shifts, zooming (up to 10%), horizontal flipping, and brightness adjustments (within a range of 0.8 to 1.2). These transformations help simulate different angles, slight positional variations, and different lighting conditions, while ensuring the anatomical integrity of the images is preserved [27].

Data augmentation plays a crucial role in addressing the multi-class, multi-label imbalance present in TB-related anomalies. By augmenting rare classes, we effectively increase the size of these subsets, ensuring that the model doesn’t overfit to the limited samples and can generalize better across all anomalies [28]. Given the multi-class and multi-label nature of the project, we applied augmentation carefully. For cases where images contain both rare labels (e.g., bullae) and frequent labels (e.g., infiltrate), we ensured that augmentation is applied proportionally to avoid excessively increasing the representation of the already dominant classes. This ensures that the augmentation strategy benefits the rare classes while maintaining balance across the dataset.

The augmentation process complements class weighting and focal loss, which handle class imbalance by adjusting the model’s focus toward underrepresented and harder-to-classify examples. Together, these techniques enhance the model’s ability to handle the imbalanced, multi-label nature of the dataset, improving its capacity to accurately detect TB-related anomalies.

2.6. Data Storage and Computational Environment

In accordance with the regulations set by the Indonesian Ministry of Health (Kementerian Kesehatan), medical data, including electronic medical records, must be stored and processed within Indonesia to ensure data confidentiality, integrity, and accessibility. This regulation mandates that all health data must remain on data servers within the country, and access is restricted only to authorized healthcare professionals to protect patient privacy and security [29].

To comply with these regulations, all datasets used in this research are securely stored in the Cloud Storage of the Big Data Center IMERI (bdc-imeri-idealab.ui.ac.id/). The model development, training, and processing were conducted using the Cloud Computing Analytics platform provided by Big Data Center IMERI. This ensures that the entire data pipeline, including storage, computation, and model training, occurs within Indonesian servers, adhering to the legal requirements in Indonesia.

By using this computational environment, this project ensures that sensitive medical data are securely processed and complies with Indonesian health regulations regarding data privacy and security.

2.7. Proposed Method

2.7.1. Pre-Processing

To ensure optimal data preparation for the Python framework, a comprehensive pre-processing method was applied to the CXR digital images in this study. The process began by converting the images from NIFTI format to PNG format using the Simple ITK library. It was identified that some images had incorrect rotations, prompting a rotation correction step to ensure consistency. Additionally, the images were converted to grayscale, simplifying the input data by removing color information while retaining necessary visual details. This pre-processing, which involved rotation correction, grayscale conversion, and format transformation, ensured the images were standardized and suitable for further analysis in the research pipeline.

Similarly, the CSV data containing the radiologist’s evaluations underwent preprocessing to convert the evaluations into labels suitable for utilization in a classification model. This transformation process, borrowed from the concept of natural language processing (NLP), involved mapping the textual evaluations to corresponding labels. By applying NLP techniques, the textual evaluations were converted into structured labels, enabling their integration into the classification model.

Additionally, data from the NIH Chest X-ray Dataset, which were already in PNG format, were integrated into the dataset. The NIH dataset required minimal pre-processing, as its images were already in the correct format. The main task involved ensuring uniformity in label naming conventions between the NIH and the existing dataset. This process involved adjusting the labels from the NIH dataset to match the format and terminology used in the main dataset, and then appending the corresponding labels to the CSV file for a consistent classification model.

The pre-processing stage played a critical role in refining the data and facilitating its compatibility with the subsequent analysis. Through careful transformations and adjustments, the CXR images were prepared for efficient processing within the Python framework, while the doctor’s evaluations and NIH data were transformed into informative labels for effective classification modeling.

2.7.2. Architecture

This work employs a novel hybrid model, combining convolutional neural network (CNN) and vision transformer (ViT), as illustrated in Figure 4.

In this hybrid model, the input image was first processed by the CNN, which performs convolutional operations to extract local features and generate a feature map. This feature map was then passed to the ViT model, specifically the transformer encoder, which utilizes self-attention mechanisms to capture global contextual information and relationships between features. Depending on the selected configuration, the ViT model processed the feature map using different transformer architectures, varying in terms of depth, hidden size, and number of attention heads. Table 4 outlines the configurations for the ViT Base, ViT Large, and ViT Huge models, highlighting their structural differences and potential impacts on model performance. By combining local and global features, the hybrid model can effectively capture both fine-grained details and high-level contextual information, resulting in enhanced image representation and classification accuracy.

To adapt the ViT model for image classification, the input image was divided into patches, and each patch was treated as a token for processing by the transformer encoder. This enabled the ViT model to leverage its attention mechanisms to capture dependencies and relationships between image patches. The hybrid model also included additional layers, such as dense layers, to perform the final classification based on the combined features of the CNN and ViT components.

The advantage of the proposed hybrid model was expected to come from leveraging the complementary strengths of both CNNs and ViTs. CNNs are well-suited for capturing local and spatial features in images, while ViTs are designed to capture global context and long-range dependencies. By integrating these capabilities, the hybrid model is anticipated to improve performance in image classification tasks, potentially offering a more comprehensive understanding of the visual content and enabling more accurate predictions.

The hybrid CNN-ViT model in this theoretical framework could represent a promising strategy for improving performance, robustness, and accuracy in computer vision applications by combining the strengths of both architectures.

In this model, the output from the CNN’s convolutional layers, which extract a feature map from the input image, was used as input to the ViT transformer encoder. The encoder applied attention mechanisms to process these data, allowing the model to focus on different parts of the image for more effective classification. To support the classification of 14 distinct tuberculosis-related anomalies along with the “normal” class, a dense layer containing 15 neurons was incorporated into the model, paired with dropout layers.

The following experiments were designed to thoroughly evaluate the performance of CNN and ViT models independently, as well as in a hybrid configuration. The experiments focused on identifying the best-performing architectures from each group and investigating how combining their strengths in a hybrid approach could lead to improved performance:

CNN-based Experiments: In this experiment, five CNN transfer learning architectures—Inception ResNet V2, EfficientNet V2L, VGG16, Xception, and DenseNet-201—were evaluated to determine the best-performing CNN model. Binary cross-entropy was used as the loss function. The CNN model with the highest performance metrics was selected for subsequent integration into a hybrid CNN-ViT architecture in later experiments.
ViT-based Experiments: In this experiment, Vision Transformer (ViT) models, specifically ViT Base and ViT Large, were assessed independently. Similar to the CNN-based experiment, binary cross-entropy was used as the loss function. The objective was to identify the most effective ViT model, which would then be used in combination with the best CNN model to create a hybrid architecture in the following experiment.
Hybrid-based Experiment: In this experiment, the best-performing CNN model from the CNN-based experiment was combined with the ViT Base model to create a hybrid CNN-ViT architecture. To address class imbalance, focal loss was implemented, and class weighting was applied to emphasize underrepresented classes. The hybrid model was fine-tuned by adjusting several key parameters, including 50 epochs; the Adam optimizer; batch sizes of 4, 8, and 16; and a learning rate of 0.0001. The Adam optimizer stabilized gradient descent, while the low learning rate allowed for smaller, precise updates. Varying batch sizes were tested to balance computational efficiency with model performance. This setup aimed to optimize the hybrid model for accuracy and efficiency, ensuring robust performance across varying conditions.

3. Result

The Result Section provides a comprehensive analysis and presentation of the outcomes obtained from the experimental study, shedding light on the key findings and observations derived from the conducted research. The results obtained from the different experiments are presented in this section. The experiments were conducted on a high-performance computing environment with a GPU equipped with 40 GB of VRAM, capable of handling complex computations efficiently, especially for deep learning tasks involving large models like the hybrid EfficientNetV2L-ViT Base.

3.1. CNN-Based Experiments

As shown in the Table 5 below EfficientNetV2L showed an excellent combination of accuracy (0.870), the lowest loss (0.323), and a respectable AUC score (0.460). While VGG16 slightly outperformed EfficientNetV2L in accuracy (0.871) and had a comparable loss (0.331), its AUC score (0.454) was lower. Additionally, VGG16 has fewer parameters (14.9 million) but does not provide the level of detail and feature extraction efficiency that EfficientNetV2L achieves, justifying its higher parameter count (118 million). Similarly, DenseNet-201, with a higher AUC (0.500), had lower accuracy (0.855) and the highest loss (0.370), making it less desirable for this specific task where minimizing loss is critical.

The following images as shown in the Figure 5 present the prediction outcomes from the EfficientNetV2L model. Each image includes a comparison between the actual diagnosis and the predicted diagnosis, along with the model’s confidence level for each predicted label. This analysis demonstrates the ability of both models to detect various TB-related abnormalities from chest X-ray (CXR) images.

(a): The ground truth is normal, and the model correctly predicts it with a confidence of 33%. The prediction aligns with the actual diagnosis, though the confidence could be higher.
(b): The ground truth is consolidation and the model predicts it with 44% confidence. The model correctly identifies the abnormality, demonstrating its potential in recognizing consolidation.
(c): The ground truth includes fibrosis, bronchiectasis, and pneumothorax. The model predicts these conditions with fibrosis 29%, bronchiectasis 35%, and pneumothorax 15%. These predictions reflect a reasonable overlap with the ground truth.
(d): The ground truth includes consolidation and pleural thickening. The model predicts consolidation with 44% confidence and pleural thickening with 27% confidence, which aligns well with the ground truth.

3.2. ViT-Based Experiments

As shown in the Table 6, ViT Base, on the other hand, strikes an ideal balance between performance and computational efficiency. While ViT Large showed higher accuracy (0.879) and AUC (0.565), the massive increase in parameters (306 million) makes it less practical for the project’s needs. In contrast, ViT Base, with 88.9 million parameters, performed very well, with an accuracy of 0.874, loss of 0.327, and an AUC of 0.500. This makes it a more computationally efficient option for global feature extraction, especially when considering the trade-off between computational demand and model performance.

The reduced parameter count in ViT Base directly contributes to its lower memory usage, which is critical when deploying models in resource-limited environments or in real-time processing scenarios. The performance gap between ViT Base and ViT Large is relatively minor, making ViT Base the more practical choice for balancing accuracy and efficiency. This balance is particularly relevant for medical imaging tasks such as tuberculosis anomaly detection, where speed and accuracy are both crucial for effective diagnostics.

Below, as shown in the Figure 6, are the prediction results of the ViT Base model, demonstrating its ability to classify normal and tuberculosis (TB) anomalies in chest X-rays. Each label is accompanied by a confidence score, highlighting the model’s certainty for each predicted abnormality.

(a): The actual diagnosis is consolidation, and the model predicts it with 38% confidence.
(b): The actual diagnosis is miliary tuberculosis and the model predicts miliary TB with 14% confidence.
(c): The ground truth includes consolidation, fibrosis, and pleural thickening. The model predicts these with fibrosis 53%, consolidation 58%, and pleural thickening 41%, all of which align closely with the actual diagnosis.
(d): The actual diagnosis is fibrosis and bullae, and the model predicts fibrosis with 45% confidence and bullae with 13%. The prediction covers both conditions but is less confident about bullae.

Compared to InceptionResNet V2 (accuracy of 0.870, loss of 0.330, and AUC of 0.432) and Xception (accuracy of 0.866, loss of 0.343, and AUC of 0.475), both EfficientNetV2L and ViT Base offer better overall performance with lower losses and higher AUC scores, making them the ideal candidates for integration into a hybrid architecture.

In conclusion, EfficientNetV2L was chosen for its superior handling of local features with low loss and strong accuracy, while ViT Base offered efficient global feature extraction with better AUC and computational efficiency. Together, they form a robust foundation for the hybrid model, balancing computational demand with high performance across key metrics.

3.3. Hybrid-Based Experiment

The results of this experiment, as shown in the Table 7 which combined EfficientNetV2L and ViT Base, revealed important insights into how different batch sizes affect model performance across 50 epochs with a learning rate of 0.0001.

The hybrid model using EfficientNetV2L and ViT Base showed the best performance with a batch size of 8 due to its strong balance between accuracy and loss. With an accuracy of 0.911, the lowest loss of 0.285, and a competitive AUC score of 0.510, this batch size allows for effective learning while avoiding issues like overfitting or underfitting. The balance achieved here is crucial for complex multi-label tasks like tuberculosis classification, where the model needs to process both local and global image features efficiently.

When comparing this performance to other batch sizes, batch size 4 yielded lower accuracy (0.833) and a much higher loss (0.531). This suggests that smaller batch sizes may lead to noisier updates and less stable training, as seen in the higher loss figure, which indicates that the model struggles to converge. Conversely, batch size 16 saw a slight drop in accuracy (0.884) and an increase in loss (0.341). Larger batch sizes can lead to less frequent weight updates, which slows convergence and can lead to suboptimal performance, as evidenced by the increase in both loss and lower AUC (0.480).

The batch size of 8 strikes the right balance between frequent updates (allowing for faster learning) and stability (ensuring better convergence). This balance is particularly important for training deep models like the hybrid EfficientNetV2L-ViT.

The model’s strong performance was also attributed to the fine-tuning of other hyperparameters, such as the learning rate of 0.0001, which ensured that the model makes smaller, more precise weight updates, leading to better convergence. Furthermore, class weighting plays a critical role in addressing the class imbalance in the dataset, ensuring that underrepresented classes receive appropriate attention during training. This is especially relevant for multi-label tasks, where some labels (e.g., rare TB-related anomalies) are significantly underrepresented. By giving more weight to these classes, the model avoids bias toward the more common labels.

Finally, the use of focal loss further improves model performance by focusing on harder-to-classify examples. In multi-label tasks where labels overlap or are imbalanced, focal loss ensures that the model focuses on correctly classifying difficult cases, reducing the chance of misclassification. This tuning, combined with the optimal batch size of 8, allows the hybrid model to effectively handle the complex task of TB detection, maximizing accuracy and minimizing loss.

Following the results from this experiment as shown in the Figure 7, the hybrid model showed significant improvements in multi-label classification accuracy. The predictions from the hybrid model demonstrate notable improvements compared to the individual results of the EfficientNetV2L and ViT Base models. The hybrid architecture effectively combines the strengths of both models, particularly in multi-label classification.

First, the hybrid model showed enhanced accuracy in multi-label prediction as shown in the Figure 8. For example, in cases with multiple abnormalities such as infiltrate and lymphadenopathy, the hybrid model balanced the confidence levels between the labels, predicting infiltrate 46% and lymphadenopathy 24%. This represents a significant improvement in consistency compared to EfficientNetV2L, which struggled with lower confidence when predicting multiple labels.

Additionally, the hybrid model provided more balanced predictions for complex cases involving overlapping conditions like infiltrate, consolidation, and bronchiectasis. The model predicted these conditions with greater confidence: infiltrate 46%, consolidation 41%, and bronchiectasis 36%. This shows a better integration of local and global features, addressing the limitations of the individual models.

In terms of detecting pleural thickening and consolidation, the hybrid model performed significantly better than the standalone models. For instance, in one case, it predicted consolidation 41% and pleural thickening 32% with greater accuracy and balance. This improved detection of subtle abnormalities can be attributed to the hybrid model’s ability to extract both fine-grained and high-level features effectively.

Moreover, the hybrid model maintains strong performance on normal cases, predicting normal 34% with balanced confidence. This consistency across both abnormal and normal cases demonstrates the model’s robustness, reducing the likelihood of false positives, which is crucial for clinical accuracy.

3.4. Evaluation Metrics Analysis of EfficientNetV2L, ViT Base, and Hybrid Model

The performance of the three models—EfficientNetV2L, ViT Base, and the Hybrid EfficientNetV2L-ViT Base model—was evaluated using key metrics: Accuracy, AUC, Precision, Recall, F1 Score, and Hamming Loss. Table 8 provides a detailed comparison, highlighting the strengths and weaknesses of each model.

Accuracy: The accuracy scores across the models reflect the complexity of the multi-label classification task. The Hybrid EfficientNetV2L-ViT Base model achieved the highest accuracy at 0.911, outperforming both EfficientNetV2L (0.870) and ViT Base (0.874). This improvement demonstrates the hybrid model’s capability to leverage the strengths of CNN and Transformer architectures for enhanced performance in recognizing complex patterns of TB-related anomalies.

AUC: Although the AUC scores remain modest due to the challenges of multi-label classification with imbalanced data, the Hybrid model recorded an AUC of 0.510, showing a slight improvement over ViT Base (0.500) and EfficientNetV2L (0.460). This indicates that the hybrid approach is marginally more effective at differentiating between classes, likely due to the combination of EfficientNetV2L’s local feature extraction and ViT’s capacity for global context representation.

Precision and Recall: Precision values were generally low, with the Hybrid model scoring 0.181, which is lower than ViT Base’s 0.319 but slightly above EfficientNetV2L’s 0.238. This lower precision across models indicates a tendency for false positives, common in multi-label tasks with overlapping classes. However, in terms of recall, the Hybrid model achieved a perfect score of 1.000, significantly higher than ViT Base (0.485) and EfficientNetV2L (0.673). This reflects the Hybrid model’s effectiveness in capturing all relevant cases, an essential trait in medical diagnostics where minimizing false negatives is critical.

F1 Score: The F1 score, which balances precision and recall, emphasizes the Hybrid model’s performance robustness in handling both false positives and false negatives. The Hybrid model achieved an F1 score of 0.301, surpassing EfficientNetV2L (0.317) and ViT Base (0.217). This higher F1 score underscores the Hybrid model’s ability to maintain a balanced performance, essential for handling multi-label, imbalanced datasets.

Hamming Loss: Hamming Loss, which quantifies the fraction of incorrect labels, was lowest for the Hybrid model at 0.326, followed by ViT Base (0.464) and EfficientNetV2L (0.526). The lower Hamming Loss for the Hybrid model indicates fewer label misclassifications, further supporting its suitability for this multi-label classification task.

3.5. Performance and Parameter of Models

Table 9 provides an overview of the model parameters, training time (in minutes), and loss for each of the models: EfficientNetV2L, ViT Base, and the hybrid EfficientNetV2L-ViT Base.

The hybrid model, with a total of 177,975,727 parameters, combines the advantages of both EfficientNetV2L and ViT Base architectures, resulting in a training time of 23.85 min and the lowest loss of 0.2854 among the models tested. This demonstrates that, despite its increased parameter count, the hybrid model achieves a favorable balance between performance and efficiency, surpassing the standalone models in terms of both accuracy and loss.

The EfficientNetV2L model, while robust with 118,410,415 parameters, required a longer training time of 35.57 min, which reflects its high computational demand. Although it achieved a reasonably low loss of 0.323, it did not outperform the hybrid model.

The ViT Base model, with 88,903,415 parameters, was the most lightweight among the three models and had the shortest training time at 4.51 min. However, this efficiency comes at a cost in terms of performance, as it recorded a slightly higher loss of 0.327, indicating that it is less effective in handling complex multi-label classifications compared to the hybrid model.

The hybrid EfficientNetV2L-ViT Base model demonstrates the best trade-off between training time and performance, achieving a balance between parameter complexity and effectiveness in multi-label classification. Future improvements could explore further optimization in architecture to reduce the computational cost while maintaining high accuracy and low loss.

3.6. Comparative Analysis of Hybrid Model Variants: The Role of Focal Loss and Class Weight

This section compares the performance and efficiency of three hybrid model variants: the hybrid model with both focal loss and class weight, the hybrid model without focal loss, and the hybrid model without class weight. The analysis highlights the importance of these techniques in addressing class imbalance and improving multi-class, multi-label classification tasks.

3.6.1. Performance Metrics Comparison

Table 10 shows the performance metrics of the three model variants in terms of accuracy, AUC, precision, recall, F1 score, and Hamming loss. The hybrid model with both focal loss and class weight achieves the best overall performance, particularly in recall (1.000), which is critical for ensuring no anomalies are missed in clinical applications.

Hybrid without Focal Loss: Precision drops to 0.120, while recall decreases significantly to 0.712. This indicates that, without focal loss, the model struggles to identify minority classes effectively, leading to more false negatives and a lower F1 score.
Hybrid without Class Weight: Precision improves slightly (0.150), but recall remains low 0.750 compared to the default model. Hamming loss increases, showing the model’s decreased ability to handle multi-label predictions effectively.

3.6.2. Efficiency Metrics Comparison

Table 11 compares the computational efficiency of the models in terms of training time and loss.

Training Time: Removing focal loss or class weight slightly reduces training time due to simplified loss calculations, but the performance trade-off is significant.
Loss: The default hybrid model has the lowest training loss (0.285), while the other two variants show higher losses, indicating less effective optimization.

3.6.3. Importance of Focal Loss and Class Weight

Focal Loss: Focal loss helps tackle the class imbalance issue by assigning higher penalties to misclassified samples of minority classes. This makes the model focus more on learning difficult or underrepresented classes, leading to significantly better recall and a more balanced F1 score. As seen in the results, removing focal loss caused a drastic reduction in recall and overall model performance, particularly for rare anomalies.

Class Weight: Class weight addresses the imbalance by adjusting the contribution of each class to the total loss, ensuring that minority classes are not overshadowed by majority ones during training. Without class weighting, the model shows lower performance across all metrics, especially in recall and Hamming loss. This underscores its importance in multi-label scenarios, where balancing predictions across multiple classes is critical.

Clinical Implications: In clinical contexts, false negatives can have severe consequences, such as missing critical diagnoses. The default hybrid model with focal loss and class weight demonstrates its superiority by minimizing false negatives (high recall) and ensuring better overall model performance. Additionally, reducing Hamming loss is essential in multi-label classification to avoid misclassifications that could lead to unnecessary or incorrect medical interventions.

The comparative analysis clearly demonstrates the crucial role of focal loss and class weight in improving both the performance and reliability of the hybrid model. These techniques effectively address class imbalance, enhance multi-label classification performance, and are vital for deploying models in sensitive domains such as healthcare.

3.7. Separated Confusion Matrices for Positive and Negative Predictions

The confusion matrices, as shown in Figure 9, provide a detailed look at the model’s performance in predicting TB-related abnormalities across various classes. The matrix on the left represents positive predictions (True Positives, False Positives, True Negatives), while the matrix on the right represents negative predictions (True Negatives, False Negatives, True Positives).

High True Positives (TP) for Common Conditions: Conditions like infiltrate, pleural effusion, and fibrosis show relatively high true positive (TP) rates, indicating the model’s effective detection of these conditions. This suggests that the model is well-calibrated to recognize these more common TB-related abnormalities, which might have more distinct features that the model can capture reliably.
Low False Positives (FP) Across Classes: The confusion matrix reveals that false positives (FP) are relatively low across most classes, indicating the model’s conservative approach in identifying anomalies. This low FP rate is beneficial for clinical application, as it reduces the chances of misclassifying healthy patients as having TB-related abnormalities.
False Negatives (FN) Impacting Sensitivity for Rare Anomalies: In the negative predictions matrix (right), certain classes like cavity, lymphadenopathy, and tuberculoma show notable numbers in the FN column. This suggests that the model is less sensitive to these rare conditions, possibly due to their lower representation in the dataset, leading to a higher likelihood of missing these anomalies. Addressing this issue would require either data augmentation or more targeted training to improve detection sensitivity for these less common conditions.
Balanced Prediction for Conditions with Lower Complexity: For simpler and more distinct classes, such as normal and consolidation, the model maintains a good balance between TPs and low FNs, reflecting reliable performance in these categories. This balance is crucial for the model’s utility in clinical settings, where it must accurately differentiate between normal and abnormal cases.
Challenges with Overlapping Features: Some conditions, such as fibroinfiltrate and bronchiectasis, exhibit moderate FNs. This could be due to the overlapping nature of their radiographic features with other conditions, leading to occasional misclassifications. This observation highlights a challenge in multi-label classification tasks where conditions have similar radiological characteristics, which can confuse the model.

The confusion matrices indicate that the hybrid model is effective in detecting common TB-related abnormalities but has limitations in detecting rare or complex conditions. Enhancing the model’s sensitivity to these conditions may require balanced datasets and additional feature refinement, especially for overlapping or visually similar anomalies.

3.8. Single Confusion Matrix with True Classes and Predicted Classes

The confusion matrix, as shown in Figure 10, provides a detailed assessment of the hybrid model’s performance across various TB-related abnormalities. Here is a breakdown of key observations:

True Positives for Common Classes: Classes like pleural effusion, fibrosis, and infiltrate exhibit relatively high numbers in the diagonal elements, indicating that the model is effectively identifying these common TB-related abnormalities. This suggests that the model has learned to recognize specific features associated with these conditions accurately.
High False Negatives in Rare Classes: Rare conditions such as lymphadenopathy, miliary, and tuberculoma show a high number of false negatives, where actual instances of these classes are misclassified as other conditions. This limitation reflects the model’s difficulty in recognizing less frequent abnormalities, likely due to an imbalance in the dataset or insufficient distinctive features to differentiate these classes from others.
Misclassification Between Similar Conditions: Conditions with overlapping or visually similar features, such as fibroinfiltrate and fibrosis, or pleural effusion and consolidation, show notable misclassification rates. This suggests that the model struggles to differentiate between these classes, likely due to similarities in radiographic appearance. Such misclassification is common in multi-label medical imaging tasks where conditions share anatomical or structural traits.
Confusion with Normal Cases: The model occasionally misclassifies abnormal cases, such as pleural thickening and bullae, as normal, indicating that subtle anomalies may be challenging for the model to detect. This could reduce the sensitivity of the model in clinical applications, where missing abnormal cases can have significant implications.
Impact of Class Imbalance on Predictions: The matrix shows an imbalance in detection accuracy across classes, with more prevalent classes like fibrosis and pleural effusion having better detection rates than rare classes. This highlights the need for further class balancing or augmentation to improve the model’s sensitivity to underrepresented classes.

In summary, the confusion matrix reveals that, while the hybrid model effectively identifies common TB-related abnormalities, it faces challenges with rare or visually similar conditions. Improvements could focus on enhancing the dataset’s balance and incorporating more advanced distinguishing features to aid in differentiating between similar abnormalities.

3.9. Saliency Map of the Hybrid Model

To gain insights into the interpretability of the hybrid model, saliency maps were generated for various chest X-ray images, each annotated with multiple TB-related abnormalities. The purpose of these saliency maps is to visualize the regions of the images the model focuses on to make its predictions, revealing its interpretive process and alignment with clinically significant areas. This analysis serves as a qualitative evaluation of the model’s capacity to identify relevant pathological features in complex, multi-label scenarios.

Each set of images in the saliency map figures presents three visualizations to interpret how the hybrid model detects TB-related anomalies in chest X-ray (CXR) images. The leftmost image is the original CXR, showing the patient’s lungs and thoracic region, labeled with ground truth diagnoses (e.g., pleural effusion, consolidation). The middle image is the saliency map, highlighting the regions the model deems critical for its prediction, with brighter red areas indicating stronger focus. The rightmost image is a colorful overlay combining the original CXR and the saliency map, allowing for a direct comparison between the model’s attention regions and the anatomical structures in the X-ray. Together, these visualizations offer insights into the model’s interpretative focus, showing how well it aligns with clinically relevant areas for TB diagnosis.

Below, as shown in the Figure 11 and Figure 12, are several saliency maps for different TB anomaly cases generated by the Hybrid model, illustrating the regions of focus used by the model to detect various anomalies in chest X-ray images.

Case 1: Pleural Effusion and Consolidation In this case, the saliency map emphasizes regions in the lung areas commonly associated with pleural effusion and consolidation, particularly in the middle and lower lobes. The highlighted regions on the saliency map indicate the model’s focus on areas of fluid accumulation and lung tissue opacity, which are characteristic of these conditions. The overlay map reinforces this by showing intensified attention in regions that align with typical clinical presentations of pleural effusion and consolidation.

Case 2: Consolidation and Pneumothorax In this case, the saliency map reveals focused attention on the upper lung regions, where pneumothorax and consolidation effects are typically observed. The highlighted areas suggest the model’s sensitivity to abnormal air accumulation and tissue consolidation patterns within these areas. The colorful overlay further illustrates that the model emphasizes regions where pneumothorax and consolidation are expected, indicating its capability to recognize both conditions effectively, aligning well with clinical observations.

The saliency maps demonstrate that the hybrid EfficientNetV2L-ViT model effectively highlights regions associated with various TB-related anomalies, with its attention aligning well with clinical expectations. For instance, it focuses on the upper lung regions for pneumothorax and the lower lung areas for pleural effusion, indicating its ability to generalize across multiple TB-related pathologies.

However, some interpretability limitations are noted, particularly in cases where the saliency maps appear diffuse, making it difficult for the model to localize attention to specific regions. This may reflect the inherent challenges in multi-label classification, where highly overlapping abnormalities complicate precise localization. Notably, the color-encoded regions shown in the saliency maps correspond to the pathological processes confirmed in the original chest X-ray. However, some highlighted areas lie outside the annotated abnormalities, potentially reflecting non-pathological structures or regions the model incorrectly considers relevant. Future work could address these discrepancies through refinement efforts, such as advanced visualization techniques or additional attention mechanisms, to enhance interpretability and improve localization for complex multi-label cases.

Overall, these saliency maps support the robustness and clinical applicability of the hybrid model, underscoring its potential for real-world application in TB diagnostics.

3.10. Inference Time and Resource Utilization per Image Prediction

To evaluate the efficiency and resource requirements of the hybrid EfficientNetV2L-ViT Base model for single-image predictions, the model was tested on a set of 74 images.

Average Inference Time: The model demonstrated an average inference time of approximately 0.135 s per image. This rapid processing rate highlights the model’s capability to handle high-throughput demands, enabling efficient image analysis workflows, especially valuable in clinical settings where timely results are essential.

GPU Utilization per Image: During testing, the hybrid model utilized an average of 6.1 GB of VRAM on an NVIDIA GPU (manufactured by NVIDIA Corporation, Santa Clara, California, USA). This GPU is part of the computational infrastructure provided by the Big Data Center (BDC) IMERI, located in Indonesia (as mentioned in Section 2.6). The GPU has a total of 15 GB of VRAM and processed a batch of 74 images, translating to an average GPU memory usage of approximately 82.4 MB per image. This moderate per-image VRAM usage indicates that the model is optimized for efficient resource usage and could potentially run on mid-range GPUs, making it suitable for deployment even in settings with limited GPU resources.

4. Discussion

The integration of EfficientNetV2L and ViT Base in this study offers a robust solution for handling the complexities of multi-label classification in tuberculosis (TB) anomalies detection from chest X-ray (CXR) images. Unlike typical multi-class classification, where each image is assigned a single label, multi-label classification requires the model to predict multiple labels for each image. This task is inherently more complex due to the overlapping nature of labels and the imbalanced distribution of labels across the dataset, presenting unique challenges in medical imaging tasks such as TB anomaly detection.

Comparison with Existing Approaches: Our study contrasts with recent studies that have primarily focused on simpler binary classification tasks or the identification of only a few TB-related abnormalities.

For instance, a study using CNN architectures like InceptionV3, Xception, ResNet50, VGG19, and VGG16 achieved 87% accuracy, but only classified images as either TB-positive or TB-negative, without addressing the variety of TB-related lesions that can coexist in a patient [7]. Similarly, another work focused on distinguishing between four lesion types (infiltrations/bronchiectasis and opacity/consolidation), achieving 100% accuracy but covering a much narrower range of abnormalities [11].

In comparison, our model aimed to classify 14 distinct TB-related abnormalities, offering a more comprehensive approach. While UNet-based models combined with Xception have reached accuracy as high as 99.29% in binary TB classification tasks [12], these models do not address the complexities of multi-label classification, where several abnormalities may overlap in a single image.

By comparing these existing studies, our work shows the potential advantages of combining CNNs and ViTs for handling a wider range of TB abnormalities while also focusing on model robustness.

Performance and Model Selection: In CNN-based experiments, five CNN architectures and two vision transformer (ViT) models were evaluated to identify the best-performing models. EfficientNetV2L was selected for its balanced performance, with high accuracy (0.870) and low loss (0.323) compared to other CNN models. EfficientNetV2L’s ability to handle multi-label tasks efficiently stems from its compound scaling mechanism, which balances depth, width, and resolution for more efficient feature extraction across different input sizes. This feature is essential in medical imaging, where detecting subtle patterns can significantly affect classification accuracy [16].

In ViT-based experiments, the ViT Base model was chosen for its computational efficiency and capacity to capture global image features. Although ViT Large demonstrated slightly better performance with an accuracy of 0.879 and an AUC of 0.565, its high computational cost and longer training time due to its large parameter count (306 million) made ViT Base a more practical choice. ViT Base, with its accuracy of 0.874 and AUC of 0.500, provided nearly equivalent results with fewer parameters (88.9 million), making it a more balanced option for the hybrid model [13].

Batch Size and Fine-Tuning: In hybrid-based experiments, we opted not to implement early stopping during training of the hybrid EfficientNetV2L-ViT Base model, choosing instead to allow training to continue for a fixed 50 epochs. This decision was made to ensure consistent convergence across the entire dataset and to observe the model’s full learning curve without interruption. Although early stopping can prevent overfitting by halting training once the validation loss ceases to improve, we aimed to thoroughly evaluate the impact of training across the defined epochs, especially given the model’s multi-label classification complexity.

Batch size played a critical role in balancing frequent weight updates with stable training. After testing various batch sizes, a batch size of 8 yielded the best performance, achieving a high accuracy (0.911), low loss (0.285), and competitive AUC (0.510). Smaller batch sizes, such as 4, produced noisier updates, leading to instability and higher loss (0.531). Conversely, larger batch sizes, such as 16, were more computationally efficient but sacrificed some accuracy (0.884) and showed a higher loss (0.341) due to delayed convergence. Therefore, a batch size of 8 provided an optimal balance, enabling the model to manage the multi-label nature of the dataset effectively without encountering overfitting or underfitting.

Training Stability Technique: Based on the loss curves in Figure 7, it is evident that the model achieves convergence relatively early (<50 epochs), with minimal divergence between training and validation loss. Since the training process is guided by the loss function, the stability of the loss plot is a reliable indicator of model performance. The Adam optimizer, with its adaptive learning rate mechanism, plays a crucial role in promoting efficient convergence and better generalization by dynamically adjusting the learning rate during training. Given that the model converges early, extending the number of epochs may lead to overfitting, a condition where early stopping is advisable. This is consistent with findings from Kingma and Ba [30], which emphasize Adam’s effectiveness in improving optimization efficiency and enhancing model stability during training.

Impact of Class Weighting and Focal Loss: Class imbalance posed a significant challenge in this study, as certain TB-related abnormalities, like infiltrates, appeared far more frequently than others, such as bullae or tuberculoma. This imbalance, combined with the multi-label nature of the data, required strategies that allowed the model to fairly treat all classes, regardless of their frequency.

Class weighting was employed to mitigate this issue by assigning higher importance to underrepresented classes, ensuring that the model did not bias towards the more frequent labels. This technique improved model generalization, enabling better predictions for rare abnormalities like bullae [24]. By increasing the influence of these rarer labels during training, the model could better capture the diverse range of TB-related anomalies.

Additionally, focal loss was used to handle the imbalance further. As described by Lin et al. (2017), focal loss focuses more on harder-to-classify examples, making it well-suited for multi-label classification tasks where overlapping and imbalanced labels are prevalent [22]. By focusing more on these difficult cases, the model improved its performance on rare and challenging anomalies, as demonstrated by the reduction in loss and the improvements in accuracy across various configurations.

Comparative Analysis of ViT and CNN: ViTs have gained attention more recently as an alternative approach to image understanding [13]. Instead of relying on convolutional layers, ViTs adopt the Transformer architecture, originally developed for natural language processing tasks, to treat images as sequences of patches processed using self-attention mechanisms. By modeling global interactions among patches, ViTs capture long-range dependencies and relationships in images, enabling them to understand both local and global contexts. This allows ViTs to grasp the holistic structure of an image effectively, yielding promising results in image classification, object detection, and even tasks like image generation.

Compared to CNNs, ViTs have a few advantages. First, ViTs eliminate the need for handcrafted convolutional operations and can be applied directly to various input sizes without architectural modifications. Additionally, ViTs have demonstrated strong performance in handling large-scale datasets and complex visual patterns. Finally, ViTs offer more interpretability, as they can attend to specific image regions, making them especially suitable for tasks requiring localization or attention to fine-grained details.

However, ViTs also have limitations, such as computational expense due to the self-attention mechanism and a requirement for large datasets for optimal performance. CNNs, meanwhile, remain more computationally efficient and have undergone extensive optimization and study.

Deployment Considerations in Real-World Clinical Settings: Deploying the hybrid EfficientNetV2L-ViT Base model in clinical settings, particularly in resource-constrained environments, requires careful consideration of its performance on unseen data and computational efficiency. The model’s average inference time of 0.135 s per image and VRAM usage of around 82.4 MB per image suggest it can efficiently handle high-throughput demands and operate on mid-range GPUs, making it feasible for facilities with limited resources. This compact memory footprint and quick processing speed enable timely TB diagnosis, which is critical in clinical workflows. For healthcare facilities with varying hardware capabilities, the model’s low memory requirements enhance its scalability, allowing it to process multiple images simultaneously on shared or lower-end systems.

In rural areas with limited access to doctors and radiology specialists, this model could significantly support TB diagnosis, provided an X-ray machine is available and operational. By generating early diagnostic results, the model can facilitate timely referrals to physicians for consultation and appropriate treatment, expediting intervention in TB cases. This approach aligns with TB eradication efforts, particularly in regions like Indonesia, where healthcare resources are often constrained, thereby supporting national TB control programs.

Diagnostic Effect and Advantages of AI Models: Traditional CXR interpretation relies heavily on radiologists’ expertise, leading to potential variability and subjectivity in diagnosis. This can result in missed or delayed TB detection, especially in complex cases involving subtle or overlapping abnormalities. In contrast, the hybrid AI model provides consistent and objective analysis by leveraging its robust feature extraction and attention mechanisms. The model not only identifies multiple TB-related anomalies simultaneously but also highlights key regions of interest through saliency maps, improving interpretability for medical professionals.

Additionally, the AI model significantly reduces diagnostic turnaround time, enabling faster decision-making and improving patient outcomes. This is particularly advantageous in high-volume clinical environments or rural areas where timely diagnosis and treatment initiation are critical. By automating the initial diagnostic process, the model supports overburdened healthcare systems, allowing radiologists and physicians to focus on more complex cases and improving overall workflow efficiency.

Limitations: While the hybrid model demonstrated strong accuracy and low loss, there are notable limitations in its evaluation metrics, particularly regarding AUC, precision, recall, and interpretability as seen in the confusion matrix and saliency map analyses. Each limitation points to areas of complexity inherent in multi-label classification, which presents unique challenges in TB anomaly detection.

Low AUC Score as shown in the Figure 13: The lower AUC score is not uncommon in multi-label classification tasks due to the inherent complexity of predicting multiple overlapping labels. The AUC metric, typically used in binary classification, is less straightforward in multi-label tasks, where each label has its own ROC curve. When averaged, the AUC score can be skewed by the overrepresentation of frequent labels, such as infiltrates, at the expense of rarer labels like bullae. One limitation observed in this study is that certain labels appear more frequently than others, creating an imbalance that can lead the model to predict these common labels more often. This can diminish sensitivity to rare labels, further skewing AUC scores. While techniques like focal loss and class weighting were applied to address these biases, these initial measures highlight the need for further refinement to enhance the model’s performance across all TB-related anomalies. As Saito and Rehmsmeier [31] note, AUC can be biased toward majority classes, making it less reliable for assessing minority class performance in imbalanced datasets. Although the AUC score of 0.510 may seem low, it reflects the complexity of the task rather than any major deficiency in the model’s ability to detect TB-related anomalies. Our model, however, demonstrates strong performance in other metrics, such as accuracy (0.911) and recall (1.000), which better capture its effectiveness in handling multi-label classifications. These metrics show that, despite a lower AUC, the model maintains high sensitivity and generalization, ensuring clinical relevance by identifying TB-related anomalies across various categories. Clinically, it would be counterproductive to exclude specific labels to artificially boost AUC, as each carries crucial diagnostic value. Moving forward, we plan to enhance class balance further through refined class weighting and selective oversampling/undersampling to better represent rare classes. Additionally, increasing data samples for underrepresented classes, though resource-intensive, will be prioritized to improve model sensitivity toward rarer TB-related anomalies.
Class Imbalance Impact on Precision and Recall: As indicated in the confusion matrix analysis, the hybrid model exhibits uneven sensitivity and specificity across different classes. Common anomalies are detected more consistently, while rarer classes, due to their limited representation in the training dataset, are often overlooked. This imbalance results in lower precision for certain classes, as the model may favor frequent labels in its predictions. Techniques like oversampling, undersampling, or advanced class weighting methods may further enhance the model’s sensitivity and precision, ensuring balanced performance across all TB-related anomalies.
Saliency Map Interpretability Limitations: While saliency maps provide visual insight into the model’s attention regions, they show only general activation patterns rather than precise, clinically relevant features. This limitation can make it difficult for radiologists to interpret the model’s reasoning, especially for subtle TB-related anomalies. The saliency maps, while useful for assessing model focus, lack the granularity needed for more complex interpretability. Future work could explore more advanced interpretability techniques, such as Grad-CAM or guided backpropagation, to provide clearer insights into how the model makes specific decisions, potentially aligning the output more closely with radiological assessment.
Challenges with Rare Anomaly Detection in Confusion Matrix Analysis: The confusion matrix analysis reveals that the model’s performance on rare anomalies like miliary and bullae is limited, with a tendency to misclassify these conditions as “Not Present.” This may stem from the imbalance in label frequency and the lack of distinctive features for these rare anomalies, which are underrepresented in the dataset. A future improvement could involve augmenting these rarer classes to provide the model with more examples of each anomaly, potentially through synthetic data generation or targeted augmentation strategies.
Limitations in Recall and Precision: While the model shows acceptable recall for certain common classes, precision remains a challenge. This means that, while the model is reasonably effective at identifying when an anomaly is present, it occasionally misclassifies images without the condition as positive cases. This limitation may lead to false positives, which, in a clinical context, could result in unnecessary follow-ups. Refining the threshold for each class or using ensemble methods to confirm predictions could potentially improve precision, particularly for multi-label cases.

Future: Future research directions in tuberculosis chest X-ray classification with a hybrid CNN and vision transformer model could explore several key areas to enhance model performance and applicability:

Future Improvements with Advanced Architectures: Although the hybrid EfficientNetV2L-ViT Base model has demonstrated promise, more advanced architectures such as Swin Transformer and ConvNeXt could offer additional performance gains. These architectures provide enhanced feature extraction capabilities and may better capture complex patterns within CXR images, particularly in a multi-label setting. Implementing such advanced architectures could address current limitations in rare anomaly detection and overall AUC performance.
Transfer learning: Examine the potential of transfer learning by leveraging pre-trained models on large-scale datasets, such as ImageNet or CheXpert, and fine-tune them on tuberculosis chest X-ray images. This could enhance the model’s performance by incorporating diverse visual features learned from broader datasets.
Interpretability and explainability: Improve the interpretability and explainability of the hybrid model’s predictions. Techniques such as attention maps and saliency mapping can shed light on the model’s decision-making process, revealing image regions contributing most to the classification.
Real-world deployment and clinical validation: Validate the hybrid model’s performance on larger and more diverse clinical datasets, involving multiple medical institutions and patient populations. Collaborate with healthcare professionals to ensure the model’s applicability, reliability, and clinical relevance.

By exploring these research directions, one can advance the field of tuberculosis chest X-ray classification, improve the accuracy and robustness of hybrid CNN and vision transformer models, and ultimately contribute to more effective tuberculosis diagnosis and patient care.

Future Research Directions and Proposed Solutions: This section outlines a plan for future extensions of the research to tackle the identified challenges effectively. The following suggested strategies aim to improve model performance, interpretability, and clinical applicability:

One major challenge identified is the significant class imbalance in the dataset, particularly for rare anomalies such as bullae or tuberculoma. This imbalance results in high false negative rates and reduced model sensitivity for underrepresented conditions. To address this, future research can implement advanced data augmentation techniques, such as generative adversarial networks (GANs) and diffusion models, to synthetically generate diverse samples for rare conditions [32]. Additionally, dynamic re-weighting strategies like class-balanced loss, which adjusts weights based on the effective number of samples, could be employed [33]. Oversampling techniques, including SMOTE, are also recommended to ensure a more balanced dataset representation [34].
Another area requiring improvement is precision in multi-label tasks, where the current hybrid model suffers from a high rate of false positives, as evidenced by its low precision score (e.g., 0.181). Threshold optimization for each label, using precision–recall trade-off analysis, could help balance precision and recall [35]. Post hoc calibration methods like platt scaling or isotonic regression may also improve the reliability of predicted probabilities [36].
High misclassification rates for anomalies with overlapping visual features, such as fibroinfiltrate and bronchiectasis, present another critical problem. Feature refinement using advanced attention mechanisms, such as the Convolutional Block Attention Module (CBAM), could help differentiate overlapping features more effectively [37]. Additionally, adopting multi-task learning (MTL), where auxiliary tasks like feature segmentation are introduced, could improve the model’s ability to learn discriminative feature representations [38].
Model interpretability and explainability remain essential for clinical applications, yet saliency maps generated by the current hybrid model sometimes highlight irrelevant areas. To improve interpretability, attention-based explainability methods could be adopted, leveraging attention maps to provide clearer visual explanations [20]. Advanced Grad-CAM techniques, such as Grad-CAM++ or Score-CAM, offer more fine-grained and reliable visualizations [39]. Quantitative evaluation of saliency maps, using metrics like Intersection over Union (IoU), would further ensure their reliability [40].
The hybrid model’s modest improvement in AUC (e.g., 0.510) indicates limited effectiveness in leveraging its added complexity. Model ensembling, combining the hybrid architecture with other high-performing models like Swin Transformer or ConvNeXt, could enhance overall performance through techniques such as soft voting [41]. Moreover, feature fusion techniques, including cross-attention and multi-scale fusion, could better integrate complementary strengths of the CNN and ViT components, boosting performance [42].
To ensure clinical relevance, validation on diverse datasets remains crucial. Cross-institutional validation, involving datasets from multiple institutions, could provide a more robust assessment of the model’s generalizability [43]. Prospective clinical trials would further evaluate the model’s impact on diagnostic workflows, focusing on accuracy, efficiency, and clinician feedback [44]. Integration with PACS systems for real-time inference could facilitate seamless adoption of the model in clinical settings [45].
Lastly, the potential of transfer learning should be further explored by leveraging pre-trained models on large medical datasets such as CheXpert or NIH CXR14. Fine-tuning these models on tuberculosis-specific data could significantly enhance performance by incorporating broader visual knowledge [43].

5. Conclusions

This study developed a hybrid AI model combining EfficientNetV2L and Vision Transformer (ViT) Base for multi-label classification of tuberculosis (TB) abnormalities in chest X-ray (CXR) images. The model’s strength lies in its ability to extract both detailed local features through EfficientNetV2L and capture global image context with ViT Base, addressing the challenge of predicting multiple co-occurring TB-related abnormalities. This hybrid approach outperformed standalone models, particularly in handling complex multi-label medical imaging tasks.

EfficientNetV2L proved highly effective in identifying subtle anomalies like infiltrates and fibrosis, with a solid accuracy of 0.870 and low loss of 0.323. Meanwhile, ViT Base provided almost comparable performance to ViT Large but with far fewer parameters, making it a more efficient option for global feature extraction, crucial in detecting diffuse anomalies like consolidation and lymphadenopathy.

To address class imbalance, focal loss and class weighting were applied, ensuring that rarer abnormalities received appropriate attention during training. This contributed to more balanced learning and improved generalization across TB-related anomalies. Testing revealed that a batch size of 8 offered the optimal balance between frequent weight updates and stable training, achieving the highest accuracy of 0.911, the lowest loss of 0.285, and an efficient inference time. Average inference time per image was approximately 0.135 s, with GPU memory consumption per image at around 82.4 MB, indicating suitability for real-time applications in clinical environments.

One limitation observed was the relatively modest AUC score (0.510), a known challenge in multi-label classification due to overlapping labels and the skew introduced by more frequent classes like infiltrates. This can lead to an underestimation of model sensitivity for rare classes when interpreting AUC scores. Therefore, accuracy and loss metrics are more reliable indicators of the model’s performance in this setting. Additionally, saliency maps and confusion matrices underscored the model’s capacity for effective feature localization, although further refinement is needed to enhance predictions on rarer anomalies.

In conclusion, the hybrid EfficientNetV2L-ViT Base model is a promising solution for multi-label classification in TB detection, with its ability to manage both local and global features effectively. This approach has significant potential to enhance TB diagnostics and could be scaled for real-world clinical applications, particularly in resource-constrained environments. Future work should focus on exploring advanced architectures such as Swin Transformer and ConvNeXt to maximize the model’s clinical utility, improve sensitivity for rare anomalies, and ensure robust performance in diverse clinical settings.

Author Contributions

V.V. and M.R. designed the project and the experiment and R.Y. performed the experiments. D.H.D. and A.P. collected the data to be analyzed by R.Y. under supervision of V.V., M.R., P.A.Y., E.D.T. and R.E.Y. S.A.P. drafted the first version of the manuscript, V.V. and P.A.Y. edited it. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Q2 International Indexed Publication Grant 2020 from Universitas Indonesia No. NKB764/UN2.RST/HKP.05.00/2020 (to PY).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration 345 of Helsinki, and approved by the Ethics Committee of Universitas Indonesia No. 20-07-0806 KET-797/UN2.F1/ETIK/PPM.00.02/2020, 27 July 2023.

Informed Consent Statement

Patient consent was represented by the ethics comittee because the data used in this study were retrospective and anonymous so that the patient’s identification was not known.

Data Availability Statement

The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding authors.

Acknowledgments

The Direktorat Riset dan Pengembangan Universitas Indonesia (RISBANG UI) provided funding for this project. Dr. Cipto Mangunkusumo National General Hospital of Jakarta and Universitas Indoneisa Hospital where the data for this project was collected. We also thank Syarifaha Ihsan and Kahlil Gibran for their contribution to the data collection process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
CNNs	Convolutional Neural Networks
CXR	Chest X-Ray
ViT	Vision transformer
MLP	Multi Layer Perceptron
NLP	Natural Language Processing

References

Goletti, D.; Petruccioli, E.; Joosten, S.A.; Ottenhoff, T.H. Tuberculosis biomarkers: From diagnosis to protection. Infect. Dis. Rep. 2016, 8, 6568. [Google Scholar] [CrossRef] [PubMed]
Lange, C.; Mori, T. Advances in the diagnosis of tuberculosis. Respirology 2010, 15, 220–240. [Google Scholar] [CrossRef] [PubMed]
World Bank Open Data. 2022. Available online: https://data.worldbank.org/indicator/SH.TBS.INCD?locations=ID (accessed on 19 September 2022).
Alli, O.A.; Ogbolu, O.D.; Alaka, O.O. Direct molecular detection of Mycobacterium tuberculosis complex from clinical samples—An adjunct to cultural method of laboratory diagnosis of tuberculosis. N. Am. J. Med Sci. 2011, 3, 281. [Google Scholar] [CrossRef] [PubMed]
Lewinsohn, D.M.; Leonard, M.K.; LoBue, P.A.; Cohn, D.L.; Daley, C.L.; Desmond, E.; Keane, J.; Lewinsohn, D.A.; Loeffler, A.M.; Mazurek, G.H.; et al. Official American Thoracic Society/Infectious Diseases Society of America/Centers for Disease Control and Prevention clinical practice guidelines: Diagnosis of tuberculosis in adults and children. Clin. Infect. Dis. 2017, 64, e1–e33. [Google Scholar] [CrossRef]
World Health Organization. Chest Radiography in Tuberculosis Detection: Summary of Current WHO Recommendations and Guidance on Programmatic Approaches; Technical Report; World Health Organization: Geneva, Switzerland, 2016. [Google Scholar]
Showkatian, E.; Salehi, M.; Ghaffari, H.; Reiazi, R.; Sadighi, N. Deep learning-based automatic detection of tuberculosis disease in chest X-ray images. Pol. J. Radiol. 2022, 87, 118–124. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Allaouzi, I.; Ahmed, M.B. A novel approach for multi-label chest X-ray classification of common thorax diseases. IEEE Access 2019, 7, 64279–64288. [Google Scholar] [CrossRef]
Cai, L.; Gao, J.; Zhao, D. A review of the application of deep learning in medical image classification and segmentation. Ann. Transl. Med. 2020, 8, 713. [Google Scholar] [CrossRef]
Ou, C.Y.; Chen, I.Y.; Chang, H.T.; Wei, C.Y.; Li, D.Y.; Chen, Y.K.; Chang, C.Y. Deep Learning-Based Classification and Semantic Segmentation of Lung Tuberculosis Lesions in Chest X-ray Images. Diagnostics 2024, 14, 952. [Google Scholar] [CrossRef]
Sharma, V.; Nillmani; Gupta, S.K.; Shukla, K.K. Deep learning models for tuberculosis detection and infected region visualization in chest X-ray images. Intell. Med. 2023, 4, 104–113. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: Seattle, WA, USA, 2019; pp. 6105–6114. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
National Institutes of Health. Chest X-Ray Dataset. Kaggle. 2024. Available online: https://www.kaggle.com/datasets/nih-chest-xrays/data (accessed on 1 September 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Mannor, S.; Peleg, D.; Rubinstein, R. The cross entropy method for classification. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 561–568. [Google Scholar]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Taylor, R.; Nitsch, V.; Bagus, M. Improving deep learning with image preprocessing: Rotation and flipping. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; pp. 33–44. [Google Scholar]
Shen, D.; Wu, G.; Suk, H.I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 2019, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
Indonesia, K.K.R. Peraturan Menteri Kesehatan Republik Indonesia Nomor 24 Tahun 2022 Tentang Rekam Medis. 2022. Available online: https://rc.kemkes.go.id/aktivitas-rme-menurut-permenkes-nomor-24-tahun-2022-fc09e6 (accessed on 15 October 2024).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Dembczyński, K.; Waegeman, W.; Cheng, W.; Hüllermeier, E. Advances in Neural Information Processing Systems. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–4 September 2018. [Google Scholar]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Improved visual explanations for deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Dietterich, T.G. Ensemble methods in machine learning. In Multiple Classifier Systems, Proceedings of the First International Workshop, MCS 2000, Cagliari, Italy, 21–23 June 2000; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
Xu, Y.; Zhang, Z.; Zhang, Q.; Zhang, L.; Huang, Y.; Gao, X.; Tong, Y. Multiscale Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 4811–4820. [Google Scholar] [CrossRef]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Rajpurkar, P.; Irvin, J.; Ball, R.L.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.P.; et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018, 15, e1002686. [Google Scholar] [CrossRef] [PubMed]
Zech, J.R.; Badgeley, M.; Liu, M.; Costa, A.B.; Titano, J.J.; Oermann, E.K. Confounding variables can degrade generalization performance of radiological deep learning models. PLoS Med. 2018, 15, e1002683. [Google Scholar] [CrossRef]

Figure 1. Sample image from the RSCM dataset. R: Right.

Figure 2. Sample image from the NIH dataset. L: Left.

Figure 3. Distribution of data for each label in bar plot.

Figure 4. Hybrid architecture of CNN and ViT model.

Figure 5. EfficientNetV2L model prediction results for normal and TB anomalies, with confidence score for each label. Each subfigure represents a chest X-ray (CXR) image with the corresponding ground truth and model predictions.

Figure 6. ViT Base model prediction results for normal and TB anomalies, with confidence score for each label. Each subfigure represents a chest X-ray (CXR) image with the corresponding ground truth and model predictions.

Figure 7. Hybrid CNN-ViT Base model accuracy and loss.

Figure 8. EfficientNetV2L model prediction results for normal and TB anomalies, with confidence score for each label.

Figure 9. Confusion matrices for each class in the hybrid model.

Figure 10. Confusion matrices for each class in the hybrid model.

Figure 11. Hybrid model saliency map for pleural effusion and consolidation.

Figure 12. Hybrid model saliency map for consolidation and pneumothorax.

Figure 13. Hybrid CNN-ViT model prediction results with AUC for each label.

Table 1. Examples of data samples and corresponding labels.

Data	Label (s)
32301310	bronchiectasis, fibroinfiltrate, infiltrate, cavity
32967370	fibroinfiltrate
34999600	infiltrate, consolidation
36764870	consolidation
36993100	infiltrate
38160860	infiltrate, consolidation, pleural thickening
41558270	atelectasis, fibroinfiltrate
42232070	bronchiectasis, bullae, fibroinfiltrate, pleural thickening
43637780	atelectasis, bronchiectasis, pleural effusion, infiltrate, consolidation, pleural thickening

Table 2. Examples of data samples and corresponding labels.

Data	Label (s)
00000001002	pleural effusion
00000005007	pleural effusion, infiltrate
00000013030	atelectasis, pneumothorax
00000013046	infiltrate

Table 3. Distribution of data for each label.

Label	RSCM	NIH	Total
Infiltrate	50	47	97
Fibroinfiltrates	64	-	64
Consolidation	36	19	55
Cavity	13	-	13
Pleural Effusion	9	81	90
Fibrosis	13	37	50
Bronchiectasis	32	-	32
Pleural thickening	15	35	50
Atelactasis	8	56	64
Lymphadenopathy	11	-	11
Pneumothorax	2	54	56
Bullae	12	-	12
Tuberculoma	11	-	11
Miliary	11	-	11
Normal	19	31	50

Table 4. Vision transformer configuration.

Model	Layer	Hidden Size	MLP Size	Heads
ViT Base	12	768	3072	12
ViT Large	24	1024	4096	16
ViT Huge	32	1280	5120	16

Table 5. Experimental result of CNN models.

Architecture	Parameters	Accuracy	Loss	AUC Score
InceptionResNet	56,393,943	0.870	0.330	0.432
VGG16	14,985,039	0.871	0.331	0.454
Xception	23,430,687	0.866	0.343	0.475
DenseNet-201	20,763,191	0.855	0.370	0.500
EfficientNetV2L	118,410,415	0.870	0.323	0.460

Table 6. Experimental result of ViT.

Architecture	Parameters	Accuracy	Loss	AUC Score
ViT Base	88,903,415	0.874	0.327	0.500
ViT Large	306,762,999	0.879	0.326	0.565

Table 7. Experimental result of hybrid models.

Batch Size	Accuracy	Loss	AUC Score
4	0.833	0.531	0.520
8	0.911	0.285	0.510
16	0.884	0.341	0.480

Table 8. Evaluation metrics for EfficientNetV2L, ViT Base, and Hybrid EfficientNetV2L-ViT Base model.

Architecture	Accuracy	AUC	Precision	Recall	F1 Score	Hamming Loss
EfficientNetV2L	0.870	0.460	0.238	0.673	0.317	0.526
ViT Base	0.874	0.500	0.319	0.485	0.217	0.464
Hybrid	0.911	0.510	0.181	1.000	0.301	0.326

Table 9. Performance and parameter analysis of the models.

Architecture	Parameter	Training Time (min)	Loss
EfficientNetV2L	118,410,415	35.57	0.323
ViT Base	88,903,415	4.51	0.327
Hybrid	177,975,727	23.85	0.285

Table 10. Performance metrics of hybrid model variants.

Architecture	Accuracy	AUC	Precision	Recall	F1 Score	Hamming Loss
Hybrid	0.911	0.510	0.181	1.000	0.301	0.326
Hybrid without Focal Loss	0.855	0.461	0.120	0.712	0.206	0.410
Hybrid without Class Weight	0.868	0.472	0.150	0.750	0.253	0.384

Table 11. Efficiency metrics of hybrid model variants.

Architecture	Training Time (min)	Loss
Hybrid	23.85	0.285
Hybrid without Focal Loss	22.90	0.340
Hybrid without Class Weight	22.75	0.320

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yulvina, R.; Putra, S.A.; Rizkinia, M.; Pujitresnani, A.; Tenda, E.D.; Yunus, R.E.; Djumaryo, D.H.; Yusuf, P.A.; Valindria, V. Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray. Computers 2024, 13, 343. https://doi.org/10.3390/computers13120343

AMA Style

Yulvina R, Putra SA, Rizkinia M, Pujitresnani A, Tenda ED, Yunus RE, Djumaryo DH, Yusuf PA, Valindria V. Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray. Computers. 2024; 13(12):343. https://doi.org/10.3390/computers13120343

Chicago/Turabian Style

Yulvina, Rizka, Stefanus Andika Putra, Mia Rizkinia, Arierta Pujitresnani, Eric Daniel Tenda, Reyhan Eddy Yunus, Dean Handimulya Djumaryo, Prasandhya Astagiri Yusuf, and Vanya Valindria. 2024. "Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray" Computers 13, no. 12: 343. https://doi.org/10.3390/computers13120343

APA Style

Yulvina, R., Putra, S. A., Rizkinia, M., Pujitresnani, A., Tenda, E. D., Yunus, R. E., Djumaryo, D. H., Yusuf, P. A., & Valindria, V. (2024). Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray. Computers, 13(12), 343. https://doi.org/10.3390/computers13120343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. Primary Dataset

2.1.2. Supplementary Dataset: NIH Chest X-Ray

2.2. Transformer

2.3. Focal Loss

2.4. Class Weight

2.5. Data Augmentation

2.6. Data Storage and Computational Environment

2.7. Proposed Method

2.7.1. Pre-Processing

2.7.2. Architecture

3. Result

3.1. CNN-Based Experiments

3.2. ViT-Based Experiments

3.3. Hybrid-Based Experiment

3.4. Evaluation Metrics Analysis of EfficientNetV2L, ViT Base, and Hybrid Model

3.5. Performance and Parameter of Models

3.6. Comparative Analysis of Hybrid Model Variants: The Role of Focal Loss and Class Weight

3.6.1. Performance Metrics Comparison

3.6.2. Efficiency Metrics Comparison

3.6.3. Importance of Focal Loss and Class Weight

3.7. Separated Confusion Matrices for Positive and Negative Predictions

3.8. Single Confusion Matrix with True Classes and Predicted Classes

3.9. Saliency Map of the Hybrid Model

3.10. Inference Time and Resource Utilization per Image Prediction

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI