1. Introduction
Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a significant global health challenge and one of the leading causes of death worldwide [
1]. Accurate diagnosis is particularly difficult in immunocompromised individuals and pediatric patients, where sample collection for bacterial confirmation is challenging [
2]. In Indonesia alone, an estimated 824,000 TB cases were reported, with 97,800 deaths in 2021 [
3]. The gold standard for TB diagnosis, bacterial isolation, takes 2 to 3 weeks for results, while the Mantoux test, using purified protein derivative (PPD), requires 48 to 78 h [
4]. Faster alternatives, such as polymerase chain reaction (PCR), are costly and less accessible in certain regions [
5]. To address these limitations, the World Health Organization (WHO) recommends using chest X-ray (CXR) as a screening tool [
6], though its effectiveness is limited by variability in interpretation among radiologists, introducing subjectivity into the process.
Recent advancements in artificial intelligence (AI) have paved the way for computer-aided diagnosis (CAD) systems to automatically detect TB by analyzing CXR images through image segmentation, feature extraction, and classification [
7]. Deep learning (DL), a form of ML, uses multiple neural network layers to process raw data and has proven highly effective in image classification tasks [
8,
9,
10].
Several recent studies have applied deep learning models to detect TB using CXR images, with significant results. For example, one study achieved an accuracy of 87%, utilizing CNN architectures such as InceptionV3, Xception, ResNet50, VGG19, and VGG16 [
7]. However, this study focused solely on binary classification, determining whether a CXR image indicated TB or not. Similarly, a study of deep learning-based classification and semantic segmentation of lung tuberculosis lesions reached 100% accuracy for distinguishing only four lesion types: infiltrations/bronchiectasis and opacity/consolidation [
11]. A third study also performed well, achieving 99.29% accuracy using a UNet for lung segmentation and Xception for TB classification [
12]. Yet, like other efforts, it primarily focused on binary classification (TB vs. normal) or a very limited range of abnormalities.
While these results are promising, they remain limited in scope, focusing on detecting TB presence or classifying only a few lesion types. For instance, lesions like infiltrations, opacity, and bronchiectasis are often classified, but these approaches fail to address a broader range of TB-related abnormalities or handle the complexity of multi-label classification where several abnormalities can co-exist in a single image.
In this study, we introduce a novel hybrid AI model that combines convolutional neural networks (CNNs) [
8] and vision transformers (ViTs) [
13] to detect TB anomalies in CXR images through a multi-label classification framework. Our model focuses on 14 distinct TB-related anomalies, a comprehensive range that exceeds previous studies in scope, making this the most extensive multi-label classification effort for TB detection using AI. We tackled data imbalance using augmentation, class weighting, and focal loss to ensure robust performance across all classes.
To further validate the performance of our model, we conducted a comparative analysis against several state-of-the-art CNN architectures, including Inception [
14], ResNet [
15], EfficientNet [
16], VGG [
17], and DenseNet [
18]. Furthermore, we evaluated vision transformer (ViT) models, including ViT Base and ViT Large [
13], to determine their effectiveness in handling the multi-label classification of tuberculosis (TB) abnormalities. This evaluation aims to identify the most effective architecture for accurate and efficient TB anomaly detection, particularly in resource-constrained settings such as Indonesia.
The key contribution of this study lies in its ability to classify 14 distinct TB-related abnormalities in a single AI-driven approach. The labeled abnormalities include (1) Infiltrate, (2) Fibroinfiltrates, (3) Consolidation, (4) Cavity, (5) Pleural Effusion, (6) Fibrosis, (7) Bronchiectasis, (8) Pleural Thickening, (9) Atelactasis, (10) Lymphadenopathy, (11) Pneumothorax, (12) Bullae, (13) Tuberculoma, and (14) Miliary. One of the primary challenges in this study is that two or more abnormalities can occur in a single patient, making multi-label classification essential in addition to traditional multi-class classification.
This approach provides enhanced decision support for radiologists and facilitates faster, more reliable diagnoses in clinical settings, ultimately aiding in more efficient TB management.
3. Result
The Result Section provides a comprehensive analysis and presentation of the outcomes obtained from the experimental study, shedding light on the key findings and observations derived from the conducted research. The results obtained from the different experiments are presented in this section. The experiments were conducted on a high-performance computing environment with a GPU equipped with 40 GB of VRAM, capable of handling complex computations efficiently, especially for deep learning tasks involving large models like the hybrid EfficientNetV2L-ViT Base.
3.1. CNN-Based Experiments
As shown in the
Table 5 below EfficientNetV2L showed an excellent combination of accuracy (0.870), the lowest loss (0.323), and a respectable AUC score (0.460). While VGG16 slightly outperformed EfficientNetV2L in accuracy (0.871) and had a comparable loss (0.331), its AUC score (0.454) was lower. Additionally, VGG16 has fewer parameters (14.9 million) but does not provide the level of detail and feature extraction efficiency that EfficientNetV2L achieves, justifying its higher parameter count (118 million). Similarly, DenseNet-201, with a higher AUC (0.500), had lower accuracy (0.855) and the highest loss (0.370), making it less desirable for this specific task where minimizing loss is critical.
The following images as shown in the
Figure 5 present the prediction outcomes from the EfficientNetV2L model. Each image includes a comparison between the actual diagnosis and the predicted diagnosis, along with the model’s confidence level for each predicted label. This analysis demonstrates the ability of both models to detect various TB-related abnormalities from chest X-ray (CXR) images.
- (a)
The ground truth is normal, and the model correctly predicts it with a confidence of 33%. The prediction aligns with the actual diagnosis, though the confidence could be higher.
- (b)
The ground truth is consolidation and the model predicts it with 44% confidence. The model correctly identifies the abnormality, demonstrating its potential in recognizing consolidation.
- (c)
The ground truth includes fibrosis, bronchiectasis, and pneumothorax. The model predicts these conditions with fibrosis 29%, bronchiectasis 35%, and pneumothorax 15%. These predictions reflect a reasonable overlap with the ground truth.
- (d)
The ground truth includes consolidation and pleural thickening. The model predicts consolidation with 44% confidence and pleural thickening with 27% confidence, which aligns well with the ground truth.
3.2. ViT-Based Experiments
As shown in the
Table 6, ViT Base, on the other hand, strikes an ideal balance between performance and computational efficiency. While ViT Large showed higher accuracy (0.879) and AUC (0.565), the massive increase in parameters (306 million) makes it less practical for the project’s needs. In contrast, ViT Base, with 88.9 million parameters, performed very well, with an accuracy of 0.874, loss of 0.327, and an AUC of 0.500. This makes it a more computationally efficient option for global feature extraction, especially when considering the trade-off between computational demand and model performance.
The reduced parameter count in ViT Base directly contributes to its lower memory usage, which is critical when deploying models in resource-limited environments or in real-time processing scenarios. The performance gap between ViT Base and ViT Large is relatively minor, making ViT Base the more practical choice for balancing accuracy and efficiency. This balance is particularly relevant for medical imaging tasks such as tuberculosis anomaly detection, where speed and accuracy are both crucial for effective diagnostics.
Below, as shown in the
Figure 6, are the prediction results of the ViT Base model, demonstrating its ability to classify normal and tuberculosis (TB) anomalies in chest X-rays. Each label is accompanied by a confidence score, highlighting the model’s certainty for each predicted abnormality.
- (a)
The actual diagnosis is consolidation, and the model predicts it with 38% confidence.
- (b)
The actual diagnosis is miliary tuberculosis and the model predicts miliary TB with 14% confidence.
- (c)
The ground truth includes consolidation, fibrosis, and pleural thickening. The model predicts these with fibrosis 53%, consolidation 58%, and pleural thickening 41%, all of which align closely with the actual diagnosis.
- (d)
The actual diagnosis is fibrosis and bullae, and the model predicts fibrosis with 45% confidence and bullae with 13%. The prediction covers both conditions but is less confident about bullae.
Compared to InceptionResNet V2 (accuracy of 0.870, loss of 0.330, and AUC of 0.432) and Xception (accuracy of 0.866, loss of 0.343, and AUC of 0.475), both EfficientNetV2L and ViT Base offer better overall performance with lower losses and higher AUC scores, making them the ideal candidates for integration into a hybrid architecture.
In conclusion, EfficientNetV2L was chosen for its superior handling of local features with low loss and strong accuracy, while ViT Base offered efficient global feature extraction with better AUC and computational efficiency. Together, they form a robust foundation for the hybrid model, balancing computational demand with high performance across key metrics.
3.3. Hybrid-Based Experiment
The results of this experiment, as shown in the
Table 7 which combined EfficientNetV2L and ViT Base, revealed important insights into how different batch sizes affect model performance across 50 epochs with a learning rate of 0.0001.
The hybrid model using EfficientNetV2L and ViT Base showed the best performance with a batch size of 8 due to its strong balance between accuracy and loss. With an accuracy of 0.911, the lowest loss of 0.285, and a competitive AUC score of 0.510, this batch size allows for effective learning while avoiding issues like overfitting or underfitting. The balance achieved here is crucial for complex multi-label tasks like tuberculosis classification, where the model needs to process both local and global image features efficiently.
When comparing this performance to other batch sizes, batch size 4 yielded lower accuracy (0.833) and a much higher loss (0.531). This suggests that smaller batch sizes may lead to noisier updates and less stable training, as seen in the higher loss figure, which indicates that the model struggles to converge. Conversely, batch size 16 saw a slight drop in accuracy (0.884) and an increase in loss (0.341). Larger batch sizes can lead to less frequent weight updates, which slows convergence and can lead to suboptimal performance, as evidenced by the increase in both loss and lower AUC (0.480).
The batch size of 8 strikes the right balance between frequent updates (allowing for faster learning) and stability (ensuring better convergence). This balance is particularly important for training deep models like the hybrid EfficientNetV2L-ViT.
The model’s strong performance was also attributed to the fine-tuning of other hyperparameters, such as the learning rate of 0.0001, which ensured that the model makes smaller, more precise weight updates, leading to better convergence. Furthermore, class weighting plays a critical role in addressing the class imbalance in the dataset, ensuring that underrepresented classes receive appropriate attention during training. This is especially relevant for multi-label tasks, where some labels (e.g., rare TB-related anomalies) are significantly underrepresented. By giving more weight to these classes, the model avoids bias toward the more common labels.
Finally, the use of focal loss further improves model performance by focusing on harder-to-classify examples. In multi-label tasks where labels overlap or are imbalanced, focal loss ensures that the model focuses on correctly classifying difficult cases, reducing the chance of misclassification. This tuning, combined with the optimal batch size of 8, allows the hybrid model to effectively handle the complex task of TB detection, maximizing accuracy and minimizing loss.
Following the results from this experiment as shown in the
Figure 7, the hybrid model showed significant improvements in multi-label classification accuracy. The predictions from the hybrid model demonstrate notable improvements compared to the individual results of the EfficientNetV2L and ViT Base models. The hybrid architecture effectively combines the strengths of both models, particularly in multi-label classification.
First, the hybrid model showed enhanced accuracy in multi-label prediction as shown in the
Figure 8. For example, in cases with multiple abnormalities such as infiltrate and lymphadenopathy, the hybrid model balanced the confidence levels between the labels, predicting infiltrate 46% and lymphadenopathy 24%. This represents a significant improvement in consistency compared to EfficientNetV2L, which struggled with lower confidence when predicting multiple labels.
Additionally, the hybrid model provided more balanced predictions for complex cases involving overlapping conditions like infiltrate, consolidation, and bronchiectasis. The model predicted these conditions with greater confidence: infiltrate 46%, consolidation 41%, and bronchiectasis 36%. This shows a better integration of local and global features, addressing the limitations of the individual models.
In terms of detecting pleural thickening and consolidation, the hybrid model performed significantly better than the standalone models. For instance, in one case, it predicted consolidation 41% and pleural thickening 32% with greater accuracy and balance. This improved detection of subtle abnormalities can be attributed to the hybrid model’s ability to extract both fine-grained and high-level features effectively.
Moreover, the hybrid model maintains strong performance on normal cases, predicting normal 34% with balanced confidence. This consistency across both abnormal and normal cases demonstrates the model’s robustness, reducing the likelihood of false positives, which is crucial for clinical accuracy.
3.4. Evaluation Metrics Analysis of EfficientNetV2L, ViT Base, and Hybrid Model
The performance of the three models—EfficientNetV2L, ViT Base, and the Hybrid EfficientNetV2L-ViT Base model—was evaluated using key metrics: Accuracy, AUC, Precision, Recall, F1 Score, and Hamming Loss.
Table 8 provides a detailed comparison, highlighting the strengths and weaknesses of each model.
Accuracy: The accuracy scores across the models reflect the complexity of the multi-label classification task. The Hybrid EfficientNetV2L-ViT Base model achieved the highest accuracy at 0.911, outperforming both EfficientNetV2L (0.870) and ViT Base (0.874). This improvement demonstrates the hybrid model’s capability to leverage the strengths of CNN and Transformer architectures for enhanced performance in recognizing complex patterns of TB-related anomalies.
AUC: Although the AUC scores remain modest due to the challenges of multi-label classification with imbalanced data, the Hybrid model recorded an AUC of 0.510, showing a slight improvement over ViT Base (0.500) and EfficientNetV2L (0.460). This indicates that the hybrid approach is marginally more effective at differentiating between classes, likely due to the combination of EfficientNetV2L’s local feature extraction and ViT’s capacity for global context representation.
Precision and Recall: Precision values were generally low, with the Hybrid model scoring 0.181, which is lower than ViT Base’s 0.319 but slightly above EfficientNetV2L’s 0.238. This lower precision across models indicates a tendency for false positives, common in multi-label tasks with overlapping classes. However, in terms of recall, the Hybrid model achieved a perfect score of 1.000, significantly higher than ViT Base (0.485) and EfficientNetV2L (0.673). This reflects the Hybrid model’s effectiveness in capturing all relevant cases, an essential trait in medical diagnostics where minimizing false negatives is critical.
F1 Score: The F1 score, which balances precision and recall, emphasizes the Hybrid model’s performance robustness in handling both false positives and false negatives. The Hybrid model achieved an F1 score of 0.301, surpassing EfficientNetV2L (0.317) and ViT Base (0.217). This higher F1 score underscores the Hybrid model’s ability to maintain a balanced performance, essential for handling multi-label, imbalanced datasets.
Hamming Loss: Hamming Loss, which quantifies the fraction of incorrect labels, was lowest for the Hybrid model at 0.326, followed by ViT Base (0.464) and EfficientNetV2L (0.526). The lower Hamming Loss for the Hybrid model indicates fewer label misclassifications, further supporting its suitability for this multi-label classification task.
3.5. Performance and Parameter of Models
Table 9 provides an overview of the model parameters, training time (in minutes), and loss for each of the models: EfficientNetV2L, ViT Base, and the hybrid EfficientNetV2L-ViT Base.
The hybrid model, with a total of 177,975,727 parameters, combines the advantages of both EfficientNetV2L and ViT Base architectures, resulting in a training time of 23.85 min and the lowest loss of 0.2854 among the models tested. This demonstrates that, despite its increased parameter count, the hybrid model achieves a favorable balance between performance and efficiency, surpassing the standalone models in terms of both accuracy and loss.
The EfficientNetV2L model, while robust with 118,410,415 parameters, required a longer training time of 35.57 min, which reflects its high computational demand. Although it achieved a reasonably low loss of 0.323, it did not outperform the hybrid model.
The ViT Base model, with 88,903,415 parameters, was the most lightweight among the three models and had the shortest training time at 4.51 min. However, this efficiency comes at a cost in terms of performance, as it recorded a slightly higher loss of 0.327, indicating that it is less effective in handling complex multi-label classifications compared to the hybrid model.
The hybrid EfficientNetV2L-ViT Base model demonstrates the best trade-off between training time and performance, achieving a balance between parameter complexity and effectiveness in multi-label classification. Future improvements could explore further optimization in architecture to reduce the computational cost while maintaining high accuracy and low loss.
3.6. Comparative Analysis of Hybrid Model Variants: The Role of Focal Loss and Class Weight
This section compares the performance and efficiency of three hybrid model variants: the hybrid model with both focal loss and class weight, the hybrid model without focal loss, and the hybrid model without class weight. The analysis highlights the importance of these techniques in addressing class imbalance and improving multi-class, multi-label classification tasks.
3.6.1. Performance Metrics Comparison
Table 10 shows the performance metrics of the three model variants in terms of accuracy, AUC, precision, recall, F1 score, and Hamming loss. The hybrid model with both focal loss and class weight achieves the best overall performance, particularly in recall (1.000), which is critical for ensuring no anomalies are missed in clinical applications.
Hybrid without Focal Loss: Precision drops to 0.120, while recall decreases significantly to 0.712. This indicates that, without focal loss, the model struggles to identify minority classes effectively, leading to more false negatives and a lower F1 score.
Hybrid without Class Weight: Precision improves slightly (0.150), but recall remains low 0.750 compared to the default model. Hamming loss increases, showing the model’s decreased ability to handle multi-label predictions effectively.
3.6.2. Efficiency Metrics Comparison
Table 11 compares the computational efficiency of the models in terms of training time and loss.
Training Time: Removing focal loss or class weight slightly reduces training time due to simplified loss calculations, but the performance trade-off is significant.
Loss: The default hybrid model has the lowest training loss (0.285), while the other two variants show higher losses, indicating less effective optimization.
3.6.3. Importance of Focal Loss and Class Weight
Focal Loss: Focal loss helps tackle the class imbalance issue by assigning higher penalties to misclassified samples of minority classes. This makes the model focus more on learning difficult or underrepresented classes, leading to significantly better recall and a more balanced F1 score. As seen in the results, removing focal loss caused a drastic reduction in recall and overall model performance, particularly for rare anomalies.
Class Weight: Class weight addresses the imbalance by adjusting the contribution of each class to the total loss, ensuring that minority classes are not overshadowed by majority ones during training. Without class weighting, the model shows lower performance across all metrics, especially in recall and Hamming loss. This underscores its importance in multi-label scenarios, where balancing predictions across multiple classes is critical.
Clinical Implications: In clinical contexts, false negatives can have severe consequences, such as missing critical diagnoses. The default hybrid model with focal loss and class weight demonstrates its superiority by minimizing false negatives (high recall) and ensuring better overall model performance. Additionally, reducing Hamming loss is essential in multi-label classification to avoid misclassifications that could lead to unnecessary or incorrect medical interventions.
The comparative analysis clearly demonstrates the crucial role of focal loss and class weight in improving both the performance and reliability of the hybrid model. These techniques effectively address class imbalance, enhance multi-label classification performance, and are vital for deploying models in sensitive domains such as healthcare.
3.7. Separated Confusion Matrices for Positive and Negative Predictions
The confusion matrices, as shown in
Figure 9, provide a detailed look at the model’s performance in predicting TB-related abnormalities across various classes. The matrix on the left represents positive predictions (True Positives, False Positives, True Negatives), while the matrix on the right represents negative predictions (True Negatives, False Negatives, True Positives).
High True Positives (TP) for Common Conditions: Conditions like infiltrate, pleural effusion, and fibrosis show relatively high true positive (TP) rates, indicating the model’s effective detection of these conditions. This suggests that the model is well-calibrated to recognize these more common TB-related abnormalities, which might have more distinct features that the model can capture reliably.
Low False Positives (FP) Across Classes: The confusion matrix reveals that false positives (FP) are relatively low across most classes, indicating the model’s conservative approach in identifying anomalies. This low FP rate is beneficial for clinical application, as it reduces the chances of misclassifying healthy patients as having TB-related abnormalities.
False Negatives (FN) Impacting Sensitivity for Rare Anomalies: In the negative predictions matrix (right), certain classes like cavity, lymphadenopathy, and tuberculoma show notable numbers in the FN column. This suggests that the model is less sensitive to these rare conditions, possibly due to their lower representation in the dataset, leading to a higher likelihood of missing these anomalies. Addressing this issue would require either data augmentation or more targeted training to improve detection sensitivity for these less common conditions.
Balanced Prediction for Conditions with Lower Complexity: For simpler and more distinct classes, such as normal and consolidation, the model maintains a good balance between TPs and low FNs, reflecting reliable performance in these categories. This balance is crucial for the model’s utility in clinical settings, where it must accurately differentiate between normal and abnormal cases.
Challenges with Overlapping Features: Some conditions, such as fibroinfiltrate and bronchiectasis, exhibit moderate FNs. This could be due to the overlapping nature of their radiographic features with other conditions, leading to occasional misclassifications. This observation highlights a challenge in multi-label classification tasks where conditions have similar radiological characteristics, which can confuse the model.
The confusion matrices indicate that the hybrid model is effective in detecting common TB-related abnormalities but has limitations in detecting rare or complex conditions. Enhancing the model’s sensitivity to these conditions may require balanced datasets and additional feature refinement, especially for overlapping or visually similar anomalies.
3.8. Single Confusion Matrix with True Classes and Predicted Classes
The confusion matrix, as shown in
Figure 10, provides a detailed assessment of the hybrid model’s performance across various TB-related abnormalities. Here is a breakdown of key observations:
True Positives for Common Classes: Classes like pleural effusion, fibrosis, and infiltrate exhibit relatively high numbers in the diagonal elements, indicating that the model is effectively identifying these common TB-related abnormalities. This suggests that the model has learned to recognize specific features associated with these conditions accurately.
High False Negatives in Rare Classes: Rare conditions such as lymphadenopathy, miliary, and tuberculoma show a high number of false negatives, where actual instances of these classes are misclassified as other conditions. This limitation reflects the model’s difficulty in recognizing less frequent abnormalities, likely due to an imbalance in the dataset or insufficient distinctive features to differentiate these classes from others.
Misclassification Between Similar Conditions: Conditions with overlapping or visually similar features, such as fibroinfiltrate and fibrosis, or pleural effusion and consolidation, show notable misclassification rates. This suggests that the model struggles to differentiate between these classes, likely due to similarities in radiographic appearance. Such misclassification is common in multi-label medical imaging tasks where conditions share anatomical or structural traits.
Confusion with Normal Cases: The model occasionally misclassifies abnormal cases, such as pleural thickening and bullae, as normal, indicating that subtle anomalies may be challenging for the model to detect. This could reduce the sensitivity of the model in clinical applications, where missing abnormal cases can have significant implications.
Impact of Class Imbalance on Predictions: The matrix shows an imbalance in detection accuracy across classes, with more prevalent classes like fibrosis and pleural effusion having better detection rates than rare classes. This highlights the need for further class balancing or augmentation to improve the model’s sensitivity to underrepresented classes.
In summary, the confusion matrix reveals that, while the hybrid model effectively identifies common TB-related abnormalities, it faces challenges with rare or visually similar conditions. Improvements could focus on enhancing the dataset’s balance and incorporating more advanced distinguishing features to aid in differentiating between similar abnormalities.
3.9. Saliency Map of the Hybrid Model
To gain insights into the interpretability of the hybrid model, saliency maps were generated for various chest X-ray images, each annotated with multiple TB-related abnormalities. The purpose of these saliency maps is to visualize the regions of the images the model focuses on to make its predictions, revealing its interpretive process and alignment with clinically significant areas. This analysis serves as a qualitative evaluation of the model’s capacity to identify relevant pathological features in complex, multi-label scenarios.
Each set of images in the saliency map figures presents three visualizations to interpret how the hybrid model detects TB-related anomalies in chest X-ray (CXR) images. The leftmost image is the original CXR, showing the patient’s lungs and thoracic region, labeled with ground truth diagnoses (e.g., pleural effusion, consolidation). The middle image is the saliency map, highlighting the regions the model deems critical for its prediction, with brighter red areas indicating stronger focus. The rightmost image is a colorful overlay combining the original CXR and the saliency map, allowing for a direct comparison between the model’s attention regions and the anatomical structures in the X-ray. Together, these visualizations offer insights into the model’s interpretative focus, showing how well it aligns with clinically relevant areas for TB diagnosis.
Below, as shown in the
Figure 11 and
Figure 12, are several saliency maps for different TB anomaly cases generated by the Hybrid model, illustrating the regions of focus used by the model to detect various anomalies in chest X-ray images.
Case 1: Pleural Effusion and Consolidation In this case, the saliency map emphasizes regions in the lung areas commonly associated with pleural effusion and consolidation, particularly in the middle and lower lobes. The highlighted regions on the saliency map indicate the model’s focus on areas of fluid accumulation and lung tissue opacity, which are characteristic of these conditions. The overlay map reinforces this by showing intensified attention in regions that align with typical clinical presentations of pleural effusion and consolidation.
Case 2: Consolidation and Pneumothorax In this case, the saliency map reveals focused attention on the upper lung regions, where pneumothorax and consolidation effects are typically observed. The highlighted areas suggest the model’s sensitivity to abnormal air accumulation and tissue consolidation patterns within these areas. The colorful overlay further illustrates that the model emphasizes regions where pneumothorax and consolidation are expected, indicating its capability to recognize both conditions effectively, aligning well with clinical observations.
The saliency maps demonstrate that the hybrid EfficientNetV2L-ViT model effectively highlights regions associated with various TB-related anomalies, with its attention aligning well with clinical expectations. For instance, it focuses on the upper lung regions for pneumothorax and the lower lung areas for pleural effusion, indicating its ability to generalize across multiple TB-related pathologies.
However, some interpretability limitations are noted, particularly in cases where the saliency maps appear diffuse, making it difficult for the model to localize attention to specific regions. This may reflect the inherent challenges in multi-label classification, where highly overlapping abnormalities complicate precise localization. Notably, the color-encoded regions shown in the saliency maps correspond to the pathological processes confirmed in the original chest X-ray. However, some highlighted areas lie outside the annotated abnormalities, potentially reflecting non-pathological structures or regions the model incorrectly considers relevant. Future work could address these discrepancies through refinement efforts, such as advanced visualization techniques or additional attention mechanisms, to enhance interpretability and improve localization for complex multi-label cases.
Overall, these saliency maps support the robustness and clinical applicability of the hybrid model, underscoring its potential for real-world application in TB diagnostics.
3.10. Inference Time and Resource Utilization per Image Prediction
To evaluate the efficiency and resource requirements of the hybrid EfficientNetV2L-ViT Base model for single-image predictions, the model was tested on a set of 74 images.
Average Inference Time: The model demonstrated an average inference time of approximately 0.135 s per image. This rapid processing rate highlights the model’s capability to handle high-throughput demands, enabling efficient image analysis workflows, especially valuable in clinical settings where timely results are essential.
GPU Utilization per Image: During testing, the hybrid model utilized an average of 6.1 GB of VRAM on an NVIDIA GPU (manufactured by NVIDIA Corporation, Santa Clara, California, USA). This GPU is part of the computational infrastructure provided by the Big Data Center (BDC) IMERI, located in Indonesia (as mentioned in
Section 2.6). The GPU has a total of 15 GB of VRAM and processed a batch of 74 images, translating to an average GPU memory usage of approximately 82.4 MB per image. This moderate per-image VRAM usage indicates that the model is optimized for efficient resource usage and could potentially run on mid-range GPUs, making it suitable for deployment even in settings with limited GPU resources.
4. Discussion
The integration of EfficientNetV2L and ViT Base in this study offers a robust solution for handling the complexities of multi-label classification in tuberculosis (TB) anomalies detection from chest X-ray (CXR) images. Unlike typical multi-class classification, where each image is assigned a single label, multi-label classification requires the model to predict multiple labels for each image. This task is inherently more complex due to the overlapping nature of labels and the imbalanced distribution of labels across the dataset, presenting unique challenges in medical imaging tasks such as TB anomaly detection.
Comparison with Existing Approaches: Our study contrasts with recent studies that have primarily focused on simpler binary classification tasks or the identification of only a few TB-related abnormalities.
For instance, a study using CNN architectures like InceptionV3, Xception, ResNet50, VGG19, and VGG16 achieved 87% accuracy, but only classified images as either TB-positive or TB-negative, without addressing the variety of TB-related lesions that can coexist in a patient [
7]. Similarly, another work focused on distinguishing between four lesion types (infiltrations/bronchiectasis and opacity/consolidation), achieving 100% accuracy but covering a much narrower range of abnormalities [
11].
In comparison, our model aimed to classify 14 distinct TB-related abnormalities, offering a more comprehensive approach. While UNet-based models combined with Xception have reached accuracy as high as 99.29% in binary TB classification tasks [
12], these models do not address the complexities of multi-label classification, where several abnormalities may overlap in a single image.
By comparing these existing studies, our work shows the potential advantages of combining CNNs and ViTs for handling a wider range of TB abnormalities while also focusing on model robustness.
Performance and Model Selection: In CNN-based experiments, five CNN architectures and two vision transformer (ViT) models were evaluated to identify the best-performing models. EfficientNetV2L was selected for its balanced performance, with high accuracy (0.870) and low loss (0.323) compared to other CNN models. EfficientNetV2L’s ability to handle multi-label tasks efficiently stems from its compound scaling mechanism, which balances depth, width, and resolution for more efficient feature extraction across different input sizes. This feature is essential in medical imaging, where detecting subtle patterns can significantly affect classification accuracy [
16].
In ViT-based experiments, the ViT Base model was chosen for its computational efficiency and capacity to capture global image features. Although ViT Large demonstrated slightly better performance with an accuracy of 0.879 and an AUC of 0.565, its high computational cost and longer training time due to its large parameter count (306 million) made ViT Base a more practical choice. ViT Base, with its accuracy of 0.874 and AUC of 0.500, provided nearly equivalent results with fewer parameters (88.9 million), making it a more balanced option for the hybrid model [
13].
Batch Size and Fine-Tuning: In hybrid-based experiments, we opted not to implement early stopping during training of the hybrid EfficientNetV2L-ViT Base model, choosing instead to allow training to continue for a fixed 50 epochs. This decision was made to ensure consistent convergence across the entire dataset and to observe the model’s full learning curve without interruption. Although early stopping can prevent overfitting by halting training once the validation loss ceases to improve, we aimed to thoroughly evaluate the impact of training across the defined epochs, especially given the model’s multi-label classification complexity.
Batch size played a critical role in balancing frequent weight updates with stable training. After testing various batch sizes, a batch size of 8 yielded the best performance, achieving a high accuracy (0.911), low loss (0.285), and competitive AUC (0.510). Smaller batch sizes, such as 4, produced noisier updates, leading to instability and higher loss (0.531). Conversely, larger batch sizes, such as 16, were more computationally efficient but sacrificed some accuracy (0.884) and showed a higher loss (0.341) due to delayed convergence. Therefore, a batch size of 8 provided an optimal balance, enabling the model to manage the multi-label nature of the dataset effectively without encountering overfitting or underfitting.
Training Stability Technique: Based on the loss curves in
Figure 7, it is evident that the model achieves convergence relatively early (<50 epochs), with minimal divergence between training and validation loss. Since the training process is guided by the loss function, the stability of the loss plot is a reliable indicator of model performance. The Adam optimizer, with its adaptive learning rate mechanism, plays a crucial role in promoting efficient convergence and better generalization by dynamically adjusting the learning rate during training. Given that the model converges early, extending the number of epochs may lead to overfitting, a condition where early stopping is advisable. This is consistent with findings from Kingma and Ba [
30], which emphasize Adam’s effectiveness in improving optimization efficiency and enhancing model stability during training.
Impact of Class Weighting and Focal Loss: Class imbalance posed a significant challenge in this study, as certain TB-related abnormalities, like infiltrates, appeared far more frequently than others, such as bullae or tuberculoma. This imbalance, combined with the multi-label nature of the data, required strategies that allowed the model to fairly treat all classes, regardless of their frequency.
Class weighting was employed to mitigate this issue by assigning higher importance to underrepresented classes, ensuring that the model did not bias towards the more frequent labels. This technique improved model generalization, enabling better predictions for rare abnormalities like bullae [
24]. By increasing the influence of these rarer labels during training, the model could better capture the diverse range of TB-related anomalies.
Additionally, focal loss was used to handle the imbalance further. As described by Lin et al. (2017), focal loss focuses more on harder-to-classify examples, making it well-suited for multi-label classification tasks where overlapping and imbalanced labels are prevalent [
22]. By focusing more on these difficult cases, the model improved its performance on rare and challenging anomalies, as demonstrated by the reduction in loss and the improvements in accuracy across various configurations.
Comparative Analysis of ViT and CNN: ViTs have gained attention more recently as an alternative approach to image understanding [
13]. Instead of relying on convolutional layers, ViTs adopt the Transformer architecture, originally developed for natural language processing tasks, to treat images as sequences of patches processed using self-attention mechanisms. By modeling global interactions among patches, ViTs capture long-range dependencies and relationships in images, enabling them to understand both local and global contexts. This allows ViTs to grasp the holistic structure of an image effectively, yielding promising results in image classification, object detection, and even tasks like image generation.
Compared to CNNs, ViTs have a few advantages. First, ViTs eliminate the need for handcrafted convolutional operations and can be applied directly to various input sizes without architectural modifications. Additionally, ViTs have demonstrated strong performance in handling large-scale datasets and complex visual patterns. Finally, ViTs offer more interpretability, as they can attend to specific image regions, making them especially suitable for tasks requiring localization or attention to fine-grained details.
However, ViTs also have limitations, such as computational expense due to the self-attention mechanism and a requirement for large datasets for optimal performance. CNNs, meanwhile, remain more computationally efficient and have undergone extensive optimization and study.
Deployment Considerations in Real-World Clinical Settings: Deploying the hybrid EfficientNetV2L-ViT Base model in clinical settings, particularly in resource-constrained environments, requires careful consideration of its performance on unseen data and computational efficiency. The model’s average inference time of 0.135 s per image and VRAM usage of around 82.4 MB per image suggest it can efficiently handle high-throughput demands and operate on mid-range GPUs, making it feasible for facilities with limited resources. This compact memory footprint and quick processing speed enable timely TB diagnosis, which is critical in clinical workflows. For healthcare facilities with varying hardware capabilities, the model’s low memory requirements enhance its scalability, allowing it to process multiple images simultaneously on shared or lower-end systems.
In rural areas with limited access to doctors and radiology specialists, this model could significantly support TB diagnosis, provided an X-ray machine is available and operational. By generating early diagnostic results, the model can facilitate timely referrals to physicians for consultation and appropriate treatment, expediting intervention in TB cases. This approach aligns with TB eradication efforts, particularly in regions like Indonesia, where healthcare resources are often constrained, thereby supporting national TB control programs.
Diagnostic Effect and Advantages of AI Models: Traditional CXR interpretation relies heavily on radiologists’ expertise, leading to potential variability and subjectivity in diagnosis. This can result in missed or delayed TB detection, especially in complex cases involving subtle or overlapping abnormalities. In contrast, the hybrid AI model provides consistent and objective analysis by leveraging its robust feature extraction and attention mechanisms. The model not only identifies multiple TB-related anomalies simultaneously but also highlights key regions of interest through saliency maps, improving interpretability for medical professionals.
Additionally, the AI model significantly reduces diagnostic turnaround time, enabling faster decision-making and improving patient outcomes. This is particularly advantageous in high-volume clinical environments or rural areas where timely diagnosis and treatment initiation are critical. By automating the initial diagnostic process, the model supports overburdened healthcare systems, allowing radiologists and physicians to focus on more complex cases and improving overall workflow efficiency.
Limitations: While the hybrid model demonstrated strong accuracy and low loss, there are notable limitations in its evaluation metrics, particularly regarding AUC, precision, recall, and interpretability as seen in the confusion matrix and saliency map analyses. Each limitation points to areas of complexity inherent in multi-label classification, which presents unique challenges in TB anomaly detection.
Low AUC Score as shown in the
Figure 13: The lower AUC score is not uncommon in multi-label classification tasks due to the inherent complexity of predicting multiple overlapping labels. The AUC metric, typically used in binary classification, is less straightforward in multi-label tasks, where each label has its own ROC curve. When averaged, the AUC score can be skewed by the overrepresentation of frequent labels, such as infiltrates, at the expense of rarer labels like bullae. One limitation observed in this study is that certain labels appear more frequently than others, creating an imbalance that can lead the model to predict these common labels more often. This can diminish sensitivity to rare labels, further skewing AUC scores. While techniques like focal loss and class weighting were applied to address these biases, these initial measures highlight the need for further refinement to enhance the model’s performance across all TB-related anomalies. As Saito and Rehmsmeier [
31] note, AUC can be biased toward majority classes, making it less reliable for assessing minority class performance in imbalanced datasets. Although the AUC score of 0.510 may seem low, it reflects the complexity of the task rather than any major deficiency in the model’s ability to detect TB-related anomalies. Our model, however, demonstrates strong performance in other metrics, such as accuracy (0.911) and recall (1.000), which better capture its effectiveness in handling multi-label classifications. These metrics show that, despite a lower AUC, the model maintains high sensitivity and generalization, ensuring clinical relevance by identifying TB-related anomalies across various categories. Clinically, it would be counterproductive to exclude specific labels to artificially boost AUC, as each carries crucial diagnostic value. Moving forward, we plan to enhance class balance further through refined class weighting and selective oversampling/undersampling to better represent rare classes. Additionally, increasing data samples for underrepresented classes, though resource-intensive, will be prioritized to improve model sensitivity toward rarer TB-related anomalies.
Class Imbalance Impact on Precision and Recall: As indicated in the confusion matrix analysis, the hybrid model exhibits uneven sensitivity and specificity across different classes. Common anomalies are detected more consistently, while rarer classes, due to their limited representation in the training dataset, are often overlooked. This imbalance results in lower precision for certain classes, as the model may favor frequent labels in its predictions. Techniques like oversampling, undersampling, or advanced class weighting methods may further enhance the model’s sensitivity and precision, ensuring balanced performance across all TB-related anomalies.
Saliency Map Interpretability Limitations: While saliency maps provide visual insight into the model’s attention regions, they show only general activation patterns rather than precise, clinically relevant features. This limitation can make it difficult for radiologists to interpret the model’s reasoning, especially for subtle TB-related anomalies. The saliency maps, while useful for assessing model focus, lack the granularity needed for more complex interpretability. Future work could explore more advanced interpretability techniques, such as Grad-CAM or guided backpropagation, to provide clearer insights into how the model makes specific decisions, potentially aligning the output more closely with radiological assessment.
Challenges with Rare Anomaly Detection in Confusion Matrix Analysis: The confusion matrix analysis reveals that the model’s performance on rare anomalies like miliary and bullae is limited, with a tendency to misclassify these conditions as “Not Present.” This may stem from the imbalance in label frequency and the lack of distinctive features for these rare anomalies, which are underrepresented in the dataset. A future improvement could involve augmenting these rarer classes to provide the model with more examples of each anomaly, potentially through synthetic data generation or targeted augmentation strategies.
Limitations in Recall and Precision: While the model shows acceptable recall for certain common classes, precision remains a challenge. This means that, while the model is reasonably effective at identifying when an anomaly is present, it occasionally misclassifies images without the condition as positive cases. This limitation may lead to false positives, which, in a clinical context, could result in unnecessary follow-ups. Refining the threshold for each class or using ensemble methods to confirm predictions could potentially improve precision, particularly for multi-label cases.
Future: Future research directions in tuberculosis chest X-ray classification with a hybrid CNN and vision transformer model could explore several key areas to enhance model performance and applicability:
Future Improvements with Advanced Architectures: Although the hybrid EfficientNetV2L-ViT Base model has demonstrated promise, more advanced architectures such as Swin Transformer and ConvNeXt could offer additional performance gains. These architectures provide enhanced feature extraction capabilities and may better capture complex patterns within CXR images, particularly in a multi-label setting. Implementing such advanced architectures could address current limitations in rare anomaly detection and overall AUC performance.
Transfer learning: Examine the potential of transfer learning by leveraging pre-trained models on large-scale datasets, such as ImageNet or CheXpert, and fine-tune them on tuberculosis chest X-ray images. This could enhance the model’s performance by incorporating diverse visual features learned from broader datasets.
Interpretability and explainability: Improve the interpretability and explainability of the hybrid model’s predictions. Techniques such as attention maps and saliency mapping can shed light on the model’s decision-making process, revealing image regions contributing most to the classification.
Real-world deployment and clinical validation: Validate the hybrid model’s performance on larger and more diverse clinical datasets, involving multiple medical institutions and patient populations. Collaborate with healthcare professionals to ensure the model’s applicability, reliability, and clinical relevance.
By exploring these research directions, one can advance the field of tuberculosis chest X-ray classification, improve the accuracy and robustness of hybrid CNN and vision transformer models, and ultimately contribute to more effective tuberculosis diagnosis and patient care.
Future Research Directions and Proposed Solutions: This section outlines a plan for future extensions of the research to tackle the identified challenges effectively. The following suggested strategies aim to improve model performance, interpretability, and clinical applicability:
One major challenge identified is the significant class imbalance in the dataset, particularly for rare anomalies such as bullae or tuberculoma. This imbalance results in high false negative rates and reduced model sensitivity for underrepresented conditions. To address this, future research can implement advanced data augmentation techniques, such as generative adversarial networks (GANs) and diffusion models, to synthetically generate diverse samples for rare conditions [
32]. Additionally, dynamic re-weighting strategies like class-balanced loss, which adjusts weights based on the effective number of samples, could be employed [
33]. Oversampling techniques, including SMOTE, are also recommended to ensure a more balanced dataset representation [
34].
Another area requiring improvement is precision in multi-label tasks, where the current hybrid model suffers from a high rate of false positives, as evidenced by its low precision score (e.g., 0.181). Threshold optimization for each label, using precision–recall trade-off analysis, could help balance precision and recall [
35]. Post hoc calibration methods like platt scaling or isotonic regression may also improve the reliability of predicted probabilities [
36].
High misclassification rates for anomalies with overlapping visual features, such as fibroinfiltrate and bronchiectasis, present another critical problem. Feature refinement using advanced attention mechanisms, such as the Convolutional Block Attention Module (CBAM), could help differentiate overlapping features more effectively [
37]. Additionally, adopting multi-task learning (MTL), where auxiliary tasks like feature segmentation are introduced, could improve the model’s ability to learn discriminative feature representations [
38].
Model interpretability and explainability remain essential for clinical applications, yet saliency maps generated by the current hybrid model sometimes highlight irrelevant areas. To improve interpretability, attention-based explainability methods could be adopted, leveraging attention maps to provide clearer visual explanations [
20]. Advanced Grad-CAM techniques, such as Grad-CAM++ or Score-CAM, offer more fine-grained and reliable visualizations [
39]. Quantitative evaluation of saliency maps, using metrics like Intersection over Union (IoU), would further ensure their reliability [
40].
The hybrid model’s modest improvement in AUC (e.g., 0.510) indicates limited effectiveness in leveraging its added complexity. Model ensembling, combining the hybrid architecture with other high-performing models like Swin Transformer or ConvNeXt, could enhance overall performance through techniques such as soft voting [
41]. Moreover, feature fusion techniques, including cross-attention and multi-scale fusion, could better integrate complementary strengths of the CNN and ViT components, boosting performance [
42].
To ensure clinical relevance, validation on diverse datasets remains crucial. Cross-institutional validation, involving datasets from multiple institutions, could provide a more robust assessment of the model’s generalizability [
43]. Prospective clinical trials would further evaluate the model’s impact on diagnostic workflows, focusing on accuracy, efficiency, and clinician feedback [
44]. Integration with PACS systems for real-time inference could facilitate seamless adoption of the model in clinical settings [
45].
Lastly, the potential of transfer learning should be further explored by leveraging pre-trained models on large medical datasets such as CheXpert or NIH CXR14. Fine-tuning these models on tuberculosis-specific data could significantly enhance performance by incorporating broader visual knowledge [
43].
5. Conclusions
This study developed a hybrid AI model combining EfficientNetV2L and Vision Transformer (ViT) Base for multi-label classification of tuberculosis (TB) abnormalities in chest X-ray (CXR) images. The model’s strength lies in its ability to extract both detailed local features through EfficientNetV2L and capture global image context with ViT Base, addressing the challenge of predicting multiple co-occurring TB-related abnormalities. This hybrid approach outperformed standalone models, particularly in handling complex multi-label medical imaging tasks.
EfficientNetV2L proved highly effective in identifying subtle anomalies like infiltrates and fibrosis, with a solid accuracy of 0.870 and low loss of 0.323. Meanwhile, ViT Base provided almost comparable performance to ViT Large but with far fewer parameters, making it a more efficient option for global feature extraction, crucial in detecting diffuse anomalies like consolidation and lymphadenopathy.
To address class imbalance, focal loss and class weighting were applied, ensuring that rarer abnormalities received appropriate attention during training. This contributed to more balanced learning and improved generalization across TB-related anomalies. Testing revealed that a batch size of 8 offered the optimal balance between frequent weight updates and stable training, achieving the highest accuracy of 0.911, the lowest loss of 0.285, and an efficient inference time. Average inference time per image was approximately 0.135 s, with GPU memory consumption per image at around 82.4 MB, indicating suitability for real-time applications in clinical environments.
One limitation observed was the relatively modest AUC score (0.510), a known challenge in multi-label classification due to overlapping labels and the skew introduced by more frequent classes like infiltrates. This can lead to an underestimation of model sensitivity for rare classes when interpreting AUC scores. Therefore, accuracy and loss metrics are more reliable indicators of the model’s performance in this setting. Additionally, saliency maps and confusion matrices underscored the model’s capacity for effective feature localization, although further refinement is needed to enhance predictions on rarer anomalies.
In conclusion, the hybrid EfficientNetV2L-ViT Base model is a promising solution for multi-label classification in TB detection, with its ability to manage both local and global features effectively. This approach has significant potential to enhance TB diagnostics and could be scaled for real-world clinical applications, particularly in resource-constrained environments. Future work should focus on exploring advanced architectures such as Swin Transformer and ConvNeXt to maximize the model’s clinical utility, improve sensitivity for rare anomalies, and ensure robust performance in diverse clinical settings.