1. Introduction
A brain tumor (BT) is defined as a conglomeration of abnormal cells in various parts of the brain [
1]. Abnormal cellular proliferation in the brain can affect normal cells and their functions, leading to recurrent headaches, speech alterations, concentration difficulties, seizures, auditory problems, and memory loss. Brain tumors are classified into many categories. Malignant tumors are referred to as cancerous, whereas benign tumors are termed non-cancerous. Primary brain tumors begin in the brain itself. Secondary tumors or metastatic tumors can start in places outside the brain and spread to the brain [
2,
3]. Malignant brain tumors proliferate rapidly and disseminate to various regions of the brain and spinal cord, rendering them more perilous than benign tumors. A glioma is a tumor that develops in the brain and spinal cord. A glioma’s ability to impair brain function and pose a threat to survival depends on its location and pace of growth [
4].
In accordance with the World Health Organization (WHO), gliomas are categorized into four grades of severity, ranging from grade I to grade IV [
5], and include types such as high-grade glioma (HGG), low-grade glioma (LGG), meningioma, and pituitary tumors of the central nervous system. Meningioma tumors originate from the meninges, the protective tissue around the brain and spinal cord. Meningioma safeguards the spinal cord and entirely encases the brain, providing protection. Due to their sluggish growth and reduced tendency to metastasize, they are generally classified as benign tumors. On the other hand, pituitary tumors arise from spontaneous mutations; however, some result from hereditary defects. These cancers are benign and have a minimal risk of metastasis. Although these tumors are benign, their presence in critical areas of the brain may lead to significant health issues.
How severely a brain tumor impacts the functioning of the nervous system is dependent on the tumor’s location and its pace of development. Brain tumors are treated differently depending on their nature, location, and size. Although it is essential for achieving better results, early identification and diagnosis of brain tumors remain challenging [
6]. A variety of diagnostic methods exist to detect brain disorders. The predominant non-invasive technology used in the medical profession for diagnosing various brain disorders is the Magnetic Resonance Imaging (MRI) technique, which is utilized to assess malignancies in brain scans [
6,
7]. The majority of specialists choose MRI for its capacity to highlight intricacies of brain malignancies and specific regions of the brain [
8]. MRI scans provide superior contrast resolution, facilitating the identification of minute cancers that may otherwise go undetected. Moreover, its capacity to record numerous planes of the targeted brain regions while patients maintain a stationary posture considerably augments the Accuracy and Precision of the diagnostic procedure [
9,
10].
Additionally, MRI uses magnetic fields, pulses, and computers to see the whole body’s organs and bones. The process of diagnosing brain issues is referred to as brain MRI analysis [
11]. A brain MRI delivers clear views of the posterior brain and brainstem, unlike a computerized tomography (CT) scan. MRI offers improved clarity of image slices in various sequences, including T1, T1CE, FLAIR, and T2, compared with CT and Positron Emission Tomography (PET). MRI is advantageous for accurately identifying low and high-grade lesions during tumor detection [
12,
13,
14].
In medical image analysis, segmentation and classification are two fundamental tasks [
14]. While classification involves identifying disease types or grading illnesses based on image attributes, such as color, texture, and shape, segmentation entails isolating specific sections of interest from medical images, including organs, tissues, or lesions. For clinical applications such as illness diagnosis, tracking treatment progress, and rehabilitation evaluation, segmentation and classification Accuracy are crucial [
15]. The early diagnosis of brain tumors is crucial for enhancing treatment efficacy and thus improving patient survival rates [
16,
17]. Nonetheless, the examination of brain neoplasms using medical imaging can be complex. Moreover, manual segmentation of brain tumors is both expensive and labor-intensive [
18]. Studies that evaluate radiological errors consistently report daily error rates are between 3% and 5%. However, in some clinical settings, retrospective discrepancy rates are over 30%. In neuroimaging, secondary evaluations by experienced neuroradiologists might show big differences in up to 13% of cases. Different studies have found different levels of accuracy in human diagnosis when it comes to multi-class classification of brain tumors. However, these data show the intrinsic limitations of expert interpretation [
19]. Consequently, automated methodologies are greatly esteemed. The automated identification of brain tumors is a formidable medical challenge [
17]. In recent years, many methodologies, from basic machine learning (ML) models to advanced approaches such as CNN and vision transformers (ViT), have been used to automatically identify brain tumors via brain MRI [
20]. Nonetheless, the development of a translational and interpretable model for the precise identification and classification of brain cancers remains a prominent scientific endeavor [
21].
Researchers are developing Computer-Aided Diagnosis (CAD) systems to assist physicians in the rapid and accurate diagnosis process. Deep learning (DL), especially using CNNs, has shown remarkable efficacy in several classification tasks, including medical imaging [
3,
4,
5,
22,
23]. Methods such as transfer learning (TL) and fine-tuning have significantly improved their diagnostic efficacy by enabling models to leverage existing information. Classifying brain tumors using a huge quantity of labeled data is particularly challenging in TL [
24]. To avoid this issue, researchers employ pre-trained models [
25]. These models proficiently integrate, scale, and compress features, while efficiently using residual information via diverse layer architectures to enhance performance [
10]. Therefore, our motivation stems from the convergence of five key factors: (i) the clinical need to improve non-invasive and early diagnosis of intracranial tumors, (ii) the technical complexity inherent in differentiating tumors with overlapping radiological appearances, (iii) the translational potential of deep learning to integrate with radiology practice and reduce reliance on invasive techniques, (iv) reducing the workload on radiologists, which is an excellent help for countries like Mexico, where there are few radiologists, and (v) contributing to the use of advanced deep learning technologies in the early detection and classification of brain tumors in MRI to improve people’s quality of life.
The structure of this paper is presented as follows: In
Section 2, the state of the art of works related to BT classification is presented. The materials and methods are delineated in
Section 3, which includes descriptions of the pre-trained models, the dataset, and image preprocessing techniques.
Section 4 presents the performance analysis, which consists of a comparison of the outcomes of each model and a comprehensive examination of the findings, including their analysis, validation, research limitations, and broader implications in the area. The study is finally concluded in
Section 5, which also identifies potential areas for future research in the classification of brain tumors using DL techniques.
2. Related Work
Numerous studies have focused on classification algorithms for the precise detection of brain cancers. The following section provides an overview of the research that developed multiple CNN models and motivated the present study.
In [
1], a deep learning approach is proposed that utilizes transfer learning with EfficientNet variants (B0–B4) to classify brain tumors in MR images into glioma, meningioma, and pituitary types of brain tumors. EfficientNetB2 achieved the highest results with 98.86% accuracy, 98.65% precision, and 98.77% recall when used with the public CE-MRI Figshare dataset. Grad-CAM visualizations verified the model’s emphasis on tumor regions, and data augmentation enhanced generalization.
The authors in [
26] used a Kaggle dataset to classify brain tumors from X-ray images. Preprocessing, including noise reduction and data augmentation, preserved important edges and generated synthetic variations. After fine-tuning, VGG19, InceptionV3, and MobileNetV2 were used. The most accurate model was VGG19 (98.58%), beating InceptionV3 (97.6%) and MobileNetV2 (98.47%).
Research in [
27] suggested a hybrid model that combines MobileNetV2 (a feature extractor) with a support vector machine (SVM) classifier. The MRI dataset Msoud is available on the Kaggle repository. Findings included an AUC of 0.99 for glioma, 0.97 for meningioma, and 1.0 for pituitary and no tumor.
In [
28], a new attentional TL model, Pre-trained Attention-fused Image SpectraNet, is proposed to enhance brain tumor detection and classification in MRI images. A CNN-based architecture is used. Training stability is improved using the Adam optimizer. The four classifications are normal, pituitary, glioma, and meningioma. The system achieved 98.33% Accuracy and 98.35% Precision.
In [
29], the authors implemented BrainNeuroNet, a teacher-student model for brain tumor detection that utilizes a Hierarchical DConv Transformer (HD) for global feature extraction and a MultiScale Attention (MSA) network for local feature extraction. Preprocessing steps included image scaling, normalization, and quality improvement. Images were collected from the BR35H and Brain Tumor MRI datasets. With a 98.63% Accuracy rate, the model demonstrated its effectiveness in precise brain tumor diagnosis, surpassing previous approaches. Similarly, in [
30], the authors proposed a deep convolutional neural network architecture with parallel dilated convolutions (PDCNNN). The model extracts fine and coarse features using parallel routes with varying dilation rates to mitigate DCNN-based overfitting and preserve global context. The network is trained using an average ensemble technique and assessed on Chakrabarty Brain MRI Images (binary), Figshare (multi-class), and Msoud (Kaggle) (multi-class), achieving accuracies of 98.67%, 98.13%, and 98.35%, respectively. These findings demonstrate that the proposed model outperforms earlier techniques. In [
31], the study introduces a two-stage structural MRI brain tumor classification methodology. A pre-trained convolutional neural network automatically extracts information in the initial step, reducing training time and processing requirements. To avoid overfitting, a filter-based deep feature selection method is employed. SVMs with polynomial kernels classify multi-class data. On the MSoud dataset, the model achieved an Accuracy of 98.17%. The Crystal Clean: Brain Tumors MRI Dataset and Figshare datasets achieved 99.46% and 98.70% Accuracies, respectively.
In another case, work [
32] presented a model that enhances the Accuracy and transparency of brain tumor classification in MRI. The model is based on the EfficientNetB0 architecture and utilizes explainable artificial intelligence (XAI) approaches, specifically Grad-CAM. In a multi-class classification approach, the model was trained and tested using the Msoud dataset. It achieved an Accuracy of 98.72%.
In the work in [
33], a two-stage brain tumor classification method utilizing the BR35H dataset was introduced. First, modern image enhancement algorithms (GFPGAN and Real-ESRGAN) improve MRI picture quality and resolution. Nine DL models are trained using five optimizers. In the second step, the top classifiers are combined to use ensemble learning methods such as weighted sum, fuzzy ranking, and majority voting. By employing GFPGAN and the five top models, the system outperformed prior brain tumor classification methods with 100% Accuracy.
In [
2], the BRATS 2015 dataset and the Figshare Dataset were used for training the proposed Multi-Class Convolutional Neural Network model (MCCNN). Two experiments, Experiment I and Experiment II, were undertaken to evaluate the performance. The suggested MCCNN-based model achieved 99% Accuracy in Experiment I and 96% in Experiment II. In [
4], authors used pre-trained DL models, Xception, MobileNetV2, InceptionV3, ResNet50, VGG16, and DenseNet121, to identify brain MRI scans in four classes. CNN models were trained using a publicly accessible Brain Tumor MRI dataset, Msoud. Xception performed best with a weighted Accuracy of 98.73%. Similarly, in [
5], a lightweight Multi-path Convolutional Neural Network (M-CNN) was proposed. During training, the model was instructed to recognize four distinct types of tumors. Sartaj, a publicly available Brain Tumor MRI dataset, was used to train the model. The model achieved a performance Accuracy level of 96.03%.
An ensemble of CNNs was introduced in [
6]. The ensemble model integrated VGG16 and ResNet152V2 architectures, demonstrating a classification Accuracy of 99.47% on the complex four-class Msoud dataset. Similarly, authors in [
16] introduced a novel ensemble using Swin Transformer and ResNet50V2 (SwT + ResNet50V2). The design utilizes self-attenuation and DL techniques to enhance diagnostic Precision while minimizing training complexity and memory consumption. The model was trained with the BR35H and Msoud datasets. An Accuracy of 99.9% was achieved in BR35H and 96.8% in Msoud. In [
20], five CNN models trained via TL and fine-tuning were combined in an ensemble model, which was optimized using Particle Swarm Optimization (PSO). Three brain tumor datasets, namely Figshare (Dataset 1), Sartaj (Dataset 2), and Msoud (Dataset 3), were used for evaluation. In Figshare, the model achieved an Accuracy of 99.35%, in Sartaj, 98.77%, and in Msoud, 99.92%.
In [
10], the authors developed an innovative hybrid system named TUMbRAIN. Additionally, they used the Msoud dataset for training, which has four classes. The findings indicate that TUMbRAIN surpasses most contemporary neural network models, achieving an exceptional total Accuracy of 97.94% with just 1.04 million parameters. In [
34], the DeepNeuroXpert (DN-XPert) model was introduced for accurate brain tumor detection, along with three complementary models: NSAS-Net for segmentation, AI2CF for classification, and WPSO for parameter tuning. Two brain tumor imaging collections, Figshare and Msoud, helped the study. Performance criteria, including Accuracy, reached 99.4%, suggesting the potential of the proposed models to enhance brain tumor detection and classification.
The research presented in [
35] employed a novel CNN architecture with explainable artificial intelligence (XAI) algorithms, such as Grad-CAM, SHAP, and LIME, to classify brain tumors. Fewer layers and parameters improve model interpretability and resilience compared with earlier models. Using the Msoud and NeuroMRI Datasets, the technique achieved 99% Accuracy on known data and 95% Accuracy on unknown data, demonstrating its generalizability and clinical value. In [
36], the authors proposed an advanced brain tumor multi-class classification method that uses TL and evolutionary algorithms. The pre-trained models EfficientNetB3 and DenseNet121 were optimized for hyperparameters using CEGA. The study used Msoud dataset. Without data augmentation, CEGA-EfficientNetB3 and CEGA-DenseNet121 achieved accuracies of 99.39% and 99.01%, respectively, surpassing the state-of-the-art approaches.
In [
37], the authors introduced the M-C&M-BL model, which uses a CNN for image feature extraction and a BiLSTM network for sequential data processing. MRI data from Br35H were used to assess the model. The results, with 99.33% Accuracy, suggest that this CNN is suitable for integration into clinical decision support systems, online and mobile diagnostic platforms, and hospital picture archiving and communication systems (PACS).
The work in [
38] presented a methodology for brain tumor identification that incorporates model optimization, modeling of realistic settings, and sophisticated data augmentation techniques. Utilizing the Msoud dataset, optimizers such as Adam and augmentation methods like CutMix, PatchUp, Gaussian Noise, and Blur were employed, resulting in an Accuracy of 99.45% under optimal conditions. Nonetheless, the performance diminished when confronted with synthetic data that was disturbed, highlighting the model’s limitations in robustness within authentic clinical contexts.
To classify brain tumors in histological images, the research in [
39] suggested using a CNN architecture in conjunction with a Vision Transformer (ViT) model. The Msoud dataset was used to train the algorithm with four classes. Outperforming prior methods, with performance ranging from 95% to 98%, a 95% confidence interval and an additional Accuracy of 99.42% were achieved, resulting in an overall Accuracy of 99.64%.
In [
40], the study presented a deep TL methodology for the early identification of brain cancers in magnetic resonance imaging (MRI), using preprocessing, segmentation by OTSU, and feature extraction via Gabor Wavelet Transform, optimized by Grey Wolves Optimization (GWO). Five architectures were assessed: VGG19, InceptionV3, InceptionResNetV2, ResNet152, and DenseNet121, with the latter demonstrating the highest Accuracy. The model underwent training on the Msoud dataset. DenseNet121 had the best Accuracy at 99.43%, surpassing the other evaluated designs.
The research in [
41] introduced a brain tumor classification model using pre-trained CNN architectures augmented with supplementary feature extraction layers and diverse activation functions (ReLU, PReLU, Swish). Seven architectures were assessed: VGG19, InceptionV3, ResNet50V2, InceptionResNetV2, DenseNet201, MobileNetV2, and EfficientNetB7, combined using a majority vote ensemble method. The models were trained using the Chakrabarty N. Brain MRI Images for Brain Tumor Detection dataset, including records from 253 patients. The model attained an Accuracy of 99.34%. Similarly, work [
42] utilized Chakrabarty N. Brain MRI images for brain tumor detection (BTD) training, employing a CNN model integrated with a multilayer perceptron (MLP) for feature extraction. The model obtained 99.6% Accuracy.
The study in [
43], presented ParMamba, a parallel architecture that amalgamates Convolutional Attention Patch Embedding (CAPE) with the ConvMamba block, which incorporates CNN, Mamba, and a channel improvement module, to boost brain tumor identification. The model was evaluated using the Msoud and Figshare datasets, achieving accuracies of 99.62% and 99.35%, respectively.
In [
44], the authors presented the Superimposed AlexNet models (SAlexNet-1 and SAlexNet-2) for precise classification of primary brain tumors, incorporating three principal enhancements: Hybrid Attention Mechanism (HAM), 3 × 3 convolutional layers for comprehensive feature extraction, and semi-transfer learning (STL) for encoder pre-training. The models were assessed using the SARTAJ (multi-class classification) and BR35H (binary classification) datasets, with SAlexNet-1 achieving a Precision of 98. 78% and 98. 07%, and SAlexNet-2 achieving 99.69% and 99.17%, respectively.
Improving the process of early brain tumor diagnosis is the main objective of this work. This diagnosis has a direct effect on patient care. Modern technologies, such as DL and TL, are being promoted as a solution to the problems with older approaches that depend on human interpretation and specialized expertise. To boost processing speed and reduce human error, these automated categorization techniques are essential. This work utilizes complex DL and TL methods to identify brain MRI images, therefore improving the Accuracy and efficiency of medical imaging systems and assisting medical professionals in detecting brain cancer more efficiently.
This study uses the openly accessible Msoud dataset, which contains MRI scans of Glioma, Meningioma, no tumor, and Pituitary tumors. This study evaluates four innovative pre-trained deep learning models for brain tumor classification, utilizing various dataset percentages and a stratified five-fold cross-validation technique. Incorporating these state-of-the-art methods into a system will enable improved clinical decision-making through the provision of accurate and scalable brain tumor classification. This work sets the way for future AI-driven medical research and contributes to our understanding of DL’s potential in medical image processing, particularly for the detection of brain tumors.
The main contributions of this paper are as follows:
Transfer learning (TL) and pre-trained deep learning models are employed for brain MRI classification, aiming to achieve faster, more accurate, and consistent diagnostic outcomes compared with traditional clinical methods.
Several state-of-the-art pre-trained models—DeiT3_Base_Patch16_224, Xception41, Inception_v4, and Swin_Tiny_Patch4_Window7_224 are used to classify brain tumors, leveraging TL to reduce training time and computational load while enhancing classification performance.
The classification performance of the selected models is evaluated using the Msoud brain tumor MRI dataset, categorized into four classes (glioma, meningioma, pituitary tumor, and no tumor), focusing on Accuracy, Precision, Recall, F1-Score, and MCC.
A parameter optimization strategy is implemented in combination with training on multiple dataset partitions, ranging from 10% to 100%, to assess the scalability and robustness of the models.
Graph-based visualization tools are utilized to analyze key hyperparameters and learning behavior during training, validation, and testing phases.
Deployment feasibility of the best-performing model is validated through real-time inference benchmarking on embedded AI hardware platforms (NVIDIA Jetson AGX Xavier and Jetson Orin Nano), demonstrating high-speed, low-latency performance suitable for edge-level clinical applications.
4. Results and Discussion
Training began with a technique known as stratified cross-validation. By incorporating it into the code using the appropriate Python libraries, we can ensure that each of the five training folds contains an equal amount of data.
Figure 6 shows the stratified distribution of samples over five folds using Stratified K-Fold cross-validation on the Msoud dataset. Each fold represents the proportion of glioma, meningioma, pituitary tumors, and no tumor, respectively.
Figure 6 shows a balanced dataset across categories. This stratified approach ensures that each fold maintains the same class distribution as the entire dataset, hence preventing bias during model training and evaluation. In the analysis of medical imaging data, it is crucial that the distribution stays consistent across folds. Class imbalance has a significant impact on the model’s learning process and overall performance. Stratified K-Fold cross-validation maintains an equivalent distribution of classes across each fold, thus enhancing result reliability, minimizing variability across folds, and rendering the model’s outcomes more valuable and reproducible.
The next
Figure 7 shows a parallel coordinates map illustrating the outcomes of a parameter optimization technique during the training of four DL models using a brain MRI dataset. Each line represents a distinct experimental run, defined by the configurations of significant hyperparameters and the resulting performance metrics. The axes represent the dataset ratio employed, model architecture, batch size, learning rate, training loss, and test Accuracy. The architectures examined were Xception41, Swin_tiny_patch4_window7_22, InceptionV4, and Deit3_base_patch16_224, over various segments of the data (25%, 50%, 75%, and 100%), to evaluate their performance at different scales.
The color gradient indicates the test Accuracy, with lighter yellow representing results near 100%, a definitive indication of very effective training configurations. This performance was further reinforced by the implementation of advanced data augmentation techniques showed in
Table 2, stratified 5-fold cross-validation to ensure class-balanced partitions as it is illustrated in
Figure 6, systematic hyperparameter optimization indicated in
Table 3, and the deliberate selection of robust deep learning architectures tailored for multi-class tumor classification displayed in
Section 3.5. The optimal results often occur with a 75% of the dataset, a batch size of around 64 and a learning rate of about
, while minimizing the training loss. These precise traces indicate that certain hyperparameter combinations have considerable stability. This renders them very suitable for fine-tuning in deployment scenarios. The image illustrates that even little adjustments in the learning rate or model selection can significantly impact Accuracy. This demonstrates the sensitivity of model generalization to optimal parameter selection. This parallel coordinates map serves as a comprehensive diagnostic tool for analyzing the interplay of training dynamics within the design space, thereby facilitating informed decisions on model optimization.
Figure 8a,b summarize the optimal results in model training and validation phases, utilizing 75% of the dataset, a batch size of around 64, and a learning rate of about
for each model.
Figure 8 displays a clear graph demonstrating outcomes nearing 99% in training and an exceptional trajectory throughout the validation phase, achieving almost 98%.
Figure 9 shows the four DL models after 10 training epochs.
Figure 9a,b show that training loss is decreasing rapidly in all models.
Figure 9a, which represents the training loss, shows a steady and rapid drop for all models, indicating that the models are converging and learning in a stable manner. In contrast, the validation loss in
Figure 9b demonstrates how effectively each design generalizes. All models decrease, although Swin_tiny_patch4_window7_22 and Inception_V4 exhibit smaller and more stable validation losses at the conclusion of the epochs. This improves generalization and reduces overfitting.
To complement the information presented in the training and validation stages,
Table 4 shows the quantitative results of the final testing set. Performance metrics, including Accuracy, Precision, Recall, F1-score, and Matthews correlation coefficient (MCC), were used to verify how well each model classified the test dataset.
Table 4 illustrates the performance metrics of four DL models on the test set. Swin_tiny_patch4_window7_224 obtained the best test Accuracy of 99.24%. It also consistently yielded better results, with Precision of 0.9924, Recall of 0.9924, F1-Score of 0.9924, and MCC of 0.9898. These results demonstrate that this model can make accurate predictions and has very few classification errors, indicating that it can generalize well across all types of tumors.
The second-place model used was Deit3_base_patch16_224, achieving a test Accuracy of 99.08%. This model’s balanced Precision, Recall, and F1-Score were all around 0.9908, while the MCC achieved a value of 0.9877. These statistics highlight its remarkable classification prowess. These transformer-based models are highly effective at identifying complex patterns in medical images. They outperformed CNN-based models such as Xception and InceptionV4.
To gain a deeper understanding of the data presented in
Table 4, it is essential to examine how the performance of the models varies with architectural complexity. Transformer-based models such as Swin_tiny_patch4_window7_224 and Deit3_base_patch16_224 exhibit superior test Accuracy, despite significant variations in their parameter counts.
In practical medical environments, it is crucial to select a model that optimally balances predictive performance and computational efficiency, rather than focusing solely on its correctness in testing.
The subsequent
Figure 10 examines the relationship between model size (in millions of parameters) and test Accuracy. Each data point is represented as a bubble, with the size of the bubble determined by the number of parameters in the model. The model designation is inscribed on the bubble. The horizontal axis represents the number of parameters, ranging from around 20 million to 100 million. The vertical axis indicates the test’s Accuracy, ranging from 0.986 to 0.994.
Swin_tiny_patch4_window7_224 has the highest test Accuracy (0.9924) among all evaluated models, while possessing a very low parameter count ( 28.3M). This demonstrates its efficiency. Deit3_base_patch16_224 possesses the highest number of parameters (about 86 million), while its test Accuracy is slightly inferior (around 0.9908). This indicates that Accuracy does not improve with larger models. Inception_v4, with nearly 42.7 million parameters, achieves a test Accuracy of around 0.9893. Xception41, the most compact model (about 22.9 million parameters), has the lowest Accuracy (around 0.9870) among all the architectures analyzed. This demonstrates that the perceived differences between model complexity and performance are not always evident and must be discerned, particularly to determine which models to employ in practical applications of DL in the medical field.
Swin_tiny_patch4_window7_224 is the model identified as the best because it strikes the best balance between Accuracy and efficiency of parameters, as shown in
Figure 10. According to
Table 4, this architecture also had the highest test Accuracy. As a result, the confusion matrices from the test phase are used to look at its performance more closely. The training setup had a batch size of 64, a learning rate of
, and a 5-fold stratified cross-validation scheme. It is important to note that only 75% of the Msoud dataset was used in this experiment.
Figure 11 shows the confusion matrices that came out of the test stage for all five folds. The Swin_tiny_patch4_window7_224 model, which is based on a vision transformer and was made to pick up on spatial-contextual dependencies in image data, made these. The model was used for a multi-class classification task that involved distinguishing between four types of brain tumors: glioma, meningioma, no tumor, and pituitary tumor.
By comparing the estimated labels with the actual labels for each fold, each matrix reflects the model’s performance in that particular fold. The fact that the numbers are consistently close to the main diagonal in all folds indicates that the model is very effective at organizing the elements, making few errors. The categories “no tumor” and “pituitary tumor” show high discrimination capacity, with virtually no cases misclassified in any of the folds. This demonstrates the model’s ability to distinguish between images of healthy brains and those associated with the pituitary gland. The confusion matrix in
Figure 11 also indicates that the majority of misclassifications transpired between the glioma and meningioma categories. This is due to the fact that both tumor types may display similar morphological features in magnetic resonance imaging sequences, particularly in T1-weighted images. Despite these minor flaws, the model demonstrated strong overall discriminatory performance. However, these misclassifications, although statistically minimal, underscore the need to incorporate expert radiology analysis, more clinical data, and multi-sequence imaging in diagnostic settings to reduce potential diagnostic uncertainty.
Fold 5 shows a slight increase in the number of misclassifications between glioma_tumor and meningioma_tumor, but the overall predictions remain relatively straightforward. These results demonstrate that the Swin Transformer-based model can generalize well and remain consistent across folds.
The confusion matrices offered a definitive representation of the classification Accuracy for each class over all test folds. Nonetheless, Receiver Operating Characteristic (ROC) curves provide a more accurate representation of the model’s ability to distinguish across classes.
Figure 12 illustrates the ROC curves and AUC values for each category. This provides more insights into the sensitivity-specificity equilibrium and the overall efficacy of the Swin_tiny_patch4_window7_224 model.
Figure 12 shows five subplots, corresponding to each fold (fold 1 to fold 5). Each subplot displays a unique ROC curve for each tumor classification, along with the Area Under the Curve (AUC) metric. It is remarkable that all classes achieve an AUC of 1.00, indicating that positive and negative predictions can be correctly distinguished without any compromise between sensitivity and specificity. This degree of performance is atypical in medical imaging applications, rendering the Swin_tiny_patch4_window7_224 model far more dependable for identifying high-level spatial patterns in MRI data.
The true positive rate vs. false positive rate curves for each class are quite similar around the upper-left edge of the ROC space. Models with perfect discrimination achieve this. Due to fold consistency, the model cannot overfit to particular data and may be employed with varied patient groups. The outcomes align with confusion matrices in
Figure 11, confirming the model’s ability to accurately predict all tumor kinds.
The no-tumor and pituitary-tumor classes exhibit clear curves, indicating few false positives. The model effectively distinguishes clinically challenging categories, such as glioma and meningioma, which often yield identical imaging results. The Swin Transformer’s hierarchical self-attention process may identify complex, differentiating features that categorize cancers. The ROC study indicates that the Swin_tiny_patch4_window7_224 model is most effective in identifying all test folds, making it a reliable option for automated brain tumor identification in clinical settings.
Examining the interpretability of DL models is crucial, particularly in the medical domain, where model transparency is paramount. This extends beyond examining only quantitative performance metrics, such as confusion matrices and ROC curves. Gradient-weighted Class Activation Mapping (Grad-CAM) was used to illustrate the spatial regions that significantly influenced the classification decisions of the Swin_tiny_patch4_window7_224 model. The activation maps generated from this offer a good understanding of the algorithm’s decision-making process.
Figure 13 presents Grad-CAM visualizations for test samples representative of all five folds across the four classes: glioma, meningioma, no tumor, and pituitary tumor. The class activation maps are superimposed on the MRI slices to illustrate the regions utilized by the Swin Transformer model to differentiate between classes.
Each column shows a different fold, and each row shows a different category. The pictures show that the folds are very uniform in space, which suggests that the model parameters are based on stable and anatomically relevant areas. In cases of malignant tumors (glioma, meningioma, and pituitary tumors), the highlighted areas correspond to pathological structures. This indicates that the model can identify physiologically essential features. In the glioma and meningioma groups, activations primarily occur in the areas surrounding the lesions. When it comes to pituitary tumors, the sellar area is the most important.
In the “no tumor” class, the Grad-CAMs exhibit dispersed yet low-intensity activations, indicating the absence of concentrated anomalies. This behavior aligns with the model’s accurate classification based on overarching anatomical traits. This interpretability research enhances the clinical validity of the Swin Transformer model by graphically demonstrating its decision-making process. The model achieves superior predictive Accuracy while maintaining transparency through the use of Grad-CAM. This is crucial for practical medical applications and for doctors to have confidence in the model.
To enhance the validation of the Swin_tiny_patch4_window7_224 performance and evaluate the efficacy of other models in this study beyond conventional metrics, a comparative analysis employing model ranking was performed across all training models. This section of the work aims to identify the architecture that exhibits the most stability and consistent Accuracy across both the training and testing stages.
To ensure methodological consistency, each model underwent training and testing with identical preprocessing and augmentation parameters. A normalized rank-based scoring system (range: 0.0–1.0) implemented in [
59] was applied to assess the performance of the models. This method integrates Accuracy, Recall, and F1-Score into a singular metric.
Figure 14 indicates that Swin_tiny_patch4_window7_224 and Deit3_base_patch16_224 secured top and second positions, achieving scores of 1.0 and 0.8, respectively. This shows they are suitable selections for the work. Inception-v4 performed adequately, earning a grade of 0.6; however, Xception41 underperformed, falling outside the “Good” threshold. The results indicate that transformer-based models outperform convolutional models in this context.
The subsequent assessment expands upon the comparative ranking analysis and examines the statistical robustness of the testing phase, emphasizing both class-level consistency and overall predictive Accuracy.
Figure 15 illustrates a comparison of test Accuracy with the variability of class-wise performance, quantified by the standard deviation of class-wise Recall, utilizing four DL models of this study. This figure provides a robust assessment based only on the testing phase, illustrating the overall efficacy of each design and its stability within each class. The Swin_tiny_patch4_window7_224 model has the highest test Accuracy (around 0.992) and little variability within classes. This demonstrates an effective balance between predictive power and reliability. Deit3_base_patch16_224 had comparable Accuracy (0.991) but the largest standard deviation, indicating greater sensitivity to class-specific characteristics. Inception_v4 performed adequately, achieving an Accuracy rate of around 0.989 and a reasonable inter-class stability rate of roughly 0.003. Xception41 had the lowest Accuracy alongside the least standard deviation, indicating steady, although somewhat worse, overall performance. These results underscore the need to examine both aggregate metrics and dispersion indicators during testing, particularly where robustness across class distributions is crucial.
Performance metrics provide valuable information about the effectiveness of a model; however, computational limitations also influence the feasibility of implementation. The next step in the research examines the resource consumption of the evaluated architectures during the training stage, under the previously specified parameters, where the best results were obtained.
Figure 16 illustrates the temporal variations in computational resource use, namely GPU and process-specific CPU usage, during the training phase of four architectures used in this study. The GPU utilization panel indicates that Deit3_base_patch16_224 frequently approached 100% saturation of GPU usage. This behavior aligns with its transformer-based architecture, which necessitates extensive self-attention operations and dense matrix multiplications across several layers. This indicates that the GPU requires a substantial amount of power over an extended period of time. Swin_tiny_patch4_window7_224 exhibited significant GPU activity, reaching peaks of 90%, characterized by substantial temporal variations typical of its hierarchical attention mechanism, which employs dynamic windows that alter spatial locality at each tier. However, Inception_v4 and Xception41 showed lower GPU usage, reaching 60% and 70% of GPU usage, due to their internal architectures, which use spatial convolutions characterized by predictable computational and memory patterns. The CPU results reveal additional cases of processing overloads that vary depending on the model. The CPU consumption of Deit3_base_patch16_224, for example, increased gradually, reaching a maximum of around 40%. This was due to its robust structure and data pipelining configuration. On the other hand, Inception_v4 and Xception41 used approximately 30% to 35%, while Swin_tiny_patch4_window7_224 used between 20% and 25%, being the one that used the least. These results demonstrate that transformer-based architectures are more effective at making predictions, but they also require more computational resources, particularly GPUs, than conventional convolutional networks. This study is crucial for determining whether a model can be effectively applied in areas with limited resources.
Figure 17 illustrates the GPU memory allocation during the training phase for each of the four models evaluated. Xception41, on the other hand, maintained a consistently high usage rate of around 40%. This is mainly due to its convolutional architecture. Deit3_base_patch16_224 showed usage fluctuations of around 30% over time. Transformer-based models require dynamic memory allocation, especially during multi-head self-attention and token projection procedures. In contrast, both Swin_tiny_patch4_window7_224 and Inception_v4 demonstrated reduced and more consistent usage (approximately 20%), indicating superior memory management capabilities. This can be attributed to their reduced feature hierarchies or the factorization of their convolutional modules.
Figure 18 illustrates the power consumption of the GPU used during the inference phase for the four models that were tested. Deit3_base_patch16_224 used the most power, often exceeding 250 W. This is because it employs a transformer-based architecture that involves numerous matrix operations and multi-head self-attention mechanisms. Xception41 and Inception_v4 consumed between 200 and 220 W of power due to their less complex convolutional architectures than the others in this study. Swin_tiny_patch4_window7_224 consumed around 180 W and was the least power-intensive option, which makes sense given its lightweight hierarchical attention design. These results are similar to those regarding GPU memory use and demonstrate a clear correlation between the complexity of an architecture and the energy required during inference. These kinds of insights are particularly important when it is challenging to conserve energy.
After evaluating computational and energy resource consumption, it is essential to emphasize performance using established benchmarks and compare the results with those of other current studies.
Table 5 provides a comprehensive comparison of current studies that have performed multi-class brain tumor classification tasks. The comparison focuses on the proposed architectures, datasets, number of classes, and maximum documented accuracies, whether in training or testing. The Swin_tiny_patch4_window7_224 model presented in this work stood out for achieving a competitive Accuracy of 99.24% on the Msoud dataset during testing. As previous evaluations indicate, this algorithm ranks among the most effective, and it is also computationally efficient.
4.1. Real-Time Inference Benchmarking of the Best DL Model on Embedded Systems
This subsection presents a real-time inference benchmarking of the best-performing deep learning model, deployed on high-performance embedded AI platforms. The objective is to evaluate the feasibility of achieving low-latency, high-throughput predictions under resource-constrained environments, which are typical in real-world clinical applications requiring time-critical responses. Two GPU-based devices, the Jetson AGX Xavier (32 GB), manufactured by NVIDIA Corporation in Huizhou, China and the Jetson Orin Nano Developer Kit (16 GB), manufactured by Yahboom Technology Co., Ltd. in Shenzhen, China were selected for this purpose. Performance was assessed in terms of mean inference time, throughput (FPS), and execution stability using ONNX Runtime and TensorRT optimization frameworks. The Jetson AGX Xavier is equipped with a 512-core Volta GPU featuring 64 Tensor Cores, an 8-core ARMv8.2 Carmel CPU, and 32 GB of LPDDR4X memory, achieving a memory bandwidth of approximately 136.5 GB/s. It supports up to 30 TOPS (INT8), and includes dual NVDLA accelerators along with a dedicated vision processor, making it suitable for intensive AI workloads. On the other hand, the Jetson Orin Nano is based on the Ampere architecture (2022) and integrates a 1024-core GPU with 32 Tensor Cores, a 6-core ARM Cortex-A78AE CPU, and up to 16 GB of LPDDR5 RAM. It delivers up to 40 TOPS (INT8) with a more compact and energy-efficient design, supporting selectable power modes of 7 W or 15 W. Both devices run Ubuntu 20.04 with NVIDIA’s JetPack SDK and support ONNX and TensorRT for model inference. These embedded platforms were chosen to test the best-performing DL model under deployment conditions with limited resources, while still meeting the demands of real-time medical image classification.
The performance of the Swin_Tiny_Patch4_Window7_224 model on the Jetson AGX Xavier using both ONNX Runtime in CPU mode and TensorRT inference engines is presented in
Table 6. TensorRT significantly reduced the mean inference time from 177.59 ms to 18.23 ms, while also lowering the standard deviation from 55.84 ms to 3.18 ms, indicating improved stability and execution consistency. Furthermore, TensorRT achieved a substantial increase in throughput, with a mean frame rate of 56.30 FPS compared with only 6.13 FPS under ONNX Runtime. These results demonstrate the Xavier module’s capability to deliver high-speed, low-latency inference suitable for real-time medical applications when optimized with TensorRT.
Figure 19 illustrates the comparative performance gains achieved by employing TensorRT over ONNX Runtime on the Jetson AGX Xavier. As shown in
Figure 19a, the model execution achieved a substantial reduction in inference time.
Figure 19b reflects the 9.74× speedup obtained, while
Figure 19c shows the 818.4% increase in computational efficiency. The speedup was computed as the ratio between ONNX Runtime and TensorRT inference times, and the efficiency gain was derived from the relative increase in mean FPS [
62]. These improvements highlight the impact of hardware-aware optimizations provided by TensorRT, enabling substantial reductions in processing latency and resource consumption—critical factors in real-time clinical deployments.
The performance metrics summarized in
Table 7 demonstrate the substantial inference acceleration achieved on the Jetson Orin Nano using TensorRT with the compressed Swin_Tiny_Patch4_Window7_224 model. The optimization process reduced the mean inference time from 792.47 ms to 23.04 ms and significantly lowered the execution variability, as indicated by the standard deviation drop from 104.19 ms to 2.92 ms. Additionally, the throughput increased dramatically from a mean of 1.29 FPS to 44.04 FPS, validating the hardware’s potential for real-time medical image analysis when properly optimized with TensorRT.
Figure 20 visualizes the inference acceleration achieved by TensorRT on the Jetson Orin Nano. As depicted in
Figure 20a, inference time was significantly reduced compared with ONNX Runtime.
Figure 20b,c present a 34.39× speedup and a 3339.4% gain in computational efficiency, respectively. The speedup ratio was computed by dividing the mean inference time of ONNX Runtime by that of TensorRT, while the efficiency gain corresponds to the relative percentage increase in frames per second (FPS) [
62]. These results confirm that the Orin Nano, despite its compact and energy-efficient design, can deliver competitive inference performance when leveraging TensorRT optimizations, making it suitable for edge-level deployment in clinical applications.
Upon optimization with TensorRT, both the Jetson AGX Xavier and the Jetson Orin Nano demonstrated the capability for real-time inference. As shown in
Figure 20a, the Orin Nano achieved a substantial reduction in inference time, despite being a more compact and energy-efficient device.
Figure 20b,c report a 34.39× speedup and a 3339.4% efficiency improvement, respectively—surpassing Xavier’s 9.74× speedup and 818.4% gain. These findings indicate that the Orin Nano is highly effective for lightweight, cost-efficient deployment scenarios, while the Xavier remains a reliable option for medical AI applications that require greater processing power. Depending on specific operational constraints, either device may be suitable for integrating DL into clinical environments.
4.2. Real-World Usage Scenario
Recent studies indicate that seasoned radiologists err 3% to 5% of the time when categorizing brain cancers as gliomas, meningiomas, and pituitary tumors on brain MRIs. The mistake rate can increase to 13% when a second opinion is obtained [
19]. The proposed Swin_tiny_patch4_window7_224 model achieved an accuracy of 99.24% during testing, indicating its superior capability to differentiate between items in controlled environments. It is essential to emphasize that the objective of this effort is not to supplant expert radiologists. Its purpose is to furnish a diagnostic instrument that aids in clinical decision-making, reduces workload, and enhances the consistency of interpretations. AI-based systems can be highly beneficial in hospitals with substantial workloads or in regions where expert assistance is scarce.
Figure 21 illustrates a potential application of the proposed approach in a clinical environment.
There are five crucial steps in the recommended clinical deployment scenario. The process begins when the radiology suite gets MRIs using regular imaging. The hospital’s PACS keeps DICOM volumetric brain MRI pictures. Second, a DICOM listener or secure API interface simplifies access to PACS. These tools have built-in AI that finds and gets useful research. The Jetson Xavier AGX platform’s Preprocessing and AI Inference Module converts DICOM data into formats compatible with various devices. You can change the size, normalize the intensity, and tensorize the data. This module uses the improved Swin_tiny_patch4_window7_224 model to predict tumor types, confidence scores, and Grad-CAM-based saliency maps to explain its predictions. After that, the results Visualization Interface sends them to the radiologist’s web client or local network workstation. This makes it easy to understand AI-generated discoveries and pictures of anatomy. The radiologist can make the final decision and write the report thanks to AI and clinical judgment. This setup processes data locally, which cuts down on latency and keeps data safe. The AI model transforms into a real-time decision-support tool. This deployment scenario is just one of many that can be changed to fit the needs of a specific technical infrastructure.
4.3. Limitations of the Study
This study’s limitations primarily lie in the dataset. The dataset consisted of four different types of tumors, classified into four classes, and was analyzed using various cuts, including central, sagittal, and axial. However, it would be necessary to obtain a dataset that includes different types of clinical cases from patients of various ages, geographical areas, and hospitals. The architectures in this study were only subjected to four classes and five folds of the dataset; however, more cases are needed to gain a comprehensive understanding of the results obtained in the study and to achieve optimal results in a real medical environment. The experiments were conducted on high-end GPU infrastructure (such as NVIDIA A100 units), which makes it difficult to predict the results in an environment with limited computational resources, such as a public hospital, where portable diagnostic devices or health monitoring platforms are commonly used. Models such as DeiT3_base_patch16_224 were accurate but consumed a significant amount of CPU and GPU power, which could make them difficult to use in real-time applications. The Swin_tiny_patch4_window7_224 model may need to be pruned or quantized to make it usable on lightweight devices.
4.4. Future Work
Further research will focus on validation using larger, multi-institutional datasets that include greater demographic diversity, such as patient age, regional origin, and scan collection protocols. The integration of such variability is crucial to assess the generalization of the model in diverse clinical settings and to facilitate its safe incorporation into diagnostic processes. Pathological heterogeneity will also be addressed with the inclusion of tumor subtypes and different tumor grades. Another focus of future work will be to seek the opinion of expert radiologists who will analyze the images and determine whether the label identified by the model is correct and corresponds to the indicated tumor type. This is because the confusion matrices of the models indicate some incorrect classifications, false positives, or false negatives that must also be verified by an expert. Finally, the importance of applying the best model from this study is a very important aspect. The aim is to apply it in a hospital workflow environment. Integrate it into an embedded system to perform diagnoses quickly in a real clinical setting.
5. Conclusions
This study presented a robust and meticulously engineered framework for the automated multi-class classification of brain tumors into four classes utilizing DL and transformer-based architectures. The proposed strategy enhanced current standards in medical imaging classification by integrating TL strategies, stratified k-fold cross-validation, GPU profiling, class-level interpretability, and parameter optimization methodologies. The evaluation used the publicly available Msoud brain MRI dataset, which had a good mix of cases across four diagnostic classes: glioma, meningioma, pituitary tumor, and no tumor. The dataset had almost the same amount of entropy (EB = 0.998) and a very low imbalance ratio, which made it easy to compare models fairly and fully.
The Swin_tiny_patch4_window7_224 model consistently outperformed the other three architectures—Xception41, Inception_v4, DeiT3_base_patch16_224—across all primary evaluation metrics. The model achieved the highest test Accuracy (99.24%), with consistently outstanding values for test Precision, test Recall, test F1-Score, and test Matthews Correlation Coefficient, even when utilizing only 75% of the available training data for five-fold cross-validation. It was essential that it demonstrated an optimal balance between prediction performance and computational efficiency. It used very little GPU memory (approximately 20%), required very little CPU power (approximately 25%), and had the lowest GPU power consumption (approximately 180 W) ever recorded. These features made it more likely that it could be used in places like hospitals and clinics where the number of computers is limited.
To validate the feasibility of deploying the proposed classification model in resource-constrained clinical environments, inference tests were conducted on two embedded platforms: Jetson AGX Xavier and Jetson Orin Nano. Using optimized versions of the Swin_tiny_patch4_window7_224 model exported via ONNX and accelerated with TensorRT, both devices achieved real-time performance. While the Xavier delivered robust inference speeds, the Orin Nano notably surpassed it in both speedup ratio and computational efficiency, despite its lower power profile. These results confirm the suitability of both platforms for edge deployment, with the Orin Nano standing out as a highly efficient and cost-effective solution for portable medical AI applications.
This work reported the Accuracy of four novel architectures. Although previous attempts employed different datasets, ensemble configurations, and architectural depths, the Swin Transformer-based method proved to be one of the most effective approaches, requiring very few computational resources. This combination of performance and efficiency sets it apart, particularly when compared with models that prioritize architectural complexity over practical effectiveness. Using Grad-CAM explainability techniques, it was also possible to observe that the model consistently identified the diagnostically important areas in MRI images. These activation maps provided doctors with greater confidence that the model was functioning correctly and made it easier for them to understand its decisions. ROC curve analysis revealed that the architecture could effectively distinguish between tumor types and folds, achieving the optimal AUC values (1.0) in all cases. It was found that the model was stable and could be applied in various situations.
In conclusion, this study demonstrates that the Swin Transformer-based architecture can be utilized in the medical field to automatically classify brain tumors. It also provided a clear, repeatable, and computer-compatible way of doing things. This method not only contributed new knowledge to the field of AI-driven medical image analysis but also established a useful standard for the future growth of CAD systems that can be used by many people and are reliable.