Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures

Gómez-Guzmán, Marco Antonio; Jiménez-Beristain, Laura; García-Guerrero, Enrique Efren; Aguirre-Castro, Oscar Adrian; Esqueda-Elizondo, José Jaime; Ramos-Acosta, Edgar Rene; Galindo-Aldana, Gilberto Manuel; Torres-Gonzalez, Cynthia; Inzunza-Gonzalez, Everardo

doi:10.3390/technologies13090379

Open AccessArticle

Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures

by

Marco Antonio Gómez-Guzmán

¹

,

Laura Jiménez-Beristain

²

,

Enrique Efren García-Guerrero

^1,*

,

Oscar Adrian Aguirre-Castro

¹

,

José Jaime Esqueda-Elizondo

²

,

Edgar Rene Ramos-Acosta

¹

,

Gilberto Manuel Galindo-Aldana

³

,

Cynthia Torres-Gonzalez

³

and

Everardo Inzunza-Gonzalez

^1,*

¹

Facultad de Ingeniería, Arquitectura y Diseño, Universidad Autónoma de Baja California, Carretera Transpeninsular Ensenada-Tijuana No. 3917, Ensenada 22860, Baja California, Mexico

²

Facultad de Ciencias Químicas e Ingeniería, Universidad Autónoma de Baja California, Calzada Universidad No. 14418, Parque Industrial Internacional, Tijuana 22424, Baja California, Mexico

³

Laboratory of Neuroscience and Cognition, Facultad de Ciencias Administrativas, Sociales e Ingeniería, Universidad Autónoma de Baja California, Carr. Est. No. 3 s/n Col. Gutierrez, Mexicali 21700, Baja California, Mexico

^*

Authors to whom correspondence should be addressed.

Technologies 2025, 13(9), 379; https://doi.org/10.3390/technologies13090379

Submission received: 30 June 2025 / Revised: 7 August 2025 / Accepted: 20 August 2025 / Published: 22 August 2025

(This article belongs to the Special Issue Advancements in Medical and Assistive Technologies Using Artificial Intelligence and Deep Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

Early and accurate identification of brain tumors is essential for determining effective treatment strategies and improving patient outcomes. Artificial intelligence (AI) and deep learning (DL) techniques have shown promise in automating diagnostic tasks based on magnetic resonance imaging (MRI). This study evaluates the performance of four pre-trained deep convolutional neural network (CNN) architectures for the automatic multi-class classification of brain tumors into four categories: Glioma, Meningioma, Pituitary, and No Tumor. The proposed approach utilizes the publicly accessible Brain Tumor MRI Msoud dataset, consisting of 7023 images, with 5712 provided for training and 1311 for testing. To assess the impact of data availability, subsets containing 25%, 50%, 75%, and 100% of the training data were used. A stratified five-fold cross-validation technique was applied. The CNN architectures evaluated include DeiT3_base_patch16_224, Xception41, Inception_v4, and Swin_Tiny_Patch4_Window7_224, all fine-tuned using transfer learning. The training pipeline incorporated advanced preprocessing and image data augmentation techniques to enhance robustness and mitigate overfitting. Among the models tested, Swin_Tiny_Patch4_Window7_224 achieved the highest classification Accuracy of 99.24% on the test set using 75% of the training data. This model demonstrated superior generalization across all tumor classes and effectively addressed class imbalance issues. Furthermore, we deployed and benchmarked the best-performing DL model on embedded AI platforms (Jetson AGX Xavier and Orin Nano), demonstrating their capability for real-time inference and highlighting their feasibility for edge-based clinical deployment. The results highlight the strong potential of pre-trained deep CNN and transformer-based architectures in medical image analysis. The proposed approach provides a scalable and energy-efficient solution for automated brain tumor diagnosis, facilitating the integration of AI into clinical workflows.

Keywords:

brain tumor classification; medical image analysis; magnetic resonance imaging (MRI); multi-class classification; computer-aided diagnosis (CAD); deep learning; transfer learning; convolutional neural networks (CNNs); vision transformers (ViT); artificial intelligence

1. Introduction

A brain tumor (BT) is defined as a conglomeration of abnormal cells in various parts of the brain [1]. Abnormal cellular proliferation in the brain can affect normal cells and their functions, leading to recurrent headaches, speech alterations, concentration difficulties, seizures, auditory problems, and memory loss. Brain tumors are classified into many categories. Malignant tumors are referred to as cancerous, whereas benign tumors are termed non-cancerous. Primary brain tumors begin in the brain itself. Secondary tumors or metastatic tumors can start in places outside the brain and spread to the brain [2,3]. Malignant brain tumors proliferate rapidly and disseminate to various regions of the brain and spinal cord, rendering them more perilous than benign tumors. A glioma is a tumor that develops in the brain and spinal cord. A glioma’s ability to impair brain function and pose a threat to survival depends on its location and pace of growth [4].

In accordance with the World Health Organization (WHO), gliomas are categorized into four grades of severity, ranging from grade I to grade IV [5], and include types such as high-grade glioma (HGG), low-grade glioma (LGG), meningioma, and pituitary tumors of the central nervous system. Meningioma tumors originate from the meninges, the protective tissue around the brain and spinal cord. Meningioma safeguards the spinal cord and entirely encases the brain, providing protection. Due to their sluggish growth and reduced tendency to metastasize, they are generally classified as benign tumors. On the other hand, pituitary tumors arise from spontaneous mutations; however, some result from hereditary defects. These cancers are benign and have a minimal risk of metastasis. Although these tumors are benign, their presence in critical areas of the brain may lead to significant health issues.

How severely a brain tumor impacts the functioning of the nervous system is dependent on the tumor’s location and its pace of development. Brain tumors are treated differently depending on their nature, location, and size. Although it is essential for achieving better results, early identification and diagnosis of brain tumors remain challenging [6]. A variety of diagnostic methods exist to detect brain disorders. The predominant non-invasive technology used in the medical profession for diagnosing various brain disorders is the Magnetic Resonance Imaging (MRI) technique, which is utilized to assess malignancies in brain scans [6,7]. The majority of specialists choose MRI for its capacity to highlight intricacies of brain malignancies and specific regions of the brain [8]. MRI scans provide superior contrast resolution, facilitating the identification of minute cancers that may otherwise go undetected. Moreover, its capacity to record numerous planes of the targeted brain regions while patients maintain a stationary posture considerably augments the Accuracy and Precision of the diagnostic procedure [9,10].

Additionally, MRI uses magnetic fields, pulses, and computers to see the whole body’s organs and bones. The process of diagnosing brain issues is referred to as brain MRI analysis [11]. A brain MRI delivers clear views of the posterior brain and brainstem, unlike a computerized tomography (CT) scan. MRI offers improved clarity of image slices in various sequences, including T1, T1CE, FLAIR, and T2, compared with CT and Positron Emission Tomography (PET). MRI is advantageous for accurately identifying low and high-grade lesions during tumor detection [12,13,14].

In medical image analysis, segmentation and classification are two fundamental tasks [14]. While classification involves identifying disease types or grading illnesses based on image attributes, such as color, texture, and shape, segmentation entails isolating specific sections of interest from medical images, including organs, tissues, or lesions. For clinical applications such as illness diagnosis, tracking treatment progress, and rehabilitation evaluation, segmentation and classification Accuracy are crucial [15]. The early diagnosis of brain tumors is crucial for enhancing treatment efficacy and thus improving patient survival rates [16,17]. Nonetheless, the examination of brain neoplasms using medical imaging can be complex. Moreover, manual segmentation of brain tumors is both expensive and labor-intensive [18]. Studies that evaluate radiological errors consistently report daily error rates are between 3% and 5%. However, in some clinical settings, retrospective discrepancy rates are over 30%. In neuroimaging, secondary evaluations by experienced neuroradiologists might show big differences in up to 13% of cases. Different studies have found different levels of accuracy in human diagnosis when it comes to multi-class classification of brain tumors. However, these data show the intrinsic limitations of expert interpretation [19]. Consequently, automated methodologies are greatly esteemed. The automated identification of brain tumors is a formidable medical challenge [17]. In recent years, many methodologies, from basic machine learning (ML) models to advanced approaches such as CNN and vision transformers (ViT), have been used to automatically identify brain tumors via brain MRI [20]. Nonetheless, the development of a translational and interpretable model for the precise identification and classification of brain cancers remains a prominent scientific endeavor [21].

Researchers are developing Computer-Aided Diagnosis (CAD) systems to assist physicians in the rapid and accurate diagnosis process. Deep learning (DL), especially using CNNs, has shown remarkable efficacy in several classification tasks, including medical imaging [3,4,5,22,23]. Methods such as transfer learning (TL) and fine-tuning have significantly improved their diagnostic efficacy by enabling models to leverage existing information. Classifying brain tumors using a huge quantity of labeled data is particularly challenging in TL [24]. To avoid this issue, researchers employ pre-trained models [25]. These models proficiently integrate, scale, and compress features, while efficiently using residual information via diverse layer architectures to enhance performance [10]. Therefore, our motivation stems from the convergence of five key factors: (i) the clinical need to improve non-invasive and early diagnosis of intracranial tumors, (ii) the technical complexity inherent in differentiating tumors with overlapping radiological appearances, (iii) the translational potential of deep learning to integrate with radiology practice and reduce reliance on invasive techniques, (iv) reducing the workload on radiologists, which is an excellent help for countries like Mexico, where there are few radiologists, and (v) contributing to the use of advanced deep learning technologies in the early detection and classification of brain tumors in MRI to improve people’s quality of life.

The structure of this paper is presented as follows: In Section 2, the state of the art of works related to BT classification is presented. The materials and methods are delineated in Section 3, which includes descriptions of the pre-trained models, the dataset, and image preprocessing techniques. Section 4 presents the performance analysis, which consists of a comparison of the outcomes of each model and a comprehensive examination of the findings, including their analysis, validation, research limitations, and broader implications in the area. The study is finally concluded in Section 5, which also identifies potential areas for future research in the classification of brain tumors using DL techniques.

2. Related Work

Numerous studies have focused on classification algorithms for the precise detection of brain cancers. The following section provides an overview of the research that developed multiple CNN models and motivated the present study.

In [1], a deep learning approach is proposed that utilizes transfer learning with EfficientNet variants (B0–B4) to classify brain tumors in MR images into glioma, meningioma, and pituitary types of brain tumors. EfficientNetB2 achieved the highest results with 98.86% accuracy, 98.65% precision, and 98.77% recall when used with the public CE-MRI Figshare dataset. Grad-CAM visualizations verified the model’s emphasis on tumor regions, and data augmentation enhanced generalization.

The authors in [26] used a Kaggle dataset to classify brain tumors from X-ray images. Preprocessing, including noise reduction and data augmentation, preserved important edges and generated synthetic variations. After fine-tuning, VGG19, InceptionV3, and MobileNetV2 were used. The most accurate model was VGG19 (98.58%), beating InceptionV3 (97.6%) and MobileNetV2 (98.47%).

Research in [27] suggested a hybrid model that combines MobileNetV2 (a feature extractor) with a support vector machine (SVM) classifier. The MRI dataset Msoud is available on the Kaggle repository. Findings included an AUC of 0.99 for glioma, 0.97 for meningioma, and 1.0 for pituitary and no tumor.

In [28], a new attentional TL model, Pre-trained Attention-fused Image SpectraNet, is proposed to enhance brain tumor detection and classification in MRI images. A CNN-based architecture is used. Training stability is improved using the Adam optimizer. The four classifications are normal, pituitary, glioma, and meningioma. The system achieved 98.33% Accuracy and 98.35% Precision.

In [29], the authors implemented BrainNeuroNet, a teacher-student model for brain tumor detection that utilizes a Hierarchical DConv Transformer (HD) for global feature extraction and a MultiScale Attention (MSA) network for local feature extraction. Preprocessing steps included image scaling, normalization, and quality improvement. Images were collected from the BR35H and Brain Tumor MRI datasets. With a 98.63% Accuracy rate, the model demonstrated its effectiveness in precise brain tumor diagnosis, surpassing previous approaches. Similarly, in [30], the authors proposed a deep convolutional neural network architecture with parallel dilated convolutions (PDCNNN). The model extracts fine and coarse features using parallel routes with varying dilation rates to mitigate DCNN-based overfitting and preserve global context. The network is trained using an average ensemble technique and assessed on Chakrabarty Brain MRI Images (binary), Figshare (multi-class), and Msoud (Kaggle) (multi-class), achieving accuracies of 98.67%, 98.13%, and 98.35%, respectively. These findings demonstrate that the proposed model outperforms earlier techniques. In [31], the study introduces a two-stage structural MRI brain tumor classification methodology. A pre-trained convolutional neural network automatically extracts information in the initial step, reducing training time and processing requirements. To avoid overfitting, a filter-based deep feature selection method is employed. SVMs with polynomial kernels classify multi-class data. On the MSoud dataset, the model achieved an Accuracy of 98.17%. The Crystal Clean: Brain Tumors MRI Dataset and Figshare datasets achieved 99.46% and 98.70% Accuracies, respectively.

In another case, work [32] presented a model that enhances the Accuracy and transparency of brain tumor classification in MRI. The model is based on the EfficientNetB0 architecture and utilizes explainable artificial intelligence (XAI) approaches, specifically Grad-CAM. In a multi-class classification approach, the model was trained and tested using the Msoud dataset. It achieved an Accuracy of 98.72%.

In the work in [33], a two-stage brain tumor classification method utilizing the BR35H dataset was introduced. First, modern image enhancement algorithms (GFPGAN and Real-ESRGAN) improve MRI picture quality and resolution. Nine DL models are trained using five optimizers. In the second step, the top classifiers are combined to use ensemble learning methods such as weighted sum, fuzzy ranking, and majority voting. By employing GFPGAN and the five top models, the system outperformed prior brain tumor classification methods with 100% Accuracy.

In [2], the BRATS 2015 dataset and the Figshare Dataset were used for training the proposed Multi-Class Convolutional Neural Network model (MCCNN). Two experiments, Experiment I and Experiment II, were undertaken to evaluate the performance. The suggested MCCNN-based model achieved 99% Accuracy in Experiment I and 96% in Experiment II. In [4], authors used pre-trained DL models, Xception, MobileNetV2, InceptionV3, ResNet50, VGG16, and DenseNet121, to identify brain MRI scans in four classes. CNN models were trained using a publicly accessible Brain Tumor MRI dataset, Msoud. Xception performed best with a weighted Accuracy of 98.73%. Similarly, in [5], a lightweight Multi-path Convolutional Neural Network (M-CNN) was proposed. During training, the model was instructed to recognize four distinct types of tumors. Sartaj, a publicly available Brain Tumor MRI dataset, was used to train the model. The model achieved a performance Accuracy level of 96.03%.

An ensemble of CNNs was introduced in [6]. The ensemble model integrated VGG16 and ResNet152V2 architectures, demonstrating a classification Accuracy of 99.47% on the complex four-class Msoud dataset. Similarly, authors in [16] introduced a novel ensemble using Swin Transformer and ResNet50V2 (SwT + ResNet50V2). The design utilizes self-attenuation and DL techniques to enhance diagnostic Precision while minimizing training complexity and memory consumption. The model was trained with the BR35H and Msoud datasets. An Accuracy of 99.9% was achieved in BR35H and 96.8% in Msoud. In [20], five CNN models trained via TL and fine-tuning were combined in an ensemble model, which was optimized using Particle Swarm Optimization (PSO). Three brain tumor datasets, namely Figshare (Dataset 1), Sartaj (Dataset 2), and Msoud (Dataset 3), were used for evaluation. In Figshare, the model achieved an Accuracy of 99.35%, in Sartaj, 98.77%, and in Msoud, 99.92%.

In [10], the authors developed an innovative hybrid system named TUMbRAIN. Additionally, they used the Msoud dataset for training, which has four classes. The findings indicate that TUMbRAIN surpasses most contemporary neural network models, achieving an exceptional total Accuracy of 97.94% with just 1.04 million parameters. In [34], the DeepNeuroXpert (DN-XPert) model was introduced for accurate brain tumor detection, along with three complementary models: NSAS-Net for segmentation, AI2CF for classification, and WPSO for parameter tuning. Two brain tumor imaging collections, Figshare and Msoud, helped the study. Performance criteria, including Accuracy, reached 99.4%, suggesting the potential of the proposed models to enhance brain tumor detection and classification.

The research presented in [35] employed a novel CNN architecture with explainable artificial intelligence (XAI) algorithms, such as Grad-CAM, SHAP, and LIME, to classify brain tumors. Fewer layers and parameters improve model interpretability and resilience compared with earlier models. Using the Msoud and NeuroMRI Datasets, the technique achieved 99% Accuracy on known data and 95% Accuracy on unknown data, demonstrating its generalizability and clinical value. In [36], the authors proposed an advanced brain tumor multi-class classification method that uses TL and evolutionary algorithms. The pre-trained models EfficientNetB3 and DenseNet121 were optimized for hyperparameters using CEGA. The study used Msoud dataset. Without data augmentation, CEGA-EfficientNetB3 and CEGA-DenseNet121 achieved accuracies of 99.39% and 99.01%, respectively, surpassing the state-of-the-art approaches.

In [37], the authors introduced the M-C&M-BL model, which uses a CNN for image feature extraction and a BiLSTM network for sequential data processing. MRI data from Br35H were used to assess the model. The results, with 99.33% Accuracy, suggest that this CNN is suitable for integration into clinical decision support systems, online and mobile diagnostic platforms, and hospital picture archiving and communication systems (PACS).

The work in [38] presented a methodology for brain tumor identification that incorporates model optimization, modeling of realistic settings, and sophisticated data augmentation techniques. Utilizing the Msoud dataset, optimizers such as Adam and augmentation methods like CutMix, PatchUp, Gaussian Noise, and Blur were employed, resulting in an Accuracy of 99.45% under optimal conditions. Nonetheless, the performance diminished when confronted with synthetic data that was disturbed, highlighting the model’s limitations in robustness within authentic clinical contexts.

To classify brain tumors in histological images, the research in [39] suggested using a CNN architecture in conjunction with a Vision Transformer (ViT) model. The Msoud dataset was used to train the algorithm with four classes. Outperforming prior methods, with performance ranging from 95% to 98%, a 95% confidence interval and an additional Accuracy of 99.42% were achieved, resulting in an overall Accuracy of 99.64%.

In [40], the study presented a deep TL methodology for the early identification of brain cancers in magnetic resonance imaging (MRI), using preprocessing, segmentation by OTSU, and feature extraction via Gabor Wavelet Transform, optimized by Grey Wolves Optimization (GWO). Five architectures were assessed: VGG19, InceptionV3, InceptionResNetV2, ResNet152, and DenseNet121, with the latter demonstrating the highest Accuracy. The model underwent training on the Msoud dataset. DenseNet121 had the best Accuracy at 99.43%, surpassing the other evaluated designs.

The research in [41] introduced a brain tumor classification model using pre-trained CNN architectures augmented with supplementary feature extraction layers and diverse activation functions (ReLU, PReLU, Swish). Seven architectures were assessed: VGG19, InceptionV3, ResNet50V2, InceptionResNetV2, DenseNet201, MobileNetV2, and EfficientNetB7, combined using a majority vote ensemble method. The models were trained using the Chakrabarty N. Brain MRI Images for Brain Tumor Detection dataset, including records from 253 patients. The model attained an Accuracy of 99.34%. Similarly, work [42] utilized Chakrabarty N. Brain MRI images for brain tumor detection (BTD) training, employing a CNN model integrated with a multilayer perceptron (MLP) for feature extraction. The model obtained 99.6% Accuracy.

The study in [43], presented ParMamba, a parallel architecture that amalgamates Convolutional Attention Patch Embedding (CAPE) with the ConvMamba block, which incorporates CNN, Mamba, and a channel improvement module, to boost brain tumor identification. The model was evaluated using the Msoud and Figshare datasets, achieving accuracies of 99.62% and 99.35%, respectively.

In [44], the authors presented the Superimposed AlexNet models (SAlexNet-1 and SAlexNet-2) for precise classification of primary brain tumors, incorporating three principal enhancements: Hybrid Attention Mechanism (HAM), 3 × 3 convolutional layers for comprehensive feature extraction, and semi-transfer learning (STL) for encoder pre-training. The models were assessed using the SARTAJ (multi-class classification) and BR35H (binary classification) datasets, with SAlexNet-1 achieving a Precision of 98. 78% and 98. 07%, and SAlexNet-2 achieving 99.69% and 99.17%, respectively.

Improving the process of early brain tumor diagnosis is the main objective of this work. This diagnosis has a direct effect on patient care. Modern technologies, such as DL and TL, are being promoted as a solution to the problems with older approaches that depend on human interpretation and specialized expertise. To boost processing speed and reduce human error, these automated categorization techniques are essential. This work utilizes complex DL and TL methods to identify brain MRI images, therefore improving the Accuracy and efficiency of medical imaging systems and assisting medical professionals in detecting brain cancer more efficiently.

This study uses the openly accessible Msoud dataset, which contains MRI scans of Glioma, Meningioma, no tumor, and Pituitary tumors. This study evaluates four innovative pre-trained deep learning models for brain tumor classification, utilizing various dataset percentages and a stratified five-fold cross-validation technique. Incorporating these state-of-the-art methods into a system will enable improved clinical decision-making through the provision of accurate and scalable brain tumor classification. This work sets the way for future AI-driven medical research and contributes to our understanding of DL’s potential in medical image processing, particularly for the detection of brain tumors.

The main contributions of this paper are as follows:

Transfer learning (TL) and pre-trained deep learning models are employed for brain MRI classification, aiming to achieve faster, more accurate, and consistent diagnostic outcomes compared with traditional clinical methods.
Several state-of-the-art pre-trained models—DeiT3_Base_Patch16_224, Xception41, Inception_v4, and Swin_Tiny_Patch4_Window7_224 are used to classify brain tumors, leveraging TL to reduce training time and computational load while enhancing classification performance.
The classification performance of the selected models is evaluated using the Msoud brain tumor MRI dataset, categorized into four classes (glioma, meningioma, pituitary tumor, and no tumor), focusing on Accuracy, Precision, Recall, F1-Score, and MCC.
A parameter optimization strategy is implemented in combination with training on multiple dataset partitions, ranging from 10% to 100%, to assess the scalability and robustness of the models.
Graph-based visualization tools are utilized to analyze key hyperparameters and learning behavior during training, validation, and testing phases.
Deployment feasibility of the best-performing model is validated through real-time inference benchmarking on embedded AI hardware platforms (NVIDIA Jetson AGX Xavier and Jetson Orin Nano), demonstrating high-speed, low-latency performance suitable for edge-level clinical applications.

3. Materials and Methods

3.1. Proposed Methods

The aim of this work is to train and assess four DL architectures for classifying brain tumors using the publicly available Msoud dataset from Kaggle. Glioma, meningioma, no tumor, and pituitary are the classes that are included in the Msoud dataset. The methodology used in this study is shown in Figure 1.

The methods proposed in this paper have been used previously in other studies, for example, in [2,3,4,17,45,46]. This method involves a process designed to analyze and interpret magnetic resonance imaging data to classify brain tumors using convolutional neural networks (CNNs). This method encompasses everything from data acquisition to the interpretation of results in the final stage, ensuring a clinically relevant approach. The initial phase involved selecting the Msoud dataset, which contains axial, sagittal, and coronal slices that represent tumor pathologies. The dataset was manually annotated by clinical experts and served as the basis for the supervised training process.

Subsequently, a preprocessing stage was carried out to normalize the images, which are the input data for the models, thereby improving their performance. This process included labeling the classes, resizing the images to a fixed input resolution suitable for each model, and converting the color to RGB to ensure compatibility with the pre-trained models. In addition, data augmentation techniques, such as random rotations and random horizontal flips, were employed to enhance the diversity of the training set, thereby increasing complexity and reducing the likelihood of overfitting, while improving the generalization ability of the models. Image preprocessing is a crucial step in developing an automated CNN-based system for medical image classification. This procedure enhances all the visual information included in medical images [47].

Current pre-trained DL models in the literature were used for the training, validation, and testing stages, as previous studies have proven their effectiveness in medical image classification. These models included DeiT3_base_patch16_224, based on Vision Transformer; Xception41, which uses convolutional layers to quickly extract features; Inception_v4, known for its multiscale convolutional approach; and Swin_tiny_patch4_window7_224, which is a hierarchical vision transformer that uses shifted windows for local self-attention. Stratified K-fold cross-validation was employed to train each model, and each model was subsequently tested on a distinct dataset to ensure the reliability of the results.

Performance metrics were used for the CNN models to gain insight into their performance. These are Accuracy, Precision, Recall, F1-Score, and Matthews correlation coefficient (MCC). The latter is especially useful when there are more than two classes and the data is not evenly distributed across them. We also created confusion matrices to see how the models grouped all classes and determine which ones worked best for this specific task.

Finally, explanations were added to help physicians understand the model’s predictions. We utilized methods such as Grad-CAM (Gradient-weighted Class Activation Mapping) to generate heat maps highlighting the most significant areas in the entered MRI scans. These visual tools not only clarify the model’s decision-making process but also enable medical experts to verify whether the model prioritizes important tumor characteristics in the MRI.

3.2. Msoud MRI Dataset

The Msoud MRI dataset [48] from Kaggle is constructed from three publicly available datasets: Figshare, Sartaj, and Br35H. It has a total of 7023 grayscale magnetic resonance images in JPEG format, including coronal, sagittal, and axial cuts, divided into four diagnostic classes: glioma, meningioma, no tumor, and pituitary tumor. This dataset was employed for training, validation, and testing of various TL-based models implemented in this study. The training subset comprises 1321 glioma, 1339 meningioma, 1595 no-tumor, and 1457 pituitary tumor images, totaling 5712 images. The testing subset consists of 300 glioma, 306 meningioma, 405 no-tumor, and 300 pituitary tumor images, totaling 1311 images. The class-wise distribution is shown in Table 1. The dataset’s diversity in anatomical presentation and tumor morphology enhances its suitability for developing and validating robust DL models for brain tumor classification. According to Table 1, the dataset was explicitly divided into 5712 images for training (81.33%) and 1311 images for testing (18.67%). These partitions were predefined by the dataset’s original authors and were respected in our implementation to ensure consistency with existing literature. Importantly, regarding the validation, we further applied a stratified 5-fold cross-validation strategy within the training subset, ensuring class balance across folds and preventing any overlap of images between folds. Although the dataset does not include patient-level identifiers, all experiments were conducted at the slice level as provided by the source. Each model was evaluated using a test dataset completely isolated from the training and validation processes, ensuring strict independence between training and testing phases. The use of stratified cross-validation and fixed test partition helps ensure the robustness and reproducibility of the reported results.

Figure 2 illustrates the class distribution in the Msoud brain tumor dataset, comprising 5712 MRI images divided into four diagnostic groups. There are 1321 glioma tumor images in the training subset, which is 23.13% of the total images in the training set. There are 1339 meningioma tumor images, which is 23.44% of the total images in the training set. There are 1457 pituitary tumor images, which is 25.51% of the total images in the training set. There are 1595 non-tumorous brain scans, which is 27.92% of the total images in the training set. A donut chart (Figure 2) shows the distribution of the training data. It shows that all classes are represented fairly evenly, with a slight majority of No Tumor samples.

The almost equal distribution among the different classes reduces the potential for bias caused by data imbalance during the model training process. However, to ensure the robustness and generalization capacity of the predictive model, it is still advisable to adopt methodological measures. These measures may include techniques such as data augmentation, class reweighting, or oversampling of underrepresented categories in situations involving small training batches. Additionally, K-fold stratified cross-validation is employed to maintain consistent class proportions within each fold. This method enables a robust statistical evaluation of the model, as it ensures that each fold accurately represents the class distribution of the entire dataset, thereby improving replicability and confidence in performance metrics.

Figure 3 presents a representative sample of the Msoud dataset, illustrating the visual characteristics and grayscale intensity distribution typical of the MRI scans used for model training and evaluation.

3.3. Class Balance Analysis in the Msoud Dataset

In this subsection, the class distribution of the Msoud dataset is analyzed using the Imbalance Ratio (IR) [49,50] and the Entropy Balance (EB) [51].

The dataset consists of the following number of samples per class:

Glioma tumor: 1321 images.
Meningioma tumor: 1339 images.
No tumor: 1595 images.
Pituitary tumor: 1457 images.

The total number of images in the dataset is

N = 1321 + 1339 + 1595 + 1457 = 5702

3.3.1. Imbalance Ratio (IR) Calculation in Msoud Dataset

The IR quantifies the disparity between the majority and minority classes. It is defined as

I R = \frac{n_{majority}}{n_{minority}}

where the following are used:

$n_{majority} = 1595$ (No tumor)
$n_{minority} = 1321$ (Glioma tumor)

I R = \frac{1595}{1321} \approx 1.21

Interpretation: An

I R

of approximately

1.21

suggests a mild imbalance, with the largest class containing about 21% more samples than the smallest class.

3.3.2. Entropy Balance (EB) Calculation in Msoud Dataset

The EB is a normalized measure of the uncertainty or disorder in the class distribution. It is calculated as

E B = \frac{H (P)}{H_{\max}}

where the following are used:

$H (P) = - \sum_{i = 1}^{4} P_{i} {log}_{2} (P_{i})$ is the entropy of the class probabilities.
$H_{\max} = {log}_{2} (4) = 2$ is the maximum possible entropy for four equally likely classes.

The probabilities of each class are

P_{1} = \frac{1321}{5702} \approx 0.232 (Glioma), P_{2} = \frac{1339}{5702} \approx 0.235 (Meningioma)

P_{3} = \frac{1595}{5702} \approx 0.280 (No tumor), P_{4} = \frac{1457}{5702} \approx 0.256 (Pituitary)

Now, we compute the entropy:

H (P) = - (0.232 {log}_{2} (0.232) + 0.235 {log}_{2} (0.235) + 0.280 {log}_{2} (0.280) + 0.256 {log}_{2} (0.256))

H (P) \approx - (0.232 \times - 2.105 + 0.235 \times - 2.086 + 0.280 \times - 1.836 + 0.256 \times - 1.963)

H (P) \approx 0.489 + 0.490 + 0.514 + 0.502 = 1.995

Finally, we compute the Entropy Balance:

E B = \frac{1.995}{2} \approx 0.998

The

E B

value is approximately

0.998

, indicating that the class distribution is very close to being uniform. The Entropy Balance is a value between 0 and 1, with 1 indicating perfect balance. This number means that the dataset is almost perfectly balanced across the four classes.

The IR and EB values for the Msoud dataset are shown in Figure 4. The IR value of

1.21

indicates a slight class imbalance, while the EB value of 0.998 suggests that class entropy is nearly evenly distributed. These numbers show that the dataset has a fairly balanced structure, which is good for training and testing models.

3.4. Image Preprocessing Techniques

The Google Colab platform and the Python programming language, version 3.11.12, are used for coding the preprocessing methods. Pytorch’s Version 2.6.0+cu124 is applied. The Nvidia A100 GPU drives this platform. As a graphical tool in this study, Weights and Biases Wandb [52] was used in the training, validation and testing stages to generate data visualization.

3.4.1. Image Data Augmentation

Many strategies are part of image data augmentation that aim to improve training datasets and help create stronger DL models. Some standard methods for adding more picture data include resizing the image, adjusting the color, applying kernel filtering, rotating the image, using stochastic blurring, and altering the feature space according to [46]. Table 2 shows the methods used for data augmentation during the training process. Augmentations can be geometric, such as resizing, rotating, or flipping an image, or photometric, such as improving the intensity of brightness, contrast, saturation, or color. The settings for each augmentation are randomly configured to diversify the training dataset. Images are downsized to a set dimension of 224 × 224 pixels, rotations are randomly applied up to a maximum angle of 30^∘, and brightness, contrast, and saturation are adjusted within a 0–20% range. Moreover, horizontal and vertical flips are applied to the training subset to enhance variability and improve model generalization.

3.4.2. Hyperparameters Setup

Table 3 presents the hyperparameters used in the model training phase according to the approach reported in [46]. The models were trained during 10 epochs with batch sizes of 32 and 64 to analyze their impact on convergence and generalization ability. To understand the effect of the training dataset size on the results, different data ratios such as 25%, 50%, 75%, and the full dataset (100%) were used. The initial learning rate was set in

1 \times 10^{- 5}

,

1 \times 10^{- 4}

, and

1 \times 10^{- 3}

, allowing for an analysis of learning stability and speed. The Adam optimizer was used due to its adaptive learning capabilities, and a ReduceLROnPlateau learning rate scheduler was utilized to dynamically reduce the learning rate when a performance plateau was detected, hence enhancing training efficiency and preventing stagnation.

Figure 5 illustrates examples from the training set after using the data optimization shown in Table 2. This dataset includes four types of tumors: gliomas, meningiomas, pituitary tumors, and no tumor. Resizing, rotation, brightness, contrast, saturation adjustment, and image-space inversion are all examples of data enhancement techniques that preserve anatomical structures while adding variety to improve and extend learning. The fact that these new examples differ from each other means that the model has learned about additional patterns. This is particularly useful for medical imaging tasks where some limited datasets and classes are highly similar to each other.

3.5. Pre-Trained CNN Models

3.5.1. Data Efficient Image Transformer

In 2017, Google introduced the DeiT3_base_patch16_224 architecture, originally intended for Natural Language Processing (NLP). Transformers utilize self-attention processes to highlight key elements of input sequences during prediction, making them exceptionally efficient for a wide range of tasks. Deep vision transformers (ViTs) have attained considerable prominence due to their proficiency in processing visual input. Transformer-based models can be enhanced by transferring information from a larger, pre-trained instructor model to a more compact student model, enabling accelerated training without sacrificing Accuracy [46,53].

3.5.2. Xception41

The Xception41 CNN, derived from the Inception model, employs depthwise separable convolutions as its fundamental innovation. This method simulates Inception modules by utilizing depthwise separable convolutions, which involve a depthwise convolution performed independently to each channel, followed by a pointwise convolution that integrates information across channels via a 1 × 1 convolution operation [33,54].

3.5.3. Inception_v4

The 2014-released Inception_v4 model from Google builds upon the core ideas of its predecessors. Inception is based on the principle that many convolutional layers may be applied using numerous concurrent branches, allowing for the extraction of information at different levels and dimensions. With the addition of factorized convolutions, residual connections, and label smoothing, Inception_v4 offers notable improvements over previous versions [46,55].

3.5.4. Swing Transformer

The Swin_tiny_patch4_window7_224 architecture introduces the “Swin attention mechanism during sequence processing. This technique focuses solely on a subset of dynamically changing sequence locations. This method enables the model to discover long-range relationships and reduces the computational cost of attention mechanisms [46,56].

3.6. Performance Metrics

Some performance metrics were used in previous studies, such as [4,6,20,45,46,57].

Accuracy is among the most basic and understandable performance criteria available. It is the proportion of accurately anticipated observations to all the observations. Equation (1) offers the formula for computing Accuracy, a fundamental assessment criterion. On the other hand, Accuracy only holds true in cases where datasets are balanced and the occurrences of false positives and false negatives are nearly equal.

Accuracy = \frac{T P + T N}{T P + F P + T N + F N},

(1)

where the following are used:

TP: True Positives (correctly classified positive cases).
TN: True Negatives (correctly classified negative cases).
FP: False Positives (incorrectly classified as positive).
FN: False Negatives (incorrectly classified as negative).

Precision is defined in Equation (2) as the ratio of properly anticipated positive observations to the overall expected positive observations. Positive Predictive Value, also known by this name, assesses the model’s Accuracy in identifying positive instances.

Precision = \frac{T P}{T P + F P} .

(2)

Recall, sometimes referred to as sensitivity, hit rate, or true positive rate, quantifies the ratio of accurately anticipated positive observations to the total actual positive observations. Equation (3) outlines the method for computing Recall, a crucial performance metric for evaluating model quality.

Recall = \frac{T P}{T P + F N} .

(3)

The F1-Score is a harmonic mean that integrates Precision and Recall into a singular, complete score. Although Precision and Recall offer significant insights individually, neither can comprehensively reflect a model’s total efficacy. A model may achieve elevated Precision while exhibiting diminished Recall, or vice versa. The F1-Score mitigates this constraint by equilibrating the trade-off between Precision and Recall into a singular score.

As demonstrated in Equation (4), the F1-Score may be calculated by synthesizing the Precision and Recall metrics following a binary or multi-class classification operation.

F 1 - S c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{(R e c a l l + P r e c i s i o n)}

(4)

A strong indicator of classification task quality is the Matthews Correlation Coefficient (MCC) [58]. Since MCC takes all elements of the confusion matrix, unlike simpler measures like Accuracy, MCC is especially useful in unbalanced datasets. Equation (5), determines the MCC:

MCC = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(5)

4. Results and Discussion

Training began with a technique known as stratified cross-validation. By incorporating it into the code using the appropriate Python libraries, we can ensure that each of the five training folds contains an equal amount of data.

Figure 6 shows the stratified distribution of samples over five folds using Stratified K-Fold cross-validation on the Msoud dataset. Each fold represents the proportion of glioma, meningioma, pituitary tumors, and no tumor, respectively. Figure 6 shows a balanced dataset across categories. This stratified approach ensures that each fold maintains the same class distribution as the entire dataset, hence preventing bias during model training and evaluation. In the analysis of medical imaging data, it is crucial that the distribution stays consistent across folds. Class imbalance has a significant impact on the model’s learning process and overall performance. Stratified K-Fold cross-validation maintains an equivalent distribution of classes across each fold, thus enhancing result reliability, minimizing variability across folds, and rendering the model’s outcomes more valuable and reproducible.

The next Figure 7 shows a parallel coordinates map illustrating the outcomes of a parameter optimization technique during the training of four DL models using a brain MRI dataset. Each line represents a distinct experimental run, defined by the configurations of significant hyperparameters and the resulting performance metrics. The axes represent the dataset ratio employed, model architecture, batch size, learning rate, training loss, and test Accuracy. The architectures examined were Xception41, Swin_tiny_patch4_window7_22, InceptionV4, and Deit3_base_patch16_224, over various segments of the data (25%, 50%, 75%, and 100%), to evaluate their performance at different scales.

The color gradient indicates the test Accuracy, with lighter yellow representing results near 100%, a definitive indication of very effective training configurations. This performance was further reinforced by the implementation of advanced data augmentation techniques showed in Table 2, stratified 5-fold cross-validation to ensure class-balanced partitions as it is illustrated in Figure 6, systematic hyperparameter optimization indicated in Table 3, and the deliberate selection of robust deep learning architectures tailored for multi-class tumor classification displayed in Section 3.5. The optimal results often occur with a 75% of the dataset, a batch size of around 64 and a learning rate of about

1 \times 10^{- 4}

, while minimizing the training loss. These precise traces indicate that certain hyperparameter combinations have considerable stability. This renders them very suitable for fine-tuning in deployment scenarios. The image illustrates that even little adjustments in the learning rate or model selection can significantly impact Accuracy. This demonstrates the sensitivity of model generalization to optimal parameter selection. This parallel coordinates map serves as a comprehensive diagnostic tool for analyzing the interplay of training dynamics within the design space, thereby facilitating informed decisions on model optimization.

Figure 8a,b summarize the optimal results in model training and validation phases, utilizing 75% of the dataset, a batch size of around 64, and a learning rate of about

1 \times 10^{- 4}

for each model. Figure 8 displays a clear graph demonstrating outcomes nearing 99% in training and an exceptional trajectory throughout the validation phase, achieving almost 98%.

Figure 9 shows the four DL models after 10 training epochs. Figure 9a,b show that training loss is decreasing rapidly in all models. Figure 9a, which represents the training loss, shows a steady and rapid drop for all models, indicating that the models are converging and learning in a stable manner. In contrast, the validation loss in Figure 9b demonstrates how effectively each design generalizes. All models decrease, although Swin_tiny_patch4_window7_22 and Inception_V4 exhibit smaller and more stable validation losses at the conclusion of the epochs. This improves generalization and reduces overfitting.

To complement the information presented in the training and validation stages, Table 4 shows the quantitative results of the final testing set. Performance metrics, including Accuracy, Precision, Recall, F1-score, and Matthews correlation coefficient (MCC), were used to verify how well each model classified the test dataset.

Table 4 illustrates the performance metrics of four DL models on the test set. Swin_tiny_patch4_window7_224 obtained the best test Accuracy of 99.24%. It also consistently yielded better results, with Precision of 0.9924, Recall of 0.9924, F1-Score of 0.9924, and MCC of 0.9898. These results demonstrate that this model can make accurate predictions and has very few classification errors, indicating that it can generalize well across all types of tumors.

The second-place model used was Deit3_base_patch16_224, achieving a test Accuracy of 99.08%. This model’s balanced Precision, Recall, and F1-Score were all around 0.9908, while the MCC achieved a value of 0.9877. These statistics highlight its remarkable classification prowess. These transformer-based models are highly effective at identifying complex patterns in medical images. They outperformed CNN-based models such as Xception and InceptionV4.

To gain a deeper understanding of the data presented in Table 4, it is essential to examine how the performance of the models varies with architectural complexity. Transformer-based models such as Swin_tiny_patch4_window7_224 and Deit3_base_patch16_224 exhibit superior test Accuracy, despite significant variations in their parameter counts.

In practical medical environments, it is crucial to select a model that optimally balances predictive performance and computational efficiency, rather than focusing solely on its correctness in testing.

The subsequent Figure 10 examines the relationship between model size (in millions of parameters) and test Accuracy. Each data point is represented as a bubble, with the size of the bubble determined by the number of parameters in the model. The model designation is inscribed on the bubble. The horizontal axis represents the number of parameters, ranging from around 20 million to 100 million. The vertical axis indicates the test’s Accuracy, ranging from 0.986 to 0.994.

Swin_tiny_patch4_window7_224 has the highest test Accuracy (0.9924) among all evaluated models, while possessing a very low parameter count ( 28.3M). This demonstrates its efficiency. Deit3_base_patch16_224 possesses the highest number of parameters (about 86 million), while its test Accuracy is slightly inferior (around 0.9908). This indicates that Accuracy does not improve with larger models. Inception_v4, with nearly 42.7 million parameters, achieves a test Accuracy of around 0.9893. Xception41, the most compact model (about 22.9 million parameters), has the lowest Accuracy (around 0.9870) among all the architectures analyzed. This demonstrates that the perceived differences between model complexity and performance are not always evident and must be discerned, particularly to determine which models to employ in practical applications of DL in the medical field.

Swin_tiny_patch4_window7_224 is the model identified as the best because it strikes the best balance between Accuracy and efficiency of parameters, as shown in Figure 10. According to Table 4, this architecture also had the highest test Accuracy. As a result, the confusion matrices from the test phase are used to look at its performance more closely. The training setup had a batch size of 64, a learning rate of

1 \times 10^{- 4}

, and a 5-fold stratified cross-validation scheme. It is important to note that only 75% of the Msoud dataset was used in this experiment.

Figure 11 shows the confusion matrices that came out of the test stage for all five folds. The Swin_tiny_patch4_window7_224 model, which is based on a vision transformer and was made to pick up on spatial-contextual dependencies in image data, made these. The model was used for a multi-class classification task that involved distinguishing between four types of brain tumors: glioma, meningioma, no tumor, and pituitary tumor.

By comparing the estimated labels with the actual labels for each fold, each matrix reflects the model’s performance in that particular fold. The fact that the numbers are consistently close to the main diagonal in all folds indicates that the model is very effective at organizing the elements, making few errors. The categories “no tumor” and “pituitary tumor” show high discrimination capacity, with virtually no cases misclassified in any of the folds. This demonstrates the model’s ability to distinguish between images of healthy brains and those associated with the pituitary gland. The confusion matrix in Figure 11 also indicates that the majority of misclassifications transpired between the glioma and meningioma categories. This is due to the fact that both tumor types may display similar morphological features in magnetic resonance imaging sequences, particularly in T1-weighted images. Despite these minor flaws, the model demonstrated strong overall discriminatory performance. However, these misclassifications, although statistically minimal, underscore the need to incorporate expert radiology analysis, more clinical data, and multi-sequence imaging in diagnostic settings to reduce potential diagnostic uncertainty.

Fold 5 shows a slight increase in the number of misclassifications between glioma_tumor and meningioma_tumor, but the overall predictions remain relatively straightforward. These results demonstrate that the Swin Transformer-based model can generalize well and remain consistent across folds.

The confusion matrices offered a definitive representation of the classification Accuracy for each class over all test folds. Nonetheless, Receiver Operating Characteristic (ROC) curves provide a more accurate representation of the model’s ability to distinguish across classes. Figure 12 illustrates the ROC curves and AUC values for each category. This provides more insights into the sensitivity-specificity equilibrium and the overall efficacy of the Swin_tiny_patch4_window7_224 model.

Figure 12 shows five subplots, corresponding to each fold (fold 1 to fold 5). Each subplot displays a unique ROC curve for each tumor classification, along with the Area Under the Curve (AUC) metric. It is remarkable that all classes achieve an AUC of 1.00, indicating that positive and negative predictions can be correctly distinguished without any compromise between sensitivity and specificity. This degree of performance is atypical in medical imaging applications, rendering the Swin_tiny_patch4_window7_224 model far more dependable for identifying high-level spatial patterns in MRI data.

The true positive rate vs. false positive rate curves for each class are quite similar around the upper-left edge of the ROC space. Models with perfect discrimination achieve this. Due to fold consistency, the model cannot overfit to particular data and may be employed with varied patient groups. The outcomes align with confusion matrices in Figure 11, confirming the model’s ability to accurately predict all tumor kinds.

The no-tumor and pituitary-tumor classes exhibit clear curves, indicating few false positives. The model effectively distinguishes clinically challenging categories, such as glioma and meningioma, which often yield identical imaging results. The Swin Transformer’s hierarchical self-attention process may identify complex, differentiating features that categorize cancers. The ROC study indicates that the Swin_tiny_patch4_window7_224 model is most effective in identifying all test folds, making it a reliable option for automated brain tumor identification in clinical settings.

Examining the interpretability of DL models is crucial, particularly in the medical domain, where model transparency is paramount. This extends beyond examining only quantitative performance metrics, such as confusion matrices and ROC curves. Gradient-weighted Class Activation Mapping (Grad-CAM) was used to illustrate the spatial regions that significantly influenced the classification decisions of the Swin_tiny_patch4_window7_224 model. The activation maps generated from this offer a good understanding of the algorithm’s decision-making process.

Figure 13 presents Grad-CAM visualizations for test samples representative of all five folds across the four classes: glioma, meningioma, no tumor, and pituitary tumor. The class activation maps are superimposed on the MRI slices to illustrate the regions utilized by the Swin Transformer model to differentiate between classes.

Each column shows a different fold, and each row shows a different category. The pictures show that the folds are very uniform in space, which suggests that the model parameters are based on stable and anatomically relevant areas. In cases of malignant tumors (glioma, meningioma, and pituitary tumors), the highlighted areas correspond to pathological structures. This indicates that the model can identify physiologically essential features. In the glioma and meningioma groups, activations primarily occur in the areas surrounding the lesions. When it comes to pituitary tumors, the sellar area is the most important.

In the “no tumor” class, the Grad-CAMs exhibit dispersed yet low-intensity activations, indicating the absence of concentrated anomalies. This behavior aligns with the model’s accurate classification based on overarching anatomical traits. This interpretability research enhances the clinical validity of the Swin Transformer model by graphically demonstrating its decision-making process. The model achieves superior predictive Accuracy while maintaining transparency through the use of Grad-CAM. This is crucial for practical medical applications and for doctors to have confidence in the model.

To enhance the validation of the Swin_tiny_patch4_window7_224 performance and evaluate the efficacy of other models in this study beyond conventional metrics, a comparative analysis employing model ranking was performed across all training models. This section of the work aims to identify the architecture that exhibits the most stability and consistent Accuracy across both the training and testing stages.

To ensure methodological consistency, each model underwent training and testing with identical preprocessing and augmentation parameters. A normalized rank-based scoring system (range: 0.0–1.0) implemented in [59] was applied to assess the performance of the models. This method integrates Accuracy, Recall, and F1-Score into a singular metric. Figure 14 indicates that Swin_tiny_patch4_window7_224 and Deit3_base_patch16_224 secured top and second positions, achieving scores of 1.0 and 0.8, respectively. This shows they are suitable selections for the work. Inception-v4 performed adequately, earning a grade of 0.6; however, Xception41 underperformed, falling outside the “Good” threshold. The results indicate that transformer-based models outperform convolutional models in this context.

The subsequent assessment expands upon the comparative ranking analysis and examines the statistical robustness of the testing phase, emphasizing both class-level consistency and overall predictive Accuracy.

Figure 15 illustrates a comparison of test Accuracy with the variability of class-wise performance, quantified by the standard deviation of class-wise Recall, utilizing four DL models of this study. This figure provides a robust assessment based only on the testing phase, illustrating the overall efficacy of each design and its stability within each class. The Swin_tiny_patch4_window7_224 model has the highest test Accuracy (around 0.992) and little variability within classes. This demonstrates an effective balance between predictive power and reliability. Deit3_base_patch16_224 had comparable Accuracy (0.991) but the largest standard deviation, indicating greater sensitivity to class-specific characteristics. Inception_v4 performed adequately, achieving an Accuracy rate of around 0.989 and a reasonable inter-class stability rate of roughly 0.003. Xception41 had the lowest Accuracy alongside the least standard deviation, indicating steady, although somewhat worse, overall performance. These results underscore the need to examine both aggregate metrics and dispersion indicators during testing, particularly where robustness across class distributions is crucial.

Performance metrics provide valuable information about the effectiveness of a model; however, computational limitations also influence the feasibility of implementation. The next step in the research examines the resource consumption of the evaluated architectures during the training stage, under the previously specified parameters, where the best results were obtained.

Figure 16 illustrates the temporal variations in computational resource use, namely GPU and process-specific CPU usage, during the training phase of four architectures used in this study. The GPU utilization panel indicates that Deit3_base_patch16_224 frequently approached 100% saturation of GPU usage. This behavior aligns with its transformer-based architecture, which necessitates extensive self-attention operations and dense matrix multiplications across several layers. This indicates that the GPU requires a substantial amount of power over an extended period of time. Swin_tiny_patch4_window7_224 exhibited significant GPU activity, reaching peaks of 90%, characterized by substantial temporal variations typical of its hierarchical attention mechanism, which employs dynamic windows that alter spatial locality at each tier. However, Inception_v4 and Xception41 showed lower GPU usage, reaching 60% and 70% of GPU usage, due to their internal architectures, which use spatial convolutions characterized by predictable computational and memory patterns. The CPU results reveal additional cases of processing overloads that vary depending on the model. The CPU consumption of Deit3_base_patch16_224, for example, increased gradually, reaching a maximum of around 40%. This was due to its robust structure and data pipelining configuration. On the other hand, Inception_v4 and Xception41 used approximately 30% to 35%, while Swin_tiny_patch4_window7_224 used between 20% and 25%, being the one that used the least. These results demonstrate that transformer-based architectures are more effective at making predictions, but they also require more computational resources, particularly GPUs, than conventional convolutional networks. This study is crucial for determining whether a model can be effectively applied in areas with limited resources.

Figure 17 illustrates the GPU memory allocation during the training phase for each of the four models evaluated. Xception41, on the other hand, maintained a consistently high usage rate of around 40%. This is mainly due to its convolutional architecture. Deit3_base_patch16_224 showed usage fluctuations of around 30% over time. Transformer-based models require dynamic memory allocation, especially during multi-head self-attention and token projection procedures. In contrast, both Swin_tiny_patch4_window7_224 and Inception_v4 demonstrated reduced and more consistent usage (approximately 20%), indicating superior memory management capabilities. This can be attributed to their reduced feature hierarchies or the factorization of their convolutional modules.

Figure 18 illustrates the power consumption of the GPU used during the inference phase for the four models that were tested. Deit3_base_patch16_224 used the most power, often exceeding 250 W. This is because it employs a transformer-based architecture that involves numerous matrix operations and multi-head self-attention mechanisms. Xception41 and Inception_v4 consumed between 200 and 220 W of power due to their less complex convolutional architectures than the others in this study. Swin_tiny_patch4_window7_224 consumed around 180 W and was the least power-intensive option, which makes sense given its lightweight hierarchical attention design. These results are similar to those regarding GPU memory use and demonstrate a clear correlation between the complexity of an architecture and the energy required during inference. These kinds of insights are particularly important when it is challenging to conserve energy.

After evaluating computational and energy resource consumption, it is essential to emphasize performance using established benchmarks and compare the results with those of other current studies. Table 5 provides a comprehensive comparison of current studies that have performed multi-class brain tumor classification tasks. The comparison focuses on the proposed architectures, datasets, number of classes, and maximum documented accuracies, whether in training or testing. The Swin_tiny_patch4_window7_224 model presented in this work stood out for achieving a competitive Accuracy of 99.24% on the Msoud dataset during testing. As previous evaluations indicate, this algorithm ranks among the most effective, and it is also computationally efficient.

4.1. Real-Time Inference Benchmarking of the Best DL Model on Embedded Systems

This subsection presents a real-time inference benchmarking of the best-performing deep learning model, deployed on high-performance embedded AI platforms. The objective is to evaluate the feasibility of achieving low-latency, high-throughput predictions under resource-constrained environments, which are typical in real-world clinical applications requiring time-critical responses. Two GPU-based devices, the Jetson AGX Xavier (32 GB), manufactured by NVIDIA Corporation in Huizhou, China and the Jetson Orin Nano Developer Kit (16 GB), manufactured by Yahboom Technology Co., Ltd. in Shenzhen, China were selected for this purpose. Performance was assessed in terms of mean inference time, throughput (FPS), and execution stability using ONNX Runtime and TensorRT optimization frameworks. The Jetson AGX Xavier is equipped with a 512-core Volta GPU featuring 64 Tensor Cores, an 8-core ARMv8.2 Carmel CPU, and 32 GB of LPDDR4X memory, achieving a memory bandwidth of approximately 136.5 GB/s. It supports up to 30 TOPS (INT8), and includes dual NVDLA accelerators along with a dedicated vision processor, making it suitable for intensive AI workloads. On the other hand, the Jetson Orin Nano is based on the Ampere architecture (2022) and integrates a 1024-core GPU with 32 Tensor Cores, a 6-core ARM Cortex-A78AE CPU, and up to 16 GB of LPDDR5 RAM. It delivers up to 40 TOPS (INT8) with a more compact and energy-efficient design, supporting selectable power modes of 7 W or 15 W. Both devices run Ubuntu 20.04 with NVIDIA’s JetPack SDK and support ONNX and TensorRT for model inference. These embedded platforms were chosen to test the best-performing DL model under deployment conditions with limited resources, while still meeting the demands of real-time medical image classification.

The performance of the Swin_Tiny_Patch4_Window7_224 model on the Jetson AGX Xavier using both ONNX Runtime in CPU mode and TensorRT inference engines is presented in Table 6. TensorRT significantly reduced the mean inference time from 177.59 ms to 18.23 ms, while also lowering the standard deviation from 55.84 ms to 3.18 ms, indicating improved stability and execution consistency. Furthermore, TensorRT achieved a substantial increase in throughput, with a mean frame rate of 56.30 FPS compared with only 6.13 FPS under ONNX Runtime. These results demonstrate the Xavier module’s capability to deliver high-speed, low-latency inference suitable for real-time medical applications when optimized with TensorRT.

Figure 19 illustrates the comparative performance gains achieved by employing TensorRT over ONNX Runtime on the Jetson AGX Xavier. As shown in Figure 19a, the model execution achieved a substantial reduction in inference time. Figure 19b reflects the 9.74× speedup obtained, while Figure 19c shows the 818.4% increase in computational efficiency. The speedup was computed as the ratio between ONNX Runtime and TensorRT inference times, and the efficiency gain was derived from the relative increase in mean FPS [62]. These improvements highlight the impact of hardware-aware optimizations provided by TensorRT, enabling substantial reductions in processing latency and resource consumption—critical factors in real-time clinical deployments.

The performance metrics summarized in Table 7 demonstrate the substantial inference acceleration achieved on the Jetson Orin Nano using TensorRT with the compressed Swin_Tiny_Patch4_Window7_224 model. The optimization process reduced the mean inference time from 792.47 ms to 23.04 ms and significantly lowered the execution variability, as indicated by the standard deviation drop from 104.19 ms to 2.92 ms. Additionally, the throughput increased dramatically from a mean of 1.29 FPS to 44.04 FPS, validating the hardware’s potential for real-time medical image analysis when properly optimized with TensorRT.

Figure 20 visualizes the inference acceleration achieved by TensorRT on the Jetson Orin Nano. As depicted in Figure 20a, inference time was significantly reduced compared with ONNX Runtime. Figure 20b,c present a 34.39× speedup and a 3339.4% gain in computational efficiency, respectively. The speedup ratio was computed by dividing the mean inference time of ONNX Runtime by that of TensorRT, while the efficiency gain corresponds to the relative percentage increase in frames per second (FPS) [62]. These results confirm that the Orin Nano, despite its compact and energy-efficient design, can deliver competitive inference performance when leveraging TensorRT optimizations, making it suitable for edge-level deployment in clinical applications.

Upon optimization with TensorRT, both the Jetson AGX Xavier and the Jetson Orin Nano demonstrated the capability for real-time inference. As shown in Figure 20a, the Orin Nano achieved a substantial reduction in inference time, despite being a more compact and energy-efficient device. Figure 20b,c report a 34.39× speedup and a 3339.4% efficiency improvement, respectively—surpassing Xavier’s 9.74× speedup and 818.4% gain. These findings indicate that the Orin Nano is highly effective for lightweight, cost-efficient deployment scenarios, while the Xavier remains a reliable option for medical AI applications that require greater processing power. Depending on specific operational constraints, either device may be suitable for integrating DL into clinical environments.

4.2. Real-World Usage Scenario

Recent studies indicate that seasoned radiologists err 3% to 5% of the time when categorizing brain cancers as gliomas, meningiomas, and pituitary tumors on brain MRIs. The mistake rate can increase to 13% when a second opinion is obtained [19]. The proposed Swin_tiny_patch4_window7_224 model achieved an accuracy of 99.24% during testing, indicating its superior capability to differentiate between items in controlled environments. It is essential to emphasize that the objective of this effort is not to supplant expert radiologists. Its purpose is to furnish a diagnostic instrument that aids in clinical decision-making, reduces workload, and enhances the consistency of interpretations. AI-based systems can be highly beneficial in hospitals with substantial workloads or in regions where expert assistance is scarce. Figure 21 illustrates a potential application of the proposed approach in a clinical environment.

There are five crucial steps in the recommended clinical deployment scenario. The process begins when the radiology suite gets MRIs using regular imaging. The hospital’s PACS keeps DICOM volumetric brain MRI pictures. Second, a DICOM listener or secure API interface simplifies access to PACS. These tools have built-in AI that finds and gets useful research. The Jetson Xavier AGX platform’s Preprocessing and AI Inference Module converts DICOM data into formats compatible with various devices. You can change the size, normalize the intensity, and tensorize the data. This module uses the improved Swin_tiny_patch4_window7_224 model to predict tumor types, confidence scores, and Grad-CAM-based saliency maps to explain its predictions. After that, the results Visualization Interface sends them to the radiologist’s web client or local network workstation. This makes it easy to understand AI-generated discoveries and pictures of anatomy. The radiologist can make the final decision and write the report thanks to AI and clinical judgment. This setup processes data locally, which cuts down on latency and keeps data safe. The AI model transforms into a real-time decision-support tool. This deployment scenario is just one of many that can be changed to fit the needs of a specific technical infrastructure.

4.3. Limitations of the Study

This study’s limitations primarily lie in the dataset. The dataset consisted of four different types of tumors, classified into four classes, and was analyzed using various cuts, including central, sagittal, and axial. However, it would be necessary to obtain a dataset that includes different types of clinical cases from patients of various ages, geographical areas, and hospitals. The architectures in this study were only subjected to four classes and five folds of the dataset; however, more cases are needed to gain a comprehensive understanding of the results obtained in the study and to achieve optimal results in a real medical environment. The experiments were conducted on high-end GPU infrastructure (such as NVIDIA A100 units), which makes it difficult to predict the results in an environment with limited computational resources, such as a public hospital, where portable diagnostic devices or health monitoring platforms are commonly used. Models such as DeiT3_base_patch16_224 were accurate but consumed a significant amount of CPU and GPU power, which could make them difficult to use in real-time applications. The Swin_tiny_patch4_window7_224 model may need to be pruned or quantized to make it usable on lightweight devices.

4.4. Future Work

Further research will focus on validation using larger, multi-institutional datasets that include greater demographic diversity, such as patient age, regional origin, and scan collection protocols. The integration of such variability is crucial to assess the generalization of the model in diverse clinical settings and to facilitate its safe incorporation into diagnostic processes. Pathological heterogeneity will also be addressed with the inclusion of tumor subtypes and different tumor grades. Another focus of future work will be to seek the opinion of expert radiologists who will analyze the images and determine whether the label identified by the model is correct and corresponds to the indicated tumor type. This is because the confusion matrices of the models indicate some incorrect classifications, false positives, or false negatives that must also be verified by an expert. Finally, the importance of applying the best model from this study is a very important aspect. The aim is to apply it in a hospital workflow environment. Integrate it into an embedded system to perform diagnoses quickly in a real clinical setting.

5. Conclusions

This study presented a robust and meticulously engineered framework for the automated multi-class classification of brain tumors into four classes utilizing DL and transformer-based architectures. The proposed strategy enhanced current standards in medical imaging classification by integrating TL strategies, stratified k-fold cross-validation, GPU profiling, class-level interpretability, and parameter optimization methodologies. The evaluation used the publicly available Msoud brain MRI dataset, which had a good mix of cases across four diagnostic classes: glioma, meningioma, pituitary tumor, and no tumor. The dataset had almost the same amount of entropy (EB = 0.998) and a very low imbalance ratio, which made it easy to compare models fairly and fully.

The Swin_tiny_patch4_window7_224 model consistently outperformed the other three architectures—Xception41, Inception_v4, DeiT3_base_patch16_224—across all primary evaluation metrics. The model achieved the highest test Accuracy (99.24%), with consistently outstanding values for test Precision, test Recall, test F1-Score, and test Matthews Correlation Coefficient, even when utilizing only 75% of the available training data for five-fold cross-validation. It was essential that it demonstrated an optimal balance between prediction performance and computational efficiency. It used very little GPU memory (approximately 20%), required very little CPU power (approximately 25%), and had the lowest GPU power consumption (approximately 180 W) ever recorded. These features made it more likely that it could be used in places like hospitals and clinics where the number of computers is limited.

To validate the feasibility of deploying the proposed classification model in resource-constrained clinical environments, inference tests were conducted on two embedded platforms: Jetson AGX Xavier and Jetson Orin Nano. Using optimized versions of the Swin_tiny_patch4_window7_224 model exported via ONNX and accelerated with TensorRT, both devices achieved real-time performance. While the Xavier delivered robust inference speeds, the Orin Nano notably surpassed it in both speedup ratio and computational efficiency, despite its lower power profile. These results confirm the suitability of both platforms for edge deployment, with the Orin Nano standing out as a highly efficient and cost-effective solution for portable medical AI applications.

This work reported the Accuracy of four novel architectures. Although previous attempts employed different datasets, ensemble configurations, and architectural depths, the Swin Transformer-based method proved to be one of the most effective approaches, requiring very few computational resources. This combination of performance and efficiency sets it apart, particularly when compared with models that prioritize architectural complexity over practical effectiveness. Using Grad-CAM explainability techniques, it was also possible to observe that the model consistently identified the diagnostically important areas in MRI images. These activation maps provided doctors with greater confidence that the model was functioning correctly and made it easier for them to understand its decisions. ROC curve analysis revealed that the architecture could effectively distinguish between tumor types and folds, achieving the optimal AUC values (1.0) in all cases. It was found that the model was stable and could be applied in various situations.

In conclusion, this study demonstrates that the Swin Transformer-based architecture can be utilized in the medical field to automatically classify brain tumors. It also provided a clear, repeatable, and computer-compatible way of doing things. This method not only contributed new knowledge to the field of AI-driven medical image analysis but also established a useful standard for the future growth of CAD systems that can be used by many people and are reliable.

Author Contributions

Conceptualization, E.I.-G. and L.J.-B.; data curation, O.A.A.-C. and J.J.E.-E.; formal analysis, C.T.-G. and G.M.G.-A.; funding acquisition, E.I.-G.; investigation, M.A.G.-G. and E.E.G.-G.; methodology, M.A.G.-G. and E.R.R.-A.; project administration, E.I.-G.; resources, E.E.G.-G.; software, M.A.G.-G. and E.R.R.-A.; supervision, E.I.-G. and L.J.-B.; validation, C.T.-G. and G.M.G.-A.; visualization, J.J.E.-E. and O.A.A.-C.; writing—original draft, M.A.G.-G.; writing—review and editing, E.I.-G. and E.E.G.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This Research was funded by the Universidad Autónoma de Baja California (UABC) through the 25th internal call for research projects with grant number 402/6/C/53/25. The authors also thank SECIHTI (Secretaría de Ciencia, Humanidades, Tecnología e Innovación) for the scholarship awarded to M. A. Gómez-Guzmán and E. R. Ramos-Acosta.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available at Kaggle platform, Brain Tumor MRI Dataset. https://doi.org/10.34740/KAGGLE/DSV/2645886.

Acknowledgments

We want to thank UABC for all the support provided to the researchers. To SECIHTI for the scholarship granted to M.A.G.-G. and E.R.R.-A.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BT	Brain Tumor
CNN	Convolutional Neural Network
CAD	Computer-Aided Diagnosis
CT	Computerized Tomography
DL	Deep Learning
DN-XPert	DeepNeuroXpert
HGG	High-grade glioma
LGG	Low-grade glioma
ML	Machine Learning
MRI	Magnetic Resonance Imaging
MCCNN	Multi-Class Convolutional Neural Network
SECIHTI	Secretaría de Ciencia, Humanidades, Tecnología e Innovación
STL	Semi-Transfer Learning
TL	Transfer Learning
ViT	Vision Transformers
WHO	World Health Organization
XAI	Explainable Artificial Intelligence

References

Zulfiqar, F.; Ijaz Bajwa, U.; Mehmood, Y. Multi-class classification of brain tumor types from MR images using EfficientNets. Biomed. Signal Process. Control 2023, 84, 104777. [Google Scholar] [CrossRef]
Jaspin, K.; Selvan, S. Multiclass convolutional neural network based classification for the diagnosis of brain MRI images. Biomed. Signal Process. Control 2023, 82, 104542. [Google Scholar] [CrossRef]
Hosny, K.M.; Mohammed, M.A.; Salama, R.A.; Elshewey, A.M. Explainable ensemble deep learning-based model for brain tumor detection and classification. Neural Comput. Appl. 2025, 37, 1289–1306. [Google Scholar] [CrossRef]
Disci, R.; Gurcan, F.; Soylu, A. Advanced Brain Tumor Classification in MR Images Using Transfer Learning and Pre-Trained Deep CNN Models. Cancers 2025, 17, 121. [Google Scholar] [CrossRef] [PubMed]
Batool, A.; Byun, Y.C. A lightweight multi-path convolutional neural network architecture using optimal features selection for multiclass classification of brain tumor using magnetic resonance images. Results Eng. 2025, 25, 104327. [Google Scholar] [CrossRef]
Tonni, S.I.; Sheakh, M.A.; Tahosin, M.S.; Hasan, M.Z.; Shuva, T.F.; Bhuiyan, T.; Almoyad, M.A.A.; Orka, N.A.; Rahman, M.T.; Khan, R.T.; et al. A Hybrid Transfer Learning Framework for Brain Tumor Diagnosis. Adv. Intell. Syst. 2025, 7, 2400495. [Google Scholar] [CrossRef]
El Amoury, S.; Smili, Y.; Fakhri, Y. Design of an Optimal Convolutional Neural Network Architecture for MRI Brain Tumor Classification by Exploiting Particle Swarm Optimization. J. Imaging 2025, 11, 31. [Google Scholar] [CrossRef]
Velayudham, A.; Kumar, K.M.; Priya MS, K. Enhancing clinical diagnostics: Novel denoising methodology for brain MRI with adaptive masking and modified non-local block. Med. Biol. Eng. Comput. 2024, 62, 3043–3056. [Google Scholar] [CrossRef]
Abou Ali, M.; Dornaika, F.; Arganda-Carreras, I.; Chmouri, R.; Shayeh, H. Enhancing MRI brain tumor classification: A comprehensive approach integrating real-life scenario simulation and augmentation techniques. Phys. Medica 2024, 127, 104841. [Google Scholar] [CrossRef]
Montalbo, F.J.P. TUMbRAIN: A transformer with a unified mobile residual attention inverted network for diagnosing brain tumors from magnetic resonance scans. Neurocomputing 2025, 611, 128583. [Google Scholar] [CrossRef]
Guder, O.; Cetin-Kaya, Y. Optimized attention-based lightweight CNN using particle swarm optimization for brain tumor classification. Biomed. Signal Process. Control 2025, 100, 107126. [Google Scholar] [CrossRef]
Basthikodi, M.; Chaithrashree, M.; Ahamed Shafeeq, B.; Gurpur, A.P. Enhancing multiclass brain tumor diagnosis using SVM and innovative feature extraction techniques. Sci. Rep. 2024, 14, 26023. [Google Scholar] [CrossRef]
Tonmoy, M.R.; Shams, M.A.; Adnan, M.A.; Mridha, M.; Safran, M.; Alfarhood, S.; Che, D. X-Brain: Explainable recognition of brain tumors using robust deep attention CNN. Biomed. Signal Process. Control 2025, 100, 106988. [Google Scholar] [CrossRef]
Deol, G.; Priyadarsini, P.I.; Nallagattla, V.G.; Amarendra, K.; Seelam, K.; Latha, B. A Novel SegNet Segmentation with MobileNet Brain Tumor Classification Using MRI Images. SN Comput. Sci. 2025, 6, 477. [Google Scholar] [CrossRef]
Sun, J.; Chen, K.; He, Z.; Ren, S.; He, X.; Liu, X.; Peng, C. Medical image analysis using improved SAM-Med2D: Segmentation and classification perspectives. BMC Med. Imaging 2024, 24, 241. [Google Scholar] [CrossRef] [PubMed]
Al Bataineh, A.F.; Nahar, K.M.; Khafajeh, H.; Samara, G.; Alazaidah, R.; Nasayreh, A.; Bashkami, A.; Gharaibeh, H.; Dawaghreh, W. Enhanced Magnetic Resonance Imaging-Based Brain Tumor Classification with a Hybrid Swin Transformer and ResNet50V2 Model. Appl. Sci. 2024, 14, 10154. [Google Scholar] [CrossRef]
Li, Z.; Dib, O. Empowering Brain Tumor Diagnosis through Explainable Deep Learning. Mach. Learn. Knowl. Extr. 2024, 6, 2248–2281. [Google Scholar] [CrossRef]
Abdusalomov, A.; Rakhimov, M.; Karimberdiyev, J.; Belalova, G.; Cho, Y.I. Enhancing automated brain tumor detection accuracy using artificial intelligence approaches for healthcare environments. Bioengineering 2024, 11, 627. [Google Scholar] [CrossRef]
Brady, A.P. Error and discrepancy in radiology: Inevitable or avoidable? Insights Imaging 2017, 8, 171–182. [Google Scholar] [CrossRef]
Çetin-Kaya, Y.; Kaya, M. A novel ensemble framework for multi-classification of brain tumors using magnetic resonance imaging. Diagnostics 2024, 14, 383. [Google Scholar] [CrossRef]
De Benedictis, S.G.; Gargano, G.; Settembre, G. Enhanced MRI brain tumor detection and classification via topological data analysis and low-rank tensor decomposition. J. Comput. Math. Data Sci. 2024, 13, 100103. [Google Scholar] [CrossRef]
Wang, J.; Lu, S.Y.; Wang, S.H.; Zhang, Y.D. RanMerFormer: Randomized vision transformer with token merging for brain tumor classification. Neurocomputing 2024, 573, 127216. [Google Scholar] [CrossRef]
Subba, A.B.; Sunaniya, A.K. Computationally optimized brain tumor classification using attention based GoogLeNet-style CNN. Expert Syst. Appl. 2025, 260, 125443. [Google Scholar] [CrossRef]
Afroj, M.; Mondal, M.R.H.; Hassan, M.R.; Akter, S. MobDenseNet: A Hybrid Deep Learning Model for Brain Tumor Classification Using MRI. Array 2025, 26, 100413. [Google Scholar] [CrossRef]
Malik, M.G.A.; Saeed, A.; Shehzad, K.; Iqbal, M. DEF-SwinE2NET: Dual enhanced features guided with multi-model fusion for brain tumor classification using preprocessing optimization. Biomed. Signal Process. Control 2025, 100, 107079. [Google Scholar] [CrossRef]
Gomaa, M.M.; Zain elabdeen, A.G.; Elnashar, A.; Zaki, A.M. Brain tumor X-ray images enhancement and classification using anisotropic diffusion filter and transfer learning models. Int. J. Inf. Technol. 2024, 16, 3771–3779. [Google Scholar] [CrossRef]
Adamu, M.J.; Kawuwa, H.B.; Qiang, L.; Nyatega, C.O.; Younis, A.; Fahad, M.; Dauya, S.S. Efficient and accurate brain tumor classification using hybrid mobileNetV2–support vector machine for magnetic resonance imaging diagnostics in neoplasms. Brain Sci. 2024, 14, 1178. [Google Scholar] [CrossRef] [PubMed]
Priya, A.; Vasudevan, V. Advanced Attention-Based Pre-Trained Transfer Learning Model for Accurate Brain Tumor Detection and Classification from MRI Images. Opt. Mem. Neural Netw. 2024, 33, 477–491. [Google Scholar] [CrossRef]
Poornam, S.; Angelina, J.J.R. BrainNeuroNet: Advancing brain tumor detection with hierarchical transformers and multiscale attention. Int. J. Inf. Technol. 2024, 16, 4749–4756. [Google Scholar] [CrossRef]
Rahman, T.; Islam, M.S.; Uddin, J. MRI-based brain tumor classification using a dilated parallel deep convolutional neural network. Digital 2024, 4, 529–554. [Google Scholar] [CrossRef]
Kar, S.; Aich, U.; Singh, P.K. Efficient Brain Tumor Classification Using Filter-Based Deep Feature Selection Methodology. SN Comput. Sci. 2024, 5, 1033. [Google Scholar] [CrossRef]
Mahesh, T.; Gupta, M.; Anupama, T.; Geman, O. An XAI-enhanced efficientNetB0 framework for precision brain tumor detection in MRI imaging. J. Neurosci. Methods 2024, 410, 110227. [Google Scholar] [CrossRef]
Bouguerra, O.; Attallah, B.; Brik, Y. MRI-based brain tumor ensemble classification using two stage score level fusion and CNN models. Egypt. Inform. J. 2024, 28, 100565. [Google Scholar] [CrossRef]
Dutta, A.K.; Bokhari, Y.; Alghayadh, F.; Alsubai, S.; Alhalabi, H.R.S.; Umer, M.; Sait, A.R.W. A synaptic deep tumor sense predictor system for brain tumor detection and classification. Alex. Eng. J. 2025, 123, 29–45. [Google Scholar] [CrossRef]
Iftikhar, S.; Anjum, N.; Siddiqui, A.B.; Ur Rehman, M.; Ramzan, N. Explainable CNN for brain tumor detection and classification through XAI based key features identification. Brain Inform. 2025, 12, 10. [Google Scholar] [CrossRef]
Ali, A.A.; Hammad, M.T.; Hassan, H.S. A Co-Evolutionary Genetic Algorithm Approach to Optimizing Deep Learning for Brain Tumor Classification. IEEE Access 2025, 13, 21229–21248. [Google Scholar] [CrossRef]
Başarslan, M.S. MC &M-BL: A novel classification model for brain tumor classification: Multi-CNN and multi-BiLSTM. J. Supercomput. 2025, 81, 502. [Google Scholar] [CrossRef]
Abou Ali, M.; Charafeddine, J.; Dornaika, F.; Arganda-Carreras, I. Enhancing Generalization and Mitigating Overfitting in Deep Learning for Brain Cancer Diagnosis from MRI. Appl. Magn. Reson. 2025, 56, 359–394. [Google Scholar] [CrossRef]
Chandraprabha, K.; Ganesan, L.; Baskaran, K. A novel approach for the detection of brain tumor and its classification via end-to-end vision transformer-CNN architecture. Front. Oncol. 2025, 15, 1508451. [Google Scholar] [CrossRef]
Pandey, A.; Pandey, V.K. Wavelet Based Classification Using Meta-Heuristic Algorithm with Deep Transfer Learning Technique. SN Comput. Sci. 2025, 6, 208. [Google Scholar] [CrossRef]
Panigrahi, S.; Adhikary, D.R.D.; Pattanayak, B.K. Brain tumor classification: A blend of ensemble learning and fine-tuned pre-trained models. Discov. Appl. Sci. 2025, 7, 274. [Google Scholar] [CrossRef]
Suthar, O.P.; Zinzuvadia, Y.; Ullah, W.; Khan, H.; Agarwal, C. Visual Intelligence in Neuro-Oncology: Effective Brain Tumor Detection through Optimized Convolutional Neural Networks. IECE Trans. Sens. Commun. Control 2025, 2, 25–35. [Google Scholar] [CrossRef]
Su, G.; Li, H.; Chen, H. ParMamba: A Parallel Architecture Using CNN and Mamba for Brain Tumor Classification. Comput. Model. Eng. Sci. (CMES) 2025, 142, 2527–2545. [Google Scholar] [CrossRef]
Qureshi, S.A.; Sadiq, T.; Usman, A.; Khawar, A.; Shah, S.T.H.; ul Rehman, A. SAlexNet: Superimposed AlexNet using residual attention mechanism for accurate and efficient automatic primary brain tumor detection and classification. Results Eng. 2025, 25, 104025. [Google Scholar] [CrossRef]
Gómez-Guzmán, M.A.; Jiménez-Beristaín, L.; García-Guerrero, E.E.; López-Bonilla, O.R.; Tamayo-Perez, U.J.; Esqueda-Elizondo, J.J.; Palomino-Vizcaino, K.; Inzunza-González, E. Classifying brain tumors on magnetic resonance imaging by using convolutional neural networks. Electronics 2023, 12, 955. [Google Scholar] [CrossRef]
Ramos-Acosta, E.R.; García-Guerrero, E.E.; López-Bonilla, O.R.; Tamayo-Pérez, U.J.; Aguirre-Castro, O.A.; Ramírez-Rios, L.Y.; Inzunza-Gonzalez, E. A novel system for the classification of zinc-plated components by benchmarking deep neural networks. Expert Syst. Appl. 2024, 255, 124866. [Google Scholar] [CrossRef]
Bohmrah, M.K.; Kaur, H. Advanced Hybridization and Optimization of DNNs for Medical Imaging: A Survey on Disease Detection Techniques. Artif. Intell. Rev. 2025, 58, 122. [Google Scholar] [CrossRef]
Nickparvar, M. Brain Tumor MRI Dataset. 2021. Available online: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 19 August 2025).
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Wang, A.X.; Le, V.T.; Trung, H.N.; Nguyen, B.P. Addressing imbalance in health data: Synthetic minority oversampling using deep learning. Comput. Biol. Med. 2025, 188, 109830. [Google Scholar] [CrossRef] [PubMed]
Yevick, D.; Hutchison, K. Neural Network Characterization and Entropy Regulated Data Balancing through Principal Component Analysis. arXiv 2023, arXiv:2312.01392. [Google Scholar] [CrossRef]
Weights; Biases. Experiment Tracking. 2025. Available online: https://wandb.ai/site/experiment-tracking/ (accessed on 19 August 2025).
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Volume 139, pp. 10347–10357. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; p. 887. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Ahmed, A.E.A.; Elmogy, M. A robust tuned EfficientNet-B2 using dynamic learning for predicting different grades of brain cancer. Egypt. Inform. J. 2025, 30, 100694. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Theissler, A.; Vollert, S.; Benz, P.; Meerhoff, L.A.; Fernandes, M. ML-ModelExplorer: An explorative model-agnostic approach to evaluate and compare multi-class classifiers. In Proceedings of the Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2020, Dublin, Ireland, 25–28 August 2020; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2020; pp. 281–300. [Google Scholar] [CrossRef]
Khan, S.U.R.; Zhao, M.; Li, Y. Detection of MRI brain tumor using residual skip block based modified MobileNet model. Clust. Comput. 2025, 28, 248. [Google Scholar] [CrossRef]
Mijwil, M.M. Smart architectures: Computerized classification of brain tumors from MRI images utilizing deep learning approaches. Multimed. Tools Appl. 2025, 84, 2261–2292. [Google Scholar] [CrossRef]
Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]

Figure 1. Flowchart of the proposed approach for BT multi-class classification.

Figure 2. Class distribution percentages in the Msoud dataset.

Figure 3. Sample of the Msoud dataset outlining its components as cited in [6].

Figure 4. Indicators of a balanced structure in the Msoud dataset encompass metrics for imbalance ratio and Entropy Balance.

Figure 5. Grid of augmented training samples, organized by tumor class. Each row corresponds to one of four categories: Glioma, Meningioma, Pituitary tumor, and No Tumor, with columns representing different magnified instances. This arrangement visually shows the variability introduced by the magnification process, while preserving class-specific visual features relevant to classification.

Figure 6. Stratified 5-fold cross-validation distribution in the Msoud dataset.

Figure 7. Parallel coordinates visualization of hyperparameter optimization for classifying brain tumors.

Figure 8. Training Accuracy and Validation Accuracy during the training process with 75% of the dataset, batch size of 64, and learning rate of

1 \times 10^{- 4}

: (a) Training Accuracy, (b) Validation Accuracy.

Figure 8. Training Accuracy and Validation Accuracy during the training process with 75% of the dataset, batch size of 64, and learning rate of

1 \times 10^{- 4}

: (a) Training Accuracy, (b) Validation Accuracy.

Figure 9. Training Loss and validation loss during the training process with 75% of the dataset, batch size of 64, and learning rate of

1 \times 10^{- 4}

: (a) Training Loss, (b) Validation Loss.

Figure 9. Training Loss and validation loss during the training process with 75% of the dataset, batch size of 64, and learning rate of

1 \times 10^{- 4}

: (a) Training Loss, (b) Validation Loss.

Figure 10. Test Accuracy versus number of parameters in selected DL models.

Figure 11. Confusion matrices across five stratified k-folds during the testing stage of the Swin_tiny_patch4_window7_224 model: (a) Confusion matrix k-fold = 1. (b) Confusion matrix k-fold = 2. (c) Confusion matrix k-fold = 3. (d) Confusion matrix k-fold = 4. (e) Confusion matrix k-fold = 5.

Figure 12. ROC curves across all stratified test folds of Swin_tiny_patch4_window7_224 model. (a) Fold 1, with an almost perfect curvature toward the upper left corner, is quite distinguishable. (b) Fold 2, showing model stability with strong classification performance and low variance. (c) AUC close to unity for fold 3, suggesting persistent distinction between true and false positive rates. (d) Fold 4, shows high predictive accuracy and low variation, supports generalization. (e) Fold 5, ensuring model reliability and balanced tumor class sensitivity and specificity.

Figure 13. Grad-CAM results in test stage of Swin_tiny_patch4_window7_224 model.

Figure 14. Model ranking. Chart generated by the ML-ModelExplorer tool [59].

Figure 15. The similarity of models indicates the standard deviation of Recalls concerning model Accuracy. Chart produced using the ML-ModelExplorer tool [59].

Figure 16. GPU and CPU utilization during the training process: (a) Behavior of GPU during the training process. (b) Behavior of CPU during the training process.

Figure 17. GPU memory allocation during training process.

Figure 18. GPU’s power usage during training process.

Figure 19. Performance metrics for ONNX Runtime and TensorRT executed on Jetson AGX Xavier: (a) Inference time comparison showing the substantial reduction achieved by TensorRT. (b) Speedup ratio obtained by dividing ONNX Runtime’s mean inference time by that of TensorRT [62]. (c) Efficiency gain expressed as the percentage increase in mean FPS after optimization [62].

Figure 20. Performance metrics for ONNX Runtime and TensorRT executed on Jetson Orin Nano: (a) Inference time comparison highlights the drastic latency reduction with TensorRT. (b) Speedup ratio calculated based on the inference times of both frameworks [62]. (c) Efficiency gain measured as the percentage of relative increase in frames per second (FPS) [62].

Figure 21. Graphical representation of a real-world usage scenario for real-time AI-assisted brain tumor classification using MRI and on-device inference with Jetson Xavier AGX.

Table 1. Distribution of images in the Msoud Brain MRI Dataset by class and data subset.

Class	Training Images	Testing Images	Total Images
Glioma Tumor	1321	300	1621
Meningioma Tumor	1339	306	1645
No Tumor	1595	405	2000
Pituitary Tumor	1457	300	1757
Total	5712	1311	7023

Table 2. Image Data Augmentation Techniques and Parameters, taken from [46].

Augmentation	Description
Rescaling	All images are resized to a fixed resolution of 224 × 224 pixels.
Rotation	Randomly rotates images by up to 30 degrees to simulate angular variance.
Brightness Variation	Random changes in brightness within a 0–20% interval.
Contrast Adjustment	Contrast levels are randomly modified between 0–20%.
Saturation Shift	Random variation of saturation within the range of 0–20%.
Horizontal Flip	Images are flipped horizontally at random during training.
Vertical Flip	Vertical flipping is randomly applied to training samples.

Table 3. Hyperparameters employed during the training stage [46].

Parameter	Value
Epochs	10
Batch size	32, 64
Data size	25%, 50%, 75%, 100%
Initial learning rate	$1 \times 10^{- 4}$ , $1 \times 10^{- 3}$
Optimizer	Adam
Learning rate scheduler	ReduceLROnPlateau

Table 4. Performance metrics on the test dataset for each evaluated model.

Model	Test	Test	Test	Test	Test
	Accuracy	Precision	Recall	F1-Score	MCC
`Swin_tiny_patch4_window7_224`	0.9924	0.9924	0.9924	0.9924	0.9898
`Deit3_base_patch16_224`	0.9908	0.9908	0.9908	0.9908	0.9877
`Xception41`	0.9870	0.9871	0.9870	0.9870	0.9826
`Inception_v4`	0.9893	0.9895	0.9893	0.9894	0.9857

Table 5. Summary of variables and parameters with their descriptions.

Reference	Best model	Dataset	Classes	Best Accuracy
Ref. [1]	EfficientNetB2	Figshare	3	98.86%
Ref. [2]	MCCNN	BRATS 2015, Figshare	2, 3	99%
Ref. [3]	DenseNet121 and InceptionV3 Ensemble	Figshare	3	99.02%
Ref. [4]	Xception	Msoud	4	98.73%
Ref. [5]	Multi-path CNN	Sartaj	4	96.03%
Ref. [6]	VGG16 and ResNet152V2 Ensemble	Msoud	4	99.47%
Ref. [7]	Proposed CNN	Msoud	4	99.19%
Ref. [10]	TUMbRAIN	Msoud	4	97.94%
Ref. [16]	SwT + ResNet50V2	BR35H, Msoud	2, 4	99.9%, 96.8%
Ref. [17]	ResNet-50, Xception, InceptionV3	Msoud	4	99%
Ref. [21]	Tucker decomp. + Extra-Trees	Msoud	4	97.28%
Ref. [23]	GoogLeNet-style CNN	Figshare	3	97.62%
Ref. [34]	DN-XPert	Figshare, Msoud	3, 4	99.4%
Ref. [35]	Proposed CNN	Msoud, NeuroMRI Dataset	4	99%
Ref. [36]	EfficientNetB3 and DenseNet121 with CEGA	Msoud	4	99.39%, 99.01%
Ref. [37]	M-C&M-BL	BR35H	2	99.33%
Ref. [40]	DenseNet121	Msoud	4	99.43%
Ref. [43]	ParMamba	Figshare, Msoud	3, 4	99.62%, 99.35%
Ref. [60]	Modified MobileNet model	BR35H, Figshare	2, 3	96.95%, 99.93%
Ref. [61]	MobileNetV2	Sartaj	4	96.5%
This Work	Swin_tiny_patch4_window7_224	Msoud	4	99.24%

Table 6. Performance comparison between ONNX Runtime and TensorRT on Jetson AGX Xavier.

Metric	ONNX Runtime	TensorRT
Mean Inference Time [ms]	177.59	18.23
Std. Deviation [ms]	55.84	3.18
Mean FPS	6.13	56.30
Max FPS	9.67	69.42

Table 7. Performance comparison between ONNX Runtime and TensorRT on Jetson Orin Nano.

Metric	ONNX Runtime	TensorRT
Mean Inference Time [ms]	792.47	23.04
Std. Deviation [ms]	104.19	2.92
Mean FPS	1.29	44.04
Max FPS	1.88	51.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gómez-Guzmán, M.A.; Jiménez-Beristain, L.; García-Guerrero, E.E.; Aguirre-Castro, O.A.; Esqueda-Elizondo, J.J.; Ramos-Acosta, E.R.; Galindo-Aldana, G.M.; Torres-Gonzalez, C.; Inzunza-Gonzalez, E. Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures. Technologies 2025, 13, 379. https://doi.org/10.3390/technologies13090379

AMA Style

Gómez-Guzmán MA, Jiménez-Beristain L, García-Guerrero EE, Aguirre-Castro OA, Esqueda-Elizondo JJ, Ramos-Acosta ER, Galindo-Aldana GM, Torres-Gonzalez C, Inzunza-Gonzalez E. Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures. Technologies. 2025; 13(9):379. https://doi.org/10.3390/technologies13090379

Chicago/Turabian Style

Gómez-Guzmán, Marco Antonio, Laura Jiménez-Beristain, Enrique Efren García-Guerrero, Oscar Adrian Aguirre-Castro, José Jaime Esqueda-Elizondo, Edgar Rene Ramos-Acosta, Gilberto Manuel Galindo-Aldana, Cynthia Torres-Gonzalez, and Everardo Inzunza-Gonzalez. 2025. "Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures" Technologies 13, no. 9: 379. https://doi.org/10.3390/technologies13090379

APA Style

Gómez-Guzmán, M. A., Jiménez-Beristain, L., García-Guerrero, E. E., Aguirre-Castro, O. A., Esqueda-Elizondo, J. J., Ramos-Acosta, E. R., Galindo-Aldana, G. M., Torres-Gonzalez, C., & Inzunza-Gonzalez, E. (2025). Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures. Technologies, 13(9), 379. https://doi.org/10.3390/technologies13090379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Proposed Methods

3.2. Msoud MRI Dataset

3.3. Class Balance Analysis in the Msoud Dataset

3.3.1. Imbalance Ratio (IR) Calculation in Msoud Dataset

3.3.2. Entropy Balance (EB) Calculation in Msoud Dataset

3.4. Image Preprocessing Techniques

3.4.1. Image Data Augmentation

3.4.2. Hyperparameters Setup

3.5. Pre-Trained CNN Models

3.5.1. Data Efficient Image Transformer

3.5.2. Xception41

3.5.3. Inception_v4

3.5.4. Swing Transformer

3.6. Performance Metrics

4. Results and Discussion

4.1. Real-Time Inference Benchmarking of the Best DL Model on Embedded Systems

4.2. Real-World Usage Scenario

4.3. Limitations of the Study

4.4. Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI