1. Introduction
Artificial intelligence is a technology consisting of extremely intelligent programs and machines that try to resemble human intelligence and offer creative solutions to humanity with the information they collect. The main purpose of artificial intelligence is to adapt the intelligent behavior observed in humans to computers and machines. In recent years, artificial intelligence technologies have been used frequently in the defense industry, economy, social life and health sector.
Machine learning, which is a subset of artificial intelligence, is the system that creates the architectures of intelligent algorithms that can make predictions through self-learning models. At the same time, there are many studies in the literature on deep learning and image processing, which is an area of machine learning. These studies have focused especially on image extraction, classification and object detection. The success rates of different algorithms on different and similar datasets have varied in studies [
1].
The interpretation of radiologic images always requires precision and care. In diseases where X-ray and computed tomography images are used frequently, such as the coronavirus (COVID-19) epidemic in recent years, the responsibilities of specialists have increased; however, the examination of radiological images has become more time-consuming. Specialized physicians working in the field of radiology work very carefully and meticulously. As a result, a report is created by interpreting MR images of many different parts of the body (brain, chest, leg, etc.) and according to these results, specialist physicians in different branches make comments for treatment and diagnosis [
2].
Image processing is a very important field of study and its applications in the medical field make it even more important. A mistake made here can have critical consequences for patients. For this reason, experts working in this field have to be very careful [
3]. Brain tumors are known to be highly sensitive due to their location. Interpreting MRI images from hundreds of thousands of different patients can sometimes be complex. It is sometimes difficult for specialists to understand whether tumors of different sizes and shapes are benign or malignant. Computer-based systems are needed to facilitate this situation and minimize the margin of error. For this purpose, deep learning methods are of interest to detect tumor cells more accurately and quickly [
4].
The vast majority of biomedical image classification studies in the literature have relied on a single CNN model for feature extraction. However, this approach has limitations in terms of feature extraction and performance. Methods that focus on a single model may fail to capture complex brain MRI images and variability. As a result, classification accuracy remains low. In contrast, our study proposes a hybrid method for brain tumor diagnosis. The proposed method uses different CNNs trained to extract features from brain MRI images and multiple ML classifiers to classify brain images into four categories. Here, the brain tumor classes consist of normal and three tumor images. The features extracted from five different ML classifiers and four different trained models are evaluated to select and evaluate the strongest features, and the features extracted from various trained CNN models are combined to develop a feature ensemble approach to tackle the brain image classification problem. The hybrid model extracted from ensemble learning is further classified and tested for accuracy. This approach combines different information gathered by multiple CNN models rather than relying on features extracted from a single model. In addition, in order to improve the model’s accuracy, the proposed approach is optimized by optimizing the grid search. The dataset was classified with the most effective ML classifier after optimization. The experimental results show that our proposed hybrid method significantly improves the performance.
2. Related Studies
Computer vision is software science that focuses on pictorial images, clippings from videos, object identification and understanding. The main purpose of image processing, which is a product of artificial intelligence, is to create systems that can copy human abilities and learn them on their own. In addition, image processing applications use machine and deep learning architectures to imitate the visual processing that occurs in living things. Many image processing techniques are used. Medical image processing is a widely used method for rapid diagnosis. When the studies were examined, it was seen that studies were carried out to diagnose tumors of different types and sizes using magnetic resonance images.
In a sample research study, the aim was to classify different brain tumors (pituitary, glioma and meningioma tumor) using CNN algorithms on MR images and to determine the importance of brain sections such as coronal, axial and sagittal in classification. On the same topic, researchers have proposed a new model, and this model is derived from the DenseNet algorithm. The results were classified with machine learning algorithms and high success rates were achieved [
5]. In a study, VGG architecture is preferred because it is easy to understand. The results obtained from 253 brain MRI images, 155 of which had tumors, showed that VGG achieved a 98% success rate [
4]. In a similar study, images of three different brain tumors were used to extract sub-layer images that are different from medical images using pre-trained models. Here, feature extraction and merging methods are used while solving the problem. Inception-v3 and DenseNet architectures were used in this problem, and success rates of 99.34% and 99.51% were achieved, respectively, from these two models [
6]. In another study, magnetic resonance images were used for brain tumor detection. In this research, a rule-based detection system was introduced and morphological features were utilized. Respectively, for 497 brain MRI images, preprocessing, segmentation, tumor region detection and tumor detection stages were followed. Here, a success rate of 84.26% was achieved [
7]. In addition, the Swati study group proposed transfer learning for multi-class brain tumor classification. For this purpose, the AlexNet, VGG-16 and VGG-19 ESA models were used. In the experimental studies, respectively, the AlexNet, VGG-16 and VGG-19 models achieved 89.95%, 94.65% and 94.82% accuracy rates. A statistical study to classify benign and malignant brain tumors compared textural characteristics. Here, the nearest-neighbor algorithm was used and a classification accuracy of 80% was achieved [
8]. In a segmentation study developed in addition to CNN structures, a diagnosis study was conducted using MR images. In this study, a thresholding technique was applied using a search algorithm. Morphological operations and connected component analysis were used to reduce the noise in the images and to identify brain tumors at a higher rate. The results obtained were compared with CNN algorithms and high success was achieved [
9].
In their research with two different datasets, ref. [
10] aimed to see the success rates of classification by applying different labels. A CNN-based deep learning algorithm was tested to classify 3580 open access brain MRI images. An accuracy rate of 96.13% was achieved in the study using first and fourth stage tumors. In a study using a neuro-fuzzy inference system, brain MR images were divided into component tissues, and diseased tissues such as spinal fluid, edema and tumors were separated. Unlike similar studies in the literature, the skull was separated and only brain tissue was evaluated. The statistical features obtained from the system were used and the results were compared with the segmented tissue areas and evaluated with the membrane index. As a result of the research, it was shown that the neuro subtraction system gave very successful results in the segmentation process of MR images [
11]. VGG algorithm was used to classify brain tumor images. Before and after this research model’s data augmentation using the VGG-19 algorithm, some optimization techniques were applied and high success rates were achieved [
12]. In the literature, it has been seen that the VGG algorithm is frequently used in brain tumor detection and classification studies. The biggest reason for this is that the algorithm gives successful results in similar datasets.
In another study, MR images of three different tumor types such as glioma, meningioma and pituitary were classified using ResNet architecture. In order to obtain a better result in the research, changes were made in the layers, and the number of layers was increased. During the training, the Figshare MRI dataset consisting of 3064 T1-weighted MR images of 233 patients with three different tumor types containing, respectively, 1426, 708 and 930 images was used. The accuracy rate obtained as a result of the research was 98.67% [
13]. Another study aimed to improve the detection and precise localization of brain cancer to improve the prognosis and treatment outcomes of patients by leveraging the information provided by brain medical images. Here, 300 brain images were analyzed using the YOLO model and a success percentage of 0.94 was achieved [
14]. In research conducted with the transfer learning method, contrast stretching and histogram equalization methods were applied to the input images using the pre-trained ResNet50 architecture, and the success rates were compared in terms of precision and sensitivity. Here, the ResNet50 method achieved a very high success rate of 99.15%, with contrast stretching for the classification process [
15]. In another study using three different convolutional neural networks, brain tumor types such as pituitary, glioma and meningioma have been classified via VGGNet, GoogleNet and AlexNet. In the research, the VGG16 architecture achieved a success rate of 98.69% in terms of classification and detection [
16].
While searching the literature, it was observed that some studies did not apply any optimization in CNN algorithms, and only machine learning methods were used for classification. Brain tumor segmentation was also performed using the BRaTS 2020 dataset. As a result, an 86% similarity rate and 80% sensitivity percentage were obtained [
17]. In addition, using the BraTS 2018 dataset, a U-Net-based model was developed to classify the tumor region using colored pixel label segmentation. As a result of this classification 98% success rate was achieved [
18]. In the research, in which a random forest classifier-based system was proposed by dividing the brain images into two classes, first of all, an adaptive median filter was applied to the MR images in the preprocessing part. The goal here was to preserve the edge pixels on the image edges. Then, feature extraction was applied to determine the tumor region. A weighted voting technique is used to distinguish between tumor and non-tumor regions [
19]. The study, employing contrast-enhanced MRI, achieved a 93% sensitivity, 82% specificity, and 87% accuracy using classical machine learning techniques. It aimed to predict the 1p/19q co-deletion status in 159 lower-grade gliomas (LGG). This was done by analyzing post-contrast MRI images with convolutional neural network (CNN) algorithms, utilizing a dataset specifically for LGG 1p/19q [
20].
Table 1 shows the data sets used by similar studies and the accuracy rates found.
3. Material and Method
3.1. Datasets
The brain is one of the most complex and important structures in the human body. It consists of more than 50 billion nerves connected by millions of connections and a special field working together. Although the brain is the control center of the whole body, it organizes the coordinated work of the heart, lungs, blood vessels and all other organs. All of our senses are connected to the brain [
21].
Brain tumors are a deadly disease that develops, especially in adults, with the formation and proliferation of abnormal cells. It is usually caused by abnormal growth of the central nervous system or brain cells. Brain tumors are divided into two groups as primary and secondary brain tumors, and the types of tumors in these two groups are categorized separately. Knowing and classifying brain tumors according to group is of vital importance for patients. Primary brain tumors are tumors originating from a cell or tissue in the brain. These can be categorized among themselves as benign and malignant. Benign tumors occur in a single region and grow relatively slowly. If the correct procedure is performed with the surgical operation, these tumors most likely do not recur. It is very important to completely remove the tumor and clean the area. Malignant tumors in the brain and spinal cord spread and multiply rapidly. Malignant tumors are classified as secondary brain tumors, and rarely start elsewhere in the body and spread to the brain. These tumors are graded between 1 and 4 according to their growth rate. Considering the grading criteria, grades 1 and 2 are considered benign, while grades 3 and 4 are considered malignant [
22].
In this research, the authors used the “Brain Tumor MRI dataset” whose MR images were published as open access by Masoud Nickparvar on the Kaggle platform. This dataset is a combination of three different datasets (Figshare, SARTAJ, Br35H). This dataset contains 7022 brain MRI images in total, consisting of 4 different classes. In the figures below, images are given of four different classes and also the numbers of these classes [
23].
As seen in
Figure 1 and
Figure 2, tumorous tissues are marked with green dots. However, this is not always possible, because not all tumors are so big that they can be seen. Missed cases can sometimes become life-threatening in a very short time. The underlying principle of this research is to minimize human error and utilize artificial intelligence. So the dataset consists of 4 classes: glioma, malignant, pituitary and no tumor. The distribution of the classes is as follows: gliomas (1321 images), malignant (1339 images), pituitary tumors (1456 images) and normal tumor cells (1595 images). The original dataset is divided into 5711 training data and 1311 test data images. However, for evaluation, 40% of the test set is allocated to the evaluation set and the remaining 60% to the training set.
Figure 3 shows the flow chart of the model. Here, All classification and CNN algorithms are presented.
3.2. Convolutional Neural Network (CNN) Model
Convolutional neural networks (CNNs) are a type of artificial neural network that has been successfully used in computer vision, voice recognition, natural language processing and various other tasks. CNNs are typically designed to work with two- or three-dimensional input data, such as visual data analysis. CNNs contain convolutional layers that are specifically designed for use in visual recognition tasks. These layers use filters or feature maps to learn and recognize features in the input data. They can be highly effective in visual tasks, for example, recognizing edges, patterns or objects in an image. In
Figure 4, the general structure of CNN architecture is expressed.
The general components of CNNs are convolution layers. These are the layers that extract feature maps by performing convolution on the input data. In this way, features can be learned hierarchically. In this way, the learned information can be transferred between layers. After the convolution layers, an activation function is usually used. Activation functions are often used to solve nonlinear problems. When applying deep learning methods, all values obtained after matrix multipliers in the convolution layer are linear [
24]. Activation functions are chosen depending on the structure of the estimation problem. Sigmoid, Softmax, Hyperbolic Tangent and ReLU are commonly preferred activation functions. After the convolution layers, pooling layers are usually used to provide scaling and position invariance. These layers can reduce the size of feature maps and images by reducing the number of parameters and highlighting important features. Finally, for the classification or regression process, the features extracted by the convolution layers are used in the fully connected layers to achieve the desired result. With the rapid development of the CPUs and GPUs of workstation computers, computational techniques are used to train CNNs more efficiently [
25]. When the studies in the literature are examined, convolutional neural networks are the most popular and powerful tool for image processing, classification and segmentation. CNNs have achieved great success, especially in visual tasks such as image classification, object recognition and face recognition. Various architectural variations can be found, but the basic principles are generally similar [
26].
While algorithms are inferring data from images, different extraction methods can be developed according to the characteristics of the image. Generally, pixel values in images are used in classification processes. Algorithms read images based on their pixel values. Each image is a combination of pixel values. Changing the pixel values will also change the image. The properties/inputs of the neural network become pixel values. Thus, the model reads the pixel values for an image, and performs feature extraction and classification. Here, a loss function is determined that measures how far the model’s predictions are from the true labels. This function provides a metric to evaluate the performance of the model. Then, if the model accuracy is not at the desired level, different back-propagation algorithms and optimization techniques are applied to update the weights to minimize the loss of the model. Stochastic gradient descent or derivative algorithms can be used here [
27]. These steps are used in the training phase of the deep learning model. Once the model is trained, the trained network can be used to make inferences from new images. That is, new, unseen data can be predicted using the features learned by the model. This is usually performed on test data to evaluate the applicability and generalization capabilities of the model. Apart from the models we use, there are many studies in the literature, especially on GoogleNet and MobileNet.
GoogleNet is a complex architecture due to the Inception modules in its structure. GoogLeNet was the winner of the ImageNet 2014 competition, with 22 layers and an error rate of 5.7%. Overall, it is one of the first CNN architectures to move away from stacking convolution and pooling layers on top of each other in a sequential structure. This new model also has a significant impact on memory and power utilization. To cope with this, parallel interconnected modules were used to avoid excessive power consumption [
28].
MobileNet, like other models, is an efficient convolutional neural network model for image recognition applications. MobileNet uses deeply separable convolutions and has 28 layers, excluding depth and point convolutions. This significantly reduces the number of parameters compared to regular convolutional networks of the same depth. Depth separable convolution allows the depth and spatial dimension of a filter to be separated. In addition, MobileNet provides two simple global hyperparameters that efficiently trade off between delay and accuracy. The MobileNet network structure is another factor that improves performance. However, it has less computational power to run or implement transfer learning [
29].
3.3. VGG Architecture
VGG (Visual Geometry Group) is a deep learning algorithm and is one of the many network models that emerged especially after the success of AlexNet. It is a network of 13 convolutional, 3 fully connected layers used by the University of Oxford Visual Geometry Group to achieve higher success rates in the ILSVRC-2014 competition. There are 41 layers in total, with Maxpool, Relulayer, Fullconnectedlayer, Dropoutlayer and Softmaxlayer layers in the network structure. In this architecture, the image to be included in the input layer is 224 × 224 × 3 in size but the last layer is again the classification layer [
4]. It uses a simpler structure instead of too many hyperparameters, as within the VGG architecture. In this way, it also simplifies its neural network architecture. VGG16 and VGG19 expressions in the literature are distinguished from the number of layers.
Figure 5 shows the VGG algorithm adapted to our model.
3.4. ResNet Architecture
ResNet architecture has a different structure from architectures such as VGG and AlexNet. Although ResNet micro-architecture differs from other structures with its micro-architecture module structure, some transitions between layers can be ignored and transitions to the lower layer can be made. With these features, ResNet architecture has succeeded in increasing the success rates to higher levels. ResNet is the 2015 winner of the ILSVRC competition. By introducing the concept of learning on CNN, it developed a 152-layer convolutional model and designed an effective method for training deep networks. With this feature, ResNet was the first architecture to show success above human performance. The most important feature of ResNet, which distinguishes it from classical models, is that the residual values between the linear and ReLU layers will create a faster model with the addition of blocks that feed the next layers. There are two 3 × 3 convolution filters in each residue block, and the number of steps is chosen as 2. Since the model becomes more difficult to optimize as it gets deeper, the best solution for ResNet is to use a jumper link that allows taking the activation function from one layer and feeding it to another layer.
Figure 6 shows the transition between layers of the ResNet architecture.
3.5. DenseNet Architecture
During neural network training, feature extraction maps are reduced due to convolution and subsampling operations. However there is a loss of image properties when transitioning between layers. DenseNet architecture has been developed to use image features more effectively. DenseNet connects each layer forward to other layers due to its feature structure. In this architecture, each layer uses the properties of all the previous layers as input and gives all the properties to all subsequent layers as input. Another important feature of DenseNet is that it reduces the number of parameters to enable feature propagation. Although DenseNet has one of the architectures that makes the best use of the transfer function, the spread rate of the feature structure within the network is quite high [
30].
Figure 7 shows the layers of the DenseNet architecture.
3.6. SqueezeNet Architecture
Compared to the AlexNet architecture, the SqueezeNet model, which uses much fewer parameters and provides a similar rate of accuracy, is kept to a small memory by using a feature compression method. SqueezeNet architecture is one of the leading models developed for classification and increasing the network accuracy of convolutional neural network algorithms popular in image processing. It was first introduced in 2016 in a paper called “Same accuracy as AlexNet, 50 times smaller in size”. Its main goal is to achieve the same level of accuracy with fewer parameters compared to typically large-sized CNN models. The SqueezeNet architecture works faster than other algorithms because the workload in the neural network is reduced thanks to efficient distributed layers [
31].
Figure 8 shows the layers and connections of the SqueezeNet architecture.
3.7. Machine Learning Classifiers
Machine learning is an application of artificial intelligence that allows computers to learn and evolve on their own by accessing the data that we provide to them. Machine learning can also be defined as the process of teaching how to make accurate predictions with the correct parameter values, after filtering the data with different feature extraction techniques. In solving machine learning problems, there are three main categories: supervised, unsupervised and reinforced learning. In addition to these categories, a semi-supervised learning model is also used. The classification of these categories is related to how the data corresponding to the learning methods are processed and analyzed. When we examine the studies in the literature, we come across many machine learning algorithms. Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), K-Nearest Neighbor (k-NN), Naive Bayes (NB), Decision Tree (DT) are among the most used methods [
22].
Support Vector Machines: The main purpose of the SVM method is to project nonlinear separable samples to another high-dimensional space plane using different kernel algorithms. The most important issue here is the role of kernel functions in the transition from linearity to nonlinearity. If we give the most known kernel functions as an example, Polynomial, Linear, Sigmoid and Radial basis functions can be given [
32]. Although linear kernel functions are often used for large feature sets, quadratic kernels are a common type of polynomial kernel.
K-Nearest Neighbor: This algorithm is the most easily understandable classification method that works with heuristic methods. Here, unlabeled objects are classified by assigning them to the class of similarly labeled examples. Class labels of the input features to be classified are assigned to their closest neighbors. The most basic rule class here is the distance between the input feature to be allocated and the set of features in the training set according to the Euclidean distance rule [
33].
Decision Tree: Decision Tree (DT), which is an inductive learning method, consists of a root node for a dataset, several connected internal nodes and leaf nodes for the remaining parts. Here, each leaf node corresponds to a decision, while all other nodes correspond to feature matching. Each non-leaf node in the created algorithm contains a subset. The data samples are divided into sub-nodes according to the feature matching results. Here, the part known as the root node covers the entire dataset. The easiest way to construct a Decision Tree is to split feature fields.
Logistic Regression: Logistic regression is a statistical method used to understand and classify complex and fuzzy events. The logistic function used here and applied as a machine learning technique is actually an analysis method used for classification. Although it is called regression, it is frequently used especially in linear classification problems [
34].
Naïve Bayes: The Naive Bayes (NB) theorem, put forward by Thomas Bayes, is a method that aims to check all probabilities. Bayes’ theorem, which is frequently used in probability theory, relates the conditional and extreme probabilities of two random events, and calculations are generally used to calculate the remaining probabilities. This theory, developed by Naive Bayes classifiers, is a method that can be trained very efficiently in a supervised learning environment [
35].
3.8. Ensemble Learning
Ensemble learning is one that provides model building by training multiple learners as a community member instead of training the model with a single learner. Here, the aim is that the predictions of the models on the ensemble data sets will give a higher accuracy decision than the individual predictions [
36]. Success criteria in these methods is evaluated according to the learning success of basic learners and their differences from each other. Ensemble learning is used to increase the performance of the model by choosing the good and bad characteristics of the learners and at the same time eliminating the possibility of making a bad choice [
37]. In this research, the results obtained by using the voting method, one of the ensemble learning methods, were compared with other methods.
Voting Method: One or more classification algorithms can be trained with the same training set, or a single model can be trained with the same data set using different parameter values. In this way, different classification models are created, and the final output value is produced by using the voting method of all outputs obtained as a result.
3.9. Parameter Optimization
In deep learning algorithms, hyperparameters are special parameters that control the learning process of the model and need to be tuned. These hyperparameters are the values that drive the architecture of the network and the training process and can influence the success of a particular deep learning model. Correctly tuning the hyperparameters can help the model achieve better performance [
38].
Parameter optimization has become increasingly necessary in the development of deep learning models in recent years, as a result of increasing the number of neural networks to find the best accuracy result and a model designed with fewer weights and parameters. Since the choice of hyperparameters is difficult, it is also difficult to adapt it to experimental values. The tuning of the hyperparameters is a complex and carefully designed structure. For widely used models, hyperparameters can be set manually, as researchers can take examples from previous studies. For small-scale models, hyperparameters can be adjusted manually. But for larger models or newly published models, finding hyperparameters requires a lot of experimentation by researchers [
39].
Hyperparameters can be divided into two groups: those used for model training and those used for model design. Choosing appropriate hyperparameters for model training enables neural networks to learn faster and achieve improved performance. The most widely adopted optimization algorithms for training a deep neural network are momentum and stochastic gradient descent, as well as AdaGrad, RMSprop and Adam. Particle size and learning rate are the most important factors, as these determine the convergence rate of the neural network during the training process. Hyperparameters used for model design are more related to the structure of the neural networks. The most typical example of this is the number of hidden layers and the width of the layers [
40]. To explain the most important parameter values:
Learning Rate: This hyperparameter controls the amount by which the weights of the network are updated. A high learning rate can update the weights quickly, but there is a risk of overdoing it. A low learning rate can slow down the learning process of the model. In most cases, the learning rate must be manually adjusted during model training, and this adjustment is often necessary to achieve high accuracy [
41].
Epoch Number: Epoch means that the entire training data is presented to the model once. The number of epochs determines how many times the model sees all the training data. Too many epochs can lead to overfitting, while insufficient epochs may not allow the model to complete its learning.
Mini-Batch Size: Batch size refers to the number of samples used in each training iteration. Small batch sizes can generally result in faster training processes, but can affect the overall model performance. Also, the randomly generated training sets used by the Probabilistic Gradient Projection Algorithm are called Mini-Batches. The gradient calculation is performed on the samples in the Mini-Batch set.
The process of hyperparameter tuning usually involves trial and error. By trying different hyperparameter values, one tries to find the combination that provides the best performance. This process is important to increase the generalizability of the model and avoid overfitting to the training data.
3.10. Evaluation Criteria
Artificial intelligence applications work on the principle of trial, feedback, correction and result. Before the research, a model is created and feedback on the validity of the model is checked. Afterwards, necessary improvements are made and the model is expected to reach the expected accuracy. Test results are measured with different metric values. The performance of the model is determined according to the results obtained from here. Evaluation criteria play a very important role in comparing different models and distinguishing results.
Various performance measures are used to estimate success rates as a result of classification processes. The most well-known criterion used in classification problems is the accuracy (ACC) metric. However, the accuracy criterion does not always provide certainty when used alone. Other measurement metrics need to be used to make a more precise and reliable analysis [
42].
When the studies in the literature are examined, besides the accuracy measurement metric, precision (Prec), sensitivity (Recall) and F1 score (F1) metrics were observed to be used. These values can be calculated in matrix form by using the confusion matrix. True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values can be calculated in the classification results’ confusion matrices [
43].
Table 2 shows the components of the confusion matrix.
A confusion matrix is a table that is often used to numerically determine the performance of a classification method on a test dataset where the actual values are known. The results of the confusion matrix on the data sets we used in the study are as follows:
TP (True Positive): It means sick to patient.
FP (False Positive): It means sick to someone who is not sick.
TN (True Negative): It means not sick to those who are not sick.
FN (False Negative): Saying that the sick person is not sick.
Here the sensitivity metric gives the percentage of correctly identifying those who are sick. The precision metric shows in percentage terms how many of those we call sick are actually sick. It can be seen that two important metrics, sensitivity and precision, are interdependent. The F1 score is used to eliminate the ambiguity that may arise here. In the F1 score calculation, the harmonic mean is used instead of the arithmetic mean. In some cases, it may be more advantageous to detect the sensitivity metric correctly than incorrectly. It is more acceptable to misjudge a patient with cancer and call someone who does not have cancer to the hospital, rather than failing to detect a patient with cancer and causing their death. However, we can call everyone to the hospital and find all the cancer cases, but this can lead to many false positives. There is a similar situation with the certainty metric. It recommends being absolutely sure before calling someone sick. If a person is diagnosed as sick and turns out to be really sick, the accuracy is 100%, but if the rest of the people are not identified, there is a big mistake.
Applications made in the Python 3.12.1 language were developed in open source libraries and the Google Colab platform, which provides free GPU support, and this software language was used in classifying the created models and distinguishing the results. Our study was conducted on a laptop computer with Apple M1 pro processor, 8-core CPU, 14-core GPU, 16-core Neural Engine and 200 GBps memory bandwidth, 3024 × 1964 resolution at 254 pixels per inch, 32 GB combined memory and 1 TB SSD.
4. Experimental Results
In this part of the research, the classification results are shown of benign and malignant tumors belonging to the dataset described in the method section. First of all, precision, sensitivity, F1 score and accuracy were calculated based on the results obtained by classical machine learning methods, and then confusion matrices were created. In addition, graphs of CNN algorithms showing accuracy rates and epoch numbers are also shown.
In order to increase the performance of individual classifiers, first of all the performance criteria were calculated with the default values of the CNN algorithms, then the current results were examined again by parameter optimization. In addition, the ensemble learning method was applied, and a confusion matrix was created using the voting method. All parameter values, accuracy percentages of the models and comparisons of the models used during the training are shown. Optimum parameters used depending on the data sets are given together with the classification algorithms. While finding these values, many experiments were conducted and calculations were made based on the optimum values. In deep learning methods, the number of rounds is often determined according to the problem to be solved. For example, in our study, the number of rounds could have been calculated as 50 or 100. Here, it was observed that increasing the number of rounds decreased the success rate. In addition to the low success rate, it causes more computational costs and consumes more processing power. Therefore, all values are set as optimum.
4.1. Machine Learning Method and Results
Here, firstly, results regarding brain MR image classification were obtained using our dataset. Our purpose was to compare the results using pre-trained CNNs as feature extractors and different ML algorithms for classification. Feature extraction was performed using networks trained on ImageNet of the four architectures used in neural network prediction. This allowed us to leverage the knowledge captured by models in the form of learned features. The effectiveness of each pre-trained model was evaluated by measuring key performance metrics such as F1 score, recall, precision and accuracy. The machine learning algorithms LDA, SVM, K-NN, DT and NB use a model learned from the split dataset to make predictions or perform classifications for new data points. The models were trained using the Scikit-Learn library and default parameters were used. In addition, a vector of 30,056 features was created by combining the features obtained separately from four different CNN models (VGG, DenseNet, ResNet, SqueezeNet). This is the total number of features extracted separately from MR images for each model used. Here, the newly created features were trained with the same classification algorithms.
Table 3 show the accuracy rates of the models and classification algorithms used in feature extraction.
Looking at the model and classification accuracy rates, it was observed that the SVM classifier gave the highest accuracy rates for each model. Here, the features provided by the DenseNet algorithm showed the highest success rate when classified with SVM. In contrast, all classifications with DT showed very low accuracy rates. This may be due to the fact that the model does not fit the training data very well and cannot cope with the data. As a result, overfitting can be occur.
Furthermore, ML classifiers were applied to the hybrid method that we developed. This method, the results of which have been observed through numerous experiments, proposes to classify trained CNN algorithms by combining them into binary, ternary and quadruple combinations.
Figure 9 shows the DenseNet + SVM confusion matrix and
Figure 10 shows the SqueezeNet + DenseNet + ResNet + VGG confusion matrix.
Figure 11,
Figure 12,
Figure 13 and
Figure 14 show performance metrics for all classification methods. The figures show that all classification accuracy percentages vary. There may be many reasons for this. For example, the DenseNet + DT classification shows a low accuracy rate in general, but the SVM classification of the same model shows the highest accuracy rate. Therefore, only the confusion matrix for this classification is shown. In addition, the accuracy rate of the hybrid model, whose confusion matrix is shown in
Figure 10, was found to be 83%. As a result of our initial training and classification studies, we can say that the results we found are quite successful considering similar studies in the literature. Based on these promising results, we further investigated the effectiveness of combining multiple features extracted by different pre-trained CNN models. In the second phase of our study, we optimized our trained data with the most appropriate parameter values to obtain more successful results.
4.2. Results of CNN Models before–after Parameter Optimization
Parameter optimization is a process used to improve the performance of a machine learning model. In this process, selecting or tuning specific parameter values is critical to ensure that the model achieves the best performance. In this section, the accuracy rates of CNN algorithms after optimization with the most appropriate parameter values are calculated. While making these calculations, many trials were performed for each CNN algorithm. During these experiments, the number of rounds, learning rate, number of layers and functions were tried many times. In addition, while performing parameter optimization, all data regarding model overfitting and misfitting, cross-validation and training times were meticulously tested and applied.
Figure 15,
Figure 16,
Figure 17 and
Figure 18 show the confusion matrices of the CNN algorithms before parameter optimization. These four models were trained with the Imagenet dataset by adding a classification layer to each model. Each model was trained for 30 rounds. For the selection of the model, the accuracy rates in the evaluation set during training were followed, and the model in the tour that gave the highest evaluation accuracy rate during the training was selected.
Section 3.9 provides some information on parameter optimization. Here,
Table 4 and
Table 5 shows the parameter values and some optimization functions common to all calculations. Our experiments show that the adaptive moment estimation (Adam) and stochastic gradient descent (SGD) functions give the best results. The Adam optimization algorithm is a widely used optimization algorithm, especially in deep learning models. This algorithm is among the gradient-based optimization methods and is designed to speed up the learning process and make it more efficient. The parameter update rule of the Adam function is as follows:
- (1)
First, the first and second moments of the gradient are calculated. These are the first moment: the moving average of the gradient (momentum) and moving average of the square of the gradient (RMSprop).
- (2)
The calculated moments are corrected with correction terms.
- (3)
Update the parameters.
Here,
Table 4 shows the parameter values that are used.
Here,
Table 5 shows which CNN algorithm uses which optimization function and their accuracy rates.
Another parameter function used in the experiment, SGD, is used to improve the performance of the models, just like the Adam function. The most important feature of grid search optimization is that it systematically tries all combinations within a given set of hyperparameters to obtain the best performance. Grid search creates a grid by specifying a range of parameters and their possible values. It then trains the model with each combination on this grid and evaluates the performance of each combination using a specific performance metric. It repeats this process until it finds the parameter combination with the best performance. The parameters to be used for each CNN architecture used during the experiment were searched by grid search. As a result of the experiments carried out here, the learning rate values of 0.0001, 0.001, 0.01 and 0.1 for the Adam and SGD functions seen in
Table 5 were evaluated. In addition, five was used as a tolerance value for the evaluation loss values obtained in the training rounds. In this case, if the evaluation loss did not decrease in five consecutive rounds, the model was considered to have memorized the training data and training was stopped.
As a result of optimization, it was observed that the ResNet architecture reached an accuracy rate of 100%. While reaching this accuracy rate, many factors affecting the model, such as the number of learnings, number of rounds and parameter function, were found after many trials.
Figure 19 shows the accuracy rates after parameter optimization.
Figure 20 shows the learning graph of the ResNet architecture in each round after optimization. At the end of the 30th round, the model shows 100% learning success. More trials here will not change the model accuracy.
Figure 21 also shows the confusion matrix of the ResNet architecture since it gives the best results. Here, since the correct activation function and number of layers are chosen, the learning ability of the model is maximized. In addition, adjusting the learning rate and setting the hyperparameters (number of epochs, number of Mini-Natches) affect the model to learn faster and more effectively. So, when the performance metrics in
Table 6 are analyzed, it is seen that all the results are accurate.
4.3. Ensemble Learning Results
Here, using the ensemble learning method, a single decision was taken from the decisions taken by the four models by voting method. The ensemble model exploits the strengths of individual features and results in a more robust and comprehensive representation of image models. The features extracted from the four different CNN networks used in our model were hybrid modeled, and the accuracy rates were calculated by creating a single model. The result is the confusion matrix shown in
Figure 22 and the results in
Table 7. When this hybrid method was tested on the test set, it achieved 99% accuracy. Due to the hybrid development of the model presented here and the combination of CNN algorithm features, it has brought a different perspective and innovation to the studies in this field. The result obtained seems to be quite successful.
5. Conclusions
The most important issue in medicine is the early diagnosis and treatment of diseases. Therefore, early diagnosis is vital, especially in cancer. It is very important for patients to start the treatment process early and manage the entire process correctly. This critical situation constitutes the basic principle of this study. The main purpose of this study is to assist healthcare professionals by utilizing artificial intelligence technologies applied in the field of healthcare. In conclusion, the research presents a comprehensive and innovative approach to the classification and diagnosis of brain tumors using artificial intelligence, specifically employing convolutional neural networks (CNNs) such as VGG, ResNet, DenseNet and SqueezeNet.
The study utilized a sizable dataset of 7022 brain MR images obtained from the Kaggle library, which was divided into 40% for testing and 60% for training to ensure unbiased evaluation. Then, training and testing were carried out using this dataset. As a result of classifying four different CNN architectures with machine learning methods, the highest accuracy of 85% was found with the SVM classification method of the DenseNet architecture. In addition, a success rate of 83% was achieved by classifying the hybrid algorithm created with features extracted from four different CNN architectures with LDA.
In the second part of the experiments, it is seen that the ResNet architecture reached 99% accuracy in the classification made with default parameter values before parameter optimization. Here, a 100% success rate was achieved as a result of optimizing the Resnet algorithm parameter functions and re-applying them to the test set. When we examine the studies conducted in the literature for this dataset and the ResNet model, we see that such a high success rate has been achieved for the first time. Finally, community learning was performed in the classification and a 99% success rate was achieved by applying the voting method. Furthermore, the utilization of ensemble learning methods added another layer of sophistication to the classification process, ultimately contributing to the identification of the most effective validation method. The entire research was conducted using the Python programming language, emphasizing the adaptability and versatility of AI applications in the biomedical field.
The findings of this study not only contribute to the growing body of knowledge in the domain of biomedical image processing but also highlight the potential of artificial intelligence, particularly ResNet architecture, in achieving highly accurate classifications in the context of brain tumor diagnosis. The 100% accuracy rate attained by ResNet underscores its robustness and effectiveness in handling complex medical imaging tasks.
In comparison with the existing literature, the research results were benchmarked, showcasing competitive or superior performance. The systematic evaluation and comparison of various architectures and machine learning methods contributes to a deeper understanding of their applicability in real-world scenarios. Overall, this research underscores the promising role of artificial intelligence in advancing diagnostic capabilities in the field of medical imaging, offering new possibilities for accurate and efficient brain tumor classification.