There are several performance indicators that we assess, such as overall accuracy, F1 score, recall, and precision, for determining the model’s performance in classifying different dementia stages. The F1 score is a metric that calculates the weighted average of recall and precision, whereas accuracy measures the proportion of properly classified individuals. Following the development of the models, several metrics were utilized in order to adjust parameters. The 5-fold cross-validation method is utilized to determine the performance of each model. Performance assessments are conducted using a multi-class approach and are represented by the confusion matrix. In the results section, we display the figures of loss, accuracy, confusion matrix, and the table of classification scores based on the best fold outputs from the 5-fold cross-validation for all models. Instead of displaying the loss, accuracy, and confusion matrix obtained from all folds of the 5-fold cross-validation, displaying the best fold outputs is better since it is not messy for following the outputs of this study. Moreover, AD- and dementia-stage prediction is crucial, since having knowledge about the stage enables physicians to have a more comprehensive understanding of the impact of the disease on the patient.
4.4. Evaluation Metrics
For each class of the dataset that was provided per model, we calculated the F1 score (4), precision (2), recall (3), and accuracy (1) to evaluate the performance of the proposed techniques. True positive (TP) is the number of images that are correctly classified as being in a particular class. False positive (FP) is the number of images that should belong to another class but are mistakenly assigned to that class. False negative (FN) is the number of images that are part of a class but are mistakenly assigned to another class. The number of images that are accurately classified as belonging to a different class are considered to be true negatives (TNs).
In
Figure 4, the value of the loss function is plotted against the number of completed iterations. Both the training and testing losses decreased for the first fifty iterations, even though the testing losses became stable. So, there is a deviation between the training and testing phases after approximately 50 iterations because of the complexity of the model. The CNN model has a high capacity to fit enough parameters of the training data that obstructs the model’s capacity to generalize on testing data.
The accuracy vs. iterations curve provides us with the information we need to validate the performance of the CNN model. In
Figure 5, we can see that both the training and testing accuracy are increasing. This indicates that the learning progress of this model is rising to the point of 100 iterations, beyond which point it stays constant when we run it for a greater number of iterations. So, in the case of accuracy, there was not a huge deviation between the training and testing phases.
The confusion matrix that was produced for the CNN model is shown in
Figure 6 for the classification of the healthy, very mild, mild, and moderate classes. Even though there is no correct prediction of the moderate class, the number of correctly predicted healthy class cases is 445; the number of correctly predicted very mild class cases is 100; and the number of correctly predicted mild class cases is 16. Due to the fact that there are only 64 images of the moderate class, it is difficult to make a precise prediction of the moderate class.
The F1 score, precision, and recall of each of the classes that comprise the CNN model are detailed in
Table 1. The overall accuracy of this model is calculated to be 43.83%. The F1 score for the healthy, very mild, mild, and moderate classes are 0.59, 0.30, 0.10, and 0. The CNN algorithm properly predicted 561 out of 1280 test images. According to
Table 1, the category that was predicted with the greatest degree of precision was healthy, while the category that was predicted with the least degree of precision was moderate dementia.
In
Figure 7, the value of the loss function is plotted against the number of completed iterations. Initially, the loss was high due to random parameters, but it started decreasing when the number of iterations increased. After iteration 40, both the training and testing losses were stable for the VGG16 with additional convolution layers.
We can verify the VGG16 with additional convolutional layers model’s performance from the accuracy versus iteration curve. In
Figure 8, we can see that the training and testing accuracy is around 80% which shows that the learning progress of this model is increasing until 40 iterations and remains fixed after that.
In
Figure 9, the confusion matrix for the VGG16 with additional convolutional layers model for the classes healthy, very mild, mild, and moderate is shown. The number of correctly predicted healthy classes is 505, the very mild class is 383, and the mild class is 23, even though there is no correct prediction of the moderate class out of 13 moderate class test images. Since there are only 64 moderate class images, there is very little chance for a correct prediction of the moderate class.
Table 2 describes the F1 score, precision, and recall of all the classes for the VGG16 with additional convolutional layers. The overall accuracy is 71.17% for this model. The F1 scores for the healthy, very mild, mild, and moderate classes are 0.83, 0.68, 0.23, and 0. Out of 1280 test images, our model correctly predicts 911.
Table 2 shows that the class most accurately predicted is healthy, while the class least accurately predicted is moderate dementia. Since there are a very small number of moderate class images, the model performance for this class is very low.
The loss function’s value versus the number of iterations is shown in
Figure 10. The GCN model’s loss decreased for the training and testing images, even though the loss was very high initially. After iteration 18, the cost was fixed at around zero for both the training and testing images.
The accuracy versus iteration curve gives us the evidence we need to verify the GCN model’s efficacy.
Figure 11 shows a rise in accuracy throughout both training and testing until it becomes fixed close to 1. This shows that the model’s learning rate increases up to about 20 iterations, after which it levels off and remains constant.
For the categorization of healthy, very mild, mild, and moderate, the GCN model’s resulting confusion matrices are shown in
Figure 12. The total number of correctly predicted healthy cases is 640; for very mild cases, this is 448; and for mild class cases, this is 180, although no moderate cases were identified in any of the 12 test images. Even though the GCN model was able to predict all of the other category images accurately, it was unable to predict the moderate dementia cases.
Table 3 shows the F1 score, precision, and recall of the GCN model for each class. This model is estimated to have a global accuracy of 99.06 percent. The F1 scores for the healthy, very mild, mild, and moderate classes are 1, 0.99, 1, and 0. Out of a total of 1280 test images, the GCN algorithm successfully predicted 1268. The most accurate prediction was made for the healthy and mild categories, while the least accurate prediction was made for the moderate dementia category.
Figure 13 depicts the loss function’s value versus the number of iterations. Even though the loss was initially quite high for the test images, the CNN-GCN model’s loss decreased abruptly for the test images. After the first couple of iterations, the cost for both training and testing images became fixed at approximately zero.
The accuracy versus iteration curve provides the evidence necessary to confirm the efficacy of the CNN-GCN model.
Figure 14 depicts a rise in accuracy during both training and testing until it approaches 1 and stabilizes. This demonstrates that the model’s learning rate increases after a couple of iterations before leveling off and remaining constant.
Figure 15 displays the confusion matrices produced by the CNN-GCN model for the categorization of the healthy, very mild, mild, and moderate classes. The number of correctly predicted healthy cases is 641, the number of very mild cases is 448, the number of mild cases is 179, and the number of moderate cases is 12. All category images are predicted correctly using the CNN-GCN model.
Table 4 shows the F1 score, precision, and recall of the CNN-GCN model for each class. We obtained an overall accuracy of 100% from the CNN-GCN model. The F1 scores for the healthy, very mild, mild, and moderate classes are 1, 1, 1, and 1, respectively. The CNN-GCN algorithm successfully predicts all of the 1280 images.
Our proposed CNN-GCN model achieves 100% accuracy on both the training and test data and may overfit the training data by capturing even irrelevant and abnormal patterns, such as noise and outliers. Consequently, the model may exhibit sub-par performance when presented with novel, unfamiliar data. The high accuracy may indicate a lack of generalizability of the model to novel contexts or datasets. This is particularly crucial when dealing with health-related data since it might exhibit significant variability. Hence, it is important to verify the accuracy of the CNN-GCN model by using distinct test datasets and ensuring its efficacy in real-world scenarios rather than just relying on controlled experimental settings. So, we collected a separate dataset for implementing our proposed CNN-GCN model. Neeraj [
28] provided a dataset to Kaggle, which consists of 2D images collected from the ADNI baseline dataset that were originally Nifti images. The dataset has three distinct classes: AD, MCI, and CN. After implementing our proposed CNN-GCN model, we presented the confusion matrix in
Figure 16 to categorize the AD, MCI, and CN classes. There are 8 accurately predicted instances of AD, 21 cases of MCI, and 15 cases of CN. The CNN-GCN model accurately predicted all of the category images. This illustrates the potential for its use with novel and unfamiliar data.