There are mainly four types of prediction instances, which we can consider in the formulation of these above-mentioned evaluation measures: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). These are described in detail below.
4.1.4. Kappa–Cohen Index
The last measure is to have confidence about the statistical analysis of the results, as statistical analysis is broadly used in many aspects of scientific work. Therefore, a statistical measure that gives confidence over confusion matrix values was calculated for evaluation. The kappa index gives the confidence over a certain range of confusion-matrix-based calculated values. If its value range lies in the range of 0–20, it promises that 0–4% of the data are reliable to use for prediction. If its index value lies in the range of 21–39, it promises a 4–15% data reliability. If it lies between 40 and 59, it promises 15–35% data reliability. If it lies between 60 and 79, it promises a 35% to 63% reliability. If it lies between 80 and 90, then it promises strong data reliability, and if it lies at more than 90, that means 82% to 100% reliability. It can be calculated as:
The agreement type was calculated using , , , and , where represents Column 1 and represents Column 2. and represent Rows 1 and 2 of any two-class confusion matrix. This formulation is the general form of a two-class confusion matrix; in the case of our proposed methodology, there are five columns and rows that extend the formulation to up to five rows and columns.
The total augmented data were later split into a 70/30 ratio for training and testing data, where the fine-tuned parameters for each of the five training models remained different and showed different results. The training and testing data number of instances became 2023 and 866, respectively.
There were five different kinds of architectures applied to classify guava diseases using the augmented image data. The individual class testing data prediction results are discussed with their overall results. The evaluation measures accuracy, sensitivity, specificity, precision, recall, and the kappa index were also used as statistical measures. The results are shown in
Table 7.
The individual class and overall results for each model were evaluated. Several evaluation measures were used: accuracy, specificity, F1 score, precision, and kappa.
The first model, AlexNet, showed 98% accuracy, 99% specificity, a 97.60% F1 score, 97.14% precision, and a 0.5296 value for kappa as the canker class prediction results of the testing data. Accuracy is a general measure over all positive and negative instances that are either wrongly or correctly predicted. The 98% accuracy of the canker class showed accurate predictions of positive and negative classes and mainly showed excellent and satisfactory results in TN predictions over TN and FP; specificity showed that true negatives were mostly predicted right among TN and FP. Precision showed TP over TP and FP, with a 97.14% value; this means that it had less accurate predictions. For a positive class to see the combined effect of TP and TN, the F1 score measure was used. The F1 score showed a 97% value, which summarizes both of the above measures. The kappa index showed a weak level of agreement for the canker class. The dot class showed 98.04% accuracy, 99.69% specificity, 98.52% F1 score, 99% precision, and a 0.53 kappa value. The dot class showed less accuracy than that of the canker class, but it needs to be discussed with another evaluation measure to analyze the predictions of positive and negative instances. Specificity, which was 99.7% in the case of the dot class, represented TN among TN and FP where the precision was 99% for TP over TP and FP. The F1 score of this measure showed 98.52%, which summarizes the recall and precision, which could be considered to be more promising factors compared to sensitivity, specificity, and precision. The last statistical measure showed a weak level of agreement for this class. The third and healthy class showed accurate results in all networks where the data agreement value was also 0.97, showing an extraordinary level of agreement on the given data. The next class, mummification, showed 98.66% accuracy, 99.53% specificity, a 98.66% F1 score, a 99.53% precision value, and a 0.485 value for kappa. The accuracy for the mummification class was slightly better than that for the canker and dot classes. Other values such as specificity were lower than those for canker and dot, but with large differences. The precision value was higher than that of the canker class and lower than that of the dot class. To summarize both precision and specificity, the F1 score was used, which was nearer to that of the dot class and higher than that of the canker class. The kappa value is a decision-maker index, with a 0.48 value, which was less than that of the canker and dot classes, but also had weak agreement with the data reliability. The kappa value for the last class was 0.56 due to the one value for both precision and specificity, but 56 also lies in the weak agreement class. Lastly, the overall or mean results were evaluated, showing 98.50% accuracy, 99.60% specificity, a 98.75% F1 score, 98.75% precision, and a 0.9531 value for the kappa index. The mean or overall value was an actual representation of a model that produced good results; it was either about the accuracy, specificity specifically for TN, and precision for the TP value, and the F1 score represents the recall and precision. A kappa value of more than 90% is a strong agreement level, and the data reliability is also high when kappa is more than 90. Therefore, AlexNet overall showed satisfactory results.
GoogLeNet showed testing results on the canker class as follows: 96.635% accuracy, 99.69% specificity, 97.81% F1 score, 99.01% precision, and 0.5285 kappa value. Accuracy showed promising results where, if examining the specificity value over TN values, it showed a value of 99.69%. Similarly, the precision value over the TP values showed 99%, and the F1 score over precision and recall showed a value of 97.81%, which showed more confidence than precision and specificity did. The last important measure is the kappa index, 0.5285, which showed weak promise as it was of only one class over other classes. The mean value showed the actual effect of the kappa stat. The other dot class showed values of 97.56% accuracy, 98.638% specificity, a 96.618% F1 score, 95.69% precision, and a 0.5269 kappa value. The accuracy value as compared to that of the canker class was lower. In the case of the dot class, where the specificity value was also slightly lower, its precision value was lower than that of the canker class. The F1 score based on precision and recall showed a slightly higher score in the canker class, where the last kappa Cohen index was similar and lied on weak agreement of data reliability. The mummification showed a 99.42% value of the accuracy, a 99.37% value of the specificity, a 97.29% value for the F1 score, a 98.18% precision value, and a 0.49 kappa value. For an accuracy value higher than those of the canker and dot classes, where the specificity value was lower than that of the canker, and slightly lower than that of the dot class, this means that some TN instances had variation in these cases. The F1 score showed a lower score than that of the dot class and higher than that of the canker class. This means the combined effect of TP and FP was more promising for mummification compared to that for the dot class. The last class of rust showed 99.47% accuracy, 99.114% specificity, a 98.172% value for the F1 score, 96.907% for precision, and 0.56 for the Cohen index. The accuracy value was higher than that of the canker, dot, and mummification classes. The F1 score showed a 98.17% value that was nearest the canker, mummification, and dot classes. Therefore, the accuracy value was the measure to analyze the test prediction results where other values such as the F1 score also mattered and made the results distinguishable.
SqueezeNet testing showed results for the canker class of 97.596% accuracy, 99.392% specificity, a 97.838% F1 score, 98% precision, and a 0.52 kappa value. The accuracy value, a general assumption of model performance, had a lower value than that of specificity, precision, and the F1 score. The specificity value was much higher, which means that true negatives over the TN and FP had higher rates for the canker class. TP cases were also predicted with a 98% value over the FP and TP as the precision scores. The recall- and precision-based F1 score showed 97.83%, which is intermediate between the precision and specificity. The kappa value was 0.52 and lied on the weak agreement of the data. The second dot-class-based results showed 100% accuracy, 100% specificity, a 96.68% F1 score, 100% precision, and a 0.51 kappa value. Although the accuracy was good for this class, specificity showed a 100% result. The F1 score that covered recall and precision validity had a 96.68% score and 0.52 kappa value. The mummification class showed a predictivity of 98.214% accuracy, 98.91% specificity, a 97.56% F1-score, a 96.916% precision value, and 0.48 for the kappa index. The last class of rust showed predictivity measures of 91.53% accuracy, 100% specificity, a 95.58% F1 score, and 100% precision. The accuracy value was lower than that of the other three classes where precision and specificity were 100% in this case. The global or overall results for the all classes had a primary issue. Accuracy was 97.11%, lower than that of AlexNet and GoogLeNet, where specificity was lower than that of both AlexNet and GoogLeNet, and the F1 score, precision, and kappa were lower for this model testing the data predictions. The kappa index showed promise for these data.
ResNet-50 and -101 had much more improved results than those of AlexNet, SqueezeNet, and GoogLeNet. ResNet-50 showed canker class results of 99.51 accuracy, 99.68% specificity, a 99.28% of F1 score, a 99.03% precision value, and 0.51 for precision value of the kappa index. Compared to the previous cases of AlexNet, GoogLeNet, and SqueezeNet, the results were overall improved for all measures. Similarly, for the dot class, the accuracy value was 100%, 99.697% specificity, a 99.515% F1 score, a 99.034% precision value, and a 0.52 kappa index. The canker class results were not better than those of the dot class if we look at the accuracy measures, with only a slight difference in the F1 scores. Mummification results showed 98.661% accuracy, 100% specificity, a 100% F1 score, 98.25% precision, and a 0.48 kappa value. Although it had 100% accuracy as an individual class, the specificity and precision values were also higher than those of the canker and dot classes. The rust class showed 100% accuracy, 100% specificity, a 100% F1 score, and a 100% precision value. The overall results were improved as compared to the above models’ mean results. The global mean results were 99.54% accuracy, which was better than that of SqueezeNet, GoogLeNet, and AlexNet. Specificity was also better than that in the three models. The F1 score and precision were both better than those in the previously discussed three models. Although it had a higher value than that of the previous models, the last kappa had the same class of confidence. The data reliability was higher for all models.
The last model of the proposed study also achieved good results as compared to the other models. The canker class showed predictivity values of 99.519% accuracy, 98.784% specificity, an F1 score of 97.87%, a precision of 96.27%, and a 0.51 kappa index. ResNet-50’s accuracy had a dominant result compared to the previous class results of the other models, while other evaluation values also showed more improvement in this model. If the accuracy value was improved, other values were not so improved, but different cases showed overall improvement of the results. In ResNet-101, the dot class showed 98.049% accuracy, 99.84% specificity, 99.505% precision, and a 0.53 kappa value. Accuracy, specificity, precision, and the F1 score were overall improved for each class, which did not happen in the previous models’ results for any class. The mummification class again showed 100% accuracy in this model testing. The mummification class showed 100% accurate results in other models where other values were not improved to such an extent: 98.41% accuracy, 99.84% specificity, 98.93% F1 score, and 99.47% precision. The last class showed consistency in the improvement of the results for each class by also showing promising results here. Lastly, the overall mean results of ResNet-50 proved it to be more accurate than the four other models. The accuracy was also better than that of the others. Similarly, other measures also showed excellent performance. The graphical illustration of all five models’ mean testing results is shown in
Figure 4.
The overall results showed that the kappa value had overall excellent data reliability for all models. The mummification, the healthy and dot classes were more distinct classes to distinguish them from the other two classes, as they showed 100 true results many times. The densest model with more residual connections showed more accurate results, which means that using a small kernel size with an increasingly denser network improved the classification results.
For individual cases or data-based testing analysis, the confusion matrices were designed and evaluated, and they are shown in
Table 8.
The confusion matrices of all architectures showed that the rust class was the most distinguishable among other guava diseases. If we discuss the AlexNet model results, four wrong cases were predicted as wrong in the rust and mummification classes, four wrong predictions were found for the dot class, where the wrong predictions lied on canker and rust. There were three wrong predictions for mummification, in the class of canker, and there were two wrong predictions in the rust class; these wrong predictions were two for dot. The second GoogLeNet architecture made 201 correct predictions in the canker class, and seven wrong predictions lied in the dot, mummification, and rust class, while no prediction lied in the healthy class. The five wrong predictions for the dot class were two predicted as mummification and three was rust, whereas 200 were predicted as right. This makes it one case less accurate than AlexNet, as that made four wrong predictions in the dot class. Mummification was predicted as nine wrong classes in the canker, dot, mummification, and rust categories, where two-hundred sixteen cases were rightly predicted. In rust, 188 cases were rightly predicted. One was predicted wrongly in the dot class. It was highly more efficient than AlexNet was in this category, as that predicted two wrong cases in the dot class, and GoogLeNet predicted only one wrong. The SqueezeNet architecture model made five wrong predictions, with two-hundred and three correct predictions of the canker class. The five wrong predictions were in the mummification class. In the dot class, there was no wrong prediction, and all 205 test instances were rightly predicted. In mummification, there were four wrongly predicted cases and two-hundred and twenty rightly predicted cases. Most were predicted in the dot class with three instances, with one predicted in the canker class. Rust had 16 wrong cases, and most were in the dot class—11 out of 16. ResNet-50 was similar to ResNet-101, where the difference was mainly of several layers and its parameters. ResNet-50 predicted one wrong cases for the canker class with one wrong prediction in dot, and two-hundred and seven were correctly predicted. All dot class predictions (205) were correctly predicted in the dot class. The mummification class in this model’s predictions had 221 correct predictions in dot. It has one wrong and two wrong predictions in the canker class. ResNet-101 also had higher accuracy results as compared to those of all other models. According to its confusion matrix, it made one wrong prediction for the canker class into the rust class. As compared to ResNet-50, it had one wrong case prediction, but in a different class. Regarding the dot class, ResNet-50 made no wrong predictions, and ResNet-101 made four wrong predictions. In the third case of the mummification class, there were four wrong predictions, and ResNet-50 made three wrong predictions. There were three wrong predictions in the last class, rust. There were 186 correct predictions for ResNet-101.
The above analysis of the five models shows that ResNet-50 had an overall high rate of correct predictions; for wrong predictions in the dot class data, it was highly difficult for each model, as it was predicted as the wrong class in most cases. The other classes also misled the models, where the most challenging and least robust class was dot. Therefore, the dot class may need more confident and robust approaches to classify it from other classes. The mummification class had no wrong predictions in ResNet-50 and -101, where it only had a higher rate of wrong predictions in the cases of SqueezeNet with four, where the two remaining models also did not make very many accurate predictions for this class. The healthy class overall in all models remained accurate with no wrong prediction by any model. It made the normal class easily distinguishable by any model. However, the challenge was differentiating the guava disease categories.