In this study, the proposed method used DDR [
32], APTOS [
33], and IDRiD [
34] datasets. The following sections describe the characteristics of these datasets, the metrics used to assess the method’s performance, the experimental setup, the qualitative analysis, and the quantitative analysis of the results.
4.2. Metrics
The performance of the model was evaluated using accuracy and kappa score as they are considered important measures for multi-class classification problems. Unlike other measures such as precision and recall, accuracy, and kappa score take into account the imbalanced distribution of the classes in the dataset and provide a more comprehensive assessment of the model’s performance for the classification of diabetic retinopathy.
To evaluate the effectiveness of the presented method, it is compared to other methods using two evaluation metrics: kappa score and accuracy [
19]. These metrics are computed using the following equations [
19].
In the Equation (
1), TP, TN, FP, and FN refers to true positives, true negatives, false positives, and false negatives, respectively [
19].
In Equation (
2),
p represents the number of classes,
represents the weight assigned to the agreement between the predicted class label and the true label for class
i and class
j,
represents the observed agreement for the
class pair, and
represents the expected agreement for the
class pair. The weight assigned to the agreement depends on how often the two classes appear together in the confusion matrix. The weights are calculated by dividing the number of instances that are labeled as both class
i and class
j by the total number of instances. The kappa score is a measure of the agreement between two raters, where the value of kappa ranges from −1 to 1. A value of 1 indicates perfect agreement between the two raters, while a value of −1 indicates complete disagreement. A value of 0 indicates that the agreement is no better than chance. The kappa score is often used in the context of medical diagnosis, where two or more doctors may be assessing the same patient. It can also be used in other contexts where multiple people are rating or categorizing items.
4.3. Quantitative Analysis
The presented model was tested on DDR [
32] and APTOS datasets [
33]. The model was designed to predict a DR score for each case, and the classification was determined based on this score. The classification criteria are outlined in the
Table 3. The proposed method was evaluated using the accuracy and kappa score to provide a thorough assessment of its performance and to compare it to the results of previous work in the literature.
In the field of diabetic retinopathy classification, various deep learning architectures [
36] have been explored for their ability to accurately predict the severity of DR from retinal images. Some of the popular deep learning architectures [
36] used for this task include MobileNet, VGG, ResNet50, Inception v3, SE-ResNext, multi-scale attention network, and Efficientnet-B5. These architectures have been pre-trained on large image datasets, such as ImageNet, and then fine-tuned on the DR classification task using a technique called transfer learning.
In transfer learning, the parameters of a pre-trained network are adjusted to fit the specific task at hand, in this case DR classification. This allows the network to take advantage of the knowledge gained from the pre-training task and reduce the amount of training data required while still allowing for fine-tuning to the specific task.
The comparison of the presented model with these popular deep learning architectures was made by training each architecture using transfer learning and evaluating their performance on the APTOS and DDR test dataset. The results are explained in the following sections.
4.3.1. APTOS Results
Table 6 shows the comparative results for the APTOS test dataset. As indicated in
Table 6, the presented model outperformed previous literature and the top performing baseline model (SE-Resnext) in terms of accuracy, with a 2.3% improvement. The method also outperformed the multi-scale attention model. In terms of precision, the proposed method achieved a kappa score of 0.939, indicating that it is effective in classifying DR. A normalized confusion matrix is also provided to assess the model’s accuracy for each class. The normalized confusion matrix of the proposed method on APTOS dataset is shown in
Figure 4.
Figure 4 shows that the presented method was able to classify images without diabetic retinopathy with 98.3% accuracy. However, there was some misclassification between adjacent classes due to the close relationship between the diabetic classes.
4.3.2. DDR Results
To further evaluate the effectiveness of the presented method, it was also tested on the DDR test dataset. The results of this evaluation were compared with those of previous work, using accuracy and kappa score as the evaluation metrics.
Table 7 shows the comparison results.
The presented method demonstrated better performance in terms of accuracy and kappa score compared to other models, as shown in the
Table 7. Additionally, the normalized confusion matrix in
Figure 5 illustrates that the presented model effectively distinguishes between DR classes.
The presented model in the study outperformed existing state-of-the-art approaches, as evidenced by its higher accuracy and kappa score. The presented method incorporated both regression and Efficientnet-B0 architecture. This unique combination of techniques likely contributed to the improved performance of the model.
4.4. Qualitative Analysis
Heatmaps are graphical visualizations that use color to encode data values [
37,
38]. In image classification, heatmaps can be used to highlight the areas of an image that are most important for making a prediction. This can provide insight into the decision-making process of the algorithm and help identify patterns in the data. Heatmaps can also be used for qualitative analysis as they can offer a more interpretable visual representation of the data compared to raw numerical values.
The presented method was evaluated qualitatively using heatmaps.
Figure 6 shows the input image and its corresponding heatmap representation. The heatmap in our study is a representation of the areas of the image that have the greatest impact on the predictions made by the model. The color scale used in the heatmap, COLORMAP_JET [
39], is a standard colormap used in the field of computer vision and is commonly used to visualize data that spans a large range of values. The scale ranges from blue, representing low values, to red, representing high values, with yellow in between. This allows one to effectively visualize and understand the contribution of different areas of the image to the model’s predictions. As shown in the
Figure 6, the model was able to identify DR lesions and classify the image based on these lesions. Additionally, it is important to note that the red color in this figure indicates that the prediction gives more importance to that particular region. It is clear that the model was trained to focus on DR lesions when making a prediction. The DR severity score and predicted class label corresponding to each image is also depicted in
Figure 6.
Additionally, heatmaps can also reveal patterns in the data that may not be immediately apparent from other forms of analysis, such as confusion matrices or accuracy scores. This information can be used to improve the model’s performance by fine-tuning the model’s hyperparameters, adding or removing layers, or modifying the data pre-processing techniques used.
4.5. Computational Time
Ophthalmologists spend several minutes closely examining fundus images to determine the severity of diabetic retinopathy and assign a grade [
40]. This process can be challenging, especially when the patient also has other diseases. The presented method aims to automate this process and takes an average of 200 milliseconds for prediction and heatmap generation. This is significantly faster than manual diagnosis. The computational time analysis was performed on a machine with 4 cores, 28 GB of RAM, and 56 GB of disk space.
In this study, we have evaluated the computational time of the proposed model and compared it with the time required for manual diagnosis. The results showed that the proposed model significantly reduces the time required for DR prediction compared to manual diagnosis. This faster prediction time has the potential to improve the efficiency and accuracy of DR detection in real-world applications.
The reduction in the time required for DR prediction can lead to a more streamlined diagnostic process, allowing healthcare professionals to diagnose and treat the disease more quickly and effectively. This, in turn, can improve the outcomes for patients with DR by allowing for earlier detection and management of the disease.
Furthermore, the faster prediction time also has the potential to increase the accessibility of DR diagnosis, particularly in resource-limited settings where access to trained healthcare professionals is limited. This can lead to improved outcomes for patients and a reduction in the global burden of DR.
4.6. Discussion
A regression-based model for diabetic retinopathy classification using Efficientnet-B0 as the backbone was proposed in this study. The model was trained and tested on a dataset of retinal images, and it achieved an overall multi-class classification accuracy of 85.5% and a kappa score of 0.921 on APTOS [
33] and DDR datasets [
32].
The presented model demonstrated superior performance when compared to existing state-of-the-art approaches in the literature that employ alternative architectures. This can be attributed to the high efficiency and capability for superior feature extraction and classification offered by Efficientnet-B0 [
10]. Furthermore, by approaching the problem of diabetic retinopathy classification through the lens of a regression problem, the accuracy of the model was enhanced. To the best of knowledge, this is the first study that views the problem of DR detection as a regression problem, which leads to improved performance and diagnostic accuracy.
A regression-based approach for DR classification has several advantages over traditional multi-class classification methods. One of the main advantages is the ability to provide a continuous output that can be interpreted as a measure of the severity of DR. This is particularly useful in cases where a quantitative measure of DR severity is needed to triage patients for further examination, such as in telemedicine systems. Additionally, the presented regression-based approaches can handle imbalanced datasets more effectively. This can lead to improved diagnostic accuracy and better patient outcomes. Furthermore, regression-based approaches can also be more robust to noise and outliers in the data as they are able to make predictions based on a broader range of information. This can be beneficial in medical imaging applications, where the quality of the images can vary widely. Overall, a regression-based approach for DR classification can provide a more accurate and comprehensive assessment of the disease and can be an important tool for the early detection and treatment of DR.
It is important to note that the model’s continuous output representing the severity of DR was mapped to class labels for the purpose of performance evaluation as the publicly available dataset used in the study had class labels and not severity scores. This mapping to class labels is simply a way to evaluate the model’s performance using the metrics and standards of traditional multi-class classification.
When compared to other studies in the field, the proposed model demonstrates superior performance in terms of accuracy, kappa score, and computational efficiency. For example, a recent study using a resnet50 architecture as the backbone achieved an accuracy of 74.6% on a APTOS dataset. Our proposed model, using Efficientnet-B0 with a regression approach, demonstrated a 11.6% increase in accuracy. Additionally, our model is computationally more efficient, which is important for real-world implementation. Moreover, the presented model is able to provide the DR severity score in addition to the class label.
The results of our study showed that the Efficientnet-B0 architecture outperformed the Efficientnet-B5 architecture in detecting diabetic retinopathy. The comparison between these two architectures highlights the trade-off between complexity and performance. The more complex architecture of Efficientnet-B5, with its larger number of parameters, resulted in overfitting to the training data. Additionally, the larger size of Efficientnet-B5 made it more computationally intensive, which led to longer training times and potentially impacted its performance. Furthermore, it is possible that the more complex architecture of Efficientnet-B5 was not well-suited for the specific task of diagnosing diabetic retinopathy and the simpler architecture of Efficientnet-B0 was more effective in this context. These findings emphasize the importance of considering the architecture of a model when selecting it for a specific task.
The results of this study have important implications for the early detection and treatment of DR. The proposed model can be used as a reliable tool for screening patients in resource-limited settings, where access to ophthalmologists is limited. Additionally, the model can be easily integrated into telemedicine systems, which can aid in the early detection and treatment of DR in remote areas.
The key findings of the study demonstrate the potential for a regression-based approach, incorporating the Efficientnet-B0 architecture to effectively classify and predict the severity of diabetic retinopathy. This approach provides a more interpretable way to predict DR compared to traditional classification methods.
However, it is important to acknowledge that this approach also has limitations that could impact its performance. The limitations of this study include the difficulty in fully incorporating the inter-relationships between different severity levels into the model, and the fact that the results are based on a specific dataset, requiring further investigation to assess the generalizability of the proposed model to other datasets.
Future research could focus on exploring alternative regression models and architectures to improve performance and incorporating additional sources of information, such as patient medical history, to enhance accuracy. Additionally, the integration of advanced machine learning techniques, such as transfer learning or ensemble methods, could be explored to further optimize the model’s performance.