Various deep learning techniques have been utilized to diagnose diabetic retinopathy. This section reviews recent advancements in the early diagnosis of this condition, specifically focusing on machine learning (ML) and deep learning (DL) algorithms. Numerous studies have investigated the use of convolutional neural networks (CNNs), ensemble learning, transfer learning, and feature extraction methods to automate the detection and classification of diabetic retinopathy severity levels from retinal images. Diabetic retinopathy classification is primarily divided into two categories: binary classification and multi-class classification. Binary classification aims to determine the presence or absence of diabetic retinopathy (DR), while multi-class classification seeks to identify the severity or stage of the disease, which typically includes Mild Non-Proliferative Diabetic Retinopathy (NPDR), Moderate NPDR, Severe NPDR, and Proliferative Diabetic Retinopathy (PDR). Numerous studies have explored both binary and multi-class classification of DR. In this section, we will examine research that focuses on binary classification, as well as studies that concentrate on multi-class classification, including those that address both types of classification.
2.1. Binary Classification of DR
Zeng et al.’s study [
16] delved into the automatic identification of referable diabetic retinopathy. They did this by dividing fundus images into two severity classes. The study utilized the “Diabetic Retinopathy Detection” high-quality fundus image dataset from Kaggle. Their method employed a specific form of deep learning model, the convolutional neural network. This model was created by modifying an already-existing one (transfer learning) to make it unique. This model, in contrast to other techniques, simultaneously examined images from both eyes and independently predicted the severity for each eye. This method produced an area under the ROC curve of 0.951, which is marginally better (by 0.011) than the current approach that analyzes one eye at a time.
In a paper by Voets et al. [
17], researchers conducted a study to validate previous studies on detecting diabetic retinopathy (DR) in retinal pictures using publicly available datasets. Instead of using the dataset used in the original work, they chose to utilize a different one from Kaggle EyePACS and Messidor-2. Like the original method, the algorithm was trained using the InceptionV3 neural network architecture. Techniques for ensemble learning were also included to improve the training process even more. The replicated approach obtained area under the curve (AUC) ratings of 0.951 and 0.853 on the separate Kaggle EyePACS test dataset and Messidor-2, respectively.
Chetoui and Akhloufi [
18] developed an advanced deep-learning convolutional neural network (CNN) for detecting diabetic retinopathy (DR) in retinal images. To assess its effectiveness, they tested the model on a large dataset consisting of over 90,000 images sourced from nine public databases: EyePacs, MESSIDOR, MESSIDOR-2, DIARETDB0, DIARETDB1, STARE, IDRID, E-ophtha, and UoA-DR. Their approach featured a unique “explainability” component that visually highlights areas identified by the model as indicative of DR. On the EyePACS dataset, their system surpassed existing benchmarks, achieving an Area Under the Curve (AUC) of 0.986. The model demonstrated robust performance across all nine datasets, with AUC values consistently above 0.95. Specifically, the AUC values for the MESSIDOR, MESSIDOR-2, DIARETDB0, DIARETDB1, STARE, IDRID, E-ophtha, and UoA-DR datasets were 0.963, 0.979, 0.986, 0.988, 0.964, 0.957, 0.984, and 0.990, respectively.
A study by Zong et al. [
19] proposed an automated method to segment hard exudates in retinal images to aid DR diagnosis. They employed 81 classified images from a public dataset (IDRiD) and used a specific technique (SLIC) to handle data limitations. To achieve better performance, their method uses a special network architecture based on U-Net and includes extra elements (inception modules and residual connections). On the IDRiD dataset, their novel network architecture produced high accuracy (97.95%).
Abdelsalam and Zahran [
20] examined a novel approach to early DR detection employing retinal pictures acquired from the Ophthalmology Center at Mansoura University, Egypt. Their approach was based on a methodology used in other scientific domains to analyze the complex branching patterns in the pictures: multifractal geometry analysis. This approach and a particular algorithm (SVM) produced good accuracy results of 98.5%. This method has the potential for identifying different phases of diabetic retinopathy and other retinal illnesses that impact blood vessel distribution, in addition to early DR identification.
In a paper by Maqsood et al. [
21], the authors presented a technique to detect hemorrhages, a crucial early indicator of DR. Their method combines a unique CNN architecture with a modified contrast enhancement methodology. The system uses fusion to identify the most informative features after using a pre-trained CNN to extract features from hemorrhages. Using a dataset of 1509 images from six sources (HRF, DRIVE, STARE, MESSIDOR, DIARETDB0, and DIARETDB1), the approach achieved an average accuracy of 97.71%. Comparing the system to other hemorrhage detection techniques shows better quantitative performance and visual quality.
In a study by Ayala et al. [
22], the authors introduced a novel deep-learning model for detecting diabetic retinopathy (DR) using fundus images. They utilized two datasets: the APTOS 2019 Blindness Detection dataset from Kaggle and the MESSIDOR2 dataset. The approach employed Convolutional Neural Networks (CNNs), which are particularly effective for image classification tasks. To evaluate the model’s ability to learn complex features, the two datasets were cross-tested. The CNN was trained to differentiate between Non-Proliferative Diabetic Retinopathy (NPDR) and Proliferative Diabetic Retinopathy (PDR) in fundus images. The model used transfer learning to initialize the weights of the CNN. This study demonstrates that deep learning can effectively create accurate DR detection models, achieving an accuracy of 81% on the APTOS dataset and 64% on the MESSIDOR dataset. However, the study had limitations, including a lack of diversity in the dataset.
Maaliw et al. [
23] propose an improved method for diagnosing diabetic retinopathy (DR). Their approach uses Atrous Spatial Pyramid Pooling (ASPP) for segmentation and a CNN based on ResidualNet for classification. This technique scored satisfactorily using a dataset that combined the DIARETDB0 and DIARETDB1 datasets (accuracy: 99.2%, precision: 98.9%, sensitivity: 99.4%, specificity: 98.9%).
A study by Rahman et al. [
24] proposed a hybrid intelligent approach that combines transfer learning and machine learning techniques to detect DR. The proposed model consists of a pretrained feature extractor, namely ResNet50, which produces output features that are then fed to a machine learning classifier, namely, SVM. Utilizing the APTOS dataset, the model achieved its highest accuracy at 96.9%.
Table 1 describes the techniques, dataset, and results of the binary classification of DR literature review.
2.2. Multi-Class Classification of DR
This section will discuss recent studies that demonstrate how deep learning models can improve the accuracy and efficiency of diagnosing and classifying diabetic retinopathy. Research has shown promising results in terms of accuracy, efficiency, and scalability for multi-stage diabetic retinopathy diagnosis using deep learning methods. However, there are still some limitations and challenges that need to be addressed, requiring further exploration and refinement in future research.
Alyoubi et al. [
4] conducted an in-depth study that introduced a comprehensive approach for classifying diabetic retinopathy (DR) images into five categories: no DR, mild DR, moderate DR, severe DR, and proliferative DR. Their technique is also capable of identifying individual lesions on the retinal surface, classifying them into two main groups: red lesions, which include microaneurysms (MA) and hemorrhages (HM), and bright lesions, which encompass soft and hard exudates (EX). The study utilized two publicly available fundus retina databases: the DDR and the Asia Pacific Tele-Ophthalmology Society (APTOS), both of which contain photographs of various DR stages. The system architecture incorporates two deep learning models: the first employs three CNN-based models for DR classification. Among these, one model utilizes transfer learning through EfficientNetB0, while the others, referred to as CNN512 and CNN229, are specifically designed and trained for the task. Notably, CNN512 processes the full image to classify it according to the DR stage. The second model is a modified version of YOLOv3, which excels in identifying and localizing DR lesions. Before training, the system benefited from image preprocessing techniques to enhance data augmentation. Performance metrics showed that CNN512 achieved an accuracy of 88.6% on the DDR dataset and 84.1% on the APTOS dataset. In terms of lesion localization, the adapted YOLOv3 model produced a mean Average Precision (mAP) of 0.216 on the DDR dataset. Remarkably, the fusion of the CNN512 and YOLOv3 models yielded an impressive accuracy of 89%, establishing a new benchmark in the field. However, a limitation of the study is that the sensitivity for detecting mild and severe DR stages was lower compared to other categories. This outcome is attributed to the imbalance present in the datasets used for training.
The goal of a study by Gao et al. [
25] is to automatically diagnose DR patients and provide them with useful recommendations. The dataset, including 4476 images, was provided by the Sichuan Provincial People’s Hospital’s ophthalmology departments, endocrinology and metabolism, and health management center. The hospital is the second-best in Sichuan Province for ophthalmology. They classified the severity of DR fundus photos using this dataset by applying transfer learning and deep convolutional neural network models. The basic model was the Inception-V3 network, but each of the four portions of the image was fed into a different Inception-V3 model. The model that was suggested was called Inception4. They achieved 88.72% accuracy, 95.77% precision, and 94.84% recall in the studies.
An ensemble of deep CNN models was created by Qummar et al. [
26] utilizing retinal pictures to improve the classification accuracy of various stages of DR. The study used the publicly available Kaggle Diabetic Retinopathy Detection dataset to address the shortcomings of current models in reliably detecting different stages of DR, with a particular focus on the early stages. The publicly accessible Kaggle dataset was used to train an ensemble of five deep CNN models, namely Resnet50, Inceptionv3, Xception, Dense121, and Dense169, to do this. The preprocessing and augmentation stages of the dataset involved scaling the images to 786 × 512 pixels, randomly clipping five 512 × 512 patches from each image, flipping, rotating the images 90 degrees, and applying mean normalization. The results showed that the suggested model had an F1-score of 53.74%, an accuracy of 80.8%, a recall of 51.5%, a specificity of 86.72, and a precision of 63.85%. The suggested model outperforms previous methods on the Kaggle dataset in detecting all levels of DR.
An advanced capsule network is presented in a study paper by Kalyani et al. [
27] to identify and categorize the stages of diabetic retinopathy. Finding out if patients have diabetic retinopathy and, if so, what stage it is in is the study’s primary goal. They used the Messidor dataset, a well-known DR dataset that divides diabetic retinopathy into four grades, 0 through 3. This study was driven by the rise of capsule networks in deep learning, outperforming classic machine learning approaches in numerous applications. Fundus images are processed using the convolution and primary capsule layers for feature extraction. The class capsule and softmax layers then determine the image’s class likelihood. Additionally, because the green channel has better contrast and accentuates characteristics like microaneurysms, only the green channel is taken out of the RGB channels of retinal pictures during preprocessing. With an astounding accuracy of 97.98%, 95.62% precision, 96.11% recall, and a 96.36% F1 score, the completed capsule network is impressive. The study is limited, nevertheless, by its concentration on the first three stages (0–3) of diabetic retinopathy; other databases categorize DR into five stages. For this kind of dataset, the existing network is still untrained.
The main goal of a study by Khan et al. [
28] is to categorize the several stages of DR, focusing on lowering learnable parameters to accelerate model training and convergence. The study used Kaggle’s EyePACS dataset, which has 88,702 images. 53,576 of these images are unlabeled, while 35,126 have labels. For its analysis, the study only used a subset of tagged photos. The study incorporates the CNN model with modifications in terms of methodology. The VGG-NiN model is a highly nonlinear, scale-invariant deep model built using the VGG16, the Spatial Pyramid Pooling layer (SPP), and the Network-in-Network (NiN) architectures. The model shows the following performance metrics: 91% specificity, 55.6% recall, 67% precision, 85% accuracy, and 59.6% F1-score. One limitation of the research was its focus on solely labeled photos, which excluded a substantial portion of the available data.
Bhardwaj et al. [
29] proposed an elaborate fusion of four InceptionResnet-V2 architectures, the Quadrant-based Ensemble InceptionResnet-V2 (QEIRV-2) model for automated DR detection. The goal of this framework was to enhance early detection of retinal abnormalities. Two datasets, MESSIDOR (1,200 images) and IDRiD (454 images with non-proliferative DR), were used to train the model. Additionally, data augmentation and preprocessing methods were used to improve performance. The approach achieved an accuracy rate of 93.33%.
Jagan et al. [
30] presented a way to enhance the classification of diabetic retinopathy (DR) using deep learning on retinal pictures. Their method includes entropy image conversion to enhance the proposed model’s deep learning capabilities and a four-step ensemble strategy (MRMR, Chi-Sq, ReliefF, and F-tests) for feature selection. This approach seeks to decrease redundancy and increase accuracy. The paper employs three datasets to accomplish this, which are Kaggle, the Indian Diabetic Retinopathy Image Dataset (IDRID), and MESSIDOR-2 and assesses Naive Bayes classifiers, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). SVMs outperformed KNN and Naive Bayes, with the best accuracy of 97.8%.
A study by Yasashvini et al. [
31] proposed a deep learning approach to classify diabetic retinopathy (DR) phases based on retinal pictures. Three models are assessed: CNN, hybrid CNN-ResNet, and hybrid CNN-DenseNet. On the Kaggle APTOS 2019 Blindness Detection dataset, all models performed well in accuracy; however, the hybrid CNN-DenseNet model performed best, scoring 96.22%. The study does, however, admit its limitations, including its small sample size and lack of cross-validation for the generalizability assessment.
A recent study by Adak et al. [
32] focused on the automated detection of diabetic retinopathy (DR) severity stages using fundus images. The researchers employed transformer-based learning models to efficiently extract important features from retinal images, enabling a detailed understanding of the severity of diabetic retinopathy. They utilized an ensemble technique that combined four different models: Class-Attention in Image Transformers (CaiT), Bidirectional Encoder representation for Image Transformer (BEiT), Data-efficient Image Transformers (DeiT), and Vision Transformer (ViT). Together, these models assessed fundus photographs to determine the severity of DR. Additionally, they introduced two ensembled image transformers: EiTwm and EiTmv. EiTwm uses majority voting, while EiTmv employs a weighted mean combination voting approach. The experiments were conducted using the publicly available APTOS-2019 blindness detection dataset, and the results were promising. EiTwm achieved an accuracy of 94.63%, while EiTmv attained 91.26%. Despite the dataset’s inherent imbalance, the models performed exceptionally well, surpassing the current state-of-the-art methods in detecting DR severity stages.
Luo et al.’s study [
33] suggested a deep CNN for DR detection that can mine retinal fundus photos’ local and long-range dependence. The research employs two methodologies: Long-Range Unit and Patch-wise Relationship Learning. The patch-wise relationship learning module improves the local patch features by using the relationships between patches. On the other hand, the image’s dispersed lesion features’ long-range reliance is captured by the Long-Range unit module. The MESSIDOR and E-Ophtha datasets were used to assess the suggested methodology. According to the results, the suggested strategy beat cutting-edge techniques on DR detection tasks. The suggested approach obtained an accuracy of 92.1% for normal/abnormal and 93.5% for referable/non-referable on the MESSIDOR dataset. The accuracy of the suggested technique was 83.6% on the E-Ophtha dataset.
The goal of a study by Jena et al. [
34] is to use asymmetric deep learning features to create a new deep learning-based method for DR screening. It used the APTOS and MESSIDOR datasets, which are both freely accessible. Photos with and without DR and photos in various DR phases are included in both collections. Using a U-Net architecture, the research extracted asymmetric deep-learning characteristics from the retinal pictures. The retrieved features were divided into DR and non-DR groups using a CNN and SVM. The MESSIDOR dataset was used to assess the CNN after it had been trained on the APTOS dataset. Non-diabetic retinopathy can be identified with 98.6% and 91.9% accuracy, respectively, according to the APTOS and MESSIDOR datasets. According to both datasets, exudates are detected with 96.7% and 98.3% accuracy.
In a study by Usman et al. [
35] the authors suggest a deep learning model that uses principle component analysis (PCA) for multi-label feature extraction and classification in order to detect and classify DR. A publicly accessible dataset of color fundus photos (CFPs) from the Kaggle Diabetic Retinopathy Detection dataset was utilized by the model. Five DR stages were identified for the dataset: proliferative DR, mild DR, moderate DR, severe DR, and no DR. The study employed a deep learning model for multi-label feature extraction and classification that was built on top of a pre-trained CNN architecture. The feature extraction and classification modules comprise the model’s two primary parts. The feature extraction module uses PCA to extract useful features from the CFPs. The retrieved characteristics are then divided into the five DR phases by the classification module using a CNN. The proposed model yielded an accuracy of 94.40%, a sensitivity of 76.35%, an F1 score of 72.87%, and a hamming loss of 0.0560, as per the experimental results.
Research by Ali et al. [
36] presented a novel DL model for the early classification of DR in their study. To achieve the best recognition accuracy, the model uses color fundus images and concentrates on the most critical components of the illness to exclude unimportant elements. Eighty percent of the Kaggle Diabetic Retinopathy Detection dataset was utilized for training, and twenty percent was used for testing in this study. The model was created using a hybrid deep-learning classification technique that included ResNet50 and Inceptionv3. Two models were used for feature extraction: ResNet50 and Inceptionv3. ResNet50 improved performance without becoming overly complex, whereas Inceptionv3 used varying filter widths to lower parameters and computational costs. Histogram equalization, intensity normalization, and augmentation were used in the pre-processing of the photos. The study yielded a precision of 96.46%, sensitivity of 99.28%, specificity of 98.92%, accuracy of 96.85%, and F1 score of 98.65%.
Manjula et al. [
37] proposed an ensemble machine-learning technique for DR detection using a Kaggle dataset. The study investigated several machine learning techniques, such as Random Forests, Decision Trees, K-Nearest Neighbor, AdaBoost Classifier, J48graft classifier, and Logistic Regression. On the Kaggle dataset, these researchers achieved a 96.34% accuracy rate, demonstrating the promise of machine learning for early DR detection.
Alwakid et al. [
38] built a deep learning model to accurately classify five stages of diabetic retinopathy (DR) in retinal images. They trained the model on two datasets (DDR and APTOS) containing high-resolution images for all DR stages (0–4). The study compared two approaches: one with image enhancement techniques (CLAHE and ESRGAN) and another without. The DenseNet-121 model demonstrated a remarkable accuracy of 98.7% on the APTOS dataset with image enhancement. However, accuracy dropped significantly to 81.23% without enhancement. Meanwhile, for approaches 1 and 2, with and without enhancement, the accuracy was reported by the DDR dataset as 79.6% and 79.2%, respectively.
Kumari et al. [
39] suggest a deep learning method for automated DR detection using a ResNet-based neural network. They investigate how well retinal pictures in color and black and white can be used for classification. Using 10-fold cross-validation, the model was trained and assessed on two datasets: EyePACS and APTOS. A ResNet architecture was used for feature extraction, and missing data was addressed by vectorization. The method showed potential for clinical applications by achieving a high accuracy of 98.9% for high-quality images. Nevertheless, accuracy fell to 94.9% for low-quality photos, emphasizing how crucial high-quality images are for accurate DR detection.
Table 2 describes the techniques, dataset, and results of the multi-class classification of DR literature review.
2.3. Joint Binary and Multi-Class Classification of DR
Without early detection and treatment, diabetic retinopathy, a common eye disease, can cause vision loss. A deep learning-based method was suggested by Yaqoob et al. [
40] for categorizing and evaluating diabetic retinopathy images. After determining the symptoms, the disease’s severity degree must be determined so that the appropriate medication can be given. The authors employed Random Forest for classification with the ResNet-50 feature map. The suggested approach is compared with state-of-the-art approaches, such as ResNet-50, VGG-19, Inception-v3, MobileNet, Xception, and VGG16. They use two publicly available datasets called Messidor-2 and EyePACS. There are two classifications of diabetic macular edema in the Messidor-2 dataset: “No Referable DME Grade” and “Referable DME Grade”. Furthermore, five disease categories from the EyePACS dataset are proliferative diabetic retinopathy, severe, moderate, mild, and no diabetic retinopathy. Consequently, the suggested method outperforms the comparable methods in accuracy, attaining 96% and 75.09% for the two datasets, respectively.
Novitasari et al. [
41] suggests the use of a hybrid fundus image classification system that combines Deep Extreme Learning Machines (DELMs) and Convolutional Neural Networks (CNNs) for identifying the stages of Diabetic Retinopathy (DR). The study utilized three publicly accessible datasets: Messidor-2, Messidor-4, and the Diabetic Retinopathy Image Database (DRIVE). This hybrid approach involved using CNNs along with Extreme Learning Machines (ELMs) [
42] to assess the severity of DR. Five different CNN architectures were employed to extract features from fundus images: ResNet-18, ResNet-50, ResNet-101, GoogleNet, and DenseNet. The extracted features were then classified using the DELM method. A key objective of this study was to identify the best CNN architecture for feature extraction from fundus images. Additionally, the performance of DELM was evaluated using various kernel functions. Consequently, remarkable results were achieved across all experiments using the CDELM method, achieving 100% accuracy for two classes in both the MESSIDOR and DRIVE datasets. For the four-class MESSIDOR dataset, the highest accuracy recorded was 98.20%. However, despite its advantages, the CDELM approach faces challenges when handling extensive datasets. Specifically, errors may occur during training when DELM attempts to multiply several square matrices with 60,000 or more data points.
A trustworthy approach for screening and grading the phases of diabetic retinopathy was introduced in a study by Shakibania et al. [
43]. The technique shows a significant potential for enhancing clinical decision-making and patient care in terms of classification accuracy and other metrics. The research presents a deep learning approach that uses a single fundus retinal image to identify and grade diabetic retinopathy Discriminative Restricted Boltzmann Machines (DRBMs) with Softmax layers stages. ResNet50 and EfficientNetB0, two sophisticated pre-trained models, are used by the model to act as feature extractors and refine them using a new dataset through transfer learning. The dataset was divided into subsets for testing (20%), validation (10%), and training (70%). The unbalanced data was resolved using the complement cross-entropy (CCE) loss function. Three datasets were used to train the model: the Indian Diabetic Retinopathy Image Dataset (IDRID), MESSIDOR2, and APTOS 2019 Blindness Detection. In binary classification, the proposed technique achieves 98.50% accuracy, 99.46% sensitivity, and 97.51% specificity; in stage grading, it achieves 93.00% quadratic weighted kappa, 89.60% accuracy, 89.60% sensitivity, and 97.72% specificity.
Table 3 describes the techniques, dataset, and results of joint binary and multi-class classification of DR literature review. It is worth noting that, there are relatively small number of studies targeting joint detection and classification of DR. Another observation is that the binary classification results in terms of accuracy and other metrics are better than the multi-classification results. That is quite understandable as granularity increases, the classification accuracy is somewhat compromised.