1. Introduction
DR is a retinal disease that impairs vision. The retina is damaged by increased blood sugar, which attacks its sensitive layer [
1]. Without treatment, these degenerative processes can lead to complete loss of vision and visual impairment [
2]. Timely diagnosis of the disease is necessary in order to be able to carry out laser or drug therapy quickly, which can slow down the progression of the disease and, in
cases, protect against blindness [
3]. Even if patients initially show no symptoms, regular monitoring with an eye examination is crucial to prevent unstoppable progression. Since the introduction of artificial intelligence (AI) in ophthalmology, evidence of DR has increased. Now, machine learning (ML) technologies are being used to analyze a vast number of eye images [
4]. Nevertheless, such an approach has its limitations that make it difficult to use AI for DR diagnosis. The success of AI-based techniques depends heavily on the reliability of the retinal images. These systems designed to analyze images are more susceptible to conditions such as under- and overexposure, camera shake, and other distortions. This is especially true for convolutional neural networks (CNNs), which are very sensitive to image quality during training (
) and can change the overall functioning of the system [
5]. Furthermore, an unbalanced ratio of DR-negative and DR-positive cases increases the probability of false negatives. Such cognitive processing is sensitive to the structural and functional characteristics of a retinal image as well as to understanding differences in DR cases.
Consistent development and diverse data utilization are necessary to improve the performance and application of AI in DR diagnosis and treatment. DL is an advanced branch of ML that uses models such as CNN to look at images in great detail in search of
DR indicators [
6]. However, despite the use of RNNs and LSTMs in different networks, CNNs have prevailed in this regard. These AI-based approaches have already been found to work with demonstrable efficiency, with DR being adequately diagnosed by more than a few systems, clearly indicating that AI has the potential to solve the problem of DR at early stages.
Wide-field OCTA (WF-OCTA) provides detailed information about the peripheral retina, and thus, improves the diagnosis of DR. The increase in the number of diabetic patients has increased the need for automated diagnostic systems for DR that rely on WF-OCTA imaging [
7]. The method uses Vision Transformers (VIT) to automatically diagnose DR using WF-OCTA images centered on the fovea. [
8]. This demonstrates the effectiveness of VIT in detecting and grading DR. This inspired the development of a transformer
for detecting DR grades that divides images into patches, converts them into sequences, and processes them through multi-head attention layers. This method showed the potential of VIT for DR detection.
Many studies have proposed a hybrid DL method that combines fine-tuned vision
and a modified capsule network for predicting the severity of DR. This method includes preprocessing steps, right transform, and adaptive histogram equalization. Using APTOS, Messidor-2, DDR, and EyePACS, an adequate
was achieved that outperforms the tendency methods [
9]. Thus, these methods show that the complex computer-aided diagnosis (CAD) system has promising efficiency in detecting DR. In particular, deep learning (DL) algorithms, especially those using VIT and SAN, show significant potential for adequate detection of DR stages.
The VIT demonstrate the efforts made to detect DR compared to traditional CNN methods. These TFs are well equipped with the Masked Autoencoders (MAR), which provide the best effectiveness in classifying the different stages of DR. Self-recognition is also a promising and substantial approach for recognizing the different DR stages. The SAN architecture considers different retinal features from different DR stages, and thus, improves the generalization ability. In this study, we also utilize the hybrid model of SAN and VIT to improve the of detection for different disease phases by applying vision techniques. The contributions are presented below to better illustrate the desired work and increase its effectiveness:
Review the essential literature on DR to understand the adaptability of the DR detection approach;
The applicability of different DR datasets with their corresponding phases;
The various preprocessing studies applied to the DR dataset;
The implementation of SAN, VIT, and HybridFusionNet architecture is achieved;
Evaluation of and classification using various performance parameters such as ROC and and curve is performed;
Comparative analysis of SAN, VIT, and HybridFusionNet models against trending methods such as ResNet, AlexNet, and VGG16.
These contributions are described in various sections of this article.
Section 2 deals with the subjective literature reviews conducted in the field of DR detection.
Section 3 conceptualizes the methods and the consequences of each step of the DR dataset.
Section 4 discusses the use of the different methods SAN, VIT, and the innovative HybridFusionNet approach.
Section 5 provides an exploration of the results.
Section 6 summarizes the findings, highlights potential future research topics, and provides a roadmap for further progress in this discipline.
2. Materials and Methods
The fundus is the back part of your eye that contains the retina, the optic nerve, and the blood vessels [
10]. With the pupil dilated, you use a special camera to take a picture of the back of the eye. The procedure only takes a few minutes. Fundus images are not medically necessary to document the presence of DR [
11]. However, images may be medically necessary to provide a baseline for assessing the progression of a disease. In addition, fundus imaging can also aid in the interpretation of fluorescein angiography, as certain retinal landmarks seen on fundus images cannot be seen on fluorescein angiograms [
12]. It is important that the eyes are dilated before the procedure. If the patient’s pupils are dilated, the technicians can see the back of the eye better.
The inability to reliably produce two-dimensional fundus images is a key drawback of several existing DR screening techniques that rely on fundus [
13]. OCT has become a popular DR screening technology because it allows direct two- and three-dimensional viewing of histologic changes in the layered retinal structures and accurate quantitative evaluation with ultra-high scan rate and resolution [
14]. OCT has the advantage that it is non-contact and non-invasive, captures high-resolution images, measures the thickness of the retina and the retinal layer, and acquires the images quickly. Although image quality is often better with a dilated pupil, OCT can often be performed on an undilated patient. The cost of equipment, media opacity limitations, operator skill and
requirements, difficulty in obtaining images from patients unable to fixate, and the introduction of imaging artefacts due to automation tool anomalies are some of the problems associated with OCT images [
15].
Machine vision is a technique for enhancing or extracting information from an image through procedures. Basic steps for machine vision based diagnostic solutions for DR are provided. These steps include preprocessing of fundus images, retinal vessel segmentation, optic disc localization, red lesion extraction, bright lesion extraction, and finally DR detection. The image processing solution is specially designed for DR screening. After preprocessing the fundus image, certain features such as the location of blood vessels, exudates, and structural aspects are extracted. These features are categorized into different stages, from normal to proliferative, using an
Support Vector Machine (SVM). The image resolution used was 256 × 256 pixels, and the model
was performed on a publicly available dataset [
16].
Using the Fisher method and mutual information, an advanced feature set with thirteen essential elements, such as Bare Logistic, Multi-Layer Perceptrons, and Sequential Minimum Optimization. The
achieved 99.73% when analyzing retinal fundus images from Bahawal Victoria Hospital, Pakistan, and 98.83% on alternative public datasets. A subsequent study examined a dataset [
17] with 3662 retinal fundus images using CNN for feature inference and dimensionality reduction. The SVM algorithm then effectively categorized the images into
,
, proliferative, and
stages, achieving an impressive 98.4%
.
Another study [
18] used the diaretDB1 dataset [
19], which consisted of 89 live fundus images; 84 showed signs of microaneurysms, the rest were standard. In addition, data from the Retinal Vascular Disease Online Challenge [
20] were used. The perceptron, which is responsible for highlighting areas of interest, extracted unique 19 × 19 pixel patches. Using a composite classification approach that includes techniques such as bottom hat and radon extraction, the images were categorized into micro-aneurysms and non-micro-aneurysms. With the diaretDB1 dataset, the researchers achieved commendable sensitiVITy, precision
, and specificity values of 92.32%, 95.93%, and 93.87% respectively.
To broaden the study horizon, another study [
21] focused on 80 non-dilated retinal images from Thammsat University Hospital [
22]. The mathematical morphology was decisive for the extraction of eighteen crucial features. Naive Bayes classifiers were used to categorize the images into no,
,
, and
, achieving a remarkable sensitiVITy of 87.15% and unmatched values for
and specificity of 99.99%. In addition, research initiatives have tapped into databases such as UTHSC, Retinal Vascular Disease Online Challenge, and diaretDB1 and extracted 87 key features that preceded their classification [
23]. In a further step, a study using the diaretDB1 dataset [
24] used 66 image attributes and classified images into primary,
, and advanced stages of DR, achieving an 88%
rate with SVM and LR classifiers. Finally, in the study by [
25], preprocessing techniques were applied based on the DiaretDB1 dataset and exudates and blood vessels were identified as crucial features. The obtained
varies between 85.8% and 88.6% depending on the algorithm used.
Table 1 shows a detailed analysis of the detection of DR using the different ML and DL models.
Table 2 describes the different studies on the detection of DR using the different types of datasets.
Datasets for DR Screening. There are many publicly accessible datasets for DR and retinal vasculature detection. These datasets are often employed for system , , testing, and system quality comparison.
Machine vision application in DR screening has made significant progress in diagnosis and referral recommendations. The majority of machine vision studies that have been reported thus far have primarily examined offline DR screening. According to several research papers, DR may be diagnosed using AI with excellent sensitiVITy and specificity. Fundus images can be assessed with varied levels of image quality. Image enhancement and restoration can be performed before DR diagnosis. A generalized model can be designed to diagnose DR with geographic variations since diabetics are connected with geographic factors such as food habits, weather conditions, etc. Acceptable is attained.
Self-learning neural networks perform exceptionally in detecting DR by autonomously acquiring essential characteristics from extensive collections of labeled retinal images. These networks exhibit an exceptional capacity to detect tiny patterns and fluctuations, ensuring validated in spotting early indicators of DR. Their capacity to scale enables efficient processing of various image datasets, including variances in quality and patient demographics. Furthermore, their velocity and mechanization enhance the efficiency of the screening procedure, enabling the possibility of early detection of DR on a significant magnitude. With the increasing availability of more data, these networks are constantly enhancing, providing hopeful opportunities for prompt intervention and preserving vision in diabetes patients.
3. Methods and Materials
We used a dataset from Kaggle containing retinal images, clinically scored on a scale of 0 to 4 for DR: 0 (No DR), 1 Mild (
), 2 Moderate (
), 3 (Severe (
)), and 4 (Proliferative DR (
)). Our goal is to develop an automated system that classifies these images based on the given scale. The details of the dataset and examples can be viewed at
https://www.kaggle.com/c/diabetic-retinopathy-detection/data (accessed on 21 November 2024).
Figure 1 shows the pixel intensity distributions for the different DR classes. Each subplot represents the frequency of pixel intensities for a particular DR class.
The histograms in
Figure 1 show the pixel intensity distribution for DR levels. The No DR class shows a frequency around one intensity, indicating a relatively uniform distribution with fewer extreme values. The
class shows a similar pattern, but is slightly shifted and shows lower frequencies compared to the No DR class. The
class shows a peak in the same intensity range, while the No DR and
classes have significantly lower frequencies, indicating a lower variance. The
class has a lower frequency than the No DR class, indicating some challenges. The
class has greater variability in pixel intensities. This analysis shows that the No DR and
classes have rather equal variation distributions, while the
,
, and
classes have an acceptable difference.
3.1. HybridFusionNet
The HybridFusionNet model is the combination of the functions of SAN and VIT. This model first uses SAN to detail the features in fundus images, which are later integrated by VIT to classify DR. The SAN mechanism helps to calculate attention scores that describe the pixels in relation to others and can be calculated as follows:
where
C is the query matrix,
D is the key matrix,
T is the value matrix, and
is the dimension of the key vectors. In order to capture extensive dependencies and features in the images, VIT divides the input fundus images into different parts, which are arranged in linear order and converge steeply towards the encoder. VIT then integrates multiple layers of SAN and Feed-Forward Neural Networks (FFNN):
where
is the weighting matrix and
is the distortion term. The
encoder is applied with SAN, normalization, and FFNN. It can classify both
and
. The
classification refers to the presence of DR or not, while the
classification includes the categorization of DR levels, No DR,
,
,
, and
.
Figure 2 begins with a preprocessing phase in which the input fundus images are transformed.
The SAN presented after preprocessing consists of several input nodes (A1 to A4) representing input values connected by a fully connected network with nodes for resistance and similarity values (R1 to R4), generating attention values (G1 to G4) and computed values (D1 to D4) through certain operations such as addition and multiplication as these mention in Algorithm 1. This network captures salient features and relationships within the retinal image and provides a comprehensive representation for classification. The SAN is followed by the VIT , which utilizes the
encoder mechanism and processes retinal image patches with position embeddings through a series of
encoders capable of capturing long-range dependencies and complex patterns. The linear projection with the flattened image patches is fed into the
encoder and feeds into a Multi-Layer Perceptron (MLP) head, which outputs the classification results. Both the SAN and the VIT provide results for
and
classification, which determine the presence and severity of DR in retinal images. This dual-output structure ensures a comprehensive analysis that takes into account the presence and progression of the disease. The architecture provides a robust solution for DR diagnosis through preprocessing, SAN, and VIT, which could improve early detection and management of the disease.
Algorithm 1 HybridFusionNet Model for Multi-Class and Binary-Class Classification |
- 1:
Input: Preprocessed fundus images and model parameters for SAN and VIT Ensure: Binary-class or multi-class DR classification
- 2:
procedure HybridFusionNet - 3:
Preprocess the input fundus images. - 4:
Apply SAN to compute attention scores. - 5:
Use SAN to compute query, key, and value matrices. - 6:
Extract deep features from fundus modalities. - 7:
Split features into patches( VIT). - 8:
Pass patches through the VIT encoder. - 9:
For binary stage, detect presence of DR. - 10:
For multi-class stages, classify the DR stage. - 11:
Output the final DR classification result. - 12:
end procedure
|
In the dataset, there were not the same number of images in both categories. This imbalance can lead to problems if
models are biased in classification. The data augmentation approach creates new images by slightly modifying the existing images. This is mainly performed for the group with fewer images. The image is rotated slightly to the left or right, creating
M, and a new image or matrix
is created by rotating it by an angle:
then the image is flipped. Thus, for an image
M, the flipped versions are
for the horizontal and
for the vertical zooming. Think of this as zooming in or out on a part of a image. By changing the size of the image matrix
M using a factor
s, we obtain:
Cropping is a crucial step in preprocessing. Additionally, we used something called SMOTE, which stands for Synthetic Minority Over-sampling Technique. For two images
a and
b, a new image
c is used:
where
is a random value that lies between 0 and 1. To start, SMOTE picks two images (or data points) from the minority class. These images are all in the same group, which means they have some things in common. In a set of images showing diabetic retinopathy, for instance, both a and b might be marked as showing a certain grade of retinopathy. Usually, a and b are picked based on how similar their features are or how close they are to each other in the feature space. This is achieved using methods like k-nearest neighbors. By doing this, a and b are made to be somewhat alike, and the made-up images between them matches the traits of the minority class. In our experiments, the value of
(sigma) used in the Gaussian filter was 1.5. This value determines the extent of smoothing applied to the image. Larger values of
result in greater smoothing, whereas smaller values preserve finer details. For dbr images,
is 1.5 found to be effective in reducing noise. After filtering, our original image
P becomes
:
where
is a formula based on the Gaussian function:
To make sure all our images have the same level of clarity, we used the same () value and size for our Gaussian filter on all of them. The Gaussian curve’s “width” or spread is determined by . A distribution that is smoother and broader has a bigger , while one that is narrower and more concentrated around the center has a lower .
3.2. Self Attain Network
The SAN mechanism in Algorithm 2 is a sophisticated filtering process that selectively highlights parts of the input data. This mechanism allows the model to effectively capture and utilize contextual relationships within the data in
Figure 3. This improves the model’s ability to understand and process complex patterns of DR stages.
In this approach, each sequence element is assigned a pointer coefficient, which is calculated by evaluating the alignment between
S and
D. This alignment serves as an indicator of the relevance of the associated
O value. O represents the output values, weighted by the alignment score between S and D. A special feature of the SAN method is the equality between
S,
D, and
O:
where Attention stands for the functioning of the SAN mechanism, which processes the three matrices
S,
D, and
O,
S represents the sensor matrix, analogous to the query in attention mechanisms, while
D and
O are similar to the key and value in these mechanisms,
is the transpose of the descriptor matrix
D, and
denotes the square root of the dimension of the descriptor matrix used for scaling in the attention mechanism to maintain the stability of the gradient.
The
depicts the outcome of the Attention function for the
mapping in multi-concentration attention. The matrices
,
, and
are parameters for the
mapping, specifically transforming
S,
D, and
O:
The function Multi-attention integrates results from multiple sectors and acts as an extended version of the SAN mechanism:
Algorithm 2 SAN for Image Classification |
Require: Feature map with dimensions (batch_size, channels, height, width) and weight
matrices. Ensure: Feature map with SAN applied
- 1:
procedure SelfAttention - 2:
Optionally reduce dimensionality of X with convolution - 3:
Compute Query (Q), Key (K), and Value (V) via convolution - 4:
Calculate attention scores: - 5:
Optionally apply a mask to attention scores - 6:
Normalize scores using Softmax - 7:
Compute weighted sum of V using attention weights - 8:
Optionally apply a linear transformation - 9:
Add X to weighted sum (residual connection) - 10:
Output feature map with self-attention - 11:
end procedure
|
The SAN mechanism in CNNs starts with an input feature map
X with specific dimensions and optional parameters for the attention mechanism. At the beginning of the process, a
convolution is applied to
X for dimensionality reduction [
29]. These are derived by convolving
X with their respective weight matrices. Then, the dot product of
Q and
K is determined. A mask can be applied to these dot values to selectively control attention. The attention weighting is achieved by the activation function. The SAN feature map is obtained by multiplying these weights by
V and adding them to the original data. Repeating the process followed in the input batch, resulting in SAN-enhanced feature maps ready for the subsequent CNN layers.
VIT is a DL model that applies the
architecture to the image classification tasks in
Figure 4. VITs work with image fields and use the SAN mechanism to model global relationships within an image.
An input image
X of dimensions (batch_size,
is divided into non-overlapping patches:
where
is the learnable projection matrix.
The
encoder consists of multiple layers of multi-head SAN and FFNN, each with residual connections and layer normalization. Queries (
A), Keys (
B), and Values (
C) are computed from the input embeddings:
Attention scores are computed as:
The output of the self-attention mechanism is passed through a FFNN:
The output embedding corresponding to classification token [CLS] is fed into a linear classifier:
The VIT model is trained on the labeled dataset using cross-entropy loss for
classification:
where
N represents the samples,
C represents the classes,
y represents the ground truth, and
is the predicted probability. The performance of the model is evaluated using metrics such as
, confusion matrix (
), and AUC-ROC to ensure effective classification of DR levels. By leveraging VIT’s ability to model long-range dependencies and global context, it has shown promising results in medical imaging tasks, including DR detection, and provides an alternative to traditional CNN-based methods.
3.3. Multi-Class Classification
In this hybrid model, we have made a classification of the stages of DR included in the training and the
approach by using the defined model that helps us to recognize the differences between these stages as in Algorithm 3.
Algorithm 3 Multi-class classification using SAN |
Require: Disease categories: DR, DR, , DR, Non-DR, Data directory, Input image size, Number of epochs Ensure: Test - 1:
procedure DRImageClassification - 2:
1: Prepare and preprocess data: resize and convert images to grayscale - 3:
2: Define and compile the Attention CNN model: - 4:
Input layer, Convolutional layers, Attention mechanism, Global average pooling, Fully connected layers - 5:
Compile with and cross-entropy loss - 6:
3: Train the model: - 7:
Split data into and sets - 8:
Train for specified epochs, validate during - 9:
4: Evaluate the model on the test set and record test - 10:
end procedure
|
The Algorithm 4 for DR classification using a VIT was developed to categorize retinal images into several levels
DR,
DR,
,
DR, and non-DR. The process begins with data preparation, where the images and associated labels are loaded from the specified data directory. The images are scaled to the required input size and normalized so that the model can effectively learn from the image data.
Algorithm 4 Multi-class using VIT |
Require: Disease categories: DR, DR, , DR, Non-DR, Data directory, Input image size, Number of epochs Ensure: Test - 1:
procedure DRImageClassification - 2:
1: Prepare and preprocess data: resize and normalize images - 3:
2: Define and compile the VIT model: - 4:
Input layer, Patch embedding, Position embedding, encoder layers, Classification head - 5:
Compile with and categorical cross-entropy loss - 6:
3: Train the model: - 7:
Split data into and sets - 8:
Train for specified epochs, validate during - 9:
4: Evaluate the model on the test set and record test - 10:
end procedure
|
Then, the vision model is included in the pipeline for the DR process. The model starts with an input layer that processes image patches and is connected to a patch embedding layer. This converts the patches into vectors. A position embedding layer is added to the patch embeddings to obtain the position information within the image. The core of the model consists of multiple encoder layers, each utilizing SAN mechanisms and FFNN to obtain features in the data. Then, the classification head performs classification with a linear l, stages of DR.After defining the model, it is compiled using the and categorical function used for DL models in classification tasks. In the phase, the data are split into and sets. Several epochs are performed, the correct performance is measured and overfitting of the model is also to be avoided. After evaluating the model, the test is considered. The final test provides an effective classification of DR stages and is a reliable tool for early detection.
4. Result Analysis and Discussion
Traditional ML models such as ResNet, AlexNet, VGG16, (VIT), self-attention models (SAN), and hybrid models (HybridFusionNet) are used in the development of this CAD system to identify DR. Traditional
and SAN are used in the initial evaluation phase. The first SAN model uses an attention mechanism to evaluate features and categorize them based on the extracted features. In two-sided classification with SAN,
classification provided encouraging results, as it efficiently discriminates between
and non-
DR. However, when the categorization was extended to the five DR classes, the problems became more apparent, requiring more modifications and a stronger emphasis on recognition performance.The
Figure 5 illustrates the various evaluation parameters calculated for the analysis of DR using SAN architecture.
Figure 5a shows the consistency of the classification of the model
across
and
, while
Figure 5b describes the classification
. The
of the model increased steadily, with
and
close to each other, indicating a significant ability to discriminate between the two
classes. While for
, the
was slightly lower than the
, indicating some difficulties in reliably categorizing multiple classes. The gap suggests that the model may need more modification to improve generalization across all class categories and maintain consistent performance in class prediction. The distribution of the dataset for detecting positive and negative cases based on false and correct in
Figure 5c. The model shows near error-free performance in detecting “No DR” instances, but has some difficulty in identifying “
” DR, with 87% correctly categorized, but 34 cases were misclassified as “No DR”. The
classification, see
Figure 5d, presents new difficulties, namely in distinguishing between the stages of DR. The prediction
for “No DR” is 91%. However, there are significant difficulties in distinguishing between the “
”, “
”, and “
” stages, as only 43% of the “
” and 69% of the “
” instances are accurately identified. The Receiver Operating Characteristic (ROC) curves in
in
Figure 5e show that the model produces robust prediction results. In contrast, the
classification
Figure 5f shows inconsistent effectiveness across the different phases of DR. The model performs consistently well in identifying “No DR”, achieving an AUC of 98%. However, as the complexity of the classes (“
”, “
”, “
”, and “
”) increases, the AUC decreases, especially for the
stage,
.
In the next phase of the evaluation of VIT
, shown in
Figure 6a, the
curve shows a consistent increase in both
and
. The model shows potential effectiveness in discriminating between two categories (“No DR” and “DR”) without overfitting, as the
power is very close to the
power.
Figure 6b illustrates similar patterns in
classification, where the
of
and
almost converge to reach the subsequent consistency levels. Although the model shows strong performance across multiple categories, the
of
hardly varies, suggesting that discriminating between more than two categories is not a major challenge.
Figure 6c shows shows the
evaluation for the
classification demonstrating how well the VIT
model can distinguish between the “no DR” and “DR” situations, with only a few false positives. In
Figure 6d, the model correctly identifies the majority of “no DR” cases, but becomes increasingly confused as the degree of DR increases. The model fails miserably in distinguishing between adjacent severity levels, resulting in incorrect labeling of “
” and “
”, “
”, and “
” segments. Apart from these problems, the model still performs well overall. However, it could be better at discriminating between finer DR levels in
Figure 6e. The model’s ROC value (AUC) of 99% confirms its effectiveness and reliability in
classification. The
ROC value shown in
Figure 6f is robust but shows greater variability between the different classes. The model is effective in detecting DR and shows more modest AUCs for subsequent stages, particularly for the more
categories, where the AUC drops to 83%. This suggests that the model performs excellently for
classification.
To improve the detection in
, we inserted the preprocessed dataset into the proposed hybrid model for SAN and VIT (HybridFusionNet)
Figure 7. For both
and
classification tasks, the HybridFusionNet model describes the sustainable performances. The
number
Figure 7a tends to increase steadily from
, while
remains constant. For the
classification number
Figure 7b, the model has reasonable
. The
falls only slightly below the
. The model is highly optimized, but may need further refinement to cope with the complexity of the
tasks. The
in the
mapping
Figure 7c quantifies the performance by minimizing the number of misclassifications between “DR” and “non-DR”. An
classification
mapping
Figure 7d reveals neighboring phases of DR such as “
” and “
”, but classifies exactly “No DR”. The
ROC in
Figure 7e shows almost perfect performance with an AUC of 99%. In contrast, the
ROC
Figure 7f shows uneven performance, with “No DR” achieving the highest AUC of 99% and “
” the lowest AUC of 83%. In general, the HybridFusionNet model shows exceptional performance in
classification and provides satisfactory results in
classification. The silent observation in the evaluation parameters shows the effectiveness of the other trending ML and DL methods. In
Table 3, the key observations are the consistently excellent performance of VIT and HybridFusionNet with 99%
,
, and recall in all phases. Models such as ResNet and AlexNet perform slightly worse, especially in heavier phases, where ResNet achieves 72 to 82%
. SAN has a balanced performance and achieves more than 90% across all phases for all parameters.
Figure 8 illustrates the classification efficacy of five models—AlexNet, VGG16, SAN, VIT, and Hybrid—regarding diabetic retinopathy (DR) across five severity categories: no DR, mild, moderate, severe, and proliferative. Trends in accuracy, precision, and recall may be shown using scatter plots, while score distributions can be represented by box plots. The hybrid model attains almost flawless metrics across all severity levels, consistently surpassing individual models, as shown by the data. SAN and VIT seem to be effective, especially under harsh settings. In terms of performance measures, AlexNet and VGG16 exhibit significant deficiencies. The Hybrid and VIT metrics provide more robustness and reduced variability in the box plots.
The VIT, SAN, and HybridFusionNet models have clear strengths and potential for improvement. VITlags behind in both
and
classification, although it has similar DR severities. SAN shows better
classification challenges
with
,
DR levels. HHybridFusionNet combines the strengths of both models and outperforms
,
,
, and
. The performance analysis is described in
Table 3 with different trend models.
The models ResNet [
30], AlexNet [
31], and VGG16 [
32] are used for the evaluation. Three key performance metrics evaluate these models:
,
, and
. While the DL models ResNET, AlexNet, and VGG16 show competitive performance, the use of transfer models resulted in SAN and VIT achieving 91 and 99 percent
in the
class and 87 and 90 percent
in the
class, respectively as in the
Table 3 and
Table 4. This shows the ability to classify both
DR and No DR with appropriate
. However, the
approach poses a challenge in classes four and five. After combining the features of the model to perform the classification, the hybrid model outperforms all tendentious DL and
models at
Figure 9.