HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection

Shukla, Amar; Tiwari, Shamik; Jain, Anurag

doi:10.3390/technologies12120256

Open AccessArticle

HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection

by

Amar Shukla

¹,

Shamik Tiwari

^2,*

and

Anurag Jain

²

¹

School of Computer Science, UPES, Dehradun 248007, India

²

School of Computer Science and Engineering, IILM University, Gurugram 122011, India

^*

Author to whom correspondence should be addressed.

Technologies 2024, 12(12), 256; https://doi.org/10.3390/technologies12120256

Submission received: 2 October 2024 / Revised: 22 November 2024 / Accepted: 3 December 2024 / Published: 11 December 2024

(This article belongs to the Special Issue Advancements in Medical and Assistive Technologies Using Artificial Intelligence and Deep Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

Diabetic retinopathy (DR) is one of the most common causes of visual impairment worldwide and requires reliable automated detection methods. Numerous research efforts have developed various conventional methods for early detection of DR. Research in the field of DR remains insufficient, indicating the potential for advances in diagnosis. In this paper, a hybrid model (HybridFusionNet) that integrates vision transformer (VIT) and attention processes is presented. It improves classification in the binary (

B_{c l}

) and multi-class (

M_{c l}

) stages by utilizing deep features from the DR stages. As a result, both the SAN and VIT models improve the recognition accuracy

(A_{c c})

in both stages.The HybridFusionNet mechanism achieves a competitive improvement in multi-stage and binary stages, which is

A_{c c}

in

B_{c l}

and

M_{c l}

, with 91% and 99%, respectively. This illustrates that this model is suitable for a better diagnosis of DR.

Keywords:

diabetic retinopathy; Self Attention Network (SAN); Vision Transformer (VIT); HybridFusionNet; binary classification; multi-class classification

1. Introduction

DR is a retinal disease that impairs vision. The retina is damaged by increased blood sugar, which attacks its sensitive layer [1]. Without treatment, these degenerative processes can lead to complete loss of vision and visual impairment [2]. Timely diagnosis of the disease is necessary in order to be able to carry out laser or drug therapy quickly, which can slow down the progression of the disease and, in

s_{v}

cases, protect against blindness [3]. Even if patients initially show no symptoms, regular monitoring with an eye examination is crucial to prevent unstoppable progression. Since the introduction of artificial intelligence (AI) in ophthalmology, evidence of DR has increased. Now, machine learning (ML) technologies are being used to analyze a vast number of eye images [4]. Nevertheless, such an approach has its limitations that make it difficult to use AI for DR diagnosis. The success of AI-based techniques depends heavily on the reliability of the retinal images. These systems designed to analyze images are more susceptible to conditions such as under- and overexposure, camera shake, and other distortions. This is especially true for convolutional neural networks (CNNs), which are very sensitive to image quality during training (

t_{n}

) and can change the overall functioning of the system [5]. Furthermore, an unbalanced ratio of DR-negative and DR-positive cases increases the probability of false negatives. Such cognitive processing is sensitive to the structural and functional characteristics of a retinal image as well as to understanding differences in DR cases.

Consistent development and diverse data utilization are necessary to improve the performance and application of AI in DR diagnosis and treatment. DL is an advanced branch of ML that uses models such as CNN to look at images in great detail in search of

m_{d}

DR indicators [6]. However, despite the use of RNNs and LSTMs in different networks, CNNs have prevailed in this regard. These AI-based approaches have already been found to work with demonstrable efficiency, with DR being adequately diagnosed by more than a few systems, clearly indicating that AI has the potential to solve the problem of DR at early stages.

Wide-field OCTA (WF-OCTA) provides detailed information about the peripheral retina, and thus, improves the diagnosis of DR. The increase in the number of diabetic patients has increased the need for automated diagnostic systems for DR that rely on WF-OCTA imaging [7]. The method uses Vision Transformers (VIT) to automatically diagnose DR using WF-OCTA images centered on the fovea. [8]. This demonstrates the effectiveness of VIT in detecting and grading DR. This inspired the development of a transformer

(T_{f})

for detecting DR grades that divides images into patches, converts them into sequences, and processes them through multi-head attention layers. This method showed the potential of VIT for DR detection.

Many studies have proposed a hybrid DL method that combines fine-tuned vision

(T_{f})

and a modified capsule network for predicting the severity of DR. This method includes preprocessing steps, right transform, and adaptive histogram equalization. Using APTOS, Messidor-2, DDR, and EyePACS, an adequate

A_{c c}

was achieved that outperforms the tendency methods [9]. Thus, these methods show that the complex computer-aided diagnosis (CAD) system has promising efficiency in detecting DR. In particular, deep learning (DL) algorithms, especially those using VIT and SAN, show significant potential for adequate detection of DR stages.

The VIT demonstrate the efforts made to detect DR compared to traditional CNN methods. These TFs are well equipped with the Masked Autoencoders (MAR), which provide the best effectiveness in classifying the different stages of DR. Self-recognition is also a promising and substantial approach for recognizing the different DR stages. The SAN architecture considers different retinal features from different DR stages, and thus, improves the generalization ability. In this study, we also utilize the hybrid model of SAN and VIT to improve the

A_{c c}

of detection for different disease phases by applying vision techniques. The contributions are presented below to better illustrate the desired work and increase its effectiveness:

Review the essential literature on DR to understand the adaptability of the DR detection approach;
The applicability of different DR datasets with their corresponding phases;
The various preprocessing studies applied to the DR dataset;
The implementation of SAN, VIT, and HybridFusionNet architecture is achieved;
Evaluation of $B_{c l}$ and $M_{c l}$ classification using various performance parameters such as ROC and $t_{n}$ and $V_{d}$ curve is performed;
Comparative analysis of SAN, VIT, and HybridFusionNet models against trending methods such as ResNet, AlexNet, and VGG16.

These contributions are described in various sections of this article. Section 2 deals with the subjective literature reviews conducted in the field of DR detection. Section 3 conceptualizes the methods and the consequences of each step of the DR dataset. Section 4 discusses the use of the different methods SAN, VIT, and the innovative HybridFusionNet approach. Section 5 provides an exploration of the results. Section 6 summarizes the findings, highlights potential future research topics, and provides a roadmap for further progress in this discipline.

2. Materials and Methods

The fundus is the back part of your eye that contains the retina, the optic nerve, and the blood vessels [10]. With the pupil dilated, you use a special camera to take a picture of the back of the eye. The procedure only takes a few minutes. Fundus images are not medically necessary to document the presence of DR [11]. However, images may be medically necessary to provide a baseline for assessing the progression of a disease. In addition, fundus imaging can also aid in the interpretation of fluorescein angiography, as certain retinal landmarks seen on fundus images cannot be seen on fluorescein angiograms [12]. It is important that the eyes are dilated before the procedure. If the patient’s pupils are dilated, the technicians can see the back of the eye better.

The inability to reliably produce two-dimensional fundus images is a key drawback of several existing DR screening techniques that rely on fundus [13]. OCT has become a popular DR screening technology because it allows direct two- and three-dimensional viewing of histologic changes in the layered retinal structures and accurate quantitative evaluation with ultra-high scan rate and resolution [14]. OCT has the advantage that it is non-contact and non-invasive, captures high-resolution images, measures the thickness of the retina and the retinal layer, and acquires the images quickly. Although image quality is often better with a dilated pupil, OCT can often be performed on an undilated patient. The cost of equipment, media opacity limitations, operator skill and

t_{n}

requirements, difficulty in obtaining images from patients unable to fixate, and the introduction of imaging artefacts due to automation tool anomalies are some of the problems associated with OCT images [15].

Machine vision is a technique for enhancing or extracting information from an image through procedures. Basic steps for machine vision based diagnostic solutions for DR are provided. These steps include preprocessing of fundus images, retinal vessel segmentation, optic disc localization, red lesion extraction, bright lesion extraction, and finally DR detection. The image processing solution is specially designed for DR screening. After preprocessing the fundus image, certain features such as the location of blood vessels, exudates, and structural aspects are extracted. These features are categorized into different stages, from normal to proliferative, using an

M_{c l}

Support Vector Machine (SVM). The image resolution used was 256 × 256 pixels, and the model

V_{d}

was performed on a publicly available dataset [16].

Using the Fisher method and mutual information, an advanced feature set with thirteen essential elements, such as Bare Logistic, Multi-Layer Perceptrons, and Sequential Minimum Optimization. The

A_{c c}

achieved 99.73% when analyzing retinal fundus images from Bahawal Victoria Hospital, Pakistan, and 98.83% on alternative public datasets. A subsequent study examined a dataset [17] with 3662 retinal fundus images using CNN for feature inference and dimensionality reduction. The SVM algorithm then effectively categorized the images into

m_{d}

,

m_{o d}

, proliferative, and

s_{v}

stages, achieving an impressive 98.4%

A_{c c}

.

Another study [18] used the diaretDB1 dataset [19], which consisted of 89 live fundus images; 84 showed signs of microaneurysms, the rest were standard. In addition, data from the Retinal Vascular Disease Online Challenge [20] were used. The perceptron, which is responsible for highlighting areas of interest, extracted unique 19 × 19 pixel patches. Using a composite classification approach that includes techniques such as bottom hat and radon extraction, the images were categorized into micro-aneurysms and non-micro-aneurysms. With the diaretDB1 dataset, the researchers achieved commendable sensitiVITy, precision

(P_{c c})

, and specificity values of 92.32%, 95.93%, and 93.87% respectively.

To broaden the study horizon, another study [21] focused on 80 non-dilated retinal images from Thammsat University Hospital [22]. The mathematical morphology was decisive for the extraction of eighteen crucial features. Naive Bayes classifiers were used to categorize the images into no,

m_{d}

,

m_{o d}

, and

s_{v}

, achieving a remarkable sensitiVITy of 87.15% and unmatched values for

A_{c c}

and specificity of 99.99%. In addition, research initiatives have tapped into databases such as UTHSC, Retinal Vascular Disease Online Challenge, and diaretDB1 and extracted 87 key features that preceded their classification [23]. In a further step, a study using the diaretDB1 dataset [24] used 66 image attributes and classified images into primary,

m_{d}

, and advanced stages of DR, achieving an 88%

A_{c c}

rate with SVM and LR classifiers. Finally, in the study by [25], preprocessing techniques were applied based on the DiaretDB1 dataset and exudates and blood vessels were identified as crucial features. The obtained

A_{c c}

varies between 85.8% and 88.6% depending on the algorithm used. Table 1 shows a detailed analysis of the detection of DR using the different ML and DL models. Table 2 describes the different studies on the detection of DR using the different types of datasets.

Datasets for DR Screening. There are many publicly accessible datasets for DR and retinal vasculature detection. These datasets are often employed for system

t_{n}

,

V_{d}

, testing, and system quality comparison.

Machine vision application in DR screening has made significant progress in diagnosis and referral recommendations. The majority of machine vision studies that have been reported thus far have primarily examined offline DR screening. According to several research papers, DR may be diagnosed using AI with excellent sensitiVITy and specificity. Fundus images can be assessed with varied levels of image quality. Image enhancement and restoration can be performed before DR diagnosis. A generalized model can be designed to diagnose DR with geographic variations since diabetics are connected with geographic factors such as food habits, weather conditions, etc. Acceptable

P_{c c}

is attained.

Self-learning neural networks perform exceptionally in detecting DR by autonomously acquiring essential characteristics from extensive collections of labeled retinal images. These networks exhibit an exceptional capacity to detect tiny patterns and fluctuations, ensuring validated

A_{c c}

in spotting early indicators of DR. Their capacity to scale enables efficient processing of various image datasets, including variances in quality and patient demographics. Furthermore, their velocity and mechanization enhance the efficiency of the screening procedure, enabling the possibility of early detection of DR on a significant magnitude. With the increasing availability of more data, these networks are constantly enhancing, providing hopeful opportunities for prompt intervention and preserving vision in diabetes patients.

3. Methods and Materials

We used a dataset from Kaggle containing retinal images, clinically scored on a scale of 0 to 4 for DR: 0 (No DR), 1 Mild (

m_{d}

), 2 Moderate (

m_{o d}

), 3 (Severe (

s_{v}

)), and 4 (Proliferative DR (

P_{f d r}

)). Our goal is to develop an automated system that classifies these images based on the given scale. The details of the dataset and examples can be viewed at https://www.kaggle.com/c/diabetic-retinopathy-detection/data (accessed on 21 November 2024). Figure 1 shows the pixel intensity distributions for the different DR classes. Each subplot represents the frequency of pixel intensities for a particular DR class.

The histograms in Figure 1 show the pixel intensity distribution for DR levels. The No DR class shows a frequency around one intensity, indicating a relatively uniform distribution with fewer extreme values. The

s_{v}

class shows a similar pattern, but is slightly shifted and shows lower frequencies compared to the No DR class. The

P_{f d r}

class shows a peak in the same intensity range, while the No DR and

s_{v}

classes have significantly lower frequencies, indicating a lower variance. The

m_{d}

class has a lower frequency than the No DR class, indicating some challenges. The

m_{o d}

class has greater variability in pixel intensities. This analysis shows that the No DR and

s_{v}

classes have rather equal variation distributions, while the

P_{f d r}

,

m_{d}

, and

m_{o d}

classes have an acceptable difference.

3.1. HybridFusionNet

The HybridFusionNet model is the combination of the functions of SAN and VIT. This model first uses SAN to detail the features in fundus images, which are later integrated by VIT to classify DR. The SAN mechanism helps to calculate attention scores that describe the pixels in relation to others and can be calculated as follows:

{Attention}_{(C, D, T)} = softmax (\frac{C D^{T}}{\sqrt{d_{k}}}) V

(1)

where C is the query matrix, D is the key matrix, T is the value matrix, and

d_{k}

is the dimension of the key vectors. In order to capture extensive dependencies and features in the images, VIT divides the input fundus images into different parts, which are arranged in linear order and converge steeply towards the encoder. VIT then integrates multiple layers of SAN and Feed-Forward Neural Networks (FFNN):

Patch Embedding = [(P_{1} + P_{2} + P_{3} + P_{4} + \dots + P_{n}) W_{p} + b_{p}]

(2)

where

W_{p}

is the weighting matrix and

b_{p}

is the distortion term. The

(T_{f})

encoder is applied with SAN, normalization, and FFNN. It can classify both

B_{c l}

and

M_{c l}

. The

B_{c l}

classification refers to the presence of DR or not, while the

M_{c l}

classification includes the categorization of DR levels, No DR,

m_{d}

,

m_{o d}

,

s_{v}

, and

P_{f d r}

. Figure 2 begins with a preprocessing phase in which the input fundus images are transformed.

The SAN presented after preprocessing consists of several input nodes (A1 to A4) representing input values connected by a fully connected network with nodes for resistance and similarity values (R1 to R4), generating attention values (G1 to G4) and computed values (D1 to D4) through certain operations such as addition and multiplication as these mention in Algorithm 1. This network captures salient features and relationships within the retinal image and provides a comprehensive representation for classification. The SAN is followed by the VIT , which utilizes the

(T_{f})

encoder mechanism and processes retinal image patches with position embeddings through a series of

(T_{f})

encoders capable of capturing long-range dependencies and complex patterns. The linear projection with the flattened image patches is fed into the

(T_{f})

encoder and feeds into a Multi-Layer Perceptron (MLP) head, which outputs the classification results. Both the SAN and the VIT provide results for

B_{c l}

and

M_{c l}

classification, which determine the presence and severity of DR in retinal images. This dual-output structure ensures a comprehensive analysis that takes into account the presence and progression of the disease. The architecture provides a robust solution for DR diagnosis through preprocessing, SAN, and VIT, which could improve early detection and management of the disease.

Algorithm 1 HybridFusionNet Model for Multi-Class and Binary-Class Classification

1:: Input: Preprocessed fundus images and model parameters for SAN and VIT

Ensure: Binary-class or multi-class DR classification

2:: procedure HybridFusionNet
3:: Preprocess the input fundus images.
4:: Apply SAN to compute attention scores.
5:: Use SAN to compute query, key, and value matrices.
6:: Extract deep features from fundus modalities.
7:: Split features into patches( VIT).
8:: Pass patches through the VIT encoder.
9:: For binary stage, detect presence of DR.
10:: For multi-class stages, classify the DR stage.
11:: Output the final DR classification result.
12:: end procedure

In the dataset, there were not the same number of images in both categories. This imbalance can lead to problems if

t_{n}

models are biased in classification. The data augmentation approach creates new images by slightly modifying the existing images. This is mainly performed for the group with fewer images. The image is rotated slightly to the left or right, creating M, and a new image or matrix

M^{'}

is created by rotating it by an angle:

M^{'} = R (θ) \cdot M

(3)

then the image is flipped. Thus, for an image M, the flipped versions are

M_{h}

for the horizontal and

M_{v}

for the vertical zooming. Think of this as zooming in or out on a part of a image. By changing the size of the image matrix M using a factor s, we obtain:

M^{'} = Z (s) \cdot M

(4)

Cropping is a crucial step in preprocessing. Additionally, we used something called SMOTE, which stands for Synthetic Minority Over-sampling Technique. For two images a and b, a new image c is used:

c = a + λ \cdot (b - a)

(5)

where

λ

is a random value that lies between 0 and 1. To start, SMOTE picks two images (or data points) from the minority class. These images are all in the same group, which means they have some things in common. In a set of images showing diabetic retinopathy, for instance, both a and b might be marked as showing a certain grade of retinopathy. Usually, a and b are picked based on how similar their features are or how close they are to each other in the feature space. This is achieved using methods like k-nearest neighbors. By doing this, a and b are made to be somewhat alike, and the made-up images between them matches the traits of the minority class. In our experiments, the value of

σ

(sigma) used in the Gaussian filter was 1.5. This value determines the extent of smoothing applied to the image. Larger values of

σ

result in greater smoothing, whereas smaller values preserve finer details. For dbr images,

σ

is 1.5 found to be effective in reducing noise. After filtering, our original image P becomes

P^{'}

:

P^{'} (M, N) = P (m, n) * G (m, n)

(6)

where

G (m, n)

is a formula based on the Gaussian function:

G (m, n) = \frac{1}{2 π σ^{2}} e^{- \frac{m^{2} + n^{2}}{2 σ^{2}}}

(7)

To make sure all our images have the same level of clarity, we used the same (

σ

) value and size for our Gaussian filter on all of them. The Gaussian curve’s “width” or spread is determined by

σ

. A distribution that is smoother and broader has a bigger

σ

, while one that is narrower and more concentrated around the center has a lower

σ

.

3.2. Self Attain Network

The SAN mechanism in Algorithm 2 is a sophisticated filtering process that selectively highlights parts of the input data. This mechanism allows the model to effectively capture and utilize contextual relationships within the data in Figure 3. This improves the model’s ability to understand and process complex patterns of DR stages.

In this approach, each sequence element is assigned a pointer coefficient, which is calculated by evaluating the alignment between S and D. This alignment serves as an indicator of the relevance of the associated O value. O represents the output values, weighted by the alignment score between S and D. A special feature of the SAN method is the equality between S, D, and O:

Attention (S, D, O) = softmax (\frac{S \times D^{T}}{\sqrt{d_{D}}}) \times O

(8)

where Attention stands for the functioning of the SAN mechanism, which processes the three matrices S, D, and O, S represents the sensor matrix, analogous to the query in attention mechanisms, while D and O are similar to the key and value in these mechanisms,

D^{T}

is the transpose of the descriptor matrix D, and

\sqrt{d_{D}}

denotes the square root of the dimension of the descriptor matrix used for scaling in the attention mechanism to maintain the stability of the gradient.

The

{sector}_{i}

depicts the outcome of the Attention function for the

i^{t h}

mapping in multi-concentration attention. The matrices

A_{i}^{S}

,

A_{i}^{D}

, and

A_{i}^{O}

are parameters for the

i^{t h}

mapping, specifically transforming S, D, and O:

\begin{matrix} {sector}_{i} & = Attention (S \times A_{i}^{S}, D \times A_{i}^{D}, O \times A_{i}^{O}) \end{matrix}

(9)

The function Multi-attention integrates results from multiple sectors and acts as an extended version of the SAN mechanism:

\begin{matrix} Multi - attention (S, D, O) = Bind & ({\sec tor}_{1}, {sector}_{2}, \dots, \\ {sector}_{h}) \times A^{T} \end{matrix}

(10)

Algorithm 2 SAN for Image Classification

Require: Feature map with dimensions (batch_size, channels, height, width) and weight matrices.
Ensure: Feature map with SAN applied

1:: procedure SelfAttention
2:: Optionally reduce dimensionality of X with $1 \times 1$ convolution
3:: Compute Query (Q), Key (K), and Value (V) via convolution
4:: Calculate attention scores: $Q \cdot K^{⊤}$
5:: Optionally apply a mask to attention scores
6:: Normalize scores using Softmax
7:: Compute weighted sum of V using attention weights
8:: Optionally apply a linear transformation
9:: Add X to weighted sum (residual connection)
10:: Output feature map with self-attention
11:: end procedure

The SAN mechanism in CNNs starts with an input feature map X with specific dimensions and optional parameters for the attention mechanism. At the beginning of the process, a

1 \times 1

convolution is applied to X for dimensionality reduction [29]. These are derived by convolving X with their respective weight matrices. Then, the dot product of Q and K is determined. A mask can be applied to these dot values to selectively control attention. The attention weighting is achieved by the activation function. The SAN feature map is obtained by multiplying these weights by V and adding them to the original data. Repeating the process followed in the input batch, resulting in SAN-enhanced feature maps ready for the subsequent CNN layers.

Vision Transformer (VIT)

VIT is a DL model that applies the

(T_{f})

architecture to the image classification tasks in Figure 4. VITs work with image fields and use the SAN mechanism to model global relationships within an image.

An input image X of dimensions (batch_size,

height, width, channels)

is divided into non-overlapping patches:

{patch}_{i} = Flatten (X_{i}) \cdot W_{p}

(11)

where

W_{p}

is the learnable projection matrix.

E_{i} = {patch}_{i} + {pos}_{i}

(12)

The

(T_{f})

encoder consists of multiple layers of multi-head SAN and FFNN, each with residual connections and layer normalization. Queries (A), Keys (B), and Values (C) are computed from the input embeddings:

A = X W_{A}, B = X W_{B}, C = X W_{C}

(13)

Attention scores are computed as:

Attention (A, B, C) = Softmax (\frac{A K^{T}}{\sqrt{d_{k}}}) C

(14)

The output of the self-attention mechanism is passed through a FFNN:

F_{F N} (X) = ReLU (X W_{1} + b_{1}) W_{2} + b_{2}

(15)

X^{'} = LayerNorm (X + Attention (Q, K, V))

(16)

X^{″} = LayerNorm (X^{'} + FFN (X^{'}))

(17)

The output embedding corresponding to classification token [CLS] is fed into a linear classifier:

y = softmax (W_{cls} \cdot X_{CLS})

(18)

The VIT model is trained on the labeled dataset using cross-entropy loss for

M_{c l}

classification:

L = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(19)

where N represents the samples, C represents the classes, y represents the ground truth, and

\hat{y}

is the predicted probability. The performance of the model is evaluated using metrics such as

A_{c c}

, confusion matrix (

C_{m}

), and AUC-ROC to ensure effective classification of DR levels. By leveraging VIT’s ability to model long-range dependencies and global context, it has shown promising results in medical imaging tasks, including DR detection, and provides an alternative to traditional CNN-based methods.

3.3. Multi-Class Classification

In this hybrid model, we have made a classification of the stages of DR included in the training and the

V_{d}

approach by using the defined model that helps us to recognize the differences between these stages as in Algorithm 3.

Algorithm 3 Multi-class classification using SAN

Require: Disease categories:

m_{d}

DR,

m_{o d}

DR,

P_{f d r}

,

s_{v}

DR, Non-DR, Data directory, Input image size, Number of

t_{n}

epochs
Ensure: Test

A_{c c}

1:: procedure DRImageClassification
2:: 1: Prepare and preprocess data: resize and convert images to grayscale
3:: 2: Define and compile the Attention CNN model:
4:: Input layer, Convolutional layers, Attention mechanism, Global average pooling, Fully connected layers
5:: Compile with $A_{d o p}$ and $B_{c l}$ cross-entropy loss
6:: 3: Train the model:
7:: Split data into $t_{n}$ and $V_{d}$ sets
8:: Train for specified epochs, validate during $t_{n}$
9:: 4: Evaluate the model on the test set and record test $A_{c c}$
10:: end procedure

The Algorithm 4 for DR classification using a VIT was developed to categorize retinal images into several levels

m_{d}

DR,

m_{o d}

DR,

P_{f d r}

,

s_{v}

DR, and non-DR. The process begins with data preparation, where the images and associated labels are loaded from the specified data directory. The images are scaled to the required input size and normalized so that the model can effectively learn from the image data.

Algorithm 4 Multi-class using VIT

Require: Disease categories:

m_{d}

DR,

m_{o d}

DR,

P_{f d r}

,

s_{v}

DR, Non-DR, Data directory, Input image size, Number of

t_{n}

epochs
Ensure: Test

A_{c c}

1:: procedure DRImageClassification
2:: 1: Prepare and preprocess data: resize and normalize images
3:: 2: Define and compile the VIT model:
4:: Input layer, Patch embedding, Position embedding, $(T_{f})$ encoder layers, Classification head
5:: Compile with $A_{d o p}$ and categorical cross-entropy loss
6:: 3: Train the model:
7:: Split data into $t_{n}$ and $V_{d}$ sets
8:: Train for specified epochs, validate during $t_{n}$
9:: 4: Evaluate the model on the test set and record test $A_{c c}$
10:: end procedure

Then, the vision

(T_{f})

model is included in the pipeline for the DR process. The model starts with an input layer that processes image patches and is connected to a patch embedding layer. This converts the patches into vectors. A position embedding layer is added to the patch embeddings to obtain the position information within the image. The core of the model consists of multiple

(T_{f})

encoder layers, each utilizing SAN mechanisms and FFNN to obtain features in the data. Then, the classification head performs

M_{c l}

classification with a linear l, stages of DR.After defining the model, it is compiled using the

A_{d o p}

and categorical function used for

t_{n}

DL models in classification tasks. In the

t_{n}

phase, the data are split into

t_{n}

and

V_{d}

sets. Several epochs are performed, the correct performance is measured and overfitting of the model is also to be avoided. After evaluating the model, the test

A_{c c}

is considered. The final test

A_{c c}

provides an effective classification of DR stages and is a reliable tool for early detection.

4. Result Analysis and Discussion

Traditional ML models such as ResNet, AlexNet, VGG16, (VIT), self-attention models (SAN), and hybrid models (HybridFusionNet) are used in the development of this CAD system to identify DR. Traditional

(T_{f})

and SAN are used in the initial evaluation phase. The first SAN model uses an attention mechanism to evaluate features and categorize them based on the extracted features. In two-sided classification with SAN,

B_{c l}

classification provided encouraging results, as it efficiently discriminates between

s_{v}

and non-

s_{v}

DR. However, when the categorization was extended to the five DR classes, the problems became more apparent, requiring more modifications and a stronger emphasis on recognition performance.The Figure 5 illustrates the various evaluation parameters calculated for the analysis of DR using SAN architecture. Figure 5a shows the consistency of the classification of the model

B_{c l}

across

t_{n}

and

V_{d}

, while Figure 5b describes the classification

M_{c l}

. The

A_{c c}

of the model increased steadily, with

t_{n}

and

V_{d}

A_{c c}

close to each other, indicating a significant ability to discriminate between the two

B_{c l}

classes. While for

M_{c l}

, the

V_{d}

A_{c c}

was slightly lower than the

t_{n}

A_{c c}

, indicating some difficulties in reliably categorizing multiple classes. The gap suggests that the model may need more modification to improve generalization across all class categories and maintain consistent performance in class prediction. The distribution of the dataset for detecting positive and negative cases based on false and correct in Figure 5c. The model shows near error-free performance in detecting “No DR” instances, but has some difficulty in identifying “

s_{v}

” DR, with 87% correctly categorized, but 34 cases were misclassified as “No DR”. The

M_{c l}

classification, see Figure 5d, presents new difficulties, namely in distinguishing between the stages of DR. The prediction

A_{c c}

for “No DR” is 91%. However, there are significant difficulties in distinguishing between the “

m_{d}

”, “

m_{o d}

”, and “

s_{v}

” stages, as only 43% of the “

m_{d}

” and 69% of the “

s_{v}

” instances are accurately identified. The Receiver Operating Characteristic (ROC) curves in

B_{c l}

in Figure 5e show that the model produces robust prediction results. In contrast, the

M_{c l}

classification Figure 5f shows inconsistent effectiveness across the different phases of DR. The model performs consistently well in identifying “No DR”, achieving an AUC of 98%. However, as the complexity of the classes (“

m_{d}

”, “

m_{o d}

”, “

s_{v}

”, and “

P_{f d r}

”) increases, the AUC decreases, especially for the

s_{v}

stage,

P_{f d r}

.

In the next phase of the evaluation of VIT

(T_{f})

, shown in Figure 6a, the

A_{c c}

curve shows a consistent increase in both

t_{n}

and

V_{d}

. The model shows potential effectiveness in discriminating between two categories (“No DR” and “DR”) without overfitting, as the

V_{d}

power is very close to the

t_{n}

power. Figure 6b illustrates similar patterns in

M_{c l}

classification, where the

A_{c c}

of

t_{n}

and

V_{d}

almost converge to reach the subsequent consistency levels. Although the model shows strong performance across multiple categories, the

A_{c c}

of

V_{d}

hardly varies, suggesting that discriminating between more than two categories is not a major challenge. Figure 6c shows shows the

C_{m}

evaluation for the

M_{c l}

classification demonstrating how well the VIT

(T_{f})

model can distinguish between the “no DR” and “DR” situations, with only a few false positives. In Figure 6d, the model correctly identifies the majority of “no DR” cases, but becomes increasingly confused as the degree of DR increases. The model fails miserably in distinguishing between adjacent severity levels, resulting in incorrect labeling of “

m_{d}

” and “

m_{o d}

”, “

m_{o d}

”, and “

s_{v}

” segments. Apart from these problems, the model still performs well overall. However, it could be better at discriminating between finer DR levels in Figure 6e. The model’s ROC value (AUC) of 99% confirms its effectiveness and reliability in

B_{c l}

classification. The

M_{c l}

ROC value shown in Figure 6f is robust but shows greater variability between the different classes. The model is effective in detecting DR and shows more modest AUCs for subsequent stages, particularly for the more

s_{v}

categories, where the AUC drops to 83%. This suggests that the model performs excellently for

M_{c l}

classification.

To improve the detection in

M_{c l}

, we inserted the preprocessed dataset into the proposed hybrid model for SAN and VIT (HybridFusionNet) Figure 7. For both

B_{c l}

and

M_{c l}

classification tasks, the HybridFusionNet model describes the sustainable performances. The

B_{c l}

number Figure 7a tends to increase steadily from

t_{n}

, while

V_{d}

A_{c c}

remains constant. For the

M_{c l}

classification number Figure 7b, the model has reasonable

t_{n}

A_{c c}

. The

V_{d}

A_{c c}

falls only slightly below the

t_{n}

. The model is highly optimized, but may need further refinement to cope with the complexity of the

M_{c l}

tasks. The

C_{m}

in the

B_{c l}

mapping Figure 7c quantifies the performance by minimizing the number of misclassifications between “DR” and “non-DR”. An

M_{c l}

classification

C_{m}

mapping Figure 7d reveals neighboring phases of DR such as “

m_{o d}

” and “

s_{v}

”, but classifies exactly “No DR”. The

B_{c l}

ROC in Figure 7e shows almost perfect performance with an AUC of 99%. In contrast, the

M_{c l}

ROC Figure 7f shows uneven performance, with “No DR” achieving the highest AUC of 99% and “

P_{f d r}

” the lowest AUC of 83%. In general, the HybridFusionNet model shows exceptional performance in

B_{c l}

classification and provides satisfactory results in

M_{c l}

classification. The silent observation in the evaluation parameters shows the effectiveness of the other trending ML and DL methods. In Table 3, the key observations are the consistently excellent performance of VIT and HybridFusionNet with 99%

A_{c c}

,

P_{c c}

, and recall in all phases. Models such as ResNet and AlexNet perform slightly worse, especially in heavier phases, where ResNet achieves 72 to 82%

A_{c c}

. SAN has a balanced performance and achieves more than 90% across all phases for all parameters.

Figure 8 illustrates the classification efficacy of five models—AlexNet, VGG16, SAN, VIT, and Hybrid—regarding diabetic retinopathy (DR) across five severity categories: no DR, mild, moderate, severe, and proliferative. Trends in accuracy, precision, and recall may be shown using scatter plots, while score distributions can be represented by box plots. The hybrid model attains almost flawless metrics across all severity levels, consistently surpassing individual models, as shown by the data. SAN and VIT seem to be effective, especially under harsh settings. In terms of performance measures, AlexNet and VGG16 exhibit significant deficiencies. The Hybrid and VIT metrics provide more robustness and reduced variability in the box plots.

The VIT, SAN, and HybridFusionNet models have clear strengths and potential for improvement. VITlags behind in both

B_{c l}

and

M_{c l}

classification, although it has similar DR severities. SAN shows better

B_{c l}

classification challenges

M_{c l}

with

A_{c c}

,

s_{v}

DR levels. HHybridFusionNet combines the strengths of both models and outperforms

B_{c l}

,

M_{c l}

,

A_{c c}

, and

(R_{c c})

. The performance analysis is described in Table 3 with different trend models.

The models ResNet [30], AlexNet [31], and VGG16 [32] are used for the evaluation. Three key performance metrics evaluate these models:

A_{c c}

,

P_{c c}

, and

R_{c c}

. While the DL models ResNET, AlexNet, and VGG16 show competitive performance, the use of transfer models resulted in SAN and VIT achieving 91 and 99 percent

A_{c c}

in the

B_{c l}

class and 87 and 90 percent

A_{c c}

in the

M_{c l}

class, respectively as in the Table 3 and Table 4. This shows the ability to classify both

s_{v}

DR and No DR with appropriate

A_{c c}

. However, the

M_{c l}

approach poses a challenge in classes four and five. After combining the features of the model to perform the classification, the hybrid model outperforms all tendentious DL and

(T_{f})

models at Figure 9.

5. Validation Study

The proposed method for detecting DR is a superior technique in summarized research using the robust Kaggle repository dataset in Table 5. It utilizes a novel self-referenced network architecture and VIT

(T_{f})

, which excels at identifying nuanced patterns in retinal images and is suitable for both

B_{c l}

and

M_{c l}

classification tasks. This approach not only outperforms other methods with its impressive

B_{c l}

class value

A_{c c}

of 91%, but also demonstrates versatility with an 87%

A_{c c}

in

M_{c l}

scenarios.

6. Conclusions

DR is a significant eye health problem, highlighting the need for improved screening methods. In our study, we used simple and complex classification methods to investigate a hybrid model with a SAN and a VIT network to identify different stages of DR. The network showed promising results with a

P_{c c}

rate of 91% and an impressive 98%

R_{c c}

in

B_{c l}

classification, resulting in an overall

A_{c c}

of 99%. However, the performance of the network in complex classification showed less

P_{c c}

for later stages of DR, although it remained effective for early stages. The network excelled in identifying

s_{v}

DR cases, which was emphasized by a high

R_{c c}

rate. Despite an overall good

A_{c c}

of 87% in complex classification, the challenges were algorithmic transparency and variability in imaging conditions. Our comparative analysis revealed that ResNet showed

m_{o d}

performance with an

A_{c c}

of 74%,

P_{c c}

of 71%, and

R_{c c}

of 73%. AlexNet improved on these metrics with an

A_{c c}

of 79%,

P_{c c}

of 76%, and

R_{c c}

of 78%. VGG16 had a similar performance to AlexNet. The SAN model outperformed it with an

A_{c c}

of 87%,

P_{c c}

of 83%, and

R_{c c}

of 86%. The VIT model achieved the optimal performance,

A_{c c}

,

P_{c c}

, and

R_{c c}

, all at 90%, and a near perfect

A_{c c}

of 99% at the

B_{c l}

classification. The proposed model, outperforms all other models in the

B_{c l}

class with 99% and

M_{c l}

with 91%

A_{c c}

. The study shows the potential of the hybrid method for DR detection, including improved

A_{c c}

for the different stages of DR.

Author Contributions

Conceptualization, A.S. and S.T.; methodology, A.S.; software, A.S.; validation, A.S., S.T. and A.J.; formal analysis, A.S.; investigation, S.T.; resources, A.J.; data curation, S.T.; writing—original draft preparation, A.S.; writing—review and editing, A.S.; visualization, A.S.; supervision, S.T.; project administration, S.T.; funding acquisition, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is available in the kaggle repository https://www.kaggle.com/c/aptos2019-blindness-detection/overview (accessed on 21 November 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Niu, Y.; Gu, L.; Zhao, Y.; Lu, F. Explainable diabetic retinopathy detection and retinal image generation. IEEE J. Biomed. Health Inform. 2021, 26, 44–55. [Google Scholar] [CrossRef] [PubMed]
Simó, R.; Stehouwer, C.D.; Avogaro, A. Diabetic retinopathy: Looking beyond the eyes. Diabetologia 2020, 63, 1662–1664. [Google Scholar] [CrossRef] [PubMed]
Kropp, M.; Golubnitschaja, O.; Mazurakova, A.; Koklesova, L.; Sargheini, N.; Vo, T.T.K.S.; de Clerck, E.; Polivka, J., Jr.; Potuznik, P.; Polivka, J.; et al. Diabetic retinopathy as the leading cause of blindness and early predictor of cascading complications—Risks and mitigation. EPMA J. 2023, 14, 21–42. [Google Scholar] [CrossRef] [PubMed]
Padhy, S.K.; Takkar, B.; Chawla, R.; Kumar, A. Artificial intelligence in diabetic retinopathy: A natural step to the future. Indian J. Ophthalmol. 2019, 67, 1004. [Google Scholar] [PubMed]
Hu, Y.; Li, Y.; Zou, H.; Zhang, X. An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss. Electronics 2023, 12, 2941. [Google Scholar] [CrossRef]
Balyen, L.; Peto, T. Promising artificial intelligence-machine learning-deep learning algorithms in ophthalmology. Asia-Pac. J. Ophthalmol. 2019, 8, 264–272. [Google Scholar]
Yang, Y.; Cai, Z.; Qiu, S.; Xu, P. Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image. PLoS ONE 2024, 19, e0299265. [Google Scholar] [CrossRef]
Zhou, Z.; Yu, H.; Zhao, J.; Wang, X.; Wu, Q.; Dai, C. Automatic diagnosis of diabetic retinopathy using vision transformer based on wide-field optical coherence tomography angiography. J. Innov. Opt. Health Sci. 2024, 17, 2350019. [Google Scholar] [CrossRef]
Oulhadj, M.; Riffi, J.; Khodriss, C.; Mahraz, A.M.; Yahyaouy, A.; Abdellaoui, M.; Andaloussi, I.B.; Tairi, H. Diabetic retinopathy prediction based on vision transformer and modified capsule network. Comput. Biol. Med. 2024, 175, 108523. [Google Scholar] [CrossRef]
Hoover, A.; Goldbaum, M. Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Trans. Med Imaging 2003, 22, 951–958. [Google Scholar] [CrossRef]
Bruce, B.B.; Lamirel, C.; Biousse, V.; Ward, A.; Heilpern, K.L.; Newman, N.J.; Wright, D.W. Feasibility of nonmydriatic ocular fundus photography in the emergency department: Phase I of the FOTO-ED study. Acad. Emerg. Med. 2011, 18, 928–933. [Google Scholar] [CrossRef] [PubMed]
Tavakoli, M.; Shahri, R.P.; Pourreza, H.; Mehdizadeh, A.; Banaee, T.; Toosi, M.H.B. A complementary method for automated detection of microaneurysms in fluorescein angiography fundus images to assess diabetic retinopathy. Pattern Recognit. 2013, 46, 2740–2753. [Google Scholar] [CrossRef]
Date, R.C.; Shen, K.L.; Shah, B.M.; Sigalos-Rivera, M.A.; Chu, Y.I.; Weng, C.Y. Accuracy of detection and grading of diabetic retinopathy and diabetic macular edema using teleretinal screening. Ophthalmol. Retin. 2019, 3, 343–349. [Google Scholar] [CrossRef] [PubMed]
Kiernan, D.F.; Mieler, W.F.; Hariprasad, S.M. Spectral-domain optical coherence tomography: A comparison of modern high-resolution retinal imaging systems. Am. J. Ophthalmol. 2010, 149, 18–31. [Google Scholar] [CrossRef]
Jeffers, K.; Ajamian, P.C. Point-counterpoint: Ultra-widefield imaging vs. dilated funduscopy: A dilated exam is the standard of care–but is it always practical. Rev. Optom. 2017, 154, 50–56. [Google Scholar]
Ali, A.; Qadri, S.; Khan Mashwani, W.; Kumam, W.; Kumam, P.; Naeem, S.; Sulaiman, M. Machine learning based automated segmentation and hybrid feature analysis for diabetic retinopathy classification using fundus image. Entropy 2020, 22, 567. [Google Scholar] [CrossRef]
APTOS2019. APTOS 2019 Blindness Detection. Available online: https://www.kaggle.com/c/aptos2019-blindness-detection/overview (accessed on 9 November 2023).
Rosas-Romero, R.; Martínez-Carballido, J.; Hernández-Capistrán, J.; Uribe-Valencia, L.J. A method to assist in the diagnosis of early diabetic retinopathy: Machine vision applied to detection of microaneurysms in fundus images. Comput. Med Imaging Graph. 2015, 44, 41–53. [Google Scholar] [CrossRef]
Kauppi, T.; Kalesnykiene, V.; Kamarainen, J.K.; Lensu, L.; Sorri, I.; Raninen, A.; Voutilainen, R.; Uusitalo, H.; Kälviäinen, H.; Pietilä, J. The diaretdb1 diabetic retinopathy database and evaluation protocol. In Proceedings of the BMVC, Coventry, UK, 10–13 September 2007; Volume 1, pp. 1–10. [Google Scholar]
Niemeijer, M.; Van Ginneken, B.; Cree, M.J.; Mizutani, A.; Quellec, G.; Sánchez, C.I.; Zhang, B.; Hornero, R.; Lamard, M.; Muramatsu, C.; et al. Retinopathy online challenge: Automatic detection of microaneurysms in digital color fundus photographs. IEEE Trans. Med Imaging 2009, 29, 185–195. [Google Scholar] [CrossRef]
Sopharak, A.; Uyyanonvara, B.; Barman, S. Simple hybrid method for fine microaneurysm detection from non-dilated diabetic retinopathy retinal images. Comput. Med Imaging Graph. 2013, 37, 394–402. [Google Scholar] [CrossRef]
Walter, T.; Massin, P.; Erginay, A.; Ordonez, R.; Jeulin, C.; Klein, J. Automatic detection of microaneurysms in color fundus images. Med Image Anal. 2007, 11, 555–566. [Google Scholar] [CrossRef]
Gurudath, N.; Celenk, M.; Riley, H. Machine learning identification of diabetic retinopathy from fundus images. In Proceedings of the 2014 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), Philadelphia, PA, USA, 13 December 2014; pp. 1–7. [Google Scholar]
Huda, S.; Ila, I.; Sarder, S.; Shamsujjoha, M.; Ali, M. An improved approach for detection of diabetic retinopathy using feature importance and machine learning algorithms. In Proceedings of the 2019 7th International Conference on Smart Computing & Communications (ICSCC), Sarawak, Malaysia, 28–30 June 2019; pp. 1–5.
Sharma, A.; Shinde, S.; Shaikh, I.; Vyas, M.; Rani, S. Machine Learning Approach for Detection of Diabetic Retinopathy with Improved Pre-Processing. In Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 19–20 February 2021; pp. 517–522. [Google Scholar]
Narayanan, B.N.; Hardie, R.C.; De Silva, M.S.; Kueterman, N.K. Hybrid machine learning architecture for automated detection and grading of retinal images for diabetic retinopathy. J. Med Imaging 2020, 7, 034501. [Google Scholar] [CrossRef] [PubMed]
Adal, K.; Sidibé, D.; Ali, S.; Chaum, E.; Karnowski, T.; Mériaudeau, F. Automated detection of microaneurysms using scale-adapted blob analysis and semi-supervised learning. Comput. Methods Programs Biomed. 2014, 114, 1–10. [Google Scholar] [CrossRef] [PubMed]
Walter, T.; Klein, J.; Massin, P.; Erginay, A. A contribution of machine vision to the diagnosis of diabetic retinopathy-detection of exudates in color fundus images of the human retina. IEEE Trans. Med Imaging 2002, 21, 1236–1243. [Google Scholar] [CrossRef] [PubMed]
Vives-Boix, V.; Ruiz-Fernández, D. Diabetic retinopathy detection through convolutional neural networks with synaptic metaplasticity. Comput. Methods Programs Biomed. 2021, 206, 106094. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Gao, Y.; Wang, K.; Guo, S.; Liu, H.; Kang, H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Inf. Sci. 2019, 501, 511–522. [Google Scholar] [CrossRef]
Lam, C.; Yi, D.; Guo, M.; Lindsey, T. Automated detection of diabetic retinopathy using deep learning. AMIA Summits Transl. Sci. Proc. 2018, 2018, 147. [Google Scholar]
Rakhlin, A. Diabetic Retinopathy detection through integration of Deep Learning classification framework. BioRxiv 2017, 225508. [Google Scholar] [CrossRef]
Bodapati, J.D.; Balaji, B.B. Self-adaptive stacking ensemble approach with attention based deep neural network models for diabetic retinopathy severity prediction. Multimed. Tools Appl. 2023, 83, 1083–1102. [Google Scholar] [CrossRef]
Song, Z.; Dong, J.; Liang, H.; Zhao, S. Self-attention light network for hierarchical severity detection of diabetic retinopathy. In Proceedings of the NCIT 2022; Proceedings of International Conference on Networks, Communications and Information Technology, Virtual, 5–6 November 2022; VDE: Frankfurt am Main, Germany, 2022; pp. 1–7. [Google Scholar]
Maaliw, R.R.; Mabunga, Z.P.; De Veluz, M.R.D.; Alon, A.S.; Lagman, A.C.; Garcia, M.B.; Lacatan, L.L.; Dellosa, R.M. An Enhanced Segmentation and Deep Learning Architecture for Early Diabetic Retinopathy Detection. In Proceedings of the 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–11 March 2023; pp. 0168–0175. [Google Scholar]
Ramos, J.M.A.; Perdómo, O.; González, F.A. Deep Semi-Supervised and Self-Supervised Learning for Diabetic Retinopathy Detection. arXiv 2022, arXiv:2208.02408. [Google Scholar]
Chilukoti, S.V.; Shan, L.; Tida, V.S.; Maida, A.S.; Hei, X. A reliable diabetic retinopathy grading via transfer learning and ensemble learning with quadratic weighted kappa metric. BMC Med Informatics Decis. Mak. 2024, 24, 37. [Google Scholar] [CrossRef]
Gupta, E.; Gupta, V.; Chopra, M.; Chhipa, P.C.; Liwicki, M. Learning self-supervised representations for label efficient cross-domain knowledge transfer on diabetic retinopathy fundus images. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–7. [Google Scholar]
Ran, J.; Zhang, G.; Xia, F.; Zhang, X.; Xie, J.; Zhang, H. Source-free active domain adaptation for diabetic retinopathy grading based on ultra-wide-field fundus images. Comput. Biol. Med. 2024, 174, 108418. [Google Scholar] [CrossRef]

Figure 1. Intensity of the different classes of DR.

Figure 2. Architecture of HybridFusionNet.

Figure 3. Self attained architecture.

Figure 4. Vision transformer.

Figure 5. SAN Evaluation parameter. (a) Describes the

t_{n}

and

V_{d}

curves obtained in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the achieved ROC for the

M_{c l}

classification.

Figure 5. SAN Evaluation parameter. (a) Describes the

t_{n}

and

V_{d}

curves obtained in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the achieved ROC for the

M_{c l}

classification.

Figure 6. VIT evaluation parameter. (a) Describes the

t_{n}

and

V_{d}

curves obtained in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the ROC achieved for the

M_{c l}

classification.

Figure 6. VIT evaluation parameter. (a) Describes the

t_{n}

and

V_{d}

curves obtained in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the ROC achieved for the

M_{c l}

classification.

Figure 7. HybridFusionNet evaluation parameters. (a) Describes the

t_{n}

and

V_{d}

curves achieved in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the ROC achieved for the

M_{c l}

classification.

Figure 7. HybridFusionNet evaluation parameters. (a) Describes the

t_{n}

and

V_{d}

curves achieved in the

B_{c l}

classification. (b) Describes the

t_{n}

and

V_{d}

curves obtained in the

M_{c l}

classification. (c) Shows the

C_{m}

evaluation for the

B_{c l}

classification. (d) Shows the

C_{m}

evaluation for the

M_{c l}

classification. (e) Shows the ROC achieved for the

B_{c l}

classification. (f) Shows the ROC achieved for the

M_{c l}

classification.

Figure 8. Evaluation of different classes using trending models.

Figure 9. Performance analysis of trending and proposed models.

Table 1. Detailed survey of the ML techniques for the classification of DR and their types.

Author	Dataset	Features	Model	$B_{cl}$	$M_{cl}$	$A_{cc}$
[16]	500 imgs (256 × 256)	13 feats	Log., Seq. opt., Tree, MLP	✓	×	99.73%, 98.83%
[26]	3662 imgs	CNN, PCA	SVM	×	✓	98.4%, 96.3%
[17]	DiaretDB	19 × 19 px patch	Hierarchy	✓	×	88.06%, 92.19%, 97.47%
[21]	80 non-dil. imgs	18	N. Bayes	×	✓	87.15%, 99.99%, 99.99%
[27]	Multiple dbs	87	SVM, N. Bayes, kNN	✓	×	44.64%, 64.62%, 81.08%
[24]	DiaretDB1	66 features	D tree	×	✓	88%
[25]	DiaretDB1	B.V., Exud.	SVM, kNN, S. tree	✓	×	88.6%
[28]	30 imgs (640 × 480)	Exud., O. disc	M. filter, Watershed	✓	×	92.8%, 92.4%

Table 2. Dataset details.

Dataset	Modalities	Resolution	Classes	Images
MESSIDOR	3CCD	1440, 2240, 2304 × 1536	4	1200
EYEPACS	Var.	433 to 5184 × 3456	5	9963
DiaRetDB1	Fund. Cam.	1500 × 1152	2	189
DRIVE	Non-myd.	584 × 565	2	40
STARE	Fund. Cam.	700 × 605	2	400
e-ophtha	OPHDIAT	1440 to 2544 × 1696	4	463
DRISHTI-GS1	OD w/ FOV	2896 × 1944	2	101

Table 3. Model performance comparison (Accuracy (

A_{c c}

), Precision (

P_{c c}

), and Recall (

R_{c c}

)) across all DR classes.

Table 3. Model performance comparison (Accuracy (

A_{c c}

), Precision (

P_{c c}

), and Recall (

R_{c c}

)) across all DR classes.

Class	Model	$A_{c c}$	$P_{c c}$	$R_{c c}$
No DR ( $N_{d r}$ )	ResNet	0.89	0.83	0.78
	AlexNet	0.88	0.80	0.73
	VGG16	0.86	0.79	0.82
	SAN	0.91	0.91	0.88
	VIT	0.99	0.99	0.99
	HybridFusionNet	0.99	0.99	0.99
Mild ( $m_{d}$ )	ResNet	0.76	0.75	0.75
	AlexNet	0.74	0.75	0.72
	VGG16	0.80	0.78	0.80
	SAN	0.90	0.92	0.87
	VIT	0.99	0.99	0.98
	HybridFusionNet	0.99	0.99	0.98
Moderate ( $m_{o d}$ )	ResNet	0.73	0.72	0.70
	AlexNet	0.79	0.78	0.75
	VGG16	0.78	0.79	0.81
	SAN	0.87	0.86	0.85
	VIT	0.98	0.98	0.97
	HybridFusionNet	0.98	0.98	0.97
Severe ( $s_{v}$ )	ResNet	0.72	0.71	0.74
	AlexNet	0.76	0.75	0.74
	VGG16	0.77	0.76	0.80
	SAN	0.89	0.91	0.88
	VIT	0.99	0.99	0.99
	HybridFusionNet	0.99	0.99	0.99
Proli-DR ( $P_{f d r}$ )	ResNet	0.82	0.81	0.80
	AlexNet	0.83	0.81	0.77
	VGG16	0.83	0.83	0.83
	SAN	0.98	0.97	0.94
	VIT	0.99	0.99	0.99
	HybridFusionNet	0.99	0.99	0.99

Table 4. Performance metrics for

B_{c l}

and

M_{c l}

classification.

Table 4. Performance metrics for

B_{c l}

and

M_{c l}

classification.

S. No	Models	Binary Class			Multi-Class
S. No	Models	$A_{cc}$	$P_{cc}$	$R_{cc}$	$A_{cc}$	$P_{cc}$	$R_{cc}$
1	ResNet	0.89	0.82	0.77	0.74	0.71	0.73
2	AlexNet	0.88	0.80	0.74	0.79	0.76	0.78
3	VGG16	0.86	0.79	0.81	0.78	0.75	0.77
4	SAN	0.91	0.91	0.88	0.87	0.83	0.86
5	VIT	0.99	0.99	0.99	0.90	0.90	0.90
6	HybridFusionNet	0.99	0.99	0.99	0.91	0.91	0.91

Table 5. Summary of DR detection methods.

S. No	Author	Dataset	Features	Methods	$A_{cc}$
1	[33]	APTOS 2019	Lesion-specific regions, spatial descriptors	Self-adaptive ensemble, dual attention	86.22%
2	[34]	Not specified	Subtle image features	Light network, ResNet with attention	80.6%
3	[35]	Not specified	Image segmentation	DR-UNet, atrous pyramid, attention-aware DL	99.2%
4	[36]	EyePACS, Messidor-2	Not specified	Semi-supervised learning, knowledge distillation	94%, 89%
5	[37]	APTOS	TL based features	EfcientNet-B3	96%
6	[38]	APTOS	Not Specified	SSL Cross domain	Binary: 99%, Multi: 83%
7	[39]	EYEPACS	source feature generator	SFADA	85%, 92%
5	HybridFusionNet	APTOS	Hybrid features attain from SAN and VIT	$B_{c l}$ & $M_{c l}$	Binary: 99%, Multi: 91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shukla, A.; Tiwari, S.; Jain, A. HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection. Technologies 2024, 12, 256. https://doi.org/10.3390/technologies12120256

AMA Style

Shukla A, Tiwari S, Jain A. HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection. Technologies. 2024; 12(12):256. https://doi.org/10.3390/technologies12120256

Chicago/Turabian Style

Shukla, Amar, Shamik Tiwari, and Anurag Jain. 2024. "HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection" Technologies 12, no. 12: 256. https://doi.org/10.3390/technologies12120256

APA Style

Shukla, A., Tiwari, S., & Jain, A. (2024). HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection. Technologies, 12(12), 256. https://doi.org/10.3390/technologies12120256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection

Abstract

1. Introduction

2. Materials and Methods

3. Methods and Materials

3.1. HybridFusionNet

3.2. Self Attain Network

3.3. Multi-Class Classification

4. Result Analysis and Discussion

5. Validation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI