Classification of Diabetic Retinopathy Based on Efficient Computational Modeling

Xue, Jiao; Wu, Jianyu; Bian, Yingxu; Zhang, Shiyan; Du, Qinsheng

doi:10.3390/app142311327

Open AccessArticle

Classification of Diabetic Retinopathy Based on Efficient Computational Modeling

by

Jiao Xue

¹,

Jianyu Wu

^1,2,3,

Yingxu Bian

^2,3

,

Shiyan Zhang

^2,3 and

Qinsheng Du

^2,3,4,*

¹

Publicity Department, Changchun University, Changchun 130022, China

²

College of Computer Science and Technology, Changchun University, Changchun 130022, China

³

Ministry of Education Key Laboratory of Intelligent Rehabilitation and Barrier-Free Access for the Disabled, Changchun 130022, China

⁴

Jilin Rehabilitation Equipment and Technology Engineering Research Center for the Disabled, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11327; https://doi.org/10.3390/app142311327

Submission received: 14 November 2024 / Revised: 28 November 2024 / Accepted: 2 December 2024 / Published: 4 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional neural networks (CNN) and Vision Transformers (ViT) have long been the main backbone networks for visual classification in the field of deep learning. Although ViT has recently received more attention than CNN due to its excellent fitting ability, their scalability is largely limited by the quadratic complexity of attention computation. For the determination of diabetic retinopathy, the fundus lesions as well as the width, angle, and branching pattern of retinal blood vessels are characterized, inspired by the ability of Mamba and VMamba to efficiently model long sequences, VMamba-m is proposed in this paper. This is a generalized visual skeleton model designed to reduce computational complexity to linear while retaining the advantageous features of ViTs. By modifying the cross-entropy loss function, we enhance the model’s attention to rare categories, especially in large-scale multi-category classification tasks. In order to enhance the adaptability of the VMamba-m model in processing visual data, we introduce the se channel attention mechanism, which enables the model to learn features in the channel dimension and form the importance of each channel. Finally, different weights are assigned to each channel through the incentive part. In addition to this, this paper further improves the implementation details and architectural design by introducing a novel attention mechanism implemented based on the local windowing method, which aims to optimize the model’s ability in processing long sequence data to enhance the performance of VMamba-m and improve its inference speed. Extensive experimental results show that VMamba-m performs well in the retinopathy V classification task, and it has significant advantages in terms of accuracy and computation time over existing benchmark models.

Keywords:

classification of diabetic retinopathy; deep learning; attention mechanism; diabetic retinopathy

1. Introduction

Visual representation learning is one of the fundamental research topics in computer vision, and significant breakthroughs have been made in the deep learning era. In recent years, Transformer [1] has become a mainstream pillar for a variety of tasks, underpinning many prominent models such as the BERT [2] model, and the GPT family [3,4,5,6]. The main contribution of BERT is to propose a new bidirectional Transformer model and successfully combine pre-training and fine-tuning to achieve excellent language understanding capabilities. The emergence of BERT has promoted progress in the field of natural language processing and has become the basis for many subsequent research and applications. By showing how to effectively use large-scale unsupervised learning, BERT has laid the foundation for building smarter language understanding systems and provided new directions for future research. The GPT series of Transformer architectures have been expanded and improved, including multimodal capabilities, performance improvements, and security considerations. It provides a comprehensive perspective for understanding the capabilities and challenges of modern large-scale language models. By showing how to use large-scale models for efficient learning, researchers have laid the foundation for smarter artificial intelligence systems. To represent complex patterns in visual data, two major classes of backbone networks, namely convolutional neural networks (CNN) [7,8,9,10,11] and visual transformers (ViTs) [12,13,14,15], have been proposed and are widely used for various visual tasks. These articles proposed a variety of deep convolutional neural network architectures, provided pre-trained models and other contributions, which greatly promoted the development of deep learning in the field of computer vision and became an important cornerstone in this field. It further promoted the application of self-attention mechanism in visual tasks, proved the potential of the Transformer model in image processing, and inspired other subsequent research work. By proposing a hierarchical design and moving window mechanism, the efficiency and performance of Transformer in computer vision tasks are significantly improved. These innovations not only make Transformer a powerful visual model, but also provide new perspectives and ideas for subsequent research, promoting the development of the field of computer vision. Compared to CNNs, ViTs often exhibit superior modeling capabilities and generally incorporate a self-attention mechanism [1,14], which implements a global receptive field and dynamically predicted weighting parameters [16].

However, the addition of self-attention mechanisms to some models also has some side effects, such as a quadratic growth of the complexity of the self-attention mechanism with the increase in the input size, which leads to significant computational overhead in downstream tasks with large spatial resolution [17]. To address this problem, considerable efforts have been made to improve the efficiency of attention, mainly by imposing constraints on the size of the computational window or the step size [15,18,19]. For example, the main contribution of ConvBERT [20] is the introduction of a dynamic convolution mechanism to enhance the local context understanding ability of the BERT model. By combining convolutional layers, ConvBERT achieves significant performance improvements in multiple natural language processing tasks, demonstrating the potential of convolutional neural networks in processing text data. Lite Transformer’s [21] new model aims to improve the computational efficiency and performance of Transformer by optimizing the attention mechanism. By introducing the long- and short-range attention mechanism, the computational efficiency and performance of the model are significantly improved. Lite Transformer provides new solutions for efficient natural language processing in resource-limited environments and opens up new directions for future model design. Wu, F. et al. [22] proposed a new model architecture based on dynamic convolution, aiming to improve computational efficiency by reducing reliance on global self-attention. This model architecture not only demonstrates the potential of lightweight convolution in natural language processing, but also provides important ideas for designing more efficient deep learning models. Linformer [23] significantly reduces computing and memory overhead through low-rank projection technology, making it more efficient when processing long sequences. It has demonstrated excellent performance in multiple natural language processing tasks and promoted the further development of self-attention mechanisms. Provides efficient solutions to the computational challenges in long sequence processing. The main contribution of Longformer [24] is to propose a new Transformer architecture, which is specially optimized for long document processing. By introducing a sparse self-attention mechanism, Longformer significantly improves the efficiency of processing long texts and reduces computational complexity while maintaining good performance. This research provides an effective solution for natural language processing of long documents and provides new ideas for future model design. Big Bird’s [25] main contribution is to propose a new Transformer architecture, specifically optimized for long sequence data. By introducing a sparse attention mechanism, Big Bird significantly reduces computational complexity and enhances the ability to model long-distance dependencies, enabling it to effectively handle long documents and complex natural language processing tasks. Although these methods are effective, they inevitably create a trade-off between effective receptive field and computational efficiency, thus limiting the ability to establish long-range dependencies in visual data.

In fact, the position of Transformer in the field of large models can be said to be difficult to shake. However, the limitations of this dominant architecture for large models have become more and more obvious as the scale of the model expands and the sequences that need to be processed become longer. The emergence of the Mamba family of models is changing all of this in a powerful way. Its excellent performance is quickly gaining widespread attention. The Vision Mamba (Vim) proposal has already demonstrated its great potential to become the next-generation backbone of vision-based models. Researchers from the Chinese Academy of Sciences, Huawei, and Pengcheng Labs have proposed VMamba: a vision Mamba model with global receptive field and linear complexity [26].

Diabetic retinopathy (DR), as a complication of diabetes mellitus, is caused by lesions in the microvessels of the fundus, which in turn lead to retinal hemorrhage, edema, ischemia, retinal proliferative membrane formation, and retinal detachment, and ultimately lead to blindness in patients. The biggest advantage of using deep learning techniques to achieve automatic DR classification of retinal blood vessels compared to traditional machine learning-based methods is that no human features need to be extracted. The deep model will autonomously learn the intrinsic connection between the features, avoiding the influence of human subjective factors on the determination of the results. For diabetic retinopathy five classification experiments, this article reproduces some open-source classic models. The results show that VMamba has better classification effect and higher accuracy than traditional CNN models. However, in some other image binary classification tasks, most of the accuracy rate is as high as 99% of the comparison, the accuracy rate of five classifications is equivalent to say a little lower.

In this paper, we improved the VMamba model. Based on the original model, we improved the VSS module in the model and added a self-designed local attention module and SE channel attention module. The focal loss function was used in the training process to enhance the performance and increase its inference speed, maximizing the accuracy and reducing the operation time.

2. Related Work

The World Health Organization’s first-ever World Vision Report revealed that at least 2.2 billion people worldwide are visually impaired or blind, and that the majority of visual impairment and blindness can be avoided through early prevention [27]. In diabetic fundus screening, professional ophthalmologists will classify the degree of retinopathy of the patient based on the characteristics of the blood vessels in combination with other diseased areas, and will take appropriate measures to reduce the risk of blindness. Due to the large number of patients around the world and the tiny size of the blood vessels and lesions in the retina, doctors in some underdeveloped areas may make misdiagnoses and omissions during the diagnostic process.

In early DR classification tasks, machine learning methods were generally used, which required experienced physicians to manually annotate the lesion features and then make judgments based on the manually extracted features, which were more dependent on the results of the feature extraction approach and did not address the problem of the high cost of manual diagnosis and there were no publicly available retinal datasets in the early days, so there were fewer studies that used machine learning methods. In the past studies, Acharya [28] used support vector machines to achieve the classification of lesions and the degree of higher-order spectral techniques. Jaspreet Kaur et al. [29] used the K-nearest neighbor algorithm (KNN) to classify diabetic retinopathy and proposed a traditional machine learning framework based on image processing and feature extraction. Du and Li [30] used a morphological approach to extract pathological features, and then completed the ratings of the degree of retinopathy by using support vector machine classifiers. Pinz et al. [31] mapped basic anatomical features and pathological changes by fusing feature information from different scanned laser fundus images, and then used a support vector machine to classify the patient’s lesions into three categories. Al-Antary et al. [32] processed each fundus image using various linear and nonlinear image filtering algorithms to generate feature data, and then used these feature data to train a random forest classifier to achieve classification.

With the development of deep learning technology, the amount of data is also growing under the processing of neural networks. CNN has a great advantage in processing high-dimensional information such as images. The Kaggle competition platform released large-scale retinopathy datasets EyePACS and APTOS in 2015 and 2019, respectively. Since then, the use of deep learning to classify diabetic retinopathy has become a mainstream method. Compared with traditional machine learning based methods, the biggest advantage of using deep learning technology to achieve automatic segmentation and DR classification of retinal blood vessels is that it does not require manual feature extraction. Deep models autonomously learn the intrinsic connections between features, avoiding the influence of subjective factors on result judgments. In addition, automatic segmentation and classification based on deep learning greatly reduce the consumption of related manpower and resources, improve the efficiency of disease screening, and alleviate the internal contradiction of the increasing number of ophthalmic diseases and the shortage of professional physicians. Therefore, many CNN-based DR classification models have emerged in subsequent research. Earlier work generally performed DR binary classification studies, such as the work of Gulshan et al. [33] based on transfer learning technique and InceptionV3 model and the explicit algorithm for lesion region designed based on visualization method by Gondal et al. [34]. In fact, the severity classification of DR can be more helpful for doctors’ clinical diagnosis and judgment, so more researchers carry out five classification studies. In terms of model structure, a 10-layer CNN designed by Pratt et al. [35] obtained good results without using any feature-specific detection. Xu et al. [36] proposed an 18-layer CNN network for DR classification. The network contains 12 convolutional layers, 4 maximal pooling layers, and 2 fully connected layers, and obtains good results on private datasets. BiRa-Net, designed by Zhao et al. [37], uses a bilinear classification structure. Sea-Net, proposed by Zhao et al. [38], enhances the extraction of features by inserting multiple attentional modules between the convolutional layers. In terms of data processing, Bravo and Arbeláez [39] found a good preprocessing method for fundus images through experimental analysis, which provides a basis for many future works. The Balanced Mix-up algorithm proposed by Galdran et al. [40] achieved good comprehensive classification results in solving the DR classification problem with a severely unbalanced dataset. Quellec et al. [41] proposed the ExplAIn model based on deep learning. This model can not only classify DR lesions, but can also classify lesion and non-lesion pixel points of individual pixels in the image. Saeed et al. [42] proposed a two-stage fine-tuning model. First, the model is pre-trained on the ImageNet dataset. Secondly, the diseased regions of the retinal image are extracted and the fully connected layer of the previous pre-trained model is removed, and the designed PCA layer is introduced and pre-trained again. Finally, the weights of the model obtained from the second pre-training are fine-tuned to obtain the final classification model.

In recent years, Chaichana Suedumrong et al. [43] used background removal and data enhancement techniques to eliminate irrelevant information in the image, allowing the model to focus more on features related to the lesions. By transforming the training data to increase the diversity of training samples, the generalization ability and robustness of the model were improved, effectively improving the detection performance of diabetic retinopathy. A. M. Mutawa et al. [44] proposed a model based on deep learning and discrete wavelet transform. They decomposed the image into low-frequency and high-frequency parts by applying DWT to the input image, thereby retaining important feature information and removing noise, significantly improving Model accuracy. San-Li Yi et al. [45] proposed a new network called RA-EfficientNet. They added a residual attention (RA) module to the EfficientNet model to extract more features and solve the problem of small differences between lesions, addressing the limitations of existing manual feature extraction. Rasha Ali Dihin et al. [46] combined the Swin Transformer architecture with wavelet transform and attention mechanism to process images through sliding window mechanism and hierarchical construction while effectively reducing computational complexity.

In the past years of research, CNN and VIT have been the mainstream frameworks for visual feature extraction. CNN has the advantages of being a simple model, having parameter sharing, and high computational efficiency. The disadvantages of CNN are that it does not have a global receptive field, is not good at processing multimodal data, has an old model architecture, has many layers, and has a large number of parameters, resulting in the high computational complexity of the model. Various model variants have brought convolutional networks close to capacity bottlenecks [47]. VIT is based on Transformer, which has the advantages of simple architecture, global sensory field, and dynamic weights, but the disadvantage is that the computational efficiency is not as good as that of convolutional neural network, and the computational amount of the self-attention mechanism in the model will increase with the increase in the length of the context in the square level [48].

The Mamba series models use selective state space to process sequences, which solves the computational efficiency problem of Transformer when processing long sequences [49]. The attention layer can be scaled linearly with the length of the sequence, enabling faster inference. The Mamba family of models controls the memory range by integrating a selection mechanism, which improves the model’s generalization ability and offers the potential for applications in the CV domain. The Mamba Series models also feature excellent scalability, allowing the models to be more easily scaled to larger sizes without loss of performance. The aim of this paper is to propose a new deep learning-based method for diabetic retina classification, which is improved based on the VMamba model, and the final method is named VMamba-m. Distinguished from the traditional model for classification, the model proposed in this paper has faster computation speed and higher accuracy, which solves the problem of higher computational complexity of the traditional model.

3. Methodology

3.1. Overview of the Methodology

In this paper, we propose to improve the model’s base model, VMamba, as a new visual backbone network architecture to efficiently process visual data, reduce computational complexity, and increase inference speed while maintaining the model’s performance by introducing a state-space model (SSM)-based module.

As shown in Figure 1, and the introduction of a 2D Selective Scanning (SS2D) [50] module enables selective SSM to process visual data efficiently, bridging the gap between 1D scanning and 2D visual data. The excellent performance of VMamba is demonstrated on several visual tasks, including image classification, target detection, and semantic segmentation. In particular, VMamba demonstrates linearly increasing computational complexity with significant input scalability when processing large input sizes.

The VMamba-m model proposed in this paper is an improvement based on VMamba. First, as shown in Figure 2, a double branch is added to the original VSS module by adding local attention and channel attention. The VSS module in the original VMamba model can only extract global features. After we add local attention and se attention to the VSS module, we can assign different positions and weights to the image from the perspective of the channel domain through a weight matrix to obtain more important feature information. The purpose of adding dual-branch attention is to extract local features; it also has faster training speed and higher accuracy than the previous single-branch attention. The use of the focal loss function in the training code improves the classification performance by introducing a focus factor and adjusting the sample weights so that the model pays more attention to samples that are difficult to classify.

3.2. Focal Loss Function

In target detection tasks, there is usually an extreme imbalance in the number of positive and negative samples. Taking an image as an example, the number of candidate frames (positive samples) that can match the target may be only a dozen or tens of frames, while the number of unmatched candidate frames (negative samples) may be as high as tens of thousands or even hundreds of thousands. This imbalance causes the model to easily favor the negative samples during training, thus neglecting the learning of the positive samples. In addition, even among the negative samples, there is an imbalance of hard and easy samples, i.e., most of the negative samples are easy to categorize, while a small number of hard-to-categorize negative samples are crucial for model performance.

Focal loss is an effective method to deal with the class imbalance problem, by introducing a focal factor and adjusting the sample weights, it makes the model pay more attention to the difficulty-to-classify samples, thus improving the classification performance [51]. It is especially suitable for target detection and other category imbalance tasks. The formula for focal loss is as follows:

F L (p_{t}) = - a_{t} {(1 - p_{t})}^{γ} l o g (p_{t})

(1)

p_{t}

is the model’s predicted probability for the target class,

a_{t}

is a balancing factor to adjust for the effects between positive and negative samples, γ is the focus factor, which is used to adjust the weights for difficult samples. The modulation factor

{(1 - p_{t})}^{γ}

is used to reduce the loss ratio of easy-to-distinguish samples. Regardless of the foreground or background class, the larger the

p_{t}

the easier it is to distinguish the sample, and the smaller the modulation factor.

Focal loss, as a loss function designed to solve the problem of category imbalance and imbalance between hard and easy samples, has shown great potential in the field of deep learning. By reducing the weight of easy-to-categorize samples and increasing the weight of hard-to-categorize samples, focal loss also enables the model to focus more on samples that are difficult to categorize correctly, thus improving the overall performance. As deep learning technology continues to evolve, focal loss is expected to play a greater role in future applications.

3.3. Local Attention

This attention module is designed independently and contains several layers for local attention operations on images. The module composition is shown in Figure 3.

The first layer is the batch normalization layer of this model, which reduces internal covariate shift by standardizing the mini-batch input data of each layer, thereby improving training speed and stability. Next is a 2D convolutional layer which uses a 3 × 3 convolutional kernel with a step size of 1 and padding of 1. This convolutional layer is used to extract local features. Next is another batch normalization layer that normalizes the output of the convolutional layer. The second 2D convolutional layer is similar to the first convolutional layer and is used for further feature extraction. The third 2D convolutional layer uses a 1 × 1 convolutional kernel with a step size of 1, which is typically used to compress the spatial size of the feature map. Although the model structure is not too complex, it works well.

3.4. SE Attention

The goal of using this module in this paper is to improve the representation ability of the network by modeling the interdependence between the convolutional feature channels. The core idea is to learn the feature weights through the network according to the loss, so that the effective feature maps are weighted heavily, and the ineffective or ineffective feature maps are weighted less heavily to train the model to achieve better results [52]. The SE block embedded in some of the original classification network inevitably increases some parameters and computation, but in the face of the improvement of the effect is acceptable. The SE block embedded in the original classification network inevitably increases some parameters and computation, but it is acceptable in terms of improving the effect. Specifically, the SE attention mechanism includes two steps: squeeze and excitation.

(1): In the squeeze step, feature U generates a channel descriptor by aggregating feature mappings across spatial dimensions H × W through the squeeze compression operation; H × W × C → 1 × 1 × C compresses global spatial information into the above channel descriptors such that these channel descriptors can be utilized by its input layers, compresses the input feature map into a vector through a global average pooling operation, and then maps it to a smaller vector through a fully connected layer. The formula is as follows:

$z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)$

(2)
(2): In the excitation step, a sigmoid function is used to compress each element in this vector to between 0 and 1, and it is multiplied with the original input feature map to obtain the weighted feature map. The vector $z$ obtained in the previous step is processed by two fully connected layers, $W_{1}$ and $W_{2}$ , to obtain the channel weight value $s$ we want. After two layers of fully connected layers, different values in $s$ represent the weight information of different channels, giving different weights to the channels. The formula is as follows:

$s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{2} δ (W_{1} z))$

(3)
(3): Reweight: The weights of the excitation output are added to the input features through channel-wise convolution.

This attention module can make the model more focused on key information through its clever attention mechanism, which effectively promotes the application of deep learning in many fields.

4. Experimental Section

4.1. Dataset and Preprocessing

In this paper, model validation is performed on APTOS2019, a publicly available dataset officially provided by Kaggle (San Francisco, CA, USA). The APTOS2019 dataset was obtained by the Aravind team while performing disease screening in medically deprived villages in India, relying on trained doctors to review images and provide diagnoses. The database contains 3662 images with lesions categorized into five categories, where 0 indicates a healthy retina and categories 1–4 indicate mild, moderate, severe, and proliferative retinopathy, respectively. A ratio of 8:2 was used to divide the training set and test set during the training process of this paper. Table 1 shows the distribution of sample size for each type of lesion in the APTOS2019 dataset.

As the samples in the DR dataset have differences in lighting conditions and shooting equipment during the sampling process, it leads to a big difference between the images in terms of size and color. There are also some images with problems such as underexposure, overexposure, and much noise. In order to train the network using samples with consistent color sizes, the images need to be preprocessed and enhanced. First of all, the black border of the fundus image must be removed because of its high pixel value, which affects the results of classification. Since the aspect ratio varies from image to image and the width of the black border at the periphery of the eyeball varies, the size of the entire image cannot be used. Therefore, the radius of the eyeball is used as a benchmark. Finally, for image enhancement, the method used in Kaggle’s DR classification competition is used to improve the brightness and contrast difference of the image. The process is shown below.

I_{g a u s s i a n} = G (x, y, σ) \otimes I (x, y)

(4)

I_{e n h a n c e} = α \cdot I (x, y) + β \cdot I_{g a u s s i a n} + γ

(5)

In Formula (4),

I (x, y)

is the preprocessed fundus image, and

G (x, y, σ)

represents the Gaussian convolution of the standard deviation. Formula (5) is a weighted sum of the images before and after enhancement.

α

represents the transparency or blending coefficient. When images are fused or colors are mixed, the α value can control the degree of blending between the two images. When performing weighted summation of images,

α

can be used to adjust the influence of the first image.

β

is used to represent the brightness adjustment factor, which is used to adjust the contrast or brightness of the image.

γ

represents the exponent of gamma correction, which is a nonlinear operation that aims to adjust the brightness of the image and improve the display effect of the image under different lighting conditions. In this paper, adjusting the

γ

value mainly affects the brightness and contrast of the image.

σ

usually represents the standard deviation in Gaussian filters, which mainly determines the degree of blurring of the filter. When using Gaussian filtering,

σ

controls the width of the Gaussian function, thereby affecting the filter’s ability to retain image details. Smaller

σ

will result in less blur, while larger

σ

will make the image more blurred. For the images in the dataset with large differences, after multiple rounds of adjustment of different parameters, we found that the parameters of

α, β, γ,

and

σ

were 4, −4, 10, and 128, respectively, to achieve the best effect, solving the problems of underexposure, overexposure, and high noise.

Figure 4 shows the preprocessing results. Through these processing methods, the black border area with high pixel values becomes gray, reducing the impact on the classification results. The contrast and clarity of the blood vessels and lesion areas are improved, making it easier to distinguish between them.

4.2. Comparison Experiments of Different Models

In order to verify the classification advantages of the improved model proposed in this paper, this section compares VMamba, Mamba, VMamba-m with some locally deployed open-source classic network models in five categories. This paper uses the Ubuntu operating system and parallel computing architecture CUDA 11, the hardware platform is RTX3080Ti 12 G (NVIDIA, Santa Clara, CA, USA), and the memory is 32 G. The models are all built using the PyTorch 2.0 deep learning framework, with the same preprocessing methods and parameter settings, and the Python version is 3.8. During training, the initial learning rate is set to 0.0001, the batch size is set to 32, and the number of iterations is 150 rounds. We measure the performance of the model based on key indicators such as accuracy, precision, recall, AUC, F1 score, and iteration time per round. After multiple rounds of experiments, the experimental results of each model on the APTOS2019 dataset are shown in Table 2. Combined with the data in the table, we can see that the Mamba series models perform well in the task of classifying diabetic retinopathy. In the five-classification task, the model of VMamba-m proposed in this paper performs the best, followed by VMamba, with classification accuracies of 0.791 and 0.714, and iterations time per round of 34.027 and 76.322, respectively, which are higher than the other models.

The accuracy of training and validation is shown in Figure 5. We will increase the number of iterations in future experiments to obtain a more stable model. Similarly, the loss of validation is depicted in Figure 6.

The ROC curve plot of our VMamba-m model is shown in Figure 7. The ROC curve and AUC value indicate how close the prediction is to the perfect classification, which is shown in the upper left corner of the ROC coordinates. The AUC value shows the area under the ROC curve. The closer the value is to 1, the better the model performance. The figure shows AUC values of 99.1%, 85.5%, 87.50%, 87%, and 86.8%, representing the categories of no DR, mild, moderate, severe, and proliferative DR, respectively. Since the morphological changes in the fundus images of moderate DR affect the identification of pathological structures, the AUC of moderate DR is lower than that of other DR, at 85.5%.

Figure 8 depicts the confusion matrix, which represents how to evaluate the performance of each class in the model. The proportion of examples successfully classified for each class is shown in the diagonal cells. In the model studied in this article, class 0 (Healthy) was correctly predicted for 354 images; class 1 (Mild) was predicted for 43; class 2 (Moderate) was predicted for 126; class 3 (Severe) was predicted for 20; and class 4 (Proliferative) was predicted for 34. In the future, we will consider adding more images to class 3 and class 4 to obtain better performance.

To further demonstrate the robustness of VMamba-m, Table 3 shows the comparison of our VMamba-m model with the four latest studies by Mutawa, A.M. et al., Dihin, R.A. et al., Bodapati et al., Dondeti et al. based on the APTOS2019 dataset.

Although the accuracy of our model is 79.1%, which seems insignificant compared to other studies, we believe that our work brings great value to the field of DR recognition. Our model is designed to be more efficient and consume less computing resources, and can handle larger input images without causing memory overflow. The loss function is improved to achieve faster convergence, thereby accelerating the training process.

4.3. Ablation Experiments with Different Mechanisms

The important research content of this paper is to solve the problems of long operation time and low accuracy in other current studies. Therefore, we introduced the SE attention mechanism, local attention mechanism, and focal loss function to solve this problem. Through ablation experiments, we can determine the impact of these mechanisms on the model effect and judge whether the above problems are effectively alleviated.

The ablation experiments with three different mechanisms in the original modeling task are shown in Table 3. Using the VMamba architecture as the baseline, the accuracy is improved by 5.3% and the time per iteration is reduced by 52.7% after the introduction of the SE attention mechanism. The accuracy is improved by 5.9% and the time per iteration is reduced by 53.8% after the introduction of the local attention mechanism. Accuracy was improved by 2% using the focal function mechanism, and the computation time remained essentially unchanged.

After the focal function is replaced in training, the ablation experiments of different mechanisms are shown in Table 4. Based on the VMamba architecture, the accuracy rate has increased by 9.5% after the introduction of the SE attention mechanism, and the iteration time of each round has decreased by 54.9%. After the introduction of a local attention set, the accuracy rate increased by 8.8% and the iteration time per round decreased by 53.7%. The VMamba-m model proposed in this paper: namely, to replace the focal function and introduce the SE attention mechanism and local mechanism, has improved the accuracy by 12.3% and reduced the iteration time of each round by 55.4%.

Table 4 shows the values of the objective evaluation indexes for the five classifications of the model proposed in this paper. After analyzing the results of the ablation experiments, it is found that the introduction of the SE attention mechanism and local attention mechanism affects the actual prediction results of the task to a greater extent, compared with the replacement of the focal function, which affects the results to a relatively small extent. In summary, the improved module based on the VMamba model in this paper is helpful for retinopathy classification.

5. Conclusions

This paper presents the improved model VMamba-m, designed for efficient visual representation learning using state space models (SSM). The advantages of selective SSM, including global receptive fields, input-dependent weighting parameters, and linear computational complexity, are incorporated into visual data processing. In addition, to address the problem of poor classification accuracy of the existing methods, combined with the characteristics of tiny and diverse lesions in DR fundus images, this paper improves the original architectural design of the VMamba model by adding a two-branch attention mechanism to extract local features, which significantly improves its inference speed and accuracy. The VMamba-m and the original Mamba series models mentioned in this paper have been proved to be effective by a large number of experimental studies, exceeding the performance of the traditional models in the case of pre-processed datasets. In addition, VMamba-m exhibits significant scalability as the input resolution increases, showing minimal performance degradation while maintaining linear computational complexity. Due to the experimental environment and time constraints, this paper was not able to potentially explore on more directions. The future research direction of this paper can still be expanded in the following aspects: (1) Although the Mamba series model is faster and more accurate than the traditional model in each round of iteration, there is still room for development in terms of precision, recall, and F1 score compared to the traditional model. (2) The Mamba series model can be applied to image processing in more medical fields, which greatly improves the accuracy and efficiency of medical image analysis, and provides more accurate data support for the diagnosis and treatment of medical imaging.

Author Contributions

Conceptualization, Q.D.; methodology, Q.D. and J.W.; software, Y.B. and S.Z.; validation, J.X. and Y.B.; writing—original draft preparation, J.W.; writing—review and editing, J.X., Q.D. and J.W.; visualization, Q.D. and Y.B.; funding acquisition, Q.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Development Plan Project of the Jilin Provincial Science and Technology Department (YDZJ202201ZYTS569), funder: Jilin Provincial Department of Science and Technology; Project of Jilin Educational Science Planning (No. GH20245), funder: Office of Jilin Provincial Education Science Research Leading Group.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. These data can be found here: https://www.kaggle.com/competitions/aptos2019-blindness-detection/data (accessed on 28 June 2019).

Acknowledgments

We would like to express our deepest gratitude to all those who have contributed to the completion of this research and the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 14 October 2024).
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Penang, Malaysia, 5–8 November 2017; pp. 4700–4708. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zhang, X.; Tian, Y.; Xie, L.; Huang, W.; Dai, Q.; Ye, Q.; Tian, Q. Hivit: A simpler and more efficient design of hierarchical vision transformer. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 1–5 May 2023. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv 2021, arXiv:2106.04263. [Google Scholar]
Bahdanau, D. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv 2020, arXiv:2011.04006. [Google Scholar]
Jiang, Z.-H.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. Convbert: Improving bert with span-based dynamic convolution. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 12837–12848. [Google Scholar]
Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite transformer with long-short range attention. arXiv 2020, arXiv:2004.11886. [Google Scholar]
Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y.N.; Auli, M. Pay less attention with lightweight and dynamic convolutions. arXiv 2019, arXiv:1901.10430. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 17283–17297. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
World Health Organization. World Report on Vision; World Health Organization: Geneva, Switzerland, 2019. [Google Scholar]
Casanova, R.; Saldana, S.; Chew, E.Y.; Danis, R.P.; Greven, C.M.; Ambrosius, W.T. Application of random forests methods to diabetic retinopathy classification analyses. PLoS ONE 2014, 9, e98587. [Google Scholar] [CrossRef]
Kaur, J.; Kaur, P. Automated Computer-Aided Diagnosis of Diabetic Retinopathy Based on Segmentation and Classification using K-nearest neighbor algorithm in retinal images. Comput. J. 2023, 66, 2011–2032. [Google Scholar] [CrossRef]
Du, N.; Li, Y. Automated identification of diabetic retinopathy stages using support vector machine. In Proceedings of the 32nd Chinese Control Conference, Xi’an, China, 26–28 July 2013; pp. 3882–3886. [Google Scholar]
Pinz, A.; Bernogger, S.; Datlinger, P.; Kruger, A. Mapping the human retina. IEEE Trans. Med. Imaging 1998, 17, 606–619. [Google Scholar] [CrossRef]
Al-Antary, M.; Hassouna, M.; Arafa, Y.; Khalifah, R. Automated identification of diabetic retinopathy using pixel-based segmentation approach. In Proceedings of the 2019 2nd International Conference on Watermarking and Image Processing, Marseille, France, 18–20 September 2019; pp. 16–20. [Google Scholar]
Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016, 316, 2402–2410. [Google Scholar] [CrossRef]
Gondal, W.M.; Köhler, J.M.; Grzeszick, R.; Fink, G.A.; Hirsch, M. Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 2069–2073. [Google Scholar]
Pratt, H.; Coenen, F.; Broadbent, D.M.; Harding, S.P.; Zheng, Y. Convolutional neural networks for diabetic retinopathy. Procedia Comput. Sci. 2016, 90, 200–205. [Google Scholar] [CrossRef]
Xu, K.; Feng, D.; Mi, H. Deep convolutional neural network-based early automated detection of diabetic retinopathy using fundus image. Molecules 2017, 22, 2054. [Google Scholar] [CrossRef]
Zhao, Z.; Zhang, K.; Hao, X.; Tian, J.; Chua, M.C.H.; Chen, L.; Xu, X. Bira-net: Bilinear attention net for diabetic retinopathy grading. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1385–1389. [Google Scholar]
Zhao, Z.; Chopra, K.; Zeng, Z.; Li, X. Sea-net: Squeeze-and-excitation attention net for diabetic retinopathy grading. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 2496–2500. [Google Scholar]
Bravo, M.A.; Arbeláez, P.A. Automatic diabetic retinopathy classification. In Proceedings of the 13th International Conference on Medical Information Processing and Analysis, San Andres Island, Colombia, 5–7 October 2017; pp. 446–455. [Google Scholar]
Galdran, A.; Carneiro, G.; González Ballester, M.A. Balanced-mixup for highly imbalanced medical image classification. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part V 24. pp. 323–333. [Google Scholar]
Quellec, G.; Al Hajj, H.; Lamard, M.; Conze, P.-H.; Massin, P.; Cochener, B. ExplAIn: Explanatory artificial intelligence for diabetic retinopathy diagnosis. Med. Image Anal. 2021, 72, 102118. [Google Scholar] [CrossRef] [PubMed]
Saeed, F.; Hussain, M.; Aboalsamh, H.A. Automatic diabetic retinopathy diagnosis using adaptive fine-tuned convolutional neural network. IEEE Access 2021, 9, 41344–41359. [Google Scholar] [CrossRef]
Suedumrong, C.; Phongmoo, S.; Akarajaka, T.; Leksakul, K. Diabetic Retinopathy Detection Using Convolutional Neural Networks with Background Removal, and Data Augmentation. Appl. Sci. 2024, 14, 8823. [Google Scholar] [CrossRef]
Mutawa, A.M.; Al-Sabti, K.; Raizada, S.; Sruthi, S. A Deep Learning Model for Detecting Diabetic Retinopathy Stages with Discrete Wavelet Transform. Appl. Sci. 2024, 14, 4428. [Google Scholar] [CrossRef]
Yi, S.-L.; Yang, X.-L.; Wang, T.-W.; She, F.-R.; Xiong, X.; He, J.-F. Diabetic retinopathy diagnosis based on RA-EfficientNet. Appl. Sci. 2021, 11, 11035. [Google Scholar] [CrossRef]
Dihin, R.A.; AlShemmary, E.N.; Al-Jawher, W.A. Wavelet-Attention Swin for Automatic Diabetic Retinopathy Classification. Baghdad Sci. J. 2024, 21, 2741–2756. [Google Scholar] [CrossRef]
Pióro, M.; Ciebiera, K.; Król, K.; Ludziejewski, J.; Krutul, M.; Krajewski, J.; Antoniak, S.; Miłoś, P.; Cygan, M.; Jaszczur, S. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv 2024, arXiv:2401.04081. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Nguyen, E.; Goel, K.; Gu, A.; Downs, G.; Shah, P.; Dao, T.; Baccus, S.; Ré, C. S4nd: Modeling images and videos as multidimensional signals with state spaces. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 2846–2861. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Dondeti, V.; Bodapati, J.D.; Shareef, S.N.; Veeranjaneyulu, N. Deep Convolution Features in Non-linear Embedding Space for Fundus Image Classification. Rev. d’Intelligence Artif. 2020, 34, 307–313. [Google Scholar] [CrossRef]
Bodapati, J.D.; Shaik, N.S.; Naralasetti, V. Composite deep neural network with gated-attention mechanism for diabetic retinopathy severity classification. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 9825–9839. [Google Scholar] [CrossRef]

Figure 1. Structure of the VMamba model.

Figure 2. Improved VSS module.

Figure 3. Improved VSS module structure of autonomous design module.

Figure 4. Preprocessed dataset. (a) Picture of diabetes retinopathy before processing; (b) processed picture of diabetes retinopathy.

Figure 5. The accuracy plot of training and validation data.

Figure 6. The loss plot of validation data.

Figure 7. ROC curve of diabetic retinopathy severity grade based on VMamba-m model.

Figure 8. The confusion matrix of VMamba-m.

Table 1. Distribution of samples in each category of the dataset.

Form	Lesion Level	APTOS2019 (Width)
0	Healthy Retina	1805 (49.3%)
1	Mild lesions	370 (10.1%)
2	Moderate lesions	999 (27.3%)
3	Severe lesions	193 (5.2%)
4	Proliferative lesions	295 (8.1%)

Table 2. Experimental results of different jobs for five classifications under Kaggle dataset.

Model	Accuracy	Recall	Precision	AUC	F1	Time
InceptionV3	0.701	0.684	0.618	0.813	0.650	106.345
VGG16	0.686	0.795	0.653	0.805	0.701	108.012
ResNet50	0.698	0.542	0.701	0.847	0.660	107.453
ResNet34	0.688	0.540	0.700	0.839	0.681	108.753
VGGNet	0.702	0.550	0.655	0.832	0.651	105.424
DensNet121	0.684	0.574	0.640	0.814	0.606	111.652
ResNet108	0.675	0.702	0.690	0.821	0.686	107.379
NASNetMobile	0.795	0.648	0.712	0.871	0.732	152.613
MobileNet	0.769	0.655	0.722	0.874	0.751	125.410
MobileNetV2	0.759	0.647	0.724	0.863	0.756	130.031
MobileNetV3Small	0.692	0.643	0.694	0.812	0.701	127.721
MobileNetV3Large	0.665	0.613	0.684	0.846	0.717	129.285
VMamba	0.714	0.500	0.718	0.881	0.500	76.322
Mamba	0.703	0.512	0.706	0.850	0.501	77.034
VMamba-m	0.791	0.537	0.766	0.891	0.527	34.027

Table 3. Experimental results of different jobs for five classifications under Kaggle dataset.

Model	Accuracy	Recall	Precision	AUC	F1	Time
Mutawa, A.M. et al. [44]	0.718	-	-	0.900	0.710	-
Dihin, R.A. et al. [46]	0.860	0.795	0.653	0.705	0.701	-
Dondeti et al. [53]	0.779	-	0.760	-	0.750	-
Bodapati et al. [54]	0.825	-	0.820	-	0.820	-
VMamba-m	0.791	0.537	0.766	0.891	0.527	34.027

Table 4. Five-category ablation experiments on the APTOS2019 dataset.

Model	Accuracy	Recall	Precision	AUC	F1	Time
VMamba	0.714	0.500	0.718	0.871	0.500	76.322
VMamba + Focall	0.722	0.504	0.724	0.874	0.510	75.122
VMamba + SE	0.742	0.507	0.723	0.878	0.501	36.081
VMamba + Local	0.746	0.502	0.734	0.879	0.500	36.255
VMamba + Focall + SE	0.771	0.517	0.733	0.881	0.506	34.391
VMamba + Focall + Local	0.766	0.514	0.742	0.886	0.515	35.295
VMamba-m	0.791	0.537	0.766	0.891	0.527	34.027

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, J.; Wu, J.; Bian, Y.; Zhang, S.; Du, Q. Classification of Diabetic Retinopathy Based on Efficient Computational Modeling. Appl. Sci. 2024, 14, 11327. https://doi.org/10.3390/app142311327

AMA Style

Xue J, Wu J, Bian Y, Zhang S, Du Q. Classification of Diabetic Retinopathy Based on Efficient Computational Modeling. Applied Sciences. 2024; 14(23):11327. https://doi.org/10.3390/app142311327

Chicago/Turabian Style

Xue, Jiao, Jianyu Wu, Yingxu Bian, Shiyan Zhang, and Qinsheng Du. 2024. "Classification of Diabetic Retinopathy Based on Efficient Computational Modeling" Applied Sciences 14, no. 23: 11327. https://doi.org/10.3390/app142311327

APA Style

Xue, J., Wu, J., Bian, Y., Zhang, S., & Du, Q. (2024). Classification of Diabetic Retinopathy Based on Efficient Computational Modeling. Applied Sciences, 14(23), 11327. https://doi.org/10.3390/app142311327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Classification of Diabetic Retinopathy Based on Efficient Computational Modeling

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overview of the Methodology

3.2. Focal Loss Function

3.3. Local Attention

3.4. SE Attention

4. Experimental Section

4.1. Dataset and Preprocessing

4.2. Comparison Experiments of Different Models

4.3. Ablation Experiments with Different Mechanisms

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI