1. Introduction
Visual representation learning is one of the fundamental research topics in computer vision, and significant breakthroughs have been made in the deep learning era. In recent years, Transformer [
1] has become a mainstream pillar for a variety of tasks, underpinning many prominent models such as the BERT [
2] model, and the GPT family [
3,
4,
5,
6]. The main contribution of BERT is to propose a new bidirectional Transformer model and successfully combine pre-training and fine-tuning to achieve excellent language understanding capabilities. The emergence of BERT has promoted progress in the field of natural language processing and has become the basis for many subsequent research and applications. By showing how to effectively use large-scale unsupervised learning, BERT has laid the foundation for building smarter language understanding systems and provided new directions for future research. The GPT series of Transformer architectures have been expanded and improved, including multimodal capabilities, performance improvements, and security considerations. It provides a comprehensive perspective for understanding the capabilities and challenges of modern large-scale language models. By showing how to use large-scale models for efficient learning, researchers have laid the foundation for smarter artificial intelligence systems. To represent complex patterns in visual data, two major classes of backbone networks, namely convolutional neural networks (CNN) [
7,
8,
9,
10,
11] and visual transformers (ViTs) [
12,
13,
14,
15], have been proposed and are widely used for various visual tasks. These articles proposed a variety of deep convolutional neural network architectures, provided pre-trained models and other contributions, which greatly promoted the development of deep learning in the field of computer vision and became an important cornerstone in this field. It further promoted the application of self-attention mechanism in visual tasks, proved the potential of the Transformer model in image processing, and inspired other subsequent research work. By proposing a hierarchical design and moving window mechanism, the efficiency and performance of Transformer in computer vision tasks are significantly improved. These innovations not only make Transformer a powerful visual model, but also provide new perspectives and ideas for subsequent research, promoting the development of the field of computer vision. Compared to CNNs, ViTs often exhibit superior modeling capabilities and generally incorporate a self-attention mechanism [
1,
14], which implements a global receptive field and dynamically predicted weighting parameters [
16].
However, the addition of self-attention mechanisms to some models also has some side effects, such as a quadratic growth of the complexity of the self-attention mechanism with the increase in the input size, which leads to significant computational overhead in downstream tasks with large spatial resolution [
17]. To address this problem, considerable efforts have been made to improve the efficiency of attention, mainly by imposing constraints on the size of the computational window or the step size [
15,
18,
19]. For example, the main contribution of ConvBERT [
20] is the introduction of a dynamic convolution mechanism to enhance the local context understanding ability of the BERT model. By combining convolutional layers, ConvBERT achieves significant performance improvements in multiple natural language processing tasks, demonstrating the potential of convolutional neural networks in processing text data. Lite Transformer’s [
21] new model aims to improve the computational efficiency and performance of Transformer by optimizing the attention mechanism. By introducing the long- and short-range attention mechanism, the computational efficiency and performance of the model are significantly improved. Lite Transformer provides new solutions for efficient natural language processing in resource-limited environments and opens up new directions for future model design. Wu, F. et al. [
22] proposed a new model architecture based on dynamic convolution, aiming to improve computational efficiency by reducing reliance on global self-attention. This model architecture not only demonstrates the potential of lightweight convolution in natural language processing, but also provides important ideas for designing more efficient deep learning models. Linformer [
23] significantly reduces computing and memory overhead through low-rank projection technology, making it more efficient when processing long sequences. It has demonstrated excellent performance in multiple natural language processing tasks and promoted the further development of self-attention mechanisms. Provides efficient solutions to the computational challenges in long sequence processing. The main contribution of Longformer [
24] is to propose a new Transformer architecture, which is specially optimized for long document processing. By introducing a sparse self-attention mechanism, Longformer significantly improves the efficiency of processing long texts and reduces computational complexity while maintaining good performance. This research provides an effective solution for natural language processing of long documents and provides new ideas for future model design. Big Bird’s [
25] main contribution is to propose a new Transformer architecture, specifically optimized for long sequence data. By introducing a sparse attention mechanism, Big Bird significantly reduces computational complexity and enhances the ability to model long-distance dependencies, enabling it to effectively handle long documents and complex natural language processing tasks. Although these methods are effective, they inevitably create a trade-off between effective receptive field and computational efficiency, thus limiting the ability to establish long-range dependencies in visual data.
In fact, the position of Transformer in the field of large models can be said to be difficult to shake. However, the limitations of this dominant architecture for large models have become more and more obvious as the scale of the model expands and the sequences that need to be processed become longer. The emergence of the Mamba family of models is changing all of this in a powerful way. Its excellent performance is quickly gaining widespread attention. The Vision Mamba (Vim) proposal has already demonstrated its great potential to become the next-generation backbone of vision-based models. Researchers from the Chinese Academy of Sciences, Huawei, and Pengcheng Labs have proposed VMamba: a vision Mamba model with global receptive field and linear complexity [
26].
Diabetic retinopathy (DR), as a complication of diabetes mellitus, is caused by lesions in the microvessels of the fundus, which in turn lead to retinal hemorrhage, edema, ischemia, retinal proliferative membrane formation, and retinal detachment, and ultimately lead to blindness in patients. The biggest advantage of using deep learning techniques to achieve automatic DR classification of retinal blood vessels compared to traditional machine learning-based methods is that no human features need to be extracted. The deep model will autonomously learn the intrinsic connection between the features, avoiding the influence of human subjective factors on the determination of the results. For diabetic retinopathy five classification experiments, this article reproduces some open-source classic models. The results show that VMamba has better classification effect and higher accuracy than traditional CNN models. However, in some other image binary classification tasks, most of the accuracy rate is as high as 99% of the comparison, the accuracy rate of five classifications is equivalent to say a little lower.
In this paper, we improved the VMamba model. Based on the original model, we improved the VSS module in the model and added a self-designed local attention module and SE channel attention module. The focal loss function was used in the training process to enhance the performance and increase its inference speed, maximizing the accuracy and reducing the operation time.
2. Related Work
The World Health Organization’s first-ever World Vision Report revealed that at least 2.2 billion people worldwide are visually impaired or blind, and that the majority of visual impairment and blindness can be avoided through early prevention [
27]. In diabetic fundus screening, professional ophthalmologists will classify the degree of retinopathy of the patient based on the characteristics of the blood vessels in combination with other diseased areas, and will take appropriate measures to reduce the risk of blindness. Due to the large number of patients around the world and the tiny size of the blood vessels and lesions in the retina, doctors in some underdeveloped areas may make misdiagnoses and omissions during the diagnostic process.
In early DR classification tasks, machine learning methods were generally used, which required experienced physicians to manually annotate the lesion features and then make judgments based on the manually extracted features, which were more dependent on the results of the feature extraction approach and did not address the problem of the high cost of manual diagnosis and there were no publicly available retinal datasets in the early days, so there were fewer studies that used machine learning methods. In the past studies, Acharya [
28] used support vector machines to achieve the classification of lesions and the degree of higher-order spectral techniques. Jaspreet Kaur et al. [
29] used the K-nearest neighbor algorithm (KNN) to classify diabetic retinopathy and proposed a traditional machine learning framework based on image processing and feature extraction. Du and Li [
30] used a morphological approach to extract pathological features, and then completed the ratings of the degree of retinopathy by using support vector machine classifiers. Pinz et al. [
31] mapped basic anatomical features and pathological changes by fusing feature information from different scanned laser fundus images, and then used a support vector machine to classify the patient’s lesions into three categories. Al-Antary et al. [
32] processed each fundus image using various linear and nonlinear image filtering algorithms to generate feature data, and then used these feature data to train a random forest classifier to achieve classification.
With the development of deep learning technology, the amount of data is also growing under the processing of neural networks. CNN has a great advantage in processing high-dimensional information such as images. The Kaggle competition platform released large-scale retinopathy datasets EyePACS and APTOS in 2015 and 2019, respectively. Since then, the use of deep learning to classify diabetic retinopathy has become a mainstream method. Compared with traditional machine learning based methods, the biggest advantage of using deep learning technology to achieve automatic segmentation and DR classification of retinal blood vessels is that it does not require manual feature extraction. Deep models autonomously learn the intrinsic connections between features, avoiding the influence of subjective factors on result judgments. In addition, automatic segmentation and classification based on deep learning greatly reduce the consumption of related manpower and resources, improve the efficiency of disease screening, and alleviate the internal contradiction of the increasing number of ophthalmic diseases and the shortage of professional physicians. Therefore, many CNN-based DR classification models have emerged in subsequent research. Earlier work generally performed DR binary classification studies, such as the work of Gulshan et al. [
33] based on transfer learning technique and InceptionV3 model and the explicit algorithm for lesion region designed based on visualization method by Gondal et al. [
34]. In fact, the severity classification of DR can be more helpful for doctors’ clinical diagnosis and judgment, so more researchers carry out five classification studies. In terms of model structure, a 10-layer CNN designed by Pratt et al. [
35] obtained good results without using any feature-specific detection. Xu et al. [
36] proposed an 18-layer CNN network for DR classification. The network contains 12 convolutional layers, 4 maximal pooling layers, and 2 fully connected layers, and obtains good results on private datasets. BiRa-Net, designed by Zhao et al. [
37], uses a bilinear classification structure. Sea-Net, proposed by Zhao et al. [
38], enhances the extraction of features by inserting multiple attentional modules between the convolutional layers. In terms of data processing, Bravo and Arbeláez [
39] found a good preprocessing method for fundus images through experimental analysis, which provides a basis for many future works. The Balanced Mix-up algorithm proposed by Galdran et al. [
40] achieved good comprehensive classification results in solving the DR classification problem with a severely unbalanced dataset. Quellec et al. [
41] proposed the ExplAIn model based on deep learning. This model can not only classify DR lesions, but can also classify lesion and non-lesion pixel points of individual pixels in the image. Saeed et al. [
42] proposed a two-stage fine-tuning model. First, the model is pre-trained on the ImageNet dataset. Secondly, the diseased regions of the retinal image are extracted and the fully connected layer of the previous pre-trained model is removed, and the designed PCA layer is introduced and pre-trained again. Finally, the weights of the model obtained from the second pre-training are fine-tuned to obtain the final classification model.
In recent years, Chaichana Suedumrong et al. [
43] used background removal and data enhancement techniques to eliminate irrelevant information in the image, allowing the model to focus more on features related to the lesions. By transforming the training data to increase the diversity of training samples, the generalization ability and robustness of the model were improved, effectively improving the detection performance of diabetic retinopathy. A. M. Mutawa et al. [
44] proposed a model based on deep learning and discrete wavelet transform. They decomposed the image into low-frequency and high-frequency parts by applying DWT to the input image, thereby retaining important feature information and removing noise, significantly improving Model accuracy. San-Li Yi et al. [
45] proposed a new network called RA-EfficientNet. They added a residual attention (RA) module to the EfficientNet model to extract more features and solve the problem of small differences between lesions, addressing the limitations of existing manual feature extraction. Rasha Ali Dihin et al. [
46] combined the Swin Transformer architecture with wavelet transform and attention mechanism to process images through sliding window mechanism and hierarchical construction while effectively reducing computational complexity.
In the past years of research, CNN and VIT have been the mainstream frameworks for visual feature extraction. CNN has the advantages of being a simple model, having parameter sharing, and high computational efficiency. The disadvantages of CNN are that it does not have a global receptive field, is not good at processing multimodal data, has an old model architecture, has many layers, and has a large number of parameters, resulting in the high computational complexity of the model. Various model variants have brought convolutional networks close to capacity bottlenecks [
47]. VIT is based on Transformer, which has the advantages of simple architecture, global sensory field, and dynamic weights, but the disadvantage is that the computational efficiency is not as good as that of convolutional neural network, and the computational amount of the self-attention mechanism in the model will increase with the increase in the length of the context in the square level [
48].
The Mamba series models use selective state space to process sequences, which solves the computational efficiency problem of Transformer when processing long sequences [
49]. The attention layer can be scaled linearly with the length of the sequence, enabling faster inference. The Mamba family of models controls the memory range by integrating a selection mechanism, which improves the model’s generalization ability and offers the potential for applications in the CV domain. The Mamba Series models also feature excellent scalability, allowing the models to be more easily scaled to larger sizes without loss of performance. The aim of this paper is to propose a new deep learning-based method for diabetic retina classification, which is improved based on the VMamba model, and the final method is named VMamba-m. Distinguished from the traditional model for classification, the model proposed in this paper has faster computation speed and higher accuracy, which solves the problem of higher computational complexity of the traditional model.
3. Methodology
3.1. Overview of the Methodology
In this paper, we propose to improve the model’s base model, VMamba, as a new visual backbone network architecture to efficiently process visual data, reduce computational complexity, and increase inference speed while maintaining the model’s performance by introducing a state-space model (SSM)-based module.
As shown in
Figure 1, and the introduction of a 2D Selective Scanning (SS2D) [
50] module enables selective SSM to process visual data efficiently, bridging the gap between 1D scanning and 2D visual data. The excellent performance of VMamba is demonstrated on several visual tasks, including image classification, target detection, and semantic segmentation. In particular, VMamba demonstrates linearly increasing computational complexity with significant input scalability when processing large input sizes.
The VMamba-m model proposed in this paper is an improvement based on VMamba. First, as shown in
Figure 2, a double branch is added to the original VSS module by adding local attention and channel attention. The VSS module in the original VMamba model can only extract global features. After we add local attention and se attention to the VSS module, we can assign different positions and weights to the image from the perspective of the channel domain through a weight matrix to obtain more important feature information. The purpose of adding dual-branch attention is to extract local features; it also has faster training speed and higher accuracy than the previous single-branch attention. The use of the focal loss function in the training code improves the classification performance by introducing a focus factor and adjusting the sample weights so that the model pays more attention to samples that are difficult to classify.
3.2. Focal Loss Function
In target detection tasks, there is usually an extreme imbalance in the number of positive and negative samples. Taking an image as an example, the number of candidate frames (positive samples) that can match the target may be only a dozen or tens of frames, while the number of unmatched candidate frames (negative samples) may be as high as tens of thousands or even hundreds of thousands. This imbalance causes the model to easily favor the negative samples during training, thus neglecting the learning of the positive samples. In addition, even among the negative samples, there is an imbalance of hard and easy samples, i.e., most of the negative samples are easy to categorize, while a small number of hard-to-categorize negative samples are crucial for model performance.
Focal loss is an effective method to deal with the class imbalance problem, by introducing a focal factor and adjusting the sample weights, it makes the model pay more attention to the difficulty-to-classify samples, thus improving the classification performance [
51]. It is especially suitable for target detection and other category imbalance tasks. The formula for focal loss is as follows:
is the model’s predicted probability for the target class, is a balancing factor to adjust for the effects between positive and negative samples, γ is the focus factor, which is used to adjust the weights for difficult samples. The modulation factor is used to reduce the loss ratio of easy-to-distinguish samples. Regardless of the foreground or background class, the larger the the easier it is to distinguish the sample, and the smaller the modulation factor.
Focal loss, as a loss function designed to solve the problem of category imbalance and imbalance between hard and easy samples, has shown great potential in the field of deep learning. By reducing the weight of easy-to-categorize samples and increasing the weight of hard-to-categorize samples, focal loss also enables the model to focus more on samples that are difficult to categorize correctly, thus improving the overall performance. As deep learning technology continues to evolve, focal loss is expected to play a greater role in future applications.
3.3. Local Attention
This attention module is designed independently and contains several layers for local attention operations on images. The module composition is shown in
Figure 3.
The first layer is the batch normalization layer of this model, which reduces internal covariate shift by standardizing the mini-batch input data of each layer, thereby improving training speed and stability. Next is a 2D convolutional layer which uses a 3 × 3 convolutional kernel with a step size of 1 and padding of 1. This convolutional layer is used to extract local features. Next is another batch normalization layer that normalizes the output of the convolutional layer. The second 2D convolutional layer is similar to the first convolutional layer and is used for further feature extraction. The third 2D convolutional layer uses a 1 × 1 convolutional kernel with a step size of 1, which is typically used to compress the spatial size of the feature map. Although the model structure is not too complex, it works well.
3.4. SE Attention
The goal of using this module in this paper is to improve the representation ability of the network by modeling the interdependence between the convolutional feature channels. The core idea is to learn the feature weights through the network according to the loss, so that the effective feature maps are weighted heavily, and the ineffective or ineffective feature maps are weighted less heavily to train the model to achieve better results [
52]. The SE block embedded in some of the original classification network inevitably increases some parameters and computation, but in the face of the improvement of the effect is acceptable. The SE block embedded in the original classification network inevitably increases some parameters and computation, but it is acceptable in terms of improving the effect. Specifically, the SE attention mechanism includes two steps: squeeze and excitation.
- (1)
In the squeeze step, feature U generates a channel descriptor by aggregating feature mappings across spatial dimensions H × W through the squeeze compression operation; H × W × C → 1 × 1 × C compresses global spatial information into the above channel descriptors such that these channel descriptors can be utilized by its input layers, compresses the input feature map into a vector through a global average pooling operation, and then maps it to a smaller vector through a fully connected layer. The formula is as follows:
- (2)
In the excitation step, a sigmoid function is used to compress each element in this vector to between 0 and 1, and it is multiplied with the original input feature map to obtain the weighted feature map. The vector
obtained in the previous step is processed by two fully connected layers,
and
, to obtain the channel weight value
we want. After two layers of fully connected layers, different values in
represent the weight information of different channels, giving different weights to the channels. The formula is as follows:
- (3)
Reweight: The weights of the excitation output are added to the input features through channel-wise convolution.
This attention module can make the model more focused on key information through its clever attention mechanism, which effectively promotes the application of deep learning in many fields.
4. Experimental Section
4.1. Dataset and Preprocessing
In this paper, model validation is performed on APTOS2019, a publicly available dataset officially provided by Kaggle (San Francisco, CA, USA). The APTOS2019 dataset was obtained by the Aravind team while performing disease screening in medically deprived villages in India, relying on trained doctors to review images and provide diagnoses. The database contains 3662 images with lesions categorized into five categories, where 0 indicates a healthy retina and categories 1–4 indicate mild, moderate, severe, and proliferative retinopathy, respectively. A ratio of 8:2 was used to divide the training set and test set during the training process of this paper.
Table 1 shows the distribution of sample size for each type of lesion in the APTOS2019 dataset.
As the samples in the DR dataset have differences in lighting conditions and shooting equipment during the sampling process, it leads to a big difference between the images in terms of size and color. There are also some images with problems such as underexposure, overexposure, and much noise. In order to train the network using samples with consistent color sizes, the images need to be preprocessed and enhanced. First of all, the black border of the fundus image must be removed because of its high pixel value, which affects the results of classification. Since the aspect ratio varies from image to image and the width of the black border at the periphery of the eyeball varies, the size of the entire image cannot be used. Therefore, the radius of the eyeball is used as a benchmark. Finally, for image enhancement, the method used in Kaggle’s DR classification competition is used to improve the brightness and contrast difference of the image. The process is shown below.
In Formula (4), is the preprocessed fundus image, and represents the Gaussian convolution of the standard deviation. Formula (5) is a weighted sum of the images before and after enhancement. represents the transparency or blending coefficient. When images are fused or colors are mixed, the α value can control the degree of blending between the two images. When performing weighted summation of images, can be used to adjust the influence of the first image. is used to represent the brightness adjustment factor, which is used to adjust the contrast or brightness of the image. represents the exponent of gamma correction, which is a nonlinear operation that aims to adjust the brightness of the image and improve the display effect of the image under different lighting conditions. In this paper, adjusting the value mainly affects the brightness and contrast of the image. usually represents the standard deviation in Gaussian filters, which mainly determines the degree of blurring of the filter. When using Gaussian filtering, controls the width of the Gaussian function, thereby affecting the filter’s ability to retain image details. Smaller will result in less blur, while larger will make the image more blurred. For the images in the dataset with large differences, after multiple rounds of adjustment of different parameters, we found that the parameters of and were 4, −4, 10, and 128, respectively, to achieve the best effect, solving the problems of underexposure, overexposure, and high noise.
Figure 4 shows the preprocessing results. Through these processing methods, the black border area with high pixel values becomes gray, reducing the impact on the classification results. The contrast and clarity of the blood vessels and lesion areas are improved, making it easier to distinguish between them.
4.2. Comparison Experiments of Different Models
In order to verify the classification advantages of the improved model proposed in this paper, this section compares VMamba, Mamba, VMamba-m with some locally deployed open-source classic network models in five categories. This paper uses the Ubuntu operating system and parallel computing architecture CUDA 11, the hardware platform is RTX3080Ti 12 G (NVIDIA, Santa Clara, CA, USA), and the memory is 32 G. The models are all built using the PyTorch 2.0 deep learning framework, with the same preprocessing methods and parameter settings, and the Python version is 3.8. During training, the initial learning rate is set to 0.0001, the batch size is set to 32, and the number of iterations is 150 rounds. We measure the performance of the model based on key indicators such as accuracy, precision, recall, AUC, F1 score, and iteration time per round. After multiple rounds of experiments, the experimental results of each model on the APTOS2019 dataset are shown in
Table 2. Combined with the data in the table, we can see that the Mamba series models perform well in the task of classifying diabetic retinopathy. In the five-classification task, the model of VMamba-m proposed in this paper performs the best, followed by VMamba, with classification accuracies of 0.791 and 0.714, and iterations time per round of 34.027 and 76.322, respectively, which are higher than the other models.
The accuracy of training and validation is shown in
Figure 5. We will increase the number of iterations in future experiments to obtain a more stable model. Similarly, the loss of validation is depicted in
Figure 6.
The ROC curve plot of our VMamba-m model is shown in
Figure 7. The ROC curve and AUC value indicate how close the prediction is to the perfect classification, which is shown in the upper left corner of the ROC coordinates. The AUC value shows the area under the ROC curve. The closer the value is to 1, the better the model performance. The figure shows AUC values of 99.1%, 85.5%, 87.50%, 87%, and 86.8%, representing the categories of no DR, mild, moderate, severe, and proliferative DR, respectively. Since the morphological changes in the fundus images of moderate DR affect the identification of pathological structures, the AUC of moderate DR is lower than that of other DR, at 85.5%.
Figure 8 depicts the confusion matrix, which represents how to evaluate the performance of each class in the model. The proportion of examples successfully classified for each class is shown in the diagonal cells. In the model studied in this article, class 0 (Healthy) was correctly predicted for 354 images; class 1 (Mild) was predicted for 43; class 2 (Moderate) was predicted for 126; class 3 (Severe) was predicted for 20; and class 4 (Proliferative) was predicted for 34. In the future, we will consider adding more images to class 3 and class 4 to obtain better performance.
To further demonstrate the robustness of VMamba-m,
Table 3 shows the comparison of our VMamba-m model with the four latest studies by Mutawa, A.M. et al., Dihin, R.A. et al., Bodapati et al., Dondeti et al. based on the APTOS2019 dataset.
Although the accuracy of our model is 79.1%, which seems insignificant compared to other studies, we believe that our work brings great value to the field of DR recognition. Our model is designed to be more efficient and consume less computing resources, and can handle larger input images without causing memory overflow. The loss function is improved to achieve faster convergence, thereby accelerating the training process.
4.3. Ablation Experiments with Different Mechanisms
The important research content of this paper is to solve the problems of long operation time and low accuracy in other current studies. Therefore, we introduced the SE attention mechanism, local attention mechanism, and focal loss function to solve this problem. Through ablation experiments, we can determine the impact of these mechanisms on the model effect and judge whether the above problems are effectively alleviated.
The ablation experiments with three different mechanisms in the original modeling task are shown in
Table 3. Using the VMamba architecture as the baseline, the accuracy is improved by 5.3% and the time per iteration is reduced by 52.7% after the introduction of the SE attention mechanism. The accuracy is improved by 5.9% and the time per iteration is reduced by 53.8% after the introduction of the local attention mechanism. Accuracy was improved by 2% using the focal function mechanism, and the computation time remained essentially unchanged.
After the focal function is replaced in training, the ablation experiments of different mechanisms are shown in
Table 4. Based on the VMamba architecture, the accuracy rate has increased by 9.5% after the introduction of the SE attention mechanism, and the iteration time of each round has decreased by 54.9%. After the introduction of a local attention set, the accuracy rate increased by 8.8% and the iteration time per round decreased by 53.7%. The VMamba-m model proposed in this paper: namely, to replace the focal function and introduce the SE attention mechanism and local mechanism, has improved the accuracy by 12.3% and reduced the iteration time of each round by 55.4%.
Table 4 shows the values of the objective evaluation indexes for the five classifications of the model proposed in this paper. After analyzing the results of the ablation experiments, it is found that the introduction of the SE attention mechanism and local attention mechanism affects the actual prediction results of the task to a greater extent, compared with the replacement of the focal function, which affects the results to a relatively small extent. In summary, the improved module based on the VMamba model in this paper is helpful for retinopathy classification.
5. Conclusions
This paper presents the improved model VMamba-m, designed for efficient visual representation learning using state space models (SSM). The advantages of selective SSM, including global receptive fields, input-dependent weighting parameters, and linear computational complexity, are incorporated into visual data processing. In addition, to address the problem of poor classification accuracy of the existing methods, combined with the characteristics of tiny and diverse lesions in DR fundus images, this paper improves the original architectural design of the VMamba model by adding a two-branch attention mechanism to extract local features, which significantly improves its inference speed and accuracy. The VMamba-m and the original Mamba series models mentioned in this paper have been proved to be effective by a large number of experimental studies, exceeding the performance of the traditional models in the case of pre-processed datasets. In addition, VMamba-m exhibits significant scalability as the input resolution increases, showing minimal performance degradation while maintaining linear computational complexity. Due to the experimental environment and time constraints, this paper was not able to potentially explore on more directions. The future research direction of this paper can still be expanded in the following aspects: (1) Although the Mamba series model is faster and more accurate than the traditional model in each round of iteration, there is still room for development in terms of precision, recall, and F1 score compared to the traditional model. (2) The Mamba series model can be applied to image processing in more medical fields, which greatly improves the accuracy and efficiency of medical image analysis, and provides more accurate data support for the diagnosis and treatment of medical imaging.