An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8

Zhang, Bowei; Li, Jing; Bai, Yun; Jiang, Qing; Yan, Biao; Wang, Zhenhua

doi:10.3390/bioengineering10121405

Open AccessArticle

An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8

by

Bowei Zhang

^1,†,

Jing Li

^2,†,

Yun Bai

^1,†,

Qing Jiang

^3,*,

Biao Yan

² and

Zhenhua Wang

^1,*

¹

College of Information Science, Shanghai Ocean University, Shanghai 201306, China

²

Department of Ophthalmology, Eye Institute, Eye and ENT Hospital, Fudan University, Shanghai 201114, China

³

The Affiliated Eye Hospital, Nanjing Medical University, Nanjing 211166, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Bioengineering 2023, 10(12), 1405; https://doi.org/10.3390/bioengineering10121405

Submission received: 9 November 2023 / Revised: 23 November 2023 / Accepted: 26 November 2023 / Published: 8 December 2023

(This article belongs to the Section Biosignal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Diabetic retinopathy (DR) is a microvascular complication of diabetes. Microaneurysms (MAs) are often observed in the retinal vessels of diabetic patients and represent one of the earliest signs of DR. Accurate and efficient detection of MAs is crucial for the diagnosis of DR. In this study, an automatic model (MA-YOLO) is proposed for MA detection in fluorescein angiography (FFA) images. To obtain detailed features and improve the discriminability of MAs in FFA images, SwinIR was utilized to reconstruct super-resolution images. To solve the problems of missed detection of small features and feature information loss, an MA detection layer was added between the neck and the head sections of YOLOv8. To enhance the generalization ability of the MA-YOLO model, transfer learning was conducted between high-resolution images and low-resolution images. To avoid excessive penalization due to geometric factors and address sample distribution imbalance, the loss function was optimized by taking the Wise-IoU loss as a bounding box regression loss. The performance of the MA-YOLO model in MA detection was compared with that of other state-of-the-art models, including SSD, RetinaNet, YOLOv5, YOLOX, and YOLOv7. The results showed that the MA-YOLO model had the best performance in MA detection, as shown by its optimal metrics, including recall, precision, F1 score, and AP, which were 88.23%, 97.98%, 92.85%, and 94.62%, respectively. Collectively, the proposed MA-YOLO model is suitable for the automatic detection of MAs in FFA images, which can assist ophthalmologists in the diagnosis of the progression of DR.

Keywords:

diabetic retinopathy; microaneurysm; deep learning; YOLOv8; SwinIR

1. Introduction

Diabetic retinopathy (DR) is one of microvascular complications affecting the retina, caused by diabetes. It is also known as a major cause of blindness worldwide [1,2]. The pathogenesis of DR is tightly associated with an altered vessel structure due to increased blood glucose levels. Initially, DR presents as tiny dilations of capillaries, known as microaneurysms (MAs) [3,4]. MAs are primarily distributed in the inner nuclear layer and deep capillary plexus and are often an early clinical manifestation of various retinal and systemic diseases, including DR, retinal vein occlusion, and infections. In fundus images, MAs appear as small dots and a visible pathology at the early stages of DR. Therefore, the accurate detection of MA is crucial for the prevention, diagnosis, and treatment of DR [5].

The advancement of modern retinal imaging techniques, such as fundus fluorescein angiography (FFA) and non-mydriatic fundus photography (NMFCS), has improved the identification of MAs. FFA is a technique that uses the injection of a contrast agent to observe the retinal vessels. NMFCS refers to a non-invasive imaging technique of capturing retinal images through fundus photography. Figure 1 shows images obtained by FFA and NMFCS. The fundus images obtained by FFA exhibit higher contrast and clearer presentation of the features of the retinal structure compared to the images obtained by NMFCS. In the clinical practice, FFA is widely recognized as an important standard for visualizing the retinal vasculature and describing subtle vascular changes.

Figure 2 shows a normal FFA image and an FFA image with microaneurysms. In the FFA image, MAs typically appear as round and bright spot-like structures with diameters ranging from 10 μm to 100 μm. MAs hold an important value in disease diagnosis and screening. Their objective quantitative evaluation is still limited as it requires manual detection by experienced technicians.

Over the last two decades, automatic detection models for MAs have rapidly developed based on deep learning. The convolutional neural network (CNN) is a deep learning algorithm that extracts features from images through multiple layers of convolution and pooling operations and utilizes the fully connected layers for classification or regression tasks. The CNN has achieved great success in the field of image processing and is widely used for object recognition and semantic segmentation. Object recognition using CNNs has great advantages such as high accuracy, application flexibility, automation, and real-time performance, providing support for practical applications such as SSD [6], RetinaNet [7], YOLOv5, YOLOv7 [8], and YOLOX [9]. Meanwhile, previous studies reported several segmentation models for the automatic detection of MAs. Liao et al. proposed a deep convolutional encoder–decoder network with a weighted dice loss for MA localization [10]. Xia et al. introduced a multi-scale model for detecting and classifying MAs using residual and efficient networks [11]. Chudzik et al. proposed a three-stage detection method as an alternative to the traditional five-stage MA detection. This study demonstrated successful transfer learning between small MA datasets [12]. Zhou et al. proposed a collaborative learning model based on a fine-tuning detection module in a semi-supervised manner to improve the performance of MA detection [13]. Xie et al. proposed a segmentation–emendation–resegmentation–verification framework to predict and correct detection errors in models, enhancing the detection of MAs [14]. Wang et al. utilized a region-based fully convolutional network (R-FCN) incorporating a feature pyramid network and an improved region proposal network for MA detection [15]. Guo et al. proposed a novel end-to-end unified framework for MA detection that utilizes multi-scale feature fusion and multi-channel bin loss [16]. Mateen et al. proposed a hybrid feature embedding approach using pre-trained VGG-19 and Inception-v3 for MA detection [17]. Kumar et al. trained a model of radial basis function neural network for MA detection [18]. Table 1 shows the strengths and weaknesses of the reported models for MA detection. The above-mentioned MA detection models based on deep learning have enhanced the efficiency of MA detection in FFA images. However, the tiny size of MAs, their low contrast with the background, and the lack of an annotated MA database still pose a great challenge for MA detection. Thus, further study is still required to design a novel detection method to enhance MA detection efficiency.

MAs are relatively small in size and often appear as tiny and blurry lesions in retinal images, which are particularly pronounced in low-resolution images. They are often similar to the pixels of blood vessels. Super-resolution reconstruction is an image processing technique that can enhance the spatial resolution and detail clarity of an image by recovering high-resolution details from a low-resolution image. The Swin Transformer [19] has shown great promise as it integrates the advantages of both CNN and Transformer. The Swin Transformer processes large images using a self-attention mechanism and models long-range dependencies with a shifted window scheme. An image restoration model, SwinIR [20], was designed based on the Swin Transformer. SwinIR could not only enhance the detail features of MAs, but also improve the visibility and discriminability of MAs in FFA images. Except for tiny MAs in FFA images, sample imbalance and loss of information are two problems to be solved that affect the accuracy and efficiency of MA detection. YOLOv8 is an object recognition algorithm that is characterized by its ability to perform object localization and classification in a single forward pass. YOLOv8 contains a backbone, a neck, and a head. The neck section utilizes the path aggregation network (PAN)–feature pyramid network (FPN) structure for feature fusion [21,22]. FPN constructs a multi-scale feature pyramid by adding lateral connections and up-sampling layers to capture rich semantic information and better detect objects of different sizes. PAN addresses the issue of feature propagation in FPN by aggregating and propagating features through horizontal and vertical paths. PAN–FPN combines the strengths of FPN and PAN to provide powerful feature representation capabilities. The backbone and neck sections of YOLOv8 draw inspiration from the design principles of YOLOv7 ELAN. The C3 structure of YOLOv5 is replaced with the C2f structure in YOLOv8, which has a richer gradient flow, allowing for a better capturing of image details and contextual information. Given its powerful identification efficiency, multi-scale feature fusion, and contextual information capturing, the proposed MA detection model was designed based on YOLOv8.

Therefore, an improved MA detection model for FFA images based on SwinIR and YOLOv8 is proposed, called MA-YOLO. The major contributions of this study proposing the new MA-YOLO model are as follows:

SwinIR was used to reconstruct high-resolution FFA images, which could enhance the visibility and discriminability of MAs in FFA images.
A detection layer was added to the YOLOv8 model, which could avoid feature information loss in shallow layers and improve the performance of MA detection.
Transfer learning was utilized between high- and low-resolution images to expand the data samples and improve the generalization ability.
Taking Wise-IoU as the bounding box regression loss, the loss function of MA-YOLO was improved, which could relieve the sample distribution imbalance problem and enhance the generalization performance.

In addition, the proposed MA-YOLO model could calculate the MA area in FFA images, which would assist ophthalmologists in assessing the progression of DR.

2. Materials and Methods

2.1. Materials

2.1.1. Datasets

The experimental dataset used in this study contained two datasets. The first dataset was collaboratively constructed by the Nanjing Medical University-Affiliated Eye Hospital and includes 1200 FFA images (768 × 868 pixels) from 1200 eyes of DR patients (age range, 31–81 years). Image acquisition was performed using the Heidelberg retina angiograph (Heidelberg Engineering, Germany) device. To ensure data quality, the dataset excluded images that were blurry or overexposed because of environmental factors or equipment materials.

The second dataset originated from a study conducted at the Persian Eye Clinic (Feiz Hospital) at the Isfahan University of Medical Sciences. The dataset includes 70 retinal images (576 × 720 pixels) from a total of 70 patients, with 30 images classified as normal, and 40 images representing different stages of abnormality. Prior to image collection, each patient underwent a comprehensive ophthalmic evaluation, which involved medical history assessment, applanation tonometry, slit-lamp examination, dilated fundus biomicroscopy, and ophthalmoscopy [23].

Based on the above datasets, a total of 1240 FFA images were selected as the experimental dataset. All images were resized to 768 × 768 pixels and annotated by clinical doctors with more than 10 years of clinical experiences. The 1240 FFA images were independently divided into 992 training FFA images, 124 validation FFA images, and 124 test FFA images.

2.1.2. Implementation

The hardware configuration used for the experiment was Ubuntu 20.04.5, 2GPUs, GPU NVIDIA RTX 2080ti and 1 GPU memory (11 GB). As software, we used the deep-learning framework Pytorch 2.0.0 and the programming language python 3.8.

2.1.3. Evaluation Metrics

Five metrics were calculated to estimate the performance of MA detection [24], i.e., recall (

R e

), precision (

P r e

), F1 score (

F 1

), average precision (

A P

), and frames per second (

FPS

).

R e = T P / (T P + F N)

(1)

P r e = T P / (T P + F P)

(2)

F 1 = 2 \times P r e \times R e / (P r e + R e)

(3)

A P = \int_{0}^{1} P r e (R e) d R e

(4)

FPS = \frac{frameNum}{elapsedTime}

(5)

T P

,

F P

, and

F N

denote true positive regions, false positive regions, and false negative regions, respectively.

frameNum

is the number of FFA images inputted into the detection model;

elapsedTime

is the time consumed by the detection model.

R e

and

P r e

are the proportion of correct predictions in all MAs and the proportion of real MAs in the samples predicted as MAs.

F 1

is a balanced metric determined by precision and recall.

A P

is the area under the precision–recall (PR) curve, obtained by plotting recall on the x-axis and precision on the y-axis, based on a set of precision and recall values calculated at different thresholds.

FPS

is the number of FFA images inferred per second.

2.2. Methods

Figure 3 illustrates the flowchart of the proposed MA-YOLO model. In this model, SwinIR was used for the super-resolution reconstruction of FFA images (Figure 3A). The MA detection layer was added to modify YOLOv8 (Figure 3B), and transfer learning was applied to increase the amount of data. The Wise-IoU loss was utilized to enhance the detection capability (Figure 3C).

2.2.1. Super-Resolution FFA Image Reconstruction Based on SwinIR

MAs are small in size and usually appear as tiny and blurry structures in FFA images. The subtle features of MAs are easily lost in low-resolution images. Therefore, reconstructing high-resolution images is helpful for MA detection in FFA images. Here, SwinIR was employed to perform the super-resolution reconstruction of FFA images.

SwinIR, an image restoration technique, contains three modules, i.e., shallow feature extraction, deep feature extraction, and HQ image reconstruction (high-quality image reconstruction modules). The shallow feature extraction module uses a convolution layer to extract shallow features, which are directly transmitted to the reconstruction module and preserve low-frequency information. The deep feature extraction module is mainly composed of residual Swin Transformer blocks (RSTB), each of which utilizes several Swin Transformer layers for self-attention and cross-window interaction. Additionally, a convolution layer was incorporated at the end of the block to enhance features. A residual connection was utilized to establish a shortcut for feature aggregation. Finally, both shallow and deep features were transmitted to the HQ image reconstruction module, which used the sub-pixel convolution layer [25] to up-sample the features for high-quality image reconstruction.

Figure 4 illustrates the structure of the residual Swin Transformer block and the Swin Transformer layer.

Based on SwinIR, the original FFA images with a size of 768 × 768 pixels were reconstructed into super-resolution FFA images with a size of 1536 × 1536 pixels and 2304 × 2304 pixels, respectively, which effectively enhanced the detail features of MAs.

2.2.2. YOLOv8 Modified by MA Detection Layer and Transfer Learning

During the down-sampling of the convolution layers of YOLOv8, the regions containing MAs become blurred, and it was difficult to accurately localize the MAs. Thus, down-sampling convolution caused the loss of small features and missed and false MA detection.

Here, an MA detection layer was introduced into the neck and head of YOLOv8 to handle shallow feature maps from the P2 layer of the backbone network and integrate them into the PAN–FPN structure. The architecture of the MA detection layer is shown in Figure 5. The MA detection layer up-sampled deep-level feature maps with stronger semantic features from the FPN structure and then concatenated them with shallow-level feature maps outputted by the P2 layer of the backbone network, enhancing the semantic expression of the shallow-level features. After feature extraction by the c2f module, the resulting features were passed into the added detection head. Simultaneously, the MA detection layer down-sampled the obtained feature maps using convolution, concatenated them with the deep-level feature maps, and then underwent another feature extraction by the c2f module. This process integrated the feature information extracted from the shallow levels into the PAN structure, enhancing the model’s localization capability at various scales. Based on the modified YOLOv8, small MA features could be obtained, and the accuracy of MA detection could be enhanced.

Due to the limited amount of MA data, few annotated samples are available for training and evaluation, and it is a challenge to construct accurate and reliable models. Here, transfer learning was used to modify YOLOv8. Transfer learning [26] is a machine learning technique that leverages the knowledge gained from one task to improve its performance on different but related tasks. Transfer learning is usually used to transfer pre-trained models or features.

Using transfer learning, three different datasets were leveraged while training the model, including the original MA images with a size of 768 × 768 pixels and two super-resolution reconstructed images with a size of 1536 × 1536 and 2304 × 2304 pixels. Figure 6 shows the flowchart of transfer learning applied to these three different datasets. Based on the original MA images of 768 × 768 pixels, the detection model was pre-trained, and the learned knowledge was retained. Based on the super-resolution reconstructed images of 1536 × 1536 pixels, the detection model was transferred and fine-tuned. The learned knowledge was updated. Based on the super-resolution reconstructed images of 2304 × 2304 pixels, the detection model was transferred and fine-tuned, and the learned knowledge was updated again.

2.2.3. Loss Function Optimization Based on Wise-IoU

The loss function of the official YOLOv8 consists of two components: classification and regression. For classification, binary cross-entropy loss (BCEL) is used as the loss function, while for regression, distribution focal loss (DFL) [27] and CIoU [28] bounding box regression loss (CIoUL) are incorporated.

The loss function of YOLOv8 is represented as

f_{loss} = λ_{1} f_{B C E L} + λ_{2} f_{D F L} + λ_{3} f_{C I o U L},

(6)

On the basis of the official YOLOv8 weight parameter settings, the weight parameters

λ_{1}

,

λ_{2}

, and

λ_{3}

were always set to 0.05, 0.15, and 0.75, respectively.

BCEL is defined as

f_{B C E L} = weight [class] (- x [class] + l o g (\sum_{j} e x p (x [j]))),

(7)

where

class

is the number of categories,

weight [class]

denotes the weight for each class, and

x

is the probability value after sigmoid activation.

DFL is an optimization of the focal loss function, which generalizes the discrete results of classification into continuous results through integration, denoted as

f_{D F L} (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log (S_{i}) + (y - y_{i}) \log (S_{i + 1})),

(8)

where

y_{i}, y_{i + 1}

represent the values from the left and right sides near the consecutive labels

y

, satisfying

y_{i} < y < y_{i + 1}, y = \sum_{i = 0}^{n} P (y_{i}) y_{i}, P (y_{i}) = S_{i}

;

P

can be implemented through a softmax layer.

According to the calculation of the overlap between the ground truth box and the predicted box and the differences in center point distance and aspect ratio, CIoUL reflects the similarity and accuracy of two bounding boxes and is defined as

f_{C I o U L} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v,

(9)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2},

(10)

α = \frac{v}{1 - I O U + v},

(11)

where

\frac{ρ^{2} (b, b^{g t})}{c^{2}}

is the distance between the centers of the target box and the prediction box,

c

is the distance between the diagonal points of the smallest enclosing box,

w^{g t}

and

h^{g t}

represent the size of the target box, and

w

and

h

represent the size of the prediction box. However, CIoUL ignores the issue of sample distribution imbalance and presents limitations in relation to small MAs and in the presence of a large background noise.

Here, CIoUL was replaced with Wise-IoU [29] bounding box regression loss. The Wise-IoU loss function uses a dynamic focusing mechanism to evaluate the quality of the anchor box, where an “outlier” is used to avoid excessive penalties for geometric factors (such as distance and aspect ratio). Additionally, the Wise-IoU loss borrows the idea of focal loss, using a focus coefficient constructed to reduce the contribution of samples easy to evaluate to the loss value. The Wise-IoU loss function is defined as

L_{W I o U} = β^{γ} \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}}) L_{I o U},

(12)

\begin{matrix} β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}} \end{matrix} \in [0, + \infty),

(13)

where

w_{g}

,

H g

are the size of the smallest enclosing box,

x

and

y

represent the coordinate values of the prediction box,

x_{g t}

and

y_{g t}

represent the coordinate values of the ground truth,

γ

is an adjustable hyperparameter, set to 0.5, and

β

indicates the degree of abnormality of the prediction box (a small degree of abnormality means that the quality of the anchor box is high). Therefore,

β

can assign small gradient gains to prediction boxes with large outliers, effectively reducing the harmful gradients of low-quality training samples.

3. Results

To evaluate the detection performance of the MA-YOLO model, two comparative experiments were performed. Experiment one was an ablation experiment, where the MA-YOLO model was compared with YOLOv8 with different settings. Experiment two was a comparative experiment, where the detection performance of the MA-YOLO model was compared with that of other models, including SSD, RetinaNet, YOLOv5, YOLOX, and YOLOv7.

3.1. Ablation Experiment

The proposed MA-YOLO model was compared to YOLOv8 with different settings, including YOLOv8 with SwinIR (YOLOv8-A), YOLOv8 with SwinIR and transfer learning (YOLOv8-B), YOLOv8 with the Wise-IoU loss function (YOLOv8-C), and YOLOv8 with the MA detection layer (YOLOv8-D). Figure 7 and Table 2 show the results of MA detection and the evaluation metrics for MA detection by YOLOv8 with different settings.

Based on Table 2, MA-YOLO provided the best scores of Re, Pre, F1, and AP in MA detection, which were 88.23, 97.98, 92.85, and 94.62, respectively. However, due to the addition of the MA detection layer, FPS was 1.51, which was lower than that of YOLOv8. Compared with the YOLOv8 model, YOLOv8-A allowed improving the scores of Re, F1, and AP, which were 83.44 (2.63↑), 84.46 (1.1↑), and 83.46 (1.37↑), respectively. YOLOv8-B also allowed improving the scores of Re, Pre, F1, and AP, which were 85.22 (4.41↑), 88.07 (2↑), 86.62 (3.26↑), and 84.13 (2.04↑), respectively. The same was observed for YOLOv8-C with scores of Re, Pre, F1, and AP of 84.65 (3.84↑), 86.73 (0.66↑), 85.68 (2.32↑), and 87.29 (5.2↑), respectively. Also YOLOv8-D allowed improving the scores of Re, Pre, F1, and AP, which were 86.15 (5.34↑), 93.19 (7.12↑), 89.53 (6.17↑), and 88.67 (6.58↑), respectively.

As shown in Figure 7, MA-YOLO provided the best performance for MA detection, with few missed and false detection results. We observed some false MA detection with the YOLOv8-A and YOLOv8-B models and some missed MA detection with the YOLOv8, YOLOv8-A, YOLOv8-B, YOLOv8-C, and YOLOv8-D models.

Figure 8 and Figure 9 illustrate the comparison of the loss curves and AP curves of the validation set between the original images and the super-resolution FFA images, where X1 denotes the original images with a size of 768 × 768 pixels, X2 the super-resolution images with a size of 1536 × 1536 pixels, and X3 the super-resolution images with a size of 2304 × 2304 pixels. Based on Figure 8 and Figure 9, it is evident that the model trained with super-resolution images demonstrated superior convergence trends and detection performance compared to the model trained with the original images.

3.2. Comparison Experiment

To evaluate the performance of MA detection, the proposed MA-YOLO model was compared with other models, including SSD, RetinaNet, YOLOv5, YOLOX, and YOLOv7. SSD is a classic one-stage object recognition algorithm, and its high detection speed makes it highly valuable for practical applications. RetinaNet has enhanced the ability of object recognition models to detect small objects by introducing focal loss. YOLOv5, YOLOX, and YOLOv7 are all part of the series of YOLO algorithms, representing newer models introduced in recent years. In addition, two reports were also selected to evaluate the proposed model’s performance in detecting MAs [24,30].

Table 3 and Table 4 show the comparison of MA detection performance and tuning parameters during the training phase among different models. Figure 10 shows the MA detection results of different models, where the red boxes represent the detection results with a confidence score greater than 0.5, the yellow boxes indicate missed detection, and the green boxes represent false positive detection. Table 5 shows the comparison of the MA detection performance of different object recognition models reported in various studies.

According to Figure 10 and Table 3 and Table 4, the detection results of MA-YOLO were close to ground truth. Part of the background was mistakenly detected by the YOLOv5, YOLOX, and YOLOv7 models. We observed some missed MA detections by the SSD, RetinaNet, YOLOv5, YOLOX, and YOLOv7 models. MA-YOLO achieved the highest Re, Pre, F1, and AP scores compared to the other models and a higher FPS score than RetinaNet. According to Table 5, the detection performance of MA-YOLO was superior to that of the other examined methods.

3.3. Calculation of the MA Region

In addition to MA detection, the MA area was calculated by the inscribed circle within the detection bounding box. The MA area could serve as an indicator to assess the progression of DR. Figure 11 shows the calculation results for the MA area, where the unit of measure is

μ m^{2}

. The MA area was calculated in FFA images captured by a Heidelberg retina angiograph with a 55° lens and included 768 × 768 pixels, with each pixel corresponding to 25 μm in reality.

4. Discussion

Microaneurysms (MA) are recognized as the earliest symptom of DR that leads to retinal blood injury. The detection of MAs within FFA images facilitates the early DR diagnosis and prevents vision loss. However, MAs are extremely small, and their contrast with the surrounding background is very subtle, which makes MA detection challenging. MA objective and quantitative evaluation is still limited because it requires manual detection by experienced technicians. This study has great potential by allowing the detection and precise localization of MAs in retinal images. The proposed model’s outputs can be directly utilized by ophthalmologists for MA detection, eliminating the need for manual intervention. It contributes to the automation of MA detection, effectively guiding and assisting ophthalmologists in the treatment and elimination of MAs. The MA area can serve as an indicator to assess the progression of DR. A large area indicates a more severe condition, requiring more proactive treatment and management measures. Changes in the MA area can provide information about the stability or deterioration of the condition. The proposed model can be used to calculate the MA area in FFA images. By regularly calculating the MA area, the progression of DR and the effectiveness of treatments can be monitored.

Due to the addition of the MA detection layer and the model’s handling of higher-resolution images, the improved performance of MA detection may result in a decrease in the speed of MA detection to some extent. In addition, this proposed model was only applied on a limited dataset, and the validation of the model performance still requires its application on independent data from different patient cohorts across various medical centers. Future research will concentrate on addressing the aforementioned issues, by quantifying the model’s uncertainty [31,32], enhancing the detection speed through parameter pruning, and conducting an in-depth analysis of the model’s interpretability [33,34,35].

5. Conclusions

This study proposes the MA-YOLO model for the automatic detection of MAs in FFA images, based on image super-resolution reconstruction for data enhancement. This method can accurately and effectively detect MAs in FFA images. The algorithm utilized SwinIR for image super-resolution reconstruction, transforming the size of FFA images from 768 × 768 pixels to 1536 × 1536 pixels and 2304 × 2304 pixels. By reconstructing low-resolution FFA images, the details of MAs as well as their visibility and discriminability in the images were improved. Based on these improvements, the structure and loss function of the YOLOv8 model were further optimized. To address the challenges of extracting small features and the loss of feature information for MA detection, an MA detection layer was added to enhance feature extraction. Additionally, transfer learning was conducted between high-resolution and low-resolution datasets to enhance the model’s generalization. The Wise-IoU bounding box regression loss was employed to avoid excessive penalization due to geometric factors, improving the model’s generalization performance and addressing the problem of sample distribution imbalance. In addition, the MA-YOLO model can be used to calculate the MA area in FFA images to assist ophthalmologists in assessing the progression of DR.

Using the FFA dataset, ablation experiments were conducted to analyze and validate the effectiveness of the proposed model in the automatic detection of MAs. Furthermore, the proposed model was compared with five detection algorithms, i.e., SSD, YOLOv5, YOLOv7, YOLOX, and RetinaNet. The results showed that the proposed model outperformed these algorithms in terms of MA detection. The MA-YOLO model is thus a prospective approach for the early diagnosis of DR. In the future, the model will be further improved by incorporating more feature learning capabilities to achieve a higher detection speed.

Author Contributions

Z.W. and Q.J. were responsible for the conceptualization and data collection. Z.W. and B.Z. were responsible for the experiment design and manuscript writing; J.L. and Y.B. conducted the data collection and data entry; B.Y. was responsible for overall supervision and manuscript revision. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China (Grant No. 82271101 and 82070983) and Natural Science Foundation of Jiangsu Province (Grant No. BK20211020).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of the Affiliated Eye Hospital, Nanjing Medical University (Identifier: NJMUEH-2021-08-16).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Alifanov, I.; Sakovych, V. Prognostic risk factors for diabetic retinopathy in patients with type 2 diabetes mellitus. J. Ophthalmol. 2022, 6, 19–23. [Google Scholar] [CrossRef]
Yau, J.W.; Rogers, S.L.; Kawasaki, R.; Lamoureux, E.L.; Kowalski, J.W.; Bek, T.; Chen, S.-J.; Dekker, J.M.; Fletcher, A.; Grauslund, J. Global prevalence and major risk factors of diabetic retinopathy. Diabetes Care 2012, 35, 556–564. [Google Scholar] [CrossRef] [PubMed]
Walter, T.; Massin, P.; Erginay, A.; Ordonez, R.; Jeulin, C.; Klein, J.-C. Automatic detection of microaneurysms in color fundus images. Med. Image Anal. 2007, 11, 555–566. [Google Scholar] [CrossRef] [PubMed]
Couturier, A.; Mané, V.; Bonnin, S.; Erginay, A.; Massin, P.; Gaudric, A.; Tadayoni, R. Capillary plexus anomalies in diabetic retinopathy on optical coherence tomography angiography. Retina 2015, 35, 2384–2391. [Google Scholar] [CrossRef] [PubMed]
Wu, B.; Zhu, W.; Shi, F.; Zhu, S.; Chen, X. Automatic detection of microaneurysms in retinal fundus images. Comput. Med. Imaging Graph. 2017, 55, 106–112. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part 14. pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Liao, Y.; Xia, H.; Song, S.; Li, H. Microaneurysm detection in fundus images based on a novel end-to-end convolutional neural network. Biocybern. Biomed. Eng. 2021, 41, 589–604. [Google Scholar] [CrossRef]
Xia, H.; Lan, Y.; Song, S.; Li, H. A multi-scale segmentation-to-classification network for tiny microaneurysm detection in fundus images. Knowl.-Based Syst. 2021, 226, 107140. [Google Scholar] [CrossRef]
Chudzik, P.; Majumdar, S.; Calivá, F.; Al-Diri, B.; Hunter, A. Microaneurysm detection using fully convolutional neural networks. Comput. Methods Programs Biomed. 2018, 158, 185–192. [Google Scholar] [CrossRef]
Zhou, Y.; He, X.; Huang, L.; Liu, L.; Zhu, F.; Cui, S.; Shao, L. Collaborative learning of semi-supervised segmentation and classification for medical images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2079–2088. [Google Scholar]
Xie, Y.; Zhang, J.; Lu, H.; Shen, C.; Xia, Y. SESV: Accurate medical image segmentation by predicting and correcting errors. IEEE Trans. Med. Imaging 2020, 40, 286–296. [Google Scholar] [CrossRef]
Wang, J.; Luo, J.; Liu, B.; Feng, R.; Lu, L.; Zou, H. Automated diabetic retinopathy grading and lesion detection based on the modified R-FCN object-detection algorithm. IET Comput. Vis. 2020, 14, 1–8. [Google Scholar] [CrossRef]
Guo, S.; Li, T.; Kang, H.; Li, N.; Zhang, Y.; Wang, K. L-Seg: An end-to-end unified framework for multi-lesion segmentation of fundus images. Neurocomputing 2019, 349, 52–63. [Google Scholar] [CrossRef]
Mateen, M.; Malik, T.S.; Hayat, S.; Hameed, M.; Sun, S.; Wen, J. Deep Learning Approach for Automatic Microaneurysms Detection. Sensors 2022, 22, 542. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.; Adarsh, A.; Kumar, B.; Singh, A.K. An automated early diabetic retinopathy detection through improved blood vessel and optic disc segmentation. Opt. Laser Technol. 2020, 121, 105815. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hajeb Mohammad Alipour, S.; Rabbani, H.; Akhlaghi, M. A new combined method based on curvelet transform and morphological operators for automatic detection of foveal avascular zone. Signal Image Video Process. 2014, 8, 205–222. [Google Scholar] [CrossRef]
Gao, W.; Shan, M.; Song, N.; Fan, B.; Fang, Y. Detection of microaneurysms in fundus images based on improved YOLOv4 with SENet embedded. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi = J. Biomed. Eng. 2022, 39, 713–720. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Akut, R.R. FILM: Finding the location of microaneurysms on the retina. Biomed. Eng. Lett. 2019, 9, 497–506. [Google Scholar] [CrossRef]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Seoni, S.; Jahmunah, V.; Salvi, M.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of uncertainty quantification to artificial intelligence in healthcare: A review of last decade (2013–2023). Comput. Biol. Med. 2023, 165, 107441. [Google Scholar] [CrossRef]
Khare, S.K.; Acharya, U.R. Adazd-Net: Automated adaptive and explainable Alzheimer’s disease detection system using EEG signals. Knowl.-Based Syst. 2023, 278, 110858. [Google Scholar] [CrossRef]
Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2023, 102, 102019. [Google Scholar] [CrossRef]
Khare, S.K.; March, S.; Barua, P.D.; Gadre, V.M.; Acharya, U.R. Application of data fusion for automated detection of children with developmental and mental disorders: A systematic review of the last decade. Inf. Fusion 2023, 99, 101898. [Google Scholar] [CrossRef]

Figure 1. Fundus images. (a) FFA image; (b) NMFCS image.

Figure 2. FFA images. (a) Normal FFA image; (b) FFA image with MAs.

Figure 3. Flowchart of MA-YOLO.

Figure 4. Architecture of RSTB and STL. (a) Residual Swin Transformer block (RSTB); (b) Swin Transformer layer (STL).

Figure 5. Architecture of the MA detection layer.

Figure 6. Flowchart of transfer learning.

Figure 7. MA detection by the YOLOv8 model with different settings, where the red boxes represent the detection results with a confidence score greater than 0.5, the yellow boxes represent missed detection, and the green boxes represent false positive detection.

Figure 8. Comparison of the loss curves between the original images and the super-resolution images.

Figure 9. Comparison of the AP curves between the original images and the super-resolution images.

Figure 10. MA detection results by different models.

Figure 11. Calculation of the MA region.

Table 1. Strengths and weaknesses of different models for MA detection.

Model	Strength	Weakness
U-net + DiceLoss + Activation function with long tail [10]	Increasing discrimination ability of probability maps	Missed detection of low-contrast MAs
U-net + residual learning + EfficientNet [11]	Improving segmentation performance by adding a classification network	Structural complexity and time-consuming calculation
U-net + BN layers + Dice coefficient function [12]	Simplifying MA extraction using a three-stage method	Presence of a large number of patches and inefficient detection
U-net + semi-supervised learning [13]	Reduction of the reliance on data labeling	Weak learning performance for MA features
SESV framework + DeepLabv3+ [14]	Presenting a high level of versatility, could be extended to other networks	High spatial and computational complexity and high cost of training time
R-FCN [15]	Improving the ability to detect objects at different scales	MA detection was compromised by the absence of annotated images
L-Seg [16]	Prevention of information loss by multi-scale feature fusion	Serious misclassification problem in MA detection
VGG-19 + Inception-v3 [17]	Obtaining data correlation in original feature space using feature embedding	Sample distribution imbalance was neglected
Radial basis function neural network [18]	Enhancement of MA extraction by removing morphological structures inside the retinas	Morphological structure removal made MA extraction cumbersome

Table 2. Comparison of the MA detection performance between YOLOv8 models with different settings.

Model	Improvement Strategy				Re (%)	Pre (%)	F1 (%)	AP (%)	FPS (It/s)
Model	SwinIR	Transfer Learning	Wise-IoU	MA Detection Layer	Re (%)	Pre (%)	F1 (%)	AP (%)	FPS (It/s)
YOLOv8					80.81 ± 0.03	86.07 ± 0.03	83.36 ± 0.03	82.09 ± 0.03	16.11 ± 0.02
YOLOv8-A	✓				83.44 ± 0.01	85.50 ± 0.04	84.46 ± 0.04	83.46 ± 0.09	1.99 ± 0.05
YOLOv8-B	✓	✓			85.22 ± 0.12	88.07 ± 0.14	86.62 ± 0.11	84.13 ± 0.10	1.99 ± 0.04
YOLOv8-C			✓		84.65 ± 0.07	86.73 ± 0.06	85.68 ± 0.08	87.29 ± 0.12	16.11 ± 0.01
YOLOv8-D				✓	86.15 ± 0.05	93.19 ± 0.09	89.53 ± 0.06	88.67 ± 0.03	12.79 ± 0.04
MA-YOLO	✓	✓	✓	✓	88.23 ± 0.11	97.98 ± 0.06	92.85 ± 0.09	94.62 ± 0.06	1.51 ± 0.03

Table 3. Comparison of the MA detection performance among different models.

Model	Re (%)	Pre (%)	F1 (%)	AP (%)
SSD	32.77 ± 0.05	76.30 ± 0.07	45.85 ± 0.16	51.53 ± 0.06
RetinaNet	71.32 ± 0.03	70.99 ± 0.14	71.15 ± 0.09	72.04 ± 0.05
YOLOv5	69.62 ± 0.05	71.53 ± 0.15	70.56 ± 0.02	71.57 ± 0.07
YOLOX	60.72 ± 0.12	67.04 ± 0.06	63.72 ± 0.13	63.33 ± 0.05
YOLOv7	68.18 ± 0.02	77.78 ± 0.07	72.66 ± 0.08	76.48 ± 0.02
MA-YOLO	88.23 ± 0.11	97.98 ± 0.06	92.85 ± 0.09	94.62 ± 0.06

Table 4. Tuning parameters and time of execution of different models.

Model	Epochs	Freeze Backbone Epochs	UnFreeze Epochs	Batch Size	Optimizer	Initial Learning Rate	Learning Rate Decay	FPS (it/s)
SSD	150	50	100	4	SGD	0.01	cosine annealing	22.93 ± 0.02
RetinaNet	150	50	100	4	SGD	0.01	cosine annealing	0.53 ± 0.02
YOLOv5	150	50	100	4	SGD	0.01	cosine annealing	16.85 ± 0.04
YOLOX	150	50	100	4	SGD	0.01	cosine annealing	13.74 ± 0.02
YOLOv7	150	50	100	4	SGD	0.01	cosine annealing	33.86 ± 0.04
MA-YOLO	150	50	100	4	SGD	0.01	cosine annealing	1.51 ± 0.03

Table 5. Comparison of MA detection performance among different studies.

Model	Re (%)	Pre (%)	F1 (%)	AP (%)
Rohan [30]	78.9	86.7	82.61	81.3
Gao [24]	89.77	87.13	88.51	88.92
MA-YOLO	88.23	97.98	92.85	94.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Li, J.; Bai, Y.; Jiang, Q.; Yan, B.; Wang, Z. An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8. Bioengineering 2023, 10, 1405. https://doi.org/10.3390/bioengineering10121405

AMA Style

Zhang B, Li J, Bai Y, Jiang Q, Yan B, Wang Z. An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8. Bioengineering. 2023; 10(12):1405. https://doi.org/10.3390/bioengineering10121405

Chicago/Turabian Style

Zhang, Bowei, Jing Li, Yun Bai, Qing Jiang, Biao Yan, and Zhenhua Wang. 2023. "An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8" Bioengineering 10, no. 12: 1405. https://doi.org/10.3390/bioengineering10121405

APA Style

Zhang, B., Li, J., Bai, Y., Jiang, Q., Yan, B., & Wang, Z. (2023). An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8. Bioengineering, 10(12), 1405. https://doi.org/10.3390/bioengineering10121405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Datasets

2.1.2. Implementation

2.1.3. Evaluation Metrics

2.2. Methods

2.2.1. Super-Resolution FFA Image Reconstruction Based on SwinIR

2.2.2. YOLOv8 Modified by MA Detection Layer and Transfer Learning

2.2.3. Loss Function Optimization Based on Wise-IoU

3. Results

3.1. Ablation Experiment

3.2. Comparison Experiment

3.3. Calculation of the MA Region

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI