Retinal Vessel Segmentation Based on Self-Attention Feature Selection

Jiang, Ligang; Li, Wen; Xiong, Zhiming; Yuan, Guohui; Huang, Chongjun; Xu, Wenhao; Zhou, Lu; Qu, Chao; Wang, Zhuoran; Tong, Yuhua

doi:10.3390/electronics13173514

Open AccessArticle

Retinal Vessel Segmentation Based on Self-Attention Feature Selection

by

Ligang Jiang

^1,†,

Wen Li

^2,3,†,

Zhiming Xiong

^2,3,

Guohui Yuan

^2,3,

Chongjun Huang

^2,3

,

Wenhao Xu

^2,3,

Lu Zhou

^2,3,

Chao Qu

⁴,

Zhuoran Wang

^2,3,*

and

Yuhua Tong

^1,*

¹

The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People’s Hospital, Quzhou 324000, China

²

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China

³

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

⁴

Department of Radiology, Sichuan Academy of Medical Sciences Sichuan Provincial People’s Hospital, Chengdu 610072, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(17), 3514; https://doi.org/10.3390/electronics13173514

Submission received: 8 July 2024 / Revised: 21 August 2024 / Accepted: 27 August 2024 / Published: 4 September 2024

(This article belongs to the Section Bioelectronics)

Download

Browse Figures

Versions Notes

Abstract

:

Many major diseases can cause changes in the morphology of blood vessels, and the segmentation of retinal blood vessels is of great significance for preventing these diseases. Obtaining complete, continuous, and high-resolution segmentation results is very challenging due to the diverse structures of retinal tissues, the complex spatial structures of blood vessels, and the presence of many small ships. In recent years, deep learning networks like UNet have been widely used in medical image processing. However, the continuous down-sampling operations in UNet can result in the loss of a significant amount of information. Although skip connections between the encoder and decoder can help address this issue, the encoder features still contain a large amount of irrelevant information that cannot be efficiently utilized by the decoder. To alleviate the irrelevant information, this paper proposes a feature selection module between the decoder and encoder that utilizes the self-attention mechanism of transformers to accurately and efficiently select the relevant encoder features for the decoder. Additionally, a lightweight Residual Global Context module is proposed to obtain dense global contextual information and establish dependencies between pixels, which can effectively preserve vascular details and segment small vessels accurately and continuously. Experimental results on three publicly available color fundus image datasets (DRIVE, CHASE, and STARE) demonstrate that the proposed algorithm outperforms existing methods in terms of both performance metrics and visual quality.

Keywords:

retinal vessel segmentation; transformer; feature selection; small vessels

1. Introduction

The relationship between retinal vessel morphology and diseases has been widely studied, showing that the morphology changes in retinal vessels can provide important information about the health status of patients to clinical doctors, thereby helping to diagnose and treat various diseases at an early stage. The morphology changes in retinal vessels can serve as indicators for predicting and diagnosing various diseases. Such as diabetic retinopathy [1], hypertension [2], cardiovascular disease [3], Alzheimer’s disease [4], etc. Retinal vessels can also reflect overall health [5] and be used to monitor disease progression [6].

Although retinal vessels are related to major diseases in the human body, the complex structure and interlaced distribution of retinal vessels make manual annotation very difficult and time-consuming. Furthermore, the limited and uneven distribution of ophthalmic medical resources in China makes it difficult for many patients to be diagnosed and treated promptly in the early stages of diseases. Automatic vessel segmentation can alleviate the burden of vessel labeling for ophthalmologists, allowing more patients in need to receive timely treatment. However, achieving high-resolution, accurate, and complete vessel segmentation results is a significant challenge due to the intricate morphology and structure of retinal vessels, as well as the small contrast differences between the vessels and surrounding tissues.

In recent years, the research in the field of retinal vessel segmentation has been increasing. These methods mainly include matching filters, vessel tracking, morphological transformations, and model-based algorithms. Neto et al. [7] developed an end-to-end algorithm based on mathematical morphology, spatial dependency, and curvature. Zhao et al. [8] proposed an infinite active contour model that utilizes mixed regional information. Roychowdhury et al. [9] proposed an adaptive threshold method to achieve iterative vessel segmentation. Saroj et al. [10] primarily divided the vessel segmentation into three stages. Initially, they utilized the Principal Component Analysis (PCA) and Contrast Limited Adaptive Histogram Equalization (CLAHE) methods to obtain an enhanced grayscale image. Then, through exhaustive experimental tests, optimal values for both the Fréchet function parameters and matched filter parameters were chosen. Finally, they acquired a clear and complete vascular image by applying an entropy-based optimal thresholding technique, and length filtering and masking methods. Roychowdhury et al. [11] reduced the number of pixels in the classification by eliminating major vessels detected in commonly segmented regions after high-pass filtering and threshold segmentation. Zhou et al. [12] proposed an improved line detector for rapidly extracting the major structures of vessels.

Methods based on deep learning can automatically extract features and perform classification, without the need for the manual design and adjustment of features and classifiers. They have high segmentation accuracy and strong robustness and have been increasingly applied in the field of medical image processing. In recent years, plenty of deep learning algorithms for retinal vessel segmentation have emerged. Li et al. [13] proposed a lightweight convolutional neural network with an attention mechanism. The entire network structure consists of a basic UNet and an attention module, the latter of which is used to capture global information and enhance features through the feature fusion process. Zhang et al. [14] proposed a novel multi-module cascaded U-shaped network that applies dilated convolutions and multi-kernel pooling techniques to retinal vessel segmentation. Gegundez-Arias et al. [15] proposed a convolutional neural network based on a simplified version of the UNet architecture that combines residual blocks and batch normalization in the up- and down-scaling phases. Zhang et al. [16] proposed a pyramid UNet that uses pyramid spatial aggregation blocks (PSAB) for both the encoder and decoder to aggregate features at multiple levels. In this way, context information from coarse to fine is shared and aggregated in each block, improving vessel segmentation performance. The above research indicates that current deep learning-based retinal vessel segmentation methods are mainly based on the UNet architecture, and auxiliary modules are added to obtain better features. However, they only connect the features of the encoder with those of the decoder through skip connections, ignoring the relationship between the features of the encoder and the decoder. The decoder is required to provide additional information that compensates for the loss of effective data caused by continuous down-sampling in the encoding process. Although skip connections can provide lost information, they also contain a lot of redundant information. Therefore, how to allow the decoder to select the encoder features reasonably is key to improving vessel segmentation performance.

This article proposes a feature selection transformer block (FSTB) based on a self-attention mechanism to help the decoder select effective information from the encoder and remove redundant information. In order to obtain the relationship between decoder features and encoder features, we use non-local [17] as the basis to perform cross-attention calculations, with decoder features as queries, and encoder features as keys and values, so that the decoder can select encoder features in this way.

In addition, some vessels are very small and have low contrast with the background, making it difficult to accurately segment them due to the lack of fine-grained details in high-level features. Inspired by GCNet [18], we propose a residual global context transformer block (RGCTB) to replace the original convolutional blocks in UNet [19]. The RGCTB can further extract and retain the details of vessels and suppress useless information.

To validate the effectiveness of the proposed algorithm, we conducted experiments on three color fundus image datasets, which were DRIVE, CHASE, and STARE, and compared our algorithm with other algorithms. The experimental results show that our algorithm achieves advanced performance in terms of both visual effects and performance indicators on all three datasets. In addition, on the DRIVE dataset, we verified the effectiveness of the proposed FSTB and RGCTB in improving vessel segmentation performance.

This study makes the following contributions:

Compared to simple skip connections, our proposed FSTB helps the decoder select effective information from the encoder and remove redundant information. And the RGCTB can effectively extract and retain information on small vessels, improving the model’s ability to segment vessels.
To improve the continuity of vessel segmentation results, we add an error penalty loss $L_{e r r}$ to the loss function, which applies $L_{2}$ loss to misclassified positive and negative samples after binarization. This can effectively improve the performance of the model, especially its ability to predict positive samples.
We validated the proposed algorithm on three publicly available datasets, DRIVE, CHASE, and STARE, and compared it with other advanced algorithms. The results show that our algorithm achieves advanced performance in terms of both visual effects and performance indicators on all three datasets, with the ACC and AUC being significantly better than those of other algorithms.

2. Method

In this section, we first provide a brief overview of our proposed network and then elaborate on the key network components, which are the RGCTB and FSTB. Finally, we introduce the proposed error correction loss function.

2.1. Overall Architecture

This article proposes a model for retinal vessel segmentation, and its overall structure is shown in Figure 1. Based on a lightweight UNet architecture, we added an FSTB to the skip connections to help the decoder select effective information from the encoder. To better segment small vessels, we replaced the original convolutional blocks in UNet with an RGCTB. The FSTB takes the decoder features as queries, and the corresponding encoder features as keys and values and uses a self-attention mechanism to fuse effective information from the encoder into the decoder. The RGCTB is embedded with a global context (GC) block between two convolutional blocks, which effectively establishes long-range contextual information of features, and uses a 1 × 1 convolution to connect the input and output features as residuals. During the training process, the FSTB is used to complete the interaction between the decoder and encoder features, while the RGCTB is used to extract and retain more fine-grained vessel details.

2.2. Residual Global Context Transformer Block

To capture long-range dependencies and model global context, traditional convolutional operations are inefficient because they capture local pixel relationships. Deepening the network can increase the receptive field of neurons in the later part of the network, achieving a similar global modeling effect. But the above brute-force method has three drawbacks: (1) it is not delicate, with a large number of parameters and computations; (2) the deeper the network, the harder it is to optimize; and (3) it cannot fully represent the global context and has a limit on the maximum distance. Wang et al. [17] proposed a non-local block based on a self-attention mechanism, which provided a groundbreaking method for capturing long-range dependencies by aggregating global context specific to the query. However, Cao et al. [18] found that the context information modeled by the non-local block is almost the same for different query positions in the image. To address these issues, a lightweight GC block was designed, which aggregates features from all positions to form a global context feature.

Figure 1. The overall network architecture, where the numbers above the feature maps represent the number of channels.

To better aggregate spatial information, we propose the RGCTB shown in Figure 2. For the input features

F \in R^{C^{'} \times H \times W}

, after passing through the first convolutional block, which has a

3 \times 3

convolutional layer, Batch Normalization, and ReLU activation, we obtain the features

F_{c 1} \in R^{C \times H \times W}

, where C equals

C^{'} / 2

in the encoder and

2 \times C^{'}

in the decoder.

Then, the output

F_{c 1}

of convolutional block 1 is used as the input to the GC block, which first passes through a

3 \times 3

convolution and reshape operation. Then, in applying the softmax operation to obtain the weights and reshaping

F_{c 1}

to obtain a

C \times H W

feature map, they are paired together to compute the attention feature

F_{g} \in R^{C \times 1 \times 1}

as follows:

F_{g} = R e s h a p e (F_{c 1}) \times S o f t m a x (R e s h a p e (C o n v_{3 \times 3} (F_{c 1})))

(1)

Then, we use the obtained attention feature

F_{g}

as input and perform a transform operation, which passes through the

3 \times 3

convolution layer 1, layer normalization, ReLu activation, and

3 \times 3

convolution layer 2 to obtain the feature map

F_{t} \in R^{C \times 1 \times 1}

.

We add

F_{c 1}

as a residual term to

F_{t}

and obtain the final output

F_{o} \in R^{C \times H \times W}

of the GC block through residual connection as follows:

F_{o} = F_{t} + F_{c 1}

(2)

where the + operation denotes a broadcasting element-wise sum operation.

The output

F_{o}

of the GC block passes through convolutional block 2, and the input feature F is adjusted in channel number using a

1 \times 1

convolution and added to the output of convolutional block 2 as a residual term to obtain the final output

F_{o u t} \in R^{C \times H \times W}

of the RGCTB:

F_{o u t} = R e L u (B N (C o n v_{3 \times 3} (F_{o}))) + C o n v_{1 \times 1} (F)

(3)

The RGCTB can simultaneously capture global information in space and features across different channels, which can effectively extract features of small vessels and improve the model’s ability to segment small vessels.

2.3. Feature Select Transformer Block

The overall structure of the FSTB is consistent with the non-local structure in [17], but different from the non-local structure’s single-feature input. We use dual inputs of decoder features and encoder features, as shown in Figure 3. The queries

Q (F_{d})

, keys

K (F_{e})

, and values

V (F_{e})

are obtained by passing the decoder features

F_{d} \in R^{C \times H \times W}

and encoder features

F_{e} \in R^{C \times H \times W}

through

3 \times 3

convolutions and reshape operations, with a channel

C^{'} = 4

. To reduce computation and memory usage while preserving information as much as possible, inspired by an ANN [20], we use pyramid average pooling for K and V separately, as shown in Figure 4, with output sizes of (1, 3, 6, 8). As a result, the sizes of

K (F_{e})

and

V (F_{e})

change from

C^{'} \times H W

to

C^{'} \times S

, where

S = \sum_{n \in {1, 3, 6, 8}} n^{2} = 110

(4)

The pairwise computations of the query and key in the cross-attention head are described as

F (F) = Q {(F_{d})}^{T} K (F_{e})

(5)

Then, the weights are obtained by applying a softmax operation to

F (F)

, and being paired with

V (F_{e})

for attention feature computation to obtain

G (F) \in R^{C^{'} \times H \times W}

:

G (F) = V (F_{e}) s o f t m a x (F (F))

(6)

Finally, the channel feature of

G (F)

is adjusted back to C through a

1 \times 1

convolution, and is then residual-connected with the decoder feature

F_{d}

to obtain the final feature fusion result

F_{o u t} \in R^{C \times H \times W}

:

F_{o u t} = C o n v_{1 \times 1} (G (F)) \oplus F_{d}

(7)

where the ⊕ operation denotes the residual connection of element-wise addition.

Through this cross-attention mechanism, the decoder can interact with the encoder features, select effective vascular information from the encoder features, and suppress the irrelevant information from other tissues in the retinal image.

2.4. Loss Function

We use two loss functions,

L_{b c e}

and

L_{e r r}

to segment blood vessels, and the total loss function is defined as follows:

L = L_{b c e} + λ L_{e r r}

(8)

where

L_{b c e}

denotes the binary cross-entropy loss and

L_{e r r}

denotes the error correction loss function, and

λ

is the weight of the error correction loss function, set to 0.1.

L_{e r r}

is defined as follows:

L_{e r r} = {(1 - t h r e_{1} (p r e d))}^{2} t a r + {(t h r e_{2} (p r e d))}^{2} (1 - t a r)

(9)

where

p r e d

represents the predicted output of the model, and

t a r

represents the ground truth label. And

t h r e_{1}

indicates the threshold for

p r e d

with a value of 0.5, where predicted values greater than or equal to 0.5 are set to 1, and predicted values less than 0.5 are unchanged. Similarly,

t h r e_{2}

sets predicted values less than 0.5 to 0, and predicted values greater than or equal to 0.5 are unchanged. In this way, we can effectively guide the model for error correction, especially for improving the model’s ability to predict vascular positive samples.

3. Implementation Details

3.1. Data Preparation

We evaluated the proposed method on three publicly available datasets: DRIVE [21], CHASE [22], and STARE [23].

The DRIVE (Digital Retinal Images for Vessel Extraction) dataset is a publicly available dataset for retinal vessel segmentation. The dataset was developed in collaboration between Maastricht University in the Netherlands and the Netherlands Institute for Neuroscience to promote research and development in the field of retinal image analysis. The DRIVE dataset contains 40 retinal digital images of size 565 × 584, with 20 images used for training and another 20 for testing.

The CHASE dataset is a publicly available dataset for retinal vessel segmentation, developed in collaboration between the University of Hong Kong and the National University of Singapore. The dataset consists of 28 retinal digital images of size 999 × 960, with 20 images used for training and another 8 for testing.

The STARE (Structured Analysis of the Retina) dataset is a publicly available dataset for retinal vessel segmentation, developed by ophthalmology and image processing experts at the University of Florida in the United States. The dataset contains 20 retinal digital images of size 700 × 605. However, the original dataset does not have a predefined train/test split. Therefore, we used a four-fold cross-validation method during training and testing. We split the dataset into four equal parts and trained the model using three of them, while the remaining one was used for testing. We repeated this process by selecting a different part for testing in each iteration. This method allows us to evaluate the model’s performance on the dataset and avoid bias caused by improper dataset splitting.

To facilitate the down-sampling and up-sampling of retinal images by our network model, we perform edge zero-padding on the three datasets, resizing the sizes of DRIVE, CHASE, and STARE to 592 × 592, 1008 × 1008, and 704 × 704, respectively. Additionally, to ensure rigorous testing, the output images are first resized back to their original sizes before evaluation. Since these three datasets have relatively small data sizes, over-fitting would be a common issue. To enhance the robustness of the network, we perform data augmentation by randomly rotating and flipping the DRIVE, CHASE, and STARE datasets horizontally, vertically, and diagonally. Furthermore, to enhance image contrast while retaining local details, we apply CLAHE [24] to all input images with parameters ClipLimit = 2 and GridSize = 8.

3.2. Model Settings

For all models, we set the number of feature channels after the first convolutional layer to 16 and use the Adam optimizer. We train with a batch size of 2 for a total of 50 epochs, with a learning rate of 0.001. Our framework is implemented using the PyTorch backend (version 3.10) and trained on an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.

3.3. Evaluation Metrics

To evaluate our model, we compared the segmentation results with the corresponding labels and classified the results of each pixel comparison into true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). Then, we use the sensitivity (SE), specificity (SP), accuracy (ACC), and area under the receiver operating characteristic curve (AUC) to evaluate the model’s performance. The definitions are as follows:

S E = \frac{T P}{T P + F N}

(10)

S P = \frac{T N}{T N + F P}

(11)

A C C = \frac{T P + F N}{T P + F N + T N + F P}

(12)

In addition to using the SE, SP, ACC, and AUC, we also added the IOU, MIOU, MCC, and F1-Score in ablation experiments for quantitative and comparative analysis. In particular, for dense predictions with an imbalanced number of vessel and non-vessel pixels in retinal images, the MCC metric is more informative.

4. Experimental Results and Analysis

We conducted a series of experiments on three publicly available datasets to evaluate the proposed network and compare the experimental results with those of existing state-of-the-art methods. Then, we performed a quantitative analysis of the performance metrics and a qualitative analysis of the segmentation results’ visual effects and feature maps.

4.1. Comparisons with Other Methods

We present the experimental results of the RGB datasets (DRIVE, CHASE, and STARE) in Table 1 to compare the performance of our proposed method with that of other state-of-the-art methods. To facilitate a better comparison, we not only selected UNet-based methods but also compared several non-UNet methods published in the last five years.

On the DRIVE dataset, the best SE, SP, ACC, and AUC values achieved by previous methods were 0.8291, 0.9879, 0.9699, and 0.9872, respectively. The performance metrics of our method are 0.8071, 0.9860, 0.9701, and 0.9874. In comparison, our network generates the best ACC and AUC, which are 0.0002 and 0.0002 higher than the runner-up, demonstrating its higher segmentation precision and robustness.

On the CHASE dataset, previous works reported the highest values for SE, SP, ACC, and AUC as 0.8477, 0.9896, 0.9751, and 0.9898, respectively. However, these values vary depending on the method used. For example, in reference [35], the highest SE was achieved, but the ACC and AUC values were 0.9745 and 0.9898, which are lower than our results. On the other hand, reference [34] achieved the highest SP, which was only 0.0013 higher than our proposed SP. However, our proposed method outperforms the method in [34] in terms of the ACC and AUC, with at least 0.0065 higher values. Specifically, our proposed method achieves the highest ACC and AUC values of 0.9758 and 0.9963, while also ranking high in terms of SE and SP. Overall, our proposed method demonstrates a good balance across all metrics and exhibits excellent performance on the CHASE dataset.

On the STARE dataset, our proposed method achieves the highest AUC values compared to all the other methods, which are 0.9976, respectively. The values is 0.0048 higher than the second-best method, respectively. Although our proposed method does not achieve the highest SP, those methods with higher SP have very low SE, which indicates a significant imbalance. Meanwhile, although our ACC value is not the highest, it is only 0.0003 lower than the best value, making it very close to the optimal performance. Overall, our proposed method demonstrates a significant advantage in terms of accuracy and stability compared to other methods, as illustrated by these four metrics.

In conclusion, our proposed method has excellent performance on all three datasets, while many of the methods listed in the table only show good performance on one or two datasets. Our proposed method nearly achieves the best ACC and AUC values across all three datasets, indicating its high accuracy and robustness.

4.2. Visual Comparison and Analysis

Figure 5 shows the overall segmentation results of the three datasets, encompassing the entire retinal vessel structure. It can be seen that the proposed method accurately segments the retinal vessels, producing a complete segmented structure. The algorithm can successfully identify most of the thin vessels. The final row of images in Figure 5 provides more detailed pixel-level information. TP pixels are marked in green, TN pixels in white, FP pixels in red, and FN pixels in blue. The latter two types of pixels represent misclassified pixels. The red and blue pixels are scarce and mostly concentrate at the end of thin vessels and along vessel edges.

For a qualitative comparison, we selected one fundus image from each dataset and show the original image, the segmentation results of our proposed FS-UNet and CAR-UNet [29], and the ground truth label in Figure 6. To demonstrate the details of the vessels and segmentation results, we selected a region on each image (marked with different color boxes) and show an enlarged version in the rightmost column. The results indicate that the segmentation results of FS-UNet are more similar to the ground truth label, surpassing the results of CAR-UNet [29]. Specifically, in areas with low contrast and interference from other tissues or lesions, our FS-UNet can segment tiny vessels more accurately. More importantly, our FS-UNet achieves better continuity of retinal vessels, which is crucial for subsequent analysis and real clinical applications.

4.3. Ablation Studies

We conducted ablation studies to better understand the impact of each component of our network. First, we analyzed the feature selection ability of the FSTB using heatmaps. Then, we analyzed the ability of the RGCTB to retain small vessel features by comparing the segmentation results of small vessels. Finally, we conducted experiments on three datasets and obtained three sets of data, comprehensively demonstrating that our proposed loss function can effectively improve the model’s ability to recognize positive samples. And the detailed metric values are shown in Table 2.

4.3.1. The Effect of the FSTB

Table 2 shows that on the DRIVE dataset, U-Net with the FSTB outperforms the original skip connections in seven metrics (except SP). The SE improves from 0.7079 to 0.8061, an increase of

13.87 %

. The ACC increases from 0.9675 to 0.9694, and the AUC improves by

1.96 %

. The MCC, F1, IOU, and MIOU increase by

3.02 %

,

3.83 %

,

6.30 %

, and

2.65 %

, respectively.

We also present the original image, label, and feature maps before and after the FSTB in Figure 7 to provide a more intuitive visualization of the feature selection process. It can be observed that the feature maps after feature selection are significantly refined while preserving useful information for vessel segmentation. Features of other tissues or pathologies, such as exudates, hemorrhages, and microaneurysms (circled portion in the figure), are irrelevant for vessel segmentation and can introduce interference. After feature selection, these features are effectively filtered out. The decoder features, obtained through the encoder and partial decoder, exhibit highly detailed vascular characteristics. Through utilizing self-attention, the FSTB can extract information from the encoder features that align with the decoder vascular features, while filtering out irrelevant information.

4.3.2. The Effect of the RGCTB

To demonstrate the effectiveness of the proposed RGCTB, we compared it with the original convolutional blocks in UNet and observed the improvement. Table 2 shows that on the DRIVE dataset, the RGCTB-based UNet outperforms the original convolutional blocks in seven metrics (except for SP). The SE increases from 0.7079 to 0.7970, showing a

12.59 %

improvement, the ACC increases from 0.9692 to 0.9699, and the AUC increases by

0.02 %

. The MCC, F1, IOU, and MIOU increases by

2.77 %

,

3.55 %

,

5.83 %

, and

2.46 %

, respectively.

In general, using the RGCTB results in significant improvements in all metrics, especially in SE. This indicates that the global information provided by the RGCTB can significantly enhance the algorithm’s ability to segment tiny vessels, thus greatly improving SE. We further demonstrate the visual comparison of the vessel segmentation results between the RGCTB and the original convolutional blocks, as shown in Figure 8. Retinal vessel images contain many tiny vessels, so we selected three magnified local images from the Drive dataset for clearer visualization. As shown in Figure 8, introducing the RGCTB can further detect more tiny vessels and vessels in low-contrast regions, while better preserving the vessel structure.

4.3.3. The Effect of Loss

Based on the proposed FS-UNet, we conducted a comparative analysis of the error correction loss function

L_{e r r}

. As shown in Table 3,

L_{e r r}

effectively improves all metrics except for SP in the three datasets, especially SE, which shows significant improvement. In retinal vessel segmentation, only 9–

14 %

of pixels belong to the vessel class, while the remaining pixels belong to the non-vessel class [39]. Therefore, the model faces greater difficulty in recognizing vessel class pixels, leading to more errors in vessel samples.

By incorporating

L_{e r r}

, the model must pay more attention to vessel samples to minimize the loss function, thereby greatly improving its ability to recognize vessel samples. Although we also perform error correction on non-vessel samples, non-vessel samples belong to the majority class, and the recognition difficulty and probability of error are low. As the model’s ability to recognize vessel samples is greatly improved, the ability to recognize non-vessel samples may decrease.

4.3.4. Qualitative Visualization

In order to better demonstrate the role of the FSTB module in the model, we use Grad-CAM to intercept three stages during the model’s forward inference. The positions of each stage are as follows:

FSTB1: After the first FSTB;
FSTB2: After the second FSTB;
FSTB3: After the third FSTB.

As a comparison, we also include the original image, which is displayed together in Figure 9. The fundus image shows the retinal blood vessels and the optic disc region. This image serves as the input data for the model, which is used to generate the subsequent attention heatmaps. FSTB1 is the attention heatmap generated from the shallow feature map. We can see that the contours of the blood vessels are relatively vague, and the model’s attention is focused on the larger main vessels. The majority of the color is between blue and green, indicating relatively low attention weights. At this shallow feature stage, the model mainly focuses on basic features of the image. FSTB2 is the attention heatmap generated from the middle-layer feature map. Compared to FSTB1, the model’s attention at this layer begins to focus more on the blood vessels, particularly on the major vessels, revealing more details. The yellow and red areas in the image indicate increased attention, showing that the middle-layer features have a clearer response to the target areas. FSTB3 is the attention heatmap generated from the deep feature map. Compared to the previous two images, the attention at this layer is more concentrated and refined, especially in the distribution of the smaller blood vessels, which become visible. The red areas are more prominent, indicating that the model assigns higher attention weights to the important details in the image during the deep feature stage.

Through attention heatmaps, medical experts can intuitively see which areas the model focuses on more. For example, in the images of FSTB2 and FSTB3, the model concentrates its attention on the major blood vessels in the retina, which aligns with the areas that doctors focus on during examination. The model’s attention to these key areas indicates its ability to recognize structures that are significant for diagnosis. From a medical interpretability perspective, this visualization reveals how the model progressively focuses on diagnostically critical areas by showing the attention distribution on retinal blood vessels at different levels. This visualization not only helps doctors understand the model’s decision-making process and enhances their trust in the model but also serves as an auxiliary diagnostic tool, alerting doctors to potential areas of concern, thereby improving the reliability and safety of automated diagnosis.

4.4. Generalization Analysis

Generalization ability is crucial for computer-aided diagnosis systems. We used a cross-training scheme [40], ref. [41] to verify the generalization ability of the proposed FS-UNet. Specifically, we first trained FS-UNet on the DRIVE training set and then tested it on the STARE dataset without fine-tuning. Then, we trained FS-UNet on the STARE dataset and tested it on the DRIVE test set without fine-tuning. We also compared our FS-UNet with existing methods and present the results in Table 4. The results show that FS-UNet achieves the highest ACC and AUC and the second highest SP in testing the DRIVE dataset, as well as the highest SE, ACC, and AUC in testing the STARE dataset. These results demonstrate that our FS-UNet exhibits a stronger generalization ability for retinal vessel segmentation tasks compared to other existing models.

However, it is observed that the SE value on the DRIVE dataset significantly decreases after cross-training compared to the results in Table 1. This may be attributed to the limited annotations of tiny vessels in the STARE dataset, which leads to the insufficient learning of FS-UNet in extracting tiny and low-contrast vessels. Conversely, the manual annotations in the DRIVE dataset contain more tiny vessels, so applying a model trained on the DRIVE dataset to the STARE dataset slightly reduces the SP but achieves the highest SE and AUC during testing.

When FS-UNet is trained on the HRF dataset, its performance varies. On the DRIVE dataset, the FS-UNet model trained on HRF exhibits decreased sensitivity (SE) compared to the model trained on DRIVE, which may be attributed to differences in image size between the HRF and DRIVE datasets. Nevertheless, the model trained on HRF performs well on the STARE dataset, achieving the highest specificity (SP) and accuracy (ACC), likely due to the greater similarity between the HRF and STARE datasets. However, due to size differences, the SE on STARE is also relatively low.

4.5. Model Complexity Analysis

In deep learning model design and application, model complexity is a key metric typically measured by computational complexity (FLOPs) and model parameter count. Computational complexity refers to the number of floating point operations required for a single forward pass of the model, while parameter count indicates the number of trainable parameters in the model. These two metrics not only affect the training time and inference speed of the model but also directly impact resource consumption and deployment feasibility in practical applications. Therefore, understanding and optimizing model complexity is crucial for achieving efficient model design and practical deployment.

We compared the computational complexity of our model with that of other models, and the results are presented in Table 5.

From the table, we can find that our model’s parameter count is relatively small at 0.87 million, especially when compared to other models such as U-Net (7.8 million) and Attention U-Net (34.8 million). This low parameter count suggests that the model is efficient in terms of memory usage and storage requirements, which can be advantageous for deployment in resource-constrained environments. A smaller parameter count generally leads to reduced storage needs and can facilitate faster model loading and deployment, making it particularly suitable for applications on devices with limited resources.

However, despite the small parameter count, our model has a significantly high FLOPs of 47.6 billion, which is notably larger than many other models. This high FLOPs indicate that the model involves complex operations or processes large amounts of data, resulting in substantial computational demands during inference. The elevated FLOPs reflect the computational complexity of the operations performed, such as convolutions and matrix multiplications, which can impact real-time performance and efficiency. Therefore, while the model is efficient in terms of memory, it requires considerable computational resources, which may affect its deployment in scenarios with limited computational power.

5. Conclusions

In this study, we proposed the FS-UNet model for retinal vessel segmentation, integrating the RGCTB and FSTB modules to optimize attention mechanisms and task-specific adaptations, enhancing segmentation performance on complex retinal images. Our experiments demonstrate that FS-UNet outperforms several state-of-the-art methods on the DRIVE, CHASE, and STARE datasets, particularly in capturing fine and intricate retinal vessel structures.

The primary strengths of FS-UNet lie in its superior detail preservation and precise recognition of fine structures. The RGCTB module replaces traditional convolutional blocks, significantly enhancing the model’s capability to extract and retain vessel details. Simultaneously, the FSTB module enables the decoder to selectively extract relevant vessel information from the encoder, filtering out irrelevant tissue information. Additionally, the error correction loss function

L_{e r r}

, improves the model’s focus on error-prone samples, further boosting overall segmentation accuracy.

Despite these strengths, our model has some limitations in terms of complexity and computational cost. While FS-UNet’s FLOPs (47.6 billion) are lower than the standard UNet’s 54.9 billion, they remain higher than lightweight models like DUNet (0.2 billion) and OCE-Net (0.2 billion). The high FLOPs indicate a higher computational burden, which may limit its use in environments with restricted computing resources. However, it is important to note that our model has a parameter count of only 0.87 million, the lowest among the compared methods, which contributes to faster training and inference speeds. We also prioritized model performance over lightweight design due to our target application in hospitals, where computing resources are generally less restricted. Additionally, our model’s inference speed in terms of FPS is competitive with most compared models and sufficient for our application scenarios.

That said, further optimization of the model’s complexity could broaden its applicability to more resource-constrained environments. Techniques such as model pruning, quantization, or the development of lightweight network architectures could be explored to reduce the computational cost while maintaining or enhancing performance. These optimizations would make FS-UNet more suitable for real-time processing and deployment in settings with limited computing power.

Furthermore, in the medical field, deep learning models require a high level of interpretability. While our RGCTB and FSTB modules provide some degree of explanation for the model’s decision-making process, further efforts are needed to enhance transparency. We plan to integrate explainable AI techniques, such as saliency maps and feature importance analysis, to increase the model’s trustworthiness in clinical applications.

In summary, FS-UNet exhibits significant advantages in retinal vessel segmentation, particularly in detail preservation, structure recognition, and overall accuracy. However, addressing the model’s complexity and computational efficiency remains crucial for extending its applicability to a wider range of clinical environments. These challenges offer important directions for future research, aiming to balance performance with resource efficiency.

Author Contributions

Conceptualization, L.J. and W.L.; methodology, W.L.; software, Z.X.; validation, Z.X. and W.L.; formal analysis, C.H.; investigation, G.Y.; resources, Z.W. and Y.T.; data curation, C.Q. and Y.T.; writing—original draft preparation, W.L.; writing—review and editing, W.L., Z.W., W.X. and L.Z.; supervision, L.J. and Z.W.; project administration, L.J. and Z.W.; funding acquisition, Y.T. and C.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Zhejiang Provincial Basic Public Welfare Project under Grant No. LGF22H120017, Medico-Engineering Cooperation Funds from University of Electronic Science and Technology of China under Grant No. ZYGX2021YGLH214, Quzhou City Science and Technology Project under Grant No. ZD2020155, and the Municipal Government of Quzhou under Grant Nos. 2023D001, 2023D020 and No. 2022D026.

Data Availability Statement

The data involved in the experiments are all publicly available and properly referenced in the paper. The source code of the proposed method in the paper can be shared upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sabanayagam, C.; Lye, W.K.; Klein, R.; Klein, B.E.; Cotch, M.F.; Wang, J.J.; Mitchell, P.; Shaw, J.E.; Selvin, E.; Sharrett, A.R.; et al. Retinal microvascular calibre and risk of diabetes mellitus: A systematic review and participant-level meta-analysis. Diabetologia 2015, 58, 2476–2485. [Google Scholar] [CrossRef] [PubMed]
Chew, S.K.; Xie, J.; Wang, J.J. Retinal arteriolar diameter and the prevalence and incidence of hypertension: A systematic review and meta-analysis of their association. Curr. Hypertens. Rep. 2012, 14, 144–151. [Google Scholar] [CrossRef]
Guo, S.; Yin, S.; Tse, G.; Li, G.; Su, L.; Liu, T. Association between caliber of retinal vessels and cardiovascular disease: A systematic review and meta-analysis. Curr. Atheroscler. Rep. 2020, 22, 1–13. [Google Scholar] [CrossRef] [PubMed]
Jin, Q.; Lei, Y.; Wang, R.; Wu, H.; Ji, K.; Ling, L. A systematic review and meta-analysis of retinal microvascular features in Alzheimer’s Disease. Front. Aging Neurosci. 2021, 13, 683824. [Google Scholar] [CrossRef]
Courtie, E.; Veenith, T.; Logan, A.; Denniston, A.; Blanch, R. Retinal blood flow in critical illness and systemic disease: A review. Ann. Intensive Care 2020, 10, 1–18. [Google Scholar] [CrossRef] [PubMed]
Greferath, U.; Guymer, R.H.; Vessey, K.A.; Brassington, K.; Fletcher, E.L. Correlation of histologic features with in vivo imaging of reticular pseudodrusen. Ophthalmology 2016, 123, 1320–1331. [Google Scholar] [CrossRef]
Neto, L.C.; Ramalho, G.L.; Neto, J.F.R.; Veras, R.M.; Medeiros, F.N. An unsupervised coarse-to-fine algorithm for blood vessel segmentation in fundus images. Expert Syst. Appl. 2017, 78, 182–192. [Google Scholar] [CrossRef]
Zhao, Y.; Rada, L.; Chen, K.; Harding, S.P.; Zheng, Y. Automated vessel segmentation using infinite perimeter active contour model with hybrid region information with application to retinal images. IEEE Trans. Med. Imaging 2015, 34, 1797–1807. [Google Scholar] [CrossRef]
Roychowdhury, S.; Koozekanani, D.D.; Parhi, K.K. Iterative vessel segmentation of fundus images. IEEE Trans. Biomed. Eng. 2015, 62, 1738–1749. [Google Scholar] [CrossRef]
Saroj, S.K.; Kumar, R.; Singh, N.P. Frechet PDF based matched filter approach for retinal blood vessels segmentation. Comput. Methods Prog. Biomed. 2020, 194, 105490. [Google Scholar] [CrossRef]
Roychowdhury, S.; Koozekanani, D.D.; Parhi, K.K. Blood vessel segmentation of fundus images by major vessel extraction and subimage classification. IEEE J. Biomed. Health Inform. 2014, 19, 1118–1128. [Google Scholar]
Zhou, C.; Zhang, X.; Chen, H. A new robust method for blood vessel segmentation in retinal fundus images based on weighted line detector and hidden Markov model. Comput. Methods Prog. Biomed. 2020, 187, 105231. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Jiang, Y.; Li, M.; Yin, S. Lightweight attention convolutional neural network for retinal vessel image segmentation. IEEE Trans. Ind. Inform. 2020, 17, 1958–1967. [Google Scholar] [CrossRef]
Li, J.; Zhang, T.; Zhao, Y.; Chen, N.; Zhou, H.; Xu, H.; Guan, Z.; Xue, L.; Yang, C.; Chen, R.; et al. MC-UNet: Multimodule Concatenation Based on U-Shape Network for Retinal Blood Vessels Segmentation. Comput. Intell. Neurosci. 2022, 2022, 9917691. [Google Scholar] [CrossRef]
Gegundez-Arias, M.E.; Marin-Santos, D.; Perez-Borrero, I.; Vasallo-Vazquez, M.J. A new deep learning method for blood vessel segmentation in retinal images based on convolutional kernels and modified U-Net model. Comput. Methods Prog. Biomed. 2021, 205, 106081. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Zhang, Y.; Xu, X. Pyramid u-net for retinal vessel segmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1125–1129. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 593–602. [Google Scholar]
Staal, J.; Abràmoff, M.D.; Niemeijer, M.; Viergever, M.A.; Van Ginneken, B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 2004, 23, 501–509. [Google Scholar] [CrossRef] [PubMed]
Owen, C.G.; Rudnicka, A.R.; Mullen, R.; Barman, S.A.; Monekosso, D.; Whincup, P.H.; Ng, J.; Paterson, C. Measuring retinal vessel tortuosity in 10-year-old children: Validation of the computer-assisted image analysis of the retina (CAIAR) program. Investig. Ophthalmol. Vis. Sci. 2009, 50, 2004–2010. [Google Scholar] [CrossRef]
Hoover, A.; Kouznetsova, V.; Goldbaum, M. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 2000, 19, 203–210. [Google Scholar] [CrossRef]
Reza, A.M. Realization of the contrast limited adaptive histogram equalization (CLAHE) for real-time image enhancement. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 38, 35–44. [Google Scholar] [CrossRef]
Jin, Q.; Meng, Z.; Pham, T.D.; Chen, Q.; Wei, L.; Su, R. DUNet: A deformable network for retinal vessel segmentation. Knowl.-Based Syst. 2019, 178, 149–162. [Google Scholar] [CrossRef]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Verma, M.; Nakashima, Y.; Nagahara, H.; Kawasaki, R. Iternet: Retinal image segmentation utilizing structural redundancy in vessel networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3656–3665. [Google Scholar]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
Guo, C.; Szemenyei, M.; Hu, Y.; Wang, W.; Zhou, W.; Yi, Y. Channel attention residual u-net for retinal vessel segmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1185–1189. [Google Scholar]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Guo, F.; Li, W.; Kuang, Z.; Tang, J. MES-Net: A new network for retinal image segmentation. Multimed. Tools Appl. 2021, 80, 14767–14788. [Google Scholar] [CrossRef]
Li, D.; Peng, L.; Peng, S.; Xiao, H.; Zhang, Y. Retinal vessel segmentation by using AFNet. Vis. Comput. 2023, 39, 1929–1941. [Google Scholar] [CrossRef]
Alvarado-Carrillo, D.E.; Dalmau-Cedeño, O.S. Width attention based convolutional neural network for retinal vessel segmentation. Expert Syst. Appl. 2022, 209, 118313. [Google Scholar] [CrossRef]
Li, J.; Gao, G.; Yang, L.; Liu, Y. GDF-Net: A multi-task symmetrical network for retinal vessel segmentation. Biomed. Signal Process. Control 2023, 81, 104426. [Google Scholar] [CrossRef]
Hu, X.; Wang, L.; Li, Y. HT-Net: A Hybrid Transformer Network for Fundus Vessel Segmentation. Sensors 2022, 22, 6782. [Google Scholar] [CrossRef]
Wei, X.; Yang, K.; Bzdok, D.; Li, Y. Orientation and Context Entangled Network for Retinal Vessel Segmentation. Expert Syst. Appl. 2023, 217, 119443. [Google Scholar] [CrossRef]
Zhang, H.; Zhong, X.; Li, Z.; Chen, Y.; Zhu, Z.; Lv, J.; Li, C.; Zhou, Y.; Li, G. TiM-Net: Transformer in M-Net for Retinal Vessel Segmentation. J. Healthc. Eng. 2022, 2022, 9016401. [Google Scholar] [CrossRef]
Li, J.; Gao, G.; Liu, Y.; Yang, L. MAGF-Net: A multiscale attention-guided fusion network for retinal vessel segmentation. Measurement 2023, 206, 112316. [Google Scholar] [CrossRef]
Zhang, J.; Dashtbozorg, B.; Bekkers, E.; Pluim, J.P.; Duits, R.; ter Haar Romeny, B.M. Robust retinal vessel segmentation via locally adaptive derivative frames in orientation scores. IEEE Trans. Med. Imaging 2016, 35, 2631–2644. [Google Scholar] [CrossRef] [PubMed]
Yan, Z.; Yang, X.; Cheng, K.T. Joint segment-level and pixel-wise losses for deep learning based retinal vessel segmentation. IEEE Trans. Biomed. Eng. 2018, 65, 1912–1923. [Google Scholar] [CrossRef]
Wu, Y.; Xia, Y.; Song, Y.; Zhang, Y.; Cai, W. NFN+: A novel network followed network for retinal vessel segmentation. Neural Netw. 2020, 126, 153–162. [Google Scholar] [CrossRef]
Guo, S. DPN: Detail-preserving network with high resolution representation for efficient segmentation of retinal vessels. J. Ambient Intell. Humaniz. Comput. 2023, 14, 5689–5702. [Google Scholar] [CrossRef]
Ye, Y.; Pan, C.; Wu, Y.; Wang, S.; Xia, Y. MFI-Net: Multiscale Feature Interaction Network for Retinal Vessel Segmentation. IEEE J. Biomed. Health Inform. 2022, 26, 4551–4562. [Google Scholar] [CrossRef]
Yuan, Y.; Zhang, L.; Wang, L.; Huang, H. Multi-Level Attention Network for Retinal Vessel Segmentation. IEEE J. Biomed. Health Inform. 2022, 26, 312–323. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Liu, J.Y.; Wang, K.; Zhang, K.; Zhang, G.S.; Liao, X.F.; Yang, G. Global Transformer and Dual Local Attention Network via Deep-Shallow Hierarchical Feature Fusion for Retinal Vessel Segmentation. IEEE Trans. Cybern. 2023, 53, 5826–5839. [Google Scholar] [CrossRef] [PubMed]
Tong, L.; Li, T.; Zhang, Q.; Zhang, Q.; Zhu, R.; Du, W.; Hu, P. LiViT-Net: A U-Net-like, lightweight Transformer network for retinal vessel segmentation. Comput. Struct. Biotechnol. J. 2024, 24, 213–224. [Google Scholar] [CrossRef]

Figure 2. The overall structure of the RGCTB.

Figure 3. The overall structure of the FSTB.

Figure 4. The process of pyramid average pooling.

Figure 5. The segmentation results of retinal vessel images from the three datasets. The first row shows the original images, the second row shows the ground truth labels, and the third row shows the segmentation results. The first two columns are from the DRIVE dataset, the middle two columns are from the CHASE dataset, and the last two columns are from the STARE dataset.

Figure 6. Comparisons of the retinal vessel segmentation results. From left to right, the images show the original image, the segmentation result of our proposed method, the segmentation result of CAR-UNet, the ground truth, and a magnified version of a selected region.

Figure 7. Comparisons of feature maps before and after the FSTB. From left to right, the original images, the labels, the feature maps before feature selection, and the feature maps after feature selection.

Figure 8. Visual comparisons of the vessel segmentation results between the RGCTB and the original convolutional blocks.

Figure 9. Qualitative visualization result of FS-UNet from different stages.

Table 1. Performance indicator comparison on the DRIVE, CHASE and STARE datasets.

Methods	Year	DRIVE				CHASE				STARE
		SE	SP	ACC	AUC	SE	SP	ACC	AUC	SE	SP	ACC	AUC
UNet [25]	2018	0.7822	0.9808	0.9555	0.9752	0.7841	0.9823	0.9643	0.9812	0.6681	0.9915	0.9639	0.9710
DUNet [25]	2019	0.7963	0.9800	0.9566	0.9802	0.8155	0.9752	0.9610	0.9804	0.7595	0.9878	0.9641	0.9832
IterNet [26]	2019	0.7735	0.9838	0.9573	0.9816	0.7970	0.9823	0.9736	0.9859	0.7715	0.9886	0.9701	0.9881
Attention UNet [27]	2019	0.7663	0.9879	0.9685	0.9834	0.8185	0.9856	0.9750	0.9891	0.7553	0.9896	0.9718	0.9807
Ce-net [28]	2019	0.7903	0.9769	0.9550	0.9780	0.8008	0.9723	0.9633	0.9797	0.7909	0.9721	0.9732	0.9597
CAR-UNet [29]	2020	0.8135	0.9849	0.9699	0.9852	0.8439	0.9839	0.9751	0.9898	0.8445	0.9850	0.9743	0.9898
AACA-MLA-D-UNet [15]	2021	0.8046	0.9805	0.9581	0.9827	0.8302	0.9891	0.9673	0.9810	0.7914	0.9870	0.9665	0.9824
Sine-Net [30]	2021	0.8260	0.9824	0.9685	0.9852	0.7856	0.9845	0.9676	0.9828	0.6776	0.9946	0.9711	0.9807
MES-Net [31]	2021	0.8221	0.9817	0.9667	0.9853	0.8198	0.9845	0.9697	0.9869	0.8210	0.9859	0.9724	0.9897
Bridge-Net [32]	2022	0.7853	0.9818	0.9565	0.9834	0.8132	0.9840	0.9667	0.9893	0.8002	0.9864	0.9668	0.9901
WA-Net [33]	2022	0.7966	0.9810	0.9575	0.9784	0.8042	0.9826	0.9653	0.9841	0.7834	0.9908	0.9752	0.9906
MC-UNet [14]	2022	0.8100	0.9879	0.9678	0.9828	0.8366	0.9829	0.9714	0.9818	0.7360	0.9947	0.9572	0.9686
GDF-Net [34]	2022	0.8291	0.9852	0.9622	0.9859	0.7856	0.9896	0.9660	0.9876	0.7616	0.9957	0.9653	0.9889
HT-Net [35]	2022	0.8256	0.9839	0.9700	0.9872	0.8477	0.9834	0.9745	0.9898	0.8478	0.9870	0.9765	0.9928
OCE-Net [36]	2022	0.8018	0.9826	0.9581	0.9821	0.8138	0.9824	0.9678	0.9872	0.8012	0.9865	0.9672	0.9876
Tim-Net [37]	2022	0.7805	0.9816	0.9638	0.9682	0.7867	0.9880	0.9711	0.9670	0.7867	0.9880	0.9711	0.9670
MAGF-Net [38]	2023	0.8262	0.9862	0.9578	0.9819	0.8328	0.9895	0.9677	0.9873	0.8093	0.9844	0.9649	0.9898
Ours	2023	0.8071	0.9860	0.9701	0.9874	0.8342	0.9853	0.9758	0.9963	0.8380	0.9878	0.9762	0.9976

Table 2. Comparison of the results depending on different components on the DRIVE dataset.

$L_{err}$	RGCTB	FSTB	SP	SE	ACC	AUC	MCC	F1-Score	IOU	MIOU
			0.9965	0.6086	0.9624	0.9849	0.7399	0.7374	0.5863	0.7733
✓			0.9926	0.7079	0.9675	0.9860	0.7820	0.7907	0.6554	0.8104
✓	✓		0.9861	0.7970	0.9693	0.9862	0.8037	0.8188	0.6936	0.8303
	✓	✓	0.9876	0.7956	0.9696	0.9862	0.8039	0.8185	0.6931	0.8302
✓		✓	0.9853	0.8061	0.9694	0.9865	0.8056	0.8210	0.6967	0.8319
✓	✓	✓	0.9860	0.8071	0.9701	0.9874	0.8096	0.8246	0.7019	0.8349

Table 3. Comparison of the results depending on the additional loss functions on DRIVE, CHASe, and STARE.

Dataset	Loss	SP	SE	ACC	AUC	MCC	F1 Score	IOU	MIOU
DRIVE	$L_{b c e}$	0.9876	0.7856	0.9696	0.9862	0.8039	0.8185	0.6931	0.8302
	$L_{b c e} + L_{e r r}$	0.9860	0.8071	0.9701	0.9874	0.8096	0.8246	0.7019	0.8349
CHASE	$L_{b c e}$	0.9878	0.8061	0.9765	0.9960	0.8008	0.8126	0.6947	0.8294
	$L_{b c e} + L_{e r r}$	0.9853	0.8342	0.9758	0.9963	0.8012	0.8133	0.6856	0.8300
STARE	$L_{b c e}$	0.9897	0.8185	0.9764	0.9980	0.8275	0.8383	0.7238	0.8493
	$L_{b c e} + L_{e r r}$	0.9878	0.8380	0.9762	0.9976	0.8288	0.8405	0.7267	0.8506

Table 4. Performance of six existing methods and the proposed method on the DRIVE and STARE datasets.

Validation Dataset	Training Dataset	Methods	Year	SE	SP	ACC	AUC
DRIVE		Yan [40]	2018	0.7292	0.9815	0.9494	0.9599
		Jin [25]	2019	0.6505	0.9914	0.9481	0.9718
		Wu [41]	2020	0.7187	0.9881	0.9538	0.9761
	STARE	Guo [42]	2020	0.7410	0.9801	0.9499	0.9685
		Ye [43]	2021	0.7313	0.9867	0.9538	0.9762
		Yuan [44]	2022	0.7098	0.9846	0.9497	0.9731
		Ours	2023	0.7298	0.9890	0.9660	0.9785
	HRF	Ours	2023	0.6250	0.9848	0.9530	0.9700
STARE		Yan [40]	2018	0.7211	0.9840	0.9569	0.9708
		Jin [25]	2019	0.7000	0.9759	0.9474	0.9571
		Wu [41]	2020	0.7378	0.9785	0.954	0.9635
		Guo [42]	2020	0.7100	0.9841	0.9558	0.9689
		Ye [43]	2021	0.7805	0.9741	0.9550	0.9747
		Yuan [44]	2022	0.8079	0.9732	0.9559	0.9735
		Ours	2023	0.8290	0.9744	0.9632	0.9784
	HRF	Ours	2023	0.6742	0.9927	0.9720	0.9754

Table 5. Comparison of the complexity between existing methods and our proposed method.

Methods	FLOPs (Billion)	Parameters (Million)	FPS
U-Net [25]	54.9	7.8	15.0
DUNet [25]	0.2	7.4	30.4
IterNet [26]	194.4	13.6	5.2
Attention U-Net [27]	296.2	34.8	3.3
Ce-net [28]	35.6	29.0	18.7
MAGF-Net [38]	4.0	34.6	24.2
OCE-Net [36]	0.2	6.3	28.5
GT-DLA-dsHFF [45]	473.9	26.0	2.3
LiVit-Net [46]	71.1	6.9	12.1
Ours	47.6	0.87	20.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, L.; Li, W.; Xiong, Z.; Yuan, G.; Huang, C.; Xu, W.; Zhou, L.; Qu, C.; Wang, Z.; Tong, Y. Retinal Vessel Segmentation Based on Self-Attention Feature Selection. Electronics 2024, 13, 3514. https://doi.org/10.3390/electronics13173514

AMA Style

Jiang L, Li W, Xiong Z, Yuan G, Huang C, Xu W, Zhou L, Qu C, Wang Z, Tong Y. Retinal Vessel Segmentation Based on Self-Attention Feature Selection. Electronics. 2024; 13(17):3514. https://doi.org/10.3390/electronics13173514

Chicago/Turabian Style

Jiang, Ligang, Wen Li, Zhiming Xiong, Guohui Yuan, Chongjun Huang, Wenhao Xu, Lu Zhou, Chao Qu, Zhuoran Wang, and Yuhua Tong. 2024. "Retinal Vessel Segmentation Based on Self-Attention Feature Selection" Electronics 13, no. 17: 3514. https://doi.org/10.3390/electronics13173514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retinal Vessel Segmentation Based on Self-Attention Feature Selection

Abstract

1. Introduction

2. Method

2.1. Overall Architecture

2.2. Residual Global Context Transformer Block

2.3. Feature Select Transformer Block

2.4. Loss Function

3. Implementation Details

3.1. Data Preparation

3.2. Model Settings

3.3. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Comparisons with Other Methods

4.2. Visual Comparison and Analysis

4.3. Ablation Studies

4.3.1. The Effect of the FSTB

4.3.2. The Effect of the RGCTB

4.3.3. The Effect of Loss

4.3.4. Qualitative Visualization

4.4. Generalization Analysis

4.5. Model Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI