1. Introduction
Medical image segmentation plays a pivotal role in medical image processing and analysis, encompassing tasks such as brain segmentation, cell segmentation, lung segmentation, retinal blood vessel segmentation, and so on. Precise segmentation can facilitate accurate lesion localization, thus aiding doctors in formulating optimal treatment plans. With the continuous advancements in computer vision and artificial intelligence, medical image segmentation finds extensive application in ophthalmology, among which the retinal vascular segmentation field has attracted much attention. Retinal vessel segmentation technology enables the analysis of vascular structures in fundus images, providing an important reference for early diagnosis and treatment of ophthalmic diseases.
Retinal vascular segmentation is of great importance to the diagnosis of ocular diseases. The observation of abnormal structures in retinal vascular segmentation images can help physicians detect signs of common eye diseases such as diabetic retinopathy, glaucoma, and macular degeneration in a timely manner [
1], leading to more timely interventions and therapeutic measures. Nonetheless, retinal vessel segmentation and visualization present challenges due to vessel size variability, intertwined branches, and complex structures [
2]. Consequently, there is a pressing need for automated and efficient retinal vessel segmentation methods to enhance ocular disease diagnosis and treatment while enabling the automated analysis of medical images.
Traditional retinal vessel segmentation methods typically include threshold segmentation algorithms, matched filtering algorithms, and machine learning-based approaches. The threshold segmentation method [
3] is straightforward but sensitive to factors like image noise and illumination changes, requiring the re-selection of thresholds for different images and applications. Matched filtering algorithms [
4] leverage 2D convolution operations with filters constructed based on 2D Gaussian templates to capture blood vessel features. Matched filtering algorithms are robust to image noise and illumination variations, but they require manually designed filters and parameters and struggle to handle complex situations such as blood vessel crossings and overlaps. Machine learning-based methods utilize manually crafted features (edges, texture, color, etc.) to discern vascular from non-vascular regions, which are subsequently segmented by a classifier. Despite their potential, this approach heavily relies on feature selection and extraction, demanding considerable human intervention and empirical knowledge, thus limiting adaptability across various retinal image types.
Deep learning has witnessed a surge in popularity owing to enhanced computational capabilities, the availability of large-scale datasets, and algorithmic advancements. Convolutional neural networks (CNNs) have achieved remarkable achievements across various fields, including image classification [
5], target detection [
6], and image segmentation [
7]. In image segmentation, CNNs have significant advantages over traditional methods by automatically extracting features and learning complex representations with robust adaptability and generalization. Several image segmentation networks have been proposed, including SegNet [
8], UNet [
9], PSPNet [
10], and UNet++ [
11]. UNet introduces an encoder–decoder structure along with a skip-connection mechanism. UNet++ enhances UNet by incorporating a dense connection mechanism and constructing a multi-layered skip connection. The UNet or improved UNet network integrates low-resolution information such as object categories and high-resolution information such as edges and details, making it suitable for medical image segmentation tasks. U-Net and its variants have been widely used in retinal blood vessel segmentation tasks [
12,
13,
14].
The above CNN models have achieved remarkable results, but the convolution operation can only capture local spatial information, while lacking robustness in capturing global contextual information. In fundus vessel segmentation tasks, distant pixels may exhibit correlations with local image structures. CNNs struggle to model global information and long-distance interaction information, thus ignoring the continuity and wholeness between blood vessels, and hence still represent a major limitation in retinal vessel segmentation. In order to realize global contextual information capture, the transformer [
15] framework was then proposed. Utilizing a self-attentive mechanism [
16], transformers can capture the correlation among all positions in a sequence at each time step and use it to compute the output, which can effectively deal with long-distance dependencies in a sequence. Initially applied in natural language processing, transformers found success in image classification with the proposal of the Vision Transformer (ViT) [
17]. However, due to extensive computational costs, the substantial parameter count and high computational demands of the ViT present great challenges. Swin Transformer [
18] introduced the windowed self-attention mechanism, partitioning the input image into fixed-size blocks for self-attention computation, thereby reducing computational complexity. Inspired by the U-shaped encoder–decoder structure of U-Net, Swin-UNet [
19] was developed and achieved notable success in medical image segmentation.
Swin-UNet, as a semantic segmentation network based on the Swin Transformer, performs well in processing large-scale images and regular data. However, it encounters challenges when segmenting retinal blood vessel images, characterized by numerous small regions and dense boundaries. We summarize the problems of the Swin-UNet retinal blood vessel segmentation method as follows: (1) During gradual downsampling and upsampling, as the size of the feature map decreases, the network forfeits the shallow details containing richer semantic features of small regions, particularly texture features at the boundaries of the target region. These details are crucial for segmentation tasks, aiding in distinguishing between the intersections of multiple region categories and enhancing segmentation quality. Once this information is lost, the network may not be able to fully recover the lost details of boundaries and contours in the upsampling stage despite integrating shallow features with the deeper ones through skip connections, leading to the decrease in the segmentation ability. (2) The Swin-UNet loss function employs cross-entropy and Dice losses. While Dice loss suits unbalanced category segmentation tasks, it proves highly unstable for regions with small areas, notably the minuscule regions in retinal blood vessel images, thereby diminishing model segmentation performance to a certain extent. Dice loss solely quantifies discrepancies between predictions and ground truth in terms of pixel count, disregarding specific boundary shapes and distributions. There are only foreground and background parts in the retinal blood vessel image, and the latter are very small. Inaccuracies in predicting some pixels in these regions may trigger significant changes in Dice loss, resulting in drastic gradient shifts and ultimately affecting model performance.
To address the above problems, we improve Swin-UNet and introduce TD Swin-UNet. Specifically, to tackle the limited capability in localizing and segmenting the boundaries of Swin-UNet, we propose the texture-driven retinal vessel segmentation method by improving boundary-wise perception. Previous studies have demonstrated that shallow semantic features contain richer textures, such as boundaries and contours, due to their higher resolution, which substantially aids in model performance. Consequently, we concatenate the outputs of multiple Swin Blocks in the network encoder and feed them into a Cross-level Texture Complementary Module (CTCM). This module amalgamates these feature maps to further augment the model’s ability to extract and represent semantic features such as boundaries and contours. Moreover, we enhance the Swin Block in the decoder and introduce the Pixel-wise Texture Swin Block. This module heightens the model’s focus on the vicinity of the region boundary, thereby improving boundary localization and segmentation performance. To address the insensitivity of the Dice loss function to the specific shape and distribution of boundaries, we introduce the Hausdorff distance loss and refine the Hausdorff loss by incorporating a clip truncation operation to avoid the imbalance due to the size of the area, ultimately enhancing the model accuracy in segmenting blood vessel boundaries. Our proposed model demonstrates outstanding segmentation results on two datasets: DRIVE and CHASEDB1. The primary contributions of this paper are summarized as follows:
(1) We propose a texture-driven retinal vessel segmentation method by improving boundary-wise perception, which contains two key enhancements. Firstly, we introduce the Cross-level Texture Complementary Module (CTCM) to fuse feature maps during the encoding process, facilitating the focus of our model on essential feature information in vessel images and recovering the shallow details lost during the downsampling stages. Additionally, we introduce the Pixel-wise Texture Swin Block (PT Swin Block) via the Pixel-wise Texture Highlighting Module (PTHM), which enhances the model’s capacity to perceive and recognize vessel boundary and contour information;
(2) We improve the loss function by introducing a proposed Hausdorff distance loss function tailored for small target regions of blood vessels. Furthermore, we refine the Hausdorff loss by introducing hyperparameters to weight different components of the loss function. This enhancement enables the model to better discern subtle features of blood vessel structure and boundary information;
(3) We conducted experiments on two datasets: DRIVE and CHASEDB1. The experimental results show that our proposed network outperforms the existing methods on the retinal blood vessel segmentation task.
3. Methods
3.1. Overall Framework of the Proposed Network
To address the challenges of key information loss and inaccurate vessel boundary segmentation in Swin-UNet’s downsampling process, we propose TD Swin-UNet. The architecture of TD Swin-UNet is illustrated in
Figure 1. TD Swin-UNet is comprised of three key components: the encoder module, the decoder module, and the CTCM module.
In the encoder module, the input image is initially divided into equal-sized image blocks using the Patch Partition module. This operation transforms the W × H × 3 feature map into vectors. Subsequently, these 48-dimension vectors are projected to C-dimension using Linear Embedding and then inputted into the Swin Block. The Swin Block, as the core component of the Swin Transformer, incorporates the shifted window mechanism and allows each position to focus on local neighborhood information, thereby enhancing the capture of spatial local relationships in the image. The output feature map of the Swin Block is connected to the decoder via a skip connection mechanism. Simultaneously, it undergoes downsampling through the Patch Merging module, halving the image height and width while doubling the number of channels. The CTCM module fuses feature maps of different scales in the encoder, enhancing lost boundary contour information and improving the model’s accuracy in boundary segmentation. In the decoder module, we introduce a new PT Swin Block. Based upon the original Swin Block and patch expanding module, the feature maps are enriched with boundary and contour information through the PTHM module, thereby enhancing the model’s ability to learn semantic information on both sides of the boundary. Finally, the segmentation result is obtained through a linear layer.
3.2. Swin Transformer Block
The Swin Block serves as the basic component in the Swin Transformer, which improves the Multi-Head Self-Attention (MSA) mechanism used in the ViT. It achieves this improvement by employing a windowed attention mechanism and cross-layer local connections to reduce the number of parameters and computational complexity. Specifically, it introduces two variants of attention mechanisms: Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window-based Multi-Head Self-Attention (SW-MSA). The basic structure of the Swin Block is depicted in
Figure 2. Each Swin Block comprises two Transformer blocks, with each Transformer block consisting of two LayerNorm (LN) layers and an MLP layer. The W-MSA and SW-MSA modules are applied to the front and back Transformer blocks, respectively. The expressions for these modules are shown below:
where
denotes the output of the previous layer,
,
denote the output of the (S)W-MSA module, and
,
represent the output of the two MLP modules of layer
l.
3.3. Cross-Level Texture Complementary Module
The encoder part of Swin-UNet contains multiple downsampling sessions, yielding feature maps , , and at different scales through forward propagation. To augment the boundary localization and segmentation capabilities of our model, we pass these three feature maps to the Cross-level Texture Complementary Module (CTCM), which sequentially upsamples the lost vessel boundary contour information and weights each channel of the feature maps to prioritize critical feature information, thereby enhancing segmentation accuracy.
The structure of CTCM is shown in
Figure 3. Firstly, we upsample
and
by two and four times, respectively, to match the height and width of
, and then align the number of channels to 4
C via a 1 × 1 convolution. However, upsampling alone fails to recover the intricate texture gradually lost during downsampling. Therefore, we introduce two difference values here to recover the detail texture lost in the downsampling process, as shown in Equations (5) and (6).
where the superscripts
,
denote fourfold and twofold upsampling.
To further focus on the importance of
and
, we introduce two hyperparameters,
and
, to weight the feature maps, respectively, and concatenate two weighted feature maps. Then, the dimensions are reduced to
by the
, which consist of convolution, the ReLU activation function, and average Pooling. During the process, to avoid the recovery of the texture details from being discarded, we chose Average Pooling instead of Max Pooling. Finally, we combine the output with
to obtain
. The output of the CTCM module is illustrated in Equation (
7).
3.4. Pixel-Wise Texture Swin Block
We enhanced the decoder structure of Swin-UNet, incorporating three newly designed Pixel-wise Texture Swin Blocks (PT Swin Blocks) as depicted in
Figure 4. The Pixel-wise Texture Highlighting Module (PTHM) was developed specifically for retinal vessel segmentation tasks in this study, which aimed to enhance the model’s ability to perceive boundary and contour information of blood vessels, thereby improving the model’s semantic learning on both sides of the boundary and guiding fine-grade segmentation tasks at a higher level. The module takes the feature map
as input, where
W and
H denote the width and height of the feature map, respectively, and
C represents the number of channels. We first perform Pixel Normalization on
, and then use the Sobel operator to perform convolution in both vertical and horizontal directions to extract the gradient around each pixel location, where larger gradients correspond to boundary regions with richer semantic information. The formulas are shown as below:
where
and
denote the convolution results using the horizontal and vertical Sobel operators, respectively. PN denotes pixel normalization, as represented by the following formula:
where
denotes the number of channels of the feature map
F;
denotes the value of the feature map at position
x,
y; and
denotes the value of the
jth channel at position
x,
y.
is a constant that prevents the denominator from being 0, and has the value of
. The purpose of using PN here is twofold. On the one hand, it mitigates the bias introduced by absolute scale differences, ensuring that features from each channel at the same pixel position are thoroughly considered. On the other hand, PN preserves the original semantic relationships between each pixel, maintaining the semantic and textural diversity among different localized regions.
For the feature map
Z, we aimed for the model to concentrate on the texture of a small localized region near the boundary, because this region is positioned on the contour between categories, assisting the model in accurate localization. At the same time, we also sought to prevent the model from overly focusing on this localized area. Therefore, we applied the Gaussian blur of size 3 × 3 to
Z and then multiplied it by
as a weight
G, as demonstrated in the following formula:
Finally, the final output
is obtained by concatenating
V and
, followed by a 1 × 1 convolution to blend the textures across each channel. The formula is shown below:
3.5. Loss Function Improvement
Apart from the issue of losing boundary texture during the downsampling process in Swin-UNet, its loss function simply uses the summation of cross-entropy loss and Dice loss. The total loss is defined as follows:
where
denotes the cross-entropy loss and
denotes the Dice loss.
To address the insensitivity to the specific shape and boundary distribution of Dice loss, and to further enhance the accuracy of the model predictions for class boundaries in retinal vessel segmentation tasks, it is imperative to introduce a loss term in the loss function that is sensitive to the specific shape of the boundary.
From this perspective, cross-entropy (CE) loss appears to be compliant, as it penalizes the model misclassification of boundary pixels, thereby prompting the model to focus more on the accuracy of segmentation boundaries. Although CE loss can partially consider boundary information, it indirectly influences boundary accuracy through overall pixel classification rather than directly optimizing boundary characteristics. Therefore, its contribution to improving the model’s ability to localize boundary categories is quite limited.
As a result, we introduce the loss term of the Hausdorff distance, denoted as
. Suppose that
P represents the model output and
G denotes the actual labels.
is defined as below:
where
represents the Hausdorff distance between two feature maps
X and
Y.
Compared to CE loss, the Hausdorff distance directly measures the disparity between the predicted boundary and the true boundary, offering a more direct metric. The Hausdorff distance aids in capturing subtle differences in boundaries, particularly when pixel-level classification results are ambiguous or when dealing with fuzzy boundaries. It provides a more precise reflection of the distance between the predicted boundary and the true boundary. This is particularly critical in retinal vessel segmentation, where any subtle boundary changes can influence doctors’ assessment of ocular lesions. Therefore, the introduction of the Hausdorff distance enhances the model’s understanding of the specific shape and distribution characteristics of blood vessel boundaries, thereby further improving segmentation accuracy.
However,
is significantly influenced by the areas of
X and
Y. In the retinal vessel segmentation task, the vessel region and background region areas are often unbalanced. This drawback of
tends to result in the loss term of large area targets being too large and the loss term of small area targets being too small, thereby leading to performance degradation. To address this issue, we introduced a clip truncation operation to mitigate the imbalance caused by differences in area size. The expression is shown as below:
Considering that the Hausdorff distance essentially measures the Euclidean distance between pixels, it is crucial to design the constant related to the Euclidean distance between pixels to maintain the consistency of the distance metric. Additionally, since images of different sizes may result in varying ranges of target boundaries, we need to ensure that the value of can be adjusted with changes in the input image size to fully accommodate the characteristics of different sized images. Thus, should be designed as a variable proportional to the image size. Moreover, given that medical image segmentation tasks typically involve several different types of targets, and an increase in the number of categories may lead to more complex boundary cases, the value of should also be inversely proportional to the number of categories . In this way, we adjust the value of according to the size of the input image and the number of categories, allowing for the effective evaluation and optimization of segmentation results in various scenarios.
Based on the above considerations, we define
as follows:
where
H,
W, and
denote the height, width, and total number of all categories of the input image, respectively.
In summary, our total loss is defined as follows:
where
,
, and
serve as hyperparameter weights and are set to 10, 10, and 1, respectively.
5. Results
5.1. Ablation Experiments
To further assess the contributions of individual modules in TD Swin-UNet, we conducted ablation experiments using the DRIVE dataset. Swin-UNet served as the baseline, and four evaluation metrics including ACC, SE, SP, and F1 were recorded during the experiments.
Table 1 presents the results of seven sets of ablation experiments: baseline + CTCM, baseline + PTHM, baseline + Hausdorff_loss, baseline + CTCM + PTHM, baseline + PTHM+Hausdorff_loss, baseline + CTCM + Hausdorff_loss, and overall improvement.
1. Efficacy of the Cross-level Texture Complementary Module
To further validate the efficacy of CTCM, we compared baseline + CTCM with the baseline alone. In comparison to the baseline, CTCM demonstrated improvements of 0.92%, 4.19%, 0.45%, and 3.77% in ACC, SE, SP, and F1, respectively. As the size of the feature maps decreased during downsampling, the baseline tended to lose shallow details, which contained richer semantic features of small regions and subtle blood vessel branching information. This led to challenges in effectively segmenting blood vessel edges and fine branches, resulting in relatively lower metrics. Incorporating CTCM enhanced the segmentation accuracy in both the blood vessel region (SE) and the background region (SP), leading to more precise recognition of blood vessels.
2. Efficacy of the Pixel-wise Texture Highlighting Module
To further confirm the effectiveness of PTHM, we integrated PTHM into the baseline. In comparison to the baseline alone, PTHM yielded improvements of 1.11%, 3.89%, 0.71%, and 4.32% in ACC, SE, SP, and F1, respectively. The incorporation of the improved PT Swin Block in the upsampling process enhanced the model’s ability to perceive the vessel boundary and contour information. The concurrent enhancement of SE and SP indicated that the enhancement module effectively learned semantic information on both sides of the boundary, leading to a significant improvement in the connectivity and wholeness of vessel segmentation.
3. Efficacy of Improved Hausdorff Loss
To address the insensitivity of the original loss function to specific boundary shapes and distributions, we further introduced improved Hausdorff loss into the baseline. The integration of Hausdorff loss resulted in improvements of 0.79%, 3.65%, 0.38%, and 3.24% in ACC, SE, SP, and F1, respectively. Hausdorff_loss more accurately reflected the distance between the predicted boundary and the real boundary, and strengthened the model’s ability to learn the specific boundary shape and distribution characteristics of the blood vessels. Consequently, the segmentation accuracy of the blood vessel region was significantly improved, leading to an overall enhancement in blood vessel segmentation effectiveness.
4. Efficacy of Overall Improvements
By integrating the above improvement modules, the final configuration of baseline + CTCM + PTHM + Hausdorff_loss achieved 96.64%, 84.49%, 98.37%, and 86.53% on ACC, SE, SP, and F1, respectively. Compared to baseline + PTHM + Hausdorff_loss, baseline + CTCM + PTHM + Hausdorff_loss exhibited a slight decrease of 0.74% in SE, but witnessed increases in SP and F1 from 97.67% and 84.90%, to 98.37% and 86.53%, resulting in an overall enhancement in vessel segmentation performance. In addition, compared to baseline + CTCM + Hausdorff_loss, baseline + CTCM + PTHM + Hausdorff_loss resulted in a marginal decrease of 0.2% in SP, yet SE and F1 improved by 2.04% and 0.6%, leading to an improved overall blood vessel segmentation performance. Despite a slight reduction in the background region segmentation accuracy, the accuracy of blood vessel segmentation and the overall performance was greatly improved. In conclusion, TD Swin-UNet effectively achieved accurate segmentation of complex structured blood vessel images, and finally exhibited high segmentation accuracy.
5.2. Visualization Results
Figure 7 and
Figure 8 depict the visualized segmentation results of the baseline and our proposed method on the DRIVE and CHASEDB1 datasets, respectively. These figures showcase the overall segmentation results of retinal vessel structures alongside localized zoomed-in images for a detailed examination of vessel segmentation. Both approaches effectively capture the main branches of thicker blood vessels in the retinal images. However, the retinal blood vessel images also have complex and intertwined fine branch structures. Due to the lack of a unique feature fusion mechanism and edge enhancement module, the baseline method had challenges in the localization and segmentation of fine blood vessels, leading to imprecise vessel boundary delineation. In comparison, our proposed model exhibited enhanced boundary detection capabilities, enabling more accurate vessel boundary delineation. Notably, as highlighted by the green box, the baseline method fared poorly in segmenting fine blood vessels characterized by discontinuous vessel structures. In contrast, the proposed method effectively captured the detailed texture information at vessel boundaries, resulting in accurate and continuous segmentation outcomes. The visualization results demonstrate the superiority of the model in localizing and segmenting vessel boundaries, particularly in accurately segmenting fine blood vessels with complex structures.
5.3. Comparisons with Existing Methods
To further validate the superiority of TD Swin-UNet, we compared it with 13 retinal vessel segmentation methods proposed over the past ten years on the DRIVE and CHASEDB1 datasets. These methods include SegNet [
8], UNet [
9], Att-Unet [
33], UNet++ [
11], CE-Net [
34], AA-UNet [
20], Efficient BFCN [
35], PSP-UNet [
36], AMF-NET [
37], IterNet++ [
38], TiM-Net [
39], CAS-UNet [
40], and LMSA-Net [
41]. We conducted comparative experiments on the first five methods, employing identical training strategies and environments across all experiments. For the latter eight methods, due to the lack of open-source code, all data are cited directly from the original texts.
Table 2 and
Table 3 present the comparison results on the DRIVE and CHASEDB1 datasets, with the experimental metrics including ACC, SE, SP, and F1, where “-” indicates that the experimental data for the item were not available in the original literature.
On the DRIVE dataset, TD Swin-UNet achieved the highest SE, Specificity SP, and F1, reaching 0.8479, 0.9837, and 0.8653, respectively. Despite LMSA-Net [
41] having a slightly higher ACC than our model (by 0.22%), TD Swin-UNet outperformed it with SE, SP, and F1 values that were higher by 1.71%, 0.16%, and 4.39%, respectively. The increase in SE and SP signifies the enhanced accuracy in retinal vessel identification. Although the improvement in SP was relatively small, our model had a significantly higher accuracy for SE in vessel region segmentation. The introduction of CTCM and PTHM restored lost boundary information during downsampling and effectively improved the model’s ability to perceive the boundary and contour information of the blood vessels, leading to more accurate segmentation and increased vascular connectivity and wholeness.
On the CHASEDB1 dataset, TD Swin-UNet achieved the highest ACC, SE, and F1, which were improved by 0.05%, 0.9%, and 1.25% respectively, compared with the maximum values of the other models. Although the SP of TD Swin-UNet (0.9867) was slightly lower than that of AMF-Net [
37] (0.9881), TiM-Net [
39] (0.9880), and CAS-UNet [
40] (0.9896), TD Swin-UNet achieved a significant improvement in the SE of the vessel region segmentation accuracy due to the attention and enhancement of the detailed texture features near the vessel boundary, resulting in the highest ACC (0.9756) and F1 (0.8515). Considering the substantial improvement in SE, the slight deficiency in SP became negligible. TD Swin-UNet demonstrated more accurate segmentation of blood vessels and background regions compared with the other methods, making it more suitable for clinical applications in medical imaging and showing promising prospects in various fields.
We also compared the visualization results of the proposed model with five other methods: SegNet [
8], UNet [
9], Att-Unet [
33], UNet++ [
11], and CE-Net [
34].
Figure 9 and
Figure 10 depict the visual comparisons on the DRIVE and CHASEDB1 datasets. While SegNet exhibited poor segmentation performance with notable background noise, UNet and Att-UNet achieved accurate segmentation of major arteries and veins, but struggled with finer blood vessel branches. UNet++ introduced a dense connection mechanism and depth supervision, which had a better segmentation effect on the local blood vessels. However, it fell short in capturing global information, and holistic blood vessel segmentation needs to be improved. CE-Net introduces a contextual feature extraction module consisting of DAC and RMP on the basis of UNet to fuse multi-scale contextual information, but it struggled to effectively capture the semantic information of the vessel structure, resulting in discontinuous vessel segmentation. In contrast, the proposed TD Swin-UNet effectively captured the long-range dependencies of blood vessels, leading to more connected vessel segmentation results. In addition, due to the introduction of CTCM and PTHM, the proposed model was able to accurately segment the details at the vessel boundary, yielding superior segmentation outcomes.