Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus

Yi, Junyan; Chen, Chouyu

doi:10.3390/e25081148

Open AccessArticle

Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus

by

Junyan Yi

and

Chouyu Chen

^*

Department of Computer Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(8), 1148; https://doi.org/10.3390/e25081148

Submission received: 7 May 2023 / Revised: 15 July 2023 / Accepted: 25 July 2023 / Published: 31 July 2023

(This article belongs to the Section Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Automatic classification of arteries and veins (A/V) in fundus images has gained considerable attention from researchers due to its potential to detect vascular abnormalities and facilitate the diagnosis of some systemic diseases. However, the variability in vessel structures and the marginal distinction between arteries and veins poses challenges to accurate A/V classification. This paper proposes a novel Multi-task Segmentation and Classification Network (MSC-Net) that utilizes the vessel features extracted by a specific module to improve A/V classification and alleviate the aforementioned limitations. The proposed method introduces three modules to enhance the performance of A/V classification: a Multi-scale Vessel Extraction (MVE) module, which distinguishes between vessel pixels and background using semantics of vessels, a Multi-structure A/V Extraction (MAE) module that classifies arteries and veins by combining the original image with the vessel features produced by the MVE module, and a Multi-source Feature Integration (MFI) module that merges the outputs from the former two modules to obtain the final A/V classification results. Extensive empirical experiments verify the high performance of the proposed MSC-Net for retinal A/V classification over state-of-the-art methods on several public datasets.

Keywords:

A/V classification; attention; convolution; feature fusion

1. Introduction

In the field of fundus images analysis, since the retina vessel structures are the only internal human vascular structures that can be observed noninvasively, the research on the segmentation and classification of retina vessels plays an essential role in diagnosing some ocular diseases, such as glaucoma [1], which could lead to blindness. Moreover, many clinical studies have demonstrated that some systemic diseases can lead to irreversible changes in the morphology of the fundus vasculature. For instance, diabetes may cause the growth of many new blood vessels which do not develop properly and can leak easily [2]; systemic arteriolar stenosis and arteriovenous nicking may be caused by long-term hypertension [3]; and the growth of unwanted, abnormal blood vessels in the ordinarily avascular macular area is attributed to Age-related Macular Degeneration (AMD) [4]. Generally, experienced physicians can predict and diagnose some diseases with these changes, and sometimes the different states of the A/V may contribute to this process. Therefore, the quality results of vessel segmentation and A/V classification can help to speed up the diagnosis of some diseases. Nevertheless, manual operations in fundus image regions are time-consuming and labor-intensive, making automatic segmentation and classification methods highly desirable in clinical practice.

Automatic retina A/V classification has been extensively studied with the development of deep learning algorithms. Generally, these methods could be roughly classified into the following two categories, convolution-based and attention-involved algorithms. In the past decade, convolution-based methods have succeeded tremendously by commonly using a series of feature extractors to extract hierarchical information. However, due to the receptive field limitation of pure convolution, these methods yield weak performance for the target structures that show huge inter-patient variation, such as the different shapes and sizes of vessels. For the attention-involved methods, due to the powerful capability of the weight matrix in global context modeling, these methods demonstrate superior performance in improving the overall cohesiveness of the segmented target. Nevertheless, with the poor extraction of short-range dependencies, these methods encounter difficulties in feature extraction at the intersection areas of vessels.

Recently, although many methods have been proposed to reduce the inadequacy caused by the disadvantages, A/V classification is still a challenging task. First, due to the complex environment of the human eyeball, the vessels have highly variable structures, and their morphology is also variable. At the same time, this challenge is accompanied by the high degree of bending of the blood vessels [5], which makes the related challenges more difficult to deal with. Second, tiny peripheral vascular differences between arteries and veins make the classification of arteries and veins more challenging. Especially at the end of blood vessels, the difference between them is even smaller.

To alleviate the above problems, we propose a novel deep learning end-to-end method called the Multi-task Segmentation and Classification Network (MSC-Net) for A/V classification to mitigate the abovementioned limitations, which splits the whole A/V classification task into three subsections, including segmenting the vessels, classifying the arteries and veins, and feature fusion. First, a Multi-scale Vessel Extraction (MVE) module is proposed to solve the first subtask, vessel segmentation. To enlarge the receptive field of the convolution-based method, we utilize a group of tiny-kernel convolution blocks for replacing the single convolution layers in the U-like architecture, making the outputs of the proposed module obtain the information from all stages. This approach enables the module to extract the detailed features of vessels efficiently. After that, a novel attention-based, Multi-structure A/V Extraction (MAE) module is proposed to address the second subtask. Due to the powerful capability of extracting the long-range dependencies between pixels, this module could easily obtain information about differences between pixels, which is highly advantageous for accurately distinguishing the arteries and veins. In addition, we combined transformer-like and U-like architectures to obtain the features from different scales. To reduce the computational cost, we adjusted the strategy of obtaining global contextual attention, avoiding the issue of binary complexity that could arise from the matrix operations. Finally, a newly designed Multi-source Feature Integration (MFI) module is proposed to complete the final subtask. Compared with the traditional feature fusion module, the proposed module could capture the detailed information of vessels to optimize the A/V feature. The features of vessels and A/V come from different sources and have different characteristic distributions. With the help of this module, the network could suppress background-prone features to pay more attention to vessel features. In summary, the main contributions are listed as follows:

We propose a multi-task segmentation and classification network (MSC-Net) for artery and vein classification in the retina fundus. The highly optimized Python implementations of our method will be released at: https://github.com/chenchouyu/MSC-Net (accessed on 7 May 2023).
Novel strategy of A/V classification is proposed, which splits the whole task into three parts, including extraction of vessel features, classifying of artery and vein, and tuning. And we propose three modules to deal with the corresponding tasks.
Extensive and comprehensive experimental results on several public datasets demonstrate the validity of novel strategy and MSC-Net for accurate A/V separation and superior performance over existing state-of-the-art methods.

The remainder of this paper is structured as follows: related works are reviewed in Section 2; the detailed description of our network is presented in Section 3; the experimental results, which include the ablation study and comparison experiments, are summarized in Section 4; and finally, the conclusion is presented in Section 5.

2. Related Work

Presently, a massive number of deep learning networks for retinal A/V classification have been proposed. These methods could be divided into two categories, including the convolution-based networks and the attention-involved method. For the first category, depending on whether the structural connection or topology is used or not, the conventional methods can be further categorized into hybrid and pure methods. And the second category, according to the composition of the attention module, can be divided into traditional methods and transformer-based methods.

2.1. Convolution-Based Methods

Since the proposal of U-Net [6] in 2015, the pure-convolution methods had shown superior capability in dealing with medical image processing. To alleviate the impact of the limited receptive field of convolution operation, the dilated convolution is applied in Chen et al. [7]. This network uses atrous convolutions with different dilation rates to generate multi-scale feature maps, which are combined to capture contextual information at different scales better. Qiangguo et al. [8] proposed a DU-Net that utilizes the deformable convolutional blocks [9] to construct the U-like architecture, demonstrating that the network can capture vessels of various shapes and scales through the deformable receptive fields. This property makes it more suitable for segmentation tasks with irregular structures. The deformable convolution plays a crucial role in achieving these results. Moreover, the inception architecture [10] confirms that the broader structure could provide a larger receptive field. Through this property, AV-Net [11] proposes a novel network established on the inception architecture, showing the strong capability for extracting detailed features. Galdran et al. [12] formulate the classification of arterial and venous vessels as a four-class semantic segmentation problem. This strategy allowed the network to classify vessels as background, artery, vein, or uncertain by considering the inherent uncertainties in the task. Xu et al. [13] further improved the A/V classification by introducing a modified fully convolutional network architecture for arteries and veins segmentation.

Furthermore, some researchers perform the A/V classification tasks in the retina fundus by introducing the structural information into the pure convolution to alleviate the loss of the information and enlarge the receptive field, called the hybrid method. Mishra et al. [14] proposed a novel method, which transforms the vessel feature from the image domain into a graph representation, utilizing the vessel topology to enhance the artery/vein classification. In [15], the retina vessel skeleton, which represents the direction of vessels, is extracted from different color spaces. This information is utilized to alleviate the disconnection in the result. Estrada et al. [16] present a novel method, which combines the tree topology estimation framework with domain-specific features to construct a global likelihood model that is highly effective in classifying A/V content. Zhao et al. [17] developed a novel approach that adapts the concept of dominant set clustering to address the retinal blood vessel topology estimation and A/V classification.

Despite the impressive performance achieved by convolution-based networks in retinal vessel segmentation and A/V classification, the local and limited receptive field of convolution remains a significant challenge for semantic segmentation. To address this issue, the MVE module was proposed. This module replaces the single convolution layer in the U-like architecture with a sequence of small convolution layers, thereby enabling each kernel to extract information from all the feature maps and expand the receptive field of the network.

2.2. Attention-Involved Methods

According to the composition of the attention module, they can be divided into traditional methods and transformer-based methods. The SE-Net [18] proposes a novel module to tackle the issue of exploiting channel dependencies. The SE block could squeeze the feature maps from the spatial domain and expand the channel attention to the global weight matrix. Inspired by the SE block, Xiang et al. [19] proposed a SE-based block called SK block, which employs an inception-like architecture to fuse the feature maps from multi-branch, and obtains the detailed channel attention matrix. Jun et al. proposed a novel dual attention mechanism that utilizes both spatial attention and channel attention to fuse the final attention representation, which achieves scene segmentation by manipulating different levels of features. In another paper [20], the authors proposed an attention mechanism called the Bottleneck Attention Module (BAM), which integrates global and local information via a bottleneck structure and channel attention mechanism. The BAM can automatically learn to selectively emphasize informative features and suppress redundant features by adaptive feature recalibration.

Since the proposal of Transformer [21] in 2017, designed for Natural Language Processing, more and more transformer-like architectures have been presented for the image processing field. For instance, Vision Transformer (ViT) [22] is the first work that utilizes the transformer-like architecture for replacing standard convolutions in deep neural networks. The original transformer processes the ’patches’ of an image to obtain the correct categories. Thanks to the robust global context extraction, the network could easily establish the long-range dependencies of pixels, which is beneficial to downstream tasks, such as classification and segmentation. To reduce the computational cost, the MixFormer [23] devises an asymmetric attention scheme, which handles multiple target templates. For accelerating the network, Yehao Li et al. [24] introduced a novel Contextual Transformer (CoT) block for visual recognition, which fully capitalizes on the contextual information among input keys to guide the learning of the dynamic attention matrix. Liu et al. presented a Shifted windows (Swin) Transformer [25] that utilizes a shifted window along the spatial dimension to model global and boundary features, in which the feature maps, Q, K, and V, are obtained by a smaller size compared to the ViT. Although the transformer has the advantage of global receptive field capture capability, the expensive calculation may limit its application in medical image processing tasks. To tackle this issue, the MAE module is proposed in this work. This module hybrids the transformer-like blocks and a U-like architecture to generate different scale feature maps and control the computing cost, efficiently.

3. Multi-Task Segmentation and Classification Network (MSC-Net)

To alleviate the challenges in A/V classification, we propose a novel Multi-task Segmentation and Classification Network (MSC-Net). The process of MSC-Net is illustrated in Figure 1.

The MSC-Net consists of three modules: the Multi-scale Vessel Extraction (MVE) module, the Multi-structure A/V Extraction (MAE) module, and the Multi-source Feature Integration (MFI) module. First, in the MVE module, the input image is operated by a full convolution network. The semantic and detailed structural information of vessels is extracted by groups of the tiny-kernel convolution layers and reorganized to establish a more accurate tree-like vessel map. Then, in the MAE module, the concatenation of the original image and the output of the MVE module is handled by a hybrid architecture. This module obtains better perception capability of long-term modeling by adding the attention operation, making it more robust in dealing with minor differences. Finally, the A/V feature maps extracted by the MAE module and the vessel features extracted by the MVE module are fed to the MFI module, which fuses the detailed information of the two aspects to obtain the final classification results, demonstrating that the second introduction of vessel information is beneficial to the A/V classification. In the following sections, the novel modules in MSC-Net will be described in detail.

3.1. The Multi-Scale Vessel Extraction (MVE) Module

The task of A/V classification is divided into three subtasks, including vessel segmentation, A/V classification, and results fine-tuning. We introduce the MVE module to address the first subtask of A/V classification. The MVE module is established on a U-like architecture, using specially adjusted convolution blocks to obtain the multi-scale feature representations. Detailed information about the MVE module is shown in Figure 2.

As a U-like architecture, the MVE module consists of a symmetric encoder and decoder. And the output of each stage of the encoder would be transferred to participate in the operation in the corresponding stages in the decoder. In the encoder of this module, we utilize the Res2Net-like structure for replacing the single convolution blocks which performed as the encoder stages in the original U-Net. The detailed structure of the Res2Net is shown as the right subfigure in Figure 2. Inspired by the Res2Net [26] architecture, we utilize a group of small convolution blocks, signed as

F_{i} (*)

, to deal with the feature maps extracted by a convolution layer with a kernel size of 1 × 1. The feature map (x) would be divided into several subsets, described as

x_{i}

, where i refers to the sequence number of a subset, illustrating the different feature representations in channels. Subsequently, except the first subset, each subset (

x_{j}

) would concatenate with the output (

y_{j - 1}

) of the small convolution block (

F_{j - 1} (*)

) which processes the pervious subset (

x_{j - 1}

). Then, the concatenation would be transferred to

F_{j} (*)

to obtain

y_{j}

, which represents the detailed extracted features of different channels. This process can be described as follows:

y_{i} = \{\begin{matrix} x_{i}, & i f i = 1 \\ F_{i} ((x_{1}; y_{i - 1})), & i f 1 < i \leq n \end{matrix}

(1)

Through this approach, each

F_{i} (*)

could obtain information from all feature stages. The new feature extraction strategy obtains a larger receptive field than single convolution. In this module, the feature maps obtained by the 1 × 1 convolution layer are split into four subsets. With this special connection mode, the amount of each block parameter is declined by approximately four times.

3.2. The Multi-Structure A/V Extraction (MAE) Module

The task of obtaining the characteristic expression of arteries and veins (A/V) is more challenging compared with the extraction of vessel structural features. A more robust feature extraction capability of the network is required for addressing the task. In computer vision, applying the attention mechanism has enabled researchers to value the solid global modeling ability it provides. As a result, more attention-based networks have been proposed. One such network is the Vision Transformer (ViT), introduced in 2021, which can efficiently learn and understand the global dependencies between structures, as shown in Figure 3. In this method, the after-embedded feature maps would be divided into three groups, called keys (K), quires (Q), and values (V), by fully connected layers. Then, the output of the attention machine could be calculated as (2), where

K^{T}

represents the transposition of keys (K),

d_{k}

refers to the dimension of keys (K) and queries (Q).

O u t p u t = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

However, due to the high computation cost and the lack of multi-scale feature inputs, ViT is unsuitable for pixel-level classification. To address these limitations and obtain a more reasonable A/V representation, we introduce the Multi-structure A/V Extraction (MAE) module, which consists of transformer-like and convolution blocks that could efficiently extract feature maps of A/V. The detailed architecture of the MAE module is shown in Figure 4. By adding a U-like structure, the transformer-like blocks can obtain multi-scale features, which is beneficial for extracting features of targets with large-scale differences. Additionally, we replace the matrix multiplication between Q and K with concatenation operations, avoiding the binary complexity problem.

As shown in the right subfigure in Figure 4, the input data in the transformer-like block is first processed by a convolution layer with a kernel size of 3 × 3, resulting in the output X. Then, the value (V) is computed via

W_{v}

, just like in ViT. However, instead of computing queries (Q) and keys (K) using extra convolution layers (

W_{q}, W_{k}

) as in the typical transformer-like block, we directly transfer X into Q and K. Next, a convolution layer with a large kernel size is used to process K, obtaining the static feature representation of the input features, which reflects the information among local neighbors of K. Then we employ two convolution layers with a kernel size of 1 × 1 and a soft-max operation to process the concatenation of static representations and Q to obtain the weight matrix, which is used to compute the dynamic feature representation via the multiplication between V and the weight matrix. We compute the output of this transformer-like block as the sum of the static and dynamic representations, which can be described by Equation (3).

O u t p u t = K_{s} + V ⊛ s o f t m a x ([K_{s}, Q] W_{θ} W_{δ})

(3)

where the ⊛ refers to the matrix multiplication operation, the matrix

W_{θ}

, and

W_{δ}

refer to the two 1 × 1 convolution layers in which the Relu function activates the first one, and the second one is not activated. The

K_{s}

refers to the static context representation. With the addition of the optimized transformer, the results of each small encoding block obtain enlarged receptive fields compared with the pure convolution block in U-Net.

3.3. The Multi-Source Feature Integration (MFI) Module

Due to the fusion of the structural features of the vessel, the performance of A/V classification heavily depends on the accurate extraction of vessel features. Therefore, the incorrect and fractured vessel features impact the coherence of the A/V results, which is a limitation puzzling our method. To alleviate the above challenge, we propose a novel feature fusion module named the Multi-source Feature Integration (MFI) module. In this module, the detailed vessel features obtained by the MVE module are utilized to improve the A/V classification results.

As a binary segmentation task, the extracted vessel feature map before operated by the activation function, the pixels of thick vessels usually obtain a probability close to 1 and the background pixels close to 0, while the boundaries and micro-vessel areas are around 0.5. To enhance the expression of these pixels, we propose a specific activation function, defined as (4). As shown in Figure 1, the vessel maps produced by the MVE module are first processed by the abovementioned activation function. And then, the activated vessel features with more detailed information on the vascular boundaries and microvascular areas are employed to operate the A/V classification results produced by the MAE module.

F (x) = e^{- | x - 0.5 |} - e^{- 0.5}

(4)

where x refers to the probability map of the vessel segmentation, which is produced by the MVE module.

After processing by this activation function, the pixels of the capillary vessel pixels and vascular boundaries, with a probability close to 0.5 in the feature maps [5], are improved to

1 - e^{- 0.5}

, and the pixels of the background and thick vessel declined to 0. Figure 5 shows the results produced by the activation function. As illustrated in the enlargement window, the pixels representing the capillaries and the edges of the thick vessel are maintained. Through utilizing the enhanced micro-vessel feature map to improve A/V classification, the impact caused by the imperfect vessel segmentation is alleviated.

3.4. Loss Function

MSC-Net is constrained by a loss function combined with vessel segmentation (

L_{v}

) and A/V classification (

L_{a}

). Based on the binary cross-entropy and the dice loss [27],

L_{v}

is quantified as follows:

L_{v} = L_{v}^{b c e} + 0.1 L_{v}^{d i c e}

(5)

where

L_{v}^{b c e}

refers to the binary cross-entropy loss,

L_{v}^{d i c e}

represents the dice loss for vessel segmentation, they are defined as:

L_{v}^{b c e} = \sum_{i} - [y_{i} l o g (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})]

(6)

L_{v}^{d i c e} = 1 - \frac{2 \sum y_{i} p_{i} + ϵ}{\sum y_{i} + \sum p_{i} + ϵ}

(7)

where

y_{i}

refers to the label of the pixel,

p_{i}

refers to prediction, and

ϵ

represents the smoothing factor setting as

10^{- 6}

. Since the A/V classification is a multi-class classification task, the

L_{a}

is quantified by a widely used cross-entropy function, as shown in (8).

L_{a} = \sum_{i} \sum_{c = 1}^{N} - y_{i c} l o g (p_{i c}) .

(8)

where N is the number of categories, c refers to the index of a certain category,

y_{i c}

represents whether the ith pixel belongs to the cth category,

p_{i c}

refers to the prediction produced by the network.

The total loss is defined as

L o s s = L_{v} + L_{a}

(9)

4. Experimental Result

4.1. Datasets

To validate the performance of the methods, we employ four public datasets for experiments. Figure 6 shows some samples from the datasets.

4.1.1. DRIVE

DRIVE (Digital Retinal Images for Vessel Extraction) dataset [28] is a classic retinal vessel segmentation dataset consisting of 40 images with a resolution of 584 × 565. In the original dataset, each image obtains a binary standard vessel segmentation label, where the vessel pixels are marked in white color. Based on [29], the vessel pixels are divided into three groups, including those referring to arteries, veins, and uncertain regions, labeled with red, blue, and green.

4.1.2. HRF

HRF (High-Resolution Fundus) dataset is a high-resolution dataset containing 45 images with a huge resolution of 3504 × 2336 [30]. All the images belong to three categories, including images from healthy patients, glaucomatous patients, and diabetic patients. And each image owns a binary standard vessel segmentation label and an artery/vein classification label. This study selects one-third of each category as training data, while the remaining is used for testing and validation [5]. In other words, there are 15 images for training and 30 images for test and validation. Before training, all the images are resized to 1536 × 1024 pixels.

4.1.3. LES

LES dataset includes 22 optic disc-centered images with a 30° field of view (FOV) and a resolution of 1620 × 1444 pixels [31]. Each image has an independent A/V classification result and blood vessel segmentation result. According to [5], select 11 pictures as training data, and use the rest for testing and validation. To better train and validate the model, all the images are resized to 1024 × 1024 pixels.

4.1.4. INSPIRE

INSPIRE (the Iowa Normative Set for Processing Images of the Retina) dataset contains 40 optic disc-centered images, which with FOV of 30° and resolution of 2392 × 2048 pixels [32]. No pixel-wise vessel segmentation ground truth is available for the INSPIRE-AVR dataset and only the A/V classification of vessel centerline is provided. So, it is used to explore the generalization of MSC-Net. Before testing, all the images are resized to 1024 × 1024 pixels.

4.2. Preprocessing and Data Augmentation

Before the training process, some augmentation methods are utilized to improve the amount of the training data and enhance the expression to alleviate the overfitting of the network. As a binary segmentation task, vessel segmentation needs more structural information about the vessel. Therefore, we utilize the single channel, the green channel of the original RGB image, to obtain the detailed vessel structure. Second, Contrast Limited Adaptive Histogram Equalization (CLAHE) [33] is employed to deal with the green channel to reduce the interference of color on blood vessel segmentation. The outputs of these operations are shown in Figure 7. For A/V classification, we use the original image to participate in the operation.

Then, the 256 × 256 pixels patches are randomly cropped from the training images. Random rotation and Gaussian noise addition are applied to these patches. For the test, each picture is cropped into several patches in order, and the vessel and A/V prediction is the connection of all the patches results related to one image.

We implemented the proposed model using Python, leveraging the PyTorch deep learning framework, and trained and tested it on an NVIDIA TITAN Xp graphics card with 12 GB of RAM, optimizing the model parameters with the Adam optimizer using 100 iterations and a learning rate of 0.001.

4.3. Evaluation Metrics

The metrics accuracy (

A c c

), sensitivity (

S e n

), specificity (

S p e

), and F1-score (

F 1

) are used to evaluate the performance of the network. For A/V classification, these metrics are turned to

A c c_{a}

,

S e n_{a}

,

S p e_{a}

, and

F 1_{a}

, while when they measure the performance of vessel segmentation, they are marked with

A c c_{v}

,

S e n_{v}

,

S p e_{v}

, and

F 1_{v}

. All the metrics are calculated as follows:

S p e = \frac{T N}{T N + F P}

(10)

S e n = \frac{T P}{T P + F N}

(11)

A c c = \frac{T P + T N}{T P + F P + T N + F N}

(12)

F 1 = \frac{2 \times T P}{2 \times T P + F P + F N}

(13)

The parameters’ true positives (

T P

) and true negatives (

T N

) measure the quantity of correctly classified positives and negatives, respectively, while false positives (

F P

) and false negatives (

F N

) relate to the number of false classifications.

For A/V classification, the positives describe the artery pixels, whereas the negatives describe the vein pixels. Sensitivity (

S e n_{a}

) measures the algorithm’s ability to detect arteries, while specificity (

S p e_{a}

) assesses the method’s ability to detect veins. Accuracy (

A c c_{a}

) represents the algorithm’s capacity to differentiate between arteries and veins, while the F1 score (

F 1_{a}

) represents the overall accuracy of the algorithm.

For vessel segmentation, the positives refer to vessel pixels, and the negatives refer to background pixels. Sensitivity (

S e n_{v}

) only relates to whether the algorithm correctly classifies vessel pixels, whereas specificity (

S p e_{v}

) only relates to background pixels. Accuracy (

A c c_{v}

) and F1 score (

F 1_{v}

) demonstrate the algorithm’s overall accuracy in separating vessel pixels from background pixels in different directions.

4.4. Ablation Studies

We conduct ablation studies on the DRIVE dataset for A/V classification and vessel segmentation to validate the contribution of each module in MSC-Net. To demonstrate the contribution of modules to the classification and segmentation tasks more effectively, we choose the double U-Net as the baseline network, which obeys the same strategy as MSC-Net. In this network, the first U-Net extracts the vessel features, and the second U-Net produces the A/V result. In the following experiments, we regard the U-like network with the first downsample depth of 64 as the original U-Net.

Table 1 illustrates a detailed summary of the performance of various module combinations on the DRIVE dataset. The performance of networks is greatly improved with the utilization of the MVE, MAE, and MFI modules. Compared to single-task networks, which are summarized in the first two rows in Table 1, multi-task networks exhibit higher performance evaluation metrics, with the baseline network even surpassing the single-task networks. Due to the replacement of the original U-Net with the MVE module, the baseline network achieves an improvement of 0.0126, 0.0024, 0.0007, and 0.0036 on

S e n_{v}

,

S p e_{v}

,

A c c_{v}

, and

F 1_{v}

, revealing that the MVE module could extract more critical information about vessel structures than the original U-Net, which is beneficial for the A/V classification. Moreover, with the alternation of the original U-Net in the baseline network with the MAE module, the performance of the baseline network is improved by

S e n_{a}

of 0.0155,

S p e_{a}

of 0.0045,

A c c_{a}

of 0.011, and

F 1_{a}

of 0.0111, demonstrating that the novel MAE module could significantly improve the network’s performance compared to the original U-Net by capturing long-range dependencies between pixels. Additionally, the baseline network achieves improved performance by combining the MFI module, which effectively fuses different feature representations to enhance network performance. After adding this module, all the metrics of the baseline network have improved, especially the

S e n_{a}

increased by 0.0178,

A c c_{a}

by 0.0112, and

S e n_{v}

by 0.0278. It is noteworthy that the MSC-Net combining three modules achieves the best results with

S e n_{a}

of 0.9494,

S p e_{a}

of 0.9440,

A c c_{a}

of 0.9469, and

F 1_{a}

of 0.9398, which denote the superior performance of the MSC-Net on A/V classification task. The highest metrics with

S e n_{v}

of 0.8527,

S p e_{v}

of 0.9785,

A c c_{v}

of 0.9685, and

F 1_{v}

of 0.8107 demonstrate that the combination of the proposed three modules could also obtain the high-quality vessel segmentation results.

Furthermore, in Figure 8, we compared the detailed visualization of the A/V classification results of the baseline, the alternation network that replaced the U-Net in the baseline network by the MAE module, and MSC-Net, in which some specific regions are highlighted with different colors. From the figure, we can see that, after the alternation of the original U-Net with the MAE module, the performance of the baseline network extracting tiny vessels is improved. Moreover, with the addition of the MVE and MFI modules, the enhanced vascular maps will preserve the microscopic structures in the original image, which is beneficial for the A/V classification. As shown in the green box of Figure 8, facing the vessel cross areas, the baseline network misclassifies the intersection category, while the alternation network and MSC-Net perform well. Moreover, in the purple box, the baseline and alternation network both confuse the belonging of the ends of the vessel. On the other hand, the MSC-Net, which combined the MVE, MAE, and MFI modules, could give a high-quality and correct classification result.

4.5. Comparison with Existing Methods

The comparison results on vessel segmentation and A/V classification are shown in Table 2 and Table 3. The methods employed for comparison experiments are classified into the methods based on hybrid structures [12,34,35] and the convolution methods combined with some graph methods [11,36,37,38]. The comparison results of A/V classification between the proposed method and other state-of-the-art methods on the DRIVE and HRF datasets are summarized in Table 3. For the DRIVE dataset, our MSC-Net achieves the optimal

S e n_{a}

of 0.9494,

S p e_{a}

of 0.9441,

A c c_{a}

of 0.9469, and

F 1_{a}

of 0.9398. The results are, respectively, 0.0194, 0.0111, 0.0128, and 0.0208 higher than that of the existing best results [12,34,35,36]. Moreover, in addition to the commonly used DRIVE dataset, we evaluate the proposed method on the HRF and LES datasets, which have a higher resolution and more complicated fundus conditions. As listed in Table 3, our method demonstrates better extraction and reorganization capabilities with the best

S e n_{a}

of 0.9720,

S p e_{a}

of 0.9758, and

A c c_{a}

of 0.9735 on HRF dataset, and the best

S e n_{a}

of 0.9030,

S p e_{a}

of 0.9155,

A c c_{a}

of 0.9072, and

F 1_{a}

of 0.8947 on LES dataset.

Furthermore, we also compare the vessel segmentation performance of MSC-Net against other state-of-the-art methods. In Table 2, we summarize the comparison results of different methods on DRIVE, HRF, and LES datasets. From the table, it can be confirmed that the proposed method has achieved the best performance on all of the DRIVE, HRF, and LES datasets. MSC-Net achieves the highest

S e n_{v}

of 0.8527,

A c c_{v}

of 0.9684,

F 1_{v}

of 0.8107, and the comparable

S p e_{v}

of 0.9785 on the DRIVE dataset, reflecting its superior feature extraction capability in identifying tiny structures such as micro-vessel pixels. For the HRF dataset, the best

S e n_{v}

of 0.8578 and

A c c_{v}

of 0.9704 showcases the effectiveness of the proposed method in precisely segmenting the vessels, even in images with high-resolution conditions and complex vessel tree structures. On the LES dataset, the highest

S e n_{v}

of 0.8824,

S p e_{v}

of 0.9885,

A c c_{v}

of 0.9825, and

F 1_{v}

of 0.8518 demonstrate the superior performance of the proposed method on the more complicated fundus conditions.

Additionally, some sample images are selected from the DRIVE, HRF, and LES datasets. Their A/V classification predictions, generated by the original U-Net or the proposed method, are presented in Figure 9, Figure 10 and Figure 11. In Figure 9, it is evident that MSC-Net is more robust than U-Net in identifying vessel crossing areas and the bifurcation of blood vessels. Specifically, MSC-Net could correctly identify the vessel crossing areas in the green box, which presents a challenge for U-Net. Similarly, MSC-Net classifies the primary and secondary vessels into appropriate categories in the purple box, whereas U-Net encounters difficulties in accurately classifying them. Furthermore, as shown in Figure 10 and Figure 11, the MSC-Net obtains higher quality results for high-resolution fundus images than the U-Net. Based on the visualization results, the proposed MSC-Net could extract tiny vessels from the original image and efficiently classify the arteries and veins, demonstrating its superior capability in detail extraction.

To enhance the robustness of the proposed model, fusing datasets with different scales can enrich the quantity and distribution of training data, which can be validated on a new dataset with multiscale. This approach can effectively address the issue of limited data availability and distribution, providing a broader range of representative samples for model training. As shown in Table 4, the first three rows are only trained on the DRIVE, HRF, or LES datasets and tested on the INSPIRE dataset. Moreover, the last row is trained on the combination of the above three datasets. From the table, the best results are obtained in the last row, showing that fusing datasets with different scales can enhance the robustness of the model.

5. Conclusions

In this paper, we proposed a novel deep learning network, named the Multi-task Segmentation and Classification Network. The network deploys a novel strategy, which utilizes the results of vessel segmentation produced by the Multi-scale Vessel Extraction module to enhance the A/V classification. The proposed Multi-scale Vessel Extraction module employs a series of tiny-kernel convolution blocks adequately to reduce the influence of background pixels and obtain multi-scale vessel features. The enhanced image contains a lot of structural information belonging to the original image, which is beneficial to the A/V classification. The newly designed MAE module utilizes the hybrid structure of adjusted transformer-like and convolution blocks to learn and classify the different pixels. The MFI module could fuse the outputs of the former two modules to obtain the high-quality A/V classification results and vessel segmentation results, simultaneously.

Due to the concatenation of the original image and the output of the Multi-scale Vessel Extraction module, the Multi-structure A/V Extraction module obtains more information about the tiny structures from the input data, which could help the network better accomplish the task of A/V classification. At the same time, with the addition of the Multi-source Feature Integration module, the insurmountable limitation of the two-stage method is alleviated. Experimental results show that our method is superior to the existing state-of-the-art methods on all of the tested public datasets.

Author Contributions

Conceptualization, J.Y. and C.C.; methodology, C.C.; software, C.C.; validation, J.Y. and C.C.; formal analysis, C.C.; writing—original draft preparation, C.C.; writing—review and editing, J.Y.; supervision, J.Y. and C.C.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of Beijing Municipal Commission of Education, grant number KM202010016011, the National Natural Science Foundation of China, grant number 61871020, 62031003, Scientific Research Foundation of Beijing University of Civil Engineering and Architecture, grant number 00331613002, the Fundamental Research Funds for Beijing University of Civil Engineering and Architecture, grant number X18064. The computer resources were provided by Public Computing Cloud Platform of Renmin University of China.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tham, Y.C.; Li, X.; Wong, T.Y.; Quigley, H.A.; Aung, T.; Cheng, C.Y. Global Prevalence of Glaucoma and Projections of Glaucoma Burden through 2040: A Systematic Review and Meta-Analysis. Ophthalmology 2014, 11, 2081–2090. [Google Scholar] [CrossRef] [PubMed]
Rask-Madsen, C.; King, G.L. Vascular Complications of Diabetes: Mechanisms of Injury and Protective Factors. Cell Metab. 2013, 1, 20–33. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Smith, W.; Wang, J.J.; Wong, T.Y.; Rochtchina, E.; Klein, R.; Leeder, S.R.; Mitchell, P. Retinal Arteriolar Narrowing Is Associated With 5-Year Incident Severe Hypertension. Hypertension 2004, 44, 442–447. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ambati, J.; Atkinson, J.P.; Gelfand, B.D. Immunology of age-related macular degeneration. Nat. Rev. Immunol. 2013, 6, 438–451. [Google Scholar] [CrossRef] [Green Version]
Hu, J.; Wang, H.; Cao, Z.; Wu, G.; Jonas, J.B.; Wang, Y.X.; Zhang, J. Automatic artery/vein classification using a vessel-constraint network for multicenter fundus images. Front. Cell Dev. Biol. 2021, 9, 659941. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference—Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Jin, Q.; Meng, Z.; Pham, T.D.; Chen, Q.; Wei, L.; Su, R. DUNet: A deformable network for retinal vessel segmentation. Knowl.-Based Syst. 2019, 178, 149–162. [Google Scholar] [CrossRef] [Green Version]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Kang, H.; Gao, Y.; Guo, S.; Xu, X.; Li, T.; Wang, K. Avnet: A retinal artery/vein classification network with category-attention weighted fusion. Comput. Methods Programs Biomed. 2020, 195, 105629. [Google Scholar] [CrossRef]
Galdran, A.; Meyer, M.; Costa, P.; Campilho, A. Uncertainty-aware artery/vein classification on retinal images. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 556–560. [Google Scholar]
Xu, X.; Wang, R.; Lv, P.; Gao, B.; Li, C.; Tian, Z.; Tan, T.; Xu, F. Simultaneous arteriole and venule segmentation with domain-specific loss function on a new public database. Biomed. Opt. Express 2018, 9, 3153–3166. [Google Scholar] [CrossRef] [Green Version]
Mishra, S.; Wang, Y.X.; Wei, C.C.; Chen, D.Z.; Hu, X.S. VTG-net: A CNN based vessel topology graph network for retinal artery/vein classification. Front. Med. 2021, 8, 750396. [Google Scholar] [CrossRef]
Tan, Y.; Yang, K.F.; Zhao, S.X.; Li, Y.J. Retinal vessel segmentation with skeletal prior and contrastive loss. IEEE Trans. Med Imaging 2022, 41, 2238–2251. [Google Scholar] [CrossRef]
Estrada, R.; Allingham, M.J.; Mettu, P.S.; Cousins, S.W.; Tomasi, C.; Farsiu, S. Retinal artery-vein classification via topology estimation. IEEE Trans. Med Imaging 2015, 34, 2518–2534. [Google Scholar] [CrossRef] [Green Version]
Zhao, Y.; Xie, J.; Zhang, H.; Zheng, Y.; Zhao, Y.; Qi, H.; Zhao, Y.; Su, P.; Liu, J.; Liu, Y. Retinal vascular network topology reconstruction and artery/vein classification via dominant set clustering. IEEE Trans. Med. Imaging 2019, 39, 341–356. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Hu, Q.; Abràmoff, M.D.; Garvin, M.K. Automated separation of binary overlapping trees in low-contrast color retinal images. In Proceedings of the 16th International Conference—Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013, Nagoya, Japan, 22–26 September 2013; pp. 436–443. [Google Scholar]
Hu, Q.; Abràmoff, M.D.; Garvin, M.K. Automated construction of arterial and venous trees in retinal images. J. Med. Imaging 2015, 2, 044001. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Odstrcilik, J.; Kolar, R.; Budai, A.; Hornegger, J.; Jan, J.; Gazarek, J.; Kubena, T.; Cernosek, P.; Svoboda, O.; Angelopoulou, E. Retinal vessel segmentation by improved matched filtering: Evaluation on a new high-resolution fundus image database. IET Image Process. 2013, 7, 373–383. [Google Scholar] [CrossRef]
Orlando, J.I.; Barbosa Breda, J.; Van Keer, K.; Blaschko, M.B.; Blanco, P.J.; Bulant, C.A. Towards a glaucoma risk index based on simulated hemodynamics from fundus images. In Proceedings of the 21st International Conference—Medical Image Computing and Computer Assisted Intervention—MICCAI 2018, Granada, Spain, 16–20 September 2018; pp. 65–73. [Google Scholar]
Niemeijer, M.; Xu, X.; Dumitrescu, A.V.; Gupta, P.; Van Ginneken, B.; Folk, J.C.; Abramoff, M.D. Automated measurement of the arteriolar-to-venular width ratio in digital color fundus photographs. IEEE Trans. Med. Imaging 2011, 30, 1941–1950. [Google Scholar] [CrossRef] [PubMed]
Reza, A.M. Realization of the contrast limited adaptive histogram equalization (CLAHE) for real-time image enhancement. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 38, 35–44. [Google Scholar] [CrossRef]
Noh, K.J.; Park, S.J.; Lee, S. Combining fundus images and fluorescein angiography for artery/vein classification using the hierarchical vessel graph network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 595–605. [Google Scholar]
Ma, W.; Yu, S.; Ma, K.; Wang, J.; Ding, X.; Zheng, Y. Multi-task neural networks with spatial activation for retinal vessel segmentation and artery/vein classification. In Proceedings of the 22nd International Conference—Medical Image Computing and Computer Assisted Intervention—MICCAI 2019, Shenzhen, China, 13–17 October 2019; pp. 769–778. [Google Scholar]
Li, L.; Verma, M.; Nakashima, Y.; Kawasaki, R.; Nagahara, H. Joint learning of vessel segmentation and artery/vein classification with post-processing. In Proceedings of the Medical Imaging with Deep Learning, Montreal, QC, Canada, 6–9 July 2020; pp. 440–453. [Google Scholar]
Dashtbozorg, B.; Mendonça, A.M.; Campilho, A. An automatic graph-based approach for artery/vein classification in retinal images. IEEE Trans. Image Process. 2013, 23, 1073–1083. [Google Scholar] [CrossRef] [PubMed]
Girard, F.; Kavalec, C.; Cheriet, F. Joint segmentation and classification of retinal arteries/veins from fundus images. Artif. Intell. Med. 2019, 94, 96–109. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [Green Version]
Guo, C.; Szemenyei, M.; Yi, Y.; Wang, W.; Chen, B.; Fan, C. Sa-unet: Spatial attention u-net for retinal vessel segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1236–1242. [Google Scholar]

Figure 1. The structure of MSC-Net.

Figure 2. The structure of the MVE module. The subfigure on the left shows the detailed structure of the Multi-scale Extract, which is labeled with the yellow arrow in the left subfigure.

Figure 3. The main block of transformer. In this block, the features are produced by the former modules. ⊛ denotes the local matrix multiplication.

Figure 4. The structure of the MAE module. The subfigure on the left shows the detailed structure of a transformer-like block.

Figure 5. Schematic diagram of the effect of MFI module on vascular feature enhancement. The left subfigure shows the original image, and the middle portrays the output of the MVE module. The right subfigure displays the enhanced blood vessels after applying the MFI module, which enhances the details and edges of the vascular features.

Figure 6. Sample images from DRIVE, HRF, LES, and INSPIRE datasets. The first column shows the original image, the vessel ground truth, and the A/V label of the DIRVE dataset sample from top to bottom. The second, third, and fourth columns show the samples from HRF, LES, and INSPIRE datasets.

Figure 7. Results of preprocessing on the DRIVE dataset. The first column is the original image, the second column represents the green channel, and the last column shows the image processed by the CLAHE.

Figure 8. The processing results of ablation studies.

Figure 9. The A/V classification results on images from the DRIVE dataset.

Figure 10. The A/V classification results on images from the HRF dataset.

Figure 11. The A/V classification results on images from the LES dataset.

Table 1. Results of the ablation studies for A/V classification on DRIVE dataset (Key: Best).

Methods			A/V Classification				Vessel Segmentation
MVE/U	MAE/U	MFI	${Sen}_{a}$	${Spe}_{a}$	${Acc}_{a}$	$F 1_{a}$	${Sen}_{v}$	${Spe}_{v}$	${Acc}_{v}$	$F 1_{v}$
-/✓	-/-	-	-	-	-	-	0.8019	0.9713	0.9541	0.8002
-/-	-/✓	-	0.9242	0.9160	0.9200	0.9098	-	-	-	-
-/✓	-/✓	-	0.9229	0.9304	0.9273	0.9202	0.8213	0.9757	0.9655	0.8053
✓/-	-/✓	-	0.9231	0.9383	0.9309	0.9332	0.8339	0.9781	0.9662	0.8089
-/✓	✓/-	-	0.9384	0.9349	0.9383	0.9313	0.8309	0.9774	0.9662	0.8074
✓/-	✓/-	-	0.9460	0.9389	0.9438	0.9363	0.8418	0.9775	0.9673	0.8100
-/✓	-/✓	✓	0.9407	0.9352	0.9385	0.9278	0.8491	0.9779	0.9673	0.8052
✓/-	-/✓	✓	0.9466	0.9346	0.9433	0.9267	0.8435	0.9784	0.9684	0.8098
-/✓	✓/-	✓	0.9471	0.9398	0.9436	0.9360	0.8519	0.9782	0.9682	0.8090
✓/-	✓/-	✓	0.9494	0.9440	0.9469	0.9398	0.8527	0.9785	0.9685	0.8107

U, Original U-Net.

Table 2. Results of the comparison experiments for vessel segmentation (Key: Best).

Dataset	Methods	${Sen}_{v}$	${Spe}_{v}$	${Acc}_{v}$	$F 1_{v}$
DRIVE	U-Net [6]	0.8019	0.9713	0.9541	0.8002
	AC-Net [35]	0.7916	0.9811	0.9570	N/A
	CE-Net [39]	0.7903	0.9769	0.9550	N/A
	SA-Unet [40]	0.8112	0.9767	0.9641	0.8027
	MSC-Net	0.8527	0.9785	0.9684	0.8107
HRF	U-Net [6]	0.8319	0.9790	0.9610	0.7987
	UA-Net [12]	0.8500	0.9100	0.9100	0.6200
	VC-Net [5]	0.7903	0.9843	0.9663	0.8101
	MSC-Net	0.8578	0.9785	0.9704	0.8001
LES	U-Net [6]	0.8595	0.9808	0.9748	0.7755
	UA-Net [12]	0.8504	0.9840	0.9722	0.8417
	MSC-Net	0.8824	0.9885	0.9825	0.8518

N/A, Not Available.

Table 3. Results of the comparison experiments for A/V classification (Key: Best).

Dataset	Methods	${Sen}_{a}$	${Spe}_{a}$	${Acc}_{a}$	$F 1_{a}$
DRIVE	U-Net [6]	0.9242	0.9160	0.9200	0.9098
	Li et al. [36]	0.9000	0.8400	0.9190	0.9190
	Girard et al. [38]	0.8630	0.8680	0.8650	N/A
	Noh et al. [34]	0.9300	0.9220	0.9260	N/A
	AC-Net [35]	0.9220	0.9330	0.9260	N/A
	AV-Net [11]	0.8863	0.9272	0.9081	N/A
	UA-Net [12]	0.8900	0.9250	0.9341	0.8800
	MSC-Net	0.9494	0.9441	0.9469	0.9398
HRF	U-Net [6]	0.9274	0.9178	0.9570	0.9714
	VC-Net [5]	N/A	0.9588	0.9704	N/A
	UA-Net [12]	0.9120	0.9013	0.9613	0.9769
	MSC-Net	0.9720	0.9758	0.9735	0.9717
LES	U-Net [6]	0.8640	0.9044	0.8771	0.8885
	UA-Net [12]	0.8800	0.8588	0.8604	N/A
	MSC-Net	0.9030	0.9155	0.9072	0.8947

Table 4. The model is trained under the selected training dataset and tested under the INSPIRE dataset (Key: Best).

Training Datasets			A/V Classification
DRIVE	HRF	LES	${Sen}_{a}$	${Spe}_{a}$	${Acc}_{a}$	$F 1_{a}$
✓	-	-	0.6873	0.6470	0.6686	0.7151
-	✓	-	0.8475	0.7511	0.8045	0.7714
-	-	✓	0.8640	0.6461	0.7048	0.6385
✓	-	-	0.8742	0.8632	0.8656	0.8565

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, J.; Chen, C. Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus. Entropy 2023, 25, 1148. https://doi.org/10.3390/e25081148

AMA Style

Yi J, Chen C. Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus. Entropy. 2023; 25(8):1148. https://doi.org/10.3390/e25081148

Chicago/Turabian Style

Yi, Junyan, and Chouyu Chen. 2023. "Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus" Entropy 25, no. 8: 1148. https://doi.org/10.3390/e25081148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task Segmentation and Classification Network for Artery/Vein Classification in Retina Fundus

Abstract

1. Introduction

2. Related Work

2.1. Convolution-Based Methods

2.2. Attention-Involved Methods

3. Multi-Task Segmentation and Classification Network (MSC-Net)

3.1. The Multi-Scale Vessel Extraction (MVE) Module

3.2. The Multi-Structure A/V Extraction (MAE) Module

3.3. The Multi-Source Feature Integration (MFI) Module

3.4. Loss Function

4. Experimental Result

4.1. Datasets

4.1.1. DRIVE

4.1.2. HRF

4.1.3. LES

4.1.4. INSPIRE

4.2. Preprocessing and Data Augmentation

4.3. Evaluation Metrics

4.4. Ablation Studies

4.5. Comparison with Existing Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI