Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification

Yan, Yang; Yang, Liu; Huang, Wenbo

doi:10.3390/app14188446

Open AccessArticle

Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification

by

Yang Yan

^*,

Liu Yang

and

Wenbo Huang

Institution of Computer Science and Technology, Changchun Normal University, Changchun 130000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8446; https://doi.org/10.3390/app14188446

Submission received: 6 June 2024 / Revised: 3 August 2024 / Accepted: 16 September 2024 / Published: 19 September 2024

(This article belongs to the Topic Color Image Processing: Models and Methods (CIP: MM))

Download

Browse Figures

Versions Notes

Abstract

:

The difficulty of classifying retinal fundus images with one or more illnesses present or missing is known as fundus multi-lesion classification. The challenges faced by current approaches include the inability to extract comparable morphological features from images of different lesions and the inability to resolve the issue of the same lesion, which presents significant feature variances due to grading disparities. This paper proposes a multi-disease recognition network model, Fundus-DANet, based on the dilated convolution. It has two sub-modules to address the aforementioned issues: the interclass learning module (ILM) and the dilated-convolution convolutional block attention module (DA-CBAM). The DA-CBAM uses a convolutional block attention module (CBAM) and dilated convolution to extract and merge multiscale information from images. The ILM uses the channel attention mechanism to map the features to lower dimensions, facilitating exploring latent relationships between various categories. The results demonstrate that this model outperforms previous models in classifying fundus multilocular lesions in the OIA-ODIR dataset with 93% accuracy.

Keywords:

attention mechanism; dilated convolution; classification of multiple fundus lesions

1. Introduction

Nearly half the cases of 2.2 billion people worldwide who experience visual impairment can be averted with early detection and treatment, according to the 2021 World Health Organization (WHO) report [1]. Thus, it is crucial to inspect for fundus lesions early and diagnose and treat them quickly. Optical Coherence Tomography (OCT), which creates high-resolution tomography pictures by scanning the reflected signals from the retinal fundus tissue, is the most widely utilized technique for fundus inspection. However, the limited number of ophthalmologists restricts widespread fundus lesion screening and prevention. Hence, there is an urgent need to develop techniques for automated diagnosis of fundus lesions.

Relevant research has proven that deep learning methods, including convolutional neural networks, perform well for classifying fundus lesions. As an illustration, Bogacsovics et al. [2] created a model that combined the characteristics from convolutional networks and hand-extracted data to improve the prediction of conditions like diabetic macular edema and diabetic retinopathy. Key elements of the BFPC-Net designed by Li et al. [3] combined channel and spatial attention mechanics with an emphasis on fundus lesions. To improve the classification accuracy of each target, Huo et al. [4] proposed a three-branch hierarchical multiscale feature fusion network architecture (HiFuse) for medical image classification. This architecture fused the advantages of the Transformer and convolutional neural network (CNN) at a multiscale level without destroying the corresponding modeling. A Discriminative Kernel Convolution Network (DKCNet) was created by Bhati et al. [5] to train the multiscale discriminative features using attention blocks. Yet, there is still a significant potential for improvement in classification accuracy and model generalization as the current approaches cannot extract the same morphological features exhibited by various lesions in the fundus pictures.

Therefore, this paper proposes Fundus-DANet with the following contributions:

1. A dilated-convolution convolutional block attention module (DA-CBAM) is designed to feature the representation in the convolutional neural network that is enhanced using the convolutional block attention module (CBAM) [6] after extracting and fusing the multiscale image features using dilated convolution to effectively extract the multiscale features within the image.

2. An interclass learning module (ILM) is designed to adapt the channel attention method to solve problems, such as the difficulty of effectively capturing the important features between classes by the traditional feature extraction method of convolutional neural networks.

3. Asymmetric polynomial loss (APL) [7] is used to apply different weights to the positive and negative samples to improve the model’s ability to recognize rare categories. Specifically, this paper enhances the model’s focus on positive samples by giving greater penalty to positive samples, which improves the model’s performance.

The rest of the paper is organized as follows: Section 2 presents the related work on CNNs and graph neural networks in multi-disease fundus classification. Section 3 details the methodology and evaluation metrics. Section 4 presents the dataset, experimental information, and results. Finally, Section 5 summarizes the paper and emphasizes the importance of the method.

2. Related Work

Fundus lesions have gained increasing interest due to recent advancements in deep learning-based algorithms for fundus picture classification. CNN is a type of deep learning neural network architecture commonly utilized in classification tasks due to its ability to accurately capture the spatial properties of fundus images through convolutional computation. Furthermore, the attention mechanism is widely used in image processing because it reduces redundant information, prioritizes important regions, and enhances classification accuracy.

2.1. Dilated-Convolution-Based Methods

A new multiscale dilated-convolution-based ensemble learning (MDCEL) method was proposed by You et al. [8]. It involved introducing dilated convolution and different expansion rates in a traditional CNN with transfer learning, training multiple learners simultaneously through an aggregation strategy, and obtaining the final result using a weighted-voting method. Unfortunately, this approach could not retain information with higher precision and had a limited capacity for feature collection. To solve this issue, Sun et al. [9] applied a multiple parallel dilated-convolution multiscale fusion of contextual information in the bottleneck layer of the network and utilized a U-shaped network to fully collect the image multiscale features. However, this approach was unable to fuse the features properly. To gather more data, enhance the fundamental features, and broaden the sensory field, Panchal et al. [10] created the residual multi-kernel dilation convolution U-Net model (ResMU-Net), which combined the U-shaped network structure with hopping connections and cavity convolution at each layer. Furthermore, Tu et al. [11] proposed a method based on residual dense and dilated convolution (RDDC-3DCNN) for classifying hyperspectral images. The authors created a method that effectively extracts the spatial–spectral features of hyperspectral images by fusing different levels of spatial–spectral features through a dilated convolution of the residual dilated dense block. These two approaches, however, were very sophisticated, prone to overfitting issues for datasets with few samples, and had limited capacity for generalization. Consequently, by highlighting discriminative characteristics, Madarapu et al. [12] suggested the multiresolution convolutional attention network (MuR-CAN) to enhance the overall performance. This network demonstrated superior performance on two datasets with fewer samples by initially extracting features from the backbone network and subsequently extracting features using dilated convolution.

2.2. Attention Mechanism-Based Methods

By concentrating on the bright and dark structures in the retinal images independently, Romero-Ora et al. [13] developed a novel attentional method to produce independent attentional maps. However, this method could only handle photos with significant variations in the feature color, like fundus images, and it might be challenging to apply to other kinds of image classifications. To improve the feature differences between lesions and background while focusing on lesion feature information, Li et al. [14] created a Binocular Fundus Photographs Classifying Network (BFPC-Net) model with improved generalization. They specifically added a residual network and an attention mechanism fusion module. However, that approach, which employed dual channel and spatial attention mechanisms, may lead to information loss or an overemphasis on specific local elements. This could cause the model to overlook other significant features and reduce its overall comprehension of the input data. Hence, Madarapu et al. [15] no longer confined themselves to focusing on local features but instead proposed a parallel connection of channel and spatial attention mechanisms. They then used the self-attention mechanism to model the interactions between all the locations of the feature mapping to capture the remote dependencies in the input feature mapping. This approach was also based on the fusion of residual networks and attention mechanisms. Nevertheless, the channel attention mechanism with considerable modeling overhead was implemented using two tightly coupled modules. As a result, rich discriminative and stage-specific features were selectively implemented from the lesion region by Das et al. [16] in their design of a novel adapter and enhanced self-attention-based CNN framework (AES-Net) model, which was based on parallel connectivity by pooling and three 1 × 1 convolutional kernels. This significantly reduced the model overhead issue. As shown in Table 1, previous studies and key contributions can be found, providing a comprehensive overview of the research landscape in this field.

3. Methods

This section describes Fundus-DANet in detail. The general framework diagram of the network is depicted in Figure 1. It comprises four components: the image preprocessing module, DA-CBAM, ILM, and loss function. This study aimed to solve the problem of accurately classifying the multilocular lesions in the fundus. First, InceptionResnet-v2 was used as the main network to extract feature F from the previously processed images. Feature F was simultaneously inserted into the DA-CBAM and ILM. In the DA-CBAM, local and global image features were extracted and fused, while in the ILM, interclass features were extracted using the SE attention mechanism [17]. Each module generated prediction results, denoted as prediction1 and prediction2, respectively. Finally, an APL function was used to derive the final loss by independently summing and averaging the two outcomes.

3.1. Fundus-DANet

3.1.1. DA-CBAM

We designed the DA-CBAM for spatial feature learning in images. Without reducing the resolution of the feature maps, this module leveraged three dilated convolution layers and channel and spatial attention mechanisms of the CBAM [6] to focus on significant aspects and acquire multiscale features for learning and fusion. Specifically, Figure 2 depicts the general architecture of the DA-CBAM. First, the channels of feature F were divided into two groups, and the channels within these groups were rearranged to facilitate the exchange and combination of information between the different channels. Then, a dilated convolution layer with a convolution kernel of 3 and expansion rates of 1, 2, and 3 was applied to the channel-shuffled feature

F^{'}

.

The dilated convolution kernel, depicted in Figure 3, incorporates spacing between the filters to expand the sensory field of the network without increasing the number of parameters and computational requirements. Dilated convolutions with different dilation rates can simultaneously capture the local details and global information, enhancing the perception of features at various scales. This capability is crucial for handling the fine and diverse characteristics present in images. Given that the image was transformed into a tensor with 1536 channels and a spatial size of 8 × 8 after processing through the backbone network, the selection of dilation rates required careful consideration. Excessive dilation rates could exceed the original spatial dimensions and adversely affect the experimental results. Therefore, we constrained the dilation rates to 1, 2, and 3. This ensured an appropriate receptive field while avoiding spatial inconsistencies. On this basis, we obtained three feature vectors, F1, F2, and F3, and the feature maps of these three scales were batch-normalized and then activated by the RELU function as shown in Equations (1) and (2):

Y_{d} [i, j] = \sum_{0}^{k - 1} \sum_{0}^{k - 1} X [i + m \times d, j + n \times d] \times W [m, n] (k = 3, d = 1, 2, 3),

(1)

where

i, j

denotes the row and column indices of the output feature map

Y_{d}

,

k

is the size of the dilated convolution kernel, and

d

is the dilated convolution rate.

F_{1}, F_{2}, F_{3} = σ (B N ({C o n v (F}^{'}))),

(2)

where

σ

is the activation function, and

C o n v

is the dilated convolution.

As seen in Figure 4, each of the three feature maps obtained was subjected to a layer of 1 × 1 convolution to reduce the number of channels. The feature maps were then joined at the channel level using the maximum and average pooling, as indicated by Equation (3). This fused the feature maps in the channel and preserved the data from the various feature map categories, enhancing the model’s representation and generalization capabilities. This research included the attention mechanism of the CBAM, which determined the significance of each spatial location in the image and improved the focus on the relevant spatial places, helping the network focus on the important aspects of the image. Ultimately, these three features were residual connections and summed up to produce

F_{1}^{'}

,

F_{2}^{'}

, and

F_{3}^{'}

. This process was carried out to improve the network’s representation capability and training efficiency by directly passing features from the lower layer to the higher layer and fusing global semantic information with local detail information. Eventually, full connectivity was used to produce the prediction results after the features Fs were reduced in feature size by average pooling while maintaining the overall features of the image.

In addition, to prevent the model from an overfitting problem, a dropout function was added to randomly discard 0.25 and 0.3 neurons after pooling and before feature summation, respectively:

F_{p o o l} = σ (B N (A v g P o o l (x) + M a x P o o l (x))),

(3)

where

x

denotes the input feature map.

3.1.2. ILM

We developed an ILM that attempted to identify the critical characteristics between various classes for precise classification, focusing on the label dependency of the training set and the issue of interclass similarity in medical images. A 1 × 1 convolution operation was performed on the features collected using the InceptionResnet-v2 to further understand the variability in features between various classes and prepare for additional classification tasks, as illustrated in Figure 5. After mapping the original 1536-dimensional features to a lower-dimensional 8-dimensional feature space with a convolution kernel of size 1 × 1, the interclass features F_a were obtained through an average pooling operation, which transformed the data in the original feature space into higher-level abstract features and improved the feature representation of different classes. This enhanced the model’s ability to capture subtle interclass differences.

The full connectivity relationship between the input and output features was established multiple times through the channel attention mechanism, which also used the maximum and parallel global pooling to improve the model’s ability to characterize the input features. This enhanced the model’s focus on critical differences between classes, improving its ability to capture interclass dependencies. Finally, the inter-channel correlation was dynamically acquired using the global and maximum pooling, and the input features were weighted to obtain the feature vector

F_{a .}^{'}

resultant from integrating the

F_{a}

and

F_{a}^{'}

residuals:

F_{a}^{'} = I L M (F_{a}) = I L M (W \times F^{'})

(4)

3.2. APL

An unbalanced data distribution results from a significant difference in the number of samples from each category in the existing public fundus datasets. This imbalance can negatively impact the performance and training of deep learning models, as the models are biased towards learning features from categories with a higher number of samples, ignoring categories with a lower number of samples, or even misclassifying such samples into categories with a higher number of samples.

We used APL [7,18] to address the issue of the dataset’s uneven distribution. It achieved this by discarding the negative samples with low prediction probabilities and separating the gradient contributions of the positive and negative samples by varying the focusing parameters for the Taylor expansion of the BCE loss. Two hyperparameters with different scaling factors

γ^{+ / -}

were first used to adjust the loss

L^{+ / -}

for positive and negative categories and discard the categories with less impact on the experiment, as shown in Equation (5):

L - = \sum_{n = 1}^{\infty} {β_{i, n} \max (p_{i} - p_{t h}, 0)}^{n + γ^{-}},

(5)

where

p_{i}

is the predicted probability,

p_{t h}

is the hard threshold, and

β

is the coefficient for negative samples. When

p_{i}

is less than the threshold, it does not participate in the loss calculation, and based on that, the loss function was obtained as:

L_{A P L} = \sum_{i = 0}^{C} {y_{i} {(1 - p_{i})}^{γ^{+}} [- l o g p_{i} + (α_{1} - 1) (1 - p_{i}) + (α_{2} - \frac{1}{2}) {(1 - p_{i})}^{2}] + (1 - y_{i}) p_{r e s}^{γ^{-}} [- l o g (1 - p_{i}) + (β_{1} - 1) p_{r e s}]} / C,

(6)

where

α_{1}

and

α_{2}

are adjustable parameters for positive samples,

β_{1}

is a parameter for negative samples,

C

is the number of categories, and

p_{r e s}

denotes

m a x (p_{i} - p_{t h}, 0)

.

The loss function in this paper’s method consists of two parts, the image learning module loss and interclass learning module loss. The total network loss formula in this paper is given by:

L_{t o t a l} = \frac{1}{N} \sum_{n = 1}^{N} {(L}_{s} + L_{a})

(7)

where

L_{s}

is the loss of the image learning module,

L_{a}

is the loss of the interclass learning module, and

N

is the number of samples under an epoch.

3.3. Evaluation Metrics

The six evaluation criteria that we employed in this work were the area under the curve (AUC), κ, final_ and F1_scores, hamming loss (HL), and classification accuracy (ACC).

ACC measures the proportion of samples the model correctly classifies in its predictions. The formula to calculate ACC is as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N},

(8)

where

T P

,

T N

,

F P

, and

F N

represent the true positive, true negative, false positive, and false negative samples, respectively.

F1_score was calculated by the formula:

F 1_s c o r e = \frac{2 T P}{2 T P + F P + F N}

(9)

Final_score was calculated by the formula:

F i n a l_s c o r e = \frac{F 1_s c o r e + K a p p a + A U C}{3}

(10)

κ is the degree to which the model’s predictions and actual categorization outcomes agree. This statistic can be useful to evaluate how successfully the model classifies minor categories in unbalanced datasets. The confusion matrix, which has a value between −1 and 1 and is often greater than 0, is used to compute κ. The formula is as follows:

κ = \frac{p_{0} - p_{e}}{1 - p_{e}},

(11)

where

p_{0}

is the accuracy of the prediction, and

p_{e}

is the sum of the product of the actual and predicted numbers divided by the square of the total number of samples.

By dividing the total number of labels by the number of elements that are not equal between the two, the Hamming loss metric was used to evaluate the discrepancy between true and predicted labels. The smaller the ratio, the more accurate the model’s prediction. The formula is as follows:

{H L}_{(y, \hat{y})} = \frac{1}{n * m} \sum_{i = 0}^{n - 1} \sum_{j = 0}^{m - 1} 1 ({\hat{y}}_{i, j} \neq y_{i, j})

(12)

4. Experiments and Results

This section provides the details of the experiment and results. Section 4.1 explains the dataset in depth. Section 4.2 provides the experimental setups and hyperparameter selection information. Section 4.3 describes the data’s preprocessing in depth. Section 4.4 thoroughly examines the experimental results, including the choice of several backbone models, attention mechanisms, ablation studies, and comparisons with current systems. A thorough analysis of the data shows that the model used in this work produces the best outcomes in these areas.

4.1. Datasets

This experimental dataset, OIA-ODIR (https://github.com/nkicsl/OIA-ODIR) (accessed on 3 August 2024) [19], contains OCT of both eyes of 5000 patients, totaling 10,000 images. As shown in Figure 6, the dataset comprises normal images (N) and six diseases, such as diabetic retinopathy (D), glaucoma (G), cataract (C), AMD (A), myopia (M), and hypertension (H), as well as other diseases (O). Labels for a single eye were initially created based on the diagnostic results because the dataset only contained labels for binocular images and diagnostic results for a single eye. This led to the retention of 9476 images as the dataset after removing images with shifted images, poor image quality, dusty lenses, abnormal camera exposures, optic discs that were not visible in photos, and controversial images regarding the pathology. The dataset, comprising 7574 images in the training set, 951 images in the validation set, and 941 images in the test set, was randomized into training, validation, and test sets at a ratio of 8:1:1.

4.2. Implementation Details

The hardware used for the study was an AMD EPYC 7402 24-core CPU, an A30 GPU graphics card, 24 GB video memory, and 86 GB RAM. The model was constructed using the Pytorch 2.0 framework based on Python 3.9 and CUDA 11.7. To make the raw data compatible with the experimental hardware and acceptable to the model, it was first scaled to 299 pixels × 299 pixels.

Table 2 illustrates how the hyperparameters were set up during the model training phase. The learning rate was first set at 0.0007 and was dynamically adjusted via StepLR. It was then refined based on the validation set results. The APL loss function was used to independently calculate and average the loss values for each module, with a complete connection utilized as the output layer.

4.3. Image Preprocessing

Significant variations in resolution and contrast across the photos in the OIA-ODIR dataset occurred as different pieces of equipment were not used to capture them simultaneously. Other issues included uneven illumination and lens blurring. Since the model was prone to overfitting and the fundus images had little data, this paper aimed to improve the images before submitting them to the network for training.

Since most of the photos in the dataset were rectangular and the input model was square, as shown in Figure 7a, directly scaling the cropped extra black region would result in the distortion issue depicted in Figure 7c. Thus, as illustrated in Figure 7b, we first searched for non-black pixels in the four directions before cropping the image to create a square that only included the bottom of the eye. This was used to scale and crop the image to the dimensions needed by the network. Next, random flipping of the image’s horizontal and vertical axes increased its diversity. The overfitting problem was resolved by adjusting the image’s brightness, contrast, saturation, and hue. Finally, the image was normalized using the mean and standard deviation to quicken the model’s convergence.

4.4. Results

4.4.1. Performance Evaluation of Different Backbone Networks

This paper compared multiple backbone models to attain the best classification performance. Due to the extensive and difficult lesions’ distribution in the fundus images, we prioritized Inception-ResNet-v2 [20]. With its residual connections, this hybrid architecture solves the gradient vanishing problem in deep networks and performs very well at capturing the features at various sizes. This preserves training stability for deeper networks. This paper also compared VGG16 [20], Inception-v3 [21], ResNet101 [22], and EfficientNet-B4 [23], four additional high-performing models, to strengthen the persuasiveness of our findings. The best results were obtained when using InceptionResnetV2 as the backbone network, as shown in Table 3, with a 93% ACC, 82.13% Final_score, 83.74% F1_score, 67.52% κ, 7% HL, and 95.12% AUC. Among the models compared, Inception-v3 and ResNet101 had their strengths, with ACCs of 92.28% and 90.70%, respectively. However, they slightly underperformed compared to the composite InceptionResNetV2. EfficientNet-B4 focuses on computational efficiency and resource utilization, which tends to sacrifice some feature extraction capabilities in complex tasks, resulting in poorer performance. VGG16 lacks an explicit multiscale feature extraction mechanism and residual connections, leading to relatively inferior results.

Table 3 also illustrates the impact of the preprocessing module on experimental outcomes. The results demonstrated that incorporating the preprocessing module improved all the metrics by 1–3%. This indicated that addressing the input image distortion could significantly enhance the model’s feature recognition capability. Furthermore, by significantly augmenting the diversity of image data and improving the quality of input images, the preprocessing module effectively reduced the risk of model overfitting.

4.4.2. Ablation Experiment

We conducted ablation experiments using the DA-CBAM, ILM, and APL functions to better demonstrate how each module affected the classification model outcomes. This paper’s experiments used the cross-entropy loss function instead of the APL function, as illustrated in Table 4. The first row shows the outcome of using only the backbone model InceptionResnetV2, and the evaluation indexes like ACC and F1-score were slightly improved after adding the DA-CBAM. However, the results were not significantly improved because the problem of dataset imbalance had not been resolved yet, and it also failed to capture the interclass feature connection. A substantial 1–6% improvement was seen for all the measures when APL was added to the DA-CBAM. All the measures did not significantly improve when the ILM was added to the DA-CBAM, indicating that proper data balancing improved the experimental findings. The results of the experiment with the ILM but without the DA-CBAM were lower because the ILM and DA-CBAM had to collaborate to enhance the system’s performance. The system’s overall performance suffered when the ILM was utilized alone. The confusion matrix in Figure 8 for the experiment, including an APL, clearly showed that the classification results for the category with fewer samples greatly improved. Additionally, the results greatly improved when the DA-CBAM and ILM were put together. However, when the backbone model was introduced with the ILM alone, the results marginally declined. This was because by complementing each other’s shortcomings and offering extra information, the two modules increased overall system performance and substantiated the need for concurrently extracting intra- and interclass features.

4.4.3. Comparative Experiments between Different Attentional Mechanisms

By giving distinct weights to different aspects, the attention mechanism can effectively increase the accuracy of medical picture categorization. This allows the model to focus more on those features. By analyzing the current state-of-the-art attention mechanisms, this research sought to determine the best scheme for the model to apply to the interclass and picture learning modules. This paper compared methods for the image learning module, such as CBAM, self-attention mechanism [25], spatial attention with channel attention removed [26], and SGE. For the interclass learning module, this paper compared ECA-Net [27] and existing improved methods by Zhang et al. [28] based on SE-Net. Table 5 displays the results of the experiment. Various attention mechanisms were found to increase the accuracy of the model. The most successful method was the combination of the CBAM and the SE-Net-improved method, which maximized the consideration of the features on the graph space and more effectively captured the important information in the image, thereby improving the model’s performance.

4.4.4. Comparative Testing of Different Models

In this paper, Fundus-DANet was compared with the EfficientNetB3 model of Wang et al. [29], the MCGL-Net model of Sun et al. [30], the MCGS-Net model of Lin et al. [31], and the DCNet of Li et al. [32]. The results of the experiments are shown in Table 6. The experiment yielded an ACC score of 94.46% for the MCGL-Net, which was 1.46% higher than the model in this paper. However, the model performed worse regarding AUC and κ metrics, coming in at 1.62% and 3.22% lower, respectively, than the experimental model. These results indicate that the model in this paper performed better regarding dataset balance. Both DCNet and MCGL-Net performed better in terms of the F1_score, striking a better balance between comprehensiveness and accuracy while demonstrating significant generalization capabilities.

5. Discussion and Limitations

In this study, we proposed a Fundus-DANet model. Although our model showed promising outcomes in capturing subtle interclass differences and enhancing overall classification accuracy, implementing it in real-world scenarios still has several limitations and challenges. First, the quality of fundus images was inconsistent, and the number of samples across different classes was severely imbalanced, making model training challenging. The limited data reduced the model’s robustness, making it difficult to maintain consistent performance across various application scenarios. Second, deploying Fundus-DANet requires substantial computational resources due to the complexity of the model and its extensive use of convolutional and attention mechanisms, which may result in longer inference times and higher hardware requirements. This limits its applicability in clinical settings requiring rapid diagnosis or real-time analysis. Lastly, the continuous updating and maintenance of the model are crucial for adapting to new data and evolving medical standards. This necessitates ongoing efforts in both technology and human resources.

To address these issues, we aim to collaborate with hospitals to acquire more data for model training, enhancing the model’s generalization capability and robustness. In addition, we will strive to reduce the model’s computational cost and inference time without compromising its accuracy to better suit the needs of real-time clinical applications.

6. Conclusions

This paper investigated a dilated-convolution fundus multi-disease classification model incorporating an attention mechanism, aiming at solving the problem of accurate recognition for multiple diseases of the retinal fundus. The model extracted the input image’s spatial and categorical properties, which then applied the appropriate spatial and channel attention mechanisms. The final prediction result was calculated by averaging the two prediction outcomes. Furthermore, the APL function resolved the issue of category imbalance and enhanced the classification outcomes for some categories. The experimental results demonstrated that the model achieved an accuracy of 93% in classifying multiple lesions in fundus images, and all the metrics were enhanced compared to the existing fundus multi-disease classification model, validating the capability of the model proposed in this paper to accurately classify fundus multi-disease. The model efficiently balances the different data categories and can be a reliable clinical tool.

Author Contributions

Conceptualization, Y.Y., L.Y. and W.H.; methodology, L.Y.; software, Y.Y., L.Y. and W.H.; validation, Y.Y., L.Y. and W.H.; writing—original draft preparation, Y.Y. and L.Y.; writing—review and editing, Y.Y.; supervision, W.H.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Jilin Province, grant number YDZJ202401350ZYTS.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study, OIA-ODIR, is publicly available and can be accessed at [https://github.com/nkicsl/OIA-ODIR] (accessed on 3 August 2024). The data and code used to support the findings of this study are available from the corresponding author upon request ([email protected]).

Acknowledgments

We thank the OIA-ODIR project for providing the dataset used in this study, which is available at [https://github.com/nkicsl/OIA-ODIR] (accessed on 3 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Burton, M.J.; Ramke, J.; Marques, A.P.; Bourne, R.R.; Congdon, N.; Jones, I.; Tong, B.A.A.; Arunga, S.; Bachani, D.; Bascaran, C. The lancet global health commission on global eye health: Vision beyond 2020. Lancet Glob. Health 2021, 9, e489–e551. [Google Scholar] [CrossRef] [PubMed]
Bogacsovics, G.; Toth, J.; Hajdu, A.; Harangi, B. Enhancing CNNs through the use of hand-crafted features in automated fundus image classification. Biomed. Signal Process. Control. 2022, 76, 103685. [Google Scholar] [CrossRef]
Meng, Q.; Zhang, W. Multi-label image classification with attention mechanism and graph convolutional networks. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing China, 15–18 December 2019; pp. 1–6. [Google Scholar]
Huo, X.; Sun, G.; Tian, S.; Wang, Y.; Yu, L.; Long, J.; Zhang, W. HiFuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 2024, 87, 105534. [Google Scholar] [CrossRef]
Bhati, A.; Gour, N.; Khanna, P.; Ojha, A. Discriminative kernel convolution network for multi-label ophthalmic disease detection on imbalanced fundus image dataset. Comput. Biol. 2023, 153, 106519. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Huang, Y.; Qi, J.; Wang, X.; Lin, Z. Asymmetric polynomial loss for multi-label classification. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
You, G.-R.; Shiue, Y.-R.; Su, C.-T.; Huang, Q.-L. Enhancing ensemble diversity based on multiscale dilated convolution in image classification. Inf. Sci. 2022, 606, 292–312. [Google Scholar] [CrossRef]
Sun, M.; Li, K.; Qi, X.; Dang, H.; Zhang, G. Contextual information enhanced convolutional neural networks for retinal vessel segmentation in color fundus images. J. Vis. Commun. Image Represent. 2021, 77, 103134. [Google Scholar] [CrossRef]
Panchal, S.; Kokare, M. ResMU-Net: Residual Multi-kernel U-Net for blood vessel segmentation in retinal fundus images. Biomed. Signal Process. Control 2024, 90, 105859. [Google Scholar] [CrossRef]
Tu, C.; Liu, W.; Jiang, W.; Zhao, L. Hyperspectral image classification based on residual dense and dilated convolution. Infrared Phys. Technol. 2023, 131, 104706. [Google Scholar] [CrossRef]
Madarapu, S.; Ari, S.; Mahapatra, K. A multi-resolution convolutional attention network for efficient diabetic retinopathy classification. Comput. Electr. Eng. 2024, 117, 109243. [Google Scholar] [CrossRef]
Romero-Oraá, R.; Herrero-Tudela, M.; López, M.I.; Hornero, R.; García, M. Attention-based deep learning framework for automatic fundus image processing to aid in diabetic retinopathy grading. Comput. Methods Programs Biomed. 2024, 249, 108160. [Google Scholar] [CrossRef]
Li, Z.; Xu, M.; Yang, X.; Han, Y. Multi-label fundus image classification using attention mechanisms and feature fusion. Micromachines 2022, 13, 947. [Google Scholar] [CrossRef] [PubMed]
Madarapu, S.; Ari, S.; Mahapatra, K. A deep integrative approach for diabetic retinopathy classification with synergistic channel-spatial and self-attention mechanism. Expert Syst. Appl. 2024, 249, 123523. [Google Scholar] [CrossRef]
Das, D.; Nayak, D.R.; Pachori, R.B. AES-Net: An adapter and enhanced self-attention guided network for multi-stage glaucoma classification using fundus images. Image Vis. Comput. 2024, 146, 105042. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Ridnik, T.; Ben-Baruch, E.; Zamir, N.; Noy, A.; Friedman, I.; Protter, M.; Zelnik-Manor, L. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 82–91. [Google Scholar]
Li, N.; Li, T.; Hu, C.; Wang, K.; Kang, H. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In Proceedings of the Benchmarking, Measuring, and Optimizing: Third BenchCouncil International Symposium, Bench 2020, Virtual Event, 15–16 November 2020; pp. 177–193. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Chen, B.; Zhang, Z.; Liu, N.; Tan, Y.; Liu, X.; Chen, T. Spatiotemporal convolutional neural network with convolutional block attention module for micro-expression recognition. Information 2020, 11, 380. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhang, Y.; Luo, L.; Dou, Q.; Heng, P.-A. Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification. Med. Image Anal. 2023, 86, 102772. [Google Scholar] [CrossRef]
Wang, J.; Yang, L.; Huo, Z.; He, W.; Luo, J. Multi-label classification of fundus images with efficientnet. IEEE Access 2020, 8, 212499–212508. [Google Scholar] [CrossRef]
Sun, K.; He, M.; Xu, Y.; Wu, Q.; He, Z.; Li, W.; Liu, H.; Pi, X. Multi-label classification of fundus images with graph convolutional network and LightGBM. Comput. Biol. Med. 2022, 149, 105909. [Google Scholar] [CrossRef]
Lin, J.; Cai, Q.; Lin, M. Multi-label classification of fundus images with graph convolutional network and self-supervised learning. IEEE Signal Process. Lett. 2021, 28, 454–458. [Google Scholar] [CrossRef]
Li, C.; Ye, J.; He, J.; Wang, S.; Qiao, Y.; Gu, L. Dense correlation network for automated multi-label ocular disease detection with paired color fundus photographs. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 1–4. [Google Scholar]
Liu, S.; Wang, W.; Deng, L.; Xu, H. Cnn-trans model: A parallel dual-branch network for fundus image classification. Biomed. Signal Process. Control. 2024, 96, 106621. [Google Scholar] [CrossRef]

Figure 1. The overall structure of Fundus-DANet, DA-CBAM is the spatial feature learning module (green), ILM is the interclass learning module (dark blue), and the asymmetric polynomial loss is represented in pink.

Figure 2. The overall architecture of the DA-CBAM. The channels are shuffled between the two groups. Dilated convolutions are then used to extract the features at different scales from each group. These features are processed through hidden layers, and the final results are summed up.

Figure 3. The receptive field with different dilation rates for a 3 × 3 kernel size.

Figure 4. The architecture of the DA-CBAM hidden layer. The parallel processing consists of two pooling layers, two dropout layers, and an attention mechanism.

Figure 5. Structure of interclass learning module.

Figure 6. Fundus images of different categories randomly selected from the OIA-ODIR [19] dataset containing normal images (N) and six diseases, such as diabetic retinopathy (D), glaucoma (G), cataract (C), AMD (A), myopia (M), and hypertension (H), as well as other diseases (O).

Figure 7. Fundus image preprocessing. The images in (a–c) are the original, cropped, and uncropped directly trained images, respectively.

Figure 8. Ablation experiment confusion matrix of OIA-ODIR.

Table 1. Overview of previous studies and key contributions of this research.

Study	Objective	Method	Findings	Limitations	Contribution of This Study
You et al. (2020) [8]	To propose a new multiscale dilated-convolution-based ensemble learning method (MDCEL)	Introduced dilated convolution and different expansion rates in a traditional CNN with transfer learning, training multiple learners simultaneously through an aggregation strategy, and obtaining the final result using a weighted-voting method	Improved classification performance	Limited information retention and feature collection capacity	Enhanced feature extraction capabilities
Sun et al. (2021) [9]	To apply multiple parallel dilated-convolution multiscale fusion of contextual information	Utilized a U-shaped network to fully collect the image multiscale features	Enhanced multiscale feature extraction	Inadequate feature fusion	Used a convolutional block attention module (CBAM) and dilated convolution to extract and merge multiscale information from images.
Panchal et al. (2022) [10]	To create the residual multi-kernel dilation convolution U-Net model (ResMU-Net)	Combined U-shaped network structure with hopping connections and cavity convolution at each layer	Broadened sensory field and enhanced feature extraction	High complexity, prone to overfitting	Reduced model complexity
Tu et al. (2022) [11]	To propose a method based on residual dense and dilated convolution (RDDC-3DCNN) for hyperspectral image classification	Fused different levels of spatial–spectral features through a dilated convolution of the residual dilated dense block	Effectively extracted spatial–spectral features	High complexity, prone to overfitting	Reduced model complexity
Madarapu et al. (2022) [12]	To propose the multiresolution convolutional attention network (MuR-CAN)	Extracted features from the backbone network and then used dilated convolution	Superior performance on two datasets with fewer samples	Small sample size	Validated on larger datasets
Romero-Ora et al. (2023) [13]	To develop a novel attentional method for retinal images	Generated independent attentional maps focusing on bright and dark structures	Improved classification accuracy	Applicable only to images with significant feature color variations	Addressed the issue of small interclass differences
Li et al. (2023) [14]	To create the BFPC-Net model with improved generalization	Added a residual network and an attention mechanism fusion module	Enhanced feature information extraction	Risk of information loss, overemphasis on local features	Fused multiscale features
Madarapu et al. (2023) [15]	To propose a parallel connection of channel and spatial attention mechanisms	Modeled interactions between all locations of the feature mapping using self-attention	Improved feature extraction and classification performance	High modeling overhead	Reduced model complexity
Das et al. (2023) [16]	To design the AES-Net model based on parallel connectivity by pooling and three 1 × 1 convolutional kernels	Selectively implemented rich discriminative and stage-specific features from the lesion region	Reduced model overhead	High complexity	Reduced model complexity

Table 2. Hyperparameter settings in training.

Hyperparameter	Value
Epochs	30
Batch size	16
Optimizer	SGD
Learning rate	0.0007
Momentum parameter	0.9
Weight decay parameter	1 × 10⁻⁶
Learning rate scheduler	StepLR
Scheduler step size	5
Learning rate decay factor	0.9

Table 3. Results of comparative experiments of the backbone model.

Backbone	No Preprocessing						Preprocessing
Backbone	ACC	AUC	F1_Score	Final_Score	κ	HL	ACC	AUC	F1_Score	Final_Score	κ	HL
Vgg16 [20]	88.15	85.66	73.38	68.60	46.77	11.85	89.97	91.42	74.86	72.08	49.96	10.03
Inception-v3 [21]	90.42	90.68	77.44	74.35	54.94	9.58	92.28	94.62	82.49	80.70	65.00	7.72
Resnet101 [22]	89.35	90.45	71.99	68.95	44.42	10.65	90.70	93.52	78.02	75.88	56.11	9.30
EfficientNet-B4 [23]	88.58	89.39	65.47	62.43	32.43	11.42	89.50	91.46	70.32	67.75	41.48	10.50
InceptionResnetV2 [24]	90.95	92.69	77.52	85.15	55.23	9.05	93.00	95.12	83.74	82.13	67.52	7.00

Table 4. Results of ODIR ablation experiments.

	ACC	AUC	F1_Score	Final_Score	κ	HL
InceptionResnet-V2	91.29	93.62	77.62	77.59	55.52	8.70
InceptionResnet-V2 + DA-CBAM	91.40	93.16	77.89	75.70	56.06	8.60
InceptionResnet-V2 + ILM	90.42	92.70	74.28	72.00	49.05	9.58
InceptionResnet-V2 + APL	92.90	94.61	84.10	82.31	68.22	7.10
InceptionResnet-V2 + DA-CBAM + APL	92.81	94.54	83.25	81.44	66.54	7.19
InceptionResnet-V2 + ILM + APL	90.53	92.70	74.59	72.30	49.66	9.5
InceptionResnet-V2 + DA-CBAM + ILM	91.77	93.68	79.18	77.14	58.56	8.23

Table 5. Comparative results of attentional mechanisms.

	ACC	AUC	F1_Score	Final_Score	κ	HL
Self-attention mechanism [25]	92.39	94.58	82.59	80.79	65.20	7.61
Spatial attention mechanism [26]	92.40	94.81	82.14	80.42	64.33	7.60
ECA-Net [27]	92.68	94.92	83.62	81.93	67.25	7.32
SGE [28]	92.55	94.39	82.62	80.76	65.28	7.45

Table 6. Comparative experiments.

	ACC	AUC	F1_Score	Final_Score	κ	HL
EfficientNetB3 [29]	92.00	74.00	89.00	72.00	52.00	–
MCGL-Net [30]	94.46	93.50	91.60	–	64.30	–
MCGS-Net [31]	–	78.16	89.66	–	57.65	–
DCNet [32]	–	93.00	91.30	82.70	63.70	–
CNN-Trans [33]	80.68	95.9	–	–	–	–
Our	93.00	95.12	83.74	82.13	67.52	7.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Y.; Yang, L.; Huang, W. Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification. Appl. Sci. 2024, 14, 8446. https://doi.org/10.3390/app14188446

AMA Style

Yan Y, Yang L, Huang W. Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification. Applied Sciences. 2024; 14(18):8446. https://doi.org/10.3390/app14188446

Chicago/Turabian Style

Yan, Yang, Liu Yang, and Wenbo Huang. 2024. "Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification" Applied Sciences 14, no. 18: 8446. https://doi.org/10.3390/app14188446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fundus-DANet: Dilated Convolution and Fusion Attention Mechanism for Multilabel Retinal Fundus Image Classification

Abstract

1. Introduction

2. Related Work

2.1. Dilated-Convolution-Based Methods

2.2. Attention Mechanism-Based Methods

3. Methods

3.1. Fundus-DANet

3.1.1. DA-CBAM

3.1.2. ILM

3.2. APL

3.3. Evaluation Metrics

4. Experiments and Results

4.1. Datasets

4.2. Implementation Details

4.3. Image Preprocessing

4.4. Results

4.4.1. Performance Evaluation of Different Backbone Networks

4.4.2. Ablation Experiment

4.4.3. Comparative Experiments between Different Attentional Mechanisms

4.4.4. Comparative Testing of Different Models

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI