1. Introduction
Diabetic retinopathy (DR) is a common complication of diabetes that manifests as damage to the retinal blood vessels, and is the leading cause of blindness [
1]. Compared to the large number of DR patients, there is a relatively severe shortage of ophthalmologists, which reduces the capacity to examine DR patients and delays treatment [
2,
3,
4]. Automated DR diagnosis allows medical experts to make early, regular, easier, and real-time examinations for patients to slow or avert the progression to their vision impairment. It also helps to save time, cost, and medical resources in response to an increasing prevalence of DR [
3,
4,
5].
Applications of automated DR detection involve the classification of the presence and severity of DR; segmentation of lesions such as blood vessel, hemorrhagic, and exudative lesions; and the localization and segmentation of the optic disk, macula, and fovea [
3,
4,
5,
6]. CNN-based deep learning has attracted major interest in DR detection, and provides better performance than traditional approaches [
3,
4,
6]. Numerous studies on DR detection using deep learning have been reported. Ensemble learning combines deep learning and other machine learning algorithms such as principal component analysis (PCA), support vector machine (SVM), random forest (RF), etc. Most of the data used for automated DR detection systems have been fundus images, with some tasks using optical coherence tomography images. The performance of DR detection is usually assessed by accuracy, sensitivity, specificity, precision, F1 score, etc., [
3,
4,
5,
6].
The application of deep learning in identifying DR can be considered as having two parts. The first part is a feature extractor (i.e., encoder) and the second part classifies DR using the extracted features. CNNs are often used as the backbone networks. Features extracted by a CNN model are then used by a dense neural network or other machine learning algorithms such as SVM, RF, decision tree, Gaussian techniques, PCA, etc., to perform a classification task [
3,
4]. Learning is usually implemented with the supervised learning baseline, and the start-of-the-art models such as VGG, ResNet, and those in the Inception family are commonly employed to extract features [
3,
4,
6]. These CNN models have been trained on ImageNet, so they can be used as pre-trained models with the learned parameters. These pre-trained models are popularly used to initialize the feature extractor via transfer learning, which involves the removal of the top layer of the network [
7,
8,
9,
10,
11,
12,
13,
14]. Additional to making use of these off-the-shelf networks, CNN-based models are also customized or modified and trained from scratch to learn DR features [
15,
16,
17]. In addition to multilayer perceptrons (MLPs) and the various machine learning methods mentioned above, some studies used ensemble methods to produce classifiers, i.e., used multiple learning algorithms to obtain better prediction performance than using one algorithm alone [
4]. As described by Zhang et al. [
8], a classier is the average of three softmax outputs, and each softmax is the final output of a four-layer MLP. Tymchenko et al. [
10] and Qummar et al. [
18] also used the method of combining several encoders to form an ensemble model. The final output is the average of the fused results of the ensemble model. In Antal et al. [
19], several classifiers using different algorithms such as decision tree, MLP, SVM, and RF were trained to construct an ensemble classier. Even more interesting is the use of the Siamese-like CNN to detect left and right retinal fundus images, respectively [
20]. Designing and adding a useful module to the CNN backbones is used to address the imbalanced DR grading problem [
13]. It is accomplished using self-supervised learning to train a model for retinal disease diagnosis using multimodal data. These multimodal data consist of raw and transformed fundus images, as well as synthesized fundus fluorescein angiography (FFA) modality data generated by a GAN model [
21]. The small network modules are comprised of a selected number of the layers to learn salient features from the data [
22]. The layers include convolution, batch normalization, ReLU, and max pooling [
22]. The deep-feature generator is designed by the non-fixed size patch division model [
23]. Since most DR datasets lack sufficient data or have an imbalanced distribution between classes, several approaches are considered to deal with this problem. These approaches include using transfer learning to exploit information of the models pre-trained on large-scale datasets by applying augmentation techniques or generating synthetic data through GANs to increase the diversity of the data [
4], and developing a self-training deep neural network model to utilize unlabeled data [
24]. Zhu et al. proposed a brain tumor segmentation method based on the fusion of a semantic segmentation module, an edge detection module, and a feature fusion module. This method of fusion outperforms several state-of-the-art brain tumor segmentation methods [
25].
Modern deep learning algorithms have produced unprecedented achievements for object recognition recently. Large-scale datasets are one of the main factors determining the success of deep-learning-based models when performing visual tasks [
26]. This demonstrates that more training data ensures that the learning algorithm can obtain more meaningful and discriminating representations from the data. The models depend on training data, and the more data, the better the results, as large amounts of data avoids overfitting and enables the development of more sophisticated and robust models [
27]. However, unlike image data from nature, medical data are not readily available, especially images from samples medically identified as abnormal, because the average number of patients with a given disease is much lower than the number of healthy individuals. On the other hand, annotation is also a factor in the difficulty of obtaining sufficient medical image data. As a result, a shortage of sufficient data and interclass imbalanced data distribution often arise in the medical field of the real world [
28,
29]. Unfortunately, neural networks suffer greatly from imbalanced learning due to the class-imbalanced data distribution, and they often suffer from overfitting due to the lack of sufficient medical data [
30,
31].
This challenging situation also exists in DR image data. Due to the relatively fewer data samples with DR symptoms in the dataset available for training a model, the model is biased to represent the majority category without DR symptoms, leading to false-negative predictions of models. For a medical diagnosis, a false-negative prediction is more serious and dangerous to patients than a false-positive diagnosis because it ignores the disease [
32]. Therefore, importance should be attached to this biased classification problem caused by the skewed distribution of training data. However, there are currently very few studies dedicated to DR detection with imbalanced learning.
This paper is focused on learning DR detection from such an interclass imbalanced fundus image dataset. The purpose of this paper is to propose methods to overcome this biased DR detection in order to reduce the error diagnosis rate and improve the performance of the DR detection system. The principle of supervised learning is to learn to make decisions in the direction of given supervision signals related to target tasks. The lack of sufficient labeled data limits the generalization ability of supervised learning. To overcome this and further improve the ability of the model to recognize under-represented data, this paper considers training the model with a self-supervised or semi-supervised learning-based approach to facilitate learning. Both semi-supervised and self-supervised learning are effective ways to leverage information from unlabeled data, avoiding the expensive cost of collecting and annotating large datasets.
Self-supervised learning, which is a subset of unsupervised learning methods, has been proposed to learn features from the images themselves without any annotation [
33]. Contrastive self-supervised learning is a technique for learning representations by comparing among multiple input samples. Its emergence has significantly narrowed the gap between unsupervised learning and supervised learning. In recent years, the promising performance of contrastive self-supervised learning both in computer vision (CV) and natural language processing (NLP) has shown that the underlying latent representations can be learned from unlabeled data by contrastive self-supervised learning [
34]. Semi-supervised learning is a learning paradigm that uses a combination of labeled and unlabeled data to train a model. Adding unlabeled samples to the training dataset changes the distribution of the original dataset, which consequently affects the model in terms of making decisions. If two points, x1 and x2, are close in a high-density region, then the corresponding output y1 should be close to y2. Under such a smoothness assumption, the additional unlabeled data help the model to find a more accurate decision boundary [
33]. According to recent advances in deep learning, semi-supervised learning outperforms supervised learning that uses only labeled data [
35].
There are few studies that have applied either self-supervised or semi-supervised learning in DR detection. Common approaches used to apply self-supervised or semi-supervised learning in other fields usually employ these learning methods to perform the target tasks directly. Instead, self-supervised and semi-supervised learning are both used as a wrapper algorithm in this study. Specifically, the model is first pre-trained on unlabeled retinal fundus data with self-supervised or semi-supervised learning; then, the learned presentations are transferred to a model that will be fine-tuned for DR detection with supervised learning using the labeled data, so that the useful knowledge learned from unlabeled data in the same domain can help the target model to move in the right direction of finding appropriate parameters from the beginning. Adopting a wrapper algorithm that utilizes unlabeled DR data to the basic supervised learning baseline shows better performance than using supervised learning alone. Furthermore, the combination of self-supervised and semi-supervised learning proposed in this study also results in improved accuracy and significant reduction in training time. This novel approach is very useful when labeled data are too sparse to train a model for annotating unlabeled data. Regarding imbalanced learning, the biased model caused by imbalanced labels can be re-balanced to some extent under the influence of unlabeled data. To further deal with the imbalanced learning problem, the classifier of the model is additionally fine-tuned on the re-balanced training dataset obtained by re-sampling.
The results of the experiment demonstrate that the proposed methods enhance the performance of DR detection from a supervised learning baseline by improving the performance on the imbalanced data. Accuracy (ACC), sensitivity (TPR), and specificity (TNR) were used as evaluation metrics. In the case of evaluating the model using the balanced EyePaCS [
36] test dataset, the false-negative error rate, which was 100%, was reduced to 14.8%, and the accuracy rate, which was only 50%, improved to 86.4%. The best performance was obtained on the balanced DDR [
37] test dataset, with an ACC of 89.62%, TPR of 86.39%, and TNR of 92.84%; this model was trained on the smaller balanced training set. The performance of the model, which was trained on the imbalanced training dataset, achieved an ACC of 89.50%, TPR of 87.81%, and TNR of 91.18%. The examination results obtained on both EyePaCS and DDR test data are higher than those reported previously [
13,
15]. Test results of the Messidor-2 [
38] dataset significantly outperformed the state-of-the-art model results [
14], with an ACC of 90.68%, a TPR of 86.0%, and a TNR of 92.33%. The models trained by adopting a wrapper algorithm that leveraged the unlabeled DR data used in self-supervised or semi-supervised learning are 4~5% higher in accuracy than the ones trained on the labeled data used in supervised learning only. It can be observed that the performance of the model trained on a relatively small but balanced training set is not worse than that of the model trained by a relatively large but imbalanced dataset. This reveals that it is not necessary to use a large-scale dataset to feed deep-learning-based CNN models to obtain better performance. Under the circumstances, where it is very difficult to collect clean, labeled medical data, it is crucial to use a small amount of data to ensure that the model works well. The proposed method can be applied to any deep learning model. The main contributions of this study can be outlined as follows:
(1). We present a method in the form of a wrapper algorithm to help improve DR detection using supervised learning. It uses semi-supervised or self-supervised learning to first learn and gain features from unlabeled data, and then transfers these learned features to a model using supervised learning to optimize learning.
(2). A combination of self-supervised and semi-supervised learning is used to perform DR detection, and this combined approach can significantly reduce the learning time while providing a viable solution for training models when labeled data are particularly scarce.
(3). To analyze the impact of data imbalance on learning, this study draws two different training datasets, one larger and imbalanced, and the other a smaller and balanced, using down-sampling. All experiments are conducted on these two datasets, and the results show that the small balanced dataset is more advantageous for training models than the larger and imbalanced data.
(4). To examine the proposed method, we conduct experimental tests on three different DR datasets: EyePaCS, DDR, and Messidor-2.
The remainder of this paper is organized as follows:
Section 2 describes the proposed methods for DR detection.
Section 3 presents experiments, result analyses, and comparisons with the previous results. Finally, the discussion and conclusion are given in
Section 4 and
Section 5, respectively.
4. Discussion
Our experimental results demonstrate that the proposed methods effectively deal with the imbalanced learning challenge, in which the model is absolutely biased towards the majority class. The methods are not limited by the network used, and can be applied to different backbone networks as well. The results would be improved if a larger-scale unlabeled dataset could be collected and used without being limited by the computational power provided. Nonetheless, the feasibility, effectiveness, and generality of the proposed methods are indicated experimentally.
The improved performance of the proposed methods in DR detection can be attributed to the approach of utilizing unlabeled data from the same domain, and to the specific strategies of dealing with the imbalanced learning issue in the case of training a model using an imbalanced dataset. The experimental results demonstrate that the sensitivity of the model can be increased substantially in each of these approaches using the wrapped self-supervised or semi-supervised algorithm. This means that a model with classification bias caused by imbalanced data distribution can be re-balanced to give more attention to the minority class (samples with DR disease), thus, the performance can be improved. Moreover, another improvement was observed when combining self-supervised learning with semi-supervised learning; this approach led to a significant reduction in training time. The effectiveness of this approach is obvious, as the model initially trained with supervised learning on labeled data certainly did not perform as well as the model that had been fine-tuned with the help of the pre-trained model, which was trained using self-supervised learning.
However, self-training using semi-supervised learning is relatively inefficient due to the iterative process. The semi-supervised learning method still suffers from the imbalanced dataset problem, which still poses a major challenge for semi-supervised learning, the current practice of which is to use a fully balanced dataset with reference to their pseudo labels. This is only feasible in the case where pseudo labels can be generated; what if this is not possible in practice? In this instance, it is feasible to use self-supervised learning to train the model first. Therefore, the research of combining semi-supervised and self-supervised learning is also a novel direction. It is difficult for self-supervised learning to interpret how to learn the underlying representation. Nonetheless, because of its explicability, self-supervised learning is said to be the closest AI principle to the way humans see.
It can be seen from the two experimental results for the imbalanced and balanced datasets that the performance of the model trained on the relatively smaller balanced dataset was not worse than that of the counterpart, which was the relatively larger imbalanced dataset. Furthermore, when using the small balanced dataset, there is no need to employ the additional re-balancing strategy of fine-tuning the classifier; it also saved computational costs because of its shorter training process. Thus, it raises the following question: why do we use large imbalanced datasets if there are appropriate ways to make the model perform as effectively on smaller datasets as it does on larger ones? This indicates that deep-learning-based CNN models do not necessarily have to use large datasets to achieve better performance. It is crucial to use a small amount of data to make the model perform well under the extraordinary difficulty of collecting clean and labeled medical data. If a model achieves good results on a small dataset, then the model will be much less dependent on a large amount of labeled data.
Both self-supervised learning and semi-supervised learning have the potential to learn from unlabeled data, and there should be more research in this field in the future. Studies in self-supervised learning have used interclass balanced benchmark datasets and have not yet considered the problem of imbalanced learning. Although it is not possible to categorize features learned from unlabeled data, there is no doubt that the learned features must be more representative of the majority class if imbalanced datasets are used to train models through self-supervised learning. Therefore, future research on self-supervised learning using imbalanced data will be initiated, while there should be improvements for imbalanced learning in semi-supervised deep learning algorithms. In addition, the DR diagnostic system not only discerns the presence or absence of symptoms, but also needs to grade the disease so that patients can understand the status of their illness in more detail, and thus take appropriate measures.
This study is limited to binary classification. Using self-training semi-supervised learning, which is not an efficient method, is also one of the limitations of our work. The current semi-supervised learning method with relatively better performance is the consistency regularization paradigm; however, since it also suffers from imbalanced learning, models have been restricted to training using balanced datasets. In this study, since we only focused on the learning problem arising from the use of imbalanced labeled datasets, only the self-trained method of iteratively generating balanced data using the reference provided by pseudo labeling was adopted. As can be seen from the data, the quantity of unlabeled data we used in the work did not reach a hundred times higher than the labeled data, which was limited by the availability of fundus data and limited computational resources. This study focused on how to utilize the unlabeled data in the method rather than improving the performance by the huge amount of data.