1. Introduction
The retinal microvasculature is the only part of human circulation that can be directly and non-invasively visualized in vivo [
1]. Hence, it can be easily acquired and analyzed by automatic tools. As a result, retinal fundus images have a multitude of applications, including biometric identification, computer-assisted laser surgery, and the diagnosis of several disorders [
2,
3]. One important processing step in such applications is the proper segmentation of retinal vessels. Image semantic segmentation aims to make dense predictions by inferring the object class for each pixel of an image and, indeed, the segmentation of digital retina images allows us to extract various quantitative vessel parameters and to obtain more objective and accurate medical diagnoses. In particular, the segmentation of retinal blood vessels can help the diagnosis, treatment, and monitoring of diseases such as diabetic retinopathy, hypertension, and arteriosclerosis [
4,
5].
Deep Neural Networks (DNNs) has become the standard approach in semantic segmentation [
6,
7,
8] and in many other computer vision tasks [
9,
10,
11,
12]. DNN training, however, requires large sets of accurately labeled data, so the availability of annotated images is becoming increasingly critical. This is particularly true in medical applications, where data collection is often difficult and expensive. For this reason, generating synthetic data is of great interest. Nevertheless, synthesizing high-resolution realistic medical images remains an open challenge. Most of the leading approaches for semantic segmentation, in fact, rely on thousands of supervised images, while supervised public datasets for retinal vessel segmentation are very small (most datasets contain fewer than 30 images).
To face the scarcity of data, we propose a new approach for the generation of retinal images along with the corresponding semantic label-maps. Specifically, we propose a novel generation procedure based on two distinct phases. In the first phase, a generative adversarial network (GAN) [
13] generates the blood vessel structure (i.e., the vasculature). The GAN is trained to learn the typical semantic label-map distribution from a small set of training samples. To generate high-resolution label-maps, the Progressively Growing GAN (PGGAN) [
14] approach has been employed. In a second, distinct phase, an image-to-image translation algorithm [
15] is used to translate blood vessels structures into realistic retinal images (see
Figure 1).
The rationale behind this approach is that, in many applications, the semantic structure of an image can be learned regardless of its visual appearance. Once the semantic label-map has been generated, visual details can be incorporated using an image-to-image translation algorithm, thus obtaining realistic synthesized images. By separating the whole process into two stages, the generation task is simplified and the number of samples required for training is significantly reduced. Moreover, the training is very effective and we obtained retinal images with an unprecedented high resolution and quality, along with their semantic label-maps. It is worth noting that the proposed two-step approach also reduces the GPU memory requirements with respect to a single-step method. Finally, the generation of the label-maps, based on GANs, allows us to synthesize a virtually infinite number of different training samples with different vasculature.
To assess the usefulness and correctness of the proposed approach, the generation procedure has been applied on two public datasets (i.e., DRIVE [
16] and CHASE_DB1 [
17]). Moreover, the two-step generation procedure has been compared with a single-stage generation, in which label-maps and retinal images have been generated simultaneously in two different channels (see
Figure 2). Indeed, in our experiments, the multi-stage approach allows us to significantly improve performance of vessels segmentation when used for data augmentation. In particular, the generated data have been used to train a Segmentation Multiscale Attention Network (SMANet) [
18]. Comparable results have been obtained by training the SMANet on the generated images in place of real data. It is interesting to note that, if the network is pre-trained on the synthesized data and then fine-tuned on real images, the segmentation results obtained on the DRIVE dataset come very close to those obtained by the best state-of-the-art approach [
19]. If the same approach is applied to the CHASE_DB1 benchmark, the results overcome (to the best of our knowledge) those obtained by any other previously proposed method.
This paper is organized as follows. In
Section 2, the related literature is reviewed.
Section 3 presents a description of the proposed approach.
Section 4 shows and discusses the experimental results. Finally,
Section 5 draws the conclusions and future perspectives.
4. Results and Discussion
In this paper, we provide both qualitative and quantitative evaluations of the generated data. In particular, some qualitative results of the generated retinal images for the DRIVE and CHASE_DB1 dataset are given in
Figure 6 and
Figure 7.
In
Figure 8, a zoom on a random patch of a high-resolution generated image shows that the image-to-image translation allows us to effectively translate the generated vessel structures in retinal images by maintaining the semantic information provided by the semantic label-map. It is worth noting that, although most of the generated samples closely resemble real retinal fundus images, few examples are clearly sub-optimal (see
Figure 9, which shows disconnected vessels and an unrealistic optical disc).
To further validate the quality of the generation process, a sub-sample of 100 synthetically generated retinal images were examined by an expert ophthalmologist. The evaluation showed that 35% of the images are of medium-high quality. The remaining 65% is visually appealing but contains small details that reveal an unnatural anatomy, such as an optical disc with feathered edges—which actually occur only in the case of specific diseases—or blood vessels that pass too close to the macula—while usually, except in the case of malformations, the macular region is avascular or at least paucivascular.
Table 1 compares the characteristics of the proposed method with respect to other learning-based approaches for retinal image generation found in the literature.
It can be observed that our approach is able to synthesize higher-resolution images, with less training samples, with respect to methods that generate both the image and the corresponding segmentation. Moreover, for such methods, the usefulness of the inclusion of synthetic images in semantic segmentation was not assessed. Instead, in this paper, we demonstrate that synthetic images can be effectively used for data augmentation, which indirectly guarantees the high quality of the generated data.
Indeed, the quantitative analysis consists of assessing the usefulness of the generated images for training a semantic segmentation network. This approach, similar to [
77], is based on the assumption that the performance of a deep learning architecture can be directly related with the quality and variety of GAN-generated images. The generation procedure described in
Section 3 was employed to generate 10,000 synthetic retinal images for both the DRIVE and the CHASE_DB1 datasets; the samples were generated in a single run without any selection strategy. To evaluate the usefulness of the generated data for semantic segmentation, we employed the following experimental setup:
SYNTH—the segmentation network was trained using only the 10,000 generated synthetic images;
REAL—only real data were used to train the semantic segmentation network;
SYNTH + REAL—synthetic data were used to pre-train the semantic segmentation network and real data were employed for fine-tuning.
Table 2 and
Table 3 report the results of the vessel segmentation for the DRIVE and CHASE_DB1 datasets, respectively.
It can be observed that the semantic segmentation network, trained on synthetic data, produces results very similar to those obtained by training on real data. This demonstrates that synthetic images effectively capture the training image distribution, so that they can be used to adequately train a deep neural network. Moreover, if fine-tuning with real data is applied after pre-training with synthetic data only, the results further improve with respect to the use of real data only. This fact indicates that the generated data can be effectively used to enlarge small training sets, such as DRIVE and CHASE_DB1. Specifically, the AUC is improved by and on the DRIVE and CHASE_DB1 datasets, respectively.
Another set of experiments was designed to compare the proposed two-stage generation procedure with a traditional single-step approach. In particular, in the one-stage method, the label-maps and the retinal images were generated simultaneously. The results of the single-step approach on the DRIVE and CHASE_DB datasets are shown in
Table 4 and
Table 5.
Table 6 and
Table 7 allows us to quickly visualize the differences between the two methods. In can be observed that better results are obtained in all the setups by employing the two-stage generation approach. In particular, if only synthetic data are used, the AUC increases by 5.01% (31.68%) with the two-stage method in the DRIVE (CHASE_DB1) dataset. As expected, the difference between the two methods is smaller if fine-tuning on real data is applied. Finally, we observe that the gap increases with higher image resolution. In the CHASE_DB1 dataset, in which the images have twice the resolution of the DRIVE dataset, the one-step generated images cannot be effectively used as data augmentation.
In the end,
Table 8 and
Table 9 compare the proposed approach with other state-of-the-art techniques.
The results show that the proposed approach reaches the state-of-the-art on the DRIVE dataset, where it is only outperfomed by [
19], when the AUC measure is used and outperforms all of the other methods on the CHASE_DB1 dataset. It is worth remembering that the experimental setups adopted by the previous approaches are varied and that a perfect comparison was impossible. For example, CHASE_DB1 does not provide an explicit training/test split, and in [
64,
65], the same split as in this paper was employed, while in [
19,
69,
71] a fourfold cross-validation strategy was applied (in [
71], where each fold included three images of one eye and four images of the other). Moreover, in [
59], only patches that were fully inside the field of view were considered. However, even with those inevitable experimental limits, the results of
Table 8 and
Table 9 suggest that the proposed method is promising and is at least as good as the best state-of-the-art techniques.