An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss

Hu, Yanzhe; Li, Yu; Zou, Hua; Zhang, Xuedong

doi:10.3390/electronics12132941

Open AccessArticle

An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss

¹

School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan 430077, China

²

School of Computer Science, Wuhan University, Wuhan 430072, China

³

School of Information Engineering, Tarim University, Alaer 843300, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(13), 2941; https://doi.org/10.3390/electronics12132941

Submission received: 1 May 2023 / Revised: 12 June 2023 / Accepted: 19 June 2023 / Published: 4 July 2023

(This article belongs to the Special Issue Signal, Image and Video Processing: Development and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Color fundus images are now widely used in computer-aided analysis systems for ophthalmic diseases. However, fundus imaging can be affected by human, environmental, and equipment factors, which may result in low-quality images. Such quality fundus images will interfere with computer-aided diagnosis. Existing methods for enhancing low-quality fundus images focus more on the overall visualization of the image rather than capturing pathological and structural features at the finer scales of the fundus image sufficiently. In this paper, we design an unsupervised method that integrates a multi-scale feature fusion transformer and an unreferenced loss function. Due to the loss of microscale features caused by unpaired training, we construct the Global Feature Extraction Module (GFEM), a combination of convolution blocks and residual Swin Transformer modules, to achieve the extraction of feature information at different levels while reducing computational costs. To improve the blurring of image details caused by deep unsupervised networks, we define unreferenced loss functions that improve the model’s ability to suppress edge sharpness degradation. In addition, uneven light distribution can also affect image quality, so we use an a priori luminance-based attention mechanism to improve low-quality image illumination unevenness. On the public dataset, we achieve an improvement of 0.88 dB in PSNR and 0.024 in SSIM compared to the state-of-the-art methods. Experiment results show that our method outperforms other deep learning methods in terms of vascular continuity and preservation of fine pathological features. Such a framework may have potential medical applications.

Keywords:

unsupervised learning; color fundus images enhancement; attention

1. Introduction

With the development of computer-aided diagnosis, color fundus images are used to diagnose and monitor early ophthalmic disease. However, the quality of images is interfered with by a variety of categories and varying degrees of degradation. Some major contributing factors, such as uneven illumination, low contrast, artifacts, and blurred fundus images, limit the analysis and diagnosis of fundus images. These low-quality images can also reduce the accuracy of downstream image processing tasks such as disease classification [1,2,3,4,5] and retinal vessel segmentation [6,7,8,9]. Therefore, the influence of degradation factors on the image must be eliminated before computer-aided analysis can be conducted for diagnosis.

Traditional image enhancement methods include histogram equalization, spatial domain filtering, and greyscale transformations. However, these methods have a limitation in generalizing and even sometimes produce non-existent artifacts when processing low-quality color fundus images, affecting the diagnostic accuracy and preventing a direct application to fundus images. With the development of deep learning, many learning-based methods have achieved good results in image enhancement.

In the field of natural images, for example, Wei et al. [10] present the method Retinex-Net, based on the Retinex theory, to enhance low-illumination images. Wang et al. [11] advocate for GLAD-Net, which uses an encoder–decoder network to extract prior global illumination information from low-illumination images, thereby guiding the process of low-illumination enhancement. In the realm of medical images, Li et al. [12] introduce NuI-Go, an enhancement network designed to remove light unevenness from color fundus images by utilizing a non-local semantic module. Lee et al. [13] propose a deep learning-centric model engineered to automatically augment low-quality retinal fundus images, with the aim of amplifying image quality and neutralizing multifaceted image degradation. However, large numbers of medical fundus images are hard to obtain, and it is impractical to obtain pairs of high- and low-quality fundus images of the same patient. Thus, these supervised learning-based methods do not apply to fundus image enhancement.

To avoid the dependence on image pairs, Guo et al. [14] propose Zero-DCE, an unsupervised learning-based low-illumination enhancement method with a series of unreferenced loss functions. Cheng et al. [15] introduce EPC-GAN, a method that includes a contrast loss function and a fundus prior loss function based on a pre-training task for classifying diabetic retinopathy classes. This approach strives to prevent alteration of important information in the images and over-enhancement problems. To ensure the preservation of structural features in the fundus while enhancing the image, Ma et al. [16] present the Still-GAN method. This approach incorporates a luminance constraint loss and a structure-preserving loss function into CycleGAN, in order to maintain the structural information in the image. Both of these methods prove effective in enhancing the dark areas of the fundus. However, they can cause some blurring or semantic disruption of pathological areas due to the lack of treatment of pathological features.

For the field of color fundus images, where high- and low-quality image pairs cannot be acquired, generative adversarial networks are more suitable for achieving color fundus image enhancement tasks. Inspired by EnlightenGAN [17], we propose an unsupervised learning-based enhancement method for color fundus images. Existing GAN networks face difficulties in capturing structural and pathological features such as retinal blood vessels and lesion characteristics. Furthermore, fine-scale features are susceptible to being blurred or erased by deeper network models. Consequently, we incorporate the Swin Transformer into the generator structure, aiming to amplify the network’s capacity for learning structured features. The main contributions of this paper are as follows:

Based on the Swin Transformer, we propose a global feature extraction module. The module first extracts shallow features using convolution layers and then extracts deep features using multiple RSTBs (Residual Swin Transformer Blocks). We embed the module in a bottleneck layer to learn the correlation between local and global information in the deep feature space. We also enhance the ability of the model to learn global features at the spatial level.
We devise a luminance attention mechanism, based on a priori knowledge. This mechanism, implemented through a luminance map, is integrated within the original same-layer skip connections of the U-Net and merged with the feature map. As a result, the attention mechanism imposes constraints that contribute to a more balanced brightness in the enhanced image.
We improve the discriminator used for the natural image enhancement task by designing a non-reference color fundus image quality loss function based on this network to make it more suitable for color fundus image enhancement. We introduce an illumination loss function and a structure-preserving loss function. These functions allow the network to preserve vascular and pathological features during enhancement, reducing fictitious features and feature loss.

2. Related Work

2.1. Traditional Approaches

Image enhancement has been a long-standing research topic in computer vision, and there have been some classical approaches, such as Retinex [18] and multi-scale Retinex models [19]. Ref. [20] improves the ability to enhance low-light-level images based on the Retinex model by adding noise mapping to the model. Guo et al. [21] propose a method that estimates the illuminance of each pixel by finding the maximum value of each pixel RGB channel species and, in turn, constructs a structure before constructing the illuminance map. These methods are equally applicable to color fundus image enhancement. For example, ref. [22] presents a method using contrast-limited adaptive histogram equalization to recover images. P. Dail et al. [23] introduce a method to enhance fundus images by fusing background information with the original retinal image. While these methods are effective, they rely on global image statistics and mapping functions that may produce non-existent artifacts and distortion.

2.2. Deep Learning Approaches

Deep learning techniques, while well-established and effective in many areas of computer vision, face unique challenges when applied to the enhancement of fundus images. For instance, Eilertsen et al. [24] apply a pixel-level loss function constraint model to solve high-dynamic-range tasks, while Dong et al. [25] employ convolutional networks for super-resolution reconstruction. Despite these efforts, the subtle pathological features and unique structure of the retina make it difficult for general convolution-based neural networks to enhance fundus images effectively. Nul-Go Net, a deep learning-based method for retinal image enhancement, strives to address these challenges by progressively removing inhomogeneous illumination through recursive residual learning. However, this method may generate artificial features not present in the original image, potentially misleading clinical diagnosis.

2.3. Generating Adversarial Networks

The main objective of a generative adversarial network remains to create images that closely resemble the original ones. Goodfellow [26] first introduces GAN, a network composed of two parts: a generator and a discriminator. It optimizes a specific loss function through adversarial learning. Mirza et al. [27] put forward a conditional GAN that can produce new image classes by adding extra label constraints to the input. In contrast, Zhu et al. [28] present a cycle consistency loss function and its network model CycleGAN. This innovation breaks the limitation of needing paired datasets for training, facilitating the transformation of images between two domains. You et al. [29] propose the Cycle-CBAM method for color fundus image enhancement, which applies Cycle-Consistency Loss to regulate the conversion between high/low-quality images. Although this approach enhances the subjective quality of the image to some extent, it may lead to the generation of non-existent vascular structures.

3. Method

As shown in Figure 1, we take U-Net as the generator and use global discriminators and local discriminators to guide the network model. In this section, we first introduce two important modules, the global feature extraction module and the luminance attention module. Then for details, we introduce the discriminator and the loss function.

3.1. Global Feature Extraction Module

Raj et al. [30] argue that the bottleneck layer at the bottom of U-Net assists in extracting non-local image feature information. Based on this, they add a global feature extraction module consisting of a Residual Dense Block (RDB) to the bottleneck layer of U-Net. Drawing inspiration from SwinIR, we introduce the Global Feature Extraction Module (GFEM), as shown in Figure 1. GFEM extracts deep-level feature data and learns the correlation between local and global information, thereby further enhancing the vascular continuity of the enhanced images. The procedure is as follows:

F e a = M_{G F E M} (x)

(1)

where

F e a

denotes the deep feature data extracted,

M_{G F E M} ()

denotes the global feature extraction module, and x denotes the input information. GFEM contains convolution layers and residual Swin Transformer blocks (RSTBs). The process of extracting intermediate feature information by RSTB can be expressed as follows:

M_{i} = M_{R S T B} (M_{i - 1}), i = 1, 2, \dots, N_{R S T B}

(2)

D = C o n v (M_{N_{R S T B}})

(3)

where

M_{R S T B} ()

denotes the RSTB module, and Mt denotes the i-th intermediate feature data. D denotes the final extracted deep feature data, and

C o n v ()

denotes the convolution layer. The feature information obtained through this module provides the basis for the subsequent fusion of feature information.

3.2. Swin Transformer Layer

Swin Transformer Layer (STL) [31] finds its genesis in the standard multi-head self-attention of the original Transformer layer [32], distinguished by the incorporation of localized attention and shifted window schemes. The STL is composed of Multi-Head Attention (MSA) and a Multilayer Perceptron (MLP). For a given input

x = H \times W \times C

, STL uses a sliding window mechanism to convert the input to

\frac{H W}{M^{2}} \times M^{2} \times C

and divides

M \times M

non-overlapping local windows. The mathematical formula for MSA is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}} + B) V

(4)

where B is the position code and

d_{k}

is the size of the last dimension of K. Q, K, and V are calculated as follows:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(5)

The multilayer perceptron has two fully connected layers with feature transformations between the layers using GELU. Both the MSA and the MLP use a residual connection whose mathematical expression is

X = M S A (L N (X)) + X

(6)

X = M L P (L N (X)) + X

(7)

where

L N ()

denotes the LayerNorm layer.

3.3. Luminance Attention

First, we use image conversion to obtain luminance images of low-quality fundus pictures. The formula is as follows:

M_{G r a y S c a l e} = 0.299 \times x_{R} + 0.587 \times x_{G} + 0.144 \times x_{B}

(8)

where

x_{R}

,

x_{G}

,

x_{B}

denote the regularized values of the three low-quality color fundus image channels.

Unlike natural images, according to the principles of color fundus photography, useful information is only retained in the circular area. The black part of the background contains no useful information and introduces some invisible noise. Inspired by the Dark Channel Prior [33], we omit the information in the black area of the background. The formula for its add-mask operation is as follows:

M_{I A} = (1 - M_{G r a y S c a l e}) \otimes M_{m a s k}

(9)

where

M_{I A}

denotes the luminance map,

M_{m a s k}

is the background mask map for x in the map, and ⊗ denotes element multiplication. The brighter the original image, the less the value of

M_{I A}

corresponds to its pixels.

Next, we stitch the original image x with

M_{I A}

and feed the result into the enhancement network. The generators of the augmented network are symmetric U-Net structures. The left-hand side of the generator consists of a convolutional kernel size of

3 \times 3

and a maximum pooling layer with a kernel size of

2 \times 2

. The ith convolution layer from left to right is denoted as

C o n v_{l e f t}^{i}

, and the pooling layer is denoted as

D o w n_{l e f t}^{i}

. The right-hand side of the generator consists of a kernel size of

2 \times 2

upsampling operators and a convolution layer with a convolution kernel size of

3 \times 3

. The ith upsampling operator from right to left is denoted as

U p_{r i g h t}^{i}

, and the convolution layer is denoted as

C o n v_{r i g h t}^{i}

. The output of its upsampling operator is noted as

O u t_{u p}^{i}

. The result obtained by

M_{I A}

after the ith maximum pooling layer is noted as

M_{I A}^{i}

. Thus, the formula for luminance attention can be expressed as follows:

O u t_{A t t n}^{i} = O u t_{d o w n}^{i} \otimes M_{I A}

(10)

where

O u t_{A t t n}^{i}

denotes the output of luminance attention, and ⊗ denotes element-by-element multiplication. Finally, we splice

O u t_{A t t n}^{i}

and

O u t_{u p}^{i}

and use them as input to

C o n v_{r i g h t}^{i}

.

3.4. Dual Discriminator and Generator

Natural image enhancement algorithms, while enhancing the image, may alter some fine details but generally do not affect the overall image quality. However, fundus images, unlike natural images, contain significant structural features and fine-scale vascular and pathological details. For enhanced color fundus images, we aim to retain these original detailed features without compromising the fine-scale attributes. To accomplish this, we employ global and local discriminators to constrain the network [17]. The global discriminator assesses the realism of the enhanced color fundus image relative to the true fundus image. The local discriminator, on the other hand, randomly crops five image blocks from the fundus image as input. This strategy helps to avoid over- or under-enhancement in local areas. The expression for our global discriminator’s loss function is

L_{D}^{G l o b a l} = E_{y \sim Y} [{(D_{R a} (y, y^{'}) - 1)}^{2}] + E_{y^{'} \sim Y^{'}} [{(D_{R a} (y^{'}, y))}^{2}]

(11)

RealisticGAN [34] proposes the use of a discriminator to estimate the probability of real data being more authentic than randomly generated fake data. Inspired by RealisticGAN, we improve its loss function to make it more suitable for color fundus image enhancement tasks. The expression is

D_{R a} (y, y^{'}) = C (y) - E_{y^{'} \sim Y^{'}} [C (y^{'})] D_{R a} (y^{'}, y) = C (y^{'})) - E_{y \sim Y} [C (y)]

(12)

where y denotes the true color fundus image and

y^{'}

denotes the enhanced image. C denotes the global discriminator. We adopt LSGAN [35] as the loss function of the local discriminator. Its expression is as follows:

L_{D}^{L o c a l} = E_{y_{p} \sim Y_{p a t c h e s}} [{(D (y_{p}) - 1)}^{2}] + E_{y_{p}^{'} \sim Y_{p a t c h e s}} [{(D (y_{p}^{'}) - 0)}^{2}]

(13)

where

y_{p}

and

y_{p}^{'}

denote local blocks of size

32 \times 32

cut at random from y and

y^{'}

.

Our goal is to enhance the low-quality color fundus image domain X into a high-quality color fundus image domain Y′. The generator constraint function expression is

L_{D}^{G l o b a l} = E_{y^{'} \sim Y^{'}} [{(D_{R a} (y^{'}, y) - 1)}^{2}] + E_{y \sim Y} [{(D_{R a} (y, y^{'}))}^{2}]

(14)

L_{D}^{L o c a l} = E_{y_{p}^{'} \sim Y_{p a t c h e s}^{'}} [{(D (y) - 1)}^{2}]

(15)

3.5. Loss Function

3.5.1. Self Feature Preserving Loss

Ma et al. [16] show that non-uniformity of brightness affects the quality of fundus images. Inspired by EnlightenGAN [17], we use the Self Feature Preserving Loss. The function ensures that the image features before and after enhancement are preserved to themselves. The formulation of the function is as follows:

L_{S F P} (x) = \frac{1}{W_{i, j} H_{i, j}} \sum_{x = 1}^{W_{i, j}} \sum_{x = 1}^{H_{i, j}} {(ϕ_{i, j} (x) - ϕ_{i, j} (G (x)))}^{2}

(16)

where x is the input low-quality fundus image and

G (x)

denotes the generator.

ϕ_{i, j} (x)

denotes the feature map extracted by pre-training, and i, j denotes j-th convolution layer after the i-th maximum pooling layer from the training model.

W_{i, j}

H_{i, j}

represent the width and height of the feature map, respectively. This study defines the value of i as 5 and j as 1.

3.5.2. Illumination Loss

Even when using the same device, variations in color fundus images can occur due to differences among operators, potentially leading to light leakage or insufficient brightness and ultimately reducing the overall quality of the fundus image. To mitigate this issue, we employ an illumination constraint function, drawing inspiration from StillGAN [16]. This function initiates by calculating the global average luminance of the image I, followed by partitioning the image into non-overlapping patches of equal size. We then compute the average luminance of each patch to derive the illumination matrix D. Ultimately, the difference in average brightness between I and D is calculated, and the function is minimized to constrain the overall brightness uniformity of the image. The formulation is as follows:

L_{i l l} (x) = E_{x \in X} [E_{g l o b a l} [| u p s a m p l i n g E_{l o c a l}^{p \times p} [G (x) - E_{g l o b a l} [G (x)]] |]]

(17)

where

x \in X

denotes the input low-quality fundus image,

G (x)

is the generator, and p denotes the non-overlapping region size.

E_{l o c a l}

denotes the calculation of the local brightness average of the image, and

E_{g l o b a l}

denotes the global brightness average of the image.

u p s a m p l i n g

is represented as bilinear interpolation upsampling.

3.5.3. Structure Loss

While the illumination loss function can improve the image brightness problem, it can also result in the loss of feature information in the enhanced image. We use a structural loss function to ensure that the enhanced image retains the structural features in the original image. Using similarity between original and enhanced images, the mean

\bar{τ}

is calculated for each identical region of the image before and after enhancement. This function is finally obtained by minimizing

1 - \bar{τ}

, which is defined as follows:

L_{S T} (x, y^{'}) = E_{x \in X, y^{'} \in Y^{'}} [1 - \frac{1}{M} \sum_{i = 1}^{M} \frac{δ_{x_{i}, y_{i}^{'} + c}}{μ_{x_{i}} μ_{y^{'}} + c}]

(18)

where

x \in X

represents the input low-quality fundus image and

y^{'} \in Y^{'}

is the high-quality fundus image after enhancement by generator G.

δ_{x_{i}, y_{i}^{'}}

denotes the covariance between

x_{i}

and

y_{i}^{'}

.

μ_{x_{i}}

and

μ_{Y_{i}}

are the standard deviations of

x_{i}

and

y_{i}^{'}

, respectively. c is a constant used to avoid fluctuations in values.

Thus, the overall function of the generator of the proposed method for color medical fundus image enhancement in this paper can be defined as

L = L_{S F P}^{G l o b a l} + L_{S F P}^{L o c a l} + L_{G}^{G l o b a l} + L_{G}^{G l o b a l} + λ_{i l l} L_{i l l} + λ_{S T} L_{S T}

(19)

where

λ_{i l l}

and

λ_{S T}

are the weight coefficients of

L_{i l l}

and

L_{S T}

, respectively, which are set to

λ_{i l l} = 5

and

λ_{S T} = 50

in this paper.

4. Experiment and Results

4.1. Datasets

In order to better train our model, we chose the dataset EyeQ [36], a subset obtained by re-labeling from EyePACS. It is a robust dataset that provides ample scope for testing and validation, contributing to a more accurate and reliable analysis. EyeQ has a total of 28,792 color fundus images with three quality levels: ‘Good’, ‘Usable’, and ‘Reject’ The pathological features of the images containing the “Good” label are visible; images containing the “Usable” label have a slight degradation factor, but the subject structure and pathological features in the image can be identified; images labeled "Reject" are considered to have a significant degradation factor and cannot be used for diagnostic tasks. In addition, images of subject structures such as the optic disc and the retina’s macular region that are not visible are also labeled as ‘Reject’. EyeQ contains 12,543 images from the training set and 12,649 images from the test set. All images are pre-processed using cofe-Net [37]. The images marked as “Good” in the test set are degraded to the corresponding low-quality images containing three degradation factors: illumination, artifacts, and blur. In particular, we use images labeled as ‘good’ and ‘usable’ to train and test the network model, as images labeled as ‘rejected’ lack a significant amount of subject structure and pathological features.

4.2. Implementation

The selection of model parameters during the training phase is predominantly determined by experimentation and optimization. The choice of input image size balances computational efficiency and the level of detail captured. As such, we set the input image size to 384 × 384 and the batch size per iteration to 8, as shown by [17], for 100 epochs. The input image block size for the local discriminator is set to 64 × 64. We opt for Adam as the optimizer with a learning rate set at 0.0001. For better visual results, we use a 512 × 512 images as input during the testing phase. The model is built on PyTorch and trained using NVIDIA GeForce RTX3090, each card with 24 GB of memory. Processing time for each image of size 512 × 512 × 3 is approximately 0.061 s.

4.3. Ablation Study

In this paper, PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) [38] are used as quantitative analysis evaluation indicators. The definition of PSNR is as follows:

P S N R = 10 \times l o g_{10} (\frac{{(2^{b} - 1)}^{2}}{M S E})

(20)

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{i = 0}^{n - 1} {[I (i, j) - K (i, j)]}^{2}

(21)

where b represents the bit depth of the image, and m and n represent the width and height of the image, respectively.

I (i, j)

and

K (i, j)

denote the pixel values at the coordinates of the reference image and the enhanced image. The mathematical definition of SSIM is as follows:

S S I M = \frac{(2 μ_{x} μ_{y}) (2 δ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (δ_{x}^{2} + δ_{y}^{2} + c_{2})}

(22)

where x and y represent the reference image and the enhanced image, and

μ_{x}

and

μ_{y}

represent the mean values.

δ_{x}^{2}

and

δ_{Y}^{2}

represent the variance, and Oxy represents the covariance of x and y. We use 8 bit RBG images in this study, so

c_{1} = (0.01 \times {(2^{8} - 1)}^{2})

.

For the ablation experiments, we train models with different modules on EyeQ. A high SSIM score indicates a more detailed image. As shown in Table 1, the GFEM + L-attention module achieves a PSNR value of 24.32, representing an improvement of 0.85 over the No GFEM module and 0.74 over the No L-attention module. Furthermore, the GFEM + L-attention module exhibits an SSIM value of 0.8932, which is 0.0062 higher than the No GFEM module and 0.0003 higher than the No L-attention module. Although the margin with the No L-attention module is minimal, it nonetheless underscores the superior performance of the GFEM + L-attention module in retaining structural similarity of the image. Figure 2 shows the impact of the different components on the model. In Sam 1, the absence of luminance attention causes the image to be partially over-lit. Sam 2 shows the impact of GFEM. When there is no GFEM, noise in the dark areas is incorrectly identified and enhanced, resulting in false features. It results in the appearance of details of blood vessels that are not present in the original image.

4.4. Comparison with State of the Art

In this section, we compare the performance of the proposed model with the current state-of-the-art methods.

4.4.1. Uneven Illumination Enhancement Comparison

We compare with other GAN-based image enhancement methods, and the comparison results are shown in Figure 3. The first column is the original color fundus image. The second column is pix2pix [39] trained on the dataset based on the synthetic high- and low-quality images. The third to fifth columns are CycleGAN, CutGAN [40], and StillGAN, trained on the dataset using real unpaired high- and low-quality color fundus images. The last column shows the results produced by our method.

As shown in Figure 3, from each column, the images enhanced by the pix2pix method show significant uneven enhancement problems, with the worst performance in terms of brightness uniformity. It may be due to a domain gap problem when the results of its training on a synthetic image dataset are migrated to a real low-quality fundus image enhancement task. CycleGAN achieves good results in terms of overall image quality but performs poorly in terms of color stability. CutGAN performs best in overall image perception, almost eliminating overexposure or underexposure. However, it tends to generate false features in areas of severe degradation that are overexposed or underexposed, which may interfere with diagnosis. StillGAN has some enhancement, but the contrast between the vessels and the background could be more significantly enhanced. The images enhanced by this method have better color consistency and overall brightness uniformity both in single images and in longitudinal contrast and provide significant contrast enhancement of the fundus vessels.

As shown in Table 2, our method achieves superior results, reaching a PSNR index of 24.32 dB. This is an improvement of 0.88 dB compared to the second-best method, StillGAN, and a significant 2.43 dB better than the lowest-scoring method, CutGAN. For the SSIM index, our method scores 0.8932, slightly lower than the highest-scoring method, Pix2pix (by 0.0014), but surpasses the lowest-scoring method, CycleGAN (by 0.0502). However, Pix2pix shows the worst performance in terms of brightness uniformity. This might be a domain gap issue when its training results on synthesized image datasets are transferred to the task of enhancing real low-quality color fundus images. Compared to Pix2pix, which uses synthesized image training datasets, our method might be more suited for real clinical images.

4.4.2. Structural Analysis

Color fundus images are diagnosed based on anatomical and pathological features in the images. Therefore, fundus image enhancement needs to maintain image viewing, structural features, and pathological features. To verify the retention of structural features after enhancement, we use Iter-Net [41] as the retinal vessel segmentation method. We segment the retinal vascular structure after enhancement by different methods. The degree of visibility and continuity of blood vessels in the segmentation results are analyzed to determine the effect of image enhancement. The comparison results are shown in Figure 4. The enhanced fundus image from our method can segment more vascular detail than the original image. CycleGAN and CutGAN have poorer vascular continuity and show vascular details that were not present in the original image. In the dark areas of the image, the low contrast of the background color of the blood vessels results in poor image visibility. The application of our method manifests a notable enhancement in the depiction of vascular details. However, the vascular structures enhanced by Pix2pix, CycleGAN, and CutGAN appear relatively disordered, while StillGAN and our proposed method demonstrate superior performance in terms of vascular detail and continuity.

4.4.3. Pathological Feature Analysis

Figure 5 shows the effect of image enhancement on the retention of pathological features. After enhancement by our method, the edges of pathological features in the dark areas remain clear and are distinguished from the background color. The pathological features enhanced by CycleGAN and CutGAN show blurring and a reduction in contrast. The pix2pix enhancement is not apparent.

S a m 2

is showcasing an area where delicate vessels interweave with hard exudates. When enhanced through our approach, both the vascular and pathological features surpass those observed in the original image. In darker regions of the original image, StillGAN’s enhancement unveils substantial minor hemorrhages or potential symptoms of microaneurysms, which do not present in the original image. We infer that the method may be mistakenly identifying and excessively amplifying noise within these darker regions as minor pathological features. The vascular structures enhanced by Pix2pix, CycleGAN, and CutGAN appear disordered, introducing vascular branches that should not originally exist. Comparatively, while StillGAN achieves superior edge sharpness of hard exudates compared to other methods, it also brings about an increase in image noise. The overall enhancement effect of the Pix2pix method is not substantial, and it might even introduce issues of uneven brightness. In terms of the overall perception of the image, CycleGAN and CutGAN demonstrate significant performance. However, they face challenges in preserving fine-scale features, which can lead to the loss of intricate details and the presence of blurriness. StillGAN performs well in terms of overall image perception and fine-scale feature retention. However, it suffers from a deficiency in vascular continuity and excessive enhancement. In summary, our method effectively enhances the overall quality of the image and achieves high fidelity in the fundus image’s fine-scale structure and pathological features.

5. Conclusions

In this study, we present a novel unsupervised framework for enhancing low-quality color retinal images. Moreover, we design a Global Feature Extraction Module and incorporated a non-reference function to retain delicate feature information within the images. To tackle the issue of uneven illumination in the images, we propose a brightness attention mechanism that is based on prior knowledge. Despite the absence of paired training data, the proposed framework is readily adaptable to real-world, low-quality color retinal images. Experimental results on publicly available color retinal image datasets show that our approach surpasses existing methods in terms of PSNR and SSIM, with improvements of 0.88 dB and 0.024, respectively. Additionally, it demonstrates that our method is better than other deep learning approaches in preserving vascular continuity and subtle pathological characteristics. However, we have observed that in some cases, the degradation factors are so severe that the valuable information in the retinal images is obscured, making restoration challenging. Consequently, the framework we have proposed might not perform well for such images. This may constitute the object of future studies.

Author Contributions

Conceptualization, Y.H., Y.L. and H.Z.; Methodology, Y.H. and H.Z.; Validation, Y.L. and X.Z.; Data curation, X.Z.; Writing—original draft, Y.H.; Writing—review & editing, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Bingtuan Science and Technology Program (2022DB005, 2019BC008).

Data Availability Statement

Not applicable.

Acknowledgments

The authors acknowledge funding from the Bingtuan Science and Technology Program (2022DB005, 2019BC008).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. [Google Scholar] [CrossRef] [Green Version]
Schuler, J.P.S.; Romani, S.; Abdel-Nasser, M.; Rashwan, H.; Puig, D. Color-Aware Two-Branch DCNN for Efficient Plant Disease Classification. MENDEL 2022, 28, 55–62. [Google Scholar] [CrossRef]
Huck Yang, C.H.; Liu, F.; Huang, J.H.; Tian, M.; Lin, I.H.; Liu, Y.C.; Morikawa, H.; Yang, H.H.; Tegner, J. Auto-classification of retinal diseases in the limit of sparse data using a two-streams machine learning model. In Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2018; pp. 323–338. [Google Scholar]
Xing, X.; Liang, G.; Blanton, H.; Rafique, M.U.; Wang, C.; Lin, A.L.; Jacobs, N. Dynamic image for 3D mri image alzheimer’s disease classification. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 355–364. [Google Scholar]
Ghani, A.; See, C.H.; Sudhakaran, V.; Ahmad, J.; Abd-Alhameed, R. Accelerating retinal fundus image classification using artificial neural networks (ANNs) and reconfigurable hardware (FPGA). Electronics 2019, 8, 1522. [Google Scholar] [CrossRef] [Green Version]
Khanal, A.; Estrada, R. Dynamic deep networks for retinal vessel segmentation. Front. Comput. Sci. 2020, 2, 35. [Google Scholar] [CrossRef]
Liu, Z. Retinal vessel segmentation based on fully convolutional networks. arXiv 2019, arXiv:1911.09915. [Google Scholar]
Pissas, T.; Bloch, E.; Cardoso, M.J.; Flores, B.; Georgiadis, O.; Jalali, S.; Ravasio, C.; Stoyanov, D.; Da Cruz, L.; Bergeles, C. Deep iterative vessel segmentation in OCT angiography. Biomed. Opt. Express 2020, 11, 2490–2510. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Szemenyei, M.; Yi, Y.; Wang, W.; Chen, B.; Fan, C. Sa-unet: Spatial attention u-net for retinal vessel segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1236–1242. [Google Scholar]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Wang, W.; Wei, C.; Yang, W.; Liu, J. Gladnet: Low-light enhancement network with global awareness. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 751–755. [Google Scholar]
Li, C.; Fu, H.; Cong, R.; Li, Z.; Xu, Q. Nui-go: Recursive non-local encoder-decoder network for retinal image non-uniform illumination removal. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1478–1487. [Google Scholar]
Lee, K.G.; Song, S.J.; Lee, S.; Yu, H.G.; Kim, D.I.; Lee, K.M. A deep learning-based framework for retinal fundus image enhancement. PLoS ONE 2023, 18, e0282416. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Guo, J.; He, C.; Zhang, M.; Li, Y.; Gao, X.; Song, B. Edge-Preserving Convolutional Generative Adversarial Networks for SAR-to-Optical Image Translation. Remote Sens. 2021, 13, 3575. [Google Scholar] [CrossRef]
Ma, Y.; Liu, J.; Liu, Y.; Fu, H.; Hu, Y.; Cheng, J.; Qi, H.; Wu, Y.; Zhang, J.; Zhao, Y. Structure and illumination constrained GAN for medical image enhancement. IEEE Trans. Med. Imaging 2021, 40, 3955–3967. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Land, E.H. The retinex theory of color vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.U.; Woodell, G.A. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-revealing low-light image enhancement via robust retinex model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Shome, S.K.; Vadali, S.R.K. Enhancement of diabetic retinopathy imagery using contrast limited adaptive histogram equalization. Int. J. Comput. Sci. Inf. Technol. 2011, 2, 2694–2699. [Google Scholar]
Dai, P.; Sheng, H.; Zhang, J.; Li, L.; Wu, J.; Fan, M. Retinal fundus image enhancement using the normalized convolution and noise removing. Int. J. Biomed. Imaging 2016, 2016, 5075612. [Google Scholar] [CrossRef] [Green Version]
Eilertsen, G.; Kronander, J.; Denes, G.; Mantiuk, R.K.; Unger, J. HDR image reconstruction from a single exposure using deep CNNs. A. Trans. Graph. (TOG) 2017, 36, 1–15. [Google Scholar] [CrossRef] [Green Version]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
You, Q.; Wan, C.; Sun, J.; Shen, J.; Ye, H.; Yu, Q. Fundus image enhancement method based on CycleGAN. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 4500–4503. [Google Scholar]
Raj, A.; Shah, N.A.; Tiwari, A.K. A novel approach for fundus image enhancement. Biomed. Signal Process. Control 2022, 71, 103208. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [PubMed]
Jolicoeur-Martineau, A. The relativistic discriminator: A key element missing from standard GAN. arXiv 2018, arXiv:1807.00734. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Fu, H.; Wang, B.; Shen, J.; Cui, S.; Xu, Y.; Liu, J.; Shao, L. Evaluation of retinal image quality assessment networks in different color-spaces. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2019; pp. 48–56. [Google Scholar]
Shen, Z.; Fu, H.; Shen, J.; Shao, L. Modeling and enhancing low-quality retinal fundus images. IEEE Trans. Med. Imaging 2020, 40, 996–1006. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Park, T.; Efros, A.A.; Zhang, R.; Zhu, J.Y. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 319–345. [Google Scholar]
Li, L.; Verma, M.; Nakashima, Y.; Nagahara, H.; Kawasaki, R. Iternet: Retinal image segmentation utilizing structural redundancy in vessel networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3656–3665. [Google Scholar]

Figure 1. Architecture of the proposed network.

Figure 2. This is a figure of the results of the ablation experiment, which includes the Global Feature Extraction Module (GFEM) and Luminance attention (L-A).

Figure 3. Qualitative results by comparing with state-of-the-art results.

Figure 4. Comparison of vascular segmentation results in color fundus images.

Figure 5. Comparison of subtle pathological features before and after enhancement of fundus images.

Table 1. Analysis of the impact of different components on the model.

Module	No GFEM	No L-Attention	GFEM + L-Attention
PSNR	23.47	23.58	24.32
SSIM	0.8870	0.8929	0.8932

Table 2. Average PSNR and SSIM results on test set.

Method	PSNR (dB)	SSIM
pix2pix [39]	23.2	0.8946
CycleGAN [28]	22.84	0.8430
CutGAN [40]	21.89	0.8534
StillGAN [16]	23.44	0.8694
Ours	24.32	0.8932

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Y.; Li, Y.; Zou, H.; Zhang, X. An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss. Electronics 2023, 12, 2941. https://doi.org/10.3390/electronics12132941

AMA Style

Hu Y, Li Y, Zou H, Zhang X. An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss. Electronics. 2023; 12(13):2941. https://doi.org/10.3390/electronics12132941

Chicago/Turabian Style

Hu, Yanzhe, Yu Li, Hua Zou, and Xuedong Zhang. 2023. "An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss" Electronics 12, no. 13: 2941. https://doi.org/10.3390/electronics12132941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Unsupervised Fundus Image Enhancement Method with Multi-Scale Transformer and Unreferenced Loss

Abstract

1. Introduction

2. Related Work

2.1. Traditional Approaches

2.2. Deep Learning Approaches

2.3. Generating Adversarial Networks

3. Method

3.1. Global Feature Extraction Module

3.2. Swin Transformer Layer

3.3. Luminance Attention

3.4. Dual Discriminator and Generator

3.5. Loss Function

3.5.1. Self Feature Preserving Loss

3.5.2. Illumination Loss

3.5.3. Structure Loss

4. Experiment and Results

4.1. Datasets

4.2. Implementation

4.3. Ablation Study

4.4. Comparison with State of the Art

4.4.1. Uneven Illumination Enhancement Comparison

4.4.2. Structural Analysis

4.4.3. Pathological Feature Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI