1. Introduction
OCT employs low-coherence interferometry to produce cross-sectional tomographic images of the internal structure of biological tissues [
1,
2]. It is routinely used for diagnostic imaging, primarily of the retina and coronary arteries [
3]. The axial resolution obtainable is in the range 2 to 15 µm [
4], with a depth range of around 1–2 mm. Unfortunately, OCT images are often degraded by speckle noise [
5,
6], creating apparent grain-like structures in the image, with a size as large as the spatial resolution of the OCT system. Speckle noise significantly degrades images and complicates interpretation and medical diagnosis by confounding tissue anatomy and masking changes in tissue scattering properties.
Speckle suppression is often achieved by incoherent averaging of images with different speckle realizations [
7], e.g., through angular compounding [
8,
9]. Averaging methods attempt to preserve the resolution while suppressing speckle arising from non-resolved tissue structure; yet, some methods produce blurred images. Moreover, although effective at suppressing speckle in ex vivo tissues or in preclinical animal research, the additional time and data throughput required to obtain multiple speckle realizations can often make this approach incompatible with clinical in vivo imaging.
Consequently, many numerical algorithms attempt to computationally suppress speckle, to name a few: non-linear filtering [
10], non-local means (NLMs) [
11,
12], and block matching and 3D filtering (BM3D) [
13]. The majority of these algorithms employ an image denoiser treating speckle as independent and identically distributed (i.i.d) Gaussian noise. The solution can sometimes be sensitive to parameters’ fine-tuning. Some algorithms also rely on accurately registered volumetric data, which is challenging to obtain in clinical settings.
Recently, the speckle reduction task has been extensively investigated from a supervised learning perspective [
14,
15,
16]. As is known, most supervised learning data-driven methods require a large training dataset. In OCT, Dong et al. (2020) [
17] trained a super-resolution generative adversarial network (SRGAN) [
18,
19] with hardware-based speckle-suppressed ex vivo samples defining the ground truth. Namely, they used 200,000 speckle-modulating OCT images of size
for training. Chintada et al. (2023) [
20] used a conditional-GAN (cGan) [
21] trained with hundreds of retinal data B-scans, with NLM [
12] as ground truth. Ma et al. (2018) [
22] also used a cGAN to perform speckle reduction and contrast enhancement for retinal OCT images by adding an edge loss function to the final objective. Clean images for training were obtained by averaging B-scans from multiple OCT volumes.
That said, there has been a growing amount of evidence demonstrating that supervised learning methods, specifically in the context of computational imaging and inverse problems, may require significantly smaller datasets. For example, it was observed that, for image restoration for florescence microscopy [
23], even a small number of training images led to an acceptable image restoration quality (e.g., 200 patches of size
). Pereg et al. (2020) [
24] used a single simplified synthetic image example for seismic inversion. Several works have explored the use of few-shot learning for transfer learning [
25,
26]. For example, Huang et al. (2022) employed a recurrent neural net (RNN) for few-shot transfer learning of holographic image reconstruction [
25]. The RNN was first trained with ∼2000 unique training images of three sample types, then its parameters remained fixed as a backbone model. The transfer learning phase required only 80 examples. Some progress has been made for few-shot learning for medical imaging primarily for classification and segmentation [
27,
28]. To our knowledge, our work presented here is the first research work addressing few-shot learning for OCT noise reduction.
In learning theory, domain shift is a change of the data distribution between the source domain (training dataset) and target domain (the test dataset). Despite advances in data augmentation and transfer learning, neural networks often fail to adapt to unseen domains. For example, convolutional neural networks (CNNs) trained for segmentation tasks can be highly sensitive to changes in resolution and contrast. Performance often degrades even within the same imaging modality. A general review of domain adaptation (DA) for medical image analysis can be found in [
29]. The different approaches are separated into shallow and deep DA models, further divided into supervised, semi-supervised, and unsupervised DA, depending on the availability of labeled data in the target domain. Generally speaking, the appropriate DA approach depends on the background and the properties of the specific problem. Many DA methods suggest ways to map the source and target domains to a shared latent space, whereas generative DA methods attempt to translate the source to the target or vice versa. In our study, we focused on a simple, yet efficient physics-aware unsupervised DA approach for the case of a change in the OCT imaging system. Namely, only unlabeled data are available for the target domain. This problem is also referred to in the literature as domain generalization [
30], and it has been hardly explored in medical imaging so far [
31].
Our aim in this work is to investigate few-shot learning as an efficient tool for OCT speckle reduction with limited ground truth training data. To this end, we first prove that the output resolution of a supervised learning speckle-suppression system is determined by the sampling space and the resolution of the source acquisition system. We also mathematically define the effects of the domain shift on the target output image. In light of the theoretical analysis, we promote the use of a patch-based learning approach. We propose a recurrent neural net (RNN) framework to demonstrate the applicability and efficiency for few-shot learning for OCT speckle suppression. We demonstrate the use of a single-image training dataset that generalizes well. The proposed approach introduces a significant decrease in the training time and required computational resources. Training takes about
2–25 seconds on a GPU workstation and a few minutes on a CPU workstation (2–4 minutes). We further propose novel upgrades for the original RNN framework and compare their performance. Namely, we introduce a one-shot patch-based RNN-mini-GAN architecture. We further demonstrate the increased SNR achieved via averaging overlapping patches. Furthermore, we recast the speckle-suppression network as a deblurring system. We further propose a patch-based one-shot learning U-Net [
32] and compare its results with the three different RNN models’ results. We illuminate the speckle reduction dependence of the acquisition system, via known lateral and axial sampling space and resolution, and offer simple strategies for training and testing under different acquisition systems. Finally, our approach can be applicable to other learning architectures, as well as other applications where the signal can be processed locally, such as speech and audio, video, seismic imaging, MRI, ultrasound, natural language processing, and more. The results in this paper are a substantial extension that replaces our non-published previous work ([
33], Section 6.2).
3. Domain-Aware Speckle Suppression
Let us denote
as the ground truth ideal tomogram perfectly describing the depth sample reflectivity. Here,
are continuous axial and lateral spatial axes. A measured tomogram can be formulated as
where ∗ denotes the convolution operation and
is a point spread function (PSF). In the discrete setting, assuming
,
are the axial and lateral sampling rates, respectively, and that the set of measured values at
lie on the grid
and
,
,
A speckle-suppressed tomogram can be viewed as the incoherent mean of coherent tomograms with different speckle realizations [
6,
41] (see e.g.,
Figure 1):
In OCT (using a wavelength-swept source (SS-OCT) or Fourier domain/spectral domain (FD/SD-OCT)) [
42,
43], the axial direction corresponds to the depth at a certain scan location of the imaged sample. The axial imaging range
is given by the central wavelength and the wavelength sampling. The axial sampling space is
, where
is the total A-line number of pixels. In the axial direction, the PSF effective width
is determined by the FFT of a zero-padded Hanning window. The lateral direction corresponds to the direction of image scanning, such that assembling all A-lines of a lateral scan into a B-scan forms a cross-sectional image. In the lateral direction, the PSF has a Gaussian shape proportional to
, where
is referred to as the waist.
is the lateral sampling space. Therefore,
is separable and can be expressed as
. Note that the resolution and sampling rate are known parameters of an OCT imaging system.
In matrix–vector form, we denote an input (log-scaled) image that is a corrupted version of , such that , where is an additional noise term. Note that, for the case of image despeckling, we do not assume that the entries of are either i.i.d. or that it is uncorrelated with . Our task is to recover . That is, we attempt to find an estimate of the unknown ground truth .
Let us assume a source training set , where are image patches sampled from a source domain as the ground truth. The learning system is trained to output a prediction rule . We assume an algorithm that trains the predictor by minimizing the training error (empirical error or empirical risk). The domain shift problem assumes a target domain with samples from a different distribution .
Figure 1.
Chicken muscle speckle suppression results: (a) speckled acquired tomogram ; (b) ground truth averaged over 901 tomograms; (c) OCT-RNN trained with 100 first columns of chicken muscle; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) RNN-GAN trained with 200 columns of chicken decimated by a factor 8/3 in the lateral direction, . System and tissue mismatch: (f) DRNN trained with 100 columns of human retinal image, ; (g) DRNN following lateral decimation of the target input by a factor of 4/3, ; (h) DRNN following lateral decimation of the target input by 8/3, . Scale bars are 200 µm.
Figure 1.
Chicken muscle speckle suppression results: (a) speckled acquired tomogram ; (b) ground truth averaged over 901 tomograms; (c) OCT-RNN trained with 100 first columns of chicken muscle; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) RNN-GAN trained with 200 columns of chicken decimated by a factor 8/3 in the lateral direction, . System and tissue mismatch: (f) DRNN trained with 100 columns of human retinal image, ; (g) DRNN following lateral decimation of the target input by a factor of 4/3, ; (h) DRNN following lateral decimation of the target input by 8/3, . Scale bars are 200 µm.
Assumption 1 (Speckle Local Ergodicity)
. Denote as a patch centered around pixel i of the image . is the probability density of a patch . Under the assumption that pixels in close proximity are a result of shared similar sub-resolution scatterers, we assume ergodicity of the Markov random field (MRF) (e.g., [44]) of patches consisting of pixels in close proximity. In other words, the probability distribution of a group of pixels’ values in close spatial proximity is defined by the same density across the entire image. This assumption takes into account that some of these patches correspond to fully developed speckle, non-fully developed speckle, and a combination of both. Note that the measured pixels’ values are correlated. That said, this assumption could be somewhat controversial, particularly in surroundings of abrupt changes in the signal intensity. However, since our images tend to have a layered structure and the PSF visible range is about 7–9 pixels in each direction, we will make this assumption.
Definition 1 (Sampling resolution ratio). We define the lateral sampling resolution ratio in pixels and the axial sampling resolution ratio as , where denotes rounding to the closest integer. That is, in a discrete setting, and are the number of pixels capturing the effective area of the PSF in each direction. The superscripts and denote the target and source, respectively.
Theorem 1 (Domain-Shift Speckle Suppression Theorem)
. A learned patch-based speckle suppression mapping does not require domain adaptation. However, the output resolution will be determined by the source domain resolution. Mathematically, denote , as the discrete PSF in the source and target domain, respectively, such thatwhere and are complementary impulse responses leading from one domain to the other. When applying the trained system to the target input, we have We refer the reader to
Appendix A for the proof and a detailed explanation. For example, if
and
, there exist
such that
. Then, we have
In other words, the output resolution is determined by the source resolution
. The tomogram component
is a low-resolution version of the target tomogram
:
If
or
, then the system’s prediction for an input in the target domain may have additional details or artificially enhanced resolution details, which would not naturally occur with other denoising mechanisms. Examples illustrating this phenomena are illustrated in
Figure 1e–f. Possible remedies: train with a larger analysis patch size, longer training, and upsampling (interpolation) of source images (or decimation of target images).
If
or
, then the network’s output is blurred in the corresponding direction (e.g.,
Figure 1h). Possible remedies: train with a smaller analysis patch size, downsample (decimate) the training image (or upsample target images). In this case, the target has details that are smaller (in pixels) than the minimal speckle size of the source, which could be interpreted by the trained predictor as noise; thus, the trained predictor may simply smear them out.
Any combination of relations
along the different image axes is possible. For our OCT data, the resolution ratio mostly differs in the lateral direction (see
Table 1). Note that, for some OCT systems, the sampling space is below the Nyquist rate. The preprocessing domain adaptation stage can be applied either to the source data or the target data, interchangeably, depending on the desired target resolution. We note that (
6) does not apply to any general pair of PSFs, but in our case study, it is safe to assume that there exist
and
that approximately satisfy (
6).
For simplicity, we assumed a spatially invariant model that does not take into consideration light–matter interactions, optical scattering, and attenuation. It may also be argued that this model is not unique to OCT and could be applied to other modalities. Nevertheless, in practice, the proposed approach is effective and yields perceptually improved results. The above analysis is not restricted to OCT and can be easily modified and applied to other degradation processes and other applications.
4. Patch-Based Few-Shot Learning
The initial RNN setting described in this subsection has been previously employed for seismic imaging [
24,
45,
46]. Hereafter, the mathematical formulation focuses on the settings of the OCT despeckling task. Nonetheless, the model can be applied to a wide range of applications. We emphasize the potential of this framework and expand and elaborate its application, while connecting it to the theoretical intuition in Theorem 1. We also propose possible upgrades that further enhance the results in our case study, as shown in
Figure 2.
Most OCT images have a layered structure and exhibit strong relations along the axial and lateral axes. RNNs can efficiently capture those relations and exploit them. That said, as demonstrated below, the proposed framework is not restricted to images that exhibit a layered structure nor to the specific RNN-based encoder–decoder architecture.
Definition 2 (Analysis patch [
46])
. We define an analysis patch as a 2D patch of size enclosing time (depth) samples of consecutive neighboring columns of the observed image . Assume . Then, the analysis patch associated with a point at location is defined by An analysis patch is associated with a pixel in the output image. To produce a point in the estimated , we set an input to the RNN as an analysis patch, i.e., . Each time step input is a group of neighboring pixels of the same corresponding time (depth). In other words, in our application, and . We set the size of the output vector to one expected pixel (), such that is expected to be the corresponding reflectivity segment, . Lastly, we ignore the first values of the output , and set the predicted reflectivity pixel as the last one, i.e., . The analysis patch moves across the image and produces all predicted points in the same manner. Each analysis patch and a corresponding output segment (or patch) are an instance for the net. The size and shape of the analysis patch define the geometrical distribution of data samples for inference.
4.1. Despeckling Reformulated as Image Deblurring
Despite the low-frequency bias of over-parametrized DNNs [
47], previous works [
46] demonstrated the ability of the proposed framework to promote high frequencies and super-resolution. To explore this possibility, we recast the framework described above as a deblurring task. This is achieved simply by applying a low-pass filter to the input speckled image and, then, training the system to deblur the image. Namely, given a noisy image
, the analysis patches is extracted from the input image
, where
is a convolution matrix of a 2D low-pass filter. We will refer to this denoiser as deblurring RNN (DRNN).
4.2. Averaging Patches
Given a noisy image
, an alternative approach is to decompose it into overlapping patches, denoise every patch separately, and finally, combine the results by simple averaging. This approach of averaging of overlapping patch estimates is common in patch-based algorithms [
48,
49], such as the expected patch log-likelihood (EPLL) [
50]. It also improves the SNR since we are averaging for every pixel a set of different estimates. Mathematically speaking, the input analysis patch is still
. However, in this configuration, the output is no longer a 1D segment, but a corresponding output 2D patch. In other words,
,
such that
(see
Figure 3).
4.3. Incremental Generative Adversarial Network
Image restoration algorithms are typically evaluated by some distortion measure (e.g., PSNR, SSIM) or by human opinion scores that quantify the perceived perceptual quality. It has long been established that distortion and perceptual quality are at odds with each other [
51]. As mentioned above, previous works adopt a two-stage training [
17,
18]. The first stage trains the generator with a content loss, while, in the second stage, initialized by the generator’s pre-trained weights, we train both a generator
G and a discriminator
D. Therefore, we propose adding a second stage of training with a combined MSE and adversarial loss:
where
is a constant balancing the losses. The generator
G remains a patch-to-patch RNN-based predictor (with or without averaging patches). To this end, we design and showcase a patch discriminator of extremely low complexity, which consists simply of two fully connected layers. We will refer to this approach as RNN-GAN.
The above framework could be generalized to 3D images using a 3D analysis volume of size . The analysis volume is then defined by and , the number of A-lines and B-scans taken into account along the lateral axes, and depth samples along the axial axis. It can be defined to associate with a point in its center or in an asymmetrical manner. In a similar manner to the 2D configuration, for each output voxel, the analysis volume would be an instance input to the RNN. Moving the analysis volume along the 3D observation image produces the entire 3D predicted despeckled volume.
The underlying assumption of the proposed approach is that the mapping from each input patch to an output point or patch is statistically unchanging. That is, the data is stationary. In practice, this assumption is controversial and does not always hold. Yet, assuming spatial invariance is helpful for introducing the major processes affecting the image quality into the model and is standard in the image-processing literature. As presented in
Section 5, in practice, this simplification does not necessarily lead to degraded results in comparison with the despeckled ground truth. The learned mapping is able to effectively capture the imaging degradation process despite its inherent statistical complexity.
4.4. Few-Shot U-Net
As is known, a U-Net is a convolutional neural network that was developed for biomedical image segmentation [
32] and achieved state-of-the-art results in numerous applications. One of the U-Net’s advantages is that it is flexible with respect to its input size. Inspired by the above approach, we further propose a patch-based one-shot learning U-Net. In other words, the U-Net is trained with random patches cropped from a
single input–output pair (or a few images). Then, the U-Net is applied to a larger image as desired by the user.
Figure 3.
Illustration of the proposed patch-to-patch RNN encoder–decoder.
Figure 3.
Illustration of the proposed patch-to-patch RNN encoder–decoder.
5. Experimental Results
Here, we show examples of our proposed few-shot domain-aware supervised learning despeckling approach with OCT experimental data, for demonstration. We investigated three one-shot learning challenging cases: (1) matching tissue and matching acquisition systems, where we used one image or part of an image for training and other images of the same tissue acquired by the same system for testing; (2) tissue type mismatch; (3) tissue type and acquisition system mismatch.
Table 1 presents the acquisition parameters, namely axial and lateral sampling spaces in tissue,
—the effective number of measured spectral points vs.
—the total number of FFT points after zero padding,
—the waist in µm, axial and lateral sampling resolution ratios in pixels, and the cropped region of interest (ROI) image sizes.
For all experiments, we set the number of neurons as
. Increasing the number of neurons did not improve the results significantly, but increases training time. The analysis patch size is
. Patch size can affect the results’ higher frequencies. Larger patches create frequency bias in favor of lower frequencies. For the DRNN we used a Gaussian filter of size [7,7] and standard deviation
. For the RNN-GAN we employed overlapping patches averaging to promote additional noise reduction. As mentioned above, our discriminator consists solely of 2 fully-connected layers. At the second adversarial stage the generator’s loss was modified to include a content loss term and an adversarial loss term
. We used the Adam-optimizer [
52] with
= 0.5,
= 0.9. The initial learning rate is
.
Table 1.
Acquisition system parameters.
Table 1.
Acquisition system parameters.
| Chicken and Blueberry | Chicken Skin | Cucumber | Retina | Cardiovascular-I [43] | Cardiovascular-II [53] |
---|
(µm) | 6 | 4.78 | 4.78 | 3.75 | 4.84 | 4.43 |
| 1600/2048 | 844/1024 | 844/1024 | 1024/2048 | 768/1024 | 800/1024 |
| 3 | 3 | 3 | 3 | 3 | 3 |
(µm) | 3.06 | 2.5 | 8 | 9 | ∼12.2 | ∼24.4 |
(µm) | 8.28 | 4.14 | 8.28 | 18 | 30 | 30 |
| 3 | 2 | 1 | 2 | 2 | 1 |
ROI image size | , | | | | | |
As the ground truth for training and testing, we used hardware-based speckle mitigation obtained by dense angular compounding, in a method similar to [
8]. That is, ground truth images for chicken muscle, blueberry, chicken skin, and cucumber sample tissues, as presented in
Figure 1b and Figures 6–8b, were acquired by an angular compounding (AC) system using sample tilting in combination with a model-based affine transformation to generate speckle-suppressed ground truth data [
54]. Note that AC via sample tilting is not possible for in vivo samples.
We used retinal data acquired by a retinal imaging system similar to [
55]. As the ground truth for training and testing, we used NLM-based speckle-suppressed images [
12]. Note that the NLM is considered relatively slow (about 23 s for a B-scan of size
). Images were cropped to a size of
.
Finally, we tested our trained systems with OCT data of coronary arteries acquired with two imaging systems. For these datasets, we have no ground truth available. The first dataset, referred to as Cardiovascular-I [
43], was acquire with in-house built catheters, for human cadaver imaging. The second human-heart coronary dataset, Cardiovascular-II [
53], was acquired with a second clinical system, where there is usually a guidewire in place. Since imaging time is critical, only 1024 A-lines per rotation were acquired.
Figure 1,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10 depict the obtained despeckled predictions for the ex vivo samples, as well as for in vivo retinal data and intravascular OCT images, employing four methods: RNN, DRNN, RNN-GAN, and U-Net [
32]. Please zoom-in on screen to see the differences. The U-Net has about
parameters (eight-times the RNN’s number of parameters), and it trains with patches of size
. Note that all four proposed methods were trained with either one example pair of the speckled image and its corresponding ground truth, a cropped part of an image, or very few examples. Visually observing the results in different scenarios, overall, the proposed approach efficiently suppresses speckle, while preserving and enhancing visual detailed structure.
To test the DRNN’s performance in different domains, we trained it with 100 columns of acquired in vivo human retinal cross-section, presented in
Figure 4a.
Figure 4b presents the ground truth obtained as described in [
12]. As can be observed, the DRNN approach generalizes well, both with matching tissues and imaging systems, as well as in cases of tissue and system mismatch. The DRNN produces good visual quality and efficiently suppresses speckle, even without preprocessing domain adaptation. As we theoretically established, applying the source-trained system to a target with a lower lateral sampling resolution ratio indeed smooths the result (e.g.,
Figure 8e), whereas a target input with a higher lateral sampling resolution ratio results in a detailed structure with minor speckle residuals and (e.g.,
Figure 4e). Visually observing the other methods’ results leads to similar conclusions.
We quantitatively evaluated the proposed approaches by comparing the peak-signal-to-noise ratio (PSNR) and structural similarity index (SSIM) of their results with respect to the images assumed as ground truth.
Table 2 compares the average PSNR and SSIM score for the above four one-shot learning methods with matching system and tissue. As can be seen, a significant increase in PSNR and SSIM scores is achieved for all methods. RNN-GAN and U-Net have the highest scores in most cases. U-Net usually yields the highest scores; yet, as can be observed in
Figure 6e, it can produce unexpected visible artifacts in some cases. The U-Net has more capacity, therefore, it tends to memorize the training image better, but generalize worse.
Figure 4.
Retinal data speckle suppression: (a) cross-sectional human retina in vivo ; (b) despeckled (NLM) image used as the ground truth; (c) DRNN trained with 100 columns of retinal image . System mismatch: (d) DRNN following lateral decimation of the target input by a factor of 2, ; (e) DRNN following lateral interpolating of the input, . System and tissue mismatch: (f) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (g) RNN-GAN trained with 200 last columns of blueberry; (h) U-Net trained with blueberry image of size , . Scale bars, 200 µm.
Figure 4.
Retinal data speckle suppression: (a) cross-sectional human retina in vivo ; (b) despeckled (NLM) image used as the ground truth; (c) DRNN trained with 100 columns of retinal image . System mismatch: (d) DRNN following lateral decimation of the target input by a factor of 2, ; (e) DRNN following lateral interpolating of the input, . System and tissue mismatch: (f) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (g) RNN-GAN trained with 200 last columns of blueberry; (h) U-Net trained with blueberry image of size , . Scale bars, 200 µm.
Note that PSNR and SSIM scores not always reliably represent the perceptual quality or desired features of the images [
51]. An image that minimizes the mean distance in any metric will necessarily suffer from a degradation in perceptual quality (the “perception–distortion” trade-off). Denoising is inherently an ill-posed problem. That is, a given input may have multiple correct solutions. The minimum-mean-squared error (MMSE) solutions are inclined to average these possible correct outcomes. In the presence of low SNRs, an averaging strategy often leads to output images with blurry edges and unclear fine details.
Keep in mind that AC-despeckled images are a result of averaging of numerous images, whereas our system’s predictions rely solely on a single observation; therefore, the reconstructions are notably more loyal to the single observed speckled image. Furthermore, although AC images are referred to as the ground truth, they may suffer from inaccuracies related to the stage tilting and its processing. The NLM ground truth also may suffer from residual speckle and blurring. As can seen in
Figure 4 and
Figure 5, the proposed models were able to remove some of these artifacts.
It is worth noting that, despite the growing interest in supervised learning methods for OCT despeckling, many competing (non few-shot learning) methods do not provide open access to training datasets and results. As most of these methods are trained with compounded data or NLMs and as the goal of this study is to explore few-shot learning and domain awareness, rather than to achieve state-of-the-art results, we directly compare our results with the assumed ground truth.
Figure 5.
Retinal data speckle suppression: (a) cross-sectional human retina in vivo, ; (b) despeckled (NLM) image used as the ground truth; (c) DRNN trained with 100 columns of retinal image ; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) U-Net trained with retinal image of size , . Scale bars are 200 µm.
Figure 5.
Retinal data speckle suppression: (a) cross-sectional human retina in vivo, ; (b) despeckled (NLM) image used as the ground truth; (c) DRNN trained with 100 columns of retinal image ; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) U-Net trained with retinal image of size , . Scale bars are 200 µm.
Figure 6.
Blueberry speckle suppression results: (a) speckled acquired tomogram; (b) despeckled via angular compounding the used ground truth; (c) RNN-GAN trained with 200 last columns of blueberry, ; (d) DRNN trained with 100 columns of human retinal image, ; (e) U-Net trained with chicken skin image, . Scale bars are 200 µm.
Figure 6.
Blueberry speckle suppression results: (a) speckled acquired tomogram; (b) despeckled via angular compounding the used ground truth; (c) RNN-GAN trained with 200 last columns of blueberry, ; (d) DRNN trained with 100 columns of human retinal image, ; (e) U-Net trained with chicken skin image, . Scale bars are 200 µm.
Table 2.
Average PSNR/SSIM obtained for different methods and datasets with training and testing matching acquisition systems and tissue types. Average scores are over 100 tomograms of size .
Table 2.
Average PSNR/SSIM obtained for different methods and datasets with training and testing matching acquisition systems and tissue types. Average scores are over 100 tomograms of size .
Dataset | Input | RNN-OCT | DRNN | RNN-GAN | U-Net |
---|
Retina | 24.87/0.46 | 33.60/ 0.87 | 30.46/0.82 | 32.24/0.86 | 33.66/0.89 |
Chicken | 24.29/0.29 | 27.97/0.61 | 29.41/0.63 | 30.81/0.74 | 32.50/0.77 |
Blueberry | 24.98/0.48 | 27.15/0.63 | 27.57/0.69 | 28.18/0.76 | 28.09/0.76 |
Chicken Skin | 26.12/0.44 | 29.64/0.71 | 29.59/0.69 | 30.49/0.78 | 30.26/0.77 |
Cucumber | 25.91/0.59 | 27.31/0.73 | 28.69/0.73 | 28.85/0.79 | 28.52/0.81 |
Table 3 provides quantitative scores for the proposed domain adaptation approach for various pairs of source and target, differing in acquisition system and tissue type, for RNN-GAN and U-Net. Notably, both approaches result in a significant increase in the PSNR and SSIM scores. Note that the images differ not only in their sampling resolution ratio, but also by the nature of the ground truth used for training. Namely, the AC images have a different texture and visual appearance than the NLMs. Regardless of the PSNR and SSIM scores, the trained model often tends to adopt the visual characteristics of the source data. This tendency may also be perceived as an advantage in the absence of a ground truth, as can be seen in
Figure 4g. The observed speckled image may originate in many plausible reconstructions with varying textures and fine details and different semantic information [
56]. The above results somewhat offer a user-dependent degree of freedom. Unfortunately, in our experiments, the domain randomization strategy [
57] failed to generalize well.
Figure 7.
Chicken skin speckle suppression results: (a) speckled acquired tomogram; (b) AC ground truth, averaged over 60 tomograms; (c) DRNN trained with 100 columns of human retinal image, ; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) RNN-GAN trained with 200 last columns of blueberry, . Scale bars are 200 µm.
Figure 7.
Chicken skin speckle suppression results: (a) speckled acquired tomogram; (b) AC ground truth, averaged over 60 tomograms; (c) DRNN trained with 100 columns of human retinal image, ; (d) RNN-GAN trained with 100 first columns of chicken muscle and blueberry, ; (e) RNN-GAN trained with 200 last columns of blueberry, . Scale bars are 200 µm.
Figure 8.
Cucumber speckle suppression results: (a) speckled acquired tomogram, ; (b) ground truth averaged over 301 tomograms; (c) DRNN trained with human retina image, ; (d) RNN-GAN trained with 200 columns of blueberry image decimated in the lateral direction by a factor of 8/3, ; (e) RNN-GAN trained with 200 columns of blueberry and chicken, . Scale bars are 200 µm.
Figure 8.
Cucumber speckle suppression results: (a) speckled acquired tomogram, ; (b) ground truth averaged over 301 tomograms; (c) DRNN trained with human retina image, ; (d) RNN-GAN trained with 200 columns of blueberry image decimated in the lateral direction by a factor of 8/3, ; (e) RNN-GAN trained with 200 columns of blueberry and chicken, . Scale bars are 200 µm.
Table 3.
Domain-aware PSNR/SSIM obtained for different methods and datasets, with training and testing acquisition systems and tissue types mismatch, with preprocessing adapting the sampling resolution ratio. * denotes cases where domain adaption was not applied.
Table 3.
Domain-aware PSNR/SSIM obtained for different methods and datasets, with training and testing acquisition systems and tissue types mismatch, with preprocessing adapting the sampling resolution ratio. * denotes cases where domain adaption was not applied.
Target Data | Source Data | RNN-GAN | U-Net |
---|
Retina | Blueberry and Chicken | 30.86/0.87 | 31.80/0.85 |
Chicken | Chicken Skin | 31.33/0.69 | 31.08/0.68 |
Chicken | Retina | 29.08/0.63 | 31.89/0.70 |
Blueberry | Retina | 27.78/0.69 | 28.33/0.77 |
Chicken Skin | Blueberry and Chicken | 30.68/0.76 | 30.43/0.71 |
Chicken Skin | Retina | 31.51/0.76 * | 30.88/0.77 * |
Cucumber | Blueberry | 27.84/0.75 | 27.61/0.72 * |
Cucumber | Retina | 28.71/0.77 * | 28.11/0.73 |
The proposed model offers substantial training time efficiency. The number of epochs for the first training content loss stage is 5–12 epochs, depending on the analysis patch size, batch size, and training image size. The adversarial loss training stage takes about 10–30 epochs. The total time of training is 5–25 seconds on a laptop GPU. Training without an adversarial stage normally takes about 12 seconds. As a rule of thumb, training for too long can cause over-fitting and blurry images. Early stopping is recommended to avoid the model from over-fitting the single image used for training. Training times were measured on a standard laptop workstation equipped with a 12th Gen Intel(R) Core(TM) i7-12800H 2.40 GHz with 32.0 GB RAM, NVIDIA RTX A2000 8 GB Laptop GPU. Training can also be performed on a CPU in a few minutes. Inference time is 110.5 ms per B-line. U-Net training is usually longer and takes about 5.76 minutes (for 16 epochs). As far as we know, our results are the state-of-the-art in terms of optimized real-time training with minimal available training data.
Figure 9.
Cardiovascular-I speckle suppression results (in Cartesian coordinates): (a) speckled acquired tomogram, ; (b) DRNN trained with 100 columns of human retinal image, ; (c) RNN-GAN trained with 200 columns of blueberry and chicken images, ; (d) U-Net trained with retinal data image of size , . Scale bar is 500 µm.
Figure 9.
Cardiovascular-I speckle suppression results (in Cartesian coordinates): (a) speckled acquired tomogram, ; (b) DRNN trained with 100 columns of human retinal image, ; (c) RNN-GAN trained with 200 columns of blueberry and chicken images, ; (d) U-Net trained with retinal data image of size , . Scale bar is 500 µm.
Figure 10.
Cardiovascular-II speckle suppression results (in Cartesian coordinates): (a) cropped speckled acquired tomogram of size , ; (b) OCT-RNN trained with 100 first columns of chicken muscle, ; (c) DRNN trained with 100 columns of human retinal image, ; (d) RNN-GAN trained with decimated retinal data, ; (e) RNN-GAN trained with interpolated retinal data, ; (f) U-Net trained with retinal data image of size , . Scale bar is 200 µm.
Figure 10.
Cardiovascular-II speckle suppression results (in Cartesian coordinates): (a) cropped speckled acquired tomogram of size , ; (b) OCT-RNN trained with 100 first columns of chicken muscle, ; (c) DRNN trained with 100 columns of human retinal image, ; (d) RNN-GAN trained with decimated retinal data, ; (e) RNN-GAN trained with interpolated retinal data, ; (f) U-Net trained with retinal data image of size , . Scale bar is 200 µm.
Lastly,
Figure 11 presents a visual comparison of our proposed one-shot RNN-GAN and U-Net results in comparison with SM-GAN [
17], trained with 3900 example pairs of speckled and despeckled OCT images. The retinal SD-OCT and the corresponding ground truths are borrowed from the dataset in [
58]. As can be seen, our approach is able to reduce speckle, while preserving perceptual quality and contrast. The comparison showcases the good generalization despite training only with a single image or part of an image.