4. Iterative Inverse Problem Solution
We consider an iterative optimization algorithm denoted by
Q. The statement of our problem can be reduced to the iterative form:
Naturally, the promise of the algorithm is to get us closer to the solution after each step in an iterative manner. The proof of convergence of any specific algorithm ensures that
and can give even more information on the speed of convergence by deriving a theoretical formula of
as a bounded formula of n.
In general, this is a hard formula to derive, and it is even more difficult when dealing with complex problems like DOT, with many multidimensional parameters. In practice, the convergence speed is influenced by many factors, related to the algorithm itself, and the configuration of the problem (physical reality and constraints). A numerical approach based on simulation and statistical analysis will prove to be very useful in tackling these kinds of hard situations, and can help us to gain more insight in the choice of optimization algorithm and all other practical purpose. As we consider in our study, a family of optimization algorithms based on gradient descent, we can point out the learning rate hyper parameter as the main factor of interest in this context.
From a practical point of view,
depends on the structure of the problem, and, consequently,
depends on different factors like the nature of inclusions (their number, form, distribution, ...), the properties of the medium, and all other parameters that shape the above forward problem as stated in the previous section. In addition, it depends on the choice of the regularization and initial guess.
Table 1 below gives an example of the parameters and hyperparameters that can be of interest in studying the practical optimization problem (including the iterative algorithm hyperparameters).
A more focused statement of the iterative optimization algorithm, for the following study in the present paper, can be formulated as
where
describes the adaptive moment algorithm,
the initial guess,
n the number of inclusions,
denotes the learning rate hyperparameter, and
represents all the remaining parameters.
Hereafter, we address our attention only to the number of inclusions (
n), the learning rate
, and the initial guess
. In our implementation of the optimization problem, we used an objective function C defined as
We define the number of iterations to convergence by
This formulation guarantees that our optimization algorithm will stop whenever is lower than or l is greater than , where and , are parameters used in iteration stopping criteria, which is explicitly set in this study to be either when the cost function is lower than or the number of iteration exceeds .
The aim of our numerical statistical study of convergence speed can then be brought down to the study of properties of the probability distribution using simulation tools. In the following study, we restrict our attention to the comparison of three algorithms based on the adaptive moment procedure. For more details, we refer the reader to the next section.
6. Results
In this part of our study, we will characterize the convergence rate of the three algorithms, and compare the convergence/divergence behavior in relation to the parameters of simulation, and, finally, we will examine the quality of the resulting reconstructions of the three optimizers.
First of all, we choose in the context of this present analysis, the definition of divergence to be the state of the running optimization when the error minimization didn’t improve for longer than 200 iterations in total.
The first subject of focus is the convergence rate of each algorithm. Let
,
, and
be three random variables representing the state of convergence for Adam, Nadam, and AmsGrad optimizers. These variables take values 0 or 1, depending on either the corresponding algorithm diverges or converges, such that:
The simulation provided us with three samples of independent and identically distributed , , and . To statistically estimate the rates of convergence, namely , , and .
We use the three estimators
where # denotes the count function, in order to construct
confidence intervals based on the large number normal approximation, as presented in
Table 2.
To shed light on the influence of simulation parameters on convergence rates, we run a logistic regression to estimate the conditional distributions
, where
, n denotes the number of anomalies in the image, and the remaining variables
and
are as mentioned earlier. The result of this procedure is depicted in
Table 3 presenting
p-values for the statistical significance of each regression parameter.
From
Table 3, we conclude that the main parameter that also has a significant influence on convergence of these algorithms is the learning rate hyper-parameter. Since the logistic coefficient for
is positive for Adam and Nadam, the larger the learning rate, the more guarantee there is for the algorithm to converge. This statement is reversed for AmsGrad, as we observe a negative coefficient for learning rate. The AmsGrad is also impacted negatively with the number of anomalies in the image.
Running our experiment simulation provided us with 1340 convergent instances in total (this means where ()), to evaluate the comparative performance between optimizers in terms of the speed of convergence as measured by the number of iterations taken by each optimizer to reach the solution, We will conduct a statistical analysis on the generated data, comparing first the speed globally between optimizers and then relating it to the variables of simulation such as the initial guess error, the choice of learning rate and the number of anomalies in the image. In addition, the influence of these variables on reconstructed image quality will be discussed, and PSNR and SSIM score values are calculated for each simulation instance; we kindly refer the reader to later discussions about reconstruction quality in this paper for more information on these scores.
A number of statistical methods have been applied and results are examined to describe the convergence speed behavior of each algorithm when applied to the inverse problem of DOT.
Image reconstruction in optical tomography is an ill-posed nonlinear inverse problem, the algorithms based on the gradient descent present no guarantee to converge to the global minima when there are local minima in the optimization problem at hand, the convergence point depends heavily on the choice of the starting point of the optimization, and, generally, these algorithms converge (depending also on the learning rate) to the nearest local minima to the initial starting point.
Image reconstruction in optical tomography is an ill-posed nonlinear inverse problem, the algorithms based on gradient descent present no guarantee to converge to the global minima when there are local minima in the optimization problem at hand, the convergence point depends heavily on the choice of the starting point of the optimization, and, generally, these algorithms converge (depending also on the learning rate) to the nearest local minima to the initial starting point.
In this section, we address the optimization problem (image reconstruction) from the perspective of the speed of convergence (as one of the very important matters in practical use of DOT in clinical applications) rather than sensitivity of the algorithms to the choice of the initial guess with respect to their efficiency to find global minima (which is the other important practical issue in applying DOT); this last perspective is equally relevant and without a doubt needs particular attention and further analysis, but in the scope of our current paper remains an open question to follow up, as our randomized simulation design was focused on controlling the factors that influence speed of convergence. We can use the same approach as in this work to quantify (statistically speaking) the efficiency and sensitivity to reach the global minima depending on problem factors, but this obviously needs to redesign the simulation to generate the appropriate data suitable for this substantially different analysis objective.
The “blindness” toward the globality/locality character of the reached optimum for the gradient descent-based algorithms is an inherent property because the gradient is a local concept, and by itself carries only local information about the objective function which makes these algorithms very sensitive to the choice of learning rate and initialization. The adaptive moment included features does not add to the picture but some amount of “memory” of the recent gradients.
A rough observation that can be mentioned here is the fact that, in our generated sample data, most of the time the convergent instances for Nadam and Adam were to the global minima, but we can’t really draw any statistical evidence from this naïve observation because our randomized simulation design does not support this analysis.
First of all, we check the distributions of number of iterations (speed) for normality, in the hope to be able to harness the large and powerful available parametric statistical approaches, from literature heavily relying on this (workhorse) normal distribution.
Probability distributions of speed of convergence and the log of speed of convergence are shown in the QQ plot described in
Figure 2a,b, respectively. From these two graphs, it clearly appears that these distributions are very far from being reasonably considered normally or log-normally distributed. This is not a surprising fact indeed, knowing that these distributions are not symmetric to begin with, and look (strongly) skewed, but we wanted to exclude the possibilities of any approximate (left truncated) normal distributions.
Confirming this visual observation, the results of running Shapiro–Wilk normality tests on the three data samples are listed in
Table 4. From
Table 4, we conclude that the number of iterations for different optimizers significantly deviate from being normally distributed, and there is very little evidence, if none at all, that supports the normality. We did not test the goodness of fit for other density functions like Gumbel, Fréchet, and Weibul, even though the look of the distributions may suggest this family of extreme value distribution (EVD), mainly for two reasons:
First, those EVDs, even if approximately fitted to our empirical distribution, will not provide us, following our best judgment, with any advantage, considering the fact that the nature of exact distribution is not our main goal in itself, but rather is the distributions’ locations, while all of the well known available parametric statistical methods for this purpose are based on the assumption that the samples come from (approximate) normal distribution.
Second, since we stopped the optimization iterations at 200 as mentioned above, we automatically lost information about the distribution in the extreme left part of the tail (which is almost
of the population according to the estimates in
Table 2, for the three algorithms). This fact would certainly impact (heavily) the estimation of any EVD parameter, and, consequently, would reduce the power of any parametric test based on those inherently biased, and grossly approximate fits, which will minimize the comparative advantage of the eventual parametric over a non-parametric alternative method.
Following the arguments discussed above, we will use non-parametric statistical approaches to recover further information about the three optimizer performances from data, and, since the exact distributions are not well defined, we will use the empirical cumulative distribution as a legitimate approximation.
From the superposition of the three optimizers’ empirical densities and cumulative densities functions of speed of convergence, as shown in
Figure 3a,b, respectively, we note the differences in the central tendencies of the speed of convergence for the three optimizers, and we remark that the minimization of the objective function converges faster in the case of AmsGrad algorithm in comparison to the other two algorithms. To gain more credible evidence about these preliminary raw observations, we conducted a Kruskal–Wallis paired test [
24] to elicit any significant difference of means among the three optimizers. Results of the tests are included in a box plot shown in
Figure 4 with
p-values. We can conclude with high confidence that there is a significant difference (
p ) between the speed of convergence for the three optimizers. Comparing the means of number of iterations between each two algorithms individually, and especially between Adam and AmsGrad that look very close (mean wise), we conclude that there is a significant difference between these two groups too.
To frame these differences in speed between the three algorithms, we generate the
confidence intervals for the median differences using the bootstrap method with 10,000 replicates each. Normal, Percentile, and pivotal 95% confidence intervals have been calculated. Results are summarized in
Table 5. From this table, we spot a clear advantage of Nadam and AmsGrad over Adam in the speed of convergence (on average) while the difference between AmsGrad and Nadam is around just four steps.
Following the logic of our study, we investigate the relationship between speed of convergence and each of the three factors of the simulation, namely the number of anomalies in image, the initial guess error, and the choice of the learning rate. To verify the impact of number of anomalies on the speed of convergence, the Kruskal–Wallis test is applied on each algorithm speed of convergence sample data, as grouped by the number of inclusions. Kruskal–Wallis test results are presented in
Figure 5, and we can conclude (by failing to reject the Kruskal–Wallis null hypothesis) that the number of anomalies present in the image is not significantly affecting the speed of convergence (
p ) for different optimizers.
To fulfill our investigation, we discuss the impact of initial guess error and learning rate parameter over number of iterations as shown in
Figure 6a and
Figure 6b, respectively. The Spearman’s coefficient of correlation is used due to its robustness against outliers which appears in data. Scatter plots in
Figure 6a,b show the relationship between the initial guess error and the learning rate parameter on the speed of convergence, respectively. Spearman’s coefficient of correlation R and
p-value are mentioned at the top of each graph. From
Figure 6b, we notice that, when the learning rate ranges in [0.001, 0.2], Nadam and Adam algorithms take more iterations than the AmsGrad algorithm. In addition, we note that the AmsGrad algorithm presents some robustness toward the learning rate in this range and presents some outliers in the range [0.001, 0.2]. According to the Spearman’s correlation coefficient, we observe a very strong correlation (
) between learning rate parameter and number of iterations for Adam and Nadam optimizers and a negligible correlation for the case of AmsGrad optimizer and presents some outliers when the learning rate ranges in [0.2, 0.3]. On the other hand,
Figure 6a shows the relationship between the initial guess error and number of iterations taken by each optimizer to reach convergence of cost functional. We note that the error has the same impact on Adam and Nadam algorithms, when comparing their
p-value and coefficient of correlation. However, we observe that the AmsGrad is more efficient than the other two optimizers even if the error is far from the real image.
To assess the quality performance in reconstructed images between these optimizers, we performed statistical tests for differences of means on PSNR and SSIM as measured for reconstructed images, between the optimizers. These two scores are defined as follows:
where
,
, and
are the luminance, contrast, and structure variations between the true image
and reconstructed image
, respectively, and
,
, and
are three parameters used to adjust relative importance of the three components of the similarity measure.
and
are the means of pixel values of
and
, respectively. We denote by
,
, and
the standard deviation of
and
, and the covariance of image
and
, respectively.
,
, and
are constants.
The global comparison of quality of reconstructed images are shown in
Figure 7a,b, as the result of running the Kruskal–Wallis for PSNR (we eliminated AmsGrad outliers where
db) and SSIM, grouping each score sample by optimizer. From
Figure 7, it appears that the PSNR and the SSIM of AmsGrad are much lower (worse quality) than those of Nadam and Adam. In addition, we can observe that the means of PSNR and SSIM for Adam and Nadam are very close.
To evaluate the influence of number of inclusions on image quality, we conduct a Wilcoxon test [
25]. The test was applied according to different groups of numbers of inclusions. The resulting
p-values of this test are summarized in
Table 6. The results analysis shows that there is a significant statistical difference between means due to the difference in number of inclusion present in images (
p-value
).
A similar conclusion is deduced about the influence of learning rate on PSNR and SSIM. Scatter plots in
Figure 8a,b clearly show this strong influence of learning rate on PSNR and SSIM, respectively. The resulting Spearman’s correlation coefficients by optimizer (and the corresponding
p-value) for PSNR and SSIM are mentioned at the top of each graph. As shown in
Figure 8a, we notice that there is a strong negative correlation between learning rate hyper-parameter and PSNR for the case of Adam and AmsGrad. For the case of Nadam, we note a moderate negative correlation between the choice of learning rate and PSNR of reconstructed images. From
Figure 8b, we observe that there is a strong negative correlation between learning rate parameter and SSIM for the case of Adam and Nadam. In addition, there is a moderate negative correlation between learning rate and SSIM in the case of AmsGrad. The resulting
p-values mentioned at the top of each graph indicate that these correlations are statistically significant, and, consequently, we can conclude the same about the significance of the influence of learning rate choice on the quality of the resulting reconstructed image. Thus, a small value of learning rate that ranges between 0.001 and 0.2 is recommended.
Concerning initial guess error, scatter plots in
Figure 9 demonstrate the influence of initial guess error on reconstructed image quality. From
Figure 9a,b, the obtained results show that there is no significant statistical differences between the initial guess error and resulting quality (PSNR/SSIM). Thus, we can conclude with high confidence (
p ) that the image quality is only influenced by the number of anomalies in the image and the choice of the learning rate.
We illustrate some cases from our simulation.
Figure 10 shows the reconstructed absorption coefficient
for the case of one inclusion for an initial guess error equal to
. Different values of learning rate are used. The background of true images are taken equal to
mm
and
mm
. The reconstruction using Nadam and Adam showed a good localization of inclusion. In addition, its size is the same compared to the true image with optical properties close to those of true image values. Some artifacts are observed in the borders close to sources and detectors region when the learning rate is higher than 0.1. For the case of AmsGrad reconstruction, we observe that the size of reconstructed image matches those for the true image with some artifacts in the center when the learning rate is lower than 0.1. However, when the learning rate is greater than 0.1, we remark that AmsGrad can localize the inclusion, but with some artifacts in the borders. The size and the shape of inclusion do not match those in the true image.
Figure 11 shows the reconstructed absorption coefficient
for the case of two inclusions with different shapes for the same values of initial guess error and optical properties used in the first case of one inclusion. From
Figure 11, we notice that we obtain a good localization of both inclusions for the case of Nadam and Adam for different values of learning rates. However, when the learning rate is higher than 0.01, we observe some artifacts near the borders. For the case of AmsGrad reconstruction, it is clear that, when the learning rate exceeds 0.1, the size and the shape of inclusion do not match those figuring in the true image.