3. Comparison of Estimator Performance in Test Cases of Increasing Complexity
3.1. Test Case Design
We investigated the efficiency and accuracy of the different estimators described in
Section 2 on a range of test cases summarized in
Table 1. For each test case, we generated a sample of a fixed size from a given distribution and test the estimators’ ability to quantify the three information-theoretic measures entropy, KL divergence, and mutual information.
We varied the sample size in order to reveal how data-hungry the different methods are in producing a reasonably good estimate of the true measure for the underlying distribution. Note that the difficulty is not only to have a large enough sample for the estimation method to work reliably, but, by its nature, a limited sample can only represent its corresponding theoretical distribution to a limited extent. Hence, we expect the estimates based on small sample sizes to be of mixed quality partly because they are not representative enough of the true distribution. We repeatedly drew samples of a certain size and report the average and the confidence interval of the estimates. We expect the performance and the confidence of the estimators to generally increase with the sample size due to this representation effect; yet, it remains to be observed how the different estimators behave for the different sample sizes and how they compare with each other in terms of accuracy and efficiency. The robustness of an estimator might be increased by bootstrapping; however, this will not resolve the issue of (non-)representativeness. It is beyond the scope of this study to design and recommend the best-suited bootstrapping approaches for the individual estimators; further, initial tests not shown here have confirmed the expectation that the variability resulting from bootstrapping is dominated by the much larger sampling variability, which is accounted for in this study, and hence, we neglected the effect of bootstrapping in this investigation.
The complexity of the test cases increases with respect to dimensionality (ranging from one to ten) and skewness (ranging from symmetrical to highly skewed) of the chosen distributions.
For the simplest 1D cases, there is no dependent variable to calculate mutual information on; so, in Cases 1 to 3, we only investigated the estimators’ performance in calculating entropy and KL divergence. We deliberately started our investigation with low dimensions and well-behaved distribution shapes such that we can rely on analytical reference solutions for the information-theoretic quantities. Then, for cases of increased complexity, analytical equations no longer exist (or only for some quantities of interest), and we have to switch to computationally heavy numerical integration to serve as a reference for the true value of the target quantity (Cases 3, 5, and 6).
3.2. Choice of Hyperparameters
Each estimation method needs the tuning of certain hyperparameters or making methodological choices. In the case of binning, the hyperparameter is the bin size
, and we investigated the differences in performance due to the four different methods to select the bin size presented in
Section 2.2. In histogram-based methods, it is important to clarify how bins with zero probability, or empty bins, are handled because of the log function in all quantities of interest. In our specific application, where Equations (
20)–(
22) are sums, bins that have zero probability are excluded before the summation operation takes place. The hyperparameters of the QS approach are fixed to the recommended values of the original proposal [
10] as described in
Section 2.2.2.
For KDE, we show the difference between performing a resubstitution estimate as opposed to an estimate through numerical integration. In both cases, a kernel representation of the PDF of the data is created using a Gaussian kernel and Silverman’s bandwidth estimate, as described in
Section 2.1. Then, the resubstitution estimate is obtained by averaging over the kernel PDF evaluated at all available data points in the sample, while the integration estimate is obtained by taking the expected value. Specifically, for entropy, Equations (
15) and (
18) show how these two processes are different.
As for k-NN, we investigated the effect of varying the number of neighbors used, k, between 1 and 15. The value of k was cut off there because, for larger values (e.g., ), the results showed very similar behavior to , but with larger variability in small sample sizes.
3.3. Implementation
The sample sizes ranged between 100 and 100,000 samples, with the upper limit being reduced to in some cases related to the computationally expensive KDE-based estimator. The test case procedure was repeated across 300 different fixed seeds for random sampling to show confidence intervals related to the effect of sampling variability on an estimate.
Numerical integration as a reference when there is no analytical solution was achieved through the QUADPACK [
38] (quad and nquad for higher dimensions) interface in
SciPy [
39]. For the integration scheme, default settings were used where both the absolute and relative error tolerances were set to
. Integration limits were set according to the true support of the underlying distribution.
Each of the estimators was implemented as a Python 3 [
40] function. The binned estimators use the histogram, histogram-dd, and digitize functions from
NumPy [
41], the latter two used for multi-dimensional histograms with uneven and even binning widths, respectively. The KDE and
k-NN estimators rely on the Gaussian KDE and the
k-d Tree data structures available in
SciPy [
39], respectively.
The code for the setup of the test cases presented here to allow for reproducing our results and, e.g., future testing of further alternative methods is available at the
Data Repository of the University of Stuttgart (DaRUS). Further, all of the estimation methods have been collected as part of the open-source
UNITE toolbox. The published toolbox itself contains the functions for the estimators to be applied to any case study of interest.
3.4. Case 1: 1D, Uniform Distribution
To begin our evaluation of the different estimation methods presented in this paper, we start with the uniform distribution. In the continuous case of the uniform distribution, the total mass of the distribution lies uniformly between two arbitrary bounds. As such, it is the maximum entropy distribution for a random variable
X under no other constraints [
3]. If the bounds of the uniform distribution are
a and
b, where
, then the PDF is:
3.4.1. Analytical Reference Solution
The entropy of a uniform distribution
is given by
If
is a second uniform distribution with bounds
c and
d where
, the KL divergence between the two distributions
and
is:
Figure 2 shows the PDFs of the distributions used to estimate entropy and KL divergence, together with the estimators’ analytical reference value. For KL divergence, the approximating distribution (
in red) must have greater or equal support for the true distribution (
in black), which is the case for the example shown.
3.4.2. Density Estimation
For illustrative purposes, we start off by showing the ability of the different estimation methods to make an estimate of this one-dimensional uniform distribution.
Figure 3 shows the true underlying PDF and the three PDFs derived by each method applied for density estimation, i.e., Equations (
11), (
19), and (
30). The solid line for each method represents the mean of the estimated density across 300 random seeds of 10,000 samples of the uniform distribution, and the density was calculated for 500 evenly spaced points between
and
along the
x axis. Additionally, the
and
percentile of the density estimated at each point is shown as the shaded area in the same color as the mean for all methods.
As expected, KDE gives a very accurate density estimation within the limits of the underlying distribution, but the effect of the Gaussian kernel is apparent close and outside of the limits. A boxcar-function-based kernel might be most suitable for this particular example, but said recommendation becomes more difficult when the data are higher dimensional; therefore, the Gaussian kernel is the default option [
11]. Although binning produces estimates with higher variance than KDE, it seems to be the most suitable method for this example. The
k-NN-based method was not explicitly designed to perform density estimation, and the example in
Figure 3 graphically shows the large variance that the method has even in a 1D case [
35]. The variance is affected by the number of neighbors used for density estimation; for smaller values of
k, the variance is much higher than for larger values of
k. In
Figure 3,
was used. This is a much larger value than what is typically recommended, as we will see later, but appears adequate for density estimation, and larger values of
k would indeed reduce the variance further. Nevertheless, the uniform distribution remains challenging due to the steps at the limits of the distribution, where only density estimation through binning is able to identify the “hard” cutoff limits.
Recall that density estimation is not the primary goal of our comparison; instead, we are interested in the estimation quality of the three information-theoretic quantities. Only the KDE and binning methods rely on this initial density estimation step; k-NN skips this step and is only shown here for a better intuition about the characteristics of this approach.
3.4.3. Estimation Results
The estimation results for the case of the uniform distribution are shown in
Figure 4. The estimates of the three methods (columns) for the two quantities entropy and KL divergence (rows) are shown as absolute values in nats on the left-hand axes, whereas the relative errors with respect to the reference solution can be read off from the right-hand axes. Note that the axis scaling is consistent within each row (i.e., between estimates for a specific quantity), but not necessarily across the rows (i.e., between the different quantities).
This first test case serves as an intuitive starting point for the interpretation of
Figure 4, as the results for all further test cases follow the same template.
The KDE estimator gives a greatly biased result when estimating entropy due to the effect of using a Gaussian kernel to approximate a uniform PDF (as seen in
Figure 3). This is because the Gaussian kernel widens the true support of the uniform distribution, and therefore, it leads to an overestimation of entropy. Integration reduces this effect and also reduces the variance of the estimate in smaller sample sizes.
Binning underestimates entropy for sample sizes up to 1000 and then yields very accurate and precise estimates. This observation is well in line with the theory, since a too-small-to-be-representative sample will not be perfectly uniform and, hence, the true entropy value of the maximum-entropy distribution is approached from below with increasing sample size. The QS approach typically underestimates entropy when compared to a histogram-based approach and one of the considered “rules-of-thumb”.
The k-NN estimator for entropy gives a similarly accurate result as binning, but with a much higher variance. As discussed in the context of density estimation, the k-NN method does not rely on any smoothing via kernels or bins, and hence, the sampling variability hits its result hard. As expected, this variability is reduced with increasing number of neighbors; however, a smaller number of neighbors yields more accurate results for smaller sample sizes.
Estimating KL divergence directly using the KDE estimator gives a result very close to the truth even in small sample sizes, with an increase in the number of samples only reducing the variance. Across different samples, the ratio between the true and the approximating distribution seems to stay consistent, leading to this result. Numerical integration leads to a worse outcome as the effect of the Gaussian kernel becomes more apparent (both distributions are first approximated, and then, their ratio is assessed, “doubling” the smearing effect of the kernel).
For the binning estimator, we see a “reversal effect” in the estimate for KL divergence where Sturges’ rule initially computes a larger KL divergence, but starts to produce results that are lower than Scott’s and FD’s rule at 500 samples. The reason is the resulting number of bins obtained from the different methods described in
Section 2.2.1: across larger sample sizes, Scott’s and FD’s rules result in a very similar number of bins with FD’s estimate being usually slightly larger, and both being larger than Sturges’ rule. This situation is reversed in smaller sample sizes, with Sturges’ rule producing the largest number of bins.
The k-NN approach yields the best results for KL divergence with a single neighbor (). While, again, the variance is high for small sample sizes, the estimate is highly accurate across all sample sizes. Obviously, a one-to-one comparison of estimated densities is preferable for an accurate estimate of KL divergence, as also seen for KDE (a direct estimate being much superior to integration). While intermediate numbers of neighbors perform poorly, the highest number investigated here () comes close to the accuracy of , but fails to reach it. It is also remarkable that k-NN-based estimation of shows very little variation across sample sizes, again emphasizing that a local estimate of the ratio between the two distributions is less affected by unrepresentative sampling or distribution reconstruction.
3.5. Case 2: 1D, Normal Distribution
The normal or Gaussian distribution is the second of our test cases and also serves as the basis for the applications in higher dimensions, where the estimators will be tested on multivariate normal distributions (
Section 3.10 and
Section 3.11). In the most basic one-dimensional case, the PDF of the normal distribution is:
The normal distribution is often denoted as , where is the expected value of the distribution and is the standard deviation.
3.5.1. Analytical Reference Solution
Given the PDF in Equation (
34), the entropy of the normal distribution is:
Given two normal distributions, where
and
, the KL divergence between these two distributions is:
Figure 5 shows the distributions used in this application case.
3.5.2. Estimation Results
The estimation results for the case of the normal distribution are shown in
Figure 6. Estimating the density of a normal distribution with a Gaussian kernel is a prime task for KDE, so the relative errors in estimating the two information-theoretic quantities are expected to be small. Across all sample sizes and quantities, a direct evaluation of the KDE estimator performs better than performing numerical integration over the kernel representation. The latter gives additional importance to the overall shape of the distribution, which is an approximation in both cases, rather than focusing on the specific points, which support the kernel representation. This can be seen more clearly for KL divergence, where ratios are considered. Only at large sample sizes, where the overall representation of the underlying distribution is very close to the true distribution, numerical integration gives a result that is almost as good as direct estimation.
All binning methods, with the exception of Sturges’ rule, give very similar results, which are highly accurate for large sample sizes. This is expected as the rules-of-thumb were all derived with more or less strict assumptions on the distribution being Gaussian. The bin width given by Scott’s works best because it is optimal for random samples of normally distributed data, in the sense that it minimizes the integrated mean-squared error of the density estimate [
27]. Neither FD’s and Sturges’ estimate follow this same property and allow for a more flexible representation of the underlying distribution. On normally distributed data and in large sample sizes, FD’s rule equally matches Scott’s rule, but Sturges’ rule overestimates entropy and underestimates Kullback–Leibler divergence because it calculates a larger
and a smaller number of bins. The QS approach does not assume normally distributed data (and hence, also cannot benefit from that knowledge); as in the previous case, it tends to slightly underestimate entropy, but its performance is generally very similar to the other binning methods. Remember that it is only derived for estimating entropy, not KL divergence or other multivariate quantities.
The results of using k-NN to estimate entropy and KL divergence appear consistent with those of the previous case: for small sample sizes, the bias decreases with decreasing k, but the variance increases. However, for sample sizes , the bias and variance almost diminish; also, for any choice of hyperparameter and sample size, the relative error of the entropy estimates is generally small, with the largest relative error being ≈6%. This demonstrates the effectiveness of this estimation method when applied to normally distributed data. For KL divergence, again, is the best choice of hyperparameter, and given a specific k, the mean estimates of KL divergence are very consistent across sample sizes.
3.6. Case 3: 1D, Normal Mixture Distribution
To exploit the statistical properties of normal distributions while making them more flexible in fitting non-normal real-word data sets, it is common to use weighted mixtures of normal distributions in approximation tasks. Such a mixture offers the possibility to test the estimators presented in this paper in a setting where the data come from a distribution that is not uni-modal and more skewed. Further, the main interest in evaluating a mixture of normal distributions is that a normal mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a normal mixture model with enough components [
33]. The PDF of a normal mixture distribution is:
where
is the individual weight of each of the
n components of the mixture, with their individual expected values
and standard deviations
, and
.
3.6.1. Numerical Reference Solution
Because of the logarithm of a sum of exponential functions, the entropy of a mixture of normal distributions cannot be calculated [
42]. The same applies to KL divergence. Therefore, numerical integration was used to obtain the reference value presented in the results.
Figure 7a shows the distribution used for this application case. This is an equally weighted mixture of the two normal distributions
and
.
3.6.2. Estimation Results
The estimation results of the mixture of normal distributions are shown in
Figure 8 and they are very similar to those of the single normal distribution in the previous case. KDE again performs well due to the usage of a Gaussian kernel. Given the two peaks of the distribution as seen in
Figure 7a, for a low number of samples, the kernel density at specific points is underestimated, and therefore, the entropy is overestimated. This is the opposite as in Case 2 (pure normal distribution), in which small samples tend to over-represent a narrower high-probability range of the underlying distribution, which leads to an underestimation of entropy. In both cases, direct estimation gives a very accurate result for sample sizes larger than 1000; integration struggles much more in the case of the normal mixture distribution.
For the binning estimator, the results are also very similar to the previous case: both Scott’s and FD’s estimate perform very well; the QS approach converges to their result from below; Sturges’ rule struggles even more than before due to an inadequate number of bins for both estimating entropy and KL divergence. Scott’s rule appears to be best to calculate an adequate number of bins, particularly for KL divergence where the estimate based on said rule appears to consistently match the true value across all sample sizes.
k-NN-based estimation behaves similarly to all previous one-dimensional cases. Here, once again, values of or give an estimate of entropy that is very close to the true value while having a smaller variance than . gives the estimate with the smallest variance, but largest bias for small sample sizes. For estimating KL divergence, again, performs best and approaches this result (but still shows a small bias even for large sample sizes).
3.7. Case 4: 2D, Normal Distribution
For extending Case 2 to a normal distribution with
d dimensions, we use the notation
, where
is a
d-dimensional mean vector and
is a
covariance matrix, where
is positive-definite. For this distribution, the PDF is:
3.7.1. Analytical Reference Solution
The entropy of this multivariate normal distribution is given by [
1]:
The KL divergence between two
d-dimensional normal distributions
and
[
43]:
In particular, for
, the parameters for the normal distribution become a
vector for the means and a
matrix for the covariance. These parameters can be written as:
This notation is similar to that of Equation (
34) with the addition of
being the Pearson correlation coefficient between
and
. Using this parametrization, the expression for mutual information for the bi-variate normal distribution becomes:
This means that, for a bivariate normal distribution, there is an exact relationship between the correlation coefficient
and mutual information [
44].
Figure 9a shows a bivariate normal distribution
, and
Figure 9b shows the same, with a second approximating distribution
in red. The parameters for both distributions are:
Additionally, in
, the value of
is
and can be directly used in Equation (
41). This result is also shown in
Figure 9a.
3.7.2. Estimation Results
This initial case in two dimensions starts to introduce the challenges of density estimation in higher dimensions. The estimation results are shown in
Figure 10. More specifically, for KDE-based estimation, direct evaluation of a kernel representation of the probability distribution is very computationally expensive for sample sizes larger than 25,000, and the waiting time to obtain an estimate becomes unreasonable. Therefore, this method is tested no further than this number of samples. Computational costs will be discussed further in
Section 3.12. Nevertheless, estimates for entropy, KL divergence, and now, also mutual information using this technique are quite accurate with a behavior similar to the one-dimensional case in
Section 3.5. Numerical integration of the KDE density estimate further increases the computational cost with the procedure becoming unfeasible for a number of samples larger than 10,000. Further, integration only notably improves performance for mutual information.
Binning-based estimation does not suffer from the increase in computational cost as much as KDE-based estimation, but there is a small increase (see
Section 3.12). As opposed to the one-dimensional cases investigated so far, binning yields large relative errors when estimating entropy even for larger sample sizes. The main drawback of the different binning rules is their inability to account for the multiple dimensions of the data, making the method less effective for dimensions higher than one. Here, Sturges’ rule is superior to the other methods, as it yields relatively accurate estimates for entropy and mutual information for samples of size 1000 or larger. For KL divergence, Scott’s and FD’s rules seem more effective. It appears that, for entropy and mutual information, a smoother histogram created using a larger
is more favorable, while for KL divergence, a more detailed histogram with a finer
is preferred. Recall that the QS approach has not been extended to higher dimensions yet and, hence, cannot be analyzed in this and the following multi-dimensional test cases.
Finally, for
k-NN estimation, the findings from the previous one-dimensional cases seem to hold, at least for two dimensions. Namely, estimation performs best for smaller values of
k both for entropy and KL divergence. This is not true for mutual information, where a large value for
gives better estimates across all sample sizes. Importantly, for certain cases, particularly in
Figure 10 shown for small sample sizes and
, the second term in Equation (
29) becomes larger than all other terms, resulting in an estimation of negative mutual information. As mutual information cannot be negative per its definition (Equation (
7)), the estimator reports
as the result. Hence, the
k-NN estimate converges to the true mutual information systematically from below.
3.8. Case 5: 2D, Normal Mixture Distribution
To increase the level of complexity while exploiting the fact that two dimensions still allow for visualization, we again tested on a mixture of now bivariate normal distributions using the scheme in Equation (
37).
3.8.1. Numerical Reference Solution
As in Case 3, analytical expressions for entropy, KL divergence, or mutual information do not exist for this type of distribution. We, therefore, relied on numerical integration to obtain the reference values presented in the results.
Figure 11a shows an equally weighted mixture of
and
.
The parameters of the mixture are:
Figure 11b shows an approximating bivariate normal distribution
. The parameters of this distribution come from making the incorrect assumption that the
,
pairs shown in
Figure 11a are independent.
3.8.2. Estimation Results
The results of this test case are found to be very similar to the previous case of a bivariate normal distribution, with the most notable difference being that KDE seems to struggle more when moving away from a pure normal distribution, meaning that the performance is overall worse in this case.
Note that, for estimating mutual information with the
k-NN method, further tests were conducted using
, but there were no noticeable differences between this value of
k and the maximum shown in
Figure 12 of
.
3.9. Case 6: 2D, Gamma-Exponential Distribution
An example of a skewed distribution in two dimensions is the gamma-exponential distribution, which has the following PDF [
21], defined for
:
where
is the scaling parameter of the distribution >0 and
is the gamma function.
3.9.1. Analytical and Numerical Reference Solutions
Darbellay and Vajda [
45] presented a list of multivariate differential entropies and mutual information. According to them, the (joint) entropy of the gamma-exponential distribution is:
where
is the digamma function or
. The mutual information is:
For calculating entropy and mutual information in this case, the scaling parameter was set to
. For determining KL divergence, the parameter of the approximating distribution was set to
. Because no reference in the literature was found to provide a theoretical result of relative entropy between two gamma-exponential distributions, numerical integration was used to obtain the true value of
presented in the results.
Figure 13 shows a plot of the two PDFs used in this test case.
3.9.2. Estimation Results
The estimation results for the case of the gamma-exponential distribution are shown in
Figure 14. Here, we have the case of a distribution that has a different shape than the normal distribution; therefore, the limitations of the usage of a Gaussian kernel in the KDE-based estimator become more apparent. The KDE-based entropy estimate only achieves an accurate result for large sample sizes and using numerical integration. This seems to contradict the previous cases, but this can be explained by how numerical integration is implemented on the KDE estimator. Numerical integration requires a set of limits, and as implemented, the limits were chosen as the maximum and minimum values available in the sample for each dimension, plus and minus the specific bandwidth calculated for the Gaussian kernel. Therefore, these limits constrain the region where numerical integration happens, even though the kernel-based representation of the distribution has support everywhere due to the usage of the Gaussian kernel. This effect also applies to KL divergence and mutual information, but is more apparent in the latter. Both of these estimates remain biased to a significant degree even with the largest sample sizes tested here.
Binning estimates applying Scott’s rule give the best results, while the FD rule drastically underestimates entropy and overestimates mutual information. Sturges’ rule shows the opposite behavior. As previously commented on, this is due to the FD rule calculating a smaller and forcing a greater number of bins than Sturges’. This overestimation on the required number of bins causes the representation of the joint distribution to be sparser and to have a larger number of empty bins, leading to a smaller variability in the representation of . Generally, estimating entropy with bins using Scott’s rule leads to small relative errors for practically all considered sample sizes, whereas the mutual information estimate using this same rule still contains quite high relative errors even for the largest sample size of 100,000. Because KL divergence compares the ratios of and , very similar results are obtained across the binning hyperparameters as the absolute bin size does not matter as much as the fact that equal binning schemes are defined for and .
As with the other cases, k-NN-based estimates improve in accuracy with increasing sample size, and performs best for entropy and KL divergence. In the case of entropy, larger k leads to a worse approximation, while still, all investigated values of k produce very small errors with large sample sizes. For KL divergence, however, the results become more mixed with larger sample sizes, where the highest considered k of 15 approaches the highly accurate result of , while the smaller k values in between seem to stabilize at large relative errors, underestimating the true KL divergence significantly. In the case of mutual information, larger k is favorable for all sample sizes, but all k larger than one achieve acceptable relative errors for sample sizes larger than 1000.
3.10. Case 7: 4D, Multivariate Normal Distribution
For higher dimensions, we tested the estimators’ ability to quantify the entropy, KL divergence, and mutual information of a four-dimensional normal distribution, since analytical solutions are available. We adopted the experiments by Wang et al. [
23] and replicated their results, while going further with their experiment by quantifying the sampling uncertainty of their results, as well as investigating additional estimation methods. For entropy and mutual information, data were sampled from
, and for KL divergence, data were also sampled from
.
3.10.1. Analytical Reference Solution
In this case, we can use Equations (
39) and (
40) as analytical expressions for entropy and KL divergence, respectively.
For mutual information, Arellano-Valle et al. [
46] suggested breaking down the distribution into marginal distributions with
n and
m dimensions, where
and:
In this form, using the expression for mutual information in the first row of Equation (
7) and considering the expression for entropy obtained in Equation (
39), mutual information can be written as:
As we cannot present plots of the PDF of the distributions used in this application case, we show here the reference true values:
The reference solution for mutual information comes from applying Equation (
45) to the distribution
. The distribution is split so that the distribution of
X contains the first three dimensions of the original distribution; therefore
; the distribution of
Y has dimensions of the highest order (
) and
. More succinctly, mutual information is calculated as:
.
3.10.2. Estimation Results
The estimation results for the case of the 4D normal distribution are shown in
Figure 15. In higher dimensions than those of the previous cases, the main drawbacks of some of the estimation methods start to become apparent. To begin, numerical integration of a KDE-based probability distribution becomes too computationally expensive, so this method was not considered in this experiment. Further, as was also the case for some of the previous cases, the evaluation of a KDE-based representation also becomes too expensive for larger sample sizes; therefore, the KDE-based direct resubstitution estimate was only calculated up to 25,000 samples. For entropy and mutual information, it seems that the estimator would converge for a larger number of samples, with the estimates having very little variance after 1000 samples. The estimate for KL divergence, however, is extremely stable with a highly biased result (40% relative error).
Binning-based estimation performs acceptably well only when using Sturges’ rule. Sturges’ rule typically results in the largest bin width, with Scott’s and FD’s rules giving smaller bin widths and, therefore, a larger amount of bins in the histogram-based representation of the PDF. In the case of entropy, for smaller bins, the contribution of each bin to the computation of entropy is very small and the correction factor described in
Section 2.2 and Equation (
20) dominates the calculation, typically underestimating entropy and resulting in values that are not shown in the limits of the plot in
Figure 15. Only Sturges’ rule with its larger bin sizes converges to the true value with the highest investigated sample size, however, approaching the true value very steeply, meaning that all smaller sample sizes exhibit large relative errors. The same is true for mutual information, but this quantity is usually overestimated with smaller sample sizes. Finally, also for KL divergence, Sturges’ is the only rule that shows a reasonable convergence behavior. The other rules result in many more bins, making it difficult for samples to match the same bin as those that are evaluated for the computation of KL divergence between the histogram of
and
. Contrary to the previous case, in higher dimensions, not only equal binning schemes matter, but also the quality of the estimate depends on the size of the bin. As the data are sparser in higher dimensions, samples that are close together should be accounted for in the same bin, and this becomes difficult when the number of bins is too large.
k-NN-based estimation performs best among all investigated methods and quantities with similar behavior as in the previous lower-dimensional cases, where, again, is the best hyperparameter for entropy and KL divergence, while seems best for mutual information. For mutual information, it can also be noted that is not an adequate parameter for estimation, especially for small sample sizes, as the estimator typically reports a value of .
3.11. Case 8: 10D, Multivariate Normal Distribution
Following the experiments by Wang et al. [
23] and, therefore, using a similar notation as the previous case, for entropy and mutual information, data were sampled from
, and for KL divergence, data were sampled from
. In this case, both distributions are centered at 0, and we write the covariance matrices using the notation
i for rows and
j for columns, then
,
and
,
, for
.
3.11.1. Analytical Reference Solution
We can again use Equations (
39), (
40), and (
46). Similar to the previous case, to calculate mutual information, Equation (
45) is applied to the distribution
, and once again, the dimensions are separated so that
and
, where
m is the dimension of the highest order; then, mutual information is calculated as
. The following solutions are obtained as reference values:
3.11.2. Estimation Results
KDE-based estimation performs poorly for all quantities. For estimating entropy, the kernel-based representation of the PDF seems to not be as smooth as the true distribution, resulting in a lower estimation of entropy. Nevertheless, the representation does improve with additional samples, and it would be expected that, given enough samples, the KDE PDF would resemble the true distribution and the estimator would converge. Then, again, more samples would mean added computational cost in evaluating the kernel representation of the PDF, and this procedure becomes a limiting factor. As with the previous case, the maximum number of samples for entropy and KL divergence has been limited to 25,000. For KL divergence, the mean estimates across sample sizes are very consistent, but the kernel-based representation of both the true and approximating distributions do not resemble the distributions from which the data were sampled; therefore, a biased result is obtained. Finally, mutual information is consistently overestimated across all sample sizes, with added computational cost, as three kernel-based approximations of the true distribution have to be evaluated: , , and .
Considering binning methods, similar to the previous case, only Sturges’ rule is able to capture an adequate representation of the underlying distribution with its larger bin widths. For entropy, as the sample size increases, the estimate made by Sturges moves toward the true value. But, even for the largest sample size of 100,000, the estimate made using this rule underestimates the true value of entropy by approximately 20%. Furthermore, when using rules with smaller bin widths, bin occupations and, hence, densities become even lower, resulting in the estimated entropy being typically negative. This can be seen in
Figure 16 as the trajectory that Scott’s rule follows, and this was true also for the previous case in
Section 3.10. Further, for KL divergence, not even the large bins produced by Sturges’ rule prevent the mismatch between the bin occupations for high-dimensional data, making the result of estimating KL divergence equivalent to infinity and not shown in
Figure 16. Finally, in the case of mutual information, once again, only Sturges’ rule produces an estimate that follows the expected behavior of improving as the sample size is increased. Both Scott’s and FD’s rule produce more bins, which typically have higher estimated densities in the joint distribution of
X and
Y than the product of both marginal distributions
X and
Y. As mutual information measures the distance between a joint distribution and the product of its marginals, as described in Equation (
6), mutual information increases with the sample size, as can be seen in
Figure 16, and said increase is not compensated by dividing by the size of the sample, as Equation (
22) indicates.
k-NN-based estimation continues to perform well, but the challenges of estimation in higher dimensions also affect the results based on this method. Here, the number of samples was also limited from the typical 100,000 as the maximum evaluated to 50,000. Although not as markedly as with other methods, the computational cost increased from seconds to minutes in comparison to the previous four-dimensional case, even considering that the number of available samples was restricted. Notably, the estimator gives a much more accurate result for entropy across all sample sizes than KDE or binning for the largest implemented sample size, using the k-NN hyperparameter . For KL divergence and mutual information, the performance is not as accurate, even when considering an optimal choice of hyperparameters. For KL divergence, estimates using show a downward trend, which would suggest that the estimator will eventually converge, but at a much higher number of available samples. The same is true for mutual information using . However, the relative errors of both quantities are much smaller than those produced by KDE or binning and remarkably similar in order of magnitude to all lower-dimensional test cases.
3.12. Computational Cost
A desirable property of an estimator is a low computational time. With such a property, the estimator would lend itself to practical purposes and processes that require a result to be estimated many times such as Monte Carlo methods or bootstrapping [
10].
To assess the computational cost of the different estimation methods, we chose to evaluate the computation time required for each test case for a set number of samples and one hyperparameter of each specific method. Given the results presented above, we chose to use 10,000 samples to evaluate computation time because, in most cases, at this sample size, each estimator has (almost) converged to the true result. In terms of hyperparameters, for the binned estimator, we chose to use Scott’s method to calculate an appropriate bin width (see
Section 2.2), Silverman’s bandwidth for the KDE estimator (Equation (
14)), and
for the
k-NN estimator to calculate entropy and KL divergence, while for mutual information, we chose
. The results in
Table 2 show the mean, maximum, and minimum computation time (in seconds) for each experiment. All results were obtained on a single thread of an Intel(R) Xeon(R) CPU E5-26280 v2 with a clock speed of 2.80 GHz.
Across the board,
Table 2 shows that binning- and
k-NN-based methods have the lowest required computational time with a single estimate taking less than a second in most cases. This makes them ideal for tasks that require repeating calculations for different samples of data or using different hyperparameters.
For the binning estimator, the process of creating, filling, and applying a specific equation as described in
Section 2.2.1 typically takes milliseconds with an increase depending on the number of dimensions of the data. While, for lower dimensions, the required histogram is calculated, for the 4- and 10-dimensional cases, we employed a procedure in which every point in the sample is replaced by the specific bin it occupies, greatly limiting the amount of memory and computational time required for estimation. Interestingly, there is a decrease in the time required to estimate KL divergence between the 4- and 10-dimensional cases. This is the product of having to find the matching and occupied bins given by Scott’s rule-of-thumb for estimating bin size. As the bins are smaller, they are more sparsely populated in the 10-dimensional case, resulting in less matching bins that are also occupied. Having to calculate the estimate of KL divergence for less bins makes this process faster in the 10-dimensional case when compared against the 4-dimensional case.
As introduced by Gupta et al. [
10], the QS estimator uses bootstrapping twice. From a single sample, it uses bootstrapping to determine the theoretical quantiles that best approximate the true distribution, and then, it uses bootstrapping again to estimate the bootstrapped confidence intervals of the estimate. Because the experiments wanted to address the effect of sampling variability on the estimators, the second step of the QS estimator was not performed and bootstrapping was only performed to determine the quantiles. This is reflected in the computational times shown in
Table 2 as the QS estimator takes longer than histogram-based methods using rules-of-thumb to determine the bin width. As described in
Section 2.2.2, bootstrapping was performed
times to determine the ideal quantiles.
KDE-based estimation provides the estimates with the highest computation time out of all the estimation methods evaluated. Calculating a direct resubstitution estimate for all quantities is one or two orders of magnitude slower than using any of the previously discussed methods. This becomes even worse when numerical integration is used to calculate an estimate. For a resubstitution estimate, each data point in the sample has to be evaluated once in the KDE-based representation of the PDF. This is opposed to performing numerical integration where the number of evaluations is unknown and they continue until a certain tolerance for error in the method is met. This is apparent as, for lower-dimensional cases, the number of required evaluations for the numerical integration estimate is low, making this method faster than evaluating all available samples in the kernel-based PDF for a resubstitution estimate. Nevertheless, due to its high computation time, the KDE-based estimator does not seem suitable for multiple evaluations.
Finally, k-NN estimation requires about 10-times the computational effort of binning in low dimensions, but it’s faster than direct KDE evaluation by about a factor of 1000. In high-dimensional cases, the effort of k-NN increases, but still is well below the effort of the direct evaluation of KDE.
3.13. Synthesis of Findings from Test Cases
Histogram-based estimation or binning is the most common approach for calculating entropy, KL divergence, or mutual information. In this study, we found that it is an accurate method especially for data in one dimension and using Scott’s rule-of-thumb to determine an adequate binning scheme for the data. Nevertheless, the method loses accuracy for data in higher dimensions, where there are no established good practices for selecting a particular bin size, and the rules-of-thumb extrapolate poorly to higher dimensions. Only Sturges’ rule, which calculates the largest bin widths, seems to generalize to higher dimensions, as the available data for estimation are sparser and less bins are able to capture an adequate number of points in a particular bin.
As an alternative, KDE-based estimation uses a kernel to create a smooth representation of a probability density function for each point in the available sample. The performance of this method is good when compared to the other estimation methods, but it might not be practical due to its high computational cost and the fact that, as implemented in this study, it used a Gaussian kernel and it was applied to data sampled from normal distributions (multivariate normal distributions in higher dimensions). Extrapolating from the lower-dimensional test cases, we reckon that, for high-dimensional distributions with different shapes, this estimation method might not perform as well.
Across all of the experiments, k-NN-based estimation was among the top-scoring methods, independent of the shape and dimensionality of the distribution sampled from, and the given sample size. We found a very consistent best-performing choice of the hyperparameter k, i.e., the number of neighbors to consider in the formulation of the estimator. Notably, the optimal choice of k depends on the specific quantity to be estimated: for entropy and KL divergence, performed best, while for mutual information, a high number of was shown to be optimal. Higher values of k were tested, but did not show significant improvement, while using for mutual information is generally not recommended. These findings suggest that a direct estimation in the immediate vicinity of each sample point is beneficial for estimating entropy (thereby, not smearing out the sample distribution too much) and KL divergence (where the ratios of probabilities are estimated per sampling point, thereby not accumulating approximation errors before calculating the ratios). For mutual information, a more aggregated view of the sample is required, favoring a higher number of neighbors to consider.
Moreover, the computational cost is relatively low, as described in
Section 3.12. Hence,
k-NN-based estimation is readily implementable in practical applications and lends itself to repeated sampling such as in Monte Carlo or bootstrapping analyses.
We found that KDE and binning both performed well in some cases, but by far not as consistently as k-NN and with varying methods (e.g., direct estimation vs. integration for KDE and Scott’s rule vs. Sturges’ and FD’s rule for binning). Hence, the clear recommendation on how to set the value of k-NN’s hyperparameter is a further advantage of this method.
4. Summary, Conclusions, and Outlook
Estimating information-theoretic quantities from sample data, possibly in higher dimensions, poses a challenge. Most methods rely on an initial density estimation step, but density estimation itself is known to be computationally demanding or even prohibitive, and to produce unreliable results with unknown accuracy and precision in practical settings. Further, most methods assume a certain shape of distribution that the sample stems from, with an unknown impact on the estimator’s performance if the true distribution deviates from that assumption. As an alternative, nearest-neighbor-based methods to directly estimate specific information-theoretic quantities have been proposed, skipping the initial density estimation step altogether. Hence, they show promise to also perform well for higher dimensions and arbitrary distributions, but have never been systematically compared to density-based methods such as kernel density estimation (KDE) or binning (histogram-based schemes). In fact, most users in science and practice seem to favor binning for its straightforward implementation, but with no further justification. We hypothesize that this is due to a lack of systematic evaluation and guidance on which method to choose in what settings. With this investigation, we aimed to close this research gap.
Typically, a new method was introduced individually without systematic comparative analysis in regard to other methods. In some cases, a brief comparison was made against another method that uses the same basic principle, such as histograms or k-NN. In response, we presented and discussed three of the most widely used non-parametric estimation methods for information-theoretic quantities, namely binning, KDE, and k-nearest neighbors (k-NN). To evaluate the estimators’ performance, we designed test cases that used data samples from distributions with different shapes and with different numbers of dimensions to quantify the information-theoretic quantities entropy, KL divergence, and mutual information. Depending on the level of complexity of each case, analytical solutions existed, or were approximated with a high-quality numerical reference solution. We tested the estimation methods on each case and reported the performance in a chart that serves for intercomparing all methods for all target quantities as a function of sample size. The true distributions used to generate the data for our experiments ranged from simple 1D uniform or normal distributions over more skewed or bimodal shapes up to multivariate distributions in 4 and 10 dimensions. Sample sizes ranged from 100 to 100,000, with a reduction of the maximum sample size in cases where computational effort exploded. As described, this is a very practical and easy-to-understand basis for our study. We also accounted for sampling variability as each experiment was repeated with 300 random seeds for the data-generating sampling procedure. Further, we considered different choices of hyperparameters for each estimation method. Finally, we assessed and compared the computation time required to obtain an estimate for a sample size of 10,000, which produced a well-converged estimate for most estimation methods. All methods showed larger relative errors for sample sizes typically below 1000, pointing to the fact that samples of such a small size are not representative enough of the underlying true distribution, and hence, any estimation method necessarily fails in reliably quantifying the underlying distribution’s particular property. For larger sample sizes, however, estimation results typically converged to a more or less biased estimate, with distinct differences between the methods.
For binning, the most important parameter is the bin width or
used to build the histogram because it controls the trade-off between oversmoothing and undersmoothing the data with respect to the true distribution. Typical rules-of-thumb were implemented and tested in this study (namely Scott’s, Sturges’, and FD’s rule), as well as the Quantile Spacing method by Gupta et al. [
10], but alternative methods using piecewise constant approximations for density [
47], as well as methods that calculate an estimate of
based on the minimization of the mean-integrated-squared error (MISE) should be explored as well in the future. These latter methods aim to reduce the error between the true distribution
and the histogram-based distribution
[
48].
Our KDE-based estimator used a Gaussian kernel as a typical choice; however, the choice of kernel is a hyperparameter that was not analyzed in this study, and it is expected that the performance of the estimator would vary depending on the choice of kernel used. A boxcar kernel, for example, would improve estimation in cases where the true distribution has distinct bounds such as a uniform distribution in one dimension, or a gamma-exponential distribution in two dimensions. However, such improvement would be accompanied by the known high computational cost of kernel-based methods, and as such, if only the particular information-theoretic quantity is of interest and no other property related to the kernel-based representation of the PDF, other more efficient methods seem better suited.
Finally, the evaluation of the
k-NN-based estimator proved its performance and served as a showcase for its computational efficiency. The speed of the estimator comes from the usage of the
k-d Tree data structure, which is in charge of the look-up operation required to identify nearest neighbors. While we found that, in its original form, the estimator performs well in the diverse set of experiments we tested it on, recent variations have been proposed to further improve the estimation of entropy by combining
k-NN with normalizing flows [
49] and the estimation of mutual information by combining
k-NN with KDE [
22] or using neural networks [
50].
In general, across all test cases, k-NN produced among the lowest relative errors in estimating entropy, KL divergence, and mutual information, independent of the sample size and shape and dimensionality of the distribution sampled from. Also, a clear identification of the best-performing value of the hyperparameter k, i.e., the number of neighbors to be considered in the vicinity of a sampling point, was possible; interestingly, the optimal choice of k depends on the specific quantity to be estimated. For entropy and KL divergence, performed best, while for mutual information, a more integral approximation with a high number of was shown to be optimal (even higher values were tested, but did not show significant further improvement).
All tested methods might benefit from a specifically tailored bootstrapping approach to increase the stability of the estimator for large samples that are (sufficiently) representative of the true underlying distribution; this is recommended for further analysis. For small sample sizes, however, our study has confirmed the dominant impact of non-representativeness.
Based on our evaluation of the performance, ease of implementation, and computational effort, we recommend
k-NN-based estimation for estimating information-theoretic quantities from sample data, especially in higher dimensions, due to its clear advantages. However, the significance of this work extends beyond a theoretical comparison and practical performance assessment: we have also collected these methods in a publicly available Python 3
toolbox, ensuring transparency and accessibility for the wider research community. This toolbox shall serve as a valuable resource, enabling researchers and practitioners to integrate information-theoretic concepts into their data analysis and modeling workflows. Specifically, this study in combination with the toolbox shall enable them to make a well-informed decision on the choice of estimation method.