1. Introduction
The evaluation of thermodynamic potentials such as the entropy or free energy is key to understanding the equilibrium properties of physical systems [
1]. In real-sized classical problems, computer simulations based on Molecular Dynamics or Monte Carlo methods cannot generically access them, mainly because of the size of the space of states to sample, which grows exponentially with the number of particles. This effect is particularly easy to quantify in magnetic models of classical two-state spin systems, where the volume of the phase space grows as
, with
N the total number of spins. Quantities such as the Helmholtz free energy
F in the canonical ensemble, proportional to the logarithm of the partition function [
2,
3]
are out of reach, as the sum extends over all possible states
, with
the corresponding energy,
the Boltzmann constant, and
T the temperature. Actually, finding the value of
Z is known to be an NP-hard problem [
4] that therefore prevents an exact estimation unless the system is small.
The relevance but unfortunate computational complexity implied in the determination of
Z has raised the urge to devise methods to approximate it in a tractable way. One remarkable technique designed to tackle this problem was developed by Bennett [
5], where the free energy difference between two overlapping canonical ensembles is estimated directly in a Monte Carlo simulation. In case one of the two values of
F is known, the method allows us to obtain the value of the other, thus gaining access to
. Another interesting approach towards the evaluation of the partition function is derived from the Wang–Landau algorithm [
6,
7,
8], where a stochastic exploration of the phase space is used to recover the density of energy states
corresponding to the Hamiltonian of the system under study. In this framework, the partition function is recovered as the integral of
over the energy range spanned by the system configurations. This method has proved to reliably reproduce the physics of different systems such as the 2D-Ising model, although it can be difficult to apply to more complex situations involving an intricate
.
An alternative approach to the problem was devised in 2001 by R. M. Neal [
9,
10], the Annealed Importance Sampling algorithm, where an annealing procedure is implemented to obtain reliable samples from an otherwise intractable probability distribution starting from samples of a simpler and tractable one. In this method, the partition function is one of the simplest quantities to evaluate, although as in most sampling schemes, convergence towards the exact value of
Z is only guaranteed in the infinite limit, both in number of samples and intermediate annealing steps. In practical terms, when a finite number of samples and intermediate annealing chains is employed, the predicted value of
Z depends on the different simulation inputs, particularly on the initial probability distribution.
Surprisingly, and despite its broad formulation in terms of an initial and a final probability distribution, little use has been seen of the AIS algorithm in the numerical simulation of physical systems, to the best of our knowledge. More applications have emerged in the world of neural networks, particularly in the field of machine learning with RBMs [
11,
12], where the evaluation of
Z is key to a precise optimization of the system parameters along learning in an exact gradient descent scheme. In this context, the AIS algorithm turns out to be most efficient since the random walk exploration can be performed by means of Gibbs sampling, which is fully parallelizable [
13]. A review and unifying framework of the algorithms for the estimation of the partition function with AIS in RBMs can be found in [
14].
In any case, the AIS algorithm is particularly suited to addressing binary state unit problems like spin systems or RBMs where the different probability distributions involved along the annealing chains are cost-effective and simple to evaluate. Notice that the RBM is a mathematical model that can be used to describe magnetic spin systems, where the weights and bias are directly related to the correlations, external fields and temperature (usually known or modeled a priori) [
15,
16,
17]. In this sense, an RBM can be used to analyze the thermodynamics of these systems, without resorting to a training set or a learning scheme. In this work, we focus on that situation, as we consider the RBM network parameters to be known. We use AIS to compute the partition function of different systems at several but low temperatures, where the calculation of
is known to be harder than at
. Notice, though, that AIS is a general algorithm that has a broad range of applications that go beyond its use in RBM modeling [
18,
19].
To be precise, in this work, we study how AIS can be used to produce reliable estimates of
in magnetic physical systems that can be mapped into RBMs. Our goal is to achieve that using a suitable starting probability distribution with a small computational cost, even in realistically large problems. We discuss how to obtain the optimal mean field probability distribution
that is closest to the Boltzmann distribution of the real model under study. After a brief derivation of how to obtain
from average system properties, we propose two strategies to find approximations to it in Ising, Spin-Glass systems and artificial models, where the exact value of the partition function can be determined. We also compare the results obtained with the standard procedure, where the uniform probability distribution is employed as the starting point of the AIS algorithm [
14,
20], a procedure that shows a non-stable behavior when measured along learning [
21]. Notice that our methodology does not use any external data other than the two-body correlations and external fields defining the model.
2. Annealed Importance Sampling
The AIS algorithm, developed by R. Neal in the late 1990s [
9,
10], allows sampling from a probability distribution that would otherwise be intractable. It can be used to estimate
Z, but it is more general and allows finding approximate values of any observable quantity
over a probability distribution
. In a general sense, this computation can be very inefficient due to two main reasons. On one hand, the probability distribution
can be impossible to sample because the exact form of
is not known, as happens in many quantum physics problems [
22,
23,
24,
25]. On the other hand, the number of samples required to obtain an accurate estimate of the average value of
may be unreasonably large. In order to deal with these problems, one usually resorts to some form of Importance Sampling, where the exploration of the space is guided by a known and suitable probability distribution
[
26]. In this way, one typically evaluates
using stochastic techniques, where samples are drawn from
. Importance Sampling is employed to reduce the variance of the estimator, or to reduce the number of samplings needed to achieve the same statistical accuracy. In any case, Importance Sampling can only be performed when a suitable
is at hand, but that may not always be the case. The AIS method allows building
starting from a trivial probability distribution, and performing an annealing process through a set of intermediate distribution corresponding to decreasing temperatures.
As explained in [
9,
10], in order to estimate
starting from a trivial
, one builds a chain of intermediate distributions
that interpolate between
and
. Denoting by
the corresponding unnormalized probability distributions, a common scheme is to define
with
and
. The approach used in AIS is to turn the estimation of
into a multidimensional integration of the form
where
are normalized joint probability distributions for the set of variables
. In these expressions,
represents a transition probability of moving from state
to state
, which asymptotically leads to the equilibrium probability
. In the same way,
represents the reversal of
. The detailed balance condition implies that the transition probabilities fulfill the relation
in order to be able to sample the space ergodically [
27]. Therefore,
can be estimated from Equation (
4) with
as
is easily sampled from the trivial
.
In practice, one uses
to generate
samples of all the intermediate distributions, such that for every set of values
, with
, one obtains a set of weights
upon substitution in Equation (
8). In this way,
is estimated according to
with
which defines the set of importance weights
obtained from the product of the ratios of the unnormalized probabilities. Notice that
is an accessible quantity, while
is not, just because one does not have access to
. One important consequence of this formalism is that a simple estimator of the partition function
associated to the distribution
is directly given by the average value
Since the values of
are usually large, one typically draws samples of
. In this way, one uses a set of
-normalized AIS samples
, such that
and defines
as an approximation to
. Notice that this value is different from the mean of the samples
, although these two quantities do not differ much when the variance of the samples is small. In fact, these two quantities tend to be the same when the variance of the set of samples is small compared to the mean value. In other situations, the nonlinear character of the operation in Equation (
12) makes the result dominated by the largest samples, to the point that, in the extreme case, the largest sample exhausts the total sum.
3. The Restricted Boltzmann Machine
An RBM with binary units is a spin model describing a mixture of two different species, where intra-species interactions are forbidden, and units play the role of the spins. In general, though, RBM units take values rather than . Furthermore, only one component of this mixture is assumed to be accessible to the external observer, usually called the visible layer. The other species, usually called the hidden layer, is assumed to have no contact with the outside world, and is present to build up correlations in the model. As a consequence, one is only interested in the marginal probability distribution associated to the visible units.
The energy function of a binary RBM with
visible units
and
hidden units
is defined as [
28,
29]:
where
is the two-body weight matrix setting the coupling strength between the two species, while
and
represent the external fields acting on each layer and are generically denoted as
bias. In this expression,
stands for the transpose of vector
.
The energy in Equation (
13) can be cast as a quadratic form, where visible and hidden units are organized as row and column vectors preceded by a constant value of 1 to account for the bias terms
leading to
where
is the
extended weight matrix, which includes the bias terms.
As usual in energy-based models, the probability of a state
follows a Boltzmann distribution
with
and
set to 1. The particular form of the energy function (
13) makes both
and
factorize as a product of probabilities corresponding to independent random variables. As a consequence, Gibbs sampling can be efficiently used to compute them [
30]. In addition, it is also possible to evaluate one of the two sums involved in the partition function. In this way, for
units, one has
where index
j runs over the whole set of hidden units, and
stands for the
jth column of
. However, the evaluation of
Z is still prohibitive when the number of input and hidden variables is large, since it involves an exponentially large number of terms. For that reason, RBMs are computationally hard to evaluate or simulate accurately [
31].
4. Parameters of the Models
In this work, we explore different problems where can be exactly computed, which will be then used to benchmark the approximations described afterwards. At the end, these are employed to predict the value of on a large, realistic system where an exact evaluation is prohibitive. The set of models where the exact is accessible include artificially generated weights, magnetic spin systems that can be directly mapped into an RBM, and weights obtained after an RBM learning process (where a training dataset is available, in contrast to the other cases). The weights and bias generated have similar statistical moments, so that by changing the temperature, the system displays different thermodynamic properties. In the following, we focus on the low-temperature regime, as in this limit, the number of states that acquire a significant probability is reduced, as a consequence of the third law of thermodynamics. Due to the large size of the configuration space, the problem of finding becomes much harder than at high temperatures, thus challenging the accuracy of the AIS predictions obtained with a low computational cost.
The sets of parameters analyzed in this work include:
- (1)
Gaussian Weights with Gaussian Moments (GWGM), characterized by an extended matrix of weights of Gaussian random numbers with and .
We have generated a total of 100 models, each one with weights and bias sampled from a normal distribution
, with both
and
also sampled from normal distributions. In particular,
is drawn from
and
from
, ensuring the latter is positive. In this way, each model follows a single Gaussian mode with different mean and variance. Notice that there is no explicit temperature dependence in these models, although according to the definition of the RBM energy in Equation (
13), a temperature
T in the corresponding Boltzmann factors could be understood as being reabsorbed into the weights and bias themselves. Finally, due to the reduced value of
, the exact value of
Z for each model has been calculated by brute force.
- (2)
A set of weights obtained after training an RBM with the MNIST dataset [
32], with
hidden units (MNIST-20h), similar to the simple case studied in Ref. [
13]. The network was trained with CD
1 for 500 epochs where convergence was already achieved. We monitor and store the weights along the learning process with the aim of having a complete picture of their evolution. In this way, we have snapshots taken at the beginning of the learning, where the training set typically does not correspond to the highest probability states, and at the end, where they are supposed to carry most of the probability mass. Notice that this, together with the MNIST-500h model described at the end of this section, are the only problems where standard RBM learning has been performed. Furthermore, being a learning problem, there is no explicit temperature implied, or equivalently, the temperature is always set to 1.
The previous problems use
binary visible and hidden variables. The next two models correspond to magnetic spin systems, mapped into RBMs using
values, which has been an active topic of research in recent years [
33,
34,
35,
36,
37]. According to [
15,
16,
17], spin systems with nearest-neighbor interactions can be simulated considering two disjoint subnets with half the total number of spins each. In this scheme, the state of all the spins can be updated in parallel in each subnet. This is a perfect fit for an RBM implementation, where units in the visible and hidden layers are arranged according to a checkerboard configuration, as shown in
Figure 1. Actually, using an RBM with these weights yields an exact mapping to the standard procedure of sampling the two disjount networks mentioned above.
- (3)
Classical Ising and Spin Glass models in one and two dimensions. A one-dimensional Ising model with periodic boundary conditions containing an even number of spins
can be represented by an RBM with the same number of units in each layer, as shown in panel (a) of
Figure 1. Identifying even and odd spins with hidden and visible units, corresponding to black and white symbols in the figure, one has
and
where
is the interaction between spins
and
. Only two entries per row/column can be non-zero in this arrangement. In the Ising model (1DIsing),
and
for all spins, while they can take different values in what we denote as a Spin Glass model (1DSG). The partition function of 1DIsing and 1DSpinGlass can be easily computed using the Transfer Matrix formalism [
38,
39]. We have generated 100 different 1DIsing models, with the J and B parameters drawn from a normal distribution with
and
. That gives 100 different 1DIsing Hamiltonians. In much the same way, we have also generated 100 1D Spin Glass models, with all the
and
parameters drawn from the same probability distribution. All these models contain
spins. We have then analyzed these systems at three different temperatures,
(1DIsing1 and 1DSG1),
(1DIsing2 and 1DSG2), and
(1DIsing3 and 1DSG3).
The two-dimensional square-lattice Ising model is much harder to solve and its analytic solution was given by Onsager in [
40] in the absence of an external field. Similar to the 1D models, it can be represented by an RBM, where visible and hidden units are arranged in a checkerboard configuration, as shown in panel (b) of
Figure 1. In this case, four weights can be non-zero in each row and column of
since there are no bias terms. Three sets of 100 2DIsing models (2DIsing1, 2DIsing2 and 2DIsing3) corresponding to
spins have been generated, with parameters drawn from the same normal distributions used for the previous 1D cases, and the same temperatures.
Furthermore, we have extended that to what we call a 2D Spin Glass (2DSG), where all two-body correlations are different, while keeping the connectivity restricted to nearest neighbors. In this case, the partition function is computed by brute force, which limits the size of the square lattices to less than or equal to , as an even number of spins per dimension is required in order to properly satisfy the periodic boundary conditions. Two different sets of 50 models (2DSG1 and 2DSG2) have been used, drawn from a normal distribution with and and corresponding to and , respectively.
All these models use spin variables as standard.
Finally, we also analyze the weights of an RBM trained with the MNIST dataset containing hidden units (MNIST-500h), where no exact value of can be obtained due to its large size. The training was made in the same conditions as in the MNIST-20h case.
5. The Optimal Mean Field Approximation
The equilibrium Boltzmann distribution associated to any physical system is given by
where
is the system’s energy corresponding to state
. In the spirit of AIS, the partition function associated to
can be obtained from a chain of intermediate probability distributions that start from another, much simpler and easy-to-sample
, as shown in
Section 2. Obtaining a good
can ease the job for AIS, and therefore becomes a key ingredient to obtain an accurate estimation of
with a reasonable number of intermediate chains and samples. A very simple probability distribution
can be obtained from a mean-field model containing only external fields
. In this scheme, and for an RBM,
defines the starting mean field energy, which makes
the product of independent distributions for each unit, thus allowing for a very simple and efficient sampling scheme in parallel. Furthermore, for
binary units, the corresponding partition function reads
while for
units, one has
Despite dealing with a mean field, obtaining the most suitable
may not be a trivial task. In most practical applications, and for lack of a better model, the simplest choice
is adopted, thus turning
into the uniform probability distribution. In the spirit of the AIS algorithm, and according to the theoretical development [
9,
10], one then expects that increasing the number of intermediate distributions should lead to the exact result, no matter what the starting
is. Whilst this should be the case, the dynamics of this process are not clear, nor is it clear whether the desired limit is attained with a large but manageable number of intermediate distributions. In other words, one has no clue as to what the convergence properties of the algorithm are, other than knowing that it provides the right result in the infinite limit. In order to test that in practice, we have conducted different experiments with the GWGM and MNIST-20h problems of
Section 4. Given that our goal is to obtain reliable estimates of
with a small computational cost, these experiments are also useful for selecting suitable values for
and
. In these experiments and the following ones reported, a linear grid of equidistant inverse temperatures has been employed. We have tried different schemes (such as a logarithmic grid), to find that no significant differences were obtained.
Figure 2 shows the evolution of the prediction of
with
for the MNIST-20h (left panel) and 10 randomly selected GWGM weights (right panel). In all these calculations, a total of
AIS samples have been employed to build
according to Equation (
12). In the MNIST-20h case, both the exact and the predicted values are displayed, while in the GWGM case, the ratio of the AIS
to the exact
is displayed for the sake of clarity. The error bars are obtained after averaging 100 repetitions of the same experiments.
Two immediate conclusions can be drawn from
Figure 2. On one hand, it is clear that in both cases, a stable prediction has been achieved already at
. This fact has also been observed with all the sets of weights tested. Starting from there, we have set
and
in all the following AIS runs throughout this work, which seems to be large enough to obtain stable results while still allowing for a fast evaluation of
with a standard computer. On the other hand, one readily notices that, despite providing an apparently converged result, the AIS prediction starting from
may differ substantially from the exact result, even in cases where one of the dimensions of the problem (
or
) is small. The situation is even worse as the error bars diminish with increasing
, leading to the false impression that a reliable prediction has been achieved. The results in the left panel show that this picture remains unaltered even with
, thus indicating that a completely unpractical amount of intermediate distributions is probably needed to produce the required changes to bring the AIS prediction close to the exact result, something that is guaranteed in the asymptotic limit [
9,
10].
Still, the plots in
Figure 2 yield a discouraging picture about the possibility of achieving good results starting from
, an image that should be properly put into perspective. In order to obtain a more complete view, we have conducted AIS experiments starting from
on all the models described in
Section 4.
We have computed 10 independent repetitions for each set of weights, each consisting of
AIS samples with
. For every repetition, an estimation of
has been obtained from the 1024 samples using Equation (
12), and the relative error
has been calculated. For all the set of weights belonging to the same system (GWGM, MNIST-20h, …), the total number of estimations producing a relative error
have been computed. The bars in
Figure 3 show that number as a percentage. As can be seen, the choice
works in many cases, but not in all of them.
In any case, and despite the fact that the uniform probability distribution corresponding to
provides a trivial starting point, it is not the only possible simple choice. In fact, any distribution of the mean field form given in Equation (
20) is suitable to start AIS from, as all components of
become independent random variables that can be sampled in parallel. Among all the possible choices of
, therefore, one can look for the optimal one that produces the best possible results with little computational cost. In this context, being optimal means producing a mean field probability distribution that is closest to the actual
one seeks to sample, according to some metric.
In particular, the optimal values
of
can be obtained minimizing the Kullback–Leibler (KL) divergence between
and the full RBM probability distribution
, so we impose the condition
where the sum over
extends to all the
states, as hidden states have already been marginalized in both
and
. One thus has, for
,
where the subscript
n indicates that the average values are taken over the
probability distribution corresponding to the target RBM. In this way, one obtains, for
,
for each visible unit
. For
, a similar procedure leads to
These expressions, also appearing in [
41], imply that the problem of finding
is equivalent to obtaining the exact average values of the visible units, which may not be a trivial task depending on the problem at hand.
In order to test the benefits of using
, we perform several AIS runs starting from the optimal
and compare the results to the same calculations starting from the uniform probability distribution, corresponding to
. As stated above, in both cases, we use
intermediate chains to obtain
AIS samples.
Figure 4 shows the results obtained in colormap form for one of the most difficult GWGM cases. The horizontal axis indicates the number
of hidden units considered, spanning the range from 1 to
, obtained by discarding weights (that is, setting
for
), while the vertical axis displays the inverse temperature. In all cases, we use
visible units, as described in
Section 4, thus allowing for the exact calculation of
by
brute force. The maps show the percentage of the 1024 samples of
that differ from the exact value by less than
in each case. As can be readily seen, the fact that
is
closer to the RBM probability distribution makes AIS work less and perform better, as expected. Notice, though, that for some combinations of
T and
, the efficiency of AIS suffers even when starting from
. This should not be completely surprising, mostly considering that a mean field starting probability distribution can still be too far away from that of the actual RBM, thus indicating that one should look for a different (and unknown) starting probability distribution.
The right panel in
Figure 4 also suggests that a mean field starting point can be problematic when the number of hidden units is much larger than the number of visible ones. This problem is easily solved noticing that
is invariant under the exchange of
and
in the RBM, associated to replacing the array of weights by its transpose. Based on these results, we have conducted additional tests on the whole GWGM set. In fact, the expectation values
can always be evaluated when the dimension of the hidden space is small, as in the present case. It is easy to show that, for binary
units, one has
where the sum extends over all hidden states, while
and
stand for the hidden state probability and the i-th row of the two-body weight matrix, respectively.
Figure 5 shows the relative error obtained after averaging ten repetitions of each AIS run, for the 100 GWGM models. All runs started from
, computed from the exact
, for the transposed and non-transposed configurations. Results have been sorted in ascending error order of the non-transposed configurations in order to obtain a better view. As can be seen, all models are accurately reproduced in the transposed case, where the number of hidden units is smaller than the number of visible ones. On the contrary, about
of the models show large deviations from the exact result when the original, non-transposed model is evaluated. This behavior is also observed when performing similar calculations with the other problems presenting large differences in the number of hidden and visible units.
6. Approaching the Optimal Mean Field
Despite the simplicity of the expressions in Equations (
25) and (
26), the problem of finding the optimal
can actually be as hard as finding
itself, so one has to devise alternative strategies to approximate it.
Three common strategies are usually employed to face this problem [
14]. The simplest one is to disregard Equations (
25) and (
26), set
and sample from the uniform probability distribution, as discussed above. Another common strategy is to set
from Equation (
13) and to disregard the contributions of the hidden units. Despite its simplicity, the resulting
is usually far away from
. The third approach was devised in [
13] for the specific case of RBM learning, where
is approximated by its average over the training set. However, this procedure cannot be employed when a training set is lacking, as when dealing with magnetic spin systems for instance, or when the existing training set does not properly represent the underlying probability distribution of the system.
In this work, we introduce two alternative strategies to estimate
that, on the one hand, imply a low computational cost, and on the other, avoid some of the drawbacks of the aforementioned choices. They both rely on finding a suitable approximation to compute
in Equations (
25) and (
26). At this point, many different choices are possible, while keeping in mind that none of them will perfectly reproduce the exact
, as we assume the original
is intractable. However, one must keep in mind that the resulting probability distribution obtained from them is used as the initial point for AIS, which will afterwards correct that to produce reliable samples of
.
Among the many possible choices, we introduce the following ones:
Pseudoinverse (Pinv) approximation: One can look for a state of the complete (visible and hidden) space with large probability. In this case, one works directly with the energy, setting to zero the gradients with respect to
of the expression in Equation (
13). One then finds
where
is the pseudoinverse of the
matrix. In this work, we build
by rounding the result of Equation (
28) to the
or the
range, depending on the units used, and approximate
by
. With that, we build the corresponding mean-field bias
.
Signs from Random Hidden (Signs_h): The expectation values
given in Equation (
27) can only be evaluated when the number of hidden units is small, but unfortunately, that is not usually the case in real problems. For that reason, we resort to a heuristic approximation, where a set of hidden states
randomly selected from the uniform probability distribution is used to obtain the same number of visible states
from the conditional probabilities
. This expression assigns a probability larger than 0.5 to
depending on the sign of the argument in the exponential. Following this, we set the components of
to be equal to 1 when
, and to 0 in the opposite case. As in most of the calculations performed in this work, we build a set of 1024 uniformly sampled
that are used to generate the
, which are finally averaged to obtain the estimation of
required to compute the approximated bias
. Notice that this is a cost-effective procedure that involves less operations than the pseudoinverse procedure outlined above. This approach is trivially extended to
units.
These two strategies have been used to produce the mean-field probability distributions of Equation (
20) that are used to start AIS. We perform 10 repetitions of each experiment for each model, producing a total of 1000 final values for the GWGM weights.
Figure 6 shows the statistics obtained for all the cases analyzed, corresponding to the total amount of AIS predictions producing a relative error of less than
with respect to the exact value of
. The lighter, midtone and darker bars correspond to
,
and
, respectively. As can be seen, both Pinv and Signs_h outperform
in most cases, yielding similar results in general. It is also worth noticing that for the sets that do not have bias (
in Equation (
13)),
is the optimal
when
units are employed. In this case, all three strategies yield very good and similar results.
The fact that both
and
lead to overall better AIS predictions than
is a direct consequence of the distribution of AIS samples in each case. This is illustrated in
Figure 7 for the GWGM case, where all samples generated from all repetitions of all models have been used to account for better statistics. The plot shows the percentage of samples that have a relative error with respect to the exact
equal to or lower than
, as a function of
, for the
,
and
strategies. As can be seen, the
mean field performs worse than the other two in general, although all three strategies produce similar results up to
. For higher values, though, differences are significant, converging once again towards the end of the curve where all samples fulfill the condition. In any case, we find that
and
perform very similarly, with minor variations that in the end lead to the small prediction differences displayed in
Figure 6. One can thus conclude that, overall, the samples generated by
and
are closer to the exact value of
than the set produced by
. Despite that, one could argue that in all cases, there is always a large amount of samples that fail to predict anything close to the right value. However, it is worth noticing that this should be the case due to the stochastic nature of the AIS algorithm and the exponential way in which the generated samples have to be combined, as displayed in Equation (
12). Fluctuations above the exact value of
are exponentially amplified, and have to be compensated by a large amount of samples that underestimate its value, whose contribution is exponentially diminished. We can thus conclude that the AIS algorithm has to produce a lot of apparently bad samples in order to produce an accurate result. Furthermore, this asymmetric generation of samples above and below the exact value leads, when not properly balanced, to an underestimation of
, as noticed in [
42]. This picture, though, can be alleviated by increasing the number of intermediate chains
, at the expense of linearly increasing the computational cost.
We finally close the discussion by showing in
Figure 8 the value of the partition function estimated with AIS for the MNIST dataset, using an RBM model containing
hidden units.
For this system, due to its large size, there is no exact calculation of
and one has to rely on the predictions obtained employing state-of-the-art techniques found in the literature. For that matter, we take as reference the value obtained from the procedure outlined in Ref. [
13], where the dataset used to train the RBM is also employed to approximate the mean values required for the evaluation of
in Equations (
25) and (
26). With this, we run AIS with
and
to obtain the reference value (green solid line in the figure). Notice that
is unreasonably large compared to what one would normally use in order to obtain a maximally accurate approximation of
with the same number of samples used throughout this work. The figure also shows the estimations obtained using
,
and
(dotted line, crosses and plus symbols, respectively). The first 21 points correspond to the first 21 epochs where the RBM weights rapidly evolve, while the last two points correspond to epochs 40 and 100. As can be seen, all curves merge at the highest epochs, while the
prediction departs from the reference curve at the early and intermediate epochs. On the contrary, the selected strategies are hardly distinguishable from the reference line along the whole curve. Despite the fact that the differences between the
curve and the rest are small, one should realize that the computational cost involved in using the proposed strategies is very low, while the predictions obtained are closer to the reference value. This is something that should be taken into account if the goal is to obtain the most accurate but economic prediction of
.