2. The SK Model
The SK model was introduced in the 1970s by D. Sherringhton and S. Kirkpatrick [
17] and stands as an explicitly solvable mean-field spin glass. In their work, the authors discovered that the solution obtained through the replica symmetric (RS) approximation was not correct at low temperature. With a groundbreaking approach, Parisi identified a new type of solutions, nowadays called
replica symmetry breaking (RSB), which proved to be correct at any temperature, thereby revealing a novel mathematical and physical structure [
18].
The SK model is defined by its Hamiltonian, that is a function of
N spins
:
where
is a collection of i.i.d. standard Gaussian. In physical terms, the couplings between pairs of spins can be ferromagnetic or antiferromagnetic with equal probability. Consider also a random variable
with
and a collection
representing random external fields acting on the spins. The
Parisi formula is a representation for the large
N limit of the pressure
defined by
In the definition (
2),
are fixed parameters, and the dependence on the realization of the random collections
is kept implicit. One can prove [
5] that
converges, for almost all realizations of the disorder, to its average
. Notice that
, taken after the logarithm, averages both the collections of
and
that are called
quenched variables. The Hamiltonian (
1) can also be regarded as a centered Gaussian process with covariance
where
is the
overlap between two spin configurations
and
.
The Parisi variational principle for the limiting pressure per particle of this model was proved after almost three decades of efforts, and it is mainly due to the works of Guerra [
8] and Talagrand [
19]. We hereby summarize these milestones in a single theorem.
Theorem 1 (Parisi Formula [
8,
19]).
Let be the space of probability measures on , and . Consider the Parisi functional, which is defined aswhere solves the PDEThe following holds The key tool for the proof is the (Gaussian)
interpolation method, which is introduced in [
9] in order to prove the existence of the large
N limit of
.
The thermodynamic equilibrium induced by the pressure
is called
quenched equilibrium and is defined as follows. Physical quantities (e.g., energy) are functions of the disorder variables
and the spin configurations
. Given a function
, its equilibrium value is defined as
where
is the (random) Boltzmann–Gibbs distribution
The measure
is called a
quenched measure and can be viewed as a two-step measuring process. Initially, for a given realization of the disorder variables
, one assumes that the system equilibrates according to the canonical Boltzmann–Gibbs distribution
defining a (random) measure on the space of spin configurations. The expectation with respect to
is denoted by
, namely
In probabilistic terms,
defines a conditional measure given
and
. The remaining degrees of freedom
,
are then averaged according to their apriori distribution
.
An important role is played by the concept of
replicas. Replicas are i.i.d. samples from
at fixed disorder. Hence, the equilibrium value of a function
of
n replicas and the quenched variables
is defined by
The computation of derivatives of
shows, using integration by parts, that the SK model is fully characterized by the (joint) distribution of the overlap array
, namely the overlaps between any finite number
n of replicas with respect to the measure (
11). The main feature of the Parisi theory is the characterization of the mentioned joint measure by means of two structural properties:
- (i)
It is uniquely determined by a one-dimensional marginal, namely the distribution of ;
- (ii)
The distribution of three replicas has with a probability of one an
ultrametric support
Despite having a mathematical proof of the Parisi Formula (
7) for the SK model, (
i) and (
ii) have been rigorously proved only in the mixed
p-spin model [
6,
20,
21], an extension of the SK model, whose Hamiltonian contains also higher-order interactions (three-body, four-body, etc.).
One of the crucial instruments to achieve a rigorous control of the model is the so-called
Ruelle Probability Cascades (RPCs), defined by Ruelle [
22] when formalizing the properties of the Generalized Random Energy model of Derrida [
23]. See also the characterization of RPC in terms of coalescent processes given in [
24]. The first direct link between RPC and the SK model appeared in the work of Aizenman–Sims–Starr [
25], where the authors found a representation of the thermodynamic limit of quenched pressure per particle in terms of the
cavity fields distribution. This representation strongly suggested that if the thermodynamic limit of the overlap distribution is described by an RPC, then the Parisi formula is correct.
The first signal that the overlap array is described by an RPC was originally found by Aizenmann and Contucci in [
10] with the identification of
stochastic stability and by Ghirlanda and Guerra [
26]. Both papers show an (infinite) set of identities for the moments of the overlap array distribution. It turns out that these identities actually imply that the support of the joint distribution of the overlap is ultrametric, as proved by Panchenko [
27]. It should be noticed that Panchenko’s theorem requires identities for the overlap moments of all orders. The latter do not hold for the bare SK model, but it can be shown that there exists a perturbation of the Hamiltonian that forces the SK model to satisfy them without affecting the limit of the quenched pressure [
28].
Once the validity of the Parisi Formula (
7) is established, it is natural to ask for the properties of its solution. The uniqueness of the minimizer of (
7) has been assessed by Auffinger and Chen [
29], and its properties have been investigated for example in [
30,
31].
A relevant question about the minimizer is the following: for which values of the parameters
is the solution of (
7) a Dirac-delta function
for some
? In this case, we say that the model is
replica symmetric and the Parisi Formula (
7) reads
The replica symmetric region can be identified [
6,
32] with the region of parameters
where the overlap is a self-averaging quantity, namely
where
is exactly the value that realizes the infimum in (
13). The physics conjecture is that the replica symmetric region can be identified by the so called Almeida–Thouless [
33]
The above conjecture is proved only in the case of Gaussian external field
[
34]. An alternative characterization of the replica symmetric region has been obtained in [
6,
35]. If the minimizer corresponds to a non-trivial distribution (i.e., with non-zero variance), we say that
replica symmetry breaking occurs, and the overlap is not a self-averaging quantity.
The Parisi formula has been extended to other mean field models with centered Gaussian interactions: vector spins [
36], multispecies models [
11,
37,
38], multiscale models [
39,
40]. Finally, we mention that the SK model fulfills a remarkable universality property: as long as
’s are independent, centered, and with unit variance, the thermodynamic limit is still described by the Parisi solution [
41].
In this work, we show that a class of non-centered Gaussian spin glasses admits an interpretation of high-dimensional inference that extends the celebrated correspondence between the spiked Wigner model and the SK model in the Nishimori line where replica symmetry is always fulfilled [
3]. We show that the addition of an SK Hamiltonian to a Hopfield with a finite number of patterns can be mapped into a high-dimensional mismatched inference problem, where the statistician ignores the correct apriori distribution on the signal components they have to reconstruct. We shall see that even this slight mismatch may lead to the emergence of complexity, namely to the breakdown of replica symmetry, which is instead guaranteed under very mild hypotheses for
optimal statisticians.
3. High-Dimensional Inference and Statistical Physics
High-dimensional inference aims at recovering a ground truth signal, in the following, that is usually a vector with a very large number of components from some noisy observations of it, which is denoted by . The main feature of this setting is that the dimension of the signal, i.e., the number of real parameters to reconstruct, and the number of observations at disposal are a function of one another, typically a polynomial. For instance, for our purposes, will be a vector of and will be an matrix for a total of noisy observations. Hence, if the number of observations becomes large, the number of parameters to retrieve also does. Contrary to what happens in typical low-dimensional settings, where max-likelihood, or Maximum A Posteriori (MAP) approaches yield provably satisfactory reconstruction performances, in a high-dimensional setting, this is not always the case. In particular, one needs to devise another kind of more refined estimators that exploit the marginal posterior probabilities for each signal component.
Both approaches described above are Bayesian, and the knowledge of a prior distribution on the signal components can play a key role especially for high-dimensional problems. Furthermore, to compose the posterior measure for the entire signal, one needs the likelihood of the data, which is the probability of an outcome of the variable given a certain ground truth realization . As we shall discuss soon, under certain hypotheses, the Bayesian approach highlights the correspondence of relevant information theoretic quantities with thermodynamic ones. Among the others, a key quantity is the mutual information between the signal and the observations , which quantifies the residual amount of information left in about after the noise corruption. As intuition may suggest, the mutual information gives access to the best reconstruction error that is information theoretically achievable.
Finally, we stress that the high dimensionality of the problem can induce phase transition in some parameters of the model, like the so-called signal-to-noise ratio (SNR), that tunes the strength of the signal with respect to that of the noise in the observations.
3.1. Bayes-Optimality and Nishimori Identities
For the sake of simplicity, we start by considering a signal of i.i.d. (independently and identically distributed) components , where has a finite fourth moment. The observations at the disposal of a statistician can be modeled as a stochastic function of the ground-truth signal: , where is the source of randomness or simply the noise. Knowing the function , from a Bayesian perspective, translates directly into having the likelihood of the model, namely the conditional distribution , which we assume to have a density over the Lebesgue measure. Observe that the likelihood is strongly affected by the nature of the noise.
According to Bayes’ rule, the posterior distribution of
given the data is:
where
, and
is the probability of a given realization of the data, which is sometimes also called
evidence. In practice, the above posterior, which would be ideal to perform inference, is rarely available, and the statistician is not aware either of the likelihood or of the correct prior distribution for the signal, or even both. This motivates the following definition of a special inference setting:
Definition 1 (Bayes optimality).
The statistician is said to be Bayes optimal, or in the Bayes-optimal setting, if they are aware both of and ; namely, they have access to the posterior (16). The above is saying that an optimal statistician knows everything about the model except for the ground truth
itself. The Bayes-optimal setting is thus often used as a theoretical framework to establish the
information theoretical limits. Indeed, it is known that the mean square error between the ground truth and an estimator
is minimized by an optimal statistician that can use the posterior mean as an estimator, yielding the
minimum mean square error (MMSE)
In the following, we shall denote averages with respect to the posterior as
.
Another important consequence of this setting is the so-called
Nishimori identities, which can be stated as follows. Given any continuous bounded function
f of the data
, the ground truth
and
i.i.d. samples from the posterior
, one has
where
. An elementary proof can be found in [
42]. These identities are enforcing a symmetry between replicas drawn from the posterior and the ground truth. For instance, a direct application of the Nishimori identities yields
It is important to stress that, as it can be seen from the above equation, an optimal statistician is actually able to compute the minimum mean square error using their posterior.
At this point, the reader will have noticed a similarity with the Statistical Mechanics formalism. In fact, it is possible to interpret
as the partition function of a model with Hamiltonian
and unit inverse absolute temperature. The pressure per particle of such a model would thus be
namely minus the Shannon entropy of the data per signal component, which is related to the mutual information
The contribution coming from the conditional entropy
can be regarded as due only to the noise, since for fixed
, the only randomness in
is due to
.
We stress here that Bayes optimality, and the Nishimori identities, under rather mild hypotheses [
43] are enough to grant
replica symmetry in the model, i.e., concentration of the order parameters in the model. For the models we are interested in, the latter can be shown to imply finite-dimensional variational principles for the limiting mutual information.
3.2. The Spiked Wigner Model
The spiked Wigner model (WSM) was first introduced in [
44] as a model for Principal Component Analysis (PCA), and since then, it was widely studied in recent literature. Without pretension of being exhaustive, we refer the interested reader to [
42,
45,
46,
47,
48,
49,
50,
51]. For our purposes, we restrict ourselves to the case where the signal is an
N-dimensional vector of
s, drawn from a Rademacher distribution
. The function
is a
Gaussian channel, namely
where
, and
is a positive parameter called the
signal-to-noise ratio. The statistician is tasked with the recovery of
given the observations
. The Bayes-optimal posterior measure for this inference problem can be written directly as a Boltzmann–Gibbs random measure thanks to the Gaussian nature of the likelihood:
where we have already exploited the fact that
. We are denoting the posterior samples with
. Since the quantity we are interested in is the quenched pressure of this model
that is connected to the mutual information
by a simple shift with an additive constant, we are allowed to perform a gauge transformation without altering its value:
This results in a Hamiltonian that is now independent of the original ground-truth signal
and the coupling between spins are Gaussian random variables with a mean equal to their variance. This condition identifies a peculiar region of the phase space of a spin-glass model, which is called
Nishimori line. In fact, the Nishimori identities were first discovered and studied in the context of gauge spin-glasses. Despite looking simpler, the above model retains most of the features we need for our study.
For inference models with additive Gaussian noise, like the one above, it is possible to prove the so-called
I-MMSE relation:
where
is the Frobenius norm and
denotes the expectation with respect to the Boltzmann–Gibbs measure induced by (
27). Hence, once the mutual information is known, the MMSE can be accessed through a derivative with respect to the signal-to-noise ratio. A clarification is in order here: the above is the MMSE on the reconstruction of the rank-one matrix
, because, due to flip symmetry, here we do not have any actual information on the single vector
, but only on the
spike .
3.3. Sub-Optimality and Replica Symmetry Breaking
There are several ways to break Bayes optimality. Some examples are that the statistician does not know the signal-to-noise ratio
[
13,
52]; the statistician adopts a likelihood different from that of the true model [
14]; the statistician adopts a wrong prior [
12,
53]; combinations of the previous and many others. We will focus on the mismatching priors case, where the statistician not only adopts a wrong prior on the ground-truth elements, but they are not aware of the rank of the spiked matrix hidden inside the noise, which is denoted by
M. The rest is assumed to be known. The channel of the inference problem is
If the statistician assumes a Rademacher prior to the signal components and a rank-one hidden matrix, they will write a posterior in the form
where
The slash on quantities emphasizes that they are not the Bayes-optimal ones. In this setting, one can no longer rely on the Nishimori identities, and in principle, replica symmetry is no longer guaranteed. On the contrary, as we shall argue later on, a mismatch in the prior only is already sufficient to cause
replica symmetry breaking.
4. The Model
Let
M be a fixed integer and
. Consider two independent random collections
and
where
is such that
. The above random collections play the role of quenched disorder in the model. Consider
N Ising spins
and the Hamiltonian function
with
. Here,
is the interacting part while
denotes the random external field acting on the spins. The Hamiltonian (
32) is determined by the choice of
and
. For
, the interaction term
coincides with the Hamiltonian (
31). Note that for some special choices of the parameters, we recover some well-known spin glass models:
gives the SK model (
1) at
and random external field
.
gives the Hopfield model [
6,
7,
18] with a finite number of patterns
.
and
gives the SK model on the Nishimori line (
27). As we have seen in
Section 3, the latter can be also viewed as a spiked Wigner model in the Bayesian-optimal setting.
Notice that the entire model can be interpreted as a
Hopfield model where the traditional Hebbian matrix
is corrupted by Gaussian noise. Furthermore, if the Hebbian coupling is replaced by a constant matrix, the model reduces to an SK model with the addition of a ferromagnetic interaction, and it was studied in [
54].
Our main result is the computation of the thermodynamic limit of the pressure per particle
whose variance can be shown to converge to 0 as an
, namely:
Lemma 1. Assume . Then, for any and where K is a suitable positive constant. We thus focus on
. The proof of this lemma makes use of the Efron–Stein concentration inequality to bound the variance, and it is simple but tedious. It follows closely that of ([
12], Lemma 9). We are now in a position to state our main theorem:
Theorem 2 (Variational solution).
If thenwhereand is the Parisi functional (5) with a random external fieldand denotes the expectation with respect to . The consistency equations areMoreover, there exists such that for any , one has and the supremum in (36) can be restricted to . The proof of the theorem is based on the concentration of the Mattis magnetization, which is the normalized scalar product between a spin-configuration (or sample from the wrong posterior measure) and one of the
:
The Hamiltonian can thus be rewritten using (
40) in the following form:
The Mattis magnetization, in fact, plays the role of an order parameter for this model. The concentration we can prove is only an integral average over some suitably small magnetic fields, which is still sufficient for our purposes:
Proposition 1 (Concentration of Mattis Magnetizations).
Consider a k such that . Let with , for all . For any , we denote by the Boltzmann–Gibbs measure induced by the Hamiltonian . Thenfor all and . We shall omit the proof of the above result as it is completely analogous to the one in [
12]. We will need an intermediate lemma that leads to it (see Lemma 2 later) together with a second key ingredient: the
adaptive interpolation technique [
48] combined with Guerra’s replica symmetry-breaking upper bound for the quenched pressure of the SK model [
8].
Proof of Theorem 2.
Here, we outline the main steps of the proof of the variational principle for the thermodynamic limit. The proof is achieved via two bounds that match in the
limit. Let us start by defining the interpolating Hamiltonian
where
and
with
and where the interpolating functions
, that must be continuously differentiable in
and non negative, will be suitably chosen. With this interpolation, one is able to prove the following sum rule:
Proposition 2. The following sum rule holds:where The proof consists of the computation of the derivative of the interpolating pressure related to the model (
43). It follows closely that of ([
12], Proposition 7), to which we refer the interested reader. Since the remainder
is non-negative, the above proposition already yields a bound for the quenched pressure of our model when we choose
constant:
where we used Lipschitz continuity of the SK pressure in the magnetic fields.
The upper bound requires more attention. First, we notice that
is convex in the magnetic fields and that
. Hence, we can use Jensen’s inequality and Lipschitz continuity of
to obtain:
Now, we use Guerra’s bound for the SK pressure, that, importantly, is uniform in
N, and we average over
on both sides
What remains to do is to prove that
for a proper choice of the interpolating functions
. The choice is made through a system of coupled ODEs
One can easily check that the above system is regular enough to admit a unique solution on the interval
. In this case, the remainder to push to 0 would appear as
The goal is now to apply a concentration lemma here:
Lemma 2. Let and denote by the Boltzmann–Gibbs expectation associated to the Hamiltonian where and is the k-th canonical basis vector of . Thenwith K a positive constant. Notice that the integral in (
51) is over
and not over the effective magnetic field of the model, which is instead
. Nevertheless, we can integrate over the magnetic fields
with a change of variables. This involves a Jacobian that is larger than 1. In fact, thanks to Liouville’s theorem ([
55], Corollary 3.1, Chapter V), one can prove that
when
.
This allows us to bound the thermal fluctuations in (
51) using (
52) and then Liouville’s theorem:
Since
has a bounded second moment, using Cauchy–Schwarz inequality, one can show that
is uniformly bounded by a constant
C. Hence,
for any
by construction (recall (
44) and (
50)). Therefore,
.
The fluctuations induced by the disorder can be bounded in a very similar fashion using (
53):
Hence, overall (
51), that equals
is a
.
can be chosen as a function of
N in order to optimize the convergence rate:
. Using Fubini’s theorem in (
49) to exchange the
t and
averages and then Dominated Convergence, one concludes the proof. □
From the the variational problem (
36), we can deduce also the differentiability properties of the limiting pressure obtaining the average values of the relevant thermodynamic quantities of the model:
Corollary 1. Let , and . Then
More generally, let y be one of the variables , then the function is convex. By Danskin theorem (see [56]), is differentiable if and only if the set is a singleton.