*Article* **Remote Sampling with Applications to General Entanglement Simulation**

#### **Gilles Brassard 1,2,\*, Luc Devroye <sup>3</sup> and Claude Gravel 4,†,\***


Received: 13 June 2018; Accepted: 15 January 2019; Published: 19 January 2019

**Abstract:** We show how to sample exactly discrete probability distributions whose defining parameters are distributed among remote parties. For this purpose, von Neumann's rejection algorithm is turned into a distributed sampling communication protocol. We study the expected number of bits communicated among the parties and also exhibit a trade-off between the number of rounds of the rejection algorithm and the number of bits transmitted in the initial phase. Finally, we apply remote sampling to the simulation of quantum entanglement in its essentially most general form possible, when an arbitrary finite number *m* of parties share systems of arbitrary finite dimensions on which they apply arbitrary measurements (not restricted to being projective measurements, but restricted to finitely many possible outcomes). In case the dimension of the systems and the number of possible outcomes per party are bounded by a constant, it suffices to communicate an expected *O*(*m*2) bits in order to simulate *exactly* the outcomes that these measurements would have produced on those systems.

**Keywords:** communication complexity; quantum theory; classical simulation of entanglement; exact sampling; random bit model; entropy

#### **1. Introduction**

Let <sup>X</sup> be a nonempty finite set containing *<sup>n</sup>* elements and *<sup>p</sup>* = (*px*)*x*∈<sup>X</sup> be a probability vector parameterized by some vector *<sup>θ</sup>* = (*θ*1,..., *<sup>θ</sup>m*) ∈ <sup>Θ</sup>*<sup>m</sup>* for an integer *<sup>m</sup>* ≥ 2. For instance, the set <sup>Θ</sup> can be the real interval [0, 1] or the set of Hermitian semi-definite positive matrices as it is the case for the simulation of entanglement. The probability vector *p* defines a random variable *X* such that **<sup>P</sup>**{*<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*} def <sup>=</sup> *px* for *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>. To sample exactly the probability vector *<sup>p</sup>* means to produce an output *<sup>x</sup>* such that **P**{*X* = *x*} = *px*. The problem of sampling probability distributions has been studied and is still studied extensively within different random and computational models. Here, we are interested in sampling *exactly* a discrete distribution whose defining parameters are distributed among *m* different parties. The *θi*'s for *i* ∈ {1, ... , *m*} are stored in *m* different locations where the *i*th party holds *θi*. In general, any communication topology between the parties would be allowed, but, in this work, we concentrate for simplicity on a model in which we add a designated party known as the *leader*, whereas the *m* other parties are known as the *custodians* because each of them is sole keeper of the corresponding parameter *θ*—hence there are *m* + 1 parties in total. The leader communicates in both directions with the custodians, who do not communicate among themselves. Allowing inter-custodian communication would not improve the communication efficiency of our scheme and can, at best, halve the number of bits communicated in any protocol. However, it could dramatically improve the sampling *time* in a realistic model in which each party is limited to sending and receiving a fixed

number of bits at any given time step, as demonstrated in our previous work [1] concerning a special case of the problem considered here. The communication scheme is illustrated in Figure 1.

**Figure 1.** The communication scheme.

It may seem paradoxical that the leader can sample *exactly* the probability vector *p* with a *finite* expected number of bits sent by the custodians, who may hold *continuous* parameters that define *p*. However, this counterintuitive possibility has been known to be achievable for more than a quarter-century in earlier work on the simulation of quantum entanglement by classical communication, starting with Refs. [2–7], continuing with Refs. [8–14], etc. for the bipartite case and Refs. [15–17], etc. for the multipartite case, and culminating with our own Ref. [1].

Our protocol to sample remotely a given probability vector is presented in Section 2. For this purpose, the von Neumann rejection algorithm [18] is modified to produce an output *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> with exact probability *px* using mere approximations of those probabilities, which are computed based on partial knowledge of the parameters transmitted on demand by the custodians to the leader. For the sake of simplicity, and to concentrate on the new techniques, we assume initially that algebraic operations on real numbers can be carried out with infinite precision and that continuous random variables can be sampled. Later, in Section 4, we build on techniques developed in Ref. [1] to obtain exact sampling in a realistic scenario in which all computations are performed with finite precision and the only source of randomness comes from flipping independent fair coins.

In the intervening Section 3, we study our motivating application of remote sampling, which is the simulation of quantum entanglement using classical resources and classical communication. Readers who may not be interested in quantum information can still benefit from Section 2 and most of Section 4, which make no reference to quantum theory in order to explain our general remote sampling strategies. A special case of remote sampling has been used by the authors [1], in which the aim was to sample a specific probability distribution appearing often in quantum information science, namely the *m*-partite Greenberger–Horne–Zeilinger (GHZ) distribution [19]. More generally, consider a quantum system of dimension *d* = *d*<sup>1</sup> ··· *dm* represented by a density matrix *ρ* known by the leader (surprisingly, the custodians have no need to know *ρ*). Suppose that there are *m* generalized measurements (POVMs) acting on quantum systems of dimensions *d*1, ... , *dm* whose possible outcomes lie in sets X1, ... , X*<sup>m</sup>* of cardinality *<sup>n</sup>*1, ... , *nm*, respectively. Each custodian knows one and only one of the POVMs and nothing else about the others. The leader does not know initially any information about any of the POVMs. Suppose in addition that the leader can generate independent identically distributed uniform random variables on the real interval [0, 1]. We show how to generate a random vector *<sup>X</sup>* = (*X*1,..., *Xm*) <sup>∈</sup> <sup>X</sup> <sup>=</sup> <sup>X</sup><sup>1</sup> <sup>×</sup> ... <sup>×</sup> <sup>X</sup>*<sup>m</sup>* sampled from the exact joint probability distribution that would be obtained if each custodian *i* had the *i*th share of *ρ* (of dimension *di*) and measured it according to the *<sup>i</sup>*th POVM, producing outcome *xi* <sup>∈</sup> <sup>X</sup>*i*. This task is defined formally in Section 3, where we prove that the total expected number of bits transmitted between the leader and the custodians using remote sampling is *O*(*m*2) provided all the *di*'s and *ni*'s are bounded by some constant. The exact

formula, involving *m* as well as the *di*'s and *ni*'s, is given as Equation (14) in Section 3, where *d* and *n* denote the product of the *di*'s and the *ni*'s, respectively. In Section 4, we obtain the same asymptotic result in the more realistic scenario in which the only source of randomness comes from independent identically distributed uniform random *bits*. This result subsumes that of Ref. [1] since all *di*'s and *ni*'s are equal to 2 for projective measurements on individual qubits of the *m*-partite GHZ state.

#### **2. Remote Sampling**

As explained in the Introduction, we show how to sample *remotely* a discrete probability vector *p* = (*px*)*x*∈X. The task of sampling is carried by a *leader* ignorant of some parameters *θ* = (*θ*1,..., *θm*) that come in the definition of the probability vector, where each *θ<sup>i</sup>* is known by the *i*th *custodian* only, with whom the leader can communicate. We strive to minimize the amount of communication required to achieve this task.

To solve our conundrum, we modify the von Neumann rejection algorithm [18,20]. Before explaining those modifications, let us review the original algorithm. Let *q* = (*qx*)*x*∈<sup>X</sup> be a probability vector that we know how to sample on the same set <sup>X</sup>, and let *<sup>C</sup>* <sup>≥</sup> <sup>1</sup> be such that *px* <sup>≤</sup> *Cqx* for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>. The classical von Neumann rejection algorithm is shown as Algorithm 1. It is well known that the expected number of times round the **repeat** loop is exactly *C*.

**Algorithm 1** Original von Neumann rejection algorithm

1: **repeat** 2: Sample *X* according to (*qx*)*x*∈<sup>X</sup> 3: Sample *U* uniformly on [0, 1] 4: **if** *UCqX* ≤ *pX* **then** 5: **return** *X* {*X* is accepted} 6: **end if** 7: **end repeat**

If only partial knowledge about the parameters defining *p* is known, it would seem that the condition in line 4 cannot be decided. Nevertheless, the strategy is to build a sequence of increasingly accurate approximations that converge to the left and right sides of the test. As explained below, the number of bits transmitted depends on the number of bits needed to compute *q*, and on the accuracy in *p* required to accept or reject. This task can be achieved either in the *random bit model*, in which only i.i.d. random bits are generated, or in the less realistic *uniform model*, in which uniform continuous random variables are needed. The random bit model was originally suggested by von Neumann [18], but only later given this name and formalized by Knuth and Yao [21]. In this section, we concentrate for simplicity on the uniform model, leaving the more practical random bit model for Section 4.

**Definition 1.** *<sup>A</sup> t-bit approximation of a real number <sup>x</sup> is any <sup>x</sup>*<sup>ˆ</sup> *such that* |*<sup>x</sup>* − *<sup>x</sup>*ˆ| ≤ <sup>2</sup>−*<sup>t</sup> . A special case of t-bit approximation is the t-bit truncation <sup>x</sup>*<sup>ˆ</sup> = sign(*x*)|*x*|2*<sup>t</sup>* /2*<sup>t</sup> , where* sign(*x*) *is equal to* +1*,* 0 *or* −1 *depending on the sign of x. If <sup>α</sup>* <sup>=</sup> *<sup>a</sup>* <sup>+</sup> *bi is a complex number, where <sup>i</sup>* <sup>=</sup> √−1*, then a t-bit approximation (resp. truncation) α*ˆ *of α is any a*ˆ + ˆ *bi, where a*ˆ *and* ˆ *b are t-bit approximations (resp. truncations) of a and b, respectively.*

Note that we assume without loss of generality that approximations of probabilities are always constrained to be real numbers between 0 and 1, which can be enforced by snapping any out-of-bound approximation (even if it is a complex number) to the closest valid value.

Consider an integer *t*<sup>0</sup> > 0 to be determined later. Our strategy is for the leader to compute the probability vector *q* = (*qx*)*x*∈<sup>X</sup> defined below, based on *t*0-bit approximations *px*(*t*0) of the probabilities *px* for each *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>. For this purpose, the leader receives sufficient information from the custodians to build the entire vector *q* at the outset of the protocol. This makes *q* the "easy-to-sample" distribution required in von Neumann's technique, which is easy not from a computational viewpoint, but in the sense that no further communication is required for the leader to sample it as many times as needed.

Let

$$C = \sum\_{x} \left( p\_{x}(t\_{0}) + 2^{-t\_{0}} \right) \tag{1}$$

and

$$q\_{\mathbf{x}} = \left(p\_{\mathbf{x}}(t\_0) + \mathbf{2}^{-t\_0}\right) / \mathbb{C} \,. \tag{2}$$

Noting that ∑*<sup>x</sup> qx* = 1, these *qx* define a proper probability vector *q* = (*qx*)*x*∈X. Using the definition of a *t*-bit approximation and the definition of *qx* from Equation (2), we have that

$$p\_{\mathcal{X}} \le \left( p\_{\mathcal{X}}(t\_0) + 2^{-t\_0} = \mathbb{C}q\_{\mathcal{X}} \right) \le p\_{\mathcal{X}} + 2 \times 2^{-t\_0}.$$

Taking the sum over the possible values for *x* and recalling that set X is of cardinality *n*,

$$1 \le \mathbb{C} \le 1 + 2^{1-t\_0}n\,. \tag{3}$$

Consider any *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> sampled according to *<sup>q</sup>* and *<sup>U</sup>* sampled uniformly in [0, 1] as in lines <sup>2</sup> and 3 of Algorithm 1. Should *x* be accepted because *UCqx* ≤ *px*, this can be certified by any *<sup>t</sup>*-bit approximation *px*(*t*) of *px* such that *UCqx* ≤ *px*(*t*) − <sup>2</sup>−*<sup>t</sup>* for some positive integer *<sup>t</sup>* since *px*(*t*) ≤ *px* + <sup>2</sup>−*<sup>t</sup>* . Conversely, any integer *t* such that *UCqx* > *px*(*t*) + 2−*<sup>t</sup>* certifies that *x* should be rejected because it implies that *UCqx* > *px* since *px*(*t*) ≥ *px* − <sup>2</sup>−*<sup>t</sup>* . On the other hand, no decision can be made concerning *UCqx* versus *px* if −2−*<sup>t</sup>* < *UCqx* − *px*(*t*) ≤ <sup>2</sup>−*<sup>t</sup>* . It follows that one can modify Algorithm 1 above into Algorithm 2 below, in which a sufficiently precise approximation of *px* suffices to make the correct decision to accept or reject an *x* sampled according to distribution *q*. A well-chosen value of *t*<sup>0</sup> must be input into this algorithm, as discussed later.

**Algorithm 2** Modified rejection algorithm—Protocol for the leader

**Input:** Value of *t*<sup>0</sup>


{We cannot decide whether to accept or reject because −2−*<sup>t</sup>* < *UCqx* − *px*(*t*) ≤ <sup>2</sup>−*<sup>t</sup>* ; communication may be required in order for the leader to compute *pX* (*t* + 1); it could be that bits previously communicated to compute *px*(*t*) can be reused.}

```
12: end if
```

```
13: end for
```
**Theorem 1.** *Algorithm 2 is correct, i.e., it terminates and returns X* = *x with probability px. Furthermore, let T be the random variable that denotes the value of variable t upon termination of any instance of the* **for** *loop, whether the loop terminates in rejection or acceptation. Then,*

$$\mathbb{E}(T) \le t\_0 + \mathfrak{z}.\tag{4}$$

**Proof.** Consider any *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> and *<sup>t</sup>* <sup>≥</sup> *<sup>t</sup>*0. To reach *<sup>T</sup>* <sup>&</sup>gt; *<sup>t</sup>*, it must be that <sup>−</sup>2−*<sup>t</sup>* <sup>&</sup>lt; *UCqx* <sup>−</sup> *px*(*t*) <sup>≤</sup> <sup>2</sup>−*<sup>t</sup>* . Noting that *qx* = 0 according to Equation (2), the probability that *T* > *t* when *X* = *x* is therefore upper-bounded as follows:

$$\begin{split} \mathbb{P}\{T > t \mid X = x\} &\leq \mathbb{P}\{-2^{-t} < \mathsf{U}\mathsf{C}q\_{\mathrm{x}} - p\_{\mathrm{x}}(t) \leq 2^{-t}\} \\ &= \mathbb{P}\left\{\frac{p\_{\mathrm{x}}(t) - 2^{-t}}{\mathsf{C}q\_{\mathrm{x}}} < \mathsf{U} \leq \frac{p\_{\mathrm{x}}(t) + 2^{-t}}{\mathsf{C}q\_{\mathrm{x}}}\right\} \\ &\leq \frac{p\_{\mathrm{x}}(t) + 2^{-t}}{\mathsf{C}q\_{\mathrm{x}}} - \frac{p\_{\mathrm{x}}(t) - 2^{-t}}{\mathsf{C}q\_{\mathrm{x}}} = \frac{2 \times 2^{-t}}{\mathsf{C}q\_{\mathrm{x}}} \leq 2^{t\_{0} - t + 1} . \end{split} \tag{5}$$

The last inequality uses the fact that

$$\mathbb{C}q\_{\mathcal{X}} = p\_{\mathcal{X}}(t\_0) + \mathbf{2}^{-t\_0} \ge \mathbf{2}^{-t\_0}.$$

It follows that the probability that more turns round the **for** loop are required decreases exponentially with each new turn once *t* > *t*<sup>0</sup> + 1, which suffices to guarantee termination of the **for** loop with probability 1. Termination of the algorithm itself comes from the fact that the choice of *X* and *U* in lines 3 and 4 leads to acceptance at line 7—and therefore termination—with probability 1/*C*, as demonstrated by von Neumann in the analysis of his rejection algorithm.

The fact that *X* = *x* is returned with probability *px* is an immediate consequence of the correctness of the von Neumann rejection algorithm since our adaptation of this method to handle the fact that only approximations of *pX* are available does not change the decision to accept or reject any given candidate sampled according to *q*.

In order to bound the expectation of *T*, we note that **P**{*T* > *t* | *X* = *x*} = 1 when *t* < *t*<sup>0</sup> since we start the **for** loop at *t* = *t*0. We can also use vacuous **P**{*T* > *t*<sup>0</sup> | *X* = *x*} ≤ 1 rather than the worse-than-vacuous upper bound of 2 given by Equation (5) in the case *t* = *t*0. Therefore,

$$\begin{aligned} \mathbf{E}\{T \mid X = x\} &= \sum\_{t=0}^{\infty} \mathbf{P}\{T > t \mid X = x\} \\ &= \sum\_{t=0}^{t\_0} \mathbf{P}\{T > t \mid X = x\} + \sum\_{t=t\_0+1}^{\infty} \mathbf{P}\{T > t \mid X = x\} \\ &\le t\_0 + 1 + 2^{t\_0+1} \sum\_{t=t\_0+1}^{\infty} 2^{-t} = t\_0 + 3 \cdot \text{a.} \end{aligned}$$

It remains to note that, since **<sup>E</sup>**(*<sup>T</sup>* <sup>|</sup> *<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*) <sup>≤</sup> *<sup>t</sup>*<sup>0</sup> <sup>+</sup> <sup>3</sup> for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>, it follows that **<sup>E</sup>**(*T*) <sup>≤</sup> *<sup>t</sup>*<sup>0</sup> <sup>+</sup> <sup>3</sup> without condition.

Let *S* be the random variable that represents the number of times variable *X* is sampled according to *q* at line 3, and let *Ti* be the random variable that represents the value of variable *T* upon termination of the *i*th instance of the **for** loop starting at line 5, for *i* ∈ {1, ... , *S*}. The random variables *Ti* are independently and identically distributed as the random variable *T* in Theorem 1 and the expected value of *S* is *C*. Let *X*1, ... , *XS* be the random variables chosen at successive passes at line 3, so that *X*1,..., *XS*−<sup>1</sup> are rejected, whereas *XS* is returned as the final result of the algorithm.

To analyse the communication complexity of Algorithm 2, we introduce function *γx*(*t*) for each *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> and *<sup>t</sup>* <sup>&</sup>gt; *<sup>t</sup>*0, which denotes the *incremental* number of bits that the leader must receive from the custodians in order to compute *px*(*t*), taking account of the information that may already be available if he had previously computed *px*(*t* − 1). For completeness, we include in *γx*(*t*) the cost of the communication required for the leader to request more information from the custodians. We also introduce function *δ*(*t*) for *t* ≥ 0, which denotes the number of bits that the leader must receive from the custodians in order to compute *px*(*t*) for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> in a "simultaneous" manner. Note that it could be much less expensive to compute those *n* values than *n* times the cost of computing any single one of them because some of the parameters held by the custodians may be relevant to more than one of the *px*'s. The total number of bits communicated in order to implement Algorithm 2 is therefore given by random variable

$$Z = \delta(t\_0) + \sum\_{i=1}^{S} \sum\_{t=t\_0+1}^{T\_i} \gamma\_{\chi\_i}(t) \dots$$

For simplicity, let us define function *γ*(*t*) def = max*x*∈<sup>X</sup> *γx*(*t*). We then have

$$Z \le \delta(t\_0) + \sum\_{i=1}^{S} \sum\_{t=t\_0+1}^{T\_i} \gamma(t) \,.$$

whose expectation, according to Wald's identity, is

$$\mathbb{E}(Z) \le \delta(t\_0) + \mathbb{E}(S)\mathbb{E}\left(\sum\_{t=t\_0+1}^T \gamma(t)\right). \tag{6}$$

Assuming the value of *γ*(*t*) is upper-bounded by some *γ*,

$$\begin{aligned} \mathbf{E}(Z) &\leq \delta(t\_0) + \mathbf{E}(S)\mathbf{E}(T - t\_0)\gamma \\ &\leq \delta(t\_0) + 3\gamma \mathbf{C} \\ &\leq \delta(t\_0) + 3\gamma \left(1 + 2^{1 - t\_0}n\right) \end{aligned} \tag{7}$$

because **E**(*S*) = *C* and using Equations (4) and (3).

Depending on the specific application, which determines *γ* and function *δ*(*t*), Equation (7) is key to a trade-off that can lead to an optimal choice of *t*<sup>0</sup> since a larger *t*<sup>0</sup> decreases 21−*t*<sup>0</sup> but is likely to increase *δ*(*t*0). The value of *γ* may play a rôle in the balance. The next section, in which we consider the simulation of quantum entanglement by classical communication, gives an example of this trade-off in action.

#### **3. Simulation of Quantum Entanglement Based on Remote Sampling**

Before introducing the simulation of entanglement, let us establish some notation and mention the mathematical objects that we shall need. It is assumed that the reader is familiar with linear algebra, in particular the notion of a semi-definite positive matrix, Hermitian matrix, trace of a matrix, tensor product, etc. For a discussion about the probabilistic and statistical nature of quantum theory, see Ref. [22]. For convenience, we use [*n*] to denote the set {1, 2, . . . , *n*} for any integer *n*.

Consider integers *m*, *d*1, *d*2, ... , *dm*, *n*1, *n*2, ... , *nm*, all greater than or equal to 2. Define *d* = ∏*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *di* and *n* = ∏*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ni*. Let *ρ* be a *d* × *d* density matrix. Recall that any density matrix is Hermitian, semi-definite positive and unit-trace, which implies that its diagonal elements are real numbers between 0 and 1. For each *i* ∈ [*m*] and *j* ∈ [*ni*], let *Mij* be a *di* × *di* Hermitian semi-definite positive matrix such that

$$\sum\_{j \in \{n\_i\}} M\_{ij} = I\_{d\_i \text{ } \prime} \tag{8}$$

where *Idi* is the *di* × *di* identity matrix. In other words, each set {*Mij*}*j*∈[*ni*] is a POVM (positive-operator valued measure) [22].

As introduced in Section 1, we consider one *leader* and *m custodians*. The leader knows density matrix *ρ* and the *i*th custodian knows the *i*th POVM, meaning that he knows the matrices *Mij* for all *j* ∈ [*ni*]. If a physical system of dimension *d* in state *ρ* were shared between the custodians, in the sense that the *i*th custodian had possession of the *i*th subsystem of dimension *di*, each custodian could perform locally his assigned POVM and output the outcome, an integer between 1 and *ni*. The joint output would belong to <sup>X</sup> def = [*n*1] <sup>×</sup> [*n*2] ×···× [*nm*], a set of cardinality *<sup>n</sup>*, sampled according to the probability distribution stipulated by the laws of quantum theory, which we review below.

Our task is to sample X with the exact same probability distribution even though there is no physical system in state *ρ* available to the custodians, and in fact all parties considered are purely classical! We know from Bell's Theorem [23] that this task is impossible in general without communication, even when *m* = 2, and our goal is to minimize the amount of communication required to achieve it. Special cases of this problem have been studied extensively for expected [1,2,4–6], etc. and worst-case [3,8], etc. communication complexity, but here we solve it in its essentially most general setting, albeit only in the expected sense. For this purpose, the leader will centralize the operations while requesting as little information as possible from the custodians on their assigned POVMs. Once the leader has successfully sampled *X* = (*X*1,..., *Xm*), he transmits each *Xi* to the *i*th custodian, who can then output it as would have been the case had quantum measurements actually taken place.

We now review the probability distribution X that we need to sample, according to quantum theory. For each vector *<sup>x</sup>* = (*x*1, ... , *xm*) <sup>∈</sup> <sup>X</sup>, let *Mx* be the *<sup>d</sup>* <sup>×</sup> *<sup>d</sup>* tensor product of matrices *Mixi* for each *i* ∈ [*m*]:

$$M\_x = \bigotimes\_{i=1}^m M\_{ix\_i}.\tag{9}$$

The set {*Mx*}*x*∈<sup>X</sup> forms a global POVM of dimension *d*, which applied to density matrix *ρ* defines a joint probability vector on <sup>X</sup>. The probability *px* of obtaining any *<sup>x</sup>* = (*x*1,..., *xm*) <sup>∈</sup> <sup>X</sup> is given by

$$p\_{\mathbf{x}} = \text{Tr}(\rho M\_{\mathbf{x}}) = \text{Tr}\left(\rho \left(\bigotimes\_{i=1}^{m} M\_{i\mathbf{x}\_{i}}\right)\right). \tag{10}$$

For a matrix *A* of size *s* × *s* and any pair of indices *r* and *c* between 0 and *s* − 1, we use (*A*)*rc* to denote the entry of *A* located in the *r*th row and *c*th column. Matrix indices start at 0 rather than 1 to facilitate Fact 2 below. We now state various facts for which we provide cursory justifications since they follow from elementary linear algebra and quantum theory, or they are lifted from previous work. **Fact 1.** For all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>, we have <sup>0</sup> <sup>≤</sup> *px* <sup>≤</sup> <sup>1</sup> when *px* is defined according to Equation (10); furthermore, <sup>∑</sup>*x*∈<sup>X</sup> *px* = 1. This is obvious because quantum theory tells us that Equation (10) defines a probability distribution over all possible outcomes *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>, as sampled by the joint measurement. Naturally, this statement could also be proven from Equations (8) and (10) using elementary linear algebra.

**Fact 2.** For each *<sup>x</sup>* = (*x*1, ... , *xm*) <sup>∈</sup> <sup>X</sup>, matrix *Mx* is the tensor product of *<sup>m</sup>* matrices as given in Equation (9). Therefore, each entry (*Mx*)*rc* is the product of *m* entries of the *Mixi* 's. Specifically, consider any indices *r* and *c* between 0 and *d* − 1 and let *ri* and *ci* be the indices between 0 and *di* − 1, for each *i* ∈ [*m*], such that

$$\begin{aligned} r &= r\_1 + r\_2 d\_1 + r\_3 d\_1 d\_2 + \dots + r\_m d\_1 \cdot \dots \cdot d\_{m-1}, \\ c &= c\_1 + c\_2 d\_1 + c\_3 d\_1 d\_2 + \dots + c\_m d\_1 \cdot \dots \cdot d\_{m-1} \cdot \dots \end{aligned}$$

The *ri*'s and *ci*'s are uniquely defined by the principle of mixed-radix numeration. We have

$$(M\_x)\_{\mathcal{H}} = \prod\_{i=1}^{m} \left( M\_{i\chi\_i} \right)\_{r\_i c\_i} \dots$$

**Fact 3.** Let *M* be a Hermitian semi-definite positive matrix. Every entry (*M*)*ij* of the matrix satisfies

$$|(\mathcal{M})\_{ij}| \le \sqrt{(\mathcal{M})\_{ii}(\mathcal{M})\_{jj}}\,.$$

This follows from the fact that all principal submatrices of any Hermitian semi-definite positive matrix are semi-definite positive [24] (Observation 7.1.2, page 430). In particular, the principal submatrix

$$
\begin{pmatrix}
(\mathcal{M})\_{\vec{\imath}\vec{\imath}} & (\mathcal{M})\_{\vec{\imath}\vec{\jmath}} \\
(\mathcal{M})\_{\vec{\jmath}\vec{\imath}} & (\mathcal{M})\_{\vec{\jmath}\vec{\jmath}}
\end{pmatrix}
$$

is semi-definite positive, and therefore it has nonnegative determinant:

$$(M)\_{\vec{u}}(M)\_{\vec{j}\vec{j}} - (M)\_{\vec{i}\vec{j}}(M)\_{\vec{j}\vec{i}} = (M)\_{\vec{u}}(M)\_{\vec{j}\vec{j}} - (M)\_{\vec{i}\vec{j}}(M)\_{\vec{i}\vec{j}}^{\*} = (M)\_{\vec{u}}(M)\_{\vec{j}\vec{j}} - |(M)\_{\vec{i}\vec{j}}|^{2} \ge 0$$

by virtue of *M* being Hermitian, where *α*∗ denotes the complex conjugate of *α*.

**Fact 4.** The norm |(*ρ*)*ij*| of any entry of a density matrix *ρ* is less than or equal to 1. This follows directly from Fact 3 since density matrices are Hermitian semi-definite positive, and from the fact that diagonal entries of density matrices, such as (*ρ*)*ii* and (*ρ*)*jj*, are real values between 0 and 1.

**Fact 5.** Given any POVM {*M*}*<sup>L</sup>* =1, we have that


The first statement follows from the fact that ∑*<sup>L</sup>* =<sup>1</sup> *M* is the identity matrix by definition of POVMs, and therefore ∑*<sup>L</sup>* =1(*M*)*ii* = 1 for all *i*, and the fact that each (*M*)*ii* ≥ 0 because each *M* is semi-definite positive. The second statement follows from the first by applying Fact 3.

**Fact 6** (This is a special case of Theorem 1 from Ref. [1], with *v* = 0)**.** Let *k* ≥ 1 be an integer and consider any two real numbers *a* and *b*. If *a*ˆ and ˆ *b* are arbitrary *k*-bit approximations of *a* and *b*, respectively, then *a*ˆ + ˆ *b* is a (*k* − 1)-bit approximation of *a* + *b*. If, in addition, *a* and *b* are known to lie in interval [−1, 1], which can also be assumed without loss of generality concerning *<sup>a</sup>*<sup>ˆ</sup> and <sup>ˆ</sup> *b* since otherwise they can be safely pushed back to the appropriate frontier of this interval, then *a*ˆˆ *b* is a (*k* − 1)-bit approximation of *ab*.

**Fact 7.** Let *<sup>k</sup>* <sup>≥</sup> <sup>1</sup> be an integer and consider any two *complex* numbers *<sup>α</sup>* and *<sup>β</sup>*. If *<sup>α</sup>*<sup>ˆ</sup> and *<sup>β</sup>*<sup>ˆ</sup> are arbitrary *<sup>k</sup>*-bit approximations of *<sup>α</sup>* and *<sup>β</sup>*, respectively, then *<sup>α</sup>*<sup>ˆ</sup> <sup>+</sup> *<sup>β</sup>*<sup>ˆ</sup> is a (*<sup>k</sup>* <sup>−</sup> <sup>1</sup>)-bit approximation of *<sup>α</sup>* <sup>+</sup> *<sup>β</sup>*. If, in addition, *k* ≥ 2 and the real and imaginary parts of *α* and *β* are known to lie in interval [−1, 1], which can also be assumed without loss of generality concerning *<sup>α</sup>*<sup>ˆ</sup> and *<sup>β</sup>*ˆ, then *<sup>α</sup>*<sup>ˆ</sup> *<sup>β</sup>*<sup>ˆ</sup> is a (*<sup>k</sup>* <sup>−</sup> <sup>2</sup>)-bit approximation of *αβ*. This is a direct consequence of Fact 6 in the case of addition. In the case of multiplication, consider *α* = *a* + *bi*, *β* = *c* + *di*, *α*ˆ = *a*ˆ + ˆ *bi* and *β*ˆ = *c*ˆ + ˆ*di*, so that

$$\alpha \beta = (ac - bd) + (ad + bc)i \quad \text{and} \quad \hbar \beta = (\hbar \mathfrak{c} - \hat{b}\hat{d}) + (\hbar \hat{d} + \hat{b}\hat{c})i \dots$$

By the multiplicative part of Fact 6, *a*ˆ*c*ˆ, ˆ *b* ˆ*d*, *a*ˆ ˆ*d* and ˆ *bc*ˆ are (*k* − 1)-bit approximations of *ac*, *bd*, *ad* and *bc*, respectively; and then by the additive part of the same fact (which obviously applies equally well to subtraction), *<sup>a</sup>*ˆ*c*<sup>ˆ</sup> <sup>−</sup> <sup>ˆ</sup> *b* ˆ*d* and *a*ˆ ˆ*d* + ˆ *bc*ˆ are (*k* − 2)-bit approximations of *ac* − *bd* and *ad* + *bc*, respectively. **Fact 8** (This is Corollary 2 from Ref. [1])**.** Let *<sup>m</sup>* ≥ <sup>2</sup> and *<sup>k</sup>* ≥ lg *<sup>m</sup>* be integers and let {*aj*}*<sup>m</sup> j*=1 and {*a*ˆ*j*}*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> be real numbers and their *k*-bit approximations, respectively, all in interval [−1, 1]. Then, ∏*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *<sup>a</sup>*ˆ*<sup>j</sup>* is a (*<sup>k</sup>* − lg *<sup>m</sup>*)-bit approximation of <sup>∏</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *aj* .

**Fact 9.** Let *<sup>m</sup>* ≥ <sup>2</sup> and *<sup>k</sup>* ≥ <sup>2</sup> lg *<sup>m</sup>* be integers and let {*αj*}*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> and {*α*ˆ*j*}*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> be complex numbers and their *k*-bit approximations, respectively. Provided it is known that |*αj*| ≤ 1 for each *j* ∈ [*m*], <sup>a</sup> (*<sup>k</sup>* <sup>−</sup> <sup>2</sup> lg *<sup>m</sup>*)-bit approximation of <sup>∏</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *α<sup>j</sup>* can be computed from knowledge of the *α*ˆ*j*'s. The proof of this fact follows essentially the same template as Fact 8, except that *two* bits of precision may be

lost at each level up the binary tree introduced in Ref. [1], due to the difference between Facts 6 and 7. A subtlety occurs in the need for Fact 7 to apply that the real and imaginary parts of all the complex numbers under consideration must lie in interval [−1, 1]. This is automatic for the exact values since the *αj*'s are upper-bounded in norm by 1 and the product of such-bounded complex numbers is also upper-bounded in norm by 1, which implies that their real and imaginary parts lie in interval [−1, 1]. For the approximations, however, we cannot force their *norm* to be bounded by 1 because we need the approximations to be rational for communication purposes. Fortunately, we can force the real and imaginary parts of all approximations computed at each level up the binary tree to lie in interval [−1, 1] because we know that they approximate such-bounded numbers. Note that the product of two complex numbers whose real and imaginary parts lie in interval [−1, 1], such as <sup>1</sup> + <sup>2</sup><sup>−</sup>*ki* and <sup>1</sup> − <sup>2</sup><sup>−</sup>*ki*, may not have this property, even if they are *k*-bit approximations of numbers bounded in norm by 1.

**Fact 10.** Let *<sup>s</sup>* ≥ <sup>2</sup> and *<sup>k</sup>* ≥ lg *<sup>s</sup>* be integers and let {*αj*}*<sup>s</sup> <sup>j</sup>*=<sup>1</sup> and {*α*ˆ*j*}*<sup>s</sup> <sup>j</sup>*=<sup>1</sup> be complex numbers and their *k*-bit approximations, respectively, without any restriction on their norm. Then ∑*<sup>s</sup> <sup>j</sup>*=<sup>1</sup> *α*ˆ*<sup>j</sup>* is <sup>a</sup> (*<sup>k</sup>* − lg *<sup>s</sup>*)-bit approximation of <sup>∑</sup>*<sup>s</sup> <sup>j</sup>*=<sup>1</sup> *α<sup>j</sup>* . Again, this follows the same proof template as Fact 8, substituting multiplication of real numbers by addition of complex numbers, which allows us to drop any condition on the size of the numbers considered.

**Fact 11.** Consider any *<sup>x</sup>* = (*x*1,..., *xm*) <sup>∈</sup> <sup>X</sup> and any positive integer *<sup>t</sup>*. In order to compute a *<sup>t</sup>*-bit approximation of *px*, it suffices to have (*t* + 1 + 2 lg *d* + 2 lg *m*)-bit approximations of each entry of the *Mixi* matrices for all *i* ∈ [*m*]. This is because

$$p\_{\boldsymbol{x}} = \text{Tr}(\rho M\_{\boldsymbol{x}}) = \sum\_{\boldsymbol{r}=0}^{d-1} (\rho M\_{\boldsymbol{x}})\_{\boldsymbol{r}\boldsymbol{r}}$$

$$\begin{split} &= \sum\_{\boldsymbol{r}=0}^{d-1} \sum\_{\boldsymbol{c}=0}^{d-1} (\rho)\_{\boldsymbol{r}\boldsymbol{c}} (M\_{\boldsymbol{x}})\_{\boldsymbol{c}\boldsymbol{r}} \\ &= \sum\_{\boldsymbol{r}=0}^{d-1} \sum\_{\boldsymbol{c}=0}^{d-1} (\rho)\_{\boldsymbol{r}\boldsymbol{c}} \prod\_{i=1}^{m} (M\_{i\boldsymbol{x}\_{i}})\_{\boldsymbol{c}\_{i}\boldsymbol{r}\_{i}} \end{split} \tag{11}$$

by virtue of Fact 2. Every term of the double sum in Equation (11) involves a product of *m* entries, one per POVM element, and therefore incurs a loss of at most 2 lg *m* bits of precision by Fact 9, whose condition holds thanks to Fact 5. An additional bit of precision may be lost in the multiplication by (*ρ*)*rc*, even though that value is available with arbitrary precision (and is upper-bounded by 1 in norm by Fact 4) because of the additions involved in multiplying complex numbers. Then, we have to add *<sup>s</sup>* = *<sup>d</sup>*<sup>2</sup> terms, which incurs an additional loss of at most lg *<sup>s</sup>* = 2 lg *<sup>d</sup>* bits of precision by Fact 10. In total, (*t* + 1 + 2 lg *d* + 2 lg *m*)-bit approximations of the (*Mixi* )*ciri* 's will result in a *t*-bit approximation of *px*.

**Fact 12.** The leader can compute *px*(*t*) for any specific *<sup>x</sup>* = (*x*1, ... , *xm*) <sup>∈</sup> <sup>X</sup> and integer *<sup>t</sup>* if he receives a total of

$$(t+2+\lceil 2\lg d \rceil + 2\lceil \lg m \rceil) \sum\_{i=1}^{m} d\_i^2$$

bits from the custodians. This is because the *i*th custodian has the description of matrix *Mixi* of size *di* × *di*, which is defined by exactly *<sup>d</sup>*<sup>2</sup> *<sup>i</sup> real* numbers since the matrix is Hermitian. By virtue of Fact 11, it is sufficient for the leader to have (*t* + 1 + 2 lg *d* + 2 lg *m*)-bit approximations for all those ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *d*<sup>2</sup> *<sup>i</sup>* numbers. Since each one of them lies in interval [−1, 1] by Fact 5, well-chosen *k*-bit approximations (for instance *k*-bit truncations) can be conveyed by the transmission of *k* + 1 bits, one of which carries the sign.

Note that the *t*-bit approximation of *px* computed according to Fact 12, say *a* + *bi*, may very well have a nonzero imaginary part *<sup>b</sup>*, albeit necessarily between −2−*<sup>t</sup>* and 2−*<sup>t</sup>* . Since *px*(*t*) must be a real number between 0 and 1, the leader sets *px*(*t*) = max(0, min(1, *a*)), taking no account of *b*, although a paranoid leader may wish to test that −2−*<sup>t</sup>* ≤ *<sup>b</sup>* ≤ <sup>2</sup>−*<sup>t</sup>* indeed and raise an alarm in case it is not (which of course is mathematically impossible unless the custodians are not given proper POVMs, unless they misbehave, or unless a computation or communication error has occurred).

**Fact 13.** For any *<sup>t</sup>*, the leader can compute *px*(*t*) for each and every *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> if he receives

$$\delta(t) \stackrel{\text{def}}{=} \left( t + 2 + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil \right) \sum\_{i=1}^{m} n\_i d\_i^2$$

bits from the custodians. This is because it suffices for each custodian *i* to send to the leader (*<sup>t</sup>* + <sup>1</sup> + 2 lg *<sup>d</sup>* + <sup>2</sup> lg *<sup>m</sup>*)-bit approximations of all *nid*<sup>2</sup> *<sup>i</sup>* real numbers that define the entire *i*th POVM, i.e., all the matrices *Mij* for *j* ∈ [*ni*]. This is a nice example of the fact that it may be much less expensive for the leader to compute at once *px*(*t*) for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>, rather than computing them one by one independently, which would cost

$$m(t+2+\lceil 2\lg d \rceil + 2\lceil \lg m \rceil) \sum\_{i=1}^{m} d\_i^2 = (t+2+\lceil 2\lg d \rceil + 2\lceil \lg m \rceil) \sum\_{i=1}^{m} n d\_i^2 \gg \delta(t)$$

bits of communication by applying *n* times Fact 12.

After all these preliminaries, we are now ready to adapt the general template of Algorithm 2 to our entanglement-simulation conundrum, yielding Algorithm 3. We postpone the choice of *t*<sup>0</sup> until after the communication complexity analysis of this new algorithm.

**Algorithm 3** Protocol for simulating arbitrary entanglement subjected to arbitrary measurements


```
8: reject ← false
```

```
14: if UCqX ≤ pX (t) − 2−t then
```

```
15: accept ← true {X is accepted}
```

```
16: else if UCqX > pX (t) + 2−t then
```

```
17: reject ← true {X is rejected}
```

```
18: else
```

```
19: The leader asks each custodian i ∈ [m] for one more bit in the truncation
          of the real and imaginary parts of the entries defining matrix MiXi
                                                                           ;
```

```
20: Using this information, the leader updates pX (t) into pX (t + 1);
```
21: *t* ← *t* + 1

```
22: end if
```

```
23: until accept or reject
```

```
24: until accept
```
25: The leader requests each custodian *i* ∈ [*m*] to output his *Xi*

To analyse the expected number of bits of communication required by this algorithm, we apply Equation (7) from Section 2 after defining explicitly the cost parameters *δ*(*t*0) for the initial computation of *px*(*t*0) for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> at lines <sup>3</sup> and 4, and *<sup>γ</sup>* for the upgrade from a specific *pX* (*t*) to *pX* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) at lines <sup>19</sup> and 20. For simplicity, we shall ignore the negligible amount of communication entailed at line 1 (which is ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> lg *ni* ≤ *<sup>m</sup>* <sup>+</sup> lg *<sup>n</sup>* bits), line <sup>2</sup> ( lg *<sup>t</sup>*0 bits), line <sup>10</sup> (also <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> lg *ni* bits, but repeated **<sup>E</sup>**(*S*) ≤ <sup>1</sup> + <sup>2</sup>1−*t*0*<sup>n</sup>* times) and line <sup>25</sup> (m bits) because they are not taken into account in Equation (7) since they are absent from Algorithm 2. If we counted it all, this would add *O*((1 + 21−*t*0*n*)lg *n* + lg *t*0) bits to Equation (13) below, which would be less than 10 lg *n* bits added to Equation (14), with no effect at all on Equation (15).

According to Fact 13,

$$\delta(t\_0) = \left(t\_0 + 2 + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil \right) \sum\_{i=1}^{m} n\_i d\_i^2 \dots$$

The cost of line 19 is very modest because we use *truncations* rather than general approximations in line <sup>3</sup> for the leader to compute *px*(*t*0) for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup>. Indeed, it suffices to obtain a single additional bit of precision in the real and imaginary parts of each entry defining matrix *MiXi* from each custodian *i* ∈ [*m*]. The cost of this update is simply

$$\gamma = m + \sum\_{i=1}^{m} d\_i^2 \tag{12}$$

bits of communication, where the addition of *m* is to account for the leader needing to request new bits from the custodians. This is a nice example of what we meant by "it could be that bits previously communicated can be reused" in line 11 of Algorithm 2.

Putting it all together in Equation (7), the total expected number of bits communicated in Algorithm 3 in order to sample exactly according to the quantum probability distribution is

$$\begin{split} \mathbb{E}(Z) &\leq \delta(t\_0) + 3\gamma \left( 1 + 2^{1-t\_0}n \right) \\ &\leq \left( t\_0 + 2 + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil \right) \sum\_{i=1}^{m} n\_i d\_i^2 + 3\left( 1 + 2^{1-t\_0}n \right) \left( m + \sum\_{i=1}^{m} d\_i^2 \right) . \end{split} \tag{13}$$

We are finally in a position to choose the value of parameter *t*0. First note that *n* = ∏*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ni* ≥ <sup>2</sup>*m*. Therefore, any constant choice of *t*<sup>0</sup> will entail an expected amount of communication that is exponential in *m* because of the right-hand term in Equation (13). At the other extreme, choosing *t*<sup>0</sup> = *n* would also entail an expected amount of communication that is exponential in *m*, this time because of the left-hand term in Equation (13). A good compromise is to choose *t*<sup>0</sup> = lg *n*, which results in 1 ≤ *<sup>C</sup>* ≤ 3 according to Equation (3), because in that case 2*t*<sup>0</sup> ≥ *<sup>n</sup>* and therefore

$$1 \le C \le 1 + 2^{1 - t\_0}n = 1 + \frac{2n}{2^{t\_0}} \le 3 \,\mu$$

so that Equation (13) becomes

$$\mathbb{E}(Z) \le \left( \lceil \lg n \rceil + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil + 2 \right) \sum\_{i=1}^{m} n\_i d\_i^2 + 9 \left( m + \sum\_{i=1}^{m} d\_i^2 \right). \tag{14}$$

In case all the *ni*'s and *di*'s are upper-bounded by some constant *ξ*, we have that *n* = ∏*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ni* ≤ *<sup>ξ</sup>m*, hence lg *<sup>n</sup>* <sup>≤</sup> *<sup>m</sup>* lg *<sup>ξ</sup>*, similarly lg *<sup>d</sup>* <sup>≤</sup> *<sup>m</sup>* lg *<sup>ξ</sup>*, and also <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *nid*<sup>2</sup> *<sup>i</sup>* ≤ *<sup>m</sup>ξ*3. It follows that

$$\mathbb{E}(Z) \le (3\xi^3 \log \mathfrak{z}) m^2 + O(m \log m),\tag{15}$$

which is on the order of *m*2, thus matching with our most general method the result that was already known for the very specific case of simulating the quantum *m*-partite GHZ distribution [1].

#### **4. Practical Implementation Using a Source of Discrete Randomness**

In practice, we cannot work with continuous random variables since our computers have finite storage capacities and finite precision arithmetic. Furthermore, the generation of uniform continuous random variables does not make sense computationally speaking and we must adapt Algorithms 2 and 3 to work in a finite world.

For this purpose, recall that *U* is a uniform continuous random variable on [0, 1] used in all the algorithms seen so far. For each *i* ≥ 1, let *Ui* denote the *i*th bit in the binary expansion of *U*, so that

$$\mathcal{U} = 0.\mathcal{U}\_1 \mathcal{U}\_2 \cdots = \sum\_{i=1}^{\infty} \mathcal{U}\_i 2^{-i} \mathcal{X}\_i$$

We acknowledge the fact that the *Ui*'s are not uniquely defined in case *U* = *j*/2*<sup>k</sup>* for integers *k* > 0 and 0 < *j* < 2*k*, but we only mention this phenomenon to ignore it since it occurs with probability 0 when *U* is uniformly distributed on [0, 1]. We denote the *t*-bit truncation of *U* by *U*[*t*]:

$$\mathcal{U}[t] \stackrel{\text{def}}{=} \lfloor \mathfrak{L}^t \mathcal{U} \rfloor / \mathcal{B}^t = \sum\_{i=1}^t \mathcal{U}\_i \mathcal{Z}^{-i} \dots$$

For all *t* ≥ 1, we have that

*<sup>U</sup>*[*t*] <sup>≤</sup> *<sup>U</sup>* <sup>&</sup>lt; *<sup>U</sup>*[*t*] + <sup>2</sup>−*<sup>t</sup>* . (16)

We modify Algorithm 2 into Algorithm 4 as follows, leaving to the reader the corresponding modification of Algorithm 3, thus yielding a practical protocol for the simulation of general entanglement under arbitrary measurements.

**Algorithm 4** Modified rejection algorithm with discrete randomness source—Protocol for the leader

**Input:** Value of *t*<sup>0</sup>

1: Compute *px*(*t*0) for each *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> {The leader needs information from the custodians in order to compute these approximations} 2: Compute *C* and *q* = (*qx*)*x*∈<sup>X</sup> as per Equations (1) and (2) 3: Sample *X* according to *q* 4: *U*[0] ← 0 5: **for** *t* = 1 **to** *t*<sup>0</sup> − 1 **do** 6: Generate i.i.d. unbiased bit *Ut* 7: *<sup>U</sup>*[*t*] ← *<sup>U</sup>*[*<sup>t</sup>* − <sup>1</sup>] + *Ut* <sup>2</sup>−*<sup>t</sup>* 8: **end for** 9: **for** *t* = *t*<sup>0</sup> **to** ∞ **do** 10: Generate i.i.d. unbiased bit *Ut* 11: *<sup>U</sup>*[*t*] ← *<sup>U</sup>*[*<sup>t</sup>* − <sup>1</sup>] + *Ut* <sup>2</sup>−*<sup>t</sup>* 12: **if** *U*[*t*] + 2−*<sup>t</sup> CqX* ≤ *pX* (*t*) − <sup>2</sup>−*<sup>t</sup>* **then** 13: **return** *X* {*X* is accepted} 14: **else if** *U*[*t*]*CqX* > *pX* (*t*) + 2−*<sup>t</sup>* **then** 15: **go to line** 3 {*X* is rejected} 16: **else** 17: Continue the **for** loop {We cannot decide to accept or reject because −(<sup>1</sup> + *CqX* )2−*<sup>t</sup>* < *<sup>U</sup>*[*t*]*CqX* − *pX* (*t*) ≤ <sup>2</sup>−*<sup>t</sup>* ; communication may be required in order for the leader to compute *pX* (*t* + 1); it could be that bits previously communicated to compute *px*(*t*) can be reused.} 18: **end if** 19: **end for**

**Theorem 2.** *Algorithm 4 is correct, i.e., it terminates and returns X* = *x with probability px. Furthermore, let T be the random variable that denotes the value of variable t upon termination of any instance of the* **for** *loop that starts at line 9, whether it terminates in rejection or acceptation. Then,*

$$\mathbf{E}(T) \le t\_0 + 3 + 2^{-t\_0}.$$

**Proof.** This is very similar to the proof of Theorem 1, so let us concentrate on the differences. First note that it follows from Equation (16) and the fact that |*pX* (*t*) − *pX* | ≤ <sup>2</sup>−*<sup>t</sup>* that

$$\left(\mathcal{U}[t] + 2^{-t}\right) \mathbb{C}q\_{\times} \le p\_{\times}(t) - 2^{-t} \implies \mathcal{U}\mathbb{C}q\_{\times} \le p\_{\times}(t) - 2^{-t} \implies \mathcal{U}\mathbb{C}q\_{\times} \le p\_{\times}$$

and

$$\mathsf{U}[t]\mathsf{C}q\_{\mathsf{x}} > p\_{\mathsf{x}}(t) + 2^{-t} \implies \mathsf{U}\mathsf{C}q\_{\mathsf{x}} > p\_{\mathsf{x}}(t) + 2^{-t} \implies \mathsf{U}\mathsf{C}q\_{\mathsf{x}} > p\_{\mathsf{x}}\dots$$

Therefore, whenever *X* is accepted at line 13 (resp. rejected at line 15), it would also have been accepted (resp. rejected) in the original von Neumann algorithm, which shows sampling correctness. Conversely, whenever we reach a value of *<sup>t</sup>* <sup>≥</sup> *<sup>t</sup>*<sup>0</sup> such that *U*[*t*] + 2−*<sup>t</sup> CqX* > *pX* (*t*) − <sup>2</sup>−*<sup>t</sup>* and *<sup>U</sup>*[*t*]*CqX* ≤ *pX* (*t*) + <sup>2</sup>−*<sup>t</sup>* , we do not have enough information to decide whether to accept or reject, and therefore we reach line 17, causing *t* to increase. This happens precisely when

$$-(1+\mathbb{C}q\_{\times})2^{-t} < \mathsf{U}[t]\mathsf{C}q\_{\times} - p\_{\times}(t) \le 2^{-t}.$$

To obtain an upper bound on **E**(*T*), we mimic the proof of Theorem 1, but in the discrete rather than continuous regime. In particular, for any *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> and *<sup>t</sup>* <sup>≥</sup> *<sup>t</sup>*0,

$$\begin{split} \mathbf{P}\left\{T > t \mid X = x\right\} &\leq \mathbf{P}\left\{-(1+\mathsf{C}q\_{x})2^{-t} < \mathsf{U}[t]\mathbf{C}q\_{x} - p\_{x}(t) \leq 2^{-t}\right\} \\ &= \mathbf{P}\left\{p\_{x}(t) - (1+\mathsf{C}q\_{x})2^{-t} < \mathsf{U}[t]\mathbf{C}q\_{x} \leq p\_{x}(t) + 2^{-t}\right\} \\ &= \mathbf{P}\left\{\frac{2^{t}p\_{x}(t)}{\mathsf{C}q\_{x}} - \frac{1+\mathsf{C}q\_{x}}{\mathsf{C}q\_{x}} < 2^{t}\mathsf{U}[t] \leq \frac{2^{t}p\_{x}(t)}{\mathsf{C}q\_{x}} + \frac{1}{\mathsf{C}q\_{x}}\right\} \\ &\leq \left[\left(\frac{2^{t}p\_{x}(t)}{\mathsf{C}q\_{x}} + \frac{1}{\mathsf{C}q\_{x}}\right) - \left(\frac{2^{t}p\_{x}(t)}{\mathsf{C}q\_{x}} - \frac{1+\mathsf{C}q\_{x}}{\mathsf{C}q\_{x}}\right) + 1\right] 2^{-t} \\ &= 2\left(1 + \frac{1}{\mathsf{C}q\_{x}}\right)2^{-t} \leq 2^{t\_{0}-t+1} + 2^{1-t} \qquad\text{(because }\mathsf{C}q\_{x} \geq 2^{-t\_{0}}\text{)}\,. \end{split} \tag{17}$$

To understand Equation (17), think of 2*<sup>t</sup> U*[*t*] as an integer chosen randomly and uniformly between 0 and <sup>2</sup>*<sup>t</sup>* − 1. The probability that it falls within some real interval (*a*, *<sup>b</sup>*] for *<sup>a</sup>* < *<sup>b</sup>* is equal to <sup>2</sup>−*<sup>t</sup>* times the number of integers between 0 and <sup>2</sup>*<sup>t</sup>* − <sup>1</sup> in that interval, the latter being upper-bounded by the number of unrestricted integers in that interval, which is at most *b* − *a* + 1.

Noting how similar Equation (18) is to the corresponding Equation (5) in the analysis of Algorithm 2, it is not surprising that the expected value of *T* will be similar as well. Indeed, continuing as in the proof of Theorem 1, without belabouring the details,

$$\begin{split} \mathbf{E}(T \mid X = \mathbf{x}) &= \sum\_{t=0}^{\infty} \mathbf{P}\{T > t \mid X = \mathbf{x}\} \\ &= \sum\_{t=0}^{t\_0+1} \mathbf{P}\{T > t \mid X = \mathbf{x}\} + \sum\_{t=t\_0+2}^{\infty} \mathbf{P}\{T > t \mid X = \mathbf{x}\} \\ &\leq t\_0 + 2 + 2^{t\_0+1} \sum\_{t=t\_0+2}^{\infty} 2^{-t} + 2 \sum\_{t=t\_0+2}^{\infty} 2^{-t} = t\_0 + 3 + 2^{-t\_0}. \end{split} \tag{19}$$

We conclude that **<sup>E</sup>**(*T*) ≤ *<sup>t</sup>*<sup>0</sup> + <sup>3</sup> + <sup>2</sup>−*t*<sup>0</sup> without condition since Equation (19) does not depend on *x*.

The similarity between Theorems 1 and 2 means that there is no significant additional cost in the amount of communication required to achieve remote sampling in the random bit model. i.e., if we consider a realistic scenario in which the only source of randomness comes from i.i.d. unbiased bits, compared to an unrealistic scenario in which continuous random variables can be drawn. For instance, the reasoning that led to Equation (7) applies *mutatis mutandis* to conclude that the expected number *Z* of bits that needs to be communicated to achieve remote sampling in the random bit model is

$$\mathbb{E}(Z) \le \delta(t\_0) + \left(\mathfrak{z} + \mathfrak{z}^{-t\_0}\right) \left(1 + \mathfrak{z}^{1-t\_0}n\right)\gamma\_{\prime\prime}$$

where *δ* and *γ* have the same meaning as in Section 2.

If we use the random bit approach for the general simulation of quantum entanglement studied in Section 3, choosing *t*<sup>0</sup> = lg *n* again, Equation (14) becomes

$$\mathbb{E}(Z) \le \left( \lceil \lg n \rceil + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil + 2 \right) \sum\_{i=1}^{m} n\_i d\_i^2 + 3(3 + 1/n) \left( m + \sum\_{i=1}^{m} d\_i^2 \right),\tag{20}$$

which reduces to the identical

$$\mathbb{E}(Z) \le (3\xi^3 \lg \xi) m^2 + O(m \log m)$$

in case all the *ni*'s and *di*'s are upper-bounded by some constant *ξ*, which again is on the order of *m*2.

In addition to communication complexity, another natural efficiency measure in the random bit model concerns the *expected number of random bits* that needs to be drawn in order to achieve sampling. Randomness is needed in lines 3, 6 and 10 of Algorithm 4. A single random bit is required each time lines 6 and 10 are entered, but line 3 calls for sampling *X* according to distribution *q*. Let *Vi* be the random variable that represents the number of random bits needed on the *i*th passage through line 3. For this purpose, we use the algorithm introduced by Donald Knuth and Andrew Chi-Chih Yao [21], which enables sampling within any finite discrete probability distribution in the random bit model by using an expectation of no more than two random bits in addition to the Shannon binary entropy of the distribution. Since each such sampling is independent from the others, it follows that *Vi* is independently and identically distributed as a random variable *V* such that

$$\mathbb{E}(V) \le 2 + H(q) \le 2 + \lg n,\tag{21}$$

where *H*(*q*) denotes the binary entropy of *q*, which is never more than the base-two logarithm of the number of atoms in the distribution, here *n*.

Let *R* be the random variable that represents the number of random bits drawn when running Algorithm 4. Reusing the notation of Section 2, let *S* be the random variable that represents the number of times variable *X* is sampled at line 3 and let *Ti* be the random variable that represents the value of variable *T* upon termination of the *i*th instance of the **for** loop starting at line 9, for *i* ∈ {1, ... , *S*}. The random variables *Ti* are independently and identically distributed as the random variable *T* in Theorem 2 and the expected value of *S* is *C*. Since one new random bit is generated precisely each time variable *t* is increased by 1 in any pass through either **for** loops (line 5 or 9), we simply have

$$R = \sum\_{i=1}^{S} \left(V\_i + T\_i\right).$$

By virtue of Equations (3) and (21), Theorem 2, and using Wald's identity again, we conclude that:

$$\begin{aligned} \mathbf{E}(R) &= \mathbf{E}(S) \left( \mathbf{E}(V) + \mathbf{E}(T) \right) \\ &\le \left( 1 + 2^{1 - t\_0} n \right) \left( \lg n + t\_0 + \dots + 2^{-t\_0} \right) \dots \end{aligned}$$

Taking *t*<sup>0</sup> = lg *n* again, remote sampling can be completed using an expected number of random bit in *O*(lg *n*), with a hidden multiplicative constant no larger than 6. The hidden constant can be reduced arbitrarily close to 2 by taking *t*<sup>0</sup> = lg *n* + *a* for an arbitrarily large constant *a*. Whenever target distribution *p* has close to full entropy, this is only twice the optimal number of random bits that would be required according to the Knuth–Yao lower bound [21] in the usual case when full knowledge of *p* is available in a central place rather than having to perform remote sampling. Note, however, that, if our primary consideration is to optimize communication for the classical simulation of entanglement, as in Section 3, choosing *t*<sup>0</sup> = lg *n* − *a* would be a better idea because the summation in the left-hand term of Equation (13) dominates that of the right-hand term. For this inconsequential optimization, *a* does not have to be a constant, but it should not exceed lg(*ξm*), where *ξ* is our usual upper bound on the number of possible outcomes for each participant (if it exists), lest the right-hand term of Equation (13) overtake the left-hand term. Provided *ξ* exists, the expected number of random bits that needs to be drawn is linear in the number of participants.

The need for continuous random variables was not the only unrealistic assumption in Algorithms 1–3. We had also assumed implicitly that custodians know their private parameters precisely (and that the leader knows exactly each entry of density matrix *ρ* in Section 3). This could be reasonable in some situations, but it could also be that some of those parameters are transcendental numbers or the result of applying transcendental functions to other parameters, for example cos *π*/8. More interestingly, it could be that the actual parameters are spoon-fed to the custodians by *examiners*, who want to test the custodians' ability to respond appropriately to unpredictable inputs. However, all we need is for the custodians to be able to obtain their own parameters with arbitrary precision, so that they can provide that information to the leader upon request. For example, if a parameter is *π*/4 and the leader requests a *k*-bit approximation of that parameter, the custodian can compute some number *<sup>x</sup>*<sup>ˆ</sup> such that |*x*<sup>ˆ</sup> − *<sup>π</sup>*/4| ≤ <sup>2</sup>−*<sup>k</sup>* and provide it to the leader. For communication efficiency purposes, it is best if *x*ˆ itself requires only *k* bits to be communicated, or perhaps one more (for the sign) in case the parameter is constrained to be between −1 and 1. It is even better if the custodian can supply a *k*-bit *truncation* because this enables the possibility to upgrade it to a (*k* + 1)-bit truncation by the transmission of a single bit upon request from the leader, as we have done explicitly for the simulation of entanglement at line 19 of Algorithm 3 in Section 3.

Nevertheless, it may be impossible for the custodians to compute truncations of their own parameters in some cases, even when they can compute arbitrarily precise approximations. Consider for instance a situation in which one parameter held by a custodian is *x* = cos *θ* for some angle *θ* for which he can only obtain arbitrarily precise truncations. Unbeknownst to the custodian, *θ* = *π*/3 and therefore *x* = 1/2. No matter how many bits the custodian obtains in the truncation of *θ*, however, he can never decide whether *θ* < *π*/3 or *θ* ≥ *π*/3. In the first case, *x* < 1/2 and therefore the 1-bit truncation of *x* should be 0, whereas in the second (correct) case, *x* ≥ 1/2 and therefore the 1-bit truncation of *x* is 1/2 (or 0.1 in binary). Thus, the custodian will be unable to respond if the leader asks him for a 1-bit truncation of *x*, no matter how much time he spends on the task. In this example, by contrast, the custodian can supply the leader with arbitrarily precise *approximations* of *x* from appropriate approximations of *θ*. Should a situation like this occur, for instance in the simulation of entanglement, there would be two solutions. The first one is for the custodian to transmit increasingly precise truncations of *θ* to the leader and let *him* compute the cosine on it. This approach is only valid if it is known at the outset that the custodian's parameter will be of that form, which was essentially the solution taken in our earlier work on the simulation of the quantum *m*-partite GHZ distribution [1]. The more general solution is to modify the protocol and declare that custodians can send arbitrary approximations to the leader rather than truncations. The consequence on Algorithm 3 is that line 19 would become much more expensive since each custodian *i* would have to transmit a fresh

one-bit-better approximation for the real and imaginary parts of the *d*<sup>2</sup> *<sup>i</sup>* entries defining matrix *MiXi* . As a result, efficiency parameter *γ*(*t*) in Equation (6) would become

$$\gamma(t) = m + \left(t + 2 + \lceil 2\lg d \rceil + 2\lceil \lg m \rceil \right) \sum\_{i=1}^{m} d\_i^2 \ge 0$$

which should be compared with the much smaller (constant) value of *γ* given in Equation (12) when truncations of the parameters are available. Nevertheless, taking *t*<sup>0</sup> = lg *n* again, this increase in *γ*(*t*) would make no significant difference in the total number of bits transmitted for the simulation of entanglement because it would increase only the right-hand term in Equations (14) and (20), but not enough to make it dominate the left-hand term. All counted, we still have an expected number of bits transmitted that is upper-bounded by (3*ξ*<sup>3</sup> lg *ξ*)*m*<sup>2</sup> + *O*(*m* log *m*) whenever all the *ni*'s and *di*'s are upper-bounded by some constant *ξ*, which again is on the order of *m*2.

#### **5. Discussion and Open Problems**

We have introduced and studied the general problem of sampling a discrete probability distribution characterized by parameters that are scattered in remote locations. Our main goal was to minimize the amount of communication required to solve this conundrum. We considered both the unrealistic model in which arithmetic can be carried out with infinite precision and continuous random variables can be sampled exactly, and the more reasonable *random bit model* studied by Knuth and Yao [21], in which all arithmetic is carried out with finite precision and the only source of randomness comes from independent tosses of a fair coin. For a small increase in the amount of communication, we can fine-tune our technique to require only twice the number of random bits that would be provably required in the standard context in which all the parameters defining the probability distribution would be available in a single location, provided the entropy of the distribution is close to maximal.

When our framework is applied to the problem of simulating quantum entanglement with classical communication in its essentially most general form, we find that an expected number of *O*(*m*2) bits of communication suffices when there are *m* participants and each one of them (in the simulated world) is given an arbitrary quantum system of bounded dimension and asked to perform an arbitrary generalized measurement (POVM) with a bounded number of possible outcomes. This result generalizes and supersedes the best approach previously known in the context of multi-party entanglement, which was for the simulation of the *m*-partite GHZ state under projective measurements [1]. Our technique also applies without the boundedness condition on the dimension of individual systems and the number of possible outcomes per party, provided those parameters remain finite.

It would be preferable if we could eliminate the dependency of the expected number of bits of communication on the number of possible measurement outcomes. Is perfect simulation possible at all when that number is infinite, regardless of communication efficiency, a scenario in which our approach cannot be applied? In the bipartite case, Serge Massar, Dave Bacon, Nicolas Cerf, and Richard Cleve proved that classical communication can serve to simulate the effect of arbitrary measurements on maximally entangled states in a way that does not require any bounds on the number of possible outcomes [6]. More specifically, they showed that arbitrary POVMs on systems of *n* Bell states can be simulated with an expectation of *O*(*n*2*n*) bits of communication. However, their approach exploits the equivalence of this problem with a variant known as *classical teleportation* [5], in which one party has full knowledge of the quantum state and the other has full knowledge of the measurement to be applied to that state. Unfortunately, the equivalence between those two problems breaks down in a multipartite scenario and there is no obvious way to extend the approach. We leave as an open question the possibility of a simulation protocol in which the expected amount of communication would only depend on the number of participants and the dimension of their simulated quantum systems.

Our work leaves several additional important questions open. Recall that our approach provides a bounded amount on the *expected* communication required to perform exact remote sampling. The most challenging open question is to determine if it is possible to achieve the same goal with a guaranteed bounded amount of communication *in the worst case*. If possible, this would certainly require the participants to share ahead of time the realization of random variables, possibly even continuous ones. Furthermore, a radically different approach would be needed since we had based ours on the von Neumann rejection algorithm, which has intrinsically no worst-case upper bound on its performance. This task may seem hopeless, but it has been shown to be possible for special cases of entanglement simulation in which the remote parameters are taken from a continuum of possibilities [3,8], despite earlier "proofs" that it is impossible [2].

A much easier task would be to consider other communication models, in which communication is no longer restricted to being between a single leader and various custodians. Would there be an advantage in communicating through the edges of a complete graph? Obviously, the maximum possible savings in terms of communication would be a factor of 2 since any time one participant wants to send a bit to some other participant, he can do so via the leader. However, if we care not only about the total number of bits communicated, but also the *time* it takes to complete the protocol in a realistic model in which each party is limited to sending and receiving a fixed number of bits at any given time step, parallelizing communication could become valuable. We had already shown in Ref. [1] that a parallel model of communication can dramatically improve the time needed to sample the *m*-partite GHZ distribution. Can this approach be generalized to arbitrary remote sampling settings?

Finally, we would like to see applications for remote sampling outside the realm of quantum information science.

**Author Contributions:** According to the tradition in our field, the authors are listed in alphabetical order. Conceptualization, G.B., L.D. and C.G.; Formal Analysis, G.B. and C.G.; Supervision, G.B. and L.D.; Validation, L.D.; Writing—Original Draft, C.G.; Writing—Review and Editing, G.B. and C.G.

**Funding:** The work of G.B. is supported in part by the Canadian Institute for Advanced Research, the Canada Research Chair program, Canada's Natural Sciences and Engineering Research Council (NSERC) and Québec's Institut transdisciplinaire d'information quantique. The work of L.D. is supported in part by NSERC.

**Acknowledgments:** The authors are very grateful to Nicolas Gisin for his interest in this work and the many discussions we have had with him on this topic in the past decade. Marc Kaplan provided important insights in earlier joint work on the simulation of entanglement. We also acknowledge useful suggestions provided by the anonymous referees, including the suggestion to look into Ref. [6].

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study, in the writing of the manuscript, and in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Classical Interpretation of the Scrooge Distribution**

#### **William K. Wootters ID**

Department of Physics, Williams College, Williamstown, MA 01267, USA; william.wootters@williams.edu; Tel.: +1-413-597-2156

Received: 25 June 2018; Accepted: 15 August 2018; Published: 20 August 2018

**Abstract:** The Scrooge distribution is a probability distribution over the set of pure states of a quantum system. Specifically, it is the distribution that, upon measurement, gives up the *least* information about the identity of the pure state compared with all other distributions that have the same density matrix. The Scrooge distribution has normally been regarded as a purely quantum mechanical concept with no natural classical interpretation. In this paper, we offer a classical interpretation of the Scrooge distribution viewed as a probability distribution over the probability simplex. We begin by considering a real-amplitude version of the Scrooge distribution for which we find that there is a non-trivial but natural classical interpretation. The transition to the complex-amplitude case requires a step that is not particularly natural but that may shed light on the relation between quantum mechanics and classical probability theory.

**Keywords:** subentropy; GAP measure; accessible information

#### **1. Introduction**

In the early days of quantum information theory, the term "quantum communication" would typically have been understood to refer to the transmission of *classical* information via quantum mechanical signals. Such communication can be done in a sophisticated way, with the receiver making joint measurements on several successive signal particles [1,2], or it can be done in a relatively straightforward way with the receiver performing a separate measurement on each individual signal particle. In both cases, but especially in the latter case, a particularly interesting quantity, given an ensemble of quantum states to be used as an alphabet, is the ensemble's *accessible information*. This is the maximum amount of information that one can obtain about the identity of the state, on average, by making a measurement on the system described by the specified ensemble. The average here is over the outcomes of the measurement, and the maximization is over all possible measurements. In general, accessible information can be defined for ensembles consisting of pure and mixed states, but in this paper, we consider only pure-state ensembles.

Any ensemble {(|*ψj*, *pj*)} of pure quantum states with their probabilities has a unique density matrix. However, for any given density matrix *ρ* representing more than a single pure state, there are infinitely many ensembles—"*ρ*-ensembles"—described by that density matrix. Thus, it is natural to ask the following question: for a given density matrix *ρ*, what pure-state *ρ*-ensemble has the greatest value of the accessible information and what pure-state *ρ*-ensemble has the lowest value? The former question was answered by an early (1973) result in quantum information theory [3]—the pure-state *ρ*-ensemble with the greatest accessible information is the one consisting of the eigenstates of *ρ* with weights given by the eigenvalues. The latter question was answered in a 1994 paper [4], in which the *ρ*-ensemble minimizing the accessible information was called the Scrooge ensemble, or Scrooge distribution, since it is the ensemble that is most stingy with its information.

To see a simple example, consider a spin-1/2 particle whose density matrix *ρ* has the |↑ and |↓ states as its eigenvectors, with eigenvalues *λ*<sup>↑</sup> and *λ*↓. The eigenstate ensemble for *ρ*, that is, the *ρ*-ensemble from which one can extract the most information, is the two-state ensemble consisting of the |↑ state with probability *λ*<sup>↑</sup> and the |↓ state with probability *λ*↓. The optimal measurement in this case—the measurement that provides the most information—is the up-down measurement, and the amount of information it provides is equal to the von Neumann entropy of the density matrix:

$$I = S(\rho) = -(\lambda\_{\uparrow} \ln \lambda\_{\uparrow} + \lambda\_{\downarrow} \ln \lambda\_{\downarrow}).\tag{1}$$

On the other hand, the Scrooge ensemble for this density matrix is represented by a continuous probability distribution over the whole surface of the Bloch sphere. If *λ*<sup>↑</sup> is larger than *λ*↓, then this continuous distribution is weighted more heavily towards the top of the sphere. We can write the Scrooge distribution explicitly in terms of the variable *x* = (1 + cos *θ*)/2, where *θ* is the angle measured from the north pole:

$$
\sigma(\mathbf{x}) = \frac{2}{\lambda\_\uparrow \lambda\_\downarrow} \cdot \frac{1}{\left(\frac{\mathbf{x}}{\lambda\_\uparrow} + \frac{1-\mathbf{x}}{\lambda\_\downarrow}\right)^3}.\tag{2}
$$

The probability density *σ*(*x*) is normalized in the sense that <sup>1</sup> <sup>0</sup> *σ*(*x*)*dx* = 1 (the distribution is uniform over the azimuthal angle). Again, this is the ensemble of pure states from which one can extract the least information about the identity of the pure state, among all ensembles with the density matrix *ρ*. Somewhat remarkably, the average amount of information one gains by measuring this particular ensemble is entirely independent of the choice of measurement, as long as the measurement is complete—that is, as long as each outcome is associated with a definite pure state. This amount of information comes out to be a quantity called the subentropy *Q* of the density matrix:

$$I = Q(\rho) = -\frac{\lambda\_\uparrow^2 \ln \lambda\_\uparrow - \lambda\_\downarrow^2 \ln \lambda\_\downarrow}{\lambda\_\uparrow - \lambda\_\downarrow}. \tag{3}$$

We give more general expressions for both the Scrooge ensemble and the subentropy in Section 2 below.

In recent years, the Scrooge distribution has made other appearances in the physics literature. Of particular interest is the fact that this distribution has emerged from an entirely different line of investigation, in which the system under consideration is entangled with a large environment and the whole system is in a pure state. In that case, if one looks at the *conditional* pure states of the original system relative to the elements of an orthogonal basis of the environment, one typically finds that these conditional states are distributed by a Scrooge distribution [5–8]. In this context, the distribution is usually called a GAP measure (Gaussian adjusted projected measure, the three adjectives corresponding to the three steps by which the measure can be constructed). On another front, the Scrooge distribution has been used to address the difficult problem of bounding the *locally* accessible information when there is more than one receiver [9].

Meanwhile, the concept of subentropy, which originally arose (though without a name) in connection with the outcome entropy of random measurements [10,11], has appeared not only in problems concerning the acquisition of classical information [12–14], but also in the quantification of entanglement [15] and the study of quantum coherence [16–19]. Many detailed properties of subentropy have now been worked out, especially concerning its relation to the Shannon entropy [20–24].

Though it is possible to devise a strictly classical situation in which subentropy arises [22], the Scrooge distribution has generally been regarded as a purely quantum mechanical concept. It is, after all, a probability distribution over pure quantum states. The aim of this paper is to provide a classical interpretation of the Scrooge distribution, and in this way, to provide a new window into the relation between quantum mechanics and classical probability theory.

We find that it is much easier to make the connection if we begin by considering not the standard Scrooge distribution, but rather the analogous distribution one obtains for the case of quantum theory with *real* amplitudes. In that case, the dimension of the set of pure states is the same as the dimension of the associated probability simplex, and we find that there is a fairly natural distribution within

classical probability theory that is essentially identical to the real-amplitude version of the Scrooge distribution. This distribution arises as the solution to a certain classical communication problem that we describe in Section 4.

With this interpretation of the real-amplitude Scrooge distribution in hand, we ask how the classical communication scenario might be modified to arrive at the original Scrooge ensemble for standard, complex-amplitude quantum theory. As we will see, the necessary modification is not particularly natural, but it is simple.

Thus, we begin in Sections 2 and 3 by reviewing the derivation of the Scrooge distribution and by working out the analogous distribution for the case of real amplitudes. Then, in Section 4, we set up and analyze the classical communication problem that, as we show in Section 5, gives rise to a distribution that is equivalent to the real-amplitude Scrooge distribution. In Section 6, we modify the classical communication scenario to produce the standard, complex-amplitude Scrooge distribution. Finally, we summarize and discuss our results in Section 7.

#### **2. The Scrooge Distribution**

There are several ways in which one can generate the Scrooge distribution. In this section, we review the main steps of the derivation given in Ref. [4], which applies to a Hilbert space of finite dimension. (The distribution can also be defined for an infinite-dimensional Hilbert space [5–8].) We begin by setting up the problem.

We imagine the following scenario. One participant, Alice, prepares a quantum system with an *n*-dimensional Hilbert space in a pure state |*x* and sends it to Bob. Bob then tries to gain information about the identity of this pure state. Initially, Bob's state of knowledge is represented by a probability density *σ*(*x*) over the set of pure states. (The symbol *x* represents a multi-dimensional parameterization of the set of pure states.) Bob makes a measurement on the system and thereby gains information. The amount of information he gains may depend on the outcome he obtains, so we are interested in the *average* amount of information he gains about *x*, the average being over all outcomes.

The standard quantification of Bob's average gain in information is the Shannon mutual information between the identity of the pure state and the outcome of the measurement. We can express this mutual information in terms of two probability functions: (i) the probability *p*(*j*|*x*) of the outcome *<sup>j</sup>* when the state is <sup>|</sup>*x*, and (ii) the overall probability *<sup>p</sup>*(*j*) = *<sup>σ</sup>*(*x*)*p*(*j*|*x*)*dx* of the outcome *<sup>j</sup>* averaged over the whole ensemble. In terms of these functions, the mutual information is

$$I = -\sum\_{j} p(j) \ln p(j) + \int \sigma(\mathbf{x}) \left[ \sum\_{j} p(j|\mathbf{x}) \ln p(j|\mathbf{x}) \right] d\mathbf{x}.\tag{4}$$

The *accessible information* of the ensemble defined by *σ*(*x*) is the maximum value of the mutual information *I*, where the maximum is taken over all possible measurements.

Again, for a given density matrix *ρ*, the Scrooge distribution is defined to be the pure-state *ρ*-ensemble with the lowest value of the accessible information. One can obtain the Scrooge distribution via the following algorithm [4].

We start by recalling the concept of "*ρ* distortion." Consider for now a finite ensemble {(|*ψi*, *pi*)} of pure states (*i* = 1, . . . , *m*) whose density matrix is the completely mixed state:

$$\sum\_{i=1}^{m} p\_i |\psi\_i\rangle\langle\psi\_i| = \frac{1}{n}I.\tag{5}$$

Let <sup>|</sup>*ψ*˜*i* be the subnormalized state vector <sup>|</sup>*ψ*˜*i* <sup>=</sup> <sup>√</sup>*pi*|*ψi*, so that

$$\sum\_{i=1}^{m} |\vec{\psi}\_i\rangle\langle\vec{\psi}\_i| = \frac{1}{n}I.\tag{6}$$

Under *<sup>ρ</sup>* distortion, each vector |*ψ*˜ is mapped to another subnormalized vector |*φ*˜ defined by

$$|\not\Phi\_i\rangle = \sqrt{n\rho}|\psi\_i\rangle. \tag{7}$$

Note that the density matrix formed by the |*φ*˜*i*'s is *<sup>ρ</sup>*:

$$\sum\_{i=1}^{m} |\phi\_i\rangle\langle\phi\_i| = \sqrt{n\rho} \left(\frac{1}{n}I\right) \sqrt{n\rho} = \rho. \tag{8}$$

In terms of normalized vectors, the new ensemble is {(|*φi*, *qi*)}, with the new probabilities *qi* equal to

$$q\_i = \langle \vec{\phi}\_i | \vec{\phi}\_i \rangle = \mathfrak{n} p\_i \langle \psi\_i | \rho | \psi\_i \rangle. \tag{9}$$

In this way, any ensemble having the completely mixed density matrix can be mapped to a "*ρ* distorted" ensemble with a density matrix *ρ*.

The Scrooge ensemble is a continuous ensemble, not a discrete one, but the concept of *ρ* distortion can be immediately extended to the continuous case, and the Scrooge distribution can be easily characterized in those terms; it is the *ρ* distortion of the *uniform* distribution over the unit sphere in Hilbert space. The uniform distribution is the unique probability distribution over the set of pure states that is invariant under all unitary transformations.

Let us see how the *ρ* distortion works out in this case. First, for the uniform distribution, it is convenient to label the parameters of the pure states by *y* instead of *x*, so that we can reserve *x* for the Scrooge distribution. Let *τ*(*y*) be the probability density over *y* that represents the uniform distribution over the unit sphere (a particular parameterization will be specified shortly). In terms of normalized states, a *ρ* distortion maps each pure state |*y* into the pure state |*x* defined by

$$|x\rangle = \frac{\sqrt{\rho}|y\rangle}{\sqrt{\langle y|\rho|y\rangle}}.\tag{10}$$

This mapping defines *x* as a function of *y*: *x* = *f*(*y*). (We write *f* explicitly below.) The resulting probability density over *x* is obtained from the continuous version of Equation (9).

$$
\sigma(\mathbf{x}) = n\tau(y)\langle y|\rho|y\rangle \mathcal{J}(y/\mathbf{x}).\tag{11}
$$

Here, J (*y*/*x*) is the Jacobian of the *y* variables with respect to the *x* variables. On the right-hand side of Equation (11), each *y* is interpreted as *f* <sup>−</sup>1(*x*), so that we get an expression that depends only on *x*.

To get an explicit expression for the Scrooge distribution—that is, an explicit expression for the probability density *σ*(*x*)—we need to choose a specific set of parameters labeling the pure states. We choose the same set of parameters to label both the uniform distribution (where we call the parameters *y*) and the Scrooge distribution (where we call the parameters *x*). We define our parameters relative to a set of normalized eigenstates |*ej* of the density matrix *ρ*. A general pure state |*x* can be written as

$$|\mathbf{x}\rangle = \sum\_{j=1}^{n} a\_j e^{-i\theta\_j} |e\_j\rangle,\tag{12}$$

where each *aj* is a non-negative real number, and each phase *θ<sup>j</sup>* runs from zero to 2*π*. For definiteness, employing the freedom to choose an overall phase, we define *θ<sup>n</sup>* to be zero. We take *x* (or *y*) to consist of the following parameters: the squared amplitudes *xj* = *a*<sup>2</sup> *<sup>j</sup>* for *j* = 1, ... , *n* − 1, and the phases *θ<sup>j</sup>* for *j* = 1, ... , *n* − 1. This set of 2*n* − 2 parameters uniquely identifies any pure state. Later, we also use the symbol *xn* = <sup>1</sup> − *x*<sup>1</sup> −···− *xn*−1. Note that the *xj*s are the probabilities of the outcomes of a particular orthogonal measurement associated with the eigenstates of *ρ*.

*Entropy* **2018**, *20*, 619

In terms of these parameters, the uniform distribution over the unit sphere takes a particularly simple form: it is the product of a uniform distribution over the phases and a uniform distribution over the (*n* − 1)-dimensional probability simplex whose points are labeled by {*x*1, ... , *xn*−1} [25]. The Scrooge distribution will likewise be a product and will be uniform over the phases but will typically have a certain bias over the probability simplex. Because the phases are always independent and uniformly distributed in the cases we consider, we omit the phases in our distribution expressions, writing the probability densities as functions of {*x*1,..., *xn*−1} (or {*y*1,..., *yn*−1}).

Our aim now is to find explicit expressions for each of the factors appearing on the right-hand side of Equation (11). Since the uniform distribution over the unit sphere induces a uniform distribution over the probability simplex, the corresponding probability density *τ*(*y*) is a constant function, with the value of the constant being (*n* − 1)! as required by normalization:

$$(n-1)!\int\_0^1 \int\_0^{1-y\_1} \cdots \int\_0^{1-y\_1-\cdots-y\_{n-2}} dy\_{n-1} \cdots dy\_2 dy\_1 = 1. \tag{13}$$

The function *f*(*y*) defined by the *ρ*-distortion map, Equation (10), is given by

$$x\_j = \frac{\lambda\_j y\_j}{\lambda\_1 y\_1 + \dots + \lambda\_n y\_n}, \quad j = 1, \dots, n - 1,\tag{14}$$

where the *λj*'s are the eigenvalues of the density matrix *ρ*. One finds that the inverse map is

$$y\_j = \frac{\mathbf{x}\_j / \lambda\_j}{\mathbf{x}\_1 / \lambda\_1 + \dots + \mathbf{x}\_n / \lambda\_n},\tag{15}$$

and the Jacobian is

$$\mathcal{J}(y/\mathbf{x}) = \frac{1}{\lambda\_1 \cdots \lambda\_n} \cdot \frac{1}{\left(\frac{\mathbf{x}\_1}{\lambda\_1} + \cdots + \frac{\mathbf{x}\_n}{\lambda\_n}\right)^{\mathrm{II}}}.\tag{16}$$

Meanwhile, the factor *y*|*ρ*|*y* can be written as

$$
\langle y|\rho|y\rangle = \lambda\_1 y\_1 + \dots + \lambda\_n y\_n = \frac{1}{\frac{\overline{\lambda\_1}}{\lambda\_1} + \dots + \frac{\overline{\lambda\_n}}{\lambda\_n}}.\tag{17}
$$

By substituting the expressions from Equations (16) and (17) into Equation (11), we finally arrive at the probability density defining the Scrooge distribution:

$$\sigma(\mathbf{x}) = \frac{n!}{\lambda\_1 \cdots \lambda\_n} \cdot \frac{1}{\left(\frac{\mathbf{x}\_1}{\lambda\_1} + \cdots + \frac{\mathbf{x}\_n}{\lambda\_n}\right)^{n+1}}.\tag{18}$$

This probability density is normalized in the sense that the integral over the probability simplex is unity:

$$\int\_{0}^{1} \int\_{0}^{1-\mathbf{x}\_{1}} \cdots \int\_{0}^{1-\mathbf{x}\_{1}-\cdots-\mathbf{x}\_{n-2}} \sigma(\mathbf{x}) d\mathbf{x}\_{n-1} \cdots d\mathbf{x}\_{2} d\mathbf{x}\_{1} = 1. \tag{19}$$

Now, how do we know that the distribution given by Equation (18) minimizes the amount of accessible information? First, one can show that for this distribution the mutual information *I* is independent of the choice of measurement as long as the measurement is complete [4]. So, one can compute the value of the accessible information by considering any such measurement, and the easiest one to consider is the orthogonal measurement along the eigenstates. The result is

$$\text{accessible information} = -\sum\_{k=1}^{n} \left( \prod\_{l \neq k} \frac{\lambda\_k}{\lambda\_k - \lambda\_l} \right) \lambda\_k \ln \lambda\_{k\prime} \tag{20}$$

which defines the subentropy *Q*. One can also show that for *any ρ*-ensemble, the *average* mutual information over all complete orthogonal measurements is equal to *Q*, which implies that *Q* is always a lower bound on the accessible information. Since the Scrooge distribution *achieves* the value *Q*, it achieves the minimum possible accessible information among all *ρ*-ensembles.

#### **3. The Real-Amplitude Analog of the Scrooge Distribution**

Though our own world is described by standard quantum theory with complex amplitudes, we can also consider an analogous, hypothetical theory with real amplitudes. A pure state in the real-amplitude theory is represented by a real unit vector, and a density matrix is represented by a symmetric real matrix with non-negative eigenvalues and unit trace. Time evolution in this theory is generated by an antisymmetric real operator in place of the antihermitian operator *iH*.

The question considered in the preceding section can also be asked in regard to the real-amplitude theory. Given a density matrix *ρ*, we ask what *ρ*-ensemble has the smallest value of accessible information. It turns out that essentially all of the methods used in the preceding section continue to work in the real case. Again one begins with the uniform distribution over the unit sphere of pure states, and again, one obtains the Scrooge ensemble (in this case the real-amplitude Scrooge ensemble) via *ρ* distortion. The arguments leading to the conclusion that the ensemble produced in this way minimizes the accessible information work just as well in the real-amplitude case as in the complex-amplitude case.

The one essential difference between the two cases lies in the form of the initial probability density *τ*(*y*) that is associated with the uniform distribution over the unit sphere in Hilbert space. Whereas in the complex case the induced distribution over the probability simplex is uniform, in the real case, the induced distribution over the probability simplex is more heavily weighted towards the edges and corners.

We can see an example by considering the case with *n* = 2. Instead of starting with a uniform distribution over the surface of the Bloch sphere, one starts with a uniform distribution over the unit circle in a two-dimensional real vector space. Let *γ* be the angle around this circle measured from some chosen axis (once a density matrix has been specified, we will take this axis to be along one of the eigenstates of the density matrix). Then, *γ* is initially uniformly distributed. The parameter analogous to *y*<sup>1</sup> of the preceding section is *y* = sin<sup>2</sup> *γ*. Note that *y* runs from 0 to 1 as *γ* runs from 0 to *π*/2. The initial probability density *τr*(*y*) is therefore obtained from

$$
\pi\_\Gamma(y)dy = (2/\pi)d\gamma\_\prime \tag{21}
$$

which leads to

$$\pi\_r(y) = \frac{1}{\pi} \cdot \frac{1}{\sqrt{y(1-y)}}\tag{22}$$

(the subscript *r* represents "real"). This is in contrast to the function *τ*(*y*) = 1 that would apply in the complex-amplitude case. We see that in the real case, *τr*(*y*) is largest around *y* = 0 and *y* = 1.

For *n* dimensions, we take as our parameters specifying a pure state (i) the first *n* − 1 probabilities *yj* (*j* = 1, ... , *n* − 1) of the outcomes of a certain orthogonal measurement (which we will choose to be the measurement along the eigenvectors of the given density matrix), and (ii) a set of discrete phase parameters (each of them taking the values ±1), which will always be independently and uniformly distributed and therefore suppressed in our expressions for the probability densities.

For the uniform distribution over the unit sphere in the *n*-dimensional real Hilbert space, one can show that the induced distribution over the parameters (*y*1,..., *yn*−1) is given by [26]

$$\pi\_{\mathcal{V}}(y) = \frac{\Gamma(n/2)}{\pi^{n/2}} \cdot \frac{1}{\sqrt{y\_1 \cdot \cdots \cdot y\_n}},\tag{23}$$

where *yn* = 1 − *y*<sup>1</sup> −···− *yn*−1. This probability density is normalized over the probability simplex, as in Equation (19):

$$\int\_0^1 \int\_0^{1-y\_1} \cdots \int\_0^{1-y\_1-\cdots-y\_{n-2}} \pi\_r(y) dy\_{n-1} \cdots dy\_2 dy\_1 = 1. \tag{24}$$

The general expression for *σ*(*x*) given in Equation (11) remains valid in the real case, as do Equations (15)–(17) for the various factors in Equation (11). Again, the one difference is in *τr*(*y*), for which we now use Equation (23). By combining these ingredients, we arrive at our expression for the real-amplitude Scrooge ensemble:

$$\sigma\_{\Gamma}(\mathbf{x}) = \frac{n\Gamma(n/2)}{\pi^{n/2}\sqrt{\lambda\_1 \cdots \lambda\_n}\sqrt{\mathbf{x}\_1 \cdots \mathbf{x}\_n} \left(\frac{\mathbf{x}\_1}{\lambda\_1} + \cdots + \frac{\mathbf{x}\_n}{\lambda\_n}\right)^{\frac{n}{2}+1}},\tag{25}$$

where, as before, the *λj*'s are the eigenvalues of the density matrix whose Scrooge distribution is being computed.

Though Equation (25) was derived as a distribution over the set of pure states in real-amplitude quantum theory, it reads as a probability distribution over the (*n* − 1)-dimensional probability simplex for a classical random variable with *n* possible values. One can therefore at least imagine that there might be a classical scenario in which this distribution is natural. In the following section, we identify such a scenario.

#### **4. Communicating with Dice**

Ref. [26] imagined the following classical communication scenario. Alice is trying to convey to Bob the location of a point in an (*n* − 1)-dimensional probability simplex. To do this, she constructs a weighted *n*-sided die that, for Bob, has the probabilities corresponding to the point that Alice is trying to convey. She then sends the die to Bob, who rolls the die many times in order to estimate the probabilities of the various possible outcomes. However, the information transmission is limited in that Bob is allowed only a fixed number of rolls—let us call this number *N* (perhaps the die automatically self-destructs after *N* rolls). So, Bob will always have an imperfect estimate of the probabilities that Alice is trying to convey. Alice and Bob are allowed to choose in advance a discrete set of points in the probability simplex—these are the points representing the set of signals Alice might try to send—and they choose this set of points, along with their *a priori* weights, so as to maximize the mutual information between the identity of the point being conveyed and the result of Bob's rolls of the die. The main result of that paper was that in the limit of a large *N*, the optimal distribution of points in the probability simplex approximates the continuous distribution over the simplex expressed by the following probability density:

$$\hat{\pi}(y) = \frac{\Gamma(n/2)}{\pi^{n/2}} \cdot \frac{1}{\sqrt{y\_1 \cdot \cdots \cdot y\_n}},\tag{26}$$

where the *yj*s are the probabilities (we use a hat in our labels of probability densities that arise in a classical context). This result is interesting because it is the same probability density as the one induced by the uniform distribution over the unit sphere in real Hilbert space (Equation (23) above). Thus, in a world based on real-amplitude quantum theory as opposed to the complex-amplitude theory, there is a sense in which one could say that nature optimizes the transfer of information.

That paper—and closely related papers [27,28]—deal only with the uniform distribution over the unit sphere, not with non-trivial Scrooge distributions. In the present section, we consider a modification of the above communication scenario, and in the next section, we show that this modified scheme yields the real-amplitude Scrooge distribution.

A natural way to generalize the above communication scheme is this: let the allowed number *N* of rolls of the die vary from one die to another (that is, some dice last longer than others before they self-destruct). Now, once *N* is allowed to vary, it makes sense to let *N* itself be another random variable that conveys information. We are thus led to consider the following scenario.

Alice is trying to convey to Bob an ordered *n*-tuple of non-negative real numbers (*M*1, ... , *Mn*) (Alice and Bob agree in advance on a specific set of such ordered *n*-tuples, any one of which Alice might try to convey). Let us refer to such an *n*-tuple as a "signal." In order to convey her signal, Alice sends Bob an *n*-sided die that Bob then begins to roll over and over, keeping track of the number of times each outcome occurs. *Nj* is the number of times that the outcome *j* occurs. At some point, the die self-destructs. Alice has constructed both the weighting of the die and the self-destruction mechanism so that the *average* value of *Nj* is *Mj*.

However, both the rolling of the die and its duration are probabilistic, and Alice cannot completely control either the individual numbers *Nj* or their sum. For any given signal (*M*1, ... , *Mn*), we assume that each *Nj* is distributed independently according to a Poisson distribution with mean value *Mj*:

$$P(N\_1, \ldots, N\_n | M\_1, \ldots, M\_n) = \prod\_{j=1}^n e^{-M\_j} \frac{M\_j^{N\_j}}{N\_j!}.\tag{27}$$

This is equivalent to assuming that the *total* number *N* of rolls of the die is Poisson distributed with a mean value of *M* = *M*<sup>1</sup> + ··· + *Mn* and that for a given total number of rolls, the numbers of occurrences of the individual outcomes are distributed according to a multinomial distribution with weights *Mj*/*M*. That is, we are assuming the usual statistics for rolling a die, together with a Poisson distribution for the total number of rolls (another model we could have used is to have Alice send Bob a radioactive sample that can decay in *n* ways and that Bob is allowed to observe with detectors for a fixed amount of time).

To make the problem interesting, and to keep Alice from being able to send Bob an arbitrarily large amount of information in a single die, limits are placed on the sizes of *M*1, ... , *Mn*. This is done by imposing, for each *j*, an upper bound M*<sup>j</sup>* (script *M*) on the expectation value of the number of times the *j* outcome occurs. This expectation value is an average over all the possible signals that Alice might send.

We also need to say in what sense Alice and Bob are optimizing their communication. There are a number of reasonable options for doing this—e.g., we could say they maximize the mutual information, or minimize the probability of error for a fixed number of signals—but it is likely that many of these formulations will be essentially equivalent when the values of M*<sup>j</sup>* become very large. Here, we take a simple, informal approach. We say that, in order to make the various signals distinguishable from each other, Alice and Bob choose their *n*-tuples (*M*1, ... , *Mn*) so that neighboring signals, say (*M*1, ... , *Mn*) and (*M*<sup>1</sup> + Δ*M*1,..., *Mn* + Δ*Mn*), are at least a certain distance from each other, and we use the Fisher information metric to measure distance. Specifically, we require the Fisher information distance between the probability distributions *P*(*N*1, ... , *Nn*|*M*1, ... , *Mn*) and *P*(*N*1,..., *Nn*|*M*<sup>1</sup> + Δ*M*1,..., *Mn* + Δ*M*1) to be greater than or equal to a specified value *dmin* (or, equivalently for small Δ*Mj*/*Mj*, we require the Kullback–Leibler divergence to be at least (1/2)*d*<sup>2</sup> *min*). For the Poisson distribution and for small values of the ratios Δ*Mj*/*Mj*, this condition works out to be *<sup>n</sup>*

$$\sum\_{j=1}^{n} \frac{(\Delta M\_j)^2}{M\_j} \ge d\_{\min}^2. \tag{28}$$

For our purposes the exact value of *dmin* is not important. We also assume that the various signals have equal *a priori* probabilities. This is a natural choice if one wants to convey as much information as possible. Under these assumptions, Alice and Bob's aim is to maximize the number of distinct signals.

The analysis will be much simpler if we parameterize each die not by (*M*1, ... , *Mn*), but rather by the variables

$$\alpha\_{\vec{j}} = \sqrt{M\_{\vec{j}}}, \quad \vec{j} = 1, \dots, n. \tag{29}$$

*Entropy* **2018**, *20*, 619

Then, for neighboring signals we can write

$$
\Delta \alpha\_{\dot{\jmath}} = \frac{1}{2\sqrt{M\_{\dot{\jmath}}}} \Delta M\_{\dot{\jmath}\prime} \tag{30}
$$

so that the condition in Equation (28) becomes

$$\sum\_{j=1}^{n} (\Delta \alpha\_j)^2 \ge \frac{1}{4} \, d\_{\text{min}}^2. \tag{31}$$

That is, in the space parameterized by*α* = (*α*1, ... , *αn*), we want the points representing Alice's signals to be evenly separated from each other. Thus Alice's signals will be roughly uniformly distributed over some region of*α*-space—she wants to pack in as many signals as possible without exceeding the bounds M*<sup>j</sup>* on the expectation values of the *Nj*s. In what follows, we approximate this discrete but roughly uniform distribution of the values of*α* by a continuous probability distribution. The probability density is zero outside the region where Alice's possible signals lie; inside that region, it has a constant value of 1/*V*, where *V* is the volume of the region.

The communication problem then becomes a straightforward geometry problem—within the "positive" section of*α*-space (that is, the section in which each *α<sup>j</sup>* is non-negative), the aim is to find the region R of largest volume that satisfies the constraints

$$\frac{1}{N\_{\mathcal{R}}} \int\_{\mathcal{R}} \mathfrak{a}\_{\dot{j}}^{2} d\vec{n} = \mathcal{M}\_{\dot{j}}, \quad j = 1, \ldots, n,\tag{32}$$

where *V*<sup>R</sup> is the volume of R. We maximize the volume because Alice's signals have a fixed packing density within R; thus the larger the volume, the more signals Alice has at her disposal.

It is not hard to see that the solution to this geometry problem is to make region R the positive section of a certain ellipsoid centered at the origin. To see this, the conditions (32) can be written as

$$\int\_{\mathcal{R}} \mathfrak{a}\_{\vec{j}}^{2} \, d\vec{\pi} = \mathcal{M}\_{\vec{j}} \int\_{\mathcal{R}} \mathrm{d}\vec{\pi}, \quad j = 1, \ldots, n. \tag{33}$$

Now, let *β<sup>j</sup>* = *α<sup>j</sup>* - M*j*. In terms of the *βj*s, the above conditions become

$$\int\_{\mathcal{R}'} \mathfrak{f}\_{\vec{\lambda}}^2 d\vec{\beta} = \int\_{\mathcal{R}'} d\vec{\beta}, \quad j = 1, \dots, n,\tag{34}$$

where <sup>R</sup> is the region of - *β*-space corresponding to the region R of*α*-space. In particular, the equation obtained by summing these *n* conditions must also be true:

$$\int\_{\mathcal{R}'} \mathfrak{E}^2 \, d\vec{\beta} = n \int\_{\mathcal{R}'} d\vec{\beta},\tag{35}$$

where *β*<sup>2</sup> = *β*<sup>2</sup> <sup>1</sup> + ··· + *<sup>β</sup>*<sup>2</sup> *<sup>n</sup>*. That is, the average squared distance from the origin over region R must be equal to *n*. The maximum volume region R satisfying this one condition is the positive section of a sphere, and one can work out that the radius of the sphere must be <sup>√</sup>*<sup>n</sup>* <sup>+</sup> 2. Moreover, that region also satisfies all of the conditions (34). So, that same region is the maximum volume region that satisfies those conditions as well. Going back to the *αj*'s, we see that the maximum volume region satisfying the conditions (32) is the positive section of an ellipsoid, with semi-axis lengths

$$
\omega\_j^{\max} = \sqrt{(n+2)\mathcal{M}\_j}.\tag{36}
$$

Thus, the strategy that Alice and Bob adopt is to choose a set of closely packed signals with some minimum separation in*α*-space that occupies the positive section of an ellipsoid centered at the origin. Again, in this paper, we treat this discrete but roughly uniform distribution of signals as if it were actually uniform. This approximation becomes more and more reasonable as the values of the M*j*s increase.

#### **5. A Distribution over the Probability Simplex**

So far, we have not made any connection between our communication problem and the real-amplitude Scrooge distribution. We do this now by seeing how the uniform distribution over the ellipsoid in*α*-space induces a certain probability distribution over the (*n* − 1)-dimensional probability simplex for Alice's *n*-sided die. We define this probability distribution as follows.

Let us imagine many rounds of communication from Alice to Bob: she has sent him many dice for which the expected numbers of occurrences of the various outcomes, (*M*1, ... , *Mn*), cover a representative range of values: the corresponding values of*α* are distributed fairly uniformly over the region R in*α*-space. Bob has rolled each of these dice as many times as it can be rolled. Now consider a small region of the probability simplex, say the region S(*x*, Δ*x*) for which the probability of the *j*th outcome lies between *xj* and *xj* + Δ*xj* for *j* = 1, ... , *n* − 1. Some of the dice Alice has sent to Bob have probabilities lying in this region. The weight we want to attach to the region S(*x*, Δ*x*) is, roughly speaking, the fraction of the total number of rolls that came from dice in this region. Note that for a die at location*α*, the expectation value of the number of times it will be rolled is *α*<sup>2</sup> = *α*<sup>2</sup> <sup>1</sup> + ··· + *<sup>α</sup>*<sup>2</sup> *<sup>n</sup>*. So, we multiply the density of signals by the factor *α*<sup>2</sup> to get the "density of rolls." These considerations lead us to the following definition of the weight *σ*ˆ(*x*)*dx*<sup>1</sup> ··· *dxn*−<sup>1</sup> that we assign to the infinitesimal region S(*x*, *dx*):

$$
\hat{\sigma}(\mathbf{x})d\mathbf{x}\_1\cdots d\mathbf{x}\_{n-1} = \frac{\int\_{\mathcal{C}(\mathbf{x},d\mathbf{x})} a^2 d\vec{\mathbf{x}}}{\int\_{\mathcal{R}} a^2 d\vec{\mathbf{x}}}.\tag{37}
$$

Here, C(*x*, *dx*) is the cone (within the region R) representing dice for which the probabilities of the outcomes lie in S(*x*, *dx*):

$$\mathcal{L}(\mathbf{x}, d\mathbf{x}) = \left\{ \vec{\mathfrak{a}} \in \mathcal{R} \, \middle| \, \mathbf{x}\_{\circ} \le \frac{\mathfrak{a}\_{j}^{2}}{\mathfrak{a}^{2}} \le \mathbf{x}\_{j} + d\mathbf{x}\_{j} \right\}. \tag{38}$$

Our use of the weighting factor *α*<sup>2</sup> is reminiscent of the "adjustment" stage in the construction of the GAP measure in Refs. [5–8], and the integration over C(*x*, *dx*) is reminiscent of the projection stage of that same construction. We can express *σ*ˆ(*x*) more formally as

$$\mathcal{O}(\mathbf{x}) = \frac{\int\_{\mathcal{R}} \left[ \prod\_{j=1}^{n-1} \delta \left( \mathbf{x}\_{j} - \frac{\mathbf{a}\_{j}^{2}}{\mathbf{a}^{2}} \right) \right] d\mathbf{a}^{2} d\tilde{\mathbf{a}}}{\int\_{\mathcal{R}} \mathbf{a}^{2} d\tilde{\mathbf{a}}},\tag{39}$$

where *δ* is the Dirac delta function.

It is not difficult to obtain an explicit expression for *σ*ˆ(*x*) starting with Equation (39). For example, in the integral appearing in the numerator of that equation, one can use the integration variables *s*1, ... ,*sn*−<sup>1</sup> and *α*, where *sj* = *αj*/*α*. Then, *d<sup>α</sup>* becomes (1/*sn*)*αn*−1*ds*<sup>1</sup> ... *dsn*−1*dα*, and the integral becomes straightforward. Here, though, we take a different path to the same answer, starting with Equation (37). This latter approach turns out to be more parallel to our derivation of the Scrooge distribution in the quantum mechanical setting.

First, note that the numerator in Equation (37) can be written as

$$\int\_{\mathcal{C}(\mathbf{x},d\mathbf{x})} a^2 d\vec{\pi} = \frac{n}{n+2} a\_{\text{max}}^2 \cdot (\text{volume of } \mathcal{C}(\mathbf{x}, d\mathbf{x})),\tag{40}$$

where *<sup>α</sup>max* is the largest value of *<sup>α</sup>* over all points in R satisfying *<sup>α</sup>*<sup>2</sup> *<sup>j</sup>* /*α*<sup>2</sup> = *xj* for *<sup>j</sup>* = 1, ... , *<sup>n</sup>*. We get Equation (40) by writing *dα* as *kαn*−1*dα*, with some constant *k*, for the purpose of integrating over the cone. We can find the value of *αmax* by finding the point of intersection between (i) the ellipsoid that defines the boundary of R, given by

$$\frac{\alpha\_1^2}{(n+2)\mathcal{M}\_1} + \dots + \frac{\alpha\_n^2}{(n+2)\mathcal{M}\_n} = 1,\tag{41}$$

and (ii) the line parameterized by *α* and defined by the equations

$$\mathfrak{a}\_{\rangle} = \sqrt{\mathfrak{x}\_{\rangle}} \mathfrak{a}, \quad \mathfrak{j} = 1, \ldots, n. \tag{42}$$

The value of *α* at this intersection point is

$$
\alpha\_{\max} = \sqrt{\frac{n+2}{\frac{\chi\_1}{\mathcal{M}\_1} + \dots + \frac{\chi\_n}{\mathcal{M}\_n}}}.\tag{43}
$$

We can therefore rewrite Equation (40) as

$$\int\_{\mathcal{C}(\mathbf{x},d\mathbf{x})} a^2 d\vec{\pi} = \frac{n}{\frac{\overline{x\_1}}{\mathcal{M}\_1} + \dots + \frac{\overline{x\_n}}{\mathcal{M}\_n}} \cdot (\text{volume of } \mathcal{C}(\mathbf{x}, d\mathbf{x})). \tag{44}$$

Meanwhile, it follows from Equation (32) that the denominator in Equation (37) is

$$\int\_{\mathcal{R}} \mathfrak{a}^2 d\vec{\mathfrak{a}} = (\mathcal{M}\_1 + \dots + \mathcal{M}\_n) V\_{\mathcal{R}}.\tag{45}$$

Our next step is to compare *σ*ˆ(*x*) to the analogous distribution *τ*ˆ(*y*) induced by the uniform distribution of the vector - *β*—the same - *β* as in Section 4—over its domain R (recall that R is the positive section of a sphere):

$$
\hat{\pi}(y)dy\_1\cdots dy\_{n-1} = \frac{\int\_{\mathcal{C}'(y,dy)} \beta^2 d\vec{\beta}}{\int\_{\mathcal{R}'} \beta^2 d\vec{\beta}}.\tag{46}
$$

Here, C (*y*, *dy*) is the cone in R for which *yj* ≤ (*βj*/*β*)<sup>2</sup> ≤ *yj* + *dyj*. We can immediately write down an explicit expression for *τ*ˆ(*y*). It is the same as the distribution (23) on the probability simplex induced by the uniform distribution over the unit sphere in the *n*-dimensional real Hilbert space—the extra radial dimension represented by *β* has no bearing on the distribution over the probability simplex. Thus,

$$
\hat{\pi}(y) = \frac{\Gamma(n/2)}{\pi^{n/2}} \cdot \frac{1}{\sqrt{y\_1 \cdot \cdots \cdot y\_n}}.\tag{47}
$$

The expression for *σ*ˆ(*x*) is determined by finding the factors by which the numerator and denominator in Equation (46) change when the sphere in - *β*-space is stretched into an ellipsoid in *α*-space. In this transformation (in which *α<sup>j</sup>* = *β<sup>j</sup>* - M*j*), the relation between *y* (in Equation (46)) and *x* (in Equation (37)) is given by *y* = *g*(*x*), where *g* takes the point (*α*<sup>2</sup> <sup>1</sup>/*α*2, ... , *<sup>α</sup>*<sup>2</sup> *<sup>n</sup>*−1/*α*2) in the probability simplex to the point (*β*<sup>2</sup> <sup>1</sup>/*β*2,..., *<sup>β</sup>*<sup>2</sup> *<sup>n</sup>*−1/*β*2).

Essentially, any appearance of M*<sup>j</sup>* in our expression (37) for *σ*ˆ(*x*)*dx*<sup>1</sup> ... *dxn*−<sup>1</sup> becomes a 1 in Equation (46). Thus, according to Equation (44), when we transform from - *β* to*α*, the numerator in Equation (46) is multiplied by

$$\frac{\frac{n}{\frac{n\_1}{\mathcal{M}\_1} + \dots + \frac{n\_n}{\mathcal{M}\_n}} \cdot (\text{volume of } \mathcal{C}(x, dx))}{n \cdot (\text{volume of } \mathcal{C}'(y, dy))},\tag{48}$$

*Entropy* **2018**, *20*, 619

and according to Equation (45), in this same transformation, the denominator in Equation (46) is multiplied by

$$\frac{(\mathcal{M}\_1 + \dots + \mathcal{M}\_n)V\_{\mathbb{R}}}{nV\_{\mathbb{R}'}}.\tag{49}$$

For both the transitions C (*y*, *dy*) → C(*x*, *dx*) and R → R, the volume increases by a factor of √M<sup>1</sup> ···M*n*. So, these volume factors cancel out. By inserting the other factors from Equations (48) and (49), it is found that

$$\mathfrak{d}^\*(\mathbf{x}) = \mathfrak{f}(y)\mathcal{I}(y/\mathbf{x}) \frac{n}{\left(\frac{\mathbf{x}\_1}{\mathcal{M}\_1} + \dots + \frac{\mathbf{x}\_n}{\mathcal{M}\_n}\right)(\mathcal{M}\_1 + \dots + \mathcal{M}\_n)},\tag{50}$$

where J (*y*/*x*) is the Jacobian of *y* with respect to *x*.

Let us now write *y* explicitly in terms of *x*:

$$y\_j = \frac{\beta\_j^2}{\beta^2} = \frac{\frac{a\_j^2}{\mathcal{M}\_j}}{\frac{a\_1^2}{\mathcal{M}\_1} + \dots + \frac{a\_n^2}{\mathcal{M}\_n}} = \frac{\frac{x\_j}{\mathcal{M}\_j}}{\frac{x\_1}{\mathcal{M}\_1} + \dots + \frac{x\_n}{\mathcal{M}\_n}}.\tag{51}$$

From this, we can get the Jacobian (very much like the one in Equation (16)):

$$\mathcal{I}(y/x) = \frac{1}{\mathcal{M}\_1 \cdots \mathcal{M}\_n} \cdot \frac{1}{\left(\frac{\mathcal{X}\_1}{\mathcal{M}\_1} + \cdots + \frac{\mathcal{X}\_n}{\mathcal{M}\_n}\right)^n}.\tag{52}$$

By inserting the results of Equations (51) and (52) into Equation (50), we arrive at

$$\vartheta(\mathbf{x}) = \frac{n\Gamma(n/2)}{\pi^{n/2}\mathcal{M}\sqrt{\mathcal{M}\_1 \cdots \mathcal{M}\_n}\sqrt{\mathbf{x}\_1 \cdots \mathbf{x}\_n} \left(\frac{\mathbf{x}\_1}{\mathcal{M}\_1} + \cdots + \frac{\mathbf{x}\_n}{\mathcal{M}\_n}\right)^{\frac{n}{2}+1}},\tag{53}$$

where M = M<sup>1</sup> + ··· + M*n*. This is essentially the same as the expression (25) obtained earlier as the real-amplitude Scrooge distribution. The agreement can be made more explicit by defining the ratios *λ<sup>j</sup>* = M*j*/M, in which case Equation (53) becomes exactly identical to Equation (25), with these *λj*s playing the role of the eigenvalues of the density matrix.

Note that in the above derivation, we see an analog of *ρ* distortion. The stretching of the sphere in - *β*-space into an ellipsoid in*α*-space is very much like *ρ* distortion, though in place of the notion of a density matrix, we have a uniform distribution within the sphere or ellipsoid.

It may seem that our communication set-up, in which Alice sends a die equipped with a probabilistic self-destruction mechanism, is rather artificial. However, the mathematics is actually fairly simple and natural. We are considering a set of Poisson-distributed random variables and are basically constructing a measure on the set of values of these variables based on distinguishability (this is the measure derived from the Fisher information metric). That measure then induces a measure on the probability simplex which agrees with the real-amplitude Scrooge distribution.

#### **6. A Classical Interpretation of the Complex-Amplitude Scrooge Distribution**

We now show how to modify the above classical communication scenario to arrive at the original, complex-amplitude Scrooge distribution.

Not surprisingly, we begin by doubling the number of sides of Alice's dice. Let the outcomes be labeled 1*a*, 1*b*, 2*a*, 2*b*, ... , *na*, *nb*. The communication scheme is exactly as it was in Section 4, except that instead of placing an upper bound on the expectation value of the number of times each individual outcome occurs, the *ja* and *jb* outcomes are grouped together and an upper bound M*<sup>j</sup>* is placed on the expectation value of the total number of times the two *j* outcomes occur. This is done for each *j* = 1, ... , *n*. Again, Alice and Bob are asked to maximize the number of distinguishable signals under this constraint, where "distinguishable" again means having a Fisher-distance separation of at least *dmin*.

As before, it is easiest to view the problem in*α*-space; let us label the variables in the space *αja* and *αjb*. We now look for the maximum-volume region R of the positive section of*α*-space satisfying the constraints

$$\frac{1}{\mathcal{V}\_{\mathcal{R}}} \int\_{\mathcal{R}} (a\_{\dot{j}a}^2 + a\_{\dot{j}b}^2) d\vec{a} = \mathcal{M}\_{\dot{j}}, \quad \boldsymbol{j} = 1, \ldots, n. \tag{54}$$

In terms of the variables *<sup>β</sup>ja* <sup>=</sup> *<sup>α</sup>ja*- <sup>M</sup>*<sup>j</sup>* and *<sup>β</sup>jb* <sup>=</sup> *<sup>α</sup>jb*- M*j*, the constraints become

$$\frac{1}{V\_{\mathcal{R}'}} \int\_{\mathcal{R}'} (\beta\_{\vec{\mu}}^2 + \beta\_{\vec{\mu}}^2) d\vec{\beta} = 1, \quad j = 1, \dots, n,\tag{55}$$

where <sup>R</sup> is the region in - *β*-space corresponding to R. Upon summing these *n* constraints, the equation

$$\frac{1}{V\_{\mathcal{R}'}} \int\_{\mathcal{R}'} \beta^2 d\vec{\beta} = n \tag{56}$$

is obtained, where *β*<sup>2</sup> = ∑*<sup>n</sup> j*=1(*β*<sup>2</sup> *ja* + *<sup>β</sup>*<sup>2</sup> *jb*). Maximizing the volume under this constraint again gives a sphere in - *β*-space, which becomes an ellipsoid in*α*-space (restricted to the positive section).

Continuing as before, one finds that the induced probability distribution over the (2*n* − 1)-dimensional probability simplex associated with a 2*n*-sided die is the analog of Equation (53), the *n* values M1,...,M*<sup>n</sup>* now being replaced by the 2*n* values M1/2,M1/2, . . . ,M*n*/2,M*n*/2.

$$\vartheta\_{ab}(\mathbf{x}) = \frac{n\Gamma(n)}{\pi^n \lambda\_1 \cdots \lambda\_n \sqrt{\mathbf{x}\_{1a} \mathbf{x}\_{1b} \cdots \mathbf{x}\_{na} \mathbf{x}\_{nb}} (\frac{\mathbf{x}\_{1a} + \mathbf{x}\_{1b}}{\lambda\_1} + \cdots + \frac{\mathbf{x}\_{na} + \mathbf{x}\_{nb}}{\lambda\_n})^{n+1}} \tag{57}$$

where *λ<sup>j</sup>* = M*j*/M. Here, *xja* and *xjb* are the probabilities of the outcomes *ja* and *jb*, and **x** refers to the point (*x*1*a*, *<sup>x</sup>*1*b*, ... , *<sup>x</sup>*(*n*−1)*a*, *<sup>x</sup>*(*n*−1)*b*, *xna*) in the (2*<sup>n</sup>* − <sup>1</sup>)-dimensional probability simplex (the value of *xnb* is determined by the requirement that the probabilities sum to unity).

Finally, a distribution over the (*n* − 1)-dimensional probability simplex is obtained by ignoring the difference between the outcomes *ja* and *jb*. We can imagine an observer who, unlike Alice and Bob, cannot see the *a* and *b*. For this "*ab*-blind" observer, the distribution of Equation (57) looks like the following distribution over the (*n* − 1)-dimensional probability simplex:

$$\mathfrak{G}(\mathbf{x}) = \int \prod\_{j=1}^{n-1} \delta[\mathbf{x}\_j - (\mathbf{x}\_{ja} + \mathbf{x}\_{jb})] \mathfrak{d}\_{ab}(\mathbf{x}) d\mathbf{x}\_{1a} d\mathbf{x}\_{1b} \cdots d\mathbf{x}\_{na} \,. \tag{58}$$

Here, *δ* is the Dirac delta function and the integral is over the (2*n* − 1)-dimensional probability simplex.

The integral in Equation (58) is straightforward, and it can be found that

$$\hat{\sigma}(\mathbf{x}) = \frac{n!}{\lambda\_1 \cdots \lambda\_n} \cdot \frac{1}{\left(\frac{\mathbf{x}\_1}{\lambda\_1} + \cdots + \frac{\mathbf{x}\_n}{\lambda\_n}\right)^{n+1}}.\tag{59}$$

This is the same as the original Scrooge distribution of Equation (18). The role of the eigenvalues of the density matrix is now played by the set of values *λ<sup>j</sup>* = M*j*/(M<sup>1</sup> + ··· + M*n*), where, again, M*<sup>j</sup>* is the maximum allowed expectation value of the number of times that the outcomes *ja* and *jb* occur.

#### **7. Discussion**

In this paper we have shown how the real-amplitude version of the Scrooge distribution emerges naturally from a classical communication scenario in which information is transmitted via the values of several random variables *Nj*. Essentially, the real-amplitude Scrooge distribution, regarded as a probability distribution over the probability simplex, is derived from an underlying distribution based on distinguishability. Our analysis includes a transformation that plays something like the role of a *ρ* distortion: in place of a density matrix, what is distorted is a distribution over the space of potential signals.

In order to get the original complex-amplitude Scrooge distribution for dimension *n*, we needed to consider a case with twice as many random variables, grouped into pairs, and then we imagined an observer for whom only the *sum* of the variables within each pair was observable.

The reader will probably have noticed that the role played by the concept of *information* in our classical communication problem seems to be exactly the opposite of the role it plays in the quantum origin of the Scrooge distribution. In quantum theory, the Scrooge distribution is the distribution over pure states that, upon measurement, provides an observer with the *least* possible amount of information. In contrast, in our classical communication scenario, the Scrooge distribution emerges from a requirement that Alice convey as much information as possible to Bob. What is common to both cases is that the information-based criterion favors a distribution that is highly *spread out* over the probability simplex. In the quantum case, a distribution spread out over many non-orthogonal states tends to make it difficult for an observer to gain information about the state. In the classical case, Alice and Bob want to spread their signals as widely as possible over the space of possibilities in order to maximize the number of distinguishable signals. Thus, though the two scenarios are quite different, their extremization criteria have similar effects.

An intriguing aspect of our classical scenario is that the probability simplex is not itself taken as the domain in which the problem is formulated. Instead, the problem is formulated in terms of the number of times each outcome occurs. The distribution over the probability simplex is a secondary concept, being derived from a more fundamental distribution over the space of the numbers of occurrences of the outcomes. That is, the *Mj* values are more fundamental in the problem than the probabilities of the outcomes, which are defined in terms of the *Mj*s by the equation *xj* = *Mj*/*M*. In this specific respect, then, the effort to find a classical interpretation of the Scrooge distribution seems to lead us away from the models studied in Refs. [26,28], in which the set of frequencies of occurrence of the measurement outcomes was the only source of information considered.

It is interesting to ask whether this feature of our scenario is necessary in order to get the Scrooge distribution classically. To address this question, in Appendix A we consider another classical communication problem, in which we impose a separate restriction for each outcome as in Section 4, but now with Alice's signals consisting purely of probabilities (which are estimated by Bob through the observed frequencies of occurrence). For simplicity, we restrict our attention to the most basic case, in which there are only two possible outcomes—so Alice's die is now a coin to be tossed—and we are aiming just for the real-amplitude Scrooge distribution as opposed to the complex-amplitude version. We find that the resulting probability distribution over the probability simplex is *not* of the same form as the real-amplitude Scrooge distribution. This result can be taken as one bit of evidence that it is indeed necessary to go beyond the probability simplex and to work in a space of one additional dimension in order to obtain the Scrooge distribution classically. In this connection, it is worth noting that something very similar has been seen in research on *subentropy*—certain simple relations between subentropy and the Shannon entropy can be obtained only by lifting the normalization restriction that defines the probability simplex and working in the larger space of unnormalized *n*-tuples [21,23].

Finally, one might wonder about the potential significance of our need to invoke an "*ab*-blind" observer in order to obtain the complex-amplitude Scrooge distribution. It is well known that the number of independent parameters required to specify a pure quantum state (of a system with a finite-dimensional Hilbert space) is exactly twice the number of independent probabilities associated with a complete orthogonal measurement on the system. Here, we are seeing another manifestation of this factor of two: the classical measurement outcomes, corresponding to the sides of a rolled die, have to be grouped into pairs, and we need to imagine an observer incapable of distinguishing between the elements of any pair. In our actual quantum world, one can reasonably ask whether there is any

interesting sense in which we ourselves are "*ab*-blind." This question, though, lies well beyond the scope of the present paper.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Appendix A. Communicating through Probabilities**

Here, we consider a classical communication problem based directly on probabilities, as opposed to being based on the number of times each outcome occurs. We restrict our attention to the case of two outcomes, which we imagine as "heads" and "tails" for a tossed coin. The question is whether the real-amplitude Scrooge distribution for *n* = 2 can be obtained in this way.

Alice is trying to convey to Bob the identity of a point in the one-dimensional probability simplex (not the two-dimensional space with axes labeled "number of heads" and "number of tails"). The "simplex" in this case is just a line segment, and the points of the simplex are labeled by the probability *x* of heads occurring (the probability of tails occurring is 1 − *x*). Alice conveys her signal by sending Bob a coin with weights (*x*, 1 − *x*). Bob tosses the coin in order to estimate the value of *x*, but he is allowed to toss it only *N* times, at which point the coin will self-destruct. Alice chooses a set of points in the probability simplex in advance that will serve as her potential signals, and she provides Bob with the list of these points. Alice also chooses a function *N*(*x*) that determines how many times Bob will be able to toss the coin if the coin's weights are (*x*, 1 − *x*). However, Bob does not know the function *N*(*x*) and is not allowed to use the observed total number of tosses in his estimation of the value of *x*. He can use only the frequencies of occurrence of heads and tails.

We limit the amount of information that Alice can convey per coin by specifying the values of two quantities: (i) the expectation value N of the total number of tosses, and (ii) the expectation value N*<sup>H</sup>* of the number of heads. If we let *ρ*(*x*)*dx* be the number of signals lying between the values *x* and *x* + *dx*, we can write these two restrictions as follows:

$$\int\_0^1 N(\mathbf{x}) \rho(\mathbf{x}) d\mathbf{x} = \mathcal{N} \int\_0^1 \rho(\mathbf{x}) d\mathbf{x}.\tag{A1}$$

$$\int\_{0}^{1} x N(x) \rho(x) dx = \mathcal{N}\_{H} \int\_{0}^{1} \rho(x) dx. \tag{A2}$$

As before, we insist that Alice choose the signal values so that neighboring signals have a certain minimum degree of distinguishability as quantified by the Fisher information metric. For the binomial distributions we are dealing with here, this condition works out to be

$$
\Delta x = \sqrt{\frac{\mathbf{x}(1-\mathbf{x})}{N(x)}} \, d\_{\text{min}} \,\tag{A3}
$$

where Δ*x* is the separation between successive signals. The density *ρ*(*x*) of signals is therefore

$$\rho(\mathbf{x}) = \frac{1}{\Delta \mathbf{x}} = \sqrt{\frac{N(\mathbf{x})}{\mathbf{x}(1-\mathbf{x})}} \frac{1}{d\_{\min}}.\tag{A4}$$

Alice wants to maximize the number of distinct signals. So, in choosing the function *N*(*x*), she needs to solve the following optimization problem: maximize the quantity (from Equation (A4))

$$\int\_{0}^{1} \sqrt{\frac{N(x)}{x(1-x)}} dx \tag{A5}$$

while satisfying the following two constraints (which come from Equations (A1) and (A2), combined with Equation (A4))

$$\int\_0^1 \frac{\mathcal{N}(\mathbf{x})^{3/2} - \mathcal{N}\mathcal{N}(\mathbf{x})^{1/2}}{\sqrt{\mathbf{x}(1-\mathbf{x})}} d\mathbf{x} = 0. \tag{A6}$$

$$\int\_0^1 \frac{x N(x)^{3/2} - \mathcal{N}\_H N(x)^{1/2}}{\sqrt{x(1-x)}} dx = 0. \tag{A7}$$

This problem can be solved by the calculus of variations, and it can be found that Alice should choose *N*(*x*) to be of the form

$$N(x) \propto \frac{1}{\frac{\chi}{\lambda} + \frac{1-\chi}{1-\lambda}}.\tag{A8}$$

Here, *λ* is a real number between zero and one, fixed by the requirement that the overall probability of heads must equal N*H*/N (we could have written the result in other ways; we use *λ* only to facilitate our later comparison with the Scrooge distribution). Once the value of *λ* is set, the constant factor multiplying the right-hand side is fixed by Equation (A6).

We now use this result to generate the probability distribution *σ*ˆ(*x*) over the probability simplex. We define it as follows: in many rounds of communication, we want *σ*ˆ(*x*)*dx* to approximate the fraction of the total number of tosses that come from a coin whose probability of heads is between *x* and *x* + *dx*. More precisely, we define *σ*ˆ(*x*) to be proportional to *N*(*x*)*ρ*(*x*), with the proportionality constant set by the normalization condition <sup>1</sup> <sup>0</sup> *σ*ˆ(*x*)*dx* = 1 (we have multiplied *ρ*(*x*) by *N*(*x*) to turn the density of signals into the density of tosses). By substituting for *N*(*x*) and *ρ*(*x*) in accordance with Equations (A4) and (A8), we arrive at

$$\vartheta(\mathbf{x}) = \frac{A}{\sqrt{\mathbf{x}(1-\mathbf{x})}} \cdot \frac{1}{\left(\frac{\mathbf{x}}{\lambda} + \frac{1-\mathbf{x}}{1-\lambda}\right)^{3/2}},\tag{A9}$$

where *A* is the normalization constant. Comparing this form with that of Equation (25), we see that this alternative problem does not lead us to the real-amplitude Scrooge distribution—the exponent appearing in the denominator is 3/2 instead of 2. Moreover, *λ* and 1 − *λ* have no obvious meaning in this problem, whereas in the problem considered in Sections 4 and 5, the *λj*s can be interpreted directly in terms of the imposed bounds M*<sup>j</sup>* on the expectation values of the number of times that the various outcomes occur.

#### **References**


c 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
