**Abbreviations**

The following abbreviations are used in this manuscript.


#### **Appendix A. VAE: Training and Implementation Details**

When training our VAE, we find the arg maximum of the logarithmic likelihood L(*θ*) w.r.t. its parameters *θ*:

$$\theta\_{\text{MLE}} = \underset{\theta}{\text{argmax}} \mathcal{L}(\theta) = \underset{\theta}{\text{argmax}} \log(p[\mathbf{x}|\theta, h]), \tag{A1}$$

Equation (A1) cannot directly be evaluated, because of hidden variables in the structure of *p*[*x*|*<sup>θ</sup>*, *h*]. We can, however, simplify this problem by introducing a distribution over hidden variables *z*. Remember that the probability distribution can be described as *p*[*x*|*<sup>θ</sup>*, *h*] = *p*[*x*|*<sup>z</sup>*, *θ*, *h*]*p*[*z*]*dz*, so that the expression for the log likelihood becomes

$$\mathcal{L}(\theta) = \log \left( \int p[\mathbf{x}|z, \theta \, h] p[z] dz \right) . \tag{A2}$$

We can then use a mathematical trick that might seem counterintuitive at first glance, but ultimately becomes quite powerful. We multiply the function inside the integral by *<sup>q</sup>*[*z*|*<sup>x</sup>*,˜*θ*,*h*] *<sup>q</sup>*[*z*|*<sup>x</sup>*,˜*θ*,*h*] = 1, where *q*[*z*|*<sup>x</sup>*, ˜*θ*, *h*] is some arbitrary distribution that can be adjusted with ˜ *θ*, so that

$$\begin{split} \mathcal{L}(\boldsymbol{\theta}) = \log \left( \int p[\boldsymbol{x}|\boldsymbol{z}, \boldsymbol{\theta}, h] p[\boldsymbol{z}] d\boldsymbol{z} \right) &= \log \left( \int \frac{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\tilde{\theta}}, h]}{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}, h]} p[\boldsymbol{x}|\boldsymbol{z}, \boldsymbol{\theta}, h] p[\boldsymbol{z}] d\boldsymbol{z} \right) \\ &= \log \left( \mathbb{E}\_{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}, h]} p[\boldsymbol{x}|\boldsymbol{z}, \boldsymbol{\theta}, h] \frac{p[\boldsymbol{z}]}{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}, h]} \right) \end{split} \tag{A3}$$

where the quantity E*f* [*x*] denotes the expectation value w.r.t some distribution *f* [*x*]. We can then use Jensen's inequality to show that

$$\log \left( \mathbb{E}\_{q[z|x,\theta,h]} p[\mathbf{x}|z,\theta,h] \frac{p[z]}{q[z|\mathbf{x},\tilde{\theta},h]} \right) \ge \mathbb{E}\_{q[z|x,\theta,h]} \log \left( p[\mathbf{x}|z,\theta,h] \frac{p[z]}{q[z|\mathbf{x},\tilde{\theta},h]} \right) \,. \tag{A4}$$

where the rhs of this inequality is the lower bound of the log likelihood, as it will always be greater than or equal to the lower bound, and equality can always be achieved by a proper choice of *q* if it is in a complex enough family.

Maximizing the lower bound is equivalent to maximizing the log likelihood. We can decompose this lower bound term into two terms:

$$\mathcal{L}(\boldsymbol{\theta}) \ge \text{ELBO}(\boldsymbol{\theta}, \boldsymbol{\bar{\theta}}) = \mathbb{E}\_{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\bar{\theta}}, \boldsymbol{h}]} \log \left( p[\mathbf{x}|\boldsymbol{z}, \boldsymbol{\theta}, \boldsymbol{h}] \right) - \int q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\bar{\theta}}, \boldsymbol{h}] \log \frac{q[\boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\bar{\theta}}, \boldsymbol{h}]}{p[\boldsymbol{z}]} d\boldsymbol{z} \tag{A5}$$

Note that the second term is equivalent to the Kullback–Leibler divergence *KL*(*q*[*z*|*<sup>x</sup>*, ˜*θ*, *h*] || *p*[*z*]). In our case, we picked the particular distribution forms that reflect the structure of our problem:

$$p[\mathbf{x}|z,\theta,h] = \prod\_{i=1}^{N} \prod\_{j=1}^{4} \pi\_{ij}(z,\theta,h)^{\mathbf{x}\_{ij}},$$

$$q[z|\mathbf{x},\theta,h] = \mathcal{N}(\mu\_i(\mathbf{x},\theta,h), \text{Diag}(\sigma\_i^2(\mathbf{x},\theta,h))), \tag{A6}$$

$$P[z] = \mathcal{N}(0,I)$$

where *μi* and *σi* are given by the encoder neural network, and *<sup>π</sup>ij* is given by the decoder neural network, with <sup>∑</sup><sup>4</sup>*j*=<sup>1</sup> *<sup>π</sup>ij* = 1 and *<sup>π</sup>ij* ≥ 0, which can be achieved by applying the softmax funtion to the output of the neural network. Now, we can use the reparametrization trick to change the variable in the integral *z* = *<sup>σ</sup>j*(*<sup>x</sup>*, ˜*θ*, *h*)*ε* + *<sup>μ</sup>j*(*<sup>x</sup>*, ˜*θ*, *h*), where *εj* ∼ N (0, *I*), to simplify this expression to

$$\begin{split} \text{ELBO}(\boldsymbol{\theta}, \boldsymbol{\bar{\theta}}) &= \sum\_{i=1}^{N} \sum\_{j=1}^{4} \text{x}\_{ij} \left< \log \left( \pi\_{ij}(\sigma\_{i}(\mathbf{x}, \boldsymbol{\bar{\theta}}, h) \boldsymbol{\varepsilon} + \mu\_{i}(\mathbf{x}, \boldsymbol{\bar{\theta}}, h), \boldsymbol{\theta}, h) \right) \right>\_{\boldsymbol{\varepsilon}\_{j} \sim \mathcal{N}([0, l])} \\ &- \sum\_{i=1}^{N} \left( \log \sigma\_{i}(\mathbf{x}, \boldsymbol{\bar{\theta}}, h) - \frac{\sigma\_{i}^{2}(\mathbf{x}, \boldsymbol{\bar{\theta}}, h) + \mu\_{i}^{2}(\mathbf{x}, \boldsymbol{\bar{\theta}}, h) - 1}{2} \right) . \end{split} \tag{A7}$$

The first term is the cross-entropy, which pushes the probability distribution to be as close as possible to the data. The second term is the regularizer, which forces the latent variable *z* not to diverge too much from the normal distribution N (0, *I*), so that the VAE can be used to generate new data once it is trained. Note that both *xij* and *σi* must be positive. Instead of adding a constraint to the VAE, which would be difficult to do, we train the VAE for the variables Π = log *π* and *ξ* = 2 log *σ*. Equation (A7) then becomes

$$\text{ELBO}(\theta,\bar{\theta}) = \sum\_{i=1}^{N} \sum\_{j=1}^{4} \mathbf{x}\_{ij} \left\langle \Pi\_{ij}(\mathbf{c}^{\mathbb{Z}\_i(\mathbf{x},\bar{\theta},h)})^2 \varepsilon + \mu\_i(\mathbf{x},\bar{\theta},h), \theta\_i h \right\rangle\_{\varepsilon\_j \sim \mathcal{N}([0,l))}$$

$$-\frac{1}{2} \sum\_{i=1}^{N} \left( \mathbf{x}\_i(\mathbf{x},\bar{\theta},h) - \mathbf{c}^{\mathbb{Z}\_i(\mathbf{x},\bar{\theta},h)} - \mu\_i^2(\mathbf{x},\bar{\theta},h) + 1 \right). \tag{A8}$$

Now, ELBO(*<sup>θ</sup>*, ˜*θ*) can be effectively optimized using gradient descent methods, averaging over *ε* can be done by sampling. Generalizing to a data set of size *M*: {*x<sup>k</sup>*}*Mk*=<sup>1</sup> can be easily done and is shown by

$$\begin{split} \text{ELBO}(\theta,\tilde{\theta}) &= \sum\_{k=1}^{M} \sum\_{i=1}^{N} \sum\_{j=1}^{4} \mathbf{x}\_{ij}^{k} \left\langle \Pi\_{ij}(\mathbf{x}^{\tilde{\theta}\_{i}(\mathbf{x}^{k},\tilde{\theta},h)})^{2} \varepsilon + \mu\_{i}(\mathbf{x}^{k},\tilde{\theta},h), \theta, h \right\rangle\_{\varepsilon\_{j} \sim \mathcal{N}(0,l)} \\ & \quad - \frac{1}{2} \sum\_{k=1}^{M} \sum\_{i=1}^{N} \left( \xi\_{i}^{\tau}(\mathbf{x}^{k},\tilde{\theta},h) - e^{\xi\_{i}(\mathbf{x}^{k},\tilde{\theta},h)}\_{} - \mu\_{i}^{2}(\mathbf{x}^{k},\tilde{\theta},h) + 1 \right). \end{split} \tag{A9}$$

A visual representation of the VAE architecture is shown in Figure A1.

**Figure A1.** Architecture of the variational autoencoder.

To solve the optimization problem, we use Adam optimizer [113] with standard parameters (lr = 0.001, *β*1 = 0.9, *β*2 = 0.999). For the encoder and decoder, we use fully-connected neural networks with two hidden layers and 256 neurons on each. We train the VAE using batches of size 100,000 samples and for 750 epochs.

#### **Appendix B. Sampling from POVM-Induced Mass Function**

The mass function induced by POVM *<sup>P</sup>*[*<sup>α</sup>*1, *α*2, ... , *<sup>α</sup>N*] has a form of matrix product state. Thus, one can easily calculate any marginal mass function because a summation over any *α* can be done locally. Any conditional mass functions can be also calculated by using marginal mass functions. Thus, one can calculate chain decomposition of the whole mass function:

$$P[\mathfrak{a}\_1, \mathfrak{a}\_2, \dots, \mathfrak{a}\_N] = P[\mathfrak{a}\_N] P[\mathfrak{a}\_{N-1} | \mathfrak{a}\_N] P[\mathfrak{a}\_{N-2} | \mathfrak{a}\_{N-1}, \mathfrak{a}\_N] \dots P[\mathfrak{a}\_1 | \mathfrak{a}\_{2^\prime}, \dots, \mathfrak{a}\_N] \tag{A10}$$

With this decomposition, one can produce a sample *α*˜ *N* from *<sup>P</sup>*[*<sup>α</sup>N*] first, then a sample *α*˜ *N*−1 from *<sup>P</sup>*[*<sup>α</sup>N*−<sup>1</sup>|*α*˜ *<sup>N</sup>*], and continue up to the end of the chain. The obtained set {*α*˜ 1, *α*˜ 2, ... , *α*˜ *N*} is a valid sample from the mass function.
