**3. Distributed Entropy-Driven Exploration for Sparse Bayesian Learning**

The learning algorithm described in the previous section estimates the parameters of the model *w* and *γ* given the measurements *y* and *X*. In the following, we focus on the question of how a new measurement is acquired in an optimal fashion. As we will show, the main criterion for this purpose is the information or, more specifically, the entropy change as a function of a possible sampling location.

#### *3.1. D-Optimality*

One possible strategy to optimally select a new measurement location *<sup>x</sup>* is provided by the theory of optimal experiment design. Optimal experiment design aims at optimizing the variance of an estimator through a number of optimality criteria. One of these criteria is a so-called D-optimality: it measures the "size" of an estimator covariance matrix by computing the volume of the corresponding uncertainty ellipsoid. More specifically, a determinant (or rather the logarithm of a determinant) of the covariance matrix is computed. The latter can then be optimized with respect to the experiment parameter. In our case, the covariance matrix **Σ***<sup>w</sup>* of the model parameters *w* is readily given in (12) as a second central moment of *p*(*w*|*y*). Thus, the D-optimality criterion can be formulated as

$$\min \log |\Sigma\_w(X, \Pi)|\,\tag{30}$$

where the dependency of **Σ***<sup>w</sup>* on measurement locations *X* has been made explicit. Note that due to the normality of the posterior pdf *p*(*w*|*y*), the term log|**Σ***w*(*X*, **Π**)| is proportional to the entropy of *w*; thus, minimization of the criterion (30) would imply a reduction of the entropy of the parameter estimates. Note that in contrast to [14], the covariance matrix is not approximated here, but it is computed exactly based on the resulting probabilistic inference model. Our intention is now to evaluate and optimize (30) as a function of the new possible sampling location *<sup>x</sup>*.

Let us consider a modification of the model (7) as a function of the location *<sup>x</sup>*. The incorporation of *<sup>x</sup>* into (7) would imply that the design matrix **<sup>Φ</sup>** would be extended as

$$\tilde{\Phi}([\mathbf{X}^T, \tilde{\mathbf{x}}]^T, [\Pi^T, \tilde{\pi}]^T) = \begin{bmatrix} \Phi(\mathbf{X}, \Pi) & \Phi(\mathbf{X}, \tilde{\pi}) \\ \Phi^T(\tilde{\mathbf{x}}, \Pi) & \phi(\tilde{\mathbf{x}}, \tilde{\pi}) \end{bmatrix}' \tag{31}$$

where *<sup>π</sup>* is a new parameterization of a function *<sup>φ</sup>* based on the new location *<sup>x</sup>*—a new regression feature. Let us stress that in general, the potential measurement at *<sup>x</sup>* does not have to lead to a new column in (31)—columns, i.e., basis functions in **Φ** can be fixed from the initial design of the problem. In the latter case, **Φ** would be extended only by a row vector *<sup>φ</sup>T*(*<sup>x</sup>*, **<sup>Π</sup>**)=[*φ*(*<sup>x</sup>*, *<sup>π</sup>*1), ... , *<sup>φ</sup>*(*<sup>x</sup>*, *<sup>π</sup>N*)]. However, a basis function with a currently zero parameter weight estimate might be useful for explaining the new measurement value at *<sup>x</sup>* and, thus, might be activated. Our next step is to consider how


#### 3.1.1. Measurement Only-Update of the D-Optimality Criterion

We will begin with considering the update of the D-optimality criterion with respect to a new measurement location *<sup>x</sup>* assuming that only the number of rows in **<sup>Φ</sup>** grows, while the number of features stays constant. In this case, (31) can be represented as

$$\tilde{\Phi}([\mathbf{X}^T, \tilde{\mathbf{x}}]^T, \Pi) = \begin{bmatrix} \Phi(\mathbf{X}, \Pi) \\ \boldsymbol{\Phi}^T(\tilde{\mathbf{x}}, \Pi) \end{bmatrix}. \tag{32}$$

Based on (32), the new covariance matrix **Σ***<sup>w</sup>* that accounts for the new measurement location *<sup>x</sup>* can be computed as

$$
\hat{\Sigma}\_w(\mathbf{X}, \Pi, \hat{\mathbf{x}}) = \left( \hat{\Phi}([\mathbf{X}^T, \hat{\mathbf{x}}]^T, \Pi) \hat{\mathbf{A}} \hat{\Phi}([\mathbf{X}^T, \hat{\mathbf{x}}]^T, \Pi) + \hat{\mathbf{f}}^{-1} \right)^{-1}, \tag{33}
$$

where **Λ** = diag{**Λ**, *<sup>λ</sup>*} ∈ <sup>R</sup>*M*+1×*M*+<sup>1</sup> and *λ* is the assumed noise precision at the potential measurement location. It is worth noting that we assume every measurement to be independent white Gaussian noise.

By combining terms that depend on *<sup>x</sup>*, we can represent (33) as

$$\begin{split} \tilde{\Sigma}\_{w}(X,\Pi,\tilde{\mathbf{x}})^{-1} &= \left[ \boldsymbol{\Phi}^{T}\boldsymbol{\Lambda}\boldsymbol{\Phi} + \hat{\Gamma}^{-1} \right] + \tilde{\lambda}\boldsymbol{\Phi}(\tilde{\mathbf{x}},\Pi)\boldsymbol{\Phi}(\tilde{\mathbf{x}},\Pi)^{T} \\ &= \Sigma\_{w}^{-1} + \tilde{\lambda}\boldsymbol{\Phi}(\tilde{\mathbf{x}},\Pi)\boldsymbol{\Phi}(\tilde{\mathbf{x}},\Pi)^{T}. \end{split} \tag{34}$$

As we see from (34), an addition of a new measurement row causes a rank-1 perturbation of the information matrix **Σ**−<sup>1</sup> *<sup>w</sup>* . Using matrix determinant lemma [37], we can thus compute

$$\log \left| \hat{\Sigma}\_w(\mathbf{X}, \Pi, \hat{\mathbf{x}}) \right| = -\log \left| \Sigma\_w^{-1} + \hat{\lambda}\Phi(\tilde{\mathbf{x}}, \Pi)\Phi(\tilde{\mathbf{x}}, \Pi)^T \right| \tag{35}$$

$$\mathbf{I} = \log \left| \boldsymbol{\Sigma}\_{\rm w} \right| - \log \left| \mathbf{1} + \boldsymbol{\widetilde{\lambda}} \boldsymbol{\Phi} (\boldsymbol{\widetilde{\kappa}}, \boldsymbol{\Pi})^T \boldsymbol{\Sigma}\_{\rm w} \boldsymbol{\Phi} (\boldsymbol{\widetilde{\kappa}}, \boldsymbol{\Pi}) \right| \tag{36}$$

Note that **<sup>Σ</sup>***<sup>w</sup>* is independent of *<sup>x</sup>*, and thus, only the second term on the right-hand side of (36) is relevant for the estimation.

Finally, the D-optimality criterion with respect to a location *<sup>x</sup>* can be formulated as

$$\underset{\overline{\mathbf{x}}}{\text{arg min }} \log |\widetilde{\boldsymbol{\Sigma}}\_{\overline{\mathbf{w}}}| \equiv \underset{\overline{\mathbf{x}}}{\text{arg max }} \log \left| 1 + \widetilde{\lambda}\boldsymbol{\Phi}(\widetilde{\mathbf{x}}, \Pi)^{T} \boldsymbol{\Sigma}\_{\overline{\mathbf{w}}} \boldsymbol{\Phi}(\widetilde{\mathbf{x}}, \Pi) \right| = \underset{\overline{\mathbf{x}}}{\text{arg max }} \log \left| f(\widetilde{\mathbf{x}}, \widetilde{\lambda}) \right|, \tag{37}$$

where we have exchanged minimization with a maximization by changing the sign of the cost function.

3.1.2. Computation of the D-Optimality Criterion with Addition of a New Feature

The computation of the D-optimality criterion becomes more involved when a measurement at a location *<sup>x</sup>* is associated with a new feature *<sup>π</sup>*. This can happen if, e.g., *<sup>π</sup>* is a center or location of a new basis function.

Then, based on (31), the new covariance matrix **<sup>Σ</sup>***<sup>w</sup>* that accounts for *<sup>x</sup>* and *<sup>π</sup>* is formulated as

$$
\tilde{\boldsymbol{\Sigma}}\_{\text{w}}(\mathbf{X}, \boldsymbol{\Pi}, \tilde{\mathbf{x}}, \tilde{\boldsymbol{\pi}}) = \left( \tilde{\boldsymbol{\Phi}}^{T}([\mathbf{X}^{T}, \hat{\mathbf{x}}]^{T}, [\mathbf{I}\mathbf{1}^{T}, \hat{\boldsymbol{\pi}}]^{T}) \tilde{\boldsymbol{\Lambda}} \tilde{\boldsymbol{\Phi}}([\mathbf{X}^{T}, \hat{\mathbf{x}}]^{T}, [\mathbf{I}\mathbf{1}^{T}, \tilde{\boldsymbol{\pi}}]^{T}) + \begin{bmatrix} \tilde{\boldsymbol{\Gamma}}^{-1} & \boldsymbol{0} \\ \boldsymbol{0} & \tilde{\gamma}^{-1} \end{bmatrix} \right)^{-1},\tag{38}
$$

where *<sup>γ</sup>* is a sparsity parameter associated with a new column [*φT*(*X*, *<sup>π</sup>*), *<sup>φ</sup>*(*<sup>x</sup>*, *<sup>π</sup>*)]*T*. By combining terms that depend on *<sup>x</sup>*, we can represent (38) as

$$
\begin{split}
\boldsymbol{\hat{\Sigma}}\_{w}(\boldsymbol{X},\boldsymbol{\Pi},\boldsymbol{\tilde{\mathcal{X}}},\boldsymbol{\tilde{\pi}})^{-1} &= \begin{bmatrix}
\boldsymbol{\Phi}^{T}\boldsymbol{\Lambda}\boldsymbol{\Phi} + \boldsymbol{\hat{\Gamma}}^{-1} & \boldsymbol{\Phi}^{T}\boldsymbol{\Lambda}\boldsymbol{\Phi}(\boldsymbol{X},\boldsymbol{\tilde{\mathcal{X}}}) \\
\boldsymbol{\Phi}^{T}(\boldsymbol{X},\boldsymbol{\tilde{\pi}})\boldsymbol{\Lambda}\boldsymbol{\Phi} & \boldsymbol{\Phi}^{T}(\boldsymbol{X},\boldsymbol{\tilde{\pi}})\boldsymbol{\Lambda}\boldsymbol{\Phi}(\boldsymbol{X},\boldsymbol{\tilde{\pi}}) + \boldsymbol{\tilde{\gamma}}^{-1}
\end{bmatrix} \\ &+ \boldsymbol{\tilde{\lambda}} \begin{bmatrix}
\boldsymbol{\Phi}(\boldsymbol{\tilde{\mathcal{X}}},\boldsymbol{\Pi}) \\
\boldsymbol{\Phi}(\boldsymbol{\tilde{\mathcal{X}}},\boldsymbol{\tilde{\pi}})
\end{bmatrix} \begin{bmatrix}
\boldsymbol{\Phi}(\boldsymbol{\tilde{\mathcal{X}}},\boldsymbol{\Pi}) \\
\boldsymbol{\Phi}(\boldsymbol{\tilde{\mathcal{X}}},\boldsymbol{\tilde{\pi}})
\end{bmatrix}^{T} .\tag{39}
\end{split}
\tag{30}
$$

To simplify the notation, let us define

$$c(\tilde{\pi}) \triangleq \Phi^{\mathsf{T}} \Lambda \phi(\mathsf{X}, \tilde{\pi}), \quad b(\tilde{\pi}) \triangleq \Phi^{\mathsf{T}}(\mathsf{X}, \tilde{\pi}) \Lambda \phi(\mathsf{X}, \tilde{\pi}) + \tilde{\gamma}^{-1}, \tag{40}$$

which can be inserted into (39), leading to

$$
\boldsymbol{\tilde{\Sigma}}\_{w}(\mathbf{X}, \boldsymbol{\Pi}, \boldsymbol{\tilde{\pi}}, \boldsymbol{\tilde{\pi}})^{-1} = \begin{bmatrix} \boldsymbol{\Sigma}\_{w}^{-1} & \mathbf{c}(\boldsymbol{\tilde{\pi}}) \\ \boldsymbol{\mathfrak{c}}^{T}(\boldsymbol{\tilde{\pi}}) & b(\boldsymbol{\tilde{\pi}}) \end{bmatrix} + \boldsymbol{\tilde{\lambda}} \begin{bmatrix} \boldsymbol{\Phi}(\boldsymbol{\tilde{\pi}}, \boldsymbol{\Pi}) \\ \boldsymbol{\Phi}(\boldsymbol{\tilde{\pi}}, \boldsymbol{\tilde{\pi}}) \end{bmatrix} \begin{bmatrix} \boldsymbol{\Phi}^{T}(\boldsymbol{\tilde{\pi}}, \boldsymbol{\Pi}) & \boldsymbol{\Phi}(\boldsymbol{\tilde{\pi}}, \boldsymbol{\tilde{\pi}}) \end{bmatrix}. \tag{41}
$$

The first term in (41) describes how much the new feature column contributes to the covariance matrix, while the second term represents the contribution of a measurement at location *<sup>x</sup>*. Let us now insert (41) into the D-optimality criterion in (30). By applying the matrix determinant lemma [37] to the resulting expression, we compute

$$\begin{split} \left| \log \left| \tilde{\Sigma}\_{w} (\mathbf{X}, \Pi\_{N}, \tilde{\mathbf{x}}, \tilde{\pi}) \right| \right| &= -\log \left| \frac{\Sigma\_{w}^{-1}}{\mathbf{c} (\tilde{\boldsymbol{\pi}})^{T}} \begin{array}{l} \mathbf{c} (\tilde{\boldsymbol{\pi}}) \\ b (\tilde{\boldsymbol{\pi}}) \end{array} \right| \\ &- \log \left| 1 + \tilde{\lambda} \begin{bmatrix} \boldsymbol{\Phi} (\tilde{\boldsymbol{\pi}}, \Pi) \\ \boldsymbol{\phi} (\tilde{\boldsymbol{\pi}}, \tilde{\boldsymbol{\pi}}) \end{bmatrix}^{T} \begin{bmatrix} \Sigma\_{w}^{-1} & \mathbf{c} (\tilde{\boldsymbol{\pi}}) \\ \mathbf{c} (\tilde{\boldsymbol{\pi}})^{T} & b (\tilde{\boldsymbol{\pi}}) \end{bmatrix}^{-1} \begin{bmatrix} \boldsymbol{\Phi} (\tilde{\boldsymbol{\pi}}, \Pi) \\ \boldsymbol{\phi} (\tilde{\boldsymbol{\pi}}, \tilde{\boldsymbol{\pi}}) \end{bmatrix} \right|. \end{split} \tag{42}$$

Now, consider separately the contribution of the two terms in the right-hand side of (42) to the D-optimality criterion. For the first term, we can use the Schur complement [38] *<sup>q</sup>*(*<sup>π</sup>*) = *<sup>b</sup>*(*<sup>π</sup>*) <sup>−</sup> *<sup>c</sup>T*(*<sup>π</sup>*)**Σ***wc*(*<sup>π</sup>*) such that the first logarithmic term can be reformulated as

$$\log \left| \begin{array}{cc} \Sigma\_w^{-1} & c(\tilde{\pi}) \\ c(\tilde{\pi})^T & b(\tilde{\pi}) \end{array} \right| = -\log |\Sigma\_w| + \log q(\tilde{\pi}). \tag{43}$$

Note that **<sup>Σ</sup>***<sup>w</sup>* is independent of *<sup>x</sup>* and of *<sup>π</sup>*, which is a fact that will become useful later. To simplify the second term in the right-hand side of (42), we first apply inversion rules for structured matrices [39], which allows us to write

$$
\begin{bmatrix}
\boldsymbol{\Sigma}\_{\boldsymbol{w}}^{-1} & \boldsymbol{c}(\boldsymbol{\widetilde{\boldsymbol{\pi}}}) \\
\boldsymbol{c}(\boldsymbol{\widetilde{\boldsymbol{\pi}}})^{T} & b(\boldsymbol{\widetilde{\boldsymbol{\pi}}})
\end{bmatrix}^{-1} = \begin{bmatrix}
\boldsymbol{\Sigma}\_{\boldsymbol{w}} - \boldsymbol{\Sigma}\_{\boldsymbol{w}} \boldsymbol{c}(\boldsymbol{\widetilde{\boldsymbol{\pi}}}) q(\boldsymbol{\widetilde{\boldsymbol{\pi}}})^{-1} \boldsymbol{c}(\boldsymbol{\widetilde{\boldsymbol{\pi}}})^{T} \boldsymbol{\Sigma}\_{\boldsymbol{w}} & -\boldsymbol{\Sigma}\_{\boldsymbol{w}} \boldsymbol{c}(\boldsymbol{\widetilde{\boldsymbol{\pi}}}) / q(\boldsymbol{\widetilde{\boldsymbol{\pi}}}) \\
\end{bmatrix} \tag{44}
$$

and thus

$$\begin{split} & \log \left| 1 + \begin{bmatrix} \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) \\ \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \tilde{\boldsymbol{\pi}}) \end{bmatrix} \right|^{T} \begin{bmatrix} \boldsymbol{\Sigma}\_{\text{w}}^{-1} & \boldsymbol{c}(\tilde{\boldsymbol{\pi}}) \\ \boldsymbol{c}(\tilde{\boldsymbol{\pi}})^{T} & b(\tilde{\boldsymbol{\pi}}) \end{bmatrix}^{-1} \begin{bmatrix} \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) \\ \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \tilde{\boldsymbol{\pi}}) \end{bmatrix} \right| \\ & = \log \Big( 1 + \tilde{\lambda} \boldsymbol{\Phi}^{T}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) \boldsymbol{\Sigma}\_{\text{w}} \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) + \tilde{\lambda} \left( \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \tilde{\boldsymbol{\pi}}) - \boldsymbol{c}(\tilde{\boldsymbol{\pi}})^{T} \boldsymbol{\Sigma}\_{\text{w}} \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) \right)^{2} / q(\tilde{\boldsymbol{\pi}}) \Big) \\ & = \log \Big( f(\tilde{\boldsymbol{\boldsymbol{x}}}, \tilde{\boldsymbol{\lambda}}) + \tilde{\lambda} \left( \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \tilde{\boldsymbol{\pi}}) - \boldsymbol{c}(\tilde{\boldsymbol{\pi}})^{T} \boldsymbol{\Sigma}\_{\text{w}} \boldsymbol{\Phi}(\tilde{\boldsymbol{\mathbf{x}}}, \boldsymbol{\Pi}) \right)^{2} / q(\tilde{\boldsymbol{\$$

Finally, after inserting (43) and (45) into (42), the D-optimality criterion with respect to a location *<sup>x</sup>* can be formulated as

$$\underset{\overline{\boldsymbol{\pi}}}{\arg\min} \log \left| \tilde{\boldsymbol{\Sigma}}\_{\boldsymbol{w}} (\mathbf{X}, \Pi, \tilde{\mathbf{x}}, \tilde{\boldsymbol{\pi}}) \right| \equiv \\ \tag{46}$$

$$\underset{\overline{\boldsymbol{\pi}}}{\arg\max} \log \left[ q(\tilde{\boldsymbol{\pi}}) f(\tilde{\boldsymbol{\pi}}, \tilde{\boldsymbol{\lambda}}) + \tilde{\boldsymbol{\lambda}} \left( \boldsymbol{\phi}(\tilde{\boldsymbol{\pi}}, \tilde{\boldsymbol{\pi}}) - \boldsymbol{\mathfrak{c}}(\tilde{\boldsymbol{\pi}})^{T} \boldsymbol{\Sigma}\_{\boldsymbol{w}} \boldsymbol{\phi}(\tilde{\boldsymbol{\pi}}, \Pi) \right)^{2} \right],$$

where we have exchanged minimization with a maximization by changing the sign of the cost function, and we dropped log <sup>|</sup>**Σ***w*<sup>|</sup> as it is independent of *<sup>x</sup>* and *<sup>π</sup>*.
