**1. Introduction**

For exploration tasks that rely on multi-agent systems, with complex, unstructured terrains, autonomy plays a key role to lower potential threats or tedious work for human operators, be it space exploration, disaster relief, or routine industrial facility inspections. The main objective here is to give a human operator more detailed information about the explored area, e.g., in terms of a map, and to support further decision making. While multiple agents do provide an increased sensing aperture and can potentially collect information more efficiently than a single-agent system, they have to rely more heavily on autonomy to compensate, e.g., possible large (or unreliable) communication delays [1] or the complexity of teleoperating multiple agents.

One of the approaches to increase the autonomy of multi-agent systems consists of using in situ analysis of the collected data with the agents' own computing resources to decide on future actions. In the context of mapping, such an approach is also known as active information gathering [2,3] or exploration. Note that mapping is generally not restricted to sensing with imaging sensors, such as cameras. The exploration of gas sources [4] or of the magnetic field [5] also falls in this category.

An approach for active information gathering lies in the focus of the presented work. In the following, we provide an overview of work related to the approach discussed in this paper, the arising challenges, and a proposed solution.

**Citation:** Manss, C.; Kuehner, I.; Shutin, D. Experimental Validation of Entropy-Driven Swarm Exploration under Sparsity Constraints with Sparse Bayesian Learning. *Entropy* **2022**, *24*, 580. https://doi.org/ 10.3390/e24050580

Academic Editors: Yaniv Altshuler, Francisco Camara Pereira and Eli David

Received: 28 March 2022 Accepted: 14 April 2022 Published: 20 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### *1.1. Related Work*

The objective of active information gathering is to utilize the collected data, represented in terms of a parameterized model, to compute information content as a function of space. This can be done using heuristic approaches, as in [6,7], where the authors modify the random walk strategy by adjusting the movement steps of each robot such as to collect more information. Alternatively, information–theoretic approaches can be used. In [8], the authors use a probabilistic description of the model to steer cameras mounted on multiple unmanned aerial vehicles (UAVs). In this case, the information metric can be computed directly based on statistics of the pixels. The resulting quantity is then used to autonomously coordinate UAVs in an optimal configuration. In [9], the authors propose an *exploration driven by uncertainty* by minimizing the determinant of the covariance matrix for an optimal camera placement for a 3D image. This approach essentially implements an optimal experiment design [10], which in turn relates the determinant of the covariance matrixof the model parameters to the Shannon entropy of Gaussian random variables. This connection has been further explored in [11], where the authors compare criteria for optimal experiment design with mutual information for Gaussian processes regression and sensor placement. This leads to a greedy algorithm that uses mutual information for finding optimal sensor placements. An extension of [11] for multiple agents and a decentralized estimation of the mutual information is presented in [2,12]. In the latter, the authors also considered robotic aspects, such as optimal trajectory planning along with information gathering: an approach that has been further investigated in [13].

One of the key elements in experiment design-based information gathering is the ability to compute the covariance structure of the model parameters as a function of space and evaluate it in a distributed fashion. In [14], the authors studied the informationgathering approach for sparsity constrained models, i.e., under assumption that the model parameters are sparse. This required implementing non-smooth -<sup>1</sup> constraints in the optimization problem, which in turn made the exact computation of the parameter covariance impossible. Instead, the covariance structure was approximated by locally smoothing the curvature of the objective function. In [14], the method was applied to generalized linear models with sparsity constraints for a distributed computation with two versions of data splitting over agents: homogeneous splitting, also called splitting-over-examples (SOE), and heterogeneous splitting, also called splitting-over-features (SOF). However, despite the method yielding in simulations a better performance as compared to systematic or random exploration approaches, the used approximation has been derived with purely empirical arguments.

#### *1.2. Paper Contribution*

To address this, the exploration problem with sparsity constraints has been cast into a probabilistic framework, where the parameter covariance can be computed exactly. In [15], we formulated a Bayesian approach toward cooperative sparse parameter estimation for SOF, and in [16] for SOE data splitting. However, the distributed computation of the covariance matrix and information-driven exploration has not been considered so far. With this contribution, we close this gap and study an information-driven exploration strategy that is based on a Bayesian approach toward distributed sparse regression. Specifically,


The rest of the paper is structured as follows. We begin with a model formulation and model learning in Section 2. In Section 3, we discuss a distributed computation of the exploration criterion for the considered regression problem. Afterwards, we outline the experimental setting, the collection of ground truth data, and the sensor calibration in Section 4, as well as the overall system design in Section 5. The experimental results are summarized in Section 6, and Section 7 concludes this work.

#### **2. Distributed Sparse Bayesian Learning**

#### *2.1. Model Definition*

We make use of a classical basis function regression [17] to express an unknown scalar physical process *<sup>p</sup>*(*x*) <sup>∈</sup> <sup>R</sup>, with *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and *<sup>d</sup>* <sup>∈</sup> <sup>N</sup>. Typically, the process is *<sup>d</sup>*-dimensional, with *<sup>d</sup>* ∈ {2, 3}. To represent the process *<sup>p</sup>*(*x*), a set of *<sup>N</sup>* <sup>∈</sup> <sup>N</sup> basis functions *<sup>φ</sup>n*(*x*, *<sup>π</sup>n*) <sup>∈</sup> <sup>R</sup>, *<sup>n</sup>* <sup>=</sup> 1, ... , *<sup>N</sup>* are used, where *<sup>π</sup><sup>n</sup>* <sup>∈</sup> <sup>R</sup>*<sup>s</sup>* is dependent on the used basis function and *<sup>s</sup>* is a number of parameters per basis function.

Each basis function is parameterized with *πn*, *n* = 1, ... , *N*, which can represent centers of corresponding basis functions, their width, etc. More formally, we assume that

$$p(\mathbf{x}) = \sum\_{n=1}^{N} \phi\_{\mathbb{H}}(\mathbf{x}, \boldsymbol{\pi}\_{\mathbb{H}}) w\_{\mathbb{H}} \tag{1}$$

where *wn* <sup>∈</sup> <sup>R</sup> are generally unknown weights in the representation.

To estimate *wn*, *n* = 1, ... , *N*, we make *M* observations of the process *p*(*x*) at locations *X* = [*x*1,..., *xM*] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*M*×*d*. The corresponding *<sup>m</sup>*-th measurement is then represented as

$$y(\mathbf{x}\_m) = p(\mathbf{x}\_m) + \eta(\mathbf{x}\_m) = \sum\_{n=1}^{N} \phi\_n(\mathbf{x}\_m, \mathbf{x}\_n) w\_n + \eta(\mathbf{x}\_m),\tag{2}$$

where *<sup>η</sup>*(*xm*) <sup>∝</sup> <sup>N</sup> (0, *<sup>λ</sup>*−1) is an additive sample of white Gaussian noise with a known precision *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>+. By collecting *<sup>M</sup>* measurements in a vector *<sup>y</sup>*(*X*)=[*y*(*x*1), ... , *<sup>y</sup>*(*xM*)]*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*M*, we can reformulate (2) in a vector-matrix notation. To this end, we define

$$\Pi \triangleq [\pi\_1, \dots, \pi\_N]^T \in \mathbb{R}^{N \times s},\tag{3}$$

$$\phi\_{\mathbb{M}}(\mathbf{X}, \boldsymbol{\pi}\_{\mathbb{M}}) \triangleq \left[\phi\_{\mathbb{M}}(\mathbf{x}\_{1}, \boldsymbol{\pi}\_{\mathbb{M}}), \dots, \phi\_{\mathbb{M}}(\mathbf{x}\_{M}, \boldsymbol{\pi}\_{\mathbb{M}})\right]^{T} \in \mathbb{R}^{M},\tag{4}$$

$$\Phi(\mathbf{X}, \Pi) \triangleq [\Phi\_1(\mathbf{X}, \pi\_1), \dots, \Phi\_N(\mathbf{X}, \pi\_N)] \in \mathbb{R}^{M \times N},\tag{5}$$

$$\mathbf{w} \text{ and } \; w \triangleq \begin{bmatrix} w\_1, \dots, w\_N \end{bmatrix}^T \in \mathbb{R}^N,\tag{6}$$

which allows us to formulate the measurement model in a vectorized form

$$y(X) = \Phi(X, \Pi) \| w + \eta(X),\tag{7}$$

with *η*(*X*) -[*η*(*x*1),..., *<sup>η</sup>*(*xM*)]*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*M*.

Based on (7), we define the likelihood of the parameters *w* as follows

$$p(y(\mathbf{X})|\boldsymbol{w}) \propto \exp\left\{-\frac{\lambda}{2} \left\| y(\mathbf{X}) - \Phi(\mathbf{X}, \Pi) \mathbf{w} \right\|^2 \right\}.\tag{8}$$

Often, the representation (1) is selected such that *N M*, i.e., it is underdetermined. This implies that there is an infinite number of possible solutions for *w*. A popular approach to restrict a set of solutions consists of introducing sparsity constraints on parameters. Within the Bayesian framework, this can be achieved by defining a prior over the parameter weights *w*. This leads to a class of probabilistic approaches referred to as Sparse Bayesian Learning (SBL).

The basic idea of SBL is to assign an appropriate prior to the *N*-dimensional vector *<sup>w</sup>* such that the resulting maximum a posteriori (MAP) estimate *<sup>w</sup>* is sparse, i.e., many of its entries are zero. Typically, SBL specifies a hierarchical factorable prior *p*(*w*|*γ*)*p*(*γ*) = ∏*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *p*(*wn*|*γn*)*p*(*γn*), where *p*(*wn*|*γn*) = N (*wn*|0, *γn*), *n* ∈ {1, ... , *N*} [18–20]. For each *n* ∈ {1, ... , *N*}, the hyperparameter *γn*, also called sparsity parameter, regulates the width of *p*(*wn*|*γn*); the product *p*(*wn*|*γn*)*p*(*γn*) defines a Gaussian scale mixture (the authors in work [21] extend this framework by generalizing *p*(*wn*|*γn*) to be the probability density function (PDF) of a power exponential distribution, which makes the hierarchical prior a power exponential scale mixture distribution). Bayesian inference on a linear model with such a hierarchical prior is commonly realized via two types of techniques: MAP estimation of *w* (Type I estimation; note that many traditional "non-Bayesian" methods for learning sparse representations such as basis pursuit de-noising or re-weighted *p*norm regressions [22–24] can be interpreted as Type I estimation within the above Bayesian framework [21]) or MAP estimation of *γ* (Type II estimation, also called maximum evidence estimation, or empirical Bayes method). Type II estimation has proven (both theoretically and empirically) to perform consistently better than Type I estimation in the present application context. One reason is that the objective function of a Type II estimator typically exhibits significantly fewer local minima than that of the corresponding Type I estimator and promotes greater sparsity [25]. The hyperprior *p*(*γn*), *n* ∈ {1, ... , *N*}, is usually selected to be non-informative, i.e., *p*(*γn*) ∝ *γ*−<sup>1</sup> *<sup>n</sup>* [26–28]. The motivation for this choice is twofold. First, the resulting inference schemes typically demonstrate superior (or similar) performance as compared to schemes derived based on other hyperprior selections [21]. Second, very efficient inference algorithms can be constructed and studied [26–30].

In the following, we consider only SBL Type II optimization as it leads to usually sparser parameter vectors *w* [21], and we drop explicit dependencies on measurements *X* and basis function parameters **Π** to simplify notation. The marginalized likelihood for SBL Type II optimization is therefore

$$p(y|\gamma) = \int\_{-\infty}^{\infty} p(y|w)p(w|\gamma)d\mathbf{w} \propto |\boldsymbol{\Sigma}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\boldsymbol{y}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{y}\right\},\tag{9}$$

where **<sup>Σ</sup>** <sup>=</sup> *<sup>λ</sup>*−<sup>1</sup> *<sup>I</sup>* <sup>+</sup> **ΦΓΦ***T*, **<sup>Γ</sup>** <sup>=</sup> diag{*γ*}, and *<sup>I</sup>* being the identity. Taking the negative logarithm of (9), we obtain the objective function for SBL Type II optimization in the following form

$$\mathcal{L}(\gamma) = -\log p(y|\gamma) = \log(|\boldsymbol{\Sigma}|) + \boldsymbol{y}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{y}.\tag{10}$$

An estimate of hyperparameters *γ* is then found as

$$
\widehat{\gamma} = \underset{\gamma}{\text{arg min }} \mathcal{L}(\gamma). \tag{11}
$$

Once the estimate *<sup>γ</sup>* is obtained, the posterior probability density function (PDF) of the the parameter weights *w* can be easily computed: it is known to be Gaussian *<sup>p</sup>*(*w*|*y*, *<sup>γ</sup>*) = <sup>N</sup> (*w*, **<sup>Σ</sup>***w*) with the moments given as

$$
\hat{w} = \lambda \Sigma\_w \Phi^T y,\qquad \qquad \Sigma\_w = \left(\lambda \Phi^T \Phi + \hat{\Gamma}^{-1}\right)^{-1},\tag{12}
$$

where **<sup>Γ</sup>** <sup>=</sup> diag{*γ*} (see also [18]).

## *2.2. Sparse Bayesian Learning with the Automatic Relevance Determination*

The key to a sparse estimate of *w* is a solution to (11). There are a number of efficient schemes [26–28] to solve this problem. The method that we use in this paper is based on [26]. In the following, we shortly outline this algorithm.

In [26], the authors introduced the reformulated automatic relevance determination (R-ARD) by using an auxiliary function that upper bounds the objective function L(*γ*) in (10). Specifically, using the concavity of the log-determinant in (10) with respect to *γ*, the former can be represented using a Fenchel conjugate as

$$\log|\Sigma| = \min\_{z} z^{l}\gamma - h^{\*}(z),\tag{13}$$

where *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* is a dual variable and *<sup>h</sup>*∗(*z*) is the dual (or conjugate) function (see also [31] (Chapter 5) or [32]).

Using (13), we can now upper-bound (10) as follows

$$\mathcal{L}(\gamma, z) \triangleq \boldsymbol{z}^T \boldsymbol{\gamma} - \boldsymbol{h}^\*(\boldsymbol{z}) + \boldsymbol{y}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{y} \geq \mathcal{L}(\gamma). \tag{14}$$

Note that for any *γ*, the bound becomes tight when minimized over *z*. This fact is utilized for the numerical estimation of *γ*, which is the essence of the R-ARD algorithm. R-ARD alternates between estimating *z*, which can be found in closed form as [26,31]

$$\hat{\mathbf{z}} = \underset{\mathbf{z}}{\text{arg min }} \mathcal{L}(\hat{\boldsymbol{\gamma}}, \mathbf{z}) = \left. \frac{\partial}{\partial \boldsymbol{\gamma}} \log |\boldsymbol{\Sigma}| \right|\_{\boldsymbol{\gamma} = \hat{\boldsymbol{\gamma}}} = \text{diag}\{\boldsymbol{\Phi}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}\}, \tag{15}$$

and estimating *<sup>γ</sup>* as a solution to a convex optimization problem

$$
\hat{\boldsymbol{\gamma}} = \underset{\boldsymbol{\gamma}}{\text{arg min }} \mathcal{L}(\boldsymbol{\gamma}, \hat{\boldsymbol{z}}) = \underset{\boldsymbol{\gamma}}{\text{arg min }} \hat{\boldsymbol{z}}^T \boldsymbol{\gamma} + \boldsymbol{y}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{y}. \tag{16}
$$

In order to solve (16), the authors in [26] proposed to use yet another upper bound on L(*γ*, *z*). Specifically, by noting that

$$y^T \Sigma^{-1} y = \min\_{w} \lambda \left\| y - \Phi w \right\|^2 + \sum\_{l=1}^{N} \frac{w\_l^2}{\gamma\_l} \tag{17}$$

the cost function in (16) can be bounded with

$$\mathcal{L}(\boldsymbol{w}, \boldsymbol{\gamma}, \hat{\boldsymbol{z}}) \triangleq \lambda \|\boldsymbol{y} - \boldsymbol{\Phi}\boldsymbol{w}\|^2 + \sum\_{l=1}^{N} \left(\hat{\boldsymbol{z}}\_{l} \boldsymbol{\gamma}\_{l} + \frac{\boldsymbol{w}\_{l}^{2}}{\boldsymbol{\gamma}\_{l}}\right) \geq \mathcal{L}(\boldsymbol{\gamma}, \hat{\boldsymbol{z}}).\tag{18}$$

The right-hand side of (18) is convex both in *w* and *γ*. As such, for any fixed *w*, the optimal solution for *<sup>γ</sup>* can be easily found as *<sup>γ</sup><sup>l</sup>* <sup>=</sup> *<sup>z</sup>* − 1 2 *<sup>l</sup>* |*wl*|, *l* = 1, ... , *N*. By inserting the latter in (18), we find the solution for *<sup>w</sup>* that minimizes the upper-bound <sup>L</sup>(*w*, *<sup>γ</sup>*, *<sup>z</sup>*) as

$$\hat{w} = \underset{w}{\text{arg min }} \mathcal{L}(w, \hat{\gamma}, \hat{z}) = \underset{w}{\text{arg min }} \lambda \left\| y - \Phi w \right\|^2 + 2 \sum\_{l=1}^{N} \hat{z}\_l^{\frac{l}{2}} \left| w\_l \right|, \tag{19}$$

which can be recognized as a weighted least absolute shrinkage and selection operator (LASSO) cost function. Expression (19) builds a basis for a distributed estimation learning of SBL parameters, since there exist techniques to optimize a LASSO function over a network, which are presented in the following section.

#### *2.3. The Distributed Automated Relevance Determination Algorithm for SOF Data Splitting*

The derivation of the distributed R-ARD (D-R-ARD) for SOF is shown in [14]. Here, we would like to show the main aspects of the distribution paradigm and the resulting algorithm. The main aspect of heterogeneous data splitting is that each agent has its own model. Therefore, the parameter weights *<sup>w</sup>* are distributed among *<sup>K</sup>* <sup>∈</sup> <sup>N</sup> agents as *w* = [*w<sup>T</sup>* <sup>1</sup> , ... , *<sup>w</sup><sup>T</sup> K*] *<sup>T</sup>* and each agent has its part *<sup>w</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*Nk* , where *<sup>N</sup>* <sup>=</sup> <sup>∑</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *Nk*. Likewise, the matrix **<sup>Φ</sup>** is partitioned among *<sup>K</sup>* agents as **<sup>Φ</sup>** = [**Φ**1, ... , **<sup>Φ</sup>***K*] where **<sup>Φ</sup>***<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*M*×*Nk* . The SOF model is then formulated as

$$y = \begin{bmatrix} \Phi\_1 & \dots & \Phi\_K \end{bmatrix} \begin{bmatrix} w\_1 \\ \vdots \\ w\_K \end{bmatrix} + \eta = \sum\_{k=1}^K \Phi\_k w\_k + \eta. \tag{20}$$

Similarly, the hyper-parameters *γ* are also partitioned as *γ* = [*γ<sup>T</sup>* <sup>1</sup> ,..., *<sup>γ</sup><sup>T</sup> K*] *T*.

The solution to cooperative SOF inference then amounts to computing *z* from (15) and optimizing the upper bound (18) over a network of *K* agents.

Unfortunately, in the case of the SOF model, the dual variable *z* = [*z<sup>T</sup>* <sup>1</sup> , ... , *<sup>z</sup><sup>T</sup> K*] *<sup>T</sup>* in (15) cannot be computed exactly. Instead it is upper-bounded [14] as *<sup>z</sup><sup>k</sup>* <sup>≤</sup> *<sup>z</sup><sup>k</sup>*, where *<sup>z</sup><sup>k</sup>* is computed for each agent:

$$\tilde{\mathbf{z}}\_k = \text{diag}\left\{ \boldsymbol{\Phi}\_k^T \boldsymbol{\Lambda} \boldsymbol{\Phi}\_k - \boldsymbol{\Phi}\_k^T \boldsymbol{\Lambda} \boldsymbol{\Phi}\_k \boldsymbol{\Sigma}\_{w\boldsymbol{\lambda}} \boldsymbol{\Phi}\_k^T \boldsymbol{\Lambda} \boldsymbol{\Phi}\_k \right\}\_\prime \tag{21}$$

with **Σ***w*,*<sup>k</sup>* = (**Φ***<sup>T</sup> <sup>k</sup>* **ΛΦ***<sup>k</sup>* <sup>+</sup> **<sup>Γ</sup>**−<sup>1</sup> *<sup>k</sup>* )−<sup>1</sup> and **<sup>Λ</sup>** <sup>=</sup> *<sup>λ</sup>I*. This approximation preserves the upper bound in (18). Consequently, (19) can be reformulated to fit for SOF as

$$\hat{\omega}\hat{w}\_k = \underset{w\_k}{\text{arg min }} \mathcal{L}(w, \hat{z}) = \underset{w\_k}{\text{arg min }} \lambda \left\| \sum\_{k=1}^{K} y - \Phi\_k w\_k \right\|^2 + 2 \sum\_{l=1}^{N\_k} \hat{z}\_{k,l}^{\frac{1}{2}} |w\_{k,l}|\_{\prime} \tag{22}$$

which can be solved distributively via the alternating direction method of multipliers (ADMM) algorithm [33] (Section 8.3). The D-R-ARD algorithm for SOF is summarized in Algorithm 1. When using ADMM to solve for *<sup>w</sup>k*, the only communication between the agents takes place inside of the ADMM algorithm. The communication load of the ADMM algorithm for SOF is discussed in [33] (Chapter 8).

#### **Algorithm 1** D-R-ARD for SOF


#### *2.4. The Distributed Automated Relevance Determination Algorithm for SOE Data Splitting*

For SOE, we will assume that measurements *y* at locations *X* are partitioned into *<sup>K</sup>* disjoint subsets {*yk*(*Xk*), *<sup>X</sup>k*}*<sup>K</sup> <sup>k</sup>*=1, each associated with the corresponding agent in the network. Hence, each agent *k* makes *Mk* observations *yk*(*Xk*)=[*yk*,1(*xk*,1), ... , *yk*,*Mk* (*xk*,*Mk* )] at locations *X<sup>k</sup>* = [*xk*,1, ... , *xk*,*Mk* ] *<sup>T</sup>*, such that *M* = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *Mk*, *<sup>y</sup>* = [*y<sup>T</sup>* <sup>1</sup> , ... , *<sup>y</sup><sup>T</sup> K*] *<sup>T</sup>*, *X* = [*X<sup>T</sup>* <sup>1</sup> , ... , *<sup>X</sup><sup>T</sup> K*] *<sup>T</sup>*, **Φ** = [**Φ***<sup>T</sup>* <sup>1</sup> , ... , **<sup>Φ</sup>***<sup>T</sup> K*] *<sup>T</sup>*, and *η* = [*η<sup>T</sup>* <sup>1</sup> , ... , *<sup>η</sup><sup>T</sup> K*] *<sup>T</sup>*. This allows us to rewrite (7) in an equivalent form as

$$y = \begin{bmatrix} y\_1 \\ \vdots \\ y\_K \end{bmatrix} = \begin{bmatrix} \Phi\_1 \\ \vdots \\ \Phi\_K \end{bmatrix} w + \begin{bmatrix} \eta\_1 \\ \vdots \\ \eta\_K \end{bmatrix} \tag{2.3}$$

where we assumed that perturbations *ηk*, *k* = 1, ... , *K*, are independent between agents, i.e.,

$$E\left\{\eta\_k \eta\_m^T\right\} = \begin{cases} \begin{array}{cc} 0I & k \neq m \\\ \lambda\_k^{-1} I & k = m. \end{array} \right.$$

To cast R-ARD in a distributed setting, we need to be able to solve (19) and compute *<sup>z</sup>* in (15) over a network of agents. To this end, let us define

$$D \triangleq \boldsymbol{\Phi}^T \boldsymbol{\Lambda} \boldsymbol{\Phi} = \sum\_{k=1}^K \boldsymbol{\Phi}\_k^T \boldsymbol{\lambda}\_k \boldsymbol{\Phi}\_k = K \times \underbrace{\frac{1}{K} \sum\_{k=1}^K \boldsymbol{\Phi}\_k^T \boldsymbol{\lambda}\_k \boldsymbol{\Phi}\_k}\_{\text{averaged consensus}}.\tag{24}$$

where **Λ** = diag [*λ*<sup>1</sup> *I*1,..., *λ<sup>K</sup> IK*], and *I<sup>k</sup>* is an identity matrix of size *Mk* × *Mk*, *k* = 1, ... , *K*. We point out that *D*, or rather the last factor in (24), can be computed over a network of agents using an averaged consensus algorithm [34,35].

Next, we apply the Woodbury identity to **Σ**−<sup>1</sup> to obtain

$$
\Delta^{-1} = \left(\Lambda^{-1} + \Phi \Gamma \Phi^T\right)^{-1} = \Lambda - \Lambda \Phi \Sigma\_w \Phi^T \Lambda,\tag{25}
$$

where **Σ***<sup>w</sup>* = (**Φ***T***ΛΦ** + **Γ**−1)−1. Inserting (25) and (24) into (15), we get

$$\hat{\boldsymbol{z}} = \text{diag}\{\boldsymbol{\Phi}^T \boldsymbol{\Lambda} \boldsymbol{\Phi} - \boldsymbol{\Phi}^T \boldsymbol{\Lambda} \boldsymbol{\Phi} \boldsymbol{\Sigma}\_w \boldsymbol{\Phi}^T \boldsymbol{\Lambda} \boldsymbol{\Phi}\} = \text{diag}\{\boldsymbol{D} - D \boldsymbol{\Sigma}\_w \boldsymbol{D}\},\tag{26}$$

where **<sup>Σ</sup>***<sup>w</sup>* = (*<sup>D</sup>* <sup>+</sup> **<sup>Γ</sup>**−1)−1. Thus, once *<sup>D</sup>* becomes available, *<sup>z</sup>* can be found distributively using expression (26).

To solve (19) distributively, we first note that for the model (23) the likelihood (8) can be equivalently rewritten as

$$p(y|w) \propto \exp\left\{-\frac{1}{2} \sum\_{k=1}^{K} \lambda\_k ||y\_k - \Phi\_k w||^2\right\}.\tag{27}$$

It is then straightforward to show that the upper bound (18) will take the form

$$\mathcal{L}(\boldsymbol{w}, \boldsymbol{\gamma}, \hat{\boldsymbol{z}}) \triangleq \frac{1}{2} \sum\_{k=1}^{K} \lambda\_{k} \left\| \boldsymbol{y}\_{k} - \boldsymbol{\Phi}\_{k} \boldsymbol{w} \right\|^{2} + \sum\_{l=1}^{M} \left( \hat{\boldsymbol{z}}\_{l} \boldsymbol{\gamma}\_{l} + \frac{\boldsymbol{w}\_{l}^{2}}{\boldsymbol{\gamma}\_{l}} \right) \geq \mathcal{L}(\boldsymbol{\gamma}, \hat{\boldsymbol{z}}).\tag{28}$$

Similarly to (18), for any *wl*, *l* = 1, ... , *M*, the bound is minimized with respect to *γ<sup>l</sup>* at *γ<sup>l</sup>* = |*wl*|/ &*zl*, *<sup>l</sup>* <sup>=</sup> 1, ... , *<sup>M</sup>*. Inserting the latter in (28), we obtain an objective function for estimating *wl*

$$
\hat{w} = \underset{w}{\text{arg min }} \frac{1}{2} \sum\_{k=1}^{K} \lambda\_k \|\Phi\_k w - y\_k\|\_2^2 + 2 \sum\_{l=1}^{M} \sqrt{\hat{z}\_l} \|w\_l\|. \tag{29}
$$

Expression (29) can be readily solved distributively using an ADMM algorithm (see e.g., [33] (Chapter 8) and [36]). Once *<sup>w</sup>* is found, optimal parameter values *<sup>γ</sup>* are found as *<sup>γ</sup><sup>l</sup>* <sup>=</sup> *<sup>z</sup>* − 1 2 *<sup>l</sup>* <sup>|</sup>*wl*|, *<sup>l</sup>* <sup>=</sup> 1, . . . , *<sup>N</sup>*.

In Algorithm 2, we now summarize the key steps of the resulting D-R-ARD algorithm for SOE. As we can see from Algorithm 2, D-R-ARD includes two optimizing loops. The inner optimization loop is an ADMM algorithm, which is guaranteed to converge to a solution [33]. The convergence of the outer loop is basically the convergence of the R-ARD algorithm presented in [26].

#### **Algorithm 2** D-R-ARD for SOE

1: *zn* <sup>←</sup> 1, <sup>∀</sup>*<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>N</sup>* 2: Compute *D* using averaged consensus over **Φ***<sup>T</sup> <sup>k</sup>* **ΛΦ***<sup>k</sup>* as in (24) 3: **while** not converged **do** 4: *<sup>w</sup>* <sup>←</sup> arg min *<sup>w</sup>* <sup>L</sup>(*w*, *<sup>γ</sup>*, *<sup>z</sup>*) See (29); is solved distributively using ADMM [33,36] 5: *<sup>γ</sup>* <sup>←</sup> <sup>|</sup>*wn*<sup>|</sup> √*zn* , ∀*n* = 1, . . . , *N* 6: **<sup>Σ</sup>***<sup>w</sup>* <sup>←</sup> (*<sup>D</sup>* <sup>+</sup> **<sup>Γ</sup>**−1)−<sup>1</sup> 7: *<sup>z</sup>* <sup>←</sup> (26)

Communication Load of D-R-ARD

In the D-R-ARD algorithm, two communication steps are required. The first communication step involves the computation of the matrix *D*, where we leverage an average consensus algorithm. There, each of the *<sup>A</sup>* <sup>∈</sup> <sup>N</sup> consensus steps requires the transmission of *N*(*N* + 1)/2 floats due to the symmetry of *D*. Note that the number *A* of averaged consensus iterations can vary depending on the connectivity of the network.

The second communication step involves the iterative estimation of the model parameters. Assuming that the update loop of D-R-ARD requires *<sup>I</sup>* <sup>∈</sup> <sup>N</sup> iterations, the distributed estimation of parameters *<sup>w</sup>* with *<sup>R</sup>* <sup>∈</sup> <sup>N</sup> ADMM iterations then scales up as O(*I* × *ARN*). Thus, the total communication load of D-R-ARD algorithm behaves as *AN*(*N* + 1)/2 + O(*I* × *ARN*). Please note also that for this estimation of the communication load, the network structure remains unchanged.
