2.1.1. *L*<sup>1</sup> Penalized Boosting

The core step of PKB is the optimization of the regularized loss function (see step 3 of Table 1). Note that G is the union of the pathway-based learner spaces, thus

$$\begin{aligned} \hat{f}\_{\mathbf{m}} &= \arg\min\_{\mathcal{G}} L\_{\mathcal{R}}(\mathbf{f})\\ &= \arg\min\_{\hat{f}\_{\mathbf{m}}} \{ L\_{\mathcal{R}}(f\_{\mathbf{m}}) : f\_{\mathbf{m}} = \arg\min\_{\hat{f} \in \mathcal{G}\_{\mathbf{m}}} L\_{\mathcal{R}}(\mathbf{f}), m = 1, 2, \dots, M \}. \end{aligned}$$

To solve for ˆ *f* , it is sufficient to obtain the optimal ˆ *fm* in each pathway-based subspace. Due to the way we construct the subspaces, in a given pathway *m*, *f* takes a parametric form as a linear combination of the corresponding kernel functions. This helps us further reduce the optimization problem to

$$\min\_{f \in \mathcal{G}\_m} L\_R(\mathbf{f}) \quad = \min\_{\beta, \mathcal{E}} \frac{1}{N} \sum\_{i=1}^N \frac{q\_{i,t}}{2} (\frac{h\_{i,t}}{q\_{i,t}} + K\_{m,i}^T \beta + \varepsilon)^2 + \Omega(f) \tag{4}$$

$$=\min\_{\boldsymbol{\beta},\boldsymbol{c}} \frac{1}{N} (\boldsymbol{\eta}\_{t} + \boldsymbol{K}\_{\text{m}}\boldsymbol{\beta} + \mathbf{1}\_{N}\boldsymbol{c})^{T} \mathcal{W}\_{t} (\boldsymbol{\eta}\_{t} + \boldsymbol{K}\_{\text{m}}\boldsymbol{\beta} + \mathbf{1}\_{N}\boldsymbol{c}) + \boldsymbol{\Omega}(\boldsymbol{f}),\tag{5}$$

where

$$\begin{array}{rcl} \eta\_{t} &=& (\frac{h\_{1,t}}{q\_{1,t}}, \frac{h\_{2,t}}{q\_{2,t}}, \dots, \frac{h\_{N,t}}{q\_{N,t}})^T, \\\ W\_{t} &=& \text{diag}(\frac{q\_{1,t}}{2}, \frac{q\_{2,t}}{2}, \dots, \frac{q\_{N,t}}{2}), \\\ K\_{\text{m}} &=& \left[K\_{\text{m}}(\mathbf{x}\_{i}^{(\text{m})}, \mathbf{x}\_{j}^{(\text{m})})\right]\_{i,j=1,2,\dots,N}. \end{array}$$

*Km*,*<sup>i</sup>* is the *i*th column of kernel matrix *Km* and 1*<sup>N</sup>* is an *N* by 1 vector of 1's. We use the *L*<sup>1</sup> norm Ω(*f*) = *λβ*1, as the penalty term, where *λ* is a tuning parameter adjusting the amount of penalty we impose on model complexity. We also prove that after certain transformations, the optimization can be converted to a LASSO problem without intercept

$$\min\_{\beta} \frac{1}{N} \|\tilde{\eta} + \tilde{K}\_{\text{m}} \beta\|\_{2}^{2} + \lambda \|\beta\|\_{1} \tag{6}$$

where

$$\begin{array}{rcl} \eta &=& W\_{\mathfrak{t}}^{\frac{1}{2}} \left[ I\_N - \frac{\mathbf{1}\_N \mathbf{1}\_N^T W\_{\mathfrak{t}}}{tr(\mathcal{W}\_{\mathfrak{t}})} \right] \eta\_{\mathfrak{t}}\\ \mathcal{R}\_{\mathfrak{m}} &=& W\_{\mathfrak{t}}^{\frac{1}{2}} \left[ I\_N - \frac{\mathbf{1}\_N \mathbf{1}\_N^T W\_{\mathfrak{t}}}{tr(\mathcal{W}\_{\mathfrak{t}})} \right] K\_{\mathfrak{m}}. \end{array}$$

Therefore, *β* can be efficiently estimated using existing LASSO solvers. The proof of the equivalence between the two problems is provided in Section 1 of the Supplementary Materials.
