**2. Materials and Methods**

Suppose our observed data are collected from *N* subjects. For subject *i*, we use a *p* dimensional vector **x***<sup>i</sup>* = (*xi*1, *xi*2, ... , *xip*) to denote the normalized gene expression profile and *yi* ∈ {1, −1} to denote its class label. Similarly, the gene expression levels of a given pathway *m* with *pm* genes can be represented by **x** (*m*) *<sup>i</sup>* = (**x** (*m*) *<sup>i</sup>*<sup>1</sup> , **x** (*m*) *<sup>i</sup>*<sup>2</sup> ,..., **x** (*m*) *ipm* ), which is a sub-vector of **x***i*.

The log loss function is commonly used in binary classifications with the following form:

$$d(y, F(\mathbf{x})) = \log(1 + e^{-yF(\mathbf{x})}),$$

and is minimized by

$$F^\*(\mathbf{x}) = \log \frac{p(y=1|\mathbf{x})}{p(y=-1|\mathbf{x})},$$

which is exactly the log odds function. Thus the sign of an estimated *F*(**x**) can be used to classify sample **x** as 1 or −1. Since genes within the same pathway likely have much stronger interactions than genes in different pathways, in our pathway-based model setting, we assume additive effects across pathways and focus on capturing gene interactions within pathways:

$$F(\mathbf{x}) = \sum\_{m=1}^{M} H\_m(\mathbf{x}^{(m)}),$$

where each *Hm* is a nonlinear function that only depends on the expression levels of genes in the *m*th pathway and summarizes its contribution to the log odds function. Due to the additive nature of this model, it only captures gene interactions within each pathway but not across pathways.

Two existing methods, NPR [11] and GAR [12], employed the Gradient Descent Boosting (GDB) framework [18] to estimate the functional form of *F*(**x**) nonparametrically. GDB can be considered

as a functional gradient descent algorithm to minimize the empirical loss function, where in each descent iteration, an increment function that best aligns with the negative gradient of the loss function (evaluated at each sample point) is selected from a space of base learners and then added to the target function *F*(**x**). NPR and GAR extended GDB to be pathway-based by applying the descent step to each pathway separately and selecting the base learner from the pathway that provides the best fit to the negative gradient.

NPR and GAR differ in how they construct base learners from each pathway: NPR uses regression trees and GAR uses linear models. Due to the linearity assumption of GAR, it lacks the ability to capture complex interactions among genes in the same pathway. Using regression tree as base learners enables NPR to model interactions, however, there is no regularization in the gradient descent step, which can lead to selection bias that prefers larger pathways.

Motivated by NPR and GAR, we propose the PKB model, where we employ kernel functions as base learners, optimize loss function with second order approximation [19] which gives Newton-like descent speed and also incorporates regularization in selection of pathways in each boosting iteration.

## *2.1. PKB Model*

Kernel methods have been applied to a variety of statistical problems, including classification [20], regression [21], dimension reduction [22] and others. Results from theories of Reproducing Kernel Hilbert Space [23] have shown that kernel functions can capture complex interactions among features. For pathway *m*, we construct a kernel-based function space as the space for base learners

$$\mathcal{G}\_m = \{ g(\mathbf{x}) = \sum\_{i=1}^N K\_m(\mathbf{x}\_i^{(m)}, \mathbf{x}^{(m)}) \beta\_i + c : \beta\_1, \beta\_2, \dots, \beta\_N, c \in \mathbb{R} \},$$

where *Km*(·, ·) is a kernel function that defines similarity between two samples only using genes in the *m*th pathway. The overall base learner space is the union of the spaces constructed from each pathway alone: G = ∪*<sup>M</sup> <sup>m</sup>*=1G*m*.

Estimation of the target function *F*(**x**) is obtained through iterative minimization of the empirical loss function evaluated at the observed data. The empirical loss is defined as

$$L(\mathbf{y}, \mathbf{F}) = \frac{1}{N} \sum\_{i=1}^{N} I(y\_{i\prime}, F(\mathbf{x}\_i))\_{\prime \prime}$$

where **F** = (*F*(**x**1), *F*(**x**2), ... , *F*(**x***N*)). In the rest of this article, we will use the bold font of a function to represent the vector of the function evaluated at the observed **x***i*'s. Assume that at iteration *t*, the estimated target function is *Ft*(**x**). In the next iteration, we aim to find the best increment function *f* ∈ G and add it to *Ft*(**x**). Expanding the empirical loss at **F***<sup>t</sup>* to the second order, we can get the following approximation

$$L\_{\text{approx}}(\mathbf{y}, \mathbf{F}\_{\mathbf{f}} + \mathbf{f}) = L(\mathbf{y}, \mathbf{F}\_{\mathbf{f}}) + \frac{1}{N} \sum\_{i=1}^{N} [l\_{t,i} f(\mathbf{x}\_{i}) + \frac{1}{2} q\_{t,i} f(\mathbf{x}\_{i})^{2}],\tag{1}$$

where

$$\begin{array}{rcl}h\_{t,i} &=& \frac{\partial L(\mathbf{y}, \mathbf{F}\_{t})}{\partial F\_{t}(\mathbf{x}\_{i})} = -\frac{\mathcal{Y}\_{i}}{1 + e^{y\_{i}F\_{t}(\mathbf{x}\_{i})}},\\ q\_{t,i} &=& \frac{\partial^{2}L(\mathbf{y}, \mathbf{F}\_{t})}{\partial F\_{t}(\mathbf{x}\_{i})^{2}} = \frac{e^{y\_{i}F\_{t}(\mathbf{x}\_{i})}}{(1 + e^{y\_{i}F\_{t}(\mathbf{x}\_{i})})^{2}}\end{array}$$

are the first order and second order derivatives with respect to each *Ft*(**x***i*), respectively. We propose a regularized loss function that incorporates both the approximated loss and a penalty on the complexity of *f* :

$$L\_R(\mathbf{f}) = \begin{array}{c} L\_{\text{approx}}(\mathbf{y}, \mathbf{F}\_t + \mathbf{f}) + \Omega(f) \end{array} \tag{2}$$

$$=\frac{1}{N}\sum\_{i=1}^{N}\frac{q\_{i,t}}{2}(\frac{h\_{i,t}}{q\_{i,t}}+f(\mathbf{x}\_{i}))^2+\Omega(f)+\mathbb{C}(\mathbf{y},\mathbf{F}\_{l}),\tag{3}$$

where Ω(·) is the penalty function. Since *f* ∈ G is a linear combination of kernel functions calculated from a specific pathway, the norm of the combination coefficients can be used to define Ω(·). We consider both *L*<sup>1</sup> and *L*<sup>2</sup> norm penalties and solutions regarding each penalty option are presented in Sections 2.1.1 and 2.1.2, respectively. *C*(**y**, **F***t*) is a constant term with respect to *f* . Therefore, we only use the first two terms of Equation (3) as the working loss function in our algorithms. We will also drop *C*(**y**, **F***t*) in the expression of *LR*(**f**) in the following sections for brevity. Such a penalized boosting step has been employed in several methods (e.g., Johnson and Zhang [24]). Intuitively, the regularized loss function would prefer simple solutions that also fit the observed data well, which usually leads to better generalization capability to unseen data.

We then optimize the regularized loss for the best increment direction

$$
\hat{f} = \arg\min\_{f \in \mathcal{G}} L\_R(\mathbf{f}).
$$

Given the direction, we find the deepest descent step length by minimizing over the original loss function

$$\hat{d} = \arg\min\_{d \in \mathbb{R}^+} L(\mathbf{y}\_\prime \mathbf{F}\_t + \hat{d}\hat{f})\_\prime$$

and update the target function to *Ft*+1(**x**) = *Ft*(**x**) + *ν* ˆ*d* ˆ *f* , where *ν* is a learning rate parameter. The above fitting procedure is repeated until a certain pre-specified number of iterations is reached. The complete procedure of the PKB algorithm is shown in Table 1.

**Table 1.** An overview of the Pathway-based Kernel Boosting (PKB) algorithm.

1. Initialize target function as an optimal constant:

$$F\_0(\mathbf{x}) = \arg\min\_{r \in \mathbb{R}} \frac{1}{n} \sum\_{\ell=1}^N I(y\_{\ell\prime}r)$$

For t from 0 to T-1 (maximum number of iterations) do:

2. calculate the first and second derivatives:

$$h\_{t,l} = -\frac{y\_l}{1 + e^{y\_l F\_l(\mathbf{x}\_l)}}, \eta\_{t,l} = \frac{e^{y\_l F\_l(\mathbf{x}\_l)}}{(1 + e^{y\_l F\_l(\mathbf{x}\_l)})^2}$$

3. optimize the regularized loss function in the base learner space:

$$f = \arg\min\_{f \in \mathcal{G}} L\_R(\mathbf{f}),$$

4. find the step length with the steepest descent:

$$\hat{d} = \arg\min\_{d \in \mathbb{R}^+} L(\mathbf{y}, \mathbf{F}\_l + d\hat{f})$$

5. update the target function:

$$F\_{l+1}(\mathbf{x}) = F\_l(\mathbf{x}) + \nu df(\mathbf{x})$$

End For return *FT*(**x**)
