*3.1. Simulation Studies*

We use simulation studies to assess the performance of PKB. We consider the following three underlying true models:


$$F(\mathbf{x}) = 2\mathbf{x}\_1^{(1)} + 3\mathbf{x}\_2^{(1)} + \exp(0.8\mathbf{x}\_1^{(2)} + 0.8\mathbf{x}\_2^{(2)}) + 4\mathbf{x}\_1^{(3)}\mathbf{x}\_2^{(3)}$$


$$F(\mathbf{x}) = 4\sin(\mathbf{x}\_1^{(1)} + \mathbf{x}\_2^{(1)}) + 3|\mathbf{x}\_1^{(2)} - \mathbf{x}\_2^{(2)}| + 2\mathbf{x}\_1^{(3)^2} - 2\mathbf{x}\_2^{(3)^2}$$


$$F(\mathbf{x}) = 2 \sum\_{m=1}^{10} \|\mathbf{x}^{(m)}\|\_{2}$$

where *F*(**x**) is the true log odds function and *x* (*m*) *<sup>i</sup>* represents the expression level of the *i*th gene in the *m*th pathway. We include different functional forms of pathway effects in *F*(**x**), including linear, exponential, polynomial and others. In models 1 and 2, only two genes in each of the

first three pathways are informative to sample classes; in model 3, only genes in the first ten pathways are informative. We generated a total of six datasets, two for each model, with different numbers of irrelevant pathways (M = 50 and 150) corresponding to different noise levels. We set the size of pathways to 5 and sample size to 900 in all simulations. Gene expression data (**x***i*'s) were generated following standard normal distribution. We then calculated the log odds *F*(**x***i*) for each sample and use the median-centered *F*(**xi**) values to generate corresponding binary outcomes *yi* ∈{−1, 1} (We usethe median-centered *F*(**x***i*) values to generate outcome, so that the proportions of 1's and −1's are approximately 50%.).

We divided the generated datasets into three folds and each time used two folds as training data and the other fold as testing data. The number of maximum iterations *T* is important to PKB, as using a large *T* will likely induce overfitting on training data and poor prediction on testing data. Therefore, we performed nested cross validation within the training data to select *T*. We further divided the training data into three folds and each time trained the PKB model using two folds while monitoring the loss function on the other fold at every iteration. Eventually, we identified the iteration number *T*∗ with the minimum averaged loss on testing data and applied PKB to the whole training dataset up to *T*∗ iterations.

We first evaluated the ability of PKB to correctly identify relevant pathways. For each simulation scenario, we calculated the average optimal weights across different cross validation runs and the results are shown in Figure 1, where the X-axis represents different pathways and the length of bars above them represents corresponding weights in the prediction functions. Note that for the underlying Model 1 and Model 2, only the first three pathways were relevant to the outcome, and in Model 3, the first ten pathways were relevant. In all the cases, PKB successfully assigned the largest weights to relevant pathways. Since PKB is an iterative approach, at some iterations, certain pathways irrelevant to the outcome may be selected by chance and added to the prediction function. This explains the non-zero weights of the irrelevant pathways and their values are clearly smaller than those of relevant pathways.

**Figure 1.** Estimated pathway weights by PKB in simulation studies. The X-axis represents pathways and the Y-axis represents estimated weights. Based on the simulation settings, the first three pathways are relevant in Models 1 and 2 and the first ten pathways are relevant in Model 3. *M* represents the number of simulated pathways.

We also applied several commonly used methods to the simulated datasets and compared their prediction accuracy with PKB. These methods included both non-pathway-based methods: Random Forest [25] and SVM [20] and pathway-based methods: NPR [11] and EasyMKL [14]. Model parameters we used for the above methods are listed in Section 3 of the Supplementary Materials. We used the same three-fold split of the data, as we used when applying PKB, to perform cross-validations for each competing method. The average prediction performance of the methods is summarized in Table 2. It can be seen that the pathway-based methods generally performed better than the non-pathway-based methods in all simulated scenarios. Among the pathway-based methods, the one that utilized kernels (EasyMKL) had comparable performances with the tree-based NPR method in Models 1 and 2 but had clearly superior performance in Model 3. This was likely due to the functional form of the log odds function *F*(**x**) of Model 3. Note that genes in relevant pathways were involved in *F*(**x**) in terms of their *L*<sup>2</sup> norms, which is hard to approximate by regression tree functions but can be well captured using kernel methods. In all scenarios, the best performance was achieved by one of the PKB methods. In four out of six scenarios, the PKB-*L*<sup>2</sup> method produced the smallest prediction errors, while in the other two scenarios, PKB-*L*<sup>1</sup> was slightly better. Although PKB-*L*<sup>1</sup> and PKB-*L*<sup>2</sup> had similar performances, PKB-*L*<sup>1</sup> was usually computationally faster, because in the optimization step of each iteration, the *L*<sup>1</sup> algorithm only looked for sparse solution of *β*'s, which can be done more efficiently than PKB-*L*2, which involves matrix inverse.

**Table 2.** Classification error rate from PKB and competing methods in simulation studies. The numbers below each model represent the number of pathways simulated in the data sets.

