*3.2. Real Data Applications*

We applied PKB to gene expression profiles to predict clinical features in three cancer studies, including breast cancer, melanoma and glioma. The clinical variables we considered included tumor grade, tumor site and metastasis status, which were all of great importance to cancer.

We used three commonly used pathway databases: KEGG, Biocarta and Gene Ontology (GO) Biological Process pathways. These databases provide lists of pathways with emphasis on different biological aspects, including molecular interactions and involvement in biological processes. The number of pathways from these databases ranges from 200 to 700. There is considerable overlap between pathways. To eliminate redundant information and control the overlap between pathways, we applied a preprocessing step to the databases with details provided in the Supplementary Materials Section 4.2.

Similar to the simulation studies, we compared the performances from different methods based on three fold cross validations following the same procedure as elaborated in Section 3.1. Most of the methods we considered have tuning parameters. We searched through different parameter configurations and reported the best result from cross-validation for each method. More details of the data sets and the implementations can be found in the Supplementary Materials Section 4. Table 3 shows the classification error rates from all methods. The numbers in bold are the optimal error rates for each column separately. In four out of five classifications, PKB was the best method (usually with the *L*<sup>1</sup> and *L*<sup>2</sup> methods being the top two). In the other case (melanoma, stage), NPR yielded the best results, with the PKB methods still ranking second and third.

We provide more detailed introductions to the data sets and clinical variables and interpretations of results by PKB in the following. For brevity of the article, we focus on presenting results for three outcomes, one from each data set and leave the other two in the Supplementary Materials (Section 4.4).

the variables used as classification outcome. The best error rates are highlighted with bold font for each column.

**Table 3.** Classification error rates on real data. The names in the parenthesis of each data set are

