**1. Introduction**

High-throughput genomic technologies have enabled cancer researchers to study the associations between genes and clinical phenotypes of interest. A large number of cancer genomic data sets have been collected with both genomic and clinical information from the patients. The analyses of these data have yielded valuable insights on cancer mechanisms, subtypes, prognosis and treatment response.

Although many methods have been developed to identify genes informative of clinical phenotypes and build prediction models from these data, it is often difficult to interpret the results with single-gene focused approaches, as one gene is often involved in multiple biological processes and the results are not robust when the signals from individual genes are weak. As a result, pathway-based methods have gained much popularity (e.g., Subramanian et al. [1]). A pathway can be considered as a set of genes that are involved in the same biological process or molecular function. It has been shown that gene-gene interactions may have stronger effects on phenotypes when the genes belong to the same pathway or regulatory network [2]. There are many pathway databases available, such as the Kyoto Encyclopedia of Genes and Genomes [3] (KEGG), the Pathway Interaction Database [4] and Biocarta [5]. By utilizing pathway information, researchers may aggregate weak signals from the same pathway to identify relevant pathways with better power and interpretability. Many pathway-based methods, such as GSEA [1], LSKM [6] and SKAT [7], focus on testing the significance of pathways. These methods consider each pathway separately and evaluate statistical significance for its relevance to the phenotype. In other words, these methods study each pathway separately without considering the effects of other pathways.

Given that many pathways likely contribute to the onset and progression of a disease [8–10]. It is of interest to study the contribution of a specific pathway to phenotypes conditional on the effects of other pathways. This is usually achieved by regression models. Wei and Li [11] and Luan and Li [12] proposed two similar models, Nonparametric Pathway-based Regression (NPR) and Group Additive Regression (GAR). Both models employ a boosting framework, construct base learners from individual pathways and perform prediction through additive models. Due to the additivity at the pathway level, these models only considered interactions among genes within the same pathway but not across pathways. Since our proposed method is motivated by the above two models, more details of these models will be described in Section 2. In genomics data analysis, multiple kernel methods [13,14] are also commonly used when predictors have group structures. In these methods, one kernel is assigned to each group of predictors and a meta-kernel is computed as a weighted sum of the individual kernels. The kernel weights are estimated through optimization and can be considered as a measure of pathway importance. Multiple kernel methods have been used to integrate multi-pathway information or multi-omics data sets and have achieved state-of-the-art performance in predictions of various outcomes [15–17].

In this paper, we propose a Pathway-based Kernel Boosting (PKB) method for sample classification. In our boosting framework, we use the second order approximation of the loss function instead of the first order approximation used in the usual gradient descent boosting method, which allows for deeper descent at each step. We introduce two types of regularizations (*L*<sup>1</sup> and *L*2) for selection of base learners in each iteration and propose algorithms for solving the regularized problems. In Section 3.1, we conduct simulation studies to evaluate the performance of PKB, along with four other competing methods. In Section 3.2, we apply PKB to three cancer genomics data sets, where we use gene expression data to predict several patient phenotypes, including tumor grade, stage, tumor site and metastasis status.
