*2.2. Methods*

We conduct both marginal and joint analysis, where the former analyzes one gene at a time and the latter analyzes all genes in a single model. Both types of analysis have been extensively conducted in existing cancer modeling studies. As they have different implications and cannot replace each other, we conduct both analyses to generate a more comprehensive understanding of cancer prognosis. We develop a penalized regression-based framework to collectively analyze multiple datasets and identify markers associated with the prognosis of multiple cancer types, while effectively accounting for the similarity across cancers. The overall flowchart of analysis is provided in Figure 1.

Assume that there are *K* cancer types, where the *k*th (*k* = 1, ... ,*K*) type has *n*(*k*) independent subjects. For subject *<sup>i</sup>* with the *<sup>k</sup>*th cancer type, let *<sup>T</sup>*(*k*) *<sup>i</sup>* be the log-transformed survival time and *X*(*k*) *<sup>i</sup>* = - *<sup>X</sup>*(*k*) *<sup>i</sup>*<sup>1</sup> , ... , *<sup>X</sup>*(*k*) *ip* be the *p*-dimensional vector of gene expression measurements. In practical analysis, right censoring is usually present. Denote *<sup>C</sup>*(*k*) *<sup>i</sup>* as the log-transformed censoring time, then we observe *y* (*k*) *<sup>i</sup>* <sup>=</sup> min- *<sup>T</sup>*(*k*) *<sup>i</sup>* ,*C*(*k*) *i* and δ (*k*) *<sup>i</sup>* = *I* - *<sup>T</sup>*(*k*) *<sup>i</sup>* <sup>≤</sup> *<sup>C</sup>*(*k*) *i* with *I*(·) being the indicator function.

**Figure 1.** Flowchart of the proposed integrative analysis of The Cancer Genome Atlas (TCGA) data.

#### 2.2.1. Marginal Analysis

We adopt the accelerated failure time (AFT) model for describing prognosis. It has been one of the most popular choices in high-dimensional survival analysis due to its lucid interpretation and, more importantly, computational simplicity [20]. For a specific cancer type, consider the marginal AFT model for the *j*th measurement as:

$$T\_i^{(k)} = \alpha\_j^{(k)} + X\_{ij}^{(k)} \eta\_j^{(k)} + \varepsilon\_{ij}^{(k)},\tag{1}$$

where <sup>α</sup>(*k*) *<sup>j</sup>* and η (*k*) *<sup>j</sup>* are the unknown intercept and coefficient, and ε (*k*) *ij* is the random error. Assume that for each cancer type, data *X*{*k*} *<sup>i</sup>* , *<sup>y</sup>*{*k*} *<sup>i</sup>* , <sup>δ</sup>{*k*} *i* , *i* = 1, ... , *n*{*k*} have been sorted according to *y* (*k*) *<sup>i</sup>* in an ascending order. Then, the following weighted penalized objective function is proposed to collectively analyze multiple cancer types,

$$\sum\_{k=1}^{K} \left[ \frac{1}{2\boldsymbol{\eta}^{[k]}} \sum\_{i} \boldsymbol{w}\_{i}^{[k]} \left[ \boldsymbol{y}\_{i}^{[k]} - \boldsymbol{a}\_{j}^{[k]} - \boldsymbol{x}\_{ij}^{[k]} \boldsymbol{\eta}\_{j}^{[k]} \right]^{2} \right] + \sum\_{k=1}^{K} \rho\_{\text{MCP}} \left( \boldsymbol{\eta}\_{j}^{(k)}, \boldsymbol{\lambda}\_{1}, \boldsymbol{\gamma} \right) + \frac{\lambda\_{2}}{2} \sum\_{k' \neq k}^{K} \rho \left( \boldsymbol{\eta}\_{j}^{(k)}, \boldsymbol{\eta}\_{j}^{(k')} \right) \tag{2}$$

Here, *<sup>w</sup>*(*k*) *<sup>i</sup>* 's are the Kaplan–Meier (KM) weights for accommodating censoring and defined as

$$w\_1^{(k)} = \frac{\delta\_1^{(k)}}{n^{(k)}}, w\_i^{(k)} = \frac{\delta\_i^{(k)}}{n^{(k)} - i + 1} \prod\_{l=1}^{i-1} \left( \frac{n^{(k)} - l}{n^{(k)} - l + 1} \right)^{\delta\_i^{(k)}}, \text{ i } = 2, \dots, n^{(k)}$$

ρ*MCP*(|*v*|, λ1, γ) = λ<sup>1</sup> |*v*| 0 <sup>1</sup> <sup>−</sup> *<sup>x</sup>* λ1γ +*dx* is the minimax concave penalty (MCP) with tuning parameter λ<sup>1</sup> and regularization parameter γ. We consider two types of ρ - η (*k*) *<sup>j</sup>* , η (*k*) *j* with tuning parameter λ2. The first is the magnitude-based shrinkage penalty with

$$\rho\left(\eta\_{j}^{(k)},\eta\_{j}^{(k')}\right) = \left(\eta\_{j}^{(k)} - s\_{j}^{(kk')}\eta\_{j}^{(k')}\right)^{2},\tag{3}$$

where *s* (*kk*) *<sup>j</sup>* = *I* - Sgn- η (*k*) *j* <sup>=</sup> Sgn- η (*k*) *j* with Sgn(·) being the sign function. The second is the sign-based shrinkage penalty with

$$\rho\left(\eta\_{j}^{(k)},\eta\_{j}^{(k')}\right) = \left(\text{Sgn}\left(\eta\_{j}^{(k)}\right) - \text{Sgn}\left(\eta\_{j}^{(k')}\right)\right)^{2} \tag{4}$$

Based on (2), a total of *p* objective functions are developed, and the estimates are defined as the minimizers of these objective functions. With penalization, some values of η (*k*) *<sup>j</sup>* 's can be shrunk to exactly zero, and variables with nonzero η (*k*) *<sup>j</sup>* 's are identified as important prognostic markers and associated with the *k*th cancer type. The magnitudes and signs of η (*k*) *<sup>j</sup>* 's describe the strengths and directions of associations. Following the literature, the coordinate descent (CD) technique is adopted for effectively optimizing the objective functions. Details are provided in Appendix A.

The objective function (2) analyzes one gene at a time, and enjoys stable estimation and simple optimization. It may be limited by a lack of attention to the interconnections among genes and their joint effects on cancer prognosis. Our brief literature search suggests that marginal analysis is still highly popular in high-dimensional omics studies [21]. For marginal analysis, a two-stage method is often adopted for marker identification, where multiple tests are first performed and a multiple comparison adjustment is then conducted on p values using, for example, the false discovery rate approach. By contrast with this strategy, we adopt the penalization technique, which can generate more stable results and, more importantly, effectively accommodate the similarity across cancer types. Specifically, MCP is used for regularized estimation and marker identification, which has been shown to have satisfactory theoretical and numerical properties. The most significant advancement is the ρ - η (*k*) *<sup>j</sup>* , η (*k*) *j* penalty term which promotes similarity between the estimated coefficients of each cancer pair. Data integration is conducted in the discovery process to facilitate early information borrowing. With the magnitude-based shrinkage penalty (3), the magnitudes of gene effects across cancer types are promoted to be similar if they have the same signs, while with the sign-based shrinkage penalty (4), the signs of gene effects are promoted to be similar. Thus, the proposed two types of ρ - η (*k*) *<sup>j</sup>* , η (*k*) *j* promote different types of similarity, with the former for *quantitative* similarity and the latter for *qualitative* similarity. As in practice the relatedness of cancer types may be not accurately known, both penalties can be useful. λ<sup>1</sup> and λ<sup>2</sup> are two tuning parameters which control the sparsity and similarity of coefficients, respectively. For the *p* objective functions, we impose the same values of λ<sup>1</sup> and λ<sup>2</sup> on different η (*k*) *<sup>j</sup>* to be concordant with joint analysis. If λ<sup>2</sup> = 0, the proposed approach goes back to the unintegrated strategy that analyzes each cancer type separately with MCP.

#### 2.2.2. Joint Analysis

For *k* = 1, ... ,*K*, consider the AFT model with the joint effects of all omics measurements,

$$T\_i^{(k)} = \alpha^{(k)} + \mathbf{X}\_i^{(k)} \boldsymbol{\mathcal{B}}^{(k)} + \boldsymbol{\varepsilon}\_i^{(k)},\tag{5}$$

where α(*k*) is the intercept, β(*k*) = - β (*k*) <sup>1</sup> , ... , β (*k*) *p* is the *p*-dimensional unknown coefficient vector, and ε (*k*) *<sup>i</sup>* is the random error. With the same notations as in the marginal analysis, for estimation, consider the following weighted penalized objective function

$$\sum\_{k=1}^{K} \left[ \frac{1}{2n^{[k]}} \sum\_{i} w\_{i}^{[k]} \left[ y\_{i}^{[k]} - \alpha^{[k]} - \mathbf{X}\_{i}^{[k]} \boldsymbol{\theta}^{[k]} \right]^{2} \right] + \sum\_{k=1}^{K} \sum\_{j=1}^{p} \rho\_{\text{MCP}} \left( \boldsymbol{\beta}\_{j}^{(k)}, \boldsymbol{\lambda}\_{3}, \mathbf{y} \right) + \frac{\lambda\_{4}}{2} \sum\_{k' \neq k} \sum\_{j=1}^{p} \rho \left( \boldsymbol{\beta}\_{j}^{(k)}, \boldsymbol{\beta}\_{j}^{(k')} \right), \tag{6}$$

where λ<sup>3</sup> and λ<sup>4</sup> are the tuning parameters. The KM weights, MCP, and two proposals for ρ β (*k*) *<sup>j</sup>* , β (*k*) *j* are also adopted in (6). The proposed estimate is defined as the minimizer of (6). Variables with nonzero estimates are identified as associated with prognosis. For optimization, the CD algorithm is adopted (Appendix A).

Different from (2), objective function (6) jointly analyzes a large number of genes in a single model and thus accommodates a high dimensionality. Compared to marginal analysis, it advances by taking the combined effects of multiple genes into consideration and better describing the underlying disease biology. However, it involves more complex computation and may lead to less stable results. Penalization is adopted to accommodate high dimensionality and identify important genes. It is perhaps the most popular technique in high dimensional data analysis. Different from the existing studies, the magnitude- and sign-based shrinkage penalty terms are also introduced similarly to that in Section 2.2.1. This can effectively accommodate the similarity across cancer types and facilitate information borrowing.

The proposed analysis can be effectively realized. To facilitate data analysis within and beyond this study, we have developed R code and made it publicly available at www.github.com/shuanggema/ IntePanCancer.

## **3. Results**

#### *3.1. Marginal Analysis*

We analyze the TCGA data using the approach described in Section 2.2.1 with penalties (3) (referred to as A1) and (4) (referred to as A2), as well as an alternative marginal approach A3 which analyzes each cancer type separately with MCP for identifying relevant markers. Comparing with the benchmark A3 can straightforwardly establish the merit of the proposed integrative analysis. Detailed estimation results are provided in the Supplementary Excel file. Different approaches are observed to generate different findings. Specifically, a total of 910 genes with 482 unique ones and 1160 genes with 275 unique ones are identified with A1 and A2, respectively, compared to 2655 genes with 999 unique ones with A3.

In Table 2, we present the top five genes with the largest numbers of associated cancer types and refer to the Supplementary Excel file for more detailed results. It is observed that the numbers of multiple cancer types-related genes identified with A1 and A2 are slightly larger than those with A3. For example, both A1 and A2 identify gene *APH1A* as associated with all nine cancer types, but this gene is missed by A3. Literature search suggests that the identified genes with the proposed A1 and A2 may have important biological implications. For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of gene *APH1A* suggests that it is a member of the notch signaling pathway which has an important impact on developmental and cell fate decisions and is deregulated in human solid tumors [22]. APH1A is one of the four essential components of γ-secretase [23]. γ-secretase is a multiprotein intramembrane-cleaving protease, which can cleave ligand-activated endogenous Notch receptors and is a potential drug target for cancer [24]. Gene *MAPK1*, identified as associated with eight

cancer types by both A1 and A2, has been reported to be involved in many cancer related pathways. MAPK1 is one of the MAP kinases in the MAPK pathway. It can phosphorylate transcription factors, which regulate the expressions of genes involved in cell proliferation and differentiation. Besides, MAPK1 is involved in EGFR tyrosine kinase inhibitor resistance [25], which importantly contributes to the etiology of various types of cancer, such as pancreatic cancer [26], paediatric acute lymphoblastic leukemia [27], and others. In this process, MAPK1 acts as a serine/threonine kinase upstream of FRS2, which plays a role in epidermal growth factor (EGF) signaling [28]. MAPK1 has also been reported to have an impact on the malignant behavior of breast cancer cells. Published studies show that gene *ETV6*, identified as associated with eight and seven cancer types by A1 and A2, respectively, is involved in the transcriptional dysregulation of cancer pathways. The dysregulation of transcription factors can alter the expressions of target genes and lead to the tumorigenic process. For example, ETV6 is a negative regulator of transcription 3 (Stat3) transcription factor activity, which has the ability to mediate the inhabitation of the proliferation of tumor cells [29]. Gene *ETV6* is relevant to multiple cancer types, including breast cancer [30], leukemia [31], non-small cell lung cancer [32], and others. These biological findings provide support to the validity of the proposed integrative analysis.


**Table 2.** Marginal analysis: top five genes with the largest numbers of associated cancer types.

To gain a deeper insight into the identification results, we further calculate the relative overlapping between gene sets associated with different cancer types. Specifically, for two gene sets *A* and *B*, their relative overlapping is defined as ROL(*A*, *<sup>B</sup>*) <sup>=</sup> *<sup>A</sup>*∩*<sup>B</sup> <sup>A</sup>*∪*<sup>B</sup>* , with a larger value indicating a stronger similarity. Results for different approaches are shown in Table 3. The average ROL values are 0.143 (A1), 0.308 (A2), and 0.147 (A3), respectively, suggesting that A2 leads to gene sets with a higher level of relative overlapping and A1 and A3 have comparable performance. Take breast invasive carcinoma (BRCA) and ovarian serous cystadenocarcinoma (OV), which are established as related, as an example. The ROL values for A1, A2, and A3 are 0.150, 0.265, and 0.146, respectively. The proposed A2 can improve the *qualitative* similarity of genes selected for multiple cancer types to a certain extent.

**Table 3.** Marginal analysis: relative overlapping between different cancer types.



**Table 3.** *Cont.*

Beyond identification, we also take a closer look at the estimation results. Specifically, we compute the difference of the estimated coefficient matrices for each cancer pair. Consider the relative Euclidean distance defined as *<sup>p</sup> j*=1 - η (*k*) *<sup>j</sup>* − η (*k*) *j* 2 / *<sup>p</sup> j*=1 - η (*k*) *j* <sup>2</sup> *<sup>p</sup> j*=1 - η (*k*) *j* 2 for *k k* , with a smaller value indicating a stronger similarity. Results for the three approaches are provided in Table 4, with the average values being 1.606 (A1), 1.534 (A2), and 2.254 (A3). The relative Euclidean distances with A1 and A2 are observed to be smaller than those with A3. For example, the distance values between BRCA and OV are 1.443 with A1 and 1.220 with A2, which are much smaller than 3.230 with A3. As another example, for the two squamous cell carcinomas, lung squamous cell carcinoma (LUSC) and head and neck squamous cell carcinoma (HNSC), the relative Euclidean distances are 1.644 (A1), 1.855 (A2), and 2.577 (A3), respectively. To more intuitively describe similarity, we conduct the hierarchical clustering analysis based on the relative Euclidean distances and present the results in Figure A1 (Appendix B). Biologically sensible findings are made, for example, the distance between BRCA and OV decreases after integration.


**Table 4.** Marginal analysis: relative Euclidean distances between estimated coefficient matrices.
