*Article* **Comprehensive Profiling of Genomic and Transcriptomic Differences between Risk Groups of Lung Adenocarcinoma and Lung Squamous Cell Carcinoma**

**Talip Zengin 1,2 and Tu ˘gba Önal-Süzek 2,3,\***


**Abstract:** Lung cancer is the second most frequently diagnosed cancer type and responsible for the highest number of cancer deaths worldwide. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are subtypes of non-small-cell lung cancer which has the highest frequency of lung cancer cases. We aimed to analyze genomic and transcriptomic variations including simple nucleotide variations (SNVs), copy number variations (CNVs) and differential expressed genes (DEGs) in order to find key genes and pathways for diagnostic and prognostic prediction for lung adenocarcinoma and lung squamous cell carcinoma. We performed a univariate Cox model and then lasso-regularized Cox model with leave-one-out cross-validation using The Cancer Genome Atlas (TCGA) gene expression data in tumor samples. We generated 35- and 33-gene signatures for prognostic risk prediction based on the overall survival time of the patients with LUAD and LUSC, respectively. When we clustered patients into high- and low-risk groups, the survival analysis showed highly significant results with high prediction power for both training and test datasets. Then, we characterized the differences including significant SNVs, CNVs, DEGs, active subnetworks, and the pathways. We described the results for the risk groups and cancer subtypes separately to identify specific genomic alterations between both high-risk groups and cancer subtypes. Both LUAD and LUSC high-risk groups have more downregulated immune pathways and upregulated metabolic pathways. On the other hand, low-risk groups have both up- and downregulated genes on cancer-related pathways. Both LUAD and LUSC have important gene alterations such as CDKN2A and CDKN2B deletions with different frequencies. SOX2 amplification occurs in LUSC and PSMD4 amplification in LUAD. EGFR and KRAS mutations are mutually exclusive in LUAD samples. EGFR, MGA, SMARCA4, ATM, RBM10, and KDM5C genes are mutated only in LUAD but not in LUSC. CDKN2A, PTEN, and HRAS genes are mutated only in LUSC samples. The low-risk groups of both LUAD and LUSC tend to have a higher number of SNVs, CNVs, and DEGs. The signature genes and altered genes have the potential to be used as diagnostic and prognostic biomarkers for personalized oncology.

**Keywords:** TCGA; non-small-cell lung cancer; lung adenocarcinoma (LUAD); lung squamous cell carcinoma (LUSC); differential expression; SNV; CNV; risk group; signature; survival

#### **1. Introduction**

Lung cancer is the second most frequently diagnosed cancer type and the leading cause of cancer-related mortality worldwide [1]. Lung cancer treatments used in the clinic are surgery, radiotherapy, chemotherapy, targeted therapy, and emerging immunotherapy. The clinical treatment decisions are made based on tumor stage, histology, genetic alterations of a few driver oncogenes for targeted therapies, and patient's condition [2]. However, most of the patients are diagnosed at an advanced and metastatic stage, with

**Citation:** Zengin, T.; Önal-Süzek, T. Comprehensive Profiling of Genomic and Transcriptomic Differences between Risk Groups of Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. *J. Pers. Med.* **2021**, *11*, 154. https://doi.org/ 10.3390/jpm11020154

Academic Editor: Raghu Sinha

Received: 30 December 2020 Accepted: 19 February 2021 Published: 23 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

high mortality and poor benefit from therapies [3]. Although the targeted therapeutics and immunotherapeutics including immune-checkpoint inhibitors are introduced for patients at an advanced stage, these options are beneficial only for limited subsets of patients and these patients still can develop resistance [4]. Therefore, the majority of patients with advanced-stage lung cancer die within 5 years of diagnosis [5].

Histologically there are four major types of lung cancer, including small-cell carcinoma (SCLC), and adenocarcinoma, squamous cell carcinoma, large cell carcinoma as grouped non-small-cell carcinoma (NSCLC). Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) account for 50% and 23% of all lung cancers, respectively [6]. Lung cancer is both histologically and molecularly heterogeneous disease and characterizing the genomics and transcriptomics of its nature is very important for effective therapies. Lung cancer has many subtypes with distinct genetic characteristics, resulting in intra-tumoral heterogeneity [7].

The Cancer Genome Atlas (TCGA) database serves different types of data such as transcriptome profiling, simple nucleotide variation, copy number variation, DNA methylation, clinical and biospecimen data of 84,392 cancer patients with 68 primary sites [8]. The Cancer Genome Atlas Research Network reported molecular profiling of 230 lung adenocarcinoma samples using mRNA, microRNA and DNA sequencing integrated with copy number, methylation and proteomic analyses. They identified 18 significantly mutated genes, including TP53, KRAS which is mutually exclusive with EGFR, BRAF, PIK3CA, MET, STK11, KEAP1, NF1, RB1, CDKN2A, GTPase gene RIT1, including activating mutations and MGA including loss-of-function mutations. DNA and mRNA sequence from the same tumor highlighted splicing alterations including exon 14 skipping in MET mRNA in 4% of cases. They also showed DNA hyper-methylation of several key genes: CDKN2A, GATA2, GATA4, GATA5, HIC1, HOXA9, HOXD13, RASSF1, SFRP1, SOX17, WIF1, and MYC over-expression was significantly associated with the hyper-methylation phenotype as well [9].

The Cancer Genome Atlas Research Network also profiled 178 lung squamous cell carcinomas and detected mutations in 11 genes, including mutations in TP53 (81%), CDKN2A, PTEN, PIK3CA, KEAP1, MLL2, HLA-A, NFE2L2, RB1, NOTCH1 including truncating mutations and loss-of-function mutations in the HLA-A class I major histocompatibility gene. They identified altered pathways such as NFE2L2 and KEAP1 in 34%, squamous differentiation genes in 44%, PI3K pathway genes in 47%, and CDKN2A and RB1 in 72% of tumors. CNV analysis revealed the amplification of NFE2L2, MYC, CDK6, MDM2, BCL2L1 and EYS, and deletions of FOXP1, PTEN and NF1 genes with previously identified CNV genes, SOX2, PDGFRA, KIT, EGFR, FGFR1, WHSC1L1, CCND1, and CDKN2A. They identified overexpression and amplification of SOX2 and TP63, loss-of-function mutations in NOTCH1, NOTCH2 and ASCL4 and focal deletions in FOXP1 which have known roles in squamous cell differentiation. CDKN2A is downregulated in over 70% of samples through epigenetic silencing by methylation (21%), inactivating mutation (18%), exon 1β skipping (4%), or homozygous deletion (29%) [10].

Recently, many studies have been published on gene expression signatures predicting the survival risk of patients with lung adenocarcinoma. These recent studies have been mostly using TCGA data, but their methods generated different gene signatures. Seven-gene expression signature including ASPM, KIF15, NCAPG, FGFR1OP, RAD51AP1, DLGAP5 and ADAM10 genes, was obtained for early stage cases from seven published lung adenocarcinoma cohorts and the signature showed high hazard rations in Cox regression analysis [11]. Shukla et al. developed TCGA RNAseq data-based prognostic signature including four protein-coding genes RHOV, CD109, FRRS1, and the lncRNA gene LINC00941, which showed high hazard ratios for stage I, EGFR wild-type, and EGFR mutant groups [12]. A prognostic signature that was independent of other clinical factors, was developed and validated based on the TCGA data. Patients were grouped into risk groups using signature genes, and patients with high-risk scores tended to have poor survival rate at 1-, 3- and 5-year follow-up. The developed eight-gene signature including

TTK, HMMR, ASPM, CDCA8, KIF2C, CCNA2, CCNB2, and MKI67 were highly expressed in A549 and PC-9 cells [13].

Twelve-gene signature (RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A/B) was established by using published microarray dataset from 129 patients and the signature was independently prognostic for lung squamous carcinoma but not for lung adenocarcinoma [14]. A four-gene clustering model in 14-Genes (DPPA, TTTY16, TRIM58, HKDC1, ZNF589, ALDH7A1, LINC01426, IL19, LOC101928358, TMEM92, HRASLS, JPH1, LOC100288778, GCGR) was established and these genes plays role in positive regulation of ERK1 and ERK2 cascade, angiogenesis, platelet degranulation, cell–matrix adhesion, extracellular matrix organization and macrophage activation [15].

Lu et.al. identified differentially expressed genes between lung adenocarcinoma and lung squamous cell carcinoma by using microarray data from the Gene Expression Omnibus database. They identified 95 upregulated and 241 downregulated DEGs in lung adenocarcinoma samples, and 204 upregulated and 285 downregulated DEGs in lung squamous cell carcinoma samples, compared to the normal lung tissue samples. The genes play role in cell-cycle, DNA replication and mismatch repair. The top five genes from global network, HSP90AA1, BCL2, CDK2, KIT and HDAC2 have differential expression profiles between lung adenocarcinoma and lung squamous cell carcinoma [16]. Recently, Wu et.al. identified diagnostic and prognostic genes for lung adenocarcinoma and squamous cell carcinoma by using weighted gene expression profiles. The five-gene diagnostic signature including KRT5, MUC1, TREM1, C3 and TMPRSS2 and the five-gene prognostic signature including ADH1C, AZGP1, CLU, CDK1 and PEG10 obtained a log-rank P-value of 0.03 and a C-index of 0.622 on the test set [17].

A considerable number of genetic and transcriptomic alterations have been identified in mostly LUAD and poorly in LUSC. Although many gene expression signatures have been identified in LUAD recently, there is less work on LUSC expression signatures. Additionally, the molecular differences between risk groups of LUAD and LUSC have not yet been systematically described. In this study, we aimed to identify the genomic and transcriptomic differences between risk groups of lung adenocarcinoma and lung squamous cell carcinoma. We performed a univariate Cox model and then Lasso-Regularized Cox Model with Leave-One-Out Cross-Validation (LOOCV) by using TCGA gene expression data in tumor samples, and identified best gene signatures to cluster patients into low- and high-risk groups. We generated 35- and 33-gene signatures for prognostic risk prediction based on the overall survival time of the patients with LUAD and LUSC. When we clustered patients into high- and low-risk groups, the survival analysis showed highly significant results for both training and test datasets. Then, we characterized the differences including significant SNVs, CNVs, DEGs and active subnetwork DEGs between risk groups in LUAD and LUSC.

#### **2. Materials and Methods**

#### *2.1. Data*

Simple Nucleotide Variation (SNV), Transcriptome Profiling, Copy Number Variation (CNV) and Clinical data of patients who have all of these data types in LUAD and LUSC projects, was downloaded separately using *TCGAbiolinks* R package [18]. Using the same package and the reference of hg38; Simple Nucleotide Variations (SNVs) and Copy Number Variations (CNVs); and transcriptomic variations were processed to identify the genomic alterations of the LUAD and LUSC patients (Table 1). The method described below can be found as flowchart in Figure S1.


**Table 1.** Summary of clinical variables of train and test group of patients with LUAD and LUSC analyzed in the study.

#### *2.2. Gene Expression Signature Analysis*

Clinical data and Gene Expression Quantification data (HTSeq counts) of patients with unpaired RNAseq data (tumor samples without normal samples) was downloaded from the TCGA database using the *TCGAbiolinks* R package. Raw HTSeq counts of tumor samples were normalized by TMM (trimmed mean of M values) method and Log<sup>2</sup> transformed after filtering to remove genes that consistently have zero or low counts. Univariate Cox Proportional Hazards Regression analysis was performed using *survival* R package [19] to identify survival-related genes. For these survival-related potential biomarker genes (*p* ≤ 0.05), Lasso-Regularized Cox Model (by using minimum lambda calculated in the model) with Leave-One-Out Cross-Validation (LOOCV) was performed to determine a gene expression signature using *glmnet* R package [20]. Multivariate Cox Regression for the signature genes was performed and the predictive performance of the model was scored using *riskRegression* R package [21]. The risk score of each patient was predicted based on multivariate Cox regression model using the *survival* R package. Patients were clustered into high-risk and the low-risk group based on the best cutoff value for ROC, calculated by *cutoff* R package [22].

For the validation of the gene signature, HTSeq counts belonging to the tumor samples of patients who have paired RNAseq data (tumor samples with the paired adjacent normal samples) were downloaded from the TCGA database, filtered, normalized by TMM method and Log<sup>2</sup> transformed. Multivariate Cox Regression for the signature genes was performed and the predictive performance of the model was scored. The risk score of every patient in the validation group was predicted based on multivariate Cox regression model and each patient was assigned to the high- or low-risk group using the best cutoff value for ROC. These analyses were performed for LUAD and LUSC patients separately.

#### *2.3. Differential Expression Analysis*

Gene Expression Quantification data (HTSeq counts) of both the primary tumor (TP) and the paired normal tissue adjacent to the tumor (NT) was downloaded from the TCGA database. Raw HTSeq counts of both tumor and normal samples were normalized by TMM method after filtering to remove genes which have zero or low counts. Differentially expressed (*q* < 0.01) genes were determined using *limma* [23] and *edgeR* [24] R packages by limma-voom method with duplicate-correlation function. HUGO symbols and NCBI Gene identifiers of the differentially expressed genes were downloaded using the *biomaRt* R package. This analysis was performed for high- and low-risk group patients of LUAD and LUSC, separately.

#### *2.4. Active Subnetwork Analysis*

Active subnetworks of the differentially expressed genes were determined using *DEsubs* R package [25]. *DEsubs* package accepts the differentially expressed genes output of the *limma* package along with their FDR adjusted *p* values (*q* values). *DEsubs* package both computes and plots the active subnetworks. All the plots and computations were generated for the high- and low-risk group patients of the LUAD and LUSC projects, separately.

#### *2.5. Copy Number Variation Analysis*

The Copy Number Variation data of the primary tumor samples of patients was downloaded using *TCGAbiolinks* package (Masked Copy Number Segment as data type). The chromosomal regions which are significantly aberrant in tumor samples were determined and plotted by *gaia* R package [26]. Gene enrichment from genomic regions which have significant differential copy number was performed using *GenomicRanges* [27] and *biomaRt* R packages. R codes used in this analysis were modified from the codes presented at "TCGA Workflow" article [28]. All the computations and the plots were generated for the high- and low-risk groups of LUAD and LUSC projects, separately.

#### *2.6. Simple Nucleotide Variations Analysis*

The masked Mutation Annotation Format (maf) files of the TCGA mutect2 pipeline in tumor samples were downloaded to obtain the somatic mutations. The maf files are filtered using the *maftools* [29] to obtain the subset of the mutations corresponding to the patient barcodes. Summary plot and oncoplot were generated to summarize the mutation data using *maftools* R package. Somatic mutations were filtered and assigned to either oncogene (OG) or tumor suppressor gene (TSG) groups along with a significance score (q < 0.05) using the *SomInaClust* R package [30]. *SomInaClust* computes a background mutation value to identify the hot spots using the known set of somatic mutations in "COSMIC" and the "Cancer Gene Census" (v92) datasets of COSMIC database for GRCh38 [31]. SNV analysis was performed for high- and low-risk group patients of LUAD and LUSC projects, separately.

#### *2.7. Visualization*

Scatter plots showing risk score and survival time of patients were generated by *ggrisk* R package [32] and Kaplan–Meier (KM) survival curves were plotted by *survminer* R package [33] displaying the overall survival difference between the risk groups stratified on the proposed gene signature. ROC curves were plotted for the risk scores based on each gene signature using *survivalROC* R package [34]. Univariate and multivariate Cox regression analyses were performed and forest plots were generated for risk score with clinical variables using *survival* and *forestmodel* [35] R packages.

Gene and pathway enrichment analyses were performed by *biomaRt* [36] and *clusterProfiler* [37] R packages and plotted by *enrichplot* R package [38]. Heatmap plots were generated using *ComplexHeatmap* R package [39]. Mosaic plots to compare the categorical variables were generated using the *vcd* R package [40,41].

OncoPrint showing CNVs among patient samples was generated using *Complex-Heatmap* R package. OncoPlot for significant mutated genes was drawn using *maftools,* and oncoPrint showing SNVs and CNVs together was generated using *ComplexHeatmap* R package. Circos plot showing all non-synonymous SNVs in original data of risk groups and significant CNVs at genome-scale were generated using *circlize* R package [42].

All possible relations between DEGs; active subnetwork DEGs; CNV genes; SNV genes of LUAD and LUSC risk groups were identified by using *VennDiagram* R package [43].

#### **3. Results**

#### *3.1. Gene Expression Signature Analysis of LUAD and LUSC Patients*

In order to identify gene expression prognosis risk model, clinical data and gene expression quantification data of tumor samples of patients with LUAD and h LUSC with unpaired RNAseq data as two separate training groups (Table 1) were downloaded from the TCGA database. A 35-gene expression signature for LUAD and a 33-gene expression signature for LUSC were identified by Lasso-Regularized Cox Model with LOOCV after univariate Cox regression analysis. The risk scores of each patient in training groups and test groups were predicted using signature genes, then patients were clustered into highand low-risk groups based on the cutoff values.

The genes of the LUAD expression signature model identified are AC005077.4, AC113404.3, ADAMTS15, AL365181.2, ANGPTL4, ASB2, ASCL2, CCDC181, CCL20, CD200R1, CPXM2, DKK1, ENPP5, EPHX1, GNPNAT1, GRIK2, IRX2, LDHA, LDLRAD3, LINC00539, LINC00578, MS4A1, OGFRP1, RAB9B, RGS20, RHOQ, SAMD13, SLC52A1, STAP1, TLE1, U91328.1, WBP2NL, ZNF571-AS1, ZNF682, ZNF835. Twenty-seven of them are protein-coding genes while two of them are long intergenic non-protein coding RNA (LINC00539, LINC00578), one is antisense RNA (ZNF571-AS1), three of them are pseudogenes (AC005077.4, AC113404.3, OGFRP1) and two of them are novel transcripts (AL365181.2, U91328.1) (Table S1). Pathway enrichment analysis by using *clusterProfiler* R package did not give any results for this 35-gene list; therefore, enrichment analysis was performed manually using the online KEGG Mapper tool. The genes play role in metabolic pathways, cancer and immune system-related pathways such as Central carbon metabolism in cancer, Glycolysis, Cholesterol metabolism, Amino sugar and Nucleotide sugar metabolism, HIF-1 signaling pathway, TNF signaling pathway, IL-17 signaling pathway, Chemokine signaling pathway and Wnt signaling pathway (Table S2). Multivariate Cox regression analysis was performed for the signature genes and the predictive performance of the model was scored. The AUC was 0.963 (*p* = 1.1 × 10−15) for LUAD training group. The risk score of each patient was predicted and patients were clustered into high- and low-risk groups based on the cutoff value. Low- and high-risk groups have different expression patterns of the signature genes and significantly different survival probabilities (*p* < 0.0001). The prediction power of the risk score is around 0.78 (AUC) for 1, 3, 5 and 8 years for LUAD training group (Figure S2). Risk group clustering is independent from tumor stages because risk groups have also significantly different survival probability for each tumor stage (Figure S3). Vital status is highly correlated with risk groups that high-risk group is positively correlated with death (*p* = 1.5 × 10−13), while only tumor stage IA and III are associated with risk groups (Figure S4). The risk score has highly significant prognostic ability (HR:2.59, *p* < 0.001) when multivariate Cox regression analysis was performed with other clinical variables (Figures S5 and S6).

In order to validate the gene expression signature, gene expression quantification data of tumor samples of patients with LUAD who have paired RNAseq data were downloaded from the TCGA database. The risk scores of each patient in test group were predicted using the gene signature lists and patients were clustered into high- and low-risk groups based on the best cutoff values for ROC. Risk groups have differential signature gene expression patterns; high-risk group has lower survival time and higher number of deaths resulting a significantly different survival probability (*p* < 0.0001). The risk score has high prediction powers, 0.97, 0.92, 0.93 and 0.92 (AUC) for 1, 3, 5 and 8 years, respectively, for LUAD test group (Figure 1).

**Figure 1.** Gene expression signature and risk clustering of LUAD test dataset. Test dataset patients were clustered into high- and low-risk groups based on risk scores of patients calculated by predicting the effect of the signature genes of the signature genes expression on overall survival. (**A**) Expression heatmap of the signature genes in tumor samples of LUAD patients in the test dataset. (**B**) Scatter plot showing risk scores, survival time and separation point of the patients into risk groups. (**C**) KM survival plot showing the overall survival probability between risk groups. (**D**) ROC curve showing prediction power of risk score in the test dataset for 1, 3, 5 and 8 years.

− Risk groups have significantly different survival probability for each tumor stage in LUAD test group as well (Figure S7). Vital status is highly correlated with risk groups. The high-risk group is positively correlated with death (*p* = 3.87 × 10−<sup>7</sup> ), while only tumor stage I is positively associated with low-risk group (*p* = 0.016) (Figure S8). The risk score has highly significant prognostic ability (HR:2.79, *p* < 0.001) as the result of multivariate Cox regression analysis was performed with other clinical variables (Figure S9).

Expression signature model identified for LUSC includes these genes: AC078883.1, AC096677.1, AC106786.1, ADAMTS17, ALDH7A1, ALK, COL28A1, EDN1, FABP6, HKDC1, IGSF1, ITIH3, JHY, KBTBD11, LINC01426, LINC01748, LPAL2, NOS1, PLAAT1, PNMA8B, RGMA, RPL37P6, S100A5, SLC9A9, SNX32, SRP14-AS1, STK24, UBB, UGGT2, WASH8P, Y\_RNA, ZNF160, ZNF703. Twenty-three of them are protein coding genes while two of them are long intergenic non-protein coding RNA (LINC01748, LINC01426), one is antisense RNA (SRP14-AS1), three of them are pseudo-genes (LPAL2, RPL37P6, WASH8P), three of them are novel transcripts (AC106786.1, AC096677.1, AC078883.1) and one is Y RNA (Table S3). They play role in mostly in metabolic pathways, cancer and immunity related pathways such as Arginine and proline metabolism, Glycolysis/Gluconeogenesis, HIF-1 signaling pathway, Non-small-cell lung cancer, PD-L1 expression and PD-1 checkpoint pathway in cancer and TGF-beta signaling pathway (Table S4).

The predictive performance score of the signature model is 80.8 (AUC) (*p* = 1.3 × 10−<sup>6</sup> ) in multivariate Cox regression analysis for LUSC training group. The risk score of each patient was predicted and patients were clustered into high- and low-risk groups based on the cutoff value. Low- and high-risk groups have different expression patterns of the signature genes and significant difference of survival probability (*p* < 0.0001). The AUC values showing prediction power of the risk score are 0.76, 0.82, 0.87 and 0.92 for 1, 3, 5 and 8 years, respectively, for LUSC training group (Figure S10). Risk groups have also significantly different survival probability for tumor stages I, II and III (Figure S11). Risk groups are highly correlated with vital status. The high-risk group has highly significant positive correlation with death (*p* = 8.5 × 10−15), while low-risk group is negatively correlated. Tumor stages did not show any association with risk groups (Figure S12). The risk score has highly significant prognostic ability (HR:2.85, *p* < 0.001) when multivariate Cox regression analysis was performed with other clinical variables (Figure S13).

In order to validate the gene expression signature for LUSC, gene expression quantification data of tumor samples of patients with LUSC who have paired RNAseq data were downloaded. The risk scores of each patient in LUSC test group were predicted using gene signature lists and patients were clustered into high- and low-risk groups based on the best cutoff values for ROC. Risk groups have differential signature gene expression pattern; high-risk group has lower survival time and higher number of deaths. Risk groups have significantly different survival probability (*p* < 0.0001). The risk score has high prediction powers, 0.93, 0.95, 0.96 and 0.97 (AUC) for 1, 3, 5 and 8 years, respectively, for LUSC test group (Figure 2).

Risk groups have also significantly different survival probability for tumor stages in test group (Figure S14). Vital status is not correlated with risk groups of LUSC test group that number of deaths is higher for high-risk group insignificantly (*p* = 0.07). Tumor stages are not associated with risk groups (Figure S15). The risk score has highly significant prognostic ability (HR:2.66, *p* < 0.001) while other clinical variables have no effect on overall survival in multivariate Cox regression analysis (Figure S16).

The expression gene signatures of LUAD and LUSC do not have any common gene, however they share eight common pathways which are mostly metabolic pathways: Central carbon metabolism in cancer, Glycolysis/Gluconeogenesis, HIF-1 signaling pathway, Pyruvate metabolism, PPAR signaling pathway, Amino sugar and nucleotide sugar metabolism, TNF signaling pathway and Pathways of neurodegeneration—multiple diseases.

#### *3.2. Differential Expression and Active Subnetwork Analysis of Risk Groups*

Gene expression quantification data of both primary tumor and adjacent normal tissues of patients who have paired RNAseq data (test groups) in LUAD and LUSC projects were downloaded from the TCGA database. Differentially expressed (*q* < 0.01) genes (DEGs) were determined in tumor samples according to normal samples for high- and lowrisk patient groups in test sets of LUAD and LUSC, separately. Then, active subnetworks of DEGs in tumor samples were determined using the DEGs with their q values.

In tumor samples of the LUAD low-risk group, the number of the genes which are dysregulated significantly (*q* < 0.01) more than 2-fold is 3615 (2439 down-, 1176 upregulated) while 3610 genes (2239 down-, 1371 upregulated) are dysregulated for the LUAD high-risk group. LUAD low- and high-risk groups have 2745 common differentially expressed genes (Figure S17). The top 20 significant DEGs highlighted as purple at volcano plot in Figure 3A,B are different between LUAD risk groups as dysregulation pattern is different between risk groups albeit the shared 2745 DEGs.

**Figure 2.** Gene expression signature and risk clustering of LUSC test dataset. Test dataset patients were clustered into highand low-risk groups based on risk scores of patients calculated by predicting the effect of the signature genes' expression on overall survival. (**A**) Expression heatmap of the signature genes in tumor samples of LUSC patients in the test dataset. (**B**) Scatter plot showing risk scores, survival time and separation point of the patients into risk groups. (**C**) KM survival plot showing the overall survival probability between risk groups. (**D**) ROC curve showing prediction power of risk score in the test dataset for 1, 3, 5, and 8 years.

Seven of the signature genes (GNPNAT1, CCDC181, LDHA, ADAMTS15, IRX2, LINC00578, AC005077.4) are dysregulated in both risk groups. ANGPTL4 is upregulated in the high-risk group while MS4A1, GRIK2, and OGFRP1 are upregulated in the lowrisk group.

Risk groups of LUAD share dysregulated pathways (Figure 3C,D), highly related to cancer, such as Cell cycle, Biosynthesis of amino acids and Protein digestion and absorption which are upregulated for both risk groups (Figure S18), on the other hand, they also share ECM–receptor interaction, Cell adhesion molecules pathways with immune systemrelated pathways such as Complement and coagulation cascades and Cytokine-cytokine receptor interaction which are downregulated for both risk groups (Figure S18). However, the high-risk group has more dysregulated immune system-related pathways such as Allograft rejection, Graft-versus-host disease, Inflammatory bowel disease, Intestinal immune network for IgA production, Rheumatoid arthritis, Staphylococcus aureus infection (Figure 3C,D), which are downregulated pathways in LUAD high-risk group (Figure S18).

Active subnetworks of differentially expressed genes in tumor samples of the LUAD risk groups were identified and low-risk group has 191 genes while high-risk group has 168 genes including 112 common genes, which are acting on active subnetworks (Figure S17).

**Figure 3.** Differential expression analysis of the LUAD risk groups. LUAD test dataset patients were clustered into highand low-risk groups based on risk scores of patients and differentially expressed genes in tumor samples were determined based on expressions in normal tissues. (**A**) Volcano plot showing differentially expressed genes more than 2-fold (Log<sup>2</sup> =1) for LUAD low-risk group. The top 20 significant downregulated and upregulated genes are highlighted as purple. FDR corrected p-values threshold is 0.01 (-Log<sup>10</sup> = 2). Red: Upregulated, Green: Downregulated, Black: Not significant or low than 2-fold. (**B**) Volcano plot showing differentially expressed genes more than two-fold (Log<sup>2</sup> = 1) for the LUAD high-risk group. The top 20 significant downregulated and upregulated genes are highlighted as purple. FDR corrected *p*-values threshold is 0.01 (-Log<sup>10</sup> = 2). Red: Upregulated, Green: Downregulated, Black: Not significant or low than 2-fold. (**C**) Dysregulated pathways of differentially expressed genes for LUAD low-risk group. (**D**) Dysregulated pathways of differentially expressed genes for LUAD high-risk group.

Pathway enrichment of DEGs at active subnetworks shows that the genes playing role in active subnetworks are much more related to cancer pathways such as PI3K-Akt signaling pathway, Ras signaling pathway, Small-cell lung cancer, Breast cancer, Gastric cancer, Proteoglycans in cancer and Rap1 signaling pathway (Figure 4). LUAD risk groups have mostly similar cancer-related active pathways, however only low-risk group has FoxO signaling pathway and TNF signaling pathway while high-risk group has Estrogen signaling pathway, Growth hormone synthesis, secretion, and action with immune system pathways such as Antigen processing and presentation, Intestinal immune network for IgA production and Leukocyte trans-endothelial migration.

The number of dysregulated genes expressed significantly (*q* < 0.01) more than 2 fold in tumor samples of the LUSC low-risk group is 5596 (3394 downregulated, 2202 upregulated) while 5403 genes (3338 downregulated, 2065 upregulated) are dysregulated for LUSC high-risk group. LUSC low- and high-risk groups have 4562 common differentially expressed genes (Figure S17). The top 20 significant DEGs highlighted at volcano plot in Figure 5A,B include common genes and dysregulation pattern is similar between risk groups.

**Figure 4.** Pathway enrichment of differentially expressed genes at active subnetworks of the LUAD risk groups. Active subnetworks were determined by using differential expression analysis results and pathway enrichment analysis was performed for the genes at subnetworks. (**A**) Pathways of differentially expressed genes in active subnetworks for LUAD low-risk group. (**B**) Pathways of differentially expressed genes in active subnetworks for LUAD high-risk group.

**Figure 5.** Differential expression analysis of the LUSC risk groups. LUSC test dataset patients were clustered into highand low-risk groups based on risk scores of patients and differentially expressed genes in tumor samples were determined based on expressions in normal tissues. (**A**) Volcano plot showing differentially expressed genes more than 2-fold (Log<sup>2</sup> = 1) for LUSC low-risk group. The top 20 significant downregulated and upregulated genes are highlighted as purple. FDR corrected p-values threshold is 0.01 (-Log<sup>10</sup> = 2). Red: Upregulated, Green: Downregulated, Black: Not significant or low than 2-fold. (**B**) Volcano plot showing differentially expressed genes more than two-fold (Log<sup>2</sup> = 1) for LUSC high-risk group. The top 20 significant downregulated and upregulated genes are highlighted as purple. FDR corrected p-values threshold is 0.01 (-Log<sup>10</sup> = 2). Red: Upregulated, Green: Downregulated, Black: Not significant or low than 2-fold. (**C**) Dysregulated pathways of differentially expressed genes for LUSC low-risk group. (**D**) Dysregulated pathways of differentially expressed genes for LUSC high-risk group.

LUSC signature genes have 10 common genes (EDN1, JHY, PLAAT1, HKDC1, ITIH3, KBTBD11, RGMA, ZNF703, S100A5, LPAL2) with DEGs of both risk groups. Three of the signature genes, ADAMTS17, IGSF1, and LINC01426, are upregulated in the low-risk group; others, NOS1 and SRP14-AS1 are downregulated while Y\_RNA is upregulated in the high-risk group.

Risk groups of LUSC have common dysregulated pathways (Figure 5C,D), which are highly related to cancer, such as Cell cycle, DNA replication, Base excision repair, p53 signaling pathway which are upregulated at both risk groups (Figure S19), on the other hand, they also share ECM–receptor interaction, Cell adhesion molecules, Focal adhesion pathways with immune system-related pathways such as Chemokine signaling pathway, Complement and coagulation cascades, Cytokine–cytokine receptor interaction, which are downregulated at both risk groups (Figure S19). However, the high-risk group has more upregulated metabolic pathways such as Central carbon metabolism in cancer, Protein digestion and absorption, Alanine, aspartate and glutamate metabolism, Arginine and proline metabolism, Cysteine and methionine metabolism, Glutathione metabolism, Ribosome biogenesis in eukaryotes; and downregulated immune-related pathways such as JAK-STAT signaling pathway, TNF signaling pathway, Primary immunodeficiency, T cell receptor signaling pathway distinctly from low-risk group (Figure S19). LUSC lowrisk group has downregulated PI3K-Akt signaling pathway, Phenylalanine metabolism, Tyrosine metabolism, Phospholipase D signaling pathway, Proteoglycans in cancer and Tight junction pathways with upregulated Hippo signaling pathway and Small-cell lung cancer distinctly from high-risk group (Figure S19).

Active subnetworks of differentially expressed genes in tumor samples of the LUSC risk groups has 357 genes for the low-risk group while 350 genes for high-risk group including 245 common genes (Figure S17). Active pathways of the LUSC risk groups, are highly related to cancer pathways such as PI3K-Akt signaling pathway, Ras signaling pathway, Small-cell lung cancer, Proteoglycans in cancer and Rap1 signaling pathway (Figure 6A,B). LUSC risk groups have mostly similar cancer-related active pathways, however only lowrisk group has Nucleotide excision repair, Adherens junction and Alpha-Linolenic acid metabolism pathways, while high-risk group has cancer and metabolism-related pathways such as Basal cell carcinoma, Prolactin signaling pathway, Apoptosis, Mitophagy, Choline metabolism in cancer, Insulin signaling pathway, Carbohydrate digestion and absorption, Central carbon metabolism in cancer with immune system-related Measles and Influenza A pathways.

**Figure 6.** Pathway enrichment of differentially expressed genes at active subnetworks of the LUSC risk groups. Active subnetworks were determined by using differential expression analysis results and pathway enrichment analysis was performed for the genes at subnetworks. (**A**) Active pathways of differentially expressed genes for LUSC low-risk group. (**B**) Active pathways of differentially expressed genes for LUSC high-risk group.

#### *3.3. Copy Number Variations Analysis*

The significant aberrant genomic regions in tumor samples of patients were determined and then gene enrichment from genomic regions which have differential copy number was performed. Pathway enrichment analysis of genes which have CNVs was performed and plotted. LUAD low- and high-risk groups have different CNV profiles as seen at CNV plots showing amplified or deleted genomic regions on chromosomes. Chromosomes 1, 6, 7, 10, 13, 16, 17, 28 and 20 have different significant aberrant genomic regions (*q* < 0.01) between risk groups (Figure 7A,B). The highest frequencies of the amplified genes are 45%, 49% and the deleted genes are 31%, 45% in the low- and high-risk groups, respectively. The top 10 the highest frequently amplified or deleted genes in tumor samples of risk groups are different and patients in the same group may have different aberration patterns (Figure 7C,D). The numbers of the deleted genes and the amplified genes are 10,144 and 10,412, respectively, in tumor samples of the LUAD low-risk group. LUAD high-risk group has 5379 deleted and 8442 amplified genes in tumor samples. Risk groups have 4921 deleted and 6559 amplified genes in common (Figure S22).

Pathways of CNV genes are different between LUAD risk groups; mostly immune system pathways such as Allograft rejection, Graft-versus-host disease, Antigen processing and presentation, Complement and coagulation cascades, Inflammatory bowel disease and Viral carcinogenesis pathways have amplified CNVs in the low-risk group (Figure S20) while Herpes simplex virus 1, Cytosolic DNA sensing pathway, Natural killer cell mediated cytotoxicity and Nod-like receptor signaling pathways have deleted CNVs (Figure S20) in the high-risk group (Figure 7). Complement and coagulation cascades pathway has amplified genes in both risk groups while Natural killer cell mediated cytotoxicity and Nod-like receptor signaling pathways have deleted genes in both risk groups (Figure S20). The low-risk group patients have immune system pathways with amplified genes whereas high-risk group patients have immune system pathways with deleted genes. On the other hand, high-risk group has amplified genes in metabolic pathways such as Gastric acid secretion and Insulin secretion (Figure S20).

LUSC risk groups have different significant aberrant genomic regions obviously on chromosomes 5, 6, 8 and X (Figure 8A,B). The highest frequencies of amplified genes are 84%, 77% and of the deleted genes are 55%, 51% in the low- and high-risk groups, respectively. LUSC risk groups have higher frequency of amplified genes than deleted genes. Risk groups have common genes from top 25 the highest frequently amplified genes such as SOX2, GHSR, TNFSF10 and miRNAs, miR-7977 and miR-569, with variable frequencies. Risk groups have also common deleted genes such as CDK inhibitors, CDKN2A and CDKN2B, and miR-1284 (Figure 8C,D). LUSC low-risk group has 10,720 deleted and 10,264 amplified genes while LUSC high-risk group has 9477 deleted and 10,250 amplified genes in tumor samples. Risk groups have 7820 deleted and 8659 amplified genes in common (Figure S22).

Pathways of CNV genes highly overlap between LUSC risk groups and they share cancer-related pathways such as PI3K-Akt signaling pathway, JAK-STAT signaling pathway, Ras signaling pathway, Gastric cancer (Figure 8E,F). However, some pathways differ between risk groups, low-risk group has CNVs at mTOR signaling pathway, VEGF signaling pathways and Central carbon metabolism in cancer, while high-risk group has CNVs at Chemical carcinogenesis, Drug metabolism—cytochrome P450, Carbohydrate digestion and absorption pathways (Figure 8E,F). Steroid hormone biosynthesis and Bile secretion pathways have multiple amplified genes while NOD-like receptor signaling pathway has deleted genes, in both risk groups. Only low-risk group has multiple amplified genes at Growth hormone synthesis, secretion and action, and Complement and coagulation cascades pathways. Only high-risk group has amplified genes at Chemical carcinogenesis and Drug metabolism pathways while has deleted genes at Cytokine-cytokine receptor interaction and Fatty acid biosynthesis pathways (Figure S21).

**Figure 7.** Significant Copy Number Variations (CNVs) of the LUAD risk groups. (**A**) CNV plot at genome scale showing amplified or deleted genomic regions on chromosomes of the LUAD low-risk group. Score: -Log10(q value), Horizontal orange line: 0.01 q value threshold. (**B**) CNV plot of the LUAD high-risk group. (**C**) OncoPrint plot showing 25 the highest frequently amplified and deleted genes of the LUAD low-risk group. (**D**) OncoPrint plot showing 25 the highest frequently amplified and deleted genes of the LUAD high-risk group. (**E**) Pathways of CNV genes of the LUAD low-risk group. (**F**) Pathways of CNV genes of the LUAD high-risk group.

**Figure 8.** Significant Copy Number Variations (CNVs) of the LUSC risk groups. (**A**) CNV plot at genome-scale showing amplified or deleted genomic regions on chromosomes of the LUSC low-risk group. (**B**) CNV plot of the LUSC high-risk group. (**C**) OncoPrint plot showing 25 the highest frequently amplified and deleted genes of the LUSC low-risk group. (**D**) OncoPrint plot showing 25 the highest frequently amplified and deleted genes of the LUSC high-risk group. (**E**) Pathways of CNV genes of the LUSC low-risk group. (**F**) Pathways of CNV genes of the LUSC high-risk group.

#### *3.4. Simple Nucleotide Variations Analysis*

Significantly (*q* < 0.05) mutated genes classified as oncogene (OG) or tumor suppressor gene (TSG) based on TSG/OG scores of the genes and the Cancer Gene Census, were identified for LUAD and LUSC risk groups. COSMIC database was used as a reference mutation database for this analysis and Cancer Gene Census data.

LUAD low-risk group has 15,376 mutated genes, while LUAD low-risk group has 12,815 mutated genes, 11,516 genes of which are common between LUAD risk groups (Figure S27). LUAD patients have a wide range of mutation numbers changing from 1518/1158 to 10s with median 167 and 172.5 for low- and high-risk groups, respectively. Missense mutation is the highest frequent mutation type, and C > A and C > T substitutions are the most frequent ones for both risk groups. LUAD risk groups have a similar set of mutated genes with varying frequencies. TP53 is the highest frequently mutated gene with 45% and 53% for low- and high-risk groups, and the following ones are MUC16 (39%, 40%) and CSMD3 (38%, 35%) for both groups (Figure S23). SomInaClust analysis was performed to determine driver genes, and 39 genes and 19 genes are strong candidate driver genes for the low-risk group and high-risk group, respectively (Tables S5 and S6). Interestingly, LUAD risk groups share 18 of these driver genes (Figure S27). SomInaClust calculates TSG and OG scores based on background mutation rate and hot spots, then classifies the genes based on TSG/OG scores and cancer gene census data (Figure S25). The driver genes determined in LUAD low-risk group are KRAS, TP53, EGFR, BRAF, STK11, MGA, NF1, RB1, PIK3CA, ATM, RBM10, SETD2, ARID1A, CTNNB1, CMTR2, SF3B1, CSMD3, ATF7IP, KEAP1, HMCN1, EPHA5, ARID2, TTK, SMAD4, KDM5C, SMARCA4, APC, NFE2L2, RIT1, DDX10, LTN1, CDH10, SPTA1, LRP1B, COL11A1, MAP3K12, USH2A, AKAP6 and RASA1. The driver genes determined in LUAD high-risk group are KRAS, TP53, STK11, EGFR, BRAF, RBM10, PIK3CA, SETD2, ARID2, NF1, RB1, MGA, KEAP1, CSMD3, SMARCA4, CTNNB1, KDM5C, IDH1 and ATM (Figure S25; Tables S5 and S6). TP53 and CSMD3 genes are the most frequently mutated genes with 47%, 56% and 41%, 37% frequencies, respectively for low- and high-risk groups (Figure 9A,B). More than half of the genes are mutated in less than 12% of patients. For common genes, LUAD high-risk group has mostly higher frequencies. TP53 has differential mutation types, while KRAS has mostly missense mutations. CSMD3 has more multi-hits (multiple mutations in one patient) in the low-risk group than the high-risk group. EGFR has in frame deletions in both risk groups and other common genes have similar mutation type pattern between risk groups (Figure 9A,B). Pathways of driver mutated genes are highly lung cancer-related pathways such as Non-small-cell lung cancer, EGFR tyrosine kinase inhibitor resistance, Platinum drug resistance, MAPK signaling, mTOR signaling, Ras signaling pathway, PI3K-Akt signaling (Figure 9C,D) and other immunologic and metabolic pathways such as Signaling pathways regulating pluripotency of stem cells, FoxO signaling pathway, Rap1 signaling pathway, Central carbon metabolism in cancer, Proteoglycans in cancer, Human T-cell leukemia virus 1 infection, PD-L1 expression and PD-1 checkpoint pathway in cancer and Natural killer cell mediated cytotoxicity pathways, for both risk groups. Many common pathways are enriched because these mutated driver genes play role in many crucial important pathways. However, Wnt signaling pathway and Hippo signaling pathways are mutated only in the low-risk group, while Gap junction, GnRH signaling pathway, C-type lectin receptor signaling pathway, T cell receptor signaling pathway, HIF-1 signaling pathway, Growth hormone synthesis, secretion and action and AMPK signaling pathways are mutated only in the high-risk group (Figure 9C,D).

LUSC low-risk group has 14,038 mutated genes, while LUSC low-risk group has 14,616 mutated genes, and 11,947 genes are common (Figure S27). LUSC patients have a range of mutation numbers from 2300/1488 to 10s with median 201 for low- and high-risk groups, respectively. Missense mutation is the highest frequent mutation type, and C > A and C > T substitutions are the most frequent ones for both risk groups. LUSC risk groups have overlapping list of mutated genes with varying frequencies. TP53 is the highest frequently mutated gene with 80% and 78% for low- and high-risk groups, and the following ones are CSMD3 (42%, 42%) and MUC16 (39%, 40%) for both groups (Figure S24). As candidate driver genes, 30 genes and 19 genes were identified for the low-risk group and the high-risk group, respectively (Tables S7 and S8). LUSC risk groups share 14 of these driver genes (Figure S27). The driver genes determined in LUSC low-risk group are TP53, KMT2D, NFE2L2, PIK3CA, CDKN2A, PTEN, RB1, FAT1, ARID1A, NF1, RASA1, CUL3, KDM6A, NRAS, KRT5, ZNF750, EP300, FGFR3, TAOK1, CSMD3, NSD1, HRAS, SI, PDS5B, KRAS, KEAP1, API5, HNRNPUL1, SLC16A1, FBXW7. The driver genes determined in LUSC highrisk group are TP53, NFE2L2, PIK3CA, KMT2D, FAT1, CDKN2A, RB1, PTEN, NOTCH1, ARID1A, RASA1, NF1, KMT2C, BRAF, PIK3R1, CSMD3, STK11, HRAS, KEAP1 (Figure S26; Tables S7 and S8). TP53 (83%, 82%), CSMD3 (44%, 44%) and KMT2D (25%, 23%) are most frequent mutated genes for low- and high-risk groups (Figure 10A,B). For common genes, risk groups have similar frequencies. TP53 and KMT2D genes have differential mutation types, while CSMD3 has mostly missense and multi-hit mutations. CDKN2A has mostly truncating mutations in both risk groups and other common genes have similar mutation type pattern between risk groups (Figure 10A,B). Pathways of driver mutated genes are highly lung cancer-related pathways such as Non-small-cell lung cancer, EGFR tyrosine kinase inhibitor resistance, Platinum drug resistance, MAPK signaling and Ras signaling (Figure 10C,D) and other immunologic and metabolic pathways such as FoxO signaling pathway, Central carbon metabolism in cancer, Proteoglycans in cancer, Hepatitis B, Hepatitis C, PD-L1 expression and PD-1 checkpoint pathway in cancer for both risk groups. Many common pathways are enriched because these mutated driver genes play role in many crucial important pathways. However, Gap junction and Ubiquitin mediated proteolysis pathways are mutated only in the low-risk group, while HIF-1 signaling and TNF signaling pathways are mutated only in the high-risk group (Figure 10C,D).

**Figure 9.** Oncoplot of potential driver genes containing significant SNVs of the LUAD risk groups. (**A**) Oncoplot showing significant SNV genes in tumor samples of the LUAD low-risk group patients. (**B**) Oncoplot showing significant SNV genes in tumor samples of the LUAD high-risk group patients. (**C**) Pathway enrichment of the significant SNV genes of the LUAD low-risk group. (**D**) Pathway enrichment of the significant SNV genes of the LUAD high-risk group.

**Figure 10.** Oncoplot of potential driver genes containing significant SNVs of the LUSC risk groups. (**A**) Oncoplot showing significant SNV genes in tumor samples of the LUSC low-risk group patients. (**B**) Oncoplot showing significant SNV genes in tumor samples of the LUSC high-risk group patients. (**C**) Pathway enrichment of the significant SNV genes of the LUSC low-risk group. (**D**) Pathway enrichment of the significant SNV genes of the LUSC high-risk group.

When venn diagram is drawn by using all driver genes, all cancer and risk groups have TP53, CSMD3, KEAP1, NF1, RB1 and PIK3CA mutations. KRAS, STK11, BRAF, ARID1A, NFE2L2 and RASA1 genes are shared by 3 different groups. LUAD high-risk group has only IDH1 oncogene as different from LUAD low-risk group while LUSC highrisk group has KMT2C, NOTCH1 and PIK3R1 tumor suppressor genes as different from LUSC low-risk group. EGFR, MGA and SMARCA4 are not driver genes in LUSC while CDKN2A, PTEN, HRAS and FAT1 are not driver genes in LUAD groups (Figure 11).

Significant SNVs and CNVs on driver genes are co-displayed as OncoPrint. Although there exist some genes with both SNVs and significant CNVs while others have only SNVs. Moreover, some patients have only SNVs or only CNVs or both for a particular driver gene.

TP53, STK11, KEAP1, SMARCA4 and MGA genes have deletions while CSMD3 and PIK3CA genes have amplification beside SNVs in both LUAD risk group. KRAS and EGFR genes have amplification in the high-risk group; however, they do not have significant CNVs in the low-risk group. Oncogenes tend to have amplifications while tumor suppressor genes tend to have deletions in both risk groups with exceptions (CSMD3, CDH10, HMCN1, AKAP6 and CTNNB1) (Figure 12).

**Figure 11.** Venn diagram of driver genes containing Simple Nucleotide Variation (SNV) in tumor samples of LUAD and LUSC risk groups.

**Figure 12.** OncoPrint of the driver genes containing significant SNVs and CNVs in LUAD risk groups. Significant SNVs and CNVs are plotted together on potential driver genes in tumor samples of the LUAD risk groups. (**A**) OncoPrint of the driver genes in LUAD low-risk group. (**B**) OncoPrint of the driver genes in LUAD high-risk group.

OncoPrints in Figure 13 show that TP53, CDKN2A, FAT1, RASA1, ARID1A and HRAS genes have deletions while only PIK3CA gene has amplification beside SNVs in both LUSC risk groups. PIK3R1, KEAP1 and STK11 genes have deletions only in the high-risk group while SI, CSMD3, ZNF750, KRAS genes have amplification and NSD1, FGFR3, PTEN, SLC16A1, NRAS and CUL3 have deletion only in the low-risk group. Oncogenes tend

to have amplifications while tumor suppressor genes tend to have deletions in both risk groups with exceptions (CSMD3, FGFR3, ZNF750, NRAS, HRAS, KEAP1) (Figure 13).

**Figure 13.** OncoPrint of the driver genes containing significant SNVs and CNVs in LUSC risk groups. Significant SNVs and CNVs are plotted together on potential driver genes in tumor samples of the LUSC risk groups. (**A**) OncoPrint of the driver genes in LUSC low-risk group. (**B**) OncoPrint of the driver genes in LUSC high-risk group.

Circos plots showing all non-synonymous SNVs in original data of risk groups and significant CNVs at genomic scale on chromosomes were drawn to show the genomic alterations between risk groups of LUAD and LUSC.

LUAD low-risk group has more genome-wide CNVs and SNVs than the high-risk group. The low-risk group has more genomics regions containing missense, nonsense and frame-shift insertions/deletions mutations. Moreover, low-risk group has extra deletions on chromosomes 1, 3, 5, 6, 12, 15 and X with extra amplifications on chromosomes 6, 10, 14, and 20. The high-risk group has extra amplifications on chromosomes 7, 11, 12, and 17. The CNVs of high-risk group are localized mostly on 1, 3, 5, 6, 7, 8 and 17 whereas low-risk group has CNVs on more chromosomes (Figure 14).

**Figure 14.** Circos plot of chromosome regions containing all SNVs and CNVs in LUAD risk groups. Significant CNVs (*q* < 0.01) and all SNVs in original data are plotted together on chromosome regions in tumor samples of the LUAD risk groups. (**A**) Circos plot of the LUAD low-risk group. (**B**) Circos plot of the LUAD high-risk group.

LUSC high-risk group has more genomic regions containing missense and nonsense mutations than the low-risk group. However, they have similar amount of CNVs although with different localizations. The high-risk group has extra amplifications on chromosomes 4, 6 and 11; has extra deletions on chromosomes 15, 19 and X. The low-risk group has only extra deletions on chromosomes 1, 5, 6, 11 and 16 (Figure 15).

**Figure 15.** Circos plot of chromosome regions containing all SNVs and CNVs in LUSC risk groups. Significant CNVs (*q* < 0.01) and all SNVs in original data are plotted together on chromosome regions in tumor samples of the LUSC risk groups. (**A**) Circos plot of the LUSC low-risk group. (**B**) Circos plot of the LUSC high-risk group.

#### **4. Discussion**

In order to profile the genetic differences between risk groups of LUAD and LUSC, gene expression signatures were generated and the patients were clustered into low- and high-risk groups and then significant DEGs, DEGs at active subnetworks, CNVs and SNVs were identified in each risk group. The biological alterations for these data types were compared between risk groups and between lung cancer subtypes.

Expression signature for LUAD consists of 35 gene which 27 of are protein-coding genes while two are long intergenic non-protein coding RNA, one is antisense RNA, three are pseudogenes and two are novel transcripts. Many of the coding genes are lung cancer or other cancer types related such as ADAMTS15 [44], ASB2 [45] and EPHX1 [46] with potential tumor suppressor roles; ANGPTL4 [47], ASCL2 [48], CCL20 [49], DKK1 [50], GRIK2 [51], LDHA [52], RGS20 [53], RHOQ [54], TLE1 [55] and WBP2 [56] with potential oncogenic roles; and CD200 [57], CD200R1 [57], CCDC181 [58], GNPNAT1 [59], IRX2 [60], LDLRAD3 [61], STAP1 [62], LINC00578 [63] with prognostic potential. Moreover, MS4A1 is dysregulated in asbestos-related lung squamous carcinoma [64], RAB9B is a target of miR-15/16 which are highly related to lung cancer [65], LINC00539 is related to tumor immune response [66] while long non-coding RNA, OGFRP1, regulates non-small-cell lung cancer progression [67]. The remaining signature genes, CPXM2, ENPP5, SAMD13, SLC52A1, ZNF682, ZNF835, ZNF571-AS1 and U91328.1, have not been related to carcinoma, yet. However, they showed highly prognostic power through risk score to distinguish low- and high-risk of overall survival in LUAD.

LUSC gene expression signature including 33 genes of which ALDH7A1 [68], ALK [69], EDN1 [70], FABP6 [71], HKDC1 [72], IGSF1 [73], KBTBD11 [74], NOS1 [75], SLC9A9 [76], STK24 [77], UBB [78], ZNF703 [79] have been shown with oncogenic relations while RGMA [80] is candidate tumor suppressors. ITIH3 [81] and S100A5 [82] has been related to prognostic biomarker potentials. Other cancer-related genes are ADAMTS17 [83],

LINC01748 [84], LPAL2 [85], SRP14-AS1 [86] and WASH8P [87]. Long intergenic nonprotein coding RNA, LINC01426, promotes cancer progression via AZGP1 and predicts poor prognosis in patients with LUAD [88]. COL28A1 has prognostic values in glioblastoma [89]. Many of the genes such as JHY, PLAAT1, PNMA8B, RPL37P6, SNX32, UGGT2 and Y\_RNA have not been related to any cancer, yet.

Gene expression signatures of LUAD and LUSC share eight pathways which are mostly metabolic pathways. LUAD signature plays role in immune-related pathways as different from those in LUSC. However, pathway enrichment shows us that risk prediction works on metabolic pathways, therefore if we put a name to important mutations as driver mutations, in this case we can say that reprogramming of energy metabolism is the alternative fuel of the cancer [90–92]. The differential expression on them with immune system effect in count can hold the passage of cancer.

High-risk groups of both LUAD and LUSC have more immune pathways including downregulated genes and metabolic pathways including upregulated genes. On the other hand, low-risk groups have both upregulated and downregulated genes on cancer-related pathways. Although LUAD and LUSC seem to have similar characteristics of risk groups, close signature gene pathways and similar differential expression pathways sharing 2106 DEGs in total, they are displayed separately in PCA, especially at analysis of test groups.

At CNV level both risk groups and cancer subtypes have huge number of genes with amplifications or deletions which can cause genomic instability and uncontrolled regulation. Both LUAD and LUSC risk groups have important gene alterations such as CDKN2A and CDKN2B deletions which are associated with NSCLC [93] and promotes KRAS and EGFR mutant tumorigenesis [94,95] while SOX2 oncogene amplification in LUSC which is a common event in squamous cell carcinomas [96,97] and amplification of PSMD4 in LUAD, with oncogenic roles in breast, hepatocellular, colorectal and prostate cancer cells [98–101]. CNVs also play role in metabolic and immune-related pathways which can differ between risk groups and cancer subtypes. If we look from a higher perspective, the LUAD low-risk group has much more CNVs and SNVs on its genome than the high-risk group. On the other hand, the LUSC high-risk group has more SNVs than the low-risk group while CNVs do not vary too much.

SNV analysis gives similar results with literature for example EGFR and KRAS mutations are mutually exclusive in LUAD samples that is confirmed again [9]. Additionally, EGFR [102], MGA [103], SMARCA4 [104], ATM [105], RBM10 [106] and KDM5C [107] which are lung cancer related genes are mutated only in LUAD but not in LUSC. On the other hand, CDKN2A [108], PTEN [109] and HRAS [110] genes are mutated only in LUSC. In general, low-risk groups have more mutated genes for both LUAD and LUSC samples. When SNV and CNV genes are plotted together, it can be seen that LUAD high-risk group has obvious oncogene amplifications and tumor suppressor deletions, while LUAD low-risk group has both tumor suppressor deletions and tumor suppressor amplifications with a few oncogene amplifications. This SNV and copy number differential pattern can cause differential gene expression profiles and characteristics of tumor. LUSC patients have mostly deletions on driver genes with only PIK3CA [111] and KRAS [111] oncogene amplifications. Both LUSC risk groups have obvious TP53 [111] and CDKN2A tumor suppressor gene deletions, but amplification of CSMD3, which has differential roles in lung cancer [112,113], does not occur in LUSC high-risk group. Again, only these driver genes which have differential alterations and frequencies can create the risk difference based on gene expression levels.

#### **5. Conclusions**

This study has been performed to profile the genomic and transcriptomic differences not only between LUAD and LUSC but also between risk groups to understand the driving differences between them. Treatment options can vary between cancer subtypes and risk groups because of differential targetable mutation patterns. Nowadays, many groups and government institutions are working on the integration of the drug bioactivity and molecular data to investigate more effective molecularly targeting therapeutics for individual patients for the personalized therapy.

**Supplementary Materials:** The supplementary data are available online at https://www.mdpi.com/ 2075-4426/11/2/154/s1; Figure S1: Flowchart of method and used R packages in this study. The other R packages not written in this flowchart can be found at Materials and Method part of the article; Figure S2: Gene expression signature and risk clustering of LUAD training dataset; Figure S3: Survival analysis of risk groups clustered by using signature gene expression at different tumor stages in LUAD training dataset; Figure S4: Mosaic plots showing association analysis of categorical variables for LUAD training dataset. Pearson residuals show the positive (blue) or negative (red) association between levels of categories; Figure S5: Multivariate Cox Regression results of clinical variables and risk score in LUAD training dataset. Only risk score has significant result when all clinical variables are included into multivariate analysis; Figure S6: Multivariate Cox Regression results of selected clinical variables (which have significant results in univariate Cox analysis) and risk score in LUAD training dataset. Risk score, t, n, m stages and history of prior malignancy have significant effects on survival. When pathologic tumor stage is used instead of t, n, m stages, only risk score and history of prior malignancy show significant effect on survival; Figure S7: Survival analysis of risk groups clustered by using signature gene expression at different tumor stages in LUAD test dataset; Figure S8: Mosaic plots showing association analysis of categorical variables for LUAD test dataset; Figure S9: Multivariate Cox Regression results of selected clinical variables (which have significant results in univariate Cox analysis) and risk score in LUAD test dataset. Risk score and n stages have significant effect on survival. When pathologic tumor stage is used instead of t, n, m stages, only risk score shows significant effect on survival; Figure S10: Gene expression signature and risk clustering of LUSC training dataset; Figure S11: Survival analysis of risk groups clustered by using signature gene expression at different tumor stages in LUSC training dataset; Figure S12: Mosaic plots showing association analysis of categorical variables for LUSC training dataset. Pearson residuals show the positive (blue) or negative (red) association between levels of categories; Figure S13: Multivariate Cox Regression results of selected clinical variables (which have significant results in univariate Cox analysis) and risk score in LUSC training dataset. Risk score, tissue or organ of origin, t and n stages and history of prior malignancy have significant effects on survival. When pathologic tumor stage is used instead of t, n, m stages, tissue or organ of origin, risk score and history of prior malignancy show significant effect on survival; Figure S14: Survival analysis of risk groups clustered by using signature gene expression at different tumor stages in LUSC test dataset; Figure S15: Mosaic plots showing association analysis of categorical variables for LUSC test dataset. Pearson residuals show the positive (blue) or negative (red) association between levels of categories; Figure S16: Multivariate Cox Regression results of selected clinical variables (which have significant results in univariate Cox analysis) and risk score in LUSC test dataset. Only risk score has significant effect on survival either t, n, m stages or pathologic tumor stage is used instead of t, n, m stages; Figure S17: Venn diagram of differentially expressed genes in tumor samples of risk groups for LUAD and LUSC test groups; Figure S18: Pathway enrichment of DEGs of LUAD risk groups; Figure S19: Pathway enrichment of DEGs of LUSC risk groups; Figure S20: Pathway enrichment of CNV genes of LUAD risk groups; Figure S21: Pathway enrichment of CNV genes of LUSC risk groups; Figure S22: Venn diagram of genes which have significant copy number alterations in tumor samples of LUAD and LUSC risk groups; Figure S23: Summary of SNVs in LUAD risk groups; Figure S24: Summary of SNVs in LUSC risk groups; Figure S25: SomInaClust result of potential driver genes containing significant SNVs in LUAD risk groups. SomInaClust calculates oncogene (OG) score and tumor suppressor gene (TSG) score for each significant gene and classifies the gene according to the score threshold (20) and reference database; Figure S26: SomInaClust result of potential driver genes containing significant SNVs in LUSC risk groups. SomInaClust calculates oncogene (OG) score and tumor suppressor gene (TSG) score for each significant gene and classifies the gene according to the score threshold (20) and reference database; Figure S27: Venn diagram of all genes and potential driver genes containing SNVs of LUAD and LUSC risk groups, Table S1: Gene list of expression signature in LUAD. Ensemble Gene IDs were used in signature analysis and then enriched by using BioMart database; Table S2: KEGG pathway enrichment of expression signature gene list in LUAD by using KEGG Mapper tool; Table S3: Gene list of expression signature in LUSC. Ensemble Gene IDs were used in signature analysis and then enriched by using BioMart database; Table S4: KEGG pathway enrichment of expression signature gene list in LUSC by using *clusterProfiler*

R package; Table S5: SomInaClust result of SNV data in tumor samples of LUAD low-risk group; Table S6: SomInaClust result of SNV data in tumor samples of LUAD high-risk group; Table S7: SomInaClust result of SNV data in tumor samples of LUSC low-risk group; Table S8: SomInaClust result of SNV data in tumor samples of LUSC high-risk group.

**Author Contributions:** Methodology, T.Z.; formal analysis, T.Z.; resources, T.Z., T.Ö.-S.; data curation, T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, T.Ö.-S.; visualization, T.Z.; project administration, T.Ö.-S. All authors have read and agreed to the published version of the manuscript.

**Funding:** T.Z. and T.Ö.-S. were partially funded by Turkish National Institutes of Health (TÜSEB) grant number 4583.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets supporting the conclusions of this article are publicly available and can be downloaded from TCGA data portal (https://portal.gdc.cancer.gov) or by using *TCGAbiolinks* R package [18]. The R code used in this study is available upon request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Differential Interactome Proposes Subtype-Specific Biomarkers and Potential Therapeutics in Renal Cell Carcinomas**

**Aysegul Caliskan 1,2,† , Gizem Gulfidan 1,† , Raghu Sinha 3,\* and Kazim Yalcin Arga 1,\***

<sup>1</sup> Department of Bioengineering, Marmara University, Istanbul 34722, Turkey;

gulaysecaliskan@hotmail.com (A.C.); gizemgulfidn@gmail.com (G.G.)

<sup>2</sup> Faculty of Pharmacy, Istinye University, Istanbul 34010, Turkey

<sup>3</sup> Department of Biochemistry and Molecular Biology, Penn State College of Medicine, Hershey, PA 17033, USA **\*** Correspondence: rsinha@pennstatehealth.psu.edu (R.S.); kazim.arga@marmara.edu.tr (K.Y.A.)

† These authors contributed equally to this work.

**Abstract:** Although many studies have been conducted on single gene therapies in cancer patients, the reality is that tumor arises from different coordinating protein groups. Unveiling perturbations in protein interactome related to the tumor formation may contribute to the development of effective diagnosis, treatment strategies, and prognosis. In this study, considering the clinical and transcriptome data of three Renal Cell Carcinoma (RCC) subtypes (ccRCC, pRCC, and chRCC) retrieved from The Cancer Genome Atlas (TCGA) and the human protein interactome, the differential protein–protein interactions were identified in each RCC subtype. The approach enabled the identification of differentially interacting proteins (DIPs) indicating prominent changes in their interaction patterns during tumor formation. Further, diagnostic and prognostic performances were generated by taking into account DIP clusters which are specific to the relevant subtypes. Furthermore, considering the mesenchymal epithelial transition (MET) receptor tyrosine kinase (PDB ID: 3DKF) as a potential drug target specific to pRCC, twenty-one lead compounds were identified through virtual screening of ZINC molecules. In this study, we presented remarkable findings in terms of early diagnosis, prognosis, and effective treatment strategies, that deserve further experimental and clinical efforts.

**Keywords:** renal cancers; protein interactome; diagnostic biomarker; prognostic biomarker; virtual screening; docking

#### **1. Introduction**

Kidney cancer is among the 10 most common cancers in adults and renal cell carcinoma (RCC) shows a steady increase in prevalence [1]. RCC is known to be the most common type of kidney cancer and is responsible for up to 85% of cases; it is more common in males than in females (ratio, 1.7:1), and most of the patients are at an older age (average age of 64 years) [1]. Primarily, RCC is categorized into subtypes according to histological classification under a microscope, including clear cell (ccRCC, also known as KIRC), papillary (pRCC, also known as KIRP), chromophobe (chRCC, also known as KICH), and some other, less common subtypes such as collecting duct, medullary RCC, and unclassified RCC [2]. The most prevalent one among kidney cancers is ccRCC which represents 75–80% of RCC [3] and derives its name from its clear cytoplasm on the pathologic analysis [4]. The rest are papillary (10–15%), chromophobe (5%), and rare kidney cancers. Although improvement of the state-of-the-art treatment technologies, the overall prognosis is still poor in RCCs, particularly for patients who present with the advanced-stage disease [1]. Therefore, early diagnosis and successful urological procedures with partial or total nephrectomy can be life-saving. However, only about 10% of RCC patients present with urological problems or other known clinical symptoms. More than sixty percent of patients are incidentally noticed at imaging investigations [5], and metastasis has already begun in nearly 20–30% of the patients when diagnosed [6]. In this context, biomarker

**Citation:** Caliskan, A.; Gulfidan, G.; Sinha, R.; Arga, K.Y. Differential Interactome Proposes Subtype-Specific Biomarkers and Potential Therapeutics in Renal Cell Carcinomas. *J. Pers. Med.* **2021**, *11*, 158. https://doi.org/10.3390/ jpm11020158

Academic Editor: Susan M. Bailey

Received: 28 December 2020 Accepted: 18 February 2021 Published: 23 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

identification from secretion fluids is extremely important for early diagnosis. Furthermore, biomarkers are becoming increasingly significant to facilitate the discovery of anti-cancer agents, to distinguish cancer cells from the other cells, to understand drug action mechanisms, to predict prognosis, to design personalized medication, and to understand the mechanisms underlying response to therapy. All types of kidney cancers are different in many respects including tumor location within the kidney, the cell type from which they originate, and alterations on their genotype, making it even more crucial to characterize the pathology of each type and to identify specific proteins as druggable targets.

Biomarkers play an important role in the implementation of personalized medicine in clinics with respect to defining subtype phenotypes, predicting clinical course and prognosis, and determining the appropriate therapeutic approach. In this respect, a comprehensive pool of molecular markers from different biological levels (hub proteins, receptors, miR-NAs, mRNAs, reporter TFs, and metabolites) were presented from a systematic integrative biology perspective with the potential to provide in-depth knowledge into the disease mechanisms in RCC subtypes [7]. On the other hand, the limited diagnostic and prognostic performance of a molecular biomarker revealed the need for system biomarkers to be obtained with approaches that consider interactions between critical molecules such as the differential protein interactome [8,9].

The differential interactome methodology is based on the idea that significant alterations occur in the protein–protein interactions (PPIs) among phenotypes. The success of this approach has been effectively demonstrated in various cancers and their subtypes [8–10]. The differential interactome approach made it possible to estimate the probability distributions for any possible co-expression profile of gene pairs (encoding proteins that interact with each other) across phenotypes and to determine the uncertainty of whether a PPI is meeting the corresponding phenotype.

The Cancer Genome Atlas (TCGA) is one of the comprehensive cancer genomics datasets available. The availability of TCGA allows researchers to uncover the molecular profiling of tumors through the application of genome analysis technologies, including large-scale genome sequencing. In our present study, we investigated the TCGA transcriptome data from 892 individuals and used the differential interactome methodology [8] that integrates transcriptome data with the human protein interactome network to analyze and compare the differential protein–protein interactions among healthy and tumor groups. Three common subtypes (ccRCC, pRCC, and chRCC) of RCC were investigated and compared in terms of the differential interactome profiles. These analyses allowed us to identify differentially interacting proteins (DIPs) that represent significant changes in their interaction patterns during the transition from "normal" to "tumor" phenotypes and are therefore differently related to the corresponding tumor [9]. We also determined candidate protein panels with high diagnostic and/or prognostic performance, which might allow us to develop novel drug candidates and to diagnose patients in the early stage. Furthermore, we offer drug candidates that showed an inhibitory effect on mesenchymal epithelial transition (MET) receptor tyrosine kinase which is one of the DIPs that have activated interactions in the case of pRCC.

#### **2. Materials and Methods**

#### *2.1. Collecting of Gene Expression Data*

The transcriptome datasets consisting of three different subtypes of kidney cancer (chRCC, ccRCC, and pRCC) were acquired from the TCGA database [11] to analyze their gene expression profiles. The number of the primary tumor and the matched normal tissue samples were 538 and 72 for ccRCC, 289 and 32 for pRCC, and 65 and 24 for chRCC, respectively.

#### *2.2. Obtaining Protein–Protein Interactions Data*

Physical PPI data experimentally detected in humans was obtained from the BioGRID database using the latest version (v. 4.0.189) [12]. The data contained 51,745 PPIs among 10,177 human proteins. After filtering the PPI data for proteins encoded by genes having transcriptome data in TCGA datasets, a network was reconstructed with 34,604 PPIs among 8322 proteins.

#### *2.3. Identification of Differential Interactome and Differentially Interacting Proteins*

The gene expression profiles of RCC subtypes were analyzed together with the obtained PPI data through the differential interactome algorithm revised in the study of Gulfidan et al. [8] using R (version 3.6.1). This algorithm presents the differential PPIs (dPPIs) between the tumor phenotype and normal phenotype, taking into account the relative observation frequencies (*q*-value) of each PPI as described earlier [8,9]. The criteria of the algorithm for obtaining significant dPPIs were set as *q*-value < 0.10 (significantly repressed in tumor phenotype), *q*-value > 0.90 (significantly activated in tumor phenotype), and a normalized observation frequency either in normal or tumor phenotype > 20%.

DIPs, the proteins having differential interactions, were classified into two groups according to their interaction patterns: (i) DIPs having repressed interactions under tumor condition, and (ii) DIPs having activated interactions under tumor condition. DIPs that were specific to the RCC subtypes and were common in all subtypes were detected for further analyses. The networks consisting of dPPIs and DIPs were visualized through the Cytoscape 3.4.0 [13].

#### *2.4. Evaluation of the Secretion Levels of Subtype-Specific DIPs in Body Fluids*

The secretion levels (ppm) of subtype-specific DIPs in plasma, serum, urine, and saliva were investigated through protein expression data which is accessible in the GeneCards [14] database curating the proteomics databases; ProteomicsDB [15], MaxQB [16], and MOPED [17].

#### *2.5. Analysis of Diagnostic Performance and Prognostic Power*

Principal component analyses (PCA) were carried out for the assessment of the diagnostic potential of subtype-specific DIPs using the expression values of genes encoding the DIPs which had the secretion levels in body fluids. The simulations were performed using the gene expression data of tumor samples of ccRCC, pRCC, and chRCC datasets for each subtype-specific DIPs, separately.

To explore the prognostic performance of each subtype-specific DIP, survival analyses were carried out through stratification of patients into high- and low-risk groups based on their prognostic index (PI), which is the linear component of the Cox model (PI = β1x<sup>1</sup> + β2x<sup>2</sup> + . . . + βpxp, where β<sup>i</sup> is coefficient acquired from the Cox fitting, x<sup>i</sup> is the expression value of each gene). Analyses were implemented through the SurvExpress tool [18] utilizing two RNA-Seq originated datasets of ccRCC with 415 samples, and pRCC with 278 samples including clinical data. In addition, RNA–Seq originated chRCC dataset with 9 samples with clinical data retrieved from TCGA [11] was analyzed separately through the pipeline established in our previous study [8] due to the absence of any dataset related to the chRCC subtype in the SurvExpress database. The signatures of survival in each risk group were estimated by Kaplan–Meier curves and Hazard Ratios (HR). Statistical significance of each plot was evaluated by the cut-off for log-rank *p*-value < 0.05. Hazard ratio (HR = O1/E1/O2/E2) was calculated to discover the significance of the survival curves based on the ratio between the relative death rate in group 1 (O1/E1) and the relative death rate in group 2 (O2/E2), where O denotes the observed number of deaths, and E denotes the expected number of deaths.

#### *2.6. Identification of Candidate Drugs through Virtual Screening*

We set the following criteria to determine the potential drug target protein among DIPs in docking studies: (i) its interactions should be activated in the disease state, and (ii) it should have at least 5 interactions. Among DIP proteins of pRCC, MET protein satisfied all the criteria and came to the forefront as a potential drug target. Through virtual screening, potential molecules targeting MET were determined. To have an insight into the ligand-receptor interactions, the available X-ray crystal structures of MET were fetched from the Protein Data Bank (PDB) (www.rcsb.org) [19]. PDB entry 3DKF was chosen for all the docking studies according to the resolution, Ramchandran outliers, and structural similarity between the screened ligands and the co-crystallized ligands. Virtual Screening binding analysis was carried out on the assigned binding site of the X-ray crystal structure of MET [20] exploiting ZINC molecules described by the publicly available ZINC15 library [21]. Molecular docking studies were executed for 703 substances retrieved from the ZINC15 library through AutoDock Vina [22] in the PyRx virtual screening tool (v. 0.8) [23].

#### **3. Results**

#### *3.1. Differential Interactome Estimation in Subtypes of RCC*

RNA-seq transcriptome data of three RCC subtypes were retrieved from TCGA to apply differential interactome methodology [8] for prediction of highly probable PPIs in each state and identification of differential PPIs. To this end, we examined transcriptomic data for three common subtypes of RCC with an adequate number of samples (*n* > 24) in both normal and tumor groups (see Materials and Methods section). The scale-free topology of the differential interactome network brings out the presence of hubs called DIPs indicating substantial changes in their interaction patterns during the transition from "normal" to "tumor" phenotypes [8,9]. We determined 628 DIPs for chRCC, 50 DIPs for ccRCC, and 29 DIPs for pRCC as subtype-specific DIPs, whereas 33 DIPs were common in all subtypes (Supplementary Table S1). The tumor-specificity of DIPs varied according to the subtype (Figure 1).

– **Figure 1.** Differential interactome networks reconstructed with differential protein–protein interactions (dPPIs) around differentially interacting proteins (DIPs) in three Renal Cell Carcinoma RCC subtypes. Red nodes represent DIPs specific to the subtype of interest. ccRCC: Clear Cell Renal Carcinoma; pRCC: Papillary Renal Cell Carcinoma; chRCC: Chromophobe Renal Cell Carcinoma.

2,

**ity on s-DIPs**

Further analyses (i.e., determination of prognostic power, diagnostic performance, and druggability) were implemented by taking into account 50 DIPs specific to ccRCC, 29 DIPs specific to pRCC, and the top 50 DIPs having the most interactions specific to chRCC (Table 1). We considered those DIPs as a cluster for each subtype and suggested them as potential systems biomarkers for the development of effective diagnosis, prognosis, and treatment strategies.

**Table 1.** Differentially interacting proteins (DIPs) specific to RCC subtypes.


<sup>1</sup> Protein expression was observed at least in one of the following body fluids: serum, plasma, saliva, urine; <sup>2</sup> Protein expression was not observed in any of the following body fluids: serum, plasma, saliva, urine.

> Then, we filtered DIPs by considering whether they are secreted in body fluids and renamed secreted proteins as "s-DIPs" (Table 1, Figure 2). s-DIPs represent proteins that were expressed at least in one of the following media: serum, plasma, saliva, or urine (www.genecards.org) [14]. The importance of secretion in body fluids that can be accessed without surgery is that it might provide serious convenience for early diagnosis. While s-DIPs were used for diagnosis analysis, all DIPs (s-DIPs and non-s-DIPs) were considered in prognosis and druggability (virtual screening) analyses. " "

> > – –

**Figure 2.** Bubble plots indicating protein expression levels of DIPs specific to three subtypes in different body fluids including serum, plasma, saliva, and urine. The x-axis indicates subtypes while the y-axis indicates protein symbols.

#### *3.2. Prognostic and Diagnostic Capabilities of DIPs Clusters*

We considered the clusters of DIPs as potential systems biomarkers for each RCC subtype and analyzed their diagnostic performance and prognostic power.

The diagnostic analysis was performed via PCA using s–DIPs (Table 1). All s–DIP clusters exhibited significantly high diagnostic performance for relevant subtypes (Figure 3A).

–Meier curves estimating patients' survival for three **Figure 3.** Diagnostic and prognostic performance analysis results for Renal Cell Carcinoma (RCC) subtypes. (**A**) Principal component analyses (PCA) plots, visualized by considering s-DIPs, indicating the individual differences in the gene expression profiles in tumor samples among the subtypes. (**B**) Kaplan–Meier curves estimating patients' survival for three subtypes based on categorization of patients into high- and low-risk groups via prognostic index. ccRCC: Clear Cell Renal Carcinoma; pRCC: Papillary Renal Cell Carcinoma; chRCC: Chromophobe Renal Cell Carcinoma; HR: Hazard Ratio; PC: Principal component.

– − − − Prognostic capabilities of gene clusters were quantified through log-rank p-values and visualized by Kaplan–Meier curves (Figure 3B). Cox (proportional hazards) regression was also engaged to estimate HRs. These analyses were carried out utilizing TCGA clinical datasets (see Materials and Methods section). Gene clusters were significantly predictive in terms of patient survival risk assessment for the respective subtype (Figure 3B, ccRCC *p* < 1 × 10 −15 , pRCC *p* = 5.36 × 10 −5 , chRCC *p* = 1.86 × 10 −3 ). Through Coxproportional hazard analysis, HR values were estimated as 4.33, 4.32, and 7.12 for ccRCC, pRCC, and chRCC, respectively.

#### *3.3. Discovery of Drug Candidates through Virtual Screening Analyses*

In silico simulation techniques have become an indispensable tool for modern-day drug discovery programs. Molecular docking currently offers the best alternative to quickly estimate the binding conformations of ligands that are energy-efficient to interact with a pharmacological receptor site. It has become more popular as it is time and cost effective in the pipeline of drug discovery and development. Interactions of some DIP proteins within the module were activated during the tumorigenesis, while some were found to be repressed. We hypothesized that, if we manage to break through the interactions that are activated, we might model a strategy to cure the disease. For this purpose, we considered DIPs with activated interactions in the tumor state as potential drug targets.

For instance, among DIP proteins of the pRCC subtype, MET protein came into prominence as a potential drug target. Candidate molecules targeting MET were determined via virtual screening of the ZINC15 library via the available crystal structures of MET from PDB. All available X-Ray crystal structures of MET (PDB IDs: 3DKF, 2RFN, 3EFJ, 3U6H, 4EEV) and their bound ligands were superposed, and potential binding sites were determined to identify the binding site location on the receptor (Figure 4A). Virtual Screening binding analysis was accomplished on the assigned binding site of the X-ray structure of MET (PDB ID: 3DKF) utilizing ZINC molecules which were described by the ZINC15 library. The virtual screening analysis revealed twenty-one ZINC molecules with high binding affinities (∆G<sup>0</sup> ≤ −12, LE > 0.35) (Table 2; Figure 4B). high binding affinities (ΔG ≤ −

**Figure 4.** Virtual screening to identify potential hit drug candidates for pRCC. (**A**) Superposition of X-ray crystal structures of MET retrieved from RCSB for the validation of docking protocol. (**B**) 2D structures of ZINC molecules that showed high binding affinities to MET protein in virtual screening.


**Table 2.** The ZINC molecules presented the best binding affinities to MET.

#### **4. Discussion**

Dysregulations in various biochemical pathways play an important role in cancer formation and development. Genetic studies have identified numerous molecular defects in cancer cells and suggested multiple potential targets for therapeutic intervention. Conventional drug design has mainly focused on the inhibition of a single protein, usually an enzyme or receptor; however, this strategy has not been successful enough, as the development and progression of cancers are mostly due to the coordinated action of a group of biological entities rather than a single molecule dysfunction [24]. Hereby, PPIs have become highly promising targets that cover many therapeutic areas and potential intervention points for the development of anticancer agents. Until now, significant progress has been made in identifying small molecule inhibitors of various protein–protein systems in the field of oncology, and powerful and selective drug-like molecules that inhibit many interactions such as p53-MDM2 interaction have been discovered [25]. Furthermore, a number of these small-molecule inhibitors, such as Siremadlin, AMG-232, and APG-115 have progressed to early phase clinical trials [26].

Our study reports the generation of the dPPI networks in RCC subtypes through the implementation of high throughput transcriptome and protein interactome data. The integration of respective RNA-seq datasets and differential interactome approach allowed the identification of dPPIs in different conditions (tumor/normal) in RCC subtypes. The study unveils and compares the dPPIs for each subtype and identifies DIPs through a differential interactome. Further analyses on DIPs may be useful in understanding the tumor mechanisms. For instance, our findings revealed that HspB1 protein is one of the common DIPs for three subtypes. The correlation between HspB1 expression in RCC subtypes and metastasis process has been revealed in previous studies and HspB1 is known to facilitate metastasis by suppressing anti-cancer response such as apoptosis and senescence [7,27].

DIP clusters were used for diagnostic and prognostic analyses for each subtype. Despite the improvements in the state of the art treatment technologies, the overall prognosis is still poor in RCCs and more than 50% of RCCs are diagnosed incidentally [28]. Even the detection of the early asymptomatic stage during routine examination could have a profound impact on clinical outcome. Therefore, an effective, clinically useful test for

early detection of RCC subtypes should be measurable in readily accessible body fluids, such as plasma, serum, urine, or saliva. For this purpose, we filtered DIPs by considering whether they are expressed in those body fluids at the protein level and defined the s-DIP concept here for the first time in literature. s-DIP clusters characterize patients well in terms of the diagnostic group (subtype) to which they belong. Hence, we offer that s-DIPs might be used for the diagnosis of candidate RCC patients after further experimental and clinical validations.

Saliva is one of the complex and important multi-constituent body fluids that reflects a wide variety of physiological knowledge due to its contents extensively supplied by the blood. Moreover, a saliva-based diagnosis has been drawing attention in the diagnosis of systemic diseases such as renal cancers, due to the source, composition, function, and interaction of saliva with the substances that make up the plasma [29,30]. In the present study, besides blood components and urine, we also demonstrated the potential of saliva as a non-invasive potential media for RCC diagnosis, especially in chRCC.

The three basic elements for the art of medicine are diagnosis, therapeutics, and prognosis. Therefore, after making the correct and early diagnosis, determining the optimal treatment strategies would be important and as a follow-up, one could provide up-to-date information on the patient's prognosis. Our present investigation also aimed to provide new targets for the design of novel therapies in RCC subtypes and putative biomarkers with prognostic significance. In this study, DIP clusters appear to be strong putative candidates for the prognostic marker in each related subtype. Survival analyses through stratification of patients according to clinicopathological variables such as tumor stage or grade would demonstrate the prognostic power of the potential biomarkers better. However, despite the presence of comprehensive gene expression profiling efforts such as TCGA, transcriptome data with available clinical information is still limited for RCCs, even for the most common subtypes.

Additionally, to shed light on the further experimental studies, we identified MET protein as an ideal potential drug target in pRCC and showed the high potential of twenty-one Zinc molecules (Table 2) as candidate therapeutics for future preclinical studies. The integration of the transcriptome and protein interactome data with the drug knowledge helped to uncover 21 in silico validated potential drug candidates for pRCC. These in-silico findings can be used further to design and synthesize novel MET inhibitors. Furthermore, ZINC73196087, ZINC72318117, ZINC72318118, ZINC73163075, ZINC73165724, ZINC73196196, and ZINC72318119 have been shown to demonstrate effective anti-proliferative activity against a panel of c-Met-amplified gastric cancer cell lines [31]. We propose that these ZINC compounds should also be evaluated with experimental studies for RCC cell lines and we conclude that these molecules might be potential therapeutics for the management of the pRCC. Further in vitro/in vivo pharmacological evaluations and clinical validations are needed for approval of these candidate drugs.

The major limitation of the study is the lack of experimental validations of the identified ZINC compounds on the RCC samples or cell lines. Future in vitro studies need to be conducted to evaluate the effects of ZINC compounds identified on cell viability, proliferation, and migration. Moreover, the mechanism of actions of these molecules need to be investigated in detail to elucidate their effect on molecular pathways such as apoptosis and cell cycle. Rather than being considered as a single agent, these compounds can also be regarded as adjuvant therapy to the baseline therapeutics, then, the critical extension of this work would be to learn whether the observations of in vitro studies can be recapitulated by in vivo studies and eventually in clinical trials. Another point that has a crucial role in translation to the clinic is sampling where body fluids are favorable for the detection of the biomarkers. Proteomics studies also need be verified for the proteins exhibiting significantly high diagnostic and prognostic performance for relevant subtypes. Moreover, these biomarkers could also assist oncologists to assist in optimal diagnosis and prognosis management.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2075-442 6/11/2/158/s1, Table S1: List of Differentially Interacting Proteins (DIPs) in chRCC, ccRCC, pRCC.

**Author Contributions:** Conceptualization, K.Y.A., R.S., and A.C.; formal analysis, A.C. and G.G.; investigation, A.C.; writing—original draft preparation, A.C. and G.G.; writing—review and editing, K.Y.A. and R.S.; visualization, G.G. and A.C.; supervision, K.Y.A. and R.S.; project administration, K.Y.A. and, R.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. Transcirptome data can be found here: [https://portal.gdc.cancer.gov/, Primary site: Kidney]. Protein interactome data is available here: [https://thebiogrid.org, File repository: BIOGRID-4.0.189].

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


**Christen A. Khella , Gaurav A. Mehta , Rushabh N. Mehta and Michael L. Gatza \***

Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08903, USA; cak241@scarletmail.rutgers.edu (C.A.K.); gam160@cinj.rutgers.edu (G.A.M.); rnm53@scarletmail.rutgers.edu (R.N.M.) **\*** Correspondence: michael.gatza@cinj.rutgers.edu; Tel.: +1-732-235-8751

**Abstract:** The underlying molecular heterogeneity of cancer is responsible for the dynamic clinical landscape of this disease. The combination of genomic and proteomic alterations, including both inherited and acquired mutations, promotes tumor diversity and accounts for variable disease progression, therapeutic response, and clinical outcome. Recent advances in high-throughput proteogenomic profiling of tumor samples have resulted in the identification of novel oncogenic drivers, tumor suppressors, and signaling networks; biomarkers for the prediction of drug sensitivity and disease progression; and have contributed to the development of novel and more effective treatment strategies. In this review, we will focus on the impact of historical and recent advances in single platform and integrative proteogenomic studies in breast and ovarian cancer, which constitute two of the most lethal forms of cancer for women, and discuss the molecular similarities of these diseases, the impact of these findings on our understanding of tumor biology as well as the clinical applicability of these discoveries.

**Keywords:** genomics; proteomics; breast; ovarian; cancer

**Citation:** Khella, C.A.; Mehta, G.A.; Mehta, R.N.; Gatza, M.L. Recent Advances in Integrative Multi-Omics Research in Breast and Ovarian Cancer. *J. Pers. Med.* **2021**, *11*, 149. https://doi.org/10.3390/jpm11020149

Academic Editor: Raghu Sinha

Received: 14 January 2021 Accepted: 14 February 2021 Published: 19 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Each year more than 1.8 million people are diagnosed with cancer in the United States including more than 270,000 breast cancer patients and 21,000 ovarian cancer patients [1]. Despite advances in diagnostic tools, predictive biomarkers, and new therapies over the past 20 years which have led to declining mortality rates, more than 285,000 people will die each year in the US due to their disease, including more than 50,000 breast and 13,000 ovarian cancer patients [1]. Enormous clinical variability, including disease progression and response to therapy has been shown to exist for most forms of cancer. These observed differences are driven, in part, by underlying genetic, genomic, and proteomic alterations unique to each patient [2–5]. In essence, cancer is not a single disease but rather a collection of genetically driven malignancies affecting a given tissue. As a result, all tumors, even within a given tissue type, cannot be treated equally [6–12]. New tools, therapies, biomarkers, and treatment strategies are being developed or will need to be developed, to identify and target those mutations and/or signaling pathways essential for each tumor to improve clinical outcome and quality of life for each patient.

The underlying genetic heterogeneity within human cancers creates several challenges both clinically and from a basic science perspective. From a mechanistic standpoint, variability in patterns of genomic and proteomic alterations create a challenge in separating the key drivers of oncogenic signaling, tumor development, and progression from those mutations that are tumor-promoting but non-transforming or that do not directly contribute to tumorigenesis (i.e., passenger mutations). This is essential as not all mutated or aberrantly expressed genes are required for tumorigenesis nor do they equally contribute to therapeutic response [13,14]. As a result, there is a need to develop tools and approaches to understand the interplay between altered genes and to determine how these genes or proteins promote aberrant signaling, including the identification of novel signaling networks

and cellular processes that contribute to tumor growth and progression. Finally, utilizing the compendium of alterations across a given tumor type, we must develop approaches to identify novel therapeutic targets and determine predictive biomarkers to recognize patients that are likely to benefit from specific therapeutic regimens.

Over the past 20 years, beginning with the sequencing of the human genome to the more recent development of next-generation sequencing (NGS), advances in genomics, proteomics, and systems biology have allowed us to begin to catalogue, visualize, compare and dissect patterns of DNA mutations and copy number alterations, mRNA and miRNA expression patterns, protein and phosphorylated protein expression and epigenetic alterations between individual patients, across specific forms of cancer and between malignancies affecting different tissues [2–5,15,16]. These studies, coupled with functional genomic studies, have begun to identify and provide insight into key drivers of oncogenic signaling, mediators of specific tumor characteristics, including response to therapy, and identify novel treatment strategies. In this review, we will examine historical and recent advances in genome and proteome-wide analyses in breast and ovarian cancer and discuss the impact of these findings on our understanding of tumor biology as well as the clinical applicability of these discoveries.

#### **2. Clinical Characterization of Breast and Ovarian Cancer**

Breast cancer is the most commonly diagnosed and the second leading cause of cancer-related mortality for women in the United States [1]. While it is estimated that approximately 50,000 women in the US and 522,000 women worldwide will die from this disease annually, survival rates have steadily increased by over 40% over the past 30 years [1,17]. Currently, more than 98% of patients diagnosed with early stage disease are expected to live for at least 10 years and the current 5-year survival rate is ~90% across all stages [1,17–19]. These improvements can be attributed, in part, to increased early detection from earlier screening and improved imaging technology as well as the development of novel therapeutic regimens incorporating chemotherapeutics, targeted therapies, radiation, surgery, and immunotherapy [20–22]. Despite these advances, the prognosis for patients with locally advanced and metastatic disease remains poor. Patients with advanced metastatic disease have a 5-year survival rate of less than 30% and a significant percentage of patients whose tumors are inoperable and/or refractory to current therapies will succumb to their disease within 5 years irrespective of tumor stage at diagnosis [18].

Part of the challenge in developing effective treatments for this disease lies in the molecular and clinical heterogeneity that exists between each patient's tumor. Clinically, breast tumors are classified based on morphological features with ~70% of tumors being classified as invasive ductal carcinomas (IDC), ~15% categorized as invasive lobular carcinoma (ILC), and the remaining tumors regarded as rare subtypes [18,23]. Prognosis and treatment strategies are largely dictated by classical histopathologic features including tumor size, histological grade and stage, lymph node status, and the expression of hormone receptors or HER2 (human epidermal growth factor receptor 2) status [18,19]. Among the histological subtypes, estrogen receptor (ER), progesterone receptor (PR), and HER2 status can be used to further delineate patients into ER+/PR+ (60–70% of patients), HER2+ (10–20%), and triple negative breast cancer (TNBC, 15–20%). However, differences in the prevalence of these histological subtypes are seen between women with different ancestries. Notably, women of African American decent have a higher incidence of TNBC when compared to American women of European ancestry (36.3% vs. 13.7%) [1,24,25]. Importantly, these biomarkers are used to direct current standard-of-care treatments with endocrine-based therapies comprising the core of therapeutic regimens to treat hormone receptor-positive (HR+) breast tumors; HER2-family inhibitors forming the foundation for therapies used to treat HER2+ patients [26], and multi-agent cytotoxic chemotherapies providing the basis for the treatment of TNBC patients [27,28]. While endocrine-based therapies result in remission in the majority of patients with HR+ tumors, approximately

30–50% of patients manifest primary or acquired resistance [29,30]. Recent studies have reported that the emergence of hormone therapy resistance in ER+ breast cancers can arise through four predominant mechanisms including ESR1 mutations (18%), altered MAPK signaling (13%), MYC or transcription factor activation (9%), and other/unknown factors (60%) [31]. The findings of multiple clinical trials have resulted in FDA and international approval for use of the mTOR inhibitor everolimus in conjunction with exemestane for the treatment of patients with advanced or metastatic ER+, PR+, HER2-negative, PIK3CA mutant tumors [32]. Likewise, the PI3K inhibitor alpelisib has been approved for the same patient population in combination with fulvestrant [33]. More recently, CDK4/6 inhibitors palbociclib, ribociclib, and abemaciclib have been approved and, in conjunction with hormone therapy, have become the primary treatment regimen for HR+/HER2- treatment naïve or hormone therapy treated metastatic breast cancer patients [34]. Finally, the treatment of TNBC tumors has begun to evolve to include immune checkpoint inhibitors while patients with BRCA mutations are treated with PARP inhibitors in conjunction with chemotherapy [28,35,36].

In contrast to breast cancer, ovarian cancer is a less frequently diagnosed malignancy with approximately 21,750 women in the US and 250,000 women worldwide being diagnosed with this disease annually [1]. Unfortunately, however ovarian cancer is the most lethal form of gynecological cancer with an estimated 13,940 women dying from this disease in the United States and more than 150,000 women dying worldwide in 2020 [1]. This translates to a death-to-case ratio of approximately 64%, far outpacing the lethality of breast cancer [1].

Similar to breast cancer, the clinical complexity of ovarian cancer is due, in part, to histological and molecular heterogeneity. Ovarian tumors are classified into four major classes: high (70%) and low (4.1%) grade serous, endometrioid (8.3%), clear cell (9.5%) and mucinous (3.2%) carcinoma [37–39]. Of note, a study by Beckmeyer-Borowko and colleagues showed that non-Hispanic Black ovarian cancer patients were more likely to be diagnosed with stage four HGSOC, clear cell or mucinous carcinomas when compared to non-Hispanic White patients [40]. Beyond these classifications, ovarian epithelial tumors have been divided into Type I and Type II tumors [41–44]. Type I tumors typically encompass low-grade and indolent tumors including low-grade serous, low-grade endometrioid, clear cell, and mucinous carcinomas that tend to present as stage I tumors while Type II tumors include more aggressive and high-grade tumors including high-grade serous, high-grade endometrioid, malignant mixed mesodermal tumors, and undifferentiated carcinomas [42]. Genetically, Type I and II tumors are characterized by specific mutations: mutations common to Type I tumors include KRAS, BRAF, ERBB2, CTNNB1, PTEN, PIK3CA, ARID1A, and PPP2R1A, while Type II tumors have a high frequency of TP53 mutations (>95%) as well as mutation or aberrant expression of BRCA1 or BRCA2 [3,42–44]. Importantly, these mutations appear to be largely confined to each subtype with Type II tumors rarely expressing Type I mutations and Type I tumors being largely wild-type for TP53, except for low-grade mucinous tumors (~25%) [45].

While more than 92% of Stage I ovarian cancer patients are successfully treated, only 15% of patients are diagnosed with the early stage disease [1]. High-grade serous ovarian cancer (HGSOC) is the most prominent form of ovarian cancer and accounts for 70% of ovarian cancer-related deaths [46,47]. Although most patients will initially respond favorably to standard-of-care cytoreduction surgery followed by platinum- and taxane-based treatment, approximately 80% will eventually relapse and develop resistance in late stage disease [39,48–51]. In addition, 25% of patients are inherently resistant to standard-of-care therapy and demonstrate disease progression within six months of treatment [52]. More recent studies have determined that HGSOC tumors are characterized by homologous recombination deficiencies (HRD) which render these tumors sensitive to PARP inhibition [53–56]. As such, PARP inhibitors (olaparib, rucaparib, niraparib) were FDA approved for treatment of platinum-sensitive recurrent, BRCA mutated, and HRD-positive epithelial ovarian cancer [34,56–61]. In addition, olaparib in combination with bevacizumab

has been approved for the treatment of patients with advanced epithelial ovarian cancer. This combination treatment nearly doubled the progression-free survival in HRD-positive tumors when compared to bevacizumab alone [62]. Finally, clinical trials examining the impact of multiple novel combinatorial strategies, including VEGF inhibitors (VEGFi) in combination with PARP inhibitors (PARPi) as well as anti-PD-1 inhibitors, alone or in combination with VEGFi and/or PARPi, are ongoing [63]. While significant advances in the molecular characterization of ovarian cancer have led to a better understanding of this disease, the prognosis has not significantly improved over the past several decades; poor prognosis is attributed to lack of early detection and resistance (inherent and acquired) to platinum-/taxane-based therapies [3,46].

Despite the inherent differences in clinical manifestation between breast and ovarian cancer, a portion of these malignancies are intrinsically linked, as women with specific inherited germline mutations including BRCA1, BRCA2, PALB2, TP53, CDH1, and PTEN have an increased lifetime risk of developing either disease [64,65]. BRCA1 or BRCA2 mutations are the most prevalent cause of high penetrance inherited breast or ovarian cancers and have been shown to affect patients irrespective of race or ethnicity. Overall, the rate for germline BRCA1 or BRCA2 mutations is relatively low with 4–6% of breast and 8–15% of ovarian tumors expressing one of these mutations [66–73]. However, it is estimated that 39–63% of women with a BRCA1 mutation will develop ovarian cancer while 46–87% will develop breast cancer by age 70. Likewise, BRCA2 mutation carriers are strongly predisposed to develop ovarian (17–27%) or breast (38–84%) cancer [65,74–77]. Clinically, BRCA1/2-mutated breast tumors tend to be classified as TNBC invasive ductal carcinoma with high nuclear grade while BRCA1/2-mutated ovarian tumors are predominantly classified as HGSOC [78–80]. Although BRCA-mutated breast and ovarian tumors are often highly aggressive, a number of studies suggest that these patients may achieve a slightly better short-term therapeutic response (2–3 year overall survival) compared to patients with wild-type BRCA1 or BRACA2, as these tumors may be more responsive to DNA-damaging drugs; however, long-term survival and/or progression-free survival differences remain unclear [65,79–82].

While inherited breast and ovarian cancers have similar features, including response to specific inhibitors, as we will discuss below, non-familial ovarian and some subsets of breast tumors also demonstrate striking genome- and proteome-wide similarities including somatic mutations, patterns of copy number alterations, and expression of specific genes, proteins and signaling pathways. By utilizing this information, more recent treatment strategies for breast and ovarian cancers have begun to incorporate targeted therapies in conjunction with standard-of-care treatments [18,19,21,83]. While these novel regimens have improved clinical response and quality of life, as we have discussed, these treatments are often limited to patients with specific genomic alterations or clinical subtypes and not all patients will respond equally. These observations highlight the need to not only develop new, more effective therapies but also illustrate that it is necessary to develop a genome- or proteome-wide portrait of the underlying molecular heterogeneity of each of these diseases. Gaining a more complex view of the underlying biological mechanisms driving disease development, progression, and response to treatment will allow investigators to identify and develop biomarkers that will enable the design and evolution of treatment regimens based on the underlying biology of a given patient's tumor.

#### **3. Molecular Classification and Characterization of Breast Cancer**

Seminal studies by Perou and colleagues used microarray-based gene expression profiling and unsupervised hierarchical clustering to identify a 496 intrinsic gene list that defined five molecularly distinct subtypes of breast cancer [84,85]. These subtypes clustered largely along the estrogen receptor status with ER-positive tumors being classified into luminal A (LumA) or luminal B (LumB) subtypes while ER-negative tumors were classified as HER2 enriched (HER2E), basal like, or normal like [84–87] (Figure 1).

**Figure 1.** Gene expression-based classification of breast and ovarian cancers. The major molecular classifications of breast and ovarian cancers are depicted here. Further highlighted are the molecular similarities between high grade serous ovarian and basal-like breast cancer.

The ER-positive luminal tumors express luminal cytokeratins 8 and 18 and are enriched for genes expressed by breast luminal epithelial cells, including GATA3, FOXA1, ESR1, and MYB. Among luminal tumors, LumB tumors are defined by higher expression of proliferation-related genes, high genomic risk, and poorer clinical outcome than LumA tumors. HER2E tumors are predominantly ER negative, characterized by the amplification of the HER2 gene on chromosome 17q12, and are associated with poor prognosis and increased risk of metastasis. The basal-like subtype is largely synonymous with triple negative breast cancer (TNBC). These tumors express basal epithelial cell markers keratin 5/6 and are characterized by enrichment of the genes expressed by breast basal or myoepithelial cells [2]. Basal-like tumors represent the most diverse subtype of breast cancer and are associated with high proliferation rates, high mutational burden, higher risk of metastasis, and poor survival rates [85,86]. Finally, the normal-like breast cancer subtype has also been described and is typified by high expression of genes known to be expressed by basal epithelial cells and adipose cells. However, the biological relevance and clinical importance of this subtype remains unclear [84–87].

The association between molecular subtypes and disease-specific outcomes demonstrate that tumor cell response to treatment is not determined by anatomical prognostic factors but rather inherent molecular features, indicating the potential clinical value of these expression-based patient classifications [84,85]. However, the 'intrinsic' gene set used by Perou and group to experimentally categorize patients was not readily employable in the clinic due to its relatively large size [84–87]. Utilizing microarray data and several minimization methods, Parker et al. developed a reliable 50-gene signature to identify breast cancer intrinsic subtypes [88]. Combined with common histologic criteria, such as tumor grade and pathologic staging, the 50-gene signature (PAM50) provided significant prognostic and predictive value through classification and generating risk-of-relapse

(ROR) scores for all patients [88]. While the clinical implications of the PAM50 subtype predictor remain to be fully resolved, the Prosigna assay, which is derived from the initial intrinsic analyses, is used clinically to help predict risk of relapse and to guide therapeutic intervention [89–91].

The recent advances in gene expression profiling platforms have led to the identification of additional molecular subtypes, further defining the biological and clinical heterogeneity of breast cancer. The claudin-low subtype was identified to be predominantly triple negative and poorly differentiated subgroup of breast tumors which are enriched for cancer stem cell-like genomic signatures and immune response genes [92,93]. These tumors are characterized by low expression of luminal genes, proliferation genes, and genes involved in tight junctions and cell–cell adhesion [92,93]. More recent gene expression studies employed by Lehmann et al. initially categorized TNBC tumors into six molecular subtypes, including BL1 and BL2 (basal-like), immunomodulatory (IM), mesenchymal (M), mesenchymal stem-like (MSL), and luminal androgen receptor (LAR) [94]. This classification has since been further refined to include the four (TNBCtype-4) tumorspecific subtypes (BL1, BL2, M, and LAR) and exclude the IM and MSL subtypes due to the identification of transcripts from infiltrating lymphocytes and tumor-associated stromal cells, respectively [95]. The TNBC type-4 subtypes demonstrated significant differences in histopathology, grade, and local and distant disease progression [95]. These subtypes were characterized by unique identities of pathway activation which stimulated the use of known inhibitors and therapies to exploit signaling vulnerabilities, exhibiting early evidence of clinical applicability [94,95].

Decomposing vast amount of information from profiling studies represents a key step in developing patient-specific therapeutic regimens. In light of this, pathway signatures were developed as an underlying platform to provide a functional interpretation of the gene expression data within each subtype and further dissect the heterogeneity of breast cancer [96]. Integrated analysis using gene expression and pathway activation probabilities contributed to stratifying tumor subtypes and characterizing distinct clinical and biological features [96–100]. Along these lines, Gatza et al. utilized pathway activation probabilities that reflect in vivo activity levels to identify subgroups that reflect the status of important signaling pathways in breast tumors [97]. These subgroups corresponded to the intrinsic subtypes and exhibited distinct patterns of pathway activation, DNA copy number changes as well as clinical and biological characteristics [97,98].

While microarray-based gene expression profiling of breast tumors has been able to distinguish tumor subgroups and begin to define underlying biological and clinical diversity, these studies were limited in their ability to create true "molecular portraits" of breast cancer. Large-scale integration of multiple proteogenomic platforms through The Cancer Genome Atlas (TCGA) project provided a more comprehensive view of breast cancer heterogeneity and underlying biology. The TCGA project (*n* = 1072) used data from six different high-throughput technology platforms, including mRNA expression microarrays (and mRNA sequencing), DNA methylation, genomic DNA copy number arrays, microRNA sequencing, whole-exome sequencing, and reverse-phase protein array (RPPA) to examine specific genetic, epigenetic, and proteomic alterations in breast cancer and to link these alterations to clinical data and characteristics [2]. Intriguingly, while the overall patterns of proteogenomic alterations were found to be variable amongst patients, including between subtypes, intra-subtype variation was limited. Remarkably consistent patterns of genomic and proteomic alterations were found to be associated with each of the mRNA-based PAM50 subtypes.

Luminal tumors are characterized by an increased frequency and diversity of significantly mutated genes in addition to a lower frequency of copy number alterations [2,101,102]. These tumors exhibit increased mutations in luminal genes including GATA3 and FOXA1, as well as genes belonging to the p38-JNK pathway (MAP3K1 and MAP2K4), which were mutated in a mutually exclusive manner. PIK3CA, which is the most frequently mutated gene in breast cancer, was predominantly altered in luminal tumors and was mutated

at a much higher frequency in LumA (45%) relative to LumB (29%) tumors. Despite the high frequency of activating PIK3CA mutations in LumA subtype tumors, the PI3K/AKT signaling axis has not been shown to be consistently upregulated in these tumors. In contrast to LumA tumors, LumB tumors are characterized by higher inactivation of the TP53 pathway associated with a higher rate of mutation in the TP53 gene, loss of ATM2, and MDM2 amplification [2]. More recent integrative analysis using 52 gene expression signatures that measure oncogenic signaling pathways identified a limited number of genes that are amplified and overexpressed in aggressive luminal subtype tumors. Among these genes, a subset (FGD5, METTL6, CPT1A, DTX3, MRPS23, EIF2S2, EIF6, and SLC2A10) was found to be essential for cell growth and, in some instances, correlated with clinical outcome [99,103–105]. This study further suggests that not only do LumA and LumB tumors express unique mutation profiles, but that these alterations result in distinct patterns of oncogenic signaling beyond differences in proliferation.

The HER2E subtype is characterized by high amplification of the HER2 amplicon (80%) on chromosome 17q12. These tumors can be either ER negative or positive and demonstrate increased expression of the HER2 oncogene as well as other genes on the 17q12-amplicon, including GRB7. However, not all clinically defined HER2-positive tumors are categorized into this subtype, as some ER+/HER2+ tumors demonstrate increased expression of specific luminal genes (i.e., GATA3, BCL2, and ESR1) and cluster largely into the LumB subtype. TP53 (72%) and PIK3CA (39%) mutations are highly enriched in this subtype and show significantly higher expression and activation of receptor tyrosine kinases such as FGFR4, EGFR, and HER2 [2].

Basal-like breast cancers represent the most heterogeneous subtype with a high frequency of TP53 mutations which are present in an overwhelming 80–90% of tumors. In addition to TP53 truncating mutations, these tumors are characterized by loss of RB1 and BRCA1 along with amplification and hyperactivation of the MYC and FOXM1 genes. Increased activation of PI3K/AKT signaling, relative to other subtypes is a distinguishing feature of basal-like tumors despite a low incidence of PIK3CA (9%) mutations. Expression of keratins 5, 6, and 17 and cell proliferation genes are significantly upregulated in these tumors owing to the increased expression of FOXM1 as a transcriptional driver of this gene signature [2].

Similar multiplatform analysis was also conducted to provide molecular context to invasive lobular breast cancer, which is the second most commonly diagnosed invasive breast cancer and comprise approximately 10–15% of all cases. Despite histological differences, invasive lobular carcinomas (ILC) and ER+ invasive ductal carcinomas (IDC) patients have historically been treated similarly, emphasizing the need to more robustly understand the molecular underpinnings of the disease for better therapeutic interventions [106]. Multiplatform studies carried out by the TCGA project and Desmedt et al. identified mutations in the E-cadherin (CDH1) gene (63% in ILC vs. 2% in IDC) which is the hallmark feature of ILCs. In addition to CDH1 loss, mutations in PTEN, TBX3, FOXA1, and ESR1 were enriched in ILC relative to IDC tumors. Mutations in PIK3CA were reported in 48% of ILC relative to 33% of IDC tumors which, along with loss of PTEN function, defines the significant upregulation of PI3K signaling in ILC tumors [23,106]. Transcriptomic analysis identified molecular ILC subtypes which were characterized by unique molecular profiles and clinical outcomes with more proliferative tumors demonstrating a worse clinical prognosis [23,106]. Overall, these multiplatform analyses not only better distinguished between lobular and ductal carcinomas but also identified clinically relevant heterogeneity that may help to better differentiate and treat these carcinomas. In addition to TCGA, the METABRIC (Molecular taxonomy of breast cancer international consortium) study used an integrated clustering approach to examine the genomic and transcriptomic architecture of 2000 breast tumors (along with clinical data) and classify them into 10 integrative clusters (IntClust 1–10) which demonstrate distinct alterations and clinical outcomes [9,107]. Importantly, this classification strategy demonstrated that incorporation of both mRNA and cDNA copy number data identified additional granularity within the PAM50 subtypes as well

as molecularly distinct entities based on the underlying genetic alterations. These data, coupled with multi-platform orthogonal analyses performed by TCGA have provided enormous insight into the underlying genetic framework of breast cancer; however, these studies were limited in their ability to associate the genomic and transcriptomic features with the proteome and phosphoproteome that drives the phenotypic characteristics of a tumor. The RPPA platform used by TCGA for quantifying protein abundance and posttranslational modifications is limited by antibody quality and an inability to detect mutant protein forms.

Consistent with this premise, analysis of the proteome and phosphoproteome was performed by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) using mass spectrometry-based analyses to integrate and contextualize genome-scale alterations of 105 tumors and adjacent normal samples [5]. In breast cancer, these analyses resulted in the identification of an average of more than 11,000 proteins and 26,000 phosphosites per tumor significantly extending the previous work from TCGA where only 141 proteins and 31 phosphosites were captured [2,5,23]. Phosphoproteomic analysis informed the translational outcomes of PIK3CA mutations in breast cancer, which often are not correlated with the transcriptional signature of breast tumors. These analyses resulted in the identification of 62 different phosphosites in PIK3CA mutated breast tumors, including *RPS6KA5* and *EIF2AK4*, explaining the activation of the pathway and revealing possible druggable kinases in this pathway [5]. The CPTAC project highlights the need for integrating data across proteogenomic platforms to connect somatic mutations with the activation of various oncogenic signaling pathways in tumors for better therapeutic outcomes. In addition, CyTOF (Cytometry by Time of Flight), has been used for realtime high-dimensional analysis of breast cancer [108–110]. For example, a recent study by Ali et al. emphasized the significance of multiplatform analyses when coupled with multidimensional imaging mass cytometry in highlighting the tumor heterogeneity both on tumor-specific and tumor microenvironment levels which in turn affect the tumor evolution, ecosystem and clinical outcomes [109]. Similarly, imaging mass cytometry has been used to generate high-dimensional images of 281 human breast tumor samples in order to identify the spatial architecture, and to define heterogeneity between intra and inter-tumoral cell subpopulations [110].

#### **4. Molecular Classification and Characterization of Ovarian Cancer**

Similar to studies in breast cancer, studies by Tothill (*n* = 285) [111], the TCGA project (*n* = 489) [3], Helland (*n* = 939) [112], and Konecny (*n* = 174) [113] utilized K-means clustering and non-negative matrix factorization consensus clustering to classify HGSOC into four distinct gene expression-based subtypes. These four molecularly distinct subtypes (Figure 1) were termed immunoreactive, proliferative, mesenchymal, and differentiated based on molecular and clinical characteristics [3]. However, in contrast to molecular subtypes of breast cancer which have clear biological and clinical implications, these relationships do not appear to be as robust in HGSOC.

Mesenchymal tumors have been reported to have the worst clinical prognosis of the four HGSOC molecular subtypes [111,113,114]. These tumors are defined by low tumor purity and demonstrate increased desmoplasia and reactive stromal components, including CD3+ infiltrates [4,111,115]. Phenotypically, these tumors exhibit increased epithelial to mesenchymal transition (EMT), angiogenesis, extracellular matrix (ECM) remodeling, and proteolysis [113,115]. Consistent with these findings, mesenchymal subtype tumors demonstrate increased expression of HOX genes which contribute to development regulation as well as aberrant TGFβ, stromal-associated, wound response, and fos-jun signaling as demonstrated by gene expression signatures [116]. Global proteomic analyses by the CPTAC project further demonstrated that these tumors exhibit increased expression of ECM and cytokine signaling at the protein level [4].

Similar to mesenchymal subtype tumors, immunoreactive tumors are defined by low tumor purity [4]. However, immunoreactive subtype tumors are associated with a good

clinical prognosis [111,113–115]. While these tumors do demonstrate infiltration of stromal cells, immunoreactive tumors appear to be defined by increased immune signaling, likely due to increased immune cell infiltration [117]. Gene and protein expression profiling studies have reported activation of the adaptive immune response as well as increased T and B cell activation markers, antigen presentation, and chemokine signaling [3,111,113,115]. Consistent with these findings, it was reported that mesenchymal and immunoreactive tumors are more closely related to each other, as compared to the proliferative or differentiated subtypes, despite differences in patterns of signaling network activity and clinical outcomes [113,118]. These similarities are likely due to the low tumor cell purity that is apparent in mesenchymal and immunoreactive tumors, while the distinction between these groups is driven by both underlying tumor biology as well as the composition of infiltrating cell populations in the tumor microenvironment. Consistent with these ideas, recent single-cell RNAseq studies have demonstrated that unique aspects of the tumor microenvironment may define signaling within these subtypes; immunoreactive tumors were shown to have immune-related cell clusters while mesenchymal tumors contained cell clusters enriched for cancer-associated fibroblast signaling [117].

Tumors classified in the proliferative subtype are associated with poor overall survival [106,107,109]. In contrast to immunoreactive or mesenchymal tumors, these tumors exhibit high tumor cellularity and low infiltration of CD3+ and CD45+ stromal cells [111,112]. Proliferative subtype tumors are defined by an undifferentiated phenotype and express pro-proliferative signaling including increased expression of developmental transcription factors, proliferation markers, ECM-related genes, and WNT/β-catenin signaling, as well as increased expression of proteins involved in DNA replication [3,4,111,113]. In addition, it has been noted that these tumors express low levels of ovarian cancer marker genes (MUC1, MUC16, KLK6, KLK7, and KLK8) and high expression of the developmental transcription factors HMGA2 and SOX1. These tumors were also associated with an increased expression of FANC genes and homologous recombination [113,115].

Finally, differentiated subtype tumors have been shown to most closely resemble normal fallopian tissue at the gene expression level [111,115]. At the genetic level, these tumors are defined by increased expression of MUC1, MUC16, SLP1 (secretary fallopian tube marker), epithelial cell differentiation markers, and folliculogenesis-related genes which are indicative of increased tumor cell differentiation [3,113,115]. Proteomic analyses from the CPTAC project were able to further dissect signaling networks activated in these tumors to identify enrichment of protein expression programs associated with altered tumor cell metabolism and increased cell-to-cell communication [4] providing additional insight into subtype-specific mechanisms driving tumor development and progression.

Although these seminal studies were able to identify four largely concordant subtypes based on gene expression profiling, a number of recent studies have suggested that these subtypes are not consistent across platforms and populations [113–115,118]. These more recent studies have observed that tumors were able to be more robustly classified into fewer groups and/or that alternative strategies may provide additional insight into the underlying biology of this disease. Notably, studies from the CPTAC project were able to utilize proteome-wide data from 9600 proteins and 6769 phosphoproteins from 174 tumor samples to identify altered signaling networks in the transcriptome based subtypes further refining and validating the distinct signaling networks in these tumors, as well as identifying signaling pathways correlated with homologous recombination deficiency phenotype and patient survival [4]. However, in this proteogenomic analysis of ovarian tumors, Zhang et al. also identified five distinct protein-based subtypes and were able to show that three of the five subgroups were largely concordant with the TCGA mRNA-based subtypes. The remaining two subgroups represented tumors defined by unique underlying biology that would not be apparent by assessing mRNA data alone. Consistent with this premise, a number of recent studies have attempted to move beyond mRNA- or protein-based approaches to incorporate phosphoproteomic or glycoproteomic profiling to investigate the heterogeneity of HGSOC tumors [119,120]. These studies have provided additional depth to our under-

standing of HGSOC tumorigenesis by identifying subgroups defined by unique patterns of active kinases and altered cell signaling which contributing to tumor development, progression, and clinical outcome. Likewise, recent work by Karagoz et al. [116] assessed patterns of oncogenic signaling using a panel of 62 gene expression-based signatures across the four TCGA subtypes in three unique datasets [3,111,121]. As noted above, these studies identified unique oncogenic and tumorigenic signaling pathways associated with each mRNA-based subtype. However, in contrast to similar analyses in breast tumors which demonstrated clear differences in pathway patterns between the PAM50 subtypes, the distinctions amongst ovarian subtypes appeared to be more subtle and included increased intra-subtype heterogeneity [99,116].

Collectively, these data reinforce the premise that ambiguity in HGSOC subtype assignment could be a result of shared common biological underpinnings, the existence of intermediate subtypes, or biased by tumor cellularity and/or composition. As such, it is apparent that further refinement of the molecular subtypes, potentially through the incorporation of multiple genomic or proteomic platforms, may be necessary for these classification schemes to be clinically relevant.

At the molecular level, HGSOC has been classified as a C-class malignancy (chromosomally unstable) that is defined by extensive structural variants [102]. Consistent with this classification, mutational profiling of HGSOC by the TCGA project using whole-exome sequencing has identified a limited number of significantly mutated genes that define this disease [3]. The most prominent among these is TP53 mutations which are evident in nearly all patients and are believed to arise early in the transformation process [3,44,122–124]. Beyond altered p53 signaling, transforming oncogenic mutations in PIK3CA, BRAF, KRAS and NRAS have been detected in HGSOC, albeit at low frequencies (<1%). Almost half of HGSOC tumors are characterized by homologous recombination (HR) deficiency through germline or somatic mutations in BRCA1/2 (20%), BRCA1 hypermethylation (11%), and/or dysregulation of other HR genes including PTEN, ATM or ATR, RAD51C, EMSY and Fanconi anemia genes [3]. While few significant mutations are apparent in HGSOC, DNA copy number alterations are more frequent in these tumors [3,102]. This includes amplification of MECOM, MYC, and CCNE1 which are among the most significant focal amplifications and found in more than 20% of HGSOC cases in addition to KRAS and MAPK1 which are found in more than 10% of cases.

Interestingly, while specific genes are mutated at a low frequency in HGSOC, pathway analyses incorporating orthogonal whole-exome sequencing and copy number data demonstrated that HGSOC tumors are characterized by aberrant RB1/E2F (67%), PI3K/RAS (45%), and NOTCH (22%) signaling as well as dysregulation of the FOXM1 transcription factor network (87%) [3]. Further pathway analysis, based on phosphoproteomic profiles of HGSOC tumors demonstrated differential expression of RhoA-regulatory, PDFRB, and integrin-linked kinase pathways between poor and good prognostic HGSOC patients [4].

Finally, a number of recent studies have used integrative analyses to identify novel oncogenes and tumor suppressors that promote HGSOC and biomarkers to predict therapeutic response and risk. These studies relied on integrative analyses of DNA copy number, methylation, and gene expression data to identify potential oncogenes and tumor suppressor proteins in HGSOC and clear cell carcinoma [125–127]. Similarly, studies from Karagoz et al. assessed orthogonal genomic and proteomic data from human HGSOC tumors from the TCGA and CPTAC studies in the context of a prognosis gene expression signature. These analyses, along with data from a genome-wide RNAi screen in ovarian cancer cell lines, identified ADNP as a novel oncogene in HGSOC and in vitro studies showed that this protein regulates cell survival through altered cell cycle checkpoints [116]. While the therapeutic potential of these genes remains unclear, studies by Kurimchak et al. incorporated kinome profiling of human tumors and PDX models to identify MRCKA as a potentially drug-able oncogene activated in a subset of HGSOC tumors. Subsequent loss-of-function studies demonstrated that this gene could regulate HGSOC tumorigenesis and could be pharmacologically inhibited suggesting it may have potential as a novel

therapeutic target [128]. Finally, studies from Coscia et al. identified CT45 as a biomarker for platinum-sensitivity in HGSOC using global proteomic profiling and demonstrated that mRNA or protein expression was associated significantly with chemosensitivity and disease-free survival [129].

#### **5. Genetic and Genomic Relationship between Breast and Ovarian Tumors**

As discussed above, both familial and non-inherited breast and ovarian cancers have been shown to have similar genetic and genomic features (Figure 1). Beyond the previously discussed correlation between inherited mutations and the increased risk of breast or ovarian tumor development, analysis of human breast tumors demonstrated that HGSOC tumors also express a basal-like gene expression signature [2]. This relationship was further validated by multi-platform genomic analyses in which basal-like and HGSOC tumors were found to have a strong genomic association based on global mRNA profiling and to express a similar pattern of DNA copy number alterations [130]. While similar patterns of gene expression between these two diseases were noted by Hoadley and colleagues in a pan-cancer analysis of 12 tumor types and by the TCGA breast cancer paper, this association was not as clear when studied within the context of 33 tumor types potentially reflecting differences driven by tumor cell of origin, additional variability due to a more diverse tumor population, or other technical or biological factors [2,130,131]. Regardless, basal-like and high-grade serous ovarian tumors are classified as C-class malignancies and are characterized by predominant recurrent copy number alterations [102]. Specifically, these tumors share copy number gains of 1q, 3q, 8q, and 12p, and copy number losses of 4q, 5q, and 8p [2,100]. Among the commonly amplified genes are MYC (8p21.21), CCNE1 (19q13.2), MECOM (3q26.2), FGF3 (4p16.3), MCL1 (1q21.3) and ERBB3 (12q13.2) [43]. Additionally, basal-like and HGSOC tumors share RB1 loss in 20% and 10% of tumors, respectively [2,3].

Beyond copy number alterations, these tumor subtypes have been shown to express similar mutation profiles for a limited number of key oncogenes and tumor suppressor genes. Basal-like and high-grade serous ovarian tumors are enriched for BRCA1/2 inactivation and express TP53 mutations in 90–95% of tumors [2,3,5]. In addition, both tumor types exhibit an increased frequency of genome breakpoints as well as a loss of heterozygosity and allelic imbalance indicating genomic instability and homologous recombination deficiency [132–134]. More recent studies have indicated that these tumors demonstrate high homologous recombination deficiency (HRD) scores, accumulation of large-scale state transitions, increased loss of heterozygosity (LOH), and telomeric allelic imbalance scar signatures. Clinically, these alterations have been shown to be significantly correlated with pathologic complete response and minimal residual disease in TNBC patients treated with platinum-based therapies and with a better prognosis in HGSOC [36,135,136].

In addition to specific mutations and genomic alterations, basal-like breast and HG-SOC tumors have been shown to express similar signaling networks including increased activation of PI3K signaling [2,3,102,137–139]. While PIK3CA mutations are relatively rare events in each tumor type, a number of unique alterations have emerged as contributing to aberrant signaling [2,3,5]. In HGSOC, DNA copy number gains in PIK3CA (18%), AKT1 or AKT2 (9% combined) and to a lesser extent, homozygous deletion of PTEN (7%) are the main drivers for this pathway [3,140,141]. In contrast, basal-like tumors are regulated by alterations in multiple genes (EGFR, IGFR1, AKT3) that occur at a low frequency (2–4%), as well as a loss of PTEN (35%) or INPP4B (30%), SOX4 amplification and overexpression, and MAGI3-AKT3 gene fusion [2,137,142,143]. Interestingly, these data indicate that while both tumor types are characterized by high PI3K signaling, the mutations activating signaling in each tumor type differed in prevalence and composition.

Similarly, on a pathway activity level, basal-like and HGSOC tumors share increased FOXM1, HIF1-α, and MYC signaling [2]. Basal-like breast cancers have increased altered cell cycle checkpoint regulation, DNA damage repair, MYC, and immune response signaling [5], while proteins associated with recurrent copy number alterations in HGSOC

converge on cell migration/invasion and immune regulation pathways [4]. Consistent with common alterations between basal-like and high-grade serous ovarian tumors, Marcotte and colleagues [144] used a genome-wide pooled shRNA screen in 29 breast and 15 ovarian cancer cell lines to identify genes uniformly essential for cell viability as well as genes required within each disease type. While cell line-specific genes were identified, these analyses also identified 66 ovarian and 155 breast cancer-specific genes as well as 297 genes that were essential for viability in the majority of cell lines irrespective of tissue type. While the latter set of genes did not necessarily take into account distinctions between molecular subtypes, these studies further reinforce the shared underlying biology of these diseases.

#### **6. Advances in Genomic Analyses of Breast and Ovarian Cancer**

Single-platform genomic and proteomic analyses have allowed for the identification and cataloging of mutations; copy number alterations; and altered gene, miRNA, protein, or phosphoprotein expression profiles [2,3,7,12,23,145–152]. As we have outlined above, patterns of genomic and proteomic alterations can define tissue- and histological-specific differences in underlying biology and can be used to define molecularly distinct subtypes of cancer, including breast and ovarian cancer [2,3,11,23,97,118–120,136,153–155]. These genomic and proteomic patterns can identify oncogenic mechanisms that contribute to disease development, progression and in some instance can serve as therapeutic targets or markers of therapeutic response [10,62,63,99,102,113,126,128,129,135,137,142,156–172]. However, single platform analyses can be limited in their ability to visualize altered signaling networks and oncogenic processes. Given the complexity of mechanisms regulating these processes, multiplatform analyses, incorporating orthogonal genomic and proteomic data, enable the visualization of various types of alterations, in multiple key components within a given network to better define the state of signaling within specific tumors types and/or subtypes [102]. More importantly, integrative multiplatform analyses have led to the comprehensive identification of actionable alterations through reverse engineering of signaling pathways, while identifying upstream effectors and downstream targets using multiple omics platforms [156,173–176] (Figure 2). Ultimately, integrative analyses have resulted in the discovery of novel tumor-promoting mechanisms with higher confidence.

Breast and ovarian tumors are comprised of a complex collection of cell types including multiple populations of tumor cells, stroma, immune cells, fibroblasts, and other cells that encompass the tumor microenvironment [177,178]. As we have discussed, omics technologies that rely on analysis of the entire tumor (i.e., bulk analysis) have provided an enormous amount of insight into tumor biology; however, these approaches represent an averaged view of the tumor landscape and do not allow for fine resolution at the single cell level. Although a number of approaches including ESTIMATE, and others, have been developed to delineate specific signaling networks that arise from discrete cell populations or to estimate differences in cell composition within tumors using bulk sequencing or proteomic data, these methodologies are unable to fully address these challenges [179]. Advances in single-cell omics have had a significant impact on our understanding of tumor characteristics that are not apparent by bulk genomic, proteomic, or metabolomic approaches. These methods have allowed us to identify and characterize unique cell subpopulations, distinguish cell transition states, map molecular markers, identify novel and previously unrecognized biological features, and in combination with other technologies, are beginning to be used to spatially map tumor cell populations, identify circulating tumor cells and provide mechanistic insight into tumorigenic processes including metastasis and therapeutic response. Given spatial limitations, we point our readers to an excellent collection of review articles that discuss these advances in depth [180–191].

**Figure 2.** Use of single platform and integrative omics in cancer biology and medicine. The major contributions of single and multi-platform omics studies as well as single cell omics are summarized here. Single platform studies enable cataloging of mutation or alteration patterns, identifying signaling networks of interest and defining certain molecular subtypes. Multiplatform studies can further expand single platform-defined molecular subtypes and identify signaling pathways by identifying mutations in multiple genes representing multiple levels of pathway dysregulation. Single cell analyses allow for analyses of tumor cell subpopulations, identify cell transition states, map molecular markers and cell populations and identify circulating tumor cell populations. Orthogonal analysis of these data provides further context to genomic studies. These approaches contribute to a greater understanding of tumor biology as well as clinical advancements in treating cancer.

A number of recent studies have employed single-cell RNA-sequencing (scRNA-seq) analyses to examine tumor immune profiling [192–197]. These approaches have clear implications for both our understanding of the role of the immune system in the tumor microenvironment and for determining or predicting the efficacy of immunotherapy-based treatments. Of note, a recent study by Azizi et al. demonstrated the wide variability in immune cell type composition between breast cancer patient samples in addition to highlighting the phenotypic expansion of intratumoral immune cells using single-cell RNA and T cell receptor sequencing [198]. Further studies by Savas and colleagues demonstrated the association between tissue-resident memory T cell differentiation signature, developed using single-cell RNA-seq, and prognosis in early stage triple-negative breast [199]. Demonstrating the potential clinical implications and applicability of these approaches, investigators have used these technologies to identity mechanisms of therapeutic resistance [200]. Notably, recent work identified enrichment of immunosuppressive immature myeloid cells (IMC) in anti-Her2 and CDK4/6 inhibitor-resistant HER2-positive breast cancer, while combinatorial treatment with cabozantinib (IMC-targeting tyrosine kinase inhibitor) and immune checkpoint blockade overcame resistance [201]. Moreover, scRNAseq has been used to develop gene-expression-signature of the myeloid-derived suppressor cells (MDSCs) in addition to identifying CD84 as a surface biomarker for MDSCs in breast cancer [202]. Similarly, Wan et al. reported reprogramming of inert natural killer and T cells to a highly active cytotoxic state following bispecific anti-PD-1orPD-L1 antibody treatment using single-cell RNA-seq analysis of HGSOC organoid co-cultures; this study identified

a potential advantage of bispecific antibodies in immune checkpoint blockade therapy in HGSOC [203]. Further analyses have identified inter- and intra-tumor heterogeneity in cancer associated fibroblasts cell states in HGSOC and breast cancer [117,204]. Collectively, immune profiling coupled with imaging and single-cell RNA-seq underscored the importance of the spatial architecture of tumor niches in regulating immune infiltration and activation [205–212].

A number of recent studies have employed single-cell analyses to investigate inter and intra- tumoral heterogeneity [213–218]. Of note, recent work by Chung et al. linked tumor-intrinsic and immune cells diversity with TNBC intratumoral heterogeneity while studies by the Ellisen laboratory identified a connecting between intertumoral heterogeneity and clonality of inferred genomic copy number changes in these tumors. These latter studies suggested that cellular genotype drives gene expression programs, including signatures of treatment resistance and metastasis, in individual tumor cell populations [219,220]. Consistent with this premise, investigators have identified rare plastic pre-adapted cell subpopulations in luminal breast tumors which showed resistance to acute endocrine treatment [221]. Similarly, studies by Izar et al. and Geistlinger et al. used scRNA-seq based analyses to link the transcriptomic-based subtype classification of HGSOC to tumor cell type composition rather than intrinsic difference in gene expression patterns present in tumor epithelial cells further highlighting the importance of considering specific subpopulations of cells and the impact of signaling from the microenvironment on tumor characteristics [117,222]. More complex analyses integrating single-cell RNA-seq coupled with cell lineage tracing has been used to detail tumor cell subpopulations that contribute to various aspects of tumor evolution, including identifying pre-EMT (Epithelial to Mesenchymal Transition) cells that are essential for metastasis initiation [223].

Beyond assessing the transcriptome, single-cell DNA sequencing approaches have been developed and used to identify subpopulations of cells that express unique mutational and CNA patterns in therapeutically actionable genes in a given breast tumor [224]. These findings have clear clinical implications as different subpopulations will be likely be uniquely sensitive or resistant to specific therapeutic regimens and contribute to the evolution of the tumor and therapeutic sensitivity. Consistent with this premise, longitudinal sequencing analyses of tumors have demonstrated the emergence and/or re-emergence of clonal populations following treatment [225–227]. Complementary to these studies, single-cell mass cytometry using CyTOF identified rare tumor subtypes in HGSOC in addition to the dominant subsets and demonstrated that one of the identified rare subtypes was enriched for EMT signaling and associated with increased tumor metastasis [228]. Finally, merging single cell proteomics with other omics analysis has enable investigators to capture tumor-immune interactions in breast tumors [212]. Collectively, single-cell omics underscored intra- and inter-tumoral heterogeneity, identified subpopulation-specific vulnerabilities and emphasized the importance of addressing these vulnerabilities with combinatorial targeted therapeutic options [117,214,220,229–233].

Traditionally genome-wide RNAi and CRPSR/Cas9 screens have identified novel essential genes and pathways [144,234,235]. These studies evolved to include chemo-genetic screens which incorporate loss-of-function screens coupled with drugs or small molecule inhibitors in order to identify drug-gene interactions, cancer genetic vulnerabilities, and potential drug resistance mechanisms [21]. More recently, investigators have begun to incorporate these studies with multi-dimensional genomic analyses of human tumors as an added functional filter to identify clinically relevant cancer vulnerabilities and potential novel therapeutic targets. For instance, Marctotte and colleague integrated a pooled shRNA screen with genomic, transcriptomic and proteomic data from 77 breast cancer cell lines to identify breast cancer subtype-specific vulnerabilities. In this study, PSMB3, PSMA6 and ATP6V1B2 were identified as top ranked "basal-selective" essential genes [234]. Likewise, integrative correlative studies between pathway-specific gene expression signature scores, gene level DNA segment scores and RNAi shRNA abundance led to the identification of 21 amplified, essential and putative driver oncogenes in highly proliferative luminal breast

cancers as well as the identification of SOX4 as a driver of PI3K signaling in basal-like breast tumors [99,137]. Similarly, in ovarian cancer, systemic loss-of-function shRNA screen identified 50 essential and amplified genes including CCNE1, PAX8, FRS2, PRKCE, and RPTOR. Of note, PAX8 was found to be amplified in 16% of primary ovarian cancers while shRNA mediated silencing of PAX8 lead to apoptosis in cell lines harboring either PAX8 amplification or overexpression [235]. Likewise, ubiquitin B (UBB) and ubiquitin C (UBC) were identified as a paralog deficiency dependency in ovarian cancers, implying the essentiality of UBC in cell lines with repressed UBB [166]; shRNA mediated silencing of UBC in UBB repressed ovarian cancer xenograft model lead to tumor regression and prolonged survival [160]. More recent studies have evolved to employ machine learning algorithms for predicting functional cancer vulnerabilities while integrating shRNA (DEMETER2) or CRISPR/Cas9 (DepMAp) screens coupled with genomic and proteomic profiling of cancer cell lines [236]. Collectively, genome-wide RNAi and CRPSR/Cas9 loss-of-function screens made a significant contribution in identifying cancer dependencies, and potential novel therapeutic targets [166,237–239].

Unfortunately, spatial limitations prevent an in-depth discussion of the many tools, algorithms, and computational approaches that have been developed for a single platform and integrative analyses. However, biochemical and genetic-based studies as well as large-scale proteogenomic analyses have demonstrated that despite enormous tumor heterogeneity, molecular alterations often converge on a limited number of signaling networks, reflecting pathway activity levels and their role in driving tumor progression [14,102,147,149,240,241]. As a result, a number of tools and approaches have been developed, including the use of gene expression-based signatures, mutational signatures, CNA (copy number alterations) signatures, and protein signatures to quantify pathway activity [7,96,97,242–244]. These approaches include several that incorporate data from multiple technical platforms and use statistical modeling driven by a priori knowledge of signaling pathways and/or protein–protein interaction networks to cluster samples based on similarity networks, detect enriched signaling networks across multi-omics platforms and/or infer pathway activity from the expression or mutation profiles of established pathway components [3,8,23,170,245,246].

As we have outlined, large-scale genomics and proteomics studies including those from the TCGA, METABRIC, and CPTAC projects, as well as many other studies, have enabled the cataloging molecular alterations and signaling pathways in breast and ovarian cancers. While these studies have implications for our understanding of the underlying molecular mechanisms of these diseases, they also highlight the need to personalize therapeutic approaches based on the biology of each patient's disease [6,131,247]. Consistent with this premise, the use of genomic profiling, including DNA sequencing gene panels from Foundation Medicine and others, into clinical trials and practice has identified potential biomarkers, beyond standard immunohistochemistry (IHC)-based assays, to predict response and provided a means to personalize treatment. In addition to DNA sequencingbased assays, multiple molecular biomarkers are currently being used to monitor and track the progression of both ovarian and breast cancer [167,248,249]. For ovarian cancer patients, these biomarkers include single gene markers CEA (Carcinoembryonic Antigen), CA125 (Cancer Antigen 125), and HE4 (Human Epididymis protein 4), as well as multivariate index assays including OVA1, ROMA, and OVERA [250–257]. Similarly, several gene-expression based prognostic tests, including Oncotype DX [258], EndoPredict [259] and MammaPrint [260] as well as the aforementioned Prosigna assays, have been FDA approved to predict risk of recurrence in breast cancer and can be used to help guide clinical decisions. Oncotype DX Recurrence Score is based on the expression level of a panel of 21 genes, which is used to predict the likelihood of the 10-year tumor recurrence and guiding the adjuvant treatment options while weighing the added benefit of adjuvant chemotherapy versus treatment with hormonal therapy alone. Oncotype Recurrence Score stratified patient samples into low-, intermediate- and high-risk groups, predicting high likelihood of added benefit of adjuvant chemotherapy in the high-risk group patient

cohort [258,261,262]. MammaPrint on the other hand is based on the gene expression profile of a panel of 70 genes. This test is used in clinics for assessing clinical outcome and predicting recurrence score in early stage breast cancer. Based on recurrence scores, patient samples are stratified into low or high genomic risk. Studies showed that there is an added benefit to adjuvant chemotherapy in the low genomic risk group when compared to patients who did not receive chemotherapy [260,263]. Finally, emerging data supports the role of analyses of circulating tumor DNA in routine clinical care [48,264–268]. The FDA recently approved the FoundationOne Liquid CDx test, which is a circulating cell-free DNA (cfDNA) based-assay as a companion diagnostic for treatment of BRCA mutant (germline or somatic) ovarian cancer patients with the PARP inhibitor rucaparib as well as alpelisib treatment of HR+/HER2-, PIK3CA mutated breast cancer patients [249].

#### **7. Summary**

Over the past twenty years, proteogenomic profiling of human tumors has drastically expanded our understanding of breast and ovarian cancer biology. The identification of molecular subtypes; novel oncogenes, tumor suppressor proteins and signaling networks; as well as clinically relevant biomarkers have begun to contribute to the development of novel and more effective treatment strategies. The next challenge will be to effectively translate these efforts into the development of new clinical diagnostic tools, biomarkers, and therapeutic strategies in order to personalize cancer treatment and improve the outcome and quality of life for patients.

**Author Contributions:** All authors contributed to writing the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by CA166228 from the National Cancer Institute of the National Institutes of Health, V2016-013 from the V Foundation for Cancer Research, and RSG-19-160-01-TBE from the American Cancer Society to MLG. Additional funding for this project was provided by New Jersey Commission for Cancer Research (DCHS-20PPC-014 to CAK and DHFS-19PPC-019 to GAM), the Cox Foundation for Cancer Research (CAK and GAM) and through a MacMillan Cancer Genetics Summer Undergraduate Research Fellowship from Rutgers University (RNM).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We would like to thank the members of our laboratory and other colleagues for critical review of the manuscript. We apologize to those authors whose work has made essential contributions to this field but that we were unable to discuss due to spatial limitations.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

