1. Introduction
Variation of gene expression among individual tumors enables the personalization of many diagnostic and treatment options [
1]. Indeed, multiple gene expression biomarkers were proposed for the prediction of patient survival and drug response (e.g., [
1,
2,
3,
4,
5]). Several transcriptomic biomarkers were approved for clinical use, e.g., gene expression signatures predicting recurrence and prognosis in breast and thyroid cancers [
6,
7,
8].
However, gene products do not act alone, but rather as the components of complex molecular networks executing specific functions in cell molecular physiology. Thus, cancer-specific alterations of gene expression inevitably lead to dysregulation of multiple molecular pathways [
9]. This makes it possible to create the next generation of signatures based on molecular pathway activities and investigate their association with various characteristics of cancers, e.g., tumor grade, invasiveness and histological type, patient survival, and response to therapy [
10,
11,
12,
13]. Pathways affected in a tumor can be identified using various statistical methods for both RNA and protein expression data [
14,
15,
16,
17]. Alternatively, gene ontology (GO) analysis can identify molecular processes enriched by the differentially expressed genes [
18].
Many such approaches ignore pathway functional topology and fail to determine the up- or downregulated state of a pathway and the extent of its activation. Indeed, different components of a molecular pathway may have different functional roles (for example, increased expression of an inhibitory component would act in favor of pathway downregulation, and vice versa). Furthermore, pathways may include feedback loops and other complex interactions that have to be taken into consideration when quantitatively assessing pathway alterations in cancer [
19,
20].
In order to translate expression data into quantitative measures of pathway deregulation while considering pathway architecture, a measure termed Pathway Activation Level (PAL) was introduced [
14,
21,
22]. For a given molecular pathway, PAL is calculated as a weighted sum of logarithms of case-to-normal ratios for the expression levels of all genes involved in the pathway of interest, with weights ranging from –1 to 1 according to the activator/repressor role of the corresponding gene products.
When used to discriminate between nine human cancer types, PAL values showed better accuracy than expression levels of individual genes [
23]. On its own, PAL was also found to be a good predictor of tissue type in bladder cancer [
9] and of sensitivities to some cancer drugs [
13,
24,
25,
26,
27,
28]. Additionally, PALs demonstrated better stability against experimental noise and lower batch bias compared to single gene expression levels in both transcriptomic (microarray and RNAseq) and proteomic data [
14,
23,
29]. These findings suggest the advantage of PAL values or other metrics of that kind as potential molecular biomarkers.
Furthermore, a computational recursive approach was proposed that can algorithmically annotate activator/repressor roles to all pathway nodes depending on the pathway molecular architecture and the nature (activation/inhibition/other) of each molecular interaction within the pathway [
20]. This enables fast, uniform, and simultaneous annotation of thousands of molecular pathways [
30]. In addition, this approach also excludes the operator error that is a probable event during manual annotation of pathways and interactomes due to their high complexities.
Recently we published an alternative concept of a molecular pathway that is built algorithmically as an interacting network around the central node—gene product of interest [
31]. This approach is based on the whole-interactome model and is fully automatic. It has the advantage of reducing bias introduced during manual reconstruction as in the case of the “classical” pathways. In the manually reconstructed pathways, the gene contents are typically investigator hypothesis-driven with a strong bias toward well-known “topical” molecules. As a result, such featured molecules are overrepresented in classical pathways whereas the others can be ignored or overrepresented instead.
Thus, using the whole interactome model, we constructed a set of so-called gene-centric pathways: local subnetworks of interacting molecules consisting of a central gene (main node of the pathway) and other molecular components interacting with this gene product either directly or indirectly. The gene-centric pathway is characterized by a maximal number of molecular interactions starting at the central node and leading to every other node of the pathway (one or two interactions in the published reports) [
31]. One such algorithmically constructed pathway centered at gene
FREM2 emerged as a promising predictor of tumor grade and survival in human gliomas, strongly exceeding the biomarker performance of the
FREM2 gene itself [
31]. We then investigated a larger number of gene-centric pathways in human gliomas, where they demonstrated an overall superior diagnostic and prognostic performance compared to single gene expression levels [
32].
However, the relative performance of the gene-centric pathways in comparison to classical pathways and single genes remained unexplored at the pan-cancer level. Here we used a human interactome model involving 7470 human gene products to algorithmically reconstruct molecular pathways termed gene-centric pathways, centered around each of these genes. We then assessed their general biomarker characteristics in comparison with the previous generation of molecular pathways (3022 “classical” pathways) and with the transcripts of 24,862 individual genes. To this end, we investigated potential biomarker associations with tumor type and overall and progression-free survival in 21 human cancer types using RNA sequencing and proteomic data for 8141 and 1117 samples, respectively. For all analytes in RNA and proteomic data, respectively, we found a total of 7441 and 7343 potential biomarker associations for gene-centric pathways, 3020 and 2950 for classical pathways, and 24,349 and 6742 for individual genes. Overall, the percentage of potential RNA biomarkers was statistically significantly higher for both types of pathways than for individual genes (p < 0.05). In turn, both types of pathways showed comparable performance. While the percentage of potential cancer type-specific biomarkers was comparable between proteomic and transcriptomic levels, the proportion of potential survival biomarkers was dramatically lower for the proteomic data: up to only 2.3% versus as much as 36.3% in transcriptomic data. Thus, we conclude that pathway activation level is the advanced type of cancer biomarker for RNA data, and momentary algorithmic computer building of pathways is a new credible alternative to time-consuming hypothesis-driven manual pathway reconstruction.
4. Discussion
We performed here the first pan-cancer screening including gene expression data for 21 human cancer types to compare the biomarker performance of manually and algorithmically reconstructed molecular pathways, and of individual genes. We found statistically significant cancer-type potential biomarkers in each cancer type under analysis, both among genes and gene-centric, and classical molecular pathways. The percentage of cancer-type biomarkers was significantly higher in both types of pathways (both gene-centric and classical) than among individual genes. The cancer-type-specific biomarkers may be important for a better understanding of tissue-specific aspects of carcinogenesis. In addition, we screened for potential biomarkers between tumors and normal tissues and observed the same trend that pathway-based potential biomarkers outperform single genes or proteins.
In 13 cancer types, we also identified putative prognostic biomarkers of all three types (genes and gene-centric and classical pathways). For overall survival, gene-centric pathways and classical pathways showed a higher percentage of significant potential biomarkers than individual genes in five and three cancer types, respectively, whereas potential gene biomarkers prevailed in two cancer types. For progression-free survival, the advantage has been shown, respectively, in four, three, and five cancer types for individual genes and gene-centric and classical pathways. Thus, we conclude that a pathway-based approach can result in enriched sets of potential biomarkers predicting survival than individual genes.
In terms of magnitudes of HRs associated with significant potential survival biomarkers, there were statistically significant yet relatively small differences between the above three biomarker types and no overall trend of an advantage of the certain biomarker type in all cancers.
Many previous studies attempted to link the activities of genes and their interacting networks with clinical outcomes [
52,
53,
54,
55,
56,
57]. In most of them, an overall analytic pipeline included assessment of differential gene expression and building co-expression networks, e.g., using Ingenuity Pathways Analysis [
52] or by identification of fully connected gene sets enriched for certain functions, e.g., using the CytoScape ExpressionCorrelation tool [
53]. Alternatively, genes could be grouped using weighted correlation network analysis (WGCNA) [
58], e.g., for studying survival biomarkers in lung adenocarcinoma, in colon and renal cancers [
54,
55,
56]. Protein–protein interactions from the STRING database (
http://string-db.org/, accessed on 20 May 2023) were also used to supplement WGCNA for a more accurate prediction of patient survival in bladder cancer [
57]. We tried to compare these approaches with the current study findings in terms of input data and output results in
Table 7.
Thus, in this study, we considered not only the proximity of genes within topological interaction networks but also their functional roles. Unlike in the previous research, in addition to well-known classical pathways from popular databases, we also generated and in-depth analyzed algorithmically constructed gene-centric pathways.
Overall, the algorithmic approach was shown to be a robust method of obtaining new molecular pathways. The algorithm selected highly connected gene-centric subnetworks in the human interactome, and the molecular pathways obtained in such a way have demonstrated biomarker values comparable with pathways manually constructed by expert curation.
It is now widely accepted that a combination of biomarkers, such as gene signatures or pathways, is more robust and performs better than using individual genes or proteins. Our results confirm this trend. However, the number of algorithmically constructed pathways was about two times higher than for the source classical pathways. The ultra-fast speed and efficiency of this approach, therefore, make it a useful solution for hypothesis-free algorithmic annotation of the whole connectomes.
In the domain of tumor-type biomarkers, many studies rely on a deep learning approach [
59,
60,
61], including convolutional neural networks [
62,
63]. However, to our knowledge, the only type of input data in such models was gene expression, and the nature of functional interactions within groups of genes generated was not considered. We speculate here that applying our gene-centric pathway approach, based on the whole-interactome model, to such deep learning settings, can further increase the biomarker capacity of both methods.
Besides gene expression values, we analyzed the biomarker capacity of proteins profiled using two labels (TMT10 and TMT11) and three models of mass spectrometers (Orbitrap Fusion Lumos, Q Exactive Plus, and Orbitrap Fusion Lumos). TMT11 and TMT10 labels utilize the same six reporter ions ranging from 126 to 131 Da. The difference between TMT11 and TMT10 is the splitting of the 131 (last) channel into 131-N and 131-C. The analysis of data clustering shows that TMT10 and TMT11-labeled tumor profiles are relatively mixed with each other, which allowed us to analyze proteomic profiles obtained using these two labels as a single dataset. However, we observed very strong clustering of data by the model of mass spectrometer which was even stronger than clustering by the cancer type. The Orbitrap Fusion Lumos is a tribrid mass spectrometer that combines three mass analyzers: quadrupole technology, Orbitrap, and linear ion trap. The Q Exactive HF-X and Q Exactive Plus include quadrupole technology and Orbitrap mass spectrometry. However, there are some technical differences between them, e.g., the resolving power is up to 240 and 140 kFWHM for Exactive HF-X and Q Exactive Plus, respectively. We demonstrated that the datasets produced by the Orbitrap Fusion Lumos, Q Exactive HF-X, and Q Exactive Plus have a different number of significant potential biomarkers (Orbitrap Fusion Lumos platform gave a ~2-fold higher proportion of potential proteomic biomarkers than the Q Exactive Plus engine,
Table 4). Currently, we do not know whether this difference is related to platform-specific data quality or to the biological properties of the tissues investigated with the respective platforms. For the same reasons, we cannot correctly compare the potential biomarker capacities of the TMT10 and TMT11 labels. However, we believe that it has to be investigated in detail in the future to enable high-quality comparative combinatorial studies of proteomic datasets.
Furthermore, the resolution of the proteomic platforms investigated here in terms of the number of items for which expression can be quantitatively assessed is ~3.6-fold lower than for the transcriptomic data obtained by RNA sequencing [
15]. However, the percentage of potential cancer type-specific biomarkers was comparable between proteomic (21–58%, average 39%) and transcriptomic (7–53%, average 26%) data at the level of single gene products (
Table 2 and
Table 3). Similarly, the percentage of proteomic pathway-based biomarkers was also similar to the transcriptomic results: 22–66% (average 44%) and 8–65% (average 33%), respectively (
Table 2 and
Table 3).
However, the proportion of potential survival biomarkers was dramatically lower for the proteomic data, where statistically significant potential biomarkers were found only in four of eight cancers (50%) versus 13 of 21 (62%) for the transcriptomic data, and their percentage was only up to 2.3% versus 36.3% in transcriptomic data (
Figure 6 and
Figure 7). For example, only six individual proteins and no molecular pathways were associated with overall survival in lung squamous cell carcinoma while no individual genes, 17 gene-centric pathways, and 154 classical pathways were associated at the transcriptomic level. At the same time, we could find survival biomarkers of pancreatic cancer only at the proteomic level (
Figure 6 and
Figure 7).
We used the same statistical criteria for both transcriptomic and proteomic data. However, despite the similar tumor stage distributions, the CPTAC and TCGA cohorts may differ significantly by treatment. The therapy used is not completely described, and standard treatment protocols may be not the same because the time gap between sample collections is about 10 years. This factor may impact survival analysis results.
This study used protein abundance data that correspond to the gene level. However, each gene may have multiple proteoforms due to alternative splicing and posttranslational modifications (PTMs). The presence of various proteoforms can have a significant impact on the potential use of a protein as a biomarker. To assess data complexity, we tested the kidney cancer phosphoproteomic CPTAC dataset PDC000128 using the COPF approach [
64]. COPF is a data-driven method that detects groups of highly correlated peptides in bottom-up proteomic datasets. Such groups can, but do not have to represent unique, specific proteoforms. We found that 485 out of 4689 proteins (10.3%) have highly correlated groups of phosphopeptides (p-adjusted < 0.1). Moreover, to assess potential proteoforms, we need information about other PTMs for the same samples, that can substantially increase the number of proteins with highly correlated groups of peptides. Furthermore, methods for the detection of proteoforms in bottom-up proteomics should be developed and validated for different PTMs. Certainly, an analytical approach for bottom-up proteomics can be used to assess potential proteoform groups, however, top-down data are needed to detect specific proteoforms. We believe that with further accumulation of data on posttranslational modifications for a larger number of samples and cancer types, our biomarker assay should also be repeated at the level of different proteoforms.
In our study, the gene-centric pathways could identify cancer types better than their corresponding central genes (
Figure 3). For some cancer types, they also provided a larger proportion of potential biomarkers than classical pathways, yet no clear overall trend could be identified.
On the other hand, in the case of potential survival biomarkers (
Figure 9B,D and
Figure 10B,D,F) pathways of either type did not show a high advantage over single genes (
Figure 8). In terms of the percentage of successful potential biomarkers, single genes were the best category in six cancer types, whereas gene-centric and classical pathways were each on the top in eight cancer types.
We also speculate here that our approach can be employed not only to screen for cancer type or survival biomarkers but also to identify new therapeutic response biomarkers or tumorigenesis-associated gene networks. Overall, we found that the percentage of high-quality potential biomarkers was statistically significantly higher among the molecular pathways, both gene-centric and classical, than in individual genes. In turn, both types of pathways showed comparable performance. Thus, we conclude that pathway activation level is the advanced type of new generation of cancer biomarkers.
The potential biomarkers identified here may be of interest for molecular cancer research. By analyzing pathway activities, we can gain deeper insights into the pathophysiology of specific cancer types and unravel complex molecular networks that drive tumorigenesis. Moreover, the identification of new algorithmically constructed pathways with clinical relevance may enhance the search for novel drug targets and the development of more effective therapeutic interventions.
Furthermore, we believe that such momentary algorithmic computer building of pathways is a new credible alternative to time-consuming hypothesis-driven manual reconstruction of pathways and can replace it in the nearest future.