To survey the dosage effects of CNVs at the single-cell level, we constructed a computational framework to examine gene expression and copy number variation. By applying the computational framework to two independent metastatic TNBC single-cell transcriptome datasets, we identified the common functional modules for further functional and clinical evaluation.
2.1. The Computational Framework to Characterize the Concordant Copy Number Variation and Gene Expression Changes at the Single-Cell Level
To save money, many CNVs were inferred based on single-cell RNA sequencing (scRNASeq) expression data, which are not independent and not suitable for cross-validation. Theoretically, we required the input CNV data, and single-cell transcriptomes should have been derived from a different technological platform and paired to each other at the patient level. For example, the bulk sequencing for all the DNA content from a tumor sample of one patient and the scRNASeq could be independently applied to thousands of cells from the same sample.
As shown in
Figure 1, our pipeline started with a compilation of matched CNV data from DNA sequencing and expression data from single-cell RNA sequencing. Next, we transformed the cell-based RNA expression profile to the Z-score, which depicts how genes were expressed relatively higher or lower in a particular cell compared to those of the quantitative expression values from other cells. Due to the limitation of RNA content, scRNAseq often has a lot of dropouts, which results in none of the RNA short reads being detected in scRNAseq. To prevent nonsense Z-scores, we masked those cells without expression values. To extract a meaningful Z-score for further mapping to CNV data, we set a threshold for the absolute Z-score of over 1.96, which means significant
p-values 0.05. Then, the transformed significant Z-score for each gene in a particular cell was mapped to the CNV data from the same patient. The concordant copy number events were screened based on the criteria of whether the copy number gain/loss in the cell correlated with a consistent gene up/downregulation. By counting how many consistent CNV-expression events occurred across all the cells from the input dataset, we further inferred a list of genes with recurrent CNV-based dosage changes. By incorporating the known gene–gene interaction and functional similarity, the framework identified functional modules for those recurrent CNV-changed genes detected in hundreds of cells regardless of their cell types.
In summary, the input for our framework was the independent CNV and scRNAseq data from the same patients. The output was a connected functional module that comprised the top mutated genes based on consistent CNV and expression changes. To further reduce the false positives of functional modules inferred from our computational pipeline, we suggest validating those functional modules in another independent dataset. Once the top prioritized functional modules and genes are confirmed, any routine computational or experimental design can be adopted to further explore the significance. For instance, our final step of this study was to take advantage of large-scale cancer genomics data to evaluate whether those top-ranked genes were highly mutated and associated with patient survival.
2.2. The CNG-Driven Increased Gene Expression in Multiple TNBC Cell Types
To focus on TNBC and explore the potential novel targets for cancer therapy, we applied this computational pipeline to the TNBC dataset GEO118390 with six patients and an expression profile from 1533 cells. Instead of cell-type identification, we focused on the common mechanisms underlying copy number variants across all the cell types. To avoid expression bias on those longer genes with more abundant sequencing reads, we used transcripts per million (TPM) as the quantification method. Specifically, the sum of all TPMs in all the 1533 cells is the same, which makes the proportion of short reads mapped to a gene comparable in different cells.
By starting from a TPM expression matrix with 21,785 genes × 1533 cells, we first generated the same size matrices with 33,396,405 Z-scores (21,785 rows and 1533 columns), which depicted the relative expression of a gene over the averaged expression rate across all the 1533 cells. A positive Z-score means the gene expression is relatively higher than the expressions from the remaining cells. By running through the described computational pipeline, we first masked 29,931,625 zero expression values. Based on the remaining 3,464,780 meaningful Z-scores, we mapped the CNV data from the whole exome sequencing in the same six patients. In total, 1,440,802 non-redundant CNV-expression events in 1245 cells were obtained.
However, most CNV-expression events (1,351,278) may not be informative since these expression variations are not statistically meaningful (e.g., their absolute Z-score < 1.96). In contrast, we defined 89,524 meaningful CNV-expression events as consistent events with an absolute Z-score cutoff over 1.96 (
Table S1). On the other hand, some CNV-expression consistency only occurred in a small number of cells. By pulling out all the cells with gains or losses of gene copies, we set a threshold of 100 or more cells to prioritize the most informative copy number variation events. Notably, we only focused on the consistent copy number gain and gene upregulation (CNG-UP) genes because the majority of the Z-scores for downregulation were higher than −1.96. Finally, these step-by-step filterings identified a total of 47,514 CNG-UP events associated with 94 genes across 1145 cells. It is worth noting that these consistent patterns across patients are highly reliable, which may empower scRNASeq-driven clinical decisions in future.
To obtain a functional overview for those 94 top-ranked genes with amplifications in hundreds of cells, we performed comprehensive functional analysis on gene ontology (GO) and well-annotated biological pathways. As shown in
Figure 2A, the most significantly enriched functional term was the cotranslational protein targeting to membranes, associated with 20 genes (corrected
p-value = 10 × 10
−25.133), which may also overlap with regulated exocytosis (corrected
p-value = 10 × 10
−5.709), protein folding (corrected
p-value = 10 × 10
−25.133), and regulation of translation (corrected
p-value = 10 × 10
−4.847). Those genes may also have roles in many fundamental cellular processes such as ER-phagosome (corrected
p-value = 10 × 10
−7.659), spliceosome (corrected
p-value = 10 × 10
−4.053), glycolysis, and gluconeogenesis (corrected
p-value = 10 × 10
−4.806). In addition, the genes also take part in multiple signaling pathways associated with VEGFA-VEGFR2 (corrected
p-value = 10 × 10
−9.642) and Rho GTPase (corrected
p-value = 10 × 10
−7.052). Some of genes may also influence negative regulation of cell differentiation (corrected
p-value = 10 × 10
−3.409), purine ribonucleoside triphosphate biosynthetic processes (corrected
p-value = 10 × 10
−7.002), and reproduction (corrected
p-value = 10 × 10
−3.920). In summary, this quick functional enrichment provides a novel insight into the association between TNBC and protein synthesis, folding, and sorting at the single-cell level.
By mapping those genes to the human interactome, we further identified the functional modules. In practice, the molecular complex detection (MCODE) algorithm [
7] was applied to cluster those GO-associated genes into five distinct modules (
Table S2). For example, module 1 is a cluster of all the genes from the cotranslational protein targeting to membranes (GO:0006613) and SRP-dependent cotranslational protein targeting to membranes (GO:0006614). Similarly, another four functional modules are represented by purine ribonucleoside triphosphate biosynthetic processes, SCF(Skp2)-mediated degradation of p27/p21, apoptosis, and neutrophil degranulation (
Figure 2B). Although these modules confirmed the functional enrichment result, the highly connected complex module also helped us to focus on the most important genes in TNBC development.
To validate our finding in another independent TNBC dataset (GSE75688), we focused on a single-cell transcriptome from five TNBC patients with 130 cells. By integrating 20,651 CNG events and 7,528,950 Z-scores for all the 57,915 genes, we found 705,873 CNV-expression events based on the same cells and patients. By further narrowing down the interesting genes based on Z score (>1.96), we harvested a total of 86 genes with CNG-UP events in 6 or more cells. The consistent CNG-UPs detected in the 6+ cell were mapped to gene–gene interactome data and were used to identify three functional modules. Notably, we confirmed the top ranked functional module, which was SRP-dependent cotranslational protein targeting to membranes (
Figure 2C,D). By overlapping those genes identified from both datasets related to protein targeting, we found 33 genes related to SRP-dependent cotranslational protein targeting to membranes (7 in common,
Figure 2E). As expected, those 33 genes are highly enriched in ribosome protein functions (29 genes associated,
Figure 2F). In addition, 31 out from the 33 genes are related to protein translation (corrected
p-value = 10 × 10
−56.43) and seven genes are related to the VEGFA-VEGFR2 (vascular endothelial growth factor A–vascular endothelial growth factor receptor 2) signaling pathway (corrected
p-value = 10 × 10
−3.76). More importantly, these 33 genes are connected to each other and might form a strongly connected complex. In summary, independent technology and TNBC cohort-based cross validation helped us identify the important concordant copy number gain and upregulation events at the single-cell level. These consistent results from thousands of single cells also make the top ranked ribosome protein modules reliable.
2.3. The Mutational and Survival Analysis on 6688 Breast Cancer Samples in 15 Studies
By applying our computational framework, we identified several important somatic CNV features at the single-cell level, particularly with respect to the ribosome proteins and their effects on mRNA translation and sorting. We hypothesized that amplification-induced high expression activity of ribosome proteins may select for breast cancers that render them invasive and aggressive for metastasis. We therefore asked whether those 33 ribosome complex genes are frequently mutated in breast cancer and metastatic TNBC. As shown in
Figure S1, we utilized a public cancer genomic resource and combined 6688 breast cancer samples with genetic mutation data from 12 independent studies. Since these 33 genes formed a strongly connected module, we organized the mutational frequency by interaction map (
Figure 3A). Generally, the ribosome proteins with higher mutational frequency tended to have more connections in the network. Of the top ten mutated genes, nine were also connected with more genes, including MPRL13 (mitochondrial ribosomal protein L13, 17%), SRP9 (signal recognition particle 9, 15%), PABPC1 (poly(A) binding protein cytoplasmic 1, 15%), RPL8 (ribosomal protein L8, 15%), RPL30 (ribosomal protein L30, 13%), SSR2 (signal sequence receptor subunit 2, 13%), RPS27 (ribosomal protein S27, 13%), RBM8A (RNA-binding protein 8A, 11%), and RPL7 (ribosomal protein L7, 11%). In fact, the topological features, such as number of connections, were found to be associated with the mutational rates of the cancer driver genes [
8].
Of the individual datasets and breast cancer subtypes (
Figure 3B), three metastatic breast cancer cohorts were highly mutated (over 50% patients in the corresponding cohort). In contrast, another seven cohorts, which showed markedly lower mutational frequency, were not metastatic. In between, there were two datasets. One was TCGA invasive carcinoma and the other was adenoid cystic carcinoma of breast. These results may imply that the 33 genes related to contranslational proteins targeting to membranes are highly mutated in metastatic breast cancers but not in non-metastatic cancers.
More importantly, survival curves based on those 33 CNG-UP genes showed the significant difference among those combined 4821 breast cancer patients. In the overall survival analysis, patients were segregated into “altered group” (red line) and “unaltered group” (blue line). The 1625 patients with certain genetic mutations on these 33 genes had a median 145.43 survival months while those patients without any mutations lived 175.30 months as a median. The logrank test statistical p-value was 8.94 × 10−7, which was corrected to a Q-value = 4.24 × 10−6. Besides the overall survival result, another four survival analyses further confirmed the statistical difference between the two groups, including relapse-free survival (Q-value = 5.26 × 10−4), disease-specific survival (Q-value = 0.0416), disease-free survival (Q-value = 0.0488), and progression-free survival (Q-value = 0.0488). In summary, these results confirmed the 33 genes are important for cancer metastasis and patient prognosis.
Strikingly, these mutations were also positively associated with other important clinical features (
Figure 4), including the race category, diagnosis age, histological grade, tumor stage, aneuploidy score, hypoxia score, and chemotherapy treatment. For example, there were relatively fewer Asian patients than White patients in the group with mutations (
Figure 4A). For histological grade and tumor stage, those patients with mutations tended to be in the higher grades (
Figure 4B) or late stages (
Figure 4C). Chemotherapy was applied more often to patients without any mutations for these 33 genes (
Figure 4D). In terms of diagnosis age, patients with mutations on these 33 genes tended to be diagnosed later (
Figure 4E), which might explain their higher grades and later stages. The presence of mutations was also indicative of higher aggression in factors such as aneuploidy (
Figure 4F) and hypoxia (
Figure 4G). The further mutational analysis expanded our understanding of the clinical features on those 33 ribosome proteins identified by our computational framework.
2.4. Intratumor Heterogeneity and Its Relationship with Key Cell States at the Single-Cell Level
According to the gene dosage hypothesis there is a positive correlation between gene copy number and mRNA expression where protein abundance between different cells will lead to a higher level of heterogeneity. Generally, scRNASeq is used to characterize the intratumor heterogeneity, which means identifying the subpopulations of cells superficially similar to other homogenous cells. In fact, measures of intratumor heterogeneity could also be used to detect changes in cell states and subsequent impacts on intratumor heterogeneity. For example, the cell stemness state is one of the key attributes comprising self-renewal, cell differentiation, and resistance to chemotherapy treatment.
As shown in
Figure 5A, we only focused on the six patients from the primary dataset GSE118389. For each patient, we calculated two heterogeneous indices: Shannon–Wiener index (
Figure 5A) and Simpson index (
Figure S2). These two indices are the classic diversity indices in ecology that depict alpha diversity, which represents the species richness in a plot. In detail, the Shannon–Wiener index is a measure of diversity that combines a species’ richness and relative abundance in a plot, while the Simpson index is more about the dominance of the species as it accounts for the proportion of a species in a community. Here, both indices were used to characterize how the number of expressed genes (analogue to species) were expressed in a cell (analogue to ecological plot). In this way, each cell had an expression richness index-to-gene expression variation. By mixing all the cells from a patient, we had an overall heterogeneous feature of gene expression. As shown in
Figure 5A, patient PT126 had the lowest Shannon–Wiener index, which means fewer genes were expressed, and those genes had lower expression levels in those cells in PT126. In
Figure 5B, we describe the general trend between the two different indices. Both indices highly correlated to each other, which confirmed the expression variations in patients.
In addition to the application of two classic diversity indices in our dataset, we also used the t-distributed stochastic neighbor embedding (tSNE) method to visualize the variations in all the cells. Based on those well-studied marker genes, we further defined five cell states that could be used to describe the cellular microenvironment and explain the intratumor variation. Specifically, five cell states were defined as breast cancer stemness, pluripotency, differentiation, proliferation, and epithelial–mesenchymal transition (EMT)/metastasis. For example, we combined the expression of CD44 (cluster of differentiation gene 44), ITGA6 (integrin subunit alpha 6), DNER (delta/notch like EGF repeat containing), ALDH1A3 (aldehyde dehydrogenase 1 family member A3), and ABCG2 (ATP binding cassette subfamily G member 2) to characterize breast cancer stemness. Similarly, we also combined all of the 33 genes to describe the overall SRP-dependent cotranslational proteins targeting functions in all the cells (SRP module). By checking the relationship between our ribosome proteins and the five cell states (
Figure 5C), we found those ribosome proteins presented positive associations with cell differentiation, stemness, and EMT/metastasis. However, we also observed huge differences at the patient level (
Figure S3). For example, the SRP modules are not well correlated with other cell states in PT126 due to a lack of sufficient expression data. Together, these findings may imply the ribosome protein-based signatures can be useful to predict the cell stemness and differentiation states, which is important for cancer metastasis or EMT.