1. Introduction
Cancer represents a substantial global health concern, with the World Health Organization reporting 19.3 million new cases and over 10.0 million cancer-related deaths worldwide in 2020. Existing studies highlight the dysregulation of gene expression and genetic mutations as fundamental contributors to carcinogenesis, often stemming from aberrant gene expressions or somatic mutations during cell division [
1]. Contemporary cancer research has unveiled long noncoding RNAs (lncRNAs), a crucial class of biomolecules, as a novel avenue for understanding the developmental intricacies of cancer [
2]. Furthermore, emerging evidence underscores enhancer RNAs (eRNAs) as significant contributors to genetic regulation, exhibiting substantial potential within the realm of oncology [
3].
As a subclass of non-coding RNA molecules, enhancer RNAs (eRNAs) are transcribed from DNA sequences located within the enhancer regions, pivotal components in the regulation of gene expression. Characterized by an average length of 5000 nucleotides, eRNAs exhibit distinct categorization based on polyadenylation and transcriptional directionality [
4]. Significantly, the expression levels of eRNAs correlate with the activity of their associated enhancers in gene targets, emphasizing their integral role in transcriptional regulation [
5]. The observed link between eRNA expression and enhancer function suggests potential implications in tumorigenesis. Numerous studies have underscored the involvement of eRNAs in governing gene expression across both healthy and cancerous cells [
6,
7,
8]. Furthermore, mutations occurring in enhancer regions often trigger aberrant eRNA transcription, implicating eRNAs in the initiation and progression of cancer [
5]. Consequently, eRNAs not only show promise as biological markers for diagnosis and prognosis in oncology, but they also present a potential target for therapeutic intervention.
Substantial efforts have been exerted in the curation of eRNAs, including systematic endeavors to identify, annotate, and catalog eRNAs based on experimental evidence or computational predictions. Several databases and resources have been developed to compile and organize information regarding eRNAs. Notable examples include Enhancer Atlas v2.0 [
9], HACER [
10], eRNAbase [
11], ENdb [
12], etc. These databases incorporate diverse experimental techniques such as RNA sequencing (RNA-seq), cap analysis of gene expression sequencing (CAGE-seq), precision nucleotide run-on sequencing (PRO-seq), and chromatin-based assays to identify and characterize eRNAs. Continuous efforts are made to update and expand these resources, reflecting the evolving understanding of eRNA biology in different cellular contexts.
eRNA can be categorized into two subclasses, unidirectional and bidirectional, according to their transcriptional directionality. The terms “unidirectional” and “bidirectional” refer to different modes of transcription from the enhancer regions, indicating unidirectional and bidirectional transcription, respectively. There are key differences between unidirectional and bidirectional eRNAs. Unidirectional transcription generates long eRNAs (>4 kb) from enhancer regions in a single direction [
13]. Bidirectional transcription generates eRNAs in two opposing directions from the same enhancer region, and the length of bidirectional eRNAs is usually shorter (<2 kb) [
14]. Furthermore, unidirectional eRNAs usually generates poly(A) tails and exhibit a lower H3K4me/me3 ratio in their chromatin signature than do bidirectional eRNAs. In comparison, bidirectional eRNAs have a higher H3K4me/me3 ratio than do unidirectional eRNAs. Unfortunately, the annotation of eRNA directionality is lacking in the current databases. None of the aforementioned eRNA databases provide the precise eRNA directionality annotation, making it difficult to distinguish between unidirectional and bidirectional eRNAs. One solution is to assume that the majority of the eRNAs in the database are bidirectional eRNAs; such a strategy was used by Zhang et al. to study eRNA-targeted therapy in cancer [
6]. In this study, we choose to initially treat all eRNAs as bidirectional. Further separation of unidirectional and bidirectional eRNA was achieved by length selection.
Another characteristic of eRNA is their tissue specificity. The expression of eRNAs is often context-dependent, influenced by the specific cellular environment, developmental stage, and physiological conditions [
15]. Certain eRNAs may exhibit tissue-specific expression patterns. The majority of the current eRNA databases curate eRNA by tissue type. Thus, our analyses were also performed by tissue type.
Mutations represent an established causal factor in the onset of various cancers, manifesting a spectrum of deleterious consequences. Adhering to the foundational principles of the central dogma of molecular biology, it is anticipated that mutations exert their influence on transcriptomes, subsequently influencing proteins, the pivotal entities orchestrating biological processes and pathways. Consequently, the identification and characterization of somatic motifs, representing binding sequences, either suppressed or induced by somatic mutations, are imperative in advancing cancer research
Mutations represent an established causal factor in the onset of various cancers, manifesting a spectrum of deleterious consequences. Adhering to the foundational principles of the central dogma of molecular biology, it is anticipated that mutations exert their influence on protein coding gene expression [
16,
17], and micro RNA expression [
18], subsequently influencing proteins, the pivotal entities orchestrating biological processes and pathways. Consequently, the identification and characterization of somatic motifs, representing binding sequences, either suppressed or induced by somatic mutations, are imperative in advancing cancer research [
19]. In a prior investigation [
20], it was demonstrated that the oncogenic E26 transformation-specific (ETS) factor exhibits binding affinity with a cryptic binding site induced by a prominent somatic mutation within the TERT promoter region. This specific mutation initiates the formation of a novel binding sequence specifically recognized by ETS proteins. Subsequently, the upregulation of TERT expression follows, leading to uncontrolled cellular proliferation and ultimately contributing to the development of cancer. Theoretically, the same somatic motif concept is applicable to eRNAs, implying that mutations within eRNAs can impact their expression levels and modify their regulatory mechanisms. Indeed, empirical evidence supports this notion, as mutated enhancers have been shown to diminish the binding potential of the transcription factors POU2F and YY1 [
21].
Multiple eRNA reviews have pointed out the potential importance of eRNA. Ding et al. pointed out that pathogenic mutations in eRNA are often overlooked [
22]. In an opinion published in
Nature Review Genetics, the author stated that extending the concept of mutation burden and aggregate analysis employed in exome sequencing to regulatory elements poses a significant challenge. Nevertheless, it is plausible that the genetic landscape of numerous common diseases encompasses diverse regulatory mutations scattered across multiple enhancers within an individual [
23]. Given that limited attention has been devoted to the study of mutations within eRNAs, we have therefore formulated a comprehensive investigation aimed at elucidating mutation density patterns within eRNAs. Mutation density has been a hypothesized, albeit underexplored, area of study, depicting regions of the genome where mutations cluster, acting as guides for the identification of oncogenes and the decoding of the complex mechanisms involved in tumorigenesis. Previous investigations have highlighted discernible imbalances in mutational density between sense and antisense RNA strands proximal to DNA replication origins and transcription start sites (TSSs) in protein-coding RNAs [
24]. These observations suggest that mutational density patterns could serve as a historical record of tumorigenesis. Moreover, recent research delving into lncRNAs has unveiled similar mutation density patterns around the TSSs between lncRNAs and protein-coding RNAs. Furthermore, it was shown that the quantified transcriptional strand bias holds prognostic significance, suggesting that a stronger DNA repair mechanism may negatively affect chemotherapy effectiveness [
25]. Given the growing recognition of the interplay between non-coding RNAs and cancer, we propose a comprehensive examination of mutational density patterns in eRNAs to ascertain whether they mirror or differ from those observed in protein-coding RNAs.
3. Discussion
eRNAs play a pivotal role in the intricate regulation of gene expression, possessing substantial significance in diverse biological processes. Although considerable progress has been made in evaluating eRNA functions, many aspects of their biology continue to evade complete understanding. These underexplored areas remain inadequately researched in current scientific literature, making them active subjects for ongoing investigation. Therefore, this study was designed to systematically examine mutation density patterns of eRNAs across multiple types of cancer.
Our results reveal several novel aspects of eRNA in relation to mutations. Previous studies have observed mutation density transcriptional strand bias in protein-coding RNAs [
24] and lncRNAs [
25]. This bias occurs due to the action of the transcription-coupled DNA repair mechanism, which preferentially repairs the template strand, commencing from the transcription start site (TSS). This action mitigates the disruptions caused by DNA lesions on the RNA polymerase, resulting in a diminished mutation density on the template strand compared to the coding strand.
Interestingly, we discovered a similar mutation strand bias within eRNAs. However, in contrast to protein-coding RNAs and lncRNAs, eRNAs are bidirectionally transcribed by RNA polymerase II [
31]. Furthermore, rarely do any eRNA databases annotate eRNA by their transcriptional directionality. Whether or not the traditional transcription-coupled DNA repair mechanism works on eRNAs is inconclusive. However, the broader context of DNA repair mechanisms, including nucleotide excision repair (NER), can still play a role in maintaining genomic integrity, including at the enhancer regions. Mutation strand bias can also occur due to preferential expression on a specific strand of eRNA. The level of enhancer activity on one strand may differ from that of the other, influencing the initiation and extent of bidirectional transcription. The presence of cis-regulatory elements within the enhancer region may contribute to strand-specific differences in transcription. Regardless, the precise mechanisms and regulatory pathways involved in DNA repair at the enhancer regions, especially during eRNA transcription, are areas of ongoing research, and the understanding of these processes is evolving.
In an analysis of all-tissue consensus eRNAs, it was observed almost uniformly across all cancer types that a dip in mutation density existed in the center of the eRNA sequence. This dip was sharply juxtaposed by peaks of mutation density occurring at both the beginning and end of the eRNAs. This pattern persists somewhat when using tissue-specific eRNAs matching the respective cancer types, albeit with diminished prominence and often lacking in statistical significance. Despite the potential introduction of tissue-specific bias or noise through the utilization of all-tissue consensus eRNAs, the discernible and consistent patterns remain interpretable as indicative of an unknown biological mechanism. In the analysis of protein-coding RNA [
24] and lncRNA [
25], distinctive mutation peaks were observed around the transcription start site (TSS), attributed to the mechanism of transcription-coupled repair. However, in the case of eRNAs characterized by bidirectional transcription with the TSS positioned in the middle, the mutation density pattern around eRNA TSS diverges from that observed in protein-coding RNAs and lncRNAs. Notably, the underlying mechanism driving this pattern is distinct and cannot be attributed to transcription-coupled repair.
The exact mechanisms at play within the context of eRNAs remain enigmatic; however, several contributing processes can be hypothesized. Firstly, it has been shown that eRNA activation are through binding to TF [
6,
32]. Such functional constraint within the middle region of eRNA could be instrumental in maintaining enhancer activity. Mutations in this region might cause more significant consequences for the enhancer’s function, thereby exerting selective pressure against mutations. Secondly, the involvement of transcription-coupled repair or DNA repair mechanisms may play a role in mitigating the number of mutations subsequent to eRNA transcription, although this does not account for the observed mutation peak at the end of the eRNAs. Thirdly, the potential elevation of spontaneous mutations during transcriptional activity may arise due to the increased local exposure of DNA during the synthesis of RNA transcripts [
33].
The biggest limitation of this study may be related to the accuracy of eRNA annotation, which can, in part, be ascribed to the sensitivity of the underlying technology. Various methods have been employed for identifying eRNA, including the identification of the binding site of the transcription factor EP300 via ChIP-seq [
34], as well as the sequencing of nascent RNA through techniques such as GRO-seq [
35] or PRO-seq [
36], among others. Each of these techniques exhibits distinct sensitivities and offers unique advantages [
37]. For each unique tissue type, the Enhancer Atlas database [
9] combined eRNA data from multiple sources, based on multiple technologies. The results may not exhibit uniform representation across all tissue types, potentially introducing variability or noise into the results. Furthermore, none of the major eRNA databases annotate eRNAs by transcriptional directionality, making distinguishing unidirectional and bidirectional status difficult. In our study, we further divided the eRNA by size, hoping to reduce the level of noises caused by inaccurate annotation. Moreover, we employed rigorous statistical analyses and validated the findings through the examination of multiple cancer types, thereby enhancing the robustness and reliability of our conclusions.
Significant advancements have been made in recognizing the crucial role of eRNAs in the complex field of gene regulation. Nevertheless, the persistent uncertainties highlighted here emphasize the ongoing complexities of eRNA biology, underlining the need for continued research to clarify their exact roles in cellular processes and disease. Our findings not only substantiate the tissue-specificity of eRNAs in terms of mutation density, but also propose the existence of one or multiple underlying biological mechanisms orchestrating mutations in a non-random fashion within eRNAs.
4. Methods
4.1. Data Collection
Somatic mutation data from cancer patients spanning 80 cohorts representing 35 distinct cancer types were obtained from the International Cancer Genome Consortium (ICGC). The dataset was refined by excluding small indels and silent mutations, retaining only single base substitutions (SBS). From the initial 80 cohorts available from the ICGC, the top 50 cohorts were chosen based on mutation quantity ranking, as higher mutation density assessments are more reliable when the total number of mutations per cohort is substantial. eRNA is known to be tissue-specific. We downloaded 12 eRNA annotation files from Enhancer Atlas [
9]. These eRNA annotation files were matched with specific cancer types. We used BedTools [
38] to identify the pairwise overlapping proportion of eRNA between different tissue types. Due to the lack of annotation of unidirectional and bidirectional eRNA, we first conducted the analysis without distinguishing the directionality, then selected bidirectional eRNAs by size < 2000 bp.
4.2. Mutation Density Pattern and Strand Bias
Given the complementary nature of DNA, the 12 possible single nucleotide mutations are classified into six intuitive mutational categories: C > A (C > A & G > T), C > G (C > G & G > C), C > T (C > T & G > A), T > A (T > A & A > T), T > C (T > C & A > G), and T > G (T > G & A > C). Each mutational category incorporates two reciprocally complementary forms, as exemplified by the C > T and G > A pair. Mutation density analysis relies on a predetermined focal genomic feature. For this study, the focal feature of interest is eRNA TSS, an extension of our previous work in which we assessed transcription strand bias in close proximity to protein-coding RNAs TSS [
3].
In order to dissect the spatial patterns of mutation density near eRNA TSS, the counted mutations included those within the immediate flanking regions, both upstream and downstream, each spanning 2000 nucleotides. These bidirectional flanks were divided into 40 bins, each measuring 100 nucleotides in length. Mutations (singular genomic positions) identified from a specific cancer cohort were tallied within each sequential 100-bp bin in correlation to every occurrence of eRNA TSS. Further normalization took into consideration the G/C to A/T ratio in GRCh38, generating the mutation density evaluated as “mutations per kilo total mutations per megabase” (MPKM). The dense graphical representation of mutation densities for the two complementary strands is enabled using R command line scripting, underpinned by our previously developed R software, MutDens [
3].
MutDens offered two core statistical analyses—the detection of a mutation peak or dip in the vicinity of eRNA TSS, and the determination of strand bias. For the former, a background mutation density was constructed in a Poisson distribution using mutations distant from the focal genomic features. A comparison of the eRNA mutation density for each ICGC cohort against this background attempted to locate possible peaks or dips, corroborated by a nominal p < 1 × 10−5 in the Poisson test. Subsequently, to identify strand bias, a Wilcoxon signed-rank test was deployed. While MutDens supplied results for the upstream flank, downstream flank, and bidirectional flanks of TSS, this study only considered the TSS downstream, aiming to detect transcription strand bias, while maintaining the false discovery rate at 0.05 (pursuant to the Benjamini–Hochberg correction).
4.3. Mutation Density in Target Genomic Regions by Cancer Patient
Within each cancer cohort, ICGC released information that allowed for the delimiting of all of the personal SBS mutations of each cancer patient (a donor). Given all the characterized SBS mutations of a donor, we leveraged the R package GenomicRanges to count the mutations that overlapped with the eRNA spans, the target genomic regions. This overlapping-based total number of mutations per donor,
C, underwent three elements of normalization, resulting in the final mutation density value as “mutations per kilo total mutations per megabase” (MPKM), computed using the following formula:
First, the raw region-bearing mutation count C was divided by total size of the target genomic regions (M), assessed in mega bases, to become an interim value I1. In the current study, because the target regions comprised human enhancers, the normalization factor serving as the denominator equated to summed enhancer lengths in the unit of Mb. Secondly, considering that a specific SBS category can occur on only a specific type of substrate base, the interim value I1 was divided by a GC content-based adjusting factor (f), to become the next interim value I2. For C > A, C > G, and C > T SBSs, the adjusting factor is 0.2, while for T > A, T > C, and T > G, it is 0.3. Finally, the interim value I2 was divided by the total number of mutations in the whole genome of the donor (K), assessed in kilos. For example, the six categories of SBS mutations for donor DO46325, an Australian ovarian cancer patient, were tallied at 1342 (C > A), 1178 (C > G), 1770 (C > T), 912 (T > A), 1160 (T > C), and 503 (T > G), ranking C > T as the most significantly mutated category in the entire genome. Counting only the mutations overlapping the enhancer regions, the six SBS categories revealed 7 (C > A), 10 (C > G), 12 (C > T), 0 (T > A), 2 (T > C), and 5 (T > G) mutations, still ranking C > T as the most significantly mutated category in the enhancer regions.
The ovary enhancer regions collectively occupy 14,522,866 (14.5 Mb) nucleotides. Dividing the enhancer-bearing SBS totals by the total region size (14.5 Mb), the total mutational burdens (1.3 k, 1.2 k, 1.8 k, 0.9 k, 1.1 k, and 0.5 k), and the concordant adjusting factors (0.2 for the former three and 0.3 for the latter three), we ended up with the following mutation density MPKM values for the ovary enhancer regions of donor DO46325: 1.7 (C > A), 2.8 (C > G), 2.2 (C > T), 0 (T > A), 0.4 (T > C), and 2.3 (T > G). Of note, after normalizing the raw enhancer-bearing mutation count from the foresaid three aspects, C > G stood out as the most frequently mutated category in the enhancer regions, with the highest MPKM value derived as 2.8 = 10/(14.5 * 0.2 * 1.2). It is worth noting that the MPKM values assessed per SBS category (Equation (1)) capture the average mutation density of the two mutation forms within an SBS category, rather than summing the two mutation forms. The MPKM values resulting from this current strategy were comparable to the mutation form-based MPKM values proposed in MutDens.