Next Article in Journal
Social Diversification Driven by Mobile Genetic Elements
Previous Article in Journal
SDHA Germline Mutations in SDH-Deficient GISTs: A Current Update
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Structural and Functional Classification of G-Quadruplex Families within the Human Genome

1
School of Graduate and Interdisciplinary Studies, University of Louisville, Louisville, KY 40292, USA
2
Department of Neuroscience Training, University of Louisville, Louisville, KY 40292, USA
3
Kentucky IDeA Network of Biomedical Research Excellence (KY INBRE) Bioinformatics Core, University of Louisville, Louisville, KY 40292, USA
4
Department of Biochemistry and Molecular Genetics, University of Louisville, Louisville, KY 40292, USA
*
Author to whom correspondence should be addressed.
Genes 2023, 14(3), 645; https://doi.org/10.3390/genes14030645
Submission received: 13 February 2023 / Revised: 22 February 2023 / Accepted: 2 March 2023 / Published: 4 March 2023
(This article belongs to the Section Bioinformatics)

Abstract

:
G-quadruplexes (G4s) are short secondary DNA structures located throughout genomic DNA and transcribed RNA. Although G4 structures have been shown to form in vivo, no current search tools that examine these structures based on previously identified G-quadruplexes and filter them based on similar sequence, structure, and thermodynamic properties are known to exist. We present a framework for clustering G-quadruplex sequences into families using the CD-HIT, MeShClust, and DNACLUST methods along with a combination of Starcode and BLAST. Utilizing this framework to filter and annotate clusters, 95 families of G-quadruplex sequences were identified within the human genome. Profiles for each family were created using hidden Markov models to allow for the identification of additional family members and generate homology probability scores. The thermodynamic folding energy properties, functional annotation of genes associated with the sequences, scores from different prediction algorithms, and transcription factor binding motifs within a family were used to annotate and compare the diversity within and across clusters. The resulting set of G-quadruplex families can be used to further understand how different regions of the genome are regulated by factors targeting specific structures common to members of a specific cluster.

1. Introduction

G-quadruplexes are stranded secondary structures of nucleic acids rich in guanine with the most common form containing four runs of at least three guanines. These runs are separated by short loops, typically 2–7 nucleotides in length, which can potentially fold into an intramolecular G-quadruplex structure. The tetrad guanine structure is stacked on top of each other and held together by mixed loops of DNA forming Hoogsteen base pairing giving a four-stranded structure with nucleobases on the inside and a sugar-phosphate backbone on the outside (Figure 1a,b). Metal ions (typically K+ or Na+) sitting internally to the Hoogsteen bases stabilize the base pairing. Stacking occurs through the O-6 atoms of guanines facing the center creating a tubular space able to function as an ion channel. The presence of a metal cation in this channel allows for interaction with the eight O-6 atoms of the guanine quartet. While the triplet tetrad is most frequently observed, two tetrad structures have been experimentally shown to form G-quadruplexes as well [1].

1.1. Roles of G-quadruplexes

Over the past three decades, guanine-rich quadruplex sequences have been implicated as key structural regulators of gene expression, cellular differentiation, transcription factors, and their cell line and tissue specificity [2]. Similarly, elevated levels of G-quadruplexes have been identified across cancer tissues including breast [3], stomach [4], and liver cancer [5] as well as neurodegenerative diseases [6]. Computational analyses of G-quadruplex patterns have identified the prevalence of G-quadruplexes in oncogenic promoters, introns, splice sites, and intergenic and telomeric ends. Initially, the secondary structures were thought to act as a physical obstacle to RNA polymerase for transcription, as identified through G4-specific antibodies [7,8] and chemical probing [9,10]. Further evidence suggests that the varied tissue-specific functionality of these structures is affected by the crosstalk of additional transcription factors [11], proteins, and physiological conditions. Additionally, G4 structures have a role in genomic instability and are associated with higher rates of double-strand breakage in nucleosome-depleted regions of highly expressed cancer genes. High- and low-density bands of G4 across both chromosomal strands have been observed showcasing a role of G-quadruplex in the pairing of homologous chromosomes during meiosis [12]. Further, recent evidence shows that G4 formation is highest during DNA replication in the S phase and lowest during the G2 and M phases, which is consistent with phases of transcription, replication, and chromatin accessibility [13].

1.2. Characteristics of G4s

Sequence characteristics such as sequence length [14], base composition [15,16], and loop length [17,18,19,20,21] are important parameters for defining the secondary structure and stability of G-quadruplexes. Molecular dynamics show that telomeric G4 repeats (TTAGGG) in the presence of a K+ cation form a structure with three single nucleotide loops in a parallel fashion. Increasing the loop length by a single base causes the sequences to adopt a mixture of parallel and antiparallel folded structures [22]. The conformation and stability of G-quadruplexes have been used to study to effect of transcription factor binding and altered mRNA expression of several genes. Examples include nucleolin [23] and Ewing’s Sarcoma proteins [24] which preferentially bind to structures with longer loop lengths. Computationally, G4s are defined by the pattern GxN1–7GxN1–7GxN1–7Gx where x ≥ 3 (length of guanine repeats). The guanine tracts are separated by loops of any base composition of length 1–7 bases. This pattern is the basis for regular expression-based tools such as Quadparser [25] and QGRSmapper [26]. With experimental data, it is known that different intermolecular structures, long loops, and non-canonical structures with G tracts containing two guanines exist [1,27,28,29]. Methods such as G4screener [30], PQSfinder [31], and G4Catchall [32] allow the search of G-quadruplexes for variable quartet and larger loop sequences. G4Hunter [33] provides a score for guanine skewness which is based on predefined values, with a score based on the number of consecutive Gs. G4RNAscreener [30] uses a machine learning algorithm trained with experimental RNA sequences from the G4RNA database [34] and incorporates a threshold using metrics from tools such as G4Hunter [33], cG/cCscore, and G4 Neural Network score for G4 prediction. RNAfold [35] has an option to predict the thermodynamic parameters for G-quadruplex formation. DSSR [36] and ElTetrado [37] use the tertiary structure of each G-quadruplex for annotating and classifying different base pairs and tetrad structures. Notably, 3D-NuS [38] allows visualization of 3D DNA structures including duplex, triplex, and quadruplexes. Notably, 3D NuS visualizes the G-quadruplex structure and its strand orientation, loops and G quartets based on the energy minimization of G4 structures using experimental data.
The G4 structure was found to be evolutionarily conserved in seven yeast species [39]. While G-quadruplex regions are significantly enriched in regulatory regions of eukaryotes, short loops of G4 are conserved in different species. Protozoa and fungi have limited diversity of G4 while an increase in diversity has been observed across invertebrates and vertebrates [40]. However, the evolutionary mechanism for this structure or the relationship of these structures at an evolutionary scale is not known.
Sequencing read fragments utilizing a customized approach that introduces stabilizing and destabilizing conditions (K+, Li+, pyridostatin) allows for high throughput sequencing of G4 locations [41,42] with a method known as G4 seq. Versions of this method have been used to identify 1,420,841 G-quadruplexes in 12 species. Using a similar method, 161 and 168 G4 sites were identified in the genomes of Pseudomonas [43] and Escherichia [44], respectively.
Over 100,000 G4 sequences have been mapped in vivo to the human genome. Several proteins such as FUS, TAF15, TARDBP, and PCBP1 have been determined to be enriched at G4 loci using artificial G4 binding [45]. SP2, a transcription factor (TF) encoded by a subfamily of the Sp/XKLF family, is a sequence-specific TF that has a strong association with G-quadruplex affinity. SP2 binds to the CCAAT motif independent of the zinc finger domain necessary for binding to GC-rich motifs [46]. It was shown in vitro that the SP1 TF was able to bind to a DNA sequence lacking the consensus motif and was able to form G-quadruplex sequences [47]. Luciferase expression studies show sequences of G4s in the KIT promoter mutated through site-directed mutagenesis were able to create a modulation (on/off) system for KIT expression through SP1 binding [48]. Additionally, G-quadruplex structures can bind to G-quadruplex sites in other promoter locations [49] mediating cis [50] and trans [51] acting regulation of transcriptional and translational processes, respectively, implying that G-quadruplex sequence and structural diversity are key factors for biological functions.

1.3. G4 Families

Previously, a small family of G-quadruplexes labeled Pu27 was identified based on sequence homology [49]. The parent G-quadruplex is a 27 nucleotide (nt) G4 formed in the nuclease hypersensitive element (NHE) region of the c-MYC promoter associated with different forms of cancer, and predominantly involved in the regulation of expression of the c-MYC gene [52]. c-MYC is an oncogene that regulates genes in the cell cycle and molecular metabolism. Rezzoug et al. identified seventeen potential G-quadruplex forming sequences homologous to the Pu27 G4 which have been shown to bind to the NHE region of the c-MYC promoter selectively [49]. In addition, G4 regions regulating VEGF genes have been shown to have an additional G-tract to act as a spare tire for the formation of the G-quadruplex sequence of oxidative damage to the guanine tracts [53]. Similar sequences have been identified for c-MYC, KRAS [54], BCL2 [55], HIF-1α, and RET genes. This highlights the presence of sequence-specific G-quadruplexes able to form, bind and regulate gene expression. Further, over the past decade, numerous G-quadruplex stabilizing and destabilizing ligands have been identified that recognize and interact selectively with these G4 sequences. Different classes of these heteroaromatic polycyclic, macrocyclic, and aromatic compounds have been designed to target the diversity of the G4 structure. The subtle differences in grooves, loop composition, and loop length allow for structural variability in these sequences. DNA aptamers that can form G4 are used for binding nucleolin [56]. More than 50 transcription factors with overlapping binding sites to the G4 region have been identified [2,57]. The folding, misfolding, and unfolding of G4 structures have been implicated in different biological processes [58,59].

1.4. Detection of G4 Families

The prediction of G-quadruplexes across genomes can be useful to identify the location of similarly structured G-quadruplexes, which can in turn be used to develop profiles of independent families based on the conservation of a variety of factors. We present a framework for predicting G-quadruplex sequences and identify similar sequences using trained profile hidden Markov models (HMMs) [60]. We identify pG4 sequences across the human genome and cluster these sequences using sequence clustering tools, CD-hit [61], MeShClust [62], and DNACLUST [63] as well as starcode [64] and BLAST [65]. These approaches utilize average weighted clustering to identify the quartet and loop patterns. We then further train HMM models using these clusters for the creation of families. Despite the short length of G-quadruplex sequences, position-dependent insertion and deletion within loops offer insight into the loop characteristics.

2. Materials and Methods

Dataset Preparation

Since there are no current families or experimental similarities in G4 structures, we start with putative G4s and apply sequence-based methods for clustering. Later, these clusters are used as initial seeds for identifying G-quadruplexes in experimental datasets. Initially, we focused on the G4s identified from Quadparser [25] on the human GRCh38 genome. While two tetrad structures have been experimentally validated, they were omitted from our clustering due to the elevated false positive rate associated with their computational detection. The following process is followed for all groups of sequences based on the number of GGG tetrads (Figure 2a).
  • CD-hit, MeShClust, DNAclust, and a combination of Starcode and BLAST with hierarchical clustering are utilized for the initial clustering of G-quadruplex sequences.
  • Steps (3)–(7) are repeated separately for each clustering method.
  • A multiple sequence alignment (MSA) of each cluster of sequences is carried out in R using the DECIPHER package [66]. The StaggerAlignment and AdjustAlignment functions are used to separate regions of alignment and gaps are shifted to improve the alignment.
  • Clusters with fewer than four sequences are filtered out. An MSA score for each cluster is calculated as the average number of gaps in each column of an alignment divided by the length using MStatX [67].
  • Each alignment is trained as a model profile HMM using HMMER 3.0 [68] and the aphid package [69] in R version 3.4.1 independently. The transition and emission probability matrices are estimated based on the plan7 PHMM model based on Durbin [60]. An example of a profile HMM stating match, insert and delete state is shown in Figure 2b. There are seven outgoing transitions based on the match, insert and delete states, i.e., In → In, Mn → In, Mn → Mn+1, Mn → Dn+1; Dn → Mn+1, Dn → Dn+1; In → Mn+1 where n represents each position of the alignment (except the final position). The observed counts of emissions and state transitions are converted into probabilities.
  • The sequences in each cluster are used as input for all the profiles and the log odds scores are generated using the forward algorithm.
  • A pairwise Wilcoxon rank sum test is carried out to compare each profile using the log odds between the profile HMM through which the sequences were generated and all other profiles (Figure 2c). If a profile is diverse (p-value < 0.05) against all compared profiles, has a probability of 0.99 for the tested sequences, and has a gap score less than a threshold of 0.10, the profile is saved as a family. For the sequences that are non-significant (p-value > 0.05) the sequences are input to the MSA and are merged and/or clustered using agglomerative clustering. Alignments with a gap score of 0.6 after merging are filtered. The process is iterated for a maximum of 100 times.
  • The group of sequences obtained from all the methods is combined and checked for redundancy using a modification of step (7) utilizing a threshold score of log odds 5, Akaike weight of 0.7 and MSA gap score threshold of 0.07 for identifying the final families, which are added manually, checked, and filtered.
  • The alignment and profile HMM are manually verified, resulting in 95 clusters referred to as families. Experimentally validated G-quadruplexes were obtained from processed peaks mapped to hg19 from GEO, accession GSE63874 [41] using bedtools [70] and quadparser2 after conversion to human genome hg38 coordinates by liftover. The models are used as a trained classifier to identify additional sequences. G4 sequences from experimental G4 seq were tested against the cluster HMMs. The likelihood that a query sequence fits the model of an individual family is calculated using the forward algorithm [71], and the normalized Akaike weights [72,73] are calculated. The maximum Akaike weight of the query given to a particular model is selected as the nearest family to the query sequence. The families are manually verified and the variability of sequences in the families is further analyzed based on the annotation of the G4, thermodynamic scores (folding energy), G4Hunter scores, and literature. The steps below highlight the method for the combination of Starcode and BLAST with hierarchical clustering.
    • Levenshtein distance is used to identify the nearest group of sequences which are then filtered based on the length of the sequence and the number of G tetrads. Starcode [64] utilizes a modified Needleman-Wunsch dynamic programming approach known as the poucet algorithm for determining the initial and nearest groups of sequences. Sequences below a fixed Levenshtein score are used to identify the groups, and each group is filtered by the length of the sequence and loop sequence content. Using specific Levenshtein distance as a constraint through this algorithm, one or two nucleotide mismatches can be identified in short DNA sequences.
    • The remaining sequences from step (1) that are not in any group are passed through BLAST for pairwise all vs. all BLAST. −log(E value) is used as the similarity metric.
    • Hierarchical clustering is applied by comparing the agglomerative, Ward, complete, and divisive methods of clustering. The number of clusters is calculated based on the sum of the within-cluster inertia. The optimal number of clusters is the maximum difference from two successive clusters between the groups, i.e., max (Im/Im+1). The mode of the number of clusters was selected as the optimal cluster.
    • Pairwise alignment of sequences of individual clusters obtained from steps (a) and (c) is carried out using the pairwise alignment function in the Biostrings [74] package. Hierarchical clustering of the sequences is performed based on the pairwise distance. The Consensus of Silhouette [75], Frey index, Macclain Index, Cindex, and Dunn index were used for identifying the optimal number of clusters. The metrics are calculated using the NbClust package [76] in R.

3. Results

In the preliminary step, a combination of Starcode and BLAST was used with hierarchical clustering to identify 2717 clusters of G-quadruplexes with 29,112 sequences. Using DNACLUST, 587 clusters with 4664 sequences were identified. A total of 786 clusters with 6335 sequences were identified with Cdhit with a k-mer of 8. MeShClust with an identity threshold of 90% and k-mer size of 9 was able to identify 508 clusters. Any clusters with fewer than four sequences were discarded. The two largest clusters had 1720 and 1410 sequences, respectively. The overall clustering summary is provided in Table 1.
The HMMs for the identified clusters were utilized to predict additional G-quadruplex sequences. In addition, the MSA was used to detect transcription factor site motifs found within each family. The G4 families suffered from the redundancy of motifs because of the high percentage of guanine bases. To identify unique motifs, a pipeline was created to merge and re-cluster the families. Overall, the Starcode and BLAST pipeline identified 95 clusters of G-quadruplex genomic DNA sequences. The MeShClust pipeline identified 72 clusters, while DNACLUST And CD-HIT identified 31 and 30, respectively. The final iteration of the clustering and merging sequences across profiles from the various clustering approaches resulted in 95 distinct families.

3.1. G-quadruplex Families

The resulting 95 families were created from 1739 distinct individual G4s identified from 2145 distinct regions of the hg38 human genome. Given the short sequence length and guanine composition, many of the G4 sequences are not unique. One of the largest families identified, Family 23 is composed of 163 regions with 118 distinct G4s occurring over 122 genes (Supplemental Table S1). Similarly, Family 79 has 130 regions with 99 distinct G4s occurring over 128 genes distributed across all chromosomes (Supplemental Table S2). We identified multiple sequence repeats capable of forming multiple G4 structures with different conformation in Families 46, 62, 88, 89, and 90 based on the available guanines (Supplemental Table S3). Smaller Families 2 and 3 have 7 and 6 distinct sequences occurring in proximity to 8 and 7 genes, respectively (Supplemental Tables S4 and S5). A summary of the predicted G4 sequence families is presented in Table 2.
We analyzed the clusters for their sequence characteristics, functional annotation, and structural features, as presented below. We highlight some of the clusters that have strong biological significance with related biological and molecular processes, including Family 4 (Supplemental Table S6), Family 32 (Supplemental Table S7), Family 75 (Supplemental Table S8), and Family 80 (Supplemental Table S9).

3.2. Categorical Enrichment of Select Families

Family 4 consists of nine sequences distributed over nine genes and seven chromosomes. Figure 3 illustrates the dot-bracket notation of the consensus of the family, along with thermodynamic characteristics. While this family is relatively small, the associated genes are related, showing an enrichment of terms related to neural cells (e.g., glia-guided migration, synapse assembly, dendritic spine development, and gliogenesis) (Supplemental Figure S1, Supplemental Table S10).
Family 32 contains 90 G4 sequences annotated with 85 genes. The thermodynamic properties are illustrated in Figure 4. The genes associated with Family 32 G4s are enriched for cellular organization (e.g., positive regulation of cell projection organization and positive regulation of cellular component organization), axonal development (e.g., neuron projection guidance, axon guidance), mitochondrial localization (e.g., regulation of protein targeting to mitochondrion and regulation of establishment of protein localization to mitochondrion) and size regulation (e.g., regulation of anatomical structure size and regulation of cell size) (Supplemental Figure S2, Supplemental Table S11).
Family 75 is represented by 18 G4 sequences distributed over 10 chromosomes and 16 genes (Figure 5). Enriched GO:BP terms are highly related to immune differentiation and adhesion (e.g., positive regulation of T cell differentiation, positive regulation of lymphocyte differentiation, positive regulation of leukocyte cell-cell adhesion) (Supplemental Figure S3, Supplemental Table S12).
For Family 80, we identified 21 sequences distributed over 12 chromosomes and 21 genes (Figure 6). Genes associated with this family appear to be localized to cellular components, in particular membranes. Enriched GO:CC categories for the genes include the cytoplasmic side of the membrane, plasma membrane, cytoplasmic side of the plasma membrane, plasma membrane region, cell projection membrane, ficolin-1-rich granule membrane, side of the membrane, cell periphery, ruffle membrane, secretory granule membrane, leading-edge membrane, actin filament, the extrinsic component of the cytoplasmic side of the plasma membrane, ruffle, membrane, extrinsic component of the plasma membrane, intrinsic components of the membrane, intrinsic components of the endoplasmic reticulum membrane, plasma membrane protein complex, ficolin-1-rich granule, and tertiary granule (Supplemental Figure S4, Supplemental Table S13). A summary of enriched GO terms as determined from GOprofiler and simplifyEnrichment for selected families is presented in Figure 7.

3.3. Thermodynamic Properties of Select Families

The free energy of the thermodynamic ensemble for the consensus sequence of Family 1 was calculated to be −28.11 kcal/mol. The frequency of the minimum free energy (MFE) structure was 50.62% with an ensemble diversity of 0, suggesting a strict conformation of tetrads for the formation of a G4 structure. The minimum free energy for the family was calculated to be −27.69 kcal/mol. This family consists of six training sequences that have a single-length loop with T-T-A loops (represented by 1-1-1 loops).
For Family 11, the free energy of the thermodynamic ensemble was calculated to be −20.22 kcal/mol. The frequency of the MFE structure in the ensemble is 25.23% and the ensemble diversity is 0, suggesting once again a strict conformation of tetrads for G4 formation. Family 63 is identified with the sequence G3AG3AG3AG3 and is found across 24 chromosomes and 97 genes distributed among intronic, intergenic, and promoter regions. The free energy of the thermodynamic ensemble for Family 63 was calculated to be −36.00 kcal/mol, while the frequency of the MFE structure in the ensemble is 100% and the ensemble diversity is 0.00.
Figure 3a–e, Figure 4a–e, Figure 5a–e and Figure 6a–e illustrate the thermodynamic properties of families 4, 32, 75, and 80, respectively. Figure 3a, Figure 4a, Figure 5a and Figure 6a represent the base pairing of each base in the G-quadruplex sequence. Figure 3b, Figure 4b, Figure 5b and Figure 6b highlight the centroid secondary structure in dot-bracket notation. A base pairing probability matrix is used to identify added information about the ensemble G4 secondary structure. Applied initially to identify different secondary structures of RNA sequences, dynamic programming provides efficient computation of base pairing probabilities for secondary structure formation. The MFE secondary structure highlighting encoding positional entropy (Figure 3c, Figure 4c, Figure 5c and Figure 6c) is calculated using the consensus sequence of the G4 cluster as predicted by RNAfold. DNA shape features such as the minor groove width and electrostatic potential (Figure 3d, Figure 4d, Figure 5d and Figure 6d) depend upon the charge distribution of nucleotides in a DNA sequence and affect the folding into secondary structure and transcription factor binding in these locations [77]. The difference in stacking energies causing the varying hydrogen bonding patterns can be predicted in each dinucleotide step and can be used to infer minor groove width [78]. The guanine amino group repeats in G-quadruplexes affect charge distributions in the minor and major grooves of helical DNA leading to rotation of the tetrads. We use it to annotate the different families of G-quadruplex identified here. A dot plot of the structure with MFE is shown in Figure 3e, Figure 4e, Figure 5e and Figure 6e for each of the selected families.
When DNA is bent around secondary structures such as helical or G-quadruplex structures, the bend is separated based on dinucleotide sequences. Propeller twist is defined as the twist along the axis making two bases “non-coplanar” [79]. Previous studies have provided evidence for the flexible nature of the GG and GC dinucleotides with low propeller twist while AA shows the highest. The flexible nature of such a structure favors G-quadruplex sequences. Low propeller twist is related to the ability of the nucleotides to slide on each other and stack in a stable manner. For each cluster, we calculated dinucleotide frequency normalized by the individual length of the G-quadruplex, minimum free energy, minor groove width, propeller twist, helical twist, roll, and electrostatic potential with −10 and +10 region around the identified clusters of G-quadruplex using DNAshapeR [80]. These features address the shape, thermodynamic stability, and flexibility of rotation of the guanine amino groups, and transcription factor recognition site.

3.4. Classification of Experimentally Validated G4 Sequences

Using the sequences from peaks mapped from a G4 seq experiment (GEO accession GSE63874), and identified using Quadparser2, we found all possible pG4 sequences with four tetrads and used it to query the model classifier. We classified 18,340 individual G4s identified from 22,226 distinct regions of the hg38 human genome into 95 families. Based on the clustering for experimental sequences, the major families represented are Family 73 (917 unique G4s related to 664 genes), Family 2 (25 unique G4s, 29 genes), and Family 93 (26 unique G4s, 25 genes). Family 63 has a distinct G4 sequence G3AG3AG3 that is repeated throughout the genome, occurring 313 times over 23 chromosomes and 204 genes.

3.5. G4 Repeat and Loop Length Characteristics

For genes with repeats of G4 sequences (i.e., more than four tetrads), multiple G4 sequences with a variable loop length are possible (Figure 8). We identify all possible linear combinations of G tetrads for such sequences and classify all combinations of the sequences into families. This provides a way to identify multiple conformations forming G-quadruplexes. One example gene with a variable length sequence is BAHCC1, a chromatin regulator known to interact with transcriptional repressors to ensure gene silencing through recognition and bind to PRC2 complex mediated H3K27me3 through chromatin compaction and histone deacetylation [81,82]. Within a single G4 region, we identified repeats of 13 different sequences (length of G4 repeat: 314 bases), with each sequence being distinct enough to occur in a separate family. We also find 29 G4 sequences in NRD2, with most of the sequences occurring in Family 17, with one each also occurring in Family 7 and 10.
We identify similar repeats of five distinct sequences spanning an intronic region in PLOD1, which codes for lysyl hydroxylase and is involved in collagen synthesis. A 45 nucleotide G-quadruplex sequence present in the promoter region of tyrosine hydroxylase (TH) can regulate transcription and has been linked with neurological and psychological disorders such as Parkinson’s and schizophrenia [83,84]. We found two additional G-quadruplex sequences in the opposite strand across promoter and intronic regions of TH which have matches to Family 14 and 37, respectively.
Semaphorins are a group of membrane-spanning proteins that bind to Plexin (PLXNA and PLXNB) receptors to regulate axon cue signaling, cytoskeletal development, and cell adhesion [85,86]. The regulation and signaling of SEMA proteins within the plexin family have been a topic of study, and we identified 39 and 37 distinct G-quadruplex forming sequences in the SEMA family and PLXN family, respectively, with similar G4 loops present in both genes. The prediction identified multiple G4 sequences present in SEMA6C, SEMA6D, and PLXND1 with the highest match to Family 48 (Table 3). Similarly, SEMA4D, SEMA4B, and PLXNA4 shared sequences occurring in Family 17. These findings suggest that multiple regions can form G-quadruplex in these genes, resulting in multiple conformations that might allow for differentiation for methylation in a pattern-specific manner.
The PDB structures 22AG, 2KF8, 5LQG, and 5YEY represent telomeric quadruplex DNA forming a range of conformations with antiparallel topology based on varying physiological conditions. These telomeric G4 sequences are determined to have the highest likelihood of matching Family 22. They have a similar loop size to structure 2KM3 [28], which has a variant of CTAGGG repeat instead of TTAGGG repeats. The 2KM3 structure forms a chair-type G-quadruplex in the K+ solution and is most similar to Family 33. Based on the sequence characteristics, these differences in structure which are caused by a one or two bp change can affect the overall prediction of the glycosidic conformation. This, in turn, can be used to help understand the structure based on the local environment and interacting conditions.
The 2LXQ G4 structure is found upstream of the pilin expression locus in Neisseria gonorrhoeae, a human pathogen 5′-G3TG3TTG3TG3 sequence is implicated in pilin antigenic variation [87]. Known to form an all-parallel stranded topology, the sequence was predicted to have the highest likelihood score with Family 40. A highly conserved G4 sequence at NHE III1 upstream of promoter one has been studied and identified to silence transcription of c-MYC [52,88,89,90,91] and other short-loop G4 sequences that form a similar topology. TAG3AG3TAG3AG3T was predicted to belong to Family 52 as well as Family 1. Despite following the same 1:2:1 pattern as the 2LXQ structure, the presence of adenosine in place of thymidine as the linker loops is considered a different family.
Experimental evidence shows that G4s with short-loop sequences favor a parallel topology, while structures with longer loops tend to form hybrid or antiparallel structures [92]. Sequences with thymine compared to adenine as a single-length loop have been found to have a higher melting point than a single A base [93]. Given our clustering scheme, multiple sequences with short loops can show high log odds for multiple families. In these cases, the Akaike weight can help guide the context and identify multiple families containing such sequences.

3.6. G4 in Enhancers

Potential regulatory roles of G4 families were analyzed by looking at the overlap between G4s and enhancers. The overlapping enhancers were then used as input into the Gene-Enhancer link correlation (http://compbio.mit.edu/epimap/ (accessed on 3 March 2023)) to determine if any of the overlapping enhancers were correlated with gene expression, and if so, in what cell type. We then performed hierarchical clustering of the intersecting G4s based on the correlations. Two main groups of interest result.
In the first group, 102 G4 sequences are found in 158 genes, belonging to 57 distinct G4 families. GO:BP analysis of this group results in terms associated with immune system processes (e.g., T cell receptor signaling pathway, regulation of leukocyte proliferation, interleukin-10 production, and regulation of cytokine production involved in immune response) or signaling cascades (e.g., positive regulation of ERK1 and ERK2 cascade, calcium ion transmembrane import into the cytosol, and Fc receptor signaling pathway) (Supplemental Figure S5, Supplemental Table S14).
The second group had a ubiquitous high correlation with all cell types in the dataset (Supplemental Figure S6). We identified 234 genes in this group with 107 distinct G4s belonging to 55 distinct families and found enrichment of terms relating to immune responses (e.g., defense response to the virus, cytokine-mediated signaling pathway, and regulation of defense response), regulated cell death (e.g., apoptotic signaling pathway, extrinsic apoptotic signaling pathway via death domain receptors, and positive regulation of programmed cell death), lipid biosynthesis (e.g., regulation of lipid biosynthetic process and response to a fatty acid), and migration (e.g., positive regulation of protein localization and positive regulation of mononuclear cell migration) (Supplemental Figure S7, Supplemental Table S15).
Based on the enriched terms of the two groups, it appears as though the G-quadruplex functions across multiple pathways in different cell types. It is possible that tissue-specific conditions control the actual G4 formation, leading to tissue-specific functional regulation. The results of the enhancer-gene correlation related to the presence of G4 sequences in enhancer regions in group 1 are more likely to affect genes in the thymus, T cell, and lymphoblastoid cells.

4. Discussion

Our clustering methodology presented here has allowed for the construction of families of G-quadruplexes based on sequence similarity, loop length and composition, and thermodynamic properties. Further analysis of these families uncovers that many of these families have functional enrichments, indicating they are potentially regulated by common mechanisms since they have structural similarities. Comparing our results to the only previously studied family, Pu27, shows a high agreement, with 12 of the 18 Pu27 members belonging to Family 1 (Table 4).
Multiple transcription factors can bind to the alternative motifs present in G-quadruplex regions [57] in response to environmental conditions and in response to stimuli. These conditions trigger the folding and unfolding of G4 structures. We identify Family 40 as an alternate conformation in these sequences, as multiple tetrads allow the alternate guanine bonds for a stable structure. Nucleoside diphosphate kinase (NM23-H2) [94,95] has been previously identified to unfold Pu27, causing the increase of c-MYC transcription while nucleolin [96] has been identified to stabilize the G4 structure. The mechanism of TF binding and the control of the expression of the c-MYC gene is poorly understood and is beyond the scope of prediction through this model. However, this process sheds light upon the collection of multiple conformations of structures in equilibrium which can alter the change in binding grooves for transcription factors and a further downstream process. Failing to take the dynamic nature of Pu27 and other G-quadruplex sequences in the genome into account could limit the effectiveness of any therapeutic compounds designed to target it.
Several G4 ligands are currently being considered for their therapeutic value. For instance, CX-5461 is utilized for the treatment of BRCA1/2 deficient tumors through topoisomerase II inhibition [97,98], and melanoma cell lines have been treated with G4 ligand RHPS4 that targets the MYC gene [99] among others. G4 ligands such as APTO-253 [100], TMPyP4 [101], and telomestatin [102] have been tested for their effect on leukemia. Despite showing promising results and inhibition of cell growth, telomerase shortening and senescence were observed with some of the G4 ligands in different leukemia cells [103]. With the information on G4 formation and binding of specific ligands to multiple G4 structures, identification of G4 clusters can provide additional information about DNA damage occurring or novel binding motifs of specific G4 ligands.
G4 structures contribute to genomic instability and the proliferative nature of different cancers. The context and location of individual G4 can serve as a roadblock for many oncogenes, but the presence of G4 in the vicinity of a tumor suppressor gene can have the opposite effect. To understand the intended consequence of these targets for all the G4 ligands, it is important to characterize the thousands of G4 structures present in the genome and classify these structures based on their structure, function, or localization.
This study identifies related families of G-quadruplex sequences within the human genome and presents them as clusters described by both an MSA and HMM. The approach described here can easily be applied to other model organisms where G4s are known to play regulatory roles. Many of these clusters were functionally annotated, allowing for a more complete understanding of these structures as well as the identification of multiple targets for testing of G4 ligands. Currently, our approach utilizes experimentally validated sequences as part of the clustering algorithm which makes it more robust to false negative G4s but also makes it more difficult to compare to strictly computational approaches that might be constructed in the future. However, we do provide all the clustering scripts and resulting family-based HMMs on our Github repository. As more information on experimentally validated G4 regions becomes available, refinement of clustering methodologies will yield more informative G4 families.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/genes14030645/s1, Figure S1: Top 25 GO:BP enrichments for Family 4; Figure S2: Top 25 GO:BP enrichments for Family 32; Figure S3: Top 25 GO:BP enrichments for Family 74; Figure S4: Top 25 GO:CC enrichments for Family 80; Figure S5: Top 25 GO:BP enrichments for experimentally validated G4s overlapping enhancers, group 1; Figure S6: Correlation of selected enhancers consisting of pG4 with gene expression in multiple cell types utilizing the epimap correlation group-link data; Figure S7: Top 25 GO:BP enrichments for experimentally validated G4s overlapping enhancers, group 2; Table S1: Summary of Family 23; Table S2: Summary of Family 79 G4 sequences; Table S3: Sequence repeats capable of forming multiple G4 structures; Table S4: Summary of Family 2 G4 sequences; Table S5: Summary of Family 3 G4 sequences; Table S6: Summary of Family 4 G4 sequences; Table S7: Summary of Family 32 G4 sequences; Table S8: Summary of Family 75 G4 sequences; Table S9: Summary of Family 80 G4 sequences; Table S10: Enriched GO:BP categories for Family 4; Table S11: Enriched GO:BP categories for Family 32; Table S12: Enriched GO:BP categories for Family 75; Table S13: Enriched GO:CC categories for Family 80; Table S14: Enriched GO:BP categories for experimentally validated G4s overlapping enhancers, group 1; Table S15: Enriched GO:BP categories for experimentally validated G4s overlapping enhancers, group 2.

Author Contributions

Conceptualization, E.C.R.; methodology, A.N., J.H.C. and E.C.R.; software, A.N.; validation, A.N. and E.C.R.; formal analysis, A.N. and E.C.R.; investigation, A.N., J.H.C. and E.C.R.; resources, E.C.R.; data curation, A.N. and E.C.R.; writing—original draft preparation, A.N. and E.C.R.; writing—review and editing, A.N., J.H.C. and E.C.R.; visualization, A.N.; supervision, J.H.C. and E.C.R.; project administration, E.C.R.; funding acquisition, E.C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Institutes of Health, grant number P20GM103436. The contents of this work are solely the responsibility of the authors and do not reflect the official views of the National Institutes of Health.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code and resulting data are available in the GitHub repository (https://github.com/UofLBioinformatics/G4-Cluster (accessed on 3 March 2023)).

Acknowledgments

We wish to thank members of the Kentucky IDeA Networks of Biomedical Research Excellence (KY INBRE) Bioinformatics Core and the Rouchka and Park labs for their valuable feedback.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lim, K.W.; Amrane, S.; Bouaziz, S.; Xu, W.; Mu, Y.; Patel, D.J.; Luu, K.N.; Phan, A.T. Structure of the human telomere in K+ solution: A stable basket-type G-quadruplex with only two G-tetrad layers. J. Am. Chem. Soc. 2009, 131, 4301–4309. [Google Scholar] [CrossRef] [Green Version]
  2. Lago, S.; Nadai, M.; Cernilogar, F.M.; Kazerani, M.; Domíniguez Moreno, H.; Schotta, G.; Richter, S.N. Promoter G-quadruplexes and transcription factors cooperate to shape the cell type-specific transcriptome. Nat. Commun. 2021, 12, 3885. [Google Scholar] [CrossRef]
  3. Hänsel-Hertsch, R.; Simeone, A.; Shea, A.; Hui, W.W.; Zyner, K.G.; Marsico, G.; Rueda, O.M.; Bruna, A.; Martin, A.; Zhang, X. Landscape of G-quadruplex DNA structural regions in breast cancer. Nat. Genet. 2020, 52, 878–883. [Google Scholar] [CrossRef]
  4. Biffi, G.; Tannahill, D.; Miller, J.; Howat, W.J.; Balasubramanian, S. Elevated levels of G-quadruplex formation in human stomach and liver cancer tissues. PloS ONE 2014, 9, e102711. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Liu, G.; Du, W.; Xu, H.; Sun, Q.; Tang, D.; Zou, S.; Zhang, Y.; Ma, M.; Zhang, G.; Du, X. RNA G-quadruplex regulates microRNA-26a biogenesis and function. J. Hepatol. 2020, 73, 371–382. [Google Scholar] [CrossRef] [PubMed]
  6. Wang, E.; Thombre, R.; Shah, Y.; Latanich, R.; Wang, J. G-Quadruplexes as pathogenic drivers in neurodegenerative disorders. Nucleic Acids Res. 2021, 49, 4816–4830. [Google Scholar] [CrossRef]
  7. Biffi, G.; Tannahill, D.; McCafferty, J.; Balasubramanian, S. Quantitative visualization of DNA G-quadruplex structures in human cells. Nat. Chem. 2013, 5, 182–186. [Google Scholar] [CrossRef]
  8. Fernando, H.; Sewitz, S.; Darot, J.; Tavare, S.; Huppert, J.L.; Balasubramanian, S. Genome-wide analysis of a G-quadruplex-specific single-chain antibody that regulates gene expression. Nucleic Acids Res. 2009, 37, 6716–6722. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Kouzine, F.; Wojtowicz, D.; Yamane, A.; Casellas, R.; Przytycka, T.M.; Levens, D.L. In vivo chemical probing for G-quadruplex formation. In G-Quadruplex Nucleic Acids; Springer: Berlin/Heidelberg, Germany, 2019; pp. 369–382. [Google Scholar]
  10. Ruttkay-Nedecky, B.; Kudr, J.; Nejdl, L.; Maskova, D.; Kizek, R.; Adam, V. G-quadruplexes as sensing probes. Molecules 2013, 18, 14760–14779. [Google Scholar] [CrossRef]
  11. Todd, A.K.; Neidle, S. The relationship of potential G-quadruplex sequences in cis-upstream regions of the human genome to SP1-binding elements. Nucleic Acids Res. 2008, 36, 2700–2704. [Google Scholar] [CrossRef] [Green Version]
  12. Chariker, J.H.; Miller, D.M.; Rouchka, E.C. Computational analysis of G-quadruplex forming sequences across chromosomes reveals high density patterns near the terminal ends. PloS ONE 2016, 11, e0165101. [Google Scholar] [CrossRef] [Green Version]
  13. Hänsel-Hertsch, R.; Beraldi, D.; Lensing, S.V.; Marsico, G.; Zyner, K.; Parry, A.; Di Antonio, M.; Pike, J.; Kimura, H.; Narita, M. G-quadruplex structures mark human regulatory chromatin. Nat. Genet. 2016, 48, 1267–1272. [Google Scholar] [CrossRef] [Green Version]
  14. Risitano, A.; Fox, K.R. Influence of loop size on the stability of intramolecular DNA quadruplexes. Nucleic Acids Res. 2004, 32, 2598–2606. [Google Scholar] [CrossRef] [Green Version]
  15. Sattin, G.; Artese, A.; Nadai, M.; Costa, G.; Parrotta, L.; Alcaro, S.; Palumbo, M.; Richter, S.N. Conformation and stability of intramolecular telomeric G-quadruplexes: Sequence effects in the loops. PLoS ONE 2013, 8, e84113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Tippana, R.; Xiao, W.; Myong, S. G-quadruplex conformation and dynamics are determined by loop length and sequence. Nucleic Acids Res. 2014, 42, 8106–8114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Guédin, A.; De Cian, A.; Gros, J.; Lacroix, L.; Mergny, J.-L. Sequence effects in single-base loops for quadruplexes. Biochimie 2008, 90, 686–696. [Google Scholar] [CrossRef] [PubMed]
  18. Li, Y.Y.; Dubins, D.N.; Le, D.M.N.T.; Leung, K.; Macgregor, R.B., Jr. The role of loops and cation on the volume of unfolding of G-quadruplexes related to HTel. Biophys. Chem. 2017, 231, 55–63. [Google Scholar] [CrossRef]
  19. Li, Y.Y.; Macgregor, R.B., Jr. A thermodynamic study of adenine and thymine substitutions in the loops of the oligodeoxyribonucleotide HTel. J. Phys. Chem. B 2016, 120, 8830–8836. [Google Scholar] [CrossRef]
  20. Piazza, A.; Adrian, M.; Samazan, F.; Heddi, B.; Hamon, F.; Serero, A.; Lopes, J.; Teulade-Fichou, M.P.; Phan, A.T.; Nicolas, A. Short loop length and high thermal stability determine genomic instability induced by G-quadruplex-forming minisatellites. EMBO J. 2015, 34, 1718–1734. [Google Scholar] [CrossRef] [Green Version]
  21. Rachwal, P.A.; Brown, T.; Fox, K.R. Sequence effects of single base loops in intramolecular quadruplex DNA. FEBS Lett. 2007, 581, 1657–1660. [Google Scholar] [CrossRef] [Green Version]
  22. Hazel, P.; Huppert, J.; Balasubramanian, S.; Neidle, S. Loop-length-dependent folding of G-quadruplexes. J. Am. Chem. Soc. 2004, 126, 16405–16415. [Google Scholar] [CrossRef] [PubMed]
  23. Lago, S.; Tosoni, E.; Nadai, M.; Palumbo, M.; Richter, S.N. The cellular protein nucleolin preferentially binds long-looped G-quadruplex nucleic acids. Biochim. Biophys. Acta (BBA)-Gen. Subj. 2017, 1861, 1371–1381. [Google Scholar] [CrossRef] [PubMed]
  24. Takahama, K.; Sugimoto, C.; Arai, S.; Kurokawa, R.; Oyoshi, T. Loop lengths of G-quadruplex structures affect the G-quadruplex DNA binding selectivity of the RGG motif in Ewing’s sarcoma. Biochemistry 2011, 50, 5369–5378. [Google Scholar] [CrossRef] [PubMed]
  25. Huppert, J.L.; Balasubramanian, S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 2005, 33, 2908–2916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Kikin, O.; D'Antonio, L.; Bagga, P.S. QGRS Mapper: A web-based server for predicting G-quadruplexes in nucleotide sequences. Nucleic Acids Res. 2006, 34, W676–W682. [Google Scholar] [CrossRef]
  27. Bolduc, F.; Garant, J.-M.; Allard, F.; Perreault, J.-P. Irregular G-quadruplexes found in the untranslated regions of human mRNAs influence translation. J. Biol. Chem. 2016, 291, 21751–21760. [Google Scholar] [CrossRef] [Green Version]
  28. Lim, K.W.; Alberti, P.; Guedin, A.; Lacroix, L.; Riou, J.-F.; Royle, N.J.; Mergny, J.-L.; Phan, A.T.n. Sequence variant (CTAGGG) n in the human telomere favors a G-quadruplex structure containing a G· C· G· C tetrad. Nucleic Acids Res. 2009, 37, 6239–6248. [Google Scholar] [CrossRef]
  29. Mukundan, V.T.; Phan, A.T. Bulges in G-quadruplexes: Broadening the definition of G-quadruplex-forming sequences. J. Am. Chem. Soc. 2013, 135, 5017–5028. [Google Scholar] [CrossRef]
  30. Garant, J.-M.; Perreault, J.-P.; Scott, M.S. Motif independent identification of potential RNA G-quadruplexes by G4RNA screener. Bioinformatics 2017, 33, 3532–3537. [Google Scholar] [CrossRef] [Green Version]
  31. Hon, J.; Martínek, T.; Zendulka, J.; Lexa, M. pqsfinder: An exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics 2017, 33, 3373–3379. [Google Scholar] [CrossRef] [Green Version]
  32. Doluca, O. G4Catchall: A G-quadruplex prediction approach considering atypical features. J. Theor. Biol. 2019, 463, 92–98. [Google Scholar] [CrossRef]
  33. Bedrat, A.; Lacroix, L.; Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 2016, 44, 1746–1759. [Google Scholar] [CrossRef]
  34. Garant, J.-M.; Luce, M.J.; Scott, M.S.; Perreault, J.-P. G4RNA: An RNA G-quadruplex database. Database 2015, 2015, bav059. [Google Scholar] [CrossRef] [Green Version]
  35. Gruber, A.R.; Lorenz, R.; Bernhart, S.H.; Neuböck, R.; Hofacker, I.L. The vienna RNA websuite. Nucleic Acids Res. 2008, 36, W70–W74. [Google Scholar] [CrossRef] [Green Version]
  36. Lu, X.-J. DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL. Nucleic Acids Res. 2020, 48, e74. [Google Scholar] [CrossRef] [PubMed]
  37. Zok, T.; Popenda, M.; Szachniuk, M. ElTetrado: A tool for identification and classification of tetrads and quadruplexes. BMC Bioinform. 2020, 21, 40. [Google Scholar] [CrossRef]
  38. Patro, L.P.P.; Kumar, A.; Kolimi, N.; Rathinavelan, T. 3D-NuS: A web server for automated modeling and visualization of non-canonical 3-dimensional nucleic acid structures. J. Mol. Biol. 2017, 429, 2438–2448. [Google Scholar] [CrossRef]
  39. Capra, J.A.; Paeschke, K.; Singh, M.; Zakian, V.A. G-quadruplex DNA sequences are evolutionarily conserved and associated with distinct genomic features in Saccharomyces cerevisiae. PLoS Comput. Biol. 2010, 6, e1000861. [Google Scholar] [CrossRef] [PubMed]
  40. Wu, F.; Niu, K.; Cui, Y.; Li, C.; Lyu, M.; Ren, Y.; Chen, Y.; Deng, H.; Huang, L.; Zheng, S. Genome-wide analysis of DNA G-quadruplex motifs across 37 species provides insights into G4 evolution. Commun. Biol. 2021, 4, 98. [Google Scholar] [CrossRef] [PubMed]
  41. Chambers, V.S.; Marsico, G.; Boutell, J.M.; Di Antonio, M.; Smith, G.P.; Balasubramanian, S. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat. Biotechnol. 2015, 33, 877–881. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  42. Marsico, G.; Chambers, V.S.; Sahakyan, A.B.; McCauley, P.; Boutell, J.M.; Antonio, M.D.; Balasubramanian, S. Whole genome experimental maps of DNA G-quadruplexes in multiple species. Nucleic Acids Res. 2019, 47, 3862–3874. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  43. Seviour, T.; Winnerdy, F.R.; Wong, L.L.; Shi, X.; Mugunthan, S.; Foo, Y.H.; Castaing, R.; Adav, S.S.; Subramoni, S.; Kohli, G.S. The biofilm matrix scaffold of Pseudomonas aeruginosa contains G-quadruplex extracellular DNA structures. Npj Biofilms Microbiomes 2021, 7, 27. [Google Scholar] [CrossRef] [PubMed]
  44. Shao, X.; Zhang, W.; Umar, M.I.; Wong, H.Y.; Seng, Z.; Xie, Y.; Zhang, Y.; Yang, L.; Kwok, C.K.; Deng, X. RNA G-quadruplex structures mediate gene regulation in bacteria. MBio 2020, 11, e02926-19. [Google Scholar] [CrossRef] [Green Version]
  45. Zheng, K.-W.; Zhang, J.-Y.; He, Y.-D.; Gong, J.-Y.; Wen, C.-J.; Chen, J.-N.; Hao, Y.-H.; Zhao, Y.; Tan, Z. Detection of genomic G-quadruplexes in living cells using a small artificial protein. Nucleic Acids Res. 2020, 48, 11706–11720. [Google Scholar] [CrossRef]
  46. Völkel, S.; Stielow, B.; Finkernagel, F.; Stiewe, T.; Nist, A.; Suske, G. Zinc finger independent genome-wide binding of Sp2 potentiates recruitment of histone-fold protein Nf-y distinguishing it from Sp1 and Sp3. PLoS Genet. 2015, 11, e1005102. [Google Scholar] [CrossRef]
  47. Raiber, E.-A.; Kranaster, R.; Lam, E.; Nikan, M.; Balasubramanian, S. A non-canonical DNA structure is a binding motif for the transcription factor SP1 in vitro. Nucleic Acids Res. 2012, 40, 1499–1508. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Da Ros, S.; Nicoletto, G.; Rigo, R.; Ceschi, S.; Zorzan, E.; Dacasto, M.; Giantin, M.; Sissi, C. G-Quadruplex modulation of SP1 functional binding sites at the KIT proximal promoter. Int. J. Mol. Sci. 2020, 22, 329. [Google Scholar] [CrossRef]
  49. Rezzoug, F.; Thomas, S.D.; Rouchka, E.C.; Miller, D.M. Discovery of a family of genomic sequences which interact specifically with the c-MYC promoter to regulate c-MYC expression. PloS ONE 2016, 11, e0161588. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  50. David, A.P.; Margarit, E.; Domizi, P.; Banchio, C.; Armas, P.; Calcaterra, N.B. G-quadruplexes as novel cis-elements controlling transcription during embryonic development. Nucleic Acids Res. 2016, 44, 4163–4173. [Google Scholar] [CrossRef] [Green Version]
  51. Beaudoin, J.-D.; Perreault, J.-P. 5′-UTR G-quadruplex structures acting as translational repressors. Nucleic Acids Res. 2010, 38, 7022–7036. [Google Scholar] [CrossRef] [Green Version]
  52. Brooks, T.A.; Hurley, L.H. Targeting MYC expression through G-quadruplexes. Genes Cancer 2010, 1, 641–649. [Google Scholar] [CrossRef]
  53. Fleming, A.M.; Zhou, J.; Wallace, S.S.; Burrows, C.J. A role for the fifth G-track in G-quadruplex forming oncogene promoter sequences during oxidative stress: Do these “spare tires” have an evolved function? ACS Cent. Sci. 2015, 1, 226–233. [Google Scholar] [CrossRef]
  54. Cogoi, S.; Xodo, L.E. G-quadruplex formation within the promoter of the KRAS proto-oncogene and its effect on transcription. Nucleic Acids Res. 2006, 34, 2536–2549. [Google Scholar] [CrossRef]
  55. Agrawal, P.; Lin, C.; Mathad, R.I.; Carver, M.; Yang, D. The major G-quadruplex formed in the human BCL-2 proximal promoter adopts a parallel structure with a 13-nt loop in K+ solution. J. Am. Chem. Soc. 2014, 136, 1750–1753. [Google Scholar] [CrossRef] [PubMed]
  56. Bates, P.J.; Laber, D.A.; Miller, D.M.; Thomas, S.D.; Trent, J.O. Discovery and development of the G-rich oligonucleotide AS1411 as a novel treatment for cancer. Exp. Mol. Pathol. 2009, 86, 151–164. [Google Scholar] [CrossRef] [Green Version]
  57. Spiegel, J.; Cuesta, S.M.; Adhikari, S.; Hänsel-Hertsch, R.; Tannahill, D.; Balasubramanian, S. G-quadruplexes are transcription factor binding hubs in human chromatin. Genome Biol. 2021, 22, 117. [Google Scholar] [CrossRef]
  58. Jana, J.; Vianney, Y.M.; Schröder, N.; Weisz, K. Guiding the folding of G-quadruplexes through loop residue interactions. Nucleic Acids Res. 2022, 50, 7161–7175. [Google Scholar] [CrossRef]
  59. Marchand, A.; Gabelica, V. Folding and misfolding pathways of G-quadruplex DNA. Nucleic Acids Res. 2016, 50, 10999–11012. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  60. Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  61. Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [Green Version]
  62. James, B.T.; Luczak, B.B.; Girgis, H.Z. MeShClust: An intelligent tool for clustering DNA sequences. Nucleic Acids Res. 2018, 46, e83. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  63. Ghodsi, M.; Liu, B.; Pop, M. DNACLUST: Accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 2011, 12, 271. [Google Scholar] [CrossRef] [Green Version]
  64. Zorita, E.; Cusco, P.; Filion, G.J. Starcode: Sequence clustering based on all-pairs search. Bioinformatics 2015, 31, 1913–1919. [Google Scholar] [CrossRef] [Green Version]
  65. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
  66. Wright, E.S. DECIPHER: Harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinform. 2015, 16, 322. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  67. Collet, G. Gcollet/MstatX: A Multiple Alignment Analyser. GitHub. Available online: https://github.com/gcollet/MstatX (accessed on 3 March 2023).
  68. Finn, R.D.; Clements, J.; Eddy, S.R. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res. 2011, 39, W29–W37. [Google Scholar] [CrossRef] [Green Version]
  69. Wilkinson, S.P. aphid: An R package for analysis with profile hidden Markov models. Bioinformatics 2019, 35, 3829–3830. [Google Scholar] [CrossRef]
  70. Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [Green Version]
  71. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef] [Green Version]
  72. Johnson, J.B.; Omland, K.S. Model selection in ecology and evolution. Trends Ecol. Evol. 2004, 19, 101–108. [Google Scholar] [CrossRef]
  73. Wagenmakers, E.-J.; Farrell, S. AIC model selection using Akaike weights. Psychon. Bull. Rev. 2004, 11, 192–196. [Google Scholar] [CrossRef]
  74. Pages, H.; Aboyoun, P.; Gentleman, R.; DebRoy, S.; Pages, M.H.; DataImport, D.; BSgenome, S.; XStringSet-class, R.; MaskedXString-class, R.; XStringSet-io, R. Package ‘Biostrings’. Available online: https://bioconductor.org/packages/release/bioc/html/Biostrings.html (accessed on 3 March 2023).
  75. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
  76. Charrad, M.; Ghazzali, N.; Boiteau, V.; Niknafs, A. NbClust: An R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 2014, 61, 1–36. [Google Scholar] [CrossRef] [Green Version]
  77. Honig, B.; Nicholls, A. Classical electrostatics in biology and chemistry. Science 1995, 268, 1144–1149. [Google Scholar] [CrossRef] [Green Version]
  78. Rohs, R.; West, S.M.; Sosinsky, A.; Liu, P.; Mann, R.S.; Honig, B. The role of DNA shape in protein–DNA recognition. Nature 2009, 461, 1248–1253. [Google Scholar] [CrossRef] [Green Version]
  79. El Hassan, M.; Calladine, C. Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. J. Mol. Biol. 1996, 259, 95–103. [Google Scholar] [CrossRef] [PubMed]
  80. Chiu, T.-P.; Comoglio, F.; Zhou, T.; Yang, L.; Paro, R.; Rohs, R. DNAshapeR: An R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics 2016, 32, 1211–1213. [Google Scholar] [CrossRef] [Green Version]
  81. Fan, H.; Lu, J.; Guo, Y.; Li, D.; Zhang, Z.-M.; Tsai, Y.-H.; Pi, W.-C.; Ahn, J.H.; Gong, W.; Xiang, Y. BAHCC1 binds H3K27me3 via a conserved BAH module to mediate gene silencing and oncogenesis. Nat. Genet. 2020, 52, 1384–1396. [Google Scholar] [CrossRef]
  82. Guo, Y.; Zhao, S.; Wang, G.G. Polycomb gene silencing mechanisms: PRC2 chromatin targeting, H3K27me3'Readout', and phase separation-based compaction. Trends Genet. 2021, 37, 547–565. [Google Scholar] [CrossRef]
  83. Banerjee, K.; Wang, M.; Cai, E.; Fujiwara, N.; Baker, H.; Cave, J.W. Regulation of tyrosine hydroxylase transcription by hnRNP K and DNA secondary structure. Nat. Commun. 2014, 5, 5769. [Google Scholar] [CrossRef] [Green Version]
  84. Farhath, M.M.; Thompson, M.; Ray, S.; Sewell, A.; Balci, H.; Basu, S. G-Quadruplex-enabling sequence within the human tyrosine hydroxylase promoter differentially regulates transcription. Biochemistry 2015, 54, 5533–5545. [Google Scholar] [CrossRef]
  85. Janssen, B.J.; Robinson, R.A.; Pérez-Brangulí, F.; Bell, C.H.; Mitchell, K.J.; Siebold, C.; Jones, E.Y. Structural basis of semaphorin–plexin signalling. Nature 2010, 467, 1118–1122. [Google Scholar] [CrossRef] [Green Version]
  86. Takamatsu, H.; Kumanogoh, A. Diverse roles for semaphorin− plexin signaling in the immune system. Trends Immunol. 2012, 33, 127–135. [Google Scholar] [CrossRef] [PubMed]
  87. Kuryavyi, V.; Cahoon, L.A.; Seifert, H.S.; Patel, D.J. RecA-binding pilE G4 sequence essential for pilin antigenic variation forms monomeric and 5′ end-stacked dimeric parallel G-quadruplexes. Structure 2012, 20, 2090–2102. [Google Scholar] [CrossRef] [Green Version]
  88. González, V.; Hurley, L.H. The c-MYC NHE III1: Function and regulation. Annu. Rev. Pharmacol. Toxicol. 2010, 50, 111–129. [Google Scholar] [CrossRef]
  89. Hurley, L.H.; Von Hoff, D.D.; Siddiqui-Jain, A.; Yang, D. Drug targeting of the c-MYC promoter to repress gene expression via a G-quadruplex silencer element. In Seminars in Oncology; WB Saunders: Philadelphia, PA, USA, 2006; pp. 498–512. [Google Scholar]
  90. Siddiqui-Jain, A.; Grand, C.L.; Bearss, D.J.; Hurley, L.H. Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription. Proc. Natl. Acad. Sci. USA 2002, 99, 11593–11598. [Google Scholar] [CrossRef] [Green Version]
  91. Yang, D.; Hurley, L.H. Structure of the biologically relevant G-quadruplex in the c-MYC promoter. Nucl. Nucl. Nucleic Acids 2006, 25, 951–968. [Google Scholar] [CrossRef]
  92. Zhang, A.Y.; Bugaut, A.; Balasubramanian, S. A sequence-independent analysis of the loop length dependence of intramolecular RNA G-quadruplex stability and topology. Biochemistry 2011, 50, 7251–7258. [Google Scholar] [CrossRef] [Green Version]
  93. Li, J.; Chu, I.-T.; Yeh, T.-A.; Chen, D.-Y.; Wang, C.-L.; Chang, T.-C. Effects of length and loop composition on structural diversity and similarity of (G3TG3NmG3TG3) G-quadruplexes. Molecules 2020, 25, 1779. [Google Scholar] [CrossRef]
  94. Postel, E.; Berberich, S.; Flint, S.; Ferrone, C. Human c-myc transcription factor PuF identified as nm23-H2 nucleoside diphosphate kinase, a candidate suppressor of tumor metastasis. Science 1993, 261, 478–480. [Google Scholar] [CrossRef] [PubMed]
  95. Shan, C.; Lin, J.; Hou, J.-Q.; Liu, H.-Y.; Chen, S.-B.; Chen, A.-C.; Ou, T.-M.; Tan, J.-H.; Li, D.; Gu, L.-Q. Chemical intervention of the NM23-H2 transcriptional programme on c-MYC via a novel small molecule. Nucleic Acids Res. 2015, 43, 6677–6691. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  96. González, V.; Hurley, L.H. The C-terminus of nucleolin promotes the formation of the c-MYC G-quadruplex and inhibits c-MYC promoter activity. Biochemistry 2010, 49, 9706–9714. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  97. Bywater, M.J.; Poortinga, G.; Sanij, E.; Hein, N.; Peck, A.; Cullinane, C.; Wall, M.; Cluse, L.; Drygin, D.; Anderes, K. Inhibition of RNA polymerase I as a therapeutic strategy to promote cancer-specific activation of p53. Cancer Cell 2012, 22, 51–65. [Google Scholar] [CrossRef] [Green Version]
  98. Xu, H.; Di Antonio, M.; McKinney, S.; Mathew, V.; Ho, B.; O’Neil, N.J.; Santos, N.D.; Silvester, J.; Wei, V.; Garcia, J. CX-5461 is a DNA G-quadruplex stabilizer with selective lethality in BRCA1/2 deficient tumours. Nat. Commun. 2017, 8, 14432. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  99. Leonetti, C.; Scarsella, M.; Riggio, G.; Rizzo, A.; Salvati, E.; D'Incalci, M.; Staszewsky, L.; Frapolli, R.; Stevens, M.F.; Stoppacciaro, A. G-quadruplex ligand RHPS4 potentiates the antitumor activity of camptothecins in preclinical models of solid tumors. Clin. Cancer Res. 2008, 14, 7284–7291. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  100. Local, A.; Zhang, H.; Benbatoul, K.D.; Folger, P.; Sheng, X.; Tsai, C.-Y.; Howell, S.B.; Rice, W.G. APTO-253 Stabilizes G-quadruplex DNA, Inhibits MYC Expression, and Induces DNA Damage in Acute Myeloid Leukemia CellsAPTO-253 as a MYC Inhibitor and G4 Ligand for AML. Mol. Cancer Ther. 2018, 17, 1177–1186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  101. Zidanloo, S.G.; Hosseinzadeh Colagar, A.; Ayatollahi, H.; Raoof, J.-B. Downregulation of the WT1 gene expression via TMPyP4 stabilization of promoter G-quadruplexes in leukemia cells. Tumor Biol. 2016, 37, 9967–9977. [Google Scholar] [CrossRef]
  102. Tauchi, T.; Shin-Ya, K.; Sashida, G.; Sumi, M.; Nakajima, A.; Shimamoto, T.; Ohyashiki, J.H.; Ohyashiki, K. Activity of a novel G-quadruplex-interactive telomerase inhibitor, telomestatin (SOT-095), against human leukemia cells: Involvement of ATM-dependent DNA damage response pathways. Oncogene 2003, 22, 5338–5347. [Google Scholar] [CrossRef] [Green Version]
  103. Liu, J.; Deng, R.; Guo, J.; Zhou, J.; Feng, G.; Huang, Z.; Gu, L.; Zeng, Y.; Zhu, X. Inhibition of myc promoter and telomerase activity and induction of delayed apoptosis by SYUIQ-5, a novel G-quadruplex interactive agent in leukemia cells. Leukemia 2007, 21, 1300–1302. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. G-quadruplex structure. (a) G-tetrad structure that forms G-quadruplexes. Hydrogen bonds between the guanine from different tetrads form a planar ring. (b) Sequence of G4 with multiple guanine tetrads. Here, 4:1:1 and 5:2:1 refer to the result from Quadparser separated out as the number of tetrads: total G4 sequences: non-overlapping G4 sequences. Red line: first occurrence of G4 forming sequence; green line: alternative G4 forming sequence.
Figure 1. G-quadruplex structure. (a) G-tetrad structure that forms G-quadruplexes. Hydrogen bonds between the guanine from different tetrads form a planar ring. (b) Sequence of G4 with multiple guanine tetrads. Here, 4:1:1 and 5:2:1 refer to the result from Quadparser separated out as the number of tetrads: total G4 sequences: non-overlapping G4 sequences. Red line: first occurrence of G4 forming sequence; green line: alternative G4 forming sequence.
Genes 14 00645 g001
Figure 2. Process for identifying and characterizing G-quadruplex families. (a) Workflow diagram for identifying distinct G-quadruplex families. (b) Process for identifying appropriate profiles for a specific family. In this case, S1, …, and Sn represent the list of sequences generated from HMM profile P1, …, and Pn, respectively. We compare the average log odds for input S1 over profile P1…Pn and recursively measure for all the profiles. For each row, the diagonal element is compared with non-diagonal values (log odds) using a Wilcoxon rank sum test with a null and alternate hypothesis, H0: T1–T2 = 0, H1: T1–T2 > 0. (c) Profile HMM derived from a selected G4 alignment. Match states are represented as rectangles with four residue emission probabilities indicated as black bars, insert states (I) as diamonds, and delete states as circles. The start and end states are B (begin) and E (end), respectively. Delete states are silent states with no emission probabilities and weighed lines represent the transition probabilities between states.
Figure 2. Process for identifying and characterizing G-quadruplex families. (a) Workflow diagram for identifying distinct G-quadruplex families. (b) Process for identifying appropriate profiles for a specific family. In this case, S1, …, and Sn represent the list of sequences generated from HMM profile P1, …, and Pn, respectively. We compare the average log odds for input S1 over profile P1…Pn and recursively measure for all the profiles. For each row, the diagonal element is compared with non-diagonal values (log odds) using a Wilcoxon rank sum test with a null and alternate hypothesis, H0: T1–T2 = 0, H1: T1–T2 > 0. (c) Profile HMM derived from a selected G4 alignment. Match states are represented as rectangles with four residue emission probabilities indicated as black bars, insert states (I) as diamonds, and delete states as circles. The start and end states are B (begin) and E (end), respectively. Delete states are silent states with no emission probabilities and weighed lines represent the transition probabilities between states.
Genes 14 00645 g002
Figure 3. Thermodynamic properties for Family 4. (a) Centroid secondary structure with a minimum free energy of −9.64 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Figure 3. Thermodynamic properties for Family 4. (a) Centroid secondary structure with a minimum free energy of −9.64 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Genes 14 00645 g003
Figure 4. Thermodynamic properties for Family 32. (a) Centroid secondary structure with a minimum free energy of −18.0 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Figure 4. Thermodynamic properties for Family 32. (a) Centroid secondary structure with a minimum free energy of −18.0 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Genes 14 00645 g004
Figure 5. Thermodynamic properties for Family 75. (a) Centroid secondary structure with a minimum free energy of −22.82 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Figure 5. Thermodynamic properties for Family 75. (a) Centroid secondary structure with a minimum free energy of −22.82 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Genes 14 00645 g005
Figure 6. Thermodynamic properties for Family 80. (a) Centroid secondary structure with a minimum free energy of −17.38 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Figure 6. Thermodynamic properties for Family 80. (a) Centroid secondary structure with a minimum free energy of −17.38 kcal/mol using the consensus sequence of the family. (b) Dot-bracket notation showing the secondary structure. (c) Sequence logo representing the per base information content. (d) Electrostatic potential generated from all the sequences of the family using 10 flanking bases on either side of the identified G4. (e) Dot plot showing the substructures with the highest probabilities.
Genes 14 00645 g006
Figure 7. Summary of enriched GO terms for select families as determined by the GOProfiler and simplifyEnrichment R packages.
Figure 7. Summary of enriched GO terms for select families as determined by the GOProfiler and simplifyEnrichment R packages.
Genes 14 00645 g007
Figure 8. Example sequences with multiple tetrads. (a) G-quadruplex sequence from chr19:43479561-43479598 overlapping the PHLDB3 gene with guanines labeled in red. (b) Three possible alternate G4 regions for the PHLDB3 region. (c) MFE structure for the PHLDB3 region. (d) G-quadruplex sequence from chr17:81432609-81432932 overlapping the BAHCC1 gene with guanines labeled in red. (e) MFE structure for the BAHCC1 gene.
Figure 8. Example sequences with multiple tetrads. (a) G-quadruplex sequence from chr19:43479561-43479598 overlapping the PHLDB3 gene with guanines labeled in red. (b) Three possible alternate G4 regions for the PHLDB3 region. (c) MFE structure for the PHLDB3 region. (d) G-quadruplex sequence from chr17:81432609-81432932 overlapping the BAHCC1 gene with guanines labeled in red. (e) MFE structure for the BAHCC1 gene.
Genes 14 00645 g008
Table 1. Cluster summary based on different clustering techniques.
Table 1. Cluster summary based on different clustering techniques.
MethodNumber of SequencesNo of ClustersNo. of Sequences in 2 Largest ClustersHMM Clusters (Sequences)HMM Families, 1st Iteration
(Sequences)
Final Families Selected (Sequences)
Starcode + BLAST with hierarchical clustering29,1122717419, 32395 (842)
DNAclust9610 (4664)587142, 12631 (1165)
Cd-hit (kmer 8)6335786182, 11530 (389)
Meshclust14,2225081720, 141072 (1843)
Total220 (3888)95 (2145)
Table 2. Summary of count of G4 sequences identified using predictive models pHMM across different clusters, genes, and chromosomes.
Table 2. Summary of count of G4 sequences identified using predictive models pHMM across different clusters, genes, and chromosomes.
TRAININGPREDICTED
FamilyG4sChrsDistinct
Sequences
Associated
Genes
Consensus Using Training SequencesG4sChrsDistinct
Sequences
Associated
Genes
11512514GGGGTGGGTGGGGAGGG64324118468
28578--GGGARKGGSCTGGGACAGGG25132529
310667GGGAGGGGGCTGCWGGGATGGGGG27022257219
49789-GGGCTGGG-GMGGGAAGGAGAGGG1062210695
58677GGGKKGGGGWGAATRGGGCAYGGG-35523341271
68667-GGGGKCTCAGGGGCTGGGCAGRGGG21323200183
77777-GGGC-CCSKGGGCDGSGRGGMRGGG63624614564
87777GGG-MCTTGGGGGTKGGGASAA--GGG-37623369311
9109810-GGGSTGGGGAGGGTGGG35023136276
1020151020GGGGTGGGGGTGGGAGGG26123107187
111512815GGGRGKKKGGTGGGAGGG1642384132
121791717GGGGC-CWGGG-TGGGA-AAGGG-34724330289
1364206262---GG-RWGGGCYKGG-GGGCWGGG14322125125
1452205051-GGGRCGGGGCAGGGG-TG-GGG16324153140
151391313GGGRRAWRGGGTGGGAGGG15122116121
168788GGGGATKDG-GGGAGGGAGGG15223134113
1723111623GGGAAGGG---TCAGGG-CCAGGG31222293286
1814111214GGGTGGGTGGGGKMAGGG43923242345
198888GGGCCMMGGGCTGGGGCAGGG59195963
208678GGGWDGGSMRGGGCM--CAAGGG42123414343
217677GGGGC-AGGGGCAGGGDGTGAGGGG13023120101
228688-GGGCYAGGGT-TGGGWRAGGG60224944
2316323118122--GGGTKG--GKGRWG-GGRTGGGGG79424555603
2435193435GGGGGYRGGGSWGGGGWGGG107219184
2539183237GGGRR-GGG-RTGGGG--CCKGGGG43423418365
269799-GGGGBWGGGGKSAGGGWGGG69196749
271191111-GGG-GCTGGGRMCWGGGCWGGG1132210798
2879217979GGGGA-WGGGMARGGGY-RGGG87218367
2918151718GGGSHWGGGGGGKGGGRGGG1082110398
301261212GGGGKRKGGGKMWGGGKGGG20923180174
3144184344GGGGMRGGGGKKGGGGTGGG107239488
3290238588GGGSTGGGGKKGGGGSWGGG16422146130
3311122102108GGGCTG-------GGGCKGGG--SCWGGG21022184160
349689GGGAATGGGGGGTGGGGG-GGGG101229870
3525162525-GGGCA---GG-GGAGGGMYAGG-----GG17922173148
3652204648-GG--GCCTKGGGG---WGGGAGGG-54023497439
377657-GGGSCAGGGCCAGGGCCAGGG13722125124
387777GGGGYGGGGGR-CAGGGCCAGGG20723200199
391281112GGGGAGRGTGGG-MAGGGTGGG14524143111
4021131620GGGYTGGGRA-TGGGTGGG48923289348
411181011GGGM-CAGGGYKSSGGSSAGGG100229988
4217131717GGGA-GGGAGGGRAACYYSRGG-53423522415
4317111717GGGGCCYGGGCCTGGGGAGGG68226473
449699GGGC-YAGA-GGGTGGGYWGGG15122141125
4528122828-GGGSKK-KGGGCAGGGG--CAGGGG-20723196151
468788-GG-GKTGGGGGMWGGGRGGRGGG83217761
4721131720--GGGGTGGGA--GGGATGGYGGGG-13421118101
4821132121-GG-GRTTGGGGGT-GG-GG-RTGGGG77624724547
4929101212-GGGGGCAGGGCYGGG-GCTGGG54214443
5032193032-GGGAGAGGGT--TKGGKGR--AGGG27123252221
51127910-GGGGTGGGCAGGGMAGMYTGGG14124136118
529899GGGCCCCSGGGGCGGGGCGGG26524264309
5356195450--GGGDGT-G-G-GSGG-AGGGAGGG--15522145127
5433183133GGG-CTCR-GG-RMAGGG-CTGGG21424206196
5521162121-GGGYR-GGGGTGG-GGGGC---RGGG11123110112
569799-GGGGTGGGGTKGGGG-GKRGAGGG33224319258
571411914GGGSC-GGGGCGGGCGGGG31423164328
582791514-GGGCTGGGKGRGGGGA-GCAGGG15523132110
5944164444GGG-SAGGGC-KGGGADRGGGG26523247226
608788-GGGGGTGGGGG--RRWGGGSAGGG1242111598
6110198GGGACTYRTGGGCTTTGGGCCAAGGG--10621105106
6210486GGGGAGACTGGGGAGGCCGGGGYRGAAGGGG73206445
639724197GGGAGGGAGGGAGGG313231204
643191612-GGGGTGKG-GGGGGGRMSGGGG54174229
651611915GGG-GARTGGGCYGGGATGGG-97218672
6658214953-GG-----STGGG--CCYTG--GGK-TG--GGG26823260236
676466GGGGTGGG-CATGGGAG-GCAGGG-21423200171
681311212-GGGGAGG-GGGGTGCCCTGGGTTGGG-13820118119
69117811GGGCAW-GAGGG-A-G-GGKTGGG1292211999
7019111416GGGRGKTGGGTGGGGGTGGG20223155161
716555GGGGAAGGGACAGGGGMMRGGG16223157157
72108810GGGSWG-CAGGG---AGGGCTGGG-20622188158
7312101112GGGTG-GGGTGGGGK-KRGATGGG-94723917664
741281211GGGTGGGGRCAAGGGTRGGG14222129119
7518101316-GG-GGTGGGA-GGGCMKGGG34323180265
766466GGGGTGGGTGGGG-RATGAGGGG45124420329
7719131919-GGRRWGGGGRA--ARGAGGGAGGG29623290223
781091010GGGGAMT-TGGGGGKGGGG-GGG32924321268
791302299128-GGGMGGGG-CGGGGCG--GGG71224400677
8021122121GGG-GCGGGSC---SSGGGGGMGGG-40623389418
8113101313GGGGRAGGG-T-GGGCTTTGGGG34723329270
8238201334GGGCAGGGCAGGG-CAGGG39124211284
83105810GGGT-CTGGGT--CTGGGTCWGGG-11623111102
846455-GGGGCCGGGGTGGGARGYGGG66216462
851281210-GGGKY-AGGGCCAGGGTGGGGG--53215042
868345GGGAGGGTCCWGGGGYTGGG12922116103
879697GGGSBCWGGGWS-AGGGAGGG73206967
881171111-GGGRGRCYTGGGTGGGGGGG-12022107103
89119611-GGGGTGGGGGTGGGGGGG43201240
9010839-GGGGTGGGGTGGGGGGG112231381
919999-GG-GGWGGGAGGGAARACKGGG-75217070
921371113GGGKT-GGGGAGGGGAWTWRGGG45123428367
939879GGGCCTGGGCYTGGGCYDGGG-26162525
9412101212GGGAMAGGGGGSAGGGGCRGGG86208680
958788----GGGGACAGGGRCA-GGGVCAGGG120218879
Table 3. G4 sequences identified in the genic regions associated with the plexin and semaphorin gene families with high similarity to G4 Families 17, 48, and 79.
Table 3. G4 sequences identified in the genic regions associated with the plexin and semaphorin gene families with high similarity to G4 Families 17, 48, and 79.
LocationSequenceLog OddsAkaike WeightStrandGene IDGene Symbol Family
chr15:90204178-90204199GGGAGGGCACTAGGGCCCTGGG8.9870.617+10509SEMA4B17
chr3:126991053-126991092GGGCAGGGCAGGCAGGGAAGGG10.5840.892+5361PLXNA117
chr9:89440465-89440503GGGTAGGGCTCAGGGGCCAGGG14.0150.99610507SEMA4D17
chr1:151141755-151141776GGGATGGGGGTTGGGGGGTGGG13.60.82810500SEMA6C48
chr15:47662210-47662233GGGGTGGGGGGTGAGGGGATGGGG11.8570.994+80031SEMA6D48
chr3:129567938-129567973GGGTTGGGGTGGGGGGTGGGG12.6520.77223129PLXND148
chr3:129588350-129588372GGGTGTCGGGGGTGGGGGAGGGG9.5990.78723129PLXND148
chr3:122983446-122983465GGGCGGGGACGGGGCGGGG12.3010.98154437SEMA5B79
chr3:129606851-129606910GGGCGGGGCCGGGGCGGGG14.2160.91623129PLXND179
chr3:50276050-50276067GGGAGGGTCGAGGGCGGG6.4150.677+7869SEMA3B79
Table 4. Family prediction for previously identified Pu27 family of G4 sequences.
Table 4. Family prediction for previously identified Pu27 family of G4 sequences.
Overall SequenceNameMinimum G4 SequenceLengthLog OddsAkaike
Weight
Family
TGGGGAGGGTGGGGAGGGTGGGGAAGGPu27-c-MYCGGGGAGGGTGGGGAGGG176.990.891
GGGTGGGGAGGGTGGGG175.70.5940
GGGGAGGGTGGGGAAGG174.950.451
TGGGAGGTGGGGAGGAGGGTTGGGAAGGPu1--PLEKHG5GGGAGGTGGGGAGGAGGGTTGGG237.420.5348
TGGGAGGTGGGGAGGAGGGTTGGGAAGGGGGAGGAGGGTTGGGAAGG196.930.9415
TGGGGAGGGTGGGGAGGCCGGGPu1-2-MYBPHLGGGGAGGGTGGGGAGG162.410.531
TGGGGAGGGTGGGGAGGGTGGGPu3---GGGGAGGGTGGGGAGGG176.990.891
GGGTGGGGAGGGTGGG167.330.99
TGGGGAGGGTGGGGAGGGCGGGGPu3-SOX2GGGGAGGGTGGGGAGGG176.990.891
GGGAGGGTGGGGAGGG165.620.741
TGGGGAGGGTGGGGAGGGTGGTGAGGGT
GGGGAGGGGGAAGG
Pu5-GRM6GGGGAGGGTGGGGAGGG176.990.891
GGGAGGGTGGGGAGGG165.620.741
GGGGAGGGTGGTGAGGGTGGGG227.530.2676
GGGTGGTGAGGGTGGGGAGGGGG237.470.8273
TGGGGAGGGTGGGGAGGGTGGGGAGGGPu7-SDK1GGGGAGGGTGGGGAGGG176.990.891
GGGTGGGGAGGGTGGGG175.70.5940
GGGTGGGGAGGGTGGGGAAGPu9---GGGTGGGGAGGGTGGGG175.70.5940
GGGGAGGGTGGGGAGGGGATGGAAPu9-2BC022036GGGTGGGGAGGGGATGG175.850.3740
GGGAGGGTGGGGAGGGTGGGGAGGGPu10-1--GGGAGGGTGGGGAGGG165.620.741
GGGTGGGGAGGGTGGGG175.70.5940
GGGGAGGGTGGGGAGGG176.990.891
GGGTGGGGAGGGTGGGGAAGGPu10-2--GGGTGGGGAGGGTGGGG175.70.5940
GGGGAGGGTGGGGAAGG174.950.451
GGGGAGGAAGGGGAGGGTGGGGAGGGPu11NAV2GGGGAGGGTGGGGAGGG176.990.891
GGGAGGGTGGGGAGGG165.620.741
GAGGGTGGGGAGGGTGGATGAGGAAGGPu14SPTLC2GGGTGGGGAGGGTGG153.190.639
TGGGGAGGGTGGGGAGGGTGGPu16--GGGGAGGGTGGGGAGGG176.990.891
GGGAGGGTGGGGAGGG165.620.741
GAGGGTGGGGAGGGTGGGGAPu17--GGGTGGGGAGGGTGGGG175.70.5940
GGGGAGGGTGGGGAGGGAGCTGGGGAPu20-CDH4GGGGAGGGTGGGGAGGG176.990.891
GGGTGGGGAGGGAGCTGGGG204.010.4951
TGGGGAGGGTGGGGAGAGGCGGGGTGGGGAGGGPuX-TM4SF2GGGAGGGTGGGGAGAGG173.410.8318
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Neupane, A.; Chariker, J.H.; Rouchka, E.C. Structural and Functional Classification of G-Quadruplex Families within the Human Genome. Genes 2023, 14, 645. https://doi.org/10.3390/genes14030645

AMA Style

Neupane A, Chariker JH, Rouchka EC. Structural and Functional Classification of G-Quadruplex Families within the Human Genome. Genes. 2023; 14(3):645. https://doi.org/10.3390/genes14030645

Chicago/Turabian Style

Neupane, Aryan, Julia H. Chariker, and Eric C. Rouchka. 2023. "Structural and Functional Classification of G-Quadruplex Families within the Human Genome" Genes 14, no. 3: 645. https://doi.org/10.3390/genes14030645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop