A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer

Pecka, Caleb J.; Thapa, Ishwor; Singh, Amar B.; Bastola, Dhundy

doi:10.3390/biomedinformatics4030100

Open AccessArticle

A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer

¹

College of IS&T, University of Nebraska at Omaha, Omaha, NE 68182, USA

²

Department of Biochemistry and Molecular Biology, University of Nebraska Medical Center, Omaha, NE 68198, USA

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2024, 4(3), 1822-1834; https://doi.org/10.3390/biomedinformatics4030100

Submission received: 2 July 2024 / Revised: 25 July 2024 / Accepted: 31 July 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Editor's Choice Series for the Computational Biology and Medicine Section)

Download

Browse Figures

Versions Notes

Abstract

:

Background: Improved technologies for chromatin accessibility sequencing such as ATAC-seq have increased our understanding of gene regulation mechanisms, particularly in disease conditions such as cancer. Methods: This study introduces a computational tool that quantifies and establishes connections between chromatin accessibility, transcription factor binding, transcription factor mutations, and gene expression using publicly available colorectal cancer data. The tool has been packaged using a workflow management system to allow biologists and researchers to reproduce the results of this study. Results: We present compelling evidence linking chromatin accessibility to gene expression, with particular emphasis on SNP mutations and the accessibility of transcription factor genes. Furthermore, we have identified significant upregulation of key transcription factor interactions in colon cancer patients, including the apoptotic regulation facilitated by E2F1, MYC, and MYCN, as well as activation of the BCL-2 protein family facilitated by TP73. Conclusion: This study demonstrates the effectiveness of the computational tool in linking chromatin accessibility to gene expression and highlights significant transcription factor interactions in colorectal cancer. The code for this project is openly available on GitHub.

Keywords:

colorectal cancer; epigenetics; chromatin accessibility; snakemake

1. Introduction

The regulatory mechanisms of gene expression play a critical role in cell differentiation and development, especially in disease conditions such as cancer. Transcription factors (TFs) have been shown to direct the regulation of genes by recognizing transcription factor binding sites (TFBSs) to initiate transcription of downstream genes [1]. TFs are incapable of initiating transcription if their binding site is condensed around a histone octamer structure called a nucleosome. Proteins in the cell can more easily interact with uncondensed chromatin, otherwise called accessible regions of DNA. In this paper, our goal was to design and develop a computational tool that demonstrated the interactions between chromatin access, TF binding, TF mutations, and gene expression using publicly available colorectal cancer (CRC) data.

Chromatin accessibility assessment can be accomplished using a variety of protocols including DNase-seq [2], and ATAC-seq [3]. ChIP-seq [4], a protocol that analyzes TF-DNA interactions, can require hundreds of millions of cells as input [3], while the ATAC-seq protocol only requires a standard input of 50,000 cells, making the technique appropriate for research with precious cell types, including cancer cells [3]. The development of chromatin sequencing technology has made it more feasible for researchers to incorporate chromatin accessibility with the analysis of other gene regulation mechanisms. For example, the interaction between chromatin access and transcription has improved predictive models of gene expression based on HiChIP throughput data [5].

Studies have also demonstrated that there is a correlation between chromatin accessibility and gene expression. Pearson correlations have shown that accessible chromatin of promoter regions had similar correlation patterns with gene expression for both healthy and cancerous tissues [6]. Furthermore, it has been shown that inferred TF binding interactions are capable of predicting and differentiating cell types [6].

As researchers improve our ability to assess gene regulation mechanisms, there is an increased need for computational tools that can perform integrative analysis in an accurate and streamlined manner. There is a lack of tools that can perform analysis on the results of chromatin accessibility data in a user-friendly manner. User-friendly workflows like Galaxy [7] have evolved over the years to enable researchers to perform computational analysis quickly and easily in the bioinformatics domain. However, online web services like Galaxy are limited to what the host domain can offer, and restrict the user’s ability to easily expand or modify the workflow. Using a computational approach, we designed a reproducible workflow environment that requires minimal systems administration knowledge.

Our workflow uses chromatin accessibility data from The Cancer Genome Atlas (TCGA) to predict TFBSs based on motif sequences found in accessible chromatin regions [6]. These results were validated using a database of known TF motifs from JASPAR [8] as well as gene expression profiling data [6]. In addition to statistical validation, we have incorporated a dynamic track-based visualization system that clearly shows the interaction between chromatin accessibility, TF motif sequences, and the genes they regulate. The outputs from our workflow are designed to be compatible with common file formats used in other track-based visualization applications, including the UCSC Genome Browser. Code for this project is publicly available on GitHub (https://github.com/CalebPecka/ATAC-Seq-Pipeline/), accessed on 1 July 2024.

2. Materials and Methods

2.1. Project Overview and Reproducibility

A high-level overview of the project can be seen in Figure 1, including Data Preprocessing, Peak Calling (to identify accessible chromatin regions), Motif Identification (to identify putative TFBSs), Motif Comparison (to compare putative TFBSs against validated databases of TFBSs), Site Matching, Statistical Analysis, and Track-Based Visualization. The high-level overview is color-coded to correspond with individual scripts, inputs, and outputs reflected in the low-level overview given in Figure 2.

Our pipeline requires the user to input a set of binary alignment map (BAM) files for analysis. All other input files are either provided in the GitHub repository or automatically installed by the pipeline. For example, the hg38 human reference genome is automatically installed in a Snakemake script. The pipeline outputs files for the genomic location of accessible chromatin regions (Upstream Peaks), motif sequences identified by BCrank (BC Rank Consensus Sequences), and layered genomic visualizations (tracks.png). For full descriptions of these files and all available fields, please refer to our GitHub documentation.

We employed Snakemake [9] to automatically detect the progress of the workflow and run necessary code based on a configuration file that can be modified by the user, making it possible for most users to ignore the technical intricacies in our low-level overview. Snakemake automatically installs conda environments required for software dependencies. We have also enabled a parameter to configure the conda environment to perfectly replicate the dependency build used in this study. The perfectly reproducible configuration may require an adaptable installation script provided with the workflow.

To the best of our knowledge, there are no other methodologies or tools currently available that can provide a meaningful comparison with our pipeline.

2.2. Data Preprocessing and Indexing

ATAC-seq, RNAseq, and SNP mutation data for 41 CRC patients were preprocessed by TCGA [6]. The hg38 human reference genome was used as a reference for the Bowtie2 alignment tool [6]. Samtools sorted the mapped reads and Picard removed duplicates, resulting in a set of 41 binary alignment map (BAM) files for each of the patient samples [6]. For patients with multiple ATAC-seq BAM files, our mutation data only contained one instance of each patient ID. Seven more patients were missing mutation data in TCGA, as shown in Table 1. A Supplementary File of TCGA barcodes is provided as a Table in Supplementary File S1. Preprocessing procedures were carried out by the TCGA study and are not included in the GitHub pipeline. The pipeline requires the user to input a BAM file for each sample.

To recreate the results from our study, TCGA BAM files can be downloaded from the TCGA-COAD study: https://portal.gdc.cancer.gov/projects/TCGA-COAD accessed on 1 July 2024. The exact BAM files and identifiers are also listed in Supplementary File S1. Once downloaded, Snakemake is the only required installation to run this pipeline. The following commands will install Snakemake in a computer environment:

conda activate base
conda create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake

To run the entire pipeline, use the following commands:

cd workflow
sh scripts/createCondaConfigurations.sh
snakemake../results/“A”, “B”, “C”.results/matchingSiteTable.csv—use-conda

Where A, B, and C are any number of BAM files you wish to analyze. The pipeline must be run from within the “workflow” directory (where Snakefile is located). The shell scripts (scripts/createCondaConfigurations) will download all necessary dependencies in a series of conda environments. For more information, see the documentation on our GitHub repository: https://github.com/CalebPecka/ATAC-Seq-Pipeline/ accessed on 1 July 2024.

2.3. Peak Calling

MACS2 was used to identify DNA read fragment pileups, also called chromatin peaks [10]. Chromatin peaks are indicative of DNA regions where chromatin structure is accessible. Each peak contains a summit value, indicating the base-pair location where fragment pileup is highest [10]. The first step of our pipeline compiles a list of all peak summits located 100–1000 base-pairs upstream of all genes at the ENSG level.

2.4. Motif Identification and Site Matching

The 50 base-pair region centered around each peak was extracted as a FASTA sequence using Biostrings and the hg38 reference genome for Homo sapiens [11]. For each of the 41 CRC samples, BCrank created a list of 1200 motifs from the upstream FASTA sequences [12]. BCRank requires the input FASTA sequences to be ordered according to confidence level. Our script automatically sorts the sequences according to q-score, the confidence level provided by MACS2. BCrank was also used to map those motifs onto the upstream FASTA sequences in order to obtain the location of putative TFBSs [12]. We would like to note that this method is limited due to an excess of false positives, a common problem with motif prediction tools.

2.5. Motif Comparison

For each sample, the list of 1200 motifs was reformatted to be compatible with MEME Suite tools [13]. A collection of known TFs from JASPAR was converted into a similar format [8]. TomTom searched for pattern matches between the position weighted matrices of our 1200 putative TFBSs against the known JASPAR collection [14]. The FDR p-value correction method was globally employed across the TomTom results for all samples, and non-significant results were removed from the merged list of all patients (

p_{a d j u s t} \leq

0.05).

2.6. Determination of Differentially Expressed Genes

The Data Driven Referencing (DDR) method [15] was employed to create a list of differentially expressed (DE) genes for the 31 non-duplicated TCGA CRC barcodes discussed in “Data preprocessing and indexing”. The DDR method normalizes gene expression levels into five tiers based on the relative expression of housekeeping genes in each sample [15]. We chose the DDR method over other gene expression analysis tools because the DDR normalization process is better suited for accounting for non-biological variabilities [15]. Fisher’s Exact Tests were used to determine which genes are enriched in cancer samples versus healthy samples [15]. DDR outputted a list of differentially upregulated and downregulated genes after performing the FDR p-value correction method and subsetting the results (

p_{a d j u s t} \leq

0.05). The resulting DE genes as well as fold changes and p-values are provided as a Supplementary Table in Supplementary File S2.

2.7. Creation of Genome Tracks

Track-based visualizations were created using pyGenomeTracks [16,17]. Bigwig files were created to visualize chromatin accessibility using the bamCoverage program provided by deepTools [18]. Known TF motifs were collected from the JASPAR database [8]. Tracks containing known gene locations were imported from the hg38 reference genome.

3. Results and Discussion

3.1. Chromatin Accessibility across the CRC Genome

In the Peak Calling step, MACS2 returns a score that quantifies chromatin peak accessibility by comparing the fragment pileup relative to various background regions of fragment pileup at a maximum of 10,000 base pairs away [10]. In the original TCGA study by Corces et al., the researchers noted that the MACS score was problematic due to its variability across different datasets [6]. We hypothesized that this issue may not be relevant in our study because we focused on a specific cohort of CRC patients, whereas the TCGA study incorporated many different cohorts of different cancer types. The experiments performed in this subsection were intended to investigate this hypothesis.

We used a one-tailed Wilcoxon rank sum test to compare the mean accessibility scores across all samples in housekeeping genes versus non-housekeeping genes. Our findings revealed that the mean accessibility score for housekeeping genes is significantly higher than non-housekeeping genes (p ≤ 0.05). These results were visualized as density plots comparing the profiles of housekeeping vs non-housekeeping genes in Figure 3.

In some situations, a gene will not have a mean accessibility score because MACS2 did not identify a statistically significant peak upstream of the gene in any of the 38 non-duplicated patient IDs (see Table 1). In these situations, we can quantify gene accessibility based on a second accessibility metric, whether or not a gene has an accessible upstream promoter region in each patient sample. Out of 3805 housekeeping genes [19], 128 were not identified as significant by MACS2. Our search list included 58,387 unique gene symbols, of which a total of 26,274 were not identified as significant by MACS2. Using a Fisher’s Exact Test, we concluded that the variance in chromatin accessibility can be partially explained by whether or not a gene is a housekeeping gene (p ≤ 0.05). The raw data for these statistical tests are provided as a Table in Supplementary File S3.

By definition, we expect housekeeping genes to be expressed in the cell at all times, as they are necessary for basic cellular functions. Therefore, we expect housekeeping genes to also have accessible promoter regions, as they constantly need to be transcribed. Our statistics support these assumptions and help to verify that the MACS2 accessibility score is a useful metric for quantifying chromatin accessibility. Therefore, in the following section (correlation between chromatin access and gene expression), we use the MACS2 accessibility score to correlate chromatin access with gene expression. In all other cases, we quantify accessibility based on our second accessibility metric.

3.2. Correlation between Chromatin Access and Gene Expression

For each gene, Pearson correlations were used to identify a relationship between MACS2 chromatin accessibility scores and the normalized HTseq gene expression across all 38 patient samples. A one-tailed Wilcoxon rank sum test was used to compare the correlation coefficients for housekeeping and non-housekeeping genes, and we concluded that the correlation coefficients of non-housekeeping genes are significantly higher than housekeeping genes (p ≤ 0.05). The data for this observation have been provided in Supplementary File S4.

Similar tests were performed to compare the correlation coefficients for CRC biomarker genes (determined using the DDR method, a tool previously developed in our lab) and non-biomarker genes [15]. A one-tailed Wilcoxon test suggests that correlation coefficients between chromatin access and gene expression are expected to be higher in our list of differentially expressed genes acquired from DDR (p ≤ 0.05). We can interpret this result to explain the mechanisms of differential gene expression. Genes which have differential gene expression patterns can be closely tied to a respective increase or decrease in chromatin access.

3.3. Motif Comparison

The 1200 motifs produced by BCrank were a highly conservative estimate of the number of motifs necessary to describe the global pattern of TF binding in a patient. We chose 1200 motifs as it was comfortably larger than the total number of JASPAR transcription factor binding profiles (900). For each patient, BCrank calculated 100 motifs to describe a global optimum set. This procedure was repeated 12 times with a different seed generation each time. Any duplicate motif sequences were deleted. TomTom is a commonly used tool that identifies pattern matches between two sets of motif sequences. Using TomTom, we calculated that, on average, 68% of our predicted motifs had a statistically significant pattern match with known JASPAR motifs. The performance of this prediction across all patients has been plotted in Supplementary File S5.

The last 32% of identified motifs are not claimed to be definitive binding sites but are instead potential candidates for further validation. We do not incorporate these motifs into subsequent steps of analysis or validation.

Using BCrank, these pattern matches can be mapped back to the hg38 reference genome as predicted locations of TFBSs. An example of these predictions is showcased in Figure 4. These visualizations clearly illustrate the predictive capabilities of the pipeline we developed for matching known and putative motifs within accessible chromatin regions of promoter regions that regulate gene expression.

3.4. Integrated Mutation Data and Interaction of Gene Regulation Mechanisms

To model the interaction of gene regulation mechanisms, we employed Sankey diagrams as shown in Figure 5, Figure 6, Figure 7 and Figure 8. These models showcase the epigenetic interactions in a subset of TFs (accessible and nonaccessible, for example). Each TF is classified in terms of their DE pattern, presence or absence of SNP mutations, and DE patterns in the TF’s target genes. Raw data for these measurements are provided as a Supplementary Table in Supplementary File S6. SNP mutations lead to structural changes in the mutated TF. In many cases, this behavior prevents TF binding to the promoter DNA sequence, causing downregulation in that TF’s target genes.

Verified target genes for TF regulation were identified using the Harmonizome CHEA Transcription Factor Targets data set [20]. The regulation of target genes was estimated by subtracting the percentage of differentially downregulated target genes in the DDR list from the percentage of differentially upregulated target genes. If the TF target genes were not found in the list of DE genes from DDR, they were not included in the final column of the Sankey diagram. A list of TF genes and their targets used in this analysis is provided as a table in Supplementary File S7. From these results, we observed that accessible TF genes (Figure 5) were more likely to regulate DE genes than unaccessible TF genes (Figure 6). Furthermore, we observed that accessible TF genes were more likely to produce DE TFs than unaccessible TF genes.

Similar diagrams were produced to compare nonmutated TFs (Figure 7) and mutated TFs (Figure 8). In general, mutated genes were more likely to produce differentially downregulated targets. We also observe that upregulated and accessible TFs are more likely to produce downregulated target genes if the TF was mutated, as compared to the nonmutated group. We expected this result because conformational changes to the TF structure prevent it from correctly binding to the TFBS of promoter regions in the TF’s target genes. These expected behaviors help us verify that our pipeline is accurately explaining gene regulation mechanisms and has the potential to be applied to other data sources to understand the origin of disease conditions.

Deviations in the relationship between gene expression of TF accessibility could be explained by gene expression mechanisms unrelated to TFs. For example, gene expression is globally increased in larger cell volumes, rather than on a gene-to-gene basis [21]. RNA polymerase II holoenzyme expression scales with cell volume [21], possibly explaining the mechanism of global gene expression increase, even in genes with downregulated transcription factors. Indeed, experimental models of distributions of gene expression profiles have been improved using exponentially scaling of cell volume [22], as well as other non-TF related gene expression mechanisms including dosage compensation, the exponential rate of mRNA maturation, and the first-order kinetics rate of mRNA decay [22]. Depletion of the cohesin complex and CTCF has been experimentally shown to both upregulate and downregulate gene expression, explained by a variety of mechanisms such as CTCF’s direct binding to the gene’s promoter region [23].

It is also important to recognize that mutations in TFs do not guarantee a loss in gene expression. For example, mutations in the promoter binding site region of the TF are far more likely to compromise the TF’s functionality, as well as regions that complex to other transcription mediators. To further complicate matters, it has been shown that TF families with similar binding site regions are able to substitute mutated TFs, supplementing and reducing the mutation’s impact on overall gene expression [24]. In the future, the overall effectiveness of our tool could greatly be improved by thoroughly investigating and categorizing the impact of mutations on TFs.

3.5. Gene Regulation Mechanisms in CRC

To better understand gene regulation mechanisms in CRC, we mapped accessible, upregulated TFs to their target genes using the TRRUSTv2 transcription factor database [25]. TRRUSTv2 was chosen because it is a manually curated list that includes metadata for whether the target genes are activated or repressed [25]. The data were further subset to only include TFs that activate differentially upregulated genes or TFs that repress differentially downregulated genes.

As seen in Table 2, a total of 39 transcription factors were identified including CEBPB, the E2F family (E2F1, E2F3, E2F6), the FOX family (FOXA2, FOXL1), MYC, MYCN, and TP73. Transcriptional regulation of BIRC5 via E2F1 has been shown to contribute to the pathogenesis of colorectal cancer [26]. Our data similarly observe that the availability of E2F1 in CRC patients is contributing to the activation of BIRC5, which is also upregulated in CRC.

Notably, several upregulated TFs in CRC are known to interact with apoptosis-regulating genes. E2F1 and MYC both upregulate TP73 (see Table 2). Supporting our observation, TP73 has been shown to block transactivation of TP53, further preventing mechanisms of apoptosis [27]. Additionally, E2F1 and MYCN have repressive interactions with the downregulated TP53 gene.

TP73, BBC3, and PMAIP1 were all found to be differentially upregulated in the CRC data. TP73 functions as a transcription factor that activates BBC3 and PMAIP1, both members of the BCL-2 protein family. Many members of the BCL-2 protein family have been targeted for CRC treatment due to their implications in apoptosis [28]. These observations showcase the potential of our pipeline to easily identify mechanisms of gene regulation in disease. In this case, the upregulation of E2F1 and MYC in colon cancer leads to the activation of TP73, further leading to the activation of apoptosis regulators.

4. Conclusions

In this paper, we presented a computational approach to quantify gene regulation mechanisms, including TF binding and chromatin accessibility. The advancement of chromatin accessibility technologies like ATAC-seq presents an exciting opportunity to increase our understanding of gene regulation mechanisms in various disease conditions. We validated our theoretical model by showing that there is a statistical relationship between chromatin access and gene expression data in CRC, especially in genes that encode TFs. We believe that this model has great potential to be applied to additional data sets to improve our understanding of the underlying mechanisms behind differential gene expression in other disease conditions, especially in the context of genetic mutations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics4030100/s1, Supplementary File S1: TCGA Barcodes Supplementary; File S2: RNA-Seq Gene Expression Profiles; Supplementary File S3: MACS2 Statistics; Supplementary File S4: Housekeeping Gene Correlations; Supplementary File S5: JASPAR Correlations per Patient; Supplementary File S6: SNP Mutation Data; Supplementary File S7: Transcription Factor Interactions; Supplementary File S8: Chromatin Accessibility of Transcription Factor Families.

Author Contributions

Methodology and writing, C.J.P.; conceptualization, I.T. and D.B.; funding acquisition, A.B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fund for Undergraduate Scholarly Experience (FUSE) grant provided by The University of Nebraska at Omaha Office of Research and Creative Activity (https://www.unomaha.edu/office-of-research-and-creativeactivity/students/fuse.php), accessed on 1 July 2024. No grant numbers are assigned to this funding. Additionally, this work was supported in part by the funds from VA-merit award (BX002761) and National Institute of Health RO1 grant funding (DK124095; to ABS) (https://www.nih.gov/grants-funding), accessed on 1 July 2024. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available data from The Cancer Genome Atlas (TCGA) was used for this study. TCGA Barcode sequences for all used samples can be found in Supplementary File S1: TCGA Barcodes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Todeschini, A.L.; Georges, A.; Veitia, R.A. Transcription factors: Specific DNA binding and specific gene regulation. Trends Genet. 2014, 30, 211–219. [Google Scholar] [CrossRef] [PubMed]
Song, L.; Crawford, G.E. DNase-seq: A high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb. Protoc. 2010, 2010, pdb–prot5384. [Google Scholar] [CrossRef] [PubMed]
Buenrostro, J.D.; Wu, B.; Chang, H.Y.; Greenleaf, W.J. ATAC-seq: A method for assaying chromatin accessibility genome-wide. Curr. Protoc. Mol. Biol. 2015, 109, 21–29. [Google Scholar] [CrossRef] [PubMed]
Landt, S.G.; Marinov, G.K.; Kundaje, A.; Kheradpour, P.; Pauli, F.; Batzoglou, S.; Bernstein, B.E.; Bickel, P.; Brown, J.B.; Cayting, P.; et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22, 1813–1831. [Google Scholar] [CrossRef] [PubMed]
Schmidt, F.; Kern, F.; Schulz, M.H. Integrative prediction of gene expression with chromatin accessibility and conformation data. Epigenetics Chromatin 2020, 13, 4. [Google Scholar] [CrossRef] [PubMed]
Corces, M.R.; Granja, J.M.; Shams, S.; Louie, B.H.; Seoane, J.A.; Zhou, W.; Silva, T.C.; Groeneveld, C.; Wong, C.K.; Cho, S.W.; et al. The chromatin accessibility landscape of primary human cancers. Science 2018, 362, eaav1898. [Google Scholar] [CrossRef] [PubMed]
Community, T.G. The Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 2022, 50, W345–W351. [Google Scholar] [CrossRef] [PubMed]
Fornes, O.; Castro-Mondragon, J.A.; Khan, A.; Van der Lee, R.; Zhang, X.; Richmond, P.A.; Modi, B.P.; Correard, S.; Gheorghe, M.; Baranašić, D.; et al. JASPAR 2020: Update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020, 48, D87–D92. [Google Scholar] [CrossRef]
Molder, F.; Jablonski, K.P.; Letcher, B.; Hall, M.B.; Tomkins-Tinch, C.H.; Sochat, V.; Forster, J.; Lee, S.; Twardziok, S.O.; Kanitz, A.; et al. Sustainable data analysis with Snakemake. F1000Research 2021, 10, 33. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Meyer, C.A.; Eeckhoute, J.; Johnson, D.S.; Bernstein, B.E.; Nusbaum, C.; Myers, R.M.; Brown, M.; Li, W.; et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9, R137. [Google Scholar] [CrossRef]
Pages, H.; Aboyoun, P.; Gentleman, R.; DebRoy, S.; Pages, M.H.; IRanges, L. Biostrings: Efficient Manipulation of Biological Strings. R Package Version 2.72.1. Available online: https://bioconductor.org/packages/Biostrings. (accessed on 1 July 2024).
Ameur, A. BCRANK: Predicting Binding Site Consensus from Ranked DNA Sequences. Available online: https://bioconductor.org/packages/release/bioc/manuals/BCRANK/man/BCRANK.pdf (accessed on 1 July 2024).
Bailey, T.L.; Johnson, J.; Grant, C.E.; Noble, W.S. The MEME suite. Nucleic Acids Res. 2015, 43, W39–W49. [Google Scholar] [CrossRef] [PubMed]
Gupta, S.; Stamatoyannopoulos, J.A.; Bailey, T.L.; Noble, W.S. Quantifying similarity between motifs. Genome Biol. 2007, 8, R24. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Thapa, I.; Haas, C.; Bastola, D. Multiplatform biomarker identification using a data-driven approach enables single-sample classification. BMC Bioinform. 2019, 20, 601. [Google Scholar] [CrossRef] [PubMed]
Ramírez, F.; Bhardwaj, V.; Arrigoni, L.; Lam, K.C.; Grüning, B.A.; Villaveces, J.; Habermann, B.; Akhtar, A.; Manke, T. High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nat. Commun. 2018, 9, 189. [Google Scholar] [CrossRef] [PubMed]
Lopez-Delisle, L.; Rabbani, L.; Wolff, J.; Bhardwaj, V.; Backofen, R.; Grüning, B.; Ramírez, F.; Manke, T. pyGenomeTracks: Reproducible plots for multivariate genomic datasets. Bioinformatics 2021, 37, 422. [Google Scholar] [CrossRef]
Ramírez, F.; Ryan, D.P.; Grüning, B.; Bhardwaj, V.; Kilpert, F.; Richter, A.S.; Heyne, S.; Dündar, F.; Manke, T. deepTools2: A next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016, 44, W160–W165. [Google Scholar] [CrossRef]
Eisenberg, E.; Levanon, E.Y. Human housekeeping genes, revisited. Trends Genet. 2013, 29, 569–574. [Google Scholar] [CrossRef] [PubMed]
Lachmann, A.; Xu, H.; Krishnan, J.; Berger, S.I.; Mazloom, A.R.; Ma’ayan, A. ChEA: Transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics 2010, 26, 2438–2444. [Google Scholar] [CrossRef] [PubMed]
Padovan-Merhar, O.; Nair, G.P.; Biaesch, A.G.; Mayer, A.; Scarfone, S.; Foley, S.W.; Wu, A.R.; Churchman, L.S.; Singh, A.; Raj, A. Single mammalian cells compensate for differences in cellular volume and DNA copy number through independent global transcriptional mechanisms. Mol. Cell 2015, 58, 339–352. [Google Scholar] [CrossRef]
Cao, Z.; Grima, R. Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proc. Natl. Acad. Sci. USA 2020, 117, 4682–4692. [Google Scholar] [CrossRef]
Zuin, J.; Dixon, J.R.; van der Reijden, M.I.; Ye, Z.; Kolovos, P.; Brouwer, R.W.; van de Corput, M.P.; van de Werken, H.J.; Knoch, T.A.; van IJcken, W.F.; et al. Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells. Proc. Natl. Acad. Sci. USA 2014, 111, 996–1001. [Google Scholar] [CrossRef] [PubMed]
Lu, R.; Rogan, P.K. Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations. F1000Research 2018, 7, 1933. [Google Scholar] [CrossRef] [PubMed]
Han, H.; Cho, J.W.; Lee, S.; Yun, A.; Kim, H.; Bae, D.; Yang, S.; Kim, C.Y.; Lee, M.; Kim, E.; et al. TRRUST v2: An expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018, 46, D380–D386. [Google Scholar] [CrossRef] [PubMed]
Xu, F.; Xiao, Z.; Fan, L.; Ruan, G.; Cheng, Y.; Tian, Y.; Chen, M.; Chen, D.; Wei, Y. RFWD3 Participates in the Occurrence and Development of Colorectal Cancer via E2F1 Transcriptional Regulation of BIRC5. Front. Cell Dev. Biol. 2021, 9, 675356. [Google Scholar] [CrossRef] [PubMed]
Grob, T.; Novak, U.; Maisse, C.; Barcaroli, D.; Lüthi, A.; Pirnia, F.; Hügli, B.; Graber, H.; De Laurenzi, V.; Fey, M.; et al. Human ΔNp73 regulates a dominant negative feedback loop for TAp73 and p53. Cell Death Differ. 2001, 8, 1213–1223. [Google Scholar] [CrossRef] [PubMed]
Ramesh, P.; Medema, J.P. BCL-2 family deregulation in colorectal cancer: Potential for BH3 mimetics in therapy. Apoptosis 2020, 25, 305–320. [Google Scholar] [CrossRef]

Figure 1. High-level overview of the project pipeline. Data from Raw FASTQ Reads, TCGA reference genome, JASPAR Transfac motifs, and Data Driven Referencing (DDR) Expression are processed by a series of scripts that lead to a Track-Based Visualization and Statistical Analysis.

Figure 2. Low-level overview of the project pipeline. Inputs and commands are color-coded to reflect the high-level overview shown in Figure 1. Processes in the dotted region are not included in our GitHub repository, but are necessary steps when performing ATACseq analysis.

Figure 3. Density plot of distribution between chromatin access and gene expression. The x-axis displays MACS2 scores for chromatin accessibility. The y-axis displays Log2 fold changes for gene expression. Results are compared for housekeeping vs non-housekeeping genes. Mean chromatin accessibility is significantly higher in housekeeping genes.

Figure 4. Example track-based visualization of the SLC04A1 gene. The x-axis displays the chromosomal position of our data. The top track (Peak Chromatin Accessibility) represents read fragment pileup as a method of quantifying chromatin accessibility. The blue box track beneath it (Narrow Peaks) represents MACS2’s interpretation of peak regions, as well as each peak’s summit location (black tick marks in each box). The GTF genes track shows the full reference genome labeling, as well as a small directional arrow to showcase whether a gene is transcribed on the positive or negative strand. The BCRANK Motifs track showcases our predicted motifs mapped onto the reference genome. The Known JASPAR TFBS track showcases how closely mapped our predicted motifs line up with validated TFBS resources.

Figure 5. Sankey diagram for subset of TFs with accessible chromatin promoter regions. A subset of this TF list is categorized as differentially and non-differentially expressed (Non DE). These data are further categorized into mutated vs non-mutated TFs. Finally, their target genes are categorized in the right-most column as generally upregulated vs downregulated.

Figure 6. Sankey diagram for subset of TFs with unaccessible chromatin promoter regions. These TFs are more likely be Non DE and produce fewer DE target genes than the accessible TFs in Figure 5.

Figure 7. Sankey diagram for subset of TFs without mutations. TFs that are upregulated have a roughly equal likelihood of producing a differentially upregulated target gene or a downregulated target gene.

Figure 8. Sankey diagram for subset of TFs with mutations. The subset of TFs that is upregulated and accessible is far more likely to produce a downregulated target when the structural features of the TF are compromised by gene mutation.

Table 1. List of data subsets and the respective number of samples in each group.

Data Subset	Number of Samples
ATAC-seq BAM files	41
Non-duplicated patient IDs	38
Non-duplicated patient IDs with corresponding mutation data	31

Table 2. Notable TF-target gene interactions.

TF	Target Genes	Interaction Type	Target ED	Target Pval Adjust
E2F1	BIRC5	Activation	1.307	$1.46 \times 10^{- 16}$
CEBPB	CXCL8	Activation	1.881	$2.12 \times 10^{- 20}$
CEBPB	GDF15	Activation	2.050	$2.47 \times 10^{- 31}$
FOXA2	MMP7	Activation	2.056	$4.00 \times 10^{- 30}$
FOXA2	ABCA1	Repression	−1.386	$3.32 \times 10^{- 18}$
FOXL1	BMP4	Activation	1.381	$1.14 \times 10^{- 15}$
E2F1	TP73	Activation	0.355	$1.66 \times 10^{- 6}$
E2F1	TP53	Repression	−0.207	$3.28 \times 10^{- 6}$
MYC	TP73	Activation	0.355	$1.66 \times 10^{- 6}$
MYCN	TP53	Repression	−0.207	$3.28 \times 10^{- 6}$
TP73	BBC3	Activation	0.805	$7.35 \times 10^{- 10}$
TP73	PMAIP1	Activation	0.978	$6.02 \times 10^{- 27}$

List of DE target genes regulated by TFs that were found to be differentially upregulated and accessible in our data set. Target ED is a logarithmic adjustment of fold change for the target gene versus wild type, taken from DDR. Pval adjust is the p-value associated with the Target ED. More detailed information, including TF ED values, chromatin accessibility, and frequency of SNP mutations can be found in Supplementary File S8.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pecka, C.J.; Thapa, I.; Singh, A.B.; Bastola, D. A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer. BioMedInformatics 2024, 4, 1822-1834. https://doi.org/10.3390/biomedinformatics4030100

AMA Style

Pecka CJ, Thapa I, Singh AB, Bastola D. A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer. BioMedInformatics. 2024; 4(3):1822-1834. https://doi.org/10.3390/biomedinformatics4030100

Chicago/Turabian Style

Pecka, Caleb J., Ishwor Thapa, Amar B. Singh, and Dhundy Bastola. 2024. "A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer" BioMedInformatics 4, no. 3: 1822-1834. https://doi.org/10.3390/biomedinformatics4030100

APA Style

Pecka, C. J., Thapa, I., Singh, A. B., & Bastola, D. (2024). A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer. BioMedInformatics, 4(3), 1822-1834. https://doi.org/10.3390/biomedinformatics4030100

Article Menu

A Computational Approach to Demonstrate the Control of Gene Expression via Chromosomal Access in Colorectal Cancer

Abstract

1. Introduction

2. Materials and Methods

2.1. Project Overview and Reproducibility

2.2. Data Preprocessing and Indexing

2.3. Peak Calling

2.4. Motif Identification and Site Matching

2.5. Motif Comparison

2.6. Determination of Differentially Expressed Genes

2.7. Creation of Genome Tracks

3. Results and Discussion

3.1. Chromatin Accessibility across the CRC Genome

3.2. Correlation between Chromatin Access and Gene Expression

3.3. Motif Comparison

3.4. Integrated Mutation Data and Interaction of Gene Regulation Mechanisms

3.5. Gene Regulation Mechanisms in CRC

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI