1. Introduction
Transcriptional regulation is the primary means by which most organisms, both prokaryotic and eukaryotic, control gene expression. Transcriptional regulation occurs through the combinatorial recognition of specific DNA elements by sets of proteins, commonly referred to as transcription factors, that affect access of RNA polymerase to gene promoter regions and rates of productive transcription. Understanding the sets of genes regulated by a particular transcription factor, or regulon, provides insights into the biological function of this protein and its role in an organism’s physiology.
With the advent of massively parallel sequencing techniques, the genomes of hundreds of thousands of organisms are presently known [
1]. Within each, simple bioinformatic approaches can be used to identify dozens to hundreds of potential transcription factors, primarily based on their sequence homology with known transcription factors. However, beyond extensions based on work in evolutionarily related organisms, it is not yet possible to predict the DNA sequences recognized by a putative transcription factor or the genes that it regulates solely from genomic sequence information. Such knowledge still requires empirical data.
In organisms possessing powerful genetic tools, a forward approach involving transcription factor gene manipulation and molecular changes in gene expression can be employed. Comparing potential regulatory regions among the genes most affected may allow the identification of a consensus DNA element involved in their regulation. This can then be directly tested in vitro by a variety of biophysical means to validate transcription factor-DNA interactions. Scanning for consensus sequences throughout an organism’s genome, particularly those intergenic regions most likely involved in promoting transcription, can yield a panel of genes potentially regulated by a particular transcription factor. Bioinformatic analyses of these genes and downstream members of their operons, ranging from identification of homologous domains with known functions in their encoded proteins to phenotypic changes in response to environmental changes, can provide important clues as to the biological roles of an unknown transcription factor. Such an approach has been aptly demonstrated in the model organism
Escherichia coli and can be seen in the wealth of information presently available on its transcription factors and the genes they regulate, as exemplified by the database RegulonDB [
2]. Such provides a paradigm for understanding transcriptional regulatory networks in the host of organisms for which genomic information is now available.
Unfortunately, genetic tools may be limited for some organisms, making the aforementioned forward genetic approach for determining putative transcription factor function challenging. One case in point is the extremely thermophilic bacteria
Thermus thermophilus HB8. Originally isolated from the Izu-Mine hot spring in Kawazu, Japan [
3], it has been adopted as the model organism for the Structural-Biological Whole Cell Project, with the goal of understanding all biological phenomena in a cell through the structure of its biomolecules [
4].
T. thermophilus HB8 is postulated to have 2226 genes in its genome, of which 1214 can be categorized by homology with regards to the potential biological functions of their encoded products. However, while tools such as microarray expression screening and state-of-the-art X-ray crystallography have been widely applied to its proteins, the polyploid nature of
T. thermophilus has complicated genetic approaches towards understanding biological questions in this organism [
5].
Our laboratory has developed an alternative, biochemistry-centric approach for determining the possible biological functions of putative transcription factors. First, we use the combinatorial selection method, Restriction Endonuclease Protection, Selection and Amplification (REPSA), massively parallel sequencing, and de novo motif discovery to determine a consensus DNA-binding sequence for a putative transcription factor [
6,
7,
8]. This is then followed by motif scanning within the subject genome and bioinformatic analyses to identify potential regulated genes and their possible biological functions. Whenever possible, each step is validated by available means, ranging from biophysical characterization of transcription factor binding to target sequences to microarray gene expression profile data. To test the utility of our approach, we have chosen to investigate four putative
T. thermophilus HB8 transcription factors, TTHA0101, TTHA0167, TTHA0973, and TTHB023. All are structurally related, possessing an N-terminal α-helix-turn-α-helix motif (pfam00440) characteristic of the archetype TetR transcriptional repressor protein in
E. coli, which is responsible for their sequence-specific DNA binding. Each has been previously investigated through other means with regards to their DNA-binding specificity and regulatory profiles [
9,
10,
11,
12]. In the present paper, we present a REPSA-based investigation into the DNA-binding specificity and possible genomic targets of the TTHA0973 protein, comparing our results with those previously obtained by more conventional methods. This study provides us with a better understanding of the strengths and weaknesses of a biochemistry-centric approach for investigating transcription factors and will help shape future studies on other uncharacterized, orphan transcription regulatory proteins.
3. Discussion
As a prelude to using our biochemistry-centric approach to characterize putative transcription factors in the extreme thermophile,
Thermus thermophilus HB8, we chose to test this approach with four relatively well-characterized transcription factors. In the present report, we investigated the
T. thermophilus HB8 transcription factor TTHA0973, which had been previously investigated by more conventional approaches [
11]. Following REPSA selection, we isolated a library of DNA sequences that exhibited substantial TTHA0973-dependent cleavage inhibition by the type IIS restriction endonuclease BpmI. Sequencing these and performing MEME motif elicitation yielded consensus sequences that were merged to a single consensus sequence, 5′–AACnAACGTTnGTT–3′. In comparison, Sakamoto et al. defined the TTHA0973-regulated operons based on their homology to phenylacetic acid (PAA) clusters present in other organisms. Comparing their two promoter regions yielded a common pseudopalindromic sequence, 5′–CNAACGNNCGTTNG–3′. Notably, both the REPSA-selected and homology-derived consensus sequences are 14 bp and have elements in common, particularly the sequences 5′–CNAACG–3′ and 5′–CGTTNG–3′. However, with only two promoters identified in the latter approach, it is difficult to define the importance of all nucleotides in the consensus with regards to their role in TTHA0973-DNA recognition. For example, all REPSA-selected sequences had MEME-defined consensus sequences present, providing extremely high statistical significance
E-values from 1.3 × 10
−1951 to 7.0 × 10
−1439. By comparison, a similar MEME analysis of the two promoter sequences did not yield any significant consensus sequences, given the limited number of sequences involved. Thus, with regards to understanding the DNA-binding specificity of a putative transcription factor, a selection-based approach like REPSA yielded a greater number of sequences with which to derive a more detailed consensus sequence. Curiously, Sakamoto et al. did perform a related selection method, genomic SELEX [
16], to identify DNA fragments from the
T. thermophilus HB8 genome that avidly bound TTHA0973 protein. What they obtained was 63 total clones, 24 unique. Of the latter, 17 contained only sequences within open-reading frames, and seven contained some intergenic regions. Focusing on the intergenic group, three unique clones were obtained that contained
TTHA0963 or
TTHA0973 promoter sequences, which helped form the basis of their consensus sequence for TTHA0973. The other clones had varying regions of homology to their defined consensus sequence, ranging from three to 10 base pairs. While REPSA and genomic SELEX are related combinatorial selection methods, they have different strengths and weaknesses [
17]. In this particular circumstance, genomic SELEX did not provide a substantially greater understanding of TTHA0973-DNA binding specificity than what was obtained through homology studies alone and was far inferior to the extent of data obtained by REPSA.
Having identified TTHA0973-DNA consensus sequences by REPSA selection, we sought to validate these sequences through direct protein-DNA binding assays. EMSA, while qualitatively useful for determining different protein-DNA complexes, is not highly amenable for measuring kinetic binding parameters such as on- and off-rates [
8]. Thus, we used BLI to determine the binding parameters of TTHA0973 to different DNAs. Experiments with different consensus sequence mutants illustrated the importance of each nucleotide in contributing to TTHA0973-DNA binding specificity. Generally, several single point mutants impacted the dissociation constant only slightly, with the two G/C nucleotides within the palindromic half-site having the most consequential effects. BLI was also performed with the different
T. thermophilus HB8 genomic sites, to validate whether TTHA0973 binds them avidly. Interestingly, while the binding affinities for these sites generally followed the trends indicated by FIMO, their values varied considerably, with some associations being below detection levels under our experimental conditions. These observations strongly suggest the necessity to perform experimental validation of theoretically determined sites before any conclusions are made. Sakamoto et al. performed a limited biophysical study of TTHA0973-DNA binding, using the related technology, surface plasmon resonance, and dsDNA fragments containing the
TTHA0963 and
TTHA0973 promoter regions [
11]. They determined
kon,
koff, and K
D values of 9.3 × 10
5 M
−1·S
−1, 1.0 × 10
−3 s
−1, and 1.1 nM for
TTHA0963 and 9.8 × 10
5 M
−1·S
−1, 0.9 × 10
−3 s
−1, and 0.9 nM for
TTHA0973, respectively, comparable values to what we obtained. Thus, this provides us with increased confidence in the use of BLI to investigate protein-DNA interactions and the values we obtained for different TTHA0973 binding sites.
For Sakamoto et al., promoter identification and biological role determination for the putative transcription factor TTHA0973 evolved primarily from knowledge about orthologs in other organisms [
11]. Thus, they were able to identify TTHA0973 as PaaR, the functional homolog of the transcriptional repressor PaaX found in
E. coli and
Pseudomonas strains, which regulates genes involved with phenylacetic acid degradation. While homology comparisons are highly effective approaches for proposing biological functions of proteins, they tend to be directed to a known outcome and are less open to potential discovery. Conversely, our biochemistry-centric approach produced many possible genomic TTHA0973-binding sites, which then had to be winnowed down through bioinformatic analyses and functional studies to yield a set of reasonable candidate genes regulated by TTHA0973. Most important, the two best candidates identified through our studies corresponded to the two gene promoters,
TTHA0963 and
TTHA0973, identified previously by homology. Our other candidate gene promoter,
TTHA0615, while acceptable with regards to sequence, intergenic location, proximity to core promoter elements, being first gene in its operon, and experimentally determined binding affinity, is not likely a TTHA0973-regulated gene, as judged from publicly available gene expression profile data comparing wild type and TTHA0973-deficient
T. thermophilus HB8 strains [
13]. Such demonstrates the limitations of a biochemistry-centric approach for identifying putative transcription factor function but also provides guidance as to those parameters that should be gauged most important in making such determinations (
i.e., binding affinity) and the need for functional validation whenever possible. With that in mind, one additional gene possessing high affinity TTHA0973 binding sites,
TTHA0647, was identified in our studies but not pursued based on its location within an open reading frame, distance from potential core promoter elements, and placement within its operon. However, clones containing this sequence were among the most abundant obtained by genomic SELEX, constituting 41 of the 63 clones isolated [
11]. In addition, we found two potential TTHA0973 binding sites within 500 bp of one another in this gene, which was unique among the TTHA0973 binding sites we identified. GEO data found no significant change in expression for
TTHA0647 when TTHA0973 was deficient, suggesting it does not play a transcriptional regulatory role for this gene. Thus, while high-affinity TTHA0973 binding sites may be coincidental here, they are worth noting in case they may be involved in an unexpected, DNA-dependent process, as we have observed previously [
6].
One concern with a genetic approach for characterizing putative transcription factors is the possibility that the observed changes in gene expression, or lack thereof, may be affected by changes in the activity of endogenous transcription regulators compensating for the loss of the deleted gene product. Thus, upregulation of additional transcription factors involved in regulating phenylacetic acid metabolism could mask identification of the full spectrum of genes controlled by TTHA0973. Given the limited information available on transcription factors in this organism, it is not yet possible to answer whether the expression of any
T. thermophilus HB8 transcription factor is affected by TTHA0973 depletion. However, it is possible to determine whether the expression of genes encoding the other related TetR-family transcriptional repressors,
TTHA0101,
TTHA0167, or
TTHB023, is affected. Using available GEO data, we found minimal changes in the expression of their transcripts (logFC = 0.025279, 0.195387, and 0.503969, respectively) when TTHA0973 was depleted (see
Table S2). These data suggested that none of their genes were significantly upregulated to compensate for the reduction in cellular TTHA0973 levels. Such is notable in that one of these transcription factors, TTHB023, had been previously identified as being involved in phenylacetic acid metabolism and capable of regulating TTHA0973-responsive gene expression in vitro [
12].