1. Introduction
The development of biological databases and the need to understand how many components present in a living cell are working together to perform cellular functions do justify the growing interest in mathematical, statistical, information theory and computational tools for the analysis of genomic data. In short, the genetic information of an organism is encoded in DNA molecules through units called bases, such as adenine (A), cytosine (C), guanine (G) and thymine (T). In eukaryotic cells, DNA is divided into gene and intergenic regions. The genes are divided into exons and introns. The protein-coding sequences are then the portion of a gene that encodes a protein: its exons. The coding region of a gene is also known as the coding sequence (CDS). The non-coding sequences refer to the introns and intergenic regions (see
Figure 1).
In this paper, we investigate how to improve the discrimination between coding and non-coding regions of a DNA sequence. In this sense, Trifonov and Sussman [
1] observed the existence of periodicities in DNA sequences from the analysis of the autocorrelation function; Tsonis et al. [
2] found that, whereas non-coding regions show a rather random pattern, coding sequences reveal periodicities, in particular, a three-base periodicity (TBP). The TBP property reveals a spectral peak at frequency
rad/sample for coding sequences. This periodic phenomenon has attracted the attention of many biologists who are trying to understand and explain it [
3,
4,
5,
6]. Thus, it is possible to discriminate between coding and non-coding regions of a DNA sequence by observing its energy spectrum [
7,
8,
9,
10].
DNA sequences are symbolic sequences, and, therefore, for spectral analysis, a numerical representation of DNA is first necessary. For that reason, the proper choice of a mapping rule of a DNA sequence onto one or more signals of complex or real numbers must be made. For a given DNA sequence, in particular, mapping is a rule that associates each element of the set of bases , with an element of another set, such as the set of complex numbers. Consequently, the challenge of predict DNA periodicities is how to choose the mapping rule for such sequences.
A classical method was proposed by Voss [
11], in which each of the four bases is associated with a binary indicator signal. Each binary indicator is a discrete-time signal that assumes 1 when the
n-th symbol of the sequence is a given base and 0 otherwise. Finally, the energy density spectrum is the sum of the energy contribution of each binary indicator signal evaluated from the discrete Fourier transform (DFT) of each signal. In addition, other approaches have proposed mappings from a DNA sequence to a single signal. In this case, the energy spectrum is evaluated from the DFT of this signal. Among the most common mappings, Nair and Sreenadhan [
12] proposed a mapping based on the electron-ion interaction pseudopotentials (EIIP); Anastassiou [
13] proposed that in the mapping image were complex numbers, similar to the QPSK modulation technique; and Galleani and Garello [
14] proposed the minimum entropy mapping (MEM) spectrum, in which a real mapping is computed from the spectral entropy minimization criterion.
However, these approaches have some performance limitations. The most important limitation is how to define this mapping. Symbolic sequences have a statistical structure that provides important information about them. We, therefore, expect that a numerical representation of such sequence does not impose additional features on the resulting signal. For example, a map cannot assume that one symbol is always numerically greater than another. For this reason, it is clear that the same mapping for any DNA sequence must ignore the features that are particularly inherent to it. Thus, this suggests that for each DNA sequence, a particular mapping should be performed.
Assuming that a numerical signal is appropriate for a given DNA sequence, then time–frequency analysis can be applied to detect coding regions in genes. In this sense, Tiwari et al. [
15] were the first researchers to propose that it is sufficient to evaluate the energy density at frequency
rad/sample in a window of
W samples, sliding it through the set of binary indicators. Vaidyanathan and Yoon [
16] proposed the use of the antinotch filter on the sliding window over the set of binary indicators. Sahu and Panda [
17] suggested the use of the S transform, considering the signal resulting from the EIIP mapping. Wang and Johnson [
18] expanded the spectral envelope approach (initially proposed by Stoffer et al. [
19]) to processing non-stationary symbolic signals in the time–frequency domain and analyzed the correlation structure of DNA.
Therefore, in this paper, we propose two new algorithms for computing mappings for DNA sequences. Both algorithms are based on the spectral envelope approach. Briefly, the spectral envelope is the new spectrum obtained by maximizing the energy spectrum over the entire frequency range [0, N − 1]. That is, at each frequency in this range, the spectral envelope looks for four constants on a complex hypersphere with a unit radius that maximizes the energy density spectrum of the signal resulting from the linear combination of the binary indicator signals. Note that combinations are computed, each of which is a potential DNA mapping.
We then use this mapping to find the numerical signal for the DNA sequence. Thus, we can calculate the respective energy density spectrum of the signal to discriminate between coding and non-coding sequences. The first algorithm searches for the mapping that maximizes the SNR of the energy density spectrum. The second algorithm, on the other hand, takes advantage of prior knowledge about the TBP property such that the mapping results from the spectral envelope at the frequency .
The performance of the new methods is verified by comparing it with the performance of four other well-established methods in the literature—Voss [
11], EIIP [
12], QPSK [
13] and MEM spectrum [
14]—and by applying them to synthetic and real DNA sequences whose properties are known. In addition, we make remarks about the proposed algorithms by discussing their intrinsic properties and computational complexities. Finally, the use of our methods shows results that have outperformed the discrimination of TBP in DNA sequences in contrast with previous works. Moreover, we noticed improvements in the SNR and spectral entropy of the respective signals. The algorithms were implemented in Python and are available in the following GitHub repository [
20].
The present paper is organized as follows.
Section 2 provides notations and definitions that are important to the analyses in this paper. In
Section 3, we present our methods and the proposed algorithms. In
Section 4, we make remarks on the algorithms. The results are presented and discussed in
Section 5 and, finally, the conclusions are elaborated in
Section 6.
3. Methods
3.1. Experimental Data
The data are available at the nucleotide database from the National Center for Biotechnology Information (NCBI) that provides open access to biomedical and genomic information [
24]. Each DNA sequence record processed by NCBI is referred to by an accession number. Furthermore, the qualifier that links DNA sequence records and their genes is the geneID. The accession numbers and geneID are both a simple series of digits.
For a detailed analysis of spectrum methods, we use the chromosomes XIV, XV, and XVI of Saccharomyces cerevisiae (accession numbers NC_001146.8, NC_001147.6 and NC_001148.4, respectively). Each chromosome has 398, 546, and 474 coding sequences, respectively. For the coding sequence whose orientation is complementary, we perform the complement reverse operation to start each sequence at the codon . The data are divided into two datasets: the first has only coding sequences (the coding sequence dataset) and the second has only sequences from intergenic regions (the non-coding sequence dataset). In both cases, we discard sequences whose length is less than 200 base pair (bp). Finally, there are 1388 coding sequences and 1188 non-coding sequences in our dataset.
Furthermore, we use the portion of gene
F56F11 from chromosome III of
Caenorhabditis elegans that transcribes the protein
F56F11.4,
isoform a. The
F56F11.4a is used as a benchmark problem for different exon detection techniques [
8,
13,
14]. It has 7990 bp starting at nucleotide position 7021 of gene
F56F11. In addition, the
F56F11.4a has five well-known distinct exons whose locations relative to nucleotide position 7021 vary from 928 to 1039, 2528 to 2857, 4114 to 4377, 5465 to 5644 and 7255 to 7605. Note that the first exon is the shortest (112 bp) and usually the most difficult to detect.
3.2. Adaptive DNA Mappings
As we have seen previously, the first procedure for the spectral analysis of DNA sequences is mapping the symbolic data to a numeric signal. We have also seen that having a single mapping for all DNA sequences can ignore the intrinsic properties of each sequence. Therefore, an adaptive mapping should be done by searching potential mappings for DNA sequences in order to highlight the structure of their data. To implement the adaptive mapping method, we propose the use of a spectral envelope approach.
One should recall that the envelope spectral represents the maximum energy that the signal (
6) can have such that
. For each particular frequency in the entire range
, there is a respective
. These vectors are the search space for our adaptive DNA mapping method based on the spectral envelope. For each
, there is an associated mapping
. The image of a particular mapping
, that is,
, are the components of a respective vector
. Therefore, in our search space, there are up to
N potential mappings and
N different signals, which can also differ in their spectral composition.
For example, consider the coding sequence of the
AIM41 gene (geneID: 854425) from chromosome XV of
Saccharomyces cerevisiae. Since it is a coding sequence, we expect the presence of the TBP property, and, therefore, we expect its energy spectrum to reveal a discriminant spectral peak at frequency
rad/sample. The spectral envelope for this sequence is shown in
Figure 3a. Note that, instead of what is expected by the TBP property for the spectral envelope, the peak occurs at the
rad/sample. However, when we solve the spectral envelope at the frequency
rad/sample, we obtain
and
. Therefore, the corresponding mapping is given by
The energy density spectrum of the signal, mapped by using
, is shown in
Figure 3b. In this case, as expected from the TBP property, the peak occurs at
rad/sample. However, the TBP property is not observed for all
in the search space. At frequency
rad/sample, the spectral envelope is
, and the corresponding mapping is given by
The energy density spectrum of the signal, mapped by using
, is shown in
Figure 3c. Notice that the energy spectrum of the DNA sequences can be slightly different when we change the mapping. This can be another reason to look for adaptive and unique mappings for each sequence.
To select a single mapping for a DNA sequence, we must choose it from the
N potential mappings. For this reason, a constraint should be imposed. The first algorithm uses as the constraint the maximization of SNR of the energy density spectrum. Consequently, from now on, we will call it SNR-SE, where SE is the short form for the spectral envelope. The SNR is the ratio of signal power to noise power. It is computed on the energy density spectrum of the signal as follows. The signal power is estimated as the energy of the highest spectral component; the noise power or the background noise is the total energy, excluding the signal power and the DC value [
4].
In this algorithm, the potential mappings are those that solve the spectral envelope for each frequency k in the closed interval from 0 to . The search space is reduced since the one-sided energy spectrum must have all the spectral information about the signal. Therefore, for each potential mapping, the energy spectrum and its SNR are estimated. Finally, we choose the mapping whose respective signal has the energy spectrum with the highest SNR. The pseudocode of this method is shown in Algorithm 2.
The second algorithm is but a special case of the first. Now we will exploit previous knowledge of the TBP property. As a result, from now on, we will call it TBP-SE. We assume that all coding sequences have the TBP property, so a discriminant spectral peak at frequency
rad/sample is observed, whereas, in non-coding sequences, this peak is absent. Therefore, from among all potential mappings of the spectral envelope, this algorithm chooses the one that solves the optimization problem of the spectral envelope at frequency
rad/sample. The pseudocode of this method is shown in Algorithm 3.
Algorithm 2SNR-SE |
Input: DNA sequence s Output: One sided spectrum - 1:
- 2:
foreachkindo - 3:
SpectralEnvelope (s, k) - 4:
map whose image are the components of - 5:
Compute using Equation ( 11) - 6:
- 7:
if then - 8:
- 9:
- 10:
end if - 11:
end for - 12:
Compute using ( 11) and - 13:
return
|
Algorithm 3TBP-SE |
Input: DNA sequence s Output: One sided spectrum - 1:
- 2:
SpectralEnvelope (s, k) - 3:
map whose image are the components of - 4:
Compute using ( 11) and - 5:
return
|
3.3. Evaluation and Interpretation
The spectral analysis for the discrimination of the DNA coding sequences is then evaluated as follows. We must check at which frequency the largest spectral peak occurs. If it occurs between frequencies rad/sample, we say that such a sequence is a DNA coding sequence.
Therefore, the test outcome can be positive (classifying the DNA sequence as a coding sequence) or negative (classifying the DNA sequence as a non-coding sequence). The test results for each DNA sequence may or may not match the real status. In such a setting, we have the following:
True positive: coding sequences that are correctly identified as coding sequences;
False positive: coding sequences that are misclassified as non-coding sequences;
True negative: non-coding sequences that are correctly classified as non-coding sequences;
False negative: non-coding sequences that are misclassified as coding sequences.
To compare the effectiveness of each DNA coding sequence identification method, we evaluate three measures: accuracy, sensitivity, and specificity. Accuracy defines the global correct classification rate, reflecting the ability to predict correctly concerning total samples, that is,
Sensitivity or true positive rate (
) evaluates the ability to correctly predict a coding sequence, that is,
Specificity or true negative rate (
) evaluates the ability to correctly predict a non-coding sequence, that is,
If these tests show that the sensitivity is high, then any DNA sequence that is a coding sequence is likely to be classified as a coding sequence by the method. On the other hand, if the specificity is high, any DNA sequence, which is a non-coding sequence, is likely to be classified as a non-coding sequence by the test. The best possible prediction method would yield the following result: sensitivity (no false negatives) and specificity (no false positives).
5. Results and Discussions
The energy density spectrum of DNA sequences can be slightly different when we compare different methods of spectral analysis. In general, these spectrums do not represent approximated versions of the other. For comparison, the energy spectrum of all sequences in the database was evaluated using the two algorithms proposed in this paper: SNR-SE and TBP-SE, in addition to these four methods already consolidated in the literature: Voss [
11], EIIP [
12], QPSK [
13] and MEM spectrum [
14].
Consider the specific case of the
AIM41 and
MPR35 genes whose energy spectrums are shown in
Figure 6 and
Figure 7, respectively. Note that, instead of what is expected, not all methods detect the TBP property for the genes. There are two possible reasons for this. First, the mapping chosen can hide spectral information on the sequence. For the
AIM41 gene, for example, the energy density spectrum, as defined by Voss or using EIIP, QPSK, and TBP-SE mappings, has the largest peak at frequency
rad/sample. However, observe that the background noise increases significantly when the Voss is evaluated. In addition, this discriminatory frequency is lost when the MEM spectrum and SNR-SE are evaluated (see
Figure 6).
The second reason is that, although the TBP property in coding sequences is a classical frequency discriminator in the biological context, some coding sequences do not seem to be distinguished by it. This is the case with the
MPR35 gene. For all methods, the energy density spectrum has the largest peak at the frequency
rad/sample (see
Figure 7). Beyond these cases, in general, the spectrum evaluated by our methods yields improvements in the coding sequence classification and background noise reduction.
Although there are intrinsic limitations in the spectral analysis of a given DNA sequence, some methods can better discriminate the TBP property for coding sequences than others.
Table 1 compares all the methods already mentioned regarding the accuracy, sensitivity, and specificity. Note that there is often a trade-off between sensitivity and specificity, such that by increasing sensitivity, one can decrease specificity and vice versa.
Table 1 reveals that our proposed method, TBP-SE, had the highest accuracy and sensitivity among all. This is especially important in this application since we reduce the probability that a coding sequence will not be identified. In other words, coding sequences are more likely to be correctly identified as coding sequences using TBP-SE. Furthermore, the specificity had an expressive level, and TBP-SE had the most uniform levels of accuracy, sensitivity, and specificity.
On the other hand, when comparing the methods with adaptive mapping, MEM spectrum does not perform well. It has the lowest levels of accuracy and sensitivity. One possible reason for this is that the search space of this method is constrained by spectral entropy; nevertheless, spectral entropy ignores the intrinsic structure of partial order, as pointed out by [
23]. Furthermore, this method has the highest computational complexity, and it is not feasible when compared to the other spectral analysis methods discussed in this paper.
The other methods seem to perform similarly to each other, but differences can be noted graphically via the receiver operating characteristic (ROC) curve, see
Figure 8. The ideal ROC curve hugs the top left corner, indicating a high
and a low False Positive Rate (
), where
. Since we use a binary classification without a threshold, the method statistics yield a single point on the ROC space.
The ROC curve reveals that QPSK and SNR-SE have similar performance (the difference in
is 0.01 and in
is 0.003). In addition, Voss, EIIP, QPSK, and SNR-SE have approximately the same
, but Voss performs better because it has the lowest
. Taking both the Voss and TBP-SE into account, approximately
of coding sequences that were misclassified as non-coding sequences using TBP-SE were also misclassified using Voss. This phenomenon also occurs in the MRP35 gene (see
Figure 7), but still, the background noise of the DNA spectrum is reduced using TBP-SE. Therefore, TBP-SE can be preferred over Voss, since the
level is especially important in this application.
Case Study: Gene F56F11.4a
The gene
F56F11.4a has five well-known distinct exons whose locations relative to nucleotide position 7021 are between 928 and 1039, 2528 and 2857, 4114 and 4377, 5465 and 5644, and 7255 and 7605. The first exon is the shortest (112 bases) and is usually the most difficult to detect. In this scenario, the coding regions are identified as follows [
8,
10,
15]. The energy density spectrum at frequency
rad/sample is evaluated over a window of
W samples, then the window is slid by one or more samples, and the energy density is recalculated in a process that analyzes the entire DNA sequence. An important criterion for this analysis is to define the window length
W. For this gene, Tiwari et al. [
15] suggests using
. Therefore, a rectangular window of length 351 and step size 5 was used.
For comparison purposes, the results are presented in
Figure 9, where the horizontal axis is the relative base positions and the vertical axis is the energy density spectrum normalized by its maximum value. There are two possible interpretations. First, the peaks in the spectrum should correspond to the regions where the TBP property is present. These regions can be evaluated using a threshold, that is, the coding regions are identified by putting a threshold on the spectrum, so regions having energy above this threshold are considered exons. In this case, in general, the methods detect four of the five exons and the first exon is the most missed. Specific to the EIIP method, the energy of the fourth exon is significantly reduced by mixing it with intronic regions. The other methods have similar performance.
However, the second interpretation expresses more information about the gene. In this case, the shaded areas show the regions where the TBP property is present in the respective slide window. The TBE-SE was the only one to identify the presence of all five exons. The EIIP indeed showed to have more instability in predicting non-coding regions. Voss, QPSK, MEM and SNR-SE had similar performances, but the MEM seems to increase the background noise of the spectrum. Although the QPSK seems to detect an additional exon at the beginning of the sequence, that shaded area is located far from the true first exon region. Additionally, there are no shaded areas in the region of the last exon. All these results were expected based on the previous analysis of the ROC curve of the methods.
6. Conclusions
DNA sequences are symbolic sequences, and, therefore, their numerical representation should not impose additional features on the mapped signal. As seen previously, the spectrum of these signals is sensitive to mapping. That is, for distinct maps, the energy spectrum of a given DNA sequence is also distinct, and they do not represent approximated versions of each other. Furthermore, a fixed mapping must not be able to represent any DNA sequence. Ideally, each DNA sequence must be mapped to a signal using a particular mapping such that this signal captures as much of the information as possible about the sequence. Therefore, in this paper, we propose two algorithms for computing mappings for DNA sequences by using the spectral envelope approach: SNR-SE and TBP-SE.
The proposed algorithms are new methods for finding adaptive complex mappings for DNA sequences, and, hence, improve the spectral analysis of such symbolic sequences. The remarks about the proposed algorithms are summarized as follows. The spectral envelope approach is used to find adaptive mappings and, thus, convert DNA sequences into discrete-time signals. A mapping is uniquely chosen for each sequence according to the constraints: SNR and TBP property. The mapping was defined over a complex field. Both algorithms have loglinear complexity, that is, they are where N is the sequence length. Computational efficiency is essential when large size DNA sequences and databases need to be processed.
To investigate how our algorithms improve the DNA spectral analysis for DNA coding sequence classification, we check the presence or absence of the TBP property at the DNA spectrum for the following methods: Voss [
11], EIIP [
12], QPSK [
13], MEM spectrum [
14], SNR-SE and TBP-SE. In this scenario, the proposed method, TBP-SE, had the highest accuracy and sensitivity among all. In addition, the TBP-SE and Voss approaches showed better performance to implement this classification. However, the TBP-SE should be preferred, as it has the highest sensitivity, which is most important in this application since we can reduce the probability of having a coding sequence that will not be identified. We also analyzed the performance of the methods for identifying exonic regions in the gene F56F11.4. In this case, the first exon is the shortest and is usually the most difficult to detect. However, the TBE-SE was the only one to identify the presence of all five exons of the gene.